Re: [PATCH v3 2/2] fs/xattr: add *at family syscalls
On Fri, Apr 26, 2024, at 18:20, Christian Göttsche wrote: > From: Christian Göttsche > > Add the four syscalls setxattrat(), getxattrat(), listxattrat() and > removexattrat(). Those can be used to operate on extended attributes, > especially security related ones, either relative to a pinned directory > or on a file descriptor without read access, avoiding a > /proc//fd/ detour, requiring a mounted procfs. > > One use case will be setfiles(8) setting SELinux file contexts > ("security.selinux") without race conditions and without a file > descriptor opened with read access requiring SELinux read permission. > > Use the do_{name}at() pattern from fs/open.c. > > Pass the value of the extended attribute, its length, and for > setxattrat(2) the command (XATTR_CREATE or XATTR_REPLACE) via an added > struct xattr_args to not exceed six syscall arguments and not > merging the AT_* and XATTR_* flags. > > Signed-off-by: Christian Göttsche > CC: x...@kernel.org > CC: linux-al...@vger.kernel.org > CC: linux-ker...@vger.kernel.org > CC: linux-arm-ker...@lists.infradead.org > CC: linux-i...@vger.kernel.org > CC: linux-m...@lists.linux-m68k.org > CC: linux-m...@vger.kernel.org > CC: linux-par...@vger.kernel.org > CC: linuxppc-dev@lists.ozlabs.org > CC: linux-s...@vger.kernel.org > CC: linux...@vger.kernel.org > CC: sparcli...@vger.kernel.org > CC: linux-fsde...@vger.kernel.org > CC: au...@vger.kernel.org > CC: linux-a...@vger.kernel.org > CC: linux-...@vger.kernel.org > CC: linux-security-mod...@vger.kernel.org > CC: seli...@vger.kernel.org I checked that the syscalls are all well-formed regarding argument types, number of arguments and (absence of) compat handling, and that they are wired up correctly across architectures I did not look at the actual implementation in detail. Reviewed-by: Arnd Bergmann
[PATCH v3 2/2] fs/xattr: add *at family syscalls
From: Christian Göttsche Add the four syscalls setxattrat(), getxattrat(), listxattrat() and removexattrat(). Those can be used to operate on extended attributes, especially security related ones, either relative to a pinned directory or on a file descriptor without read access, avoiding a /proc//fd/ detour, requiring a mounted procfs. One use case will be setfiles(8) setting SELinux file contexts ("security.selinux") without race conditions and without a file descriptor opened with read access requiring SELinux read permission. Use the do_{name}at() pattern from fs/open.c. Pass the value of the extended attribute, its length, and for setxattrat(2) the command (XATTR_CREATE or XATTR_REPLACE) via an added struct xattr_args to not exceed six syscall arguments and not merging the AT_* and XATTR_* flags. Signed-off-by: Christian Göttsche CC: x...@kernel.org CC: linux-al...@vger.kernel.org CC: linux-ker...@vger.kernel.org CC: linux-arm-ker...@lists.infradead.org CC: linux-i...@vger.kernel.org CC: linux-m...@lists.linux-m68k.org CC: linux-m...@vger.kernel.org CC: linux-par...@vger.kernel.org CC: linuxppc-dev@lists.ozlabs.org CC: linux-s...@vger.kernel.org CC: linux...@vger.kernel.org CC: sparcli...@vger.kernel.org CC: linux-fsde...@vger.kernel.org CC: au...@vger.kernel.org CC: linux-a...@vger.kernel.org CC: linux-...@vger.kernel.org CC: linux-security-mod...@vger.kernel.org CC: seli...@vger.kernel.org --- v3: - pass value, size and xattr_flags via new struct xattr_args to split AT_* and XATTR_* flags v2: https://lore.kernel.org/lkml/20230511150802.737477-1-cgzo...@googlemail.com/ - squash syscall introduction and wire up commits - add AT_XATTR_CREATE and AT_XATTR_REPLACE constants v1 discussion: https://lore.kernel.org/all/20220830152858.14866-2-cgzo...@googlemail.com/ Previous approach ("f*xattr: allow O_PATH descriptors"): https://lore.kernel.org/all/20220607153139.35588-1-cgzo...@googlemail.com/ --- arch/alpha/kernel/syscalls/syscall.tbl | 4 + arch/arm/tools/syscall.tbl | 4 + arch/arm64/include/asm/unistd.h | 2 +- arch/arm64/include/asm/unistd32.h | 8 ++ arch/m68k/kernel/syscalls/syscall.tbl | 4 + arch/microblaze/kernel/syscalls/syscall.tbl | 4 + arch/mips/kernel/syscalls/syscall_n32.tbl | 4 + arch/mips/kernel/syscalls/syscall_n64.tbl | 4 + arch/mips/kernel/syscalls/syscall_o32.tbl | 4 + arch/parisc/kernel/syscalls/syscall.tbl | 4 + arch/powerpc/kernel/syscalls/syscall.tbl| 4 + arch/s390/kernel/syscalls/syscall.tbl | 4 + arch/sh/kernel/syscalls/syscall.tbl | 4 + arch/sparc/kernel/syscalls/syscall.tbl | 4 + arch/x86/entry/syscalls/syscall_32.tbl | 4 + arch/x86/entry/syscalls/syscall_64.tbl | 4 + arch/xtensa/kernel/syscalls/syscall.tbl | 4 + fs/xattr.c | 128 include/asm-generic/audit_change_attr.h | 6 + include/linux/syscalls.h| 10 ++ include/uapi/asm-generic/unistd.h | 12 +- include/uapi/linux/xattr.h | 6 + 22 files changed, 208 insertions(+), 24 deletions(-) diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl index 8ff110826ce2..fdc11249f1b8 100644 --- a/arch/alpha/kernel/syscalls/syscall.tbl +++ b/arch/alpha/kernel/syscalls/syscall.tbl @@ -501,3 +501,7 @@ 569common lsm_get_self_attr sys_lsm_get_self_attr 570common lsm_set_self_attr sys_lsm_set_self_attr 571common lsm_list_modulessys_lsm_list_modules +572common setxattrat sys_setxattrat +573common getxattrat sys_getxattrat +574common listxattrat sys_listxattrat +575common removexattrat sys_removexattrat diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl index b6c9e01e14f5..22fbbcd8e2b5 100644 --- a/arch/arm/tools/syscall.tbl +++ b/arch/arm/tools/syscall.tbl @@ -475,3 +475,7 @@ 459common lsm_get_self_attr sys_lsm_get_self_attr 460common lsm_set_self_attr sys_lsm_set_self_attr 461common lsm_list_modulessys_lsm_list_modules +462common setxattrat sys_setxattrat +463common getxattrat sys_getxattrat +464common listxattrat sys_listxattrat +465common removexattrat sys_removexattrat diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h index 491b2b9bd553..f3a77719eb05 100644 --- a/arch/arm64/include/asm/unistd.h +++ b/arch/arm64/include/asm/unistd.h @@ -39,7 +39,7 @@ #define __ARM_NR_compat_set_tls(__ARM_NR_COMPAT_BASE + 5) #define __ARM_NR_COMPAT_END(__ARM_NR_COMPAT_BASE + 0x800) -#define __NR_compat_syscalls 462 +#define
Re: [PATCH v6 00/16] mm: jit/text allocator
On Fri, Apr 26, 2024 at 11:28:38AM +0300, Mike Rapoport wrote: > From: "Mike Rapoport (IBM)" > > Hi, > > The patches are also available in git: > https://git.kernel.org/pub/scm/linux/kernel/git/rppt/linux.git/log/?h=execmem/v6 > > v6 changes: > * restore patch "arm64: extend execmem_info for generated code > allocations" that disappeared in v5 rebase > * update execmem initialization so that by default it will be > initialized early while late initialization will be an opt-in I've taken this new iteration again through modules-next so to help get more testing exposure to this. If we run into conflicts again with mm we can see if Andrew is willing to take this in through there. However, it may make sense to only consider up to "mm: introduce execmem_alloc() and execmem_free()" for v6.10 given we're bound to likely find more issues and we are already at rc5. Luis
Re: [PATCH v1 1/3] mm/gup: consistently name GUP-fast functions
On Fri, Apr 26, 2024 at 11:33:08PM +0200, David Hildenbrand wrote: > I raised this topic in the past, and IMHO we either (a) never should have > added COW support; or (b) added COW support by using ordinary anonymous > memory (hey, partial mappings of hugetlb pages! ;) ). > > After all, COW is an optimization to speed up fork and defer copying. It > relies on memory overcommit, but that doesn't really apply to hugetlb, so we > fake it ... Good summary. > > One easy ABI break I had in mind was to simply *not* allow COW-sharing of > anon hugetlb folios; for example, simply don't copy the page into the child. > Chances are there are not really a lot of child processes that would fail > ... but likely we would break *something*. So there is no easy way out :( Right, not easy. The thing is this is one spot out of many of the specialties, it also may or may not be worthwhile to have dedicated time while nobody yet has a problem with it. It might be easier to start with v2, even though that's also hard to nail everything properly - the challenge can come from different angles. Thanks for the sharings, helpful. I'll go ahead with the Power fix on hugepd putting this aside. I hope that before the end of this year, whatever I'll fix can go away, by removing hugepd completely from Linux. For now that may or may not be as smooth, so we'd better still fix it. -- Peter Xu
Re: [PATCH] powerpc/pseries: Enforce hcall result buffer validity and size
Nathan Lynch writes: > Michael Ellerman writes: >> Nathan Lynch via B4 Relay >> writes: >>> >>> plpar_hcall(), plpar_hcall9(), and related functions expect callers to >>> provide valid result buffers of certain minimum size. Currently this >>> is communicated only through comments in the code and the compiler has >>> no idea. >>> >>> For example, if I write a bug like this: >>> >>> long retbuf[PLPAR_HCALL_BUFSIZE]; // should be PLPAR_HCALL9_BUFSIZE >>> plpar_hcall9(H_ALLOCATE_VAS_WINDOW, retbuf, ...); >>> >>> This compiles with no diagnostics emitted, but likely results in stack >>> corruption at runtime when plpar_hcall9() stores results past the end >>> of the array. (To be clear this is a contrived example and I have not >>> found a real instance yet.) >> >> We did have some real stack corruption bugs in the past. >> >> I referred to them in my previous (much uglier) attempt at a fix: >> >> >> https://patchwork.ozlabs.org/project/linuxppc-dev/patch/1476780032-21643-2-git-send-email-...@ellerman.id.au/ >> >> Annoyingly I didn't describe them in any detail, but at least one of them >> was: >> >> 24c65bc7037e ("hwrng: pseries - port to new read API and fix stack >> corruption") > > Thanks for this background. > > >> Will this catch a case like that? Where the too-small buffer is not >> declared locally but rather comes into the function as a pointer? > > No, unfortunately. But here's a sketch that forces retbuf to be an > array [...] I've made some attempts to improve on this, but I think the original patch as written may be the best we can do without altering existing call sites or introducing new APIs and types. FWIW, GCC is capable of warning when a too-small dynamically allocated buffer is used. I don't think it would have caught the pseries-rng bug, but it works when the size of the buffer is available e.g. #include long plpar_hcall(long opcode, long rets[static 4], ...); void f(void) { long retbuf_stack_4[4]; long retbuf_stack_3[3]; long *retbuf_heap_4 = malloc(4 * sizeof(long)); long *retbuf_heap_3 = malloc(3 * sizeof(long)); plpar_hcall(0, retbuf_stack_4); plpar_hcall(0, retbuf_stack_3); // bug plpar_hcall(0, retbuf_heap_4); plpar_hcall(0, retbuf_heap_3); // bug } :13:5: warning: 'plpar_hcall' accessing 32 bytes in a region of size 24 [-Wstringop-overflow=] 13 | plpar_hcall(0, retbuf_stack_3); // bug | ^~ :13:5: note: referencing argument 2 of type 'long int[4]' :3:6: note: in a call to function 'plpar_hcall' 3 | long plpar_hcall(long opcode, long rets[static 4], ...); | ^~~ :15:5: warning: 'plpar_hcall' accessing 32 bytes in a region of size 24 [-Wstringop-overflow=] 15 | plpar_hcall(0, retbuf_heap_3); // bug | ^ :15:5: note: referencing argument 2 of type 'long int[4]' :3:6: note: in a call to function 'plpar_hcall' 3 | long plpar_hcall(long opcode, long rets[static 4], ...); | ^~~ Compiler Explorer link for anyone interested in experimenting: https://godbolt.org/z/x9GKMTzdb It looks like -Wstringop-overflow is disabled in Linux's build for now, but hopefully that will change in the future. OK with taking the patch as-is?
Re: [PATCH v1 1/3] mm/gup: consistently name GUP-fast functions
Hmm, so when I enable 2M hugetlb I found ./cow is even failing on x86. # ./cow | grep -B1 "not ok" # [RUN] vmsplice() + unmap in child ... with hugetlb (2048 kB) not ok 161 No leak from parent into child -- # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with hugetlb (2048 kB) not ok 215 No leak from parent into child -- # [RUN] vmsplice() before fork(), unmap in parent after fork() ... with hugetlb (2048 kB) not ok 269 No leak from child into parent -- # [RUN] vmsplice() + unmap in parent after fork() ... with hugetlb (2048 kB) not ok 323 No leak from child into parent And it looks like it was always failing.. perhaps since the start? We Yes! commit 7dad331be7816103eba8c12caeb88fbd3599c0b9 Author: David Hildenbrand Date: Tue Sep 27 13:01:17 2022 +0200 selftests/vm: anon_cow: hugetlb tests Let's run all existing test cases with all hugetlb sizes we're able to detect. Note that some tests cases still fail. This will, for example, be fixed once vmsplice properly uses FOLL_PIN instead of FOLL_GET for pinning. With 2 MiB and 1 GiB hugetlb on x86_64, the expected failures are: # [RUN] vmsplice() + unmap in child ... with hugetlb (2048 kB) not ok 23 No leak from parent into child # [RUN] vmsplice() + unmap in child ... with hugetlb (1048576 kB) not ok 24 No leak from parent into child # [RUN] vmsplice() before fork(), unmap in parent after fork() ... with hugetlb (2048 kB) not ok 35 No leak from child into parent # [RUN] vmsplice() before fork(), unmap in parent after fork() ... with hugetlb (1048576 kB) not ok 36 No leak from child into parent # [RUN] vmsplice() + unmap in parent after fork() ... with hugetlb (2048 kB) not ok 47 No leak from child into parent # [RUN] vmsplice() + unmap in parent after fork() ... with hugetlb (1048576 kB) not ok 48 No leak from child into parent As it keeps confusing people (until somebody cares enough to fix vmsplice), I already thought about just disabling the test and adding a comment why it happens and why nobody cares. I think we should, and when doing so maybe add a rich comment in hugetlb_wp() too explaining everything? Likely yes. Let me think of something. didn't do the same on hugetlb v.s. normal anon from that regard on the vmsplice() fix. I drafted a patch to allow refcount>1 detection as the same, then all tests pass for me, as below. David, I'd like to double check with you before I post anything: is that your intention to do so when working on the R/O pinning or not? Here certainly the "if it's easy it would already have done" principle applies. :) The issue is the following: hugetlb pages are scarce resources that cannot usually be overcommitted. For ordinary memory, we don't care if we COW in some corner case because there is an unexpected reference. You temporarily consume an additional page that gets freed as soon as the unexpected reference is dropped. For hugetlb, it is problematic. Assume you have reserved a single 1 GiB hugetlb page and your process uses that in a MAP_PRIVATE mapping. Then it calls fork() and the child quits immediately. If you decide to COW, you would need a second hugetlb page, which we don't have, so you have to crash the program. And in hugetlb it's extremely easy to not get folio_ref_count() == 1: hugetlb_fault() will do a folio_get(folio) before calling hugetlb_wp()! ... so you essentially always copy. Hmm yes there's one extra refcount. I think this is all fine, we can simply take all of them into account when making a CoW decision. However crashing an userspace can be a problem for sure. Right, and a simple reference from page migration or some other PFN walker would be sufficient for that. I did not dare being responsible for that, even though races are rare :) The vmsplice leak is not worth that: hugetlb with MAP_PRIVATE to COW-share data between processes with different privilege levels is not really common. At that point I walked away from that, letting vmsplice() be fixed at some point. Dave Howells was close at some point IIRC ... I had some ideas about retrying until the other reference is gone (which cannot be a longterm GUP pin), but as vmsplice essentially does without FOLL_PIN|FOLL_LONGTERM, it's quit hopeless to resolve that as long as vmsplice holds longterm references the wrong way. --- One could argue that fork() with hugetlb and MAP_PRIVATE is stupid and fragile: assume your child MM is torn down deferred, and will unmap the hugetlb page deferred. Or assume you access the page concurrently with fork(). You'd have to COW and crash the program. BUT, there is a horribly ugly hack in hugetlb COW code where you *steal* the page form the child program and crash your child. I'm not making that up, it's horrible. I didn't notice that code before; doesn't sound like a very
Re: [PATCH v1 1/3] mm/gup: consistently name GUP-fast functions
On Fri, Apr 26, 2024 at 07:28:31PM +0200, David Hildenbrand wrote: > On 26.04.24 18:12, Peter Xu wrote: > > On Fri, Apr 26, 2024 at 09:44:58AM -0400, Peter Xu wrote: > > > On Fri, Apr 26, 2024 at 09:17:47AM +0200, David Hildenbrand wrote: > > > > On 02.04.24 14:55, David Hildenbrand wrote: > > > > > Let's consistently call the "fast-only" part of GUP "GUP-fast" and > > > > > rename > > > > > all relevant internal functions to start with "gup_fast", to make it > > > > > clearer that this is not ordinary GUP. The current mixture of > > > > > "lockless", "gup" and "gup_fast" is confusing. > > > > > > > > > > Further, avoid the term "huge" when talking about a "leaf" -- for > > > > > example, we nowadays check pmd_leaf() because pmd_huge() is gone. For > > > > > the > > > > > "hugepd"/"hugepte" stuff, it's part of the name ("is_hugepd"), so that > > > > > stays. > > > > > > > > > > What remains is the "external" interface: > > > > > * get_user_pages_fast_only() > > > > > * get_user_pages_fast() > > > > > * pin_user_pages_fast() > > > > > > > > > > The high-level internal functions for GUP-fast (+slow fallback) are > > > > > now: > > > > > * internal_get_user_pages_fast() -> gup_fast_fallback() > > > > > * lockless_pages_from_mm() -> gup_fast() > > > > > > > > > > The basic GUP-fast walker functions: > > > > > * gup_pgd_range() -> gup_fast_pgd_range() > > > > > * gup_p4d_range() -> gup_fast_p4d_range() > > > > > * gup_pud_range() -> gup_fast_pud_range() > > > > > * gup_pmd_range() -> gup_fast_pmd_range() > > > > > * gup_pte_range() -> gup_fast_pte_range() > > > > > * gup_huge_pgd() -> gup_fast_pgd_leaf() > > > > > * gup_huge_pud() -> gup_fast_pud_leaf() > > > > > * gup_huge_pmd() -> gup_fast_pmd_leaf() > > > > > > > > > > The weird hugepd stuff: > > > > > * gup_huge_pd() -> gup_fast_hugepd() > > > > > * gup_hugepte() -> gup_fast_hugepte() > > > > > > > > I just realized that we end up calling these from follow_hugepd() as > > > > well. > > > > And something seems to be off, because gup_fast_hugepd() won't have the > > > > VMA > > > > even in the slow-GUP case to pass it to gup_must_unshare(). > > > > > > > > So these are GUP-fast functions and the terminology seem correct. But > > > > the > > > > usage from follow_hugepd() is questionable, > > > > > > > > commit a12083d721d703f985f4403d6b333cc449f838f6 > > > > Author: Peter Xu > > > > Date: Wed Mar 27 11:23:31 2024 -0400 > > > > > > > > mm/gup: handle hugepd for follow_page() > > > > > > > > > > > > states "With previous refactors on fast-gup gup_huge_pd(), most of the > > > > code > > > > can be leveraged", which doesn't look quite true just staring the the > > > > gup_must_unshare() call where we don't pass the VMA. Also, > > > > "unlikely(pte_val(pte) != pte_val(ptep_get(ptep)" doesn't make any > > > > sense for > > > > slow GUP ... > > > > > > Yes it's not needed, just doesn't look worthwhile to put another helper on > > > top just for this. I mentioned this in the commit message here: > > > > > >There's something not needed for follow page, for example, > > > gup_hugepte() > > >tries to detect pgtable entry change which will never happen with slow > > >gup (which has the pgtable lock held), but that's not a problem to > > > check. > > > > > > > > > > > @Peter, any insights? > > > > > > However I think we should pass vma in for sure, I guess I overlooked that, > > > and it didn't expose in my tests too as I probably missed ./cow. > > > > > > I'll prepare a separate patch on top of this series and the gup-fast > > > rename > > > patches (I saw this one just reached mm-stable), and I'll see whether I > > > can > > > test it too if I can find a Power system fast enough. I'll probably drop > > > the "fast" in the hugepd function names too. > > > > For the missing VMA parameter, the cow.c test might not trigger it. We never > need the VMA to make > a pinning decision for anonymous memory. We'll trigger an unsharing fault, > get an exclusive anonymous page > and can continue. > > We need the VMA in gup_must_unshare(), when long-term pinning a file hugetlb > page. I *think* > the gup_longterm.c selftest should trigger that, especially: > > # [RUN] R/O longterm GUP-fast pin in MAP_SHARED file mapping ... with memfd > hugetlb (2048 kB) > ... > # [RUN] R/O longterm GUP-fast pin in MAP_SHARED file mapping ... with memfd > hugetlb (1048576 kB) > > > We need a MAP_SHARED page where the PTE is R/O that we want to long-term pin > R/O. > I don't remember from the top of my head if the test here might have a > R/W-mapped > folio. If so, we could extend it to cover that. Let me try both then. > > > Hmm, so when I enable 2M hugetlb I found ./cow is even failing on x86. > > > ># ./cow | grep -B1 "not ok" > ># [RUN] vmsplice() + unmap in child ... with hugetlb (2048 kB) > >not ok 161 No leak from parent into child > >-- > ># [RUN] vmsplice() + unmap in child with mprotect() optimization
Re: [PATCH v6 08/16] mm/execmem, arch: convert remaining overrides of module_alloc to execmem
On Fri, Apr 26, 2024 at 1:30 AM Mike Rapoport wrote: > > From: "Mike Rapoport (IBM)" > > Extend execmem parameters to accommodate more complex overrides of > module_alloc() by architectures. > > This includes specification of a fallback range required by arm, arm64 > and powerpc, EXECMEM_MODULE_DATA type required by powerpc, support for > allocation of KASAN shadow required by s390 and x86 and support for > late initialization of execmem required by arm64. > > The core implementation of execmem_alloc() takes care of suppressing > warnings when the initial allocation fails but there is a fallback range > defined. > > Signed-off-by: Mike Rapoport (IBM) > Acked-by: Will Deacon nit: We should probably move the logic for ARCH_WANTS_EXECMEM_LATE to a separate patch. Otherwise, Acked-by: Song Liu
Re: [PATCH v6 07/16] mm/execmem, arch: convert simple overrides of module_alloc to execmem
On Fri, Apr 26, 2024 at 1:30 AM Mike Rapoport wrote: > > From: "Mike Rapoport (IBM)" > > Several architectures override module_alloc() only to define address > range for code allocations different than VMALLOC address space. > > Provide a generic implementation in execmem that uses the parameters for > address space ranges, required alignment and page protections provided > by architectures. > > The architectures must fill execmem_info structure and implement > execmem_arch_setup() that returns a pointer to that structure. This way the > execmem initialization won't be called from every architecture, but rather > from a central place, namely a core_initcall() in execmem. > > The execmem provides execmem_alloc() API that wraps __vmalloc_node_range() > with the parameters defined by the architectures. If an architecture does > not implement execmem_arch_setup(), execmem_alloc() will fall back to > module_alloc(). > > Signed-off-by: Mike Rapoport (IBM) Acked-by: Song Liu
Re: [PATCH v6 06/16] mm: introduce execmem_alloc() and execmem_free()
On Fri, Apr 26, 2024 at 1:30 AM Mike Rapoport wrote: > > From: "Mike Rapoport (IBM)" > > module_alloc() is used everywhere as a mean to allocate memory for code. > > Beside being semantically wrong, this unnecessarily ties all subsystems > that need to allocate code, such as ftrace, kprobes and BPF to modules and > puts the burden of code allocation to the modules code. > > Several architectures override module_alloc() because of various > constraints where the executable memory can be located and this causes > additional obstacles for improvements of code allocation. > > Start splitting code allocation from modules by introducing execmem_alloc() > and execmem_free() APIs. > > Initially, execmem_alloc() is a wrapper for module_alloc() and > execmem_free() is a replacement of module_memfree() to allow updating all > call sites to use the new APIs. > > Since architectures define different restrictions on placement, > permissions, alignment and other parameters for memory that can be used by > different subsystems that allocate executable memory, execmem_alloc() takes > a type argument, that will be used to identify the calling subsystem and to > allow architectures define parameters for ranges suitable for that > subsystem. > > No functional changes. > > Signed-off-by: Mike Rapoport (IBM) > Acked-by: Masami Hiramatsu (Google) Acked-by: Song Liu
Re: [PATCH v6 05/16] module: make module_memory_{alloc,free} more self-contained
On Fri, Apr 26, 2024 at 1:30 AM Mike Rapoport wrote: > > From: "Mike Rapoport (IBM)" > > Move the logic related to the memory allocation and freeing into > module_memory_alloc() and module_memory_free(). > > Signed-off-by: Mike Rapoport (IBM) Acked-by: Song Liu
Re: [PATCH v1 1/3] mm/gup: consistently name GUP-fast functions
On 26.04.24 18:12, Peter Xu wrote: On Fri, Apr 26, 2024 at 09:44:58AM -0400, Peter Xu wrote: On Fri, Apr 26, 2024 at 09:17:47AM +0200, David Hildenbrand wrote: On 02.04.24 14:55, David Hildenbrand wrote: Let's consistently call the "fast-only" part of GUP "GUP-fast" and rename all relevant internal functions to start with "gup_fast", to make it clearer that this is not ordinary GUP. The current mixture of "lockless", "gup" and "gup_fast" is confusing. Further, avoid the term "huge" when talking about a "leaf" -- for example, we nowadays check pmd_leaf() because pmd_huge() is gone. For the "hugepd"/"hugepte" stuff, it's part of the name ("is_hugepd"), so that stays. What remains is the "external" interface: * get_user_pages_fast_only() * get_user_pages_fast() * pin_user_pages_fast() The high-level internal functions for GUP-fast (+slow fallback) are now: * internal_get_user_pages_fast() -> gup_fast_fallback() * lockless_pages_from_mm() -> gup_fast() The basic GUP-fast walker functions: * gup_pgd_range() -> gup_fast_pgd_range() * gup_p4d_range() -> gup_fast_p4d_range() * gup_pud_range() -> gup_fast_pud_range() * gup_pmd_range() -> gup_fast_pmd_range() * gup_pte_range() -> gup_fast_pte_range() * gup_huge_pgd() -> gup_fast_pgd_leaf() * gup_huge_pud() -> gup_fast_pud_leaf() * gup_huge_pmd() -> gup_fast_pmd_leaf() The weird hugepd stuff: * gup_huge_pd() -> gup_fast_hugepd() * gup_hugepte() -> gup_fast_hugepte() I just realized that we end up calling these from follow_hugepd() as well. And something seems to be off, because gup_fast_hugepd() won't have the VMA even in the slow-GUP case to pass it to gup_must_unshare(). So these are GUP-fast functions and the terminology seem correct. But the usage from follow_hugepd() is questionable, commit a12083d721d703f985f4403d6b333cc449f838f6 Author: Peter Xu Date: Wed Mar 27 11:23:31 2024 -0400 mm/gup: handle hugepd for follow_page() states "With previous refactors on fast-gup gup_huge_pd(), most of the code can be leveraged", which doesn't look quite true just staring the the gup_must_unshare() call where we don't pass the VMA. Also, "unlikely(pte_val(pte) != pte_val(ptep_get(ptep)" doesn't make any sense for slow GUP ... Yes it's not needed, just doesn't look worthwhile to put another helper on top just for this. I mentioned this in the commit message here: There's something not needed for follow page, for example, gup_hugepte() tries to detect pgtable entry change which will never happen with slow gup (which has the pgtable lock held), but that's not a problem to check. @Peter, any insights? However I think we should pass vma in for sure, I guess I overlooked that, and it didn't expose in my tests too as I probably missed ./cow. I'll prepare a separate patch on top of this series and the gup-fast rename patches (I saw this one just reached mm-stable), and I'll see whether I can test it too if I can find a Power system fast enough. I'll probably drop the "fast" in the hugepd function names too. For the missing VMA parameter, the cow.c test might not trigger it. We never need the VMA to make a pinning decision for anonymous memory. We'll trigger an unsharing fault, get an exclusive anonymous page and can continue. We need the VMA in gup_must_unshare(), when long-term pinning a file hugetlb page. I *think* the gup_longterm.c selftest should trigger that, especially: # [RUN] R/O longterm GUP-fast pin in MAP_SHARED file mapping ... with memfd hugetlb (2048 kB) ... # [RUN] R/O longterm GUP-fast pin in MAP_SHARED file mapping ... with memfd hugetlb (1048576 kB) We need a MAP_SHARED page where the PTE is R/O that we want to long-term pin R/O. I don't remember from the top of my head if the test here might have a R/W-mapped folio. If so, we could extend it to cover that. Hmm, so when I enable 2M hugetlb I found ./cow is even failing on x86. # ./cow | grep -B1 "not ok" # [RUN] vmsplice() + unmap in child ... with hugetlb (2048 kB) not ok 161 No leak from parent into child -- # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with hugetlb (2048 kB) not ok 215 No leak from parent into child -- # [RUN] vmsplice() before fork(), unmap in parent after fork() ... with hugetlb (2048 kB) not ok 269 No leak from child into parent -- # [RUN] vmsplice() + unmap in parent after fork() ... with hugetlb (2048 kB) not ok 323 No leak from child into parent And it looks like it was always failing.. perhaps since the start? We Yes! commit 7dad331be7816103eba8c12caeb88fbd3599c0b9 Author: David Hildenbrand Date: Tue Sep 27 13:01:17 2022 +0200 selftests/vm: anon_cow: hugetlb tests Let's run all existing test cases with all hugetlb sizes we're able to detect. Note that some tests cases still fail. This will, for example, be fixed once vmsplice properly uses FOLL_PIN instead of FOLL_GET for pinning. With 2 MiB and 1 GiB hugetlb on
Re: [PATCH v1 1/3] mm/gup: consistently name GUP-fast functions
On Fri, Apr 26, 2024 at 09:44:58AM -0400, Peter Xu wrote: > On Fri, Apr 26, 2024 at 09:17:47AM +0200, David Hildenbrand wrote: > > On 02.04.24 14:55, David Hildenbrand wrote: > > > Let's consistently call the "fast-only" part of GUP "GUP-fast" and rename > > > all relevant internal functions to start with "gup_fast", to make it > > > clearer that this is not ordinary GUP. The current mixture of > > > "lockless", "gup" and "gup_fast" is confusing. > > > > > > Further, avoid the term "huge" when talking about a "leaf" -- for > > > example, we nowadays check pmd_leaf() because pmd_huge() is gone. For the > > > "hugepd"/"hugepte" stuff, it's part of the name ("is_hugepd"), so that > > > stays. > > > > > > What remains is the "external" interface: > > > * get_user_pages_fast_only() > > > * get_user_pages_fast() > > > * pin_user_pages_fast() > > > > > > The high-level internal functions for GUP-fast (+slow fallback) are now: > > > * internal_get_user_pages_fast() -> gup_fast_fallback() > > > * lockless_pages_from_mm() -> gup_fast() > > > > > > The basic GUP-fast walker functions: > > > * gup_pgd_range() -> gup_fast_pgd_range() > > > * gup_p4d_range() -> gup_fast_p4d_range() > > > * gup_pud_range() -> gup_fast_pud_range() > > > * gup_pmd_range() -> gup_fast_pmd_range() > > > * gup_pte_range() -> gup_fast_pte_range() > > > * gup_huge_pgd() -> gup_fast_pgd_leaf() > > > * gup_huge_pud() -> gup_fast_pud_leaf() > > > * gup_huge_pmd() -> gup_fast_pmd_leaf() > > > > > > The weird hugepd stuff: > > > * gup_huge_pd() -> gup_fast_hugepd() > > > * gup_hugepte() -> gup_fast_hugepte() > > > > I just realized that we end up calling these from follow_hugepd() as well. > > And something seems to be off, because gup_fast_hugepd() won't have the VMA > > even in the slow-GUP case to pass it to gup_must_unshare(). > > > > So these are GUP-fast functions and the terminology seem correct. But the > > usage from follow_hugepd() is questionable, > > > > commit a12083d721d703f985f4403d6b333cc449f838f6 > > Author: Peter Xu > > Date: Wed Mar 27 11:23:31 2024 -0400 > > > > mm/gup: handle hugepd for follow_page() > > > > > > states "With previous refactors on fast-gup gup_huge_pd(), most of the code > > can be leveraged", which doesn't look quite true just staring the the > > gup_must_unshare() call where we don't pass the VMA. Also, > > "unlikely(pte_val(pte) != pte_val(ptep_get(ptep)" doesn't make any sense for > > slow GUP ... > > Yes it's not needed, just doesn't look worthwhile to put another helper on > top just for this. I mentioned this in the commit message here: > > There's something not needed for follow page, for example, gup_hugepte() > tries to detect pgtable entry change which will never happen with slow > gup (which has the pgtable lock held), but that's not a problem to check. > > > > > @Peter, any insights? > > However I think we should pass vma in for sure, I guess I overlooked that, > and it didn't expose in my tests too as I probably missed ./cow. > > I'll prepare a separate patch on top of this series and the gup-fast rename > patches (I saw this one just reached mm-stable), and I'll see whether I can > test it too if I can find a Power system fast enough. I'll probably drop > the "fast" in the hugepd function names too. Hmm, so when I enable 2M hugetlb I found ./cow is even failing on x86. # ./cow | grep -B1 "not ok" # [RUN] vmsplice() + unmap in child ... with hugetlb (2048 kB) not ok 161 No leak from parent into child -- # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with hugetlb (2048 kB) not ok 215 No leak from parent into child -- # [RUN] vmsplice() before fork(), unmap in parent after fork() ... with hugetlb (2048 kB) not ok 269 No leak from child into parent -- # [RUN] vmsplice() + unmap in parent after fork() ... with hugetlb (2048 kB) not ok 323 No leak from child into parent And it looks like it was always failing.. perhaps since the start? We didn't do the same on hugetlb v.s. normal anon from that regard on the vmsplice() fix. I drafted a patch to allow refcount>1 detection as the same, then all tests pass for me, as below. David, I'd like to double check with you before I post anything: is that your intention to do so when working on the R/O pinning or not? Thanks, = >From 7300c249738dadda1457c755b597c1551dfe8dc6 Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Fri, 26 Apr 2024 11:41:12 -0400 Subject: [PATCH] mm/hugetlb: Fix vmsplice case on memory leak once more Signed-off-by: Peter Xu --- mm/hugetlb.c | 9 ++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 417fc5cdb6ee..1ca102013561 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -5961,10 +5961,13 @@ static vm_fault_t hugetlb_wp(struct folio *pagecache_folio, retry_avoidcopy: /* -* If no-one else is actually using this page, we're the exclusive -* owner and can reuse this
Re: [PATCH v3 1/3] PCI/AER: Store UNCOR_STATUS bits that might be ANFE in aer_err_info
On Tue, 23 Apr 2024 02:25:05 + "Duan, Zhenzhong" wrote: > >-Original Message- > >From: Jonathan Cameron > >Subject: Re: [PATCH v3 1/3] PCI/AER: Store UNCOR_STATUS bits that might > >be ANFE in aer_err_info > > > >On Wed, 17 Apr 2024 14:14:05 +0800 > >Zhenzhong Duan wrote: > > > >> In some cases the detector of a Non-Fatal Error(NFE) is not the most > >> appropriate agent to determine the type of the error. For example, > >> when software performs a configuration read from a non-existent > >> device or Function, completer will send an ERR_NONFATAL Message. > >> On some platforms, ERR_NONFATAL results in a System Error, which > >> breaks normal software probing. > >> > >> Advisory Non-Fatal Error(ANFE) is a special case that can be used > >> in above scenario. It is predominantly determined by the role of the > >> detecting agent (Requester, Completer, or Receiver) and the specific > >> error. In such cases, an agent with AER signals the NFE (if enabled) > >> by sending an ERR_COR Message as an advisory to software, instead of > >> sending ERR_NONFATAL. > >> > >> When processing an ANFE, ideally both correctable error(CE) status and > >> uncorrectable error(UE) status should be cleared. However, there is no > >> way to fully identify the UE associated with ANFE. Even worse, a Fatal > >> Error(FE) or Non-Fatal Error(NFE) may set the same UE status bit as > >> ANFE. Treating an ANFE as NFE will reproduce above mentioned issue, > >> i.e., breaking softwore probing; treating NFE as ANFE will make us > >> ignoring some UEs which need active recover operation. To avoid clearing > >> UEs that are not ANFE by accident, the most conservative route is taken > >> here: If any of the FE/NFE Detected bits is set in Device Status, do not > >> touch UE status, they should be cleared later by the UE handler. Otherwise, > >> a specific set of UEs that may be raised as ANFE according to the PCIe > >> specification will be cleared if their corresponding severity is Non-Fatal. > >> > >> To achieve above purpose, store UNCOR_STATUS bits that might be ANFE > >> in aer_err_info.anfe_status. So that those bits could be printed and > >> processed later. > >> > >> Tested-by: Yudong Wang > >> Co-developed-by: "Wang, Qingshun" > >> Signed-off-by: "Wang, Qingshun" > >> Signed-off-by: Zhenzhong Duan > >> --- > >> drivers/pci/pci.h | 1 + > >> drivers/pci/pcie/aer.c | 45 > >++ > >> 2 files changed, 46 insertions(+) > >> > >> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h > >> index 17fed1846847..3f9eb807f9fd 100644 > >> --- a/drivers/pci/pci.h > >> +++ b/drivers/pci/pci.h > >> @@ -412,6 +412,7 @@ struct aer_err_info { > >> > >>unsigned int status;/* COR/UNCOR Error Status */ > >>unsigned int mask; /* COR/UNCOR Error Mask */ > >> + unsigned int anfe_status; /* UNCOR Error Status for ANFE */ > >>struct pcie_tlp_log tlp;/* TLP Header */ > >> }; > >> > >> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c > >> index ac6293c24976..27364ab4b148 100644 > >> --- a/drivers/pci/pcie/aer.c > >> +++ b/drivers/pci/pcie/aer.c > >> @@ -107,6 +107,12 @@ struct aer_stats { > >>PCI_ERR_ROOT_MULTI_COR_RCV | > > \ > >>PCI_ERR_ROOT_MULTI_UNCOR_RCV) > >> > >> +#define AER_ERR_ANFE_UNC_MASK > > (PCI_ERR_UNC_POISON_TLP | \ > >> + PCI_ERR_UNC_COMP_TIME | > > \ > >> + PCI_ERR_UNC_COMP_ABORT | > > \ > >> + PCI_ERR_UNC_UNX_COMP | > > \ > >> + PCI_ERR_UNC_UNSUP) > >> + > >> static int pcie_aer_disable; > >> static pci_ers_result_t aer_root_reset(struct pci_dev *dev); > >> > >> @@ -1196,6 +1202,41 @@ void aer_recover_queue(int domain, unsigned > >int bus, unsigned int devfn, > >> EXPORT_SYMBOL_GPL(aer_recover_queue); > >> #endif > >> > >> +static void anfe_get_uc_status(struct pci_dev *dev, struct aer_err_info > >*info) > >> +{ > >> + u32 uncor_mask, uncor_status; > >> + u16 device_status; > >> + int aer = dev->aer_cap; > >> + > >> + if (pcie_capability_read_word(dev, PCI_EXP_DEVSTA, > >_status)) > >> + return; > >> + /* > >> + * Take the most conservative route here. If there are > >> + * Non-Fatal/Fatal errors detected, do not assume any > >> + * bit in uncor_status is set by ANFE. > >> + */ > >> + if (device_status & (PCI_EXP_DEVSTA_NFED | PCI_EXP_DEVSTA_FED)) > >> + return; > >> + > > > >Is there not a race here? If we happen to get either an NFED or FED > >between the read of device_status above and here we might pick up a status > >that corresponds to that (and hence clear something we should not). > > In this scenario, info->anfe_status is 0. OK. In that case what is the point of the check above? If the code is safe to
Re: [PATCH v1 1/3] mm/gup: consistently name GUP-fast functions
On Fri, Apr 26, 2024 at 09:17:47AM +0200, David Hildenbrand wrote: > On 02.04.24 14:55, David Hildenbrand wrote: > > Let's consistently call the "fast-only" part of GUP "GUP-fast" and rename > > all relevant internal functions to start with "gup_fast", to make it > > clearer that this is not ordinary GUP. The current mixture of > > "lockless", "gup" and "gup_fast" is confusing. > > > > Further, avoid the term "huge" when talking about a "leaf" -- for > > example, we nowadays check pmd_leaf() because pmd_huge() is gone. For the > > "hugepd"/"hugepte" stuff, it's part of the name ("is_hugepd"), so that > > stays. > > > > What remains is the "external" interface: > > * get_user_pages_fast_only() > > * get_user_pages_fast() > > * pin_user_pages_fast() > > > > The high-level internal functions for GUP-fast (+slow fallback) are now: > > * internal_get_user_pages_fast() -> gup_fast_fallback() > > * lockless_pages_from_mm() -> gup_fast() > > > > The basic GUP-fast walker functions: > > * gup_pgd_range() -> gup_fast_pgd_range() > > * gup_p4d_range() -> gup_fast_p4d_range() > > * gup_pud_range() -> gup_fast_pud_range() > > * gup_pmd_range() -> gup_fast_pmd_range() > > * gup_pte_range() -> gup_fast_pte_range() > > * gup_huge_pgd() -> gup_fast_pgd_leaf() > > * gup_huge_pud() -> gup_fast_pud_leaf() > > * gup_huge_pmd() -> gup_fast_pmd_leaf() > > > > The weird hugepd stuff: > > * gup_huge_pd() -> gup_fast_hugepd() > > * gup_hugepte() -> gup_fast_hugepte() > > I just realized that we end up calling these from follow_hugepd() as well. > And something seems to be off, because gup_fast_hugepd() won't have the VMA > even in the slow-GUP case to pass it to gup_must_unshare(). > > So these are GUP-fast functions and the terminology seem correct. But the > usage from follow_hugepd() is questionable, > > commit a12083d721d703f985f4403d6b333cc449f838f6 > Author: Peter Xu > Date: Wed Mar 27 11:23:31 2024 -0400 > > mm/gup: handle hugepd for follow_page() > > > states "With previous refactors on fast-gup gup_huge_pd(), most of the code > can be leveraged", which doesn't look quite true just staring the the > gup_must_unshare() call where we don't pass the VMA. Also, > "unlikely(pte_val(pte) != pte_val(ptep_get(ptep)" doesn't make any sense for > slow GUP ... Yes it's not needed, just doesn't look worthwhile to put another helper on top just for this. I mentioned this in the commit message here: There's something not needed for follow page, for example, gup_hugepte() tries to detect pgtable entry change which will never happen with slow gup (which has the pgtable lock held), but that's not a problem to check. > > @Peter, any insights? However I think we should pass vma in for sure, I guess I overlooked that, and it didn't expose in my tests too as I probably missed ./cow. I'll prepare a separate patch on top of this series and the gup-fast rename patches (I saw this one just reached mm-stable), and I'll see whether I can test it too if I can find a Power system fast enough. I'll probably drop the "fast" in the hugepd function names too. Thanks, -- Peter Xu
[PATCH v6 16/16] bpf: remove CONFIG_BPF_JIT dependency on CONFIG_MODULES of
From: "Mike Rapoport (IBM)" BPF just-in-time compiler depended on CONFIG_MODULES because it used module_alloc() to allocate memory for the generated code. Since code allocations are now implemented with execmem, drop dependency of CONFIG_BPF_JIT on CONFIG_MODULES and make it select CONFIG_EXECMEM. Suggested-by: Björn Töpel Signed-off-by: Mike Rapoport (IBM) --- kernel/bpf/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/bpf/Kconfig b/kernel/bpf/Kconfig index bc25f5098a25..f999e4e0b344 100644 --- a/kernel/bpf/Kconfig +++ b/kernel/bpf/Kconfig @@ -43,7 +43,7 @@ config BPF_JIT bool "Enable BPF Just In Time compiler" depends on BPF depends on HAVE_CBPF_JIT || HAVE_EBPF_JIT - depends on MODULES + select EXECMEM help BPF programs are normally handled by a BPF interpreter. This option allows the kernel to generate native code when a program is loaded -- 2.43.0
[PATCH v6 15/16] kprobes: remove dependency on CONFIG_MODULES
From: "Mike Rapoport (IBM)" kprobes depended on CONFIG_MODULES because it has to allocate memory for code. Since code allocations are now implemented with execmem, kprobes can be enabled in non-modular kernels. Add #ifdef CONFIG_MODULE guards for the code dealing with kprobes inside modules, make CONFIG_KPROBES select CONFIG_EXECMEM and drop the dependency of CONFIG_KPROBES on CONFIG_MODULES. Signed-off-by: Mike Rapoport (IBM) --- arch/Kconfig| 2 +- include/linux/module.h | 9 ++ kernel/kprobes.c| 55 +++-- kernel/trace/trace_kprobe.c | 20 +- 4 files changed, 63 insertions(+), 23 deletions(-) diff --git a/arch/Kconfig b/arch/Kconfig index 4fd0daa54e6c..caa459964f09 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -52,9 +52,9 @@ config GENERIC_ENTRY config KPROBES bool "Kprobes" - depends on MODULES depends on HAVE_KPROBES select KALLSYMS + select EXECMEM select TASKS_RCU if PREEMPTION help Kprobes allows you to trap at almost any kernel address and diff --git a/include/linux/module.h b/include/linux/module.h index 1153b0d99a80..ffa1c603163c 100644 --- a/include/linux/module.h +++ b/include/linux/module.h @@ -605,6 +605,11 @@ static inline bool module_is_live(struct module *mod) return mod->state != MODULE_STATE_GOING; } +static inline bool module_is_coming(struct module *mod) +{ +return mod->state == MODULE_STATE_COMING; +} + struct module *__module_text_address(unsigned long addr); struct module *__module_address(unsigned long addr); bool is_module_address(unsigned long addr); @@ -857,6 +862,10 @@ void *dereference_module_function_descriptor(struct module *mod, void *ptr) return ptr; } +static inline bool module_is_coming(struct module *mod) +{ + return false; +} #endif /* CONFIG_MODULES */ #ifdef CONFIG_SYSFS diff --git a/kernel/kprobes.c b/kernel/kprobes.c index ddd7cdc16edf..ca2c6cbd42d2 100644 --- a/kernel/kprobes.c +++ b/kernel/kprobes.c @@ -1588,7 +1588,7 @@ static int check_kprobe_address_safe(struct kprobe *p, } /* Get module refcount and reject __init functions for loaded modules. */ - if (*probed_mod) { + if (IS_ENABLED(CONFIG_MODULES) && *probed_mod) { /* * We must hold a refcount of the probed module while updating * its code to prohibit unexpected unloading. @@ -1603,12 +1603,13 @@ static int check_kprobe_address_safe(struct kprobe *p, * kprobes in there. */ if (within_module_init((unsigned long)p->addr, *probed_mod) && - (*probed_mod)->state != MODULE_STATE_COMING) { + !module_is_coming(*probed_mod)) { module_put(*probed_mod); *probed_mod = NULL; ret = -ENOENT; } } + out: preempt_enable(); jump_label_unlock(); @@ -2488,24 +2489,6 @@ int kprobe_add_area_blacklist(unsigned long start, unsigned long end) return 0; } -/* Remove all symbols in given area from kprobe blacklist */ -static void kprobe_remove_area_blacklist(unsigned long start, unsigned long end) -{ - struct kprobe_blacklist_entry *ent, *n; - - list_for_each_entry_safe(ent, n, _blacklist, list) { - if (ent->start_addr < start || ent->start_addr >= end) - continue; - list_del(>list); - kfree(ent); - } -} - -static void kprobe_remove_ksym_blacklist(unsigned long entry) -{ - kprobe_remove_area_blacklist(entry, entry + 1); -} - int __weak arch_kprobe_get_kallsym(unsigned int *symnum, unsigned long *value, char *type, char *sym) { @@ -2570,6 +2553,25 @@ static int __init populate_kprobe_blacklist(unsigned long *start, return ret ? : arch_populate_kprobe_blacklist(); } +#ifdef CONFIG_MODULES +/* Remove all symbols in given area from kprobe blacklist */ +static void kprobe_remove_area_blacklist(unsigned long start, unsigned long end) +{ + struct kprobe_blacklist_entry *ent, *n; + + list_for_each_entry_safe(ent, n, _blacklist, list) { + if (ent->start_addr < start || ent->start_addr >= end) + continue; + list_del(>list); + kfree(ent); + } +} + +static void kprobe_remove_ksym_blacklist(unsigned long entry) +{ + kprobe_remove_area_blacklist(entry, entry + 1); +} + static void add_module_kprobe_blacklist(struct module *mod) { unsigned long start, end; @@ -2672,6 +2674,17 @@ static struct notifier_block kprobe_module_nb = { .priority = 0 }; +static int kprobe_register_module_notifier(void) +{ + return register_module_notifier(_module_nb); +} +#else +static int kprobe_register_module_notifier(void) +{ +
[PATCH v6 14/16] powerpc: use CONFIG_EXECMEM instead of CONFIG_MODULES where appropriate
From: "Mike Rapoport (IBM)" There are places where CONFIG_MODULES guards the code that depends on memory allocation being done with module_alloc(). Replace CONFIG_MODULES with CONFIG_EXECMEM in such places. Signed-off-by: Mike Rapoport (IBM) --- arch/powerpc/Kconfig | 2 +- arch/powerpc/include/asm/kasan.h | 2 +- arch/powerpc/kernel/head_8xx.S | 4 ++-- arch/powerpc/kernel/head_book3s_32.S | 6 +++--- arch/powerpc/lib/code-patching.c | 2 +- arch/powerpc/mm/book3s32/mmu.c | 2 +- 6 files changed, 9 insertions(+), 9 deletions(-) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 1c4be3373686..2e586733a464 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -285,7 +285,7 @@ config PPC select IOMMU_HELPER if PPC64 select IRQ_DOMAIN select IRQ_FORCED_THREADING - select KASAN_VMALLOCif KASAN && MODULES + select KASAN_VMALLOCif KASAN && EXECMEM select LOCK_MM_AND_FIND_VMA select MMU_GATHER_PAGE_SIZE select MMU_GATHER_RCU_TABLE_FREE diff --git a/arch/powerpc/include/asm/kasan.h b/arch/powerpc/include/asm/kasan.h index 365d2720097c..b5bbb94c51f6 100644 --- a/arch/powerpc/include/asm/kasan.h +++ b/arch/powerpc/include/asm/kasan.h @@ -19,7 +19,7 @@ #define KASAN_SHADOW_SCALE_SHIFT 3 -#if defined(CONFIG_MODULES) && defined(CONFIG_PPC32) +#if defined(CONFIG_EXECMEM) && defined(CONFIG_PPC32) #define KASAN_KERN_START ALIGN_DOWN(PAGE_OFFSET - SZ_256M, SZ_256M) #else #define KASAN_KERN_START PAGE_OFFSET diff --git a/arch/powerpc/kernel/head_8xx.S b/arch/powerpc/kernel/head_8xx.S index 647b0b445e89..edc479a7c2bc 100644 --- a/arch/powerpc/kernel/head_8xx.S +++ b/arch/powerpc/kernel/head_8xx.S @@ -199,12 +199,12 @@ instruction_counter: mfspr r10, SPRN_SRR0 /* Get effective address of fault */ INVALIDATE_ADJACENT_PAGES_CPU15(r10, r11) mtspr SPRN_MD_EPN, r10 -#ifdef CONFIG_MODULES +#ifdef CONFIG_EXECMEM mfcrr11 compare_to_kernel_boundary r10, r10 #endif mfspr r10, SPRN_M_TWB /* Get level 1 table */ -#ifdef CONFIG_MODULES +#ifdef CONFIG_EXECMEM blt+3f rlwinm r10, r10, 0, 20, 31 orisr10, r10, (swapper_pg_dir - PAGE_OFFSET)@ha diff --git a/arch/powerpc/kernel/head_book3s_32.S b/arch/powerpc/kernel/head_book3s_32.S index c1d89764dd22..57196883a00e 100644 --- a/arch/powerpc/kernel/head_book3s_32.S +++ b/arch/powerpc/kernel/head_book3s_32.S @@ -419,14 +419,14 @@ InstructionTLBMiss: */ /* Get PTE (linux-style) and check access */ mfspr r3,SPRN_IMISS -#ifdef CONFIG_MODULES +#ifdef CONFIG_EXECMEM lis r1, TASK_SIZE@h /* check if kernel address */ cmplw 0,r1,r3 #endif mfspr r2, SPRN_SDR1 li r1,_PAGE_PRESENT | _PAGE_ACCESSED | _PAGE_EXEC rlwinm r2, r2, 28, 0xf000 -#ifdef CONFIG_MODULES +#ifdef CONFIG_EXECMEM li r0, 3 bgt-112f lis r2, (swapper_pg_dir - PAGE_OFFSET)@ha /* if kernel address, use */ @@ -442,7 +442,7 @@ InstructionTLBMiss: andc. r1,r1,r2/* check access & ~permission */ bne-InstructionAddressInvalid /* return if access not permitted */ /* Convert linux-style PTE to low word of PPC-style PTE */ -#ifdef CONFIG_MODULES +#ifdef CONFIG_EXECMEM rlwimi r2, r0, 0, 31, 31 /* userspace ? -> PP lsb */ #endif ori r1, r1, 0xe06 /* clear out reserved bits */ diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c index c6ab46156cda..7af791446ddf 100644 --- a/arch/powerpc/lib/code-patching.c +++ b/arch/powerpc/lib/code-patching.c @@ -225,7 +225,7 @@ void __init poking_init(void) static unsigned long get_patch_pfn(void *addr) { - if (IS_ENABLED(CONFIG_MODULES) && is_vmalloc_or_module_addr(addr)) + if (IS_ENABLED(CONFIG_EXECMEM) && is_vmalloc_or_module_addr(addr)) return vmalloc_to_pfn(addr); else return __pa_symbol(addr) >> PAGE_SHIFT; diff --git a/arch/powerpc/mm/book3s32/mmu.c b/arch/powerpc/mm/book3s32/mmu.c index 100f999871bc..625fe7d08e06 100644 --- a/arch/powerpc/mm/book3s32/mmu.c +++ b/arch/powerpc/mm/book3s32/mmu.c @@ -184,7 +184,7 @@ unsigned long __init mmu_mapin_ram(unsigned long base, unsigned long top) static bool is_module_segment(unsigned long addr) { - if (!IS_ENABLED(CONFIG_MODULES)) + if (!IS_ENABLED(CONFIG_EXECMEM)) return false; if (addr < ALIGN_DOWN(MODULES_VADDR, SZ_256M)) return false; -- 2.43.0
[PATCH v6 13/16] x86/ftrace: enable dynamic ftrace without CONFIG_MODULES
From: "Mike Rapoport (IBM)" Dynamic ftrace must allocate memory for code and this was impossible without CONFIG_MODULES. With execmem separated from the modules code, execmem_text_alloc() is available regardless of CONFIG_MODULES. Remove dependency of dynamic ftrace on CONFIG_MODULES and make CONFIG_DYNAMIC_FTRACE select CONFIG_EXECMEM in Kconfig. Signed-off-by: Mike Rapoport (IBM) --- arch/x86/Kconfig | 1 + arch/x86/kernel/ftrace.c | 10 -- 2 files changed, 1 insertion(+), 10 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 4474bf32d0a4..f2917ccf4fb4 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -34,6 +34,7 @@ config X86_64 select SWIOTLB select ARCH_HAS_ELFCORE_COMPAT select ZONE_DMA32 + select EXECMEM if DYNAMIC_FTRACE config FORCE_DYNAMIC_FTRACE def_bool y diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c index c8ddb7abda7c..8da0e66ca22d 100644 --- a/arch/x86/kernel/ftrace.c +++ b/arch/x86/kernel/ftrace.c @@ -261,8 +261,6 @@ void arch_ftrace_update_code(int command) /* Currently only x86_64 supports dynamic trampolines */ #ifdef CONFIG_X86_64 -#ifdef CONFIG_MODULES -/* Module allocation simplifies allocating memory for code */ static inline void *alloc_tramp(unsigned long size) { return execmem_alloc(EXECMEM_FTRACE, size); @@ -271,14 +269,6 @@ static inline void tramp_free(void *tramp) { execmem_free(tramp); } -#else -/* Trampolines can only be created if modules are supported */ -static inline void *alloc_tramp(unsigned long size) -{ - return NULL; -} -static inline void tramp_free(void *tramp) { } -#endif /* Defined as markers to the end of the ftrace default trampolines */ extern void ftrace_regs_caller_end(void); -- 2.43.0
[PATCH v6 12/16] arch: make execmem setup available regardless of CONFIG_MODULES
From: "Mike Rapoport (IBM)" execmem does not depend on modules, on the contrary modules use execmem. To make execmem available when CONFIG_MODULES=n, for instance for kprobes, split execmem_params initialization out from arch/*/kernel/module.c and compile it when CONFIG_EXECMEM=y Signed-off-by: Mike Rapoport (IBM) --- arch/arm/kernel/module.c | 43 -- arch/arm/mm/init.c | 45 +++ arch/arm64/kernel/module.c | 140 - arch/arm64/mm/init.c | 140 + arch/loongarch/kernel/module.c | 19 - arch/loongarch/mm/init.c | 21 + arch/mips/kernel/module.c | 22 -- arch/mips/mm/init.c| 23 ++ arch/nios2/kernel/module.c | 20 - arch/nios2/mm/init.c | 21 + arch/parisc/kernel/module.c| 20 - arch/parisc/mm/init.c | 23 +- arch/powerpc/kernel/module.c | 63 --- arch/powerpc/mm/mem.c | 64 +++ arch/riscv/kernel/module.c | 44 --- arch/riscv/mm/init.c | 45 +++ arch/s390/kernel/module.c | 27 --- arch/s390/mm/init.c| 30 +++ arch/sparc/kernel/module.c | 19 - arch/sparc/mm/Makefile | 2 + arch/sparc/mm/execmem.c| 21 + arch/x86/kernel/module.c | 27 --- arch/x86/mm/init.c | 29 +++ 23 files changed, 463 insertions(+), 445 deletions(-) create mode 100644 arch/sparc/mm/execmem.c diff --git a/arch/arm/kernel/module.c b/arch/arm/kernel/module.c index a98fdf6ff26c..677f218f7e84 100644 --- a/arch/arm/kernel/module.c +++ b/arch/arm/kernel/module.c @@ -12,57 +12,14 @@ #include #include #include -#include #include #include -#include -#include #include #include #include #include -#ifdef CONFIG_XIP_KERNEL -/* - * The XIP kernel text is mapped in the module area for modules and - * some other stuff to work without any indirect relocations. - * MODULES_VADDR is redefined here and not in asm/memory.h to avoid - * recompiling the whole kernel when CONFIG_XIP_KERNEL is turned on/off. - */ -#undef MODULES_VADDR -#define MODULES_VADDR (((unsigned long)_exiprom + ~PMD_MASK) & PMD_MASK) -#endif - -#ifdef CONFIG_MMU -static struct execmem_info execmem_info __ro_after_init; - -struct execmem_info __init *execmem_arch_setup(void) -{ - unsigned long fallback_start = 0, fallback_end = 0; - - if (IS_ENABLED(CONFIG_ARM_MODULE_PLTS)) { - fallback_start = VMALLOC_START; - fallback_end = VMALLOC_END; - } - - execmem_info = (struct execmem_info){ - .ranges = { - [EXECMEM_DEFAULT] = { - .start = MODULES_VADDR, - .end= MODULES_END, - .pgprot = PAGE_KERNEL_EXEC, - .alignment = 1, - .fallback_start = fallback_start, - .fallback_end = fallback_end, - }, - }, - }; - - return _info; -} -#endif - bool module_init_section(const char *name) { return strstarts(name, ".init") || diff --git a/arch/arm/mm/init.c b/arch/arm/mm/init.c index e8c6f4be0ce1..5345d218899a 100644 --- a/arch/arm/mm/init.c +++ b/arch/arm/mm/init.c @@ -22,6 +22,7 @@ #include #include #include +#include #include #include @@ -486,3 +487,47 @@ void free_initrd_mem(unsigned long start, unsigned long end) free_reserved_area((void *)start, (void *)end, -1, "initrd"); } #endif + +#ifdef CONFIG_EXECMEM + +#ifdef CONFIG_XIP_KERNEL +/* + * The XIP kernel text is mapped in the module area for modules and + * some other stuff to work without any indirect relocations. + * MODULES_VADDR is redefined here and not in asm/memory.h to avoid + * recompiling the whole kernel when CONFIG_XIP_KERNEL is turned on/off. + */ +#undef MODULES_VADDR +#define MODULES_VADDR (((unsigned long)_exiprom + ~PMD_MASK) & PMD_MASK) +#endif + +#ifdef CONFIG_MMU +static struct execmem_info execmem_info __ro_after_init; + +struct execmem_info __init *execmem_arch_setup(void) +{ + unsigned long fallback_start = 0, fallback_end = 0; + + if (IS_ENABLED(CONFIG_ARM_MODULE_PLTS)) { + fallback_start = VMALLOC_START; + fallback_end = VMALLOC_END; + } + + execmem_info = (struct execmem_info){ + .ranges = { + [EXECMEM_DEFAULT] = { + .start = MODULES_VADDR, + .end= MODULES_END, + .pgprot = PAGE_KERNEL_EXEC, + .alignment = 1, + .fallback_start = fallback_start, + .fallback_end = fallback_end, + }, + }, +
[PATCH v6 11/16] powerpc: extend execmem_params for kprobes allocations
From: "Mike Rapoport (IBM)" powerpc overrides kprobes::alloc_insn_page() to remove writable permissions when STRICT_MODULE_RWX is on. Add definition of EXECMEM_KRPOBES to execmem_params to allow using the generic kprobes::alloc_insn_page() with the desired permissions. As powerpc uses breakpoint instructions to inject kprobes, it does not need to constrain kprobe allocations to the modules area and can use the entire vmalloc address space. Signed-off-by: Mike Rapoport (IBM) --- arch/powerpc/kernel/kprobes.c | 20 arch/powerpc/kernel/module.c | 7 +++ 2 files changed, 7 insertions(+), 20 deletions(-) diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c index 9fcd01bb2ce6..14c5ddec3056 100644 --- a/arch/powerpc/kernel/kprobes.c +++ b/arch/powerpc/kernel/kprobes.c @@ -126,26 +126,6 @@ kprobe_opcode_t *arch_adjust_kprobe_addr(unsigned long addr, unsigned long offse return (kprobe_opcode_t *)(addr + offset); } -void *alloc_insn_page(void) -{ - void *page; - - page = execmem_alloc(EXECMEM_KPROBES, PAGE_SIZE); - if (!page) - return NULL; - - if (strict_module_rwx_enabled()) { - int err = set_memory_rox((unsigned long)page, 1); - - if (err) - goto error; - } - return page; -error: - execmem_free(page); - return NULL; -} - int arch_prepare_kprobe(struct kprobe *p) { int ret = 0; diff --git a/arch/powerpc/kernel/module.c b/arch/powerpc/kernel/module.c index ac80559015a3..2a23cf7e141b 100644 --- a/arch/powerpc/kernel/module.c +++ b/arch/powerpc/kernel/module.c @@ -94,6 +94,7 @@ static struct execmem_info execmem_info __ro_after_init; struct execmem_info __init *execmem_arch_setup(void) { + pgprot_t kprobes_prot = strict_module_rwx_enabled() ? PAGE_KERNEL_ROX : PAGE_KERNEL_EXEC; pgprot_t prot = strict_module_rwx_enabled() ? PAGE_KERNEL : PAGE_KERNEL_EXEC; unsigned long fallback_start = 0, fallback_end = 0; unsigned long start, end; @@ -132,6 +133,12 @@ struct execmem_info __init *execmem_arch_setup(void) .fallback_start = fallback_start, .fallback_end = fallback_end, }, + [EXECMEM_KPROBES] = { + .start = VMALLOC_START, + .end= VMALLOC_END, + .pgprot = kprobes_prot, + .alignment = 1, + }, [EXECMEM_MODULE_DATA] = { .start = VMALLOC_START, .end= VMALLOC_END, -- 2.43.0
[PATCH v6 10/16] arm64: extend execmem_info for generated code allocations
From: "Mike Rapoport (IBM)" The memory allocations for kprobes and BPF on arm64 can be placed anywhere in vmalloc address space and currently this is implemented with overrides of alloc_insn_page() and bpf_jit_alloc_exec() in arm64. Define EXECMEM_KPROBES and EXECMEM_BPF ranges in arm64::execmem_info and drop overrides of alloc_insn_page() and bpf_jit_alloc_exec(). Signed-off-by: Mike Rapoport (IBM) Acked-by: Will Deacon --- arch/arm64/kernel/module.c | 12 arch/arm64/kernel/probes/kprobes.c | 7 --- arch/arm64/net/bpf_jit_comp.c | 11 --- 3 files changed, 12 insertions(+), 18 deletions(-) diff --git a/arch/arm64/kernel/module.c b/arch/arm64/kernel/module.c index b7a7a23f9f8f..a52240ea084b 100644 --- a/arch/arm64/kernel/module.c +++ b/arch/arm64/kernel/module.c @@ -146,6 +146,18 @@ struct execmem_info __init *execmem_arch_setup(void) .fallback_start = fallback_start, .fallback_end = fallback_end, }, + [EXECMEM_KPROBES] = { + .start = VMALLOC_START, + .end= VMALLOC_END, + .pgprot = PAGE_KERNEL_ROX, + .alignment = 1, + }, + [EXECMEM_BPF] = { + .start = VMALLOC_START, + .end= VMALLOC_END, + .pgprot = PAGE_KERNEL, + .alignment = 1, + }, }, }; diff --git a/arch/arm64/kernel/probes/kprobes.c b/arch/arm64/kernel/probes/kprobes.c index 327855a11df2..4268678d0e86 100644 --- a/arch/arm64/kernel/probes/kprobes.c +++ b/arch/arm64/kernel/probes/kprobes.c @@ -129,13 +129,6 @@ int __kprobes arch_prepare_kprobe(struct kprobe *p) return 0; } -void *alloc_insn_page(void) -{ - return __vmalloc_node_range(PAGE_SIZE, 1, VMALLOC_START, VMALLOC_END, - GFP_KERNEL, PAGE_KERNEL_ROX, VM_FLUSH_RESET_PERMS, - NUMA_NO_NODE, __builtin_return_address(0)); -} - /* arm kprobe: install breakpoint in text */ void __kprobes arch_arm_kprobe(struct kprobe *p) { diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c index 122021f9bdfc..456f5af239fc 100644 --- a/arch/arm64/net/bpf_jit_comp.c +++ b/arch/arm64/net/bpf_jit_comp.c @@ -1793,17 +1793,6 @@ u64 bpf_jit_alloc_exec_limit(void) return VMALLOC_END - VMALLOC_START; } -void *bpf_jit_alloc_exec(unsigned long size) -{ - /* Memory is intended to be executable, reset the pointer tag. */ - return kasan_reset_tag(vmalloc(size)); -} - -void bpf_jit_free_exec(void *addr) -{ - return vfree(addr); -} - /* Indicate the JIT backend supports mixing bpf2bpf and tailcalls. */ bool bpf_jit_supports_subprog_tailcalls(void) { -- 2.43.0
[PATCH v6 09/16] riscv: extend execmem_params for generated code allocations
From: "Mike Rapoport (IBM)" The memory allocations for kprobes and BPF on RISC-V are not placed in the modules area and these custom allocations are implemented with overrides of alloc_insn_page() and bpf_jit_alloc_exec(). Slightly reorder execmem_params initialization to support both 32 and 64 bit variants, define EXECMEM_KPROBES and EXECMEM_BPF ranges in riscv::execmem_params and drop overrides of alloc_insn_page() and bpf_jit_alloc_exec(). Signed-off-by: Mike Rapoport (IBM) Reviewed-by: Alexandre Ghiti --- arch/riscv/kernel/module.c | 28 +--- arch/riscv/kernel/probes/kprobes.c | 10 -- arch/riscv/net/bpf_jit_core.c | 13 - 3 files changed, 25 insertions(+), 26 deletions(-) diff --git a/arch/riscv/kernel/module.c b/arch/riscv/kernel/module.c index 182904127ba0..2ecbacbc9993 100644 --- a/arch/riscv/kernel/module.c +++ b/arch/riscv/kernel/module.c @@ -906,19 +906,41 @@ int apply_relocate_add(Elf_Shdr *sechdrs, const char *strtab, return 0; } -#if defined(CONFIG_MMU) && defined(CONFIG_64BIT) +#ifdef CONFIG_MMU static struct execmem_info execmem_info __ro_after_init; struct execmem_info __init *execmem_arch_setup(void) { + unsigned long start, end; + + if (IS_ENABLED(CONFIG_64BIT)) { + start = MODULES_VADDR; + end = MODULES_END; + } else { + start = VMALLOC_START; + end = VMALLOC_END; + } + execmem_info = (struct execmem_info){ .ranges = { [EXECMEM_DEFAULT] = { - .start = MODULES_VADDR, - .end= MODULES_END, + .start = start, + .end= end, .pgprot = PAGE_KERNEL, .alignment = 1, }, + [EXECMEM_KPROBES] = { + .start = VMALLOC_START, + .end= VMALLOC_END, + .pgprot = PAGE_KERNEL_READ_EXEC, + .alignment = 1, + }, + [EXECMEM_BPF] = { + .start = BPF_JIT_REGION_START, + .end= BPF_JIT_REGION_END, + .pgprot = PAGE_KERNEL, + .alignment = PAGE_SIZE, + }, }, }; diff --git a/arch/riscv/kernel/probes/kprobes.c b/arch/riscv/kernel/probes/kprobes.c index 2f08c14a933d..e64f2f3064eb 100644 --- a/arch/riscv/kernel/probes/kprobes.c +++ b/arch/riscv/kernel/probes/kprobes.c @@ -104,16 +104,6 @@ int __kprobes arch_prepare_kprobe(struct kprobe *p) return 0; } -#ifdef CONFIG_MMU -void *alloc_insn_page(void) -{ - return __vmalloc_node_range(PAGE_SIZE, 1, VMALLOC_START, VMALLOC_END, -GFP_KERNEL, PAGE_KERNEL_READ_EXEC, -VM_FLUSH_RESET_PERMS, NUMA_NO_NODE, -__builtin_return_address(0)); -} -#endif - /* install breakpoint in text */ void __kprobes arch_arm_kprobe(struct kprobe *p) { diff --git a/arch/riscv/net/bpf_jit_core.c b/arch/riscv/net/bpf_jit_core.c index 6b3acac30c06..e238fdbd5dbc 100644 --- a/arch/riscv/net/bpf_jit_core.c +++ b/arch/riscv/net/bpf_jit_core.c @@ -219,19 +219,6 @@ u64 bpf_jit_alloc_exec_limit(void) return BPF_JIT_REGION_SIZE; } -void *bpf_jit_alloc_exec(unsigned long size) -{ - return __vmalloc_node_range(size, PAGE_SIZE, BPF_JIT_REGION_START, - BPF_JIT_REGION_END, GFP_KERNEL, - PAGE_KERNEL, 0, NUMA_NO_NODE, - __builtin_return_address(0)); -} - -void bpf_jit_free_exec(void *addr) -{ - return vfree(addr); -} - void *bpf_arch_text_copy(void *dst, void *src, size_t len) { int ret; -- 2.43.0
[PATCH v6 08/16] mm/execmem, arch: convert remaining overrides of module_alloc to execmem
From: "Mike Rapoport (IBM)" Extend execmem parameters to accommodate more complex overrides of module_alloc() by architectures. This includes specification of a fallback range required by arm, arm64 and powerpc, EXECMEM_MODULE_DATA type required by powerpc, support for allocation of KASAN shadow required by s390 and x86 and support for late initialization of execmem required by arm64. The core implementation of execmem_alloc() takes care of suppressing warnings when the initial allocation fails but there is a fallback range defined. Signed-off-by: Mike Rapoport (IBM) Acked-by: Will Deacon --- arch/Kconfig | 8 arch/arm/kernel/module.c | 41 arch/arm64/Kconfig | 1 + arch/arm64/kernel/module.c | 55 ++ arch/powerpc/kernel/module.c | 60 +++-- arch/s390/kernel/module.c| 54 +++--- arch/x86/kernel/module.c | 70 +++-- include/linux/execmem.h | 30 ++- include/linux/moduleloader.h | 12 -- kernel/module/main.c | 26 +++-- mm/execmem.c | 75 ++-- 11 files changed, 247 insertions(+), 185 deletions(-) diff --git a/arch/Kconfig b/arch/Kconfig index 65afb1de48b3..4fd0daa54e6c 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -960,6 +960,14 @@ config ARCH_WANTS_MODULES_DATA_IN_VMALLOC For architectures like powerpc/32 which have constraints on module allocation and need to allocate module data outside of module area. +config ARCH_WANTS_EXECMEM_LATE + bool + help + For architectures that do not allocate executable memory early on + boot, but rather require its initialization late when there is + enough entropy for module space randomization, for instance + arm64. + config HAVE_IRQ_EXIT_ON_IRQ_STACK bool help diff --git a/arch/arm/kernel/module.c b/arch/arm/kernel/module.c index e74d84f58b77..a98fdf6ff26c 100644 --- a/arch/arm/kernel/module.c +++ b/arch/arm/kernel/module.c @@ -16,6 +16,7 @@ #include #include #include +#include #include #include @@ -34,23 +35,31 @@ #endif #ifdef CONFIG_MMU -void *module_alloc(unsigned long size) +static struct execmem_info execmem_info __ro_after_init; + +struct execmem_info __init *execmem_arch_setup(void) { - gfp_t gfp_mask = GFP_KERNEL; - void *p; - - /* Silence the initial allocation */ - if (IS_ENABLED(CONFIG_ARM_MODULE_PLTS)) - gfp_mask |= __GFP_NOWARN; - - p = __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END, - gfp_mask, PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE, - __builtin_return_address(0)); - if (!IS_ENABLED(CONFIG_ARM_MODULE_PLTS) || p) - return p; - return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END, - GFP_KERNEL, PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE, - __builtin_return_address(0)); + unsigned long fallback_start = 0, fallback_end = 0; + + if (IS_ENABLED(CONFIG_ARM_MODULE_PLTS)) { + fallback_start = VMALLOC_START; + fallback_end = VMALLOC_END; + } + + execmem_info = (struct execmem_info){ + .ranges = { + [EXECMEM_DEFAULT] = { + .start = MODULES_VADDR, + .end= MODULES_END, + .pgprot = PAGE_KERNEL_EXEC, + .alignment = 1, + .fallback_start = fallback_start, + .fallback_end = fallback_end, + }, + }, + }; + + return _info; } #endif diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 7b11c98b3e84..74b34a78b7ac 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -105,6 +105,7 @@ config ARM64 select ARCH_WANT_FRAME_POINTERS select ARCH_WANT_HUGE_PMD_SHARE if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36) select ARCH_WANT_LD_ORPHAN_WARN + select ARCH_WANTS_EXECMEM_LATE if EXECMEM select ARCH_WANTS_NO_INSTR select ARCH_WANTS_THP_SWAP if ARM64_4K_PAGES select ARCH_HAS_UBSAN diff --git a/arch/arm64/kernel/module.c b/arch/arm64/kernel/module.c index e92da4da1b2a..b7a7a23f9f8f 100644 --- a/arch/arm64/kernel/module.c +++ b/arch/arm64/kernel/module.c @@ -20,6 +20,7 @@ #include #include #include +#include #include #include @@ -108,41 +109,47 @@ static int __init module_init_limits(void) return 0; } -subsys_initcall(module_init_limits); -void *module_alloc(unsigned long size) +static struct execmem_info execmem_info __ro_after_init; + +struct execmem_info __init *execmem_arch_setup(void) { - void *p = NULL; +
[PATCH v6 07/16] mm/execmem, arch: convert simple overrides of module_alloc to execmem
From: "Mike Rapoport (IBM)" Several architectures override module_alloc() only to define address range for code allocations different than VMALLOC address space. Provide a generic implementation in execmem that uses the parameters for address space ranges, required alignment and page protections provided by architectures. The architectures must fill execmem_info structure and implement execmem_arch_setup() that returns a pointer to that structure. This way the execmem initialization won't be called from every architecture, but rather from a central place, namely a core_initcall() in execmem. The execmem provides execmem_alloc() API that wraps __vmalloc_node_range() with the parameters defined by the architectures. If an architecture does not implement execmem_arch_setup(), execmem_alloc() will fall back to module_alloc(). Signed-off-by: Mike Rapoport (IBM) --- arch/loongarch/kernel/module.c | 19 -- arch/mips/kernel/module.c | 20 -- arch/nios2/kernel/module.c | 21 --- arch/parisc/kernel/module.c| 24 arch/riscv/kernel/module.c | 24 arch/sparc/kernel/module.c | 20 -- include/linux/execmem.h| 47 mm/execmem.c | 67 -- mm/mm_init.c | 2 + 9 files changed, 210 insertions(+), 34 deletions(-) diff --git a/arch/loongarch/kernel/module.c b/arch/loongarch/kernel/module.c index c7d0338d12c1..ca6dd7ea1610 100644 --- a/arch/loongarch/kernel/module.c +++ b/arch/loongarch/kernel/module.c @@ -18,6 +18,7 @@ #include #include #include +#include #include #include #include @@ -490,10 +491,22 @@ int apply_relocate_add(Elf_Shdr *sechdrs, const char *strtab, return 0; } -void *module_alloc(unsigned long size) +static struct execmem_info execmem_info __ro_after_init; + +struct execmem_info __init *execmem_arch_setup(void) { - return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END, - GFP_KERNEL, PAGE_KERNEL, 0, NUMA_NO_NODE, __builtin_return_address(0)); + execmem_info = (struct execmem_info){ + .ranges = { + [EXECMEM_DEFAULT] = { + .start = MODULES_VADDR, + .end= MODULES_END, + .pgprot = PAGE_KERNEL, + .alignment = 1, + }, + }, + }; + + return _info; } static void module_init_ftrace_plt(const Elf_Ehdr *hdr, diff --git a/arch/mips/kernel/module.c b/arch/mips/kernel/module.c index 9a6c96014904..59225a3cf918 100644 --- a/arch/mips/kernel/module.c +++ b/arch/mips/kernel/module.c @@ -20,6 +20,7 @@ #include #include #include +#include #include struct mips_hi16 { @@ -32,11 +33,22 @@ static LIST_HEAD(dbe_list); static DEFINE_SPINLOCK(dbe_lock); #ifdef MODULES_VADDR -void *module_alloc(unsigned long size) +static struct execmem_info execmem_info __ro_after_init; + +struct execmem_info __init *execmem_arch_setup(void) { - return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END, - GFP_KERNEL, PAGE_KERNEL, 0, NUMA_NO_NODE, - __builtin_return_address(0)); + execmem_info = (struct execmem_info){ + .ranges = { + [EXECMEM_DEFAULT] = { + .start = MODULES_VADDR, + .end= MODULES_END, + .pgprot = PAGE_KERNEL, + .alignment = 1, + }, + }, + }; + + return _info; } #endif diff --git a/arch/nios2/kernel/module.c b/arch/nios2/kernel/module.c index 9c97b7513853..0d1ee86631fc 100644 --- a/arch/nios2/kernel/module.c +++ b/arch/nios2/kernel/module.c @@ -18,15 +18,26 @@ #include #include #include +#include #include -void *module_alloc(unsigned long size) +static struct execmem_info execmem_info __ro_after_init; + +struct execmem_info __init *execmem_arch_setup(void) { - return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END, - GFP_KERNEL, PAGE_KERNEL_EXEC, - VM_FLUSH_RESET_PERMS, NUMA_NO_NODE, - __builtin_return_address(0)); + execmem_info = (struct execmem_info){ + .ranges = { + [EXECMEM_DEFAULT] = { + .start = MODULES_VADDR, + .end= MODULES_END, + .pgprot = PAGE_KERNEL_EXEC, + .alignment = 1, + }, + }, + }; + + return _info; } int apply_relocate_add(Elf32_Shdr *sechdrs, const char *strtab, diff --git a/arch/parisc/kernel/module.c
[PATCH v6 06/16] mm: introduce execmem_alloc() and execmem_free()
From: "Mike Rapoport (IBM)" module_alloc() is used everywhere as a mean to allocate memory for code. Beside being semantically wrong, this unnecessarily ties all subsystems that need to allocate code, such as ftrace, kprobes and BPF to modules and puts the burden of code allocation to the modules code. Several architectures override module_alloc() because of various constraints where the executable memory can be located and this causes additional obstacles for improvements of code allocation. Start splitting code allocation from modules by introducing execmem_alloc() and execmem_free() APIs. Initially, execmem_alloc() is a wrapper for module_alloc() and execmem_free() is a replacement of module_memfree() to allow updating all call sites to use the new APIs. Since architectures define different restrictions on placement, permissions, alignment and other parameters for memory that can be used by different subsystems that allocate executable memory, execmem_alloc() takes a type argument, that will be used to identify the calling subsystem and to allow architectures define parameters for ranges suitable for that subsystem. No functional changes. Signed-off-by: Mike Rapoport (IBM) Acked-by: Masami Hiramatsu (Google) --- arch/powerpc/kernel/kprobes.c| 6 ++-- arch/s390/kernel/ftrace.c| 4 +-- arch/s390/kernel/kprobes.c | 4 +-- arch/s390/kernel/module.c| 5 +-- arch/sparc/net/bpf_jit_comp_32.c | 8 ++--- arch/x86/kernel/ftrace.c | 6 ++-- arch/x86/kernel/kprobes/core.c | 4 +-- include/linux/execmem.h | 57 include/linux/moduleloader.h | 3 -- kernel/bpf/core.c| 6 ++-- kernel/kprobes.c | 8 ++--- kernel/module/Kconfig| 1 + kernel/module/main.c | 25 +- mm/Kconfig | 3 ++ mm/Makefile | 1 + mm/execmem.c | 32 ++ 16 files changed, 128 insertions(+), 45 deletions(-) create mode 100644 include/linux/execmem.h create mode 100644 mm/execmem.c diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c index bbca90a5e2ec..9fcd01bb2ce6 100644 --- a/arch/powerpc/kernel/kprobes.c +++ b/arch/powerpc/kernel/kprobes.c @@ -19,8 +19,8 @@ #include #include #include -#include #include +#include #include #include #include @@ -130,7 +130,7 @@ void *alloc_insn_page(void) { void *page; - page = module_alloc(PAGE_SIZE); + page = execmem_alloc(EXECMEM_KPROBES, PAGE_SIZE); if (!page) return NULL; @@ -142,7 +142,7 @@ void *alloc_insn_page(void) } return page; error: - module_memfree(page); + execmem_free(page); return NULL; } diff --git a/arch/s390/kernel/ftrace.c b/arch/s390/kernel/ftrace.c index c46381ea04ec..798249ef5646 100644 --- a/arch/s390/kernel/ftrace.c +++ b/arch/s390/kernel/ftrace.c @@ -7,13 +7,13 @@ * Author(s): Martin Schwidefsky */ -#include #include #include #include #include #include #include +#include #include #include #include @@ -220,7 +220,7 @@ static int __init ftrace_plt_init(void) { const char *start, *end; - ftrace_plt = module_alloc(PAGE_SIZE); + ftrace_plt = execmem_alloc(EXECMEM_FTRACE, PAGE_SIZE); if (!ftrace_plt) panic("cannot allocate ftrace plt\n"); diff --git a/arch/s390/kernel/kprobes.c b/arch/s390/kernel/kprobes.c index f0cf20d4b3c5..3c1b1be744de 100644 --- a/arch/s390/kernel/kprobes.c +++ b/arch/s390/kernel/kprobes.c @@ -9,7 +9,6 @@ #define pr_fmt(fmt) "kprobes: " fmt -#include #include #include #include @@ -21,6 +20,7 @@ #include #include #include +#include #include #include #include @@ -38,7 +38,7 @@ void *alloc_insn_page(void) { void *page; - page = module_alloc(PAGE_SIZE); + page = execmem_alloc(EXECMEM_KPROBES, PAGE_SIZE); if (!page) return NULL; set_memory_rox((unsigned long)page, 1); diff --git a/arch/s390/kernel/module.c b/arch/s390/kernel/module.c index 42215f9404af..ac97a905e8cd 100644 --- a/arch/s390/kernel/module.c +++ b/arch/s390/kernel/module.c @@ -21,6 +21,7 @@ #include #include #include +#include #include #include #include @@ -76,7 +77,7 @@ void *module_alloc(unsigned long size) #ifdef CONFIG_FUNCTION_TRACER void module_arch_cleanup(struct module *mod) { - module_memfree(mod->arch.trampolines_start); + execmem_free(mod->arch.trampolines_start); } #endif @@ -510,7 +511,7 @@ static int module_alloc_ftrace_hotpatch_trampolines(struct module *me, size = FTRACE_HOTPATCH_TRAMPOLINES_SIZE(s->sh_size); numpages = DIV_ROUND_UP(size, PAGE_SIZE); - start = module_alloc(numpages * PAGE_SIZE); + start = execmem_alloc(EXECMEM_FTRACE, numpages * PAGE_SIZE); if (!start) return -ENOMEM;
[PATCH v6 05/16] module: make module_memory_{alloc,free} more self-contained
From: "Mike Rapoport (IBM)" Move the logic related to the memory allocation and freeing into module_memory_alloc() and module_memory_free(). Signed-off-by: Mike Rapoport (IBM) --- kernel/module/main.c | 64 +++- 1 file changed, 39 insertions(+), 25 deletions(-) diff --git a/kernel/module/main.c b/kernel/module/main.c index e1e8a7a9d6c1..5b82b069e0d3 100644 --- a/kernel/module/main.c +++ b/kernel/module/main.c @@ -1203,15 +1203,44 @@ static bool mod_mem_use_vmalloc(enum mod_mem_type type) mod_mem_type_is_core_data(type); } -static void *module_memory_alloc(unsigned int size, enum mod_mem_type type) +static int module_memory_alloc(struct module *mod, enum mod_mem_type type) { + unsigned int size = PAGE_ALIGN(mod->mem[type].size); + void *ptr; + + mod->mem[type].size = size; + if (mod_mem_use_vmalloc(type)) - return vzalloc(size); - return module_alloc(size); + ptr = vmalloc(size); + else + ptr = module_alloc(size); + + if (!ptr) + return -ENOMEM; + + /* +* The pointer to these blocks of memory are stored on the module +* structure and we keep that around so long as the module is +* around. We only free that memory when we unload the module. +* Just mark them as not being a leak then. The .init* ELF +* sections *do* get freed after boot so we *could* treat them +* slightly differently with kmemleak_ignore() and only grey +* them out as they work as typical memory allocations which +* *do* eventually get freed, but let's just keep things simple +* and avoid *any* false positives. +*/ + kmemleak_not_leak(ptr); + + memset(ptr, 0, size); + mod->mem[type].base = ptr; + + return 0; } -static void module_memory_free(void *ptr, enum mod_mem_type type) +static void module_memory_free(struct module *mod, enum mod_mem_type type) { + void *ptr = mod->mem[type].base; + if (mod_mem_use_vmalloc(type)) vfree(ptr); else @@ -1229,12 +1258,12 @@ static void free_mod_mem(struct module *mod) /* Free lock-classes; relies on the preceding sync_rcu(). */ lockdep_free_key_range(mod_mem->base, mod_mem->size); if (mod_mem->size) - module_memory_free(mod_mem->base, type); + module_memory_free(mod, type); } /* MOD_DATA hosts mod, so free it at last */ lockdep_free_key_range(mod->mem[MOD_DATA].base, mod->mem[MOD_DATA].size); - module_memory_free(mod->mem[MOD_DATA].base, MOD_DATA); + module_memory_free(mod, MOD_DATA); } /* Free a module, remove from lists, etc. */ @@ -2225,7 +2254,6 @@ static int find_module_sections(struct module *mod, struct load_info *info) static int move_module(struct module *mod, struct load_info *info) { int i; - void *ptr; enum mod_mem_type t = 0; int ret = -ENOMEM; @@ -2234,26 +2262,12 @@ static int move_module(struct module *mod, struct load_info *info) mod->mem[type].base = NULL; continue; } - mod->mem[type].size = PAGE_ALIGN(mod->mem[type].size); - ptr = module_memory_alloc(mod->mem[type].size, type); - /* - * The pointer to these blocks of memory are stored on the module - * structure and we keep that around so long as the module is - * around. We only free that memory when we unload the module. - * Just mark them as not being a leak then. The .init* ELF - * sections *do* get freed after boot so we *could* treat them - * slightly differently with kmemleak_ignore() and only grey - * them out as they work as typical memory allocations which - * *do* eventually get freed, but let's just keep things simple - * and avoid *any* false positives. -*/ - kmemleak_not_leak(ptr); - if (!ptr) { + + ret = module_memory_alloc(mod, type); + if (ret) { t = type; goto out_enomem; } - memset(ptr, 0, mod->mem[type].size); - mod->mem[type].base = ptr; } /* Transfer each section which specifies SHF_ALLOC */ @@ -2296,7 +2310,7 @@ static int move_module(struct module *mod, struct load_info *info) return 0; out_enomem: for (t--; t >= 0; t--) - module_memory_free(mod->mem[t].base, t); + module_memory_free(mod, t); return ret; } -- 2.43.0
[PATCH v6 04/16] sparc: simplify module_alloc()
From: "Mike Rapoport (IBM)" Define MODULES_VADDR and MODULES_END as VMALLOC_START and VMALLOC_END for 32-bit and reduce module_alloc() to __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END, ...) as with the new defines the allocations becomes identical for both 32 and 64 bits. While on it, drop unused include of Suggested-by: Sam Ravnborg Signed-off-by: Mike Rapoport (IBM) --- arch/sparc/include/asm/pgtable_32.h | 2 ++ arch/sparc/kernel/module.c | 25 + 2 files changed, 3 insertions(+), 24 deletions(-) diff --git a/arch/sparc/include/asm/pgtable_32.h b/arch/sparc/include/asm/pgtable_32.h index 9e85d57ac3f2..62bcafe38b1f 100644 --- a/arch/sparc/include/asm/pgtable_32.h +++ b/arch/sparc/include/asm/pgtable_32.h @@ -432,6 +432,8 @@ static inline int io_remap_pfn_range(struct vm_area_struct *vma, #define VMALLOC_START _AC(0xfe60,UL) #define VMALLOC_END _AC(0xffc0,UL) +#define MODULES_VADDR VMALLOC_START +#define MODULES_END VMALLOC_END /* We provide our own get_unmapped_area to cope with VA holes for userland */ #define HAVE_ARCH_UNMAPPED_AREA diff --git a/arch/sparc/kernel/module.c b/arch/sparc/kernel/module.c index 66c45a2764bc..d37adb2a0b54 100644 --- a/arch/sparc/kernel/module.c +++ b/arch/sparc/kernel/module.c @@ -21,35 +21,12 @@ #include "entry.h" -#ifdef CONFIG_SPARC64 - -#include - -static void *module_map(unsigned long size) +void *module_alloc(unsigned long size) { - if (PAGE_ALIGN(size) > MODULES_LEN) - return NULL; return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END, GFP_KERNEL, PAGE_KERNEL, 0, NUMA_NO_NODE, __builtin_return_address(0)); } -#else -static void *module_map(unsigned long size) -{ - return vmalloc(size); -} -#endif /* CONFIG_SPARC64 */ - -void *module_alloc(unsigned long size) -{ - void *ret; - - ret = module_map(size); - if (ret) - memset(ret, 0, size); - - return ret; -} /* Make generic code ignore STT_REGISTER dummy undefined symbols. */ int module_frob_arch_sections(Elf_Ehdr *hdr, -- 2.43.0
[PATCH v6 03/16] nios2: define virtual address space for modules
From: "Mike Rapoport (IBM)" nios2 uses kmalloc() to implement module_alloc() because CALL26/PCREL26 cannot reach all of vmalloc address space. Define module space as 32MiB below the kernel base and switch nios2 to use vmalloc for module allocations. Suggested-by: Thomas Gleixner Acked-by: Dinh Nguyen Acked-by: Song Liu Signed-off-by: Mike Rapoport (IBM) --- arch/nios2/include/asm/pgtable.h | 5 - arch/nios2/kernel/module.c | 19 --- 2 files changed, 8 insertions(+), 16 deletions(-) diff --git a/arch/nios2/include/asm/pgtable.h b/arch/nios2/include/asm/pgtable.h index d052dfcbe8d3..eab87c6beacb 100644 --- a/arch/nios2/include/asm/pgtable.h +++ b/arch/nios2/include/asm/pgtable.h @@ -25,7 +25,10 @@ #include #define VMALLOC_START CONFIG_NIOS2_KERNEL_MMU_REGION_BASE -#define VMALLOC_END(CONFIG_NIOS2_KERNEL_REGION_BASE - 1) +#define VMALLOC_END(CONFIG_NIOS2_KERNEL_REGION_BASE - SZ_32M - 1) + +#define MODULES_VADDR (CONFIG_NIOS2_KERNEL_REGION_BASE - SZ_32M) +#define MODULES_END(CONFIG_NIOS2_KERNEL_REGION_BASE - 1) struct mm_struct; diff --git a/arch/nios2/kernel/module.c b/arch/nios2/kernel/module.c index 76e0a42d6e36..9c97b7513853 100644 --- a/arch/nios2/kernel/module.c +++ b/arch/nios2/kernel/module.c @@ -21,23 +21,12 @@ #include -/* - * Modules should NOT be allocated with kmalloc for (obvious) reasons. - * But we do it for now to avoid relocation issues. CALL26/PCREL26 cannot reach - * from 0x8000 (vmalloc area) to 0xc (kernel) (kmalloc returns - * addresses in 0xc000) - */ void *module_alloc(unsigned long size) { - if (size == 0) - return NULL; - return kmalloc(size, GFP_KERNEL); -} - -/* Free memory returned from module_alloc */ -void module_memfree(void *module_region) -{ - kfree(module_region); + return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END, + GFP_KERNEL, PAGE_KERNEL_EXEC, + VM_FLUSH_RESET_PERMS, NUMA_NO_NODE, + __builtin_return_address(0)); } int apply_relocate_add(Elf32_Shdr *sechdrs, const char *strtab, -- 2.43.0
[PATCH v6 02/16] mips: module: rename MODULE_START to MODULES_VADDR
From: "Mike Rapoport (IBM)" and MODULE_END to MODULES_END to match other architectures that define custom address space for modules. Signed-off-by: Mike Rapoport (IBM) --- arch/mips/include/asm/pgtable-64.h | 4 ++-- arch/mips/kernel/module.c | 4 ++-- arch/mips/mm/fault.c | 4 ++-- 3 files changed, 6 insertions(+), 6 deletions(-) diff --git a/arch/mips/include/asm/pgtable-64.h b/arch/mips/include/asm/pgtable-64.h index 20ca48c1b606..c0109aff223b 100644 --- a/arch/mips/include/asm/pgtable-64.h +++ b/arch/mips/include/asm/pgtable-64.h @@ -147,8 +147,8 @@ #if defined(CONFIG_MODULES) && defined(KBUILD_64BIT_SYM32) && \ VMALLOC_START != CKSSEG /* Load modules into 32bit-compatible segment. */ -#define MODULE_START CKSSEG -#define MODULE_END (FIXADDR_START-2*PAGE_SIZE) +#define MODULES_VADDR CKSSEG +#define MODULES_END(FIXADDR_START-2*PAGE_SIZE) #endif #define pte_ERROR(e) \ diff --git a/arch/mips/kernel/module.c b/arch/mips/kernel/module.c index 7b2fbaa9cac5..9a6c96014904 100644 --- a/arch/mips/kernel/module.c +++ b/arch/mips/kernel/module.c @@ -31,10 +31,10 @@ struct mips_hi16 { static LIST_HEAD(dbe_list); static DEFINE_SPINLOCK(dbe_lock); -#ifdef MODULE_START +#ifdef MODULES_VADDR void *module_alloc(unsigned long size) { - return __vmalloc_node_range(size, 1, MODULE_START, MODULE_END, + return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END, GFP_KERNEL, PAGE_KERNEL, 0, NUMA_NO_NODE, __builtin_return_address(0)); } diff --git a/arch/mips/mm/fault.c b/arch/mips/mm/fault.c index aaa9a242ebba..37fedeaca2e9 100644 --- a/arch/mips/mm/fault.c +++ b/arch/mips/mm/fault.c @@ -83,8 +83,8 @@ static void __do_page_fault(struct pt_regs *regs, unsigned long write, if (unlikely(address >= VMALLOC_START && address <= VMALLOC_END)) goto VMALLOC_FAULT_TARGET; -#ifdef MODULE_START - if (unlikely(address >= MODULE_START && address < MODULE_END)) +#ifdef MODULES_VADDR + if (unlikely(address >= MODULES_VADDR && address < MODULES_END)) goto VMALLOC_FAULT_TARGET; #endif -- 2.43.0
[PATCH v6 01/16] arm64: module: remove unneeded call to kasan_alloc_module_shadow()
From: "Mike Rapoport (IBM)" Since commit f6f37d9320a1 ("arm64: select KASAN_VMALLOC for SW/HW_TAGS modes") KASAN_VMALLOC is always enabled when KASAN is on. This means that allocations in module_alloc() will be tracked by KASAN protection for vmalloc() and that kasan_alloc_module_shadow() will be always an empty inline and there is no point in calling it. Drop meaningless call to kasan_alloc_module_shadow() from module_alloc(). Signed-off-by: Mike Rapoport (IBM) --- arch/arm64/kernel/module.c | 5 - 1 file changed, 5 deletions(-) diff --git a/arch/arm64/kernel/module.c b/arch/arm64/kernel/module.c index 47e0be610bb6..e92da4da1b2a 100644 --- a/arch/arm64/kernel/module.c +++ b/arch/arm64/kernel/module.c @@ -141,11 +141,6 @@ void *module_alloc(unsigned long size) __func__); } - if (p && (kasan_alloc_module_shadow(p, size, GFP_KERNEL) < 0)) { - vfree(p); - return NULL; - } - /* Memory is intended to be executable, reset the pointer tag. */ return kasan_reset_tag(p); } -- 2.43.0
[PATCH v6 00/16] mm: jit/text allocator
From: "Mike Rapoport (IBM)" Hi, The patches are also available in git: https://git.kernel.org/pub/scm/linux/kernel/git/rppt/linux.git/log/?h=execmem/v6 v6 changes: * restore patch "arm64: extend execmem_info for generated code allocations" that disappeared in v5 rebase * update execmem initialization so that by default it will be initialized early while late initialization will be an opt-in v5: https://lore.kernel.org/all/20240422094436.3625171-1-r...@kernel.org * rebase on v6.9-rc4 to avoid a conflict in kprobes * add copyrights to mm/execmem.c (Luis) * fix spelling (Ingo) * define MODULES_VADDDR for sparc (Sam) * consistently initialize struct execmem_info (Peter) * reduce #ifdefs in function bodies in kprobes (Masami) v4: https://lore.kernel.org/all/20240411160051.2093261-1-r...@kernel.org * rebase on v6.9-rc2 * rename execmem_params to execmem_info and execmem_arch_params() to execmem_arch_setup() * use single execmem_alloc() API instead of execmem_{text,data}_alloc() (Song) * avoid extra copy of execmem parameters (Rick) * run execmem_init() as core_initcall() except for the architectures that may allocated text really early (currently only x86) (Will) * add acks for some of arm64 and riscv changes, thanks Will and Alexandre * new commits: - drop call to kasan_alloc_module_shadow() on arm64 because it's not needed anymore - rename MODULE_START to MODULES_VADDR on MIPS - use CONFIG_EXECMEM instead of CONFIG_MODULES on powerpc as per Christophe: https://lore.kernel.org/all/79062fa3-3402-47b3-8920-9231ad05e...@csgroup.eu/ v3: https://lore.kernel.org/all/20230918072955.2507221-1-r...@kernel.org * add type parameter to execmem allocation APIs * remove BPF dependency on modules v2: https://lore.kernel.org/all/20230616085038.4121892-1-r...@kernel.org * Separate "module" and "others" allocations with execmem_text_alloc() and jit_text_alloc() * Drop ROX entailment on x86 * Add ack for nios2 changes, thanks Dinh Nguyen v1: https://lore.kernel.org/all/20230601101257.530867-1-r...@kernel.org = Cover letter from v1 (sligtly updated) = module_alloc() is used everywhere as a mean to allocate memory for code. Beside being semantically wrong, this unnecessarily ties all subsystmes that need to allocate code, such as ftrace, kprobes and BPF to modules and puts the burden of code allocation to the modules code. Several architectures override module_alloc() because of various constraints where the executable memory can be located and this causes additional obstacles for improvements of code allocation. A centralized infrastructure for code allocation allows allocations of executable memory as ROX, and future optimizations such as caching large pages for better iTLB performance and providing sub-page allocations for users that only need small jit code snippets. Rick Edgecombe proposed perm_alloc extension to vmalloc [1] and Song Liu proposed execmem_alloc [2], but both these approaches were targeting BPF allocations and lacked the ground work to abstract executable allocations and split them from the modules core. Thomas Gleixner suggested to express module allocation restrictions and requirements as struct mod_alloc_type_params [3] that would define ranges, protections and other parameters for different types of allocations used by modules and following that suggestion Song separated allocations of different types in modules (commit ac3b43283923 ("module: replace module_layout with module_memory")) and posted "Type aware module allocator" set [4]. I liked the idea of parametrising code allocation requirements as a structure, but I believe the original proposal and Song's module allocator was too module centric, so I came up with these patches. This set splits code allocation from modules by introducing execmem_alloc() and and execmem_free(), APIs, replaces call sites of module_alloc() and module_memfree() with the new APIs and implements core text and related allocations in a central place. Instead of architecture specific overrides for module_alloc(), the architectures that require non-default behaviour for text allocation must fill execmem_info structure and implement execmem_arch_setup() that returns a pointer to that structure. If an architecture does not implement execmem_arch_setup(), the defaults compatible with the current modules::module_alloc() are used. Since architectures define different restrictions on placement, permissions, alignment and other parameters for memory that can be used by different subsystems that allocate executable memory, execmem APIs take a type argument, that will be used to identify the calling subsystem and to allow architectures to define parameters for ranges suitable for that subsystem. The new infrastructure allows decoupling of BPF, kprobes and ftrace from modules, and most importantly it paves the way for ROX allocations for executable memory. [1] https://lore.kernel.org/lkml/20201120202426.18009-1-rick.p.edgeco...@intel.com/ [2]
Re: [PATCH v13 25/35] KVM: selftests: Convert lib's mem regions to KVM_SET_USER_MEMORY_REGION2
On Thu Apr 25, 2024 at 6:09 PM EEST, Sean Christopherson wrote: > + __TEST_REQUIRE(kvm_has_cap(KVM_CAP_USER_MEMORY2), > + "KVM selftests from v6.8+ require > KVM_SET_USER_MEMORY_REGION2"); This would work also for casual (but not seasoned) visitor in KVM code as additionl documentation. BR, Jarkko
Re: [PATCH v1 1/3] mm/gup: consistently name GUP-fast functions
On 02.04.24 14:55, David Hildenbrand wrote: Let's consistently call the "fast-only" part of GUP "GUP-fast" and rename all relevant internal functions to start with "gup_fast", to make it clearer that this is not ordinary GUP. The current mixture of "lockless", "gup" and "gup_fast" is confusing. Further, avoid the term "huge" when talking about a "leaf" -- for example, we nowadays check pmd_leaf() because pmd_huge() is gone. For the "hugepd"/"hugepte" stuff, it's part of the name ("is_hugepd"), so that stays. What remains is the "external" interface: * get_user_pages_fast_only() * get_user_pages_fast() * pin_user_pages_fast() The high-level internal functions for GUP-fast (+slow fallback) are now: * internal_get_user_pages_fast() -> gup_fast_fallback() * lockless_pages_from_mm() -> gup_fast() The basic GUP-fast walker functions: * gup_pgd_range() -> gup_fast_pgd_range() * gup_p4d_range() -> gup_fast_p4d_range() * gup_pud_range() -> gup_fast_pud_range() * gup_pmd_range() -> gup_fast_pmd_range() * gup_pte_range() -> gup_fast_pte_range() * gup_huge_pgd() -> gup_fast_pgd_leaf() * gup_huge_pud() -> gup_fast_pud_leaf() * gup_huge_pmd() -> gup_fast_pmd_leaf() The weird hugepd stuff: * gup_huge_pd() -> gup_fast_hugepd() * gup_hugepte() -> gup_fast_hugepte() I just realized that we end up calling these from follow_hugepd() as well. And something seems to be off, because gup_fast_hugepd() won't have the VMA even in the slow-GUP case to pass it to gup_must_unshare(). So these are GUP-fast functions and the terminology seem correct. But the usage from follow_hugepd() is questionable, commit a12083d721d703f985f4403d6b333cc449f838f6 Author: Peter Xu Date: Wed Mar 27 11:23:31 2024 -0400 mm/gup: handle hugepd for follow_page() states "With previous refactors on fast-gup gup_huge_pd(), most of the code can be leveraged", which doesn't look quite true just staring the the gup_must_unshare() call where we don't pass the VMA. Also, "unlikely(pte_val(pte) != pte_val(ptep_get(ptep)" doesn't make any sense for slow GUP ... @Peter, any insights? -- Cheers, David / dhildenb
Re: linux-next: boot failure after merge of the modules tree
Hi Mike, On Wed, 24 Apr 2024 12:14:49 +0300 Mike Rapoport wrote: > > This should fix it for now, I'll rework initialization a bit in v6 > > diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig > index 1c4be3373686..bea33bf538e9 100644 > --- a/arch/powerpc/Kconfig > +++ b/arch/powerpc/Kconfig > @@ -176,6 +176,7 @@ config PPC > select ARCH_WANT_IRQS_OFF_ACTIVATE_MM > select ARCH_WANT_LD_ORPHAN_WARN > select ARCH_WANT_OPTIMIZE_DAX_VMEMMAP if PPC_RADIX_MMU > + select ARCH_WANTS_EXECMEM_EARLY if EXECMEM > select ARCH_WANTS_MODULES_DATA_IN_VMALLOC if PPC_BOOK3S_32 || > PPC_8xx > select ARCH_WEAK_RELEASE_ACQUIRE > select BINFMT_ELF I added the above to today's merge of the modules tree and it made the boot failure go away. -- Cheers, Stephen Rothwell pgp23z3ezFCNO.pgp Description: OpenPGP digital signature
[PATCH v19 6/6] powerpc/crash: add crash memory hotplug support
Extend the arch crash hotplug handler, as introduced by the patch title ("powerpc: add crash CPU hotplug support"), to also support memory add/remove events. Elfcorehdr describes the memory of the crash kernel to capture the kernel; hence, it needs to be updated if memory resources change due to memory add/remove events. Therefore, arch_crash_handle_hotplug_event() is updated to recreate the elfcorehdr and replace it with the previous one on memory add/remove events. The memblock list is used to prepare the elfcorehdr. In the case of memory hot remove, the memblock list is updated after the arch crash hotplug handler is triggered, as depicted in Figure 1. Thus, the hot-removed memory is explicitly removed from the crash memory ranges to ensure that the memory ranges added to elfcorehdr do not include the hot-removed memory. Memory remove | v Offline pages | v Initiate memory notify call <> crash hotplug handler chain for MEM_OFFLINE event | v Update memblock list Figure 1 There are two system calls, `kexec_file_load` and `kexec_load`, used to load the kdump image. A few changes have been made to ensure that the kernel can safely update the elfcorehdr component of the kdump image for both system calls. For the kexec_file_load syscall, kdump image is prepared in the kernel. To support an increasing number of memory regions, the elfcorehdr is built with extra buffer space to ensure that it can accommodate additional memory ranges in future. For the kexec_load syscall, the elfcorehdr is updated only if the KEXEC_CRASH_HOTPLUG_SUPPORT kexec flag is passed to the kernel by the kexec tool. Passing this flag to the kernel indicates that the elfcorehdr is built to accommodate additional memory ranges and the elfcorehdr segment is not considered for SHA calculation, making it safe to update. The changes related to this feature are kept under the CRASH_HOTPLUG config, and it is enabled by default. Signed-off-by: Sourabh Jain Acked-by: Hari Bathini Cc: Akhil Raj Cc: Andrew Morton Cc: Aneesh Kumar K.V Cc: Baoquan He Cc: Borislav Petkov (AMD) Cc: Boris Ostrovsky Cc: Christophe Leroy Cc: Dave Hansen Cc: Dave Young Cc: David Hildenbrand Cc: Greg Kroah-Hartman Cc: Laurent Dufour Cc: Mahesh Salgaonkar Cc: Michael Ellerman Cc: Mimi Zohar Cc: Naveen N Rao Cc: Oscar Salvador Cc: Stephen Rothwell Cc: Thomas Gleixner Cc: Valentin Schneider Cc: Vivek Goyal Cc: ke...@lists.infradead.org Cc: x...@kernel.org --- Changes in v19: * Fix a build warning: remove NULL check before freeing memory for elfbuf in update_crash_elfcorehdr function. arch/powerpc/include/asm/kexec.h| 3 + arch/powerpc/include/asm/kexec_ranges.h | 1 + arch/powerpc/kexec/crash.c | 94 - arch/powerpc/kexec/file_load_64.c | 20 +- arch/powerpc/kexec/ranges.c | 85 ++ 5 files changed, 201 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/kexec.h b/arch/powerpc/include/asm/kexec.h index e75970351bcd..95a98b390d62 100644 --- a/arch/powerpc/include/asm/kexec.h +++ b/arch/powerpc/include/asm/kexec.h @@ -141,6 +141,9 @@ void arch_crash_handle_hotplug_event(struct kimage *image, void *arg); int arch_crash_hotplug_support(struct kimage *image, unsigned long kexec_flags); #define arch_crash_hotplug_support arch_crash_hotplug_support + +unsigned int arch_crash_get_elfcorehdr_size(void); +#define crash_get_elfcorehdr_size arch_crash_get_elfcorehdr_size #endif /* CONFIG_CRASH_HOTPLUG */ extern int crashing_cpu; diff --git a/arch/powerpc/include/asm/kexec_ranges.h b/arch/powerpc/include/asm/kexec_ranges.h index 8489e844b447..14055896cbcb 100644 --- a/arch/powerpc/include/asm/kexec_ranges.h +++ b/arch/powerpc/include/asm/kexec_ranges.h @@ -7,6 +7,7 @@ void sort_memory_ranges(struct crash_mem *mrngs, bool merge); struct crash_mem *realloc_mem_ranges(struct crash_mem **mem_ranges); int add_mem_range(struct crash_mem **mem_ranges, u64 base, u64 size); +int remove_mem_range(struct crash_mem **mem_ranges, u64 base, u64 size); int get_exclude_memory_ranges(struct crash_mem **mem_ranges); int get_reserved_memory_ranges(struct crash_mem **mem_ranges); int get_crash_memory_ranges(struct crash_mem **mem_ranges); diff --git a/arch/powerpc/kexec/crash.c b/arch/powerpc/kexec/crash.c index 8938a19af12f..9ac3266e4965 100644 --- a/arch/powerpc/kexec/crash.c +++ b/arch/powerpc/kexec/crash.c @@ -17,6 +17,7 @@ #include #include #include +#include #include #include @@ -25,6 +26,7 @@ #include #include #include +#include /* * The primary CPU waits a while for all secondary CPUs to enter. This is to @@ -398,6 +400,93 @@ void default_machine_crash_shutdown(struct pt_regs *regs) #undef pr_fmt #define pr_fmt(fmt) "crash hp: " fmt +/* + * Advertise preferred elfcorehdr size to userspace via + * /sys/kernel/crash_elfcorehdr_size sysfs
[PATCH v19 5/6] powerpc/crash: add crash CPU hotplug support
Due to CPU/Memory hotplug or online/offline events, the elfcorehdr (which describes the CPUs and memory of the crashed kernel) and FDT (Flattened Device Tree) of kdump image becomes outdated. Consequently, attempting dump collection with an outdated elfcorehdr or FDT can lead to failed or inaccurate dump collection. Going forward, CPU hotplug or online/offline events are referred as CPU/Memory add/remove events. The current solution to address the above issue involves monitoring the CPU/Memory add/remove events in userspace using udev rules and whenever there are changes in CPU and memory resources, the entire kdump image is loaded again. The kdump image includes kernel, initrd, elfcorehdr, FDT, purgatory. Given that only elfcorehdr and FDT get outdated due to CPU/Memory add/remove events, reloading the entire kdump image is inefficient. More importantly, kdump remains inactive for a substantial amount of time until the kdump reload completes. To address the aforementioned issue, commit 247262756121 ("crash: add generic infrastructure for crash hotplug support") added a generic infrastructure that allows architectures to selectively update the kdump image component during CPU or memory add/remove events within the kernel itself. In the event of a CPU or memory add/remove events, the generic crash hotplug event handler, `crash_handle_hotplug_event()`, is triggered. It then acquires the necessary locks to update the kdump image and invokes the architecture-specific crash hotplug handler, `arch_crash_handle_hotplug_event()`, to update the required kdump image components. This patch adds crash hotplug handler for PowerPC and enable support to update the kdump image on CPU add/remove events. Support for memory add/remove events is added in a subsequent patch with the title "powerpc: add crash memory hotplug support" As mentioned earlier, only the elfcorehdr and FDT kdump image components need to be updated in the event of CPU or memory add/remove events. However, on PowerPC architecture crash hotplug handler only updates the FDT to enable crash hotplug support for CPU add/remove events. Here's why. The elfcorehdr on PowerPC is built with possible CPUs, and thus, it does not need an update on CPU add/remove events. On the other hand, the FDT needs to be updated on CPU add events to include the newly added CPU. If the FDT is not updated and the kernel crashes on a newly added CPU, the kdump kernel will fail to boot due to the unavailability of the crashing CPU in the FDT. During the early boot, it is expected that the boot CPU must be a part of the FDT; otherwise, the kernel will raise a BUG and fail to boot. For more information, refer to commit 36ae37e3436b0 ("powerpc: Make boot_cpuid common between 32 and 64-bit"). Since it is okay to have an offline CPU in the kdump FDT, no action is taken in case of CPU removal. There are two system calls, `kexec_file_load` and `kexec_load`, used to load the kdump image. Few changes have been made to ensure kernel can safely update the FDT of kdump image loaded using both system calls. For kexec_file_load syscall the kdump image is prepared in kernel. So to support an increasing number of CPUs, the FDT is constructed with extra buffer space to ensure it can accommodate a possible number of CPU nodes. Additionally, a call to fdt_pack (which trims the unused space once the FDT is prepared) is avoided if this feature is enabled. For the kexec_load syscall, the FDT is updated only if the KEXEC_CRASH_HOTPLUG_SUPPORT kexec flag is passed to the kernel by userspace (kexec tools). When userspace passes this flag to the kernel, it indicates that the FDT is built to accommodate possible CPUs, and the FDT segment is excluded from SHA calculation, making it safe to update. The changes related to this feature are kept under the CRASH_HOTPLUG config, and it is enabled by default. Signed-off-by: Sourabh Jain Acked-by: Hari Bathini Cc: Akhil Raj Cc: Andrew Morton Cc: Aneesh Kumar K.V Cc: Baoquan He Cc: Borislav Petkov (AMD) Cc: Boris Ostrovsky Cc: Christophe Leroy Cc: Dave Hansen Cc: Dave Young Cc: David Hildenbrand Cc: Greg Kroah-Hartman Cc: Laurent Dufour Cc: Mahesh Salgaonkar Cc: Michael Ellerman Cc: Mimi Zohar Cc: Naveen N Rao Cc: Oscar Salvador Cc: Stephen Rothwell Cc: Thomas Gleixner Cc: Valentin Schneider Cc: Vivek Goyal Cc: ke...@lists.infradead.org Cc: x...@kernel.org --- * No changes in v19. arch/powerpc/Kconfig | 4 ++ arch/powerpc/include/asm/kexec.h | 8 +++ arch/powerpc/kexec/crash.c| 103 ++ arch/powerpc/kexec/elf_64.c | 3 +- arch/powerpc/kexec/file_load_64.c | 17 + 5 files changed, 134 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 1c4be3373686..a1a3b3363008 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -686,6 +686,10 @@ config ARCH_SELECTS_CRASH_DUMP depends on CRASH_DUMP select
[PATCH v19 3/6] powerpc/kexec: move *_memory_ranges functions to ranges.c
Move the following functions form kexec/{file_load_64.c => ranges.c} and make them public so that components other than KEXEC_FILE can also use these functions. 1. get_exclude_memory_ranges 2. get_reserved_memory_ranges 3. get_crash_memory_ranges 4. get_usable_memory_ranges Later in the series get_crash_memory_ranges function is utilized for in-kernel updates to kdump image during CPU/Memory hotplug or online/offline events for both kexec_load and kexec_file_load syscalls. Since the above functions are moved to ranges.c, some of the helper functions in ranges.c are no longer required to be public. Mark them as static and removed them from kexec_ranges.h header file. Finally, remove the CONFIG_KEXEC_FILE build dependency for range.c because it is required for other config, such as CONFIG_CRASH_DUMP. No functional changes are intended. Signed-off-by: Sourabh Jain Acked-by: Hari Bathini Cc: Akhil Raj Cc: Andrew Morton Cc: Aneesh Kumar K.V Cc: Baoquan He Cc: Borislav Petkov (AMD) Cc: Boris Ostrovsky Cc: Christophe Leroy Cc: Dave Hansen Cc: Dave Young Cc: David Hildenbrand Cc: Greg Kroah-Hartman Cc: Laurent Dufour Cc: Mahesh Salgaonkar Cc: Michael Ellerman Cc: Mimi Zohar Cc: Naveen N Rao Cc: Oscar Salvador Cc: Stephen Rothwell Cc: Thomas Gleixner Cc: Valentin Schneider Cc: Vivek Goyal Cc: ke...@lists.infradead.org Cc: x...@kernel.org --- * No changes in v19. arch/powerpc/include/asm/kexec_ranges.h | 19 +- arch/powerpc/kexec/Makefile | 4 +- arch/powerpc/kexec/file_load_64.c | 190 arch/powerpc/kexec/ranges.c | 227 +++- 4 files changed, 224 insertions(+), 216 deletions(-) diff --git a/arch/powerpc/include/asm/kexec_ranges.h b/arch/powerpc/include/asm/kexec_ranges.h index f83866a19e87..8489e844b447 100644 --- a/arch/powerpc/include/asm/kexec_ranges.h +++ b/arch/powerpc/include/asm/kexec_ranges.h @@ -7,19 +7,8 @@ void sort_memory_ranges(struct crash_mem *mrngs, bool merge); struct crash_mem *realloc_mem_ranges(struct crash_mem **mem_ranges); int add_mem_range(struct crash_mem **mem_ranges, u64 base, u64 size); -int add_tce_mem_ranges(struct crash_mem **mem_ranges); -int add_initrd_mem_range(struct crash_mem **mem_ranges); -#ifdef CONFIG_PPC_64S_HASH_MMU -int add_htab_mem_range(struct crash_mem **mem_ranges); -#else -static inline int add_htab_mem_range(struct crash_mem **mem_ranges) -{ - return 0; -} -#endif -int add_kernel_mem_range(struct crash_mem **mem_ranges); -int add_rtas_mem_range(struct crash_mem **mem_ranges); -int add_opal_mem_range(struct crash_mem **mem_ranges); -int add_reserved_mem_ranges(struct crash_mem **mem_ranges); - +int get_exclude_memory_ranges(struct crash_mem **mem_ranges); +int get_reserved_memory_ranges(struct crash_mem **mem_ranges); +int get_crash_memory_ranges(struct crash_mem **mem_ranges); +int get_usable_memory_ranges(struct crash_mem **mem_ranges); #endif /* _ASM_POWERPC_KEXEC_RANGES_H */ diff --git a/arch/powerpc/kexec/Makefile b/arch/powerpc/kexec/Makefile index 8e469c4da3f8..470eb0453e17 100644 --- a/arch/powerpc/kexec/Makefile +++ b/arch/powerpc/kexec/Makefile @@ -3,11 +3,11 @@ # Makefile for the linux kernel. # -obj-y += core.o core_$(BITS).o +obj-y += core.o core_$(BITS).o ranges.o obj-$(CONFIG_PPC32)+= relocate_32.o -obj-$(CONFIG_KEXEC_FILE) += file_load.o ranges.o file_load_$(BITS).o elf_$(BITS).o +obj-$(CONFIG_KEXEC_FILE) += file_load.o file_load_$(BITS).o elf_$(BITS).o obj-$(CONFIG_VMCORE_INFO) += vmcore_info.o obj-$(CONFIG_CRASH_DUMP) += crash.o diff --git a/arch/powerpc/kexec/file_load_64.c b/arch/powerpc/kexec/file_load_64.c index 1bc65de6174f..6a01f62b8fcf 100644 --- a/arch/powerpc/kexec/file_load_64.c +++ b/arch/powerpc/kexec/file_load_64.c @@ -47,83 +47,6 @@ const struct kexec_file_ops * const kexec_file_loaders[] = { NULL }; -/** - * get_exclude_memory_ranges - Get exclude memory ranges. This list includes - * regions like opal/rtas, tce-table, initrd, - * kernel, htab which should be avoided while - * setting up kexec load segments. - * @mem_ranges:Range list to add the memory ranges to. - * - * Returns 0 on success, negative errno on error. - */ -static int get_exclude_memory_ranges(struct crash_mem **mem_ranges) -{ - int ret; - - ret = add_tce_mem_ranges(mem_ranges); - if (ret) - goto out; - - ret = add_initrd_mem_range(mem_ranges); - if (ret) - goto out; - - ret = add_htab_mem_range(mem_ranges); - if (ret) - goto out; - - ret = add_kernel_mem_range(mem_ranges); - if (ret) - goto out; - - ret = add_rtas_mem_range(mem_ranges); - if (ret) - goto out; - - ret = add_opal_mem_range(mem_ranges); -
[PATCH v19 4/6] PowerPC/kexec: make the update_cpus_node() function public
Move the update_cpus_node() from kexec/{file_load_64.c => core_64.c} to allow other kexec components to use it. Later in the series, this function is used for in-kernel updates to the kdump image during CPU/memory hotplug or online/offline events for both kexec_load and kexec_file_load syscalls. No functional changes are intended. Signed-off-by: Sourabh Jain Acked-by: Hari Bathini Cc: Akhil Raj Cc: Andrew Morton Cc: Aneesh Kumar K.V Cc: Baoquan He Cc: Borislav Petkov (AMD) Cc: Boris Ostrovsky Cc: Christophe Leroy Cc: Dave Hansen Cc: Dave Young Cc: David Hildenbrand Cc: Greg Kroah-Hartman Cc: Laurent Dufour Cc: Mahesh Salgaonkar Cc: Michael Ellerman Cc: Mimi Zohar Cc: Naveen N Rao Cc: Oscar Salvador Cc: Stephen Rothwell Cc: Thomas Gleixner Cc: Valentin Schneider Cc: Vivek Goyal Cc: ke...@lists.infradead.org Cc: x...@kernel.org --- * No changes in v19. arch/powerpc/include/asm/kexec.h | 4 ++ arch/powerpc/kexec/core_64.c | 91 +++ arch/powerpc/kexec/file_load_64.c | 87 - 3 files changed, 95 insertions(+), 87 deletions(-) diff --git a/arch/powerpc/include/asm/kexec.h b/arch/powerpc/include/asm/kexec.h index fdb90e24dc74..d9ff4d0e392d 100644 --- a/arch/powerpc/include/asm/kexec.h +++ b/arch/powerpc/include/asm/kexec.h @@ -185,6 +185,10 @@ static inline void crash_send_ipi(void (*crash_ipi_callback)(struct pt_regs *)) #endif /* CONFIG_CRASH_DUMP */ +#if defined(CONFIG_KEXEC_FILE) || defined(CONFIG_CRASH_DUMP) +int update_cpus_node(void *fdt); +#endif + #ifdef CONFIG_PPC_BOOK3S_64 #include #endif diff --git a/arch/powerpc/kexec/core_64.c b/arch/powerpc/kexec/core_64.c index 762e4d09aacf..85050be08a23 100644 --- a/arch/powerpc/kexec/core_64.c +++ b/arch/powerpc/kexec/core_64.c @@ -17,6 +17,7 @@ #include #include #include +#include #include #include @@ -30,6 +31,7 @@ #include #include #include +#include int machine_kexec_prepare(struct kimage *image) { @@ -419,3 +421,92 @@ static int __init export_htab_values(void) } late_initcall(export_htab_values); #endif /* CONFIG_PPC_64S_HASH_MMU */ + +#if defined(CONFIG_KEXEC_FILE) || defined(CONFIG_CRASH_DUMP) +/** + * add_node_props - Reads node properties from device node structure and add + * them to fdt. + * @fdt:Flattened device tree of the kernel + * @node_offset:offset of the node to add a property at + * @dn: device node pointer + * + * Returns 0 on success, negative errno on error. + */ +static int add_node_props(void *fdt, int node_offset, const struct device_node *dn) +{ + int ret = 0; + struct property *pp; + + if (!dn) + return -EINVAL; + + for_each_property_of_node(dn, pp) { + ret = fdt_setprop(fdt, node_offset, pp->name, pp->value, pp->length); + if (ret < 0) { + pr_err("Unable to add %s property: %s\n", pp->name, fdt_strerror(ret)); + return ret; + } + } + return ret; +} + +/** + * update_cpus_node - Update cpus node of flattened device tree using of_root + *device node. + * @fdt: Flattened device tree of the kernel. + * + * Returns 0 on success, negative errno on error. + */ +int update_cpus_node(void *fdt) +{ + struct device_node *cpus_node, *dn; + int cpus_offset, cpus_subnode_offset, ret = 0; + + cpus_offset = fdt_path_offset(fdt, "/cpus"); + if (cpus_offset < 0 && cpus_offset != -FDT_ERR_NOTFOUND) { + pr_err("Malformed device tree: error reading /cpus node: %s\n", + fdt_strerror(cpus_offset)); + return cpus_offset; + } + + if (cpus_offset > 0) { + ret = fdt_del_node(fdt, cpus_offset); + if (ret < 0) { + pr_err("Error deleting /cpus node: %s\n", fdt_strerror(ret)); + return -EINVAL; + } + } + + /* Add cpus node to fdt */ + cpus_offset = fdt_add_subnode(fdt, fdt_path_offset(fdt, "/"), "cpus"); + if (cpus_offset < 0) { + pr_err("Error creating /cpus node: %s\n", fdt_strerror(cpus_offset)); + return -EINVAL; + } + + /* Add cpus node properties */ + cpus_node = of_find_node_by_path("/cpus"); + ret = add_node_props(fdt, cpus_offset, cpus_node); + of_node_put(cpus_node); + if (ret < 0) + return ret; + + /* Loop through all subnodes of cpus and add them to fdt */ + for_each_node_by_type(dn, "cpu") { + cpus_subnode_offset = fdt_add_subnode(fdt, cpus_offset, dn->full_name); + if (cpus_subnode_offset < 0) { + pr_err("Unable to add %s subnode: %s\n", dn->full_name, + fdt_strerror(cpus_subnode_offset)); + ret = cpus_subnode_offset; +
[PATCH v19 2/6] crash: add a new kexec flag for hotplug support
Commit a72bbec70da2 ("crash: hotplug support for kexec_load()") introduced a new kexec flag, `KEXEC_UPDATE_ELFCOREHDR`. Kexec tool uses this flag to indicate to the kernel that it is safe to modify the elfcorehdr of the kdump image loaded using the kexec_load system call. However, it is possible that architectures may need to update kexec segments other then elfcorehdr. For example, FDT (Flatten Device Tree) on PowerPC. Introducing a new kexec flag for every new kexec segment may not be a good solution. Hence, a generic kexec flag bit, `KEXEC_CRASH_HOTPLUG_SUPPORT`, is introduced to share the CPU/Memory hotplug support intent between the kexec tool and the kernel for the kexec_load system call. Now we have two kexec flags that enables crash hotplug support for kexec_load system call. First is KEXEC_UPDATE_ELFCOREHDR (only used in x86), and second is KEXEC_CRASH_HOTPLUG_SUPPORT (for all architectures). To simplify the process of finding and reporting the crash hotplug support the following changes are introduced. 1. Define arch specific function to process the kexec flags and determine crash hotplug support 2. Rename the @update_elfcorehdr member of struct kimage to @hotplug_support and populate it for both kexec_load and kexec_file_load syscalls, because architecture can update more than one kexec segment 3. Let generic function crash_check_hotplug_support report hotplug support for loaded kdump image based on value of @hotplug_support To bring the x86 crash hotplug support in line with the above points, the following changes have been made: - Introduce the arch_crash_hotplug_support function to process kexec flags and determine crash hotplug support - Remove the arch_crash_hotplug_[cpu|memory]_support functions Signed-off-by: Sourabh Jain Acked-by: Baoquan He Acked-by: Hari Bathini Cc: Akhil Raj Cc: Andrew Morton Cc: Aneesh Kumar K.V Cc: Borislav Petkov (AMD) Cc: Boris Ostrovsky Cc: Christophe Leroy Cc: Dave Hansen Cc: Dave Young Cc: David Hildenbrand Cc: Eric DeVolder Cc: Greg Kroah-Hartman Cc: Laurent Dufour Cc: Mahesh Salgaonkar Cc: Michael Ellerman Cc: Mimi Zohar Cc: Naveen N Rao Cc: Oscar Salvador Cc: Stephen Rothwell Cc: Thomas Gleixner Cc: Valentin Schneider Cc: Vivek Goyal Cc: ke...@lists.infradead.org Cc: x...@kernel.org --- * No changes in v19. arch/x86/include/asm/kexec.h | 11 ++- arch/x86/kernel/crash.c | 28 +--- drivers/base/cpu.c | 2 +- drivers/base/memory.c| 2 +- include/linux/crash_core.h | 13 ++--- include/linux/kexec.h| 11 +++ include/uapi/linux/kexec.h | 1 + kernel/crash_core.c | 15 ++- kernel/kexec.c | 4 ++-- kernel/kexec_file.c | 5 + 10 files changed, 48 insertions(+), 44 deletions(-) diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h index cb1320ebbc23..ae5482a2f0ca 100644 --- a/arch/x86/include/asm/kexec.h +++ b/arch/x86/include/asm/kexec.h @@ -210,15 +210,8 @@ extern void kdump_nmi_shootdown_cpus(void); void arch_crash_handle_hotplug_event(struct kimage *image, void *arg); #define arch_crash_handle_hotplug_event arch_crash_handle_hotplug_event -#ifdef CONFIG_HOTPLUG_CPU -int arch_crash_hotplug_cpu_support(void); -#define crash_hotplug_cpu_support arch_crash_hotplug_cpu_support -#endif - -#ifdef CONFIG_MEMORY_HOTPLUG -int arch_crash_hotplug_memory_support(void); -#define crash_hotplug_memory_support arch_crash_hotplug_memory_support -#endif +int arch_crash_hotplug_support(struct kimage *image, unsigned long kexec_flags); +#define arch_crash_hotplug_support arch_crash_hotplug_support unsigned int arch_crash_get_elfcorehdr_size(void); #define crash_get_elfcorehdr_size arch_crash_get_elfcorehdr_size diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c index 2a682fe86352..f06501445cd9 100644 --- a/arch/x86/kernel/crash.c +++ b/arch/x86/kernel/crash.c @@ -402,20 +402,26 @@ int crash_load_segments(struct kimage *image) #undef pr_fmt #define pr_fmt(fmt) "crash hp: " fmt -/* These functions provide the value for the sysfs crash_hotplug nodes */ -#ifdef CONFIG_HOTPLUG_CPU -int arch_crash_hotplug_cpu_support(void) +int arch_crash_hotplug_support(struct kimage *image, unsigned long kexec_flags) { - return crash_check_update_elfcorehdr(); -} -#endif -#ifdef CONFIG_MEMORY_HOTPLUG -int arch_crash_hotplug_memory_support(void) -{ - return crash_check_update_elfcorehdr(); -} +#ifdef CONFIG_KEXEC_FILE + if (image->file_mode) + return 1; #endif + /* +* Initially, crash hotplug support for kexec_load was added +* with the KEXEC_UPDATE_ELFCOREHDR flag. Later, this +* functionality was expanded to accommodate multiple kexec +* segment updates, leading to the introduction of the +* KEXEC_CRASH_HOTPLUG_SUPPORT kexec flag bit. Consequently, +* when the kexec tool sends either
[PATCH v19 0/6]powerpc/crash: Kernel handling of CPU and memory hotplug
Commit 247262756121 ("crash: add generic infrastructure for crash hotplug support") added a generic infrastructure that allows architectures to selectively update the kdump image component during CPU or memory add/remove events within the kernel itself. This patch series adds crash hotplug handler for PowerPC and enable support to update the kdump image on CPU/Memory add/remove events. Among the 6 patches in this series, the first two patches make changes to the generic crash hotplug handler to assist PowerPC in adding support for this feature. The last four patches add support for this feature. The following section outlines the problem addressed by this patch series, along with the current solution, its shortcomings, and the proposed resolution. Problem: Due to CPU/Memory hotplug or online/offline events the elfcorehdr (which describes the CPUs and memory of the crashed kernel) and FDT (Flattened Device Tree) of kdump image becomes outdated. Consequently, attempting dump collection with an outdated elfcorehdr or FDT can lead to failed or inaccurate dump collection. Going forward CPU hotplug or online/offline events are referred as CPU/Memory add/remove events. Existing solution and its shortcoming: == The current solution to address the above issue involves monitoring the CPU/memory add/remove events in userspace using udev rules and whenever there are changes in CPU and memory resources, the entire kdump image is loaded again. The kdump image includes kernel, initrd, elfcorehdr, FDT, purgatory. Given that only elfcorehdr and FDT get outdated due to CPU/Memory add/remove events, reloading the entire kdump image is inefficient. More importantly, kdump remains inactive for a substantial amount of time until the kdump reload completes. Proposed solution: == Instead of initiating a full kdump image reload from userspace on CPU/Memory hotplug and online/offline events, the proposed solution aims to update only the necessary kdump image component within the kernel itself. Git tree for testing: = https://github.com/sourabhjains/linux/tree/kdump-in-kernel-crash-update-v19 Above tree is rebased on top of v6.9-rc5 branch. To realize this feature, the kdump udev rule must be updated. On RHEL, add the following two lines at the top of the "/usr/lib/udev/rules.d/98-kexec.rules" file. SUBSYSTEM=="cpu", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end" SUBSYSTEM=="memory", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end" With the above change to the kdump udev rule, kdump reload is avoided during CPU/Memory add/remove events if this feature is enabled in the kernel. Note: only kexec_file_load syscall will work. For kexec_load minor changes are required in kexec tool. Changelog: -- v19: - Fix a build warning, remove NULL check before freeing memory. 6/6 Reported by kernel test robot - Rebase it to 6.9-rc5 v18: [No functional changes] - https://lore.kernel.org/all/20240326055413.186534-1-sourabhj...@linux.ibm.com/ - Update a comment in 2/6. - Describe the clean-up done on x86 in patch description 2/6. - Fix a minor typo in the patch description of 3/6. v17: [https://lore.kernel.org/all/20240226084118.16310-1-sourabhj...@linux.ibm.com/] - Rebase the patch series on top linux-next tree and below patch series https://lore.kernel.org/all/20240213113150.1148276-1-hbath...@linux.ibm.com/ - Split 0003 patch from v16 into two patches 1. Move get_crash_memory_ranges() along with other *_memory_ranges() functions to ranges.c and make them public. 2. Make update_cpus_node function public and take this function out of file_load_64.c - Keep arch_crash_hotplug_support in crash.c instead of core_64.c [05/06] - Use CONFIG_CRASH_MAX_MEMORY_RANGES to find extra elfcorehdr size [06/06] v16: [https://lore.kernel.org/all/20240217081452.164571-1-sourabhj...@linux.ibm.com/] - Remove the unused #define `crash_hotplug_cpu_support` and `crash_hotplug_memory_support` in `arch/x86/include/asm/kexec.h`. - Document why two kexec flag bits are used in `arch_crash_hotplug_memory_support` (x86). - Use a switch case to handle different hotplug operations in `arch_crash_handle_hotplug_event` for PowerPC. - Fix a typo in 4/5. v15: - Remove the patch that adds a new kexec flag for FDT update. - Introduce a generic kexec flag bit to share hotplug support intent between the kexec tool and the kernel for the kexec_load syscall. (2/5) - Introduce an architecture-specific handler to process the kexec flag for crash hotplug support. (2/5) - Rename the @update_elfcorehdr member of the struct kimage to @hotplug_support. (2/5) - Use a common function to advertise hotplug support for both CPU and Memory. (2/5) v14: - Fix build warnings by including necessary header files - Rebase to v6.7-rc5 v13: - Fix a build warning, take ranges.c out
[PATCH v19 1/6] crash: forward memory_notify arg to arch crash hotplug handler
In the event of memory hotplug or online/offline events, the crash memory hotplug notifier `crash_memhp_notifier()` receives a `memory_notify` object but doesn't forward that object to the generic and architecture-specific crash hotplug handler. The `memory_notify` object contains the starting PFN (Page Frame Number) and the number of pages in the hot-removed memory. This information is necessary for architectures like PowerPC to update/recreate the kdump image, specifically `elfcorehdr`. So update the function signature of `crash_handle_hotplug_event()` and `arch_crash_handle_hotplug_event()` to accept the `memory_notify` object as an argument from crash memory hotplug notifier. Since no such object is available in the case of CPU hotplug event, the crash CPU hotplug notifier `crash_cpuhp_online()` passes NULL to the crash hotplug handler. Signed-off-by: Sourabh Jain Acked-by: Baoquan He Acked-by: Hari Bathini Cc: Akhil Raj Cc: Andrew Morton Cc: Aneesh Kumar K.V Cc: Borislav Petkov (AMD) Cc: Boris Ostrovsky Cc: Christophe Leroy Cc: Dave Hansen Cc: Dave Young Cc: David Hildenbrand Cc: Greg Kroah-Hartman Cc: Laurent Dufour Cc: Mahesh Salgaonkar Cc: Michael Ellerman Cc: Mimi Zohar Cc: Naveen N Rao Cc: Oscar Salvador Cc: Stephen Rothwell Cc: Thomas Gleixner Cc: Valentin Schneider Cc: Vivek Goyal Cc: ke...@lists.infradead.org Cc: x...@kernel.org --- * No changes in v19. arch/x86/include/asm/kexec.h | 2 +- arch/x86/kernel/crash.c | 4 +++- include/linux/crash_core.h | 2 +- kernel/crash_core.c | 14 +++--- 4 files changed, 12 insertions(+), 10 deletions(-) diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h index 91ca9a9ee3a2..cb1320ebbc23 100644 --- a/arch/x86/include/asm/kexec.h +++ b/arch/x86/include/asm/kexec.h @@ -207,7 +207,7 @@ int arch_kimage_file_post_load_cleanup(struct kimage *image); extern void kdump_nmi_shootdown_cpus(void); #ifdef CONFIG_CRASH_HOTPLUG -void arch_crash_handle_hotplug_event(struct kimage *image); +void arch_crash_handle_hotplug_event(struct kimage *image, void *arg); #define arch_crash_handle_hotplug_event arch_crash_handle_hotplug_event #ifdef CONFIG_HOTPLUG_CPU diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c index e74d0c4286c1..2a682fe86352 100644 --- a/arch/x86/kernel/crash.c +++ b/arch/x86/kernel/crash.c @@ -432,10 +432,12 @@ unsigned int arch_crash_get_elfcorehdr_size(void) /** * arch_crash_handle_hotplug_event() - Handle hotplug elfcorehdr changes * @image: a pointer to kexec_crash_image + * @arg: struct memory_notify handler for memory hotplug case and + * NULL for CPU hotplug case. * * Prepare the new elfcorehdr and replace the existing elfcorehdr. */ -void arch_crash_handle_hotplug_event(struct kimage *image) +void arch_crash_handle_hotplug_event(struct kimage *image, void *arg) { void *elfbuf = NULL, *old_elfcorehdr; unsigned long nr_mem_ranges; diff --git a/include/linux/crash_core.h b/include/linux/crash_core.h index d33352c2e386..647e928efee8 100644 --- a/include/linux/crash_core.h +++ b/include/linux/crash_core.h @@ -37,7 +37,7 @@ static inline void arch_kexec_unprotect_crashkres(void) { } #ifndef arch_crash_handle_hotplug_event -static inline void arch_crash_handle_hotplug_event(struct kimage *image) { } +static inline void arch_crash_handle_hotplug_event(struct kimage *image, void *arg) { } #endif int crash_check_update_elfcorehdr(void); diff --git a/kernel/crash_core.c b/kernel/crash_core.c index 78b5dc7cee3a..70fa8111a9d6 100644 --- a/kernel/crash_core.c +++ b/kernel/crash_core.c @@ -534,7 +534,7 @@ int crash_check_update_elfcorehdr(void) * list of segments it checks (since the elfcorehdr changes and thus * would require an update to purgatory itself to update the digest). */ -static void crash_handle_hotplug_event(unsigned int hp_action, unsigned int cpu) +static void crash_handle_hotplug_event(unsigned int hp_action, unsigned int cpu, void *arg) { struct kimage *image; @@ -596,7 +596,7 @@ static void crash_handle_hotplug_event(unsigned int hp_action, unsigned int cpu) image->hp_action = hp_action; /* Now invoke arch-specific update handler */ - arch_crash_handle_hotplug_event(image); + arch_crash_handle_hotplug_event(image, arg); /* No longer handling a hotplug event */ image->hp_action = KEXEC_CRASH_HP_NONE; @@ -612,17 +612,17 @@ static void crash_handle_hotplug_event(unsigned int hp_action, unsigned int cpu) crash_hotplug_unlock(); } -static int crash_memhp_notifier(struct notifier_block *nb, unsigned long val, void *v) +static int crash_memhp_notifier(struct notifier_block *nb, unsigned long val, void *arg) { switch (val) { case MEM_ONLINE: crash_handle_hotplug_event(KEXEC_CRASH_HP_ADD_MEMORY, - KEXEC_CRASH_HP_INVALID_CPU); + KEXEC_CRASH_HP_INVALID_CPU,