RE: [PATCH v3 1/3] PCI/AER: Store UNCOR_STATUS bits that might be ANFE in aer_err_info
>-Original Message- >From: Jonathan Cameron >Subject: Re: [PATCH v3 1/3] PCI/AER: Store UNCOR_STATUS bits that might >be ANFE in aer_err_info > >On Wed, 17 Apr 2024 14:14:05 +0800 >Zhenzhong Duan wrote: > >> In some cases the detector of a Non-Fatal Error(NFE) is not the most >> appropriate agent to determine the type of the error. For example, >> when software performs a configuration read from a non-existent >> device or Function, completer will send an ERR_NONFATAL Message. >> On some platforms, ERR_NONFATAL results in a System Error, which >> breaks normal software probing. >> >> Advisory Non-Fatal Error(ANFE) is a special case that can be used >> in above scenario. It is predominantly determined by the role of the >> detecting agent (Requester, Completer, or Receiver) and the specific >> error. In such cases, an agent with AER signals the NFE (if enabled) >> by sending an ERR_COR Message as an advisory to software, instead of >> sending ERR_NONFATAL. >> >> When processing an ANFE, ideally both correctable error(CE) status and >> uncorrectable error(UE) status should be cleared. However, there is no >> way to fully identify the UE associated with ANFE. Even worse, a Fatal >> Error(FE) or Non-Fatal Error(NFE) may set the same UE status bit as >> ANFE. Treating an ANFE as NFE will reproduce above mentioned issue, >> i.e., breaking softwore probing; treating NFE as ANFE will make us >> ignoring some UEs which need active recover operation. To avoid clearing >> UEs that are not ANFE by accident, the most conservative route is taken >> here: If any of the FE/NFE Detected bits is set in Device Status, do not >> touch UE status, they should be cleared later by the UE handler. Otherwise, >> a specific set of UEs that may be raised as ANFE according to the PCIe >> specification will be cleared if their corresponding severity is Non-Fatal. >> >> To achieve above purpose, store UNCOR_STATUS bits that might be ANFE >> in aer_err_info.anfe_status. So that those bits could be printed and >> processed later. >> >> Tested-by: Yudong Wang >> Co-developed-by: "Wang, Qingshun" >> Signed-off-by: "Wang, Qingshun" >> Signed-off-by: Zhenzhong Duan >> --- >> drivers/pci/pci.h | 1 + >> drivers/pci/pcie/aer.c | 45 >++ >> 2 files changed, 46 insertions(+) >> >> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h >> index 17fed1846847..3f9eb807f9fd 100644 >> --- a/drivers/pci/pci.h >> +++ b/drivers/pci/pci.h >> @@ -412,6 +412,7 @@ struct aer_err_info { >> >> unsigned int status;/* COR/UNCOR Error Status */ >> unsigned int mask; /* COR/UNCOR Error Mask */ >> +unsigned int anfe_status; /* UNCOR Error Status for ANFE */ >> struct pcie_tlp_log tlp;/* TLP Header */ >> }; >> >> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c >> index ac6293c24976..27364ab4b148 100644 >> --- a/drivers/pci/pcie/aer.c >> +++ b/drivers/pci/pcie/aer.c >> @@ -107,6 +107,12 @@ struct aer_stats { >> PCI_ERR_ROOT_MULTI_COR_RCV | > \ >> PCI_ERR_ROOT_MULTI_UNCOR_RCV) >> >> +#define AER_ERR_ANFE_UNC_MASK > (PCI_ERR_UNC_POISON_TLP | \ >> +PCI_ERR_UNC_COMP_TIME | > \ >> +PCI_ERR_UNC_COMP_ABORT | > \ >> +PCI_ERR_UNC_UNX_COMP | > \ >> +PCI_ERR_UNC_UNSUP) >> + >> static int pcie_aer_disable; >> static pci_ers_result_t aer_root_reset(struct pci_dev *dev); >> >> @@ -1196,6 +1202,41 @@ void aer_recover_queue(int domain, unsigned >int bus, unsigned int devfn, >> EXPORT_SYMBOL_GPL(aer_recover_queue); >> #endif >> >> +static void anfe_get_uc_status(struct pci_dev *dev, struct aer_err_info >*info) >> +{ >> +u32 uncor_mask, uncor_status; >> +u16 device_status; >> +int aer = dev->aer_cap; >> + >> +if (pcie_capability_read_word(dev, PCI_EXP_DEVSTA, >_status)) >> +return; >> +/* >> + * Take the most conservative route here. If there are >> + * Non-Fatal/Fatal errors detected, do not assume any >> + * bit in uncor_status is set by ANFE. >> + */ >> +if (device_status & (PCI_EXP_DEVSTA_NFED | PCI_EXP_DEVSTA_FED)) >> +return; >> + > >Is there not a race here? If we happen to get either an NFED or FED >between the read of device_status above and here we might pick up a status >that corresponds to that (and hence clear something we should not). In this scenario, info->anfe_status is 0. > >Or am I missing that race being close somewhere? The bits leading to NFED or FED is masked out when assigning info->anfe_status. Bits for FED is masked out by ~info->severity, bit for NFED is masked out by AER_ERR_ANFE_UNC_MASK. So we never clear status bits for NFED or FED in ANFE handler. See below assignment of info->anfe_status. Thanks
Re: [RFC PATCH 2/2] objtool/powerpc: Enhance objtool to fixup alternate feature relative addresses
Hi Sathvika, On Mon, Apr 22, 2024 at 02:52:06PM +0530, Sathvika Vasireddy wrote: > Implement build-time fixup of alternate feature relative addresses for > the out-of-line (else) patch code. Initial posting to achieve the same > using another tool can be found at [1]. Idea is to implement this using > objtool instead of introducing another tool since it already has elf > parsing and processing covered. > > Introduce --ftr-fixup as an option to objtool to do feature fixup at > build-time. > > Couple of issues and warnings encountered while implementing feature > fixup using objtool are as follows: > > 1. libelf is creating corrupted vmlinux file after writing necessary > changes to the file. Due to this, kexec is not able to load new > kernel. > > It gives the following error: > ELF Note corrupted ! > Cannot determine the file type of vmlinux > > To fix this issue, after opening vmlinux file, make a call to > elf_flagelf (e, ELF_C_SET, ELF_F_LAYOUT). This instructs libelf not > to touch the segment and section layout. It informs the library > that the application will take responsibility for the layout of the > file and that the library should not insert any padding between > sections. > > 2. Fix can't find starting instruction warnings when run on vmlinux > > Objtool throws a lot of can't find starting instruction warnings > when run on vmlinux with --ftr-fixup option. > > These warnings are seen because find_insn() function looks for > instructions at offsets that are relative to the start of the section. > In case of individual object files (.o), there are no can't find > starting instruction warnings seen because the actual offset > associated with an instruction is itself a relative offset since the > sections start at offset 0x0. > > However, in case of vmlinux, find_insn() function fails to find > instructions at the actual offset associated with an instruction > since the sections in vmlinux do not start at offset 0x0. Due to > this, find_insn() will look for absolute offset and not the relative > offset. This is resulting in a lot of can't find starting instruction > warnings when objtool is run on vmlinux. > > To fix this, pass offset that is relative to the start of the section > to find_insn(). > > find_insn() is also looking for symbols of size 0. But, objtool does > not store empty STT_NOTYPE symbols in the rbtree. Due to this, > for empty symbols, objtool is throwing can't find starting > instruction warnings. Fix this by ignoring symbols that are of > size 0 since objtool does not add them to the rbtree. > > 3. Objtool is throwing unannotated intra-function call warnings > when run on vmlinux with --ftr-fixup option. > > One such example: > > vmlinux: warning: objtool: .text+0x3d94: > unannotated intra-function call > > .text + 0x3d94 = c0008000 + 3d94 = c00081d4 > > c00081d4: 45 24 02 48 bl c002a618 > > > c002a610 : > c002a610: 0e 01 4c 3c addis r2,r12,270 > c002a610: R_PPC64_REL16_HA.TOC. > c002a614: f0 6c 42 38 addir2,r2,27888 > c002a614: R_PPC64_REL16_LO.TOC.+0x4 > c002a618: a6 02 08 7c mflrr0 > > This is happening because we should be looking for destination > symbols that are at absolute offsets instead of relative offsets. > After fixing dest_off to point to absolute offset, there are still > a lot of these warnings shown. > > In the above example, objtool is computing the destination > offset to be c002a618, which points to a completely > different instruction. find_call_destination() is looking for this > offset and failing. Instead, we should be looking for destination > offset c002a610 which points to system_reset_exception > function. > > Even after fixing the way destination offset is computed, and > after looking for dest_off - 0x8 in cases where the original offset > is not found, there are still a lot of unannotated intra-function > call warnings generated. This is due to symbols that are not > properly annotated. > > So, for now, as a hack to curb these warnings, do not emit > unannotated intra-function call warnings when objtool is run > with --ftr-fixup option. > > TODO: > This patch enables build time feature fixup only for powerpc little > endian configs. There are boot failures with big endian configs. > Posting this as an initial RFC to get some review comments while I work > on big endian issues. > > [1] > https://lore.kernel.org/linuxppc-dev/20170521010130.13552-1-npig...@gmail.com/ > > Co-developed-by: Nicholas Piggin > Signed-off-by: Nicholas Piggin > Signed-off-by: Sathvika Vasireddy When I build this series with LLVM 14 [1] (due to an issue I report below), I am getting a crash when CONFIG_FTR_FIXUP_SELFTEST is disabled. diff --git a/arch/powerpc/configs/ppc64_defconfig
Re: [RFC PATCH 1/2] objtool: Run objtool only if either of the config options are selected
Hi Masahiro, thanks for reviewing. On 4/22/24 5:39 PM, Masahiro Yamada wrote: On Mon, Apr 22, 2024 at 6:25 PM Sathvika Vasireddy wrote: Currently, when objtool is enabled and none of the supported options are triggered, kernel build errors out with the below error: error: objtool: At least one command required. Then, I think CONFIG_OBJTOOL should be disabled. A subsequent patch introduces --ftr-fixup as an option to objtool to do feature fixup at build-time via CONFIG_HAVE_OBJTOOL_FTR_FIXUP option. If CONFIG_OBJTOOL is not selected, then objtool cannot be used to pass --ftr-fixup option. In cases where none of the supported options (like --mcount on powerpc for example) is triggered, but still require --ftr-fixup option to be passed to objtool, we see "error: objtool: At least one command required" errors. So, to address this, run only when either of the config options are selected. Thanks, Sathvika
[PATCH v2] powerpc/pseries/iommu: LPAR panics during boot up with a frozen PE
At the time of LPAR boot up, partition firmware provides Open Firmware property ibm,dma-window for the PE. This property is provided on the PCI bus the PE is attached to. There are execptions where the partition firmware might not provide this property for the PE at the time of LPAR boot up. One of the scenario is where the firmware has frozen the PE due to some error condition. This PE is frozen for 24 hours or unless the whole system is reinitialized. Within this time frame, if the LPAR is booted, the frozen PE will be presented to the LPAR but ibm,dma-window property could be missing. Today, under these circumstances, the LPAR oopses with NULL pointer dereference, when configuring the PCI bus the PE is attached to. BUG: Kernel NULL pointer dereference on read at 0x00c8 Faulting instruction address: 0xc01024c0 Oops: Kernel access of bad area, sig: 7 [#1] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries Modules linked in: Supported: Yes CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.4.0-150600.9-default #1 Hardware name: IBM,9043-MRX POWER10 (raw) 0x800200 0xf06 of:IBM,FW1060.00 (NM1060_023) hv:phyp pSeries NIP: c01024c0 LR: c01024b0 CTR: c0102450 REGS: c37db5c0 TRAP: 0300 Not tainted (6.4.0-150600.9-default) MSR: 82009033 CR: 28000822 XER: CFAR: c010254c DAR: 00c8 DSISR: 0008 IRQMASK: 0 ... NIP [c01024c0] pci_dma_bus_setup_pSeriesLP+0x70/0x2a0 LR [c01024b0] pci_dma_bus_setup_pSeriesLP+0x60/0x2a0 Call Trace: pci_dma_bus_setup_pSeriesLP+0x60/0x2a0 (unreliable) pcibios_setup_bus_self+0x1c0/0x370 __of_scan_bus+0x2f8/0x330 pcibios_scan_phb+0x280/0x3d0 pcibios_init+0x88/0x12c do_one_initcall+0x60/0x320 kernel_init_freeable+0x344/0x3e4 kernel_init+0x34/0x1d0 ret_from_kernel_user_thread+0x14/0x1c Fixes: b1fc44eaa9ba ("pseries/iommu/ddw: Fix kdump to work in absence of ibm,dma-window") Signed-off-by: Gaurav Batra --- arch/powerpc/platforms/pseries/iommu.c | 8 1 file changed, 8 insertions(+) diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c index e8c4129697b1..b1e6d275cda9 100644 --- a/arch/powerpc/platforms/pseries/iommu.c +++ b/arch/powerpc/platforms/pseries/iommu.c @@ -786,8 +786,16 @@ static void pci_dma_bus_setup_pSeriesLP(struct pci_bus *bus) * parent bus. During reboot, there will be ibm,dma-window property to * define DMA window. For kdump, there will at least be default window or DDW * or both. +* There is an exception to the above. In case the PE goes into frozen +* state, firmware may not provide ibm,dma-window property at the time +* of LPAR boot up. */ + if (!pdn) { + pr_debug(" no ibm,dma-window property !\n"); + return; + } + ppci = PCI_DN(pdn); pr_debug(" parent is %pOF, iommu_table: 0x%p\n", base-commit: 0bbac3facb5d6cc0171c45c9873a2dc96bea9680 -- 2.39.3 (Apple Git-146)
[PATCH v10 3/3] Documentation/powerpc: update fadump implementation details
The patch titled ("powerpc: make fadump resilient with memory add/remove events") has made significant changes to the implementation of fadump, particularly on elfcorehdr creation and fadump crash info header structure. Therefore, updating the fadump implementation documentation to reflect those changes. Following updates are done to firmware assisted dump documentation: 1. The elfcorehdr is no longer stored after fadump HDR in the reserved dump area. Instead, the second kernel dynamically allocates memory for the elfcorehdr within the address range from 0 to the boot memory size. Therefore, update figures 1 and 2 of Memory Reservation during the first and second kernels to reflect this change. 2. A version field has been added to the fadump header to manage the future changes to fadump crash info header structure without changing the fadump header magic number in the future. Therefore, remove the corresponding TODO from the document. Signed-off-by: Sourabh Jain Cc: Aditya Gupta Cc: Aneesh Kumar K.V Cc: Hari Bathini Cc: Mahesh Salgaonkar Cc: Michael Ellerman Cc: Naveen N Rao --- .../arch/powerpc/firmware-assisted-dump.rst | 91 +-- 1 file changed, 42 insertions(+), 49 deletions(-) diff --git a/Documentation/arch/powerpc/firmware-assisted-dump.rst b/Documentation/arch/powerpc/firmware-assisted-dump.rst index e363fc48529a..7e37aadd1f77 100644 --- a/Documentation/arch/powerpc/firmware-assisted-dump.rst +++ b/Documentation/arch/powerpc/firmware-assisted-dump.rst @@ -134,12 +134,12 @@ that are run. If there is dump data, then the memory is held. If there is no waiting dump data, then only the memory required to -hold CPU state, HPTE region, boot memory dump, FADump header and -elfcore header, is usually reserved at an offset greater than boot -memory size (see Fig. 1). This area is *not* released: this region -will be kept permanently reserved, so that it can act as a receptacle -for a copy of the boot memory content in addition to CPU state and -HPTE region, in the case a crash does occur. +hold CPU state, HPTE region, boot memory dump, and FADump header is +usually reserved at an offset greater than boot memory size (see Fig. 1). +This area is *not* released: this region will be kept permanently +reserved, so that it can act as a receptacle for a copy of the boot +memory content in addition to CPU state and HPTE region, in the case +a crash does occur. Since this reserved memory area is used only after the system crash, there is no point in blocking this significant chunk of memory from @@ -153,22 +153,22 @@ that were present in CMA region:: o Memory Reservation during first kernel - Low memory Top of memory - 0boot memory size |<--- Reserved dump area --->| | - | | |Permanent Reservation | | - V V || V - +---+-/ /---+---++---+-+-++--+ - | | |///|| DUMP | HDR | ELF || | - +---+-/ /---+---++---+-+-++--+ -| ^^ ^ ^ ^ -| || | | | -\ CPU HPTE / | | - -- | | - Boot memory content gets transferred| | - to reserved area by firmware at the | | - time of crash. | | - FADump Header | - (meta area)| + Low memory Top of memory + 0boot memory size |<-- Reserved dump area ->| | + | | | Permanent Reservation | | + V V | | V + +---+-/ /---+---++---+---++-+ + | | |///||DUMP | HDR || | + +---+-/ /---+---++---+---++-+ +| ^^ ^ ^ ^ +| || | | | +\ CPU HPTE / | | + | | + Boot memory content gets transferred | | + to reserved area by firmware at the | | + time of crash. | | + FADump Header | +(meta area) | | | Metadata: This area holds a metadata structure whose @@ -186,13 +186,20 @@ that were present in
[PATCH v10 2/3] powerpc/fadump: add hotplug_ready sysfs interface
The elfcorehdr describes the CPUs and memory of the crashed kernel to the kernel that captures the dump, known as the second or fadump kernel. The elfcorehdr needs to be updated if the system's memory changes due to memory hotplug or online/offline events. Currently, memory hotplug events are monitored in userspace by udev rules, and fadump is re-registered, which recreates the elfcorehdr with the latest available memory in the system. However, the previous patch ("powerpc: make fadump resilient with memory add/remove events") moved the creation of elfcorehdr to the second or fadump kernel. This eliminates the need to regenerate the elfcorehdr during memory hotplug or online/offline events. Create a sysfs entry at /sys/kernel/fadump/hotplug_ready to let userspace know that fadump re-registration is not required for memory add/remove events. Signed-off-by: Sourabh Jain Cc: Aditya Gupta Cc: "Aneesh Kumar K.V" Cc: Hari Bathini Cc: Mahesh Salgaonkar Cc: Michael Ellerman Cc: Naveen N Rao --- Documentation/ABI/testing/sysfs-kernel-fadump | 11 +++ arch/powerpc/kernel/fadump.c | 14 ++ 2 files changed, 25 insertions(+) diff --git a/Documentation/ABI/testing/sysfs-kernel-fadump b/Documentation/ABI/testing/sysfs-kernel-fadump index 8f7a64a81783..c586054657d6 100644 --- a/Documentation/ABI/testing/sysfs-kernel-fadump +++ b/Documentation/ABI/testing/sysfs-kernel-fadump @@ -38,3 +38,14 @@ Contact: linuxppc-dev@lists.ozlabs.org Description: read only Provide information about the amount of memory reserved by FADump to save the crash dump in bytes. + +What: /sys/kernel/fadump/hotplug_ready +Date: Apr 2024 +Contact: linuxppc-dev@lists.ozlabs.org +Description: read only + Kdump udev rule re-registers fadump on memory add/remove events, + primarily to update the elfcorehdr. This sysfs indicates the + kdump udev rule that fadump re-registration is not required on + memory add/remove events because elfcorehdr is now prepared in + the second/fadump kernel. +User: kexec-tools diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c index 35254fc1516b..dfab452e947b 100644 --- a/arch/powerpc/kernel/fadump.c +++ b/arch/powerpc/kernel/fadump.c @@ -1442,6 +1442,18 @@ static ssize_t enabled_show(struct kobject *kobj, return sprintf(buf, "%d\n", fw_dump.fadump_enabled); } +/* + * /sys/kernel/fadump/hotplug_ready sysfs node returns 1, which inidcates + * to usersapce that fadump re-registration is not required on memory + * hotplug events. + */ +static ssize_t hotplug_ready_show(struct kobject *kobj, + struct kobj_attribute *attr, + char *buf) +{ + return sprintf(buf, "%d\n", 1); +} + static ssize_t mem_reserved_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) @@ -1514,11 +1526,13 @@ static struct kobj_attribute release_attr = __ATTR_WO(release_mem); static struct kobj_attribute enable_attr = __ATTR_RO(enabled); static struct kobj_attribute register_attr = __ATTR_RW(registered); static struct kobj_attribute mem_reserved_attr = __ATTR_RO(mem_reserved); +static struct kobj_attribute hotplug_ready_attr = __ATTR_RO(hotplug_ready); static struct attribute *fadump_attrs[] = { _attr.attr, _attr.attr, _reserved_attr.attr, + _ready_attr.attr, NULL, }; -- 2.44.0
[PATCH v10 1/3] powerpc: make fadump resilient with memory add/remove events
Due to changes in memory resources caused by either memory hotplug or online/offline events, the elfcorehdr, which describes the CPUs and memory of the crashed kernel to the kernel that collects the dump (known as second/fadump kernel), becomes outdated. Consequently, attempting dump collection with an outdated elfcorehdr can lead to failed or inaccurate dump collection. Memory hotplug or online/offline events is referred as memory add/remove events in reset of the commit message. The current solution to address the aforementioned issue is as follows: Monitor memory add/remove events in userspace using udev rules, and re-register fadump whenever there are changes in memory resources. This leads to the creation of a new elfcorehdr with updated system memory information. There are several notable issues associated with re-registering fadump for every memory add/remove events. 1. Bulk memory add/remove events with udev-based fadump re-registration can lead to race conditions and, more importantly, it creates a wide window during which fadump is inactive until all memory add/remove events are settled. 2. Re-registering fadump for every memory add/remove event is inefficient. 3. The memory for elfcorehdr is allocated based on the memblock regions available during early boot and remains fixed thereafter. However, if elfcorehdr is later recreated with additional memblock regions, its size will increase, potentially leading to memory corruption. Address the aforementioned challenges by shifting the creation of elfcorehdr from the first kernel (also referred as the crashed kernel), where it was created and frequently recreated for every memory add/remove event, to the fadump kernel. As a result, the elfcorehdr only needs to be created once, thus eliminating the necessity to re-register fadump during memory add/remove events. At present, the first kernel prepares fadump header and stores it in the fadump reserved area. The fadump header includes the start address of the elfcorehdr, crashing CPU details, and other relevant information. In the event of a crash in the first kernel, the second/fadump boots and accesses the fadump header prepared by the first kernel. It then performs the following steps in a platform-specific function [rtas|opal]_fadump_process: 1. Sanity check for fadump header 2. Update CPU notes in elfcorehdr Along with the above, update the setup_fadump()/fadump.c to create elfcorehdr and set its address to the global variable elfcorehdr_addr for the vmcore module to process it in the second/fadump kernel. Section below outlines the information required to create the elfcorehdr and the changes made to make it available to the fadump kernel if it's not already. To create elfcorehdr, the following crashed kernel information is required: CPU notes, vmcoreinfo, and memory ranges. At present, the CPU notes are already prepared in the fadump kernel, so no changes are needed in that regard. The fadump kernel has access to all crashed kernel memory regions, including boot memory regions that are relocated by firmware to fadump reserved areas, so no changes for that either. However, it is necessary to add new members to the fadump header, i.e., the 'fadump_crash_info_header' structure, in order to pass the crashed kernel's vmcoreinfo address and its size to fadump kernel. In addition to the vmcoreinfo address and size, there are a few other attributes also added to the fadump_crash_info_header structure. 1. version: It stores the fadump header version, which is currently set to 1. This provides flexibility to update the fadump crash info header in the future without changing the magic number. For each change in the fadump header, the version will be increased. This will help the updated kernel determine how to handle kernel dumps from older kernels. The magic number remains relevant for checking fadump header corruption. 2. pt_regs_sz/cpu_mask_sz: Store size of pt_regs and cpu_mask structure of first kernel. These attributes are used to prevent dump processing if the sizes of pt_regs or cpu_mask structure differ between the first and fadump kernels. Note: if either first/crashed kernel or second/fadump kernel do not have the changes introduced here then kernel fail to collect the dump and prints relevant error message on the console. Signed-off-by: Sourabh Jain Cc: Aditya Gupta Cc: "Aneesh Kumar K.V" Cc: Hari Bathini Cc: Mahesh Salgaonkar Cc: Michael Ellerman Cc: Naveen N Rao --- arch/powerpc/include/asm/fadump-internal.h | 31 +- arch/powerpc/kernel/fadump.c | 361 +++ arch/powerpc/platforms/powernv/opal-fadump.c | 22 +- arch/powerpc/platforms/pseries/rtas-fadump.c | 34 +- 4 files changed, 242 insertions(+), 206 deletions(-) diff --git a/arch/powerpc/include/asm/fadump-internal.h b/arch/powerpc/include/asm/fadump-internal.h index 27f9e11eda28..5d706a7acc8a 100644 ---
[PATCH v10 0/3] powerpc: make fadump resilient with memory add/remove events
Problem: Due to changes in memory resources caused by either memory hotplug or online/offline events, the elfcorehdr, which describes the cpus and memory of the crashed kernel to the kernel that collects the dump (known as second/fadump kernel), becomes outdated. Consequently, attempting dump collection with an outdated elfcorehdr can lead to failed or inaccurate dump collection. Memory hotplug or online/offline events is referred as memory add/remove events in reset of the patch series. Existing solution: == Monitor memory add/remove events in userspace using udev rules, and re-register fadump whenever there are changes in memory resources. This leads to the creation of a new elfcorehdr with updated system memory information. Challenges with existing solution: == 1. Performing bulk memory add/remove with udev-based fadump re-registration can lead to race conditions and, more importantly, it creates a large wide window during which fadump is inactive until all memory add/remove events are settled. 2. Re-registering fadump for every memory add/remove event is inefficient. 3. Memory for elfcorehdr is allocated based on the memblock regions available during first kernel early boot and it remains fixed thereafter. However, if the elfcorehdr is later recreated with additional memblock regions, its size will increase, potentially leading to memory corruption. Proposed solution: == Address the aforementioned challenges by shifting the creation of elfcorehdr from the first kernel (also referred as the crashed kernel), where it was created and frequently recreated for every memory add/remove event, to the fadump kernel. As a result, the elfcorehdr only needs to be created once, thus eliminating the necessity to re-register fadump during memory add/remove events. To know more about elfcorehdr creation in the fadump kernel, refer to the first patch in this series. The second patch includes a new sysfs interface that tells userspace that fadump re-registration isn't needed for memory add/remove events. note that userspace changes do not need to be in sync with kernel changes; they can roll out independently. Since there are significant changes in the fadump implementation, the third patch updates the fadump documentation to reflect the changes made in this patch series. Kernel tree rebased on 6.9.0-rc5 with patch series applied: === https://github.com/sourabhjains/linux/tree/fadump-mem-hotplug-v10 Userspace changes: == To realize this feature, one must update the kdump udev rules to prevent fadump re-registration during memory add/remove events. On rhel apply the following changes to file /usr/lib/udev/rules.d/98-kexec.rules -RUN+="/bin/sh -c '/usr/bin/systemctl is-active kdump.service || exit 0; /usr/bin/systemd-run --quiet --no-block /usr/lib/udev/kdump-udev-throttler'" +# don't re-register fadump if the value of the node +# /sys/kernel/fadump/hotplug_ready is 1. + +RUN+="/bin/sh -c '/usr/bin/systemctl is-active kdump.service || exit 0; ! test -f /sys/kernel/fadump_enabled || cat /sys/kernel/fadump_enabled | grep 0 || ! test -f /sys/kernel/fadump/hotplug_ready || cat /sys/kernel/fadump/hotplug_ready | grep 0 || exit 0; /usr/bin/systemd-run --quiet --no-block /usr/lib/udev/kdump-udev-throttler'" Changelog: == v10: 23 Apr 2024 - Fix a type cast build error. 1/3 - Rebase it to 6.9-rc5. v9: 16 Apr 2024 https://lore.kernel.org/all/20240416080848.347602-1-sourabhj...@linux.ibm.com/ - Set the physical address of elfcorehdr to elfcorehdr_addr. 1/3 - Set elfcorehdr_addr to ELFCORE_ADDR_ERR before freeing the elfcorehdr. 1/3 - Mark the populate_elf_pt_load function as __init. 1/3 - Rename a function from process_fadump to fadump_process. 1/3 - Make minor changes to the commit message and a couple of comments in 1/3. - Update date of introduction of /sys/kernel/fadump/hotplug_ready sysfs. 2/3 - Rebase it to 6.9-rc3. v8: 16 Feb 2024 https://lore.kernel.org/all/20240217072004.148293-1-sourabhj...@linux.ibm.com/ - Move `elfcorehdr_addr` and `elfcorehdr_size` struct attributes from `struct fadump_crash_info_header` to `struct fw_dump`. - Make minor changes in commit message 1/3. - Rebase it to 6.8-rc4. v7: 11 Jan 2024 https://lore.kernel.org/all/2024040943.297501-1-sourabhj...@linux.ibm.com/ - Rebase it to 6.7 v6: 8 Dec 2023 https://lore.kernel.org/all/20231208115159.82236-1-sourabhj...@linux.ibm.com/ - Add size fields for `pt_regs` and `cpumask` in the fadump header structure - Don't process the dump if the size of `pt_regs` and `cpu_mask` is not same in the crashed and fadump kernel - Include an additional check for endianness mismatch when the magic number doesn't match, to print the relevant error message - Don't process the dump if the fadump header contains an old
Re: [PATCH v4 05/15] mm: introduce execmem_alloc() and execmem_free()
Hi Masami and Mike, On Sat, Apr 20, 2024 at 2:11 AM Masami Hiramatsu wrote: [...] > > > > > > IIUC, we need to update __execmem_cache_alloc() to take a range pointer as > > > input. module text will use "range" for EXECMEM_MODULE_TEXT, while kprobe > > > will use "range" for EXECMEM_KPROBE. Without "map to" concept or sharing > > > the "range" object, we will have to compare different range parameters to > > > check > > > we can share cached pages between module text and kprobe, which is not > > > efficient. Did I miss something? > > Song, thanks for trying to eplain. I think I need to explain why I used > module_alloc() originally. > > This depends on how kprobe features are implemented on the architecture, and > how much features are supported on kprobes. > > Because kprobe jump optimization and kprobe jump-back optimization need to > use a jump instruction to jump into the trampoline and jump back from the > trampoline directly, if the architecuture jmp instruction supports +-2GB range > like x86, it needs to allocate the trampoline buffer inside such address > space. > This requirement is similar to the modules (because module function needs to > call other functions in the kernel etc.), at least kprobes on x86 used > module_alloc(). > > However, if an architecture only supports breakpoint/trap based kprobe, > it does not need to consider whether the execmem is allocated. > > > > > We can always share large ROX pages as long as they are within the correct > > address space. The permissions for them are ROX and the alignment > > differences are due to KASAN and this is handled during allocation of the > > large page to refill the cache. __execmem_cache_alloc() only needs to limit > > the search for the address space of the range. > > So I don't think EXECMEM_KPROBE always same as EXECMEM_MODULE_TEXT, it > should be configured for each arch. Especially, if it is only used for > searching parameter, it looks OK to me. Thanks for the explanation! I was thinking "we can have EXECMEM_KPROBE share the same parameters as EXECMEM_MODULE_TEXT for all architectures". But this thought is built on top of assumptions on future changes/improvements within multiple sub systems. At this moment, I have no objections moving forward with current execmem APIs. Thanks, Song
Re: [PATCH v3 14/15] riscv: Add support for suppressing warning backtraces
On Wed, Apr 03, 2024 at 06:19:35AM -0700, Guenter Roeck wrote: > Add name of functions triggering warning backtraces to the __bug_table > object section to enable support for suppressing WARNING backtraces. > > To limit image size impact, the pointer to the function name is only added > to the __bug_table section if both CONFIG_KUNIT_SUPPRESS_BACKTRACE and > CONFIG_DEBUG_BUGVERBOSE are enabled. Otherwise, the __func__ assembly > parameter is replaced with a (dummy) NULL parameter to avoid an image size > increase due to unused __func__ entries (this is necessary because __func__ > is not a define but a virtual variable). > > To simplify the implementation, unify the __BUG_ENTRY_ADDR and > __BUG_ENTRY_FILE macros into a single macro named __BUG_REL() which takes > the address, file, or function reference as parameter. > > Tested-by: Linux Kernel Functional Testing > Acked-by: Dan Carpenter > Cc: Paul Walmsley > Cc: Palmer Dabbelt > Cc: Albert Ou > Signed-off-by: Guenter Roeck > --- > v2: > - Rebased to v6.9-rc1 > - Added Tested-by:, Acked-by:, and Reviewed-by: tags > - Introduced KUNIT_SUPPRESS_BACKTRACE configuration option > v3: > - Rebased to v6.9-rc2 > > arch/riscv/include/asm/bug.h | 38 > 1 file changed, 26 insertions(+), 12 deletions(-) > > diff --git a/arch/riscv/include/asm/bug.h b/arch/riscv/include/asm/bug.h > index 1aaea81fb141..79f360af4ad8 100644 > --- a/arch/riscv/include/asm/bug.h > +++ b/arch/riscv/include/asm/bug.h > @@ -30,26 +30,39 @@ > typedef u32 bug_insn_t; > > #ifdef CONFIG_GENERIC_BUG_RELATIVE_POINTERS > -#define __BUG_ENTRY_ADDR RISCV_INT " 1b - ." > -#define __BUG_ENTRY_FILE RISCV_INT " %0 - ." > +#define __BUG_REL(val) RISCV_INT " " __stringify(val) " - ." > #else > -#define __BUG_ENTRY_ADDR RISCV_PTR " 1b" > -#define __BUG_ENTRY_FILE RISCV_PTR " %0" > +#define __BUG_REL(val) RISCV_PTR " " __stringify(val) > #endif > > #ifdef CONFIG_DEBUG_BUGVERBOSE > + > +#ifdef CONFIG_KUNIT_SUPPRESS_BACKTRACE > +# define HAVE_BUG_FUNCTION > +# define __BUG_FUNC_PTR __BUG_REL(%1) > +#else > +# define __BUG_FUNC_PTR > +#endif /* CONFIG_KUNIT_SUPPRESS_BACKTRACE */ > + > #define __BUG_ENTRY \ > - __BUG_ENTRY_ADDR "\n\t" \ > - __BUG_ENTRY_FILE "\n\t" \ > - RISCV_SHORT " %1\n\t" \ > - RISCV_SHORT " %2" > + __BUG_REL(1b) "\n\t"\ > + __BUG_REL(%0) "\n\t"\ > + __BUG_FUNC_PTR "\n\t" \ > + RISCV_SHORT " %2\n\t" \ > + RISCV_SHORT " %3" > #else > #define __BUG_ENTRY \ > - __BUG_ENTRY_ADDR "\n\t" \ > - RISCV_SHORT " %2" > + __BUG_REL(1b) "\n\t"\ > + RISCV_SHORT " %3" > #endif > > #ifdef CONFIG_GENERIC_BUG > +#ifdef HAVE_BUG_FUNCTION > +# define __BUG_FUNC __func__ > +#else > +# define __BUG_FUNC NULL > +#endif > + > #define __BUG_FLAGS(flags) \ > do { \ > __asm__ __volatile__ ( \ > @@ -58,10 +71,11 @@ do { > \ > ".pushsection __bug_table,\"aw\"\n\t" \ > "2:\n\t"\ > __BUG_ENTRY "\n\t" \ > - ".org 2b + %3\n\t" \ > + ".org 2b + %4\n\t" \ > ".popsection" \ > : \ > - : "i" (__FILE__), "i" (__LINE__), \ > + : "i" (__FILE__), "i" (__BUG_FUNC), \ > + "i" (__LINE__), \ > "i" (flags), \ > "i" (sizeof(struct bug_entry))); \ > } while (0) > -- > 2.39.2 > > > ___ > linux-riscv mailing list > linux-ri...@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-riscv Reviewed-by: Charlie Jenkins - Charlie
Re: [PATCH v3 1/3] PCI/AER: Store UNCOR_STATUS bits that might be ANFE in aer_err_info
On Wed, 17 Apr 2024 14:14:05 +0800 Zhenzhong Duan wrote: > In some cases the detector of a Non-Fatal Error(NFE) is not the most > appropriate agent to determine the type of the error. For example, > when software performs a configuration read from a non-existent > device or Function, completer will send an ERR_NONFATAL Message. > On some platforms, ERR_NONFATAL results in a System Error, which > breaks normal software probing. > > Advisory Non-Fatal Error(ANFE) is a special case that can be used > in above scenario. It is predominantly determined by the role of the > detecting agent (Requester, Completer, or Receiver) and the specific > error. In such cases, an agent with AER signals the NFE (if enabled) > by sending an ERR_COR Message as an advisory to software, instead of > sending ERR_NONFATAL. > > When processing an ANFE, ideally both correctable error(CE) status and > uncorrectable error(UE) status should be cleared. However, there is no > way to fully identify the UE associated with ANFE. Even worse, a Fatal > Error(FE) or Non-Fatal Error(NFE) may set the same UE status bit as > ANFE. Treating an ANFE as NFE will reproduce above mentioned issue, > i.e., breaking softwore probing; treating NFE as ANFE will make us > ignoring some UEs which need active recover operation. To avoid clearing > UEs that are not ANFE by accident, the most conservative route is taken > here: If any of the FE/NFE Detected bits is set in Device Status, do not > touch UE status, they should be cleared later by the UE handler. Otherwise, > a specific set of UEs that may be raised as ANFE according to the PCIe > specification will be cleared if their corresponding severity is Non-Fatal. > > To achieve above purpose, store UNCOR_STATUS bits that might be ANFE > in aer_err_info.anfe_status. So that those bits could be printed and > processed later. > > Tested-by: Yudong Wang > Co-developed-by: "Wang, Qingshun" > Signed-off-by: "Wang, Qingshun" > Signed-off-by: Zhenzhong Duan > --- > drivers/pci/pci.h | 1 + > drivers/pci/pcie/aer.c | 45 ++ > 2 files changed, 46 insertions(+) > > diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h > index 17fed1846847..3f9eb807f9fd 100644 > --- a/drivers/pci/pci.h > +++ b/drivers/pci/pci.h > @@ -412,6 +412,7 @@ struct aer_err_info { > > unsigned int status;/* COR/UNCOR Error Status */ > unsigned int mask; /* COR/UNCOR Error Mask */ > + unsigned int anfe_status; /* UNCOR Error Status for ANFE */ > struct pcie_tlp_log tlp;/* TLP Header */ > }; > > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c > index ac6293c24976..27364ab4b148 100644 > --- a/drivers/pci/pcie/aer.c > +++ b/drivers/pci/pcie/aer.c > @@ -107,6 +107,12 @@ struct aer_stats { > PCI_ERR_ROOT_MULTI_COR_RCV |\ > PCI_ERR_ROOT_MULTI_UNCOR_RCV) > > +#define AER_ERR_ANFE_UNC_MASK(PCI_ERR_UNC_POISON_TLP | > \ > + PCI_ERR_UNC_COMP_TIME | \ > + PCI_ERR_UNC_COMP_ABORT |\ > + PCI_ERR_UNC_UNX_COMP | \ > + PCI_ERR_UNC_UNSUP) > + > static int pcie_aer_disable; > static pci_ers_result_t aer_root_reset(struct pci_dev *dev); > > @@ -1196,6 +1202,41 @@ void aer_recover_queue(int domain, unsigned int bus, > unsigned int devfn, > EXPORT_SYMBOL_GPL(aer_recover_queue); > #endif > > +static void anfe_get_uc_status(struct pci_dev *dev, struct aer_err_info > *info) > +{ > + u32 uncor_mask, uncor_status; > + u16 device_status; > + int aer = dev->aer_cap; > + > + if (pcie_capability_read_word(dev, PCI_EXP_DEVSTA, _status)) > + return; > + /* > + * Take the most conservative route here. If there are > + * Non-Fatal/Fatal errors detected, do not assume any > + * bit in uncor_status is set by ANFE. > + */ > + if (device_status & (PCI_EXP_DEVSTA_NFED | PCI_EXP_DEVSTA_FED)) > + return; > + Is there not a race here? If we happen to get either an NFED or FED between the read of device_status above and here we might pick up a status that corresponds to that (and hence clear something we should not). Or am I missing that race being close somewhere? > + pci_read_config_dword(dev, aer + PCI_ERR_UNCOR_STATUS, _status); > + pci_read_config_dword(dev, aer + PCI_ERR_UNCOR_MASK, _mask); > + /* > + * According to PCIe Base Specification Revision 6.1, > + * Section 6.2.3.2.4, if an UNCOR error is raised as > + * Advisory Non-Fatal error, it will match the following > + * conditions: > + * a. The severity of the error is Non-Fatal. > + * b. The error is one of the following: > + * 1. Poisoned TLP (Section 6.2.3.2.4.3) > +
Re: [PATCH v5 RESEND] arch/powerpc/kvm: Add support for reading VPA counters for pseries guests
On Tue, Apr 02, 2024 at 12:36:54PM +0530, Gautam Menghani wrote: > PAPR hypervisor has introduced three new counters in the VPA area of > LPAR CPUs for KVM L2 guest (see [1] for terminology) observability - 2 > for context switches from host to guest and vice versa, and 1 counter > for getting the total time spent inside the KVM guest. Add a tracepoint > that enables reading the counters for use by ftrace/perf. Note that this > tracepoint is only available for nestedv2 API (i.e, KVM on PowerVM). > > Also maintain an aggregation of the context switch times in vcpu->arch. > This will be useful in getting the aggregate times with a pmu driver > which will be upstreamed in the near future. It would be better to add code to maintain aggregate times as part of that pmu driver. > > [1] Terminology: > a. L1 refers to the VM (LPAR) booted on top of PAPR hypervisor > b. L2 refers to the KVM guest booted on top of L1. > > Signed-off-by: Vaibhav Jain > Signed-off-by: Gautam Menghani > --- > v5 RESEND: > 1. Add the changelog > > v4 -> v5: > 1. Define helper functions for getting/setting the accumulation counter > in L2's VPA > > v3 -> v4: > 1. After vcpu_run, check the VPA flag instead of checking for tracepoint > being enabled for disabling the cs time accumulation. > > v2 -> v3: > 1. Move the counter disabling and zeroing code to a different function. > 2. Move the get_lppaca() inside the tracepoint_enabled() branch. > 3. Add the aggregation logic to maintain total context switch time. > > v1 -> v2: > 1. Fix the build error due to invalid struct member reference. > > arch/powerpc/include/asm/kvm_host.h | 5 > arch/powerpc/include/asm/lppaca.h | 11 +--- > arch/powerpc/kvm/book3s_hv.c| 40 + > arch/powerpc/kvm/trace_hv.h | 25 ++ > 4 files changed, 78 insertions(+), 3 deletions(-) > > diff --git a/arch/powerpc/include/asm/kvm_host.h > b/arch/powerpc/include/asm/kvm_host.h > index 8abac532146e..d953b32dd68a 100644 > --- a/arch/powerpc/include/asm/kvm_host.h > +++ b/arch/powerpc/include/asm/kvm_host.h > @@ -847,6 +847,11 @@ struct kvm_vcpu_arch { > gpa_t nested_io_gpr; > /* For nested APIv2 guests*/ > struct kvmhv_nestedv2_io nestedv2_io; > + > + /* Aggregate context switch and guest run time info (in ns) */ > + u64 l1_to_l2_cs_agg; > + u64 l2_to_l1_cs_agg; > + u64 l2_runtime_agg; Can be dropped from this patch. > #endif > > #ifdef CONFIG_KVM_BOOK3S_HV_EXIT_TIMING > diff --git a/arch/powerpc/include/asm/lppaca.h > b/arch/powerpc/include/asm/lppaca.h > index 61ec2447dabf..bda6b86b9f13 100644 > --- a/arch/powerpc/include/asm/lppaca.h > +++ b/arch/powerpc/include/asm/lppaca.h > @@ -62,7 +62,8 @@ struct lppaca { > u8 donate_dedicated_cpu; /* Donate dedicated CPU cycles */ > u8 fpregs_in_use; > u8 pmcregs_in_use; > - u8 reserved8[28]; > + u8 l2_accumul_cntrs_enable; /* Enable usage of counters for KVM > guest */ A simpler name - l2_counters_enable or such? > + u8 reserved8[27]; > __be64 wait_state_cycles; /* Wait cycles for this proc */ > u8 reserved9[28]; > __be16 slb_count; /* # of SLBs to maintain */ > @@ -92,9 +93,13 @@ struct lppaca { > /* cacheline 4-5 */ > > __be32 page_ins; /* CMO Hint - # page ins by OS */ > - u8 reserved12[148]; > + u8 reserved12[28]; > + volatile __be64 l1_to_l2_cs_tb; > + volatile __be64 l2_to_l1_cs_tb; > + volatile __be64 l2_runtime_tb; > + u8 reserved13[96]; > volatile __be64 dtl_idx;/* Dispatch Trace Log head index */ > - u8 reserved13[96]; > + u8 reserved14[96]; > } cacheline_aligned; > > #define lppaca_of(cpu) (*paca_ptrs[cpu]->lppaca_ptr) > diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c > index 8e86eb577eb8..fea1c1429975 100644 > --- a/arch/powerpc/kvm/book3s_hv.c > +++ b/arch/powerpc/kvm/book3s_hv.c > @@ -4108,6 +4108,37 @@ static void vcpu_vpa_increment_dispatch(struct > kvm_vcpu *vcpu) > } > } > > +static inline int kvmhv_get_l2_accumul(void) > +{ > + return get_lppaca()->l2_accumul_cntrs_enable; > +} > + > +static inline void kvmhv_set_l2_accumul(int val) ^^^ bool? > +{ > + get_lppaca()->l2_accumul_cntrs_enable = val; > +} > + > +static void do_trace_nested_cs_time(struct kvm_vcpu *vcpu) > +{ > + struct lppaca *lp = get_lppaca(); > + u64 l1_to_l2_ns, l2_to_l1_ns, l2_runtime_ns; > + > + l1_to_l2_ns = tb_to_ns(be64_to_cpu(lp->l1_to_l2_cs_tb)); > + l2_to_l1_ns = tb_to_ns(be64_to_cpu(lp->l2_to_l1_cs_tb)); > + l2_runtime_ns = tb_to_ns(be64_to_cpu(lp->l2_runtime_tb)); > + trace_kvmppc_vcpu_exit_cs_time(vcpu, l1_to_l2_ns, l2_to_l1_ns, > + l2_runtime_ns); > +
Re: [PATCH v5 14/15] kprobes: remove dependency on CONFIG_MODULES
On Mon, 22 Apr 2024 12:44:35 +0300 Mike Rapoport wrote: > From: "Mike Rapoport (IBM)" > > kprobes depended on CONFIG_MODULES because it has to allocate memory for > code. > > Since code allocations are now implemented with execmem, kprobes can be > enabled in non-modular kernels. > > Add #ifdef CONFIG_MODULE guards for the code dealing with kprobes inside > modules, make CONFIG_KPROBES select CONFIG_EXECMEM and drop the > dependency of CONFIG_KPROBES on CONFIG_MODULES. Looks good to me. Acked-by: Masami Hiramatsu (Google) Thank you! > > Signed-off-by: Mike Rapoport (IBM) > --- > arch/Kconfig| 2 +- > include/linux/module.h | 9 ++ > kernel/kprobes.c| 55 +++-- > kernel/trace/trace_kprobe.c | 20 +- > 4 files changed, 63 insertions(+), 23 deletions(-) > > diff --git a/arch/Kconfig b/arch/Kconfig > index 7006f71f0110..a48ce6a488b3 100644 > --- a/arch/Kconfig > +++ b/arch/Kconfig > @@ -52,9 +52,9 @@ config GENERIC_ENTRY > > config KPROBES > bool "Kprobes" > - depends on MODULES > depends on HAVE_KPROBES > select KALLSYMS > + select EXECMEM > select TASKS_RCU if PREEMPTION > help > Kprobes allows you to trap at almost any kernel address and > diff --git a/include/linux/module.h b/include/linux/module.h > index 1153b0d99a80..ffa1c603163c 100644 > --- a/include/linux/module.h > +++ b/include/linux/module.h > @@ -605,6 +605,11 @@ static inline bool module_is_live(struct module *mod) > return mod->state != MODULE_STATE_GOING; > } > > +static inline bool module_is_coming(struct module *mod) > +{ > +return mod->state == MODULE_STATE_COMING; > +} > + > struct module *__module_text_address(unsigned long addr); > struct module *__module_address(unsigned long addr); > bool is_module_address(unsigned long addr); > @@ -857,6 +862,10 @@ void *dereference_module_function_descriptor(struct > module *mod, void *ptr) > return ptr; > } > > +static inline bool module_is_coming(struct module *mod) > +{ > + return false; > +} > #endif /* CONFIG_MODULES */ > > #ifdef CONFIG_SYSFS > diff --git a/kernel/kprobes.c b/kernel/kprobes.c > index ddd7cdc16edf..ca2c6cbd42d2 100644 > --- a/kernel/kprobes.c > +++ b/kernel/kprobes.c > @@ -1588,7 +1588,7 @@ static int check_kprobe_address_safe(struct kprobe *p, > } > > /* Get module refcount and reject __init functions for loaded modules. > */ > - if (*probed_mod) { > + if (IS_ENABLED(CONFIG_MODULES) && *probed_mod) { > /* >* We must hold a refcount of the probed module while updating >* its code to prohibit unexpected unloading. > @@ -1603,12 +1603,13 @@ static int check_kprobe_address_safe(struct kprobe *p, >* kprobes in there. >*/ > if (within_module_init((unsigned long)p->addr, *probed_mod) && > - (*probed_mod)->state != MODULE_STATE_COMING) { > + !module_is_coming(*probed_mod)) { > module_put(*probed_mod); > *probed_mod = NULL; > ret = -ENOENT; > } > } > + > out: > preempt_enable(); > jump_label_unlock(); > @@ -2488,24 +2489,6 @@ int kprobe_add_area_blacklist(unsigned long start, > unsigned long end) > return 0; > } > > -/* Remove all symbols in given area from kprobe blacklist */ > -static void kprobe_remove_area_blacklist(unsigned long start, unsigned long > end) > -{ > - struct kprobe_blacklist_entry *ent, *n; > - > - list_for_each_entry_safe(ent, n, _blacklist, list) { > - if (ent->start_addr < start || ent->start_addr >= end) > - continue; > - list_del(>list); > - kfree(ent); > - } > -} > - > -static void kprobe_remove_ksym_blacklist(unsigned long entry) > -{ > - kprobe_remove_area_blacklist(entry, entry + 1); > -} > - > int __weak arch_kprobe_get_kallsym(unsigned int *symnum, unsigned long > *value, > char *type, char *sym) > { > @@ -2570,6 +2553,25 @@ static int __init populate_kprobe_blacklist(unsigned > long *start, > return ret ? : arch_populate_kprobe_blacklist(); > } > > +#ifdef CONFIG_MODULES > +/* Remove all symbols in given area from kprobe blacklist */ > +static void kprobe_remove_area_blacklist(unsigned long start, unsigned long > end) > +{ > + struct kprobe_blacklist_entry *ent, *n; > + > + list_for_each_entry_safe(ent, n, _blacklist, list) { > + if (ent->start_addr < start || ent->start_addr >= end) > + continue; > + list_del(>list); > + kfree(ent); > + } > +} > + > +static void kprobe_remove_ksym_blacklist(unsigned long entry) > +{ > + kprobe_remove_area_blacklist(entry, entry + 1); > +} > + > static void add_module_kprobe_blacklist(struct module
Re: [PATCH] selftests/powerpc: Install tests in sub-directories
Michael Ellerman writes: > The sources for the powerpc selftests are arranged into sub-directories. > However when the tests are built and installed, the sub-directories are > squashed, losing the structure. This is missing a preparatory patch, new version coming. cheers
[PATCH v2 2/2] selftests/powerpc: Install tests in sub-directories
The sources for the powerpc selftests are arranged into sub-directories. However when the tests are built and installed, the sub-directories are squashed, losing the structure. For example, with the current code the result of installing the selftests is: $ tree tools/testing/selftests/kselftest_install tools/testing/selftests/kselftest_install ├── kselftest │ ├── ktap_helpers.sh │ ├── module.sh │ ├── prefix.pl │ └── runner.sh ├── kselftest-list.txt ├── powerpc │ ├── alignment_handler │ ├── attr_test │ ├── back_to_back_ebbs_test │ ├── bad_accesses │ ├── bhrb_filter_map_test │ ├── bhrb_no_crash_wo_pmu_test │ ├── blacklisted_events_test │ ├── cache_shape │ ├── close_clears_pmcc_test │ ├── context_switch │ ├── copy_first_unaligned ... │ ├── settings ... │ └── wild_bctr └── run_kselftest.sh All the powerpc tests are squashed into the single powerpc directory. In particular, note that there is a single `settings` file, even though there are multiple settings files in the powerpc selftest sources. One of the settings files ends up installed, depending on install order, even if they have different contents. Similarly if there were two tests with the same name in different sub-directories they would clobber each other. Fix it by replicating the directory structure of the source tree into the install directory. The result being for example: $ tree tools/testing/selftests/kselftest_install tools/testing/selftests/kselftest_install ├── kselftest │ ├── ktap_helpers.sh │ ├── module.sh │ ├── prefix.pl │ └── runner.sh ├── kselftest-list.txt ├── powerpc │ ├── alignment │ │ ├── alignment_handler │ │ └── copy_first_unaligned │ ├── benchmarks │ │ ├── context_switch │ │ ├── exec_target │ │ ├── fork │ │ ├── futex_bench │ │ ├── gettimeofday │ │ ├── mmap_bench │ │ ├── null_syscall │ │ └── settings ... │ ├── eeh │ │ ├── eeh-basic.sh │ │ ├── eeh-functions.sh │ │ └── settings ... │ └── vphn │ └── test-vphn └── run_kselftest.sh Note multiple settings files in different sub-directories. This change also has the effect of changing the names of the tests from the point of view of the kselftest runner. Before the tests are named eg: powerpc:copy_first_unaligned powerpc:cache_shape powerpc:reg_access_test After, the test collection names include the sub-directory: powerpc/alignment:copy_first_unaligned powerpc/cache_shape:cache_shape powerpc/pmu/ebb:reg_access_test That means whereas previously all powerpc tests could be run with: $ ./run_kselftest.sh -c powerpc After the change it's necessary to pass a regex that matches all powerpc entries, eg: $ ./run_kselftest.sh -c "powerpc.*" The latter form also works before and after the change. Signed-off-by: Michael Ellerman --- tools/testing/selftests/powerpc/Makefile | 4 ++-- tools/testing/selftests/powerpc/pmu/Makefile | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) v2: Unchanged. diff --git a/tools/testing/selftests/powerpc/Makefile b/tools/testing/selftests/powerpc/Makefile index 2f299fd04d2d..b175e94e1901 100644 --- a/tools/testing/selftests/powerpc/Makefile +++ b/tools/testing/selftests/powerpc/Makefile @@ -52,14 +52,14 @@ endef override define INSTALL_RULE +@for TARGET in $(SUB_DIRS); do \ BUILD_TARGET=$(OUTPUT)/$$TARGET;\ - $(MAKE) OUTPUT=$$BUILD_TARGET -C $$TARGET install;\ + $(MAKE) OUTPUT=$$BUILD_TARGET INSTALL_PATH=$$INSTALL_PATH/$$TARGET -C $$TARGET install;\ done; endef emit_tests: +@for TARGET in $(SUB_DIRS); do \ BUILD_TARGET=$(OUTPUT)/$$TARGET;\ - $(MAKE) OUTPUT=$$BUILD_TARGET -s -C $$TARGET $@;\ + $(MAKE) OUTPUT=$$BUILD_TARGET COLLECTION=$(COLLECTION)/$$TARGET -s -C $$TARGET $@;\ done; override define CLEAN diff --git a/tools/testing/selftests/powerpc/pmu/Makefile b/tools/testing/selftests/powerpc/pmu/Makefile index 773933e5180e..7e9dbf3d0d09 100644 --- a/tools/testing/selftests/powerpc/pmu/Makefile +++ b/tools/testing/selftests/powerpc/pmu/Makefile @@ -44,7 +44,7 @@ emit_tests: done +@for TARGET in $(SUB_DIRS); do \ BUILD_TARGET=$(OUTPUT)/$$TARGET; \ - $(MAKE) OUTPUT=$$BUILD_TARGET -s -C $$TARGET emit_tests; \ + $(MAKE) OUTPUT=$$BUILD_TARGET COLLECTION=$(COLLECTION)/$$TARGET -s -C $$TARGET emit_tests; \ done; DEFAULT_INSTALL_RULE := $(INSTALL_RULE) @@ -52,7 +52,7 @@ override define INSTALL_RULE $(DEFAULT_INSTALL_RULE) +@for TARGET in $(SUB_DIRS); do \ BUILD_TARGET=$(OUTPUT)/$$TARGET; \ - $(MAKE) OUTPUT=$$BUILD_TARGET -C $$TARGET install; \ + $(MAKE) OUTPUT=$$BUILD_TARGET INSTALL_PATH=$$INSTALL_PATH/$$TARGET -C $$TARGET install; \
[PATCH v2 1/2] selftests/powerpc: Convert pmu Makefile to for loop style
The pmu Makefile has grown more sub directories over the years. Rather than open coding the rules for each subdir, use for loops. Signed-off-by: Michael Ellerman --- tools/testing/selftests/powerpc/pmu/Makefile | 43 ++-- 1 file changed, 22 insertions(+), 21 deletions(-) v2: Actually send both patches. diff --git a/tools/testing/selftests/powerpc/pmu/Makefile b/tools/testing/selftests/powerpc/pmu/Makefile index 1fcacae1b188..773933e5180e 100644 --- a/tools/testing/selftests/powerpc/pmu/Makefile +++ b/tools/testing/selftests/powerpc/pmu/Makefile @@ -9,7 +9,9 @@ top_srcdir = ../../../../.. include ../../lib.mk include ../flags.mk -all: $(TEST_GEN_PROGS) ebb sampling_tests event_code_tests +SUB_DIRS := ebb sampling_tests event_code_tests + +all: $(TEST_GEN_PROGS) $(SUB_DIRS) $(TEST_GEN_PROGS): $(EXTRA_SOURCES) @@ -23,12 +25,16 @@ $(OUTPUT)/count_stcx_fail: loop.S $(EXTRA_SOURCES) $(OUTPUT)/per_event_excludes: ../utils.c +$(SUB_DIRS): + BUILD_TARGET=$(OUTPUT)/$@; mkdir -p $$BUILD_TARGET; $(MAKE) OUTPUT=$$BUILD_TARGET -k -C $@ all + DEFAULT_RUN_TESTS := $(RUN_TESTS) override define RUN_TESTS $(DEFAULT_RUN_TESTS) - +TARGET=ebb; BUILD_TARGET=$$OUTPUT/$$TARGET; $(MAKE) OUTPUT=$$BUILD_TARGET -C $$TARGET run_tests - +TARGET=sampling_tests; BUILD_TARGET=$$OUTPUT/$$TARGET; $(MAKE) OUTPUT=$$BUILD_TARGET -C $$TARGET run_tests - +TARGET=event_code_tests; BUILD_TARGET=$$OUTPUT/$$TARGET; $(MAKE) OUTPUT=$$BUILD_TARGET -C $$TARGET run_tests + +@for TARGET in $(SUB_DIRS); do \ + BUILD_TARGET=$(OUTPUT)/$$TARGET; \ + $(MAKE) OUTPUT=$$BUILD_TARGET -C $$TARGET run_tests; \ + done; endef emit_tests: @@ -36,34 +42,29 @@ emit_tests: BASENAME_TEST=`basename $$TEST`;\ echo "$(COLLECTION):$$BASENAME_TEST"; \ done - +TARGET=ebb; BUILD_TARGET=$$OUTPUT/$$TARGET; $(MAKE) OUTPUT=$$BUILD_TARGET -s -C $$TARGET emit_tests - +TARGET=sampling_tests; BUILD_TARGET=$$OUTPUT/$$TARGET; $(MAKE) OUTPUT=$$BUILD_TARGET -s -C $$TARGET emit_tests - +TARGET=event_code_tests; BUILD_TARGET=$$OUTPUT/$$TARGET; $(MAKE) OUTPUT=$$BUILD_TARGET -s -C $$TARGET emit_tests + +@for TARGET in $(SUB_DIRS); do \ + BUILD_TARGET=$(OUTPUT)/$$TARGET; \ + $(MAKE) OUTPUT=$$BUILD_TARGET -s -C $$TARGET emit_tests; \ + done; DEFAULT_INSTALL_RULE := $(INSTALL_RULE) override define INSTALL_RULE $(DEFAULT_INSTALL_RULE) - +TARGET=ebb; BUILD_TARGET=$$OUTPUT/$$TARGET; $(MAKE) OUTPUT=$$BUILD_TARGET -C $$TARGET install - +TARGET=sampling_tests; BUILD_TARGET=$$OUTPUT/$$TARGET; $(MAKE) OUTPUT=$$BUILD_TARGET -C $$TARGET install - +TARGET=event_code_tests; BUILD_TARGET=$$OUTPUT/$$TARGET; $(MAKE) OUTPUT=$$BUILD_TARGET -C $$TARGET install + +@for TARGET in $(SUB_DIRS); do \ + BUILD_TARGET=$(OUTPUT)/$$TARGET; \ + $(MAKE) OUTPUT=$$BUILD_TARGET -C $$TARGET install; \ + done; endef DEFAULT_CLEAN := $(CLEAN) override define CLEAN $(DEFAULT_CLEAN) $(RM) $(TEST_GEN_PROGS) $(OUTPUT)/loop.o - +TARGET=ebb; BUILD_TARGET=$$OUTPUT/$$TARGET; $(MAKE) OUTPUT=$$BUILD_TARGET -C $$TARGET clean - +TARGET=sampling_tests; BUILD_TARGET=$$OUTPUT/$$TARGET; $(MAKE) OUTPUT=$$BUILD_TARGET -C $$TARGET clean - +TARGET=event_code_tests; BUILD_TARGET=$$OUTPUT/$$TARGET; $(MAKE) OUTPUT=$$BUILD_TARGET -C $$TARGET clean + +@for TARGET in $(SUB_DIRS); do \ + BUILD_TARGET=$(OUTPUT)/$$TARGET; \ + $(MAKE) OUTPUT=$$BUILD_TARGET -C $$TARGET clean; \ + done; endef -ebb: - TARGET=$@; BUILD_TARGET=$$OUTPUT/$$TARGET; mkdir -p $$BUILD_TARGET; $(MAKE) OUTPUT=$$BUILD_TARGET -k -C $$TARGET all - -sampling_tests: - TARGET=$@; BUILD_TARGET=$$OUTPUT/$$TARGET; mkdir -p $$BUILD_TARGET; $(MAKE) OUTPUT=$$BUILD_TARGET -k -C $$TARGET all - -event_code_tests: - TARGET=$@; BUILD_TARGET=$$OUTPUT/$$TARGET; mkdir -p $$BUILD_TARGET; $(MAKE) OUTPUT=$$BUILD_TARGET -k -C $$TARGET all .PHONY: all run_tests ebb sampling_tests event_code_tests emit_tests -- 2.44.0
[PATCH] selftests/powerpc: Install tests in sub-directories
The sources for the powerpc selftests are arranged into sub-directories. However when the tests are built and installed, the sub-directories are squashed, losing the structure. For example, with the current code the result of installing the selftests is: $ tree tools/testing/selftests/kselftest_install tools/testing/selftests/kselftest_install ├── kselftest │ ├── ktap_helpers.sh │ ├── module.sh │ ├── prefix.pl │ └── runner.sh ├── kselftest-list.txt ├── powerpc │ ├── alignment_handler │ ├── attr_test │ ├── back_to_back_ebbs_test │ ├── bad_accesses │ ├── bhrb_filter_map_test │ ├── bhrb_no_crash_wo_pmu_test │ ├── blacklisted_events_test │ ├── cache_shape │ ├── close_clears_pmcc_test │ ├── context_switch │ ├── copy_first_unaligned ... │ ├── settings ... │ └── wild_bctr └── run_kselftest.sh All the powerpc tests are squashed into the single powerpc directory. In particular, note that there is a single `settings` file, even though there are multiple settings files in the powerpc selftest sources. One of the settings files ends up installed, depending on install order, even if they have different contents. Similarly if there were two tests with the same name in different sub-directories they would clobber each other. Fix it by replicating the directory structure of the source tree into the install directory. The result being for example: $ tree tools/testing/selftests/kselftest_install tools/testing/selftests/kselftest_install ├── kselftest │ ├── ktap_helpers.sh │ ├── module.sh │ ├── prefix.pl │ └── runner.sh ├── kselftest-list.txt ├── powerpc │ ├── alignment │ │ ├── alignment_handler │ │ └── copy_first_unaligned │ ├── benchmarks │ │ ├── context_switch │ │ ├── exec_target │ │ ├── fork │ │ ├── futex_bench │ │ ├── gettimeofday │ │ ├── mmap_bench │ │ ├── null_syscall │ │ └── settings ... │ ├── eeh │ │ ├── eeh-basic.sh │ │ ├── eeh-functions.sh │ │ └── settings ... │ └── vphn │ └── test-vphn └── run_kselftest.sh Note multiple settings files in different sub-directories. This change also has the effect of changing the names of the tests from the point of view of the kselftest runner. Before the tests are named eg: powerpc:copy_first_unaligned powerpc:cache_shape powerpc:reg_access_test After, the test collection names include the sub-directory: powerpc/alignment:copy_first_unaligned powerpc/cache_shape:cache_shape powerpc/pmu/ebb:reg_access_test That means whereas previously all powerpc tests could be run with: $ ./run_kselftest.sh -c powerpc After the change it's necessary to pass a regex that matches all powerpc entries, eg: $ ./run_kselftest.sh -c "powerpc.*" The latter form also works before and after the change. Signed-off-by: Michael Ellerman --- tools/testing/selftests/powerpc/Makefile | 4 ++-- tools/testing/selftests/powerpc/pmu/Makefile | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/tools/testing/selftests/powerpc/Makefile b/tools/testing/selftests/powerpc/Makefile index 2f299fd04d2d..b175e94e1901 100644 --- a/tools/testing/selftests/powerpc/Makefile +++ b/tools/testing/selftests/powerpc/Makefile @@ -52,14 +52,14 @@ endef override define INSTALL_RULE +@for TARGET in $(SUB_DIRS); do \ BUILD_TARGET=$(OUTPUT)/$$TARGET;\ - $(MAKE) OUTPUT=$$BUILD_TARGET -C $$TARGET install;\ + $(MAKE) OUTPUT=$$BUILD_TARGET INSTALL_PATH=$$INSTALL_PATH/$$TARGET -C $$TARGET install;\ done; endef emit_tests: +@for TARGET in $(SUB_DIRS); do \ BUILD_TARGET=$(OUTPUT)/$$TARGET;\ - $(MAKE) OUTPUT=$$BUILD_TARGET -s -C $$TARGET $@;\ + $(MAKE) OUTPUT=$$BUILD_TARGET COLLECTION=$(COLLECTION)/$$TARGET -s -C $$TARGET $@;\ done; override define CLEAN diff --git a/tools/testing/selftests/powerpc/pmu/Makefile b/tools/testing/selftests/powerpc/pmu/Makefile index 773933e5180e..7e9dbf3d0d09 100644 --- a/tools/testing/selftests/powerpc/pmu/Makefile +++ b/tools/testing/selftests/powerpc/pmu/Makefile @@ -44,7 +44,7 @@ emit_tests: done +@for TARGET in $(SUB_DIRS); do \ BUILD_TARGET=$(OUTPUT)/$$TARGET; \ - $(MAKE) OUTPUT=$$BUILD_TARGET -s -C $$TARGET emit_tests; \ + $(MAKE) OUTPUT=$$BUILD_TARGET COLLECTION=$(COLLECTION)/$$TARGET -s -C $$TARGET emit_tests; \ done; DEFAULT_INSTALL_RULE := $(INSTALL_RULE) @@ -52,7 +52,7 @@ override define INSTALL_RULE $(DEFAULT_INSTALL_RULE) +@for TARGET in $(SUB_DIRS); do \ BUILD_TARGET=$(OUTPUT)/$$TARGET; \ - $(MAKE) OUTPUT=$$BUILD_TARGET -C $$TARGET install; \ + $(MAKE) OUTPUT=$$BUILD_TARGET INSTALL_PATH=$$INSTALL_PATH/$$TARGET -C $$TARGET install; \ done;
Re: [PATCH v5 11/15] arch: make execmem setup available regardless of CONFIG_MODULES
On 22/4/24 11:44, Mike Rapoport wrote: From: "Mike Rapoport (IBM)" execmem does not depend on modules, on the contrary modules use execmem. To make execmem available when CONFIG_MODULES=n, for instance for kprobes, split execmem_params initialization out from arch/*/kernel/module.c and compile it when CONFIG_EXECMEM=y Signed-off-by: Mike Rapoport (IBM) --- arch/arm/kernel/module.c | 43 -- arch/arm/mm/init.c | 45 +++ arch/arm64/kernel/module.c | 140 - arch/arm64/mm/init.c | 140 + arch/loongarch/kernel/module.c | 19 - arch/loongarch/mm/init.c | 21 + arch/mips/kernel/module.c | 22 -- arch/mips/mm/init.c| 23 ++ arch/nios2/kernel/module.c | 20 - arch/nios2/mm/init.c | 21 + arch/parisc/kernel/module.c| 20 - arch/parisc/mm/init.c | 23 +- arch/powerpc/kernel/module.c | 63 --- arch/powerpc/mm/mem.c | 64 +++ arch/riscv/kernel/module.c | 44 --- arch/riscv/mm/init.c | 45 +++ arch/s390/kernel/module.c | 27 --- arch/s390/mm/init.c| 30 +++ arch/sparc/kernel/module.c | 19 - arch/sparc/mm/Makefile | 2 + arch/sparc/mm/execmem.c| 21 + arch/x86/kernel/module.c | 27 --- arch/x86/mm/init.c | 29 +++ 23 files changed, 463 insertions(+), 445 deletions(-) create mode 100644 arch/sparc/mm/execmem.c Reviewed-by: Philippe Mathieu-Daudé
Re: [RFC PATCH 1/2] objtool: Run objtool only if either of the config options are selected
On Mon, Apr 22, 2024 at 6:25 PM Sathvika Vasireddy wrote: > > Currently, when objtool is enabled and none of the supported options > are triggered, kernel build errors out with the below error: > error: objtool: At least one command required. Then, I think CONFIG_OBJTOOL should be disabled. > > To address this, ensure that objtool is run only when either of the > config options are selected. > > Signed-off-by: Sathvika Vasireddy > --- > scripts/Makefile.lib | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/scripts/Makefile.lib b/scripts/Makefile.lib > index 3179747cbd2c..c65bb0fbd136 100644 > --- a/scripts/Makefile.lib > +++ b/scripts/Makefile.lib > @@ -286,7 +286,10 @@ objtool-args = $(objtool-args-y) > \ > > delay-objtool := $(or $(CONFIG_LTO_CLANG),$(CONFIG_X86_KERNEL_IBT)) > > +ifneq ($(objtool-args-y),) > cmd_objtool = $(if $(objtool-enabled), ; $(objtool) $(objtool-args) $@) > +endif > + > cmd_gen_objtooldep = $(if $(objtool-enabled), { echo ; echo '$@: $$(wildcard > $(objtool))' ; } >> $(dot-target).cmd) > > endif # CONFIG_OBJTOOL > -- > 2.34.1 > -- Best Regards Masahiro Yamada
[PATCH] powerpc: Mark memory_limit as initdata
The `memory_limit` variable should only be used during boot, enforce that by marking it initdata. Signed-off-by: Michael Ellerman --- arch/powerpc/mm/mem.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c index 3a440004b97d..12316ac66e7e 100644 --- a/arch/powerpc/mm/mem.c +++ b/arch/powerpc/mm/mem.c @@ -30,7 +30,7 @@ #include -unsigned long long memory_limit; +unsigned long long memory_limit __initdata; unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)] __page_aligned_bss; EXPORT_SYMBOL(empty_zero_page); -- 2.44.0
[PATCH v5 15/15] bpf: remove CONFIG_BPF_JIT dependency on CONFIG_MODULES of
From: "Mike Rapoport (IBM)" BPF just-in-time compiler depended on CONFIG_MODULES because it used module_alloc() to allocate memory for the generated code. Since code allocations are now implemented with execmem, drop dependency of CONFIG_BPF_JIT on CONFIG_MODULES and make it select CONFIG_EXECMEM. Suggested-by: Björn Töpel Signed-off-by: Mike Rapoport (IBM) --- kernel/bpf/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/bpf/Kconfig b/kernel/bpf/Kconfig index bc25f5098a25..f999e4e0b344 100644 --- a/kernel/bpf/Kconfig +++ b/kernel/bpf/Kconfig @@ -43,7 +43,7 @@ config BPF_JIT bool "Enable BPF Just In Time compiler" depends on BPF depends on HAVE_CBPF_JIT || HAVE_EBPF_JIT - depends on MODULES + select EXECMEM help BPF programs are normally handled by a BPF interpreter. This option allows the kernel to generate native code when a program is loaded -- 2.43.0
[PATCH v5 14/15] kprobes: remove dependency on CONFIG_MODULES
From: "Mike Rapoport (IBM)" kprobes depended on CONFIG_MODULES because it has to allocate memory for code. Since code allocations are now implemented with execmem, kprobes can be enabled in non-modular kernels. Add #ifdef CONFIG_MODULE guards for the code dealing with kprobes inside modules, make CONFIG_KPROBES select CONFIG_EXECMEM and drop the dependency of CONFIG_KPROBES on CONFIG_MODULES. Signed-off-by: Mike Rapoport (IBM) --- arch/Kconfig| 2 +- include/linux/module.h | 9 ++ kernel/kprobes.c| 55 +++-- kernel/trace/trace_kprobe.c | 20 +- 4 files changed, 63 insertions(+), 23 deletions(-) diff --git a/arch/Kconfig b/arch/Kconfig index 7006f71f0110..a48ce6a488b3 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -52,9 +52,9 @@ config GENERIC_ENTRY config KPROBES bool "Kprobes" - depends on MODULES depends on HAVE_KPROBES select KALLSYMS + select EXECMEM select TASKS_RCU if PREEMPTION help Kprobes allows you to trap at almost any kernel address and diff --git a/include/linux/module.h b/include/linux/module.h index 1153b0d99a80..ffa1c603163c 100644 --- a/include/linux/module.h +++ b/include/linux/module.h @@ -605,6 +605,11 @@ static inline bool module_is_live(struct module *mod) return mod->state != MODULE_STATE_GOING; } +static inline bool module_is_coming(struct module *mod) +{ +return mod->state == MODULE_STATE_COMING; +} + struct module *__module_text_address(unsigned long addr); struct module *__module_address(unsigned long addr); bool is_module_address(unsigned long addr); @@ -857,6 +862,10 @@ void *dereference_module_function_descriptor(struct module *mod, void *ptr) return ptr; } +static inline bool module_is_coming(struct module *mod) +{ + return false; +} #endif /* CONFIG_MODULES */ #ifdef CONFIG_SYSFS diff --git a/kernel/kprobes.c b/kernel/kprobes.c index ddd7cdc16edf..ca2c6cbd42d2 100644 --- a/kernel/kprobes.c +++ b/kernel/kprobes.c @@ -1588,7 +1588,7 @@ static int check_kprobe_address_safe(struct kprobe *p, } /* Get module refcount and reject __init functions for loaded modules. */ - if (*probed_mod) { + if (IS_ENABLED(CONFIG_MODULES) && *probed_mod) { /* * We must hold a refcount of the probed module while updating * its code to prohibit unexpected unloading. @@ -1603,12 +1603,13 @@ static int check_kprobe_address_safe(struct kprobe *p, * kprobes in there. */ if (within_module_init((unsigned long)p->addr, *probed_mod) && - (*probed_mod)->state != MODULE_STATE_COMING) { + !module_is_coming(*probed_mod)) { module_put(*probed_mod); *probed_mod = NULL; ret = -ENOENT; } } + out: preempt_enable(); jump_label_unlock(); @@ -2488,24 +2489,6 @@ int kprobe_add_area_blacklist(unsigned long start, unsigned long end) return 0; } -/* Remove all symbols in given area from kprobe blacklist */ -static void kprobe_remove_area_blacklist(unsigned long start, unsigned long end) -{ - struct kprobe_blacklist_entry *ent, *n; - - list_for_each_entry_safe(ent, n, _blacklist, list) { - if (ent->start_addr < start || ent->start_addr >= end) - continue; - list_del(>list); - kfree(ent); - } -} - -static void kprobe_remove_ksym_blacklist(unsigned long entry) -{ - kprobe_remove_area_blacklist(entry, entry + 1); -} - int __weak arch_kprobe_get_kallsym(unsigned int *symnum, unsigned long *value, char *type, char *sym) { @@ -2570,6 +2553,25 @@ static int __init populate_kprobe_blacklist(unsigned long *start, return ret ? : arch_populate_kprobe_blacklist(); } +#ifdef CONFIG_MODULES +/* Remove all symbols in given area from kprobe blacklist */ +static void kprobe_remove_area_blacklist(unsigned long start, unsigned long end) +{ + struct kprobe_blacklist_entry *ent, *n; + + list_for_each_entry_safe(ent, n, _blacklist, list) { + if (ent->start_addr < start || ent->start_addr >= end) + continue; + list_del(>list); + kfree(ent); + } +} + +static void kprobe_remove_ksym_blacklist(unsigned long entry) +{ + kprobe_remove_area_blacklist(entry, entry + 1); +} + static void add_module_kprobe_blacklist(struct module *mod) { unsigned long start, end; @@ -2672,6 +2674,17 @@ static struct notifier_block kprobe_module_nb = { .priority = 0 }; +static int kprobe_register_module_notifier(void) +{ + return register_module_notifier(_module_nb); +} +#else +static int kprobe_register_module_notifier(void) +{ +
[PATCH v5 13/15] powerpc: use CONFIG_EXECMEM instead of CONFIG_MODULES where appropriate
From: "Mike Rapoport (IBM)" There are places where CONFIG_MODULES guards the code that depends on memory allocation being done with module_alloc(). Replace CONFIG_MODULES with CONFIG_EXECMEM in such places. Signed-off-by: Mike Rapoport (IBM) --- arch/powerpc/Kconfig | 2 +- arch/powerpc/include/asm/kasan.h | 2 +- arch/powerpc/kernel/head_8xx.S | 4 ++-- arch/powerpc/kernel/head_book3s_32.S | 6 +++--- arch/powerpc/lib/code-patching.c | 2 +- arch/powerpc/mm/book3s32/mmu.c | 2 +- 6 files changed, 9 insertions(+), 9 deletions(-) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 1c4be3373686..2e586733a464 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -285,7 +285,7 @@ config PPC select IOMMU_HELPER if PPC64 select IRQ_DOMAIN select IRQ_FORCED_THREADING - select KASAN_VMALLOCif KASAN && MODULES + select KASAN_VMALLOCif KASAN && EXECMEM select LOCK_MM_AND_FIND_VMA select MMU_GATHER_PAGE_SIZE select MMU_GATHER_RCU_TABLE_FREE diff --git a/arch/powerpc/include/asm/kasan.h b/arch/powerpc/include/asm/kasan.h index 365d2720097c..b5bbb94c51f6 100644 --- a/arch/powerpc/include/asm/kasan.h +++ b/arch/powerpc/include/asm/kasan.h @@ -19,7 +19,7 @@ #define KASAN_SHADOW_SCALE_SHIFT 3 -#if defined(CONFIG_MODULES) && defined(CONFIG_PPC32) +#if defined(CONFIG_EXECMEM) && defined(CONFIG_PPC32) #define KASAN_KERN_START ALIGN_DOWN(PAGE_OFFSET - SZ_256M, SZ_256M) #else #define KASAN_KERN_START PAGE_OFFSET diff --git a/arch/powerpc/kernel/head_8xx.S b/arch/powerpc/kernel/head_8xx.S index 647b0b445e89..edc479a7c2bc 100644 --- a/arch/powerpc/kernel/head_8xx.S +++ b/arch/powerpc/kernel/head_8xx.S @@ -199,12 +199,12 @@ instruction_counter: mfspr r10, SPRN_SRR0 /* Get effective address of fault */ INVALIDATE_ADJACENT_PAGES_CPU15(r10, r11) mtspr SPRN_MD_EPN, r10 -#ifdef CONFIG_MODULES +#ifdef CONFIG_EXECMEM mfcrr11 compare_to_kernel_boundary r10, r10 #endif mfspr r10, SPRN_M_TWB /* Get level 1 table */ -#ifdef CONFIG_MODULES +#ifdef CONFIG_EXECMEM blt+3f rlwinm r10, r10, 0, 20, 31 orisr10, r10, (swapper_pg_dir - PAGE_OFFSET)@ha diff --git a/arch/powerpc/kernel/head_book3s_32.S b/arch/powerpc/kernel/head_book3s_32.S index c1d89764dd22..57196883a00e 100644 --- a/arch/powerpc/kernel/head_book3s_32.S +++ b/arch/powerpc/kernel/head_book3s_32.S @@ -419,14 +419,14 @@ InstructionTLBMiss: */ /* Get PTE (linux-style) and check access */ mfspr r3,SPRN_IMISS -#ifdef CONFIG_MODULES +#ifdef CONFIG_EXECMEM lis r1, TASK_SIZE@h /* check if kernel address */ cmplw 0,r1,r3 #endif mfspr r2, SPRN_SDR1 li r1,_PAGE_PRESENT | _PAGE_ACCESSED | _PAGE_EXEC rlwinm r2, r2, 28, 0xf000 -#ifdef CONFIG_MODULES +#ifdef CONFIG_EXECMEM li r0, 3 bgt-112f lis r2, (swapper_pg_dir - PAGE_OFFSET)@ha /* if kernel address, use */ @@ -442,7 +442,7 @@ InstructionTLBMiss: andc. r1,r1,r2/* check access & ~permission */ bne-InstructionAddressInvalid /* return if access not permitted */ /* Convert linux-style PTE to low word of PPC-style PTE */ -#ifdef CONFIG_MODULES +#ifdef CONFIG_EXECMEM rlwimi r2, r0, 0, 31, 31 /* userspace ? -> PP lsb */ #endif ori r1, r1, 0xe06 /* clear out reserved bits */ diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c index c6ab46156cda..7af791446ddf 100644 --- a/arch/powerpc/lib/code-patching.c +++ b/arch/powerpc/lib/code-patching.c @@ -225,7 +225,7 @@ void __init poking_init(void) static unsigned long get_patch_pfn(void *addr) { - if (IS_ENABLED(CONFIG_MODULES) && is_vmalloc_or_module_addr(addr)) + if (IS_ENABLED(CONFIG_EXECMEM) && is_vmalloc_or_module_addr(addr)) return vmalloc_to_pfn(addr); else return __pa_symbol(addr) >> PAGE_SHIFT; diff --git a/arch/powerpc/mm/book3s32/mmu.c b/arch/powerpc/mm/book3s32/mmu.c index 100f999871bc..625fe7d08e06 100644 --- a/arch/powerpc/mm/book3s32/mmu.c +++ b/arch/powerpc/mm/book3s32/mmu.c @@ -184,7 +184,7 @@ unsigned long __init mmu_mapin_ram(unsigned long base, unsigned long top) static bool is_module_segment(unsigned long addr) { - if (!IS_ENABLED(CONFIG_MODULES)) + if (!IS_ENABLED(CONFIG_EXECMEM)) return false; if (addr < ALIGN_DOWN(MODULES_VADDR, SZ_256M)) return false; -- 2.43.0
[PATCH v5 12/15] x86/ftrace: enable dynamic ftrace without CONFIG_MODULES
From: "Mike Rapoport (IBM)" Dynamic ftrace must allocate memory for code and this was impossible without CONFIG_MODULES. With execmem separated from the modules code, execmem_text_alloc() is available regardless of CONFIG_MODULES. Remove dependency of dynamic ftrace on CONFIG_MODULES and make CONFIG_DYNAMIC_FTRACE select CONFIG_EXECMEM in Kconfig. Signed-off-by: Mike Rapoport (IBM) --- arch/x86/Kconfig | 1 + arch/x86/kernel/ftrace.c | 10 -- 2 files changed, 1 insertion(+), 10 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 3f5ba72c9480..cd8addb96a0b 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -34,6 +34,7 @@ config X86_64 select SWIOTLB select ARCH_HAS_ELFCORE_COMPAT select ZONE_DMA32 + select EXECMEM if DYNAMIC_FTRACE config FORCE_DYNAMIC_FTRACE def_bool y diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c index c8ddb7abda7c..8da0e66ca22d 100644 --- a/arch/x86/kernel/ftrace.c +++ b/arch/x86/kernel/ftrace.c @@ -261,8 +261,6 @@ void arch_ftrace_update_code(int command) /* Currently only x86_64 supports dynamic trampolines */ #ifdef CONFIG_X86_64 -#ifdef CONFIG_MODULES -/* Module allocation simplifies allocating memory for code */ static inline void *alloc_tramp(unsigned long size) { return execmem_alloc(EXECMEM_FTRACE, size); @@ -271,14 +269,6 @@ static inline void tramp_free(void *tramp) { execmem_free(tramp); } -#else -/* Trampolines can only be created if modules are supported */ -static inline void *alloc_tramp(unsigned long size) -{ - return NULL; -} -static inline void tramp_free(void *tramp) { } -#endif /* Defined as markers to the end of the ftrace default trampolines */ extern void ftrace_regs_caller_end(void); -- 2.43.0
[PATCH v5 11/15] arch: make execmem setup available regardless of CONFIG_MODULES
From: "Mike Rapoport (IBM)" execmem does not depend on modules, on the contrary modules use execmem. To make execmem available when CONFIG_MODULES=n, for instance for kprobes, split execmem_params initialization out from arch/*/kernel/module.c and compile it when CONFIG_EXECMEM=y Signed-off-by: Mike Rapoport (IBM) --- arch/arm/kernel/module.c | 43 -- arch/arm/mm/init.c | 45 +++ arch/arm64/kernel/module.c | 140 - arch/arm64/mm/init.c | 140 + arch/loongarch/kernel/module.c | 19 - arch/loongarch/mm/init.c | 21 + arch/mips/kernel/module.c | 22 -- arch/mips/mm/init.c| 23 ++ arch/nios2/kernel/module.c | 20 - arch/nios2/mm/init.c | 21 + arch/parisc/kernel/module.c| 20 - arch/parisc/mm/init.c | 23 +- arch/powerpc/kernel/module.c | 63 --- arch/powerpc/mm/mem.c | 64 +++ arch/riscv/kernel/module.c | 44 --- arch/riscv/mm/init.c | 45 +++ arch/s390/kernel/module.c | 27 --- arch/s390/mm/init.c| 30 +++ arch/sparc/kernel/module.c | 19 - arch/sparc/mm/Makefile | 2 + arch/sparc/mm/execmem.c| 21 + arch/x86/kernel/module.c | 27 --- arch/x86/mm/init.c | 29 +++ 23 files changed, 463 insertions(+), 445 deletions(-) create mode 100644 arch/sparc/mm/execmem.c diff --git a/arch/arm/kernel/module.c b/arch/arm/kernel/module.c index a98fdf6ff26c..677f218f7e84 100644 --- a/arch/arm/kernel/module.c +++ b/arch/arm/kernel/module.c @@ -12,57 +12,14 @@ #include #include #include -#include #include #include -#include -#include #include #include #include #include -#ifdef CONFIG_XIP_KERNEL -/* - * The XIP kernel text is mapped in the module area for modules and - * some other stuff to work without any indirect relocations. - * MODULES_VADDR is redefined here and not in asm/memory.h to avoid - * recompiling the whole kernel when CONFIG_XIP_KERNEL is turned on/off. - */ -#undef MODULES_VADDR -#define MODULES_VADDR (((unsigned long)_exiprom + ~PMD_MASK) & PMD_MASK) -#endif - -#ifdef CONFIG_MMU -static struct execmem_info execmem_info __ro_after_init; - -struct execmem_info __init *execmem_arch_setup(void) -{ - unsigned long fallback_start = 0, fallback_end = 0; - - if (IS_ENABLED(CONFIG_ARM_MODULE_PLTS)) { - fallback_start = VMALLOC_START; - fallback_end = VMALLOC_END; - } - - execmem_info = (struct execmem_info){ - .ranges = { - [EXECMEM_DEFAULT] = { - .start = MODULES_VADDR, - .end= MODULES_END, - .pgprot = PAGE_KERNEL_EXEC, - .alignment = 1, - .fallback_start = fallback_start, - .fallback_end = fallback_end, - }, - }, - }; - - return _info; -} -#endif - bool module_init_section(const char *name) { return strstarts(name, ".init") || diff --git a/arch/arm/mm/init.c b/arch/arm/mm/init.c index e8c6f4be0ce1..5345d218899a 100644 --- a/arch/arm/mm/init.c +++ b/arch/arm/mm/init.c @@ -22,6 +22,7 @@ #include #include #include +#include #include #include @@ -486,3 +487,47 @@ void free_initrd_mem(unsigned long start, unsigned long end) free_reserved_area((void *)start, (void *)end, -1, "initrd"); } #endif + +#ifdef CONFIG_EXECMEM + +#ifdef CONFIG_XIP_KERNEL +/* + * The XIP kernel text is mapped in the module area for modules and + * some other stuff to work without any indirect relocations. + * MODULES_VADDR is redefined here and not in asm/memory.h to avoid + * recompiling the whole kernel when CONFIG_XIP_KERNEL is turned on/off. + */ +#undef MODULES_VADDR +#define MODULES_VADDR (((unsigned long)_exiprom + ~PMD_MASK) & PMD_MASK) +#endif + +#ifdef CONFIG_MMU +static struct execmem_info execmem_info __ro_after_init; + +struct execmem_info __init *execmem_arch_setup(void) +{ + unsigned long fallback_start = 0, fallback_end = 0; + + if (IS_ENABLED(CONFIG_ARM_MODULE_PLTS)) { + fallback_start = VMALLOC_START; + fallback_end = VMALLOC_END; + } + + execmem_info = (struct execmem_info){ + .ranges = { + [EXECMEM_DEFAULT] = { + .start = MODULES_VADDR, + .end= MODULES_END, + .pgprot = PAGE_KERNEL_EXEC, + .alignment = 1, + .fallback_start = fallback_start, + .fallback_end = fallback_end, + }, + }, +
[PATCH v5 10/15] powerpc: extend execmem_params for kprobes allocations
From: "Mike Rapoport (IBM)" powerpc overrides kprobes::alloc_insn_page() to remove writable permissions when STRICT_MODULE_RWX is on. Add definition of EXECMEM_KRPOBES to execmem_params to allow using the generic kprobes::alloc_insn_page() with the desired permissions. As powerpc uses breakpoint instructions to inject kprobes, it does not need to constrain kprobe allocations to the modules area and can use the entire vmalloc address space. Signed-off-by: Mike Rapoport (IBM) --- arch/powerpc/kernel/kprobes.c | 20 arch/powerpc/kernel/module.c | 7 +++ 2 files changed, 7 insertions(+), 20 deletions(-) diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c index 9fcd01bb2ce6..14c5ddec3056 100644 --- a/arch/powerpc/kernel/kprobes.c +++ b/arch/powerpc/kernel/kprobes.c @@ -126,26 +126,6 @@ kprobe_opcode_t *arch_adjust_kprobe_addr(unsigned long addr, unsigned long offse return (kprobe_opcode_t *)(addr + offset); } -void *alloc_insn_page(void) -{ - void *page; - - page = execmem_alloc(EXECMEM_KPROBES, PAGE_SIZE); - if (!page) - return NULL; - - if (strict_module_rwx_enabled()) { - int err = set_memory_rox((unsigned long)page, 1); - - if (err) - goto error; - } - return page; -error: - execmem_free(page); - return NULL; -} - int arch_prepare_kprobe(struct kprobe *p) { int ret = 0; diff --git a/arch/powerpc/kernel/module.c b/arch/powerpc/kernel/module.c index ac80559015a3..2a23cf7e141b 100644 --- a/arch/powerpc/kernel/module.c +++ b/arch/powerpc/kernel/module.c @@ -94,6 +94,7 @@ static struct execmem_info execmem_info __ro_after_init; struct execmem_info __init *execmem_arch_setup(void) { + pgprot_t kprobes_prot = strict_module_rwx_enabled() ? PAGE_KERNEL_ROX : PAGE_KERNEL_EXEC; pgprot_t prot = strict_module_rwx_enabled() ? PAGE_KERNEL : PAGE_KERNEL_EXEC; unsigned long fallback_start = 0, fallback_end = 0; unsigned long start, end; @@ -132,6 +133,12 @@ struct execmem_info __init *execmem_arch_setup(void) .fallback_start = fallback_start, .fallback_end = fallback_end, }, + [EXECMEM_KPROBES] = { + .start = VMALLOC_START, + .end= VMALLOC_END, + .pgprot = kprobes_prot, + .alignment = 1, + }, [EXECMEM_MODULE_DATA] = { .start = VMALLOC_START, .end= VMALLOC_END, -- 2.43.0
[PATCH v5 09/15] riscv: extend execmem_params for generated code allocations
From: "Mike Rapoport (IBM)" The memory allocations for kprobes and BPF on RISC-V are not placed in the modules area and these custom allocations are implemented with overrides of alloc_insn_page() and bpf_jit_alloc_exec(). Slightly reorder execmem_params initialization to support both 32 and 64 bit variants, define EXECMEM_KPROBES and EXECMEM_BPF ranges in riscv::execmem_params and drop overrides of alloc_insn_page() and bpf_jit_alloc_exec(). Signed-off-by: Mike Rapoport (IBM) Reviewed-by: Alexandre Ghiti --- arch/riscv/kernel/module.c | 28 +--- arch/riscv/kernel/probes/kprobes.c | 10 -- arch/riscv/net/bpf_jit_core.c | 13 - 3 files changed, 25 insertions(+), 26 deletions(-) diff --git a/arch/riscv/kernel/module.c b/arch/riscv/kernel/module.c index 182904127ba0..2ecbacbc9993 100644 --- a/arch/riscv/kernel/module.c +++ b/arch/riscv/kernel/module.c @@ -906,19 +906,41 @@ int apply_relocate_add(Elf_Shdr *sechdrs, const char *strtab, return 0; } -#if defined(CONFIG_MMU) && defined(CONFIG_64BIT) +#ifdef CONFIG_MMU static struct execmem_info execmem_info __ro_after_init; struct execmem_info __init *execmem_arch_setup(void) { + unsigned long start, end; + + if (IS_ENABLED(CONFIG_64BIT)) { + start = MODULES_VADDR; + end = MODULES_END; + } else { + start = VMALLOC_START; + end = VMALLOC_END; + } + execmem_info = (struct execmem_info){ .ranges = { [EXECMEM_DEFAULT] = { - .start = MODULES_VADDR, - .end= MODULES_END, + .start = start, + .end= end, .pgprot = PAGE_KERNEL, .alignment = 1, }, + [EXECMEM_KPROBES] = { + .start = VMALLOC_START, + .end= VMALLOC_END, + .pgprot = PAGE_KERNEL_READ_EXEC, + .alignment = 1, + }, + [EXECMEM_BPF] = { + .start = BPF_JIT_REGION_START, + .end= BPF_JIT_REGION_END, + .pgprot = PAGE_KERNEL, + .alignment = PAGE_SIZE, + }, }, }; diff --git a/arch/riscv/kernel/probes/kprobes.c b/arch/riscv/kernel/probes/kprobes.c index 2f08c14a933d..e64f2f3064eb 100644 --- a/arch/riscv/kernel/probes/kprobes.c +++ b/arch/riscv/kernel/probes/kprobes.c @@ -104,16 +104,6 @@ int __kprobes arch_prepare_kprobe(struct kprobe *p) return 0; } -#ifdef CONFIG_MMU -void *alloc_insn_page(void) -{ - return __vmalloc_node_range(PAGE_SIZE, 1, VMALLOC_START, VMALLOC_END, -GFP_KERNEL, PAGE_KERNEL_READ_EXEC, -VM_FLUSH_RESET_PERMS, NUMA_NO_NODE, -__builtin_return_address(0)); -} -#endif - /* install breakpoint in text */ void __kprobes arch_arm_kprobe(struct kprobe *p) { diff --git a/arch/riscv/net/bpf_jit_core.c b/arch/riscv/net/bpf_jit_core.c index 6b3acac30c06..e238fdbd5dbc 100644 --- a/arch/riscv/net/bpf_jit_core.c +++ b/arch/riscv/net/bpf_jit_core.c @@ -219,19 +219,6 @@ u64 bpf_jit_alloc_exec_limit(void) return BPF_JIT_REGION_SIZE; } -void *bpf_jit_alloc_exec(unsigned long size) -{ - return __vmalloc_node_range(size, PAGE_SIZE, BPF_JIT_REGION_START, - BPF_JIT_REGION_END, GFP_KERNEL, - PAGE_KERNEL, 0, NUMA_NO_NODE, - __builtin_return_address(0)); -} - -void bpf_jit_free_exec(void *addr) -{ - return vfree(addr); -} - void *bpf_arch_text_copy(void *dst, void *src, size_t len) { int ret; -- 2.43.0
[PATCH v5 08/15] mm/execmem, arch: convert remaining overrides of module_alloc to execmem
From: "Mike Rapoport (IBM)" Extend execmem parameters to accommodate more complex overrides of module_alloc() by architectures. This includes specification of a fallback range required by arm, arm64 and powerpc, EXECMEM_MODULE_DATA type required by powerpc, support for allocation of KASAN shadow required by s390 and x86 and support for early initialization of execmem required by x86. The core implementation of execmem_alloc() takes care of suppressing warnings when the initial allocation fails but there is a fallback range defined. Signed-off-by: Mike Rapoport (IBM) Acked-by: Will Deacon --- arch/Kconfig | 6 +++ arch/arm/kernel/module.c | 41 ++--- arch/arm64/kernel/module.c | 67 ++-- arch/arm64/kernel/probes/kprobes.c | 7 --- arch/arm64/net/bpf_jit_comp.c | 11 - arch/powerpc/kernel/module.c | 60 - arch/s390/kernel/module.c | 54 ++- arch/x86/Kconfig | 1 + arch/x86/kernel/module.c | 70 ++ include/linux/execmem.h| 34 +++ include/linux/moduleloader.h | 12 - kernel/module/main.c | 26 +++ mm/execmem.c | 70 +- mm/mm_init.c | 2 + 14 files changed, 259 insertions(+), 202 deletions(-) diff --git a/arch/Kconfig b/arch/Kconfig index 65afb1de48b3..7006f71f0110 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -960,6 +960,12 @@ config ARCH_WANTS_MODULES_DATA_IN_VMALLOC For architectures like powerpc/32 which have constraints on module allocation and need to allocate module data outside of module area. +config ARCH_WANTS_EXECMEM_EARLY + bool + help + For architectures that might allocate executable memory early on + boot, for instance ftrace on x86. + config HAVE_IRQ_EXIT_ON_IRQ_STACK bool help diff --git a/arch/arm/kernel/module.c b/arch/arm/kernel/module.c index e74d84f58b77..a98fdf6ff26c 100644 --- a/arch/arm/kernel/module.c +++ b/arch/arm/kernel/module.c @@ -16,6 +16,7 @@ #include #include #include +#include #include #include @@ -34,23 +35,31 @@ #endif #ifdef CONFIG_MMU -void *module_alloc(unsigned long size) +static struct execmem_info execmem_info __ro_after_init; + +struct execmem_info __init *execmem_arch_setup(void) { - gfp_t gfp_mask = GFP_KERNEL; - void *p; - - /* Silence the initial allocation */ - if (IS_ENABLED(CONFIG_ARM_MODULE_PLTS)) - gfp_mask |= __GFP_NOWARN; - - p = __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END, - gfp_mask, PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE, - __builtin_return_address(0)); - if (!IS_ENABLED(CONFIG_ARM_MODULE_PLTS) || p) - return p; - return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END, - GFP_KERNEL, PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE, - __builtin_return_address(0)); + unsigned long fallback_start = 0, fallback_end = 0; + + if (IS_ENABLED(CONFIG_ARM_MODULE_PLTS)) { + fallback_start = VMALLOC_START; + fallback_end = VMALLOC_END; + } + + execmem_info = (struct execmem_info){ + .ranges = { + [EXECMEM_DEFAULT] = { + .start = MODULES_VADDR, + .end= MODULES_END, + .pgprot = PAGE_KERNEL_EXEC, + .alignment = 1, + .fallback_start = fallback_start, + .fallback_end = fallback_end, + }, + }, + }; + + return _info; } #endif diff --git a/arch/arm64/kernel/module.c b/arch/arm64/kernel/module.c index e92da4da1b2a..a52240ea084b 100644 --- a/arch/arm64/kernel/module.c +++ b/arch/arm64/kernel/module.c @@ -20,6 +20,7 @@ #include #include #include +#include #include #include @@ -108,41 +109,59 @@ static int __init module_init_limits(void) return 0; } -subsys_initcall(module_init_limits); -void *module_alloc(unsigned long size) +static struct execmem_info execmem_info __ro_after_init; + +struct execmem_info __init *execmem_arch_setup(void) { - void *p = NULL; + unsigned long fallback_start = 0, fallback_end = 0; + unsigned long start = 0, end = 0; + + module_init_limits(); /* * Where possible, prefer to allocate within direct branch range of the * kernel such that no PLTs are necessary. */ if (module_direct_base) { - p = __vmalloc_node_range(size, MODULE_ALIGN, -module_direct_base, -
[PATCH v5 07/15] mm/execmem, arch: convert simple overrides of module_alloc to execmem
From: "Mike Rapoport (IBM)" Several architectures override module_alloc() only to define address range for code allocations different than VMALLOC address space. Provide a generic implementation in execmem that uses the parameters for address space ranges, required alignment and page protections provided by architectures. The architectures must fill execmem_info structure and implement execmem_arch_setup() that returns a pointer to that structure. This way the execmem initialization won't be called from every architecture, but rather from a central place, namely a core_initcall() in execmem. The execmem provides execmem_alloc() API that wraps __vmalloc_node_range() with the parameters defined by the architectures. If an architecture does not implement execmem_arch_setup(), execmem_alloc() will fall back to module_alloc(). Signed-off-by: Mike Rapoport (IBM) --- arch/loongarch/kernel/module.c | 19 +++-- arch/mips/kernel/module.c | 20 -- arch/nios2/kernel/module.c | 21 +++--- arch/parisc/kernel/module.c| 24 +++ arch/riscv/kernel/module.c | 24 +++ arch/sparc/kernel/module.c | 20 -- include/linux/execmem.h| 41 +++ mm/execmem.c | 73 -- 8 files changed, 208 insertions(+), 34 deletions(-) diff --git a/arch/loongarch/kernel/module.c b/arch/loongarch/kernel/module.c index c7d0338d12c1..ca6dd7ea1610 100644 --- a/arch/loongarch/kernel/module.c +++ b/arch/loongarch/kernel/module.c @@ -18,6 +18,7 @@ #include #include #include +#include #include #include #include @@ -490,10 +491,22 @@ int apply_relocate_add(Elf_Shdr *sechdrs, const char *strtab, return 0; } -void *module_alloc(unsigned long size) +static struct execmem_info execmem_info __ro_after_init; + +struct execmem_info __init *execmem_arch_setup(void) { - return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END, - GFP_KERNEL, PAGE_KERNEL, 0, NUMA_NO_NODE, __builtin_return_address(0)); + execmem_info = (struct execmem_info){ + .ranges = { + [EXECMEM_DEFAULT] = { + .start = MODULES_VADDR, + .end= MODULES_END, + .pgprot = PAGE_KERNEL, + .alignment = 1, + }, + }, + }; + + return _info; } static void module_init_ftrace_plt(const Elf_Ehdr *hdr, diff --git a/arch/mips/kernel/module.c b/arch/mips/kernel/module.c index 9a6c96014904..59225a3cf918 100644 --- a/arch/mips/kernel/module.c +++ b/arch/mips/kernel/module.c @@ -20,6 +20,7 @@ #include #include #include +#include #include struct mips_hi16 { @@ -32,11 +33,22 @@ static LIST_HEAD(dbe_list); static DEFINE_SPINLOCK(dbe_lock); #ifdef MODULES_VADDR -void *module_alloc(unsigned long size) +static struct execmem_info execmem_info __ro_after_init; + +struct execmem_info __init *execmem_arch_setup(void) { - return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END, - GFP_KERNEL, PAGE_KERNEL, 0, NUMA_NO_NODE, - __builtin_return_address(0)); + execmem_info = (struct execmem_info){ + .ranges = { + [EXECMEM_DEFAULT] = { + .start = MODULES_VADDR, + .end= MODULES_END, + .pgprot = PAGE_KERNEL, + .alignment = 1, + }, + }, + }; + + return _info; } #endif diff --git a/arch/nios2/kernel/module.c b/arch/nios2/kernel/module.c index 9c97b7513853..0d1ee86631fc 100644 --- a/arch/nios2/kernel/module.c +++ b/arch/nios2/kernel/module.c @@ -18,15 +18,26 @@ #include #include #include +#include #include -void *module_alloc(unsigned long size) +static struct execmem_info execmem_info __ro_after_init; + +struct execmem_info __init *execmem_arch_setup(void) { - return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END, - GFP_KERNEL, PAGE_KERNEL_EXEC, - VM_FLUSH_RESET_PERMS, NUMA_NO_NODE, - __builtin_return_address(0)); + execmem_info = (struct execmem_info){ + .ranges = { + [EXECMEM_DEFAULT] = { + .start = MODULES_VADDR, + .end= MODULES_END, + .pgprot = PAGE_KERNEL_EXEC, + .alignment = 1, + }, + }, + }; + + return _info; } int apply_relocate_add(Elf32_Shdr *sechdrs, const char *strtab, diff --git a/arch/parisc/kernel/module.c b/arch/parisc/kernel/module.c index
[PATCH v5 06/15] mm: introduce execmem_alloc() and execmem_free()
From: "Mike Rapoport (IBM)" module_alloc() is used everywhere as a mean to allocate memory for code. Beside being semantically wrong, this unnecessarily ties all subsystems that need to allocate code, such as ftrace, kprobes and BPF to modules and puts the burden of code allocation to the modules code. Several architectures override module_alloc() because of various constraints where the executable memory can be located and this causes additional obstacles for improvements of code allocation. Start splitting code allocation from modules by introducing execmem_alloc() and execmem_free() APIs. Initially, execmem_alloc() is a wrapper for module_alloc() and execmem_free() is a replacement of module_memfree() to allow updating all call sites to use the new APIs. Since architectures define different restrictions on placement, permissions, alignment and other parameters for memory that can be used by different subsystems that allocate executable memory, execmem_alloc() takes a type argument, that will be used to identify the calling subsystem and to allow architectures define parameters for ranges suitable for that subsystem. No functional changes. Signed-off-by: Mike Rapoport (IBM) Acked-by: Masami Hiramatsu (Google) --- arch/powerpc/kernel/kprobes.c| 6 ++-- arch/s390/kernel/ftrace.c| 4 +-- arch/s390/kernel/kprobes.c | 4 +-- arch/s390/kernel/module.c| 5 +-- arch/sparc/net/bpf_jit_comp_32.c | 8 ++--- arch/x86/kernel/ftrace.c | 6 ++-- arch/x86/kernel/kprobes/core.c | 4 +-- include/linux/execmem.h | 57 include/linux/moduleloader.h | 3 -- kernel/bpf/core.c| 6 ++-- kernel/kprobes.c | 8 ++--- kernel/module/Kconfig| 1 + kernel/module/main.c | 25 +- mm/Kconfig | 3 ++ mm/Makefile | 1 + mm/execmem.c | 32 ++ 16 files changed, 128 insertions(+), 45 deletions(-) create mode 100644 include/linux/execmem.h create mode 100644 mm/execmem.c diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c index bbca90a5e2ec..9fcd01bb2ce6 100644 --- a/arch/powerpc/kernel/kprobes.c +++ b/arch/powerpc/kernel/kprobes.c @@ -19,8 +19,8 @@ #include #include #include -#include #include +#include #include #include #include @@ -130,7 +130,7 @@ void *alloc_insn_page(void) { void *page; - page = module_alloc(PAGE_SIZE); + page = execmem_alloc(EXECMEM_KPROBES, PAGE_SIZE); if (!page) return NULL; @@ -142,7 +142,7 @@ void *alloc_insn_page(void) } return page; error: - module_memfree(page); + execmem_free(page); return NULL; } diff --git a/arch/s390/kernel/ftrace.c b/arch/s390/kernel/ftrace.c index c46381ea04ec..798249ef5646 100644 --- a/arch/s390/kernel/ftrace.c +++ b/arch/s390/kernel/ftrace.c @@ -7,13 +7,13 @@ * Author(s): Martin Schwidefsky */ -#include #include #include #include #include #include #include +#include #include #include #include @@ -220,7 +220,7 @@ static int __init ftrace_plt_init(void) { const char *start, *end; - ftrace_plt = module_alloc(PAGE_SIZE); + ftrace_plt = execmem_alloc(EXECMEM_FTRACE, PAGE_SIZE); if (!ftrace_plt) panic("cannot allocate ftrace plt\n"); diff --git a/arch/s390/kernel/kprobes.c b/arch/s390/kernel/kprobes.c index f0cf20d4b3c5..3c1b1be744de 100644 --- a/arch/s390/kernel/kprobes.c +++ b/arch/s390/kernel/kprobes.c @@ -9,7 +9,6 @@ #define pr_fmt(fmt) "kprobes: " fmt -#include #include #include #include @@ -21,6 +20,7 @@ #include #include #include +#include #include #include #include @@ -38,7 +38,7 @@ void *alloc_insn_page(void) { void *page; - page = module_alloc(PAGE_SIZE); + page = execmem_alloc(EXECMEM_KPROBES, PAGE_SIZE); if (!page) return NULL; set_memory_rox((unsigned long)page, 1); diff --git a/arch/s390/kernel/module.c b/arch/s390/kernel/module.c index 42215f9404af..ac97a905e8cd 100644 --- a/arch/s390/kernel/module.c +++ b/arch/s390/kernel/module.c @@ -21,6 +21,7 @@ #include #include #include +#include #include #include #include @@ -76,7 +77,7 @@ void *module_alloc(unsigned long size) #ifdef CONFIG_FUNCTION_TRACER void module_arch_cleanup(struct module *mod) { - module_memfree(mod->arch.trampolines_start); + execmem_free(mod->arch.trampolines_start); } #endif @@ -510,7 +511,7 @@ static int module_alloc_ftrace_hotpatch_trampolines(struct module *me, size = FTRACE_HOTPATCH_TRAMPOLINES_SIZE(s->sh_size); numpages = DIV_ROUND_UP(size, PAGE_SIZE); - start = module_alloc(numpages * PAGE_SIZE); + start = execmem_alloc(EXECMEM_FTRACE, numpages * PAGE_SIZE); if (!start) return -ENOMEM;
[PATCH v5 05/15] module: make module_memory_{alloc,free} more self-contained
From: "Mike Rapoport (IBM)" Move the logic related to the memory allocation and freeing into module_memory_alloc() and module_memory_free(). Signed-off-by: Mike Rapoport (IBM) --- kernel/module/main.c | 64 +++- 1 file changed, 39 insertions(+), 25 deletions(-) diff --git a/kernel/module/main.c b/kernel/module/main.c index e1e8a7a9d6c1..5b82b069e0d3 100644 --- a/kernel/module/main.c +++ b/kernel/module/main.c @@ -1203,15 +1203,44 @@ static bool mod_mem_use_vmalloc(enum mod_mem_type type) mod_mem_type_is_core_data(type); } -static void *module_memory_alloc(unsigned int size, enum mod_mem_type type) +static int module_memory_alloc(struct module *mod, enum mod_mem_type type) { + unsigned int size = PAGE_ALIGN(mod->mem[type].size); + void *ptr; + + mod->mem[type].size = size; + if (mod_mem_use_vmalloc(type)) - return vzalloc(size); - return module_alloc(size); + ptr = vmalloc(size); + else + ptr = module_alloc(size); + + if (!ptr) + return -ENOMEM; + + /* +* The pointer to these blocks of memory are stored on the module +* structure and we keep that around so long as the module is +* around. We only free that memory when we unload the module. +* Just mark them as not being a leak then. The .init* ELF +* sections *do* get freed after boot so we *could* treat them +* slightly differently with kmemleak_ignore() and only grey +* them out as they work as typical memory allocations which +* *do* eventually get freed, but let's just keep things simple +* and avoid *any* false positives. +*/ + kmemleak_not_leak(ptr); + + memset(ptr, 0, size); + mod->mem[type].base = ptr; + + return 0; } -static void module_memory_free(void *ptr, enum mod_mem_type type) +static void module_memory_free(struct module *mod, enum mod_mem_type type) { + void *ptr = mod->mem[type].base; + if (mod_mem_use_vmalloc(type)) vfree(ptr); else @@ -1229,12 +1258,12 @@ static void free_mod_mem(struct module *mod) /* Free lock-classes; relies on the preceding sync_rcu(). */ lockdep_free_key_range(mod_mem->base, mod_mem->size); if (mod_mem->size) - module_memory_free(mod_mem->base, type); + module_memory_free(mod, type); } /* MOD_DATA hosts mod, so free it at last */ lockdep_free_key_range(mod->mem[MOD_DATA].base, mod->mem[MOD_DATA].size); - module_memory_free(mod->mem[MOD_DATA].base, MOD_DATA); + module_memory_free(mod, MOD_DATA); } /* Free a module, remove from lists, etc. */ @@ -2225,7 +2254,6 @@ static int find_module_sections(struct module *mod, struct load_info *info) static int move_module(struct module *mod, struct load_info *info) { int i; - void *ptr; enum mod_mem_type t = 0; int ret = -ENOMEM; @@ -2234,26 +2262,12 @@ static int move_module(struct module *mod, struct load_info *info) mod->mem[type].base = NULL; continue; } - mod->mem[type].size = PAGE_ALIGN(mod->mem[type].size); - ptr = module_memory_alloc(mod->mem[type].size, type); - /* - * The pointer to these blocks of memory are stored on the module - * structure and we keep that around so long as the module is - * around. We only free that memory when we unload the module. - * Just mark them as not being a leak then. The .init* ELF - * sections *do* get freed after boot so we *could* treat them - * slightly differently with kmemleak_ignore() and only grey - * them out as they work as typical memory allocations which - * *do* eventually get freed, but let's just keep things simple - * and avoid *any* false positives. -*/ - kmemleak_not_leak(ptr); - if (!ptr) { + + ret = module_memory_alloc(mod, type); + if (ret) { t = type; goto out_enomem; } - memset(ptr, 0, mod->mem[type].size); - mod->mem[type].base = ptr; } /* Transfer each section which specifies SHF_ALLOC */ @@ -2296,7 +2310,7 @@ static int move_module(struct module *mod, struct load_info *info) return 0; out_enomem: for (t--; t >= 0; t--) - module_memory_free(mod->mem[t].base, t); + module_memory_free(mod, t); return ret; } -- 2.43.0
[PATCH v5 04/15] sparc: simplify module_alloc()
From: "Mike Rapoport (IBM)" Define MODULES_VADDR and MODULES_END as VMALLOC_START and VMALLOC_END for 32-bit and reduce module_alloc() to __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END, ...) as with the new defines the allocations becames identical for both 32 and 64 bits. While on it, drop unsed include of Suggested-by: Sam Ravnborg Signed-off-by: Mike Rapoport (IBM) --- arch/sparc/include/asm/pgtable_32.h | 2 ++ arch/sparc/kernel/module.c | 25 + 2 files changed, 3 insertions(+), 24 deletions(-) diff --git a/arch/sparc/include/asm/pgtable_32.h b/arch/sparc/include/asm/pgtable_32.h index 9e85d57ac3f2..62bcafe38b1f 100644 --- a/arch/sparc/include/asm/pgtable_32.h +++ b/arch/sparc/include/asm/pgtable_32.h @@ -432,6 +432,8 @@ static inline int io_remap_pfn_range(struct vm_area_struct *vma, #define VMALLOC_START _AC(0xfe60,UL) #define VMALLOC_END _AC(0xffc0,UL) +#define MODULES_VADDR VMALLOC_START +#define MODULES_END VMALLOC_END /* We provide our own get_unmapped_area to cope with VA holes for userland */ #define HAVE_ARCH_UNMAPPED_AREA diff --git a/arch/sparc/kernel/module.c b/arch/sparc/kernel/module.c index 66c45a2764bc..d37adb2a0b54 100644 --- a/arch/sparc/kernel/module.c +++ b/arch/sparc/kernel/module.c @@ -21,35 +21,12 @@ #include "entry.h" -#ifdef CONFIG_SPARC64 - -#include - -static void *module_map(unsigned long size) +void *module_alloc(unsigned long size) { - if (PAGE_ALIGN(size) > MODULES_LEN) - return NULL; return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END, GFP_KERNEL, PAGE_KERNEL, 0, NUMA_NO_NODE, __builtin_return_address(0)); } -#else -static void *module_map(unsigned long size) -{ - return vmalloc(size); -} -#endif /* CONFIG_SPARC64 */ - -void *module_alloc(unsigned long size) -{ - void *ret; - - ret = module_map(size); - if (ret) - memset(ret, 0, size); - - return ret; -} /* Make generic code ignore STT_REGISTER dummy undefined symbols. */ int module_frob_arch_sections(Elf_Ehdr *hdr, -- 2.43.0
[PATCH v5 03/15] nios2: define virtual address space for modules
From: "Mike Rapoport (IBM)" nios2 uses kmalloc() to implement module_alloc() because CALL26/PCREL26 cannot reach all of vmalloc address space. Define module space as 32MiB below the kernel base and switch nios2 to use vmalloc for module allocations. Suggested-by: Thomas Gleixner Acked-by: Dinh Nguyen Acked-by: Song Liu Signed-off-by: Mike Rapoport (IBM) --- arch/nios2/include/asm/pgtable.h | 5 - arch/nios2/kernel/module.c | 19 --- 2 files changed, 8 insertions(+), 16 deletions(-) diff --git a/arch/nios2/include/asm/pgtable.h b/arch/nios2/include/asm/pgtable.h index d052dfcbe8d3..eab87c6beacb 100644 --- a/arch/nios2/include/asm/pgtable.h +++ b/arch/nios2/include/asm/pgtable.h @@ -25,7 +25,10 @@ #include #define VMALLOC_START CONFIG_NIOS2_KERNEL_MMU_REGION_BASE -#define VMALLOC_END(CONFIG_NIOS2_KERNEL_REGION_BASE - 1) +#define VMALLOC_END(CONFIG_NIOS2_KERNEL_REGION_BASE - SZ_32M - 1) + +#define MODULES_VADDR (CONFIG_NIOS2_KERNEL_REGION_BASE - SZ_32M) +#define MODULES_END(CONFIG_NIOS2_KERNEL_REGION_BASE - 1) struct mm_struct; diff --git a/arch/nios2/kernel/module.c b/arch/nios2/kernel/module.c index 76e0a42d6e36..9c97b7513853 100644 --- a/arch/nios2/kernel/module.c +++ b/arch/nios2/kernel/module.c @@ -21,23 +21,12 @@ #include -/* - * Modules should NOT be allocated with kmalloc for (obvious) reasons. - * But we do it for now to avoid relocation issues. CALL26/PCREL26 cannot reach - * from 0x8000 (vmalloc area) to 0xc (kernel) (kmalloc returns - * addresses in 0xc000) - */ void *module_alloc(unsigned long size) { - if (size == 0) - return NULL; - return kmalloc(size, GFP_KERNEL); -} - -/* Free memory returned from module_alloc */ -void module_memfree(void *module_region) -{ - kfree(module_region); + return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END, + GFP_KERNEL, PAGE_KERNEL_EXEC, + VM_FLUSH_RESET_PERMS, NUMA_NO_NODE, + __builtin_return_address(0)); } int apply_relocate_add(Elf32_Shdr *sechdrs, const char *strtab, -- 2.43.0
[PATCH v5 02/15] mips: module: rename MODULE_START to MODULES_VADDR
From: "Mike Rapoport (IBM)" and MODULE_END to MODULES_END to match other architectures that define custom address space for modules. Signed-off-by: Mike Rapoport (IBM) --- arch/mips/include/asm/pgtable-64.h | 4 ++-- arch/mips/kernel/module.c | 4 ++-- arch/mips/mm/fault.c | 4 ++-- 3 files changed, 6 insertions(+), 6 deletions(-) diff --git a/arch/mips/include/asm/pgtable-64.h b/arch/mips/include/asm/pgtable-64.h index 20ca48c1b606..c0109aff223b 100644 --- a/arch/mips/include/asm/pgtable-64.h +++ b/arch/mips/include/asm/pgtable-64.h @@ -147,8 +147,8 @@ #if defined(CONFIG_MODULES) && defined(KBUILD_64BIT_SYM32) && \ VMALLOC_START != CKSSEG /* Load modules into 32bit-compatible segment. */ -#define MODULE_START CKSSEG -#define MODULE_END (FIXADDR_START-2*PAGE_SIZE) +#define MODULES_VADDR CKSSEG +#define MODULES_END(FIXADDR_START-2*PAGE_SIZE) #endif #define pte_ERROR(e) \ diff --git a/arch/mips/kernel/module.c b/arch/mips/kernel/module.c index 7b2fbaa9cac5..9a6c96014904 100644 --- a/arch/mips/kernel/module.c +++ b/arch/mips/kernel/module.c @@ -31,10 +31,10 @@ struct mips_hi16 { static LIST_HEAD(dbe_list); static DEFINE_SPINLOCK(dbe_lock); -#ifdef MODULE_START +#ifdef MODULES_VADDR void *module_alloc(unsigned long size) { - return __vmalloc_node_range(size, 1, MODULE_START, MODULE_END, + return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END, GFP_KERNEL, PAGE_KERNEL, 0, NUMA_NO_NODE, __builtin_return_address(0)); } diff --git a/arch/mips/mm/fault.c b/arch/mips/mm/fault.c index aaa9a242ebba..37fedeaca2e9 100644 --- a/arch/mips/mm/fault.c +++ b/arch/mips/mm/fault.c @@ -83,8 +83,8 @@ static void __do_page_fault(struct pt_regs *regs, unsigned long write, if (unlikely(address >= VMALLOC_START && address <= VMALLOC_END)) goto VMALLOC_FAULT_TARGET; -#ifdef MODULE_START - if (unlikely(address >= MODULE_START && address < MODULE_END)) +#ifdef MODULES_VADDR + if (unlikely(address >= MODULES_VADDR && address < MODULES_END)) goto VMALLOC_FAULT_TARGET; #endif -- 2.43.0
[PATCH v5 01/15] arm64: module: remove unneeded call to kasan_alloc_module_shadow()
From: "Mike Rapoport (IBM)" Since commit f6f37d9320a1 ("arm64: select KASAN_VMALLOC for SW/HW_TAGS modes") KASAN_VMALLOC is always enabled when KASAN is on. This means that allocations in module_alloc() will be tracked by KASAN protection for vmalloc() and that kasan_alloc_module_shadow() will be always an empty inline and there is no point in calling it. Drop meaningless call to kasan_alloc_module_shadow() from module_alloc(). Signed-off-by: Mike Rapoport (IBM) --- arch/arm64/kernel/module.c | 5 - 1 file changed, 5 deletions(-) diff --git a/arch/arm64/kernel/module.c b/arch/arm64/kernel/module.c index 47e0be610bb6..e92da4da1b2a 100644 --- a/arch/arm64/kernel/module.c +++ b/arch/arm64/kernel/module.c @@ -141,11 +141,6 @@ void *module_alloc(unsigned long size) __func__); } - if (p && (kasan_alloc_module_shadow(p, size, GFP_KERNEL) < 0)) { - vfree(p); - return NULL; - } - /* Memory is intended to be executable, reset the pointer tag. */ return kasan_reset_tag(p); } -- 2.43.0
[PATCH v5 00/15] mm: jit/text allocator
From: "Mike Rapoport (IBM)" (something went wrong with the prevois posting, sorry for the noise) Hi, Since v3 I looked into making execmem more of an utility toolbox, as we discussed at LPC with Mark Rutland, but it was getting more hairier than having a struct describing architecture constraints and a type identifying the consumer of execmem. And I do think that having the description of architecture constraints for allocations of executable memory in a single place is better than having it spread all over the place. The patches available via git: https://git.kernel.org/pub/scm/linux/kernel/git/rppt/linux.git/log/?h=execmem/v5 v5 changes: * rebase on v6.9-rc4 to avoid a conflict in kprobes * add copyrights to mm/execmem.c (Luis) * fix spelling (Ingo) * define MODULES_VADDDR for sparc (Sam) * consistently initialize struct execmem_info (Peter) * reduce #ifdefs in function bodies in kprobes (Masami) v4: https://lore.kernel.org/all/20240411160051.2093261-1-r...@kernel.org * rebase on v6.9-rc2 * rename execmem_params to execmem_info and execmem_arch_params() to execmem_arch_setup() * use single execmem_alloc() API instead of execmem_{text,data}_alloc() (Song) * avoid extra copy of execmem parameters (Rick) * run execmem_init() as core_initcall() except for the architectures that may allocated text really early (currently only x86) (Will) * add acks for some of arm64 and riscv changes, thanks Will and Alexandre * new commits: - drop call to kasan_alloc_module_shadow() on arm64 because it's not needed anymore - rename MODULE_START to MODULES_VADDR on MIPS - use CONFIG_EXECMEM instead of CONFIG_MODULES on powerpc as per Christophe: https://lore.kernel.org/all/79062fa3-3402-47b3-8920-9231ad05e...@csgroup.eu/ v3: https://lore.kernel.org/all/20230918072955.2507221-1-r...@kernel.org * add type parameter to execmem allocation APIs * remove BPF dependency on modules v2: https://lore.kernel.org/all/20230616085038.4121892-1-r...@kernel.org * Separate "module" and "others" allocations with execmem_text_alloc() and jit_text_alloc() * Drop ROX entailment on x86 * Add ack for nios2 changes, thanks Dinh Nguyen v1: https://lore.kernel.org/all/20230601101257.530867-1-r...@kernel.org = Cover letter from v1 (sligtly updated) = module_alloc() is used everywhere as a mean to allocate memory for code. Beside being semantically wrong, this unnecessarily ties all subsystmes that need to allocate code, such as ftrace, kprobes and BPF to modules and puts the burden of code allocation to the modules code. Several architectures override module_alloc() because of various constraints where the executable memory can be located and this causes additional obstacles for improvements of code allocation. A centralized infrastructure for code allocation allows allocations of executable memory as ROX, and future optimizations such as caching large pages for better iTLB performance and providing sub-page allocations for users that only need small jit code snippets. Rick Edgecombe proposed perm_alloc extension to vmalloc [1] and Song Liu proposed execmem_alloc [2], but both these approaches were targeting BPF allocations and lacked the ground work to abstract executable allocations and split them from the modules core. Thomas Gleixner suggested to express module allocation restrictions and requirements as struct mod_alloc_type_params [3] that would define ranges, protections and other parameters for different types of allocations used by modules and following that suggestion Song separated allocations of different types in modules (commit ac3b43283923 ("module: replace module_layout with module_memory")) and posted "Type aware module allocator" set [4]. I liked the idea of parametrising code allocation requirements as a structure, but I believe the original proposal and Song's module allocator was too module centric, so I came up with these patches. This set splits code allocation from modules by introducing execmem_alloc() and and execmem_free(), APIs, replaces call sites of module_alloc() and module_memfree() with the new APIs and implements core text and related allocations in a central place. Instead of architecture specific overrides for module_alloc(), the architectures that require non-default behaviour for text allocation must fill execmem_info structure and implement execmem_arch_setup() that returns a pointer to that structure. If an architecture does not implement execmem_arch_setup(), the defaults compatible with the current modules::module_alloc() are used. Since architectures define different restrictions on placement, permissions, alignment and other parameters for memory that can be used by different subsystems that allocate executable memory, execmem APIs take a type argument, that will be used to identify the calling subsystem and to allow architectures to define parameters for ranges suitable for that subsystem. The new infrastructure allows decoupling of BPF, kprobes and ftrace from
Re: [PATCH] powerpc: Split PAGE_SHIFT/SIZE into vdso/page.h
Michael Ellerman writes: > The VDSO needs PAGE_SHIFT/SIZE defined, so it includes asm/page.h. For the archives, this was superseeded by Arnd's rework: https://lore.kernel.org/all/20240320180228.136371-1-a...@kernel.org/ cheers
[RFC PATCH 2/2] objtool/powerpc: Enhance objtool to fixup alternate feature relative addresses
Implement build-time fixup of alternate feature relative addresses for the out-of-line (else) patch code. Initial posting to achieve the same using another tool can be found at [1]. Idea is to implement this using objtool instead of introducing another tool since it already has elf parsing and processing covered. Introduce --ftr-fixup as an option to objtool to do feature fixup at build-time. Couple of issues and warnings encountered while implementing feature fixup using objtool are as follows: 1. libelf is creating corrupted vmlinux file after writing necessary changes to the file. Due to this, kexec is not able to load new kernel. It gives the following error: ELF Note corrupted ! Cannot determine the file type of vmlinux To fix this issue, after opening vmlinux file, make a call to elf_flagelf (e, ELF_C_SET, ELF_F_LAYOUT). This instructs libelf not to touch the segment and section layout. It informs the library that the application will take responsibility for the layout of the file and that the library should not insert any padding between sections. 2. Fix can't find starting instruction warnings when run on vmlinux Objtool throws a lot of can't find starting instruction warnings when run on vmlinux with --ftr-fixup option. These warnings are seen because find_insn() function looks for instructions at offsets that are relative to the start of the section. In case of individual object files (.o), there are no can't find starting instruction warnings seen because the actual offset associated with an instruction is itself a relative offset since the sections start at offset 0x0. However, in case of vmlinux, find_insn() function fails to find instructions at the actual offset associated with an instruction since the sections in vmlinux do not start at offset 0x0. Due to this, find_insn() will look for absolute offset and not the relative offset. This is resulting in a lot of can't find starting instruction warnings when objtool is run on vmlinux. To fix this, pass offset that is relative to the start of the section to find_insn(). find_insn() is also looking for symbols of size 0. But, objtool does not store empty STT_NOTYPE symbols in the rbtree. Due to this, for empty symbols, objtool is throwing can't find starting instruction warnings. Fix this by ignoring symbols that are of size 0 since objtool does not add them to the rbtree. 3. Objtool is throwing unannotated intra-function call warnings when run on vmlinux with --ftr-fixup option. One such example: vmlinux: warning: objtool: .text+0x3d94: unannotated intra-function call .text + 0x3d94 = c0008000 + 3d94 = c00081d4 c00081d4: 45 24 02 48 bl c002a618 c002a610 : c002a610: 0e 01 4c 3c addis r2,r12,270 c002a610: R_PPC64_REL16_HA.TOC. c002a614: f0 6c 42 38 addir2,r2,27888 c002a614: R_PPC64_REL16_LO.TOC.+0x4 c002a618: a6 02 08 7c mflrr0 This is happening because we should be looking for destination symbols that are at absolute offsets instead of relative offsets. After fixing dest_off to point to absolute offset, there are still a lot of these warnings shown. In the above example, objtool is computing the destination offset to be c002a618, which points to a completely different instruction. find_call_destination() is looking for this offset and failing. Instead, we should be looking for destination offset c002a610 which points to system_reset_exception function. Even after fixing the way destination offset is computed, and after looking for dest_off - 0x8 in cases where the original offset is not found, there are still a lot of unannotated intra-function call warnings generated. This is due to symbols that are not properly annotated. So, for now, as a hack to curb these warnings, do not emit unannotated intra-function call warnings when objtool is run with --ftr-fixup option. TODO: This patch enables build time feature fixup only for powerpc little endian configs. There are boot failures with big endian configs. Posting this as an initial RFC to get some review comments while I work on big endian issues. [1] https://lore.kernel.org/linuxppc-dev/20170521010130.13552-1-npig...@gmail.com/ Co-developed-by: Nicholas Piggin Signed-off-by: Nicholas Piggin Signed-off-by: Sathvika Vasireddy --- arch/Kconfig | 3 + arch/powerpc/Kconfig | 5 + arch/powerpc/Makefile | 5 + arch/powerpc/include/asm/feature-fixups.h | 11 +- arch/powerpc/kernel/vmlinux.lds.S | 14 +- arch/powerpc/lib/feature-fixups.c | 13 + scripts/Makefile.lib | 7 + scripts/Makefile.vmlinux | 15 +- tools/objtool/arch/powerpc/special.c | 329 ++ tools/objtool/arch/x86/special.c
[RFC PATCH 1/2] objtool: Run objtool only if either of the config options are selected
Currently, when objtool is enabled and none of the supported options are triggered, kernel build errors out with the below error: error: objtool: At least one command required. To address this, ensure that objtool is run only when either of the config options are selected. Signed-off-by: Sathvika Vasireddy --- scripts/Makefile.lib | 3 +++ 1 file changed, 3 insertions(+) diff --git a/scripts/Makefile.lib b/scripts/Makefile.lib index 3179747cbd2c..c65bb0fbd136 100644 --- a/scripts/Makefile.lib +++ b/scripts/Makefile.lib @@ -286,7 +286,10 @@ objtool-args = $(objtool-args-y) \ delay-objtool := $(or $(CONFIG_LTO_CLANG),$(CONFIG_X86_KERNEL_IBT)) +ifneq ($(objtool-args-y),) cmd_objtool = $(if $(objtool-enabled), ; $(objtool) $(objtool-args) $@) +endif + cmd_gen_objtooldep = $(if $(objtool-enabled), { echo ; echo '$@: $$(wildcard $(objtool))' ; } >> $(dot-target).cmd) endif # CONFIG_OBJTOOL -- 2.34.1
Re: [PATCH] powerpc/iommu: Refactor spapr_tce_platform_iommu_attach_dev()
On Thu, 15 Feb 2024 07:52:32 -0600, Shivaprasad G Bhat wrote: > The patch makes the iommu_group_get() call only when using it > thereby avoiding the unnecessary get & put for domain already > being set case. > > Applied to powerpc/fixes. [1/1] powerpc/iommu: Refactor spapr_tce_platform_iommu_attach_dev() https://git.kernel.org/powerpc/c/5bd31ab5f79eb6e3bdfa0ca0b57650f9d1604062 cheers
Re: [PATCH] powerpc/crypto/chacha-p10: Fix failure on non Power10
On Fri, 29 Mar 2024 00:02:00 +1100, Michael Ellerman wrote: > The chacha-p10-crypto module provides optimised chacha routines for > Power10. It also selects CRYPTO_ARCH_HAVE_LIB_CHACHA which says it > provides chacha_crypt_arch() to generic code. > > Notably the module needs to provide chacha_crypt_arch() regardless of > whether it is loaded on Power10 or an older CPU. > > [...] Applied to powerpc/fixes. [1/1] powerpc/crypto/chacha-p10: Fix failure on non Power10 https://git.kernel.org/powerpc/c/69630926011c1f7170a465b7b5c228deb66e9372 cheers
Re: [PATCH v2] selftests/powerpc/papr-vpd: Fix missing variable initialization
On Thu, 04 Apr 2024 17:02:09 -0500, Nathan Lynch wrote: > The "close handle without consuming VPD" testcase has inconsistent > results because it fails to initialize the location code object it > passes to ioctl() to create a VPD handle. Initialize the location code > to the empty string as intended. > > Applied to powerpc/fixes. [1/1] selftests/powerpc/papr-vpd: Fix missing variable initialization https://git.kernel.org/powerpc/c/210cfef579260ed6c3b700e7baeae51a5e183f43 cheers
[PATCH v5 15/15] bpf: remove CONFIG_BPF_JIT dependency on CONFIG_MODULES of
From: "Mike Rapoport (IBM)" BPF just-in-time compiler depended on CONFIG_MODULES because it used module_alloc() to allocate memory for the generated code. Since code allocations are now implemented with execmem, drop dependency of CONFIG_BPF_JIT on CONFIG_MODULES and make it select CONFIG_EXECMEM. Suggested-by: Björn Töpel Signed-off-by: Mike Rapoport (IBM) --- kernel/bpf/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/bpf/Kconfig b/kernel/bpf/Kconfig index bc25f5098a25..f999e4e0b344 100644 --- a/kernel/bpf/Kconfig +++ b/kernel/bpf/Kconfig @@ -43,7 +43,7 @@ config BPF_JIT bool "Enable BPF Just In Time compiler" depends on BPF depends on HAVE_CBPF_JIT || HAVE_EBPF_JIT - depends on MODULES + select EXECMEM help BPF programs are normally handled by a BPF interpreter. This option allows the kernel to generate native code when a program is loaded -- 2.43.0
[PATCH v5 14/15] kprobes: remove dependency on CONFIG_MODULES
From: "Mike Rapoport (IBM)" kprobes depended on CONFIG_MODULES because it has to allocate memory for code. Since code allocations are now implemented with execmem, kprobes can be enabled in non-modular kernels. Add #ifdef CONFIG_MODULE guards for the code dealing with kprobes inside modules, make CONFIG_KPROBES select CONFIG_EXECMEM and drop the dependency of CONFIG_KPROBES on CONFIG_MODULES. Signed-off-by: Mike Rapoport (IBM) --- arch/Kconfig| 2 +- include/linux/module.h | 9 ++ kernel/kprobes.c| 55 +++-- kernel/trace/trace_kprobe.c | 20 +- 4 files changed, 63 insertions(+), 23 deletions(-) diff --git a/arch/Kconfig b/arch/Kconfig index 7006f71f0110..a48ce6a488b3 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -52,9 +52,9 @@ config GENERIC_ENTRY config KPROBES bool "Kprobes" - depends on MODULES depends on HAVE_KPROBES select KALLSYMS + select EXECMEM select TASKS_RCU if PREEMPTION help Kprobes allows you to trap at almost any kernel address and diff --git a/include/linux/module.h b/include/linux/module.h index 1153b0d99a80..ffa1c603163c 100644 --- a/include/linux/module.h +++ b/include/linux/module.h @@ -605,6 +605,11 @@ static inline bool module_is_live(struct module *mod) return mod->state != MODULE_STATE_GOING; } +static inline bool module_is_coming(struct module *mod) +{ +return mod->state == MODULE_STATE_COMING; +} + struct module *__module_text_address(unsigned long addr); struct module *__module_address(unsigned long addr); bool is_module_address(unsigned long addr); @@ -857,6 +862,10 @@ void *dereference_module_function_descriptor(struct module *mod, void *ptr) return ptr; } +static inline bool module_is_coming(struct module *mod) +{ + return false; +} #endif /* CONFIG_MODULES */ #ifdef CONFIG_SYSFS diff --git a/kernel/kprobes.c b/kernel/kprobes.c index ddd7cdc16edf..ca2c6cbd42d2 100644 --- a/kernel/kprobes.c +++ b/kernel/kprobes.c @@ -1588,7 +1588,7 @@ static int check_kprobe_address_safe(struct kprobe *p, } /* Get module refcount and reject __init functions for loaded modules. */ - if (*probed_mod) { + if (IS_ENABLED(CONFIG_MODULES) && *probed_mod) { /* * We must hold a refcount of the probed module while updating * its code to prohibit unexpected unloading. @@ -1603,12 +1603,13 @@ static int check_kprobe_address_safe(struct kprobe *p, * kprobes in there. */ if (within_module_init((unsigned long)p->addr, *probed_mod) && - (*probed_mod)->state != MODULE_STATE_COMING) { + !module_is_coming(*probed_mod)) { module_put(*probed_mod); *probed_mod = NULL; ret = -ENOENT; } } + out: preempt_enable(); jump_label_unlock(); @@ -2488,24 +2489,6 @@ int kprobe_add_area_blacklist(unsigned long start, unsigned long end) return 0; } -/* Remove all symbols in given area from kprobe blacklist */ -static void kprobe_remove_area_blacklist(unsigned long start, unsigned long end) -{ - struct kprobe_blacklist_entry *ent, *n; - - list_for_each_entry_safe(ent, n, _blacklist, list) { - if (ent->start_addr < start || ent->start_addr >= end) - continue; - list_del(>list); - kfree(ent); - } -} - -static void kprobe_remove_ksym_blacklist(unsigned long entry) -{ - kprobe_remove_area_blacklist(entry, entry + 1); -} - int __weak arch_kprobe_get_kallsym(unsigned int *symnum, unsigned long *value, char *type, char *sym) { @@ -2570,6 +2553,25 @@ static int __init populate_kprobe_blacklist(unsigned long *start, return ret ? : arch_populate_kprobe_blacklist(); } +#ifdef CONFIG_MODULES +/* Remove all symbols in given area from kprobe blacklist */ +static void kprobe_remove_area_blacklist(unsigned long start, unsigned long end) +{ + struct kprobe_blacklist_entry *ent, *n; + + list_for_each_entry_safe(ent, n, _blacklist, list) { + if (ent->start_addr < start || ent->start_addr >= end) + continue; + list_del(>list); + kfree(ent); + } +} + +static void kprobe_remove_ksym_blacklist(unsigned long entry) +{ + kprobe_remove_area_blacklist(entry, entry + 1); +} + static void add_module_kprobe_blacklist(struct module *mod) { unsigned long start, end; @@ -2672,6 +2674,17 @@ static struct notifier_block kprobe_module_nb = { .priority = 0 }; +static int kprobe_register_module_notifier(void) +{ + return register_module_notifier(_module_nb); +} +#else +static int kprobe_register_module_notifier(void) +{ +
[PATCH v5 13/15] powerpc: use CONFIG_EXECMEM instead of CONFIG_MODULES where appropriate
From: "Mike Rapoport (IBM)" There are places where CONFIG_MODULES guards the code that depends on memory allocation being done with module_alloc(). Replace CONFIG_MODULES with CONFIG_EXECMEM in such places. Signed-off-by: Mike Rapoport (IBM) --- arch/powerpc/Kconfig | 2 +- arch/powerpc/include/asm/kasan.h | 2 +- arch/powerpc/kernel/head_8xx.S | 4 ++-- arch/powerpc/kernel/head_book3s_32.S | 6 +++--- arch/powerpc/lib/code-patching.c | 2 +- arch/powerpc/mm/book3s32/mmu.c | 2 +- 6 files changed, 9 insertions(+), 9 deletions(-) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 1c4be3373686..2e586733a464 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -285,7 +285,7 @@ config PPC select IOMMU_HELPER if PPC64 select IRQ_DOMAIN select IRQ_FORCED_THREADING - select KASAN_VMALLOCif KASAN && MODULES + select KASAN_VMALLOCif KASAN && EXECMEM select LOCK_MM_AND_FIND_VMA select MMU_GATHER_PAGE_SIZE select MMU_GATHER_RCU_TABLE_FREE diff --git a/arch/powerpc/include/asm/kasan.h b/arch/powerpc/include/asm/kasan.h index 365d2720097c..b5bbb94c51f6 100644 --- a/arch/powerpc/include/asm/kasan.h +++ b/arch/powerpc/include/asm/kasan.h @@ -19,7 +19,7 @@ #define KASAN_SHADOW_SCALE_SHIFT 3 -#if defined(CONFIG_MODULES) && defined(CONFIG_PPC32) +#if defined(CONFIG_EXECMEM) && defined(CONFIG_PPC32) #define KASAN_KERN_START ALIGN_DOWN(PAGE_OFFSET - SZ_256M, SZ_256M) #else #define KASAN_KERN_START PAGE_OFFSET diff --git a/arch/powerpc/kernel/head_8xx.S b/arch/powerpc/kernel/head_8xx.S index 647b0b445e89..edc479a7c2bc 100644 --- a/arch/powerpc/kernel/head_8xx.S +++ b/arch/powerpc/kernel/head_8xx.S @@ -199,12 +199,12 @@ instruction_counter: mfspr r10, SPRN_SRR0 /* Get effective address of fault */ INVALIDATE_ADJACENT_PAGES_CPU15(r10, r11) mtspr SPRN_MD_EPN, r10 -#ifdef CONFIG_MODULES +#ifdef CONFIG_EXECMEM mfcrr11 compare_to_kernel_boundary r10, r10 #endif mfspr r10, SPRN_M_TWB /* Get level 1 table */ -#ifdef CONFIG_MODULES +#ifdef CONFIG_EXECMEM blt+3f rlwinm r10, r10, 0, 20, 31 orisr10, r10, (swapper_pg_dir - PAGE_OFFSET)@ha diff --git a/arch/powerpc/kernel/head_book3s_32.S b/arch/powerpc/kernel/head_book3s_32.S index c1d89764dd22..57196883a00e 100644 --- a/arch/powerpc/kernel/head_book3s_32.S +++ b/arch/powerpc/kernel/head_book3s_32.S @@ -419,14 +419,14 @@ InstructionTLBMiss: */ /* Get PTE (linux-style) and check access */ mfspr r3,SPRN_IMISS -#ifdef CONFIG_MODULES +#ifdef CONFIG_EXECMEM lis r1, TASK_SIZE@h /* check if kernel address */ cmplw 0,r1,r3 #endif mfspr r2, SPRN_SDR1 li r1,_PAGE_PRESENT | _PAGE_ACCESSED | _PAGE_EXEC rlwinm r2, r2, 28, 0xf000 -#ifdef CONFIG_MODULES +#ifdef CONFIG_EXECMEM li r0, 3 bgt-112f lis r2, (swapper_pg_dir - PAGE_OFFSET)@ha /* if kernel address, use */ @@ -442,7 +442,7 @@ InstructionTLBMiss: andc. r1,r1,r2/* check access & ~permission */ bne-InstructionAddressInvalid /* return if access not permitted */ /* Convert linux-style PTE to low word of PPC-style PTE */ -#ifdef CONFIG_MODULES +#ifdef CONFIG_EXECMEM rlwimi r2, r0, 0, 31, 31 /* userspace ? -> PP lsb */ #endif ori r1, r1, 0xe06 /* clear out reserved bits */ diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c index c6ab46156cda..7af791446ddf 100644 --- a/arch/powerpc/lib/code-patching.c +++ b/arch/powerpc/lib/code-patching.c @@ -225,7 +225,7 @@ void __init poking_init(void) static unsigned long get_patch_pfn(void *addr) { - if (IS_ENABLED(CONFIG_MODULES) && is_vmalloc_or_module_addr(addr)) + if (IS_ENABLED(CONFIG_EXECMEM) && is_vmalloc_or_module_addr(addr)) return vmalloc_to_pfn(addr); else return __pa_symbol(addr) >> PAGE_SHIFT; diff --git a/arch/powerpc/mm/book3s32/mmu.c b/arch/powerpc/mm/book3s32/mmu.c index 100f999871bc..625fe7d08e06 100644 --- a/arch/powerpc/mm/book3s32/mmu.c +++ b/arch/powerpc/mm/book3s32/mmu.c @@ -184,7 +184,7 @@ unsigned long __init mmu_mapin_ram(unsigned long base, unsigned long top) static bool is_module_segment(unsigned long addr) { - if (!IS_ENABLED(CONFIG_MODULES)) + if (!IS_ENABLED(CONFIG_EXECMEM)) return false; if (addr < ALIGN_DOWN(MODULES_VADDR, SZ_256M)) return false; -- 2.43.0
[PATCH v5 12/15] x86/ftrace: enable dynamic ftrace without CONFIG_MODULES
From: "Mike Rapoport (IBM)" Dynamic ftrace must allocate memory for code and this was impossible without CONFIG_MODULES. With execmem separated from the modules code, execmem_text_alloc() is available regardless of CONFIG_MODULES. Remove dependency of dynamic ftrace on CONFIG_MODULES and make CONFIG_DYNAMIC_FTRACE select CONFIG_EXECMEM in Kconfig. Signed-off-by: Mike Rapoport (IBM) --- arch/x86/Kconfig | 1 + arch/x86/kernel/ftrace.c | 10 -- 2 files changed, 1 insertion(+), 10 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 3f5ba72c9480..cd8addb96a0b 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -34,6 +34,7 @@ config X86_64 select SWIOTLB select ARCH_HAS_ELFCORE_COMPAT select ZONE_DMA32 + select EXECMEM if DYNAMIC_FTRACE config FORCE_DYNAMIC_FTRACE def_bool y diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c index c8ddb7abda7c..8da0e66ca22d 100644 --- a/arch/x86/kernel/ftrace.c +++ b/arch/x86/kernel/ftrace.c @@ -261,8 +261,6 @@ void arch_ftrace_update_code(int command) /* Currently only x86_64 supports dynamic trampolines */ #ifdef CONFIG_X86_64 -#ifdef CONFIG_MODULES -/* Module allocation simplifies allocating memory for code */ static inline void *alloc_tramp(unsigned long size) { return execmem_alloc(EXECMEM_FTRACE, size); @@ -271,14 +269,6 @@ static inline void tramp_free(void *tramp) { execmem_free(tramp); } -#else -/* Trampolines can only be created if modules are supported */ -static inline void *alloc_tramp(unsigned long size) -{ - return NULL; -} -static inline void tramp_free(void *tramp) { } -#endif /* Defined as markers to the end of the ftrace default trampolines */ extern void ftrace_regs_caller_end(void); -- 2.43.0
[PATCH v5 11/15] arch: make execmem setup available regardless of CONFIG_MODULES
From: "Mike Rapoport (IBM)" execmem does not depend on modules, on the contrary modules use execmem. To make execmem available when CONFIG_MODULES=n, for instance for kprobes, split execmem_params initialization out from arch/*/kernel/module.c and compile it when CONFIG_EXECMEM=y Signed-off-by: Mike Rapoport (IBM) --- arch/arm/kernel/module.c | 43 -- arch/arm/mm/init.c | 45 +++ arch/arm64/kernel/module.c | 140 - arch/arm64/mm/init.c | 140 + arch/loongarch/kernel/module.c | 19 - arch/loongarch/mm/init.c | 21 + arch/mips/kernel/module.c | 22 -- arch/mips/mm/init.c| 23 ++ arch/nios2/kernel/module.c | 20 - arch/nios2/mm/init.c | 21 + arch/parisc/kernel/module.c| 20 - arch/parisc/mm/init.c | 23 +- arch/powerpc/kernel/module.c | 63 --- arch/powerpc/mm/mem.c | 64 +++ arch/riscv/kernel/module.c | 44 --- arch/riscv/mm/init.c | 45 +++ arch/s390/kernel/module.c | 27 --- arch/s390/mm/init.c| 30 +++ arch/sparc/kernel/module.c | 19 - arch/sparc/mm/Makefile | 2 + arch/sparc/mm/execmem.c| 21 + arch/x86/kernel/module.c | 27 --- arch/x86/mm/init.c | 29 +++ 23 files changed, 463 insertions(+), 445 deletions(-) create mode 100644 arch/sparc/mm/execmem.c diff --git a/arch/arm/kernel/module.c b/arch/arm/kernel/module.c index a98fdf6ff26c..677f218f7e84 100644 --- a/arch/arm/kernel/module.c +++ b/arch/arm/kernel/module.c @@ -12,57 +12,14 @@ #include #include #include -#include #include #include -#include -#include #include #include #include #include -#ifdef CONFIG_XIP_KERNEL -/* - * The XIP kernel text is mapped in the module area for modules and - * some other stuff to work without any indirect relocations. - * MODULES_VADDR is redefined here and not in asm/memory.h to avoid - * recompiling the whole kernel when CONFIG_XIP_KERNEL is turned on/off. - */ -#undef MODULES_VADDR -#define MODULES_VADDR (((unsigned long)_exiprom + ~PMD_MASK) & PMD_MASK) -#endif - -#ifdef CONFIG_MMU -static struct execmem_info execmem_info __ro_after_init; - -struct execmem_info __init *execmem_arch_setup(void) -{ - unsigned long fallback_start = 0, fallback_end = 0; - - if (IS_ENABLED(CONFIG_ARM_MODULE_PLTS)) { - fallback_start = VMALLOC_START; - fallback_end = VMALLOC_END; - } - - execmem_info = (struct execmem_info){ - .ranges = { - [EXECMEM_DEFAULT] = { - .start = MODULES_VADDR, - .end= MODULES_END, - .pgprot = PAGE_KERNEL_EXEC, - .alignment = 1, - .fallback_start = fallback_start, - .fallback_end = fallback_end, - }, - }, - }; - - return _info; -} -#endif - bool module_init_section(const char *name) { return strstarts(name, ".init") || diff --git a/arch/arm/mm/init.c b/arch/arm/mm/init.c index e8c6f4be0ce1..5345d218899a 100644 --- a/arch/arm/mm/init.c +++ b/arch/arm/mm/init.c @@ -22,6 +22,7 @@ #include #include #include +#include #include #include @@ -486,3 +487,47 @@ void free_initrd_mem(unsigned long start, unsigned long end) free_reserved_area((void *)start, (void *)end, -1, "initrd"); } #endif + +#ifdef CONFIG_EXECMEM + +#ifdef CONFIG_XIP_KERNEL +/* + * The XIP kernel text is mapped in the module area for modules and + * some other stuff to work without any indirect relocations. + * MODULES_VADDR is redefined here and not in asm/memory.h to avoid + * recompiling the whole kernel when CONFIG_XIP_KERNEL is turned on/off. + */ +#undef MODULES_VADDR +#define MODULES_VADDR (((unsigned long)_exiprom + ~PMD_MASK) & PMD_MASK) +#endif + +#ifdef CONFIG_MMU +static struct execmem_info execmem_info __ro_after_init; + +struct execmem_info __init *execmem_arch_setup(void) +{ + unsigned long fallback_start = 0, fallback_end = 0; + + if (IS_ENABLED(CONFIG_ARM_MODULE_PLTS)) { + fallback_start = VMALLOC_START; + fallback_end = VMALLOC_END; + } + + execmem_info = (struct execmem_info){ + .ranges = { + [EXECMEM_DEFAULT] = { + .start = MODULES_VADDR, + .end= MODULES_END, + .pgprot = PAGE_KERNEL_EXEC, + .alignment = 1, + .fallback_start = fallback_start, + .fallback_end = fallback_end, + }, + }, +
[PATCH v5 10/15] powerpc: extend execmem_params for kprobes allocations
From: "Mike Rapoport (IBM)" powerpc overrides kprobes::alloc_insn_page() to remove writable permissions when STRICT_MODULE_RWX is on. Add definition of EXECMEM_KRPOBES to execmem_params to allow using the generic kprobes::alloc_insn_page() with the desired permissions. As powerpc uses breakpoint instructions to inject kprobes, it does not need to constrain kprobe allocations to the modules area and can use the entire vmalloc address space. Signed-off-by: Mike Rapoport (IBM) --- arch/powerpc/kernel/kprobes.c | 20 arch/powerpc/kernel/module.c | 7 +++ 2 files changed, 7 insertions(+), 20 deletions(-) diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c index 9fcd01bb2ce6..14c5ddec3056 100644 --- a/arch/powerpc/kernel/kprobes.c +++ b/arch/powerpc/kernel/kprobes.c @@ -126,26 +126,6 @@ kprobe_opcode_t *arch_adjust_kprobe_addr(unsigned long addr, unsigned long offse return (kprobe_opcode_t *)(addr + offset); } -void *alloc_insn_page(void) -{ - void *page; - - page = execmem_alloc(EXECMEM_KPROBES, PAGE_SIZE); - if (!page) - return NULL; - - if (strict_module_rwx_enabled()) { - int err = set_memory_rox((unsigned long)page, 1); - - if (err) - goto error; - } - return page; -error: - execmem_free(page); - return NULL; -} - int arch_prepare_kprobe(struct kprobe *p) { int ret = 0; diff --git a/arch/powerpc/kernel/module.c b/arch/powerpc/kernel/module.c index ac80559015a3..2a23cf7e141b 100644 --- a/arch/powerpc/kernel/module.c +++ b/arch/powerpc/kernel/module.c @@ -94,6 +94,7 @@ static struct execmem_info execmem_info __ro_after_init; struct execmem_info __init *execmem_arch_setup(void) { + pgprot_t kprobes_prot = strict_module_rwx_enabled() ? PAGE_KERNEL_ROX : PAGE_KERNEL_EXEC; pgprot_t prot = strict_module_rwx_enabled() ? PAGE_KERNEL : PAGE_KERNEL_EXEC; unsigned long fallback_start = 0, fallback_end = 0; unsigned long start, end; @@ -132,6 +133,12 @@ struct execmem_info __init *execmem_arch_setup(void) .fallback_start = fallback_start, .fallback_end = fallback_end, }, + [EXECMEM_KPROBES] = { + .start = VMALLOC_START, + .end= VMALLOC_END, + .pgprot = kprobes_prot, + .alignment = 1, + }, [EXECMEM_MODULE_DATA] = { .start = VMALLOC_START, .end= VMALLOC_END, -- 2.43.0
[PATCH v5 09/15] riscv: extend execmem_params for generated code allocations
From: "Mike Rapoport (IBM)" The memory allocations for kprobes and BPF on RISC-V are not placed in the modules area and these custom allocations are implemented with overrides of alloc_insn_page() and bpf_jit_alloc_exec(). Slightly reorder execmem_params initialization to support both 32 and 64 bit variants, define EXECMEM_KPROBES and EXECMEM_BPF ranges in riscv::execmem_params and drop overrides of alloc_insn_page() and bpf_jit_alloc_exec(). Signed-off-by: Mike Rapoport (IBM) Reviewed-by: Alexandre Ghiti --- arch/riscv/kernel/module.c | 28 +--- arch/riscv/kernel/probes/kprobes.c | 10 -- arch/riscv/net/bpf_jit_core.c | 13 - 3 files changed, 25 insertions(+), 26 deletions(-) diff --git a/arch/riscv/kernel/module.c b/arch/riscv/kernel/module.c index 182904127ba0..2ecbacbc9993 100644 --- a/arch/riscv/kernel/module.c +++ b/arch/riscv/kernel/module.c @@ -906,19 +906,41 @@ int apply_relocate_add(Elf_Shdr *sechdrs, const char *strtab, return 0; } -#if defined(CONFIG_MMU) && defined(CONFIG_64BIT) +#ifdef CONFIG_MMU static struct execmem_info execmem_info __ro_after_init; struct execmem_info __init *execmem_arch_setup(void) { + unsigned long start, end; + + if (IS_ENABLED(CONFIG_64BIT)) { + start = MODULES_VADDR; + end = MODULES_END; + } else { + start = VMALLOC_START; + end = VMALLOC_END; + } + execmem_info = (struct execmem_info){ .ranges = { [EXECMEM_DEFAULT] = { - .start = MODULES_VADDR, - .end= MODULES_END, + .start = start, + .end= end, .pgprot = PAGE_KERNEL, .alignment = 1, }, + [EXECMEM_KPROBES] = { + .start = VMALLOC_START, + .end= VMALLOC_END, + .pgprot = PAGE_KERNEL_READ_EXEC, + .alignment = 1, + }, + [EXECMEM_BPF] = { + .start = BPF_JIT_REGION_START, + .end= BPF_JIT_REGION_END, + .pgprot = PAGE_KERNEL, + .alignment = PAGE_SIZE, + }, }, }; diff --git a/arch/riscv/kernel/probes/kprobes.c b/arch/riscv/kernel/probes/kprobes.c index 2f08c14a933d..e64f2f3064eb 100644 --- a/arch/riscv/kernel/probes/kprobes.c +++ b/arch/riscv/kernel/probes/kprobes.c @@ -104,16 +104,6 @@ int __kprobes arch_prepare_kprobe(struct kprobe *p) return 0; } -#ifdef CONFIG_MMU -void *alloc_insn_page(void) -{ - return __vmalloc_node_range(PAGE_SIZE, 1, VMALLOC_START, VMALLOC_END, -GFP_KERNEL, PAGE_KERNEL_READ_EXEC, -VM_FLUSH_RESET_PERMS, NUMA_NO_NODE, -__builtin_return_address(0)); -} -#endif - /* install breakpoint in text */ void __kprobes arch_arm_kprobe(struct kprobe *p) { diff --git a/arch/riscv/net/bpf_jit_core.c b/arch/riscv/net/bpf_jit_core.c index 6b3acac30c06..e238fdbd5dbc 100644 --- a/arch/riscv/net/bpf_jit_core.c +++ b/arch/riscv/net/bpf_jit_core.c @@ -219,19 +219,6 @@ u64 bpf_jit_alloc_exec_limit(void) return BPF_JIT_REGION_SIZE; } -void *bpf_jit_alloc_exec(unsigned long size) -{ - return __vmalloc_node_range(size, PAGE_SIZE, BPF_JIT_REGION_START, - BPF_JIT_REGION_END, GFP_KERNEL, - PAGE_KERNEL, 0, NUMA_NO_NODE, - __builtin_return_address(0)); -} - -void bpf_jit_free_exec(void *addr) -{ - return vfree(addr); -} - void *bpf_arch_text_copy(void *dst, void *src, size_t len) { int ret; -- 2.43.0
[PATCH v5 08/15] mm/execmem, arch: convert remaining overrides of module_alloc to execmem
From: "Mike Rapoport (IBM)" Extend execmem parameters to accommodate more complex overrides of module_alloc() by architectures. This includes specification of a fallback range required by arm, arm64 and powerpc, EXECMEM_MODULE_DATA type required by powerpc, support for allocation of KASAN shadow required by s390 and x86 and support for early initialization of execmem required by x86. The core implementation of execmem_alloc() takes care of suppressing warnings when the initial allocation fails but there is a fallback range defined. Signed-off-by: Mike Rapoport (IBM) Acked-by: Will Deacon --- arch/Kconfig | 6 +++ arch/arm/kernel/module.c | 41 ++--- arch/arm64/kernel/module.c | 67 ++-- arch/arm64/kernel/probes/kprobes.c | 7 --- arch/arm64/net/bpf_jit_comp.c | 11 - arch/powerpc/kernel/module.c | 60 - arch/s390/kernel/module.c | 54 ++- arch/x86/Kconfig | 1 + arch/x86/kernel/module.c | 70 ++ include/linux/execmem.h| 34 +++ include/linux/moduleloader.h | 12 - kernel/module/main.c | 26 +++ mm/execmem.c | 70 +- mm/mm_init.c | 2 + 14 files changed, 259 insertions(+), 202 deletions(-) diff --git a/arch/Kconfig b/arch/Kconfig index 65afb1de48b3..7006f71f0110 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -960,6 +960,12 @@ config ARCH_WANTS_MODULES_DATA_IN_VMALLOC For architectures like powerpc/32 which have constraints on module allocation and need to allocate module data outside of module area. +config ARCH_WANTS_EXECMEM_EARLY + bool + help + For architectures that might allocate executable memory early on + boot, for instance ftrace on x86. + config HAVE_IRQ_EXIT_ON_IRQ_STACK bool help diff --git a/arch/arm/kernel/module.c b/arch/arm/kernel/module.c index e74d84f58b77..a98fdf6ff26c 100644 --- a/arch/arm/kernel/module.c +++ b/arch/arm/kernel/module.c @@ -16,6 +16,7 @@ #include #include #include +#include #include #include @@ -34,23 +35,31 @@ #endif #ifdef CONFIG_MMU -void *module_alloc(unsigned long size) +static struct execmem_info execmem_info __ro_after_init; + +struct execmem_info __init *execmem_arch_setup(void) { - gfp_t gfp_mask = GFP_KERNEL; - void *p; - - /* Silence the initial allocation */ - if (IS_ENABLED(CONFIG_ARM_MODULE_PLTS)) - gfp_mask |= __GFP_NOWARN; - - p = __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END, - gfp_mask, PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE, - __builtin_return_address(0)); - if (!IS_ENABLED(CONFIG_ARM_MODULE_PLTS) || p) - return p; - return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END, - GFP_KERNEL, PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE, - __builtin_return_address(0)); + unsigned long fallback_start = 0, fallback_end = 0; + + if (IS_ENABLED(CONFIG_ARM_MODULE_PLTS)) { + fallback_start = VMALLOC_START; + fallback_end = VMALLOC_END; + } + + execmem_info = (struct execmem_info){ + .ranges = { + [EXECMEM_DEFAULT] = { + .start = MODULES_VADDR, + .end= MODULES_END, + .pgprot = PAGE_KERNEL_EXEC, + .alignment = 1, + .fallback_start = fallback_start, + .fallback_end = fallback_end, + }, + }, + }; + + return _info; } #endif diff --git a/arch/arm64/kernel/module.c b/arch/arm64/kernel/module.c index e92da4da1b2a..a52240ea084b 100644 --- a/arch/arm64/kernel/module.c +++ b/arch/arm64/kernel/module.c @@ -20,6 +20,7 @@ #include #include #include +#include #include #include @@ -108,41 +109,59 @@ static int __init module_init_limits(void) return 0; } -subsys_initcall(module_init_limits); -void *module_alloc(unsigned long size) +static struct execmem_info execmem_info __ro_after_init; + +struct execmem_info __init *execmem_arch_setup(void) { - void *p = NULL; + unsigned long fallback_start = 0, fallback_end = 0; + unsigned long start = 0, end = 0; + + module_init_limits(); /* * Where possible, prefer to allocate within direct branch range of the * kernel such that no PLTs are necessary. */ if (module_direct_base) { - p = __vmalloc_node_range(size, MODULE_ALIGN, -module_direct_base, -
[PATCH v5 07/15] mm/execmem, arch: convert simple overrides of module_alloc to execmem
From: "Mike Rapoport (IBM)" Several architectures override module_alloc() only to define address range for code allocations different than VMALLOC address space. Provide a generic implementation in execmem that uses the parameters for address space ranges, required alignment and page protections provided by architectures. The architectures must fill execmem_info structure and implement execmem_arch_setup() that returns a pointer to that structure. This way the execmem initialization won't be called from every architecture, but rather from a central place, namely a core_initcall() in execmem. The execmem provides execmem_alloc() API that wraps __vmalloc_node_range() with the parameters defined by the architectures. If an architecture does not implement execmem_arch_setup(), execmem_alloc() will fall back to module_alloc(). Signed-off-by: Mike Rapoport (IBM) --- arch/loongarch/kernel/module.c | 19 +++-- arch/mips/kernel/module.c | 20 -- arch/nios2/kernel/module.c | 21 +++--- arch/parisc/kernel/module.c| 24 +++ arch/riscv/kernel/module.c | 24 +++ arch/sparc/kernel/module.c | 20 -- include/linux/execmem.h| 41 +++ mm/execmem.c | 73 -- 8 files changed, 208 insertions(+), 34 deletions(-) diff --git a/arch/loongarch/kernel/module.c b/arch/loongarch/kernel/module.c index c7d0338d12c1..ca6dd7ea1610 100644 --- a/arch/loongarch/kernel/module.c +++ b/arch/loongarch/kernel/module.c @@ -18,6 +18,7 @@ #include #include #include +#include #include #include #include @@ -490,10 +491,22 @@ int apply_relocate_add(Elf_Shdr *sechdrs, const char *strtab, return 0; } -void *module_alloc(unsigned long size) +static struct execmem_info execmem_info __ro_after_init; + +struct execmem_info __init *execmem_arch_setup(void) { - return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END, - GFP_KERNEL, PAGE_KERNEL, 0, NUMA_NO_NODE, __builtin_return_address(0)); + execmem_info = (struct execmem_info){ + .ranges = { + [EXECMEM_DEFAULT] = { + .start = MODULES_VADDR, + .end= MODULES_END, + .pgprot = PAGE_KERNEL, + .alignment = 1, + }, + }, + }; + + return _info; } static void module_init_ftrace_plt(const Elf_Ehdr *hdr, diff --git a/arch/mips/kernel/module.c b/arch/mips/kernel/module.c index 9a6c96014904..59225a3cf918 100644 --- a/arch/mips/kernel/module.c +++ b/arch/mips/kernel/module.c @@ -20,6 +20,7 @@ #include #include #include +#include #include struct mips_hi16 { @@ -32,11 +33,22 @@ static LIST_HEAD(dbe_list); static DEFINE_SPINLOCK(dbe_lock); #ifdef MODULES_VADDR -void *module_alloc(unsigned long size) +static struct execmem_info execmem_info __ro_after_init; + +struct execmem_info __init *execmem_arch_setup(void) { - return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END, - GFP_KERNEL, PAGE_KERNEL, 0, NUMA_NO_NODE, - __builtin_return_address(0)); + execmem_info = (struct execmem_info){ + .ranges = { + [EXECMEM_DEFAULT] = { + .start = MODULES_VADDR, + .end= MODULES_END, + .pgprot = PAGE_KERNEL, + .alignment = 1, + }, + }, + }; + + return _info; } #endif diff --git a/arch/nios2/kernel/module.c b/arch/nios2/kernel/module.c index 9c97b7513853..0d1ee86631fc 100644 --- a/arch/nios2/kernel/module.c +++ b/arch/nios2/kernel/module.c @@ -18,15 +18,26 @@ #include #include #include +#include #include -void *module_alloc(unsigned long size) +static struct execmem_info execmem_info __ro_after_init; + +struct execmem_info __init *execmem_arch_setup(void) { - return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END, - GFP_KERNEL, PAGE_KERNEL_EXEC, - VM_FLUSH_RESET_PERMS, NUMA_NO_NODE, - __builtin_return_address(0)); + execmem_info = (struct execmem_info){ + .ranges = { + [EXECMEM_DEFAULT] = { + .start = MODULES_VADDR, + .end= MODULES_END, + .pgprot = PAGE_KERNEL_EXEC, + .alignment = 1, + }, + }, + }; + + return _info; } int apply_relocate_add(Elf32_Shdr *sechdrs, const char *strtab, diff --git a/arch/parisc/kernel/module.c b/arch/parisc/kernel/module.c index
[PATCH v5 06/15] mm: introduce execmem_alloc() and execmem_free()
From: "Mike Rapoport (IBM)" module_alloc() is used everywhere as a mean to allocate memory for code. Beside being semantically wrong, this unnecessarily ties all subsystems that need to allocate code, such as ftrace, kprobes and BPF to modules and puts the burden of code allocation to the modules code. Several architectures override module_alloc() because of various constraints where the executable memory can be located and this causes additional obstacles for improvements of code allocation. Start splitting code allocation from modules by introducing execmem_alloc() and execmem_free() APIs. Initially, execmem_alloc() is a wrapper for module_alloc() and execmem_free() is a replacement of module_memfree() to allow updating all call sites to use the new APIs. Since architectures define different restrictions on placement, permissions, alignment and other parameters for memory that can be used by different subsystems that allocate executable memory, execmem_alloc() takes a type argument, that will be used to identify the calling subsystem and to allow architectures define parameters for ranges suitable for that subsystem. No functional changes. Signed-off-by: Mike Rapoport (IBM) Acked-by: Masami Hiramatsu (Google) --- arch/powerpc/kernel/kprobes.c| 6 ++-- arch/s390/kernel/ftrace.c| 4 +-- arch/s390/kernel/kprobes.c | 4 +-- arch/s390/kernel/module.c| 5 +-- arch/sparc/net/bpf_jit_comp_32.c | 8 ++--- arch/x86/kernel/ftrace.c | 6 ++-- arch/x86/kernel/kprobes/core.c | 4 +-- include/linux/execmem.h | 57 include/linux/moduleloader.h | 3 -- kernel/bpf/core.c| 6 ++-- kernel/kprobes.c | 8 ++--- kernel/module/Kconfig| 1 + kernel/module/main.c | 25 +- mm/Kconfig | 3 ++ mm/Makefile | 1 + mm/execmem.c | 32 ++ 16 files changed, 128 insertions(+), 45 deletions(-) create mode 100644 include/linux/execmem.h create mode 100644 mm/execmem.c diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c index bbca90a5e2ec..9fcd01bb2ce6 100644 --- a/arch/powerpc/kernel/kprobes.c +++ b/arch/powerpc/kernel/kprobes.c @@ -19,8 +19,8 @@ #include #include #include -#include #include +#include #include #include #include @@ -130,7 +130,7 @@ void *alloc_insn_page(void) { void *page; - page = module_alloc(PAGE_SIZE); + page = execmem_alloc(EXECMEM_KPROBES, PAGE_SIZE); if (!page) return NULL; @@ -142,7 +142,7 @@ void *alloc_insn_page(void) } return page; error: - module_memfree(page); + execmem_free(page); return NULL; } diff --git a/arch/s390/kernel/ftrace.c b/arch/s390/kernel/ftrace.c index c46381ea04ec..798249ef5646 100644 --- a/arch/s390/kernel/ftrace.c +++ b/arch/s390/kernel/ftrace.c @@ -7,13 +7,13 @@ * Author(s): Martin Schwidefsky */ -#include #include #include #include #include #include #include +#include #include #include #include @@ -220,7 +220,7 @@ static int __init ftrace_plt_init(void) { const char *start, *end; - ftrace_plt = module_alloc(PAGE_SIZE); + ftrace_plt = execmem_alloc(EXECMEM_FTRACE, PAGE_SIZE); if (!ftrace_plt) panic("cannot allocate ftrace plt\n"); diff --git a/arch/s390/kernel/kprobes.c b/arch/s390/kernel/kprobes.c index f0cf20d4b3c5..3c1b1be744de 100644 --- a/arch/s390/kernel/kprobes.c +++ b/arch/s390/kernel/kprobes.c @@ -9,7 +9,6 @@ #define pr_fmt(fmt) "kprobes: " fmt -#include #include #include #include @@ -21,6 +20,7 @@ #include #include #include +#include #include #include #include @@ -38,7 +38,7 @@ void *alloc_insn_page(void) { void *page; - page = module_alloc(PAGE_SIZE); + page = execmem_alloc(EXECMEM_KPROBES, PAGE_SIZE); if (!page) return NULL; set_memory_rox((unsigned long)page, 1); diff --git a/arch/s390/kernel/module.c b/arch/s390/kernel/module.c index 42215f9404af..ac97a905e8cd 100644 --- a/arch/s390/kernel/module.c +++ b/arch/s390/kernel/module.c @@ -21,6 +21,7 @@ #include #include #include +#include #include #include #include @@ -76,7 +77,7 @@ void *module_alloc(unsigned long size) #ifdef CONFIG_FUNCTION_TRACER void module_arch_cleanup(struct module *mod) { - module_memfree(mod->arch.trampolines_start); + execmem_free(mod->arch.trampolines_start); } #endif @@ -510,7 +511,7 @@ static int module_alloc_ftrace_hotpatch_trampolines(struct module *me, size = FTRACE_HOTPATCH_TRAMPOLINES_SIZE(s->sh_size); numpages = DIV_ROUND_UP(size, PAGE_SIZE); - start = module_alloc(numpages * PAGE_SIZE); + start = execmem_alloc(EXECMEM_FTRACE, numpages * PAGE_SIZE); if (!start) return -ENOMEM;
[PATCH v5 05/15] module: make module_memory_{alloc,free} more self-contained
From: "Mike Rapoport (IBM)" Move the logic related to the memory allocation and freeing into module_memory_alloc() and module_memory_free(). Signed-off-by: Mike Rapoport (IBM) --- kernel/module/main.c | 64 +++- 1 file changed, 39 insertions(+), 25 deletions(-) diff --git a/kernel/module/main.c b/kernel/module/main.c index e1e8a7a9d6c1..5b82b069e0d3 100644 --- a/kernel/module/main.c +++ b/kernel/module/main.c @@ -1203,15 +1203,44 @@ static bool mod_mem_use_vmalloc(enum mod_mem_type type) mod_mem_type_is_core_data(type); } -static void *module_memory_alloc(unsigned int size, enum mod_mem_type type) +static int module_memory_alloc(struct module *mod, enum mod_mem_type type) { + unsigned int size = PAGE_ALIGN(mod->mem[type].size); + void *ptr; + + mod->mem[type].size = size; + if (mod_mem_use_vmalloc(type)) - return vzalloc(size); - return module_alloc(size); + ptr = vmalloc(size); + else + ptr = module_alloc(size); + + if (!ptr) + return -ENOMEM; + + /* +* The pointer to these blocks of memory are stored on the module +* structure and we keep that around so long as the module is +* around. We only free that memory when we unload the module. +* Just mark them as not being a leak then. The .init* ELF +* sections *do* get freed after boot so we *could* treat them +* slightly differently with kmemleak_ignore() and only grey +* them out as they work as typical memory allocations which +* *do* eventually get freed, but let's just keep things simple +* and avoid *any* false positives. +*/ + kmemleak_not_leak(ptr); + + memset(ptr, 0, size); + mod->mem[type].base = ptr; + + return 0; } -static void module_memory_free(void *ptr, enum mod_mem_type type) +static void module_memory_free(struct module *mod, enum mod_mem_type type) { + void *ptr = mod->mem[type].base; + if (mod_mem_use_vmalloc(type)) vfree(ptr); else @@ -1229,12 +1258,12 @@ static void free_mod_mem(struct module *mod) /* Free lock-classes; relies on the preceding sync_rcu(). */ lockdep_free_key_range(mod_mem->base, mod_mem->size); if (mod_mem->size) - module_memory_free(mod_mem->base, type); + module_memory_free(mod, type); } /* MOD_DATA hosts mod, so free it at last */ lockdep_free_key_range(mod->mem[MOD_DATA].base, mod->mem[MOD_DATA].size); - module_memory_free(mod->mem[MOD_DATA].base, MOD_DATA); + module_memory_free(mod, MOD_DATA); } /* Free a module, remove from lists, etc. */ @@ -2225,7 +2254,6 @@ static int find_module_sections(struct module *mod, struct load_info *info) static int move_module(struct module *mod, struct load_info *info) { int i; - void *ptr; enum mod_mem_type t = 0; int ret = -ENOMEM; @@ -2234,26 +2262,12 @@ static int move_module(struct module *mod, struct load_info *info) mod->mem[type].base = NULL; continue; } - mod->mem[type].size = PAGE_ALIGN(mod->mem[type].size); - ptr = module_memory_alloc(mod->mem[type].size, type); - /* - * The pointer to these blocks of memory are stored on the module - * structure and we keep that around so long as the module is - * around. We only free that memory when we unload the module. - * Just mark them as not being a leak then. The .init* ELF - * sections *do* get freed after boot so we *could* treat them - * slightly differently with kmemleak_ignore() and only grey - * them out as they work as typical memory allocations which - * *do* eventually get freed, but let's just keep things simple - * and avoid *any* false positives. -*/ - kmemleak_not_leak(ptr); - if (!ptr) { + + ret = module_memory_alloc(mod, type); + if (ret) { t = type; goto out_enomem; } - memset(ptr, 0, mod->mem[type].size); - mod->mem[type].base = ptr; } /* Transfer each section which specifies SHF_ALLOC */ @@ -2296,7 +2310,7 @@ static int move_module(struct module *mod, struct load_info *info) return 0; out_enomem: for (t--; t >= 0; t--) - module_memory_free(mod->mem[t].base, t); + module_memory_free(mod, t); return ret; } -- 2.43.0
[PATCH v5 04/15] sparc: simplify module_alloc()
From: "Mike Rapoport (IBM)" Define MODULES_VADDR and MODULES_END as VMALLOC_START and VMALLOC_END for 32-bit and reduce module_alloc() to __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END, ...) as with the new defines the allocations becames identical for both 32 and 64 bits. While on it, drop unsed include of Suggested-by: Sam Ravnborg Signed-off-by: Mike Rapoport (IBM) --- arch/sparc/include/asm/pgtable_32.h | 2 ++ arch/sparc/kernel/module.c | 25 + 2 files changed, 3 insertions(+), 24 deletions(-) diff --git a/arch/sparc/include/asm/pgtable_32.h b/arch/sparc/include/asm/pgtable_32.h index 9e85d57ac3f2..62bcafe38b1f 100644 --- a/arch/sparc/include/asm/pgtable_32.h +++ b/arch/sparc/include/asm/pgtable_32.h @@ -432,6 +432,8 @@ static inline int io_remap_pfn_range(struct vm_area_struct *vma, #define VMALLOC_START _AC(0xfe60,UL) #define VMALLOC_END _AC(0xffc0,UL) +#define MODULES_VADDR VMALLOC_START +#define MODULES_END VMALLOC_END /* We provide our own get_unmapped_area to cope with VA holes for userland */ #define HAVE_ARCH_UNMAPPED_AREA diff --git a/arch/sparc/kernel/module.c b/arch/sparc/kernel/module.c index 66c45a2764bc..d37adb2a0b54 100644 --- a/arch/sparc/kernel/module.c +++ b/arch/sparc/kernel/module.c @@ -21,35 +21,12 @@ #include "entry.h" -#ifdef CONFIG_SPARC64 - -#include - -static void *module_map(unsigned long size) +void *module_alloc(unsigned long size) { - if (PAGE_ALIGN(size) > MODULES_LEN) - return NULL; return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END, GFP_KERNEL, PAGE_KERNEL, 0, NUMA_NO_NODE, __builtin_return_address(0)); } -#else -static void *module_map(unsigned long size) -{ - return vmalloc(size); -} -#endif /* CONFIG_SPARC64 */ - -void *module_alloc(unsigned long size) -{ - void *ret; - - ret = module_map(size); - if (ret) - memset(ret, 0, size); - - return ret; -} /* Make generic code ignore STT_REGISTER dummy undefined symbols. */ int module_frob_arch_sections(Elf_Ehdr *hdr, -- 2.43.0
[PATCH v5 03/15] nios2: define virtual address space for modules
From: "Mike Rapoport (IBM)" nios2 uses kmalloc() to implement module_alloc() because CALL26/PCREL26 cannot reach all of vmalloc address space. Define module space as 32MiB below the kernel base and switch nios2 to use vmalloc for module allocations. Suggested-by: Thomas Gleixner Acked-by: Dinh Nguyen Acked-by: Song Liu Signed-off-by: Mike Rapoport (IBM) --- arch/nios2/include/asm/pgtable.h | 5 - arch/nios2/kernel/module.c | 19 --- 2 files changed, 8 insertions(+), 16 deletions(-) diff --git a/arch/nios2/include/asm/pgtable.h b/arch/nios2/include/asm/pgtable.h index d052dfcbe8d3..eab87c6beacb 100644 --- a/arch/nios2/include/asm/pgtable.h +++ b/arch/nios2/include/asm/pgtable.h @@ -25,7 +25,10 @@ #include #define VMALLOC_START CONFIG_NIOS2_KERNEL_MMU_REGION_BASE -#define VMALLOC_END(CONFIG_NIOS2_KERNEL_REGION_BASE - 1) +#define VMALLOC_END(CONFIG_NIOS2_KERNEL_REGION_BASE - SZ_32M - 1) + +#define MODULES_VADDR (CONFIG_NIOS2_KERNEL_REGION_BASE - SZ_32M) +#define MODULES_END(CONFIG_NIOS2_KERNEL_REGION_BASE - 1) struct mm_struct; diff --git a/arch/nios2/kernel/module.c b/arch/nios2/kernel/module.c index 76e0a42d6e36..9c97b7513853 100644 --- a/arch/nios2/kernel/module.c +++ b/arch/nios2/kernel/module.c @@ -21,23 +21,12 @@ #include -/* - * Modules should NOT be allocated with kmalloc for (obvious) reasons. - * But we do it for now to avoid relocation issues. CALL26/PCREL26 cannot reach - * from 0x8000 (vmalloc area) to 0xc (kernel) (kmalloc returns - * addresses in 0xc000) - */ void *module_alloc(unsigned long size) { - if (size == 0) - return NULL; - return kmalloc(size, GFP_KERNEL); -} - -/* Free memory returned from module_alloc */ -void module_memfree(void *module_region) -{ - kfree(module_region); + return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END, + GFP_KERNEL, PAGE_KERNEL_EXEC, + VM_FLUSH_RESET_PERMS, NUMA_NO_NODE, + __builtin_return_address(0)); } int apply_relocate_add(Elf32_Shdr *sechdrs, const char *strtab, -- 2.43.0
[PATCH v5 02/15] mips: module: rename MODULE_START to MODULES_VADDR
From: "Mike Rapoport (IBM)" and MODULE_END to MODULES_END to match other architectures that define custom address space for modules. Signed-off-by: Mike Rapoport (IBM) --- arch/mips/include/asm/pgtable-64.h | 4 ++-- arch/mips/kernel/module.c | 4 ++-- arch/mips/mm/fault.c | 4 ++-- 3 files changed, 6 insertions(+), 6 deletions(-) diff --git a/arch/mips/include/asm/pgtable-64.h b/arch/mips/include/asm/pgtable-64.h index 20ca48c1b606..c0109aff223b 100644 --- a/arch/mips/include/asm/pgtable-64.h +++ b/arch/mips/include/asm/pgtable-64.h @@ -147,8 +147,8 @@ #if defined(CONFIG_MODULES) && defined(KBUILD_64BIT_SYM32) && \ VMALLOC_START != CKSSEG /* Load modules into 32bit-compatible segment. */ -#define MODULE_START CKSSEG -#define MODULE_END (FIXADDR_START-2*PAGE_SIZE) +#define MODULES_VADDR CKSSEG +#define MODULES_END(FIXADDR_START-2*PAGE_SIZE) #endif #define pte_ERROR(e) \ diff --git a/arch/mips/kernel/module.c b/arch/mips/kernel/module.c index 7b2fbaa9cac5..9a6c96014904 100644 --- a/arch/mips/kernel/module.c +++ b/arch/mips/kernel/module.c @@ -31,10 +31,10 @@ struct mips_hi16 { static LIST_HEAD(dbe_list); static DEFINE_SPINLOCK(dbe_lock); -#ifdef MODULE_START +#ifdef MODULES_VADDR void *module_alloc(unsigned long size) { - return __vmalloc_node_range(size, 1, MODULE_START, MODULE_END, + return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END, GFP_KERNEL, PAGE_KERNEL, 0, NUMA_NO_NODE, __builtin_return_address(0)); } diff --git a/arch/mips/mm/fault.c b/arch/mips/mm/fault.c index aaa9a242ebba..37fedeaca2e9 100644 --- a/arch/mips/mm/fault.c +++ b/arch/mips/mm/fault.c @@ -83,8 +83,8 @@ static void __do_page_fault(struct pt_regs *regs, unsigned long write, if (unlikely(address >= VMALLOC_START && address <= VMALLOC_END)) goto VMALLOC_FAULT_TARGET; -#ifdef MODULE_START - if (unlikely(address >= MODULE_START && address < MODULE_END)) +#ifdef MODULES_VADDR + if (unlikely(address >= MODULES_VADDR && address < MODULES_END)) goto VMALLOC_FAULT_TARGET; #endif -- 2.43.0
[PATCH v5 01/15] arm64: module: remove unneeded call to kasan_alloc_module_shadow()
From: "Mike Rapoport (IBM)" Since commit f6f37d9320a1 ("arm64: select KASAN_VMALLOC for SW/HW_TAGS modes") KASAN_VMALLOC is always enabled when KASAN is on. This means that allocations in module_alloc() will be tracked by KASAN protection for vmalloc() and that kasan_alloc_module_shadow() will be always an empty inline and there is no point in calling it. Drop meaningless call to kasan_alloc_module_shadow() from module_alloc(). Signed-off-by: Mike Rapoport (IBM) --- arch/arm64/kernel/module.c | 5 - 1 file changed, 5 deletions(-) diff --git a/arch/arm64/kernel/module.c b/arch/arm64/kernel/module.c index 47e0be610bb6..e92da4da1b2a 100644 --- a/arch/arm64/kernel/module.c +++ b/arch/arm64/kernel/module.c @@ -141,11 +141,6 @@ void *module_alloc(unsigned long size) __func__); } - if (p && (kasan_alloc_module_shadow(p, size, GFP_KERNEL) < 0)) { - vfree(p); - return NULL; - } - /* Memory is intended to be executable, reset the pointer tag. */ return kasan_reset_tag(p); } -- 2.43.0
Re: (subset) [PATCH 00/34] address all -Wunused-const warnings
On Wed, 03 Apr 2024 10:06:18 +0200, Arnd Bergmann wrote: > Compilers traditionally warn for unused 'static' variables, but not > if they are constant. The reason here is a custom for C++ programmers > to define named constants as 'static const' variables in header files > instead of using macros or enums. > > In W=1 builds, we get warnings only static const variables in C > files, but not in headers, which is a good compromise, but this still > produces warning output in at least 30 files. These warnings are > almost all harmless, but also trivial to fix, and there is no > good reason to warn only about the non-const variables being unused. > > [...] Applied to powerpc/next. [01/34] powerpc/fsl-soc: hide unused const variable https://git.kernel.org/powerpc/c/01acaf3aa75e1641442cc23d8fe0a7bb4226efb1 cheers
[PATCH v5 00/15] mm: jit/text allocator
From: "Mike Rapoport (IBM)" Hi, Since v3 I looked into making execmem more of an utility toolbox, as we discussed at LPC with Mark Rutland, but it was getting more hairier than having a struct describing architecture constraints and a type identifying the consumer of execmem. And I do think that having the description of architecture constraints for allocations of executable memory in a single place is better than having it spread all over the place. The patches available via git: https://git.kernel.org/pub/scm/linux/kernel/git/rppt/linux.git/log/?h=execmem/v5 v5 changes: * rebase on v6.9-rc4 to avoid a conflict in kprobes * add copyrights to mm/execmem.c (Luis) * fix spelling (Ingo) * define MODULES_VADDDR for sparc (Sam) * consistently initialize struct execmem_info (Peter) * reduce #ifdefs in function bodies in kprobes (Masami) v4: https://lore.kernel.org/all/20240411160051.2093261-1-r...@kernel.org * rebase on v6.9-rc2 * rename execmem_params to execmem_info and execmem_arch_params() to execmem_arch_setup() * use single execmem_alloc() API instead of execmem_{text,data}_alloc() (Song) * avoid extra copy of execmem parameters (Rick) * run execmem_init() as core_initcall() except for the architectures that may allocated text really early (currently only x86) (Will) * add acks for some of arm64 and riscv changes, thanks Will and Alexandre * new commits: - drop call to kasan_alloc_module_shadow() on arm64 because it's not needed anymore - rename MODULE_START to MODULES_VADDR on MIPS - use CONFIG_EXECMEM instead of CONFIG_MODULES on powerpc as per Christophe: https://lore.kernel.org/all/79062fa3-3402-47b3-8920-9231ad05e...@csgroup.eu/ v3: https://lore.kernel.org/all/20230918072955.2507221-1-r...@kernel.org * add type parameter to execmem allocation APIs * remove BPF dependency on modules v2: https://lore.kernel.org/all/20230616085038.4121892-1-r...@kernel.org * Separate "module" and "others" allocations with execmem_text_alloc() and jit_text_alloc() * Drop ROX entailment on x86 * Add ack for nios2 changes, thanks Dinh Nguyen v1: https://lore.kernel.org/all/20230601101257.530867-1-r...@kernel.org = Cover letter from v1 (sligtly updated) = module_alloc() is used everywhere as a mean to allocate memory for code. Beside being semantically wrong, this unnecessarily ties all subsystmes that need to allocate code, such as ftrace, kprobes and BPF to modules and puts the burden of code allocation to the modules code. Several architectures override module_alloc() because of various constraints where the executable memory can be located and this causes additional obstacles for improvements of code allocation. A centralized infrastructure for code allocation allows allocations of executable memory as ROX, and future optimizations such as caching large pages for better iTLB performance and providing sub-page allocations for users that only need small jit code snippets. Rick Edgecombe proposed perm_alloc extension to vmalloc [1] and Song Liu proposed execmem_alloc [2], but both these approaches were targeting BPF allocations and lacked the ground work to abstract executable allocations and split them from the modules core. Thomas Gleixner suggested to express module allocation restrictions and requirements as struct mod_alloc_type_params [3] that would define ranges, protections and other parameters for different types of allocations used by modules and following that suggestion Song separated allocations of different types in modules (commit ac3b43283923 ("module: replace module_layout with module_memory")) and posted "Type aware module allocator" set [4]. I liked the idea of parametrising code allocation requirements as a structure, but I believe the original proposal and Song's module allocator was too module centric, so I came up with these patches. This set splits code allocation from modules by introducing execmem_alloc() and and execmem_free(), APIs, replaces call sites of module_alloc() and module_memfree() with the new APIs and implements core text and related allocations in a central place. Instead of architecture specific overrides for module_alloc(), the architectures that require non-default behaviour for text allocation must fill execmem_info structure and implement execmem_arch_setup() that returns a pointer to that structure. If an architecture does not implement execmem_arch_setup(), the defaults compatible with the current modules::module_alloc() are used. Since architectures define different restrictions on placement, permissions, alignment and other parameters for memory that can be used by different subsystems that allocate executable memory, execmem APIs take a type argument, that will be used to identify the calling subsystem and to allow architectures to define parameters for ranges suitable for that subsystem. The new infrastructure allows decoupling of BPF, kprobes and ftrace from modules, and most importantly it paves the way for ROX allocations for
Re: [PATCH] powerpc: Use str_plural() to fix Coccinelle warning
On Mon, 01 Apr 2024 00:22:50 +0200, Thorsten Blum wrote: > Fixes the following Coccinelle/coccicheck warning reported by > string_choices.cocci: > > opportunity for str_plural(tpc) > > Applied to powerpc/next. [1/1] powerpc: Use str_plural() to fix Coccinelle warning https://git.kernel.org/powerpc/c/3e42e72796d8991fecad78d61a180e24a4853427 cheers
Re: [PATCH] powerpc/ptdump: Fix walk_vmemmap to also print first vmemmap entry
On Wed, 17 Apr 2024 20:37:40 +0530, Ritesh Harjani (IBM) wrote: > walk_vmemmap() was skipping the first vmemmap entry pointed by > vmemmap_list pointer itself. This patch fixes that. > > With this we should see the vmemmap entry at 0xc00c for hash > which wasn't getting printed on doing > > "cat /sys/kernel/debug/kernel_hash_pagetable" > > [...] Applied to powerpc/next. [1/1] powerpc/ptdump: Fix walk_vmemmap to also print first vmemmap entry https://git.kernel.org/powerpc/c/f318c8be797f8572629d5386a88cde7d753457a8 cheers
Re: [PATCH v2] Add static_key_feature_checks_initialized flag
On Mon, 08 Apr 2024 05:23:58 +, Nicholas Miehlbradt wrote: > JUMP_LABEL_FEATURE_CHECK_DEBUG used static_key_intialized to determine > whether {cpu,mmu}_has_feature() is used before static keys were > initialized. However, {cpu,mmu}_has_feature() should not be used before > setup_feature_keys() is called but static_key_initialized is set well > before this by the call to jump_label_init() in early_init_devtree(). > This creates a window in which JUMP_LABEL_FEATURE_CHECK_DEBUG will not > detect misuse and report errors. Add a flag specifically to indicate > when {cpu,mmu}_has_feature() is safe to use. > > [...] Applied to powerpc/next. [1/1] Add static_key_feature_checks_initialized flag https://git.kernel.org/powerpc/c/676b2f99b0f6cd11193eeae13c976565c3fc7545 cheers
Re: [PATCH] powerpc: Fix fatal warnings flag for LLVM's integrated assembler
On Fri, 05 Apr 2024 12:31:22 -0700, Nathan Chancellor wrote: > When building with LLVM_IAS=1, there is an error because > '-fatal-warnings' is not recognized as a valid flag: > > clang: error: unsupported argument '-fatal-warnings' to option '-Wa,' > > Use the double hyphen version of the flag, '--fatal-warnings', which > works with both the GNU assembler and LLVM's integrated assembler. > > [...] Applied to powerpc/next. [1/1] powerpc: Fix fatal warnings flag for LLVM's integrated assembler https://git.kernel.org/powerpc/c/8884fc918f6aee220f9b41806974508bd0213aca cheers
Re: [PATCH v5] powerpc: Avoid nmi_enter/nmi_exit in real mode interrupt.
On Wed, 10 Apr 2024 10:00:06 +0530, Mahesh Salgaonkar wrote: > nmi_enter()/nmi_exit() touches per cpu variables which can lead to kernel > crash when invoked during real mode interrupt handling (e.g. early HMI/MCE > interrupt handler) if percpu allocation comes from vmalloc area. > > Early HMI/MCE handlers are called through DEFINE_INTERRUPT_HANDLER_NMI() > wrapper which invokes nmi_enter/nmi_exit calls. We don't see any issue when > percpu allocation is from the embedded first chunk. However with > CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK enabled there are chances where percpu > allocation can come from the vmalloc area. > > [...] Applied to powerpc/next. [1/1] powerpc: Avoid nmi_enter/nmi_exit in real mode interrupt. https://git.kernel.org/powerpc/c/0db880fc865ffb522141ced4bfa66c12ab1fbb70 cheers
Re: [PATCH v2] powerpc: Fix PS3 allmodconfig warning
On Mon, 01 Apr 2024 16:08:31 +0900, Geoff Levand wrote: > The struct ps3_notification_device in the ps3_probe_thread routine > is too large to be on the stack, causing a warning for an > allmodconfig build with clang. > > Change the struct ps3_notification_device from a variable on the stack > to a dynamically allocated variable. > > [...] Applied to powerpc/next. [1/1] powerpc: Fix PS3 allmodconfig warning https://git.kernel.org/powerpc/c/bfe51886ca544956eb4ff924d1937ac01d0ca9c8 cheers
Re: [PATCH v1] powerpc: Error on assembly warnings
On Tue, 26 Mar 2024 15:44:20 +1100, Benjamin Gray wrote: > We currently enable -Werror on the arch/powerpc subtree. However this > only catches C warnings. Assembly warnings are logged, but the make > invocation will still succeed. This can allow incorrect syntax such as > > ori r3, r4, r5 > > to be compiled without catching that the assembler is treating r5 > as the immediate value 5. > > [...] Applied to powerpc/next. [1/1] powerpc: Error on assembly warnings https://git.kernel.org/powerpc/c/608d4a5ca56302181e669cea0aa571cbec6680eb cheers
Re: [PATCH v1 1/1] powerpc/52xx: Replace of_gpio.h by proper one
On Wed, 13 Mar 2024 15:56:45 +0200, Andy Shevchenko wrote: > of_gpio.h is deprecated and subject to remove. > The driver doesn't use it directly, replace it > with what is really being used. > > Applied to powerpc/next. [1/1] powerpc/52xx: Replace of_gpio.h by proper one https://git.kernel.org/powerpc/c/676abf7c39267080ab81597c6d4f372a10c0fc21 cheers
Re: [PATCH v4] powerpc/pseries: make max polling consistent for longer H_CALLs
On Wed, 2024-04-17 at 23:12 -0400, Nayna Jain wrote: > Currently, plpks_confirm_object_flushed() function polls for 5msec in > total instead of 5sec. > > Keep max polling time consistent for all the H_CALLs, which take > longer > than expected, to be 5sec. Also, make use of fsleep() everywhere to > insert delay. > > Reported-by: Nageswara R Sastry > Fixes: 2454a7af0f2a ("powerpc/pseries: define driver for Platform > KeyStore") > Signed-off-by: Nayna Jain > Tested-by: Nageswara R Sastry Reviewed-by: Andrew Donnellan > --- > v4: > * As per Andrew's feedback, squashed Patch 2 with Patch 1. > Now it is single patch. > > v3: > * Addition to Patch 1 timeout patch based on Andrew's feedback. > > v2: > * Updated based on feedback from Michael Ellerman > Replaced usleep_range with fsleep. > Since there is no more need to specify range, sleep time is > reverted back to 10 msec. > > arch/powerpc/include/asm/plpks.h | 5 ++--- > arch/powerpc/platforms/pseries/plpks.c | 10 +- > 2 files changed, 7 insertions(+), 8 deletions(-) > > diff --git a/arch/powerpc/include/asm/plpks.h > b/arch/powerpc/include/asm/plpks.h > index 23b77027c916..7a84069759b0 100644 > --- a/arch/powerpc/include/asm/plpks.h > +++ b/arch/powerpc/include/asm/plpks.h > @@ -44,9 +44,8 @@ > #define PLPKS_MAX_DATA_SIZE 4000 > > // Timeouts for PLPKS operations > -#define PLPKS_MAX_TIMEOUT5000 // msec > -#define PLPKS_FLUSH_SLEEP10 // msec > -#define PLPKS_FLUSH_SLEEP_RANGE 400 > +#define PLPKS_MAX_TIMEOUT(5 * USEC_PER_SEC) > +#define PLPKS_FLUSH_SLEEP1 // usec > > struct plpks_var { > char *component; > diff --git a/arch/powerpc/platforms/pseries/plpks.c > b/arch/powerpc/platforms/pseries/plpks.c > index febe18f251d0..4a595493d28a 100644 > --- a/arch/powerpc/platforms/pseries/plpks.c > +++ b/arch/powerpc/platforms/pseries/plpks.c > @@ -415,8 +415,7 @@ static int plpks_confirm_object_flushed(struct > label *label, > break; > } > > - usleep_range(PLPKS_FLUSH_SLEEP, > - PLPKS_FLUSH_SLEEP + > PLPKS_FLUSH_SLEEP_RANGE); > + fsleep(PLPKS_FLUSH_SLEEP); > timeout = timeout + PLPKS_FLUSH_SLEEP; > } while (timeout < PLPKS_MAX_TIMEOUT); > > @@ -464,9 +463,10 @@ int plpks_signed_update_var(struct plpks_var > *var, u64 flags) > > continuetoken = retbuf[0]; > if (pseries_status_to_err(rc) == -EBUSY) { > - int delay_ms = get_longbusy_msecs(rc); > - mdelay(delay_ms); > - timeout += delay_ms; > + int delay_us = get_longbusy_msecs(rc) * > 1000; > + > + fsleep(delay_us); > + timeout += delay_us; > } > rc = pseries_status_to_err(rc); > } while (rc == -EBUSY && timeout < PLPKS_MAX_TIMEOUT); -- Andrew DonnellanOzLabs, ADL Canberra a...@linux.ibm.com IBM Australia Limited
Re: [PATCH 1/1] Replace macro "ARCH_HAVE_EXTRA_ELF_NOTES" with kconfig
* Vignesh Balasubramanian: > diff --git a/include/linux/elf.h b/include/linux/elf.h > index c9a46c4e183b..5c402788da19 100644 > --- a/include/linux/elf.h > +++ b/include/linux/elf.h > @@ -65,7 +65,7 @@ extern Elf64_Dyn _DYNAMIC []; > struct file; > struct coredump_params; > > -#ifndef ARCH_HAVE_EXTRA_ELF_NOTES > +#ifndef CONFIG_ARCH_HAVE_EXTRA_ELF_NOTES You could add #pragma GCC poison ARCH_HAVE_EXTRA_ELF_NOTES to a central header, to let GCC and Clang flag uses that have not been converted. Thanks, Florian
[PATCH v2] powerpc/eeh: Permanently disable the removed device
When a device is hot removed on powernv, the hotplug driver clears the device's state. However, on pseries, if a device is removed by phyp after reaching the error threshold, the kernel remains unaware, leading to the device not being torn down. This prevents necessary remediation actions like failover. Permanently disable the device if the presence check fails. Also, in eeh_dev_check_failure in we may consider the error as false positive if the device is hotpluged out as the get_state call returns EEH_STATE_NOT_SUPPORT and we may end up not clearing the device state, so log the event if the state is not moved to permanent failure state. Signed-off-by: Ganesh Goudar --- V2: * Elobrate the commit message. * Fix formatting issues in commit message and comments. --- arch/powerpc/kernel/eeh.c| 11 ++- arch/powerpc/kernel/eeh_driver.c | 13 +++-- 2 files changed, 21 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c index ab316e155ea9..6670063a7a6c 100644 --- a/arch/powerpc/kernel/eeh.c +++ b/arch/powerpc/kernel/eeh.c @@ -506,9 +506,18 @@ int eeh_dev_check_failure(struct eeh_dev *edev) * We will punt with the following conditions: Failure to get * PE's state, EEH not support and Permanently unavailable * state, PE is in good state. +* +* On the pSeries, after reaching the threshold, get_state might +* return EEH_STATE_NOT_SUPPORT. However, it's possible that the +* device state remains uncleared if the device is not marked +* pci_channel_io_perm_failure. Therefore, consider logging the +* event to let device removal happen. +* */ if ((ret < 0) || - (ret == EEH_STATE_NOT_SUPPORT) || eeh_state_active(ret)) { + (ret == EEH_STATE_NOT_SUPPORT && +dev->error_state == pci_channel_io_perm_failure) || + eeh_state_active(ret)) { eeh_stats.false_positives++; pe->false_positives++; rc = 0; diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c index 48773d2d9be3..7efe04c68f0f 100644 --- a/arch/powerpc/kernel/eeh_driver.c +++ b/arch/powerpc/kernel/eeh_driver.c @@ -865,9 +865,18 @@ void eeh_handle_normal_event(struct eeh_pe *pe) devices++; if (!devices) { - pr_debug("EEH: Frozen PHB#%x-PE#%x is empty!\n", + pr_warn("EEH: Frozen PHB#%x-PE#%x is empty!\n", pe->phb->global_number, pe->addr); - goto out; /* nothing to recover */ + /* +* The device is removed, tear down its state, on powernv +* hotplug driver would take care of it but not on pseries, +* permanently disable the card as it is hot removed. +* +* In the case of powernv, note that the removal of device +* is covered by pci rescan lock, so no problem even if hotplug +* driver attempts to remove the device. +*/ + goto recover_failed; } /* Log the event */ -- 2.44.0
[PATCH v2] ASoC: dt-bindings: fsl,ssi: Convert to YAML
Convert the fsl,ssi binding to YAML. Add below compatible strings which were not listed in document: fsl,imx50-ssi fsl,imx53-ssi fsl,imx25-ssi fsl,imx27-ssi fsl,imx6q-ssi fsl,imx6sl-ssi fsl,imx6sx-ssi Add below fsl,mode strings which were not listed. i2s-slave i2s-master lj-slave lj-master rj-slave rj-master Add 'ac97-gpios' property which were not listed. Then dtbs_check can pass. And remove the 'codec' description which should be in the 'codec' binding doc. Signed-off-by: Shengjiu Wang --- changes in v2: - change fallback string to const. - add dai-common.yaml - add ac97-gpios property .../devicetree/bindings/sound/fsl,ssi.txt | 87 .../devicetree/bindings/sound/fsl,ssi.yaml| 192 ++ 2 files changed, 192 insertions(+), 87 deletions(-) delete mode 100644 Documentation/devicetree/bindings/sound/fsl,ssi.txt create mode 100644 Documentation/devicetree/bindings/sound/fsl,ssi.yaml diff --git a/Documentation/devicetree/bindings/sound/fsl,ssi.txt b/Documentation/devicetree/bindings/sound/fsl,ssi.txt deleted file mode 100644 index 7e15a85cecd2.. --- a/Documentation/devicetree/bindings/sound/fsl,ssi.txt +++ /dev/null @@ -1,87 +0,0 @@ -Freescale Synchronous Serial Interface - -The SSI is a serial device that communicates with audio codecs. It can -be programmed in AC97, I2S, left-justified, or right-justified modes. - -Required properties: -- compatible: Compatible list, should contain one of the following -compatibles: - fsl,mpc8610-ssi - fsl,imx51-ssi - fsl,imx35-ssi - fsl,imx21-ssi -- cell-index: The SSI, <0> = SSI1, <1> = SSI2, and so on. -- reg: Offset and length of the register set for the device. -- interrupts:where a is the interrupt number and b is a -field that represents an encoding of the sense and -level information for the interrupt. This should be -encoded based on the information in section 2) -depending on the type of interrupt controller you -have. -- fsl,fifo-depth: The number of elements in the transmit and receive FIFOs. -This number is the maximum allowed value for SFCSR[TFWM0]. - - clocks: "ipg" - Required clock for the SSI unit -"baud" - Required clock for SSI master mode. Otherwise this - clock is not used - -Required are also ac97 link bindings if ac97 is used. See -Documentation/devicetree/bindings/sound/soc-ac97link.txt for the necessary -bindings. - -Optional properties: -- codec-handle: Phandle to a 'codec' node that defines an audio -codec connected to this SSI. This node is typically -a child of an I2C or other control node. -- fsl,fiq-stream-filter: Bool property. Disabled DMA and use FIQ instead to - filter the codec stream. This is necessary for some boards - where an incompatible codec is connected to this SSI, e.g. - on pca100 and pcm043. -- dmas:Generic dma devicetree binding as described in - Documentation/devicetree/bindings/dma/dma.txt. -- dma-names: Two dmas have to be defined, "tx" and "rx", if fsl,imx-fiq - is not defined. -- fsl,mode: The operating mode for the AC97 interface only. -"ac97-slave" - AC97 mode, SSI is clock slave -"ac97-master" - AC97 mode, SSI is clock master -- fsl,ssi-asynchronous: -If specified, the SSI is to be programmed in asynchronous -mode. In this mode, pins SRCK, STCK, SRFS, and STFS must -all be connected to valid signals. In synchronous mode, -SRCK and SRFS are ignored. Asynchronous mode allows -playback and capture to use different sample sizes and -sample rates. Some drivers may require that SRCK and STCK -be connected together, and SRFS and STFS be connected -together. This would still allow different sample sizes, -but not different sample rates. -- fsl,playback-dma: Phandle to a node for the DMA channel to use for -playback of audio. This is typically dictated by SOC -design. See the notes below. -Only used on Power Architecture. -- fsl,capture-dma: Phandle to a node for the DMA channel to use for -capture (recording) of audio. This is typically dictated -by SOC design. See the notes below. -Only used on Power Architecture. - -Child 'codec' node required properties: -- compatible: Compatible list, contains the name of the codec - -Child 'codec' node optional