Re: [PATCH 1/2] kexec: fix KEXEC_FILE dependencies
On 11/30/23 10:56, Andrew Morton wrote: On Thu, 2 Nov 2023 16:03:18 +0800 Baoquan He wrote: CONFIG_KEXEC_FILE, but still get purgatory code built in which is totally useless. Not sure if I think too much over this. I see your point here, and I would suggest changing the CONFIG_ARCH_SUPPORTS_KEXEC_PURGATORY symbol to just indicate the availability of the purgatory code for the arch, rather than actually controlling the code itself. I already mentioned this for s390, but riscv would need the same thing on top. I think the change below should address your concern. Since no new comment, do you mind spinning v2 to wrap all these up? This patchset remains in mm-hotfixes-unstable from the previous -rc cycle. Eric, do you have any comments? Arnd, do you plan on a v2? If not, should I merge v1? If so, should I now add cc:stable? My apologies, I lost this. I've looked at these changes, and I am in favor of these changes. Furthermore, I ran the following thru the Kconfig regression script, and did not find anything! I believe the following patch represents the current discussion threads around Kconfig and KEXEC/CRASH. Reviewed-by: Eric DeVolder Tested-by: Eric DeVolder Thanks! eric diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 6f105ee4f3cf..1f11a62809f2 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -608,10 +608,10 @@ config ARCH_SUPPORTS_KEXEC def_bool PPC_BOOK3S || PPC_E500 || (44x && !SMP) config ARCH_SUPPORTS_KEXEC_FILE - def_bool PPC64 && CRYPTO=y && CRYPTO_SHA256=y + def_bool PPC64 config ARCH_SUPPORTS_KEXEC_PURGATORY - def_bool KEXEC_FILE + def_bool y config ARCH_SELECTS_KEXEC_FILE def_bool y diff --git a/arch/riscv/Kbuild b/arch/riscv/Kbuild index d25ad1c19f88..ab181d187c23 100644 --- a/arch/riscv/Kbuild +++ b/arch/riscv/Kbuild @@ -5,7 +5,7 @@ obj-$(CONFIG_BUILTIN_DTB) += boot/dts/ obj-y += errata/ obj-$(CONFIG_KVM) += kvm/ -obj-$(CONFIG_ARCH_SUPPORTS_KEXEC_PURGATORY) += purgatory/ +obj-$(CONFIG_KEXEC_FILE) += purgatory/ # for cleaning subdir- += boot diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig index 95a2a06acc6a..98857d76e458 100644 --- a/arch/riscv/Kconfig +++ b/arch/riscv/Kconfig @@ -702,9 +702,7 @@ config ARCH_SELECTS_KEXEC_FILE select KEXEC_ELF config ARCH_SUPPORTS_KEXEC_PURGATORY - def_bool KEXEC_FILE - depends on CRYPTO=y - depends on CRYPTO_SHA256=y + def_bool y config ARCH_SUPPORTS_CRASH_DUMP def_bool y diff --git a/arch/riscv/kernel/elf_kexec.c b/arch/riscv/kernel/elf_kexec.c index e60fbd8660c4..3ac341d296db 100644 --- a/arch/riscv/kernel/elf_kexec.c +++ b/arch/riscv/kernel/elf_kexec.c @@ -266,7 +266,7 @@ static void *elf_kexec_load(struct kimage *image, char *kernel_buf, cmdline = modified_cmdline; } -#ifdef CONFIG_ARCH_SUPPORTS_KEXEC_PURGATORY +#ifdef CONFIG_KEXEC_FILE /* Add purgatory to the image */ kbuf.top_down = true; kbuf.mem = KEXEC_BUF_MEM_UNKNOWN; @@ -280,7 +280,7 @@ static void *elf_kexec_load(struct kimage *image, char *kernel_buf, sizeof(kernel_start), 0); if (ret) pr_err("Error update purgatory ret=%d\n", ret); -#endif /* CONFIG_ARCH_SUPPORTS_KEXEC_PURGATORY */ +#endif /* CONFIG_KEXEC_FILE */ /* Add the initrd to the image */ if (initrd != NULL) { diff --git a/arch/s390/Kbuild b/arch/s390/Kbuild index a5d3503b353c..f2ce80b65551 100644 --- a/arch/s390/Kbuild +++ b/arch/s390/Kbuild @@ -7,7 +7,7 @@ obj-$(CONFIG_S390_HYPFS) += hypfs/ obj-$(CONFIG_APPLDATA_BASE) += appldata/ obj-y += net/ obj-$(CONFIG_PCI) += pci/ -obj-$(CONFIG_ARCH_SUPPORTS_KEXEC_PURGATORY) += purgatory/ +obj-$(CONFIG_KEXEC_FILE) += purgatory/ # for cleaning subdir- += boot tools diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig index 3bec98d20283..d5d8f99d1f25 100644 --- a/arch/s390/Kconfig +++ b/arch/s390/Kconfig @@ -254,13 +254,13 @@ config ARCH_SUPPORTS_KEXEC def_bool y config ARCH_SUPPORTS_KEXEC_FILE - def_bool CRYPTO && CRYPTO_SHA256 && CRYPTO_SHA256_S390 + def_bool y config ARCH_SUPPORTS_KEXEC_SIG def_bool MODULE_SIG_FORMAT config ARCH_SUPPORTS_KEXEC_PURGATORY - def_bool KEXEC_FILE + def_bool y config ARCH_SUPPORTS_CRASH_DUMP def_bool y diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 3762f41bb092..1566748f16c4 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -2072,7 +2072,7 @@ config ARCH_SUPPORTS_KEXEC def_bool y config ARCH_SUPPORTS_KEXEC_FILE - def_bool X86_64 && CRYPTO && CRYPTO_SHA256 + def_bool X86_64 config ARCH_SELECTS_KEXEC_FILE def_bool y @@ -2080,7 +2080,7 @@ config ARCH_SELECTS_KEXEC_FILE select HAVE_IMA_KEXEC if IMA config ARCH_SUPPORTS_KEXEC_PURGATORY - def_bool KEXEC_FILE + def_bool y config ARCH_SUPPORTS_KEXEC_SIG def_bool y diff --git a/kernel/Kconfig.kexec b/kernel/Kconfig.kexec index 7aff28d
Re: [PATCH v2] kernel/Kconfig.kexec: drop select of KEXEC for CRASH_DUMP
On 11/27/23 23:44, Baoquan He wrote: Ignat Korchagin complained that a potential config regression was introduced by commit 89cde455915f ("kexec: consolidate kexec and crash options into kernel/Kconfig.kexec"). Before the commit, CONFIG_CRASH_DUMP has no dependency on CONFIG_KEXEC. After the commit, CRASH_DUMP selects KEXEC. That enforces system to have CONFIG_KEXEC=y as long as CONFIG_CRASH_DUMP=Y which people may not want. In Ignat's case, he sets CONFIG_CRASH_DUMP=y, CONFIG_KEXEC_FILE=y and CONFIG_KEXEC=n because kexec_load interface could have security issue if kernel/initrd has no chance to be signed and verified. CRASH_DUMP has select of KEXEC because Eric, author of above commit, met a LKP report of build failure when posting patch of earlier version. Please see below link to get detail of the LKP report: https://lore.kernel.org/all/3e8eecd1-a277-2cfb-690e-5de2eb7b9...@oracle.com/T/#u In fact, that LKP report is triggered because arm's is wrapped in CONFIG_KEXEC ifdeffery scope. That is wrong. CONFIG_KEXEC controls the enabling/disabling of kexec_load interface, but not kexec feature. Removing the wrongly added CONFIG_KEXEC ifdeffery scope in of arm allows us to drop the select KEXEC for CRASH_DUMP. Meanwhile, change arch/arm/kernel/Makefile to let machine_kexec.o relocate_kernel.o depend on KEXEC_CORE. Fixes: commit 89cde455915f ("kexec: consolidate kexec and crash options into kernel/Kconfig.kexec") Reported-by: Ignat Korchagin Signed-off-by: Baoquan He --- arch/arm/include/asm/kexec.h | 4 arch/arm/kernel/Makefile | 2 +- kernel/Kconfig.kexec | 1 - 3 files changed, 1 insertion(+), 6 deletions(-) diff --git a/arch/arm/include/asm/kexec.h b/arch/arm/include/asm/kexec.h index e62832dcba76..a8287e7ab9d4 100644 --- a/arch/arm/include/asm/kexec.h +++ b/arch/arm/include/asm/kexec.h @@ -2,8 +2,6 @@ #ifndef _ARM_KEXEC_H #define _ARM_KEXEC_H -#ifdef CONFIG_KEXEC - /* Maximum physical address we can use pages from */ #define KEXEC_SOURCE_MEMORY_LIMIT (-1UL) /* Maximum address we can reach in physical address mode */ @@ -82,6 +80,4 @@ static inline struct page *boot_pfn_to_page(unsigned long boot_pfn) #endif /* __ASSEMBLY__ */ -#endif /* CONFIG_KEXEC */ - #endif /* _ARM_KEXEC_H */ diff --git a/arch/arm/kernel/Makefile b/arch/arm/kernel/Makefile index d53f56d6f840..771264d4726a 100644 --- a/arch/arm/kernel/Makefile +++ b/arch/arm/kernel/Makefile @@ -59,7 +59,7 @@ obj-$(CONFIG_FUNCTION_TRACER) += entry-ftrace.o obj-$(CONFIG_DYNAMIC_FTRACE) += ftrace.o insn.o patch.o obj-$(CONFIG_FUNCTION_GRAPH_TRACER) += ftrace.o insn.o patch.o obj-$(CONFIG_JUMP_LABEL) += jump_label.o insn.o patch.o -obj-$(CONFIG_KEXEC)+= machine_kexec.o relocate_kernel.o +obj-$(CONFIG_KEXEC_CORE) += machine_kexec.o relocate_kernel.o # Main staffs in KPROBES are in arch/arm/probes/ . obj-$(CONFIG_KPROBES) += patch.o insn.o obj-$(CONFIG_OABI_COMPAT) += sys_oabi-compat.o diff --git a/kernel/Kconfig.kexec b/kernel/Kconfig.kexec index 7aff28ded2f4..1cc3b1c595d7 100644 --- a/kernel/Kconfig.kexec +++ b/kernel/Kconfig.kexec @@ -97,7 +97,6 @@ config CRASH_DUMP depends on ARCH_SUPPORTS_KEXEC select CRASH_CORE select KEXEC_CORE - select KEXEC help Generate crash dump after being started by kexec. This should be normally only set in special crash dump kernels I have run this change against the kconfig regression script, and it did not find any differences! Reviewed-by: Eric DeVolder ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v2] drivers/base/cpu: crash data showing should depends on KEXEC_CORE
On 11/27/23 23:52, Baoquan He wrote: After commit 88a6f8994421 ("crash: memory and CPU hotplug sysfs attributes"), on x86_64, if only below kernel configs related to kdump are set, compiling error are triggered. CONFIG_CRASH_CORE=y CONFIG_KEXEC_CORE=y CONFIG_CRASH_DUMP=y CONFIG_CRASH_HOTPLUG=y -- -- drivers/base/cpu.c: In function ‘crash_hotplug_show’: drivers/base/cpu.c:309:40: error: implicit declaration of function ‘crash_hotplug_cpu_support’; did you mean ‘crash_hotplug_show’? [-Werror=implicit-function-declaration] 309 | return sysfs_emit(buf, "%d\n", crash_hotplug_cpu_support()); |^ |crash_hotplug_show cc1: some warnings being treated as errors -- CONFIG_KEXEC is used to enable kexec_load interface, the crash_notes/crash_notes_size/crash_hotplug showing depends on CONFIG_KEXEC is incorrect. It should depend on KEXEC_CORE instead. Fix it now. Fixes: commit 88a6f8994421 ("crash: memory and CPU hotplug sysfs attributes") Signed-off-by: Baoquan He Reviewed-by: Eric DeVolder --- drivers/base/cpu.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c index 9ea22e165acd..548491de818e 100644 --- a/drivers/base/cpu.c +++ b/drivers/base/cpu.c @@ -144,7 +144,7 @@ static DEVICE_ATTR(release, S_IWUSR, NULL, cpu_release_store); #endif /* CONFIG_ARCH_CPU_PROBE_RELEASE */ #endif /* CONFIG_HOTPLUG_CPU */ -#ifdef CONFIG_KEXEC +#ifdef CONFIG_KEXEC_CORE #include static ssize_t crash_notes_show(struct device *dev, @@ -189,14 +189,14 @@ static const struct attribute_group crash_note_cpu_attr_group = { #endif static const struct attribute_group *common_cpu_attr_groups[] = { -#ifdef CONFIG_KEXEC +#ifdef CONFIG_KEXEC_CORE _note_cpu_attr_group, #endif NULL }; static const struct attribute_group *hotplugable_cpu_attr_groups[] = { -#ifdef CONFIG_KEXEC +#ifdef CONFIG_KEXEC_CORE _note_cpu_attr_group, #endif NULL ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH 0/3] kernel/Kconfig.kexec: drop select of KEXEC for CRASH_DUMP
On 11/23/23 01:36, Baoquan He wrote: Ignat reported a potential config regression was introduced by commit 89cde455915f ("kexec: consolidate kexec and crash options into kernel/Kconfig.kexec"). Please click below link for more details: https://lore.kernel.org/all/CALrw=nhprqqaqtp_jzfregrqemps8jbf8jqcv4ygqxyce-s...@mail.gmail.com/T/#u The patch 1 fix the regression by removing incorrect CONFIG_KEXEC ifdeffery scope adding in arm's , then dropping the select of KEXEC for CRASH_DUMP. This is tested and passed a cross comiping of arm. Patch 2 is to fix a build failure when I tested patch 1 on x86_64, the wrong CONFIG_KEXEC iddeffery is replaced with CONFIG_KEXEC_CORE. Test passed on x86_64. Patch 3 is to fix an unnecessary 'select KEXEC' in s390 ARCH. Removing the select won't impact anything. Test passed on a ibm-z system. I apologize for my delay in responding, I did not have a computer with me during my holiday travel. I was able to re-run my Kconfig test script with this patch series (now that I'm running this on private resources, it takes half a day 8( ). The script only performs comparisons of the .config before (LHSB) and after (RHSB) the patch series; it does NOT do any building. At any rate, what that revealed was only differences in s390. That means that all other arches do not have any unintended side effects. The differences with patch3 applied look like: FAIL: allnoconfig arch/s390/configs/kasan.config LHSB {'CONFIG_CRASH_CORE': 'y', 'CONFIG_KEXEC_CORE': 'y', 'CONFIG_KEXEC': 'y'} RHSB {'CONFIG_KEXEC': 'n'} The 'allnoconfig' and 'olddefconfig' targets failed for all s390 defconfigs. The LHSB is the pre-patch values, and the RHSB is the post-patch values. So this states that CRASH_CORE and KEXEC_CORE were set previously, but now they are not. KEXEC obviously is being turned off intentionally. Hope this helps some. Regards, eric Baoquan He (3): kernel/Kconfig.kexec: drop select of KEXEC for CRASH_DUMP drivers/base/cpu: crash data showing should depends on KEXEC_CORE s390/Kconfig: drop select of KEXEC arch/arm/include/asm/kexec.h | 4 arch/s390/Kconfig| 1 - drivers/base/cpu.c | 6 +++--- kernel/Kconfig.kexec | 1 - 4 files changed, 3 insertions(+), 9 deletions(-) ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v3 0/6] crashdump: Kernel handling of CPU and memory hot un/plug
On 10/4/23 07:08, Simon Horman wrote: On Wed, Sep 27, 2023 at 02:11:30PM -0400, Eric DeVolder wrote: When the kdump service is loaded, if a CPU or memory is hot un/plugged, the crash elfcorehdr, which describes the CPUs and memory in the system, must also be updated, else the resulting vmcore is inaccurate (eg. missing either CPU context or memory regions). The current solution utilizes udev (eg. RHEL /usr/lib/udev/rules.d/ 98-kexec.rules) to initiate an unload-then-reload of the *entire* kdump image (eg. kernel, initrd, boot_params, purgatory and elfcorehdr) by the userspace kexec utility. This occurrs just so the elfcorehdr can be updated with the latest list of CPUs and memory regions. In a previous post I have outlined the significant performance problems related to offloading this activity to userspace. With the Linux kernel 6.6 commit below, the kernel now has the ability to directly modify the elfcorehdr, eliminating the need to unload-then-reload the entire kdump image when CPU or memory is hot un/plugged or on/offlined. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d6 8b4b6f307d155475cce541f2aee938032ed22e This kexec-tools patch series is for supporting hotplug with the kexec_load() syscall; the kernel directly supports hotplug for the kexec_file_load() syscall, requiring no userspace help. There are two basic obstacles/requirements for the kexec-tools to overcome in order to support kernel hotplug rewriting of the elfcorehdr. First, the buffer containing the elfcorehdr must be excluded from the purgatory checksum/digest, which is computed at load time. Otherwise kernel run-time changes to the elfcorehdr, as a result of hot un/plug, would result in the checksum failing (specifically in purgatory at panic kernel boot time), and kdump capture kernel failing to start. To let the kernel know it is okay to modify the elfcorehdr, kexec sets the KEXEC_UPDATE_ELFCOREHDR flag. NOTE: The kernel specifically does *NOT* attempt to recompute the checksum/digest as that would ultimately require patching the in- memory purgatory image with the updated checksum. As that purgatory image is already fully linked, it is binary blob containing no ELF information which would allow it to be re-linked or patched. Thus excluding the elfcorehdr from the checksum/digests avoids all these problems. Second, the size of the elfcorehdr buffer must be large enough to accomodate growth of the number of CPUs and/or memory regions. To satisfy the first requirement, this patch series introduces the --hotplug option to indicate to kexec-tools that kexec should exclude the elfcorehdr buffer from the purgatory checksum/digest calculation and set the KEXEC_UPDATE_ELFCOREHDR flag. To satisfy the second requirement, the size is obtained from the /sys/kernel/crash_elfcorehdr_size node (new with the kernel series cited above). To use this feature with kexec_load() syscall, invoke kexec with: kexec -c --hotplug ... Thanks! eric Thanks Eric, applied. Excellent, thank you! eric ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH v3 6/6] crashdump/x86: set the elfcorehdr segment size for hotplug
For hotplug, the elfcorehdr segment must be sized appropriately to allow a growing number of CPUs or memory regions. Use the size reported by the kernel via /sys/kernel/crash_elfcorehdr_sz. Signed-off-by: Eric DeVolder --- kexec/arch/i386/crashdump-x86.c | 8 1 file changed, 8 insertions(+) diff --git a/kexec/arch/i386/crashdump-x86.c b/kexec/arch/i386/crashdump-x86.c index cb86ca7..a01031e 100644 --- a/kexec/arch/i386/crashdump-x86.c +++ b/kexec/arch/i386/crashdump-x86.c @@ -957,6 +957,14 @@ int load_crashdump_segments(struct kexec_info *info, char* mod_cmdline, memsz = bufsz; } + /* For hotplug support, override the minimum necessary size just +* computed with the value from /sys/kernel/crash_elfcorehdr_size. +* Properly align the size as well. +*/ + if (do_hotplug) { + memsz = _ALIGN(elfcorehdrsz, align); + } + /* Record the location of the elfcorehdr for hotplug handling */ info->elfcorehdr = elfcorehdr = add_buffer(info, tmp, bufsz, memsz, align, min_base, -- 2.39.3 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH v3 4/6] crashdump: exclude elfcorehdr segment from digest for hotplug
To allow direct modification of the elfcorehdr by the kernel, in response to CPU and memory hot un/plug and/or online/offline events, the buffer containing the elfcorehdr must be excluded from the purgatory checksum/digest. If the elfcorehdr is not excluded from the purgatory checksum/digest, then at panic time, the checksum/digest check fails (due to the elfcorehdr having been modified), and the kdump capture kernel does not start. Signed-off-by: Eric DeVolder --- kexec/kexec.c | 8 kexec/kexec.h | 1 + 2 files changed, 9 insertions(+) diff --git a/kexec/kexec.c b/kexec/kexec.c index 0207608..fdb4c98 100644 --- a/kexec/kexec.c +++ b/kexec/kexec.c @@ -689,6 +689,14 @@ static void update_purgatory(struct kexec_info *info) if (info->segment[i].mem == (void *)info->rhdr.rel_addr) { continue; } + + /* Don't include elfcorehdr in the checksum, if hotplug +* support enabled. +*/ + if (do_hotplug && (info->segment[i].mem == (void *)info->elfcorehdr)) { + continue; + } + sha256_update(, info->segment[i].buf, info->segment[i].bufsz); nullsz = info->segment[i].memsz - info->segment[i].bufsz; diff --git a/kexec/kexec.h b/kexec/kexec.h index 487f707..1004aff 100644 --- a/kexec/kexec.h +++ b/kexec/kexec.h @@ -170,6 +170,7 @@ struct kexec_info { int command_line_len; int skip_checks; + unsigned long elfcorehdr; }; struct arch_map_entry { -- 2.39.3 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH v3 3/6] crashdump: setup general hotplug support
To allow direct modification of the elfcorehdr by the kernel, in response to CPU and memory hot un/plug and/or online/offline events, the following conditions must occur: - the elfcorehdr buffer must be excluded from the purgatory checksum/digest, and - the elfcorehdr segment must be large enough, and - the kernel must be notified that it can modify the elfcorehdr Excluding the elfcorehdr buffer from the digest occurs in patch "crashdump: exclude elfcorehdr segment from digest for hotplug". If this is not done, a change to the elfcorehdr will cause the purgatory check at panic time to fail, and kdump capture kernel does not start. For hotplug, the size of the elfcorehdr segment is obtained from the kernel via the /sys/kernel/crash_elforehdr_size node. The KEXEC_UPDATE_ELFCOREHDR flag indicates to the kernel that it can make direct modifications to the elfcorehdr. Signed-off-by: Eric DeVolder --- kexec/kexec.c | 18 ++ 1 file changed, 18 insertions(+) diff --git a/kexec/kexec.c b/kexec/kexec.c index d790748..0207608 100644 --- a/kexec/kexec.c +++ b/kexec/kexec.c @@ -1631,6 +1631,24 @@ int main(int argc, char *argv[]) die("--load-live-update can only be used with xen\n"); } + /* NOTE: Xen KEXEC_LIVE_UPDATE and KEXEC_UPDATE_ELFCOREHDR collide */ + if (do_hotplug) { + const char *ces = "/sys/kernel/crash_elfcorehdr_size"; + char *buf, *endptr = NULL; + off_t nread = 0; + buf = slurp_file_len(ces, sizeof(buf)-1, ); + if (buf) { + if (buf[nread-1] == '\n') + buf[nread-1] = '\0'; + elfcorehdrsz = strtoul(buf, , 0); + } + if (!elfcorehdrsz || (endptr && *endptr != '\0')) + die("Path %s does not exist, the kernel needs CONFIG_CRASH_HOTPLUG\n", ces); + dbgprintf("ELFCOREHDR_SIZE %lu\n", elfcorehdrsz); + /* Indicate to the kernel it is ok to modify the elfcorehdr */ + kexec_flags |= KEXEC_UPDATE_ELFCOREHDR; + } + fileind = optind; /* Reset getopt for the next pass; called in other source modules */ opterr = 1; -- 2.39.3 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH v3 5/6] crashdump/x86: identify elfcorehdr segment for hotplug
Identify the segment containing the elfcorehdr buffer so that it can be excluded from the purgatory checksum/digest, if hotplug support is in effect. Signed-off-by: Eric DeVolder --- kexec/arch/i386/crashdump-x86.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/kexec/arch/i386/crashdump-x86.c b/kexec/arch/i386/crashdump-x86.c index df1f24c..cb86ca7 100644 --- a/kexec/arch/i386/crashdump-x86.c +++ b/kexec/arch/i386/crashdump-x86.c @@ -956,6 +956,9 @@ int load_crashdump_segments(struct kexec_info *info, char* mod_cmdline, } else { memsz = bufsz; } + + /* Record the location of the elfcorehdr for hotplug handling */ + info->elfcorehdr = elfcorehdr = add_buffer(info, tmp, bufsz, memsz, align, min_base, max_addr, -1); dbgprintf("Created elf header segment at 0x%lx\n", elfcorehdr); -- 2.39.3 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH v3 0/6] crashdump: Kernel handling of CPU and memory hot un/plug
When the kdump service is loaded, if a CPU or memory is hot un/plugged, the crash elfcorehdr, which describes the CPUs and memory in the system, must also be updated, else the resulting vmcore is inaccurate (eg. missing either CPU context or memory regions). The current solution utilizes udev (eg. RHEL /usr/lib/udev/rules.d/ 98-kexec.rules) to initiate an unload-then-reload of the *entire* kdump image (eg. kernel, initrd, boot_params, purgatory and elfcorehdr) by the userspace kexec utility. This occurrs just so the elfcorehdr can be updated with the latest list of CPUs and memory regions. In a previous post I have outlined the significant performance problems related to offloading this activity to userspace. With the Linux kernel 6.6 commit below, the kernel now has the ability to directly modify the elfcorehdr, eliminating the need to unload-then-reload the entire kdump image when CPU or memory is hot un/plugged or on/offlined. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d6 8b4b6f307d155475cce541f2aee938032ed22e This kexec-tools patch series is for supporting hotplug with the kexec_load() syscall; the kernel directly supports hotplug for the kexec_file_load() syscall, requiring no userspace help. There are two basic obstacles/requirements for the kexec-tools to overcome in order to support kernel hotplug rewriting of the elfcorehdr. First, the buffer containing the elfcorehdr must be excluded from the purgatory checksum/digest, which is computed at load time. Otherwise kernel run-time changes to the elfcorehdr, as a result of hot un/plug, would result in the checksum failing (specifically in purgatory at panic kernel boot time), and kdump capture kernel failing to start. To let the kernel know it is okay to modify the elfcorehdr, kexec sets the KEXEC_UPDATE_ELFCOREHDR flag. NOTE: The kernel specifically does *NOT* attempt to recompute the checksum/digest as that would ultimately require patching the in- memory purgatory image with the updated checksum. As that purgatory image is already fully linked, it is binary blob containing no ELF information which would allow it to be re-linked or patched. Thus excluding the elfcorehdr from the checksum/digests avoids all these problems. Second, the size of the elfcorehdr buffer must be large enough to accomodate growth of the number of CPUs and/or memory regions. To satisfy the first requirement, this patch series introduces the --hotplug option to indicate to kexec-tools that kexec should exclude the elfcorehdr buffer from the purgatory checksum/digest calculation and set the KEXEC_UPDATE_ELFCOREHDR flag. To satisfy the second requirement, the size is obtained from the /sys/kernel/crash_elfcorehdr_size node (new with the kernel series cited above). To use this feature with kexec_load() syscall, invoke kexec with: kexec -c --hotplug ... Thanks! eric --- v3: 27sep2023 - Cite the merged Linux 6.6 commit that supports crash hotplug. - Removed the --elfcorehdrsz option, instead using the the /sys/kernel/crash_elfcorehdr_size node from the new kernel crash hotplug feature. v2: 3may2023 http://lists.infradead.org/pipermail/kexec/2023-May/027049.html - Setting KEXEC_UPDATE_ELFCOREHDR flag - Utilizing /sys/kernel/crash_elfcorehdr_size info. v1: 20oct2022 http://lists.infradead.org/pipermail/kexec/2022-October/026032.html - Initial patch series RFC: https://lore.kernel.org/lkml/b04ed259-dc5f-7f30-6661-c26f92d90...@oracle.com/ s/vmcoreinfo/elfcorehdr/g --- Eric DeVolder (6): kexec: define KEXEC_UPDATE_ELFCOREHDR crashdump: introduce the hotplug command line options crashdump: setup general hotplug support crashdump: exclude elfcorehdr segment from digest for hotplug crashdump/x86: identify elfcorehdr segment for hotplug crashdump/x86: set the elfcorehdr segment size for hotplug kexec/arch/i386/crashdump-x86.c | 11 +++ kexec/kexec-syscall.h | 1 + kexec/kexec.8 | 6 ++ kexec/kexec.c | 32 kexec/kexec.h | 8 +++- 5 files changed, 57 insertions(+), 1 deletion(-) -- 2.39.3 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH v3 2/6] crashdump: introduce the hotplug command line options
Introducing the --hotplug command line option, which is used to indicate to the kernel that the kdump image is setup to permit the kernel to directly modify the elfcorehdr in response to CPU and memory hotplug and/or online/offline events. This option is only meaningful for kexec_load() syscall. For the kexec_file_load() syscall, this option is a no-op as the kernel handles all aspects of loading the kdump image. This is the command line processing and documentation. Signed-off-by: Eric DeVolder --- kexec/kexec.8 | 6 ++ kexec/kexec.c | 6 ++ kexec/kexec.h | 7 ++- 3 files changed, 18 insertions(+), 1 deletion(-) diff --git a/kexec/kexec.8 b/kexec/kexec.8 index 3a344c5..4400baf 100644 --- a/kexec/kexec.8 +++ b/kexec/kexec.8 @@ -132,6 +132,12 @@ in one call. Open a help file for .BR kexec . .TP +.B \-\-hotplug +Setup for kernel modification of the elfcorehdr. This option performs +the steps needed to support kernel updates to the elfcorehdr in the +presence of hot un/plug and/or on/offline events. This option only +useful for KEXEC_LOAD syscall. +.TP .B \-i\ (\-\-no-checks) Fast reboot, no memory integrity checks. .TP diff --git a/kexec/kexec.c b/kexec/kexec.c index 1edbd34..d790748 100644 --- a/kexec/kexec.c +++ b/kexec/kexec.c @@ -58,6 +58,8 @@ unsigned long long mem_min = 0; unsigned long long mem_max = ULONG_MAX; +unsigned long elfcorehdrsz = 0; +int do_hotplug = 0; static unsigned long kexec_flags = 0; /* Flags for kexec file (fd) based syscall */ static unsigned long kexec_file_flags = 0; @@ -1069,6 +1071,7 @@ void usage(void) " back to the compatibility syscall when file based\n" " syscall is not supported or the kernel did not\n" " understand the image (default)\n" + " --hotplugSetup for kernel modification of elfcorehdr.\n" " -d, --debug Enable debugging to help spot a failure.\n" " -S, --status Return 1 if the type (by default crash) is loaded,\n" " 0 if not.\n" @@ -1579,6 +1582,9 @@ int main(int argc, char *argv[]) case OPT_PRINT_CKR_SIZE: print_crashkernel_region_size(); return 0; + case OPT_HOTPLUG: + do_hotplug = 1; + break; default: break; } diff --git a/kexec/kexec.h b/kexec/kexec.h index 0933389..487f707 100644 --- a/kexec/kexec.h +++ b/kexec/kexec.h @@ -232,7 +232,8 @@ extern int file_types; #define OPT_PRINT_CKR_SIZE 262 #define OPT_LOAD_LIVE_UPDATE 263 #define OPT_EXEC_LIVE_UPDATE 264 -#define OPT_MAX265 +#define OPT_HOTPLUG265 +#define OPT_MAX266 #define KEXEC_OPTIONS \ { "help", 0, 0, OPT_HELP }, \ { "version",0, 0, OPT_VERSION }, \ @@ -259,6 +260,7 @@ extern int file_types; { "debug", 0, 0, OPT_DEBUG }, \ { "status", 0, 0, OPT_STATUS }, \ { "print-ckr-size", 0, 0, OPT_PRINT_CKR_SIZE }, \ + { "hotplug",0, 0, OPT_HOTPLUG }, \ #define KEXEC_OPT_STR "h?vdfixyluet:pscaS" @@ -297,6 +299,9 @@ extern int ifdown(void); extern char purgatory[]; extern size_t purgatory_size; +extern unsigned long elfcorehdrsz; +extern int do_hotplug; + #define BOOTLOADER "kexec" #define BOOTLOADER_VERSION PACKAGE_VERSION -- 2.39.3 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH v3 1/6] kexec: define KEXEC_UPDATE_ELFCOREHDR
The Linux kernel defines this flag to indicate that the kexec_load()'ed image is setup so that the kernel may directly modify the elfcorehdr (and not cause the purgatory digest checksum to fail) in response to CPU or memory hot un/plug and/or on/offline events. Define this flag to match/mirror the kernel flag. Signed-off-by: Eric DeVolder --- kexec/kexec-syscall.h | 1 + 1 file changed, 1 insertion(+) diff --git a/kexec/kexec-syscall.h b/kexec/kexec-syscall.h index 1e2d12f..2559bff 100644 --- a/kexec/kexec-syscall.h +++ b/kexec/kexec-syscall.h @@ -112,6 +112,7 @@ static inline long kexec_file_load(int kernel_fd, int initrd_fd, #define KEXEC_ON_CRASH 0x0001 #define KEXEC_PRESERVE_CONTEXT 0x0002 +#define KEXEC_UPDATE_ELFCOREHDR0x0004 #define KEXEC_ARCH_MASK0x /* Flags for kexec file based system call */ -- 2.39.3 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v3] Crash: add lock to serialize crash hotplug handling
On 9/26/23 15:50, Andrew Morton wrote: On Tue, 26 Sep 2023 20:09:05 +0800 Baoquan He wrote: Eric reported that handling corresponding crash hotplug event can be failed easily when many memory hotplug event are notified in a short period. They failed because failing to take __kexec_lock. I'm assuming that this failure is sufficiently likely so as to justify a -stable backport of the fix. Please let me know if this is incorrect. Andrew, Correct, this is sufficiently likely to happen. Thanks, eric ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v3] Crash: add lock to serialize crash hotplug handling
On 9/26/23 07:09, Baoquan He wrote: Eric reported that handling corresponding crash hotplug event can be failed easily when many memory hotplug event are notified in a short period. They failed because failing to take __kexec_lock. === [ 78.714569] Fallback order for Node 0: 0 [ 78.714575] Built 1 zonelists, mobility grouping on. Total pages: 1817886 [ 78.717133] Policy zone: Normal [ 78.724423] crash hp: kexec_trylock() failed, elfcorehdr may be inaccurate [ 78.727207] crash hp: kexec_trylock() failed, elfcorehdr may be inaccurate [ 80.056643] PEFILE: Unsigned PE binary === The memory hotplug events are notified very quickly and very many, while the handling of crash hotplug is much slower relatively. So the atomic variable __kexec_lock and kexec_trylock() can't guarantee the serialization of crash hotplug handling. Here, add a new mutex lock __crash_hotplug_lock to serialize crash hotplug handling specifically. This doesn't impact the usage of __kexec_lock. Signed-off-by: Baoquan He I've run this patch in my regression environment and I do not see any lock failures! And I've done this with a variety of DIMM sizes up to 8GiB in order to vary the "size of the swarm". Both with kexec_load and kexec_file_load. Tested-by: Eric DeVolder Reviewed-by: Eric DeVolder --- v2->v3: - crash_check_update_elfcorehdr() need take __crash_hotplug_lock too because there's tiny racing window when kexec_load interface is taken. Eric pointed out this. v1->v2: - Move mutex lock definition into CONFIG_CRASH_HOTPLUG ifdeffery scope in kernel/crash_core.c because the lock is only needed and used in that scope. Suggested by Eric. kernel/crash_core.c | 17 + 1 file changed, 17 insertions(+) diff --git a/kernel/crash_core.c b/kernel/crash_core.c index 03a7932cde0a..2f675ef045d4 100644 --- a/kernel/crash_core.c +++ b/kernel/crash_core.c @@ -739,6 +739,17 @@ subsys_initcall(crash_notes_memory_init); #undef pr_fmt #define pr_fmt(fmt) "crash hp: " fmt +/* + * Different than kexec/kdump loading/unloading/jumping/shrinking which + * usually rarely happen, there will be many crash hotplug events notified + * during one short period, e.g one memory board is hot added and memory + * regions are online. So mutex lock __crash_hotplug_lock is used to + * serialize the crash hotplug handling specifically. + */ +DEFINE_MUTEX(__crash_hotplug_lock); +#define crash_hotplug_lock() mutex_lock(&__crash_hotplug_lock) +#define crash_hotplug_unlock() mutex_unlock(&__crash_hotplug_lock) + /* * This routine utilized when the crash_hotplug sysfs node is read. * It reflects the kernel's ability/permission to update the crash @@ -748,9 +759,11 @@ int crash_check_update_elfcorehdr(void) { int rc = 0; + crash_hotplug_lock(); /* Obtain lock while reading crash information */ if (!kexec_trylock()) { pr_info("kexec_trylock() failed, elfcorehdr may be inaccurate\n"); + crash_hotplug_unlock(); return 0; } if (kexec_crash_image) { @@ -761,6 +774,7 @@ int crash_check_update_elfcorehdr(void) } /* Release lock now that update complete */ kexec_unlock(); + crash_hotplug_unlock(); return rc; } @@ -783,9 +797,11 @@ static void crash_handle_hotplug_event(unsigned int hp_action, unsigned int cpu) { struct kimage *image; + crash_hotplug_lock(); /* Obtain lock while changing crash information */ if (!kexec_trylock()) { pr_info("kexec_trylock() failed, elfcorehdr may be inaccurate\n"); + crash_hotplug_unlock(); return; } @@ -852,6 +868,7 @@ static void crash_handle_hotplug_event(unsigned int hp_action, unsigned int cpu) out: /* Release lock now that update complete */ kexec_unlock(); + crash_hotplug_unlock(); } static int crash_memhp_notifier(struct notifier_block *nb, unsigned long val, void *v) ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v2] Crash: add lock to serialize crash hotplug handling
On 9/24/23 22:07, Baoquan He wrote: Eric reported that handling corresponding crash hotplug event can be failed easily when many memory hotplug event are notified in a short period. They failed because failing to take __kexec_lock. === [ 78.714569] Fallback order for Node 0: 0 [ 78.714575] Built 1 zonelists, mobility grouping on. Total pages: 1817886 [ 78.717133] Policy zone: Normal [ 78.724423] crash hp: kexec_trylock() failed, elfcorehdr may be inaccurate [ 78.727207] crash hp: kexec_trylock() failed, elfcorehdr may be inaccurate [ 80.056643] PEFILE: Unsigned PE binary === The memory hotplug events are notified very quickly and very many, while the handling of crash hotplug is much slower relatively. So the atomic variable __kexec_lock and kexec_trylock() can't guarantee the serialization of crash hotplug handling. Here, add a new mutex lock __crash_hotplug_lock to serialize crash hotplug handling specifically. This doesn't impact the usage of __kexec_lock. Signed-off-by: Baoquan He --- v1->v2: - Move mutex lock definition into CONFIG_CRASH_HOTPLUG ifdeffery scope in kernel/crash_core.c because the lock is only needed and used in that scope. Suggested by Eric. kernel/crash_core.c | 14 ++ 1 file changed, 14 insertions(+) diff --git a/kernel/crash_core.c b/kernel/crash_core.c index 03a7932cde0a..5951d6366b72 100644 --- a/kernel/crash_core.c +++ b/kernel/crash_core.c @@ -739,6 +739,17 @@ subsys_initcall(crash_notes_memory_init); #undef pr_fmt #define pr_fmt(fmt) "crash hp: " fmt +/* + * Different than kexec/kdump loading/unloading/jumping/shrinking which + * usually rarely happen, there will be many crash hotplug events notified + * during one short period, e.g one memory board is hot added and memory + * regions are online. So mutex lock __crash_hotplug_lock is used to + * serialize the crash hotplug handling specifically. + */ +DEFINE_MUTEX(__crash_hotplug_lock); +#define crash_hotplug_lock() mutex_lock(&__crash_hotplug_lock) +#define crash_hotplug_unlock() mutex_unlock(&__crash_hotplug_lock) + /* * This routine utilized when the crash_hotplug sysfs node is read. * It reflects the kernel's ability/permission to update the crash @@ -783,9 +794,11 @@ static void crash_handle_hotplug_event(unsigned int hp_action, unsigned int cpu) { struct kimage *image; + crash_hotplug_lock(); /* Obtain lock while changing crash information */ if (!kexec_trylock()) { pr_info("kexec_trylock() failed, elfcorehdr may be inaccurate\n"); + crash_hotplug_unlock(); return; } @@ -852,6 +865,7 @@ static void crash_handle_hotplug_event(unsigned int hp_action, unsigned int cpu) out: /* Release lock now that update complete */ kexec_unlock(); + crash_hotplug_unlock(); } static int crash_memhp_notifier(struct notifier_block *nb, unsigned long val, void *v) The crash_check_update_elfcorehdr() also has kexec_trylock() and needs similar treatment. Userspace (ie udev rule processing) and kernel (crash hotplug infrastrucutre) need to be protected/serialized from one another. Eric ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH] Crash: add lock to serialize crash hotplug handling
On 9/22/23 18:54, Baoquan He wrote: Eric reported that handling corresponding crash hotplug event can be failed easily when many momery hotplug event are notified in a short period. They failed because failing to take __kexec_lock. === [ 78.714569] Fallback order for Node 0: 0 [ 78.714575] Built 1 zonelists, mobility grouping on. Total pages: 1817886 [ 78.717133] Policy zone: Normal [ 78.724423] crash hp: kexec_trylock() failed, elfcorehdr may be inaccurate [ 78.727207] crash hp: kexec_trylock() failed, elfcorehdr may be inaccurate [ 80.056643] PEFILE: Unsigned PE binary === The memory hotplug events are notified very quickly and very many, while the handling of crash hotplug is much slower relatively. So the atomic variable __kexec_lock and kexec_trylock() can't guarantee the serialization of crash hotplug handling. Here, add a new mutex lock __crash_hotplug_lock to serialize crash hotplug handling specifically. This doesn't impact the usage of __kexec_lock. Signed-off-by: Baoquan He --- kernel/crash_core.c | 3 +++ kernel/kexec_core.c | 1 + kernel/kexec_internal.h | 11 +++ 3 files changed, 15 insertions(+) diff --git a/kernel/crash_core.c b/kernel/crash_core.c index 03a7932cde0a..e8851724a530 100644 --- a/kernel/crash_core.c +++ b/kernel/crash_core.c @@ -783,9 +783,11 @@ static void crash_handle_hotplug_event(unsigned int hp_action, unsigned int cpu) { struct kimage *image; + crash_hotplug_lock(); /* Obtain lock while changing crash information */ if (!kexec_trylock()) { pr_info("kexec_trylock() failed, elfcorehdr may be inaccurate\n"); + crash_hotplug_unlock(); return; } @@ -852,6 +854,7 @@ static void crash_handle_hotplug_event(unsigned int hp_action, unsigned int cpu) out: /* Release lock now that update complete */ kexec_unlock(); + crash_hotplug_unlock(); } The crash_check_update_elfcorehdr() also has kexec_trylock() and needs similar treatment. static int crash_memhp_notifier(struct notifier_block *nb, unsigned long val, void *v) diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c index 9dc728982d79..b95a73f35d9a 100644 --- a/kernel/kexec_core.c +++ b/kernel/kexec_core.c @@ -48,6 +48,7 @@ #include "kexec_internal.h" atomic_t __kexec_lock = ATOMIC_INIT(0); +DEFINE_MUTEX(__crash_hotplug_lock); /* Flag to indicate we are going to kexec a new kernel */ bool kexec_in_progress = false; diff --git a/kernel/kexec_internal.h b/kernel/kexec_internal.h index 74da1409cd14..1db31625ef20 100644 --- a/kernel/kexec_internal.h +++ b/kernel/kexec_internal.h @@ -28,6 +28,17 @@ static inline void kexec_unlock(void) atomic_set_release(&__kexec_lock, 0); } +/* + * Different than kexec/kdump loading/unloading/crash or kexec jumping/shrinking + * which usually rarely happen, there will be many crash hotplug events notified + * during one short period, e.g one memory board is hot added and memory regions + * are online. So mutex lock __crash_hotplug_lock is used to serialize the crash + * hotplug handling specificially. + * */ +extern struct mutex __crash_hotplug_lock; +#define crash_hotplug_lock() mutex_lock(&__crash_hotplug_lock) +#define crash_hotplug_unlock() mutex_unlock(&__crash_hotplug_lock) + #ifdef CONFIG_KEXEC_FILE #include void kimage_file_post_load_cleanup(struct kimage *image); The new content for kexec_internal.h and kexec_core.c could/should probably be moved into crash_core.c, within the CONFIG_CRASH_HOTPLUG? eric ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH] kexec: change locking mechanism to a mutex
On 9/22/23 11:28, Valentin Schneider wrote: On 21/09/23 17:59, Eric DeVolder wrote: The design decision to use the atomic lock is described in the comment from kexec_internal.h, cited above. However, examining the code of __crash_kexec(): if (kexec_trylock()) { if (kexec_crash_image) { ... } kexec_unlock(); } reveals that the use of kexec_trylock() here is actually a "best effort" due to the atomic lock. This atomic lock, prior to crash hotplug, would almost always be assured (another kexec syscall could hold the lock and prevent this, but that is about it). So at the point where the capture kernel would be invoked, if the lock is not obtained, then kdump doesn't occur. It is possible to instead use a mutex with proper waiting, and utilize mutex_trylock() as the "best effort" in __crash_kexec(). The use of a mutex then avoids all the lock acquisition problems that were revealed by the crash hotplug activity. @Dave thanks for the Cc, I'd have missed this otherwise. Prior to the atomic thingie, we actually had a mutex and did mutex_trylock() in __crash_kexec(). I'm a bit confused as this looks like a revert of 05c6257433b7 ("panic, kexec: make __crash_kexec() NMI safe") with just the helpers kept in - this doesn't seem to address any of the original issues regarding NMIs? Sebastian raised some good points in [1] regarding these issues. The main hurdle pointed out there is, if we end up in the slowpath during the unlock, then we can can up acquiring the ->wait_lock which isn't NMI safe. This is even worse on PREEMPT_RT, as both trylock and the unlock can end up acquiring the ->wait_lock. [1]: https://lore.kernel.org/all/yqyz%2fuf14qkyt...@linutronix.de/ Having reviewed the references, it would seem that Baoquan's approach of a new lock to handle the hotplug activity is the way to go? Eric ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH] kexec: change locking mechanism to a mutex
On 9/22/23 03:06, Baoquan He wrote: On 09/22/23 at 11:36am, Dave Young wrote: [Cced Valentin Schneider as he added the trylocks] On Fri, 22 Sept 2023 at 06:04, Eric DeVolder wrote: Scaled up testing has revealed that the kexec_trylock() implementation leads to failures within the crash hotplug infrastructure due to the inability to acquire the lock, specifically the message: crash hp: kexec_trylock() failed, elfcorehdr may be inaccurate When hotplug events occur, the crash hotplug infrastructure first attempts to obtain the lock via the kexec_trylock(). However, the implementation either acquires the lock, or fails and returns; there is no waiting on the lock. Here is the comment/explanation from kernel/kexec_internal.h:kexec_trylock(): * Whatever is used to serialize accesses to the kexec_crash_image needs to be * NMI safe, as __crash_kexec() can happen during nmi_panic(), so here we use a * "simple" atomic variable that is acquired with a cmpxchg(). While this in theory can happen for either CPU or memory hoptlug, this problem is most prone to occur for memory hotplug. When memory is hot plugged, the memory is converted into smaller 128MiB memblocks (typically). As each memblock is processed, a kernel thread and a udev event thread are created. The udev thread tries for the lock via the reading of the sysfs node /sys/devices/system/memory/crash_hotplug node, and the kernel worker thread tries for the lock upon entering the crash hotplug infrastructure. These threads then compete for the kexec lock. For example, a 1GiB DIMM is converted into 8 memblocks, each spawning two threads for a total of 16 threads that create a small "swarm" all trying to acquire the lock. The larger the DIMM, the more the memblocks and the larger the swarm. At the root of the problem is the atomic lock behind kexec_trylock(); it works well for low lock traffic; ie loading/unloading a capture kernel, things that happen basically once. But with the introduction of crash hotplug, the traffic through the lock increases significantly, and more importantly in bursts occurring at roughly the same time. Thus there is a need to wait on the lock. Yeah, the atomic __kexec_lock is used to lock the door of operation on kimage. Among kexec/kdump kernel load/unload/shrink/jumping, once any one is in progress, the later attempt doesn't make sense. And these events are rare. Crash hotplug event is different, there will be many during one period. The main problem you are encountering is the cocurrent handling of hotplug event, right? Wondering if we can define another mutex lock to serialize the handling of hotplug event like below. Just a sterotype to state my thought. I've tried this patch (with slight change) against my regression setup and it works as well. Eric diff --git a/kernel/crash_core.c b/kernel/crash_core.c index 03a7932cde0a..39b9a57a4177 100644 --- a/kernel/crash_core.c +++ b/kernel/crash_core.c @@ -783,6 +783,7 @@ static void crash_handle_hotplug_event(unsigned int hp_action, unsigned int cpu) { struct kimage *image; + crash_hotplug_lock(); /* Obtain lock while changing crash information */ if (!kexec_trylock()) { pr_info("kexec_trylock() failed, elfcorehdr may be inaccurate\n"); @@ -852,6 +853,7 @@ static void crash_handle_hotplug_event(unsigned int hp_action, unsigned int cpu) out: /* Release lock now that update complete */ kexec_unlock(); + crash_hotplug_unlock(); } static int crash_memhp_notifier(struct notifier_block *nb, unsigned long val, void *v) diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c index 9dc728982d79..b95a73f35d9a 100644 --- a/kernel/kexec_core.c +++ b/kernel/kexec_core.c @@ -48,6 +48,7 @@ #include "kexec_internal.h" atomic_t __kexec_lock = ATOMIC_INIT(0); +DEFINE_MUTEX(__crash_hotplug_lock); /* Flag to indicate we are going to kexec a new kernel */ bool kexec_in_progress = false; diff --git a/kernel/kexec_internal.h b/kernel/kexec_internal.h index 74da1409cd14..32cb890bb059 100644 --- a/kernel/kexec_internal.h +++ b/kernel/kexec_internal.h @@ -28,6 +28,10 @@ static inline void kexec_unlock(void) atomic_set_release(&__kexec_lock, 0); } +extern struct mutex __crash_hotplug_lock; +#define crash_hotplug_lock() mutex_lock(&__crash_hotplug_lock) +#define crash_hotplug_unlock() mutex_unlock(&__crash_hotplug_lock) + #ifdef CONFIG_KEXEC_FILE #include void kimage_file_post_load_cleanup(struct kimage *image); ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH] kexec: change locking mechanism to a mutex
On 9/21/23 19:26, Andrew Morton wrote: On Thu, 21 Sep 2023 17:59:38 -0400 Eric DeVolder wrote: Scaled up testing has revealed that the kexec_trylock() implementation leads to failures within the crash hotplug infrastructure due to the inability to acquire the lock, specifically the message: crash hp: kexec_trylock() failed, elfcorehdr may be inaccurate When hotplug events occur, the crash hotplug infrastructure first attempts to obtain the lock via the kexec_trylock(). However, the implementation either acquires the lock, or fails and returns; there is no waiting on the lock. Here is the comment/explanation from kernel/kexec_internal.h:kexec_trylock(): * Whatever is used to serialize accesses to the kexec_crash_image needs to be * NMI safe, as __crash_kexec() can happen during nmi_panic(), so here we use a * "simple" atomic variable that is acquired with a cmpxchg(). While this in theory can happen for either CPU or memory hoptlug, this problem is most prone to occur for memory hotplug. When memory is hot plugged, the memory is converted into smaller 128MiB memblocks (typically). As each memblock is processed, a kernel thread and a udev event thread are created. The udev thread tries for the lock via the reading of the sysfs node /sys/devices/system/memory/crash_hotplug node, and the kernel worker thread tries for the lock upon entering the crash hotplug infrastructure. These threads then compete for the kexec lock. For example, a 1GiB DIMM is converted into 8 memblocks, each spawning two threads for a total of 16 threads that create a small "swarm" all trying to acquire the lock. The larger the DIMM, the more the memblocks and the larger the swarm. At the root of the problem is the atomic lock behind kexec_trylock(); it works well for low lock traffic; ie loading/unloading a capture kernel, things that happen basically once. But with the introduction of crash hotplug, the traffic through the lock increases significantly, and more importantly in bursts occurring at roughly the same time. Thus there is a need to wait on the lock. A possible workaround is to simply retry the lock, say up to N times. There is, of course, the problem of determining a value of N that works for all implementations, and for all the other call sites of kexec_trylock(). Not ideal. The design decision to use the atomic lock is described in the comment from kexec_internal.h, cited above. However, examining the code of __crash_kexec(): if (kexec_trylock()) { if (kexec_crash_image) { ... } kexec_unlock(); } reveals that the use of kexec_trylock() here is actually a "best effort" due to the atomic lock. This atomic lock, prior to crash hotplug, would almost always be assured (another kexec syscall could hold the lock and prevent this, but that is about it). So at the point where the capture kernel would be invoked, if the lock is not obtained, then kdump doesn't occur. It is possible to instead use a mutex with proper waiting, and utilize mutex_trylock() as the "best effort" in __crash_kexec(). The use of a mutex then avoids all the lock acquisition problems that were revealed by the crash hotplug activity. Convert the atomic lock to a mutex. ... --- a/kernel/kexec_core.c +++ b/kernel/kexec_core.c @@ -47,7 +47,7 @@ #include #include "kexec_internal.h" -atomic_t __kexec_lock = ATOMIC_INIT(0); +DEFINE_MUTEX(__kexec_lock); /* Flag to indicate we are going to kexec a new kernel */ bool kexec_in_progress = false; @@ -1057,7 +1057,7 @@ void __noclone __crash_kexec(struct pt_regs *regs) * of memory the xchg(_crash_image) would be * sufficient. But since I reuse the memory... */ - if (kexec_trylock()) { + if (mutex_trylock(&__kexec_lock)) { if (kexec_crash_image) { struct pt_regs fixed_regs; What's happening here? If someone else held the lock we silently fail to run the kexec? Shouldn't we at least alert the user to what just happened? Yes, I believe it would silently "fail" and not run the kexec kernel. I do not have a good feel to know if logging is going to be functional, and reliable, at this point in time (on a panic path)... eric ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH] kexec: change locking mechanism to a mutex
On 9/21/23 19:22, Andrew Morton wrote: On Thu, 21 Sep 2023 17:59:38 -0400 Eric DeVolder wrote: Scaled up testing has revealed that the kexec_trylock() implementation leads to failures within the crash hotplug infrastructure due to the inability to acquire the lock, specifically the message: ... Convert the atomic lock to a mutex. Do you think this problem is serious enough to warrant a backport into -stable kernels? I do not since it will be the lock traffic created by the crash hotplug infrastructure that will reveal the weak locking mechanism. Until this crash hotplug shows up in a stable kernel, it should not be an issue; there isn't anything else that easily exercise it to reveal the problem. eric ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH] kexec: change locking mechanism to a mutex
Scaled up testing has revealed that the kexec_trylock() implementation leads to failures within the crash hotplug infrastructure due to the inability to acquire the lock, specifically the message: crash hp: kexec_trylock() failed, elfcorehdr may be inaccurate When hotplug events occur, the crash hotplug infrastructure first attempts to obtain the lock via the kexec_trylock(). However, the implementation either acquires the lock, or fails and returns; there is no waiting on the lock. Here is the comment/explanation from kernel/kexec_internal.h:kexec_trylock(): * Whatever is used to serialize accesses to the kexec_crash_image needs to be * NMI safe, as __crash_kexec() can happen during nmi_panic(), so here we use a * "simple" atomic variable that is acquired with a cmpxchg(). While this in theory can happen for either CPU or memory hoptlug, this problem is most prone to occur for memory hotplug. When memory is hot plugged, the memory is converted into smaller 128MiB memblocks (typically). As each memblock is processed, a kernel thread and a udev event thread are created. The udev thread tries for the lock via the reading of the sysfs node /sys/devices/system/memory/crash_hotplug node, and the kernel worker thread tries for the lock upon entering the crash hotplug infrastructure. These threads then compete for the kexec lock. For example, a 1GiB DIMM is converted into 8 memblocks, each spawning two threads for a total of 16 threads that create a small "swarm" all trying to acquire the lock. The larger the DIMM, the more the memblocks and the larger the swarm. At the root of the problem is the atomic lock behind kexec_trylock(); it works well for low lock traffic; ie loading/unloading a capture kernel, things that happen basically once. But with the introduction of crash hotplug, the traffic through the lock increases significantly, and more importantly in bursts occurring at roughly the same time. Thus there is a need to wait on the lock. A possible workaround is to simply retry the lock, say up to N times. There is, of course, the problem of determining a value of N that works for all implementations, and for all the other call sites of kexec_trylock(). Not ideal. The design decision to use the atomic lock is described in the comment from kexec_internal.h, cited above. However, examining the code of __crash_kexec(): if (kexec_trylock()) { if (kexec_crash_image) { ... } kexec_unlock(); } reveals that the use of kexec_trylock() here is actually a "best effort" due to the atomic lock. This atomic lock, prior to crash hotplug, would almost always be assured (another kexec syscall could hold the lock and prevent this, but that is about it). So at the point where the capture kernel would be invoked, if the lock is not obtained, then kdump doesn't occur. It is possible to instead use a mutex with proper waiting, and utilize mutex_trylock() as the "best effort" in __crash_kexec(). The use of a mutex then avoids all the lock acquisition problems that were revealed by the crash hotplug activity. Convert the atomic lock to a mutex. Signed-off-by: Eric DeVolder --- kernel/crash_core.c | 10 ++ kernel/kexec.c | 3 +-- kernel/kexec_core.c | 13 + kernel/kexec_file.c | 3 +-- kernel/kexec_internal.h | 12 +++- 5 files changed, 12 insertions(+), 29 deletions(-) diff --git a/kernel/crash_core.c b/kernel/crash_core.c index 03a7932cde0a..9a8378fbdafa 100644 --- a/kernel/crash_core.c +++ b/kernel/crash_core.c @@ -749,10 +749,7 @@ int crash_check_update_elfcorehdr(void) int rc = 0; /* Obtain lock while reading crash information */ - if (!kexec_trylock()) { - pr_info("kexec_trylock() failed, elfcorehdr may be inaccurate\n"); - return 0; - } + kexec_lock(); if (kexec_crash_image) { if (kexec_crash_image->file_mode) rc = 1; @@ -784,10 +781,7 @@ static void crash_handle_hotplug_event(unsigned int hp_action, unsigned int cpu) struct kimage *image; /* Obtain lock while changing crash information */ - if (!kexec_trylock()) { - pr_info("kexec_trylock() failed, elfcorehdr may be inaccurate\n"); - return; - } + kexec_lock(); /* Check kdump is not loaded */ if (!kexec_crash_image) diff --git a/kernel/kexec.c b/kernel/kexec.c index 107f355eac10..a2f687900bb5 100644 --- a/kernel/kexec.c +++ b/kernel/kexec.c @@ -96,8 +96,7 @@ static int do_kexec_load(unsigned long entry, unsigned long nr_segments, * crash kernels we need a serialization here to prevent multiple crash * kernels from attempting to load simultaneously. */ - if (!kexec_trylock()) - return -EB
Re: [PATCH v28 0/8] crash: Kernel handling of CPU and memory hot un/plug
On 8/14/23 17:33, Andrew Morton wrote: On Mon, 14 Aug 2023 17:44:38 -0400 Eric DeVolder wrote: This series is dependent upon "refactor Kconfig to consolidate KEXEC and CRASH options". https://lore.kernel.org/lkml/20230712161545.87870-1-eric.devol...@oracle.com/ Once the kdump service is loaded, if changes to CPUs or memory occur, either by hot un/plug or off/onlining, the crash elfcorehdr must also be updated. Thanks, I updated branch mm-nonmm-unstable to this version. Andrew, So far only one issue has popped up. I've posted the following patch to akpm to solve that issue. Please apply this patch on-top/with this v28 series. [PATCH] x86/crash: correct unused function build error The thread on this issue is here: https://lore.kernel.org/lkml/08fc20ef-854d-404a-b2f2-75941eeeccf8@paulmck-laptop/ If you'd rather I post a v29, I'll happily do so. Thank you! eric ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v27 2/8] crash: add generic infrastructure for crash hotplug support
On 8/12/23 05:47, Sourabh Jain wrote: Hello Eric, On 11/08/23 22:36, Eric DeVolder wrote: To support crash hotplug, a mechanism is needed to update the crash elfcorehdr upon CPU or memory changes (eg. hot un/plug or off/ onlining). The crash elfcorehdr describes the CPUs and memory to be written into the vmcore. To track CPU changes, callbacks are registered with the cpuhp mechanism via cpuhp_setup_state_nocalls(CPUHP_BP_PREPARE_DYN). The crash hotplug elfcorehdr update has no explicit ordering requirement (relative to other cpuhp states), so meets the criteria for utilizing CPUHP_BP_PREPARE_DYN. CPUHP_BP_PREPARE_DYN is a dynamic state and avoids the need to introduce a new state for crash hotplug. Also, CPUHP_BP_PREPARE_DYN is the last state in the PREPARE group, just prior to the STARTING group, which is very close to the CPU starting up in a plug/online situation, or stopping in a unplug/ offline situation. This minimizes the window of time during an actual plug/online or unplug/offline situation in which the elfcorehdr would be inaccurate. Note that for a CPU being unplugged or offlined, the CPU will still be present in the list of CPUs generated by crash_prepare_elf64_headers(). However, there is no need to explicitly omit the CPU, see justification in 'crash: change crash_prepare_elf64_headers() to for_each_possible_cpu()'. To track memory changes, a notifier is registered to capture the memblock MEM_ONLINE and MEM_OFFLINE events via register_memory_notifier(). The CPU callbacks and memory notifiers invoke crash_handle_hotplug_event() which performs needed tasks and then dispatches the event to the architecture specific arch_crash_handle_hotplug_event() to update the elfcorehdr with the current state of CPUs and memory. During the process, the kexec_lock is held. Signed-off-by: Eric DeVolder Reviewed-by: Sourabh Jain Acked-by: Hari Bathini Acked-by: Baoquan He --- include/linux/crash_core.h | 9 +++ include/linux/kexec.h | 11 +++ kernel/Kconfig.kexec | 31 kernel/crash_core.c | 142 + kernel/kexec_core.c | 6 ++ 5 files changed, 199 insertions(+) diff --git a/include/linux/crash_core.h b/include/linux/crash_core.h index de62a722431e..e14345cc7a22 100644 --- a/include/linux/crash_core.h +++ b/include/linux/crash_core.h @@ -84,4 +84,13 @@ int parse_crashkernel_high(char *cmdline, unsigned long long system_ram, int parse_crashkernel_low(char *cmdline, unsigned long long system_ram, unsigned long long *crash_size, unsigned long long *crash_base); +#define KEXEC_CRASH_HP_NONE 0 +#define KEXEC_CRASH_HP_ADD_CPU 1 +#define KEXEC_CRASH_HP_REMOVE_CPU 2 +#define KEXEC_CRASH_HP_ADD_MEMORY 3 +#define KEXEC_CRASH_HP_REMOVE_MEMORY 4 +#define KEXEC_CRASH_HP_INVALID_CPU -1U + +struct kimage; + #endif /* LINUX_CRASH_CORE_H */ diff --git a/include/linux/kexec.h b/include/linux/kexec.h index 811a90e09698..b9903dd48e24 100644 --- a/include/linux/kexec.h +++ b/include/linux/kexec.h @@ -33,6 +33,7 @@ extern note_buf_t __percpu *crash_notes; #include #include #include +#include #include /* Verify architecture specific macros are defined */ @@ -360,6 +361,12 @@ struct kimage { struct purgatory_info purgatory_info; #endif +#ifdef CONFIG_CRASH_HOTPLUG + int hp_action; + int elfcorehdr_index; + bool elfcorehdr_updated; +#endif + #ifdef CONFIG_IMA_KEXEC /* Virtual address of IMA measurement buffer for kexec syscall */ void *ima_buffer; @@ -490,6 +497,10 @@ static inline int arch_kexec_post_alloc_pages(void *vaddr, unsigned int pages, g static inline void arch_kexec_pre_free_pages(void *vaddr, unsigned int pages) { } #endif +#ifndef arch_crash_handle_hotplug_event +static inline void arch_crash_handle_hotplug_event(struct kimage *image) { } +#endif + Isn't the above function should be declare under CONFIG_CRASH_HOTPLUG? Thanks, Sourabh There are no compiler warnings/errors, due to the nature of being declared static inline. And most of the other functions defined in a similar way in this file are not guard banded by CONFIG ifdefs. I'm inclined to leave it this way. Thanks! eric #else /* !CONFIG_KEXEC_CORE */ struct pt_regs; struct task_struct; diff --git a/kernel/Kconfig.kexec b/kernel/Kconfig.kexec index ff72e45cfaef..d0a9a5392035 100644 --- a/kernel/Kconfig.kexec +++ b/kernel/Kconfig.kexec @@ -113,4 +113,35 @@ config CRASH_DUMP For s390, this option also enables zfcpdump. See also +config CRASH_HOTPLUG + bool "Update the crash elfcorehdr on system configuration changes" + default y + depends on CRASH_DUMP && (HOTPLUG_CPU || MEMORY_HOTPLUG) + depends on ARCH_SUPPORTS_CRASH_HOTPLUG + help + Enable direct update to the crash elfcorehdr (which contains + the list of CPUs and memory regions to be dumped upon a crash) + in response to hot plu
[PATCH v28 6/8] crash: hotplug support for kexec_load()
The hotplug support for kexec_load() requires changes to the userspace kexec-tools and a little extra help from the kernel. Given a kdump capture kernel loaded via kexec_load(), and a subsequent hotplug event, the crash hotplug handler finds the elfcorehdr and rewrites it to reflect the hotplug change. That is the desired outcome, however, at kernel panic time, the purgatory integrity check fails (because the elfcorehdr changed), and the capture kernel does not boot and no vmcore is generated. Therefore, the userspace kexec-tools/kexec must indicate to the kernel that the elfcorehdr can be modified (because the kexec excluded the elfcorehdr from the digest, and sized the elfcorehdr memory buffer appropriately). To facilitate hotplug support with kexec_load(): - a new kexec flag KEXEC_UPATE_ELFCOREHDR indicates that it is safe for the kernel to modify the kexec_load()'d elfcorehdr - the /sys/kernel/crash_elfcorehdr_size node communicates the preferred size of the elfcorehdr memory buffer - The sysfs crash_hotplug nodes (ie. /sys/devices/system/[cpu|memory]/crash_hotplug) dynamically take into account kexec_file_load() vs kexec_load() and KEXEC_UPDATE_ELFCOREHDR. This is critical so that the udev rule processing of crash_hotplug is all that is needed to determine if the userspace unload-then-load of the kdump image is to be skipped, or not. The proposed udev rule change looks like: # The kernel updates the crash elfcorehdr for CPU and memory changes SUBSYSTEM=="cpu", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end" SUBSYSTEM=="memory", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end" The table below indicates the behavior of kexec_load()'d kdump image updates (with the new udev crash_hotplug rule in place): Kernel |Kexec ---+-+ Old|Old |New | a | a ---+-+ New| a | b ---+-+ where kexec 'old' and 'new' delineate kexec-tools has the needed modifications for the crash hotplug feature, and kernel 'old' and 'new' delineate the kernel supports this crash hotplug feature. Behavior 'a' indicates the unload-then-reload of the entire kdump image. For the kexec 'old' column, the unload-then-reload occurs due to the missing flag KEXEC_UPDATE_ELFCOREHDR. An 'old' kernel (with 'new' kexec) does not present the crash_hotplug sysfs node, which leads to the unload-then-reload of the kdump image. Behavior 'b' indicates the desired optimized behavior of the kernel directly modifying the elfcorehdr and avoiding the unload-then-reload of the kdump image. If the udev rule is not updated with crash_hotplug node check, then no matter any combination of kernel or kexec is new or old, the kdump image continues to be unload-then-reload on hotplug changes. To fully support crash hotplug feature, there needs to be a rollout of kernel, kexec-tools and udev rule changes. However, the order of the rollout of these pieces does not matter; kexec_load()'d kdump images still function for hotplug as-is. Suggested-by: Hari Bathini Signed-off-by: Eric DeVolder Acked-by: Hari Bathini Acked-by: Baoquan He --- arch/x86/include/asm/kexec.h | 11 +++ arch/x86/kernel/crash.c | 27 +++ include/linux/kexec.h| 14 -- include/uapi/linux/kexec.h | 1 + kernel/Kconfig.kexec | 4 kernel/crash_core.c | 31 +++ kernel/kexec.c | 5 + kernel/ksysfs.c | 15 +++ 8 files changed, 102 insertions(+), 6 deletions(-) diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h index 9143100ea3ea..3be6a98751f0 100644 --- a/arch/x86/include/asm/kexec.h +++ b/arch/x86/include/asm/kexec.h @@ -214,14 +214,17 @@ void arch_crash_handle_hotplug_event(struct kimage *image); #define arch_crash_handle_hotplug_event arch_crash_handle_hotplug_event #ifdef CONFIG_HOTPLUG_CPU -static inline int crash_hotplug_cpu_support(void) { return 1; } -#define crash_hotplug_cpu_support crash_hotplug_cpu_support +int arch_crash_hotplug_cpu_support(void); +#define crash_hotplug_cpu_support arch_crash_hotplug_cpu_support #endif #ifdef CONFIG_MEMORY_HOTPLUG -static inline int crash_hotplug_memory_support(void) { return 1; } -#define crash_hotplug_memory_support crash_hotplug_memory_support +int arch_crash_hotplug_memory_support(void); +#define crash_hotplug_memory_support arch_crash_hotplug_memory_support #endif + +unsigned int arch_crash_get_elfcorehdr_size(void); +#define crash_get_elfcorehdr_size arch_crash_get_elfcorehdr_size #endif #endif /* __ASSEMBLY__ */ diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c index 4b6cebceec68..1900efcdf1bc 100644 --- a/arch/x86/kernel/crash.c +++ b/arch/x86/kernel/crash.c @@ -429,6 +429,33 @@ int crash_load_segments(struct kimage *image) #undef pr_fmt #define pr_fmt(fmt) "cras
[PATCH v28 5/8] x86/crash: add x86 crash hotplug support
When CPU or memory is hot un/plugged, or off/onlined, the crash elfcorehdr, which describes the CPUs and memory in the system, must also be updated. A new elfcorehdr is generated from the available CPUs and memory and replaces the existing elfcorehdr. The segment containing the elfcorehdr is identified at run-time in crash_core:crash_handle_hotplug_event(). No modifications to purgatory (see 'kexec: exclude elfcorehdr from the segment digest') or boot_params (as the elfcorehdr= capture kernel command line parameter pointer remains unchanged and correct) are needed, just elfcorehdr. For kexec_file_load(), the elfcorehdr segment size is based on NR_CPUS and CRASH_MAX_MEMORY_RANGES in order to accommodate a growing number of CPU and memory resources. For kexec_load(), the userspace kexec utility needs to size the elfcorehdr segment in the same/similar manner. To accommodate kexec_load() syscall in the absence of kexec_file_load() syscall support, prepare_elf_headers() and dependents are moved outside of CONFIG_KEXEC_FILE. Signed-off-by: Eric DeVolder Reviewed-by: Sourabh Jain Acked-by: Hari Bathini Acked-by: Baoquan He --- arch/x86/Kconfig | 3 + arch/x86/include/asm/kexec.h | 15 + arch/x86/kernel/crash.c | 105 --- 3 files changed, 116 insertions(+), 7 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 7082fc10b346..ffc95c3d6abd 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -2069,6 +2069,9 @@ config ARCH_SUPPORTS_KEXEC_JUMP config ARCH_SUPPORTS_CRASH_DUMP def_bool X86_64 || (X86_32 && HIGHMEM) +config ARCH_SUPPORTS_CRASH_HOTPLUG + def_bool y + config PHYSICAL_START hex "Physical address where the kernel is loaded" if (EXPERT || CRASH_DUMP) default "0x100" diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h index 5b77bbc28f96..9143100ea3ea 100644 --- a/arch/x86/include/asm/kexec.h +++ b/arch/x86/include/asm/kexec.h @@ -209,6 +209,21 @@ typedef void crash_vmclear_fn(void); extern crash_vmclear_fn __rcu *crash_vmclear_loaded_vmcss; extern void kdump_nmi_shootdown_cpus(void); +#ifdef CONFIG_CRASH_HOTPLUG +void arch_crash_handle_hotplug_event(struct kimage *image); +#define arch_crash_handle_hotplug_event arch_crash_handle_hotplug_event + +#ifdef CONFIG_HOTPLUG_CPU +static inline int crash_hotplug_cpu_support(void) { return 1; } +#define crash_hotplug_cpu_support crash_hotplug_cpu_support +#endif + +#ifdef CONFIG_MEMORY_HOTPLUG +static inline int crash_hotplug_memory_support(void) { return 1; } +#define crash_hotplug_memory_support crash_hotplug_memory_support +#endif +#endif + #endif /* __ASSEMBLY__ */ #endif /* _ASM_X86_KEXEC_H */ diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c index cdd92ab43cda..4b6cebceec68 100644 --- a/arch/x86/kernel/crash.c +++ b/arch/x86/kernel/crash.c @@ -158,8 +158,7 @@ void native_machine_crash_shutdown(struct pt_regs *regs) crash_save_cpu(regs, safe_smp_processor_id()); } -#ifdef CONFIG_KEXEC_FILE - +#if defined(CONFIG_KEXEC_FILE) || defined(CONFIG_CRASH_DUMP) static int get_nr_ram_ranges_callback(struct resource *res, void *arg) { unsigned int *nr_ranges = arg; @@ -231,7 +230,7 @@ static int prepare_elf64_ram_headers_callback(struct resource *res, void *arg) /* Prepare elf headers. Return addr and size */ static int prepare_elf_headers(struct kimage *image, void **addr, - unsigned long *sz) + unsigned long *sz, unsigned long *nr_mem_ranges) { struct crash_mem *cmem; int ret; @@ -249,6 +248,9 @@ static int prepare_elf_headers(struct kimage *image, void **addr, if (ret) goto out; + /* Return the computed number of memory ranges, for hotplug usage */ + *nr_mem_ranges = cmem->nr_ranges; + /* By default prepare 64bit headers */ ret = crash_prepare_elf64_headers(cmem, IS_ENABLED(CONFIG_X86_64), addr, sz); @@ -256,7 +258,9 @@ static int prepare_elf_headers(struct kimage *image, void **addr, vfree(cmem); return ret; } +#endif +#ifdef CONFIG_KEXEC_FILE static int add_e820_entry(struct boot_params *params, struct e820_entry *entry) { unsigned int nr_e820_entries; @@ -371,18 +375,42 @@ int crash_setup_memmap_entries(struct kimage *image, struct boot_params *params) int crash_load_segments(struct kimage *image) { int ret; + unsigned long pnum = 0; struct kexec_buf kbuf = { .image = image, .buf_min = 0, .buf_max = ULONG_MAX, .top_down = false }; /* Prepare elf headers and add a segment */ - ret = prepare_elf_headers(image, , ); + ret = prepare_elf_headers(image, , , ); if (ret) return ret; - image->elf_headers = kbuf.buffer; - image->elf_headers_sz = k
[PATCH v28 0/8] crash: Kernel handling of CPU and memory hot un/plug
. - Per David Hansen, converted to use of kmap_local_page(). - Per Baoquan He, replaced use of __weak with the kexec technique. v9: 13jun2022 https://lkml.org/lkml/2022/6/13/3382 https://lore.kernel.org/lkml/20220613224240.79400-1-eric.devol...@oracle.com/ - Rebased to 5.18.0 - Per Sourabh, moved crash_prepare_elf64_headers() into common crash_core.c to avoid compile issues with kexec_load only path. - Per David Hildebrand, replaced mutex_trylock() with mutex_lock(). - Changed the __weak arch_crash_handle_hotplug_event() to utilize WARN_ONCE() instead of WARN(). Fix some formatting issues. - Per Sourabh, introduced sysfs attribute crash_hotplug for memory and CPUs; for use by userspace (udev) to determine if the kernel performs crash hot un/plug support. - Per Sourabh, moved the code detecting the elfcorehdr segment from arch/x86 into crash_core:handle_hotplug_event() so both kexec_load and kexec_file_load can benefit. - Updated userspace kexec-tools kexec utility to reflect change to using CRASH_MAX_MEMORY_RANGES and get_nr_cpus(). - Updated the new proposed udev rules to reflect using the sysfs attributes crash_hotplug. v8: 5may2022 https://lkml.org/lkml/2022/5/5/1133 https://lore.kernel.org/lkml/20220505184603.1548-1-eric.devol...@oracle.com/ - Per Borislav Petkov, eliminated CONFIG_CRASH_HOTPLUG in favor of CONFIG_HOTPLUG_CPU || CONFIG_MEMORY_HOTPLUG, ie a new define is not needed. Also use of IS_ENABLED() rather than #ifdef's. Renamed crash_hotplug_handler() to handle_hotplug_event(). And other corrections. - Per Baoquan, minimized the parameters to the arch_crash_ handle_hotplug_event() to hp_action and cpu. - Introduce KEXEC_CRASH_HP_INVALID_CPU definition, per Baoquan. - Per Sourabh Jain, renamed and repurposed CRASH_HOTPLUG_ELFCOREHDR_SZ to CONFIG_CRASH_MAX_MEMORY_RANGES, mirroring kexec-tools change by David Hildebrand. Folded this patch into the x86 kexec_file_load support patch. v7: 13apr2022 https://lkml.org/lkml/2022/4/13/850 https://lore.kernel.org/lkml/20220413164237.20845-1-eric.devol...@oracle.com/ - Resolved parameter usage to crash_hotplug_handler(), per Baoquan. v6: 1apr2022 https://lkml.org/lkml/2022/4/1/1203 https://lore.kernel.org/lkml/20220401183040.1624-1-eric.devol...@oracle.com/ - Reword commit messages and some comment cleanup per Baoquan. - Changed elf_index to elfcorehdr_index for clarity. - Minor code changes per Baoquan. v5: 3mar2022 https://lkml.org/lkml/2022/3/3/674 https://lore.kernel.org/lkml/20220303162725.49640-1-eric.devol...@oracle.com/ - Reworded description of CRASH_HOTPLUG_ELFCOREHDR_SZ, per David Hildenbrand. - Refactored slightly a few patches per Baoquan recommendation. v4: 9feb2022 https://lkml.org/lkml/2022/2/9/1406 https://lore.kernel.org/lkml/20220209195706.51522-1-eric.devol...@oracle.com/ - Refactored patches per Baoquan suggestsions. - A few corrections, per Baoquan. v3: 10jan2022 https://lkml.org/lkml/2022/1/10/1212 https://lore.kernel.org/lkml/20220110195727.1682-1-eric.devol...@oracle.com/ - Rebasing per Baoquan He request. - Changed memory notifier per David Hildenbrand. - Providing example kexec userspace change in cover letter. RFC v2: 7dec2021 https://lkml.org/lkml/2021/12/7/1088 https://lore.kernel.org/lkml/20211207195204.1582-1-eric.devol...@oracle.com/ - Acting upon Baoquan He suggestion of removing elfcorehdr from the purgatory list of segments, removed purgatory code from patchset, and it is signficiantly simpler now. RFC v1: 18nov2021 https://lkml.org/lkml/2021/11/18/845 https://lore.kernel.org/lkml/2028174948.37435-1-eric.devol...@oracle.com/ - working patchset demonstrating kernel handling of hotplug updates to x86 elfcorehdr for kexec_file_load RFC: 14dec2020 https://lkml.org/lkml/2020/12/14/532 https://lore.kernel.org/lkml/b04ed259-dc5f-7f30-6661-c26f92d90...@oracle.com/ NOTE: s/vmcoreinfo/elfcorehdr/g - proposed concept of allowing kernel to handle hotplug update of elfcorehdr --- Eric DeVolder (8): crash: move a few code bits to setup support of crash hotplug crash: add generic infrastructure for crash hotplug support kexec: exclude elfcorehdr from the segment digest crash: memory and CPU hotplug sysfs attributes x86/crash: add x86 crash hotplug support crash: hotplug support for kexec_load() crash: change crash_prepare_elf64_headers() to for_each_possible_cpu() x86/crash: optimize CPU changes .../ABI/testing/sysfs-devices-memory | 8 + .../ABI/testing/sysfs-devices-system-cpu | 8 + .../admin-guide/mm/memory-hotplug.rst | 8 + Documentation/core-api/cpu_hotplug.rst| 18 + arch/x86/Kconfig | 3 + arch/x86/include/asm/kexec.h | 18 + arch/x86/kernel/crash.c | 142 ++- drivers/base/cpu.c| 13 + drivers/base/memory.c
[PATCH v28 1/8] crash: move a few code bits to setup support of crash hotplug
The crash hotplug support leans on the work for the kexec_file_load() syscall. To also support the kexec_load() syscall, a few bits of code need to be move outside of CONFIG_KEXEC_FILE. As such, these bits are moved out of kexec_file.c and into a common location crash_core.c. In addition, struct crash_mem and crash_notes were moved to new locales so that PROC_KCORE, which sets CRASH_CORE alone, builds correctly. No functionality change intended. Signed-off-by: Eric DeVolder Reviewed-by: Sourabh Jain Acked-by: Hari Bathini Acked-by: Baoquan He --- include/linux/crash_core.h | 20 include/linux/kexec.h | 15 --- kernel/crash_core.c| 218 + kernel/kexec_core.c| 37 --- kernel/kexec_file.c| 181 -- 5 files changed, 238 insertions(+), 233 deletions(-) diff --git a/include/linux/crash_core.h b/include/linux/crash_core.h index de62a722431e..1e48b1d96404 100644 --- a/include/linux/crash_core.h +++ b/include/linux/crash_core.h @@ -28,6 +28,8 @@ VMCOREINFO_BYTES) typedef u32 note_buf_t[CRASH_CORE_NOTE_BYTES/4]; +/* Per cpu memory for storing cpu states in case of system crash. */ +extern note_buf_t __percpu *crash_notes; void crash_update_vmcoreinfo_safecopy(void *ptr); void crash_save_vmcoreinfo(void); @@ -84,4 +86,22 @@ int parse_crashkernel_high(char *cmdline, unsigned long long system_ram, int parse_crashkernel_low(char *cmdline, unsigned long long system_ram, unsigned long long *crash_size, unsigned long long *crash_base); +/* Alignment required for elf header segment */ +#define ELF_CORE_HEADER_ALIGN 4096 + +struct crash_mem { + unsigned int max_nr_ranges; + unsigned int nr_ranges; + struct range ranges[]; +}; + +extern int crash_exclude_mem_range(struct crash_mem *mem, + unsigned long long mstart, + unsigned long long mend); +extern int crash_prepare_elf64_headers(struct crash_mem *mem, int need_kernel_map, + void **addr, unsigned long *sz); + +struct kimage; +struct kexec_segment; + #endif /* LINUX_CRASH_CORE_H */ diff --git a/include/linux/kexec.h b/include/linux/kexec.h index 22b5cd24f581..fb4350db33ff 100644 --- a/include/linux/kexec.h +++ b/include/linux/kexec.h @@ -230,21 +230,6 @@ static inline int arch_kexec_locate_mem_hole(struct kexec_buf *kbuf) } #endif -/* Alignment required for elf header segment */ -#define ELF_CORE_HEADER_ALIGN 4096 - -struct crash_mem { - unsigned int max_nr_ranges; - unsigned int nr_ranges; - struct range ranges[]; -}; - -extern int crash_exclude_mem_range(struct crash_mem *mem, - unsigned long long mstart, - unsigned long long mend); -extern int crash_prepare_elf64_headers(struct crash_mem *mem, int need_kernel_map, - void **addr, unsigned long *sz); - #ifndef arch_kexec_apply_relocations_add /* * arch_kexec_apply_relocations_add - apply relocations of type RELA diff --git a/kernel/crash_core.c b/kernel/crash_core.c index 90ce1dfd591c..336083fba623 100644 --- a/kernel/crash_core.c +++ b/kernel/crash_core.c @@ -10,6 +10,7 @@ #include #include #include +#include #include #include @@ -18,6 +19,9 @@ #include "kallsyms_internal.h" +/* Per cpu memory for storing cpu states in case of system crash. */ +note_buf_t __percpu *crash_notes; + /* vmcoreinfo stuff */ unsigned char *vmcoreinfo_data; size_t vmcoreinfo_size; @@ -314,6 +318,187 @@ static int __init parse_crashkernel_dummy(char *arg) } early_param("crashkernel", parse_crashkernel_dummy); +int crash_prepare_elf64_headers(struct crash_mem *mem, int need_kernel_map, + void **addr, unsigned long *sz) +{ + Elf64_Ehdr *ehdr; + Elf64_Phdr *phdr; + unsigned long nr_cpus = num_possible_cpus(), nr_phdr, elf_sz; + unsigned char *buf; + unsigned int cpu, i; + unsigned long long notes_addr; + unsigned long mstart, mend; + + /* extra phdr for vmcoreinfo ELF note */ + nr_phdr = nr_cpus + 1; + nr_phdr += mem->nr_ranges; + + /* +* kexec-tools creates an extra PT_LOAD phdr for kernel text mapping +* area (for example, 8000 - a000 on x86_64). +* I think this is required by tools like gdb. So same physical +* memory will be mapped in two ELF headers. One will contain kernel +* text virtual addresses and other will have __va(physical) addresses. +*/ + + nr_phdr++; + elf_sz = sizeof(Elf64_Ehdr) + nr_phdr * sizeof(Elf64_Phdr); + elf_sz = ALIGN(elf_sz, ELF_CORE_HEADER_ALIGN); + + buf = vzalloc(elf_sz); + if (!buf) + return -ENOMEM; + + ehdr = (Elf64_Ehdr *)buf; +
[PATCH v28 8/8] x86/crash: optimize CPU changes
crash_prepare_elf64_headers() writes into the elfcorehdr an ELF PT_NOTE for all possible CPUs. As such, subsequent changes to CPUs (ie. hot un/plug, online/offline) do not need to rewrite the elfcorehdr. The kimage->file_mode term covers kdump images loaded via the kexec_file_load() syscall. Since crash_prepare_elf64_headers() wrote the initial elfcorehdr, no update to the elfcorehdr is needed for CPU changes. The kimage->elfcorehdr_updated term covers kdump images loaded via the kexec_load() syscall. At least one memory or CPU change must occur to cause crash_prepare_elf64_headers() to rewrite the elfcorehdr. Afterwards, no update to the elfcorehdr is needed for CPU changes. This code is intentionally *NOT* hoisted into crash_handle_hotplug_event() as it would prevent the arch-specific handler from running for CPU changes. This would break PPC, for example, which needs to update other information besides the elfcorehdr, on CPU changes. Signed-off-by: Eric DeVolder Reviewed-by: Sourabh Jain Acked-by: Hari Bathini Acked-by: Baoquan He --- arch/x86/kernel/crash.c | 10 ++ 1 file changed, 10 insertions(+) diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c index 1900efcdf1bc..86d2ca80b9b2 100644 --- a/arch/x86/kernel/crash.c +++ b/arch/x86/kernel/crash.c @@ -469,6 +469,16 @@ void arch_crash_handle_hotplug_event(struct kimage *image) unsigned long mem, memsz; unsigned long elfsz = 0; + /* +* As crash_prepare_elf64_headers() has already described all +* possible CPUs, there is no need to update the elfcorehdr +* for additional CPU changes. +*/ + if ((image->file_mode || image->elfcorehdr_updated) && + ((image->hp_action == KEXEC_CRASH_HP_ADD_CPU) || + (image->hp_action == KEXEC_CRASH_HP_REMOVE_CPU))) + return; + /* * Create the new elfcorehdr reflecting the changes to CPU and/or * memory resources. -- 2.31.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH v28 3/8] kexec: exclude elfcorehdr from the segment digest
When a crash kernel is loaded via the kexec_file_load() syscall, the kernel places the various segments (ie crash kernel, crash initrd, boot_params, elfcorehdr, purgatory, etc) in memory. For those architectures that utilize purgatory, a hash digest of the segments is calculated for integrity checking. The digest is embedded into the purgatory image prior to placing in memory. Updates to the elfcorehdr in response to CPU and memory changes would cause the purgatory integrity checking to fail (at crash time, and no vmcore created). Therefore, the elfcorehdr segment is explicitly excluded from the purgatory digest, enabling updates to the elfcorehdr while also avoiding the need to recompute the hash digest and reload purgatory. Suggested-by: Baoquan He Signed-off-by: Eric DeVolder Reviewed-by: Sourabh Jain Acked-by: Hari Bathini Acked-by: Baoquan He --- kernel/kexec_file.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c index 453b7a513540..e2ec9d7b9a1f 100644 --- a/kernel/kexec_file.c +++ b/kernel/kexec_file.c @@ -726,6 +726,12 @@ static int kexec_calculate_store_digests(struct kimage *image) for (j = i = 0; i < image->nr_segments; i++) { struct kexec_segment *ksegment; +#ifdef CONFIG_CRASH_HOTPLUG + /* Exclude elfcorehdr segment to allow future changes via hotplug */ + if (j == image->elfcorehdr_index) + continue; +#endif + ksegment = >segment[i]; /* * Skip purgatory as it will be modified once we put digest -- 2.31.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH v28 4/8] crash: memory and CPU hotplug sysfs attributes
Introduce the crash_hotplug attribute for memory and CPUs for use by userspace. These attributes directly facilitate the udev rule for managing userspace re-loading of the crash kernel upon hot un/plug changes. For memory, expose the crash_hotplug attribute to the /sys/devices/system/memory directory. For example: # udevadm info --attribute-walk /sys/devices/system/memory/memory81 looking at device '/devices/system/memory/memory81': KERNEL=="memory81" SUBSYSTEM=="memory" DRIVER=="" ATTR{online}=="1" ATTR{phys_device}=="0" ATTR{phys_index}=="0051" ATTR{removable}=="1" ATTR{state}=="online" ATTR{valid_zones}=="Movable" looking at parent device '/devices/system/memory': KERNELS=="memory" SUBSYSTEMS=="" DRIVERS=="" ATTRS{auto_online_blocks}=="offline" ATTRS{block_size_bytes}=="800" ATTRS{crash_hotplug}=="1" For CPUs, expose the crash_hotplug attribute to the /sys/devices/system/cpu directory. For example: # udevadm info --attribute-walk /sys/devices/system/cpu/cpu0 looking at device '/devices/system/cpu/cpu0': KERNEL=="cpu0" SUBSYSTEM=="cpu" DRIVER=="processor" ATTR{crash_notes}=="277c38600" ATTR{crash_notes_size}=="368" ATTR{online}=="1" looking at parent device '/devices/system/cpu': KERNELS=="cpu" SUBSYSTEMS=="" DRIVERS=="" ATTRS{crash_hotplug}=="1" ATTRS{isolated}=="" ATTRS{kernel_max}=="8191" ATTRS{nohz_full}==" (null)" ATTRS{offline}=="4-7" ATTRS{online}=="0-3" ATTRS{possible}=="0-7" ATTRS{present}=="0-3" With these sysfs attributes in place, it is possible to efficiently instruct the udev rule to skip crash kernel reloading for kernels configured with crash hotplug support. For example, the following is the proposed udev rule change for RHEL system 98-kexec.rules (as the first lines of the rule file): # The kernel updates the crash elfcorehdr for CPU and memory changes SUBSYSTEM=="cpu", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end" SUBSYSTEM=="memory", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end" When examined in the context of 98-kexec.rules, the above rules test if crash_hotplug is set, and if so, the userspace initiated unload-then-reload of the crash kernel is skipped. CPU and memory checks are separated in accordance with CONFIG_HOTPLUG_CPU and CONFIG_MEMORY_HOTPLUG kernel config options. If an architecture supports, for example, memory hotplug but not CPU hotplug, then the /sys/devices/system/memory/crash_hotplug attribute file is present, but the /sys/devices/system/cpu/crash_hotplug attribute file will NOT be present. Thus the udev rule skips userspace processing of memory hot un/plug events, but the udev rule will evaluate false for CPU events, thus allowing userspace to process CPU hot un/plug events (ie the unload-then-reload of the kdump capture kernel). Signed-off-by: Eric DeVolder Reviewed-by: Sourabh Jain Acked-by: Hari Bathini Acked-by: Baoquan He --- Documentation/ABI/testing/sysfs-devices-memory | 8 .../ABI/testing/sysfs-devices-system-cpu | 8 .../admin-guide/mm/memory-hotplug.rst | 8 Documentation/core-api/cpu_hotplug.rst | 18 ++ drivers/base/cpu.c | 13 + drivers/base/memory.c | 13 + include/linux/kexec.h | 8 7 files changed, 76 insertions(+) diff --git a/Documentation/ABI/testing/sysfs-devices-memory b/Documentation/ABI/testing/sysfs-devices-memory index d8b0f80b9e33..a95e0f17c35a 100644 --- a/Documentation/ABI/testing/sysfs-devices-memory +++ b/Documentation/ABI/testing/sysfs-devices-memory @@ -110,3 +110,11 @@ Description: link is created for memory section 9 on node0. /sys/devices/system/node/node0/memory9 -> ../../memory/memory9 + +What: /sys/devices/system/memory/crash_hotplug +Date: Aug 2023 +Contact: Linux kernel mailing list +Description: + (RO) indicates whether or not the kernel directly supports + modifying the crash elfcorehdr for memory hot un/plug and/or + on/offline changes. diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu index 77942eedf4f6..b52564de2b18 100644 --- a/Documentation/ABI/testing/sysfs-devices-system-cpu +++ b/Documentation/ABI/testing/sysfs-devices-system-cpu @@ -687,3 +687,11 @@ Description: (RO) the list of CPUs that are isolated
[PATCH v28 7/8] crash: change crash_prepare_elf64_headers() to for_each_possible_cpu()
The function crash_prepare_elf64_headers() generates the elfcorehdr which describes the CPUs and memory in the system for the crash kernel. In particular, it writes out ELF PT_NOTEs for memory regions and the CPUs in the system. With respect to the CPUs, the current implementation utilizes for_each_present_cpu() which means that as CPUs are added and removed, the elfcorehdr must again be updated to reflect the new set of CPUs. The reasoning behind the move to use for_each_possible_cpu(), is: - At kernel boot time, all percpu crash_notes are allocated for all possible CPUs; that is, crash_notes are not allocated dynamically when CPUs are plugged/unplugged. Thus the crash_notes for each possible CPU are always available. - The crash_prepare_elf64_headers() creates an ELF PT_NOTE per CPU. Changing to for_each_possible_cpu() is valid as the crash_notes pointed to by each CPU PT_NOTE are present and always valid. Furthermore, examining a common crash processing path of: kernel panic -> crash kernel -> makedumpfile -> 'crash' analyzer elfcorehdr /proc/vmcore vmcore reveals how the ELF CPU PT_NOTEs are utilized: - Upon panic, each CPU is sent an IPI and shuts itself down, recording its state in its crash_notes. When all CPUs are shutdown, the crash kernel is launched with a pointer to the elfcorehdr. - The crash kernel via linux/fs/proc/vmcore.c does not examine or use the contents of the PT_NOTEs, it exposes them via /proc/vmcore. - The makedumpfile utility uses /proc/vmcore and reads the CPU PT_NOTEs to craft a nr_cpus variable, which is reported in a header but otherwise generally unused. Makedumpfile creates the vmcore. - The 'crash' dump analyzer does not appear to reference the CPU PT_NOTEs. Instead it looks-up the cpu_[possible|present|onlin]_mask symbols and directly examines those structure contents from vmcore memory. From that information it is able to determine which CPUs are present and online, and locate the corresponding crash_notes. Said differently, it appears that 'crash' analyzer does not rely on the ELF PT_NOTEs for CPUs; rather it obtains the information directly via kernel symbols and the memory within the vmcore. (There maybe other vmcore generating and analysis tools that do use these PT_NOTEs, but 'makedumpfile' and 'crash' seems to be the most common solution.) This results in the benefit of having all CPUs described in the elfcorehdr, and therefore reducing the need to re-generate the elfcorehdr on CPU changes, at the small expense of an additional 56 bytes per PT_NOTE for not-present-but-possible CPUs. On systems where kexec_file_load() syscall is utilized, all the above is valid. On systems where kexec_load() syscall is utilized, there may be the need for the elfcorehdr to be regenerated once. The reason being that some archs only populate the 'present' CPUs from the /sys/devices/system/cpus entries, which the userspace 'kexec' utility uses to generate the userspace-supplied elfcorehdr. In this situation, one memory or CPU change will rewrite the elfcorehdr via the crash_prepare_elf64_headers() function and now all possible CPUs will be described, just as with kexec_file_load() syscall. Suggested-by: Sourabh Jain Signed-off-by: Eric DeVolder Reviewed-by: Sourabh Jain Acked-by: Hari Bathini Acked-by: Baoquan He --- kernel/crash_core.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/crash_core.c b/kernel/crash_core.c index 34dc7bddfd77..7b87db9973a5 100644 --- a/kernel/crash_core.c +++ b/kernel/crash_core.c @@ -367,8 +367,8 @@ int crash_prepare_elf64_headers(struct crash_mem *mem, int need_kernel_map, ehdr->e_ehsize = sizeof(Elf64_Ehdr); ehdr->e_phentsize = sizeof(Elf64_Phdr); - /* Prepare one phdr of type PT_NOTE for each present CPU */ - for_each_present_cpu(cpu) { + /* Prepare one phdr of type PT_NOTE for each possible CPU */ + for_each_possible_cpu(cpu) { phdr->p_type = PT_NOTE; notes_addr = per_cpu_ptr_to_phys(per_cpu_ptr(crash_notes, cpu)); phdr->p_offset = phdr->p_paddr = notes_addr; -- 2.31.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH v28 2/8] crash: add generic infrastructure for crash hotplug support
To support crash hotplug, a mechanism is needed to update the crash elfcorehdr upon CPU or memory changes (eg. hot un/plug or off/ onlining). The crash elfcorehdr describes the CPUs and memory to be written into the vmcore. To track CPU changes, callbacks are registered with the cpuhp mechanism via cpuhp_setup_state_nocalls(CPUHP_BP_PREPARE_DYN). The crash hotplug elfcorehdr update has no explicit ordering requirement (relative to other cpuhp states), so meets the criteria for utilizing CPUHP_BP_PREPARE_DYN. CPUHP_BP_PREPARE_DYN is a dynamic state and avoids the need to introduce a new state for crash hotplug. Also, CPUHP_BP_PREPARE_DYN is the last state in the PREPARE group, just prior to the STARTING group, which is very close to the CPU starting up in a plug/online situation, or stopping in a unplug/ offline situation. This minimizes the window of time during an actual plug/online or unplug/offline situation in which the elfcorehdr would be inaccurate. Note that for a CPU being unplugged or offlined, the CPU will still be present in the list of CPUs generated by crash_prepare_elf64_headers(). However, there is no need to explicitly omit the CPU, see justification in 'crash: change crash_prepare_elf64_headers() to for_each_possible_cpu()'. To track memory changes, a notifier is registered to capture the memblock MEM_ONLINE and MEM_OFFLINE events via register_memory_notifier(). The CPU callbacks and memory notifiers invoke crash_handle_hotplug_event() which performs needed tasks and then dispatches the event to the architecture specific arch_crash_handle_hotplug_event() to update the elfcorehdr with the current state of CPUs and memory. During the process, the kexec_lock is held. Signed-off-by: Eric DeVolder Reviewed-by: Sourabh Jain Acked-by: Hari Bathini Acked-by: Baoquan He --- include/linux/crash_core.h | 7 ++ include/linux/kexec.h | 11 +++ kernel/Kconfig.kexec | 31 kernel/crash_core.c| 142 + kernel/kexec_core.c| 6 ++ 5 files changed, 197 insertions(+) diff --git a/include/linux/crash_core.h b/include/linux/crash_core.h index 1e48b1d96404..0c06561bf5ff 100644 --- a/include/linux/crash_core.h +++ b/include/linux/crash_core.h @@ -104,4 +104,11 @@ extern int crash_prepare_elf64_headers(struct crash_mem *mem, int need_kernel_ma struct kimage; struct kexec_segment; +#define KEXEC_CRASH_HP_NONE0 +#define KEXEC_CRASH_HP_ADD_CPU 1 +#define KEXEC_CRASH_HP_REMOVE_CPU 2 +#define KEXEC_CRASH_HP_ADD_MEMORY 3 +#define KEXEC_CRASH_HP_REMOVE_MEMORY 4 +#define KEXEC_CRASH_HP_INVALID_CPU -1U + #endif /* LINUX_CRASH_CORE_H */ diff --git a/include/linux/kexec.h b/include/linux/kexec.h index fb4350db33ff..df395f888915 100644 --- a/include/linux/kexec.h +++ b/include/linux/kexec.h @@ -33,6 +33,7 @@ extern note_buf_t __percpu *crash_notes; #include #include #include +#include #include /* Verify architecture specific macros are defined */ @@ -345,6 +346,12 @@ struct kimage { struct purgatory_info purgatory_info; #endif +#ifdef CONFIG_CRASH_HOTPLUG + int hp_action; + int elfcorehdr_index; + bool elfcorehdr_updated; +#endif + #ifdef CONFIG_IMA_KEXEC /* Virtual address of IMA measurement buffer for kexec syscall */ void *ima_buffer; @@ -475,6 +482,10 @@ static inline int arch_kexec_post_alloc_pages(void *vaddr, unsigned int pages, g static inline void arch_kexec_pre_free_pages(void *vaddr, unsigned int pages) { } #endif +#ifndef arch_crash_handle_hotplug_event +static inline void arch_crash_handle_hotplug_event(struct kimage *image) { } +#endif + #else /* !CONFIG_KEXEC_CORE */ struct pt_regs; struct task_struct; diff --git a/kernel/Kconfig.kexec b/kernel/Kconfig.kexec index ff72e45cfaef..d0a9a5392035 100644 --- a/kernel/Kconfig.kexec +++ b/kernel/Kconfig.kexec @@ -113,4 +113,35 @@ config CRASH_DUMP For s390, this option also enables zfcpdump. See also +config CRASH_HOTPLUG + bool "Update the crash elfcorehdr on system configuration changes" + default y + depends on CRASH_DUMP && (HOTPLUG_CPU || MEMORY_HOTPLUG) + depends on ARCH_SUPPORTS_CRASH_HOTPLUG + help + Enable direct update to the crash elfcorehdr (which contains + the list of CPUs and memory regions to be dumped upon a crash) + in response to hot plug/unplug or online/offline of CPUs or + memory. This is a much more advanced approach than userspace + attempting that. + + If unsure, say Y. + +config CRASH_MAX_MEMORY_RANGES + int "Specify the maximum number of memory regions for the elfcorehdr" + default 8192 + depends on CRASH_HOTPLUG + help + For the kexec_file_load() syscall path, specify the maximum number of + memory regions that the elfcorehdr bu
[PATCH v27 2/8] crash: add generic infrastructure for crash hotplug support
To support crash hotplug, a mechanism is needed to update the crash elfcorehdr upon CPU or memory changes (eg. hot un/plug or off/ onlining). The crash elfcorehdr describes the CPUs and memory to be written into the vmcore. To track CPU changes, callbacks are registered with the cpuhp mechanism via cpuhp_setup_state_nocalls(CPUHP_BP_PREPARE_DYN). The crash hotplug elfcorehdr update has no explicit ordering requirement (relative to other cpuhp states), so meets the criteria for utilizing CPUHP_BP_PREPARE_DYN. CPUHP_BP_PREPARE_DYN is a dynamic state and avoids the need to introduce a new state for crash hotplug. Also, CPUHP_BP_PREPARE_DYN is the last state in the PREPARE group, just prior to the STARTING group, which is very close to the CPU starting up in a plug/online situation, or stopping in a unplug/ offline situation. This minimizes the window of time during an actual plug/online or unplug/offline situation in which the elfcorehdr would be inaccurate. Note that for a CPU being unplugged or offlined, the CPU will still be present in the list of CPUs generated by crash_prepare_elf64_headers(). However, there is no need to explicitly omit the CPU, see justification in 'crash: change crash_prepare_elf64_headers() to for_each_possible_cpu()'. To track memory changes, a notifier is registered to capture the memblock MEM_ONLINE and MEM_OFFLINE events via register_memory_notifier(). The CPU callbacks and memory notifiers invoke crash_handle_hotplug_event() which performs needed tasks and then dispatches the event to the architecture specific arch_crash_handle_hotplug_event() to update the elfcorehdr with the current state of CPUs and memory. During the process, the kexec_lock is held. Signed-off-by: Eric DeVolder Reviewed-by: Sourabh Jain Acked-by: Hari Bathini Acked-by: Baoquan He --- include/linux/crash_core.h | 9 +++ include/linux/kexec.h | 11 +++ kernel/Kconfig.kexec | 31 kernel/crash_core.c| 142 + kernel/kexec_core.c| 6 ++ 5 files changed, 199 insertions(+) diff --git a/include/linux/crash_core.h b/include/linux/crash_core.h index de62a722431e..e14345cc7a22 100644 --- a/include/linux/crash_core.h +++ b/include/linux/crash_core.h @@ -84,4 +84,13 @@ int parse_crashkernel_high(char *cmdline, unsigned long long system_ram, int parse_crashkernel_low(char *cmdline, unsigned long long system_ram, unsigned long long *crash_size, unsigned long long *crash_base); +#define KEXEC_CRASH_HP_NONE0 +#define KEXEC_CRASH_HP_ADD_CPU 1 +#define KEXEC_CRASH_HP_REMOVE_CPU 2 +#define KEXEC_CRASH_HP_ADD_MEMORY 3 +#define KEXEC_CRASH_HP_REMOVE_MEMORY 4 +#define KEXEC_CRASH_HP_INVALID_CPU -1U + +struct kimage; + #endif /* LINUX_CRASH_CORE_H */ diff --git a/include/linux/kexec.h b/include/linux/kexec.h index 811a90e09698..b9903dd48e24 100644 --- a/include/linux/kexec.h +++ b/include/linux/kexec.h @@ -33,6 +33,7 @@ extern note_buf_t __percpu *crash_notes; #include #include #include +#include #include /* Verify architecture specific macros are defined */ @@ -360,6 +361,12 @@ struct kimage { struct purgatory_info purgatory_info; #endif +#ifdef CONFIG_CRASH_HOTPLUG + int hp_action; + int elfcorehdr_index; + bool elfcorehdr_updated; +#endif + #ifdef CONFIG_IMA_KEXEC /* Virtual address of IMA measurement buffer for kexec syscall */ void *ima_buffer; @@ -490,6 +497,10 @@ static inline int arch_kexec_post_alloc_pages(void *vaddr, unsigned int pages, g static inline void arch_kexec_pre_free_pages(void *vaddr, unsigned int pages) { } #endif +#ifndef arch_crash_handle_hotplug_event +static inline void arch_crash_handle_hotplug_event(struct kimage *image) { } +#endif + #else /* !CONFIG_KEXEC_CORE */ struct pt_regs; struct task_struct; diff --git a/kernel/Kconfig.kexec b/kernel/Kconfig.kexec index ff72e45cfaef..d0a9a5392035 100644 --- a/kernel/Kconfig.kexec +++ b/kernel/Kconfig.kexec @@ -113,4 +113,35 @@ config CRASH_DUMP For s390, this option also enables zfcpdump. See also +config CRASH_HOTPLUG + bool "Update the crash elfcorehdr on system configuration changes" + default y + depends on CRASH_DUMP && (HOTPLUG_CPU || MEMORY_HOTPLUG) + depends on ARCH_SUPPORTS_CRASH_HOTPLUG + help + Enable direct update to the crash elfcorehdr (which contains + the list of CPUs and memory regions to be dumped upon a crash) + in response to hot plug/unplug or online/offline of CPUs or + memory. This is a much more advanced approach than userspace + attempting that. + + If unsure, say Y. + +config CRASH_MAX_MEMORY_RANGES + int "Specify the maximum number of memory regions for the elfcorehdr" + default 8192 + depends on CRASH_HOTPLUG + help +
[PATCH v27 0/8] crash: Kernel handling of CPU and memory hot un/plug
382 https://lore.kernel.org/lkml/20220613224240.79400-1-eric.devol...@oracle.com/ - Rebased to 5.18.0 - Per Sourabh, moved crash_prepare_elf64_headers() into common crash_core.c to avoid compile issues with kexec_load only path. - Per David Hildebrand, replaced mutex_trylock() with mutex_lock(). - Changed the __weak arch_crash_handle_hotplug_event() to utilize WARN_ONCE() instead of WARN(). Fix some formatting issues. - Per Sourabh, introduced sysfs attribute crash_hotplug for memory and CPUs; for use by userspace (udev) to determine if the kernel performs crash hot un/plug support. - Per Sourabh, moved the code detecting the elfcorehdr segment from arch/x86 into crash_core:handle_hotplug_event() so both kexec_load and kexec_file_load can benefit. - Updated userspace kexec-tools kexec utility to reflect change to using CRASH_MAX_MEMORY_RANGES and get_nr_cpus(). - Updated the new proposed udev rules to reflect using the sysfs attributes crash_hotplug. v8: 5may2022 https://lkml.org/lkml/2022/5/5/1133 https://lore.kernel.org/lkml/20220505184603.1548-1-eric.devol...@oracle.com/ - Per Borislav Petkov, eliminated CONFIG_CRASH_HOTPLUG in favor of CONFIG_HOTPLUG_CPU || CONFIG_MEMORY_HOTPLUG, ie a new define is not needed. Also use of IS_ENABLED() rather than #ifdef's. Renamed crash_hotplug_handler() to handle_hotplug_event(). And other corrections. - Per Baoquan, minimized the parameters to the arch_crash_ handle_hotplug_event() to hp_action and cpu. - Introduce KEXEC_CRASH_HP_INVALID_CPU definition, per Baoquan. - Per Sourabh Jain, renamed and repurposed CRASH_HOTPLUG_ELFCOREHDR_SZ to CONFIG_CRASH_MAX_MEMORY_RANGES, mirroring kexec-tools change by David Hildebrand. Folded this patch into the x86 kexec_file_load support patch. v7: 13apr2022 https://lkml.org/lkml/2022/4/13/850 https://lore.kernel.org/lkml/20220413164237.20845-1-eric.devol...@oracle.com/ - Resolved parameter usage to crash_hotplug_handler(), per Baoquan. v6: 1apr2022 https://lkml.org/lkml/2022/4/1/1203 https://lore.kernel.org/lkml/20220401183040.1624-1-eric.devol...@oracle.com/ - Reword commit messages and some comment cleanup per Baoquan. - Changed elf_index to elfcorehdr_index for clarity. - Minor code changes per Baoquan. v5: 3mar2022 https://lkml.org/lkml/2022/3/3/674 https://lore.kernel.org/lkml/20220303162725.49640-1-eric.devol...@oracle.com/ - Reworded description of CRASH_HOTPLUG_ELFCOREHDR_SZ, per David Hildenbrand. - Refactored slightly a few patches per Baoquan recommendation. v4: 9feb2022 https://lkml.org/lkml/2022/2/9/1406 https://lore.kernel.org/lkml/20220209195706.51522-1-eric.devol...@oracle.com/ - Refactored patches per Baoquan suggestsions. - A few corrections, per Baoquan. v3: 10jan2022 https://lkml.org/lkml/2022/1/10/1212 https://lore.kernel.org/lkml/20220110195727.1682-1-eric.devol...@oracle.com/ - Rebasing per Baoquan He request. - Changed memory notifier per David Hildenbrand. - Providing example kexec userspace change in cover letter. RFC v2: 7dec2021 https://lkml.org/lkml/2021/12/7/1088 https://lore.kernel.org/lkml/20211207195204.1582-1-eric.devol...@oracle.com/ - Acting upon Baoquan He suggestion of removing elfcorehdr from the purgatory list of segments, removed purgatory code from patchset, and it is signficiantly simpler now. RFC v1: 18nov2021 https://lkml.org/lkml/2021/11/18/845 https://lore.kernel.org/lkml/2028174948.37435-1-eric.devol...@oracle.com/ - working patchset demonstrating kernel handling of hotplug updates to x86 elfcorehdr for kexec_file_load RFC: 14dec2020 https://lkml.org/lkml/2020/12/14/532 https://lore.kernel.org/lkml/b04ed259-dc5f-7f30-6661-c26f92d90...@oracle.com/ NOTE: s/vmcoreinfo/elfcorehdr/g - proposed concept of allowing kernel to handle hotplug update of elfcorehdr --- Eric DeVolder (8): crash: move a few code bits to setup support of crash hotplug crash: add generic infrastructure for crash hotplug support kexec: exclude elfcorehdr from the segment digest crash: memory and CPU hotplug sysfs attributes x86/crash: add x86 crash hotplug support crash: hotplug support for kexec_load() crash: change crash_prepare_elf64_headers() to for_each_possible_cpu() x86/crash: optimize CPU changes .../ABI/testing/sysfs-devices-memory | 8 + .../ABI/testing/sysfs-devices-system-cpu | 8 + .../admin-guide/mm/memory-hotplug.rst | 8 + Documentation/core-api/cpu_hotplug.rst| 18 + arch/x86/Kconfig | 3 + arch/x86/include/asm/kexec.h | 18 + arch/x86/kernel/crash.c | 140 ++- drivers/base/cpu.c| 13 + drivers/base/memory.c | 13 + include/linux/crash_core.h| 9 + include/linux/kexec.h | 63 +++- include/uapi/linux/kexec.h| 1 + kernel/Kconfig
[PATCH v27 8/8] x86/crash: optimize CPU changes
crash_prepare_elf64_headers() writes into the elfcorehdr an ELF PT_NOTE for all possible CPUs. As such, subsequent changes to CPUs (ie. hot un/plug, online/offline) do not need to rewrite the elfcorehdr. The kimage->file_mode term covers kdump images loaded via the kexec_file_load() syscall. Since crash_prepare_elf64_headers() wrote the initial elfcorehdr, no update to the elfcorehdr is needed for CPU changes. The kimage->elfcorehdr_updated term covers kdump images loaded via the kexec_load() syscall. At least one memory or CPU change must occur to cause crash_prepare_elf64_headers() to rewrite the elfcorehdr. Afterwards, no update to the elfcorehdr is needed for CPU changes. This code is intentionally *NOT* hoisted into crash_handle_hotplug_event() as it would prevent the arch-specific handler from running for CPU changes. This would break PPC, for example, which needs to update other information besides the elfcorehdr, on CPU changes. Signed-off-by: Eric DeVolder Reviewed-by: Sourabh Jain Acked-by: Hari Bathini Acked-by: Baoquan He --- arch/x86/kernel/crash.c | 10 ++ 1 file changed, 10 insertions(+) diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c index caf22bcb61af..18d2a18d1073 100644 --- a/arch/x86/kernel/crash.c +++ b/arch/x86/kernel/crash.c @@ -467,6 +467,16 @@ void arch_crash_handle_hotplug_event(struct kimage *image) unsigned long mem, memsz; unsigned long elfsz = 0; + /* +* As crash_prepare_elf64_headers() has already described all +* possible CPUs, there is no need to update the elfcorehdr +* for additional CPU changes. +*/ + if ((image->file_mode || image->elfcorehdr_updated) && + ((image->hp_action == KEXEC_CRASH_HP_ADD_CPU) || + (image->hp_action == KEXEC_CRASH_HP_REMOVE_CPU))) + return; + /* * Create the new elfcorehdr reflecting the changes to CPU and/or * memory resources. -- 2.31.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH v27 7/8] crash: change crash_prepare_elf64_headers() to for_each_possible_cpu()
The function crash_prepare_elf64_headers() generates the elfcorehdr which describes the CPUs and memory in the system for the crash kernel. In particular, it writes out ELF PT_NOTEs for memory regions and the CPUs in the system. With respect to the CPUs, the current implementation utilizes for_each_present_cpu() which means that as CPUs are added and removed, the elfcorehdr must again be updated to reflect the new set of CPUs. The reasoning behind the move to use for_each_possible_cpu(), is: - At kernel boot time, all percpu crash_notes are allocated for all possible CPUs; that is, crash_notes are not allocated dynamically when CPUs are plugged/unplugged. Thus the crash_notes for each possible CPU are always available. - The crash_prepare_elf64_headers() creates an ELF PT_NOTE per CPU. Changing to for_each_possible_cpu() is valid as the crash_notes pointed to by each CPU PT_NOTE are present and always valid. Furthermore, examining a common crash processing path of: kernel panic -> crash kernel -> makedumpfile -> 'crash' analyzer elfcorehdr /proc/vmcore vmcore reveals how the ELF CPU PT_NOTEs are utilized: - Upon panic, each CPU is sent an IPI and shuts itself down, recording its state in its crash_notes. When all CPUs are shutdown, the crash kernel is launched with a pointer to the elfcorehdr. - The crash kernel via linux/fs/proc/vmcore.c does not examine or use the contents of the PT_NOTEs, it exposes them via /proc/vmcore. - The makedumpfile utility uses /proc/vmcore and reads the CPU PT_NOTEs to craft a nr_cpus variable, which is reported in a header but otherwise generally unused. Makedumpfile creates the vmcore. - The 'crash' dump analyzer does not appear to reference the CPU PT_NOTEs. Instead it looks-up the cpu_[possible|present|onlin]_mask symbols and directly examines those structure contents from vmcore memory. From that information it is able to determine which CPUs are present and online, and locate the corresponding crash_notes. Said differently, it appears that 'crash' analyzer does not rely on the ELF PT_NOTEs for CPUs; rather it obtains the information directly via kernel symbols and the memory within the vmcore. (There maybe other vmcore generating and analysis tools that do use these PT_NOTEs, but 'makedumpfile' and 'crash' seems to be the most common solution.) This results in the benefit of having all CPUs described in the elfcorehdr, and therefore reducing the need to re-generate the elfcorehdr on CPU changes, at the small expense of an additional 56 bytes per PT_NOTE for not-present-but-possible CPUs. On systems where kexec_file_load() syscall is utilized, all the above is valid. On systems where kexec_load() syscall is utilized, there may be the need for the elfcorehdr to be regenerated once. The reason being that some archs only populate the 'present' CPUs from the /sys/devices/system/cpus entries, which the userspace 'kexec' utility uses to generate the userspace-supplied elfcorehdr. In this situation, one memory or CPU change will rewrite the elfcorehdr via the crash_prepare_elf64_headers() function and now all possible CPUs will be described, just as with kexec_file_load() syscall. Suggested-by: Sourabh Jain Signed-off-by: Eric DeVolder Reviewed-by: Sourabh Jain Acked-by: Hari Bathini Acked-by: Baoquan He --- kernel/crash_core.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/crash_core.c b/kernel/crash_core.c index fa918176d46d..7378b501fada 100644 --- a/kernel/crash_core.c +++ b/kernel/crash_core.c @@ -364,8 +364,8 @@ int crash_prepare_elf64_headers(struct crash_mem *mem, int need_kernel_map, ehdr->e_ehsize = sizeof(Elf64_Ehdr); ehdr->e_phentsize = sizeof(Elf64_Phdr); - /* Prepare one phdr of type PT_NOTE for each present CPU */ - for_each_present_cpu(cpu) { + /* Prepare one phdr of type PT_NOTE for each possible CPU */ + for_each_possible_cpu(cpu) { phdr->p_type = PT_NOTE; notes_addr = per_cpu_ptr_to_phys(per_cpu_ptr(crash_notes, cpu)); phdr->p_offset = phdr->p_paddr = notes_addr; -- 2.31.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH v27 5/8] x86/crash: add x86 crash hotplug support
When CPU or memory is hot un/plugged, or off/onlined, the crash elfcorehdr, which describes the CPUs and memory in the system, must also be updated. A new elfcorehdr is generated from the available CPUs and memory and replaces the existing elfcorehdr. The segment containing the elfcorehdr is identified at run-time in crash_core:crash_handle_hotplug_event(). No modifications to purgatory (see 'kexec: exclude elfcorehdr from the segment digest') or boot_params (as the elfcorehdr= capture kernel command line parameter pointer remains unchanged and correct) are needed, just elfcorehdr. For kexec_file_load(), the elfcorehdr segment size is based on NR_CPUS and CRASH_MAX_MEMORY_RANGES in order to accommodate a growing number of CPU and memory resources. For kexec_load(), the userspace kexec utility needs to size the elfcorehdr segment in the same/similar manner. To accommodate kexec_load() syscall in the absence of kexec_file_load() syscall support, prepare_elf_headers() and dependents are moved outside of CONFIG_KEXEC_FILE. Signed-off-by: Eric DeVolder Reviewed-by: Sourabh Jain Acked-by: Hari Bathini Acked-by: Baoquan He --- arch/x86/Kconfig | 3 + arch/x86/include/asm/kexec.h | 15 + arch/x86/kernel/crash.c | 103 --- 3 files changed, 114 insertions(+), 7 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 7082fc10b346..ffc95c3d6abd 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -2069,6 +2069,9 @@ config ARCH_SUPPORTS_KEXEC_JUMP config ARCH_SUPPORTS_CRASH_DUMP def_bool X86_64 || (X86_32 && HIGHMEM) +config ARCH_SUPPORTS_CRASH_HOTPLUG + def_bool y + config PHYSICAL_START hex "Physical address where the kernel is loaded" if (EXPERT || CRASH_DUMP) default "0x100" diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h index 5b77bbc28f96..9143100ea3ea 100644 --- a/arch/x86/include/asm/kexec.h +++ b/arch/x86/include/asm/kexec.h @@ -209,6 +209,21 @@ typedef void crash_vmclear_fn(void); extern crash_vmclear_fn __rcu *crash_vmclear_loaded_vmcss; extern void kdump_nmi_shootdown_cpus(void); +#ifdef CONFIG_CRASH_HOTPLUG +void arch_crash_handle_hotplug_event(struct kimage *image); +#define arch_crash_handle_hotplug_event arch_crash_handle_hotplug_event + +#ifdef CONFIG_HOTPLUG_CPU +static inline int crash_hotplug_cpu_support(void) { return 1; } +#define crash_hotplug_cpu_support crash_hotplug_cpu_support +#endif + +#ifdef CONFIG_MEMORY_HOTPLUG +static inline int crash_hotplug_memory_support(void) { return 1; } +#define crash_hotplug_memory_support crash_hotplug_memory_support +#endif +#endif + #endif /* __ASSEMBLY__ */ #endif /* _ASM_X86_KEXEC_H */ diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c index cdd92ab43cda..c70a111c44fa 100644 --- a/arch/x86/kernel/crash.c +++ b/arch/x86/kernel/crash.c @@ -158,8 +158,6 @@ void native_machine_crash_shutdown(struct pt_regs *regs) crash_save_cpu(regs, safe_smp_processor_id()); } -#ifdef CONFIG_KEXEC_FILE - static int get_nr_ram_ranges_callback(struct resource *res, void *arg) { unsigned int *nr_ranges = arg; @@ -231,7 +229,7 @@ static int prepare_elf64_ram_headers_callback(struct resource *res, void *arg) /* Prepare elf headers. Return addr and size */ static int prepare_elf_headers(struct kimage *image, void **addr, - unsigned long *sz) + unsigned long *sz, unsigned long *nr_mem_ranges) { struct crash_mem *cmem; int ret; @@ -249,6 +247,9 @@ static int prepare_elf_headers(struct kimage *image, void **addr, if (ret) goto out; + /* Return the computed number of memory ranges, for hotplug usage */ + *nr_mem_ranges = cmem->nr_ranges; + /* By default prepare 64bit headers */ ret = crash_prepare_elf64_headers(cmem, IS_ENABLED(CONFIG_X86_64), addr, sz); @@ -257,6 +258,7 @@ static int prepare_elf_headers(struct kimage *image, void **addr, return ret; } +#ifdef CONFIG_KEXEC_FILE static int add_e820_entry(struct boot_params *params, struct e820_entry *entry) { unsigned int nr_e820_entries; @@ -371,18 +373,42 @@ int crash_setup_memmap_entries(struct kimage *image, struct boot_params *params) int crash_load_segments(struct kimage *image) { int ret; + unsigned long pnum = 0; struct kexec_buf kbuf = { .image = image, .buf_min = 0, .buf_max = ULONG_MAX, .top_down = false }; /* Prepare elf headers and add a segment */ - ret = prepare_elf_headers(image, , ); + ret = prepare_elf_headers(image, , , ); if (ret) return ret; - image->elf_headers = kbuf.buffer; - image->elf_headers_sz = kbuf.bufsz; + image->elf_headers = kbuf.buffer; + image->elf_he
[PATCH v27 3/8] kexec: exclude elfcorehdr from the segment digest
When a crash kernel is loaded via the kexec_file_load() syscall, the kernel places the various segments (ie crash kernel, crash initrd, boot_params, elfcorehdr, purgatory, etc) in memory. For those architectures that utilize purgatory, a hash digest of the segments is calculated for integrity checking. The digest is embedded into the purgatory image prior to placing in memory. Updates to the elfcorehdr in response to CPU and memory changes would cause the purgatory integrity checking to fail (at crash time, and no vmcore created). Therefore, the elfcorehdr segment is explicitly excluded from the purgatory digest, enabling updates to the elfcorehdr while also avoiding the need to recompute the hash digest and reload purgatory. Suggested-by: Baoquan He Signed-off-by: Eric DeVolder Reviewed-by: Sourabh Jain Acked-by: Hari Bathini Acked-by: Baoquan He --- kernel/kexec_file.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c index 453b7a513540..e2ec9d7b9a1f 100644 --- a/kernel/kexec_file.c +++ b/kernel/kexec_file.c @@ -726,6 +726,12 @@ static int kexec_calculate_store_digests(struct kimage *image) for (j = i = 0; i < image->nr_segments; i++) { struct kexec_segment *ksegment; +#ifdef CONFIG_CRASH_HOTPLUG + /* Exclude elfcorehdr segment to allow future changes via hotplug */ + if (j == image->elfcorehdr_index) + continue; +#endif + ksegment = >segment[i]; /* * Skip purgatory as it will be modified once we put digest -- 2.31.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH v27 4/8] crash: memory and CPU hotplug sysfs attributes
Introduce the crash_hotplug attribute for memory and CPUs for use by userspace. These attributes directly facilitate the udev rule for managing userspace re-loading of the crash kernel upon hot un/plug changes. For memory, expose the crash_hotplug attribute to the /sys/devices/system/memory directory. For example: # udevadm info --attribute-walk /sys/devices/system/memory/memory81 looking at device '/devices/system/memory/memory81': KERNEL=="memory81" SUBSYSTEM=="memory" DRIVER=="" ATTR{online}=="1" ATTR{phys_device}=="0" ATTR{phys_index}=="0051" ATTR{removable}=="1" ATTR{state}=="online" ATTR{valid_zones}=="Movable" looking at parent device '/devices/system/memory': KERNELS=="memory" SUBSYSTEMS=="" DRIVERS=="" ATTRS{auto_online_blocks}=="offline" ATTRS{block_size_bytes}=="800" ATTRS{crash_hotplug}=="1" For CPUs, expose the crash_hotplug attribute to the /sys/devices/system/cpu directory. For example: # udevadm info --attribute-walk /sys/devices/system/cpu/cpu0 looking at device '/devices/system/cpu/cpu0': KERNEL=="cpu0" SUBSYSTEM=="cpu" DRIVER=="processor" ATTR{crash_notes}=="277c38600" ATTR{crash_notes_size}=="368" ATTR{online}=="1" looking at parent device '/devices/system/cpu': KERNELS=="cpu" SUBSYSTEMS=="" DRIVERS=="" ATTRS{crash_hotplug}=="1" ATTRS{isolated}=="" ATTRS{kernel_max}=="8191" ATTRS{nohz_full}==" (null)" ATTRS{offline}=="4-7" ATTRS{online}=="0-3" ATTRS{possible}=="0-7" ATTRS{present}=="0-3" With these sysfs attributes in place, it is possible to efficiently instruct the udev rule to skip crash kernel reloading for kernels configured with crash hotplug support. For example, the following is the proposed udev rule change for RHEL system 98-kexec.rules (as the first lines of the rule file): # The kernel updates the crash elfcorehdr for CPU and memory changes SUBSYSTEM=="cpu", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end" SUBSYSTEM=="memory", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end" When examined in the context of 98-kexec.rules, the above rules test if crash_hotplug is set, and if so, the userspace initiated unload-then-reload of the crash kernel is skipped. CPU and memory checks are separated in accordance with CONFIG_HOTPLUG_CPU and CONFIG_MEMORY_HOTPLUG kernel config options. If an architecture supports, for example, memory hotplug but not CPU hotplug, then the /sys/devices/system/memory/crash_hotplug attribute file is present, but the /sys/devices/system/cpu/crash_hotplug attribute file will NOT be present. Thus the udev rule skips userspace processing of memory hot un/plug events, but the udev rule will evaluate false for CPU events, thus allowing userspace to process CPU hot un/plug events (ie the unload-then-reload of the kdump capture kernel). Signed-off-by: Eric DeVolder Reviewed-by: Sourabh Jain Acked-by: Hari Bathini Acked-by: Baoquan He --- Documentation/ABI/testing/sysfs-devices-memory | 8 .../ABI/testing/sysfs-devices-system-cpu | 8 .../admin-guide/mm/memory-hotplug.rst | 8 Documentation/core-api/cpu_hotplug.rst | 18 ++ drivers/base/cpu.c | 13 + drivers/base/memory.c | 13 + include/linux/kexec.h | 8 7 files changed, 76 insertions(+) diff --git a/Documentation/ABI/testing/sysfs-devices-memory b/Documentation/ABI/testing/sysfs-devices-memory index d8b0f80b9e33..a95e0f17c35a 100644 --- a/Documentation/ABI/testing/sysfs-devices-memory +++ b/Documentation/ABI/testing/sysfs-devices-memory @@ -110,3 +110,11 @@ Description: link is created for memory section 9 on node0. /sys/devices/system/node/node0/memory9 -> ../../memory/memory9 + +What: /sys/devices/system/memory/crash_hotplug +Date: Aug 2023 +Contact: Linux kernel mailing list +Description: + (RO) indicates whether or not the kernel directly supports + modifying the crash elfcorehdr for memory hot un/plug and/or + on/offline changes. diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu index 77942eedf4f6..b52564de2b18 100644 --- a/Documentation/ABI/testing/sysfs-devices-system-cpu +++ b/Documentation/ABI/testing/sysfs-devices-system-cpu @@ -687,3 +687,11 @@ Description: (RO) the list of CPUs that are isolated
[PATCH v27 1/8] crash: move a few code bits to setup support of crash hotplug
The crash hotplug support leans on the work for the kexec_file_load() syscall. To also support the kexec_load() syscall, a few bits of code need to be move outside of CONFIG_KEXEC_FILE. As such, these bits are moved out of kexec_file.c and into a common location crash_core.c. No functionality change intended. Signed-off-by: Eric DeVolder Reviewed-by: Sourabh Jain Acked-by: Hari Bathini Acked-by: Baoquan He --- include/linux/kexec.h | 30 +++ kernel/crash_core.c | 182 ++ kernel/kexec_file.c | 181 - 3 files changed, 197 insertions(+), 196 deletions(-) diff --git a/include/linux/kexec.h b/include/linux/kexec.h index 22b5cd24f581..811a90e09698 100644 --- a/include/linux/kexec.h +++ b/include/linux/kexec.h @@ -105,6 +105,21 @@ struct compat_kexec_segment { }; #endif +/* Alignment required for elf header segment */ +#define ELF_CORE_HEADER_ALIGN 4096 + +struct crash_mem { + unsigned int max_nr_ranges; + unsigned int nr_ranges; + struct range ranges[]; +}; + +extern int crash_exclude_mem_range(struct crash_mem *mem, + unsigned long long mstart, + unsigned long long mend); +extern int crash_prepare_elf64_headers(struct crash_mem *mem, int need_kernel_map, + void **addr, unsigned long *sz); + #ifdef CONFIG_KEXEC_FILE struct purgatory_info { /* @@ -230,21 +245,6 @@ static inline int arch_kexec_locate_mem_hole(struct kexec_buf *kbuf) } #endif -/* Alignment required for elf header segment */ -#define ELF_CORE_HEADER_ALIGN 4096 - -struct crash_mem { - unsigned int max_nr_ranges; - unsigned int nr_ranges; - struct range ranges[]; -}; - -extern int crash_exclude_mem_range(struct crash_mem *mem, - unsigned long long mstart, - unsigned long long mend); -extern int crash_prepare_elf64_headers(struct crash_mem *mem, int need_kernel_map, - void **addr, unsigned long *sz); - #ifndef arch_kexec_apply_relocations_add /* * arch_kexec_apply_relocations_add - apply relocations of type RELA diff --git a/kernel/crash_core.c b/kernel/crash_core.c index 90ce1dfd591c..b7c30b748a16 100644 --- a/kernel/crash_core.c +++ b/kernel/crash_core.c @@ -10,6 +10,7 @@ #include #include #include +#include #include #include @@ -314,6 +315,187 @@ static int __init parse_crashkernel_dummy(char *arg) } early_param("crashkernel", parse_crashkernel_dummy); +int crash_prepare_elf64_headers(struct crash_mem *mem, int need_kernel_map, + void **addr, unsigned long *sz) +{ + Elf64_Ehdr *ehdr; + Elf64_Phdr *phdr; + unsigned long nr_cpus = num_possible_cpus(), nr_phdr, elf_sz; + unsigned char *buf; + unsigned int cpu, i; + unsigned long long notes_addr; + unsigned long mstart, mend; + + /* extra phdr for vmcoreinfo ELF note */ + nr_phdr = nr_cpus + 1; + nr_phdr += mem->nr_ranges; + + /* +* kexec-tools creates an extra PT_LOAD phdr for kernel text mapping +* area (for example, 8000 - a000 on x86_64). +* I think this is required by tools like gdb. So same physical +* memory will be mapped in two ELF headers. One will contain kernel +* text virtual addresses and other will have __va(physical) addresses. +*/ + + nr_phdr++; + elf_sz = sizeof(Elf64_Ehdr) + nr_phdr * sizeof(Elf64_Phdr); + elf_sz = ALIGN(elf_sz, ELF_CORE_HEADER_ALIGN); + + buf = vzalloc(elf_sz); + if (!buf) + return -ENOMEM; + + ehdr = (Elf64_Ehdr *)buf; + phdr = (Elf64_Phdr *)(ehdr + 1); + memcpy(ehdr->e_ident, ELFMAG, SELFMAG); + ehdr->e_ident[EI_CLASS] = ELFCLASS64; + ehdr->e_ident[EI_DATA] = ELFDATA2LSB; + ehdr->e_ident[EI_VERSION] = EV_CURRENT; + ehdr->e_ident[EI_OSABI] = ELF_OSABI; + memset(ehdr->e_ident + EI_PAD, 0, EI_NIDENT - EI_PAD); + ehdr->e_type = ET_CORE; + ehdr->e_machine = ELF_ARCH; + ehdr->e_version = EV_CURRENT; + ehdr->e_phoff = sizeof(Elf64_Ehdr); + ehdr->e_ehsize = sizeof(Elf64_Ehdr); + ehdr->e_phentsize = sizeof(Elf64_Phdr); + + /* Prepare one phdr of type PT_NOTE for each present CPU */ + for_each_present_cpu(cpu) { + phdr->p_type = PT_NOTE; + notes_addr = per_cpu_ptr_to_phys(per_cpu_ptr(crash_notes, cpu)); + phdr->p_offset = phdr->p_paddr = notes_addr; + phdr->p_filesz = phdr->p_memsz = sizeof(note_buf_t); + (ehdr->e_phnum)++; + phdr++; + } + + /* Prepare one PT_NOTE header for vmcoreinfo */ + phdr->p_type = PT_NOTE; +
[PATCH v27 6/8] crash: hotplug support for kexec_load()
The hotplug support for kexec_load() requires changes to the userspace kexec-tools and a little extra help from the kernel. Given a kdump capture kernel loaded via kexec_load(), and a subsequent hotplug event, the crash hotplug handler finds the elfcorehdr and rewrites it to reflect the hotplug change. That is the desired outcome, however, at kernel panic time, the purgatory integrity check fails (because the elfcorehdr changed), and the capture kernel does not boot and no vmcore is generated. Therefore, the userspace kexec-tools/kexec must indicate to the kernel that the elfcorehdr can be modified (because the kexec excluded the elfcorehdr from the digest, and sized the elfcorehdr memory buffer appropriately). To facilitate hotplug support with kexec_load(): - a new kexec flag KEXEC_UPATE_ELFCOREHDR indicates that it is safe for the kernel to modify the kexec_load()'d elfcorehdr - the /sys/kernel/crash_elfcorehdr_size node communicates the preferred size of the elfcorehdr memory buffer - The sysfs crash_hotplug nodes (ie. /sys/devices/system/[cpu|memory]/crash_hotplug) dynamically take into account kexec_file_load() vs kexec_load() and KEXEC_UPDATE_ELFCOREHDR. This is critical so that the udev rule processing of crash_hotplug is all that is needed to determine if the userspace unload-then-load of the kdump image is to be skipped, or not. The proposed udev rule change looks like: # The kernel updates the crash elfcorehdr for CPU and memory changes SUBSYSTEM=="cpu", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end" SUBSYSTEM=="memory", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end" The table below indicates the behavior of kexec_load()'d kdump image updates (with the new udev crash_hotplug rule in place): Kernel |Kexec ---+-+ Old|Old |New | a | a ---+-+ New| a | b ---+-+ where kexec 'old' and 'new' delineate kexec-tools has the needed modifications for the crash hotplug feature, and kernel 'old' and 'new' delineate the kernel supports this crash hotplug feature. Behavior 'a' indicates the unload-then-reload of the entire kdump image. For the kexec 'old' column, the unload-then-reload occurs due to the missing flag KEXEC_UPDATE_ELFCOREHDR. An 'old' kernel (with 'new' kexec) does not present the crash_hotplug sysfs node, which leads to the unload-then-reload of the kdump image. Behavior 'b' indicates the desired optimized behavior of the kernel directly modifying the elfcorehdr and avoiding the unload-then-reload of the kdump image. If the udev rule is not updated with crash_hotplug node check, then no matter any combination of kernel or kexec is new or old, the kdump image continues to be unload-then-reload on hotplug changes. To fully support crash hotplug feature, there needs to be a rollout of kernel, kexec-tools and udev rule changes. However, the order of the rollout of these pieces does not matter; kexec_load()'d kdump images still function for hotplug as-is. Suggested-by: Hari Bathini Signed-off-by: Eric DeVolder Acked-by: Hari Bathini Acked-by: Baoquan He --- arch/x86/include/asm/kexec.h | 11 +++ arch/x86/kernel/crash.c | 27 +++ include/linux/kexec.h| 14 -- include/uapi/linux/kexec.h | 1 + kernel/Kconfig.kexec | 4 kernel/crash_core.c | 31 +++ kernel/kexec.c | 5 + kernel/ksysfs.c | 15 +++ 8 files changed, 102 insertions(+), 6 deletions(-) diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h index 9143100ea3ea..3be6a98751f0 100644 --- a/arch/x86/include/asm/kexec.h +++ b/arch/x86/include/asm/kexec.h @@ -214,14 +214,17 @@ void arch_crash_handle_hotplug_event(struct kimage *image); #define arch_crash_handle_hotplug_event arch_crash_handle_hotplug_event #ifdef CONFIG_HOTPLUG_CPU -static inline int crash_hotplug_cpu_support(void) { return 1; } -#define crash_hotplug_cpu_support crash_hotplug_cpu_support +int arch_crash_hotplug_cpu_support(void); +#define crash_hotplug_cpu_support arch_crash_hotplug_cpu_support #endif #ifdef CONFIG_MEMORY_HOTPLUG -static inline int crash_hotplug_memory_support(void) { return 1; } -#define crash_hotplug_memory_support crash_hotplug_memory_support +int arch_crash_hotplug_memory_support(void); +#define crash_hotplug_memory_support arch_crash_hotplug_memory_support #endif + +unsigned int arch_crash_get_elfcorehdr_size(void); +#define crash_get_elfcorehdr_size arch_crash_get_elfcorehdr_size #endif #endif /* __ASSEMBLY__ */ diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c index c70a111c44fa..caf22bcb61af 100644 --- a/arch/x86/kernel/crash.c +++ b/arch/x86/kernel/crash.c @@ -427,6 +427,33 @@ int crash_load_segments(struct kimage *image) #undef pr_fmt #define pr_fmt(fmt) "cras
[PATCH v26 7/8] crash: change crash_prepare_elf64_headers() to for_each_possible_cpu()
The function crash_prepare_elf64_headers() generates the elfcorehdr which describes the CPUs and memory in the system for the crash kernel. In particular, it writes out ELF PT_NOTEs for memory regions and the CPUs in the system. With respect to the CPUs, the current implementation utilizes for_each_present_cpu() which means that as CPUs are added and removed, the elfcorehdr must again be updated to reflect the new set of CPUs. The reasoning behind the move to use for_each_possible_cpu(), is: - At kernel boot time, all percpu crash_notes are allocated for all possible CPUs; that is, crash_notes are not allocated dynamically when CPUs are plugged/unplugged. Thus the crash_notes for each possible CPU are always available. - The crash_prepare_elf64_headers() creates an ELF PT_NOTE per CPU. Changing to for_each_possible_cpu() is valid as the crash_notes pointed to by each CPU PT_NOTE are present and always valid. Furthermore, examining a common crash processing path of: kernel panic -> crash kernel -> makedumpfile -> 'crash' analyzer elfcorehdr /proc/vmcore vmcore reveals how the ELF CPU PT_NOTEs are utilized: - Upon panic, each CPU is sent an IPI and shuts itself down, recording its state in its crash_notes. When all CPUs are shutdown, the crash kernel is launched with a pointer to the elfcorehdr. - The crash kernel via linux/fs/proc/vmcore.c does not examine or use the contents of the PT_NOTEs, it exposes them via /proc/vmcore. - The makedumpfile utility uses /proc/vmcore and reads the CPU PT_NOTEs to craft a nr_cpus variable, which is reported in a header but otherwise generally unused. Makedumpfile creates the vmcore. - The 'crash' dump analyzer does not appear to reference the CPU PT_NOTEs. Instead it looks-up the cpu_[possible|present|onlin]_mask symbols and directly examines those structure contents from vmcore memory. From that information it is able to determine which CPUs are present and online, and locate the corresponding crash_notes. Said differently, it appears that 'crash' analyzer does not rely on the ELF PT_NOTEs for CPUs; rather it obtains the information directly via kernel symbols and the memory within the vmcore. (There maybe other vmcore generating and analysis tools that do use these PT_NOTEs, but 'makedumpfile' and 'crash' seems to be the most common solution.) This results in the benefit of having all CPUs described in the elfcorehdr, and therefore reducing the need to re-generate the elfcorehdr on CPU changes, at the small expense of an additional 56 bytes per PT_NOTE for not-present-but-possible CPUs. On systems where kexec_file_load() syscall is utilized, all the above is valid. On systems where kexec_load() syscall is utilized, there may be the need for the elfcorehdr to be regenerated once. The reason being that some archs only populate the 'present' CPUs from the /sys/devices/system/cpus entries, which the userspace 'kexec' utility uses to generate the userspace-supplied elfcorehdr. In this situation, one memory or CPU change will rewrite the elfcorehdr via the crash_prepare_elf64_headers() function and now all possible CPUs will be described, just as with kexec_file_load() syscall. Suggested-by: Sourabh Jain Signed-off-by: Eric DeVolder Reviewed-by: Sourabh Jain Acked-by: Hari Bathini Acked-by: Baoquan He --- kernel/crash_core.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/crash_core.c b/kernel/crash_core.c index fa918176d46d..7378b501fada 100644 --- a/kernel/crash_core.c +++ b/kernel/crash_core.c @@ -364,8 +364,8 @@ int crash_prepare_elf64_headers(struct crash_mem *mem, int need_kernel_map, ehdr->e_ehsize = sizeof(Elf64_Ehdr); ehdr->e_phentsize = sizeof(Elf64_Phdr); - /* Prepare one phdr of type PT_NOTE for each present CPU */ - for_each_present_cpu(cpu) { + /* Prepare one phdr of type PT_NOTE for each possible CPU */ + for_each_possible_cpu(cpu) { phdr->p_type = PT_NOTE; notes_addr = per_cpu_ptr_to_phys(per_cpu_ptr(crash_notes, cpu)); phdr->p_offset = phdr->p_paddr = notes_addr; -- 2.31.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH v26 8/8] x86/crash: optimize CPU changes
crash_prepare_elf64_headers() writes into the elfcorehdr an ELF PT_NOTE for all possible CPUs. As such, subsequent changes to CPUs (ie. hot un/plug, online/offline) do not need to rewrite the elfcorehdr. The kimage->file_mode term covers kdump images loaded via the kexec_file_load() syscall. Since crash_prepare_elf64_headers() wrote the initial elfcorehdr, no update to the elfcorehdr is needed for CPU changes. The kimage->elfcorehdr_updated term covers kdump images loaded via the kexec_load() syscall. At least one memory or CPU change must occur to cause crash_prepare_elf64_headers() to rewrite the elfcorehdr. Afterwards, no update to the elfcorehdr is needed for CPU changes. This code is intentionally *NOT* hoisted into crash_handle_hotplug_event() as it would prevent the arch-specific handler from running for CPU changes. This would break PPC, for example, which needs to update other information besides the elfcorehdr, on CPU changes. Signed-off-by: Eric DeVolder Reviewed-by: Sourabh Jain Acked-by: Hari Bathini Acked-by: Baoquan He --- arch/x86/kernel/crash.c | 10 ++ 1 file changed, 10 insertions(+) diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c index caf22bcb61af..18d2a18d1073 100644 --- a/arch/x86/kernel/crash.c +++ b/arch/x86/kernel/crash.c @@ -467,6 +467,16 @@ void arch_crash_handle_hotplug_event(struct kimage *image) unsigned long mem, memsz; unsigned long elfsz = 0; + /* +* As crash_prepare_elf64_headers() has already described all +* possible CPUs, there is no need to update the elfcorehdr +* for additional CPU changes. +*/ + if ((image->file_mode || image->elfcorehdr_updated) && + ((image->hp_action == KEXEC_CRASH_HP_ADD_CPU) || + (image->hp_action == KEXEC_CRASH_HP_REMOVE_CPU))) + return; + /* * Create the new elfcorehdr reflecting the changes to CPU and/or * memory resources. -- 2.31.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH v26 3/8] kexec: exclude elfcorehdr from the segment digest
When a crash kernel is loaded via the kexec_file_load() syscall, the kernel places the various segments (ie crash kernel, crash initrd, boot_params, elfcorehdr, purgatory, etc) in memory. For those architectures that utilize purgatory, a hash digest of the segments is calculated for integrity checking. The digest is embedded into the purgatory image prior to placing in memory. Updates to the elfcorehdr in response to CPU and memory changes would cause the purgatory integrity checking to fail (at crash time, and no vmcore created). Therefore, the elfcorehdr segment is explicitly excluded from the purgatory digest, enabling updates to the elfcorehdr while also avoiding the need to recompute the hash digest and reload purgatory. Suggested-by: Baoquan He Signed-off-by: Eric DeVolder Reviewed-by: Sourabh Jain Acked-by: Hari Bathini Acked-by: Baoquan He --- kernel/kexec_file.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c index 453b7a513540..e2ec9d7b9a1f 100644 --- a/kernel/kexec_file.c +++ b/kernel/kexec_file.c @@ -726,6 +726,12 @@ static int kexec_calculate_store_digests(struct kimage *image) for (j = i = 0; i < image->nr_segments; i++) { struct kexec_segment *ksegment; +#ifdef CONFIG_CRASH_HOTPLUG + /* Exclude elfcorehdr segment to allow future changes via hotplug */ + if (j == image->elfcorehdr_index) + continue; +#endif + ksegment = >segment[i]; /* * Skip purgatory as it will be modified once we put digest -- 2.31.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH v26 1/8] crash: move a few code bits to setup support of crash hotplug
The crash hotplug support leans on the work for the kexec_file_load() syscall. To also support the kexec_load() syscall, a few bits of code need to be move outside of CONFIG_KEXEC_FILE. As such, these bits are moved out of kexec_file.c and into a common location crash_core.c. No functionality change intended. Signed-off-by: Eric DeVolder Reviewed-by: Sourabh Jain Acked-by: Hari Bathini Acked-by: Baoquan He --- include/linux/kexec.h | 30 +++ kernel/crash_core.c | 182 ++ kernel/kexec_file.c | 181 - 3 files changed, 197 insertions(+), 196 deletions(-) diff --git a/include/linux/kexec.h b/include/linux/kexec.h index 22b5cd24f581..811a90e09698 100644 --- a/include/linux/kexec.h +++ b/include/linux/kexec.h @@ -105,6 +105,21 @@ struct compat_kexec_segment { }; #endif +/* Alignment required for elf header segment */ +#define ELF_CORE_HEADER_ALIGN 4096 + +struct crash_mem { + unsigned int max_nr_ranges; + unsigned int nr_ranges; + struct range ranges[]; +}; + +extern int crash_exclude_mem_range(struct crash_mem *mem, + unsigned long long mstart, + unsigned long long mend); +extern int crash_prepare_elf64_headers(struct crash_mem *mem, int need_kernel_map, + void **addr, unsigned long *sz); + #ifdef CONFIG_KEXEC_FILE struct purgatory_info { /* @@ -230,21 +245,6 @@ static inline int arch_kexec_locate_mem_hole(struct kexec_buf *kbuf) } #endif -/* Alignment required for elf header segment */ -#define ELF_CORE_HEADER_ALIGN 4096 - -struct crash_mem { - unsigned int max_nr_ranges; - unsigned int nr_ranges; - struct range ranges[]; -}; - -extern int crash_exclude_mem_range(struct crash_mem *mem, - unsigned long long mstart, - unsigned long long mend); -extern int crash_prepare_elf64_headers(struct crash_mem *mem, int need_kernel_map, - void **addr, unsigned long *sz); - #ifndef arch_kexec_apply_relocations_add /* * arch_kexec_apply_relocations_add - apply relocations of type RELA diff --git a/kernel/crash_core.c b/kernel/crash_core.c index 90ce1dfd591c..b7c30b748a16 100644 --- a/kernel/crash_core.c +++ b/kernel/crash_core.c @@ -10,6 +10,7 @@ #include #include #include +#include #include #include @@ -314,6 +315,187 @@ static int __init parse_crashkernel_dummy(char *arg) } early_param("crashkernel", parse_crashkernel_dummy); +int crash_prepare_elf64_headers(struct crash_mem *mem, int need_kernel_map, + void **addr, unsigned long *sz) +{ + Elf64_Ehdr *ehdr; + Elf64_Phdr *phdr; + unsigned long nr_cpus = num_possible_cpus(), nr_phdr, elf_sz; + unsigned char *buf; + unsigned int cpu, i; + unsigned long long notes_addr; + unsigned long mstart, mend; + + /* extra phdr for vmcoreinfo ELF note */ + nr_phdr = nr_cpus + 1; + nr_phdr += mem->nr_ranges; + + /* +* kexec-tools creates an extra PT_LOAD phdr for kernel text mapping +* area (for example, 8000 - a000 on x86_64). +* I think this is required by tools like gdb. So same physical +* memory will be mapped in two ELF headers. One will contain kernel +* text virtual addresses and other will have __va(physical) addresses. +*/ + + nr_phdr++; + elf_sz = sizeof(Elf64_Ehdr) + nr_phdr * sizeof(Elf64_Phdr); + elf_sz = ALIGN(elf_sz, ELF_CORE_HEADER_ALIGN); + + buf = vzalloc(elf_sz); + if (!buf) + return -ENOMEM; + + ehdr = (Elf64_Ehdr *)buf; + phdr = (Elf64_Phdr *)(ehdr + 1); + memcpy(ehdr->e_ident, ELFMAG, SELFMAG); + ehdr->e_ident[EI_CLASS] = ELFCLASS64; + ehdr->e_ident[EI_DATA] = ELFDATA2LSB; + ehdr->e_ident[EI_VERSION] = EV_CURRENT; + ehdr->e_ident[EI_OSABI] = ELF_OSABI; + memset(ehdr->e_ident + EI_PAD, 0, EI_NIDENT - EI_PAD); + ehdr->e_type = ET_CORE; + ehdr->e_machine = ELF_ARCH; + ehdr->e_version = EV_CURRENT; + ehdr->e_phoff = sizeof(Elf64_Ehdr); + ehdr->e_ehsize = sizeof(Elf64_Ehdr); + ehdr->e_phentsize = sizeof(Elf64_Phdr); + + /* Prepare one phdr of type PT_NOTE for each present CPU */ + for_each_present_cpu(cpu) { + phdr->p_type = PT_NOTE; + notes_addr = per_cpu_ptr_to_phys(per_cpu_ptr(crash_notes, cpu)); + phdr->p_offset = phdr->p_paddr = notes_addr; + phdr->p_filesz = phdr->p_memsz = sizeof(note_buf_t); + (ehdr->e_phnum)++; + phdr++; + } + + /* Prepare one PT_NOTE header for vmcoreinfo */ + phdr->p_type = PT_NOTE; +
[PATCH v26 0/8] crash: Kernel handling of CPU and memory hot un/plug
kexec_load and kexec_file_load can benefit. - Updated userspace kexec-tools kexec utility to reflect change to using CRASH_MAX_MEMORY_RANGES and get_nr_cpus(). - Updated the new proposed udev rules to reflect using the sysfs attributes crash_hotplug. v8: 5may2022 https://lkml.org/lkml/2022/5/5/1133 https://lore.kernel.org/lkml/20220505184603.1548-1-eric.devol...@oracle.com/ - Per Borislav Petkov, eliminated CONFIG_CRASH_HOTPLUG in favor of CONFIG_HOTPLUG_CPU || CONFIG_MEMORY_HOTPLUG, ie a new define is not needed. Also use of IS_ENABLED() rather than #ifdef's. Renamed crash_hotplug_handler() to handle_hotplug_event(). And other corrections. - Per Baoquan, minimized the parameters to the arch_crash_ handle_hotplug_event() to hp_action and cpu. - Introduce KEXEC_CRASH_HP_INVALID_CPU definition, per Baoquan. - Per Sourabh Jain, renamed and repurposed CRASH_HOTPLUG_ELFCOREHDR_SZ to CONFIG_CRASH_MAX_MEMORY_RANGES, mirroring kexec-tools change by David Hildebrand. Folded this patch into the x86 kexec_file_load support patch. v7: 13apr2022 https://lkml.org/lkml/2022/4/13/850 https://lore.kernel.org/lkml/20220413164237.20845-1-eric.devol...@oracle.com/ - Resolved parameter usage to crash_hotplug_handler(), per Baoquan. v6: 1apr2022 https://lkml.org/lkml/2022/4/1/1203 https://lore.kernel.org/lkml/20220401183040.1624-1-eric.devol...@oracle.com/ - Reword commit messages and some comment cleanup per Baoquan. - Changed elf_index to elfcorehdr_index for clarity. - Minor code changes per Baoquan. v5: 3mar2022 https://lkml.org/lkml/2022/3/3/674 https://lore.kernel.org/lkml/20220303162725.49640-1-eric.devol...@oracle.com/ - Reworded description of CRASH_HOTPLUG_ELFCOREHDR_SZ, per David Hildenbrand. - Refactored slightly a few patches per Baoquan recommendation. v4: 9feb2022 https://lkml.org/lkml/2022/2/9/1406 https://lore.kernel.org/lkml/20220209195706.51522-1-eric.devol...@oracle.com/ - Refactored patches per Baoquan suggestsions. - A few corrections, per Baoquan. v3: 10jan2022 https://lkml.org/lkml/2022/1/10/1212 https://lore.kernel.org/lkml/20220110195727.1682-1-eric.devol...@oracle.com/ - Rebasing per Baoquan He request. - Changed memory notifier per David Hildenbrand. - Providing example kexec userspace change in cover letter. RFC v2: 7dec2021 https://lkml.org/lkml/2021/12/7/1088 https://lore.kernel.org/lkml/20211207195204.1582-1-eric.devol...@oracle.com/ - Acting upon Baoquan He suggestion of removing elfcorehdr from the purgatory list of segments, removed purgatory code from patchset, and it is signficiantly simpler now. RFC v1: 18nov2021 https://lkml.org/lkml/2021/11/18/845 https://lore.kernel.org/lkml/2028174948.37435-1-eric.devol...@oracle.com/ - working patchset demonstrating kernel handling of hotplug updates to x86 elfcorehdr for kexec_file_load RFC: 14dec2020 https://lkml.org/lkml/2020/12/14/532 https://lore.kernel.org/lkml/b04ed259-dc5f-7f30-6661-c26f92d90...@oracle.com/ NOTE: s/vmcoreinfo/elfcorehdr/g - proposed concept of allowing kernel to handle hotplug update of elfcorehdr --- Eric DeVolder (8): crash: move a few code bits to setup support of crash hotplug crash: add generic infrastructure for crash hotplug support kexec: exclude elfcorehdr from the segment digest crash: memory and CPU hotplug sysfs attributes x86/crash: add x86 crash hotplug support crash: hotplug support for kexec_load() crash: change crash_prepare_elf64_headers() to for_each_possible_cpu() x86/crash: optimize CPU changes .../ABI/testing/sysfs-devices-memory | 8 + .../ABI/testing/sysfs-devices-system-cpu | 8 + .../admin-guide/mm/memory-hotplug.rst | 8 + Documentation/core-api/cpu_hotplug.rst| 18 + arch/x86/Kconfig | 3 + arch/x86/include/asm/kexec.h | 18 + arch/x86/kernel/crash.c | 140 ++- drivers/base/cpu.c| 13 + drivers/base/memory.c | 13 + include/linux/crash_core.h| 9 + include/linux/kexec.h | 63 +++- include/uapi/linux/kexec.h| 1 + kernel/Kconfig.kexec | 35 ++ kernel/crash_core.c | 355 ++ kernel/kexec.c| 5 + kernel/kexec_core.c | 6 + kernel/kexec_file.c | 187 + kernel/ksysfs.c | 15 + 18 files changed, 700 insertions(+), 205 deletions(-) -- 2.31.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH v26 6/8] crash: hotplug support for kexec_load()
The hotplug support for kexec_load() requires changes to the userspace kexec-tools and a little extra help from the kernel. Given a kdump capture kernel loaded via kexec_load(), and a subsequent hotplug event, the crash hotplug handler finds the elfcorehdr and rewrites it to reflect the hotplug change. That is the desired outcome, however, at kernel panic time, the purgatory integrity check fails (because the elfcorehdr changed), and the capture kernel does not boot and no vmcore is generated. Therefore, the userspace kexec-tools/kexec must indicate to the kernel that the elfcorehdr can be modified (because the kexec excluded the elfcorehdr from the digest, and sized the elfcorehdr memory buffer appropriately). To facilitate hotplug support with kexec_load(): - a new kexec flag KEXEC_UPATE_ELFCOREHDR indicates that it is safe for the kernel to modify the kexec_load()'d elfcorehdr - the /sys/kernel/crash_elfcorehdr_size node communicates the preferred size of the elfcorehdr memory buffer - The sysfs crash_hotplug nodes (ie. /sys/devices/system/[cpu|memory]/crash_hotplug) dynamically take into account kexec_file_load() vs kexec_load() and KEXEC_UPDATE_ELFCOREHDR. This is critical so that the udev rule processing of crash_hotplug is all that is needed to determine if the userspace unload-then-load of the kdump image is to be skipped, or not. The proposed udev rule change looks like: # The kernel updates the crash elfcorehdr for CPU and memory changes SUBSYSTEM=="cpu", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end" SUBSYSTEM=="memory", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end" The table below indicates the behavior of kexec_load()'d kdump image updates (with the new udev crash_hotplug rule in place): Kernel |Kexec ---+-+ Old|Old |New | a | a ---+-+ New| a | b ---+-+ where kexec 'old' and 'new' delineate kexec-tools has the needed modifications for the crash hotplug feature, and kernel 'old' and 'new' delineate the kernel supports this crash hotplug feature. Behavior 'a' indicates the unload-then-reload of the entire kdump image. For the kexec 'old' column, the unload-then-reload occurs due to the missing flag KEXEC_UPDATE_ELFCOREHDR. An 'old' kernel (with 'new' kexec) does not present the crash_hotplug sysfs node, which leads to the unload-then-reload of the kdump image. Behavior 'b' indicates the desired optimized behavior of the kernel directly modifying the elfcorehdr and avoiding the unload-then-reload of the kdump image. If the udev rule is not updated with crash_hotplug node check, then no matter any combination of kernel or kexec is new or old, the kdump image continues to be unload-then-reload on hotplug changes. To fully support crash hotplug feature, there needs to be a rollout of kernel, kexec-tools and udev rule changes. However, the order of the rollout of these pieces does not matter; kexec_load()'d kdump images still function for hotplug as-is. Suggested-by: Hari Bathini Signed-off-by: Eric DeVolder Acked-by: Hari Bathini Acked-by: Baoquan He --- arch/x86/include/asm/kexec.h | 11 +++ arch/x86/kernel/crash.c | 27 +++ include/linux/kexec.h| 14 -- include/uapi/linux/kexec.h | 1 + kernel/Kconfig.kexec | 4 kernel/crash_core.c | 31 +++ kernel/kexec.c | 5 + kernel/ksysfs.c | 15 +++ 8 files changed, 102 insertions(+), 6 deletions(-) diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h index 9143100ea3ea..3be6a98751f0 100644 --- a/arch/x86/include/asm/kexec.h +++ b/arch/x86/include/asm/kexec.h @@ -214,14 +214,17 @@ void arch_crash_handle_hotplug_event(struct kimage *image); #define arch_crash_handle_hotplug_event arch_crash_handle_hotplug_event #ifdef CONFIG_HOTPLUG_CPU -static inline int crash_hotplug_cpu_support(void) { return 1; } -#define crash_hotplug_cpu_support crash_hotplug_cpu_support +int arch_crash_hotplug_cpu_support(void); +#define crash_hotplug_cpu_support arch_crash_hotplug_cpu_support #endif #ifdef CONFIG_MEMORY_HOTPLUG -static inline int crash_hotplug_memory_support(void) { return 1; } -#define crash_hotplug_memory_support crash_hotplug_memory_support +int arch_crash_hotplug_memory_support(void); +#define crash_hotplug_memory_support arch_crash_hotplug_memory_support #endif + +unsigned int arch_crash_get_elfcorehdr_size(void); +#define crash_get_elfcorehdr_size arch_crash_get_elfcorehdr_size #endif #endif /* __ASSEMBLY__ */ diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c index c70a111c44fa..caf22bcb61af 100644 --- a/arch/x86/kernel/crash.c +++ b/arch/x86/kernel/crash.c @@ -427,6 +427,33 @@ int crash_load_segments(struct kimage *image) #undef pr_fmt #define pr_fmt(fmt) "cras
[PATCH v26 2/8] crash: add generic infrastructure for crash hotplug support
To support crash hotplug, a mechanism is needed to update the crash elfcorehdr upon CPU or memory changes (eg. hot un/plug or off/ onlining). The crash elfcorehdr describes the CPUs and memory to be written into the vmcore. To track CPU changes, callbacks are registered with the cpuhp mechanism via cpuhp_setup_state_nocalls(CPUHP_BP_PREPARE_DYN). The crash hotplug elfcorehdr update has no explicit ordering requirement (relative to other cpuhp states), so meets the criteria for utilizing CPUHP_BP_PREPARE_DYN. CPUHP_BP_PREPARE_DYN is a dynamic state and avoids the need to introduce a new state for crash hotplug. Also, CPUHP_BP_PREPARE_DYN is the last state in the PREPARE group, just prior to the STARTING group, which is very close to the CPU starting up in a plug/online situation, or stopping in a unplug/ offline situation. This minimizes the window of time during an actual plug/online or unplug/offline situation in which the elfcorehdr would be inaccurate. Note that for a CPU being unplugged or offlined, the CPU will still be present in the list of CPUs generated by crash_prepare_elf64_headers(). However, there is no need to explicitly omit the CPU, see justification in 'crash: change crash_prepare_elf64_headers() to for_each_possible_cpu()'. To track memory changes, a notifier is registered to capture the memblock MEM_ONLINE and MEM_OFFLINE events via register_memory_notifier(). The CPU callbacks and memory notifiers invoke crash_handle_hotplug_event() which performs needed tasks and then dispatches the event to the architecture specific arch_crash_handle_hotplug_event() to update the elfcorehdr with the current state of CPUs and memory. During the process, the kexec_lock is held. Signed-off-by: Eric DeVolder Reviewed-by: Sourabh Jain Acked-by: Hari Bathini Acked-by: Baoquan He --- include/linux/crash_core.h | 9 +++ include/linux/kexec.h | 11 +++ kernel/Kconfig.kexec | 31 kernel/crash_core.c| 142 + kernel/kexec_core.c| 6 ++ 5 files changed, 199 insertions(+) diff --git a/include/linux/crash_core.h b/include/linux/crash_core.h index de62a722431e..e14345cc7a22 100644 --- a/include/linux/crash_core.h +++ b/include/linux/crash_core.h @@ -84,4 +84,13 @@ int parse_crashkernel_high(char *cmdline, unsigned long long system_ram, int parse_crashkernel_low(char *cmdline, unsigned long long system_ram, unsigned long long *crash_size, unsigned long long *crash_base); +#define KEXEC_CRASH_HP_NONE0 +#define KEXEC_CRASH_HP_ADD_CPU 1 +#define KEXEC_CRASH_HP_REMOVE_CPU 2 +#define KEXEC_CRASH_HP_ADD_MEMORY 3 +#define KEXEC_CRASH_HP_REMOVE_MEMORY 4 +#define KEXEC_CRASH_HP_INVALID_CPU -1U + +struct kimage; + #endif /* LINUX_CRASH_CORE_H */ diff --git a/include/linux/kexec.h b/include/linux/kexec.h index 811a90e09698..b9903dd48e24 100644 --- a/include/linux/kexec.h +++ b/include/linux/kexec.h @@ -33,6 +33,7 @@ extern note_buf_t __percpu *crash_notes; #include #include #include +#include #include /* Verify architecture specific macros are defined */ @@ -360,6 +361,12 @@ struct kimage { struct purgatory_info purgatory_info; #endif +#ifdef CONFIG_CRASH_HOTPLUG + int hp_action; + int elfcorehdr_index; + bool elfcorehdr_updated; +#endif + #ifdef CONFIG_IMA_KEXEC /* Virtual address of IMA measurement buffer for kexec syscall */ void *ima_buffer; @@ -490,6 +497,10 @@ static inline int arch_kexec_post_alloc_pages(void *vaddr, unsigned int pages, g static inline void arch_kexec_pre_free_pages(void *vaddr, unsigned int pages) { } #endif +#ifndef arch_crash_handle_hotplug_event +static inline void arch_crash_handle_hotplug_event(struct kimage *image) { } +#endif + #else /* !CONFIG_KEXEC_CORE */ struct pt_regs; struct task_struct; diff --git a/kernel/Kconfig.kexec b/kernel/Kconfig.kexec index ff72e45cfaef..d0a9a5392035 100644 --- a/kernel/Kconfig.kexec +++ b/kernel/Kconfig.kexec @@ -113,4 +113,35 @@ config CRASH_DUMP For s390, this option also enables zfcpdump. See also +config CRASH_HOTPLUG + bool "Update the crash elfcorehdr on system configuration changes" + default y + depends on CRASH_DUMP && (HOTPLUG_CPU || MEMORY_HOTPLUG) + depends on ARCH_SUPPORTS_CRASH_HOTPLUG + help + Enable direct update to the crash elfcorehdr (which contains + the list of CPUs and memory regions to be dumped upon a crash) + in response to hot plug/unplug or online/offline of CPUs or + memory. This is a much more advanced approach than userspace + attempting that. + + If unsure, say Y. + +config CRASH_MAX_MEMORY_RANGES + int "Specify the maximum number of memory regions for the elfcorehdr" + default 8192 + depends on CRASH_HOTPLUG + help +
[PATCH v26 5/8] x86/crash: add x86 crash hotplug support
When CPU or memory is hot un/plugged, or off/onlined, the crash elfcorehdr, which describes the CPUs and memory in the system, must also be updated. A new elfcorehdr is generated from the available CPUs and memory and replaces the existing elfcorehdr. The segment containing the elfcorehdr is identified at run-time in crash_core:crash_handle_hotplug_event(). No modifications to purgatory (see 'kexec: exclude elfcorehdr from the segment digest') or boot_params (as the elfcorehdr= capture kernel command line parameter pointer remains unchanged and correct) are needed, just elfcorehdr. For kexec_file_load(), the elfcorehdr segment size is based on NR_CPUS and CRASH_MAX_MEMORY_RANGES in order to accommodate a growing number of CPU and memory resources. For kexec_load(), the userspace kexec utility needs to size the elfcorehdr segment in the same/similar manner. To accommodate kexec_load() syscall in the absence of kexec_file_load() syscall support, prepare_elf_headers() and dependents are moved outside of CONFIG_KEXEC_FILE. Signed-off-by: Eric DeVolder Reviewed-by: Sourabh Jain Acked-by: Hari Bathini Acked-by: Baoquan He --- arch/x86/Kconfig | 3 + arch/x86/include/asm/kexec.h | 15 + arch/x86/kernel/crash.c | 103 --- 3 files changed, 114 insertions(+), 7 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index fedc6743..d9fc80b9ef84 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -2069,6 +2069,9 @@ config ARCH_SUPPORTS_KEXEC_JUMP config ARCH_SUPPORTS_CRASH_DUMP def_bool X86_64 || (X86_32 && HIGHMEM) +config ARCH_SUPPORTS_CRASH_HOTPLUG + def_bool y + config PHYSICAL_START hex "Physical address where the kernel is loaded" if (EXPERT || CRASH_DUMP) default "0x100" diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h index 5b77bbc28f96..9143100ea3ea 100644 --- a/arch/x86/include/asm/kexec.h +++ b/arch/x86/include/asm/kexec.h @@ -209,6 +209,21 @@ typedef void crash_vmclear_fn(void); extern crash_vmclear_fn __rcu *crash_vmclear_loaded_vmcss; extern void kdump_nmi_shootdown_cpus(void); +#ifdef CONFIG_CRASH_HOTPLUG +void arch_crash_handle_hotplug_event(struct kimage *image); +#define arch_crash_handle_hotplug_event arch_crash_handle_hotplug_event + +#ifdef CONFIG_HOTPLUG_CPU +static inline int crash_hotplug_cpu_support(void) { return 1; } +#define crash_hotplug_cpu_support crash_hotplug_cpu_support +#endif + +#ifdef CONFIG_MEMORY_HOTPLUG +static inline int crash_hotplug_memory_support(void) { return 1; } +#define crash_hotplug_memory_support crash_hotplug_memory_support +#endif +#endif + #endif /* __ASSEMBLY__ */ #endif /* _ASM_X86_KEXEC_H */ diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c index cdd92ab43cda..c70a111c44fa 100644 --- a/arch/x86/kernel/crash.c +++ b/arch/x86/kernel/crash.c @@ -158,8 +158,6 @@ void native_machine_crash_shutdown(struct pt_regs *regs) crash_save_cpu(regs, safe_smp_processor_id()); } -#ifdef CONFIG_KEXEC_FILE - static int get_nr_ram_ranges_callback(struct resource *res, void *arg) { unsigned int *nr_ranges = arg; @@ -231,7 +229,7 @@ static int prepare_elf64_ram_headers_callback(struct resource *res, void *arg) /* Prepare elf headers. Return addr and size */ static int prepare_elf_headers(struct kimage *image, void **addr, - unsigned long *sz) + unsigned long *sz, unsigned long *nr_mem_ranges) { struct crash_mem *cmem; int ret; @@ -249,6 +247,9 @@ static int prepare_elf_headers(struct kimage *image, void **addr, if (ret) goto out; + /* Return the computed number of memory ranges, for hotplug usage */ + *nr_mem_ranges = cmem->nr_ranges; + /* By default prepare 64bit headers */ ret = crash_prepare_elf64_headers(cmem, IS_ENABLED(CONFIG_X86_64), addr, sz); @@ -257,6 +258,7 @@ static int prepare_elf_headers(struct kimage *image, void **addr, return ret; } +#ifdef CONFIG_KEXEC_FILE static int add_e820_entry(struct boot_params *params, struct e820_entry *entry) { unsigned int nr_e820_entries; @@ -371,18 +373,42 @@ int crash_setup_memmap_entries(struct kimage *image, struct boot_params *params) int crash_load_segments(struct kimage *image) { int ret; + unsigned long pnum = 0; struct kexec_buf kbuf = { .image = image, .buf_min = 0, .buf_max = ULONG_MAX, .top_down = false }; /* Prepare elf headers and add a segment */ - ret = prepare_elf_headers(image, , ); + ret = prepare_elf_headers(image, , , ); if (ret) return ret; - image->elf_headers = kbuf.buffer; - image->elf_headers_sz = kbuf.bufsz; + image->elf_headers = kbuf.buffer; + image->elf_he
[PATCH v26 4/8] crash: memory and CPU hotplug sysfs attributes
Introduce the crash_hotplug attribute for memory and CPUs for use by userspace. These attributes directly facilitate the udev rule for managing userspace re-loading of the crash kernel upon hot un/plug changes. For memory, expose the crash_hotplug attribute to the /sys/devices/system/memory directory. For example: # udevadm info --attribute-walk /sys/devices/system/memory/memory81 looking at device '/devices/system/memory/memory81': KERNEL=="memory81" SUBSYSTEM=="memory" DRIVER=="" ATTR{online}=="1" ATTR{phys_device}=="0" ATTR{phys_index}=="0051" ATTR{removable}=="1" ATTR{state}=="online" ATTR{valid_zones}=="Movable" looking at parent device '/devices/system/memory': KERNELS=="memory" SUBSYSTEMS=="" DRIVERS=="" ATTRS{auto_online_blocks}=="offline" ATTRS{block_size_bytes}=="800" ATTRS{crash_hotplug}=="1" For CPUs, expose the crash_hotplug attribute to the /sys/devices/system/cpu directory. For example: # udevadm info --attribute-walk /sys/devices/system/cpu/cpu0 looking at device '/devices/system/cpu/cpu0': KERNEL=="cpu0" SUBSYSTEM=="cpu" DRIVER=="processor" ATTR{crash_notes}=="277c38600" ATTR{crash_notes_size}=="368" ATTR{online}=="1" looking at parent device '/devices/system/cpu': KERNELS=="cpu" SUBSYSTEMS=="" DRIVERS=="" ATTRS{crash_hotplug}=="1" ATTRS{isolated}=="" ATTRS{kernel_max}=="8191" ATTRS{nohz_full}==" (null)" ATTRS{offline}=="4-7" ATTRS{online}=="0-3" ATTRS{possible}=="0-7" ATTRS{present}=="0-3" With these sysfs attributes in place, it is possible to efficiently instruct the udev rule to skip crash kernel reloading for kernels configured with crash hotplug support. For example, the following is the proposed udev rule change for RHEL system 98-kexec.rules (as the first lines of the rule file): # The kernel updates the crash elfcorehdr for CPU and memory changes SUBSYSTEM=="cpu", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end" SUBSYSTEM=="memory", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end" When examined in the context of 98-kexec.rules, the above rules test if crash_hotplug is set, and if so, the userspace initiated unload-then-reload of the crash kernel is skipped. CPU and memory checks are separated in accordance with CONFIG_HOTPLUG_CPU and CONFIG_MEMORY_HOTPLUG kernel config options. If an architecture supports, for example, memory hotplug but not CPU hotplug, then the /sys/devices/system/memory/crash_hotplug attribute file is present, but the /sys/devices/system/cpu/crash_hotplug attribute file will NOT be present. Thus the udev rule skips userspace processing of memory hot un/plug events, but the udev rule will evaluate false for CPU events, thus allowing userspace to process CPU hot un/plug events (ie the unload-then-reload of the kdump capture kernel). Signed-off-by: Eric DeVolder Reviewed-by: Sourabh Jain Acked-by: Hari Bathini Acked-by: Baoquan He --- Documentation/ABI/testing/sysfs-devices-memory | 8 .../ABI/testing/sysfs-devices-system-cpu | 8 .../admin-guide/mm/memory-hotplug.rst | 8 Documentation/core-api/cpu_hotplug.rst | 18 ++ drivers/base/cpu.c | 13 + drivers/base/memory.c | 13 + include/linux/kexec.h | 8 7 files changed, 76 insertions(+) diff --git a/Documentation/ABI/testing/sysfs-devices-memory b/Documentation/ABI/testing/sysfs-devices-memory index d8b0f80b9e33..a95e0f17c35a 100644 --- a/Documentation/ABI/testing/sysfs-devices-memory +++ b/Documentation/ABI/testing/sysfs-devices-memory @@ -110,3 +110,11 @@ Description: link is created for memory section 9 on node0. /sys/devices/system/node/node0/memory9 -> ../../memory/memory9 + +What: /sys/devices/system/memory/crash_hotplug +Date: Aug 2023 +Contact: Linux kernel mailing list +Description: + (RO) indicates whether or not the kernel directly supports + modifying the crash elfcorehdr for memory hot un/plug and/or + on/offline changes. diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu index ecd585ca2d50..31189da7ef57 100644 --- a/Documentation/ABI/testing/sysfs-devices-system-cpu +++ b/Documentation/ABI/testing/sysfs-devices-system-cpu @@ -686,3 +686,11 @@ Description: (RO) the list of CPUs that are isolated
Re: [PATCH v25 01/10] drivers/base: refactor cpu.c to use .is_visible()
On 7/21/23 11:32, Eric DeVolder wrote: On 7/3/23 11:53, Eric DeVolder wrote: On 7/3/23 08:05, Greg KH wrote: On Thu, Jun 29, 2023 at 03:21:10PM -0400, Eric DeVolder wrote: - the function body of the callback functions are now wrapped with IS_ENABLED(); as the callback function must exist now that the attribute is always compiled-in (though not necessarily visible). Why do you need to do this last thing? Is it a code savings goal? Or something else? The file will not be present in the system if the option is not enabled, so it should be safe to not do this unless you feel it's necessary for some reason? To accommodate the request, all DEVICE_ATTR() must be unconditionally present in this file. The DEVICE_ATTR() requires the .show() callback. As the callback is referenced from a data structure, the callback has to be present for link. All the callbacks for these attributes are in this file. I have two basic choices for gutting the function body if the config feature is not enabled. I can either use #ifdef or IS_ENABLED(). Thomas has made it clear I need to use IS_ENABLED(). I can certainly use #ifdef (which is what I did in v24). Not doing this would make the diff easier to read :) I agree this is messy. I'm not really sure what this request/effort achieves as these attributes are not strongly related (unlike cacheinfo) and the way the file was before results in less code. At any rate, please indicate if you'd rather I use #ifdef. Thanks for your time! eric thanks, greg k-h Hi Greg, I was wondering if you might weigh-in so that I can proceed. I think there are three options on the table: - use #ifdef to comment out these function bodies, which keeps the diff much more readable - use IS_ENABLED() as Thomas has requested I do, but makes the diff more difficult to read - remove this refactor altogether, perhaps post-poning until after this crash hotplug series merges, as this refactor is largely unrelated to crash hotplug. Thank you for your time on this topic! eric Hi Greg, If you have an opinion on how to proceed, please provide. Thanks, eric ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v25 01/10] drivers/base: refactor cpu.c to use .is_visible()
On 7/3/23 11:53, Eric DeVolder wrote: On 7/3/23 08:05, Greg KH wrote: On Thu, Jun 29, 2023 at 03:21:10PM -0400, Eric DeVolder wrote: - the function body of the callback functions are now wrapped with IS_ENABLED(); as the callback function must exist now that the attribute is always compiled-in (though not necessarily visible). Why do you need to do this last thing? Is it a code savings goal? Or something else? The file will not be present in the system if the option is not enabled, so it should be safe to not do this unless you feel it's necessary for some reason? To accommodate the request, all DEVICE_ATTR() must be unconditionally present in this file. The DEVICE_ATTR() requires the .show() callback. As the callback is referenced from a data structure, the callback has to be present for link. All the callbacks for these attributes are in this file. I have two basic choices for gutting the function body if the config feature is not enabled. I can either use #ifdef or IS_ENABLED(). Thomas has made it clear I need to use IS_ENABLED(). I can certainly use #ifdef (which is what I did in v24). Not doing this would make the diff easier to read :) I agree this is messy. I'm not really sure what this request/effort achieves as these attributes are not strongly related (unlike cacheinfo) and the way the file was before results in less code. At any rate, please indicate if you'd rather I use #ifdef. Thanks for your time! eric thanks, greg k-h Hi Greg, I was wondering if you might weigh-in so that I can proceed. I think there are three options on the table: - use #ifdef to comment out these function bodies, which keeps the diff much more readable - use IS_ENABLED() as Thomas has requested I do, but makes the diff more difficult to read - remove this refactor altogether, perhaps post-poning until after this crash hotplug series merges, as this refactor is largely unrelated to crash hotplug. Thank you for your time on this topic! eric ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v25 06/10] crash: memory and CPU hotplug sysfs attributes
On 7/3/23 08:07, Greg KH wrote: On Thu, Jun 29, 2023 at 03:21:15PM -0400, Eric DeVolder wrote: +What: /sys/devices/system/cpu/crash_hotplug +Date: Jun 2023 It's not "Jun" anymore :( +Contact: Linux kernel mailing list Why are you not going to maintain this? Why is this up to me? thanks, greg k-h My apologies, I'll correct both in the next posting. Thanks! eric ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v25 01/10] drivers/base: refactor cpu.c to use .is_visible()
On 7/3/23 08:05, Greg KH wrote: On Thu, Jun 29, 2023 at 03:21:10PM -0400, Eric DeVolder wrote: - the function body of the callback functions are now wrapped with IS_ENABLED(); as the callback function must exist now that the attribute is always compiled-in (though not necessarily visible). Why do you need to do this last thing? Is it a code savings goal? Or something else? The file will not be present in the system if the option is not enabled, so it should be safe to not do this unless you feel it's necessary for some reason? To accommodate the request, all DEVICE_ATTR() must be unconditionally present in this file. The DEVICE_ATTR() requires the .show() callback. As the callback is referenced from a data structure, the callback has to be present for link. All the callbacks for these attributes are in this file. I have two basic choices for gutting the function body if the config feature is not enabled. I can either use #ifdef or IS_ENABLED(). Thomas has made it clear I need to use IS_ENABLED(). I can certainly use #ifdef (which is what I did in v24). Not doing this would make the diff easier to read :) I agree this is messy. I'm not really sure what this request/effort achieves as these attributes are not strongly related (unlike cacheinfo) and the way the file was before results in less code. At any rate, please indicate if you'd rather I use #ifdef. Thanks for your time! eric thanks, greg k-h ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v25 06/10] crash: memory and CPU hotplug sysfs attributes
Randy, Thanks for looking at this! Inline comments below. eric On 6/29/23 15:59, Randy Dunlap wrote: Hi-- On 6/29/23 12:21, Eric DeVolder wrote: Signed-off-by: Eric DeVolder Reviewed-by: Sourabh Jain Acked-by: Hari Bathini Acked-by: Baoquan He --- Documentation/ABI/testing/sysfs-devices-memory | 8 .../ABI/testing/sysfs-devices-system-cpu | 8 .../admin-guide/mm/memory-hotplug.rst | 8 Documentation/core-api/cpu_hotplug.rst | 18 ++ drivers/base/cpu.c | 16 ++-- drivers/base/memory.c | 13 + include/linux/kexec.h | 8 7 files changed, 77 insertions(+), 2 deletions(-) diff --git a/Documentation/admin-guide/mm/memory-hotplug.rst b/Documentation/admin-guide/mm/memory-hotplug.rst index 1b02fe5807cc..eb99d79223a3 100644 --- a/Documentation/admin-guide/mm/memory-hotplug.rst +++ b/Documentation/admin-guide/mm/memory-hotplug.rst @@ -291,6 +291,14 @@ The following files are currently defined: Availability depends on the CONFIG_ARCH_MEMORY_PROBE kernel configuration option. ``uevent`` read-write: generic udev file for device subsystems. +``crash_hotplug`` read-only: when changes to the system memory map + occur due to hot un/plug of memory, this file contains + '1' if the kernel updates the kdump capture kernel memory + map itself (via elfcorehdr), or '0' if userspace must update + the kdump capture kernel memory map. + + Availability depends on the CONFIG_MEMORY_HOTPLUG kernel + configuration option. == = Did you test build the documentation? It looks to me like the end-of-table '=' signs line needs 3 more === to be long enough for the text above it. Hmm, the 'make htmldocs' renders and views ok. Is there perhaps another method I should use? .. note:: diff --git a/Documentation/core-api/cpu_hotplug.rst b/Documentation/core-api/cpu_hotplug.rst index e6f5bc39cf5c..54581c501562 100644 --- a/Documentation/core-api/cpu_hotplug.rst +++ b/Documentation/core-api/cpu_hotplug.rst @@ -741,6 +741,24 @@ will receive all events. A script like:: can process the event further. +When changes to the CPUs in the system occur, the sysfs file +/sys/devices/system/cpu/crash_hotplug contains '1' if the kernel +updates the kdump capture kernel list of CPUs itself (via elfcorehdr), +or '0' if userspace must update the kdump capture kernel list of CPUs. + +The availability depends on the CONFIG_HOTPLUG_CPU kernel configuration +option. + +To skip userspace processing of CPU hot un/plug events for kdump +(ie the unload-then-reload to obtain a current list of CPUs), this sysfs i.e. got it, thanks. +file can be used in a udev rule as follows: + + SUBSYSTEM=="cpu", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end" + +For a cpu hot un/plug event, if the architecture supports kernel updates CPU for consistency got it, thanks. +of the elfcorehdr (which contains the list of CPUs), then the rule skips +the unload-then-reload of the kdump capture kernel. + Kernel Inline Documentations Reference == Thanks. ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH v25 08/10] crash: hotplug support for kexec_load()
The hotplug support for kexec_load() requires changes to the userspace kexec-tools and a little extra help from the kernel. Given a kdump capture kernel loaded via kexec_load(), and a subsequent hotplug event, the crash hotplug handler finds the elfcorehdr and rewrites it to reflect the hotplug change. That is the desired outcome, however, at kernel panic time, the purgatory integrity check fails (because the elfcorehdr changed), and the capture kernel does not boot and no vmcore is generated. Therefore, the userspace kexec-tools/kexec must indicate to the kernel that the elfcorehdr can be modified (because the kexec excluded the elfcorehdr from the digest, and sized the elfcorehdr memory buffer appropriately). To facilitate hotplug support with kexec_load(): - a new kexec flag KEXEC_UPATE_ELFCOREHDR indicates that it is safe for the kernel to modify the kexec_load()'d elfcorehdr - the /sys/kernel/crash_elfcorehdr_size node communicates the preferred size of the elfcorehdr memory buffer - The sysfs crash_hotplug nodes (ie. /sys/devices/system/[cpu|memory]/crash_hotplug) dynamically take into account kexec_file_load() vs kexec_load() and KEXEC_UPDATE_ELFCOREHDR. This is critical so that the udev rule processing of crash_hotplug is all that is needed to determine if the userspace unload-then-load of the kdump image is to be skipped, or not. The proposed udev rule change looks like: # The kernel updates the crash elfcorehdr for CPU and memory changes SUBSYSTEM=="cpu", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end" SUBSYSTEM=="memory", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end" The table below indicates the behavior of kexec_load()'d kdump image updates (with the new udev crash_hotplug rule in place): Kernel |Kexec ---+-+ Old|Old |New | a | a ---+-+ New| a | b ---+-+ where kexec 'old' and 'new' delineate kexec-tools has the needed modifications for the crash hotplug feature, and kernel 'old' and 'new' delineate the kernel supports this crash hotplug feature. Behavior 'a' indicates the unload-then-reload of the entire kdump image. For the kexec 'old' column, the unload-then-reload occurs due to the missing flag KEXEC_UPDATE_ELFCOREHDR. An 'old' kernel (with 'new' kexec) does not present the crash_hotplug sysfs node, which leads to the unload-then-reload of the kdump image. Behavior 'b' indicates the desired optimized behavior of the kernel directly modifying the elfcorehdr and avoiding the unload-then-reload of the kdump image. If the udev rule is not updated with crash_hotplug node check, then no matter any combination of kernel or kexec is new or old, the kdump image continues to be unload-then-reload on hotplug changes. To fully support crash hotplug feature, there needs to be a rollout of kernel, kexec-tools and udev rule changes. However, the order of the rollout of these pieces does not matter; kexec_load()'d kdump images still function for hotplug as-is. Suggested-by: Hari Bathini Signed-off-by: Eric DeVolder Acked-by: Hari Bathini Acked-by: Baoquan He --- arch/x86/include/asm/kexec.h | 11 +++ arch/x86/kernel/crash.c | 27 +++ include/linux/kexec.h| 14 -- include/uapi/linux/kexec.h | 1 + kernel/Kconfig.kexec | 4 kernel/crash_core.c | 31 +++ kernel/kexec.c | 5 + kernel/ksysfs.c | 15 +++ 8 files changed, 102 insertions(+), 6 deletions(-) diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h index 9143100ea3ea..3be6a98751f0 100644 --- a/arch/x86/include/asm/kexec.h +++ b/arch/x86/include/asm/kexec.h @@ -214,14 +214,17 @@ void arch_crash_handle_hotplug_event(struct kimage *image); #define arch_crash_handle_hotplug_event arch_crash_handle_hotplug_event #ifdef CONFIG_HOTPLUG_CPU -static inline int crash_hotplug_cpu_support(void) { return 1; } -#define crash_hotplug_cpu_support crash_hotplug_cpu_support +int arch_crash_hotplug_cpu_support(void); +#define crash_hotplug_cpu_support arch_crash_hotplug_cpu_support #endif #ifdef CONFIG_MEMORY_HOTPLUG -static inline int crash_hotplug_memory_support(void) { return 1; } -#define crash_hotplug_memory_support crash_hotplug_memory_support +int arch_crash_hotplug_memory_support(void); +#define crash_hotplug_memory_support arch_crash_hotplug_memory_support #endif + +unsigned int arch_crash_get_elfcorehdr_size(void); +#define crash_get_elfcorehdr_size arch_crash_get_elfcorehdr_size #endif #endif /* __ASSEMBLY__ */ diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c index c70a111c44fa..caf22bcb61af 100644 --- a/arch/x86/kernel/crash.c +++ b/arch/x86/kernel/crash.c @@ -427,6 +427,33 @@ int crash_load_segments(struct kimage *image) #undef pr_fmt #define pr_fmt(fmt) "cras
[PATCH v25 06/10] crash: memory and CPU hotplug sysfs attributes
Introduce the crash_hotplug attribute for memory and CPUs for use by userspace. These attributes directly facilitate the udev rule for managing userspace re-loading of the crash kernel upon hot un/plug changes. For memory, expose the crash_hotplug attribute to the /sys/devices/system/memory directory. For example: # udevadm info --attribute-walk /sys/devices/system/memory/memory81 looking at device '/devices/system/memory/memory81': KERNEL=="memory81" SUBSYSTEM=="memory" DRIVER=="" ATTR{online}=="1" ATTR{phys_device}=="0" ATTR{phys_index}=="0051" ATTR{removable}=="1" ATTR{state}=="online" ATTR{valid_zones}=="Movable" looking at parent device '/devices/system/memory': KERNELS=="memory" SUBSYSTEMS=="" DRIVERS=="" ATTRS{auto_online_blocks}=="offline" ATTRS{block_size_bytes}=="800" ATTRS{crash_hotplug}=="1" For CPUs, expose the crash_hotplug attribute to the /sys/devices/system/cpu directory. For example: # udevadm info --attribute-walk /sys/devices/system/cpu/cpu0 looking at device '/devices/system/cpu/cpu0': KERNEL=="cpu0" SUBSYSTEM=="cpu" DRIVER=="processor" ATTR{crash_notes}=="277c38600" ATTR{crash_notes_size}=="368" ATTR{online}=="1" looking at parent device '/devices/system/cpu': KERNELS=="cpu" SUBSYSTEMS=="" DRIVERS=="" ATTRS{crash_hotplug}=="1" ATTRS{isolated}=="" ATTRS{kernel_max}=="8191" ATTRS{nohz_full}==" (null)" ATTRS{offline}=="4-7" ATTRS{online}=="0-3" ATTRS{possible}=="0-7" ATTRS{present}=="0-3" With these sysfs attributes in place, it is possible to efficiently instruct the udev rule to skip crash kernel reloading for kernels configured with crash hotplug support. For example, the following is the proposed udev rule change for RHEL system 98-kexec.rules (as the first lines of the rule file): # The kernel updates the crash elfcorehdr for CPU and memory changes SUBSYSTEM=="cpu", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end" SUBSYSTEM=="memory", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end" When examined in the context of 98-kexec.rules, the above rules test if crash_hotplug is set, and if so, the userspace initiated unload-then-reload of the crash kernel is skipped. CPU and memory checks are separated in accordance with CONFIG_HOTPLUG_CPU and CONFIG_MEMORY_HOTPLUG kernel config options. If an architecture supports, for example, memory hotplug but not CPU hotplug, then the /sys/devices/system/memory/crash_hotplug attribute file is present, but the /sys/devices/system/cpu/crash_hotplug attribute file will NOT be present. Thus the udev rule skips userspace processing of memory hot un/plug events, but the udev rule will evaluate false for CPU events, thus allowing userspace to process CPU hot un/plug events (ie the unload-then-reload of the kdump capture kernel). Signed-off-by: Eric DeVolder Reviewed-by: Sourabh Jain Acked-by: Hari Bathini Acked-by: Baoquan He --- Documentation/ABI/testing/sysfs-devices-memory | 8 .../ABI/testing/sysfs-devices-system-cpu | 8 .../admin-guide/mm/memory-hotplug.rst | 8 Documentation/core-api/cpu_hotplug.rst | 18 ++ drivers/base/cpu.c | 16 ++-- drivers/base/memory.c | 13 + include/linux/kexec.h | 8 7 files changed, 77 insertions(+), 2 deletions(-) diff --git a/Documentation/ABI/testing/sysfs-devices-memory b/Documentation/ABI/testing/sysfs-devices-memory index d8b0f80b9e33..c50725ebebb7 100644 --- a/Documentation/ABI/testing/sysfs-devices-memory +++ b/Documentation/ABI/testing/sysfs-devices-memory @@ -110,3 +110,11 @@ Description: link is created for memory section 9 on node0. /sys/devices/system/node/node0/memory9 -> ../../memory/memory9 + +What: /sys/devices/system/cpu/crash_hotplug +Date: Jun 2023 +Contact: Linux kernel mailing list +Description: + (RO) indicates whether or not the kernel directly supports + modifying the crash elfcorehdr for memory hot un/plug and/or + on/offline changes. diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu index ecd585ca2d50..598b0fa67481 100644 --- a/Documentation/ABI/testing/sysfs-devices-system-cpu +++ b/Documentation/ABI/testing/sysfs-devices-system-cpu @@ -686,3 +686,11 @@ Description: (RO) the list of C
[PATCH v25 01/10] drivers/base: refactor cpu.c to use .is_visible()
Greg Kroah-Hartman requested that this file use the .is_visible() method instead of #ifdefs for the attributes in cpu.c. static struct attribute *cpu_root_attrs[] = { #ifdef CONFIG_ARCH_CPU_PROBE_RELEASE _attr_probe.attr, _attr_release.attr, #endif _attrs[0].attr.attr, _attrs[1].attr.attr, _attrs[2].attr.attr, _attr_kernel_max.attr, _attr_offline.attr, _attr_isolated.attr, #ifdef CONFIG_NO_HZ_FULL _attr_nohz_full.attr, #endif #ifdef CONFIG_GENERIC_CPU_AUTOPROBE _attr_modalias.attr, #endif NULL }; To that end: - the .is_visible() method is implemented, and IS_ENABLED(), rather than #ifdef, is used to determine the visibility of the attribute. - the DEVICE_ATTR() attributes are moved outside of #ifdefs, so that those structs are always present for the cpu_root_attrs[]. - the function body of the callback functions are now wrapped with IS_ENABLED(); as the callback function must exist now that the attribute is always compiled-in (though not necessarily visible). No functionality change intended. Signed-off-by: Eric DeVolder --- drivers/base/cpu.c | 125 +++ include/linux/tick.h | 2 +- 2 files changed, 81 insertions(+), 46 deletions(-) diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c index c1815b9dae68..2455cbcebc87 100644 --- a/drivers/base/cpu.c +++ b/drivers/base/cpu.c @@ -82,24 +82,27 @@ void unregister_cpu(struct cpu *cpu) per_cpu(cpu_sys_devices, logical_cpu) = NULL; return; } +#endif /* CONFIG_HOTPLUG_CPU */ -#ifdef CONFIG_ARCH_CPU_PROBE_RELEASE static ssize_t cpu_probe_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count) { - ssize_t cnt; - int ret; + if (IS_ENABLED(CONFIG_ARCH_CPU_PROBE_RELEASE)) { + ssize_t cnt; + int ret; - ret = lock_device_hotplug_sysfs(); - if (ret) - return ret; + ret = lock_device_hotplug_sysfs(); + if (ret) + return ret; - cnt = arch_cpu_probe(buf, count); + cnt = arch_cpu_probe(buf, count); - unlock_device_hotplug(); - return cnt; + unlock_device_hotplug(); + return cnt; + } + return 0; } static ssize_t cpu_release_store(struct device *dev, @@ -107,23 +110,24 @@ static ssize_t cpu_release_store(struct device *dev, const char *buf, size_t count) { - ssize_t cnt; - int ret; + if (IS_ENABLED(CONFIG_ARCH_CPU_PROBE_RELEASE)) { + ssize_t cnt; + int ret; - ret = lock_device_hotplug_sysfs(); - if (ret) - return ret; + ret = lock_device_hotplug_sysfs(); + if (ret) + return ret; - cnt = arch_cpu_release(buf, count); + cnt = arch_cpu_release(buf, count); - unlock_device_hotplug(); - return cnt; + unlock_device_hotplug(); + return cnt; + } + return 0; } static DEVICE_ATTR(probe, S_IWUSR, NULL, cpu_probe_store); static DEVICE_ATTR(release, S_IWUSR, NULL, cpu_release_store); -#endif /* CONFIG_ARCH_CPU_PROBE_RELEASE */ -#endif /* CONFIG_HOTPLUG_CPU */ #ifdef CONFIG_KEXEC #include @@ -273,14 +277,14 @@ static ssize_t print_cpus_isolated(struct device *dev, } static DEVICE_ATTR(isolated, 0444, print_cpus_isolated, NULL); -#ifdef CONFIG_NO_HZ_FULL static ssize_t print_cpus_nohz_full(struct device *dev, struct device_attribute *attr, char *buf) { - return sysfs_emit(buf, "%*pbl\n", cpumask_pr_args(tick_nohz_full_mask)); + if (IS_ENABLED(CONFIG_NO_HZ_FULL)) + return sysfs_emit(buf, "%*pbl\n", cpumask_pr_args(tick_nohz_full_mask)); + return 0; } static DEVICE_ATTR(nohz_full, 0444, print_cpus_nohz_full, NULL); -#endif static void cpu_device_release(struct device *dev) { @@ -301,30 +305,32 @@ static void cpu_device_release(struct device *dev) */ } -#ifdef CONFIG_GENERIC_CPU_AUTOPROBE static ssize_t print_cpu_modalias(struct device *dev, struct device_attribute *attr, char *buf) { int len = 0; - u32 i; - - len += sysfs_emit_at(buf, len, -"cpu:type:" CPU_FEATURE_TYPEFMT ":feature:", -CPU_FEATURE_TYPEVAL); - - for (i = 0; i < MAX_CPU_FEATURES; i++) - if (cpu_have_feature(i)) { - if (len + sizeof(",\n") >= PAGE_SIZE) { - WARN(1, "CPU features overflow page\n"); -
[PATCH v25 07/10] x86/crash: add x86 crash hotplug support
When CPU or memory is hot un/plugged, or off/onlined, the crash elfcorehdr, which describes the CPUs and memory in the system, must also be updated. A new elfcorehdr is generated from the available CPUs and memory and replaces the existing elfcorehdr. The segment containing the elfcorehdr is identified at run-time in crash_core:crash_handle_hotplug_event(). No modifications to purgatory (see 'kexec: exclude elfcorehdr from the segment digest') or boot_params (as the elfcorehdr= capture kernel command line parameter pointer remains unchanged and correct) are needed, just elfcorehdr. For kexec_file_load(), the elfcorehdr segment size is based on NR_CPUS and CRASH_MAX_MEMORY_RANGES in order to accommodate a growing number of CPU and memory resources. For kexec_load(), the userspace kexec utility needs to size the elfcorehdr segment in the same/similar manner. To accommodate kexec_load() syscall in the absence of kexec_file_load() syscall support, prepare_elf_headers() and dependents are moved outside of CONFIG_KEXEC_FILE. Signed-off-by: Eric DeVolder Reviewed-by: Sourabh Jain Acked-by: Hari Bathini Acked-by: Baoquan He --- arch/x86/Kconfig | 3 + arch/x86/include/asm/kexec.h | 15 + arch/x86/kernel/crash.c | 103 --- 3 files changed, 114 insertions(+), 7 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 06a4472d0fc0..42c083da7ce4 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -2058,6 +2058,9 @@ config ARCH_SUPPORTS_KEXEC_JUMP config ARCH_SUPPORTS_CRASH_DUMP def_bool X86_64 || (X86_32 && HIGHMEM) +config ARCH_SUPPORTS_CRASH_HOTPLUG + def_bool y + config PHYSICAL_START hex "Physical address where the kernel is loaded" if (EXPERT || CRASH_DUMP) default "0x100" diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h index 5b77bbc28f96..9143100ea3ea 100644 --- a/arch/x86/include/asm/kexec.h +++ b/arch/x86/include/asm/kexec.h @@ -209,6 +209,21 @@ typedef void crash_vmclear_fn(void); extern crash_vmclear_fn __rcu *crash_vmclear_loaded_vmcss; extern void kdump_nmi_shootdown_cpus(void); +#ifdef CONFIG_CRASH_HOTPLUG +void arch_crash_handle_hotplug_event(struct kimage *image); +#define arch_crash_handle_hotplug_event arch_crash_handle_hotplug_event + +#ifdef CONFIG_HOTPLUG_CPU +static inline int crash_hotplug_cpu_support(void) { return 1; } +#define crash_hotplug_cpu_support crash_hotplug_cpu_support +#endif + +#ifdef CONFIG_MEMORY_HOTPLUG +static inline int crash_hotplug_memory_support(void) { return 1; } +#define crash_hotplug_memory_support crash_hotplug_memory_support +#endif +#endif + #endif /* __ASSEMBLY__ */ #endif /* _ASM_X86_KEXEC_H */ diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c index cdd92ab43cda..c70a111c44fa 100644 --- a/arch/x86/kernel/crash.c +++ b/arch/x86/kernel/crash.c @@ -158,8 +158,6 @@ void native_machine_crash_shutdown(struct pt_regs *regs) crash_save_cpu(regs, safe_smp_processor_id()); } -#ifdef CONFIG_KEXEC_FILE - static int get_nr_ram_ranges_callback(struct resource *res, void *arg) { unsigned int *nr_ranges = arg; @@ -231,7 +229,7 @@ static int prepare_elf64_ram_headers_callback(struct resource *res, void *arg) /* Prepare elf headers. Return addr and size */ static int prepare_elf_headers(struct kimage *image, void **addr, - unsigned long *sz) + unsigned long *sz, unsigned long *nr_mem_ranges) { struct crash_mem *cmem; int ret; @@ -249,6 +247,9 @@ static int prepare_elf_headers(struct kimage *image, void **addr, if (ret) goto out; + /* Return the computed number of memory ranges, for hotplug usage */ + *nr_mem_ranges = cmem->nr_ranges; + /* By default prepare 64bit headers */ ret = crash_prepare_elf64_headers(cmem, IS_ENABLED(CONFIG_X86_64), addr, sz); @@ -257,6 +258,7 @@ static int prepare_elf_headers(struct kimage *image, void **addr, return ret; } +#ifdef CONFIG_KEXEC_FILE static int add_e820_entry(struct boot_params *params, struct e820_entry *entry) { unsigned int nr_e820_entries; @@ -371,18 +373,42 @@ int crash_setup_memmap_entries(struct kimage *image, struct boot_params *params) int crash_load_segments(struct kimage *image) { int ret; + unsigned long pnum = 0; struct kexec_buf kbuf = { .image = image, .buf_min = 0, .buf_max = ULONG_MAX, .top_down = false }; /* Prepare elf headers and add a segment */ - ret = prepare_elf_headers(image, , ); + ret = prepare_elf_headers(image, , , ); if (ret) return ret; - image->elf_headers = kbuf.buffer; - image->elf_headers_sz = kbuf.bufsz; + image->elf_headers = kbuf.buffer; + image->elf_he
[PATCH v25 02/10] drivers/base: refactor memory.c to use .is_visible()
Greg Kroah-Hartman requested that this file use the .is_visible() method instead of #ifdefs for the attributes in memory.c. static struct attribute *memory_memblk_attrs[] = { _attr_phys_index.attr, _attr_state.attr, _attr_phys_device.attr, _attr_removable.attr, #ifdef CONFIG_MEMORY_HOTREMOVE _attr_valid_zones.attr, #endif NULL }; and static struct attribute *memory_root_attrs[] = { #ifdef CONFIG_ARCH_MEMORY_PROBE _attr_probe.attr, #endif #ifdef CONFIG_MEMORY_FAILURE _attr_soft_offline_page.attr, _attr_hard_offline_page.attr, #endif _attr_block_size_bytes.attr, _attr_auto_online_blocks.attr, NULL }; To that end: - the .is_visible() method is implemented, and IS_ENABLED(), rather than #ifdef, is used to determine the visibility of the attribute. - the DEVICE_ATTR_xx() attributes are moved outside of #ifdefs, so that those structs are always present for the memory_memblk_attrs[] and memory_root_attrs[]. - the function body of the callback functions are now wrapped with IS_ENABLED(); as the callback function must exist now that the attribute is always compiled-in (though not necessarily visible). No functionality change intended. Signed-off-by: Eric DeVolder --- drivers/base/memory.c | 229 ++ 1 file changed, 140 insertions(+), 89 deletions(-) diff --git a/drivers/base/memory.c b/drivers/base/memory.c index b456ac213610..7294112fe646 100644 --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -391,62 +391,66 @@ static ssize_t phys_device_show(struct device *dev, arch_get_memory_phys_device(start_pfn)); } -#ifdef CONFIG_MEMORY_HOTREMOVE static int print_allowed_zone(char *buf, int len, int nid, struct memory_group *group, unsigned long start_pfn, unsigned long nr_pages, int online_type, struct zone *default_zone) { - struct zone *zone; + if (IS_ENABLED(CONFIG_MEMORY_HOTREMOVE)) { + struct zone *zone; - zone = zone_for_pfn_range(online_type, nid, group, start_pfn, nr_pages); - if (zone == default_zone) - return 0; + zone = zone_for_pfn_range(online_type, nid, group, start_pfn, nr_pages); + if (zone == default_zone) + return 0; - return sysfs_emit_at(buf, len, " %s", zone->name); + return sysfs_emit_at(buf, len, " %s", zone->name); + } + return 0; } static ssize_t valid_zones_show(struct device *dev, struct device_attribute *attr, char *buf) { - struct memory_block *mem = to_memory_block(dev); - unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr); - unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block; - struct memory_group *group = mem->group; - struct zone *default_zone; - int nid = mem->nid; - int len = 0; + if (IS_ENABLED(CONFIG_MEMORY_HOTREMOVE)) { + struct memory_block *mem = to_memory_block(dev); + unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr); + unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block; + struct memory_group *group = mem->group; + struct zone *default_zone; + int nid = mem->nid; + int len = 0; - /* -* Check the existing zone. Make sure that we do that only on the -* online nodes otherwise the page_zone is not reliable -*/ - if (mem->state == MEM_ONLINE) { /* -* If !mem->zone, the memory block spans multiple zones and -* cannot get offlined. -*/ - default_zone = mem->zone; - if (!default_zone) - return sysfs_emit(buf, "%s\n", "none"); - len += sysfs_emit_at(buf, len, "%s", default_zone->name); - goto out; - } + * Check the existing zone. Make sure that we do that only on the + * online nodes otherwise the page_zone is not reliable + */ + if (mem->state == MEM_ONLINE) { + /* +* If !mem->zone, the memory block spans multiple zones and +* cannot get offlined. +*/ + default_zone = mem->zone; + if (!default_zone) + return sysfs_emit(buf, "%s\n", "none"); + len += sysfs_emit_at(buf, len, "%s", default_zone->name); + goto out; + } - default_zone = zone_for
[PATCH v25 00/10] crash: Kernel handling of CPU and memory hot un/plug
://lkml.org/lkml/2022/5/5/1133 https://lore.kernel.org/lkml/20220505184603.1548-1-eric.devol...@oracle.com/ - Per Borislav Petkov, eliminated CONFIG_CRASH_HOTPLUG in favor of CONFIG_HOTPLUG_CPU || CONFIG_MEMORY_HOTPLUG, ie a new define is not needed. Also use of IS_ENABLED() rather than #ifdef's. Renamed crash_hotplug_handler() to handle_hotplug_event(). And other corrections. - Per Baoquan, minimized the parameters to the arch_crash_ handle_hotplug_event() to hp_action and cpu. - Introduce KEXEC_CRASH_HP_INVALID_CPU definition, per Baoquan. - Per Sourabh Jain, renamed and repurposed CRASH_HOTPLUG_ELFCOREHDR_SZ to CONFIG_CRASH_MAX_MEMORY_RANGES, mirroring kexec-tools change by David Hildebrand. Folded this patch into the x86 kexec_file_load support patch. v7: 13apr2022 https://lkml.org/lkml/2022/4/13/850 https://lore.kernel.org/lkml/20220413164237.20845-1-eric.devol...@oracle.com/ - Resolved parameter usage to crash_hotplug_handler(), per Baoquan. v6: 1apr2022 https://lkml.org/lkml/2022/4/1/1203 https://lore.kernel.org/lkml/20220401183040.1624-1-eric.devol...@oracle.com/ - Reword commit messages and some comment cleanup per Baoquan. - Changed elf_index to elfcorehdr_index for clarity. - Minor code changes per Baoquan. v5: 3mar2022 https://lkml.org/lkml/2022/3/3/674 https://lore.kernel.org/lkml/20220303162725.49640-1-eric.devol...@oracle.com/ - Reworded description of CRASH_HOTPLUG_ELFCOREHDR_SZ, per David Hildenbrand. - Refactored slightly a few patches per Baoquan recommendation. v4: 9feb2022 https://lkml.org/lkml/2022/2/9/1406 https://lore.kernel.org/lkml/20220209195706.51522-1-eric.devol...@oracle.com/ - Refactored patches per Baoquan suggestsions. - A few corrections, per Baoquan. v3: 10jan2022 https://lkml.org/lkml/2022/1/10/1212 https://lore.kernel.org/lkml/20220110195727.1682-1-eric.devol...@oracle.com/ - Rebasing per Baoquan He request. - Changed memory notifier per David Hildenbrand. - Providing example kexec userspace change in cover letter. RFC v2: 7dec2021 https://lkml.org/lkml/2021/12/7/1088 https://lore.kernel.org/lkml/20211207195204.1582-1-eric.devol...@oracle.com/ - Acting upon Baoquan He suggestion of removing elfcorehdr from the purgatory list of segments, removed purgatory code from patchset, and it is signficiantly simpler now. RFC v1: 18nov2021 https://lkml.org/lkml/2021/11/18/845 https://lore.kernel.org/lkml/2028174948.37435-1-eric.devol...@oracle.com/ - working patchset demonstrating kernel handling of hotplug updates to x86 elfcorehdr for kexec_file_load RFC: 14dec2020 https://lkml.org/lkml/2020/12/14/532 https://lore.kernel.org/lkml/b04ed259-dc5f-7f30-6661-c26f92d90...@oracle.com/ - proposed concept of allowing kernel to handle hotplug update of elfcorehdr --- Eric DeVolder (10): drivers/base: refactor cpu.c to use .is_visible() drivers/base: refactor memory.c to use .is_visible() crash: move a few code bits to setup support of crash hotplug crash: add generic infrastructure for crash hotplug support kexec: exclude elfcorehdr from the segment digest crash: memory and CPU hotplug sysfs attributes x86/crash: add x86 crash hotplug support crash: hotplug support for kexec_load() crash: change crash_prepare_elf64_headers() to for_each_possible_cpu() x86/crash: optimize CPU changes .../ABI/testing/sysfs-devices-memory | 8 + .../ABI/testing/sysfs-devices-system-cpu | 8 + .../admin-guide/mm/memory-hotplug.rst | 8 + Documentation/core-api/cpu_hotplug.rst| 18 + arch/x86/Kconfig | 3 + arch/x86/include/asm/kexec.h | 18 + arch/x86/kernel/crash.c | 140 ++- drivers/base/cpu.c| 141 --- drivers/base/memory.c | 242 +++- include/linux/crash_core.h| 9 + include/linux/kexec.h | 63 +++- include/linux/tick.h | 2 +- include/uapi/linux/kexec.h| 1 + kernel/Kconfig.kexec | 35 ++ kernel/crash_core.c | 355 ++ kernel/kexec.c| 5 + kernel/kexec_core.c | 6 + kernel/kexec_file.c | 187 + kernel/ksysfs.c | 15 + 19 files changed, 922 insertions(+), 342 deletions(-) -- 2.31.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH v25 04/10] crash: add generic infrastructure for crash hotplug support
To support crash hotplug, a mechanism is needed to update the crash elfcorehdr upon CPU or memory changes (eg. hot un/plug or off/ onlining). The crash elfcorehdr describes the CPUs and memory to be written into the vmcore. To track CPU changes, callbacks are registered with the cpuhp mechanism via cpuhp_setup_state_nocalls(CPUHP_BP_PREPARE_DYN). The crash hotplug elfcorehdr update has no explicit ordering requirement (relative to other cpuhp states), so meets the criteria for utilizing CPUHP_BP_PREPARE_DYN. CPUHP_BP_PREPARE_DYN is a dynamic state and avoids the need to introduce a new state for crash hotplug. Also, CPUHP_BP_PREPARE_DYN is the last state in the PREPARE group, just prior to the STARTING group, which is very close to the CPU starting up in a plug/online situation, or stopping in a unplug/ offline situation. This minimizes the window of time during an actual plug/online or unplug/offline situation in which the elfcorehdr would be inaccurate. Note that for a CPU being unplugged or offlined, the CPU will still be present in the list of CPUs generated by crash_prepare_elf64_headers(). However, there is no need to explicitly omit the CPU, see justification in 'crash: change crash_prepare_elf64_headers() to for_each_possible_cpu()'. To track memory changes, a notifier is registered to capture the memblock MEM_ONLINE and MEM_OFFLINE events via register_memory_notifier(). The CPU callbacks and memory notifiers invoke crash_handle_hotplug_event() which performs needed tasks and then dispatches the event to the architecture specific arch_crash_handle_hotplug_event() to update the elfcorehdr with the current state of CPUs and memory. During the process, the kexec_lock is held. Signed-off-by: Eric DeVolder Reviewed-by: Sourabh Jain Acked-by: Hari Bathini Acked-by: Baoquan He --- include/linux/crash_core.h | 9 +++ include/linux/kexec.h | 11 +++ kernel/Kconfig.kexec | 31 kernel/crash_core.c| 142 + kernel/kexec_core.c| 6 ++ 5 files changed, 199 insertions(+) diff --git a/include/linux/crash_core.h b/include/linux/crash_core.h index de62a722431e..e14345cc7a22 100644 --- a/include/linux/crash_core.h +++ b/include/linux/crash_core.h @@ -84,4 +84,13 @@ int parse_crashkernel_high(char *cmdline, unsigned long long system_ram, int parse_crashkernel_low(char *cmdline, unsigned long long system_ram, unsigned long long *crash_size, unsigned long long *crash_base); +#define KEXEC_CRASH_HP_NONE0 +#define KEXEC_CRASH_HP_ADD_CPU 1 +#define KEXEC_CRASH_HP_REMOVE_CPU 2 +#define KEXEC_CRASH_HP_ADD_MEMORY 3 +#define KEXEC_CRASH_HP_REMOVE_MEMORY 4 +#define KEXEC_CRASH_HP_INVALID_CPU -1U + +struct kimage; + #endif /* LINUX_CRASH_CORE_H */ diff --git a/include/linux/kexec.h b/include/linux/kexec.h index 811a90e09698..b9903dd48e24 100644 --- a/include/linux/kexec.h +++ b/include/linux/kexec.h @@ -33,6 +33,7 @@ extern note_buf_t __percpu *crash_notes; #include #include #include +#include #include /* Verify architecture specific macros are defined */ @@ -360,6 +361,12 @@ struct kimage { struct purgatory_info purgatory_info; #endif +#ifdef CONFIG_CRASH_HOTPLUG + int hp_action; + int elfcorehdr_index; + bool elfcorehdr_updated; +#endif + #ifdef CONFIG_IMA_KEXEC /* Virtual address of IMA measurement buffer for kexec syscall */ void *ima_buffer; @@ -490,6 +497,10 @@ static inline int arch_kexec_post_alloc_pages(void *vaddr, unsigned int pages, g static inline void arch_kexec_pre_free_pages(void *vaddr, unsigned int pages) { } #endif +#ifndef arch_crash_handle_hotplug_event +static inline void arch_crash_handle_hotplug_event(struct kimage *image) { } +#endif + #else /* !CONFIG_KEXEC_CORE */ struct pt_regs; struct task_struct; diff --git a/kernel/Kconfig.kexec b/kernel/Kconfig.kexec index 5d576ddfd999..7eb42a795176 100644 --- a/kernel/Kconfig.kexec +++ b/kernel/Kconfig.kexec @@ -107,4 +107,35 @@ config CRASH_DUMP For s390, this option also enables zfcpdump. See also +config CRASH_HOTPLUG + bool "Update the crash elfcorehdr on system configuration changes" + default y + depends on CRASH_DUMP && (HOTPLUG_CPU || MEMORY_HOTPLUG) + depends on ARCH_SUPPORTS_CRASH_HOTPLUG + help + Enable direct update to the crash elfcorehdr (which contains + the list of CPUs and memory regions to be dumped upon a crash) + in response to hot plug/unplug or online/offline of CPUs or + memory. This is a much more advanced approach than userspace + attempting that. + + If unsure, say Y. + +config CRASH_MAX_MEMORY_RANGES + int "Specify the maximum number of memory regions for the elfcorehdr" + default 8192 + depends on CRASH_HOTPLUG + help +
[PATCH v25 09/10] crash: change crash_prepare_elf64_headers() to for_each_possible_cpu()
The function crash_prepare_elf64_headers() generates the elfcorehdr which describes the CPUs and memory in the system for the crash kernel. In particular, it writes out ELF PT_NOTEs for memory regions and the CPUs in the system. With respect to the CPUs, the current implementation utilizes for_each_present_cpu() which means that as CPUs are added and removed, the elfcorehdr must again be updated to reflect the new set of CPUs. The reasoning behind the move to use for_each_possible_cpu(), is: - At kernel boot time, all percpu crash_notes are allocated for all possible CPUs; that is, crash_notes are not allocated dynamically when CPUs are plugged/unplugged. Thus the crash_notes for each possible CPU are always available. - The crash_prepare_elf64_headers() creates an ELF PT_NOTE per CPU. Changing to for_each_possible_cpu() is valid as the crash_notes pointed to by each CPU PT_NOTE are present and always valid. Furthermore, examining a common crash processing path of: kernel panic -> crash kernel -> makedumpfile -> 'crash' analyzer elfcorehdr /proc/vmcore vmcore reveals how the ELF CPU PT_NOTEs are utilized: - Upon panic, each CPU is sent an IPI and shuts itself down, recording its state in its crash_notes. When all CPUs are shutdown, the crash kernel is launched with a pointer to the elfcorehdr. - The crash kernel via linux/fs/proc/vmcore.c does not examine or use the contents of the PT_NOTEs, it exposes them via /proc/vmcore. - The makedumpfile utility uses /proc/vmcore and reads the CPU PT_NOTEs to craft a nr_cpus variable, which is reported in a header but otherwise generally unused. Makedumpfile creates the vmcore. - The 'crash' dump analyzer does not appear to reference the CPU PT_NOTEs. Instead it looks-up the cpu_[possible|present|onlin]_mask symbols and directly examines those structure contents from vmcore memory. From that information it is able to determine which CPUs are present and online, and locate the corresponding crash_notes. Said differently, it appears that 'crash' analyzer does not rely on the ELF PT_NOTEs for CPUs; rather it obtains the information directly via kernel symbols and the memory within the vmcore. (There maybe other vmcore generating and analysis tools that do use these PT_NOTEs, but 'makedumpfile' and 'crash' seems to be the most common solution.) This results in the benefit of having all CPUs described in the elfcorehdr, and therefore reducing the need to re-generate the elfcorehdr on CPU changes, at the small expense of an additional 56 bytes per PT_NOTE for not-present-but-possible CPUs. On systems where kexec_file_load() syscall is utilized, all the above is valid. On systems where kexec_load() syscall is utilized, there may be the need for the elfcorehdr to be regenerated once. The reason being that some archs only populate the 'present' CPUs from the /sys/devices/system/cpus entries, which the userspace 'kexec' utility uses to generate the userspace-supplied elfcorehdr. In this situation, one memory or CPU change will rewrite the elfcorehdr via the crash_prepare_elf64_headers() function and now all possible CPUs will be described, just as with kexec_file_load() syscall. Suggested-by: Sourabh Jain Signed-off-by: Eric DeVolder Reviewed-by: Sourabh Jain Acked-by: Hari Bathini Acked-by: Baoquan He --- kernel/crash_core.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/crash_core.c b/kernel/crash_core.c index fa918176d46d..7378b501fada 100644 --- a/kernel/crash_core.c +++ b/kernel/crash_core.c @@ -364,8 +364,8 @@ int crash_prepare_elf64_headers(struct crash_mem *mem, int need_kernel_map, ehdr->e_ehsize = sizeof(Elf64_Ehdr); ehdr->e_phentsize = sizeof(Elf64_Phdr); - /* Prepare one phdr of type PT_NOTE for each present CPU */ - for_each_present_cpu(cpu) { + /* Prepare one phdr of type PT_NOTE for each possible CPU */ + for_each_possible_cpu(cpu) { phdr->p_type = PT_NOTE; notes_addr = per_cpu_ptr_to_phys(per_cpu_ptr(crash_notes, cpu)); phdr->p_offset = phdr->p_paddr = notes_addr; -- 2.31.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH v25 03/10] crash: move a few code bits to setup support of crash hotplug
The crash hotplug support leans on the work for the kexec_file_load() syscall. To also support the kexec_load() syscall, a few bits of code need to be move outside of CONFIG_KEXEC_FILE. As such, these bits are moved out of kexec_file.c and into a common location crash_core.c. No functionality change intended. Signed-off-by: Eric DeVolder Reviewed-by: Sourabh Jain Acked-by: Hari Bathini Acked-by: Baoquan He --- include/linux/kexec.h | 30 +++ kernel/crash_core.c | 182 ++ kernel/kexec_file.c | 181 - 3 files changed, 197 insertions(+), 196 deletions(-) diff --git a/include/linux/kexec.h b/include/linux/kexec.h index 22b5cd24f581..811a90e09698 100644 --- a/include/linux/kexec.h +++ b/include/linux/kexec.h @@ -105,6 +105,21 @@ struct compat_kexec_segment { }; #endif +/* Alignment required for elf header segment */ +#define ELF_CORE_HEADER_ALIGN 4096 + +struct crash_mem { + unsigned int max_nr_ranges; + unsigned int nr_ranges; + struct range ranges[]; +}; + +extern int crash_exclude_mem_range(struct crash_mem *mem, + unsigned long long mstart, + unsigned long long mend); +extern int crash_prepare_elf64_headers(struct crash_mem *mem, int need_kernel_map, + void **addr, unsigned long *sz); + #ifdef CONFIG_KEXEC_FILE struct purgatory_info { /* @@ -230,21 +245,6 @@ static inline int arch_kexec_locate_mem_hole(struct kexec_buf *kbuf) } #endif -/* Alignment required for elf header segment */ -#define ELF_CORE_HEADER_ALIGN 4096 - -struct crash_mem { - unsigned int max_nr_ranges; - unsigned int nr_ranges; - struct range ranges[]; -}; - -extern int crash_exclude_mem_range(struct crash_mem *mem, - unsigned long long mstart, - unsigned long long mend); -extern int crash_prepare_elf64_headers(struct crash_mem *mem, int need_kernel_map, - void **addr, unsigned long *sz); - #ifndef arch_kexec_apply_relocations_add /* * arch_kexec_apply_relocations_add - apply relocations of type RELA diff --git a/kernel/crash_core.c b/kernel/crash_core.c index 90ce1dfd591c..b7c30b748a16 100644 --- a/kernel/crash_core.c +++ b/kernel/crash_core.c @@ -10,6 +10,7 @@ #include #include #include +#include #include #include @@ -314,6 +315,187 @@ static int __init parse_crashkernel_dummy(char *arg) } early_param("crashkernel", parse_crashkernel_dummy); +int crash_prepare_elf64_headers(struct crash_mem *mem, int need_kernel_map, + void **addr, unsigned long *sz) +{ + Elf64_Ehdr *ehdr; + Elf64_Phdr *phdr; + unsigned long nr_cpus = num_possible_cpus(), nr_phdr, elf_sz; + unsigned char *buf; + unsigned int cpu, i; + unsigned long long notes_addr; + unsigned long mstart, mend; + + /* extra phdr for vmcoreinfo ELF note */ + nr_phdr = nr_cpus + 1; + nr_phdr += mem->nr_ranges; + + /* +* kexec-tools creates an extra PT_LOAD phdr for kernel text mapping +* area (for example, 8000 - a000 on x86_64). +* I think this is required by tools like gdb. So same physical +* memory will be mapped in two ELF headers. One will contain kernel +* text virtual addresses and other will have __va(physical) addresses. +*/ + + nr_phdr++; + elf_sz = sizeof(Elf64_Ehdr) + nr_phdr * sizeof(Elf64_Phdr); + elf_sz = ALIGN(elf_sz, ELF_CORE_HEADER_ALIGN); + + buf = vzalloc(elf_sz); + if (!buf) + return -ENOMEM; + + ehdr = (Elf64_Ehdr *)buf; + phdr = (Elf64_Phdr *)(ehdr + 1); + memcpy(ehdr->e_ident, ELFMAG, SELFMAG); + ehdr->e_ident[EI_CLASS] = ELFCLASS64; + ehdr->e_ident[EI_DATA] = ELFDATA2LSB; + ehdr->e_ident[EI_VERSION] = EV_CURRENT; + ehdr->e_ident[EI_OSABI] = ELF_OSABI; + memset(ehdr->e_ident + EI_PAD, 0, EI_NIDENT - EI_PAD); + ehdr->e_type = ET_CORE; + ehdr->e_machine = ELF_ARCH; + ehdr->e_version = EV_CURRENT; + ehdr->e_phoff = sizeof(Elf64_Ehdr); + ehdr->e_ehsize = sizeof(Elf64_Ehdr); + ehdr->e_phentsize = sizeof(Elf64_Phdr); + + /* Prepare one phdr of type PT_NOTE for each present CPU */ + for_each_present_cpu(cpu) { + phdr->p_type = PT_NOTE; + notes_addr = per_cpu_ptr_to_phys(per_cpu_ptr(crash_notes, cpu)); + phdr->p_offset = phdr->p_paddr = notes_addr; + phdr->p_filesz = phdr->p_memsz = sizeof(note_buf_t); + (ehdr->e_phnum)++; + phdr++; + } + + /* Prepare one PT_NOTE header for vmcoreinfo */ + phdr->p_type = PT_NOTE; +
[PATCH v25 05/10] kexec: exclude elfcorehdr from the segment digest
When a crash kernel is loaded via the kexec_file_load() syscall, the kernel places the various segments (ie crash kernel, crash initrd, boot_params, elfcorehdr, purgatory, etc) in memory. For those architectures that utilize purgatory, a hash digest of the segments is calculated for integrity checking. The digest is embedded into the purgatory image prior to placing in memory. Updates to the elfcorehdr in response to CPU and memory changes would cause the purgatory integrity checking to fail (at crash time, and no vmcore created). Therefore, the elfcorehdr segment is explicitly excluded from the purgatory digest, enabling updates to the elfcorehdr while also avoiding the need to recompute the hash digest and reload purgatory. Signed-off-by: Eric DeVolder Reviewed-by: Sourabh Jain Acked-by: Hari Bathini Acked-by: Baoquan He --- kernel/kexec_file.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c index e9cf9e8d8f01..824ffc5282f4 100644 --- a/kernel/kexec_file.c +++ b/kernel/kexec_file.c @@ -726,6 +726,12 @@ static int kexec_calculate_store_digests(struct kimage *image) for (j = i = 0; i < image->nr_segments; i++) { struct kexec_segment *ksegment; +#ifdef CONFIG_CRASH_HOTPLUG + /* Exclude elfcorehdr segment to allow future changes via hotplug */ + if (j == image->elfcorehdr_index) + continue; +#endif + ksegment = >segment[i]; /* * Skip purgatory as it will be modified once we put digest -- 2.31.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH v25 10/10] x86/crash: optimize CPU changes
crash_prepare_elf64_headers() writes into the elfcorehdr an ELF PT_NOTE for all possible CPUs. As such, subsequent changes to CPUs (ie. hot un/plug, online/offline) do not need to rewrite the elfcorehdr. The kimage->file_mode term covers kdump images loaded via the kexec_file_load() syscall. Since crash_prepare_elf64_headers() wrote the initial elfcorehdr, no update to the elfcorehdr is needed for CPU changes. The kimage->elfcorehdr_updated term covers kdump images loaded via the kexec_load() syscall. At least one memory or CPU change must occur to cause crash_prepare_elf64_headers() to rewrite the elfcorehdr. Afterwards, no update to the elfcorehdr is needed for CPU changes. This code is intentionally *NOT* hoisted into crash_handle_hotplug_event() as it would prevent the arch-specific handler from running for CPU changes. This would break PPC, for example, which needs to update other information besides the elfcorehdr, on CPU changes. Signed-off-by: Eric DeVolder Reviewed-by: Sourabh Jain Acked-by: Hari Bathini Acked-by: Baoquan He --- arch/x86/kernel/crash.c | 10 ++ 1 file changed, 10 insertions(+) diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c index caf22bcb61af..18d2a18d1073 100644 --- a/arch/x86/kernel/crash.c +++ b/arch/x86/kernel/crash.c @@ -467,6 +467,16 @@ void arch_crash_handle_hotplug_event(struct kimage *image) unsigned long mem, memsz; unsigned long elfsz = 0; + /* +* As crash_prepare_elf64_headers() has already described all +* possible CPUs, there is no need to update the elfcorehdr +* for additional CPU changes. +*/ + if ((image->file_mode || image->elfcorehdr_updated) && + ((image->hp_action == KEXEC_CRASH_HP_ADD_CPU) || + (image->hp_action == KEXEC_CRASH_HP_REMOVE_CPU))) + return; + /* * Create the new elfcorehdr reflecting the changes to CPU and/or * memory resources. -- 2.31.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v24 01/10] drivers/base: refactor cpu.c to use .is_visible()
I still need to convert the ifdefs within the functions to IS_ENABLED(), my apologies. eric On 6/28/23 13:52, Eric DeVolder wrote: Greg Kroah-Hartman requested that this file use the .is_visible() method instead of #ifdefs for the attributes in cpu.c. static struct attribute *cpu_root_attrs[] = { #ifdef CONFIG_ARCH_CPU_PROBE_RELEASE _attr_probe.attr, _attr_release.attr, #endif _attrs[0].attr.attr, _attrs[1].attr.attr, _attrs[2].attr.attr, _attr_kernel_max.attr, _attr_offline.attr, _attr_isolated.attr, #ifdef CONFIG_NO_HZ_FULL _attr_nohz_full.attr, #endif #ifdef CONFIG_GENERIC_CPU_AUTOPROBE _attr_modalias.attr, #endif NULL }; To that end: - the .is_visible() method is implemented, and IS_ENABLED(), rather than #ifdef, is used to determine the visibility of the attribute. - the DEVICE_ATTR() attributes are moved outside of #ifdefs, so that those structs are always present for the cpu_root_attrs[]. - the #ifdefs guarding the attributes in the cpu_root_attrs[] are moved to the corresponding callback function; as the callback function must exist now that the attribute is always compiled-in (though not necessarily visible). No functionality change intended. Signed-off-by: Eric DeVolder --- drivers/base/cpu.c | 67 -- 1 file changed, 53 insertions(+), 14 deletions(-) diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c index c1815b9dae68..75fa46a567a1 100644 --- a/drivers/base/cpu.c +++ b/drivers/base/cpu.c @@ -82,13 +82,14 @@ void unregister_cpu(struct cpu *cpu) per_cpu(cpu_sys_devices, logical_cpu) = NULL; return; } +#endif /* CONFIG_HOTPLUG_CPU */ -#ifdef CONFIG_ARCH_CPU_PROBE_RELEASE static ssize_t cpu_probe_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count) { +#ifdef CONFIG_ARCH_CPU_PROBE_RELEASE ssize_t cnt; int ret; @@ -100,6 +101,9 @@ static ssize_t cpu_probe_store(struct device *dev, unlock_device_hotplug(); return cnt; +#else + return 0; +#endif } static ssize_t cpu_release_store(struct device *dev, @@ -107,6 +111,7 @@ static ssize_t cpu_release_store(struct device *dev, const char *buf, size_t count) { +#ifdef CONFIG_ARCH_CPU_PROBE_RELEASE ssize_t cnt; int ret; @@ -118,12 +123,13 @@ static ssize_t cpu_release_store(struct device *dev, unlock_device_hotplug(); return cnt; +#else + return 0; +#endif } static DEVICE_ATTR(probe, S_IWUSR, NULL, cpu_probe_store); static DEVICE_ATTR(release, S_IWUSR, NULL, cpu_release_store); -#endif /* CONFIG_ARCH_CPU_PROBE_RELEASE */ -#endif /* CONFIG_HOTPLUG_CPU */ #ifdef CONFIG_KEXEC #include @@ -273,14 +279,16 @@ static ssize_t print_cpus_isolated(struct device *dev, } static DEVICE_ATTR(isolated, 0444, print_cpus_isolated, NULL); -#ifdef CONFIG_NO_HZ_FULL static ssize_t print_cpus_nohz_full(struct device *dev, struct device_attribute *attr, char *buf) { +#ifdef CONFIG_NO_HZ_FULL return sysfs_emit(buf, "%*pbl\n", cpumask_pr_args(tick_nohz_full_mask)); +#else + return 0; +#endif } static DEVICE_ATTR(nohz_full, 0444, print_cpus_nohz_full, NULL); -#endif static void cpu_device_release(struct device *dev) { @@ -301,12 +309,12 @@ static void cpu_device_release(struct device *dev) */ } -#ifdef CONFIG_GENERIC_CPU_AUTOPROBE static ssize_t print_cpu_modalias(struct device *dev, struct device_attribute *attr, char *buf) { int len = 0; +#ifdef CONFIG_GENERIC_CPU_AUTOPROBE u32 i; len += sysfs_emit_at(buf, len, @@ -322,9 +330,11 @@ static ssize_t print_cpu_modalias(struct device *dev, len += sysfs_emit_at(buf, len, ",%04X", i); } len += sysfs_emit_at(buf, len, "\n"); +#endif return len; } +#ifdef CONFIG_GENERIC_CPU_AUTOPROBE static int cpu_uevent(const struct device *dev, struct kobj_uevent_env *env) { char *buf = kzalloc(PAGE_SIZE, GFP_KERNEL); @@ -451,32 +461,61 @@ struct device *cpu_device_create(struct device *parent, void *drvdata, } EXPORT_SYMBOL_GPL(cpu_device_create); -#ifdef CONFIG_GENERIC_CPU_AUTOPROBE static DEVICE_ATTR(modalias, 0444, print_cpu_modalias, NULL); -#endif static struct attribute *cpu_root_attrs[] = { -#ifdef CONFIG_ARCH_CPU_PROBE_RELEASE _attr_probe.attr, _attr_release.attr, -#endif _attrs[0].attr.attr, _attrs[1].attr.attr, _attrs[2].attr.attr, _attr_kernel_max.attr, _attr_offline.attr, _attr_isolated.a
Re: [PATCH v24 02/10] drivers/base: refactor memory.c to use .is_visible()
I still need to convert the ifdefs within the functions to IS_ENABLED(), my apologies. eric On 6/28/23 13:52, Eric DeVolder wrote: Greg Kroah-Hartman requested that this file use the .is_visible() method instead of #ifdefs for the attributes in memory.c. static struct attribute *memory_memblk_attrs[] = { _attr_phys_index.attr, _attr_state.attr, _attr_phys_device.attr, _attr_removable.attr, #ifdef CONFIG_MEMORY_HOTREMOVE _attr_valid_zones.attr, #endif NULL }; and static struct attribute *memory_root_attrs[] = { #ifdef CONFIG_ARCH_MEMORY_PROBE _attr_probe.attr, #endif #ifdef CONFIG_MEMORY_FAILURE _attr_soft_offline_page.attr, _attr_hard_offline_page.attr, #endif _attr_block_size_bytes.attr, _attr_auto_online_blocks.attr, NULL }; To that end: - the .is_visible() method is implemented, and IS_ENABLED(), rather than #ifdef, is used to determine the visibility of the attribute. - the DEVICE_ATTR_xx() attributes are moved outside of #ifdefs, so that those structs are always present for the memory_memblk_attrs[] and memory_root_attrs[]. - the #ifdefs guarding the attributes in the memory_memblk_attrs[] and memory_root_attrs[] are moved to the corresponding callback function; as the callback function must exist now that the attribute is always compiled-in (though not necessarily visible). No functionality change intended. Signed-off-by: Eric DeVolder --- drivers/base/memory.c | 78 +++ 1 file changed, 65 insertions(+), 13 deletions(-) diff --git a/drivers/base/memory.c b/drivers/base/memory.c index b456ac213610..f03eda7e1c9c 100644 --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -405,10 +405,12 @@ static int print_allowed_zone(char *buf, int len, int nid, return sysfs_emit_at(buf, len, " %s", zone->name); } +#endif static ssize_t valid_zones_show(struct device *dev, struct device_attribute *attr, char *buf) { +#ifdef CONFIG_MEMORY_HOTREMOVE struct memory_block *mem = to_memory_block(dev); unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr); unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block; @@ -444,9 +446,11 @@ static ssize_t valid_zones_show(struct device *dev, out: len += sysfs_emit_at(buf, len, "\n"); return len; +#else + return 0; +#endif } static DEVICE_ATTR_RO(valid_zones); -#endif static DEVICE_ATTR_RO(phys_index); static DEVICE_ATTR_RW(state); @@ -496,10 +500,10 @@ static DEVICE_ATTR_RW(auto_online_blocks); * as well as ppc64 will do all of their discovery in userspace * and will require this interface. */ -#ifdef CONFIG_ARCH_MEMORY_PROBE static ssize_t probe_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count) { +#ifdef CONFIG_ARCH_MEMORY_PROBE u64 phys_addr; int nid, ret; unsigned long pages_per_block = PAGES_PER_SECTION * sections_per_block; @@ -527,12 +531,13 @@ static ssize_t probe_store(struct device *dev, struct device_attribute *attr, out: unlock_device_hotplug(); return ret; +#else + return 0; +#endif } static DEVICE_ATTR_WO(probe); -#endif -#ifdef CONFIG_MEMORY_FAILURE /* * Support for offlining pages of memory */ @@ -542,6 +547,7 @@ static ssize_t soft_offline_page_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count) { +#ifdef CONFIG_MEMORY_FAILURE int ret; u64 pfn; if (!capable(CAP_SYS_ADMIN)) @@ -551,6 +557,9 @@ static ssize_t soft_offline_page_store(struct device *dev, pfn >>= PAGE_SHIFT; ret = soft_offline_page(pfn, 0); return ret == 0 ? count : ret; +#else + return 0; +#endif } /* Forcibly offline a page, including killing processes. */ @@ -558,6 +567,7 @@ static ssize_t hard_offline_page_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count) { +#ifdef CONFIG_MEMORY_FAILURE int ret; u64 pfn; if (!capable(CAP_SYS_ADMIN)) @@ -569,11 +579,13 @@ static ssize_t hard_offline_page_store(struct device *dev, if (ret == -EOPNOTSUPP) ret = 0; return ret ? ret : count; +#else + return 0; +#endif } static DEVICE_ATTR_WO(soft_offline_page); static DEVICE_ATTR_WO(hard_offline_page); -#endif /* See phys_device_show(). */ int __weak arch_get_memory_phys_device(unsigned long start_pfn) @@ -611,14 +623,35 @@ static struct attribute *memory_memblk_attrs[] = { _attr_state.attr, _attr_phys_device.attr, _attr_removable.attr,
[PATCH v24 00/10] crash: Kernel handling of CPU and memory hot un/plug
d crash_hotplug_handler() to handle_hotplug_event(). And other corrections. - Per Baoquan, minimized the parameters to the arch_crash_ handle_hotplug_event() to hp_action and cpu. - Introduce KEXEC_CRASH_HP_INVALID_CPU definition, per Baoquan. - Per Sourabh Jain, renamed and repurposed CRASH_HOTPLUG_ELFCOREHDR_SZ to CONFIG_CRASH_MAX_MEMORY_RANGES, mirroring kexec-tools change by David Hildebrand. Folded this patch into the x86 kexec_file_load support patch. v7: 13apr2022 https://lkml.org/lkml/2022/4/13/850 https://lore.kernel.org/lkml/20220413164237.20845-1-eric.devol...@oracle.com/ - Resolved parameter usage to crash_hotplug_handler(), per Baoquan. v6: 1apr2022 https://lkml.org/lkml/2022/4/1/1203 https://lore.kernel.org/lkml/20220401183040.1624-1-eric.devol...@oracle.com/ - Reword commit messages and some comment cleanup per Baoquan. - Changed elf_index to elfcorehdr_index for clarity. - Minor code changes per Baoquan. v5: 3mar2022 https://lkml.org/lkml/2022/3/3/674 https://lore.kernel.org/lkml/20220303162725.49640-1-eric.devol...@oracle.com/ - Reworded description of CRASH_HOTPLUG_ELFCOREHDR_SZ, per David Hildenbrand. - Refactored slightly a few patches per Baoquan recommendation. v4: 9feb2022 https://lkml.org/lkml/2022/2/9/1406 https://lore.kernel.org/lkml/20220209195706.51522-1-eric.devol...@oracle.com/ - Refactored patches per Baoquan suggestsions. - A few corrections, per Baoquan. v3: 10jan2022 https://lkml.org/lkml/2022/1/10/1212 https://lore.kernel.org/lkml/20220110195727.1682-1-eric.devol...@oracle.com/ - Rebasing per Baoquan He request. - Changed memory notifier per David Hildenbrand. - Providing example kexec userspace change in cover letter. RFC v2: 7dec2021 https://lkml.org/lkml/2021/12/7/1088 https://lore.kernel.org/lkml/20211207195204.1582-1-eric.devol...@oracle.com/ - Acting upon Baoquan He suggestion of removing elfcorehdr from the purgatory list of segments, removed purgatory code from patchset, and it is signficiantly simpler now. RFC v1: 18nov2021 https://lkml.org/lkml/2021/11/18/845 https://lore.kernel.org/lkml/2028174948.37435-1-eric.devol...@oracle.com/ - working patchset demonstrating kernel handling of hotplug updates to x86 elfcorehdr for kexec_file_load RFC: 14dec2020 https://lkml.org/lkml/2020/12/14/532 https://lore.kernel.org/lkml/b04ed259-dc5f-7f30-6661-c26f92d90...@oracle.com/ - proposed concept of allowing kernel to handle hotplug update of elfcorehdr --- Eric DeVolder (10): drivers/base: refactor cpu.c to use .is_visible() drivers/base: refactor memory.c to use .is_visible() crash: move a few code bits to setup support of crash hotplug crash: add generic infrastructure for crash hotplug support kexec: exclude elfcorehdr from the segment digest crash: memory and CPU hotplug sysfs attributes x86/crash: add x86 crash hotplug support crash: hotplug support for kexec_load() crash: change crash_prepare_elf64_headers() to for_each_possible_cpu() x86/crash: optimize CPU changes .../ABI/testing/sysfs-devices-memory | 8 + .../ABI/testing/sysfs-devices-system-cpu | 8 + .../admin-guide/mm/memory-hotplug.rst | 8 + Documentation/core-api/cpu_hotplug.rst| 18 + arch/x86/Kconfig | 3 + arch/x86/include/asm/kexec.h | 18 + arch/x86/kernel/crash.c | 140 ++- drivers/base/cpu.c| 83 +++- drivers/base/memory.c | 91 - include/linux/crash_core.h| 9 + include/linux/kexec.h | 63 +++- include/uapi/linux/kexec.h| 1 + kernel/Kconfig.kexec | 35 ++ kernel/crash_core.c | 355 ++ kernel/kexec.c| 5 + kernel/kexec_core.c | 6 + kernel/kexec_file.c | 187 + kernel/ksysfs.c | 15 + 18 files changed, 819 insertions(+), 234 deletions(-) -- 2.31.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH v24 01/10] drivers/base: refactor cpu.c to use .is_visible()
Greg Kroah-Hartman requested that this file use the .is_visible() method instead of #ifdefs for the attributes in cpu.c. static struct attribute *cpu_root_attrs[] = { #ifdef CONFIG_ARCH_CPU_PROBE_RELEASE _attr_probe.attr, _attr_release.attr, #endif _attrs[0].attr.attr, _attrs[1].attr.attr, _attrs[2].attr.attr, _attr_kernel_max.attr, _attr_offline.attr, _attr_isolated.attr, #ifdef CONFIG_NO_HZ_FULL _attr_nohz_full.attr, #endif #ifdef CONFIG_GENERIC_CPU_AUTOPROBE _attr_modalias.attr, #endif NULL }; To that end: - the .is_visible() method is implemented, and IS_ENABLED(), rather than #ifdef, is used to determine the visibility of the attribute. - the DEVICE_ATTR() attributes are moved outside of #ifdefs, so that those structs are always present for the cpu_root_attrs[]. - the #ifdefs guarding the attributes in the cpu_root_attrs[] are moved to the corresponding callback function; as the callback function must exist now that the attribute is always compiled-in (though not necessarily visible). No functionality change intended. Signed-off-by: Eric DeVolder --- drivers/base/cpu.c | 67 -- 1 file changed, 53 insertions(+), 14 deletions(-) diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c index c1815b9dae68..75fa46a567a1 100644 --- a/drivers/base/cpu.c +++ b/drivers/base/cpu.c @@ -82,13 +82,14 @@ void unregister_cpu(struct cpu *cpu) per_cpu(cpu_sys_devices, logical_cpu) = NULL; return; } +#endif /* CONFIG_HOTPLUG_CPU */ -#ifdef CONFIG_ARCH_CPU_PROBE_RELEASE static ssize_t cpu_probe_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count) { +#ifdef CONFIG_ARCH_CPU_PROBE_RELEASE ssize_t cnt; int ret; @@ -100,6 +101,9 @@ static ssize_t cpu_probe_store(struct device *dev, unlock_device_hotplug(); return cnt; +#else + return 0; +#endif } static ssize_t cpu_release_store(struct device *dev, @@ -107,6 +111,7 @@ static ssize_t cpu_release_store(struct device *dev, const char *buf, size_t count) { +#ifdef CONFIG_ARCH_CPU_PROBE_RELEASE ssize_t cnt; int ret; @@ -118,12 +123,13 @@ static ssize_t cpu_release_store(struct device *dev, unlock_device_hotplug(); return cnt; +#else + return 0; +#endif } static DEVICE_ATTR(probe, S_IWUSR, NULL, cpu_probe_store); static DEVICE_ATTR(release, S_IWUSR, NULL, cpu_release_store); -#endif /* CONFIG_ARCH_CPU_PROBE_RELEASE */ -#endif /* CONFIG_HOTPLUG_CPU */ #ifdef CONFIG_KEXEC #include @@ -273,14 +279,16 @@ static ssize_t print_cpus_isolated(struct device *dev, } static DEVICE_ATTR(isolated, 0444, print_cpus_isolated, NULL); -#ifdef CONFIG_NO_HZ_FULL static ssize_t print_cpus_nohz_full(struct device *dev, struct device_attribute *attr, char *buf) { +#ifdef CONFIG_NO_HZ_FULL return sysfs_emit(buf, "%*pbl\n", cpumask_pr_args(tick_nohz_full_mask)); +#else + return 0; +#endif } static DEVICE_ATTR(nohz_full, 0444, print_cpus_nohz_full, NULL); -#endif static void cpu_device_release(struct device *dev) { @@ -301,12 +309,12 @@ static void cpu_device_release(struct device *dev) */ } -#ifdef CONFIG_GENERIC_CPU_AUTOPROBE static ssize_t print_cpu_modalias(struct device *dev, struct device_attribute *attr, char *buf) { int len = 0; +#ifdef CONFIG_GENERIC_CPU_AUTOPROBE u32 i; len += sysfs_emit_at(buf, len, @@ -322,9 +330,11 @@ static ssize_t print_cpu_modalias(struct device *dev, len += sysfs_emit_at(buf, len, ",%04X", i); } len += sysfs_emit_at(buf, len, "\n"); +#endif return len; } +#ifdef CONFIG_GENERIC_CPU_AUTOPROBE static int cpu_uevent(const struct device *dev, struct kobj_uevent_env *env) { char *buf = kzalloc(PAGE_SIZE, GFP_KERNEL); @@ -451,32 +461,61 @@ struct device *cpu_device_create(struct device *parent, void *drvdata, } EXPORT_SYMBOL_GPL(cpu_device_create); -#ifdef CONFIG_GENERIC_CPU_AUTOPROBE static DEVICE_ATTR(modalias, 0444, print_cpu_modalias, NULL); -#endif static struct attribute *cpu_root_attrs[] = { -#ifdef CONFIG_ARCH_CPU_PROBE_RELEASE _attr_probe.attr, _attr_release.attr, -#endif _attrs[0].attr.attr, _attrs[1].attr.attr, _attrs[2].attr.attr, _attr_kernel_max.attr, _attr_offline.attr, _attr_isolated.attr, -#ifdef CONFIG_NO_HZ_FULL _attr_nohz_full.attr, -#endif -#ifdef CONFIG_GENERIC_CPU_AUTOPROBE _attr_modalias.attr, -#endif NULL }; +static umode_t +cpu_root_attr_i
[PATCH v24 09/10] crash: change crash_prepare_elf64_headers() to for_each_possible_cpu()
The function crash_prepare_elf64_headers() generates the elfcorehdr which describes the CPUs and memory in the system for the crash kernel. In particular, it writes out ELF PT_NOTEs for memory regions and the CPUs in the system. With respect to the CPUs, the current implementation utilizes for_each_present_cpu() which means that as CPUs are added and removed, the elfcorehdr must again be updated to reflect the new set of CPUs. The reasoning behind the move to use for_each_possible_cpu(), is: - At kernel boot time, all percpu crash_notes are allocated for all possible CPUs; that is, crash_notes are not allocated dynamically when CPUs are plugged/unplugged. Thus the crash_notes for each possible CPU are always available. - The crash_prepare_elf64_headers() creates an ELF PT_NOTE per CPU. Changing to for_each_possible_cpu() is valid as the crash_notes pointed to by each CPU PT_NOTE are present and always valid. Furthermore, examining a common crash processing path of: kernel panic -> crash kernel -> makedumpfile -> 'crash' analyzer elfcorehdr /proc/vmcore vmcore reveals how the ELF CPU PT_NOTEs are utilized: - Upon panic, each CPU is sent an IPI and shuts itself down, recording its state in its crash_notes. When all CPUs are shutdown, the crash kernel is launched with a pointer to the elfcorehdr. - The crash kernel via linux/fs/proc/vmcore.c does not examine or use the contents of the PT_NOTEs, it exposes them via /proc/vmcore. - The makedumpfile utility uses /proc/vmcore and reads the CPU PT_NOTEs to craft a nr_cpus variable, which is reported in a header but otherwise generally unused. Makedumpfile creates the vmcore. - The 'crash' dump analyzer does not appear to reference the CPU PT_NOTEs. Instead it looks-up the cpu_[possible|present|onlin]_mask symbols and directly examines those structure contents from vmcore memory. From that information it is able to determine which CPUs are present and online, and locate the corresponding crash_notes. Said differently, it appears that 'crash' analyzer does not rely on the ELF PT_NOTEs for CPUs; rather it obtains the information directly via kernel symbols and the memory within the vmcore. (There maybe other vmcore generating and analysis tools that do use these PT_NOTEs, but 'makedumpfile' and 'crash' seems to be the most common solution.) This results in the benefit of having all CPUs described in the elfcorehdr, and therefore reducing the need to re-generate the elfcorehdr on CPU changes, at the small expense of an additional 56 bytes per PT_NOTE for not-present-but-possible CPUs. On systems where kexec_file_load() syscall is utilized, all the above is valid. On systems where kexec_load() syscall is utilized, there may be the need for the elfcorehdr to be regenerated once. The reason being that some archs only populate the 'present' CPUs from the /sys/devices/system/cpus entries, which the userspace 'kexec' utility uses to generate the userspace-supplied elfcorehdr. In this situation, one memory or CPU change will rewrite the elfcorehdr via the crash_prepare_elf64_headers() function and now all possible CPUs will be described, just as with kexec_file_load() syscall. Suggested-by: Sourabh Jain Signed-off-by: Eric DeVolder Reviewed-by: Sourabh Jain Acked-by: Hari Bathini Acked-by: Baoquan He --- kernel/crash_core.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/crash_core.c b/kernel/crash_core.c index fa918176d46d..7378b501fada 100644 --- a/kernel/crash_core.c +++ b/kernel/crash_core.c @@ -364,8 +364,8 @@ int crash_prepare_elf64_headers(struct crash_mem *mem, int need_kernel_map, ehdr->e_ehsize = sizeof(Elf64_Ehdr); ehdr->e_phentsize = sizeof(Elf64_Phdr); - /* Prepare one phdr of type PT_NOTE for each present CPU */ - for_each_present_cpu(cpu) { + /* Prepare one phdr of type PT_NOTE for each possible CPU */ + for_each_possible_cpu(cpu) { phdr->p_type = PT_NOTE; notes_addr = per_cpu_ptr_to_phys(per_cpu_ptr(crash_notes, cpu)); phdr->p_offset = phdr->p_paddr = notes_addr; -- 2.31.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH v24 07/10] x86/crash: add x86 crash hotplug support
When CPU or memory is hot un/plugged, or off/onlined, the crash elfcorehdr, which describes the CPUs and memory in the system, must also be updated. A new elfcorehdr is generated from the available CPUs and memory and replaces the existing elfcorehdr. The segment containing the elfcorehdr is identified at run-time in crash_core:crash_handle_hotplug_event(). No modifications to purgatory (see 'kexec: exclude elfcorehdr from the segment digest') or boot_params (as the elfcorehdr= capture kernel command line parameter pointer remains unchanged and correct) are needed, just elfcorehdr. For kexec_file_load(), the elfcorehdr segment size is based on NR_CPUS and CRASH_MAX_MEMORY_RANGES in order to accommodate a growing number of CPU and memory resources. For kexec_load(), the userspace kexec utility needs to size the elfcorehdr segment in the same/similar manner. To accommodate kexec_load() syscall in the absence of kexec_file_load() syscall support, prepare_elf_headers() and dependents are moved outside of CONFIG_KEXEC_FILE. Signed-off-by: Eric DeVolder Reviewed-by: Sourabh Jain Acked-by: Hari Bathini Acked-by: Baoquan He --- arch/x86/Kconfig | 3 + arch/x86/include/asm/kexec.h | 15 + arch/x86/kernel/crash.c | 103 --- 3 files changed, 114 insertions(+), 7 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 06a4472d0fc0..42c083da7ce4 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -2058,6 +2058,9 @@ config ARCH_SUPPORTS_KEXEC_JUMP config ARCH_SUPPORTS_CRASH_DUMP def_bool X86_64 || (X86_32 && HIGHMEM) +config ARCH_SUPPORTS_CRASH_HOTPLUG + def_bool y + config PHYSICAL_START hex "Physical address where the kernel is loaded" if (EXPERT || CRASH_DUMP) default "0x100" diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h index 5b77bbc28f96..9143100ea3ea 100644 --- a/arch/x86/include/asm/kexec.h +++ b/arch/x86/include/asm/kexec.h @@ -209,6 +209,21 @@ typedef void crash_vmclear_fn(void); extern crash_vmclear_fn __rcu *crash_vmclear_loaded_vmcss; extern void kdump_nmi_shootdown_cpus(void); +#ifdef CONFIG_CRASH_HOTPLUG +void arch_crash_handle_hotplug_event(struct kimage *image); +#define arch_crash_handle_hotplug_event arch_crash_handle_hotplug_event + +#ifdef CONFIG_HOTPLUG_CPU +static inline int crash_hotplug_cpu_support(void) { return 1; } +#define crash_hotplug_cpu_support crash_hotplug_cpu_support +#endif + +#ifdef CONFIG_MEMORY_HOTPLUG +static inline int crash_hotplug_memory_support(void) { return 1; } +#define crash_hotplug_memory_support crash_hotplug_memory_support +#endif +#endif + #endif /* __ASSEMBLY__ */ #endif /* _ASM_X86_KEXEC_H */ diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c index cdd92ab43cda..c70a111c44fa 100644 --- a/arch/x86/kernel/crash.c +++ b/arch/x86/kernel/crash.c @@ -158,8 +158,6 @@ void native_machine_crash_shutdown(struct pt_regs *regs) crash_save_cpu(regs, safe_smp_processor_id()); } -#ifdef CONFIG_KEXEC_FILE - static int get_nr_ram_ranges_callback(struct resource *res, void *arg) { unsigned int *nr_ranges = arg; @@ -231,7 +229,7 @@ static int prepare_elf64_ram_headers_callback(struct resource *res, void *arg) /* Prepare elf headers. Return addr and size */ static int prepare_elf_headers(struct kimage *image, void **addr, - unsigned long *sz) + unsigned long *sz, unsigned long *nr_mem_ranges) { struct crash_mem *cmem; int ret; @@ -249,6 +247,9 @@ static int prepare_elf_headers(struct kimage *image, void **addr, if (ret) goto out; + /* Return the computed number of memory ranges, for hotplug usage */ + *nr_mem_ranges = cmem->nr_ranges; + /* By default prepare 64bit headers */ ret = crash_prepare_elf64_headers(cmem, IS_ENABLED(CONFIG_X86_64), addr, sz); @@ -257,6 +258,7 @@ static int prepare_elf_headers(struct kimage *image, void **addr, return ret; } +#ifdef CONFIG_KEXEC_FILE static int add_e820_entry(struct boot_params *params, struct e820_entry *entry) { unsigned int nr_e820_entries; @@ -371,18 +373,42 @@ int crash_setup_memmap_entries(struct kimage *image, struct boot_params *params) int crash_load_segments(struct kimage *image) { int ret; + unsigned long pnum = 0; struct kexec_buf kbuf = { .image = image, .buf_min = 0, .buf_max = ULONG_MAX, .top_down = false }; /* Prepare elf headers and add a segment */ - ret = prepare_elf_headers(image, , ); + ret = prepare_elf_headers(image, , , ); if (ret) return ret; - image->elf_headers = kbuf.buffer; - image->elf_headers_sz = kbuf.bufsz; + image->elf_headers = kbuf.buffer; + image->elf_he
[PATCH v24 10/10] x86/crash: optimize CPU changes
crash_prepare_elf64_headers() writes into the elfcorehdr an ELF PT_NOTE for all possible CPUs. As such, subsequent changes to CPUs (ie. hot un/plug, online/offline) do not need to rewrite the elfcorehdr. The kimage->file_mode term covers kdump images loaded via the kexec_file_load() syscall. Since crash_prepare_elf64_headers() wrote the initial elfcorehdr, no update to the elfcorehdr is needed for CPU changes. The kimage->elfcorehdr_updated term covers kdump images loaded via the kexec_load() syscall. At least one memory or CPU change must occur to cause crash_prepare_elf64_headers() to rewrite the elfcorehdr. Afterwards, no update to the elfcorehdr is needed for CPU changes. This code is intentionally *NOT* hoisted into crash_handle_hotplug_event() as it would prevent the arch-specific handler from running for CPU changes. This would break PPC, for example, which needs to update other information besides the elfcorehdr, on CPU changes. Signed-off-by: Eric DeVolder Reviewed-by: Sourabh Jain Acked-by: Hari Bathini Acked-by: Baoquan He --- arch/x86/kernel/crash.c | 10 ++ 1 file changed, 10 insertions(+) diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c index caf22bcb61af..18d2a18d1073 100644 --- a/arch/x86/kernel/crash.c +++ b/arch/x86/kernel/crash.c @@ -467,6 +467,16 @@ void arch_crash_handle_hotplug_event(struct kimage *image) unsigned long mem, memsz; unsigned long elfsz = 0; + /* +* As crash_prepare_elf64_headers() has already described all +* possible CPUs, there is no need to update the elfcorehdr +* for additional CPU changes. +*/ + if ((image->file_mode || image->elfcorehdr_updated) && + ((image->hp_action == KEXEC_CRASH_HP_ADD_CPU) || + (image->hp_action == KEXEC_CRASH_HP_REMOVE_CPU))) + return; + /* * Create the new elfcorehdr reflecting the changes to CPU and/or * memory resources. -- 2.31.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH v24 08/10] crash: hotplug support for kexec_load()
The hotplug support for kexec_load() requires changes to the userspace kexec-tools and a little extra help from the kernel. Given a kdump capture kernel loaded via kexec_load(), and a subsequent hotplug event, the crash hotplug handler finds the elfcorehdr and rewrites it to reflect the hotplug change. That is the desired outcome, however, at kernel panic time, the purgatory integrity check fails (because the elfcorehdr changed), and the capture kernel does not boot and no vmcore is generated. Therefore, the userspace kexec-tools/kexec must indicate to the kernel that the elfcorehdr can be modified (because the kexec excluded the elfcorehdr from the digest, and sized the elfcorehdr memory buffer appropriately). To facilitate hotplug support with kexec_load(): - a new kexec flag KEXEC_UPATE_ELFCOREHDR indicates that it is safe for the kernel to modify the kexec_load()'d elfcorehdr - the /sys/kernel/crash_elfcorehdr_size node communicates the preferred size of the elfcorehdr memory buffer - The sysfs crash_hotplug nodes (ie. /sys/devices/system/[cpu|memory]/crash_hotplug) dynamically take into account kexec_file_load() vs kexec_load() and KEXEC_UPDATE_ELFCOREHDR. This is critical so that the udev rule processing of crash_hotplug is all that is needed to determine if the userspace unload-then-load of the kdump image is to be skipped, or not. The proposed udev rule change looks like: # The kernel updates the crash elfcorehdr for CPU and memory changes SUBSYSTEM=="cpu", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end" SUBSYSTEM=="memory", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end" The table below indicates the behavior of kexec_load()'d kdump image updates (with the new udev crash_hotplug rule in place): Kernel |Kexec ---+-+ Old|Old |New | a | a ---+-+ New| a | b ---+-+ where kexec 'old' and 'new' delineate kexec-tools has the needed modifications for the crash hotplug feature, and kernel 'old' and 'new' delineate the kernel supports this crash hotplug feature. Behavior 'a' indicates the unload-then-reload of the entire kdump image. For the kexec 'old' column, the unload-then-reload occurs due to the missing flag KEXEC_UPDATE_ELFCOREHDR. An 'old' kernel (with 'new' kexec) does not present the crash_hotplug sysfs node, which leads to the unload-then-reload of the kdump image. Behavior 'b' indicates the desired optimized behavior of the kernel directly modifying the elfcorehdr and avoiding the unload-then-reload of the kdump image. If the udev rule is not updated with crash_hotplug node check, then no matter any combination of kernel or kexec is new or old, the kdump image continues to be unload-then-reload on hotplug changes. To fully support crash hotplug feature, there needs to be a rollout of kernel, kexec-tools and udev rule changes. However, the order of the rollout of these pieces does not matter; kexec_load()'d kdump images still function for hotplug as-is. Suggested-by: Hari Bathini Signed-off-by: Eric DeVolder Acked-by: Hari Bathini Acked-by: Baoquan He --- arch/x86/include/asm/kexec.h | 11 +++ arch/x86/kernel/crash.c | 27 +++ include/linux/kexec.h| 14 -- include/uapi/linux/kexec.h | 1 + kernel/Kconfig.kexec | 4 kernel/crash_core.c | 31 +++ kernel/kexec.c | 5 + kernel/ksysfs.c | 15 +++ 8 files changed, 102 insertions(+), 6 deletions(-) diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h index 9143100ea3ea..3be6a98751f0 100644 --- a/arch/x86/include/asm/kexec.h +++ b/arch/x86/include/asm/kexec.h @@ -214,14 +214,17 @@ void arch_crash_handle_hotplug_event(struct kimage *image); #define arch_crash_handle_hotplug_event arch_crash_handle_hotplug_event #ifdef CONFIG_HOTPLUG_CPU -static inline int crash_hotplug_cpu_support(void) { return 1; } -#define crash_hotplug_cpu_support crash_hotplug_cpu_support +int arch_crash_hotplug_cpu_support(void); +#define crash_hotplug_cpu_support arch_crash_hotplug_cpu_support #endif #ifdef CONFIG_MEMORY_HOTPLUG -static inline int crash_hotplug_memory_support(void) { return 1; } -#define crash_hotplug_memory_support crash_hotplug_memory_support +int arch_crash_hotplug_memory_support(void); +#define crash_hotplug_memory_support arch_crash_hotplug_memory_support #endif + +unsigned int arch_crash_get_elfcorehdr_size(void); +#define crash_get_elfcorehdr_size arch_crash_get_elfcorehdr_size #endif #endif /* __ASSEMBLY__ */ diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c index c70a111c44fa..caf22bcb61af 100644 --- a/arch/x86/kernel/crash.c +++ b/arch/x86/kernel/crash.c @@ -427,6 +427,33 @@ int crash_load_segments(struct kimage *image) #undef pr_fmt #define pr_fmt(fmt) "cras
[PATCH v24 06/10] crash: memory and CPU hotplug sysfs attributes
Introduce the crash_hotplug attribute for memory and CPUs for use by userspace. These attributes directly facilitate the udev rule for managing userspace re-loading of the crash kernel upon hot un/plug changes. For memory, expose the crash_hotplug attribute to the /sys/devices/system/memory directory. For example: # udevadm info --attribute-walk /sys/devices/system/memory/memory81 looking at device '/devices/system/memory/memory81': KERNEL=="memory81" SUBSYSTEM=="memory" DRIVER=="" ATTR{online}=="1" ATTR{phys_device}=="0" ATTR{phys_index}=="0051" ATTR{removable}=="1" ATTR{state}=="online" ATTR{valid_zones}=="Movable" looking at parent device '/devices/system/memory': KERNELS=="memory" SUBSYSTEMS=="" DRIVERS=="" ATTRS{auto_online_blocks}=="offline" ATTRS{block_size_bytes}=="800" ATTRS{crash_hotplug}=="1" For CPUs, expose the crash_hotplug attribute to the /sys/devices/system/cpu directory. For example: # udevadm info --attribute-walk /sys/devices/system/cpu/cpu0 looking at device '/devices/system/cpu/cpu0': KERNEL=="cpu0" SUBSYSTEM=="cpu" DRIVER=="processor" ATTR{crash_notes}=="277c38600" ATTR{crash_notes_size}=="368" ATTR{online}=="1" looking at parent device '/devices/system/cpu': KERNELS=="cpu" SUBSYSTEMS=="" DRIVERS=="" ATTRS{crash_hotplug}=="1" ATTRS{isolated}=="" ATTRS{kernel_max}=="8191" ATTRS{nohz_full}==" (null)" ATTRS{offline}=="4-7" ATTRS{online}=="0-3" ATTRS{possible}=="0-7" ATTRS{present}=="0-3" With these sysfs attributes in place, it is possible to efficiently instruct the udev rule to skip crash kernel reloading for kernels configured with crash hotplug support. For example, the following is the proposed udev rule change for RHEL system 98-kexec.rules (as the first lines of the rule file): # The kernel updates the crash elfcorehdr for CPU and memory changes SUBSYSTEM=="cpu", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end" SUBSYSTEM=="memory", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end" When examined in the context of 98-kexec.rules, the above rules test if crash_hotplug is set, and if so, the userspace initiated unload-then-reload of the crash kernel is skipped. CPU and memory checks are separated in accordance with CONFIG_HOTPLUG_CPU and CONFIG_MEMORY_HOTPLUG kernel config options. If an architecture supports, for example, memory hotplug but not CPU hotplug, then the /sys/devices/system/memory/crash_hotplug attribute file is present, but the /sys/devices/system/cpu/crash_hotplug attribute file will NOT be present. Thus the udev rule skips userspace processing of memory hot un/plug events, but the udev rule will evaluate false for CPU events, thus allowing userspace to process CPU hot un/plug events (ie the unload-then-reload of the kdump capture kernel). Signed-off-by: Eric DeVolder Reviewed-by: Sourabh Jain Acked-by: Hari Bathini Acked-by: Baoquan He --- Documentation/ABI/testing/sysfs-devices-memory | 8 .../ABI/testing/sysfs-devices-system-cpu | 8 .../admin-guide/mm/memory-hotplug.rst | 8 Documentation/core-api/cpu_hotplug.rst | 18 ++ drivers/base/cpu.c | 16 ++-- drivers/base/memory.c | 13 + include/linux/kexec.h | 8 7 files changed, 77 insertions(+), 2 deletions(-) diff --git a/Documentation/ABI/testing/sysfs-devices-memory b/Documentation/ABI/testing/sysfs-devices-memory index d8b0f80b9e33..c50725ebebb7 100644 --- a/Documentation/ABI/testing/sysfs-devices-memory +++ b/Documentation/ABI/testing/sysfs-devices-memory @@ -110,3 +110,11 @@ Description: link is created for memory section 9 on node0. /sys/devices/system/node/node0/memory9 -> ../../memory/memory9 + +What: /sys/devices/system/cpu/crash_hotplug +Date: Jun 2023 +Contact: Linux kernel mailing list +Description: + (RO) indicates whether or not the kernel directly supports + modifying the crash elfcorehdr for memory hot un/plug and/or + on/offline changes. diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu index ecd585ca2d50..598b0fa67481 100644 --- a/Documentation/ABI/testing/sysfs-devices-system-cpu +++ b/Documentation/ABI/testing/sysfs-devices-system-cpu @@ -686,3 +686,11 @@ Description: (RO) the list of C
[PATCH v24 04/10] crash: add generic infrastructure for crash hotplug support
To support crash hotplug, a mechanism is needed to update the crash elfcorehdr upon CPU or memory changes (eg. hot un/plug or off/ onlining). The crash elfcorehdr describes the CPUs and memory to be written into the vmcore. To track CPU changes, callbacks are registered with the cpuhp mechanism via cpuhp_setup_state_nocalls(CPUHP_BP_PREPARE_DYN). The crash hotplug elfcorehdr update has no explicit ordering requirement (relative to other cpuhp states), so meets the criteria for utilizing CPUHP_BP_PREPARE_DYN. CPUHP_BP_PREPARE_DYN is a dynamic state and avoids the need to introduce a new state for crash hotplug. Also, CPUHP_BP_PREPARE_DYN is the last state in the PREPARE group, just prior to the STARTING group, which is very close to the CPU starting up in a plug/online situation, or stopping in a unplug/ offline situation. This minimizes the window of time during an actual plug/online or unplug/offline situation in which the elfcorehdr would be inaccurate. Note that for a CPU being unplugged or offlined, the CPU will still be present in the list of CPUs generated by crash_prepare_elf64_headers(). However, there is no need to explicitly omit the CPU, see justification in 'crash: change crash_prepare_elf64_headers() to for_each_possible_cpu()'. To track memory changes, a notifier is registered to capture the memblock MEM_ONLINE and MEM_OFFLINE events via register_memory_notifier(). The CPU callbacks and memory notifiers invoke crash_handle_hotplug_event() which performs needed tasks and then dispatches the event to the architecture specific arch_crash_handle_hotplug_event() to update the elfcorehdr with the current state of CPUs and memory. During the process, the kexec_lock is held. Signed-off-by: Eric DeVolder Reviewed-by: Sourabh Jain Acked-by: Hari Bathini Acked-by: Baoquan He --- include/linux/crash_core.h | 9 +++ include/linux/kexec.h | 11 +++ kernel/Kconfig.kexec | 31 kernel/crash_core.c| 142 + kernel/kexec_core.c| 6 ++ 5 files changed, 199 insertions(+) diff --git a/include/linux/crash_core.h b/include/linux/crash_core.h index de62a722431e..e14345cc7a22 100644 --- a/include/linux/crash_core.h +++ b/include/linux/crash_core.h @@ -84,4 +84,13 @@ int parse_crashkernel_high(char *cmdline, unsigned long long system_ram, int parse_crashkernel_low(char *cmdline, unsigned long long system_ram, unsigned long long *crash_size, unsigned long long *crash_base); +#define KEXEC_CRASH_HP_NONE0 +#define KEXEC_CRASH_HP_ADD_CPU 1 +#define KEXEC_CRASH_HP_REMOVE_CPU 2 +#define KEXEC_CRASH_HP_ADD_MEMORY 3 +#define KEXEC_CRASH_HP_REMOVE_MEMORY 4 +#define KEXEC_CRASH_HP_INVALID_CPU -1U + +struct kimage; + #endif /* LINUX_CRASH_CORE_H */ diff --git a/include/linux/kexec.h b/include/linux/kexec.h index 811a90e09698..b9903dd48e24 100644 --- a/include/linux/kexec.h +++ b/include/linux/kexec.h @@ -33,6 +33,7 @@ extern note_buf_t __percpu *crash_notes; #include #include #include +#include #include /* Verify architecture specific macros are defined */ @@ -360,6 +361,12 @@ struct kimage { struct purgatory_info purgatory_info; #endif +#ifdef CONFIG_CRASH_HOTPLUG + int hp_action; + int elfcorehdr_index; + bool elfcorehdr_updated; +#endif + #ifdef CONFIG_IMA_KEXEC /* Virtual address of IMA measurement buffer for kexec syscall */ void *ima_buffer; @@ -490,6 +497,10 @@ static inline int arch_kexec_post_alloc_pages(void *vaddr, unsigned int pages, g static inline void arch_kexec_pre_free_pages(void *vaddr, unsigned int pages) { } #endif +#ifndef arch_crash_handle_hotplug_event +static inline void arch_crash_handle_hotplug_event(struct kimage *image) { } +#endif + #else /* !CONFIG_KEXEC_CORE */ struct pt_regs; struct task_struct; diff --git a/kernel/Kconfig.kexec b/kernel/Kconfig.kexec index 5d576ddfd999..7eb42a795176 100644 --- a/kernel/Kconfig.kexec +++ b/kernel/Kconfig.kexec @@ -107,4 +107,35 @@ config CRASH_DUMP For s390, this option also enables zfcpdump. See also +config CRASH_HOTPLUG + bool "Update the crash elfcorehdr on system configuration changes" + default y + depends on CRASH_DUMP && (HOTPLUG_CPU || MEMORY_HOTPLUG) + depends on ARCH_SUPPORTS_CRASH_HOTPLUG + help + Enable direct update to the crash elfcorehdr (which contains + the list of CPUs and memory regions to be dumped upon a crash) + in response to hot plug/unplug or online/offline of CPUs or + memory. This is a much more advanced approach than userspace + attempting that. + + If unsure, say Y. + +config CRASH_MAX_MEMORY_RANGES + int "Specify the maximum number of memory regions for the elfcorehdr" + default 8192 + depends on CRASH_HOTPLUG + help +
[PATCH v24 03/10] crash: move a few code bits to setup support of crash hotplug
The crash hotplug support leans on the work for the kexec_file_load() syscall. To also support the kexec_load() syscall, a few bits of code need to be move outside of CONFIG_KEXEC_FILE. As such, these bits are moved out of kexec_file.c and into a common location crash_core.c. No functionality change intended. Signed-off-by: Eric DeVolder Reviewed-by: Sourabh Jain Acked-by: Hari Bathini Acked-by: Baoquan He --- include/linux/kexec.h | 30 +++ kernel/crash_core.c | 182 ++ kernel/kexec_file.c | 181 - 3 files changed, 197 insertions(+), 196 deletions(-) diff --git a/include/linux/kexec.h b/include/linux/kexec.h index 22b5cd24f581..811a90e09698 100644 --- a/include/linux/kexec.h +++ b/include/linux/kexec.h @@ -105,6 +105,21 @@ struct compat_kexec_segment { }; #endif +/* Alignment required for elf header segment */ +#define ELF_CORE_HEADER_ALIGN 4096 + +struct crash_mem { + unsigned int max_nr_ranges; + unsigned int nr_ranges; + struct range ranges[]; +}; + +extern int crash_exclude_mem_range(struct crash_mem *mem, + unsigned long long mstart, + unsigned long long mend); +extern int crash_prepare_elf64_headers(struct crash_mem *mem, int need_kernel_map, + void **addr, unsigned long *sz); + #ifdef CONFIG_KEXEC_FILE struct purgatory_info { /* @@ -230,21 +245,6 @@ static inline int arch_kexec_locate_mem_hole(struct kexec_buf *kbuf) } #endif -/* Alignment required for elf header segment */ -#define ELF_CORE_HEADER_ALIGN 4096 - -struct crash_mem { - unsigned int max_nr_ranges; - unsigned int nr_ranges; - struct range ranges[]; -}; - -extern int crash_exclude_mem_range(struct crash_mem *mem, - unsigned long long mstart, - unsigned long long mend); -extern int crash_prepare_elf64_headers(struct crash_mem *mem, int need_kernel_map, - void **addr, unsigned long *sz); - #ifndef arch_kexec_apply_relocations_add /* * arch_kexec_apply_relocations_add - apply relocations of type RELA diff --git a/kernel/crash_core.c b/kernel/crash_core.c index 90ce1dfd591c..b7c30b748a16 100644 --- a/kernel/crash_core.c +++ b/kernel/crash_core.c @@ -10,6 +10,7 @@ #include #include #include +#include #include #include @@ -314,6 +315,187 @@ static int __init parse_crashkernel_dummy(char *arg) } early_param("crashkernel", parse_crashkernel_dummy); +int crash_prepare_elf64_headers(struct crash_mem *mem, int need_kernel_map, + void **addr, unsigned long *sz) +{ + Elf64_Ehdr *ehdr; + Elf64_Phdr *phdr; + unsigned long nr_cpus = num_possible_cpus(), nr_phdr, elf_sz; + unsigned char *buf; + unsigned int cpu, i; + unsigned long long notes_addr; + unsigned long mstart, mend; + + /* extra phdr for vmcoreinfo ELF note */ + nr_phdr = nr_cpus + 1; + nr_phdr += mem->nr_ranges; + + /* +* kexec-tools creates an extra PT_LOAD phdr for kernel text mapping +* area (for example, 8000 - a000 on x86_64). +* I think this is required by tools like gdb. So same physical +* memory will be mapped in two ELF headers. One will contain kernel +* text virtual addresses and other will have __va(physical) addresses. +*/ + + nr_phdr++; + elf_sz = sizeof(Elf64_Ehdr) + nr_phdr * sizeof(Elf64_Phdr); + elf_sz = ALIGN(elf_sz, ELF_CORE_HEADER_ALIGN); + + buf = vzalloc(elf_sz); + if (!buf) + return -ENOMEM; + + ehdr = (Elf64_Ehdr *)buf; + phdr = (Elf64_Phdr *)(ehdr + 1); + memcpy(ehdr->e_ident, ELFMAG, SELFMAG); + ehdr->e_ident[EI_CLASS] = ELFCLASS64; + ehdr->e_ident[EI_DATA] = ELFDATA2LSB; + ehdr->e_ident[EI_VERSION] = EV_CURRENT; + ehdr->e_ident[EI_OSABI] = ELF_OSABI; + memset(ehdr->e_ident + EI_PAD, 0, EI_NIDENT - EI_PAD); + ehdr->e_type = ET_CORE; + ehdr->e_machine = ELF_ARCH; + ehdr->e_version = EV_CURRENT; + ehdr->e_phoff = sizeof(Elf64_Ehdr); + ehdr->e_ehsize = sizeof(Elf64_Ehdr); + ehdr->e_phentsize = sizeof(Elf64_Phdr); + + /* Prepare one phdr of type PT_NOTE for each present CPU */ + for_each_present_cpu(cpu) { + phdr->p_type = PT_NOTE; + notes_addr = per_cpu_ptr_to_phys(per_cpu_ptr(crash_notes, cpu)); + phdr->p_offset = phdr->p_paddr = notes_addr; + phdr->p_filesz = phdr->p_memsz = sizeof(note_buf_t); + (ehdr->e_phnum)++; + phdr++; + } + + /* Prepare one PT_NOTE header for vmcoreinfo */ + phdr->p_type = PT_NOTE; +
[PATCH v24 05/10] kexec: exclude elfcorehdr from the segment digest
When a crash kernel is loaded via the kexec_file_load() syscall, the kernel places the various segments (ie crash kernel, crash initrd, boot_params, elfcorehdr, purgatory, etc) in memory. For those architectures that utilize purgatory, a hash digest of the segments is calculated for integrity checking. The digest is embedded into the purgatory image prior to placing in memory. Updates to the elfcorehdr in response to CPU and memory changes would cause the purgatory integrity checking to fail (at crash time, and no vmcore created). Therefore, the elfcorehdr segment is explicitly excluded from the purgatory digest, enabling updates to the elfcorehdr while also avoiding the need to recompute the hash digest and reload purgatory. Signed-off-by: Eric DeVolder Reviewed-by: Sourabh Jain Acked-by: Hari Bathini Acked-by: Baoquan He --- kernel/kexec_file.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c index e9cf9e8d8f01..824ffc5282f4 100644 --- a/kernel/kexec_file.c +++ b/kernel/kexec_file.c @@ -726,6 +726,12 @@ static int kexec_calculate_store_digests(struct kimage *image) for (j = i = 0; i < image->nr_segments; i++) { struct kexec_segment *ksegment; +#ifdef CONFIG_CRASH_HOTPLUG + /* Exclude elfcorehdr segment to allow future changes via hotplug */ + if (j == image->elfcorehdr_index) + continue; +#endif + ksegment = >segment[i]; /* * Skip purgatory as it will be modified once we put digest -- 2.31.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH v24 02/10] drivers/base: refactor memory.c to use .is_visible()
Greg Kroah-Hartman requested that this file use the .is_visible() method instead of #ifdefs for the attributes in memory.c. static struct attribute *memory_memblk_attrs[] = { _attr_phys_index.attr, _attr_state.attr, _attr_phys_device.attr, _attr_removable.attr, #ifdef CONFIG_MEMORY_HOTREMOVE _attr_valid_zones.attr, #endif NULL }; and static struct attribute *memory_root_attrs[] = { #ifdef CONFIG_ARCH_MEMORY_PROBE _attr_probe.attr, #endif #ifdef CONFIG_MEMORY_FAILURE _attr_soft_offline_page.attr, _attr_hard_offline_page.attr, #endif _attr_block_size_bytes.attr, _attr_auto_online_blocks.attr, NULL }; To that end: - the .is_visible() method is implemented, and IS_ENABLED(), rather than #ifdef, is used to determine the visibility of the attribute. - the DEVICE_ATTR_xx() attributes are moved outside of #ifdefs, so that those structs are always present for the memory_memblk_attrs[] and memory_root_attrs[]. - the #ifdefs guarding the attributes in the memory_memblk_attrs[] and memory_root_attrs[] are moved to the corresponding callback function; as the callback function must exist now that the attribute is always compiled-in (though not necessarily visible). No functionality change intended. Signed-off-by: Eric DeVolder --- drivers/base/memory.c | 78 +++ 1 file changed, 65 insertions(+), 13 deletions(-) diff --git a/drivers/base/memory.c b/drivers/base/memory.c index b456ac213610..f03eda7e1c9c 100644 --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -405,10 +405,12 @@ static int print_allowed_zone(char *buf, int len, int nid, return sysfs_emit_at(buf, len, " %s", zone->name); } +#endif static ssize_t valid_zones_show(struct device *dev, struct device_attribute *attr, char *buf) { +#ifdef CONFIG_MEMORY_HOTREMOVE struct memory_block *mem = to_memory_block(dev); unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr); unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block; @@ -444,9 +446,11 @@ static ssize_t valid_zones_show(struct device *dev, out: len += sysfs_emit_at(buf, len, "\n"); return len; +#else + return 0; +#endif } static DEVICE_ATTR_RO(valid_zones); -#endif static DEVICE_ATTR_RO(phys_index); static DEVICE_ATTR_RW(state); @@ -496,10 +500,10 @@ static DEVICE_ATTR_RW(auto_online_blocks); * as well as ppc64 will do all of their discovery in userspace * and will require this interface. */ -#ifdef CONFIG_ARCH_MEMORY_PROBE static ssize_t probe_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count) { +#ifdef CONFIG_ARCH_MEMORY_PROBE u64 phys_addr; int nid, ret; unsigned long pages_per_block = PAGES_PER_SECTION * sections_per_block; @@ -527,12 +531,13 @@ static ssize_t probe_store(struct device *dev, struct device_attribute *attr, out: unlock_device_hotplug(); return ret; +#else + return 0; +#endif } static DEVICE_ATTR_WO(probe); -#endif -#ifdef CONFIG_MEMORY_FAILURE /* * Support for offlining pages of memory */ @@ -542,6 +547,7 @@ static ssize_t soft_offline_page_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count) { +#ifdef CONFIG_MEMORY_FAILURE int ret; u64 pfn; if (!capable(CAP_SYS_ADMIN)) @@ -551,6 +557,9 @@ static ssize_t soft_offline_page_store(struct device *dev, pfn >>= PAGE_SHIFT; ret = soft_offline_page(pfn, 0); return ret == 0 ? count : ret; +#else + return 0; +#endif } /* Forcibly offline a page, including killing processes. */ @@ -558,6 +567,7 @@ static ssize_t hard_offline_page_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count) { +#ifdef CONFIG_MEMORY_FAILURE int ret; u64 pfn; if (!capable(CAP_SYS_ADMIN)) @@ -569,11 +579,13 @@ static ssize_t hard_offline_page_store(struct device *dev, if (ret == -EOPNOTSUPP) ret = 0; return ret ? ret : count; +#else + return 0; +#endif } static DEVICE_ATTR_WO(soft_offline_page); static DEVICE_ATTR_WO(hard_offline_page); -#endif /* See phys_device_show(). */ int __weak arch_get_memory_phys_device(unsigned long start_pfn) @@ -611,14 +623,35 @@ static struct attribute *memory_memblk_attrs[] = { _attr_state.attr, _attr_phys_device.attr, _attr_removable.attr, -#ifdef CONFIG_MEMORY_HOTREMOVE _attr_valid_zones.attr, -#endif NULL }; +static umode_t +memory_memblk_attr_is_visible(struct kobject *kobj, + struct attribute *attr
Re: [PATCH v23 4/8] crash: memory and CPU hotplug sysfs attributes
On 6/13/23 03:03, Greg KH wrote: On Mon, Jun 12, 2023 at 05:07:08PM -0400, Eric DeVolder wrote: Introduce the crash_hotplug attribute for memory and CPUs for use by userspace. These attributes directly facilitate the udev rule for managing userspace re-loading of the crash kernel upon hot un/plug changes. For memory, expose the crash_hotplug attribute to the /sys/devices/system/memory directory. For example: # udevadm info --attribute-walk /sys/devices/system/memory/memory81 looking at device '/devices/system/memory/memory81': KERNEL=="memory81" SUBSYSTEM=="memory" DRIVER=="" ATTR{online}=="1" ATTR{phys_device}=="0" ATTR{phys_index}=="0051" ATTR{removable}=="1" ATTR{state}=="online" ATTR{valid_zones}=="Movable" looking at parent device '/devices/system/memory': KERNELS=="memory" SUBSYSTEMS=="" DRIVERS=="" ATTRS{auto_online_blocks}=="offline" ATTRS{block_size_bytes}=="800" ATTRS{crash_hotplug}=="1" For CPUs, expose the crash_hotplug attribute to the /sys/devices/system/cpu directory. For example: # udevadm info --attribute-walk /sys/devices/system/cpu/cpu0 looking at device '/devices/system/cpu/cpu0': KERNEL=="cpu0" SUBSYSTEM=="cpu" DRIVER=="processor" ATTR{crash_notes}=="277c38600" ATTR{crash_notes_size}=="368" ATTR{online}=="1" looking at parent device '/devices/system/cpu': KERNELS=="cpu" SUBSYSTEMS=="" DRIVERS=="" ATTRS{crash_hotplug}=="1" ATTRS{isolated}=="" ATTRS{kernel_max}=="8191" ATTRS{nohz_full}==" (null)" ATTRS{offline}=="4-7" ATTRS{online}=="0-3" ATTRS{possible}=="0-7" ATTRS{present}=="0-3" With these sysfs attributes in place, it is possible to efficiently instruct the udev rule to skip crash kernel reloading for kernels configured with crash hotplug support. For example, the following is the proposed udev rule change for RHEL system 98-kexec.rules (as the first lines of the rule file): # The kernel updates the crash elfcorehdr for CPU and memory changes SUBSYSTEM=="cpu", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end" SUBSYSTEM=="memory", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end" When examined in the context of 98-kexec.rules, the above rules test if crash_hotplug is set, and if so, the userspace initiated unload-then-reload of the crash kernel is skipped. CPU and memory checks are separated in accordance with CONFIG_HOTPLUG_CPU and CONFIG_MEMORY_HOTPLUG kernel config options. If an architecture supports, for example, memory hotplug but not CPU hotplug, then the /sys/devices/system/memory/crash_hotplug attribute file is present, but the /sys/devices/system/cpu/crash_hotplug attribute file will NOT be present. Thus the udev rule skips userspace processing of memory hot un/plug events, but the udev rule will evaluate false for CPU events, thus allowing userspace to process CPU hot un/plug events (ie the unload-then-reload of the kdump capture kernel). Signed-off-by: Eric DeVolder Reviewed-by: Sourabh Jain Acked-by: Hari Bathini Acked-by: Baoquan He --- .../admin-guide/mm/memory-hotplug.rst | 8 Documentation/core-api/cpu_hotplug.rst | 18 ++ drivers/base/cpu.c | 14 ++ drivers/base/memory.c | 13 + include/linux/kexec.h | 8 5 files changed, 61 insertions(+) diff --git a/Documentation/admin-guide/mm/memory-hotplug.rst b/Documentation/admin-guide/mm/memory-hotplug.rst index 1b02fe5807cc..eb99d79223a3 100644 --- a/Documentation/admin-guide/mm/memory-hotplug.rst +++ b/Documentation/admin-guide/mm/memory-hotplug.rst @@ -291,6 +291,14 @@ The following files are currently defined: Availability depends on the CONFIG_ARCH_MEMORY_PROBE kernel configuration option. ``uevent`` read-write: generic udev file for device subsystems. +``crash_hotplug`` read-only: when changes to the system memory map + occur due to hot un/plug of memory, this file contains + '1' if the kernel updates the kdump capture kernel memory + map itself (via elfcorehdr), or '0' if userspace must update + the kdump capture kernel memory map. + + Availability depends on the CONFIG_MEMORY_HOTPLUG kernel + configuration option. ==
Re: [PATCH v23 4/8] crash: memory and CPU hotplug sysfs attributes
On 6/13/23 10:24, Eric DeVolder wrote: On 6/13/23 03:03, Greg KH wrote: On Mon, Jun 12, 2023 at 05:07:08PM -0400, Eric DeVolder wrote: Introduce the crash_hotplug attribute for memory and CPUs for use by userspace. These attributes directly facilitate the udev rule for managing userspace re-loading of the crash kernel upon hot un/plug changes. For memory, expose the crash_hotplug attribute to the /sys/devices/system/memory directory. For example: # udevadm info --attribute-walk /sys/devices/system/memory/memory81 looking at device '/devices/system/memory/memory81': KERNEL=="memory81" SUBSYSTEM=="memory" DRIVER=="" ATTR{online}=="1" ATTR{phys_device}=="0" ATTR{phys_index}=="0051" ATTR{removable}=="1" ATTR{state}=="online" ATTR{valid_zones}=="Movable" looking at parent device '/devices/system/memory': KERNELS=="memory" SUBSYSTEMS=="" DRIVERS=="" ATTRS{auto_online_blocks}=="offline" ATTRS{block_size_bytes}=="800" ATTRS{crash_hotplug}=="1" For CPUs, expose the crash_hotplug attribute to the /sys/devices/system/cpu directory. For example: # udevadm info --attribute-walk /sys/devices/system/cpu/cpu0 looking at device '/devices/system/cpu/cpu0': KERNEL=="cpu0" SUBSYSTEM=="cpu" DRIVER=="processor" ATTR{crash_notes}=="277c38600" ATTR{crash_notes_size}=="368" ATTR{online}=="1" looking at parent device '/devices/system/cpu': KERNELS=="cpu" SUBSYSTEMS=="" DRIVERS=="" ATTRS{crash_hotplug}=="1" ATTRS{isolated}=="" ATTRS{kernel_max}=="8191" ATTRS{nohz_full}==" (null)" ATTRS{offline}=="4-7" ATTRS{online}=="0-3" ATTRS{possible}=="0-7" ATTRS{present}=="0-3" With these sysfs attributes in place, it is possible to efficiently instruct the udev rule to skip crash kernel reloading for kernels configured with crash hotplug support. For example, the following is the proposed udev rule change for RHEL system 98-kexec.rules (as the first lines of the rule file): # The kernel updates the crash elfcorehdr for CPU and memory changes SUBSYSTEM=="cpu", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end" SUBSYSTEM=="memory", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end" When examined in the context of 98-kexec.rules, the above rules test if crash_hotplug is set, and if so, the userspace initiated unload-then-reload of the crash kernel is skipped. CPU and memory checks are separated in accordance with CONFIG_HOTPLUG_CPU and CONFIG_MEMORY_HOTPLUG kernel config options. If an architecture supports, for example, memory hotplug but not CPU hotplug, then the /sys/devices/system/memory/crash_hotplug attribute file is present, but the /sys/devices/system/cpu/crash_hotplug attribute file will NOT be present. Thus the udev rule skips userspace processing of memory hot un/plug events, but the udev rule will evaluate false for CPU events, thus allowing userspace to process CPU hot un/plug events (ie the unload-then-reload of the kdump capture kernel). Signed-off-by: Eric DeVolder Reviewed-by: Sourabh Jain Acked-by: Hari Bathini Acked-by: Baoquan He --- .../admin-guide/mm/memory-hotplug.rst | 8 Documentation/core-api/cpu_hotplug.rst | 18 ++ drivers/base/cpu.c | 14 ++ drivers/base/memory.c | 13 + include/linux/kexec.h | 8 5 files changed, 61 insertions(+) diff --git a/Documentation/admin-guide/mm/memory-hotplug.rst b/Documentation/admin-guide/mm/memory-hotplug.rst index 1b02fe5807cc..eb99d79223a3 100644 --- a/Documentation/admin-guide/mm/memory-hotplug.rst +++ b/Documentation/admin-guide/mm/memory-hotplug.rst @@ -291,6 +291,14 @@ The following files are currently defined: Availability depends on the CONFIG_ARCH_MEMORY_PROBE kernel configuration option. ``uevent`` read-write: generic udev file for device subsystems. +``crash_hotplug`` read-only: when changes to the system memory map + occur due to hot un/plug of memory, this file contains + '1' if the kernel updates the kdump capture kernel memory + map itself (via elfcorehdr), or '0' if userspace must update + the kdump capture kernel memory map. + + Availability depends on the CONFIG_MEMORY_HOTPLUG kernel + configuration option. == ==
Re: [PATCH v23 4/8] crash: memory and CPU hotplug sysfs attributes
On 6/13/23 03:03, Greg KH wrote: On Mon, Jun 12, 2023 at 05:07:08PM -0400, Eric DeVolder wrote: Introduce the crash_hotplug attribute for memory and CPUs for use by userspace. These attributes directly facilitate the udev rule for managing userspace re-loading of the crash kernel upon hot un/plug changes. For memory, expose the crash_hotplug attribute to the /sys/devices/system/memory directory. For example: # udevadm info --attribute-walk /sys/devices/system/memory/memory81 looking at device '/devices/system/memory/memory81': KERNEL=="memory81" SUBSYSTEM=="memory" DRIVER=="" ATTR{online}=="1" ATTR{phys_device}=="0" ATTR{phys_index}=="0051" ATTR{removable}=="1" ATTR{state}=="online" ATTR{valid_zones}=="Movable" looking at parent device '/devices/system/memory': KERNELS=="memory" SUBSYSTEMS=="" DRIVERS=="" ATTRS{auto_online_blocks}=="offline" ATTRS{block_size_bytes}=="800" ATTRS{crash_hotplug}=="1" For CPUs, expose the crash_hotplug attribute to the /sys/devices/system/cpu directory. For example: # udevadm info --attribute-walk /sys/devices/system/cpu/cpu0 looking at device '/devices/system/cpu/cpu0': KERNEL=="cpu0" SUBSYSTEM=="cpu" DRIVER=="processor" ATTR{crash_notes}=="277c38600" ATTR{crash_notes_size}=="368" ATTR{online}=="1" looking at parent device '/devices/system/cpu': KERNELS=="cpu" SUBSYSTEMS=="" DRIVERS=="" ATTRS{crash_hotplug}=="1" ATTRS{isolated}=="" ATTRS{kernel_max}=="8191" ATTRS{nohz_full}==" (null)" ATTRS{offline}=="4-7" ATTRS{online}=="0-3" ATTRS{possible}=="0-7" ATTRS{present}=="0-3" With these sysfs attributes in place, it is possible to efficiently instruct the udev rule to skip crash kernel reloading for kernels configured with crash hotplug support. For example, the following is the proposed udev rule change for RHEL system 98-kexec.rules (as the first lines of the rule file): # The kernel updates the crash elfcorehdr for CPU and memory changes SUBSYSTEM=="cpu", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end" SUBSYSTEM=="memory", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end" When examined in the context of 98-kexec.rules, the above rules test if crash_hotplug is set, and if so, the userspace initiated unload-then-reload of the crash kernel is skipped. CPU and memory checks are separated in accordance with CONFIG_HOTPLUG_CPU and CONFIG_MEMORY_HOTPLUG kernel config options. If an architecture supports, for example, memory hotplug but not CPU hotplug, then the /sys/devices/system/memory/crash_hotplug attribute file is present, but the /sys/devices/system/cpu/crash_hotplug attribute file will NOT be present. Thus the udev rule skips userspace processing of memory hot un/plug events, but the udev rule will evaluate false for CPU events, thus allowing userspace to process CPU hot un/plug events (ie the unload-then-reload of the kdump capture kernel). Signed-off-by: Eric DeVolder Reviewed-by: Sourabh Jain Acked-by: Hari Bathini Acked-by: Baoquan He --- .../admin-guide/mm/memory-hotplug.rst | 8 Documentation/core-api/cpu_hotplug.rst | 18 ++ drivers/base/cpu.c | 14 ++ drivers/base/memory.c | 13 + include/linux/kexec.h | 8 5 files changed, 61 insertions(+) diff --git a/Documentation/admin-guide/mm/memory-hotplug.rst b/Documentation/admin-guide/mm/memory-hotplug.rst index 1b02fe5807cc..eb99d79223a3 100644 --- a/Documentation/admin-guide/mm/memory-hotplug.rst +++ b/Documentation/admin-guide/mm/memory-hotplug.rst @@ -291,6 +291,14 @@ The following files are currently defined: Availability depends on the CONFIG_ARCH_MEMORY_PROBE kernel configuration option. ``uevent`` read-write: generic udev file for device subsystems. +``crash_hotplug`` read-only: when changes to the system memory map + occur due to hot un/plug of memory, this file contains + '1' if the kernel updates the kdump capture kernel memory + map itself (via elfcorehdr), or '0' if userspace must update + the kdump capture kernel memory map. + + Availability depends on the CONFIG_MEMORY_HOTPLUG kernel + configuration option. ==
[PATCH v23 5/8] x86/crash: add x86 crash hotplug support
When CPU or memory is hot un/plugged, or off/onlined, the crash elfcorehdr, which describes the CPUs and memory in the system, must also be updated. A new elfcorehdr is generated from the available CPUs and memory and replaces the existing elfcorehdr. The segment containing the elfcorehdr is identified at run-time in crash_core:crash_handle_hotplug_event(). No modifications to purgatory (see 'kexec: exclude elfcorehdr from the segment digest') or boot_params (as the elfcorehdr= capture kernel command line parameter pointer remains unchanged and correct) are needed, just elfcorehdr. For kexec_file_load(), the elfcorehdr segment size is based on NR_CPUS and CRASH_MAX_MEMORY_RANGES in order to accommodate a growing number of CPU and memory resources. For kexec_load(), the userspace kexec utility needs to size the elfcorehdr segment in the same/similar manner. To accommodate kexec_load() syscall in the absence of kexec_file_load() syscall support, prepare_elf_headers() and dependents are moved outside of CONFIG_KEXEC_FILE. Signed-off-by: Eric DeVolder Reviewed-by: Sourabh Jain Acked-by: Hari Bathini Acked-by: Baoquan He --- arch/x86/Kconfig | 3 + arch/x86/include/asm/kexec.h | 15 + arch/x86/kernel/crash.c | 103 --- 3 files changed, 114 insertions(+), 7 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 7dff2481abe0..4b39f4059876 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -2063,6 +2063,9 @@ config ARCH_HAS_KEXEC_JUMP config ARCH_HAS_CRASH_DUMP def_bool X86_64 || (X86_32 && HIGHMEM) +config ARCH_HAS_CRASH_HOTPLUG + def_bool y + config PHYSICAL_START hex "Physical address where the kernel is loaded" if (EXPERT || CRASH_DUMP) default "0x100" diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h index 5b77bbc28f96..9143100ea3ea 100644 --- a/arch/x86/include/asm/kexec.h +++ b/arch/x86/include/asm/kexec.h @@ -209,6 +209,21 @@ typedef void crash_vmclear_fn(void); extern crash_vmclear_fn __rcu *crash_vmclear_loaded_vmcss; extern void kdump_nmi_shootdown_cpus(void); +#ifdef CONFIG_CRASH_HOTPLUG +void arch_crash_handle_hotplug_event(struct kimage *image); +#define arch_crash_handle_hotplug_event arch_crash_handle_hotplug_event + +#ifdef CONFIG_HOTPLUG_CPU +static inline int crash_hotplug_cpu_support(void) { return 1; } +#define crash_hotplug_cpu_support crash_hotplug_cpu_support +#endif + +#ifdef CONFIG_MEMORY_HOTPLUG +static inline int crash_hotplug_memory_support(void) { return 1; } +#define crash_hotplug_memory_support crash_hotplug_memory_support +#endif +#endif + #endif /* __ASSEMBLY__ */ #endif /* _ASM_X86_KEXEC_H */ diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c index cdd92ab43cda..c70a111c44fa 100644 --- a/arch/x86/kernel/crash.c +++ b/arch/x86/kernel/crash.c @@ -158,8 +158,6 @@ void native_machine_crash_shutdown(struct pt_regs *regs) crash_save_cpu(regs, safe_smp_processor_id()); } -#ifdef CONFIG_KEXEC_FILE - static int get_nr_ram_ranges_callback(struct resource *res, void *arg) { unsigned int *nr_ranges = arg; @@ -231,7 +229,7 @@ static int prepare_elf64_ram_headers_callback(struct resource *res, void *arg) /* Prepare elf headers. Return addr and size */ static int prepare_elf_headers(struct kimage *image, void **addr, - unsigned long *sz) + unsigned long *sz, unsigned long *nr_mem_ranges) { struct crash_mem *cmem; int ret; @@ -249,6 +247,9 @@ static int prepare_elf_headers(struct kimage *image, void **addr, if (ret) goto out; + /* Return the computed number of memory ranges, for hotplug usage */ + *nr_mem_ranges = cmem->nr_ranges; + /* By default prepare 64bit headers */ ret = crash_prepare_elf64_headers(cmem, IS_ENABLED(CONFIG_X86_64), addr, sz); @@ -257,6 +258,7 @@ static int prepare_elf_headers(struct kimage *image, void **addr, return ret; } +#ifdef CONFIG_KEXEC_FILE static int add_e820_entry(struct boot_params *params, struct e820_entry *entry) { unsigned int nr_e820_entries; @@ -371,18 +373,42 @@ int crash_setup_memmap_entries(struct kimage *image, struct boot_params *params) int crash_load_segments(struct kimage *image) { int ret; + unsigned long pnum = 0; struct kexec_buf kbuf = { .image = image, .buf_min = 0, .buf_max = ULONG_MAX, .top_down = false }; /* Prepare elf headers and add a segment */ - ret = prepare_elf_headers(image, , ); + ret = prepare_elf_headers(image, , , ); if (ret) return ret; - image->elf_headers = kbuf.buffer; - image->elf_headers_sz = kbuf.bufsz; + image->elf_headers = kbuf.buffer; + image->elf_headers_sz = kbuf.bufsz; +
[PATCH v23 8/8] x86/crash: optimize CPU changes
crash_prepare_elf64_headers() writes into the elfcorehdr an ELF PT_NOTE for all possible CPUs. As such, subsequent changes to CPUs (ie. hot un/plug, online/offline) do not need to rewrite the elfcorehdr. The kimage->file_mode term covers kdump images loaded via the kexec_file_load() syscall. Since crash_prepare_elf64_headers() wrote the initial elfcorehdr, no update to the elfcorehdr is needed for CPU changes. The kimage->elfcorehdr_updated term covers kdump images loaded via the kexec_load() syscall. At least one memory or CPU change must occur to cause crash_prepare_elf64_headers() to rewrite the elfcorehdr. Afterwards, no update to the elfcorehdr is needed for CPU changes. This code is intentionally *NOT* hoisted into crash_handle_hotplug_event() as it would prevent the arch-specific handler from running for CPU changes. This would break PPC, for example, which needs to update other information besides the elfcorehdr, on CPU changes. Signed-off-by: Eric DeVolder Reviewed-by: Sourabh Jain Acked-by: Hari Bathini Acked-by: Baoquan He --- arch/x86/kernel/crash.c | 10 ++ 1 file changed, 10 insertions(+) diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c index caf22bcb61af..18d2a18d1073 100644 --- a/arch/x86/kernel/crash.c +++ b/arch/x86/kernel/crash.c @@ -467,6 +467,16 @@ void arch_crash_handle_hotplug_event(struct kimage *image) unsigned long mem, memsz; unsigned long elfsz = 0; + /* +* As crash_prepare_elf64_headers() has already described all +* possible CPUs, there is no need to update the elfcorehdr +* for additional CPU changes. +*/ + if ((image->file_mode || image->elfcorehdr_updated) && + ((image->hp_action == KEXEC_CRASH_HP_ADD_CPU) || + (image->hp_action == KEXEC_CRASH_HP_REMOVE_CPU))) + return; + /* * Create the new elfcorehdr reflecting the changes to CPU and/or * memory resources. -- 2.31.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH v23 1/8] crash: move a few code bits to setup support of crash hotplug
The crash hotplug support leans on the work for the kexec_file_load() syscall. To also support the kexec_load() syscall, a few bits of code need to be move outside of CONFIG_KEXEC_FILE. As such, these bits are moved out of kexec_file.c and into a common location crash_core.c. No functionality change intended. Signed-off-by: Eric DeVolder Reviewed-by: Sourabh Jain Acked-by: Hari Bathini Acked-by: Baoquan He --- include/linux/kexec.h | 30 +++ kernel/crash_core.c | 182 ++ kernel/kexec_file.c | 181 - 3 files changed, 197 insertions(+), 196 deletions(-) diff --git a/include/linux/kexec.h b/include/linux/kexec.h index 22b5cd24f581..811a90e09698 100644 --- a/include/linux/kexec.h +++ b/include/linux/kexec.h @@ -105,6 +105,21 @@ struct compat_kexec_segment { }; #endif +/* Alignment required for elf header segment */ +#define ELF_CORE_HEADER_ALIGN 4096 + +struct crash_mem { + unsigned int max_nr_ranges; + unsigned int nr_ranges; + struct range ranges[]; +}; + +extern int crash_exclude_mem_range(struct crash_mem *mem, + unsigned long long mstart, + unsigned long long mend); +extern int crash_prepare_elf64_headers(struct crash_mem *mem, int need_kernel_map, + void **addr, unsigned long *sz); + #ifdef CONFIG_KEXEC_FILE struct purgatory_info { /* @@ -230,21 +245,6 @@ static inline int arch_kexec_locate_mem_hole(struct kexec_buf *kbuf) } #endif -/* Alignment required for elf header segment */ -#define ELF_CORE_HEADER_ALIGN 4096 - -struct crash_mem { - unsigned int max_nr_ranges; - unsigned int nr_ranges; - struct range ranges[]; -}; - -extern int crash_exclude_mem_range(struct crash_mem *mem, - unsigned long long mstart, - unsigned long long mend); -extern int crash_prepare_elf64_headers(struct crash_mem *mem, int need_kernel_map, - void **addr, unsigned long *sz); - #ifndef arch_kexec_apply_relocations_add /* * arch_kexec_apply_relocations_add - apply relocations of type RELA diff --git a/kernel/crash_core.c b/kernel/crash_core.c index 90ce1dfd591c..b7c30b748a16 100644 --- a/kernel/crash_core.c +++ b/kernel/crash_core.c @@ -10,6 +10,7 @@ #include #include #include +#include #include #include @@ -314,6 +315,187 @@ static int __init parse_crashkernel_dummy(char *arg) } early_param("crashkernel", parse_crashkernel_dummy); +int crash_prepare_elf64_headers(struct crash_mem *mem, int need_kernel_map, + void **addr, unsigned long *sz) +{ + Elf64_Ehdr *ehdr; + Elf64_Phdr *phdr; + unsigned long nr_cpus = num_possible_cpus(), nr_phdr, elf_sz; + unsigned char *buf; + unsigned int cpu, i; + unsigned long long notes_addr; + unsigned long mstart, mend; + + /* extra phdr for vmcoreinfo ELF note */ + nr_phdr = nr_cpus + 1; + nr_phdr += mem->nr_ranges; + + /* +* kexec-tools creates an extra PT_LOAD phdr for kernel text mapping +* area (for example, 8000 - a000 on x86_64). +* I think this is required by tools like gdb. So same physical +* memory will be mapped in two ELF headers. One will contain kernel +* text virtual addresses and other will have __va(physical) addresses. +*/ + + nr_phdr++; + elf_sz = sizeof(Elf64_Ehdr) + nr_phdr * sizeof(Elf64_Phdr); + elf_sz = ALIGN(elf_sz, ELF_CORE_HEADER_ALIGN); + + buf = vzalloc(elf_sz); + if (!buf) + return -ENOMEM; + + ehdr = (Elf64_Ehdr *)buf; + phdr = (Elf64_Phdr *)(ehdr + 1); + memcpy(ehdr->e_ident, ELFMAG, SELFMAG); + ehdr->e_ident[EI_CLASS] = ELFCLASS64; + ehdr->e_ident[EI_DATA] = ELFDATA2LSB; + ehdr->e_ident[EI_VERSION] = EV_CURRENT; + ehdr->e_ident[EI_OSABI] = ELF_OSABI; + memset(ehdr->e_ident + EI_PAD, 0, EI_NIDENT - EI_PAD); + ehdr->e_type = ET_CORE; + ehdr->e_machine = ELF_ARCH; + ehdr->e_version = EV_CURRENT; + ehdr->e_phoff = sizeof(Elf64_Ehdr); + ehdr->e_ehsize = sizeof(Elf64_Ehdr); + ehdr->e_phentsize = sizeof(Elf64_Phdr); + + /* Prepare one phdr of type PT_NOTE for each present CPU */ + for_each_present_cpu(cpu) { + phdr->p_type = PT_NOTE; + notes_addr = per_cpu_ptr_to_phys(per_cpu_ptr(crash_notes, cpu)); + phdr->p_offset = phdr->p_paddr = notes_addr; + phdr->p_filesz = phdr->p_memsz = sizeof(note_buf_t); + (ehdr->e_phnum)++; + phdr++; + } + + /* Prepare one PT_NOTE header for vmcoreinfo */ + phdr->p_type = PT_NOTE; +
[PATCH v23 7/8] crash: change crash_prepare_elf64_headers() to for_each_possible_cpu()
The function crash_prepare_elf64_headers() generates the elfcorehdr which describes the CPUs and memory in the system for the crash kernel. In particular, it writes out ELF PT_NOTEs for memory regions and the CPUs in the system. With respect to the CPUs, the current implementation utilizes for_each_present_cpu() which means that as CPUs are added and removed, the elfcorehdr must again be updated to reflect the new set of CPUs. The reasoning behind the move to use for_each_possible_cpu(), is: - At kernel boot time, all percpu crash_notes are allocated for all possible CPUs; that is, crash_notes are not allocated dynamically when CPUs are plugged/unplugged. Thus the crash_notes for each possible CPU are always available. - The crash_prepare_elf64_headers() creates an ELF PT_NOTE per CPU. Changing to for_each_possible_cpu() is valid as the crash_notes pointed to by each CPU PT_NOTE are present and always valid. Furthermore, examining a common crash processing path of: kernel panic -> crash kernel -> makedumpfile -> 'crash' analyzer elfcorehdr /proc/vmcore vmcore reveals how the ELF CPU PT_NOTEs are utilized: - Upon panic, each CPU is sent an IPI and shuts itself down, recording its state in its crash_notes. When all CPUs are shutdown, the crash kernel is launched with a pointer to the elfcorehdr. - The crash kernel via linux/fs/proc/vmcore.c does not examine or use the contents of the PT_NOTEs, it exposes them via /proc/vmcore. - The makedumpfile utility uses /proc/vmcore and reads the CPU PT_NOTEs to craft a nr_cpus variable, which is reported in a header but otherwise generally unused. Makedumpfile creates the vmcore. - The 'crash' dump analyzer does not appear to reference the CPU PT_NOTEs. Instead it looks-up the cpu_[possible|present|onlin]_mask symbols and directly examines those structure contents from vmcore memory. From that information it is able to determine which CPUs are present and online, and locate the corresponding crash_notes. Said differently, it appears that 'crash' analyzer does not rely on the ELF PT_NOTEs for CPUs; rather it obtains the information directly via kernel symbols and the memory within the vmcore. (There maybe other vmcore generating and analysis tools that do use these PT_NOTEs, but 'makedumpfile' and 'crash' seems to be the most common solution.) This results in the benefit of having all CPUs described in the elfcorehdr, and therefore reducing the need to re-generate the elfcorehdr on CPU changes, at the small expense of an additional 56 bytes per PT_NOTE for not-present-but-possible CPUs. On systems where kexec_file_load() syscall is utilized, all the above is valid. On systems where kexec_load() syscall is utilized, there may be the need for the elfcorehdr to be regenerated once. The reason being that some archs only populate the 'present' CPUs from the /sys/devices/system/cpus entries, which the userspace 'kexec' utility uses to generate the userspace-supplied elfcorehdr. In this situation, one memory or CPU change will rewrite the elfcorehdr via the crash_prepare_elf64_headers() function and now all possible CPUs will be described, just as with kexec_file_load() syscall. Suggested-by: Sourabh Jain Signed-off-by: Eric DeVolder Reviewed-by: Sourabh Jain Acked-by: Hari Bathini Acked-by: Baoquan He --- kernel/crash_core.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/crash_core.c b/kernel/crash_core.c index fa918176d46d..7378b501fada 100644 --- a/kernel/crash_core.c +++ b/kernel/crash_core.c @@ -364,8 +364,8 @@ int crash_prepare_elf64_headers(struct crash_mem *mem, int need_kernel_map, ehdr->e_ehsize = sizeof(Elf64_Ehdr); ehdr->e_phentsize = sizeof(Elf64_Phdr); - /* Prepare one phdr of type PT_NOTE for each present CPU */ - for_each_present_cpu(cpu) { + /* Prepare one phdr of type PT_NOTE for each possible CPU */ + for_each_possible_cpu(cpu) { phdr->p_type = PT_NOTE; notes_addr = per_cpu_ptr_to_phys(per_cpu_ptr(crash_notes, cpu)); phdr->p_offset = phdr->p_paddr = notes_addr; -- 2.31.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH v23 2/8] crash: add generic infrastructure for crash hotplug support
To support crash hotplug, a mechanism is needed to update the crash elfcorehdr upon CPU or memory changes (eg. hot un/plug or off/ onlining). The crash elfcorehdr describes the CPUs and memory to be written into the vmcore. To track CPU changes, callbacks are registered with the cpuhp mechanism via cpuhp_setup_state_nocalls(CPUHP_BP_PREPARE_DYN). The crash hotplug elfcorehdr update has no explicit ordering requirement (relative to other cpuhp states), so meets the criteria for utilizing CPUHP_BP_PREPARE_DYN. CPUHP_BP_PREPARE_DYN is a dynamic state and avoids the need to introduce a new state for crash hotplug. Also, CPUHP_BP_PREPARE_DYN is the last state in the PREPARE group, just prior to the STARTING group, which is very close to the CPU starting up in a plug/online situation, or stopping in a unplug/ offline situation. This minimizes the window of time during an actual plug/online or unplug/offline situation in which the elfcorehdr would be inaccurate. Note that for a CPU being unplugged or offlined, the CPU will still be present in the list of CPUs generated by crash_prepare_elf64_headers(). However, there is no need to explicitly omit the CPU, see justification in 'crash: change crash_prepare_elf64_headers() to for_each_possible_cpu()'. To track memory changes, a notifier is registered to capture the memblock MEM_ONLINE and MEM_OFFLINE events via register_memory_notifier(). The CPU callbacks and memory notifiers invoke crash_handle_hotplug_event() which performs needed tasks and then dispatches the event to the architecture specific arch_crash_handle_hotplug_event() to update the elfcorehdr with the current state of CPUs and memory. During the process, the kexec_lock is held. Signed-off-by: Eric DeVolder Reviewed-by: Sourabh Jain Acked-by: Hari Bathini Acked-by: Baoquan He --- include/linux/crash_core.h | 9 +++ include/linux/kexec.h | 11 +++ kernel/Kconfig.kexec | 31 kernel/crash_core.c| 142 + kernel/kexec_core.c| 6 ++ 5 files changed, 199 insertions(+) diff --git a/include/linux/crash_core.h b/include/linux/crash_core.h index de62a722431e..e14345cc7a22 100644 --- a/include/linux/crash_core.h +++ b/include/linux/crash_core.h @@ -84,4 +84,13 @@ int parse_crashkernel_high(char *cmdline, unsigned long long system_ram, int parse_crashkernel_low(char *cmdline, unsigned long long system_ram, unsigned long long *crash_size, unsigned long long *crash_base); +#define KEXEC_CRASH_HP_NONE0 +#define KEXEC_CRASH_HP_ADD_CPU 1 +#define KEXEC_CRASH_HP_REMOVE_CPU 2 +#define KEXEC_CRASH_HP_ADD_MEMORY 3 +#define KEXEC_CRASH_HP_REMOVE_MEMORY 4 +#define KEXEC_CRASH_HP_INVALID_CPU -1U + +struct kimage; + #endif /* LINUX_CRASH_CORE_H */ diff --git a/include/linux/kexec.h b/include/linux/kexec.h index 811a90e09698..b9903dd48e24 100644 --- a/include/linux/kexec.h +++ b/include/linux/kexec.h @@ -33,6 +33,7 @@ extern note_buf_t __percpu *crash_notes; #include #include #include +#include #include /* Verify architecture specific macros are defined */ @@ -360,6 +361,12 @@ struct kimage { struct purgatory_info purgatory_info; #endif +#ifdef CONFIG_CRASH_HOTPLUG + int hp_action; + int elfcorehdr_index; + bool elfcorehdr_updated; +#endif + #ifdef CONFIG_IMA_KEXEC /* Virtual address of IMA measurement buffer for kexec syscall */ void *ima_buffer; @@ -490,6 +497,10 @@ static inline int arch_kexec_post_alloc_pages(void *vaddr, unsigned int pages, g static inline void arch_kexec_pre_free_pages(void *vaddr, unsigned int pages) { } #endif +#ifndef arch_crash_handle_hotplug_event +static inline void arch_crash_handle_hotplug_event(struct kimage *image) { } +#endif + #else /* !CONFIG_KEXEC_CORE */ struct pt_regs; struct task_struct; diff --git a/kernel/Kconfig.kexec b/kernel/Kconfig.kexec index 660048099865..a117163fde45 100644 --- a/kernel/Kconfig.kexec +++ b/kernel/Kconfig.kexec @@ -100,4 +100,35 @@ config CRASH_DUMP For s390, this option also enables zfcpdump. See also +config CRASH_HOTPLUG + bool "Update the crash elfcorehdr on system configuration changes" + default y + depends on CRASH_DUMP && (HOTPLUG_CPU || MEMORY_HOTPLUG) + depends on ARCH_HAS_CRASH_HOTPLUG + help + Enable direct update to the crash elfcorehdr (which contains + the list of CPUs and memory regions to be dumped upon a crash) + in response to hot plug/unplug or online/offline of CPUs or + memory. This is a much more advanced approach than userspace + attempting that. + + If unsure, say Y. + +config CRASH_MAX_MEMORY_RANGES + int "Specify the maximum number of memory regions for the elfcorehdr" + default 8192 + depends on CRASH_HOTPLUG + help + For t
[PATCH v23 3/8] kexec: exclude elfcorehdr from the segment digest
When a crash kernel is loaded via the kexec_file_load() syscall, the kernel places the various segments (ie crash kernel, crash initrd, boot_params, elfcorehdr, purgatory, etc) in memory. For those architectures that utilize purgatory, a hash digest of the segments is calculated for integrity checking. The digest is embedded into the purgatory image prior to placing in memory. Updates to the elfcorehdr in response to CPU and memory changes would cause the purgatory integrity checking to fail (at crash time, and no vmcore created). Therefore, the elfcorehdr segment is explicitly excluded from the purgatory digest, enabling updates to the elfcorehdr while also avoiding the need to recompute the hash digest and reload purgatory. Signed-off-by: Eric DeVolder Reviewed-by: Sourabh Jain Acked-by: Hari Bathini Acked-by: Baoquan He --- kernel/kexec_file.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c index f8b1797b3ec9..1d2cfc869a75 100644 --- a/kernel/kexec_file.c +++ b/kernel/kexec_file.c @@ -726,6 +726,12 @@ static int kexec_calculate_store_digests(struct kimage *image) for (j = i = 0; i < image->nr_segments; i++) { struct kexec_segment *ksegment; +#ifdef CONFIG_CRASH_HOTPLUG + /* Exclude elfcorehdr segment to allow future changes via hotplug */ + if (j == image->elfcorehdr_index) + continue; +#endif + ksegment = >segment[i]; /* * Skip purgatory as it will be modified once we put digest -- 2.31.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH v23 6/8] crash: hotplug support for kexec_load()
The hotplug support for kexec_load() requires changes to the userspace kexec-tools and a little extra help from the kernel. Given a kdump capture kernel loaded via kexec_load(), and a subsequent hotplug event, the crash hotplug handler finds the elfcorehdr and rewrites it to reflect the hotplug change. That is the desired outcome, however, at kernel panic time, the purgatory integrity check fails (because the elfcorehdr changed), and the capture kernel does not boot and no vmcore is generated. Therefore, the userspace kexec-tools/kexec must indicate to the kernel that the elfcorehdr can be modified (because the kexec excluded the elfcorehdr from the digest, and sized the elfcorehdr memory buffer appropriately). To facilitate hotplug support with kexec_load(): - a new kexec flag KEXEC_UPATE_ELFCOREHDR indicates that it is safe for the kernel to modify the kexec_load()'d elfcorehdr - the /sys/kernel/crash_elfcorehdr_size node communicates the preferred size of the elfcorehdr memory buffer - The sysfs crash_hotplug nodes (ie. /sys/devices/system/[cpu|memory]/crash_hotplug) dynamically take into account kexec_file_load() vs kexec_load() and KEXEC_UPDATE_ELFCOREHDR. This is critical so that the udev rule processing of crash_hotplug is all that is needed to determine if the userspace unload-then-load of the kdump image is to be skipped, or not. The proposed udev rule change looks like: # The kernel updates the crash elfcorehdr for CPU and memory changes SUBSYSTEM=="cpu", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end" SUBSYSTEM=="memory", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end" The table below indicates the behavior of kexec_load()'d kdump image updates (with the new udev crash_hotplug rule in place): Kernel |Kexec ---+-+ Old|Old |New | a | a ---+-+ New| a | b ---+-+ where kexec 'old' and 'new' delineate kexec-tools has the needed modifications for the crash hotplug feature, and kernel 'old' and 'new' delineate the kernel supports this crash hotplug feature. Behavior 'a' indicates the unload-then-reload of the entire kdump image. For the kexec 'old' column, the unload-then-reload occurs due to the missing flag KEXEC_UPDATE_ELFCOREHDR. An 'old' kernel (with 'new' kexec) does not present the crash_hotplug sysfs node, which leads to the unload-then-reload of the kdump image. Behavior 'b' indicates the desired optimized behavior of the kernel directly modifying the elfcorehdr and avoiding the unload-then-reload of the kdump image. If the udev rule is not updated with crash_hotplug node check, then no matter any combination of kernel or kexec is new or old, the kdump image continues to be unload-then-reload on hotplug changes. To fully support crash hotplug feature, there needs to be a rollout of kernel, kexec-tools and udev rule changes. However, the order of the rollout of these pieces does not matter; kexec_load()'d kdump images still function for hotplug as-is. Suggested-by: Hari Bathini Signed-off-by: Eric DeVolder Acked-by: Hari Bathini Acked-by: Baoquan He --- arch/x86/include/asm/kexec.h | 11 +++ arch/x86/kernel/crash.c | 27 +++ include/linux/kexec.h| 14 -- include/uapi/linux/kexec.h | 1 + kernel/crash_core.c | 31 +++ kernel/kexec.c | 5 + kernel/ksysfs.c | 15 +++ 7 files changed, 98 insertions(+), 6 deletions(-) diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h index 9143100ea3ea..3be6a98751f0 100644 --- a/arch/x86/include/asm/kexec.h +++ b/arch/x86/include/asm/kexec.h @@ -214,14 +214,17 @@ void arch_crash_handle_hotplug_event(struct kimage *image); #define arch_crash_handle_hotplug_event arch_crash_handle_hotplug_event #ifdef CONFIG_HOTPLUG_CPU -static inline int crash_hotplug_cpu_support(void) { return 1; } -#define crash_hotplug_cpu_support crash_hotplug_cpu_support +int arch_crash_hotplug_cpu_support(void); +#define crash_hotplug_cpu_support arch_crash_hotplug_cpu_support #endif #ifdef CONFIG_MEMORY_HOTPLUG -static inline int crash_hotplug_memory_support(void) { return 1; } -#define crash_hotplug_memory_support crash_hotplug_memory_support +int arch_crash_hotplug_memory_support(void); +#define crash_hotplug_memory_support arch_crash_hotplug_memory_support #endif + +unsigned int arch_crash_get_elfcorehdr_size(void); +#define crash_get_elfcorehdr_size arch_crash_get_elfcorehdr_size #endif #endif /* __ASSEMBLY__ */ diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c index c70a111c44fa..caf22bcb61af 100644 --- a/arch/x86/kernel/crash.c +++ b/arch/x86/kernel/crash.c @@ -427,6 +427,33 @@ int crash_load_segments(struct kimage *image) #undef pr_fmt #define pr_fmt(fmt) "crash hp: " fmt +/* These functions provide the
[PATCH v23 0/8] crash: Kernel handling of CPU and memory hot un/plug
kexec_file_load support patch. v7: 13apr2022 https://lkml.org/lkml/2022/4/13/850 https://lore.kernel.org/lkml/20220413164237.20845-1-eric.devol...@oracle.com/ - Resolved parameter usage to crash_hotplug_handler(), per Baoquan. v6: 1apr2022 https://lkml.org/lkml/2022/4/1/1203 https://lore.kernel.org/lkml/20220401183040.1624-1-eric.devol...@oracle.com/ - Reword commit messages and some comment cleanup per Baoquan. - Changed elf_index to elfcorehdr_index for clarity. - Minor code changes per Baoquan. v5: 3mar2022 https://lkml.org/lkml/2022/3/3/674 https://lore.kernel.org/lkml/20220303162725.49640-1-eric.devol...@oracle.com/ - Reworded description of CRASH_HOTPLUG_ELFCOREHDR_SZ, per David Hildenbrand. - Refactored slightly a few patches per Baoquan recommendation. v4: 9feb2022 https://lkml.org/lkml/2022/2/9/1406 https://lore.kernel.org/lkml/20220209195706.51522-1-eric.devol...@oracle.com/ - Refactored patches per Baoquan suggestsions. - A few corrections, per Baoquan. v3: 10jan2022 https://lkml.org/lkml/2022/1/10/1212 https://lore.kernel.org/lkml/20220110195727.1682-1-eric.devol...@oracle.com/ - Rebasing per Baoquan He request. - Changed memory notifier per David Hildenbrand. - Providing example kexec userspace change in cover letter. RFC v2: 7dec2021 https://lkml.org/lkml/2021/12/7/1088 https://lore.kernel.org/lkml/20211207195204.1582-1-eric.devol...@oracle.com/ - Acting upon Baoquan He suggestion of removing elfcorehdr from the purgatory list of segments, removed purgatory code from patchset, and it is signficiantly simpler now. RFC v1: 18nov2021 https://lkml.org/lkml/2021/11/18/845 https://lore.kernel.org/lkml/2028174948.37435-1-eric.devol...@oracle.com/ - working patchset demonstrating kernel handling of hotplug updates to x86 elfcorehdr for kexec_file_load RFC: 14dec2020 https://lkml.org/lkml/2020/12/14/532 https://lore.kernel.org/lkml/b04ed259-dc5f-7f30-6661-c26f92d90...@oracle.com/ - proposed concept of allowing kernel to handle hotplug update of elfcorehdr --- Eric DeVolder (8): crash: move a few code bits to setup support of crash hotplug crash: add generic infrastructure for crash hotplug support kexec: exclude elfcorehdr from the segment digest crash: memory and CPU hotplug sysfs attributes x86/crash: add x86 crash hotplug support crash: hotplug support for kexec_load() crash: change crash_prepare_elf64_headers() to for_each_possible_cpu() x86/crash: optimize CPU changes .../admin-guide/mm/memory-hotplug.rst | 8 + Documentation/core-api/cpu_hotplug.rst| 18 + arch/x86/Kconfig | 3 + arch/x86/include/asm/kexec.h | 18 + arch/x86/kernel/crash.c | 140 ++- drivers/base/cpu.c| 14 + drivers/base/memory.c | 13 + include/linux/crash_core.h| 9 + include/linux/kexec.h | 63 +++- include/uapi/linux/kexec.h| 1 + kernel/Kconfig.kexec | 31 ++ kernel/crash_core.c | 355 ++ kernel/kexec.c| 5 + kernel/kexec_core.c | 6 + kernel/kexec_file.c | 187 + kernel/ksysfs.c | 15 + 16 files changed, 681 insertions(+), 205 deletions(-) -- 2.31.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH v23 4/8] crash: memory and CPU hotplug sysfs attributes
Introduce the crash_hotplug attribute for memory and CPUs for use by userspace. These attributes directly facilitate the udev rule for managing userspace re-loading of the crash kernel upon hot un/plug changes. For memory, expose the crash_hotplug attribute to the /sys/devices/system/memory directory. For example: # udevadm info --attribute-walk /sys/devices/system/memory/memory81 looking at device '/devices/system/memory/memory81': KERNEL=="memory81" SUBSYSTEM=="memory" DRIVER=="" ATTR{online}=="1" ATTR{phys_device}=="0" ATTR{phys_index}=="0051" ATTR{removable}=="1" ATTR{state}=="online" ATTR{valid_zones}=="Movable" looking at parent device '/devices/system/memory': KERNELS=="memory" SUBSYSTEMS=="" DRIVERS=="" ATTRS{auto_online_blocks}=="offline" ATTRS{block_size_bytes}=="800" ATTRS{crash_hotplug}=="1" For CPUs, expose the crash_hotplug attribute to the /sys/devices/system/cpu directory. For example: # udevadm info --attribute-walk /sys/devices/system/cpu/cpu0 looking at device '/devices/system/cpu/cpu0': KERNEL=="cpu0" SUBSYSTEM=="cpu" DRIVER=="processor" ATTR{crash_notes}=="277c38600" ATTR{crash_notes_size}=="368" ATTR{online}=="1" looking at parent device '/devices/system/cpu': KERNELS=="cpu" SUBSYSTEMS=="" DRIVERS=="" ATTRS{crash_hotplug}=="1" ATTRS{isolated}=="" ATTRS{kernel_max}=="8191" ATTRS{nohz_full}==" (null)" ATTRS{offline}=="4-7" ATTRS{online}=="0-3" ATTRS{possible}=="0-7" ATTRS{present}=="0-3" With these sysfs attributes in place, it is possible to efficiently instruct the udev rule to skip crash kernel reloading for kernels configured with crash hotplug support. For example, the following is the proposed udev rule change for RHEL system 98-kexec.rules (as the first lines of the rule file): # The kernel updates the crash elfcorehdr for CPU and memory changes SUBSYSTEM=="cpu", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end" SUBSYSTEM=="memory", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end" When examined in the context of 98-kexec.rules, the above rules test if crash_hotplug is set, and if so, the userspace initiated unload-then-reload of the crash kernel is skipped. CPU and memory checks are separated in accordance with CONFIG_HOTPLUG_CPU and CONFIG_MEMORY_HOTPLUG kernel config options. If an architecture supports, for example, memory hotplug but not CPU hotplug, then the /sys/devices/system/memory/crash_hotplug attribute file is present, but the /sys/devices/system/cpu/crash_hotplug attribute file will NOT be present. Thus the udev rule skips userspace processing of memory hot un/plug events, but the udev rule will evaluate false for CPU events, thus allowing userspace to process CPU hot un/plug events (ie the unload-then-reload of the kdump capture kernel). Signed-off-by: Eric DeVolder Reviewed-by: Sourabh Jain Acked-by: Hari Bathini Acked-by: Baoquan He --- .../admin-guide/mm/memory-hotplug.rst | 8 Documentation/core-api/cpu_hotplug.rst | 18 ++ drivers/base/cpu.c | 14 ++ drivers/base/memory.c | 13 + include/linux/kexec.h | 8 5 files changed, 61 insertions(+) diff --git a/Documentation/admin-guide/mm/memory-hotplug.rst b/Documentation/admin-guide/mm/memory-hotplug.rst index 1b02fe5807cc..eb99d79223a3 100644 --- a/Documentation/admin-guide/mm/memory-hotplug.rst +++ b/Documentation/admin-guide/mm/memory-hotplug.rst @@ -291,6 +291,14 @@ The following files are currently defined: Availability depends on the CONFIG_ARCH_MEMORY_PROBE kernel configuration option. ``uevent``read-write: generic udev file for device subsystems. +``crash_hotplug`` read-only: when changes to the system memory map + occur due to hot un/plug of memory, this file contains + '1' if the kernel updates the kdump capture kernel memory + map itself (via elfcorehdr), or '0' if userspace must update + the kdump capture kernel memory map. + + Availability depends on the CONFIG_MEMORY_HOTPLUG kernel + configuration option. == = .. note:: diff --git a/Documentation/core-api/cpu_hotplug.rst b/Documentation/core-api/cpu_hotplug.rst index f75778d37488..0c8dc3fe5
Re: [PATCH v22 6/8] crash: hotplug support for kexec_load()
On 5/9/23 01:15, Sourabh Jain wrote: On 04/05/23 04:11, Eric DeVolder wrote: The hotplug support for kexec_load() requires coordination with userspace, and therefore a little extra help from the kernel to facilitate the coordination. In the absence of the solution contained within this particular patch, if a kdump capture kernel is loaded via kexec_load() syscall, then the crash hotplug logic would find the segment containing the elfcorehdr, and upon a hotplug event, rewrite the elfcorehdr. While generally speaking that is the desired behavior and outcome, a problem arises from the fact that if the kdump image includes a purgatory that performs a digest checksum, then that check would fail (because the elfcorehdr was changed), and the capture kernel would fail to boot and no kdump occur. Therefore, what is needed is for the userspace kexec-tools to indicate to the kernel whether or not the supplied kdump image/ elfcorehdr can be modified (because the kexec-tools excludes the elfcorehdr from the digest, and sizes the elfcorehdr memory buffer appropriately). To solve these problems, this patch introduces: - a new kexec flag KEXEC_UPATE_ELFCOREHDR to indicate that it is Architectures may need to update kexec segment other then elfcorehdr. How about changing the flag name to KEXEC_UPDATE_SEGMENTS? - Sourabh These seems almost too generic and vague. I get that for PPC this flag will drive updating elfcorehdr as well as FDT, so the flag is over-loaded in a sense. Another idea for the name? eric safe for the kernel to modify the elfcorehdr (because kexec-tools has excluded the elfcorehdr from the digest). - the /sys/kernel/crash_elfcorehdr_size node to communicate to kexec-tools what the preferred size of the elfcorehdr memory buffer should be in order to accommodate hotplug changes. - The sysfs crash_hotplug nodes (ie. /sys/devices/system/[cpu|memory]/crash_hotplug) are now dynamic in that they examine kexec_file_load() vs kexec_load(), and when kexec_load(), whether or not KEXEC_UPDATE_ELFCOREHDR is in effect. This is critical so that the udev rule processing of crash_hotplug indicates correctly (ie. the userspace unload-then-load of the kdump of the kdump image can be skipped, or not). With this patch in place, I believe the following statements to be true (with local testing to verify): - For systems which have these kernel changes in place, but not the corresponding changes to the crash hot plug udev rules and kexec-tools, (ie "older" systems) those systems will continue to unload-then-load the kdump image, as has always been done. The kexec-tools will not set KEXEC_UPDATE_ELFCOREHDR. - For systems which have these kernel changes in place and the proposed udev rule changes in place, but not the kexec-tools changes in place: - the use of kexec_load() will not set KEXEC_UPDATE_ELFCOREHDR and so the unload-then-reload of kdump image will occur (the sysfs crash_hotplug nodes will show 0). - the use of kexec_file_load() will permit sysfs crash_hotplug nodes to show 1, and the kernel will modify the elfcorehdr directly. And with the udev changes in place, the unload-then-load will not occur! - For systems which have these kernel changes as well as the udev and kexec-tools changes in place, then the user/admin has full authority over the enablement and support of crash hotplug support, whether via kexec_file_load() or kexec_load(). Said differently, as kexec_load() was/is widely in use, these changes permit it to continue to be used as-is (retaining the current unload-then- reload behavior) until such time as the udev and kexec-tools changes can be rolled out as well. I've intentionally kept the changes related to userspace coordination for kexec_load() separate as this need was identified late; the rest of this series has been generally reviewed and accepted. Once this support has been vetted, I can refactor if needed. Suggested-by: Hari Bathini Signed-off-by: Eric DeVolder --- arch/x86/include/asm/kexec.h | 11 +++ arch/x86/kernel/crash.c | 27 +++ include/linux/kexec.h | 14 -- include/uapi/linux/kexec.h | 1 + kernel/crash_core.c | 31 +++ kernel/kexec.c | 3 +++ kernel/ksysfs.c | 15 +++ 7 files changed, 96 insertions(+), 6 deletions(-) diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h index 9143100ea3ea..3be6a98751f0 100644 --- a/arch/x86/include/asm/kexec.h +++ b/arch/x86/include/asm/kexec.h @@ -214,14 +214,17 @@ void arch_crash_handle_hotplug_event(struct kimage *image); #define arch_crash_handle_hotplug_event arch_crash_handle_hotplug_event #ifdef CONFIG_HOTPLUG_CPU -static inline int crash_hotplug_cpu_support(void) { return 1; } -#define crash_hotplug_cpu_support crash_hotplug_c
Re: [PATCH v22 5/8] x86/crash: add x86 crash hotplug support
On 5/9/23 17:52, Thomas Gleixner wrote: On Wed, May 03 2023 at 18:41, Eric DeVolder wrote: In the patch 'kexec: exclude elfcorehdr from the segment digest' See reply to 8/8 yep diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 53bab123a8ee..80538524c494 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -2119,6 +2119,19 @@ config CRASH_DUMP (CONFIG_RELOCATABLE=y). For more details see Documentation/admin-guide/kdump/kdump.rst +config CRASH_HOTPLUG + bool "Update the crash elfcorehdr on system configuration changes" + default y + depends on CRASH_DUMP && (HOTPLUG_CPU || MEMORY_HOTPLUG) + help + Enable direct update to the crash elfcorehdr (which contains + the list of CPUs and memory regions to be dumped upon a crash) + in response to hot plug/unplug or online/offline of CPUs or + memory. This is a much more advanced approach than userspace + attempting that. + + If unsure, say Y. Why is this config an X86 specific thing? Neither CRASH_DUMP nor HOTPLUG_CPU nor MEMORY_HOTPLUG are in any way X86 specific at all. So why can't you stick that into a place where it can be reused by other architectures? It's not rocket science to do + depends on WANTS_CRASH_HOTPLUG && CRASH_DUMP && (HOTPLUG_CPU || MEMORY_HOTPLUG) or something like that. It's so tiring to have x86 Kconfig be the dump ground for the initial implementation, then having the sh*t copied to every other architecture and the cleanup is left to the maintainers. It's not rocket science to differentiate between a real architecture specific option and a generally useful option in the first place, right? Right. To your point, CRASH_DUMP has been copied in all the archs: arch/arm/Kconfig:config CRASH_DUMP arch/arm64/Kconfig:config CRASH_DUMP arch/ia64/Kconfig:config CRASH_DUMP arch/mips/Kconfig:config CRASH_DUMP arch/powerpc/Kconfig:config CRASH_DUMP arch/riscv/Kconfig:config CRASH_DUMP arch/s390/Kconfig:config CRASH_DUMP arch/sh/Kconfig:config CRASH_DUMP arch/x86/Kconfig:config CRASH_DUMP arch/loongarch/Kconfig:config CRASH_DUMP Likewise for KEXEC and KEXEC_FILE. I've looked into this in the past, and looking again today, I don't see a natural place to put the option. Perhaps starting a kernel/Kconfig.kexec? +#ifdef CONFIG_CRASH_HOTPLUG + /* +* Ensure the elfcorehdr segment large enough for hotplug changes. +* Account for VMCOREINFO and kernel_map and maximum CPUs. Neither the first line nor the second one qualifies as parseable sentences. What about: Ensure the elfcorehdr segment is large enough for hotplug changes. The segment size accounts for VMCOREINFO, kernel_map, maximum CPUs and maximum memory ranges. +/** + * arch_crash_handle_hotplug_event() - Handle hotplug elfcorehdr changes + * @image: the active struct kimage What is an active struct kimage? How about this: @image: a pointer to kexec_crash_image + * + * The new elfcorehdr is prepared in a kernel buffer, and then it is + * written on top of the existing/old elfcorehdr. -ENOPARSE How about: Prepare the new elfcorehdr and replace the existing elfcorehdr. + */ +void arch_crash_handle_hotplug_event(struct kimage *image) +{ + void *elfbuf = NULL, *old_elfcorehdr; + unsigned long nr_mem_ranges; + unsigned long mem, memsz; + unsigned long elfsz = 0; + + /* +* Create the new elfcorehdr reflecting the changes to CPU and/or +* memory resources. +*/ + if (prepare_elf_headers(image, , , _mem_ranges)) { + pr_err("unable to prepare elfcore headers"); + goto out; So this can fail. Why is there just a pr_err() and no return value which tells the caller that this failed? An error in the crash elfcorehdr infrastructure introduced in this series is not a reason to rollback state. The cpuhp and memory notifier callbacks always return an OK. The primary errors that might occur are failure to obtain the kexec_lock, and failure to obtain a temporary kernel buffer to stage the new elfcorehdr. How about: pr_err("prepare_elf_headers() failed"); + /* +* Copy new elfcorehdr over the old elfcorehdr at destination. +*/ + old_elfcorehdr = kmap_local_page(pfn_to_page(mem >> PAGE_SHIFT)); + if (!old_elfcorehdr) { + pr_err("updating elfcorehdr failed\n"); How hard is it to write an error message which is clearly describing the problem? How about: pr_err("mapping elfcorehdr segment failed"); Thanks, tglx Again, thanks for the fresh eyes! eric ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v22 8/8] x86/crash: optimize CPU changes
On 5/9/23 17:39, Thomas Gleixner wrote: On Wed, May 03 2023 at 18:41, Eric DeVolder wrote: This patch is dependent upon the patch 'crash: change Seriously? You send a patch series which is ordered in itself and then tell in the changelog of patch 8/8 that it depends on patch 7/8? This information is complete garbage once the patches are applied and ends up in the git logs and even for the submission it's useless information. Patch series are usually ordered by dependecy, no? Aside of that please do: # git grep 'This patch' Documentation/process/ I'll remove, and re-examine the messages to use imperative tone. crash_prepare_elf64_headers() to for_each_possible_cpu()'. With that patch, crash_prepare_elf64_headers() writes out an ELF CPU PT_NOTE for all possible CPUs, thus further CPU changes to the elfcorehdr are not needed. I'm having a hard time to decode this word salad. crash_prepare_elf64_headers() is writing out an ELF CPU PT_NOTE for all possible CPUs, thus further changes to the ELF core header are not required. Makes some sense to me. How about this? crash_prepare_elf64_headers() writes into the elfcorehdr an ELF PT_NOTE for all possible CPUs. As such, subsequent changes to CPUs (ie. hot un/plug, online/offline) do not need to rewrite the elfcorehdr. This change works for kexec_file_load() and kexec_load() syscalls. For kexec_file_load(), crash_prepare_elf64_headers() is utilized directly and thus all ELF CPU PT_NOTEs are in the elfcorehdr already. This is the kimage->file_mode term. For kexec_load() syscall, one CPU or memory change will cause the elfcorehdr to be updated via crash_prepare_elf64_headers() and at that point all ELF CPU PT_NOTEs are in the elfcorehdr. This is the kimage->elfcorehdr_updated term. Sorry. I tried hard, but this is completely incomprehensible. How about this? The kimage->file_mode term covers kdump images loaded via the kexec_file_load() syscall. Since crash_prepare_elf64_headers() wrote the initial elfcorehdr, no update to the elfcorehdr is needed for CPU changes. The kimage->elfcorehdr_updated term covers kdump images loaded via the kexec_load() syscall. At least one memory or CPU change must occur to cause crash_prepare_elf64_headers() to rewrite the elfcorehdr. Afterwards, no update to the elfcorehdr is needed for CPU changes. diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c index 8064e65de6c0..3157e6068747 100644 --- a/arch/x86/kernel/crash.c +++ b/arch/x86/kernel/crash.c @@ -483,6 +483,16 @@ void arch_crash_handle_hotplug_event(struct kimage *image) unsigned long mem, memsz; unsigned long elfsz = 0; + /* As crash_prepare_elf64_headers() has already described all This is not a proper multiline comment. Please read and follow the tip tree documentation along with all other things which are documented there: https://www.kernel.org/doc/html/latest/process/maintainer-tip.html This documentation is not there for entertainment value or exists just because we are bored to death. I'll fix it; unintentional. Should checkpatch.pl catch this (it did not)? +* possible CPUs, there is no need to update the elfcorehdr +* for additional CPU changes. This works for both kexec_load() +* and kexec_file_load() syscalls. And it does not work for what? I'll remove this. I keep using phrases like this since kexec_file_load() is wholly controlled by the kernel code, where as kexec_load() has userspace dependencies. In this case,the sentence isn't warranted; it will work; no exceptional cases. You cannot expect that anyone who reads this code is an kexec/crash* wizard who might be able to deduce the meaning of this. Thanks, tglx Yes, thanks for the fresh eyes! eric ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v22 6/8] crash: hotplug support for kexec_load()
On 5/9/23 01:56, Sourabh Jain wrote: > > On 04/05/23 04:11, Eric DeVolder wrote: >> The hotplug support for kexec_load() requires coordination with >> userspace, and therefore a little extra help from the kernel to >> facilitate the coordination. >> >> In the absence of the solution contained within this particular >> patch, if a kdump capture kernel is loaded via kexec_load() syscall, >> then the crash hotplug logic would find the segment containing the >> elfcorehdr, and upon a hotplug event, rewrite the elfcorehdr. While >> generally speaking that is the desired behavior and outcome, a >> problem arises from the fact that if the kdump image includes a >> purgatory that performs a digest checksum, then that check would >> fail (because the elfcorehdr was changed), and the capture kernel >> would fail to boot and no kdump occur. >> >> Therefore, what is needed is for the userspace kexec-tools to >> indicate to the kernel whether or not the supplied kdump image/ >> elfcorehdr can be modified (because the kexec-tools excludes the >> elfcorehdr from the digest, and sizes the elfcorehdr memory buffer >> appropriately). >> >> To solve these problems, this patch introduces: >> - a new kexec flag KEXEC_UPATE_ELFCOREHDR to indicate that it is >> safe for the kernel to modify the elfcorehdr (because kexec-tools >> has excluded the elfcorehdr from the digest). >> - the /sys/kernel/crash_elfcorehdr_size node to communicate to >> kexec-tools what the preferred size of the elfcorehdr memory buffer >> should be in order to accommodate hotplug changes. >> - The sysfs crash_hotplug nodes (ie. >> /sys/devices/system/[cpu|memory]/crash_hotplug) are now dynamic in >> that they examine kexec_file_load() vs kexec_load(), and when >> kexec_load(), whether or not KEXEC_UPDATE_ELFCOREHDR is in effect. >> This is critical so that the udev rule processing of crash_hotplug >> indicates correctly (ie. the userspace unload-then-load of the >> kdump of the kdump image can be skipped, or not). >> >> With this patch in place, I believe the following statements to be true >> (with local testing to verify): >> >> - For systems which have these kernel changes in place, but not the >> corresponding changes to the crash hot plug udev rules and >> kexec-tools, (ie "older" systems) those systems will continue to >> unload-then-load the kdump image, as has always been done. The >> kexec-tools will not set KEXEC_UPDATE_ELFCOREHDR. >> - For systems which have these kernel changes in place and the proposed >> udev rule changes in place, but not the kexec-tools changes in place: >> - the use of kexec_load() will not set KEXEC_UPDATE_ELFCOREHDR and >> so the unload-then-reload of kdump image will occur (the sysfs >> crash_hotplug nodes will show 0). >> - the use of kexec_file_load() will permit sysfs crash_hotplug nodes >> to show 1, and the kernel will modify the elfcorehdr directly. And >> with the udev changes in place, the unload-then-load will not occur! >> - For systems which have these kernel changes as well as the udev and >> kexec-tools changes in place, then the user/admin has full authority >> over the enablement and support of crash hotplug support, whether via >> kexec_file_load() or kexec_load(). >> >> Said differently, as kexec_load() was/is widely in use, these changes >> permit it to continue to be used as-is (retaining the current unload-then- >> reload behavior) until such time as the udev and kexec-tools changes can >> be rolled out as well. >> >> I've intentionally kept the changes related to userspace coordination >> for kexec_load() separate as this need was identified late; the >> rest of this series has been generally reviewed and accepted. Once >> this support has been vetted, I can refactor if needed. >> >> Suggested-by: Hari Bathini >> Signed-off-by: Eric DeVolder >> --- >> arch/x86/include/asm/kexec.h | 11 +++ >> arch/x86/kernel/crash.c | 27 +++ >> include/linux/kexec.h | 14 -- >> include/uapi/linux/kexec.h | 1 + >> kernel/crash_core.c | 31 +++ >> kernel/kexec.c | 3 +++ >> kernel/ksysfs.c | 15 +++ >> 7 files changed, 96 insertions(+), 6 deletions(-) >> >> diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h >> index
[PATCH v22 6/8] crash: hotplug support for kexec_load()
The hotplug support for kexec_load() requires coordination with userspace, and therefore a little extra help from the kernel to facilitate the coordination. In the absence of the solution contained within this particular patch, if a kdump capture kernel is loaded via kexec_load() syscall, then the crash hotplug logic would find the segment containing the elfcorehdr, and upon a hotplug event, rewrite the elfcorehdr. While generally speaking that is the desired behavior and outcome, a problem arises from the fact that if the kdump image includes a purgatory that performs a digest checksum, then that check would fail (because the elfcorehdr was changed), and the capture kernel would fail to boot and no kdump occur. Therefore, what is needed is for the userspace kexec-tools to indicate to the kernel whether or not the supplied kdump image/ elfcorehdr can be modified (because the kexec-tools excludes the elfcorehdr from the digest, and sizes the elfcorehdr memory buffer appropriately). To solve these problems, this patch introduces: - a new kexec flag KEXEC_UPATE_ELFCOREHDR to indicate that it is safe for the kernel to modify the elfcorehdr (because kexec-tools has excluded the elfcorehdr from the digest). - the /sys/kernel/crash_elfcorehdr_size node to communicate to kexec-tools what the preferred size of the elfcorehdr memory buffer should be in order to accommodate hotplug changes. - The sysfs crash_hotplug nodes (ie. /sys/devices/system/[cpu|memory]/crash_hotplug) are now dynamic in that they examine kexec_file_load() vs kexec_load(), and when kexec_load(), whether or not KEXEC_UPDATE_ELFCOREHDR is in effect. This is critical so that the udev rule processing of crash_hotplug indicates correctly (ie. the userspace unload-then-load of the kdump of the kdump image can be skipped, or not). With this patch in place, I believe the following statements to be true (with local testing to verify): - For systems which have these kernel changes in place, but not the corresponding changes to the crash hot plug udev rules and kexec-tools, (ie "older" systems) those systems will continue to unload-then-load the kdump image, as has always been done. The kexec-tools will not set KEXEC_UPDATE_ELFCOREHDR. - For systems which have these kernel changes in place and the proposed udev rule changes in place, but not the kexec-tools changes in place: - the use of kexec_load() will not set KEXEC_UPDATE_ELFCOREHDR and so the unload-then-reload of kdump image will occur (the sysfs crash_hotplug nodes will show 0). - the use of kexec_file_load() will permit sysfs crash_hotplug nodes to show 1, and the kernel will modify the elfcorehdr directly. And with the udev changes in place, the unload-then-load will not occur! - For systems which have these kernel changes as well as the udev and kexec-tools changes in place, then the user/admin has full authority over the enablement and support of crash hotplug support, whether via kexec_file_load() or kexec_load(). Said differently, as kexec_load() was/is widely in use, these changes permit it to continue to be used as-is (retaining the current unload-then- reload behavior) until such time as the udev and kexec-tools changes can be rolled out as well. I've intentionally kept the changes related to userspace coordination for kexec_load() separate as this need was identified late; the rest of this series has been generally reviewed and accepted. Once this support has been vetted, I can refactor if needed. Suggested-by: Hari Bathini Signed-off-by: Eric DeVolder --- arch/x86/include/asm/kexec.h | 11 +++ arch/x86/kernel/crash.c | 27 +++ include/linux/kexec.h| 14 -- include/uapi/linux/kexec.h | 1 + kernel/crash_core.c | 31 +++ kernel/kexec.c | 3 +++ kernel/ksysfs.c | 15 +++ 7 files changed, 96 insertions(+), 6 deletions(-) diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h index 9143100ea3ea..3be6a98751f0 100644 --- a/arch/x86/include/asm/kexec.h +++ b/arch/x86/include/asm/kexec.h @@ -214,14 +214,17 @@ void arch_crash_handle_hotplug_event(struct kimage *image); #define arch_crash_handle_hotplug_event arch_crash_handle_hotplug_event #ifdef CONFIG_HOTPLUG_CPU -static inline int crash_hotplug_cpu_support(void) { return 1; } -#define crash_hotplug_cpu_support crash_hotplug_cpu_support +int arch_crash_hotplug_cpu_support(void); +#define crash_hotplug_cpu_support arch_crash_hotplug_cpu_support #endif #ifdef CONFIG_MEMORY_HOTPLUG -static inline int crash_hotplug_memory_support(void) { return 1; } -#define crash_hotplug_memory_support crash_hotplug_memory_support +int arch_crash_hotplug_memory_support(void); +#define crash_hotplug_memory_support arch_crash_hotplug_memory_support #endif + +unsigned int arch_crash_get_elfco
[PATCH v22 4/8] crash: memory and CPU hotplug sysfs attributes
This introduces the crash_hotplug attribute for memory and CPUs for use by userspace. This change directly facilitates the udev rule for managing userspace re-loading of the crash kernel upon hot un/plug changes. For memory, this changeset introduces the crash_hotplug attribute to the /sys/devices/system/memory directory. For example: # udevadm info --attribute-walk /sys/devices/system/memory/memory81 looking at device '/devices/system/memory/memory81': KERNEL=="memory81" SUBSYSTEM=="memory" DRIVER=="" ATTR{online}=="1" ATTR{phys_device}=="0" ATTR{phys_index}=="0051" ATTR{removable}=="1" ATTR{state}=="online" ATTR{valid_zones}=="Movable" looking at parent device '/devices/system/memory': KERNELS=="memory" SUBSYSTEMS=="" DRIVERS=="" ATTRS{auto_online_blocks}=="offline" ATTRS{block_size_bytes}=="800" ATTRS{crash_hotplug}=="1" For CPUs, this changeset introduces the crash_hotplug attribute to the /sys/devices/system/cpu directory. For example: # udevadm info --attribute-walk /sys/devices/system/cpu/cpu0 looking at device '/devices/system/cpu/cpu0': KERNEL=="cpu0" SUBSYSTEM=="cpu" DRIVER=="processor" ATTR{crash_notes}=="277c38600" ATTR{crash_notes_size}=="368" ATTR{online}=="1" looking at parent device '/devices/system/cpu': KERNELS=="cpu" SUBSYSTEMS=="" DRIVERS=="" ATTRS{crash_hotplug}=="1" ATTRS{isolated}=="" ATTRS{kernel_max}=="8191" ATTRS{nohz_full}==" (null)" ATTRS{offline}=="4-7" ATTRS{online}=="0-3" ATTRS{possible}=="0-7" ATTRS{present}=="0-3" With these sysfs attributes in place, it is possible to efficiently instruct the udev rule to skip crash kernel reloading for kernels configured with crash hotplug support. For example, the following is the proposed udev rule change for RHEL system 98-kexec.rules (as the first lines of the rule file): # The kernel updates the crash elfcorehdr for CPU and memory changes SUBSYSTEM=="cpu", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end" SUBSYSTEM=="memory", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end" When examined in the context of 98-kexec.rules, the above change tests if crash_hotplug is set, and if so, it skips the userspace initiated unload-then-reload of the crash kernel. CPU and memory checks are separated in accordance with CONFIG_HOTPLUG_CPU and CONFIG_MEMORY_HOTPLUG kernel config options. If an architecture supports, for example, memory hotplug but not CPU hotplug, then the /sys/devices/system/memory/crash_hotplug attribute file is present, but the /sys/devices/system/cpu/crash_hotplug attribute file will NOT be present. Thus the udev rule will skip userspace processing of memory hot un/plug events, but the udev rule will evaluate false for CPU events, thus allowing userspace to process CPU hot un/plug events (ie the unload-then-reload of the kdump capture kernel). Signed-off-by: Eric DeVolder Reviewed-by: Sourabh Jain Acked-by: Baoquan He --- .../admin-guide/mm/memory-hotplug.rst | 8 Documentation/core-api/cpu_hotplug.rst | 18 ++ drivers/base/cpu.c | 14 ++ drivers/base/memory.c | 13 + include/linux/kexec.h | 8 5 files changed, 61 insertions(+) diff --git a/Documentation/admin-guide/mm/memory-hotplug.rst b/Documentation/admin-guide/mm/memory-hotplug.rst index 1b02fe5807cc..eb99d79223a3 100644 --- a/Documentation/admin-guide/mm/memory-hotplug.rst +++ b/Documentation/admin-guide/mm/memory-hotplug.rst @@ -291,6 +291,14 @@ The following files are currently defined: Availability depends on the CONFIG_ARCH_MEMORY_PROBE kernel configuration option. ``uevent``read-write: generic udev file for device subsystems. +``crash_hotplug`` read-only: when changes to the system memory map + occur due to hot un/plug of memory, this file contains + '1' if the kernel updates the kdump capture kernel memory + map itself (via elfcorehdr), or '0' if userspace must update + the kdump capture kernel memory map. + + Availability depends on the CONFIG_MEMORY_HOTPLUG kernel + configuration option. == = .. note:: diff --git a/Documentation/core-api/cpu_hotplug.rst b/Documentation/core-api/cpu_hotplug.rst index f75
[PATCH v22 1/8] crash: move a few code bits to setup support of crash hotplug
The crash hotplug support leans on the work for the kexec_file_load() syscall. To also support the kexec_load() syscall, a few bits of code need to be move outside of CONFIG_KEXEC_FILE. As such, these bits are moved out of kexec_file.c and into a common location crash_core.c. No functionality change intended. Signed-off-by: Eric DeVolder Reviewed-by: Sourabh Jain Acked-by: Baoquan He --- include/linux/kexec.h | 30 +++ kernel/crash_core.c | 182 ++ kernel/kexec_file.c | 181 - 3 files changed, 197 insertions(+), 196 deletions(-) diff --git a/include/linux/kexec.h b/include/linux/kexec.h index 22b5cd24f581..811a90e09698 100644 --- a/include/linux/kexec.h +++ b/include/linux/kexec.h @@ -105,6 +105,21 @@ struct compat_kexec_segment { }; #endif +/* Alignment required for elf header segment */ +#define ELF_CORE_HEADER_ALIGN 4096 + +struct crash_mem { + unsigned int max_nr_ranges; + unsigned int nr_ranges; + struct range ranges[]; +}; + +extern int crash_exclude_mem_range(struct crash_mem *mem, + unsigned long long mstart, + unsigned long long mend); +extern int crash_prepare_elf64_headers(struct crash_mem *mem, int need_kernel_map, + void **addr, unsigned long *sz); + #ifdef CONFIG_KEXEC_FILE struct purgatory_info { /* @@ -230,21 +245,6 @@ static inline int arch_kexec_locate_mem_hole(struct kexec_buf *kbuf) } #endif -/* Alignment required for elf header segment */ -#define ELF_CORE_HEADER_ALIGN 4096 - -struct crash_mem { - unsigned int max_nr_ranges; - unsigned int nr_ranges; - struct range ranges[]; -}; - -extern int crash_exclude_mem_range(struct crash_mem *mem, - unsigned long long mstart, - unsigned long long mend); -extern int crash_prepare_elf64_headers(struct crash_mem *mem, int need_kernel_map, - void **addr, unsigned long *sz); - #ifndef arch_kexec_apply_relocations_add /* * arch_kexec_apply_relocations_add - apply relocations of type RELA diff --git a/kernel/crash_core.c b/kernel/crash_core.c index 90ce1dfd591c..b7c30b748a16 100644 --- a/kernel/crash_core.c +++ b/kernel/crash_core.c @@ -10,6 +10,7 @@ #include #include #include +#include #include #include @@ -314,6 +315,187 @@ static int __init parse_crashkernel_dummy(char *arg) } early_param("crashkernel", parse_crashkernel_dummy); +int crash_prepare_elf64_headers(struct crash_mem *mem, int need_kernel_map, + void **addr, unsigned long *sz) +{ + Elf64_Ehdr *ehdr; + Elf64_Phdr *phdr; + unsigned long nr_cpus = num_possible_cpus(), nr_phdr, elf_sz; + unsigned char *buf; + unsigned int cpu, i; + unsigned long long notes_addr; + unsigned long mstart, mend; + + /* extra phdr for vmcoreinfo ELF note */ + nr_phdr = nr_cpus + 1; + nr_phdr += mem->nr_ranges; + + /* +* kexec-tools creates an extra PT_LOAD phdr for kernel text mapping +* area (for example, 8000 - a000 on x86_64). +* I think this is required by tools like gdb. So same physical +* memory will be mapped in two ELF headers. One will contain kernel +* text virtual addresses and other will have __va(physical) addresses. +*/ + + nr_phdr++; + elf_sz = sizeof(Elf64_Ehdr) + nr_phdr * sizeof(Elf64_Phdr); + elf_sz = ALIGN(elf_sz, ELF_CORE_HEADER_ALIGN); + + buf = vzalloc(elf_sz); + if (!buf) + return -ENOMEM; + + ehdr = (Elf64_Ehdr *)buf; + phdr = (Elf64_Phdr *)(ehdr + 1); + memcpy(ehdr->e_ident, ELFMAG, SELFMAG); + ehdr->e_ident[EI_CLASS] = ELFCLASS64; + ehdr->e_ident[EI_DATA] = ELFDATA2LSB; + ehdr->e_ident[EI_VERSION] = EV_CURRENT; + ehdr->e_ident[EI_OSABI] = ELF_OSABI; + memset(ehdr->e_ident + EI_PAD, 0, EI_NIDENT - EI_PAD); + ehdr->e_type = ET_CORE; + ehdr->e_machine = ELF_ARCH; + ehdr->e_version = EV_CURRENT; + ehdr->e_phoff = sizeof(Elf64_Ehdr); + ehdr->e_ehsize = sizeof(Elf64_Ehdr); + ehdr->e_phentsize = sizeof(Elf64_Phdr); + + /* Prepare one phdr of type PT_NOTE for each present CPU */ + for_each_present_cpu(cpu) { + phdr->p_type = PT_NOTE; + notes_addr = per_cpu_ptr_to_phys(per_cpu_ptr(crash_notes, cpu)); + phdr->p_offset = phdr->p_paddr = notes_addr; + phdr->p_filesz = phdr->p_memsz = sizeof(note_buf_t); + (ehdr->e_phnum)++; + phdr++; + } + + /* Prepare one PT_NOTE header for vmcoreinfo */ + phdr->p_type = PT_NOTE; + phdr->p_offset = phdr-&
[PATCH v22 3/8] kexec: exclude elfcorehdr from the segment digest
When a crash kernel is loaded via the kexec_file_load() syscall, the kernel places the various segments (ie crash kernel, crash initrd, boot_params, elfcorehdr, purgatory, etc) in memory. For those architectures that utilize purgatory, a hash digest of the segments is calculated for integrity checking. This digest is embedded into the purgatory image prior to placing purgatory in memory. This patchset updates the elfcorehdr on CPU or memory changes. However, changes to the elfcorehdr in turn cause purgatory integrity checking to fail (at crash time, and no vmcore created). Therefore, this patch explicitly excludes the elfcorehdr segment from the list of segments used to create the digest. By doing so, this permits updates to the elfcorehdr in response to CPU or memory changes, and avoids the need to also recompute the hash digest and reload purgatory. Signed-off-by: Eric DeVolder Reviewed-by: Sourabh Jain Acked-by: Baoquan He --- kernel/kexec_file.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c index f8b1797b3ec9..1d2cfc869a75 100644 --- a/kernel/kexec_file.c +++ b/kernel/kexec_file.c @@ -726,6 +726,12 @@ static int kexec_calculate_store_digests(struct kimage *image) for (j = i = 0; i < image->nr_segments; i++) { struct kexec_segment *ksegment; +#ifdef CONFIG_CRASH_HOTPLUG + /* Exclude elfcorehdr segment to allow future changes via hotplug */ + if (j == image->elfcorehdr_index) + continue; +#endif + ksegment = >segment[i]; /* * Skip purgatory as it will be modified once we put digest -- 2.31.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH v22 7/8] crash: change crash_prepare_elf64_headers() to for_each_possible_cpu()
The function crash_prepare_elf64_headers() generates the elfcorehdr which describes the CPUs and memory in the system for the crash kernel. In particular, it writes out ELF PT_NOTEs for memory regions and the CPUs in the system. With respect to the CPUs, the current implementation utilizes for_each_present_cpu() which means that as CPUs are added and removed, the elfcorehdr must again be updated to reflect the new set of CPUs. The reasoning behind the change to use for_each_possible_cpu(), is: - At kernel boot time, all percpu crash_notes are allocated for all possible CPUs; that is, crash_notes are not allocated dynamically when CPUs are plugged/unplugged. Thus the crash_notes for each possible CPU are always available. - The crash_prepare_elf64_headers() creates an ELF PT_NOTE per CPU. Changing to for_each_possible_cpu() is valid as the crash_notes pointed to by each CPU PT_NOTE are present and always valid. Furthermore, examining a common crash processing path of: kernel panic -> crash kernel -> makedumpfile -> 'crash' analyzer elfcorehdr /proc/vmcore vmcore reveals how the ELF CPU PT_NOTEs are utilized: - Upon panic, each CPU is sent an IPI and shuts itself down, recording its state in its crash_notes. When all CPUs are shutdown, the crash kernel is launched with a pointer to the elfcorehdr. - The crash kernel via linux/fs/proc/vmcore.c does not examine or use the contents of the PT_NOTEs, it exposes them via /proc/vmcore. - The makedumpfile utility uses /proc/vmcore and reads the CPU PT_NOTEs to craft a nr_cpus variable, which is reported in a header but otherwise generally unused. Makedumpfile creates the vmcore. - The 'crash' dump analyzer does not appear to reference the CPU PT_NOTEs. Instead it looks-up the cpu_[possible|present|onlin]_mask symbols and directly examines those structure contents from vmcore memory. From that information it is able to determine which CPUs are present and online, and locate the corresponding crash_notes. Said differently, it appears that 'crash' analyzer does not rely on the ELF PT_NOTEs for CPUs; rather it obtains the information directly via kernel symbols and the memory within the vmcore. (There maybe other vmcore generating and analysis tools that do use these PT_NOTEs, but 'makedumpfile' and 'crash' seems to be the most common solution.) This change results in the benefit of having all CPUs described in the elfcorehdr, and therefore reducing the need to re-generate the elfcorehdr on CPU changes, at the small expense of an additional 56 bytes per PT_NOTE for not-present-but-possible CPUs. On systems where kexec_file_load() syscall is utilized, all the above is valid. On systems where kexec_load() syscall is utilized, there may be the need for the elfcorehdr to be regenerated once. The reason being that some archs only populate the 'present' CPUs in the /sys/devices/system/cpus entries, which the userspace 'kexec' utility uses to generate the userspace-supplied elfcorehdr. In this situation, one memory or CPU change will rewrite the elfcorehdr via the crash_prepare_elf64_headers() function and now all possible CPUs will be described, just as with kexec_file_load() syscall. Suggested-by: Sourabh Jain Signed-off-by: Eric DeVolder Reviewed-by: Sourabh Jain Acked-by: Baoquan He --- kernel/crash_core.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/crash_core.c b/kernel/crash_core.c index e05bfdb7eaed..26262789baf6 100644 --- a/kernel/crash_core.c +++ b/kernel/crash_core.c @@ -364,8 +364,8 @@ int crash_prepare_elf64_headers(struct crash_mem *mem, int need_kernel_map, ehdr->e_ehsize = sizeof(Elf64_Ehdr); ehdr->e_phentsize = sizeof(Elf64_Phdr); - /* Prepare one phdr of type PT_NOTE for each present CPU */ - for_each_present_cpu(cpu) { + /* Prepare one phdr of type PT_NOTE for each possible CPU */ + for_each_possible_cpu(cpu) { phdr->p_type = PT_NOTE; notes_addr = per_cpu_ptr_to_phys(per_cpu_ptr(crash_notes, cpu)); phdr->p_offset = phdr->p_paddr = notes_addr; -- 2.31.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec