Re: nvdimm,pmem: makedumpfile: __vtop4_x86_64: Can't get a valid pte.
Glad you found it. Any thoughts/reactions? EM Sent from my iPhone > On Nov 29, 2022, at 12:17 AM, lizhij...@fujitsu.com wrote: > > > >> On 28/11/2022 23:03, Eliot Moss wrote: >>> On 11/28/2022 9:46 AM, lizhij...@fujitsu.com wrote: >>> >>> >>> On 28/11/2022 20:53, Eliot Moss wrote: On 11/28/2022 7:04 AM, lizhij...@fujitsu.com wrote: > Hi folks, > > I'm going to make crash coredump support pmem region. So > I have modified kexec-tools to add pmem region to PT_LOAD of vmcore. > > But it failed at makedumpfile, log are as following: > > In my environment, i found the last 512 pages in pmem region will cause > the error. I wonder if an issue I reported is related: when set up to map 2Mb (huge) pages, the last 2Mb of a large region got mapped as 4Kb pages, and then later, half of a large region was treated that way. >>> Could you share the url/link ? I'd like to take a look >> >> It was in a previous email to the nvdimm list. the title was: >> >> "Possible PMD (huge pages) bug in fs dax" >> >> And here is the body. I just sent directly to the list so there >> is no URL (if I should be engaging in a different way, please let me know): > > I found it :) at > https://www.mail-archive.com/nvdimm@lists.linux.dev/msg02743.html > > >> >> Folks - I posted already on nvdimm, but perhaps the topic did not quite grab >> anyone's attention. I had had some trouble figuring all the details to get >> dax mapping of files from an xfs file system with underlying Optane DC memory >> going, but now have that working reliably. But there is an odd behavior: >> >> When first mapping a file, I request mapping a 32 Gb range, aligned on a 1 Gb >> (and thus clearly on a 2 Mb) boundary. >> >> For each group of 8 Gb, the first 4095 entries map with a 2 Mb huge (PMD) >> page. The 4096th one does FALLBACK. I suspect some problem in >> dax.c:grab_mapping_entry or its callees, but am not personally well enough >> versed in either the dax code or the xarray implementation to dig further. >> >> >> If you'd like a second puzzle ... after completing this mapping, another >> thread accesses the whole range sequentially. This results in NOPAGE fault >> handling for the first 4095+4095 2M regions that previously resulted in >> NOPAGE -- so far so good. But it gives FALLBACK for the upper 16 Gb (except >> the two PMD regions it alrady gave FALLBACK for). >> >> >> I can provide trace output from a run if you'd like and all the ndctl, gdisk >> -l, fdisk -l, and xfs_info details if you like. >> >> >> In my application, it would be nice if dax.c could deliver 1 Gb PUD size >> mappings as well, though it would appear that that would require more surgery >> on dax.c. It would be somewhat analogous to what's already there, of course, >> but I don't mean to minimize the possible trickiness of it. I realize I >> should submit that request as a separate thread which I intend to do >> later. >> >> >> Regards - Eliot Moss
Re: nvdimm,pmem: makedumpfile: __vtop4_x86_64: Can't get a valid pte.
On 28/11/2022 23:03, Eliot Moss wrote: > On 11/28/2022 9:46 AM, lizhij...@fujitsu.com wrote: >> >> >> On 28/11/2022 20:53, Eliot Moss wrote: >>> On 11/28/2022 7:04 AM, lizhij...@fujitsu.com wrote: Hi folks, I'm going to make crash coredump support pmem region. So I have modified kexec-tools to add pmem region to PT_LOAD of vmcore. But it failed at makedumpfile, log are as following: In my environment, i found the last 512 pages in pmem region will cause the error. >>> >>> I wonder if an issue I reported is related: when set up to map >>> 2Mb (huge) pages, the last 2Mb of a large region got mapped as >>> 4Kb pages, and then later, half of a large region was treated >>> that way. >>> >> Could you share the url/link ? I'd like to take a look > > It was in a previous email to the nvdimm list. the title was: > > "Possible PMD (huge pages) bug in fs dax" > > And here is the body. I just sent directly to the list so there > is no URL (if I should be engaging in a different way, please let me know): I found it :) at https://www.mail-archive.com/nvdimm@lists.linux.dev/msg02743.html > > Folks - I posted already on nvdimm, but perhaps the topic did not quite grab > anyone's attention. I had had some trouble figuring all the details to get > dax mapping of files from an xfs file system with underlying Optane DC memory > going, but now have that working reliably. But there is an odd behavior: > > When first mapping a file, I request mapping a 32 Gb range, aligned on a 1 Gb > (and thus clearly on a 2 Mb) boundary. > > For each group of 8 Gb, the first 4095 entries map with a 2 Mb huge (PMD) > page. The 4096th one does FALLBACK. I suspect some problem in > dax.c:grab_mapping_entry or its callees, but am not personally well enough > versed in either the dax code or the xarray implementation to dig further. > > > If you'd like a second puzzle ... after completing this mapping, another > thread accesses the whole range sequentially. This results in NOPAGE fault > handling for the first 4095+4095 2M regions that previously resulted in > NOPAGE -- so far so good. But it gives FALLBACK for the upper 16 Gb (except > the two PMD regions it alrady gave FALLBACK for). > > > I can provide trace output from a run if you'd like and all the ndctl, gdisk > -l, fdisk -l, and xfs_info details if you like. > > > In my application, it would be nice if dax.c could deliver 1 Gb PUD size > mappings as well, though it would appear that that would require more surgery > on dax.c. It would be somewhat analogous to what's already there, of course, > but I don't mean to minimize the possible trickiness of it. I realize I > should submit that request as a separate thread which I intend to do > later. > > > Regards - Eliot Moss
Re: [PATCH v14 3/7] crash: add generic infrastructure for crash hotplug support
On 11/28/22 at 09:46am, Eric DeVolder wrote: > > > On 11/24/22 21:26, Baoquan He wrote: > > On 11/16/22 at 04:46pm, Eric DeVolder wrote: > > .. > > > diff --git a/include/linux/kexec.h b/include/linux/kexec.h > > > index ebf46c3b8f8b..b4dbc21f9081 100644 > > > --- a/include/linux/kexec.h > > > +++ b/include/linux/kexec.h > > > @@ -32,6 +32,7 @@ extern note_buf_t __percpu *crash_notes; > > > #include > > > #include > > > #include > > > +#include > > > #include > > > /* Verify architecture specific macros are defined */ > > > @@ -374,6 +375,13 @@ struct kimage { > > > struct purgatory_info purgatory_info; > > > #endif > > > +#ifdef CONFIG_CRASH_HOTPLUG > > > > This kernel config CRASH_HOTPLUG is added in patch 7, but we have used > > it in the previous patch, not sure if this is acceptable. > > > I wasn't sure what to do here either. Patch 7 is the x86 arch-specific > support patch, and CRASH_HOTPLUG is introduced in arch/x86/Kconfig. I did > look at introducing CRASH_HOTPLUG as a generic/non-arch-specific option, but > no location seemed appropriate given HOTPLUG_CPU is arch-specific and > MEMORY_HOTPLUG is in mm/Kconfig. arch/Kconfig? Because CRASH_CORE/KEXEC_CORE are defined there. > > This doesn't break bisect, but as you point out, not sure if the location in > patch 7 is acceptable. > I'm not really sure how to resolve the question. Hmm, since it's bisect-able, seems doesn't break rule. I could be too sensitive. Do we have a precendent like this, to strengthen our confidence? If no concern from other people, it's also fine to me. ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v2] x86/bugs: Explicitly clear speculative MSR bits
On Mon, Nov 28, 2022 at 03:02:21PM -0800, Pawan Gupta wrote: On Mon, Nov 28, 2022 at 11:40:19PM +0100, Borislav Petkov wrote: On Mon, Nov 28, 2022 at 02:03:58PM -0800, Pawan Gupta wrote: diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c index 3e3230cccaa7..cfc2ed2661fc 100644 --- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -66,7 +66,7 @@ static DEFINE_MUTEX(spec_ctrl_mutex); */ void write_spec_ctrl_current(u64 val, bool force) { - if (this_cpu_read(x86_spec_ctrl_current) == val) + if (!force && this_cpu_read(x86_spec_ctrl_current) == val) return; this_cpu_write(x86_spec_ctrl_current, val); Still looks hacky to me. I think it would be a lot cleaner if MSR_IA32_SPEC_CTRL gets cleaned of the speculation bits in init_speculation_control() which gets run on *every* CPU. So by the time check_bugs() gets to setup stuff, the MSR will be ready to go regardless. I.e., something like this (not supposed to work - just to show what I mean): diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index 73cc546e024d..367732c92942 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -993,9 +993,19 @@ static void init_speculation_control(struct cpuinfo_x86 *c) * Intel CPUs, for finer-grained selection of what's available. */ if (cpu_has(c, X86_FEATURE_SPEC_CTRL)) { + u64 msr; + set_cpu_cap(c, X86_FEATURE_IBRS); set_cpu_cap(c, X86_FEATURE_IBPB); set_cpu_cap(c, X86_FEATURE_MSR_SPEC_CTRL); + + /* +* Clear speculation control settings from a previous kernel +* run, i.e., kexec. +*/ + rdmsrl(MSR_IA32_SPEC_CTRL, msr); + if (msr & SPEC_CTRL_MASK) + wrmsr (MSR_IA32_SPEC_CTRL, msr & ~SPEC_CTRL_MASK); Yes thats a cleaner approach, except that the late microcode load will ruin the MSR: Root of the original problem is x86_spec_ctrl_current is not the current value of MSR at bootup. How about we update x86_spec_ctrl_current before any writes to the MSR?: diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c index 3e3230cccaa7..68ed52394fd9 100644 --- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -137,8 +137,18 @@ void __init check_bugs(void) * have unknown values. AMD64_LS_CFG MSR is cached in the early AMD * init code as it is not enumerated and depends on the family. */ - if (boot_cpu_has(X86_FEATURE_MSR_SPEC_CTRL)) + if (boot_cpu_has(X86_FEATURE_MSR_SPEC_CTRL)) { rdmsrl(MSR_IA32_SPEC_CTRL, x86_spec_ctrl_base); + /* +* Previously running software, like kexec for example, may +* have some controls turned ON. +* Clear them and let the mitigations setup below set them +* based on configuration. +*/ + this_cpu_write(x86_spec_ctrl_current, x86_spec_ctrl_base); + x86_spec_ctrl_base &= ~SPEC_CTRL_MITIGATIONS_MASK; + write_spec_ctrl_current(x86_spec_ctrl_base, true); + } /* Select the proper CPU mitigations before patching alternatives: */ spectre_v1_select_mitigation(); @@ -2047,8 +2057,13 @@ int arch_prctl_spec_ctrl_get(struct task_struct *task, unsigned long which) void x86_spec_ctrl_setup_ap(void) { - if (boot_cpu_has(X86_FEATURE_MSR_SPEC_CTRL)) + if (boot_cpu_has(X86_FEATURE_MSR_SPEC_CTRL)) { + u64 msr; + + rdmsrl(MSR_IA32_SPEC_CTRL, msr); + this_cpu_write(x86_spec_ctrl_current, msr); write_spec_ctrl_current(x86_spec_ctrl_base, true); + } if (ssb_mode == SPEC_STORE_BYPASS_DISABLE) x86_amd_ssb_disable(); ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v1 2/2] kexec: Introduce kexec_reboot_disabled
On Mon, 28 Nov 2022 17:28:55 +0100 Philipp Rudo wrote: > To be honest I don't think we make a progress here at the moment. I > would like to hear from others what they think about this. Not sure if you missed my reply. https://lore.kernel.org/all/20221128114200.72b3e...@gandalf.local.home/ -- Steve ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v2] x86/bugs: Explicitly clear speculative MSR bits
On Mon, Nov 28, 2022 at 03:02:19PM -0800, Pawan Gupta wrote: > Yes thats a cleaner approach, except that the late microcode load will > ruin the MSR: > > microcode_reload_late() > microcode_check() > get_cpu_cap() > init_speculation_control() Microcode late loading ruins a lot of crap already anyway. Then I guess we'll have to look for a better place for this on the init path - I don't want any more hackery in the hw vuln nightmare code. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v2] x86/bugs: Explicitly clear speculative MSR bits
On Mon, Nov 28, 2022 at 11:40:19PM +0100, Borislav Petkov wrote: On Mon, Nov 28, 2022 at 02:03:58PM -0800, Pawan Gupta wrote: diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c index 3e3230cccaa7..cfc2ed2661fc 100644 --- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -66,7 +66,7 @@ static DEFINE_MUTEX(spec_ctrl_mutex); */ void write_spec_ctrl_current(u64 val, bool force) { - if (this_cpu_read(x86_spec_ctrl_current) == val) + if (!force && this_cpu_read(x86_spec_ctrl_current) == val) return; this_cpu_write(x86_spec_ctrl_current, val); Still looks hacky to me. I think it would be a lot cleaner if MSR_IA32_SPEC_CTRL gets cleaned of the speculation bits in init_speculation_control() which gets run on *every* CPU. So by the time check_bugs() gets to setup stuff, the MSR will be ready to go regardless. I.e., something like this (not supposed to work - just to show what I mean): diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index 73cc546e024d..367732c92942 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -993,9 +993,19 @@ static void init_speculation_control(struct cpuinfo_x86 *c) * Intel CPUs, for finer-grained selection of what's available. */ if (cpu_has(c, X86_FEATURE_SPEC_CTRL)) { + u64 msr; + set_cpu_cap(c, X86_FEATURE_IBRS); set_cpu_cap(c, X86_FEATURE_IBPB); set_cpu_cap(c, X86_FEATURE_MSR_SPEC_CTRL); + + /* +* Clear speculation control settings from a previous kernel +* run, i.e., kexec. +*/ + rdmsrl(MSR_IA32_SPEC_CTRL, msr); + if (msr & SPEC_CTRL_MASK) + wrmsr (MSR_IA32_SPEC_CTRL, msr & ~SPEC_CTRL_MASK); Yes thats a cleaner approach, except that the late microcode load will ruin the MSR: microcode_reload_late() microcode_check() get_cpu_cap() init_speculation_control() ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v2] x86/bugs: Explicitly clear speculative MSR bits
On Mon, Nov 28, 2022 at 02:03:58PM -0800, Pawan Gupta wrote: > diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c > index 3e3230cccaa7..cfc2ed2661fc 100644 > --- a/arch/x86/kernel/cpu/bugs.c > +++ b/arch/x86/kernel/cpu/bugs.c > @@ -66,7 +66,7 @@ static DEFINE_MUTEX(spec_ctrl_mutex); > */ > void write_spec_ctrl_current(u64 val, bool force) > { > - if (this_cpu_read(x86_spec_ctrl_current) == val) > + if (!force && this_cpu_read(x86_spec_ctrl_current) == val) > return; > this_cpu_write(x86_spec_ctrl_current, val); Still looks hacky to me. I think it would be a lot cleaner if MSR_IA32_SPEC_CTRL gets cleaned of the speculation bits in init_speculation_control() which gets run on *every* CPU. So by the time check_bugs() gets to setup stuff, the MSR will be ready to go regardless. I.e., something like this (not supposed to work - just to show what I mean): diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index 73cc546e024d..367732c92942 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -993,9 +993,19 @@ static void init_speculation_control(struct cpuinfo_x86 *c) * Intel CPUs, for finer-grained selection of what's available. */ if (cpu_has(c, X86_FEATURE_SPEC_CTRL)) { + u64 msr; + set_cpu_cap(c, X86_FEATURE_IBRS); set_cpu_cap(c, X86_FEATURE_IBPB); set_cpu_cap(c, X86_FEATURE_MSR_SPEC_CTRL); + + /* +* Clear speculation control settings from a previous kernel +* run, i.e., kexec. +*/ + rdmsrl(MSR_IA32_SPEC_CTRL, msr); + if (msr & SPEC_CTRL_MASK) + wrmsr (MSR_IA32_SPEC_CTRL, msr & ~SPEC_CTRL_MASK); } if (cpu_has(c, X86_FEATURE_INTEL_STIBP)) -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v2] x86/bugs: Explicitly clear speculative MSR bits
On Mon, Nov 28, 2022 at 01:42:26AM +0100, Borislav Petkov wrote: On Thu, Nov 24, 2022 at 02:46:50AM -0800, Breno Leitao wrote: Currently x86_spec_ctrl_base is read at boot time, and speculative bits are set if configs are enable, such as MSR[SPEC_CTRL_IBRS] is enabled if CONFIG_CPU_IBRS_ENTRY is configured. These MSR bits are not cleared if the mitigations are disabled. This is a problem when kexec-ing a kernel that has the mitigation disabled, from a kernel that has the mitigation enabled. In this case, the MSR bits are carried forward and not cleared at the boot of the new kernel. This might have some performance degradation that is hard to find. This problem does not happen if the machine is (hard) rebooted, because the bit will be cleared by default. This patch also defines a SPEC_CTRL_MASK macro, so, we can easily track and clear if eventually some new mitigation shows up. Just remove that sentence - the macro's function is kinda obvious from the diff itself. Suggested-by: Pawan Gupta Signed-off-by: Breno Leitao --- arch/x86/include/asm/msr-index.h | 3 +++ arch/x86/kernel/cpu/bugs.c | 9 - 2 files changed, 11 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h index 4a2af82553e4..704f49580ee1 100644 --- a/arch/x86/include/asm/msr-index.h +++ b/arch/x86/include/asm/msr-index.h @@ -54,6 +54,9 @@ #define SPEC_CTRL_RRSBA_DIS_S_SHIFT6 /* Disable RRSBA behavior */ #define SPEC_CTRL_RRSBA_DIS_S BIT(SPEC_CTRL_RRSBA_DIS_S_SHIFT) +#define SPEC_CTRL_MASK (SPEC_CTRL_IBRS | SPEC_CTRL_STIBP | SPEC_CTRL_SSBD \ + | SPEC_CTRL_RRSBA_DIS_S) Call that SPEC_CTRL_MITIGATIONS_MASK or so to denote what it is - a mask of the SPEC_CTRL bits which the kernel toggles when controlling mitigations. A comment above it wouldn't hurt either. + #define MSR_IA32_PRED_CMD 0x0049 /* Prediction Command */ #define PRED_CMD_IBPB BIT(0) /* Indirect Branch Prediction Barrier */ diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c index 3e3230cccaa7..88957da1029b 100644 --- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -137,8 +137,15 @@ void __init check_bugs(void) * have unknown values. AMD64_LS_CFG MSR is cached in the early AMD * init code as it is not enumerated and depends on the family. */ - if (boot_cpu_has(X86_FEATURE_MSR_SPEC_CTRL)) + if (boot_cpu_has(X86_FEATURE_MSR_SPEC_CTRL)) { rdmsrl(MSR_IA32_SPEC_CTRL, x86_spec_ctrl_base); + /* +* Previously running software may have some controls turned ON. "Previously running software, like kexec for example, ..." +* Clear them and let kernel decide which controls to use. "Clear them and let the mitigations setup below set them based on configuration." +*/ + x86_spec_ctrl_base &= ~SPEC_CTRL_MASK; + wrmsrl(MSR_IA32_SPEC_CTRL, x86_spec_ctrl_base); So this WRMSR will happen on the BSP only but the SPEC_CTRL MSR is per-CPU. As is x86_spec_ctrl_current which tracks it. So I'd say you don't need that WRMSR here - the proper value will get replicated eventually everywhere... This patch is particularly for the case when user intends to turn off the mitigations like with mitigations=off. In that case we need the WRMSR because mitigation selection will simply return without writing to the MSR on BSP. As part of AP init x86_spec_ctrl_setup_ap() writes to the MSR even when the mitigation is turned off, so AP's should have been fine, but I think there is a subtle bug there as well. For below call: x86_spec_ctrl_setup_ap(void) { write_spec_ctrl_current(x86_spec_ctrl_base, true); When x86_spec_ctrl_base is 0 MSR won't be written because of a check in write_spec_ctrl_current() that doesn't write the MSR when the new value (0) is same as x86_spec_ctrl_current (initialized to 0). Below should fix the problem with APs: --- diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c index 3e3230cccaa7..cfc2ed2661fc 100644 --- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -66,7 +66,7 @@ static DEFINE_MUTEX(spec_ctrl_mutex); */ void write_spec_ctrl_current(u64 val, bool force) { - if (this_cpu_read(x86_spec_ctrl_current) == val) + if (!force && this_cpu_read(x86_spec_ctrl_current) == val) return; this_cpu_write(x86_spec_ctrl_current, val); ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH] lockdown: kexec_file: prevent unsigned kernel image when KEXEC_SIG not enabled
On Tue, 2022-11-22 at 10:36 +0800, Coiby Xu wrote: > Hi Mimi, > > On Mon, Nov 21, 2022 at 01:23:57PM -0500, Mimi Zohar wrote: > >Hi Coiby, > > > >On Mon, 2022-11-21 at 15:29 +0800, Coiby Xu wrote: > >> A kernel builder may not enable KEXEC_SIG and some architectures like > >> ppc64 simply don't have KEXEC_SIG. In these cases, unless both > >> IMA_ARCH_POLICY and secure boot also enabled, lockdown doesn't prevent > >> unsigned kernel image from being kexec'ed via the kexec_file_load > >> syscall whereas it could prevent one via the kexec_load syscall. Mandate > >> signature verification for those cases. > >> > >> Fixes: 155bdd30af17 ("kexec_file: Restrict at runtime if the kernel is > >> locked down") > >> Cc: Matthew Garrett > >> Cc: Jiri Bohac > >> Cc: David Howells > >> Cc: kexec@lists.infradead.org > >> Cc: linux-integr...@vger.kernel.org > >> Signed-off-by: Coiby Xu > > > >Other than correcting the function name to mandate_signature_verificati > >on(), > > Applied to v2, thanks for correcting me! Btw, I realize I overwrote the > return code of kexec_image_verify_sig with > mandate_signature_verification's. v2 has fixed this issue as well. > > > > >Reviewed-by: Mimi Zohar > > And thanks for the review! You're welcome. Without either IMA_ARCH or KEXEC_SIG enabled, the kexec selftest test_kexec_file_load.sh properly failed with "kexec_file_load failed [PASS]", but from the informational messages output, it isn't clear why it failed. This should be corrected. -- thanks, Mimi ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH] kexec: Enable runtime allocation of crash_image
Hi Philipp Thanks for your review. On Mon, 28 Nov 2022 at 18:00, Philipp Rudo wrote: > > Hi Ricardo, > > On Thu, 24 Nov 2022 23:23:36 +0100 > Ricardo Ribalda wrote: > > > Usually crash_image is defined statically via the crashkernel parameter > > or DT. > > > > But if the crash kernel is not used, or is smaller than then > > area pre-allocated that memory is wasted. > > > > Also, if the crash kernel was not defined at bootime, there is no way to > > use the crash kernel. > > > > Enable runtime allocation of the crash_image if the crash_image is not > > defined statically. Following the same memory allocation/validation path > > that for the reboot kexec kernel. > > > > Signed-off-by: Ricardo Ribalda > > I don't think this patch will work as intended. For one you omit > setting the image->type to KEXEC_TYPE_CRASH. But when you grep for that > type you will find that there is a lot of special handling done for it. > I don't believe that this can simply be skipped without causing > problems. > > Furthermore I think you have missed one important detail. The memory > reserved for the crash kernel is not just a buffer for the image but > the memory it runs in! For that it has to be a continuous piece of > physical memory with usually some additional arch specific limitations. > When allocated dynamically all those limitations need to be considered. > But a standard kexec doesn't care about those limitations as it doesn't > care about the os running before itself. It can simply overwrite the > memory when booting. But if the crash kernel does the same it will > corrupt the dump it is supposed to generate. Right now, I do not intend to use it to fetch a kdump, I am using it as the image that will run when the system crashes. It seems to work fine on the two devices that I am using for tests. > > Thanks > Philipp > > > --- > > kexec: Enable runtime allocation of crash_image > > > > To: Eric Biederman > > Cc: kexec@lists.infradead.org > > Cc: linux-ker...@vger.kernel.org > > Cc: Sergey Senozhatsky > > Cc: linux-ker...@vger.kernel.org > > Cc: Ross Zwisler > > Cc: Philipp Rudo > > Cc: Baoquan He > > --- > > include/linux/kexec.h | 1 + > > kernel/kexec.c| 9 + > > kernel/kexec_core.c | 5 + > > kernel/kexec_file.c | 7 --- > > 4 files changed, 15 insertions(+), 7 deletions(-) > > > > diff --git a/include/linux/kexec.h b/include/linux/kexec.h > > index 41a686996aaa..98ca9a32bc8e 100644 > > --- a/include/linux/kexec.h > > +++ b/include/linux/kexec.h > > @@ -427,6 +427,7 @@ extern int kexec_load_disabled; > > extern bool kexec_in_progress; > > > > int crash_shrink_memory(unsigned long new_size); > > +bool __crash_memory_valid(void); > > ssize_t crash_get_memory_size(void); > > > > #ifndef arch_kexec_protect_crashkres > > diff --git a/kernel/kexec.c b/kernel/kexec.c > > index cb8e6e6f983c..b5c17db25e88 100644 > > --- a/kernel/kexec.c > > +++ b/kernel/kexec.c > > @@ -28,7 +28,7 @@ static int kimage_alloc_init(struct kimage **rimage, > > unsigned long entry, > > struct kimage *image; > > bool kexec_on_panic = flags & KEXEC_ON_CRASH; > > > > - if (kexec_on_panic) { > > + if (kexec_on_panic && __crash_memory_valid()) { > > /* Verify we have a valid entry point */ > > if ((entry < phys_to_boot_phys(crashk_res.start)) || > > (entry > phys_to_boot_phys(crashk_res.end))) > > @@ -44,7 +44,7 @@ static int kimage_alloc_init(struct kimage **rimage, > > unsigned long entry, > > image->nr_segments = nr_segments; > > memcpy(image->segment, segments, nr_segments * sizeof(*segments)); > > > > - if (kexec_on_panic) { > > + if (kexec_on_panic && __crash_memory_valid()) { > > /* Enable special crash kernel control page alloc policy. */ > > image->control_page = crashk_res.start; > > image->type = KEXEC_TYPE_CRASH; > > @@ -101,7 +101,7 @@ static int do_kexec_load(unsigned long entry, unsigned > > long nr_segments, > > > > if (flags & KEXEC_ON_CRASH) { > > dest_image = _crash_image; > > - if (kexec_crash_image) > > + if (kexec_crash_image && __crash_memory_valid()) > > arch_kexec_unprotect_crashkres(); > > } else { > > dest_image = _image; > > @@ -157,7 +157,8 @@ static int do_kexec_load(unsigned long entry, unsigned > > long nr_segments, > > image = xchg(dest_image, image); > > > > out: > > - if ((flags & KEXEC_ON_CRASH) && kexec_crash_image) > > + if ((flags & KEXEC_ON_CRASH) && kexec_crash_image && > > + __crash_memory_valid()) > > arch_kexec_protect_crashkres(); > > > > kimage_free(image); > > diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c > > index ca2743f9c634..77083c9760fb 100644 > > --- a/kernel/kexec_core.c > > +++ b/kernel/kexec_core.c > > @@ -1004,6 +1004,11 @@ void crash_kexec(struct pt_regs *regs) > > } > > } >
Re: [PATCH] kexec: Enable runtime allocation of crash_image
Hi Ricardo, On Thu, 24 Nov 2022 23:23:36 +0100 Ricardo Ribalda wrote: > Usually crash_image is defined statically via the crashkernel parameter > or DT. > > But if the crash kernel is not used, or is smaller than then > area pre-allocated that memory is wasted. > > Also, if the crash kernel was not defined at bootime, there is no way to > use the crash kernel. > > Enable runtime allocation of the crash_image if the crash_image is not > defined statically. Following the same memory allocation/validation path > that for the reboot kexec kernel. > > Signed-off-by: Ricardo Ribalda I don't think this patch will work as intended. For one you omit setting the image->type to KEXEC_TYPE_CRASH. But when you grep for that type you will find that there is a lot of special handling done for it. I don't believe that this can simply be skipped without causing problems. Furthermore I think you have missed one important detail. The memory reserved for the crash kernel is not just a buffer for the image but the memory it runs in! For that it has to be a continuous piece of physical memory with usually some additional arch specific limitations. When allocated dynamically all those limitations need to be considered. But a standard kexec doesn't care about those limitations as it doesn't care about the os running before itself. It can simply overwrite the memory when booting. But if the crash kernel does the same it will corrupt the dump it is supposed to generate. Thanks Philipp > --- > kexec: Enable runtime allocation of crash_image > > To: Eric Biederman > Cc: kexec@lists.infradead.org > Cc: linux-ker...@vger.kernel.org > Cc: Sergey Senozhatsky > Cc: linux-ker...@vger.kernel.org > Cc: Ross Zwisler > Cc: Philipp Rudo > Cc: Baoquan He > --- > include/linux/kexec.h | 1 + > kernel/kexec.c| 9 + > kernel/kexec_core.c | 5 + > kernel/kexec_file.c | 7 --- > 4 files changed, 15 insertions(+), 7 deletions(-) > > diff --git a/include/linux/kexec.h b/include/linux/kexec.h > index 41a686996aaa..98ca9a32bc8e 100644 > --- a/include/linux/kexec.h > +++ b/include/linux/kexec.h > @@ -427,6 +427,7 @@ extern int kexec_load_disabled; > extern bool kexec_in_progress; > > int crash_shrink_memory(unsigned long new_size); > +bool __crash_memory_valid(void); > ssize_t crash_get_memory_size(void); > > #ifndef arch_kexec_protect_crashkres > diff --git a/kernel/kexec.c b/kernel/kexec.c > index cb8e6e6f983c..b5c17db25e88 100644 > --- a/kernel/kexec.c > +++ b/kernel/kexec.c > @@ -28,7 +28,7 @@ static int kimage_alloc_init(struct kimage **rimage, > unsigned long entry, > struct kimage *image; > bool kexec_on_panic = flags & KEXEC_ON_CRASH; > > - if (kexec_on_panic) { > + if (kexec_on_panic && __crash_memory_valid()) { > /* Verify we have a valid entry point */ > if ((entry < phys_to_boot_phys(crashk_res.start)) || > (entry > phys_to_boot_phys(crashk_res.end))) > @@ -44,7 +44,7 @@ static int kimage_alloc_init(struct kimage **rimage, > unsigned long entry, > image->nr_segments = nr_segments; > memcpy(image->segment, segments, nr_segments * sizeof(*segments)); > > - if (kexec_on_panic) { > + if (kexec_on_panic && __crash_memory_valid()) { > /* Enable special crash kernel control page alloc policy. */ > image->control_page = crashk_res.start; > image->type = KEXEC_TYPE_CRASH; > @@ -101,7 +101,7 @@ static int do_kexec_load(unsigned long entry, unsigned > long nr_segments, > > if (flags & KEXEC_ON_CRASH) { > dest_image = _crash_image; > - if (kexec_crash_image) > + if (kexec_crash_image && __crash_memory_valid()) > arch_kexec_unprotect_crashkres(); > } else { > dest_image = _image; > @@ -157,7 +157,8 @@ static int do_kexec_load(unsigned long entry, unsigned > long nr_segments, > image = xchg(dest_image, image); > > out: > - if ((flags & KEXEC_ON_CRASH) && kexec_crash_image) > + if ((flags & KEXEC_ON_CRASH) && kexec_crash_image && > + __crash_memory_valid()) > arch_kexec_protect_crashkres(); > > kimage_free(image); > diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c > index ca2743f9c634..77083c9760fb 100644 > --- a/kernel/kexec_core.c > +++ b/kernel/kexec_core.c > @@ -1004,6 +1004,11 @@ void crash_kexec(struct pt_regs *regs) > } > } > > +bool __crash_memory_valid(void) > +{ > + return crashk_res.end != crashk_res.start; > +} > + > ssize_t crash_get_memory_size(void) > { > ssize_t size = 0; > diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c > index 45637511e0de..0671f4f370ff 100644 > --- a/kernel/kexec_file.c > +++ b/kernel/kexec_file.c > @@ -280,7 +280,7 @@ kimage_file_alloc_init(struct kimage **rimage, int > kernel_fd, > > image->file_mode = 1; > > - if (kexec_on_panic) { > + if
Re: [PATCH v1 2/2] kexec: Introduce kexec_reboot_disabled
On Thu, 24 Nov 2022 16:01:15 +0100 Philipp Rudo wrote: > No, I think the implementation is fine. I'm currently only struggling > to understand what problem kexec_reboot_disabled solves that cannot be > solved by kexec_load_disabled. Hi Philipp, Thanks for working with us on this. Let me try to explain our use case. We want kexec/kdump enabled, but we really do not want kexec used for any other purpose. We must have the kexec kernel loaded at boot up and not afterward. Your recommendation of: kexec -p dump_kernel echo 1 > /proc/sys/kernel/kexec_load_disabled can work, and we will probably add it. But we are taking the paranoid approach, and what I learned in security 101 ;-) and that is, only open up the minimal attack surface as possible. Yes, it's highly unlikely that the above would crash. But as with most security vulnerabilities, it's not going to be an attacker that creates a new gadget here, but probably another script in the future that causes this to be delayed or something, and a new window of opportunity will arise for an attacker. Maybe, that new window only works for non panic kernels. Yes, this is a contrived scenario, but the work vs risk is very low in adding this feature. Perhaps the attack surface that a reboot kexec could be, is that the attacker gets the ability at boot up to load the kexec for reboot and not panic. Then the attack must wait for the victim to reboot their machine before they have access to the new kernel. Again, I admit this is contrived, but just because I can't think of a real situation that this could be a problem doesn't mean that one doesn't exist. In other words, if we never want to allow a kexec reboot, why allow it at all from the beginning? The above allows it, until we don't. That alone makes us nervous. Whereas this patch is rather trivial and doesn't add complexity. Thanks for your time, we appreciate it. -- Steve ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v1 2/2] kexec: Introduce kexec_reboot_disabled
Hi Ricardo, On Thu, 24 Nov 2022 23:32:34 +0100 Ricardo Ribalda wrote: > Hi Philipp > > > On Thu, 24 Nov 2022 at 16:01, Philipp Rudo wrote: > > > > On Thu, 24 Nov 2022 13:52:58 +0100 > > Ricardo Ribalda wrote: > > > > > On Thu, 24 Nov 2022 at 12:40, Philipp Rudo wrote: > > > > > > > > Hi Ricardo, > > > > > > > > On Wed, 23 Nov 2022 09:58:08 +0100 > > > > Ricardo Ribalda wrote: > > > > > > > > > Hi Philipp > > > > > > > > > > Thanks for your review. > > > > > > > > > > My scenario is a trusted system, where even if you are root, your > > > > > access to the system is very limited. > > > > > > > > > > Let's assume LOADPIN and verity are enabled. > > > > > > > > My point is that on such systems I expect that a sysadmin also wants to > > > > control the crash kernel including its initramfs (which also has to be > > > > part > > > > of the signed kernel?). But if that's the case a sysadmin can simply arm > > > > kdump early during boot and then toggle kexec_load_disabled. With that > > > > LINUX_REBOOT_CMD_KEXEC also gets disabled as no kexec kernel can be > > > > loaded > > > > while kdump works. Thus there is no need to add the new interface. Or am > > > > I missing anything? > > > > > > Let's say that you have a script that does something like this > > > > > > > > > kexec -p dump_kernel > > > echo 1 > /proc/sys/kernel/kexec_load_disabled > > > > > > If an attacker can DDos the system and make that script crash... then > > > kexec is still accessible > > > > > > On the other hand, if you load the kernel with the commandline > > > > > > sysctl.kernel.kexec_load_disabled=1 > > > > reboot? > > yes :) thanks! > > > Otherwise you shouldn't be able to load the crash kernel at all. > > > > > Then even if the script crashes, the only way to abuse kexec is by > > > panicing the running kernel > > > > True. But when an attacker can DDos the system the final workload is > > already running. So wouldn't it be enough to make sure that the script > > above has finished before starting you workload. E.g. by setting an > > appropriate Before=/After= in the systemd.unit? > > What if the kexec binary crashes and the unit will never succeed? Then there are options like OnFailure= or FailureAction= the sysadmin can use do what ever he seems appropriate. > Or worse, your distro does not use systemd !!! In that case there are other ways to achieve the same. The two options are only examples. Why can't this be used as a replacement? > > Furthermore, I don't think that restricting kexec reboot alone is > > sufficient when the attacker can still control the crash kernel. At > > least my assumption is that triggering a panic instead of just > > rebooting is just a mild inconvenience for somebody who is able to pull > > off an attack like that. > > The attacker does not control the crash kernel completely. loadpin is > still in place. > Yes, they can downgrade the whole system to a vulnerable kernel image. > But the choices are limited :) > > With physical access to the device panicing a kernel is easily doable > (but not trivial). But remotely, it is more challenging. Well the same holds for kexec. So the only difference is triggering the panic where I'm still not convinced it's a huge obstacle for someone who is able to pull off all the steps before for such an attack. To be honest I don't think we make a progress here at the moment. I would like to hear from others what they think about this. Thanks Philipp > > > > > > Would it make you more comfortable if I model this as a kernel config > > > instead of a runtime option? > > > > No, I think the implementation is fine. I'm currently only struggling > > to understand what problem kexec_reboot_disabled solves that cannot be > > solved by kexec_load_disabled. > > > > > Thanks! > > > > Happy to help. > > > > Thanks > > Philipp > > > > > > > > > > > > > > > > Thanks > > > > Philipp > > > > > > > > > > > > > > On Mon, 21 Nov 2022 at 15:10, Philipp Rudo wrote: > > > > > > > > > > > > Hi Ricardo, > > > > > > > > > > > > On Thu, 17 Nov 2022 16:15:07 +0100 > > > > > > Ricardo Ribalda wrote: > > > > > > > > > > > > > Hi Philipp > > > > > > > > > > > > > > Thanks for your review! > > > > > > > > > > > > happy to help. > > > > > > > > > > > > > > > > > > > > On Thu, 17 Nov 2022 at 16:07, Philipp Rudo > > > > > > > wrote: > > > > > > > > > > > > > > > > Hi Ricardo, > > > > > > > > > > > > > > > > all in all I think this patch makes sense. However, there is > > > > > > > > one point > > > > > > > > I don't like... > > > > > > > > > > > > > > > > On Mon, 14 Nov 2022 14:18:39 +0100 > > > > > > > > Ricardo Ribalda wrote: > > > > > > > > > > > > > > > > > Create a new toogle that disables LINUX_REBOOT_CMD_KEXEC, > > > > > > > > > reducing the > > > > > > > > > attack surface to a system. > > > > > > > > > > > > > > > > > > Without this toogle, an attacker can only
Re: [PATCH v14 3/7] crash: add generic infrastructure for crash hotplug support
On 11/24/22 21:26, Baoquan He wrote: On 11/16/22 at 04:46pm, Eric DeVolder wrote: .. diff --git a/include/linux/kexec.h b/include/linux/kexec.h index ebf46c3b8f8b..b4dbc21f9081 100644 --- a/include/linux/kexec.h +++ b/include/linux/kexec.h @@ -32,6 +32,7 @@ extern note_buf_t __percpu *crash_notes; #include #include #include +#include #include /* Verify architecture specific macros are defined */ @@ -374,6 +375,13 @@ struct kimage { struct purgatory_info purgatory_info; #endif +#ifdef CONFIG_CRASH_HOTPLUG This kernel config CRASH_HOTPLUG is added in patch 7, but we have used it in the previous patch, not sure if this is acceptable. I wasn't sure what to do here either. Patch 7 is the x86 arch-specific support patch, and CRASH_HOTPLUG is introduced in arch/x86/Kconfig. I did look at introducing CRASH_HOTPLUG as a generic/non-arch-specific option, but no location seemed appropriate given HOTPLUG_CPU is arch-specific and MEMORY_HOTPLUG is in mm/Kconfig. This doesn't break bisect, but as you point out, not sure if the location in patch 7 is acceptable. I'm not really sure how to resolve the question. eric + bool hotplug_event; + unsigned int offlinecpu; + bool elfcorehdr_index_valid; + int elfcorehdr_index; +#endif + #ifdef CONFIG_IMA_KEXEC /* Virtual address of IMA measurement buffer for kexec syscall */ void *ima_buffer; @@ -503,6 +511,34 @@ static inline int arch_kexec_post_alloc_pages(void *vaddr, unsigned int pages, g static inline void arch_kexec_pre_free_pages(void *vaddr, unsigned int pages) { } #endif +#ifndef arch_map_crash_pages +/* + * NOTE: The addresses and sizes passed to this routine have + * already been fully aligned on page boundaries. There is no + * need for massaging the address or size. + */ +static inline void *arch_map_crash_pages(unsigned long paddr, + unsigned long size) +{ + if (size > 0) + return kmap_local_page(pfn_to_page(paddr >> PAGE_SHIFT)); + else + return NULL; +} +#endif + +#ifndef arch_unmap_crash_pages +static inline void arch_unmap_crash_pages(void *ptr) +{ + if (ptr) + kunmap_local(ptr); +} +#endif + +#ifndef arch_crash_handle_hotplug_event +static inline void arch_crash_handle_hotplug_event(struct kimage *image) { } +#endif + #else /* !CONFIG_KEXEC_CORE */ struct pt_regs; struct task_struct; diff --git a/kernel/crash_core.c b/kernel/crash_core.c index 8c648fd5897a..4e7221226976 100644 --- a/kernel/crash_core.c +++ b/kernel/crash_core.c @@ -11,6 +11,8 @@ #include #include #include +#include +#include #include #include @@ -18,6 +20,7 @@ #include #include "kallsyms_internal.h" +#include "kexec_internal.h" /* vmcoreinfo stuff */ unsigned char *vmcoreinfo_data; @@ -612,3 +615,139 @@ static int __init crash_save_vmcoreinfo_init(void) } subsys_initcall(crash_save_vmcoreinfo_init); + +#ifdef CONFIG_CRASH_HOTPLUG +#undef pr_fmt +#define pr_fmt(fmt) "crash hp: " fmt +/* + * To accurately reflect hot un/plug changes, the elfcorehdr (which + * is passed to the crash kernel via the elfcorehdr= parameter) + * must be updated with the new list of CPUs and memories. + * + * In order to make changes to elfcorehdr, two conditions are needed: + * First, the segment containing the elfcorehdr must be large enough + * to permit a growing number of resources; the elfcorehdr memory size + * is based on NR_CPUS_DEFAULT and CRASH_MAX_MEMORY_RANGES. + * Second, purgatory must explicitly exclude the elfcorehdr from the + * list of segments it checks (since the elfcorehdr changes and thus + * would require an update to purgatory itself to update the digest). + */ +static void handle_hotplug_event(unsigned int hp_action, unsigned int cpu) +{ + /* Obtain lock while changing crash information */ + if (kexec_trylock()) { + + /* Check kdump is loaded */ + if (kexec_crash_image) { + struct kimage *image = kexec_crash_image; + + if (hp_action == KEXEC_CRASH_HP_ADD_CPU || + hp_action == KEXEC_CRASH_HP_REMOVE_CPU) + pr_debug("hp_action %u, cpu %u\n", hp_action, cpu); + else + pr_debug("hp_action %u\n", hp_action); + + /* +* When the struct kimage is allocated, it is wiped to zero, so +* the elfcorehdr_index_valid defaults to false. Find the +* segment containing the elfcorehdr, if not already found. +* This works for both the kexec_load and kexec_file_load paths. +*/ + if (!image->elfcorehdr_index_valid) { + unsigned long mem, memsz; +
[PATCH v3] x86/bugs: Explicitly clear speculative MSR bits
Currently x86_spec_ctrl_base is read at boot time, and speculative bits are set if configs are enable, such as MSR[SPEC_CTRL_IBRS] is enabled if CONFIG_CPU_IBRS_ENTRY is configured. These MSR bits are not cleared if the mitigations are disabled. This is a problem when kexec-ing a kernel that has the mitigation disabled, from a kernel that has the mitigation enabled. In this case, the MSR bits are carried forward and not cleared at the boot of the new kernel. This might have some performance degradation that is hard to find. This problem does not happen if the machine is (hard) rebooted, because the bit will be cleared by default. Suggested-by: Pawan Gupta Signed-off-by: Breno Leitao --- arch/x86/include/asm/msr-index.h | 4 arch/x86/kernel/cpu/bugs.c | 10 +- 2 files changed, 13 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h index 4a2af82553e4..22986a8f18bc 100644 --- a/arch/x86/include/asm/msr-index.h +++ b/arch/x86/include/asm/msr-index.h @@ -54,6 +54,10 @@ #define SPEC_CTRL_RRSBA_DIS_S_SHIFT6 /* Disable RRSBA behavior */ #define SPEC_CTRL_RRSBA_DIS_S BIT(SPEC_CTRL_RRSBA_DIS_S_SHIFT) +/* A mask for bits which the kernel toggles when controlling mitigations */ +#define SPEC_CTRL_MITIGATIONS_MASK (SPEC_CTRL_IBRS | SPEC_CTRL_STIBP | SPEC_CTRL_SSBD \ + | SPEC_CTRL_RRSBA_DIS_S) + #define MSR_IA32_PRED_CMD 0x0049 /* Prediction Command */ #define PRED_CMD_IBPB BIT(0) /* Indirect Branch Prediction Barrier */ diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c index 3e3230cccaa7..4030358216c8 100644 --- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -137,8 +137,16 @@ void __init check_bugs(void) * have unknown values. AMD64_LS_CFG MSR is cached in the early AMD * init code as it is not enumerated and depends on the family. */ - if (boot_cpu_has(X86_FEATURE_MSR_SPEC_CTRL)) + if (boot_cpu_has(X86_FEATURE_MSR_SPEC_CTRL)) { rdmsrl(MSR_IA32_SPEC_CTRL, x86_spec_ctrl_base); + /* +* Previously running software, like kexec for example, may +* have some controls turned ON. +* Clear them and let the mitigations setup below set them +* based on configuration. +*/ + x86_spec_ctrl_base &= ~SPEC_CTRL_MITIGATIONS_MASK; + } /* Select the proper CPU mitigations before patching alternatives: */ spectre_v1_select_mitigation(); -- 2.30.2 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: nvdimm,pmem: makedumpfile: __vtop4_x86_64: Can't get a valid pte.
On 11/28/2022 9:46 AM, lizhij...@fujitsu.com wrote: On 28/11/2022 20:53, Eliot Moss wrote: On 11/28/2022 7:04 AM, lizhij...@fujitsu.com wrote: Hi folks, I'm going to make crash coredump support pmem region. So I have modified kexec-tools to add pmem region to PT_LOAD of vmcore. But it failed at makedumpfile, log are as following: In my environment, i found the last 512 pages in pmem region will cause the error. I wonder if an issue I reported is related: when set up to map 2Mb (huge) pages, the last 2Mb of a large region got mapped as 4Kb pages, and then later, half of a large region was treated that way. Could you share the url/link ? I'd like to take a look It was in a previous email to the nvdimm list. the title was: "Possible PMD (huge pages) bug in fs dax" And here is the body. I just sent directly to the list so there is no URL (if I should be engaging in a different way, please let me know): Folks - I posted already on nvdimm, but perhaps the topic did not quite grab anyone's attention. I had had some trouble figuring all the details to get dax mapping of files from an xfs file system with underlying Optane DC memory going, but now have that working reliably. But there is an odd behavior: When first mapping a file, I request mapping a 32 Gb range, aligned on a 1 Gb (and thus clearly on a 2 Mb) boundary. For each group of 8 Gb, the first 4095 entries map with a 2 Mb huge (PMD) page. The 4096th one does FALLBACK. I suspect some problem in dax.c:grab_mapping_entry or its callees, but am not personally well enough versed in either the dax code or the xarray implementation to dig further. If you'd like a second puzzle ... after completing this mapping, another thread accesses the whole range sequentially. This results in NOPAGE fault handling for the first 4095+4095 2M regions that previously resulted in NOPAGE -- so far so good. But it gives FALLBACK for the upper 16 Gb (except the two PMD regions it alrady gave FALLBACK for). I can provide trace output from a run if you'd like and all the ndctl, gdisk -l, fdisk -l, and xfs_info details if you like. In my application, it would be nice if dax.c could deliver 1 Gb PUD size mappings as well, though it would appear that that would require more surgery on dax.c. It would be somewhat analogous to what's already there, of course, but I don't mean to minimize the possible trickiness of it. I realize I should submit that request as a separate thread which I intend to do later. Regards - Eliot Moss
Re: nvdimm,pmem: makedumpfile: __vtop4_x86_64: Can't get a valid pte.
On 28/11/2022 20:53, Eliot Moss wrote: > On 11/28/2022 7:04 AM, lizhij...@fujitsu.com wrote: >> Hi folks, >> >> I'm going to make crash coredump support pmem region. So >> I have modified kexec-tools to add pmem region to PT_LOAD of vmcore. >> >> But it failed at makedumpfile, log are as following: >> >> In my environment, i found the last 512 pages in pmem region will cause the >> error. > > I wonder if an issue I reported is related: when set up to map > 2Mb (huge) pages, the last 2Mb of a large region got mapped as > 4Kb pages, and then later, half of a large region was treated > that way. > Could you share the url/link ? I'd like to take a look > I've seen no response to the report, but assume folks have > been busy with other things or perhaps giving this lower > priority since it does not exactly *fail*, just not work as > a user might think it should. > > Regards - Eliot Moss
Re: nvdimm,pmem: makedumpfile: __vtop4_x86_64: Can't get a valid pte.
On 11/28/2022 7:04 AM, lizhij...@fujitsu.com wrote: Hi folks, I'm going to make crash coredump support pmem region. So I have modified kexec-tools to add pmem region to PT_LOAD of vmcore. But it failed at makedumpfile, log are as following: In my environment, i found the last 512 pages in pmem region will cause the error. I wonder if an issue I reported is related: when set up to map 2Mb (huge) pages, the last 2Mb of a large region got mapped as 4Kb pages, and then later, half of a large region was treated that way. I've seen no response to the report, but assume folks have been busy with other things or perhaps giving this lower priority since it does not exactly *fail*, just not work as a user might think it should. Regards - Eliot Moss ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
nvdimm,pmem: makedumpfile: __vtop4_x86_64: Can't get a valid pte.
Hi folks, I'm going to make crash coredump support pmem region. So I have modified kexec-tools to add pmem region to PT_LOAD of vmcore. But it failed at makedumpfile, log are as following: In my environment, i found the last 512 pages in pmem region will cause the error. qemu commandline: -object memory-backend-file,id=memnvdimm0,prealloc=yes,mem-path=/root/qemu-dax.img,share=yes,size=4267704320,align=2097152 -device nvdimm,node=0,label-size=4194304,memdev=memnvdimm0,id=nvdimm0,slot=0 ndctl info: [root@rdma-server ~]# ndctl list [ { "dev":"namespace0.0", "mode":"devdax", "map":"dev", "size":4127195136, "uuid":"f6fc1e86-ac5b-48d8-9cda-4888a33158f9", "chardev":"dax0.0", "align":4096 } ] [root@rdma-server ~]# ndctl list -iRD { "dimms":[ { "dev":"nmem0", "id":"8680-56341200", "handle":1, "phys_id":0 } ], "regions":[ { "dev":"region0", "size":4263510016, "align":16777216, "available_size":0, "max_available_extent":0, "type":"pmem", "iset_id":10248187106440278, "mappings":[ { "dimm":"nmem0", "offset":0, "length":4263510016, "position":0 } ], "persistence_domain":"unknown" } ] } iomem info: [root@rdma-server ~]# cat /proc/iomem | grep Persi 14000-23e1f : Persistent Memory makedumpfile info: [ 57.229110] kdump.sh[240]: mem_map[ 71] ea0008e0 238000 23e200 Firstly, i wonder that 1) makedumpfile read the whole range of iomem(same with the PT_LOAD of pmem) 2) 1st kernel side only setup mem_map(vmemmap) for this namespace, which size is 512 pages smaller than iomem for some reasons. 3) Since there is an align in nvdimm region(16MiB in above), i also guess the maximum size of the pmem can used by user should be ALIGN(iomem, 10MiB), after this alignment, the last 512 pages will be dropped. then kernel only setups vmemmap for this range. but i didn't see any code doing such things in kernel side. So if you guy know the reasons, please let me know :), any hint/feedback is very welcome. [ 56.380802] kdump.sh[240]: OFFSET(atomic_long_t.counter)=0 [ 56.385976] kdump.sh[240]: SIZE(latched_seq)=64 [ 56.390217] kdump.sh[240]: OFFSET(latched_seq.val)=48 [ 56.395750] kdump.sh[240]: LENGTH(free_area.free_list)=5 [ 56.401295] kdump.sh[240]: NUMBER(NR_FREE_PAGES)=0 [ 56.406772] kdump.sh[240]: NUMBER(PG_lru)=4 [ 56.408783] kdump.sh[240]: NUMBER(PG_private)=13 [ 56.415401] kdump.sh[240]: NUMBER(PG_swapcache)=10 [ 56.421269] kdump.sh[240]: NUMBER(PG_swapbacked)=19 [ 56.426797] kdump.sh[240]: NUMBER(PG_slab)=9 [ 56.428911] kdump.sh[240]: NUMBER(PG_head_mask)=65536 [ 56.435175] kdump.sh[240]: NUMBER(PAGE_BUDDY_MAPCOUNT_VALUE)=-129 [ 56.437522] kdump.sh[240]: NUMBER(HUGETLB_PAGE_DTOR)=2 [ 56.442233] kdump.sh[240]: NUMBER(PAGE_OFFLINE_MAPCOUNT_VALUE)=-257 [ 56.446943] kdump.sh[240]: SYMBOL(kallsyms_names)=9e5eb9b0 [ 56.452486] kdump.sh[240]: SYMBOL(kallsyms_num_syms)=9e5eb9a8 [ 56.458891] kdump.sh[240]: SYMBOL(kallsyms_token_table)=9e76e038 Excluding unnecessary pages : [ 2.7 %] \ __vtop4_x86_64: Can't get a valid pte. [ 56.476355] kdump.sh[240]: readmem: Can't convert a virtual address(ea0008f8) to physical address. [ 56.483192] kdump.sh[240]: readmem: type_addr: 0, addr:ea0008f8, size:32768 [ 56.489350] kdump.sh[240]: __exclude_unnecessary_pages: Can't read the buffer of struct page. [ 56.494516] kdump.sh[240]: create_2nd_bitmap: Can't exclude unnecessary pages. [ 56.501427] kdump[242]: saving vmcore failed, _exitcode:1 [ 56.506871] kdump.sh[240]: SYMBOL(kallsyms_token_index)=9e76e3e8 [ 56.511820] kdump.sh[240]: SYMBOL(kallsyms_offsets)=9e574240 [ 56.516816] kdump.sh[240]: SYMBOL(kallsyms_relative_base)=9e5eb9a0 [ 56.522270] kdump.sh[240]: NUMBER(phys_base)=4330618880 [ 56.526372] kdump.sh[240]: SYMBOL(init_top_pgt)=9ea26000 [ 56.531827] kdump.sh[240]: NUMBER(pgtable_l5_enabled)=0 [ 56.535773] kdump.sh[240]: SYMBOL(node_data)=9f2829a0 [ 56.540830] kdump.sh[240]: LENGTH(node_data)=64 [ 56.545405] kdump.sh[240]: KERNELOFFSET=1b40 [ 56.549841] kdump.sh[240]: NUMBER(KERNEL_IMAGE_SIZE)=1073741824 [ 56.554862] kdump.sh[240]: NUMBER(sme_mask)=0 [ 56.558820] kdump.sh[240]: CRASHTIME=1669635006 [ 56.562812] kdump.sh[240]: phys_base: 10220 (vmcoreinfo) [ 56.567343] kdump.sh[240]: max_mapnr: 23e200 [ 56.572837] kdump.sh[240]: There is enough free memory to be done in one cycle. [ 56.577843] kdump.sh[240]: Buffer size for the cyclic mode: 587904 [ 56.582390] kdump.sh[240]: The kernel version is not supported. [ 56.590812] kdump.sh[240]: The makedumpfile