Re: nvdimm,pmem: makedumpfile: __vtop4_x86_64: Can't get a valid pte.

2022-11-28 Thread Eliot Moss
Glad you found it. Any thoughts/reactions?  EM

Sent from my iPhone

> On Nov 29, 2022, at 12:17 AM, lizhij...@fujitsu.com wrote:
> 
> 
> 
>> On 28/11/2022 23:03, Eliot Moss wrote:
>>> On 11/28/2022 9:46 AM, lizhij...@fujitsu.com wrote:
>>> 
>>> 
>>> On 28/11/2022 20:53, Eliot Moss wrote:
 On 11/28/2022 7:04 AM, lizhij...@fujitsu.com wrote:
> Hi folks,
> 
> I'm going to make crash coredump support pmem region. So
> I have modified kexec-tools to add pmem region to PT_LOAD of vmcore.
> 
> But it failed at makedumpfile, log are as following:
> 
> In my environment, i found the last 512 pages in pmem region will cause 
> the error.
 
 I wonder if an issue I reported is related: when set up to map
 2Mb (huge) pages, the last 2Mb of a large region got mapped as
 4Kb pages, and then later, half of a large region was treated
 that way.
 
>>> Could you share the url/link ? I'd like to take a look
>> 
>> It was in a previous email to the nvdimm list.  the title was:
>> 
>> "Possible PMD (huge pages) bug in fs dax"
>> 
>> And here is the body.  I just sent directly to the list so there
>> is no URL (if I should be engaging in a different way, please let me know):
> 
> I found it :) at
> https://www.mail-archive.com/nvdimm@lists.linux.dev/msg02743.html
> 
> 
>> 
>> Folks - I posted already on nvdimm, but perhaps the topic did not quite grab
>> anyone's attention.  I had had some trouble figuring all the details to get
>> dax mapping of files from an xfs file system with underlying Optane DC memory
>> going, but now have that working reliably.  But there is an odd behavior:
>> 
>> When first mapping a file, I request mapping a 32 Gb range, aligned on a 1 Gb
>> (and thus clearly on a 2 Mb) boundary.
>> 
>> For each group of 8 Gb, the first 4095 entries map with a 2 Mb huge (PMD)
>> page.  The 4096th one does FALLBACK.  I suspect some problem in
>> dax.c:grab_mapping_entry or its callees, but am not personally well enough
>> versed in either the dax code or the xarray implementation to dig further.
>> 
>> 
>> If you'd like a second puzzle  ... after completing this mapping, another
>> thread accesses the whole range sequentially.  This results in NOPAGE fault
>> handling for the first 4095+4095 2M regions that previously resulted in
>> NOPAGE -- so far so good.  But it gives FALLBACK for the upper 16 Gb (except
>> the two PMD regions it alrady gave FALLBACK for).
>> 
>> 
>> I can provide trace output from a run if you'd like and all the ndctl, gdisk
>> -l, fdisk -l, and xfs_info details if you like.
>> 
>> 
>> In my application, it would be nice if dax.c could deliver 1 Gb PUD size
>> mappings as well, though it would appear that that would require more surgery
>> on dax.c.  It would be somewhat analogous to what's already there, of course,
>> but I don't mean to minimize the possible trickiness of it.  I realize I
>> should submit that request as a separate thread  which I intend to do
>> later.
>> 
>> 
>> Regards - Eliot Moss




Re: nvdimm,pmem: makedumpfile: __vtop4_x86_64: Can't get a valid pte.

2022-11-28 Thread lizhij...@fujitsu.com


On 28/11/2022 23:03, Eliot Moss wrote:
> On 11/28/2022 9:46 AM, lizhij...@fujitsu.com wrote:
>>
>>
>> On 28/11/2022 20:53, Eliot Moss wrote:
>>> On 11/28/2022 7:04 AM, lizhij...@fujitsu.com wrote:
 Hi folks,

 I'm going to make crash coredump support pmem region. So
 I have modified kexec-tools to add pmem region to PT_LOAD of vmcore.

 But it failed at makedumpfile, log are as following:

 In my environment, i found the last 512 pages in pmem region will cause 
 the error.
>>>
>>> I wonder if an issue I reported is related: when set up to map
>>> 2Mb (huge) pages, the last 2Mb of a large region got mapped as
>>> 4Kb pages, and then later, half of a large region was treated
>>> that way.
>>>
>> Could you share the url/link ? I'd like to take a look
> 
> It was in a previous email to the nvdimm list.  the title was:
> 
> "Possible PMD (huge pages) bug in fs dax"
> 
> And here is the body.  I just sent directly to the list so there
> is no URL (if I should be engaging in a different way, please let me know):

I found it :) at
https://www.mail-archive.com/nvdimm@lists.linux.dev/msg02743.html


> 
> Folks - I posted already on nvdimm, but perhaps the topic did not quite grab
> anyone's attention.  I had had some trouble figuring all the details to get
> dax mapping of files from an xfs file system with underlying Optane DC memory
> going, but now have that working reliably.  But there is an odd behavior:
> 
> When first mapping a file, I request mapping a 32 Gb range, aligned on a 1 Gb
> (and thus clearly on a 2 Mb) boundary.
> 
> For each group of 8 Gb, the first 4095 entries map with a 2 Mb huge (PMD)
> page.  The 4096th one does FALLBACK.  I suspect some problem in
> dax.c:grab_mapping_entry or its callees, but am not personally well enough
> versed in either the dax code or the xarray implementation to dig further.
> 
> 
> If you'd like a second puzzle  ... after completing this mapping, another
> thread accesses the whole range sequentially.  This results in NOPAGE fault
> handling for the first 4095+4095 2M regions that previously resulted in
> NOPAGE -- so far so good.  But it gives FALLBACK for the upper 16 Gb (except
> the two PMD regions it alrady gave FALLBACK for).
> 
> 
> I can provide trace output from a run if you'd like and all the ndctl, gdisk
> -l, fdisk -l, and xfs_info details if you like.
> 
> 
> In my application, it would be nice if dax.c could deliver 1 Gb PUD size
> mappings as well, though it would appear that that would require more surgery
> on dax.c.  It would be somewhat analogous to what's already there, of course,
> but I don't mean to minimize the possible trickiness of it.  I realize I
> should submit that request as a separate thread  which I intend to do
> later.
> 
> 
> Regards - Eliot Moss

Re: [PATCH v14 3/7] crash: add generic infrastructure for crash hotplug support

2022-11-28 Thread Baoquan He
On 11/28/22 at 09:46am, Eric DeVolder wrote:
> 
> 
> On 11/24/22 21:26, Baoquan He wrote:
> > On 11/16/22 at 04:46pm, Eric DeVolder wrote:
> > ..
> > > diff --git a/include/linux/kexec.h b/include/linux/kexec.h
> > > index ebf46c3b8f8b..b4dbc21f9081 100644
> > > --- a/include/linux/kexec.h
> > > +++ b/include/linux/kexec.h
> > > @@ -32,6 +32,7 @@ extern note_buf_t __percpu *crash_notes;
> > >   #include 
> > >   #include 
> > >   #include 
> > > +#include 
> > >   #include 
> > >   /* Verify architecture specific macros are defined */
> > > @@ -374,6 +375,13 @@ struct kimage {
> > >   struct purgatory_info purgatory_info;
> > >   #endif
> > > +#ifdef CONFIG_CRASH_HOTPLUG
> > 
> > This kernel config CRASH_HOTPLUG is added in patch 7, but we have used
> > it in the previous patch, not sure if this is acceptable.
> > 
> I wasn't sure what to do here either. Patch 7 is the x86 arch-specific
> support patch, and CRASH_HOTPLUG is introduced in arch/x86/Kconfig. I did
> look at introducing CRASH_HOTPLUG as a generic/non-arch-specific option, but
> no location seemed appropriate given HOTPLUG_CPU is arch-specific and
> MEMORY_HOTPLUG is in mm/Kconfig.

arch/Kconfig?

Because CRASH_CORE/KEXEC_CORE are defined there.

> 
> This doesn't break bisect, but as you point out, not sure if the location in 
> patch 7 is acceptable.
> I'm not really sure how to resolve the question.

Hmm, since it's bisect-able, seems doesn't break rule. I could be too
sensitive. Do we have a precendent like this, to strengthen our
confidence?

If no concern from other people, it's also fine to me.


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2] x86/bugs: Explicitly clear speculative MSR bits

2022-11-28 Thread Pawan Gupta

On Mon, Nov 28, 2022 at 03:02:21PM -0800, Pawan Gupta wrote:

On Mon, Nov 28, 2022 at 11:40:19PM +0100, Borislav Petkov wrote:

On Mon, Nov 28, 2022 at 02:03:58PM -0800, Pawan Gupta wrote:

diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index 3e3230cccaa7..cfc2ed2661fc 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -66,7 +66,7 @@ static DEFINE_MUTEX(spec_ctrl_mutex);
 */
void write_spec_ctrl_current(u64 val, bool force)
{
-   if (this_cpu_read(x86_spec_ctrl_current) == val)
+   if (!force && this_cpu_read(x86_spec_ctrl_current) == val)
return;
this_cpu_write(x86_spec_ctrl_current, val);


Still looks hacky to me.

I think it would be a lot cleaner if MSR_IA32_SPEC_CTRL gets cleaned of
the speculation bits in init_speculation_control() which gets run on
*every* CPU.

So by the time check_bugs() gets to setup stuff, the MSR will be ready
to go regardless.

I.e., something like this (not supposed to work - just to show what I
mean):

diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 73cc546e024d..367732c92942 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -993,9 +993,19 @@ static void init_speculation_control(struct cpuinfo_x86 *c)
 * Intel CPUs, for finer-grained selection of what's available.
 */
if (cpu_has(c, X86_FEATURE_SPEC_CTRL)) {
+   u64 msr;
+
set_cpu_cap(c, X86_FEATURE_IBRS);
set_cpu_cap(c, X86_FEATURE_IBPB);
set_cpu_cap(c, X86_FEATURE_MSR_SPEC_CTRL);
+
+   /*
+* Clear speculation control settings from a previous kernel
+* run, i.e., kexec.
+*/
+   rdmsrl(MSR_IA32_SPEC_CTRL, msr);
+   if (msr & SPEC_CTRL_MASK)
+   wrmsr (MSR_IA32_SPEC_CTRL, msr & ~SPEC_CTRL_MASK);


Yes thats a cleaner approach, except that the late microcode load will
ruin the MSR:


Root of the original problem is x86_spec_ctrl_current is not the current
value of MSR at bootup.

How about we update x86_spec_ctrl_current before any writes to the MSR?:

diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index 3e3230cccaa7..68ed52394fd9 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -137,8 +137,18 @@ void __init check_bugs(void)
 * have unknown values. AMD64_LS_CFG MSR is cached in the early AMD
 * init code as it is not enumerated and depends on the family.
 */
-   if (boot_cpu_has(X86_FEATURE_MSR_SPEC_CTRL))
+   if (boot_cpu_has(X86_FEATURE_MSR_SPEC_CTRL)) {
rdmsrl(MSR_IA32_SPEC_CTRL, x86_spec_ctrl_base);
+   /*
+* Previously running software, like kexec for example, may
+* have some controls turned ON.
+* Clear them and let the mitigations setup below set them
+* based on configuration.
+*/
+   this_cpu_write(x86_spec_ctrl_current, x86_spec_ctrl_base);
+   x86_spec_ctrl_base &= ~SPEC_CTRL_MITIGATIONS_MASK;
+   write_spec_ctrl_current(x86_spec_ctrl_base, true);
+   }
 
 	/* Select the proper CPU mitigations before patching alternatives: */

spectre_v1_select_mitigation();
@@ -2047,8 +2057,13 @@ int arch_prctl_spec_ctrl_get(struct task_struct *task, 
unsigned long which)
 
 void x86_spec_ctrl_setup_ap(void)

 {
-   if (boot_cpu_has(X86_FEATURE_MSR_SPEC_CTRL))
+   if (boot_cpu_has(X86_FEATURE_MSR_SPEC_CTRL)) {
+   u64 msr;
+
+   rdmsrl(MSR_IA32_SPEC_CTRL, msr);
+   this_cpu_write(x86_spec_ctrl_current, msr);
write_spec_ctrl_current(x86_spec_ctrl_base, true);
+   }
 
 	if (ssb_mode == SPEC_STORE_BYPASS_DISABLE)

x86_amd_ssb_disable();

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v1 2/2] kexec: Introduce kexec_reboot_disabled

2022-11-28 Thread Steven Rostedt
On Mon, 28 Nov 2022 17:28:55 +0100
Philipp Rudo  wrote:

> To be honest I don't think we make a progress here at the moment. I
> would like to hear from others what they think about this.

Not sure if you missed my reply.

  https://lore.kernel.org/all/20221128114200.72b3e...@gandalf.local.home/

-- Steve

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2] x86/bugs: Explicitly clear speculative MSR bits

2022-11-28 Thread Borislav Petkov
On Mon, Nov 28, 2022 at 03:02:19PM -0800, Pawan Gupta wrote:
> Yes thats a cleaner approach, except that the late microcode load will
> ruin the MSR:
> 
> microcode_reload_late()
>   microcode_check()
> get_cpu_cap()
>   init_speculation_control()

Microcode late loading ruins a lot of crap already anyway.

Then I guess we'll have to look for a better place for this on the init
path - I don't want any more hackery in the hw vuln nightmare code.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2] x86/bugs: Explicitly clear speculative MSR bits

2022-11-28 Thread Pawan Gupta

On Mon, Nov 28, 2022 at 11:40:19PM +0100, Borislav Petkov wrote:

On Mon, Nov 28, 2022 at 02:03:58PM -0800, Pawan Gupta wrote:

diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index 3e3230cccaa7..cfc2ed2661fc 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -66,7 +66,7 @@ static DEFINE_MUTEX(spec_ctrl_mutex);
  */
 void write_spec_ctrl_current(u64 val, bool force)
 {
-   if (this_cpu_read(x86_spec_ctrl_current) == val)
+   if (!force && this_cpu_read(x86_spec_ctrl_current) == val)
return;
this_cpu_write(x86_spec_ctrl_current, val);


Still looks hacky to me.

I think it would be a lot cleaner if MSR_IA32_SPEC_CTRL gets cleaned of
the speculation bits in init_speculation_control() which gets run on
*every* CPU.

So by the time check_bugs() gets to setup stuff, the MSR will be ready
to go regardless.

I.e., something like this (not supposed to work - just to show what I
mean):

diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 73cc546e024d..367732c92942 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -993,9 +993,19 @@ static void init_speculation_control(struct cpuinfo_x86 *c)
 * Intel CPUs, for finer-grained selection of what's available.
 */
if (cpu_has(c, X86_FEATURE_SPEC_CTRL)) {
+   u64 msr;
+
set_cpu_cap(c, X86_FEATURE_IBRS);
set_cpu_cap(c, X86_FEATURE_IBPB);
set_cpu_cap(c, X86_FEATURE_MSR_SPEC_CTRL);
+
+   /*
+* Clear speculation control settings from a previous kernel
+* run, i.e., kexec.
+*/
+   rdmsrl(MSR_IA32_SPEC_CTRL, msr);
+   if (msr & SPEC_CTRL_MASK)
+   wrmsr (MSR_IA32_SPEC_CTRL, msr & ~SPEC_CTRL_MASK);


Yes thats a cleaner approach, except that the late microcode load will
ruin the MSR:

microcode_reload_late()
  microcode_check()
get_cpu_cap()
  init_speculation_control()

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2] x86/bugs: Explicitly clear speculative MSR bits

2022-11-28 Thread Borislav Petkov
On Mon, Nov 28, 2022 at 02:03:58PM -0800, Pawan Gupta wrote:
> diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
> index 3e3230cccaa7..cfc2ed2661fc 100644
> --- a/arch/x86/kernel/cpu/bugs.c
> +++ b/arch/x86/kernel/cpu/bugs.c
> @@ -66,7 +66,7 @@ static DEFINE_MUTEX(spec_ctrl_mutex);
>   */
>  void write_spec_ctrl_current(u64 val, bool force)
>  {
> - if (this_cpu_read(x86_spec_ctrl_current) == val)
> + if (!force && this_cpu_read(x86_spec_ctrl_current) == val)
>   return;
>   this_cpu_write(x86_spec_ctrl_current, val);

Still looks hacky to me.

I think it would be a lot cleaner if MSR_IA32_SPEC_CTRL gets cleaned of
the speculation bits in init_speculation_control() which gets run on
*every* CPU.

So by the time check_bugs() gets to setup stuff, the MSR will be ready
to go regardless.

I.e., something like this (not supposed to work - just to show what I
mean):

diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 73cc546e024d..367732c92942 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -993,9 +993,19 @@ static void init_speculation_control(struct cpuinfo_x86 *c)
 * Intel CPUs, for finer-grained selection of what's available.
 */
if (cpu_has(c, X86_FEATURE_SPEC_CTRL)) {
+   u64 msr;
+
set_cpu_cap(c, X86_FEATURE_IBRS);
set_cpu_cap(c, X86_FEATURE_IBPB);
set_cpu_cap(c, X86_FEATURE_MSR_SPEC_CTRL);
+
+   /*
+* Clear speculation control settings from a previous kernel
+* run, i.e., kexec.
+*/
+   rdmsrl(MSR_IA32_SPEC_CTRL, msr);
+   if (msr & SPEC_CTRL_MASK)
+   wrmsr (MSR_IA32_SPEC_CTRL, msr & ~SPEC_CTRL_MASK);
}
 
if (cpu_has(c, X86_FEATURE_INTEL_STIBP))



-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2] x86/bugs: Explicitly clear speculative MSR bits

2022-11-28 Thread Pawan Gupta

On Mon, Nov 28, 2022 at 01:42:26AM +0100, Borislav Petkov wrote:

On Thu, Nov 24, 2022 at 02:46:50AM -0800, Breno Leitao wrote:

Currently x86_spec_ctrl_base is read at boot time, and speculative bits
are set if configs are enable, such as MSR[SPEC_CTRL_IBRS] is enabled if
CONFIG_CPU_IBRS_ENTRY is configured. These MSR bits are not cleared if
the mitigations are disabled.

This is a problem when kexec-ing a kernel that has the mitigation
disabled, from a kernel that has the mitigation enabled. In this case,
the MSR bits are carried forward and not cleared at the boot of the new
kernel. This might have some performance degradation that is hard to
find.

This problem does not happen if the machine is (hard) rebooted, because
the bit will be cleared by default.

This patch also defines a SPEC_CTRL_MASK macro, so, we can easily track
and clear if eventually some new mitigation shows up.


Just remove that sentence - the macro's function is kinda obvious from
the diff itself.


Suggested-by: Pawan Gupta 
Signed-off-by: Breno Leitao 
---
 arch/x86/include/asm/msr-index.h | 3 +++
 arch/x86/kernel/cpu/bugs.c   | 9 -
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 4a2af82553e4..704f49580ee1 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -54,6 +54,9 @@
 #define SPEC_CTRL_RRSBA_DIS_S_SHIFT6  /* Disable RRSBA behavior */
 #define SPEC_CTRL_RRSBA_DIS_S  BIT(SPEC_CTRL_RRSBA_DIS_S_SHIFT)

+#define SPEC_CTRL_MASK (SPEC_CTRL_IBRS | SPEC_CTRL_STIBP | 
SPEC_CTRL_SSBD \
+   | SPEC_CTRL_RRSBA_DIS_S)


Call that SPEC_CTRL_MITIGATIONS_MASK or so to denote what it is - a
mask of the SPEC_CTRL bits which the kernel toggles when controlling
mitigations.

A comment above it wouldn't hurt either.


+
 #define MSR_IA32_PRED_CMD  0x0049 /* Prediction Command */
 #define PRED_CMD_IBPB  BIT(0) /* Indirect Branch 
Prediction Barrier */

diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index 3e3230cccaa7..88957da1029b 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -137,8 +137,15 @@ void __init check_bugs(void)
 * have unknown values. AMD64_LS_CFG MSR is cached in the early AMD
 * init code as it is not enumerated and depends on the family.
 */
-   if (boot_cpu_has(X86_FEATURE_MSR_SPEC_CTRL))
+   if (boot_cpu_has(X86_FEATURE_MSR_SPEC_CTRL)) {
rdmsrl(MSR_IA32_SPEC_CTRL, x86_spec_ctrl_base);
+   /*
+* Previously running software may have some controls turned ON.


"Previously running software, like kexec for example, ..."


+* Clear them and let kernel decide which controls to use.


"Clear them and let the mitigations setup below set them based on 
configuration."


+*/
+   x86_spec_ctrl_base &= ~SPEC_CTRL_MASK;
+   wrmsrl(MSR_IA32_SPEC_CTRL, x86_spec_ctrl_base);


So this WRMSR will happen on the BSP only but the SPEC_CTRL MSR is
per-CPU. As is x86_spec_ctrl_current which tracks it.

So I'd say you don't need that WRMSR here - the proper value will get
replicated eventually everywhere...


This patch is particularly for the case when user intends to turn off
the mitigations like with mitigations=off. In that case we need the
WRMSR because mitigation selection will simply return without writing to
the MSR on BSP.

As part of AP init x86_spec_ctrl_setup_ap() writes to the MSR even
when the mitigation is turned off, so AP's should have been fine, but I
think there is a subtle bug there as well. For below call:

x86_spec_ctrl_setup_ap(void)
{
write_spec_ctrl_current(x86_spec_ctrl_base, true);

When x86_spec_ctrl_base is 0 MSR won't be written because of a check in
write_spec_ctrl_current() that doesn't write the MSR when the new value
(0) is same as x86_spec_ctrl_current (initialized to 0).

Below should fix the problem with APs:

---
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index 3e3230cccaa7..cfc2ed2661fc 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -66,7 +66,7 @@ static DEFINE_MUTEX(spec_ctrl_mutex);
  */
 void write_spec_ctrl_current(u64 val, bool force)
 {
-   if (this_cpu_read(x86_spec_ctrl_current) == val)
+   if (!force && this_cpu_read(x86_spec_ctrl_current) == val)
return;
 
 	this_cpu_write(x86_spec_ctrl_current, val);


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] lockdown: kexec_file: prevent unsigned kernel image when KEXEC_SIG not enabled

2022-11-28 Thread Mimi Zohar
On Tue, 2022-11-22 at 10:36 +0800, Coiby Xu wrote:
> Hi Mimi,
> 
> On Mon, Nov 21, 2022 at 01:23:57PM -0500, Mimi Zohar wrote:
> >Hi Coiby,
> >
> >On Mon, 2022-11-21 at 15:29 +0800, Coiby Xu wrote:
> >> A kernel builder may not enable KEXEC_SIG and some architectures like
> >> ppc64 simply don't have KEXEC_SIG. In these cases, unless both
> >> IMA_ARCH_POLICY and secure boot also enabled, lockdown doesn't prevent
> >> unsigned kernel image from being kexec'ed via the kexec_file_load
> >> syscall whereas it could prevent one via the kexec_load syscall. Mandate
> >> signature verification for those cases.
> >>
> >> Fixes: 155bdd30af17 ("kexec_file: Restrict at runtime if the kernel is 
> >> locked down")
> >> Cc: Matthew Garrett 
> >> Cc: Jiri Bohac 
> >> Cc: David Howells 
> >> Cc: kexec@lists.infradead.org
> >> Cc: linux-integr...@vger.kernel.org
> >> Signed-off-by: Coiby Xu 
> >
> >Other than correcting the function name to mandate_signature_verificati
> >on(),
> 
> Applied to v2, thanks for correcting me! Btw, I realize I overwrote the
> return code of kexec_image_verify_sig with
> mandate_signature_verification's. v2 has fixed this issue as well.
> 
> >
> >Reviewed-by: Mimi Zohar 
> 
> And thanks for the review!

You're welcome.

Without either IMA_ARCH or KEXEC_SIG enabled, the kexec selftest
test_kexec_file_load.sh properly failed with "kexec_file_load failed
[PASS]", but from the informational messages output, it isn't clear why
it failed.  This should be corrected.

-- 
thanks,

Mimi


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] kexec: Enable runtime allocation of crash_image

2022-11-28 Thread Ricardo Ribalda
Hi Philipp


Thanks for your review.


On Mon, 28 Nov 2022 at 18:00, Philipp Rudo  wrote:
>
> Hi Ricardo,
>
> On Thu, 24 Nov 2022 23:23:36 +0100
> Ricardo Ribalda  wrote:
>
> > Usually crash_image is defined statically via the crashkernel parameter
> > or DT.
> >
> > But if the crash kernel is not used, or is smaller than then
> > area pre-allocated that memory is wasted.
> >
> > Also, if the crash kernel was not defined at bootime, there is no way to
> > use the crash kernel.
> >
> > Enable runtime allocation of the crash_image if the crash_image is not
> > defined statically. Following the same memory allocation/validation path
> > that for the reboot kexec kernel.
> >
> > Signed-off-by: Ricardo Ribalda 
>
> I don't think this patch will work as intended. For one you omit
> setting the image->type to KEXEC_TYPE_CRASH. But when you grep for that
> type you will find that there is a lot of special handling done for it.
> I don't believe that this can simply be skipped without causing
> problems.
>
> Furthermore I think you have missed one important detail. The memory
> reserved for the crash kernel is not just a buffer for the image but
> the memory it runs in! For that it has to be a continuous piece of
> physical memory with usually some additional arch specific limitations.
> When allocated dynamically all those limitations need to be considered.
> But a standard kexec doesn't care about those limitations as it doesn't
> care about the os running before itself. It can simply overwrite the
> memory when booting. But if the crash kernel does the same it will
> corrupt the dump it is supposed to generate.

Right now, I do not intend to use it to fetch a kdump, I am using it
as the image that will run when the system crashes.

It seems to work fine on the two devices that I am using for tests.

>
> Thanks
> Philipp
>
> > ---
> > kexec: Enable runtime allocation of crash_image
> >
> > To: Eric Biederman 
> > Cc: kexec@lists.infradead.org
> > Cc: linux-ker...@vger.kernel.org
> > Cc: Sergey Senozhatsky 
> > Cc: linux-ker...@vger.kernel.org
> > Cc: Ross Zwisler 
> > Cc: Philipp Rudo 
> > Cc: Baoquan He 
> > ---
> >  include/linux/kexec.h | 1 +
> >  kernel/kexec.c| 9 +
> >  kernel/kexec_core.c   | 5 +
> >  kernel/kexec_file.c   | 7 ---
> >  4 files changed, 15 insertions(+), 7 deletions(-)
> >
> > diff --git a/include/linux/kexec.h b/include/linux/kexec.h
> > index 41a686996aaa..98ca9a32bc8e 100644
> > --- a/include/linux/kexec.h
> > +++ b/include/linux/kexec.h
> > @@ -427,6 +427,7 @@ extern int kexec_load_disabled;
> >  extern bool kexec_in_progress;
> >
> >  int crash_shrink_memory(unsigned long new_size);
> > +bool __crash_memory_valid(void);
> >  ssize_t crash_get_memory_size(void);
> >
> >  #ifndef arch_kexec_protect_crashkres
> > diff --git a/kernel/kexec.c b/kernel/kexec.c
> > index cb8e6e6f983c..b5c17db25e88 100644
> > --- a/kernel/kexec.c
> > +++ b/kernel/kexec.c
> > @@ -28,7 +28,7 @@ static int kimage_alloc_init(struct kimage **rimage, 
> > unsigned long entry,
> >   struct kimage *image;
> >   bool kexec_on_panic = flags & KEXEC_ON_CRASH;
> >
> > - if (kexec_on_panic) {
> > + if (kexec_on_panic && __crash_memory_valid()) {
> >   /* Verify we have a valid entry point */
> >   if ((entry < phys_to_boot_phys(crashk_res.start)) ||
> >   (entry > phys_to_boot_phys(crashk_res.end)))
> > @@ -44,7 +44,7 @@ static int kimage_alloc_init(struct kimage **rimage, 
> > unsigned long entry,
> >   image->nr_segments = nr_segments;
> >   memcpy(image->segment, segments, nr_segments * sizeof(*segments));
> >
> > - if (kexec_on_panic) {
> > + if (kexec_on_panic && __crash_memory_valid()) {
> >   /* Enable special crash kernel control page alloc policy. */
> >   image->control_page = crashk_res.start;
> >   image->type = KEXEC_TYPE_CRASH;
> > @@ -101,7 +101,7 @@ static int do_kexec_load(unsigned long entry, unsigned 
> > long nr_segments,
> >
> >   if (flags & KEXEC_ON_CRASH) {
> >   dest_image = _crash_image;
> > - if (kexec_crash_image)
> > + if (kexec_crash_image && __crash_memory_valid())
> >   arch_kexec_unprotect_crashkres();
> >   } else {
> >   dest_image = _image;
> > @@ -157,7 +157,8 @@ static int do_kexec_load(unsigned long entry, unsigned 
> > long nr_segments,
> >   image = xchg(dest_image, image);
> >
> >  out:
> > - if ((flags & KEXEC_ON_CRASH) && kexec_crash_image)
> > + if ((flags & KEXEC_ON_CRASH) && kexec_crash_image &&
> > + __crash_memory_valid())
> >   arch_kexec_protect_crashkres();
> >
> >   kimage_free(image);
> > diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
> > index ca2743f9c634..77083c9760fb 100644
> > --- a/kernel/kexec_core.c
> > +++ b/kernel/kexec_core.c
> > @@ -1004,6 +1004,11 @@ void crash_kexec(struct pt_regs *regs)
> >   }
> >  }
> 

Re: [PATCH] kexec: Enable runtime allocation of crash_image

2022-11-28 Thread Philipp Rudo
Hi Ricardo,

On Thu, 24 Nov 2022 23:23:36 +0100
Ricardo Ribalda  wrote:

> Usually crash_image is defined statically via the crashkernel parameter
> or DT.
> 
> But if the crash kernel is not used, or is smaller than then
> area pre-allocated that memory is wasted.
> 
> Also, if the crash kernel was not defined at bootime, there is no way to
> use the crash kernel.
> 
> Enable runtime allocation of the crash_image if the crash_image is not
> defined statically. Following the same memory allocation/validation path
> that for the reboot kexec kernel.
> 
> Signed-off-by: Ricardo Ribalda 

I don't think this patch will work as intended. For one you omit
setting the image->type to KEXEC_TYPE_CRASH. But when you grep for that
type you will find that there is a lot of special handling done for it.
I don't believe that this can simply be skipped without causing
problems.

Furthermore I think you have missed one important detail. The memory
reserved for the crash kernel is not just a buffer for the image but
the memory it runs in! For that it has to be a continuous piece of
physical memory with usually some additional arch specific limitations.
When allocated dynamically all those limitations need to be considered.
But a standard kexec doesn't care about those limitations as it doesn't
care about the os running before itself. It can simply overwrite the
memory when booting. But if the crash kernel does the same it will
corrupt the dump it is supposed to generate.

Thanks
Philipp

> ---
> kexec: Enable runtime allocation of crash_image
> 
> To: Eric Biederman 
> Cc: kexec@lists.infradead.org
> Cc: linux-ker...@vger.kernel.org
> Cc: Sergey Senozhatsky 
> Cc: linux-ker...@vger.kernel.org
> Cc: Ross Zwisler 
> Cc: Philipp Rudo 
> Cc: Baoquan He 
> ---
>  include/linux/kexec.h | 1 +
>  kernel/kexec.c| 9 +
>  kernel/kexec_core.c   | 5 +
>  kernel/kexec_file.c   | 7 ---
>  4 files changed, 15 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/kexec.h b/include/linux/kexec.h
> index 41a686996aaa..98ca9a32bc8e 100644
> --- a/include/linux/kexec.h
> +++ b/include/linux/kexec.h
> @@ -427,6 +427,7 @@ extern int kexec_load_disabled;
>  extern bool kexec_in_progress;
>  
>  int crash_shrink_memory(unsigned long new_size);
> +bool __crash_memory_valid(void);
>  ssize_t crash_get_memory_size(void);
>  
>  #ifndef arch_kexec_protect_crashkres
> diff --git a/kernel/kexec.c b/kernel/kexec.c
> index cb8e6e6f983c..b5c17db25e88 100644
> --- a/kernel/kexec.c
> +++ b/kernel/kexec.c
> @@ -28,7 +28,7 @@ static int kimage_alloc_init(struct kimage **rimage, 
> unsigned long entry,
>   struct kimage *image;
>   bool kexec_on_panic = flags & KEXEC_ON_CRASH;
>  
> - if (kexec_on_panic) {
> + if (kexec_on_panic && __crash_memory_valid()) {
>   /* Verify we have a valid entry point */
>   if ((entry < phys_to_boot_phys(crashk_res.start)) ||
>   (entry > phys_to_boot_phys(crashk_res.end)))
> @@ -44,7 +44,7 @@ static int kimage_alloc_init(struct kimage **rimage, 
> unsigned long entry,
>   image->nr_segments = nr_segments;
>   memcpy(image->segment, segments, nr_segments * sizeof(*segments));
>  
> - if (kexec_on_panic) {
> + if (kexec_on_panic && __crash_memory_valid()) {
>   /* Enable special crash kernel control page alloc policy. */
>   image->control_page = crashk_res.start;
>   image->type = KEXEC_TYPE_CRASH;
> @@ -101,7 +101,7 @@ static int do_kexec_load(unsigned long entry, unsigned 
> long nr_segments,
>  
>   if (flags & KEXEC_ON_CRASH) {
>   dest_image = _crash_image;
> - if (kexec_crash_image)
> + if (kexec_crash_image && __crash_memory_valid())
>   arch_kexec_unprotect_crashkres();
>   } else {
>   dest_image = _image;
> @@ -157,7 +157,8 @@ static int do_kexec_load(unsigned long entry, unsigned 
> long nr_segments,
>   image = xchg(dest_image, image);
>  
>  out:
> - if ((flags & KEXEC_ON_CRASH) && kexec_crash_image)
> + if ((flags & KEXEC_ON_CRASH) && kexec_crash_image &&
> + __crash_memory_valid())
>   arch_kexec_protect_crashkres();
>  
>   kimage_free(image);
> diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
> index ca2743f9c634..77083c9760fb 100644
> --- a/kernel/kexec_core.c
> +++ b/kernel/kexec_core.c
> @@ -1004,6 +1004,11 @@ void crash_kexec(struct pt_regs *regs)
>   }
>  }
>  
> +bool __crash_memory_valid(void)
> +{
> + return crashk_res.end != crashk_res.start;
> +}
> +
>  ssize_t crash_get_memory_size(void)
>  {
>   ssize_t size = 0;
> diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
> index 45637511e0de..0671f4f370ff 100644
> --- a/kernel/kexec_file.c
> +++ b/kernel/kexec_file.c
> @@ -280,7 +280,7 @@ kimage_file_alloc_init(struct kimage **rimage, int 
> kernel_fd,
>  
>   image->file_mode = 1;
>  
> - if (kexec_on_panic) {
> + if 

Re: [PATCH v1 2/2] kexec: Introduce kexec_reboot_disabled

2022-11-28 Thread Steven Rostedt
On Thu, 24 Nov 2022 16:01:15 +0100
Philipp Rudo  wrote:

> No, I think the implementation is fine. I'm currently only struggling
> to understand what problem kexec_reboot_disabled solves that cannot be
> solved by kexec_load_disabled.

Hi Philipp,

Thanks for working with us on this.

Let me try to explain our use case. We want kexec/kdump enabled, but we
really do not want kexec used for any other purpose. We must have the kexec
kernel loaded at boot up and not afterward.

Your recommendation of:

  kexec -p dump_kernel
  echo 1 > /proc/sys/kernel/kexec_load_disabled

can work, and we will probably add it. But we are taking the paranoid
approach, and what I learned in security 101 ;-) and that is, only open up
the minimal attack surface as possible.

Yes, it's highly unlikely that the above would crash. But as with most
security vulnerabilities, it's not going to be an attacker that creates a
new gadget here, but probably another script in the future that causes this
to be delayed or something, and a new window of opportunity will arise for
an attacker. Maybe, that new window only works for non panic kernels. Yes,
this is a contrived scenario, but the work vs risk is very low in adding
this feature.

Perhaps the attack surface that a reboot kexec could be, is that the
attacker gets the ability at boot up to load the kexec for reboot and not panic.
Then the attack must wait for the victim to reboot their machine before
they have access to the new kernel. Again, I admit this is contrived, but
just because I can't think of a real situation that this could be a problem
doesn't mean that one doesn't exist.

In other words, if we never want to allow a kexec reboot, why allow it at
all from the beginning? The above allows it, until we don't. That alone
makes us nervous. Whereas this patch is rather trivial and doesn't add
complexity.

Thanks for your time, we appreciate it.

-- Steve

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v1 2/2] kexec: Introduce kexec_reboot_disabled

2022-11-28 Thread Philipp Rudo
Hi Ricardo,

On Thu, 24 Nov 2022 23:32:34 +0100
Ricardo Ribalda  wrote:

> Hi Philipp
> 
> 
> On Thu, 24 Nov 2022 at 16:01, Philipp Rudo  wrote:
> >
> > On Thu, 24 Nov 2022 13:52:58 +0100
> > Ricardo Ribalda  wrote:
> >  
> > > On Thu, 24 Nov 2022 at 12:40, Philipp Rudo  wrote:  
> > > >
> > > > Hi Ricardo,
> > > >
> > > > On Wed, 23 Nov 2022 09:58:08 +0100
> > > > Ricardo Ribalda  wrote:
> > > >  
> > > > > Hi Philipp
> > > > >
> > > > > Thanks for your review.
> > > > >
> > > > > My scenario is a trusted system, where even if you are root, your
> > > > > access to the system is very limited.
> > > > >
> > > > > Let's assume LOADPIN and verity are enabled.  
> > > >
> > > > My point is that on such systems I expect that a sysadmin also wants to
> > > > control the crash kernel including its initramfs (which also has to be 
> > > > part
> > > > of the signed kernel?). But if that's the case a sysadmin can simply arm
> > > > kdump early during boot and then toggle kexec_load_disabled. With that
> > > > LINUX_REBOOT_CMD_KEXEC also gets disabled as no kexec kernel can be 
> > > > loaded
> > > > while kdump works. Thus there is no need to add the new interface. Or am
> > > > I missing anything?  
> > >
> > > Let's say that you have a script that does something like this
> > >
> > >
> > > kexec -p dump_kernel
> > > echo 1 > /proc/sys/kernel/kexec_load_disabled
> > >
> > > If an attacker can DDos the system and make that script crash... then
> > > kexec is still accessible
> > >
> > > On the other hand, if you load the kernel with the commandline
> > >
> > > sysctl.kernel.kexec_load_disabled=1  
> >   
> >   reboot?  
> 
> yes :)  thanks!
> 
> > Otherwise you shouldn't be able to load the crash kernel at all.
> >  
> > > Then even if the script crashes, the only way to abuse kexec is by
> > > panicing the running kernel  
> >
> > True. But  when an attacker can DDos the system the final workload is
> > already running. So wouldn't it be enough to make sure that the script
> > above has finished before starting you workload. E.g. by setting an
> > appropriate Before=/After= in the systemd.unit?  
> 
> What if the kexec binary crashes and the unit will never succeed?

Then there are options like OnFailure= or FailureAction= the sysadmin
can use do what ever he seems appropriate.

> Or worse, your distro does not use systemd !!!

In that case there are other ways to achieve the same. The two options
are only examples. Why can't this be used as a replacement?

> > Furthermore, I don't think that restricting kexec reboot alone is
> > sufficient when the attacker can still control the crash kernel. At
> > least my assumption is that triggering a panic instead of just
> > rebooting is just a mild inconvenience for somebody who is able to pull
> > off an attack like that.  
> 
> The attacker does not control the crash kernel completely. loadpin is
> still in place.
> Yes, they can downgrade the whole system to a vulnerable kernel image.
> But the choices are limited :)
> 
> With physical access to the device panicing a kernel is easily doable
> (but not trivial). But remotely, it is more challenging.


Well the same holds for kexec. So the only difference is triggering the
panic where I'm still not convinced it's a huge obstacle for someone
who is able to pull off all the steps before for such an attack.

To be honest I don't think we make a progress here at the moment. I
would like to hear from others what they think about this.

Thanks
Philipp

> 
> >  
> > > Would it make you more comfortable if I model this as a kernel config
> > > instead of a runtime option?  
> >
> > No, I think the implementation is fine. I'm currently only struggling
> > to understand what problem kexec_reboot_disabled solves that cannot be
> > solved by kexec_load_disabled.
> >  
> > > Thanks!  
> >
> > Happy to help.
> >
> > Thanks
> > Philipp
> >  
> > >
> > >  
> > > >
> > > > Thanks
> > > > Philipp
> > > >  
> > > > >
> > > > > On Mon, 21 Nov 2022 at 15:10, Philipp Rudo  wrote:  
> > > > > >
> > > > > > Hi Ricardo,
> > > > > >
> > > > > > On Thu, 17 Nov 2022 16:15:07 +0100
> > > > > > Ricardo Ribalda  wrote:
> > > > > >  
> > > > > > > Hi Philipp
> > > > > > >
> > > > > > > Thanks for your review!  
> > > > > >
> > > > > > happy to help.
> > > > > >  
> > > > > > >
> > > > > > > On Thu, 17 Nov 2022 at 16:07, Philipp Rudo  
> > > > > > > wrote:  
> > > > > > > >
> > > > > > > > Hi Ricardo,
> > > > > > > >
> > > > > > > > all in all I think this patch makes sense. However, there is 
> > > > > > > > one point
> > > > > > > > I don't like...
> > > > > > > >
> > > > > > > > On Mon, 14 Nov 2022 14:18:39 +0100
> > > > > > > > Ricardo Ribalda  wrote:
> > > > > > > >  
> > > > > > > > > Create a new toogle that disables LINUX_REBOOT_CMD_KEXEC, 
> > > > > > > > > reducing the
> > > > > > > > > attack surface to a system.
> > > > > > > > >
> > > > > > > > > Without this toogle, an attacker can only 

Re: [PATCH v14 3/7] crash: add generic infrastructure for crash hotplug support

2022-11-28 Thread Eric DeVolder




On 11/24/22 21:26, Baoquan He wrote:

On 11/16/22 at 04:46pm, Eric DeVolder wrote:
..

diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index ebf46c3b8f8b..b4dbc21f9081 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -32,6 +32,7 @@ extern note_buf_t __percpu *crash_notes;
  #include 
  #include 
  #include 
+#include 
  #include 
  
  /* Verify architecture specific macros are defined */

@@ -374,6 +375,13 @@ struct kimage {
struct purgatory_info purgatory_info;
  #endif
  
+#ifdef CONFIG_CRASH_HOTPLUG


This kernel config CRASH_HOTPLUG is added in patch 7, but we have used
it in the previous patch, not sure if this is acceptable.

I wasn't sure what to do here either. Patch 7 is the x86 arch-specific support patch, and 
CRASH_HOTPLUG is introduced in arch/x86/Kconfig. I did look at introducing CRASH_HOTPLUG as a 
generic/non-arch-specific option, but no location seemed appropriate given HOTPLUG_CPU is 
arch-specific and MEMORY_HOTPLUG is in mm/Kconfig.


This doesn't break bisect, but as you point out, not sure if the location in 
patch 7 is acceptable.
I'm not really sure how to resolve the question.

eric


+   bool hotplug_event;
+   unsigned int offlinecpu;
+   bool elfcorehdr_index_valid;
+   int elfcorehdr_index;
+#endif
+
  #ifdef CONFIG_IMA_KEXEC
/* Virtual address of IMA measurement buffer for kexec syscall */
void *ima_buffer;
@@ -503,6 +511,34 @@ static inline int arch_kexec_post_alloc_pages(void *vaddr, 
unsigned int pages, g
  static inline void arch_kexec_pre_free_pages(void *vaddr, unsigned int pages) 
{ }
  #endif
  
+#ifndef arch_map_crash_pages

+/*
+ * NOTE: The addresses and sizes passed to this routine have
+ * already been fully aligned on page boundaries. There is no
+ * need for massaging the address or size.
+ */
+static inline void *arch_map_crash_pages(unsigned long paddr,
+   unsigned long size)
+{
+   if (size > 0)
+   return kmap_local_page(pfn_to_page(paddr >> PAGE_SHIFT));
+   else
+   return NULL;
+}
+#endif
+
+#ifndef arch_unmap_crash_pages
+static inline void arch_unmap_crash_pages(void *ptr)
+{
+   if (ptr)
+   kunmap_local(ptr);
+}
+#endif
+
+#ifndef arch_crash_handle_hotplug_event
+static inline void arch_crash_handle_hotplug_event(struct kimage *image) { }
+#endif
+
  #else /* !CONFIG_KEXEC_CORE */
  struct pt_regs;
  struct task_struct;
diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index 8c648fd5897a..4e7221226976 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -11,6 +11,8 @@
  #include 
  #include 
  #include 
+#include 
+#include 
  
  #include 

  #include 
@@ -18,6 +20,7 @@
  #include 
  
  #include "kallsyms_internal.h"

+#include "kexec_internal.h"
  
  /* vmcoreinfo stuff */

  unsigned char *vmcoreinfo_data;
@@ -612,3 +615,139 @@ static int __init crash_save_vmcoreinfo_init(void)
  }
  
  subsys_initcall(crash_save_vmcoreinfo_init);

+
+#ifdef CONFIG_CRASH_HOTPLUG
+#undef pr_fmt
+#define pr_fmt(fmt) "crash hp: " fmt
+/*
+ * To accurately reflect hot un/plug changes, the elfcorehdr (which
+ * is passed to the crash kernel via the elfcorehdr= parameter)
+ * must be updated with the new list of CPUs and memories.
+ *
+ * In order to make changes to elfcorehdr, two conditions are needed:
+ * First, the segment containing the elfcorehdr must be large enough
+ * to permit a growing number of resources; the elfcorehdr memory size
+ * is based on NR_CPUS_DEFAULT and CRASH_MAX_MEMORY_RANGES.
+ * Second, purgatory must explicitly exclude the elfcorehdr from the
+ * list of segments it checks (since the elfcorehdr changes and thus
+ * would require an update to purgatory itself to update the digest).
+ */
+static void handle_hotplug_event(unsigned int hp_action, unsigned int cpu)
+{
+   /* Obtain lock while changing crash information */
+   if (kexec_trylock()) {
+
+   /* Check kdump is loaded */
+   if (kexec_crash_image) {
+   struct kimage *image = kexec_crash_image;
+
+   if (hp_action == KEXEC_CRASH_HP_ADD_CPU ||
+   hp_action == KEXEC_CRASH_HP_REMOVE_CPU)
+   pr_debug("hp_action %u, cpu %u\n", hp_action, 
cpu);
+   else
+   pr_debug("hp_action %u\n", hp_action);
+
+   /*
+* When the struct kimage is allocated, it is wiped to 
zero, so
+* the elfcorehdr_index_valid defaults to false. Find 
the
+* segment containing the elfcorehdr, if not already 
found.
+* This works for both the kexec_load and 
kexec_file_load paths.
+*/
+   if (!image->elfcorehdr_index_valid) {
+   unsigned long mem, memsz;
+   

[PATCH v3] x86/bugs: Explicitly clear speculative MSR bits

2022-11-28 Thread Breno Leitao
Currently x86_spec_ctrl_base is read at boot time, and speculative bits
are set if configs are enable, such as MSR[SPEC_CTRL_IBRS] is enabled if
CONFIG_CPU_IBRS_ENTRY is configured. These MSR bits are not cleared if
the mitigations are disabled.

This is a problem when kexec-ing a kernel that has the mitigation
disabled, from a kernel that has the mitigation enabled. In this case,
the MSR bits are carried forward and not cleared at the boot of the new
kernel. This might have some performance degradation that is hard to
find.

This problem does not happen if the machine is (hard) rebooted, because
the bit will be cleared by default.

Suggested-by: Pawan Gupta 
Signed-off-by: Breno Leitao 
---
 arch/x86/include/asm/msr-index.h |  4 
 arch/x86/kernel/cpu/bugs.c   | 10 +-
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 4a2af82553e4..22986a8f18bc 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -54,6 +54,10 @@
 #define SPEC_CTRL_RRSBA_DIS_S_SHIFT6  /* Disable RRSBA behavior */
 #define SPEC_CTRL_RRSBA_DIS_S  BIT(SPEC_CTRL_RRSBA_DIS_S_SHIFT)
 
+/* A mask for bits which the kernel toggles when controlling mitigations */
+#define SPEC_CTRL_MITIGATIONS_MASK (SPEC_CTRL_IBRS | SPEC_CTRL_STIBP | 
SPEC_CTRL_SSBD \
+   | SPEC_CTRL_RRSBA_DIS_S)
+
 #define MSR_IA32_PRED_CMD  0x0049 /* Prediction Command */
 #define PRED_CMD_IBPB  BIT(0) /* Indirect Branch 
Prediction Barrier */
 
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index 3e3230cccaa7..4030358216c8 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -137,8 +137,16 @@ void __init check_bugs(void)
 * have unknown values. AMD64_LS_CFG MSR is cached in the early AMD
 * init code as it is not enumerated and depends on the family.
 */
-   if (boot_cpu_has(X86_FEATURE_MSR_SPEC_CTRL))
+   if (boot_cpu_has(X86_FEATURE_MSR_SPEC_CTRL)) {
rdmsrl(MSR_IA32_SPEC_CTRL, x86_spec_ctrl_base);
+   /*
+* Previously running software, like kexec for example, may
+* have some controls turned ON.
+* Clear them and let the mitigations setup below set them
+* based on configuration.
+*/
+   x86_spec_ctrl_base &= ~SPEC_CTRL_MITIGATIONS_MASK;
+   }
 
/* Select the proper CPU mitigations before patching alternatives: */
spectre_v1_select_mitigation();
-- 
2.30.2


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: nvdimm,pmem: makedumpfile: __vtop4_x86_64: Can't get a valid pte.

2022-11-28 Thread Eliot Moss

On 11/28/2022 9:46 AM, lizhij...@fujitsu.com wrote:



On 28/11/2022 20:53, Eliot Moss wrote:

On 11/28/2022 7:04 AM, lizhij...@fujitsu.com wrote:

Hi folks,

I'm going to make crash coredump support pmem region. So
I have modified kexec-tools to add pmem region to PT_LOAD of vmcore.

But it failed at makedumpfile, log are as following:

In my environment, i found the last 512 pages in pmem region will cause the 
error.


I wonder if an issue I reported is related: when set up to map
2Mb (huge) pages, the last 2Mb of a large region got mapped as
4Kb pages, and then later, half of a large region was treated
that way.


Could you share the url/link ? I'd like to take a look


It was in a previous email to the nvdimm list.  the title was:

"Possible PMD (huge pages) bug in fs dax"

And here is the body.  I just sent directly to the list so there
is no URL (if I should be engaging in a different way, please let me know):

Folks - I posted already on nvdimm, but perhaps the topic did not quite grab
anyone's attention.  I had had some trouble figuring all the details to get
dax mapping of files from an xfs file system with underlying Optane DC memory
going, but now have that working reliably.  But there is an odd behavior:

When first mapping a file, I request mapping a 32 Gb range, aligned on a 1 Gb
(and thus clearly on a 2 Mb) boundary.

For each group of 8 Gb, the first 4095 entries map with a 2 Mb huge (PMD)
page.  The 4096th one does FALLBACK.  I suspect some problem in
dax.c:grab_mapping_entry or its callees, but am not personally well enough
versed in either the dax code or the xarray implementation to dig further.


If you'd like a second puzzle  ... after completing this mapping, another
thread accesses the whole range sequentially.  This results in NOPAGE fault
handling for the first 4095+4095 2M regions that previously resulted in
NOPAGE -- so far so good.  But it gives FALLBACK for the upper 16 Gb (except
the two PMD regions it alrady gave FALLBACK for).


I can provide trace output from a run if you'd like and all the ndctl, gdisk
-l, fdisk -l, and xfs_info details if you like.


In my application, it would be nice if dax.c could deliver 1 Gb PUD size
mappings as well, though it would appear that that would require more surgery
on dax.c.  It would be somewhat analogous to what's already there, of course,
but I don't mean to minimize the possible trickiness of it.  I realize I
should submit that request as a separate thread  which I intend to do
later.


Regards - Eliot Moss



Re: nvdimm,pmem: makedumpfile: __vtop4_x86_64: Can't get a valid pte.

2022-11-28 Thread lizhij...@fujitsu.com


On 28/11/2022 20:53, Eliot Moss wrote:
> On 11/28/2022 7:04 AM, lizhij...@fujitsu.com wrote:
>> Hi folks,
>>
>> I'm going to make crash coredump support pmem region. So
>> I have modified kexec-tools to add pmem region to PT_LOAD of vmcore.
>>
>> But it failed at makedumpfile, log are as following:
>>
>> In my environment, i found the last 512 pages in pmem region will cause the 
>> error.
> 
> I wonder if an issue I reported is related: when set up to map
> 2Mb (huge) pages, the last 2Mb of a large region got mapped as
> 4Kb pages, and then later, half of a large region was treated
> that way.
> 
Could you share the url/link ? I'd like to take a look



> I've seen no response to the report, but assume folks have
> been busy with other things or perhaps giving this lower
> priority since it does not exactly *fail*, just not work as
> a user might think it should.
> 
> Regards - Eliot Moss

Re: nvdimm,pmem: makedumpfile: __vtop4_x86_64: Can't get a valid pte.

2022-11-28 Thread Eliot Moss

On 11/28/2022 7:04 AM, lizhij...@fujitsu.com wrote:

Hi folks,

I'm going to make crash coredump support pmem region. So
I have modified kexec-tools to add pmem region to PT_LOAD of vmcore.

But it failed at makedumpfile, log are as following:

In my environment, i found the last 512 pages in pmem region will cause the 
error.


I wonder if an issue I reported is related: when set up to map
2Mb (huge) pages, the last 2Mb of a large region got mapped as
4Kb pages, and then later, half of a large region was treated
that way.

I've seen no response to the report, but assume folks have
been busy with other things or perhaps giving this lower
priority since it does not exactly *fail*, just not work as
a user might think it should.

Regards - Eliot Moss

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


nvdimm,pmem: makedumpfile: __vtop4_x86_64: Can't get a valid pte.

2022-11-28 Thread lizhij...@fujitsu.com
Hi folks,

I'm going to make crash coredump support pmem region. So
I have modified kexec-tools to add pmem region to PT_LOAD of vmcore.

But it failed at makedumpfile, log are as following:

In my environment, i found the last 512 pages in pmem region will cause the 
error.

qemu commandline:
  -object 
memory-backend-file,id=memnvdimm0,prealloc=yes,mem-path=/root/qemu-dax.img,share=yes,size=4267704320,align=2097152
-device nvdimm,node=0,label-size=4194304,memdev=memnvdimm0,id=nvdimm0,slot=0

ndctl info:
[root@rdma-server ~]# ndctl list
[
   {
 "dev":"namespace0.0",
 "mode":"devdax",
 "map":"dev",
 "size":4127195136,
 "uuid":"f6fc1e86-ac5b-48d8-9cda-4888a33158f9",
 "chardev":"dax0.0",
 "align":4096
   }
]
[root@rdma-server ~]# ndctl list -iRD
{
   "dimms":[
 {
   "dev":"nmem0",
   "id":"8680-56341200",
   "handle":1,
   "phys_id":0
 }
   ],
   "regions":[
 {
   "dev":"region0",
   "size":4263510016,
   "align":16777216,
   "available_size":0,
   "max_available_extent":0,
   "type":"pmem",
   "iset_id":10248187106440278,
   "mappings":[
 {
   "dimm":"nmem0",
   "offset":0,
   "length":4263510016,
   "position":0
 }
   ],
   "persistence_domain":"unknown"
 }
   ]
}

iomem info:
[root@rdma-server ~]# cat /proc/iomem  | grep Persi
14000-23e1f : Persistent Memory

makedumpfile info:
[   57.229110] kdump.sh[240]: mem_map[  71] ea0008e0   238000   
23e200


Firstly, i wonder that
1) makedumpfile read the whole range of iomem(same with the PT_LOAD of pmem)
2) 1st kernel side only setup mem_map(vmemmap) for this namespace, which size 
is 512 pages smaller than iomem for some reasons.
3) Since there is an align in nvdimm region(16MiB in above), i also guess the 
maximum size of the pmem can used by user should
be ALIGN(iomem, 10MiB), after this alignment, the last 512 pages will be 
dropped. then kernel only setups vmemmap for this
range. but i didn't see any code doing such things in kernel side.

So if you guy know the reasons, please let me know :), any hint/feedback is 
very welcome.


[   56.380802] kdump.sh[240]:   OFFSET(atomic_long_t.counter)=0
[   56.385976] kdump.sh[240]:   SIZE(latched_seq)=64
[   56.390217] kdump.sh[240]:   OFFSET(latched_seq.val)=48
[   56.395750] kdump.sh[240]:   LENGTH(free_area.free_list)=5
[   56.401295] kdump.sh[240]:   NUMBER(NR_FREE_PAGES)=0
[   56.406772] kdump.sh[240]:   NUMBER(PG_lru)=4
[   56.408783] kdump.sh[240]:   NUMBER(PG_private)=13
[   56.415401] kdump.sh[240]:   NUMBER(PG_swapcache)=10
[   56.421269] kdump.sh[240]:   NUMBER(PG_swapbacked)=19
[   56.426797] kdump.sh[240]:   NUMBER(PG_slab)=9
[   56.428911] kdump.sh[240]:   NUMBER(PG_head_mask)=65536
[   56.435175] kdump.sh[240]:   NUMBER(PAGE_BUDDY_MAPCOUNT_VALUE)=-129
[   56.437522] kdump.sh[240]:   NUMBER(HUGETLB_PAGE_DTOR)=2
[   56.442233] kdump.sh[240]:   NUMBER(PAGE_OFFLINE_MAPCOUNT_VALUE)=-257
[   56.446943] kdump.sh[240]:   SYMBOL(kallsyms_names)=9e5eb9b0
[   56.452486] kdump.sh[240]:   SYMBOL(kallsyms_num_syms)=9e5eb9a8
[   56.458891] kdump.sh[240]:   SYMBOL(kallsyms_token_table)=9e76e038
Excluding unnecessary pages   : [  2.7 %] \ 
 __vtop4_x86_64: Can't get a valid pte.
[   56.476355] kdump.sh[240]: readmem: Can't convert a virtual 
address(ea0008f8) to physical address.
[   56.483192] kdump.sh[240]: readmem: type_addr: 0, addr:ea0008f8, 
size:32768
[   56.489350] kdump.sh[240]: __exclude_unnecessary_pages: Can't read the 
buffer of struct page.
[   56.494516] kdump.sh[240]: create_2nd_bitmap: Can't exclude unnecessary 
pages.
[   56.501427] kdump[242]: saving vmcore failed, _exitcode:1
[   56.506871] kdump.sh[240]:   SYMBOL(kallsyms_token_index)=9e76e3e8
[   56.511820] kdump.sh[240]:   SYMBOL(kallsyms_offsets)=9e574240
[   56.516816] kdump.sh[240]:   SYMBOL(kallsyms_relative_base)=9e5eb9a0
[   56.522270] kdump.sh[240]:   NUMBER(phys_base)=4330618880
[   56.526372] kdump.sh[240]:   SYMBOL(init_top_pgt)=9ea26000
[   56.531827] kdump.sh[240]:   NUMBER(pgtable_l5_enabled)=0
[   56.535773] kdump.sh[240]:   SYMBOL(node_data)=9f2829a0
[   56.540830] kdump.sh[240]:   LENGTH(node_data)=64
[   56.545405] kdump.sh[240]:   KERNELOFFSET=1b40
[   56.549841] kdump.sh[240]:   NUMBER(KERNEL_IMAGE_SIZE)=1073741824
[   56.554862] kdump.sh[240]:   NUMBER(sme_mask)=0
[   56.558820] kdump.sh[240]:   CRASHTIME=1669635006
[   56.562812] kdump.sh[240]: phys_base: 10220 (vmcoreinfo)
[   56.567343] kdump.sh[240]: max_mapnr: 23e200
[   56.572837] kdump.sh[240]: There is enough free memory to be done in one 
cycle.
[   56.577843] kdump.sh[240]: Buffer size for the cyclic mode: 587904
[   56.582390] kdump.sh[240]: The kernel version is not supported.
[   56.590812] kdump.sh[240]: The makedumpfile