Re: [PATCH v5 1/3] powerpc: make fadump resilient with memory add/remove events

2023-11-23 Thread Sourabh Jain

Hello Michael,

On 22/11/23 18:22, Michael Ellerman wrote:

Sourabh Jain  writes:

On 22/11/23 10:47, Michael Ellerman wrote:

Aneesh Kumar K V  writes:
...

I am not sure whether we need to add all the complexity to enable supporting 
different fadump kernel
version. Is that even a possible use case with fadump? Can't we always assume 
that with fadump the
crash kernel and fadump kernel will be same version?

How sure are we of that?

Don't we go through grub when we boot into the 2nd kernel. And so
couldn't it choose to boot a different kernel, for whatever reason.

I don't think we need to support different pt_reg / cpumask sizes, but
requiring the exact same kernel version is too strict I think.

Agree.

But maybe I'm wrong. Would be good to hear what distro folks think.

How about checking fadump crash info header compatibility in the
following way?

static bool is_fadump_header_compatible(struct fadump_crash_info_header
*fdh)
{
      if (fdh->magic_number == FADUMP_CRASH_INFO_MAGIC_OLD) {
      pr_err("Old magic number, can't process the dump.");
      return false;
      }

      if (fdh->magic_number != FADUMP_CRASH_INFO_MAGIC) {
      pr_err("Fadump header is corrupted.");
      return false;
      }

      /*
       * If the kernel version of the first/crashed kernel and the
second/fadump
       * kernel is not same, then only collect the dump if the size of all
       * non-primitive type members of the fadump header is the same
across kernels.
       */
      if (strcmp(fdh->kernel_version, init_uts_ns.name.release)) {
  
I don't think the kernel version check is necessary?


I didn't find a place where pt_regs members are accessed to take
a decision in fadump kernel. we just copy the pt_regs in fadump kernel.

So I think as long as size is same across kernels, we are good.




      if (fdh->pt_regs_sz != sizeof(struct pt_regs) || fdh->cpu_mask_sz != 
sizeof(struct cpumask)) {
          pr_err("Fadump header size mismatch.\n")
          return false;

Yeah I think that works.


      } else
          pr_warn("Kernel version mismatch; dump data is unreliable.\n");
      }

      return true;
}

And the new fadump crash info header will be: As suggested by Hari.

/* fadump crash info structure */
struct fadump_crash_info_header {
      u64        magic_number;
+  u32        version;
  
Do we need version? We're effectively using the magic_number as a

version already. Having an actual version would allow us to make
backward compatible changes in future, but it's not clear we'll need or
want to do that.
Agree that currently version field is not used but I added a version 
field to

make the fadump header structure compatible with future changes without
changing the magic number.

I will add a comment on how version field should be utilized if one 
changes fadump

header in future.




      u32        crashing_cpu;
      u64        elfcorehdr_addr;
+  u64        elfcorehdr_size;
+  u64        vmcoreinfo_raddr;
+  u64        vmcoreinfo_size;
+  u8          kernel_version[__NEW_UTS_LEN + 1];
+  u32        pt_regs_sz;
      struct pt_regs    regs;
+  u32        cpu_mask_sz;
  
Typically you would put all the size fields before the variable sized

fields, because otherwise the later size fields may not be where you
expect them to be. But because we're bailing out if any of the sizes
don't match it doesn't actually matter.


Yeah, but I will reorganize fadump header and put the size fields before the
variable sized fields.




      struct cpumask    cpu_mask;
};

The other issue is endian. I assume we're just declaring that the
1st/2nd kernel must be the same endian? We should still make that
explicit though.


A comment is fine or should we add a explicit check and error out
with relevant error message if endianness is not same across the
kernels?

Something like:

if (fdh->magic_number != FADUMP_CRASH_INFO_MAGIC) {
    if (fdh->magic_number == swab64(FADUMP_CRASH_INFO_MAGIC)) {
        pr_err("Endianness mismatch");
    } else {
        pr_err("Fadump header is corrupted.");
    }
    return false;
}




To make it clearer that this struct is somewhat an ABI, I think we
should move the definition into arch/powerpc/include/uapi/asm/fadump.h

Sure


We don't expect userspace to actually use the header, but it will
hopefully remind everyone that the struct needs to be changed with care :)

Agree

Thanks,
Sourabh Jain


Re: [PATCH v5 1/3] powerpc: make fadump resilient with memory add/remove events

2023-11-23 Thread Sourabh Jain

Hello Aneesh,

On 22/11/23 17:50, Aneesh Kumar K V wrote:

On 11/22/23 4:05 PM, Sourabh Jain wrote:

Hello Michael,


On 22/11/23 10:47, Michael Ellerman wrote:

Aneesh Kumar K V  writes:
...

I am not sure whether we need to add all the complexity to enable supporting 
different fadump kernel
version. Is that even a possible use case with fadump? Can't we always assume 
that with fadump the
crash kernel and fadump kernel will be same version?

How sure are we of that?

Don't we go through grub when we boot into the 2nd kernel. And so
couldn't it choose to boot a different kernel, for whatever reason.

I don't think we need to support different pt_reg / cpumask sizes, but
requiring the exact same kernel version is too strict I think.

Agree.

But maybe I'm wrong. Would be good to hear what distro folks think.

How about checking fadump crash info header compatibility in the following way?

static bool is_fadump_header_compatible(struct fadump_crash_info_header *fdh)
{
     if (fdh->magic_number == FADUMP_CRASH_INFO_MAGIC_OLD) {
     pr_err("Old magic number, can't process the dump.");
     return false;
     }

     if (fdh->magic_number != FADUMP_CRASH_INFO_MAGIC) {
     pr_err("Fadump header is corrupted.");
     return false;
     }

     /*
      * If the kernel version of the first/crashed kernel and the second/fadump
      * kernel is not same, then only collect the dump if the size of all
      * non-primitive type members of the fadump header is the same across 
kernels.
      */
     if (strcmp(fdh->kernel_version, init_uts_ns.name.release)) {
     if (fdh->pt_regs_sz != sizeof(struct pt_regs) || fdh->cpu_mask_sz != 
sizeof(struct cpumask)) {
         pr_err("Fadump header size mismatch.\n")
         return false;
     } else
         pr_warn("Kernel version mismatch; dump data is unreliable.\n");
     }


You also want a fdh->version check?


Even though we don't have any action against an fdh->version right now, 
I think I should
check the fadump header version. Currently, if the version doesn't 
match, it means the

header is corrupted.


I am still not sure you need the kernel_version check. IMHO that is too strict
and can hit that check where you have kexec kernel which is not installed in 
/boot crashing?


If the kernel versions mismatch, we still collect the dump if the 
`pt_regs` and `cpu_mask`
sizes are the same across the kernels. The kernel version check is just 
to warn users that

the collected dump may be unreliable.

Should I remove the kernel_version filed from fadump crash info header 
and remove the

the kernel version check while processing the kernel dump?

Thanks,
Sourabh Jain


Re: [PATCH v5 1/3] powerpc: make fadump resilient with memory add/remove events

2023-11-22 Thread Michael Ellerman
Sourabh Jain  writes:
> On 22/11/23 10:47, Michael Ellerman wrote:
>> Aneesh Kumar K V  writes:
>> ...
>>> I am not sure whether we need to add all the complexity to enable 
>>> supporting different fadump kernel
>>> version. Is that even a possible use case with fadump? Can't we always 
>>> assume that with fadump the
>>> crash kernel and fadump kernel will be same version?
>> How sure are we of that?
>>
>> Don't we go through grub when we boot into the 2nd kernel. And so
>> couldn't it choose to boot a different kernel, for whatever reason.
>>
>> I don't think we need to support different pt_reg / cpumask sizes, but
>> requiring the exact same kernel version is too strict I think.
> Agree.
>>
>> But maybe I'm wrong. Would be good to hear what distro folks think.
>
> How about checking fadump crash info header compatibility in the 
> following way?
>
> static bool is_fadump_header_compatible(struct fadump_crash_info_header 
> *fdh)
> {
>      if (fdh->magic_number == FADUMP_CRASH_INFO_MAGIC_OLD) {
>      pr_err("Old magic number, can't process the dump.");
>      return false;
>      }
>
>      if (fdh->magic_number != FADUMP_CRASH_INFO_MAGIC) {
>      pr_err("Fadump header is corrupted.");
>      return false;
>      }
>
>      /*
>       * If the kernel version of the first/crashed kernel and the 
> second/fadump
>       * kernel is not same, then only collect the dump if the size of all
>       * non-primitive type members of the fadump header is the same 
> across kernels.
>       */
>      if (strcmp(fdh->kernel_version, init_uts_ns.name.release)) {
 
I don't think the kernel version check is necessary?

>      if (fdh->pt_regs_sz != sizeof(struct pt_regs) || fdh->cpu_mask_sz != 
> sizeof(struct cpumask)) {
>          pr_err("Fadump header size mismatch.\n")
>          return false;

Yeah I think that works.

>      } else
>          pr_warn("Kernel version mismatch; dump data is unreliable.\n");
>      }
>
>      return true;
> }
>
> And the new fadump crash info header will be: As suggested by Hari.
>
> /* fadump crash info structure */
> struct fadump_crash_info_header {
>      u64        magic_number;
> +  u32        version;
 
Do we need version? We're effectively using the magic_number as a
version already. Having an actual version would allow us to make
backward compatible changes in future, but it's not clear we'll need or
want to do that.

>      u32        crashing_cpu;
>      u64        elfcorehdr_addr;
> +  u64        elfcorehdr_size;
> +  u64        vmcoreinfo_raddr;
> +  u64        vmcoreinfo_size;
> +  u8          kernel_version[__NEW_UTS_LEN + 1];
> +  u32        pt_regs_sz;
>      struct pt_regs    regs;
> +  u32        cpu_mask_sz;
 
Typically you would put all the size fields before the variable sized
fields, because otherwise the later size fields may not be where you
expect them to be. But because we're bailing out if any of the sizes
don't match it doesn't actually matter.

>      struct cpumask    cpu_mask;
> };

The other issue is endian. I assume we're just declaring that the
1st/2nd kernel must be the same endian? We should still make that
explicit though.

To make it clearer that this struct is somewhat an ABI, I think we
should move the definition into arch/powerpc/include/uapi/asm/fadump.h

We don't expect userspace to actually use the header, but it will
hopefully remind everyone that the struct needs to be changed with care :)

cheers


Re: [PATCH v5 1/3] powerpc: make fadump resilient with memory add/remove events

2023-11-22 Thread Aneesh Kumar K V
On 11/22/23 4:05 PM, Sourabh Jain wrote:
> Hello Michael,
> 
> 
> On 22/11/23 10:47, Michael Ellerman wrote:
>> Aneesh Kumar K V  writes:
>> ...
>>> I am not sure whether we need to add all the complexity to enable 
>>> supporting different fadump kernel
>>> version. Is that even a possible use case with fadump? Can't we always 
>>> assume that with fadump the
>>> crash kernel and fadump kernel will be same version?
>> How sure are we of that?
>>
>> Don't we go through grub when we boot into the 2nd kernel. And so
>> couldn't it choose to boot a different kernel, for whatever reason.
>>
>> I don't think we need to support different pt_reg / cpumask sizes, but
>> requiring the exact same kernel version is too strict I think.
> Agree.
>>
>> But maybe I'm wrong. Would be good to hear what distro folks think.
> 
> How about checking fadump crash info header compatibility in the following 
> way?
> 
> static bool is_fadump_header_compatible(struct fadump_crash_info_header *fdh)
> {
>     if (fdh->magic_number == FADUMP_CRASH_INFO_MAGIC_OLD) {
>     pr_err("Old magic number, can't process the dump.");
>     return false;
>     }
> 
>     if (fdh->magic_number != FADUMP_CRASH_INFO_MAGIC) {
>     pr_err("Fadump header is corrupted.");
>     return false;
>     }
> 
>     /*
>      * If the kernel version of the first/crashed kernel and the second/fadump
>      * kernel is not same, then only collect the dump if the size of all
>      * non-primitive type members of the fadump header is the same across 
> kernels.
>      */
>     if (strcmp(fdh->kernel_version, init_uts_ns.name.release)) {
>     if (fdh->pt_regs_sz != sizeof(struct pt_regs) || fdh->cpu_mask_sz != 
> sizeof(struct cpumask)) {
>         pr_err("Fadump header size mismatch.\n")
>         return false;
>     } else
>         pr_warn("Kernel version mismatch; dump data is unreliable.\n");
>     }
> 

You also want a fdh->version check? I am still not sure you need the 
kernel_version check. IMHO that is too strict
and can hit that check where you have kexec kernel which is not installed in 
/boot crashing? 


>     return true;
> }
> 
> And the new fadump crash info header will be: As suggested by Hari.
> 
> /* fadump crash info structure */
> struct fadump_crash_info_header {
>     u64        magic_number;
> +  u32        version;
>     u32        crashing_cpu;
>     u64        elfcorehdr_addr;
> +  u64        elfcorehdr_size;
> +  u64        vmcoreinfo_raddr;
> +  u64        vmcoreinfo_size;
> +  u8          kernel_version[__NEW_UTS_LEN + 1];
> +  u32        pt_regs_sz;
>     struct pt_regs    regs;
> +  u32        cpu_mask_sz;
>     struct cpumask    cpu_mask;
> };
> 
> Thanks,
> Sourabh Jain



Re: [PATCH v5 1/3] powerpc: make fadump resilient with memory add/remove events

2023-11-22 Thread Sourabh Jain

Hello Michael,


On 22/11/23 10:47, Michael Ellerman wrote:

Aneesh Kumar K V  writes:
...

I am not sure whether we need to add all the complexity to enable supporting 
different fadump kernel
version. Is that even a possible use case with fadump? Can't we always assume 
that with fadump the
crash kernel and fadump kernel will be same version?

How sure are we of that?

Don't we go through grub when we boot into the 2nd kernel. And so
couldn't it choose to boot a different kernel, for whatever reason.

I don't think we need to support different pt_reg / cpumask sizes, but
requiring the exact same kernel version is too strict I think.

Agree.


But maybe I'm wrong. Would be good to hear what distro folks think.


How about checking fadump crash info header compatibility in the 
following way?


static bool is_fadump_header_compatible(struct fadump_crash_info_header 
*fdh)

{
    if (fdh->magic_number == FADUMP_CRASH_INFO_MAGIC_OLD) {
    pr_err("Old magic number, can't process the dump.");
    return false;
    }

    if (fdh->magic_number != FADUMP_CRASH_INFO_MAGIC) {
    pr_err("Fadump header is corrupted.");
    return false;
    }

    /*
     * If the kernel version of the first/crashed kernel and the 
second/fadump

     * kernel is not same, then only collect the dump if the size of all
     * non-primitive type members of the fadump header is the same 
across kernels.

     */
    if (strcmp(fdh->kernel_version, init_uts_ns.name.release)) {
    if (fdh->pt_regs_sz != sizeof(struct pt_regs) || 
fdh->cpu_mask_sz != sizeof(struct cpumask)) {

        pr_err("Fadump header size mismatch.\n")
        return false;
    } else
        pr_warn("Kernel version mismatch; dump data is unreliable.\n");
    }

    return true;
}

And the new fadump crash info header will be: As suggested by Hari.

/* fadump crash info structure */
struct fadump_crash_info_header {
    u64        magic_number;
+  u32        version;
    u32        crashing_cpu;
    u64        elfcorehdr_addr;
+  u64        elfcorehdr_size;
+  u64        vmcoreinfo_raddr;
+  u64        vmcoreinfo_size;
+  u8          kernel_version[__NEW_UTS_LEN + 1];
+  u32        pt_regs_sz;
    struct pt_regs    regs;
+  u32        cpu_mask_sz;
    struct cpumask    cpu_mask;
};

Thanks,
Sourabh Jain


Re: [PATCH v5 1/3] powerpc: make fadump resilient with memory add/remove events

2023-11-21 Thread Michael Ellerman
Aneesh Kumar K V  writes:
...
>
> I am not sure whether we need to add all the complexity to enable supporting 
> different fadump kernel
> version. Is that even a possible use case with fadump? Can't we always assume 
> that with fadump the
> crash kernel and fadump kernel will be same version?

How sure are we of that?

Don't we go through grub when we boot into the 2nd kernel. And so
couldn't it choose to boot a different kernel, for whatever reason.

I don't think we need to support different pt_reg / cpumask sizes, but
requiring the exact same kernel version is too strict I think.

But maybe I'm wrong. Would be good to hear what distro folks think.

cheers


Re: [PATCH v5 1/3] powerpc: make fadump resilient with memory add/remove events

2023-11-16 Thread Hari Bathini




On 17/11/23 11:01 am, Aneesh Kumar K V wrote:

On 11/17/23 10:03 AM, Sourabh Jain wrote:

Hi Aneesh,

Thanks for reviewing the patch.

On 15/11/23 10:14, Aneesh Kumar K.V wrote:

Sourabh Jain  writes:




diff --git a/arch/powerpc/include/asm/fadump-internal.h 
b/arch/powerpc/include/asm/fadump-internal.h
index 27f9e11eda28..7be3d8894520 100644
--- a/arch/powerpc/include/asm/fadump-internal.h
+++ b/arch/powerpc/include/asm/fadump-internal.h
@@ -42,7 +42,25 @@ static inline u64 fadump_str_to_u64(const char *str)
     #define FADUMP_CPU_UNKNOWN    (~((u32)0))
   -#define FADUMP_CRASH_INFO_MAGIC    fadump_str_to_u64("FADMPINF")
+/*
+ * The introduction of new fields in the fadump crash info header has
+ * led to a change in the magic key, from `FADMPINF` to `FADMPSIG`.
+ * This alteration ensures backward compatibility, enabling the kernel
+ * with the updated fadump crash info to handle kernel dumps from older
+ * kernels.
+ *
+ * To prevent the need for further changes to the magic number in the
+ * event of future modifications to the fadump header, a version field
+ * has been introduced to track the fadump crash info header version.
+ *
+ * Historically, there was no connection between the magic number and
+ * the fadump crash info header version. However, moving forward, the
+ * `FADMPINF` magic number in header will be treated as version 0, while
+ * the `FADMPSIG` magic number in header will include a version field to
+ * determine its version.
+ */
+#define FADUMP_CRASH_INFO_MAGIC    fadump_str_to_u64("FADMPSIG")
+#define FADUMP_VERSION    1


Can we keep the old magic details as

#define FADUMP_CRASH_INFO_MAGIC_OLD    fadump_str_to_u64("FADMPINF")
#define FADUMP_CRASH_INFO_MAGIC    fadump_str_to_u64("FADMPSIG")


Sure.


Also considering the struct need not be backward compatible, can we just
do

struct fadump_crash_info_header {
 u64    magic_number;
 u32    crashing_cpu;
 u64    elfcorehdr_addr;
 u64    elfcorehdr_size;
 u64    vmcoreinfo_raddr;
 u64    vmcoreinfo_size;
 struct pt_regs    regs;
 struct cpumask    cpu_mask;
};
static inline bool fadump_compatible(struct fadump_crash_info_header
*fdh)
{
 return (fdh->magic_number == FADUMP_CRASH_INFO_MAGIC)
}

and fail fadump if we find it not compatible?


Agree that it is unsafe to collect a dump with an incompatible fadump crash 
info header.

Given that I am updating the fadump crash info header, we can make a few 
arrangements
like adding a size filed for the dynamic size attribute like pt_regs and 
cpumask to ensure
better compatibility in the future.

Additionally, let's introduce a version field to the fadump crash info header 
to avoid changing
the magic number in the future.



I am not sure whether we need to add all the complexity to enable supporting 
different fadump kernel
version. Is that even a possible use case with fadump? Can't we always assume 
that with fadump the
crash kernel and fadump kernel will be same version? if yes we can simply fail 
with a magic number
mismatch because that indicates an user config error?


If we decide not to support different kernel versions for production
kernel and capture kernel, We can make that implicit by adding kernel
version info of production kernel in the header and bailing out if
there is kernel version mismatch as magic could still match for two
different kernel versions.

I would personally prefer something like the below though:

struct fadump_crash_info_header {
u64 magic_number;
u32 version
u32 crashing_cpu;
u64 elfcorehdr_addr;
u64 elfcorehdr_size;
u64 vmcoreinfo_raddr;
u64 vmcoreinfo_size;
u8  kernel_version[];
u32 pt_regs_sz;
struct pt_regs  regs;
u32 cpu_mask_sz;
struct cpumask  cpu_mask;
};

if (magic_number != new_magic)
goto err;   /* Error out */

if (kernel_version != capture_kernel_version)
{
		if (cpu_mask_sz == sizeof(struct pt_regs) && cpu_mask_sz == 
sizeof(struct cpumask))

/*
 * Warn about the kernel version mismatch and how data 
can be different
 * across kernel versions and proceed anyway!
 */
else
goto err;   /* Error out */
}

This ensures we warn and proceed in cases where it is less likely to
have issues capturing kernel dump. This helps in dev environment where
we are trying to debug an early boot crash - in which case capture
kernel can't be the same kernel as it would likely hit the same problem
while booting..

Thanks
Hari


Re: [PATCH v5 1/3] powerpc: make fadump resilient with memory add/remove events

2023-11-16 Thread Aneesh Kumar K V
On 11/17/23 10:03 AM, Sourabh Jain wrote:
> Hi Aneesh,
> 
> Thanks for reviewing the patch.
> 
> On 15/11/23 10:14, Aneesh Kumar K.V wrote:
>> Sourabh Jain  writes:
>>
>> 
>>
>>> diff --git a/arch/powerpc/include/asm/fadump-internal.h 
>>> b/arch/powerpc/include/asm/fadump-internal.h
>>> index 27f9e11eda28..7be3d8894520 100644
>>> --- a/arch/powerpc/include/asm/fadump-internal.h
>>> +++ b/arch/powerpc/include/asm/fadump-internal.h
>>> @@ -42,7 +42,25 @@ static inline u64 fadump_str_to_u64(const char *str)
>>>     #define FADUMP_CPU_UNKNOWN    (~((u32)0))
>>>   -#define FADUMP_CRASH_INFO_MAGIC    fadump_str_to_u64("FADMPINF")
>>> +/*
>>> + * The introduction of new fields in the fadump crash info header has
>>> + * led to a change in the magic key, from `FADMPINF` to `FADMPSIG`.
>>> + * This alteration ensures backward compatibility, enabling the kernel
>>> + * with the updated fadump crash info to handle kernel dumps from older
>>> + * kernels.
>>> + *
>>> + * To prevent the need for further changes to the magic number in the
>>> + * event of future modifications to the fadump header, a version field
>>> + * has been introduced to track the fadump crash info header version.
>>> + *
>>> + * Historically, there was no connection between the magic number and
>>> + * the fadump crash info header version. However, moving forward, the
>>> + * `FADMPINF` magic number in header will be treated as version 0, while
>>> + * the `FADMPSIG` magic number in header will include a version field to
>>> + * determine its version.
>>> + */
>>> +#define FADUMP_CRASH_INFO_MAGIC    fadump_str_to_u64("FADMPSIG")
>>> +#define FADUMP_VERSION    1
>>>
>> Can we keep the old magic details as
>>
>> #define FADUMP_CRASH_INFO_MAGIC_OLD    fadump_str_to_u64("FADMPINF")
>> #define FADUMP_CRASH_INFO_MAGIC    fadump_str_to_u64("FADMPSIG")
> 
> Sure.
> 
>> Also considering the struct need not be backward compatible, can we just
>> do
>>
>> struct fadump_crash_info_header {
>> u64    magic_number;
>> u32    crashing_cpu;
>> u64    elfcorehdr_addr;
>> u64    elfcorehdr_size;
>> u64    vmcoreinfo_raddr;
>> u64    vmcoreinfo_size;
>> struct pt_regs    regs;
>> struct cpumask    cpu_mask;
>> };
>> static inline bool fadump_compatible(struct fadump_crash_info_header
>> *fdh)
>> {
>> return (fdh->magic_number == FADUMP_CRASH_INFO_MAGIC)
>> }
>>
>> and fail fadump if we find it not compatible?
> 
> Agree that it is unsafe to collect a dump with an incompatible fadump crash 
> info header.
> 
> Given that I am updating the fadump crash info header, we can make a few 
> arrangements
> like adding a size filed for the dynamic size attribute like pt_regs and 
> cpumask to ensure
> better compatibility in the future.
> 
> Additionally, let's introduce a version field to the fadump crash info header 
> to avoid changing
> the magic number in the future.
>

I am not sure whether we need to add all the complexity to enable supporting 
different fadump kernel
version. Is that even a possible use case with fadump? Can't we always assume 
that with fadump the
crash kernel and fadump kernel will be same version? if yes we can simply fail 
with a magic number
mismatch because that indicates an user config error? 

-aneesh



Re: [PATCH v5 1/3] powerpc: make fadump resilient with memory add/remove events

2023-11-16 Thread Sourabh Jain

Hi Aneesh,

Thanks for reviewing the patch.

On 15/11/23 10:14, Aneesh Kumar K.V wrote:

Sourabh Jain  writes:




diff --git a/arch/powerpc/include/asm/fadump-internal.h 
b/arch/powerpc/include/asm/fadump-internal.h
index 27f9e11eda28..7be3d8894520 100644
--- a/arch/powerpc/include/asm/fadump-internal.h
+++ b/arch/powerpc/include/asm/fadump-internal.h
@@ -42,7 +42,25 @@ static inline u64 fadump_str_to_u64(const char *str)
  
  #define FADUMP_CPU_UNKNOWN		(~((u32)0))
  
-#define FADUMP_CRASH_INFO_MAGIC		fadump_str_to_u64("FADMPINF")

+/*
+ * The introduction of new fields in the fadump crash info header has
+ * led to a change in the magic key, from `FADMPINF` to `FADMPSIG`.
+ * This alteration ensures backward compatibility, enabling the kernel
+ * with the updated fadump crash info to handle kernel dumps from older
+ * kernels.
+ *
+ * To prevent the need for further changes to the magic number in the
+ * event of future modifications to the fadump header, a version field
+ * has been introduced to track the fadump crash info header version.
+ *
+ * Historically, there was no connection between the magic number and
+ * the fadump crash info header version. However, moving forward, the
+ * `FADMPINF` magic number in header will be treated as version 0, while
+ * the `FADMPSIG` magic number in header will include a version field to
+ * determine its version.
+ */
+#define FADUMP_CRASH_INFO_MAGICfadump_str_to_u64("FADMPSIG")
+#define FADUMP_VERSION 1


Can we keep the old magic details as

#define FADUMP_CRASH_INFO_MAGIC_OLD fadump_str_to_u64("FADMPINF")
#define FADUMP_CRASH_INFO_MAGIC fadump_str_to_u64("FADMPSIG")


Sure.


Also considering the struct need not be backward compatible, can we just
do

struct fadump_crash_info_header {
u64 magic_number;
u32 crashing_cpu;
u64 elfcorehdr_addr;
u64 elfcorehdr_size;
u64 vmcoreinfo_raddr;
u64 vmcoreinfo_size;
struct pt_regs  regs;
struct cpumask  cpu_mask;
};
static inline bool fadump_compatible(struct fadump_crash_info_header
*fdh)
{
return (fdh->magic_number == FADUMP_CRASH_INFO_MAGIC)
}

and fail fadump if we find it not compatible?


Agree that it is unsafe to collect a dump with an incompatible fadump 
crash info header.


Given that I am updating the fadump crash info header, we can make a few 
arrangements
like adding a size filed for the dynamic size attribute like pt_regs and 
cpumask to ensure

better compatibility in the future.

Additionally, let's introduce a version field to the fadump crash info 
header to avoid changing

the magic number in the future.

Thanks,
Sourabh


Re: [PATCH v5 1/3] powerpc: make fadump resilient with memory add/remove events

2023-11-14 Thread Aneesh Kumar K.V
Sourabh Jain  writes:



> diff --git a/arch/powerpc/include/asm/fadump-internal.h 
> b/arch/powerpc/include/asm/fadump-internal.h
> index 27f9e11eda28..7be3d8894520 100644
> --- a/arch/powerpc/include/asm/fadump-internal.h
> +++ b/arch/powerpc/include/asm/fadump-internal.h
> @@ -42,7 +42,25 @@ static inline u64 fadump_str_to_u64(const char *str)
>  
>  #define FADUMP_CPU_UNKNOWN   (~((u32)0))
>  
> -#define FADUMP_CRASH_INFO_MAGIC  fadump_str_to_u64("FADMPINF")
> +/*
> + * The introduction of new fields in the fadump crash info header has
> + * led to a change in the magic key, from `FADMPINF` to `FADMPSIG`.
> + * This alteration ensures backward compatibility, enabling the kernel
> + * with the updated fadump crash info to handle kernel dumps from older
> + * kernels.
> + *
> + * To prevent the need for further changes to the magic number in the
> + * event of future modifications to the fadump header, a version field
> + * has been introduced to track the fadump crash info header version.
> + *
> + * Historically, there was no connection between the magic number and
> + * the fadump crash info header version. However, moving forward, the
> + * `FADMPINF` magic number in header will be treated as version 0, while
> + * the `FADMPSIG` magic number in header will include a version field to
> + * determine its version.
> + */
> +#define FADUMP_CRASH_INFO_MAGIC  fadump_str_to_u64("FADMPSIG")
> +#define FADUMP_VERSION   1
>

Can we keep the old magic details as

#define FADUMP_CRASH_INFO_MAGIC_OLD fadump_str_to_u64("FADMPINF")
#define FADUMP_CRASH_INFO_MAGIC fadump_str_to_u64("FADMPSIG")

Also considering the struct need not be backward compatible, can we just
do

struct fadump_crash_info_header {
u64 magic_number;
u32 crashing_cpu;
u64 elfcorehdr_addr;
u64 elfcorehdr_size;
u64 vmcoreinfo_raddr;
u64 vmcoreinfo_size;
struct pt_regs  regs;
struct cpumask  cpu_mask;
};

static inline bool fadump_compatible(struct fadump_crash_info_header
*fdh)
{
return (fdh->magic_number == FADUMP_CRASH_INFO_MAGIC)
}

and fail fadump if we find it not compatible?

-aneesh


Re: [PATCH v5 1/3] powerpc: make fadump resilient with memory add/remove events

2023-11-12 Thread Sourabh Jain

Hello Michael,


On 09/11/23 17:44, Michael Ellerman wrote:

Hi Sourabh,

This seems like a good change to the design, but I'm confused by
some things, more below ...


Thanks.


Sourabh Jain  writes:
...

Table 1 below illustrates kernel's ability to collect dump if either the
first/crashed kernel or the second/fadump kernel does not have the
changes introduced here. Consider the 'old kernel' as the kernel without
this patch, and the 'new kernel' as the kernel with this patch included.

+--+++---+
| scenario |  first/crashed kernel  |  second/fadump kernel  |  Dump |
+--+++---+
|1 |   old kernel   |new kernel  |  Yes  |
+--+++---+
|2 |   new kernel   |old kernel  |  No   |
+--+++---+

  Table 1

Scenario 1:
---
Since the magic number of fadump header is updated, the second kernel
can differentiate the crashed kernel is of type 'new kernel' or
'old kernel' and act accordingly. In this scenario, since the crashed
kernel is of type 'old kernel,' the fadump kernel skips elfcorehdr
creation and uses the one prepared in the first kernel itself to collect
the dump.

Scenario 2:
---
Since 'old kernel' as the fadump kernel is NOT capable of processing
fadump header with updated magic number from 'new kernel' hence it
gracefully fails with the below error and dump collection fails in this
scenario.

[0.007365] rtas fadump: Crash info header is not valid.

Add a version field to the fadump_crash_info_header structure to avoid
the need to change its magic number in the future. Adding a version
field to the fadump header was one of the TODO items listed in
Documentation/powerpc/firmware-assisted-dump.rst.

This is a good analysis of the issues with different kernel versions,
and seems like an OK trade off, ie. that old kernels can't process dumps
from new kernels.

But do we actually support using different kernel versions for the
crash/dump kernel?

Because AFAICS the fadump_crash_info_header is not safe across kernel
versions, in its current form or with your changes.


Yeah, I was also under the impression that it is not supported, but I 
was not aware

that the size of pt_regs and cpumask can change based on the configuration.




diff --git a/arch/powerpc/include/asm/fadump-internal.h 
b/arch/powerpc/include/asm/fadump-internal.h
index 27f9e11eda28..7be3d8894520 100644
--- a/arch/powerpc/include/asm/fadump-internal.h
+++ b/arch/powerpc/include/asm/fadump-internal.h
@@ -42,7 +42,25 @@ static inline u64 fadump_str_to_u64(const char *str)
  
  #define FADUMP_CPU_UNKNOWN		(~((u32)0))
  
-#define FADUMP_CRASH_INFO_MAGIC		fadump_str_to_u64("FADMPINF")

+/*
+ * The introduction of new fields in the fadump crash info header has
+ * led to a change in the magic key, from `FADMPINF` to `FADMPSIG`.
+ * This alteration ensures backward compatibility, enabling the kernel
+ * with the updated fadump crash info to handle kernel dumps from older
+ * kernels.
+ *
+ * To prevent the need for further changes to the magic number in the
+ * event of future modifications to the fadump header, a version field
+ * has been introduced to track the fadump crash info header version.
+ *
+ * Historically, there was no connection between the magic number and
+ * the fadump crash info header version. However, moving forward, the
+ * `FADMPINF` magic number in header will be treated as version 0, while
+ * the `FADMPSIG` magic number in header will include a version field to
+ * determine its version.
+ */
+#define FADUMP_CRASH_INFO_MAGICfadump_str_to_u64("FADMPSIG")
+#define FADUMP_VERSION 1
  
  /* fadump crash info structure */

  struct fadump_crash_info_header {
@@ -51,6 +69,10 @@ struct fadump_crash_info_header {


struct fadump_crash_info_header {
u64 magic_number;
u64 elfcorehdr_addr;


u32 crashing_cpu;
struct pt_regs  regs;
struct cpumask  cpu_mask;
+   u32 version;
+   u64 elfcorehdr_size;
+   u64 vmcoreinfo_raddr;
+   u64 vmcoreinfo_size;
  };

The reason I say it's not safe is because pt_regs and especially cpumask
can change size depending on the kernel configuration.

pt_regs probably doesn't change size in practice for common kernel
configurations, but some of the fields are under #ifdef.

cpumask on the other hand is directly controlled by CONFIG_NR_CPUS. So
if the first and second kernel have a different value for NR_CPUS they
will disagree on the size of the struct.

That has presumably worked OK so far because folks tend to use the same, or
similar kernels for the first/second kernel. And also the cpumask is the

Re: [PATCH v5 1/3] powerpc: make fadump resilient with memory add/remove events

2023-11-09 Thread Michael Ellerman
Hi Sourabh,

This seems like a good change to the design, but I'm confused by
some things, more below ...

Sourabh Jain  writes:
>
...
>
> Table 1 below illustrates kernel's ability to collect dump if either the
> first/crashed kernel or the second/fadump kernel does not have the
> changes introduced here. Consider the 'old kernel' as the kernel without
> this patch, and the 'new kernel' as the kernel with this patch included.
>
> +--+++---+
> | scenario |  first/crashed kernel  |  second/fadump kernel  |  Dump |
> +--+++---+
> |1 |   old kernel   |new kernel  |  Yes  |
> +--+++---+
> |2 |   new kernel   |old kernel  |  No   |
> +--+++---+
>
> Table 1
>
> Scenario 1:
> ---
> Since the magic number of fadump header is updated, the second kernel
> can differentiate the crashed kernel is of type 'new kernel' or
> 'old kernel' and act accordingly. In this scenario, since the crashed
> kernel is of type 'old kernel,' the fadump kernel skips elfcorehdr
> creation and uses the one prepared in the first kernel itself to collect
> the dump.
>
> Scenario 2:
> ---
> Since 'old kernel' as the fadump kernel is NOT capable of processing
> fadump header with updated magic number from 'new kernel' hence it
> gracefully fails with the below error and dump collection fails in this
> scenario.
>
> [0.007365] rtas fadump: Crash info header is not valid.
>
> Add a version field to the fadump_crash_info_header structure to avoid
> the need to change its magic number in the future. Adding a version
> field to the fadump header was one of the TODO items listed in
> Documentation/powerpc/firmware-assisted-dump.rst.

This is a good analysis of the issues with different kernel versions,
and seems like an OK trade off, ie. that old kernels can't process dumps
from new kernels.

But do we actually support using different kernel versions for the
crash/dump kernel?

Because AFAICS the fadump_crash_info_header is not safe across kernel
versions, in its current form or with your changes.

> diff --git a/arch/powerpc/include/asm/fadump-internal.h 
> b/arch/powerpc/include/asm/fadump-internal.h
> index 27f9e11eda28..7be3d8894520 100644
> --- a/arch/powerpc/include/asm/fadump-internal.h
> +++ b/arch/powerpc/include/asm/fadump-internal.h
> @@ -42,7 +42,25 @@ static inline u64 fadump_str_to_u64(const char *str)
>  
>  #define FADUMP_CPU_UNKNOWN   (~((u32)0))
>  
> -#define FADUMP_CRASH_INFO_MAGIC  fadump_str_to_u64("FADMPINF")
> +/*
> + * The introduction of new fields in the fadump crash info header has
> + * led to a change in the magic key, from `FADMPINF` to `FADMPSIG`.
> + * This alteration ensures backward compatibility, enabling the kernel
> + * with the updated fadump crash info to handle kernel dumps from older
> + * kernels.
> + *
> + * To prevent the need for further changes to the magic number in the
> + * event of future modifications to the fadump header, a version field
> + * has been introduced to track the fadump crash info header version.
> + *
> + * Historically, there was no connection between the magic number and
> + * the fadump crash info header version. However, moving forward, the
> + * `FADMPINF` magic number in header will be treated as version 0, while
> + * the `FADMPSIG` magic number in header will include a version field to
> + * determine its version.
> + */
> +#define FADUMP_CRASH_INFO_MAGIC  fadump_str_to_u64("FADMPSIG")
> +#define FADUMP_VERSION   1
>  
>  /* fadump crash info structure */
>  struct fadump_crash_info_header {
> @@ -51,6 +69,10 @@ struct fadump_crash_info_header {
> 
struct fadump_crash_info_header {
u64 magic_number;
u64 elfcorehdr_addr;

>   u32 crashing_cpu;
>   struct pt_regs  regs;
>   struct cpumask  cpu_mask;
> + u32 version;
> + u64 elfcorehdr_size;
> + u64 vmcoreinfo_raddr;
> + u64 vmcoreinfo_size;
>  };

The reason I say it's not safe is because pt_regs and especially cpumask
can change size depending on the kernel configuration.

pt_regs probably doesn't change size in practice for common kernel
configurations, but some of the fields are under #ifdef.

cpumask on the other hand is directly controlled by CONFIG_NR_CPUS. So
if the first and second kernel have a different value for NR_CPUS they
will disagree on the size of the struct.

That has presumably worked OK so far because folks tend to use the same, or
similar kernels for the first/second kernel. And also the cpumask is the
last element of the struct, so a disagreement about it size doesn't
affect the location of