Re: [PATCH v5 21/31] powernv/fadump: process architected register state data provided by firmware

2019-09-10 Thread Hari Bathini



On 10/09/19 7:35 PM, Michael Ellerman wrote:
> Hari Bathini  writes:
>> On 09/09/19 9:03 PM, Oliver O'Halloran wrote:
>>> On Mon, Sep 9, 2019 at 11:23 PM Hari Bathini  wrote:
 On 04/09/19 5:50 PM, Michael Ellerman wrote:
> Hari Bathini  writes:
 [...]

>> +/*
>> + * CPU state data is provided by f/w. Below are the definitions
>> + * provided in HDAT spec. Refer to latest HDAT specification for
>> + * any update to this format.
>> + */
>
> How is this meant to work? If HDAT ever changes the format they will
> break all existing kernels in the field.
>
>> +#define HDAT_FADUMP_CPU_DATA_VERSION1

 Changes are not expected here. But this is just to cover for such scenario,
 if that ever happens.
>>>
>>> The HDAT spec doesn't define the SPR numbers for NIA, MSR and the CR.
>>> As far as I can tell the values you've assumed here are chip-specific,
>>> non-architected SPR numbers that come from an array buried somewhere
>>> in the SBE codebase. I don't believe you for a second when you say
>>> that this will never change.
>>
>> At least, the understanding is that this numbers not change across processor
>> generations. If something changes, it is supposed to be handled in SBE. Also,
>> I am told this numbers would be listed in the HDAT Spec. Not sure if that
>> happened yet though. Vasant, you have anything to add?
> 
> That doesn't help much because the HDAT spec is not public.
> 
> The point is with the code written the way it is, these values *must
> not* change, or else all existing kernels will be broken, which is not
> acceptable.

Yeah. It is absurd to error out just by looking at version number...

> 
 Also, I think it is a bit far-fetched to error out if versions mismatch.
 Warning and proceeding sounds worthier because the changes are usually
 backward compatible, if and when there are any. Will update accordingly...
>>>
>>> Literally the only reason I didn't drop the CPU DATA parts of the OPAL
>>> MPIPL series was because I assumed the kernel would do the sensible
>>> thing and reject or ignore the structure if it did not know how to
>>> parse the data.
>>
>> I think, the changes if any, would have to be backward compatible for the 
>> sake
>> of sanity.
> 
> People need to understand that this is an ABI between firmware and
> in-the-field distribution kernels which are only updated at customer
> discretion, or possibly never.
> 
> Any changes *must be* backward compatible.
> 
> Looking at the header struct:
> 
> +struct hdat_fadump_thread_hdr {
> + __be32  pir;
> + /* 0x00 - 0x0F - The corresponding stop state of the core */
> + u8  core_state;
> + u8  reserved[3];
> 
> You have those 3 reserved bytes, so a future revision could repurpose
> one of those as a flag to indicate a new format. And/or the hdr could be
> made bigger and new kernels could be taught to look for new things in
> the space after the hdr but before the reg entries.
> 
> So I think there is a reasonable mechanism for extending the format in
> future, but my point is people must understand that this is an ABI and
> changes must be made accordingly.

True. The folks who make the changes to this format should be aware that
breaking kernel ABI is not going to be pretty and I think they are :)

> 
>> Even if they are not, we are better off exporting the /proc/vmcore
>> with a warning and some crazy CPU register data (if parsing goes alright) 
>> than
>> no dump at all? 
> 
> If it's just a case of reg entries that we don't recognise then yes I
> think it would be OK to just skip them and continue exporting. But if
> there's any deeper misunderstanding of the format then we should bail
> out.

Sure. Will try and fix that by first trying to do a sanity check on the
fields that are needed for parsing the data and proceed with a warning if
nothing weird is detected and fallback to just appending crashing cpu as
done in patch 16/31, if anything weird is observed. That should hopefully
take care of all cases in the best possible way..

> 
> I notice now that you don't do anything in opal_fadump_set_regval_regnum()
> if you are passed a register we don't understand, so that probably needs
> fixing.

f/w provides about 100 odd registers in the CPU state data. Most of them
pt_regs doesn't care about. So, opal_fadump_set_regval_regnum is happy as
long as it find the registers listed in it. Unless, pt_regs changes, we
can stick to this and ignore the rest of them?

- Hari



Re: [PATCH v5 21/31] powernv/fadump: process architected register state data provided by firmware

2019-09-10 Thread Michael Ellerman
Hari Bathini  writes:
> On 09/09/19 9:03 PM, Oliver O'Halloran wrote:
>> On Mon, Sep 9, 2019 at 11:23 PM Hari Bathini  wrote:
>>> On 04/09/19 5:50 PM, Michael Ellerman wrote:
 Hari Bathini  writes:
>>> [...]
>>>
> +/*
> + * CPU state data is provided by f/w. Below are the definitions
> + * provided in HDAT spec. Refer to latest HDAT specification for
> + * any update to this format.
> + */

 How is this meant to work? If HDAT ever changes the format they will
 break all existing kernels in the field.

> +#define HDAT_FADUMP_CPU_DATA_VERSION1
>>>
>>> Changes are not expected here. But this is just to cover for such scenario,
>>> if that ever happens.
>> 
>> The HDAT spec doesn't define the SPR numbers for NIA, MSR and the CR.
>> As far as I can tell the values you've assumed here are chip-specific,
>> non-architected SPR numbers that come from an array buried somewhere
>> in the SBE codebase. I don't believe you for a second when you say
>> that this will never change.
>
> At least, the understanding is that this numbers not change across processor
> generations. If something changes, it is supposed to be handled in SBE. Also,
> I am told this numbers would be listed in the HDAT Spec. Not sure if that
> happened yet though. Vasant, you have anything to add?

That doesn't help much because the HDAT spec is not public.

The point is with the code written the way it is, these values *must
not* change, or else all existing kernels will be broken, which is not
acceptable.

>>> Also, I think it is a bit far-fetched to error out if versions mismatch.
>>> Warning and proceeding sounds worthier because the changes are usually
>>> backward compatible, if and when there are any. Will update accordingly...
>> 
>> Literally the only reason I didn't drop the CPU DATA parts of the OPAL
>> MPIPL series was because I assumed the kernel would do the sensible
>> thing and reject or ignore the structure if it did not know how to
>> parse the data.
>
> I think, the changes if any, would have to be backward compatible for the sake
> of sanity.

People need to understand that this is an ABI between firmware and
in-the-field distribution kernels which are only updated at customer
discretion, or possibly never.

Any changes *must be* backward compatible.

Looking at the header struct:

+struct hdat_fadump_thread_hdr {
+   __be32  pir;
+   /* 0x00 - 0x0F - The corresponding stop state of the core */
+   u8  core_state;
+   u8  reserved[3];

You have those 3 reserved bytes, so a future revision could repurpose
one of those as a flag to indicate a new format. And/or the hdr could be
made bigger and new kernels could be taught to look for new things in
the space after the hdr but before the reg entries.

So I think there is a reasonable mechanism for extending the format in
future, but my point is people must understand that this is an ABI and
changes must be made accordingly.

> Even if they are not, we are better off exporting the /proc/vmcore
> with a warning and some crazy CPU register data (if parsing goes alright) than
> no dump at all? 

If it's just a case of reg entries that we don't recognise then yes I
think it would be OK to just skip them and continue exporting. But if
there's any deeper misunderstanding of the format then we should bail
out.

I notice now that you don't do anything in opal_fadump_set_regval_regnum()
if you are passed a register we don't understand, so that probably needs
fixing.

cheers


Re: [PATCH v5 21/31] powernv/fadump: process architected register state data provided by firmware

2019-09-10 Thread Hari Bathini



On 09/09/19 9:03 PM, Oliver O'Halloran wrote:
> On Mon, Sep 9, 2019 at 11:23 PM Hari Bathini  wrote:
>>
>> On 04/09/19 5:50 PM, Michael Ellerman wrote:
>>> Hari Bathini  writes:
>>>
>>
>> [...]
>>
 +/*
 + * CPU state data is provided by f/w. Below are the definitions
 + * provided in HDAT spec. Refer to latest HDAT specification for
 + * any update to this format.
 + */
>>>
>>> How is this meant to work? If HDAT ever changes the format they will
>>> break all existing kernels in the field.
>>>
 +#define HDAT_FADUMP_CPU_DATA_VERSION1
>>
>> Changes are not expected here. But this is just to cover for such scenario,
>> if that ever happens.
> 
> The HDAT spec doesn't define the SPR numbers for NIA, MSR and the CR.
> As far as I can tell the values you've assumed here are chip-specific,
> non-architected SPR numbers that come from an array buried somewhere
> in the SBE codebase. I don't believe you for a second when you say
> that this will never change.

At least, the understanding is that this numbers not change across processor
generations. If something changes, it is supposed to be handled in SBE. Also,
I am told this numbers would be listed in the HDAT Spec. Not sure if that
happened yet though. Vasant, you have anything to add?

>> Also, I think it is a bit far-fetched to error out if versions mismatch.
>> Warning and proceeding sounds worthier because the changes are usually
>> backward compatible, if and when there are any. Will update accordingly...
> 
> Literally the only reason I didn't drop the CPU DATA parts of the OPAL
> MPIPL series was because I assumed the kernel would do the sensible
> thing and reject or ignore the structure if it did not know how to
> parse the data.

I think, the changes if any, would have to be backward compatible for the sake
of sanity. Even if they are not, we are better off exporting the /proc/vmcore
with a warning and some crazy CPU register data (if parsing goes alright) than
no dump at all? 

- Hari



Re: [PATCH v5 21/31] powernv/fadump: process architected register state data provided by firmware

2019-09-09 Thread Oliver O'Halloran
On Mon, Sep 9, 2019 at 11:23 PM Hari Bathini  wrote:
>
> On 04/09/19 5:50 PM, Michael Ellerman wrote:
> > Hari Bathini  writes:
> >
>
> [...]
>
> >> +/*
> >> + * CPU state data is provided by f/w. Below are the definitions
> >> + * provided in HDAT spec. Refer to latest HDAT specification for
> >> + * any update to this format.
> >> + */
> >
> > How is this meant to work? If HDAT ever changes the format they will
> > break all existing kernels in the field.
> >
> >> +#define HDAT_FADUMP_CPU_DATA_VERSION1
>
> Changes are not expected here. But this is just to cover for such scenario,
> if that ever happens.

The HDAT spec doesn't define the SPR numbers for NIA, MSR and the CR.
As far as I can tell the values you've assumed here are chip-specific,
non-architected SPR numbers that come from an array buried somewhere
in the SBE codebase. I don't believe you for a second when you say
that this will never change.

> Also, I think it is a bit far-fetched to error out if versions mismatch.
> Warning and proceeding sounds worthier because the changes are usually
> backward compatible, if and when there are any. Will update accordingly...

Literally the only reason I didn't drop the CPU DATA parts of the OPAL
MPIPL series was because I assumed the kernel would do the sensible
thing and reject or ignore the structure if it did not know how to
parse the data.

Oliver


Re: [PATCH v5 21/31] powernv/fadump: process architected register state data provided by firmware

2019-09-09 Thread Hari Bathini



On 04/09/19 5:50 PM, Michael Ellerman wrote:
> Hari Bathini  writes:
>

[...]

>> +/*
>> + * CPU state data is provided by f/w. Below are the definitions
>> + * provided in HDAT spec. Refer to latest HDAT specification for
>> + * any update to this format.
>> + */
> 
> How is this meant to work? If HDAT ever changes the format they will
> break all existing kernels in the field.
> 
>> +#define HDAT_FADUMP_CPU_DATA_VERSION1

Changes are not expected here. But this is just to cover for such scenario,
if that ever happens.

Also, I think it is a bit far-fetched to error out if versions mismatch.
Warning and proceeding sounds worthier because the changes are usually
backward compatible, if and when there are any. Will update accordingly... 

- Hari



Re: [PATCH v5 21/31] powernv/fadump: process architected register state data provided by firmware

2019-09-04 Thread Michael Ellerman
Hari Bathini  writes:

> diff --git a/arch/powerpc/kernel/fadump-common.h 
> b/arch/powerpc/kernel/fadump-common.h
> index 7107cf2..fc408b0 100644
> --- a/arch/powerpc/kernel/fadump-common.h
> +++ b/arch/powerpc/kernel/fadump-common.h
> @@ -98,7 +98,11 @@ struct fw_dump {
>   /* cmd line option during boot */
>   unsigned long   reserve_bootvar;
>  
> + unsigned long   cpu_state_destination_addr;

AFAICS that is only used in two places, and both of them have to call
__va() on it, so why don't we store the virtual address to start with?

> diff --git a/arch/powerpc/platforms/powernv/opal-fadump.c 
> b/arch/powerpc/platforms/powernv/opal-fadump.c
> index f75b861..9a32a7f 100644
> --- a/arch/powerpc/platforms/powernv/opal-fadump.c
> +++ b/arch/powerpc/platforms/powernv/opal-fadump.c
> @@ -282,15 +283,122 @@ static void opal_fadump_cleanup(struct fw_dump 
> *fadump_conf)
>   pr_warn("Could not reset (%llu) kernel metadata tag!\n", ret);
>  }
>  
> +static inline void opal_fadump_set_regval_regnum(struct pt_regs *regs,
> +  u32 reg_type, u32 reg_num,
> +  u64 reg_val)
> +{
> + if (reg_type == HDAT_FADUMP_REG_TYPE_GPR) {
> + if (reg_num < 32)
> + regs->gpr[reg_num] = reg_val;
> + return;
> + }
> +
> + switch (reg_num) {
> + case SPRN_CTR:
> + regs->ctr = reg_val;
> + break;
> + case SPRN_LR:
> + regs->link = reg_val;
> + break;
> + case SPRN_XER:
> + regs->xer = reg_val;
> + break;
> + case SPRN_DAR:
> + regs->dar = reg_val;
> + break;
> + case SPRN_DSISR:
> + regs->dsisr = reg_val;
> + break;
> + case HDAT_FADUMP_REG_ID_NIP:
> + regs->nip = reg_val;
> + break;
> + case HDAT_FADUMP_REG_ID_MSR:
> + regs->msr = reg_val;
> + break;
> + case HDAT_FADUMP_REG_ID_CCR:
> + regs->ccr = reg_val;
> + break;
> + }
> +}
> +
> +static inline void opal_fadump_read_regs(char *bufp, unsigned int regs_cnt,
> +  unsigned int reg_entry_size,
> +  struct pt_regs *regs)
> +{
> + int i;
> + struct hdat_fadump_reg_entry *reg_entry;

Where's my christmas tree :)

> +
> + memset(regs, 0, sizeof(struct pt_regs));
> +
> + for (i = 0; i < regs_cnt; i++, bufp += reg_entry_size) {
> + reg_entry = (struct hdat_fadump_reg_entry *)bufp;
> + opal_fadump_set_regval_regnum(regs,
> +   be32_to_cpu(reg_entry->reg_type),
> +   be32_to_cpu(reg_entry->reg_num),
> +   be64_to_cpu(reg_entry->reg_val));
> + }
> +}
> +
> +static inline bool __init is_thread_core_inactive(u8 core_state)
> +{
> + bool is_inactive = false;
> +
> + if (core_state == HDAT_FADUMP_CORE_INACTIVE)
> + is_inactive = true;
> +
> + return is_inactive;

return core_state == HDAT_FADUMP_CORE_INACTIVE;

??

In fact there's only one caller, so just drop the inline entirely.

> +}
> +
>  /*
>   * Convert CPU state data saved at the time of crash into ELF notes.
> + *
> + * Each register entry is of 16 bytes, A numerical identifier along with
> + * a GPR/SPR flag in the first 8 bytes and the register value in the next
> + * 8 bytes. For more details refer to F/W documentation.
>   */
>  static int __init opal_fadump_build_cpu_notes(struct fw_dump *fadump_conf)
>  {
>   u32 num_cpus, *note_buf;
>   struct fadump_crash_info_header *fdh = NULL;
> + struct hdat_fadump_thread_hdr *thdr;
> + unsigned long addr;
> + u32 thread_pir;
> + char *bufp;
> + struct pt_regs regs;
> + unsigned int size_of_each_thread;
> + unsigned int regs_offset, regs_cnt, reg_esize;
> + int i;

unsigned int size_of_each_thread, regs_offset, regs_cnt, reg_esize;
struct fadump_crash_info_header *fdh = NULL;
u32 num_cpus, thread_pir, *note_buf;
struct hdat_fadump_thread_hdr *thdr;
struct pt_regs regs;
unsigned long addr;
char *bufp;
int i;

Ah much better :)

Though the number of variables might be an indication that this function
could be split into smaller parts.

> @@ -473,6 +627,26 @@ int __init opal_fadump_dt_scan(struct fw_dump 
> *fadump_conf, ulong node)
>   return 1;
>   }
>  
> + ret = opal_mpipl_query_tag(OPAL_MPIPL_TAG_CPU, );
> + if ((ret != OPAL_SUCCESS) || !addr) {
> + pr_err("Failed to get CPU metadata (%lld)\n", ret);
> + return 1;
> + }
> +
> + addr = be64_to_cpu(addr);
> + pr_debug("CPU metadata addr: %llx\n", addr);
> +
> 

[PATCH v5 21/31] powernv/fadump: process architected register state data provided by firmware

2019-08-20 Thread Hari Bathini
From: Hari Bathini 

Firmware provides architected register state data at the time of crash.
Process this data and build CPU notes to append to ELF core.

Signed-off-by: Hari Bathini 
Signed-off-by: Vasant Hegde 
---
 arch/powerpc/kernel/fadump-common.h  |4 +
 arch/powerpc/platforms/powernv/opal-fadump.c |  198 --
 arch/powerpc/platforms/powernv/opal-fadump.h |   39 +
 3 files changed, 229 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/kernel/fadump-common.h 
b/arch/powerpc/kernel/fadump-common.h
index 7107cf2..fc408b0 100644
--- a/arch/powerpc/kernel/fadump-common.h
+++ b/arch/powerpc/kernel/fadump-common.h
@@ -98,7 +98,11 @@ struct fw_dump {
/* cmd line option during boot */
unsigned long   reserve_bootvar;
 
+   unsigned long   cpu_state_destination_addr;
+   unsigned long   cpu_state_data_version;
+   unsigned long   cpu_state_entry_size;
unsigned long   cpu_state_data_size;
+
unsigned long   hpte_region_size;
 
unsigned long   boot_mem_dest_addr;
diff --git a/arch/powerpc/platforms/powernv/opal-fadump.c 
b/arch/powerpc/platforms/powernv/opal-fadump.c
index f75b861..9a32a7f 100644
--- a/arch/powerpc/platforms/powernv/opal-fadump.c
+++ b/arch/powerpc/platforms/powernv/opal-fadump.c
@@ -23,6 +23,7 @@
 #include "opal-fadump.h"
 
 static const struct opal_fadump_mem_struct *opal_fdm_active;
+static const struct opal_mpipl_fadump *opal_cpu_metadata;
 static struct opal_fadump_mem_struct *opal_fdm;
 
 static int opal_fadump_unregister(struct fw_dump *fadump_conf);
@@ -282,15 +283,122 @@ static void opal_fadump_cleanup(struct fw_dump 
*fadump_conf)
pr_warn("Could not reset (%llu) kernel metadata tag!\n", ret);
 }
 
+static inline void opal_fadump_set_regval_regnum(struct pt_regs *regs,
+u32 reg_type, u32 reg_num,
+u64 reg_val)
+{
+   if (reg_type == HDAT_FADUMP_REG_TYPE_GPR) {
+   if (reg_num < 32)
+   regs->gpr[reg_num] = reg_val;
+   return;
+   }
+
+   switch (reg_num) {
+   case SPRN_CTR:
+   regs->ctr = reg_val;
+   break;
+   case SPRN_LR:
+   regs->link = reg_val;
+   break;
+   case SPRN_XER:
+   regs->xer = reg_val;
+   break;
+   case SPRN_DAR:
+   regs->dar = reg_val;
+   break;
+   case SPRN_DSISR:
+   regs->dsisr = reg_val;
+   break;
+   case HDAT_FADUMP_REG_ID_NIP:
+   regs->nip = reg_val;
+   break;
+   case HDAT_FADUMP_REG_ID_MSR:
+   regs->msr = reg_val;
+   break;
+   case HDAT_FADUMP_REG_ID_CCR:
+   regs->ccr = reg_val;
+   break;
+   }
+}
+
+static inline void opal_fadump_read_regs(char *bufp, unsigned int regs_cnt,
+unsigned int reg_entry_size,
+struct pt_regs *regs)
+{
+   int i;
+   struct hdat_fadump_reg_entry *reg_entry;
+
+   memset(regs, 0, sizeof(struct pt_regs));
+
+   for (i = 0; i < regs_cnt; i++, bufp += reg_entry_size) {
+   reg_entry = (struct hdat_fadump_reg_entry *)bufp;
+   opal_fadump_set_regval_regnum(regs,
+ be32_to_cpu(reg_entry->reg_type),
+ be32_to_cpu(reg_entry->reg_num),
+ be64_to_cpu(reg_entry->reg_val));
+   }
+}
+
+static inline bool __init is_thread_core_inactive(u8 core_state)
+{
+   bool is_inactive = false;
+
+   if (core_state == HDAT_FADUMP_CORE_INACTIVE)
+   is_inactive = true;
+
+   return is_inactive;
+}
+
 /*
  * Convert CPU state data saved at the time of crash into ELF notes.
+ *
+ * Each register entry is of 16 bytes, A numerical identifier along with
+ * a GPR/SPR flag in the first 8 bytes and the register value in the next
+ * 8 bytes. For more details refer to F/W documentation.
  */
 static int __init opal_fadump_build_cpu_notes(struct fw_dump *fadump_conf)
 {
u32 num_cpus, *note_buf;
struct fadump_crash_info_header *fdh = NULL;
+   struct hdat_fadump_thread_hdr *thdr;
+   unsigned long addr;
+   u32 thread_pir;
+   char *bufp;
+   struct pt_regs regs;
+   unsigned int size_of_each_thread;
+   unsigned int regs_offset, regs_cnt, reg_esize;
+   int i;
+
+   fadump_conf->cpu_state_entry_size =
+   be32_to_cpu(opal_cpu_metadata->cpu_data_size);
+   fadump_conf->cpu_state_destination_addr =
+   be64_to_cpu(opal_cpu_metadata->region[0].dest);
+   fadump_conf->cpu_state_data_size =
+   be64_to_cpu(opal_cpu_metadata->region[0].size);
+
+   if