from:"Mahesh Jagannath Salgaonkar"

Re: [PATCH] powerpc/mce: Remove per cpu variables from MCE handlers

2020-12-08 Thread Mahesh Jagannath Salgaonkar

On 12/8/20 4:16 PM, Ganesh wrote:
> 
> On 12/8/20 4:01 PM, Michael Ellerman wrote:
>> Ganesh Goudar  writes:
>>> diff --git a/arch/powerpc/include/asm/paca.h
>>> b/arch/powerpc/include/asm/paca.h
>>> index 9454d29ff4b4..4769954efa7d 100644
>>> --- a/arch/powerpc/include/asm/paca.h
>>> +++ b/arch/powerpc/include/asm/paca.h
>>> @@ -273,6 +274,17 @@ struct paca_struct {
>>>   #ifdef CONFIG_MMIOWB
>>>   struct mmiowb_state mmiowb_state;
>>>   #endif
>>> +#ifdef CONFIG_PPC_BOOK3S_64
>>> +    int mce_nest_count;
>>> +    struct machine_check_event mce_event[MAX_MC_EVT];
>>> +    /* Queue for delayed MCE events. */
>>> +    int mce_queue_count;
>>> +    struct machine_check_event mce_event_queue[MAX_MC_EVT];
>>> +
>>> +    /* Queue for delayed MCE UE events. */
>>> +    int mce_ue_count;
>>> +    struct machine_check_event  mce_ue_event_queue[MAX_MC_EVT];
>>> +#endif /* CONFIG_PPC_BOOK3S_64 */
>>>   } cacheline_aligned;
>> How much does this expand the paca by?
> 
> Size of paca is 4480 bytes, these add up another 2160 bytes, so expands
> it by 48%.
> 

Should we dynamically allocate the array sizes early as similar to that
of paca->mce_faulty_slbs so that we don't bump up paca size ?

Thanks,
-Mahesh.

Re: [PATCH v2] powernv/elog: Fix the race while processing OPAL error log event.

2020-10-05 Thread Mahesh Jagannath Salgaonkar

On 10/5/20 4:17 PM, Ananth N Mavinakayanahalli wrote:
> On 10/5/20 9:42 AM, Mahesh Salgaonkar wrote:
>> Every error log reported by OPAL is exported to userspace through a sysfs
>> interface and notified using kobject_uevent(). The userspace daemon
>> (opal_errd) then reads the error log and acknowledges it error log is
>> saved
>> safely to disk. Once acknowledged the kernel removes the respective sysfs
>> file entry causing respective resources getting released including
>> kobject.
>>
>> However there are chances where user daemon may already be scanning elog
>> entries while new sysfs elog entry is being created by kernel. User
>> daemon
>> may read this new entry and ack it even before kernel can notify
>> userspace
>> about it through kobject_uevent() call. If that happens then we have a
>> potential race between elog_ack_store->kobject_put() and kobject_uevent
>> which can lead to use-after-free issue of a kernfs object resulting
>> into a
>> kernel crash. This patch fixes this race by protecting a sysfs file
>> creation/notification by holding an additional reference count on kobject
>> until we safely send kobject_uevent().
>>
>> Reported-by: Oliver O'Halloran 
>> Signed-off-by: Mahesh Salgaonkar 
>> Signed-off-by: Aneesh Kumar K.V 
> 
> cc stable?
> 

Will add it in v3.

Thanks,
-Mahesh.

Re: [PATCH v2] powernv/elog: Fix the race while processing OPAL error log event.

2020-10-05 Thread Mahesh Jagannath Salgaonkar

On 10/6/20 5:55 AM, Oliver O'Halloran wrote:
> On Mon, Oct 5, 2020 at 3:12 PM Mahesh Salgaonkar  wrote:
>>
>> Every error log reported by OPAL is exported to userspace through a sysfs
>> interface and notified using kobject_uevent(). The userspace daemon
>> (opal_errd) then reads the error log and acknowledges it error log is saved
>> safely to disk. Once acknowledged the kernel removes the respective sysfs
>> file entry causing respective resources getting released including kobject.
>>
>> However there are chances where user daemon may already be scanning elog
>> entries while new sysfs elog entry is being created by kernel. User daemon
>> may read this new entry and ack it even before kernel can notify userspace
>> about it through kobject_uevent() call. If that happens then we have a
>> potential race between elog_ack_store->kobject_put() and kobject_uevent
>> which can lead to use-after-free issue of a kernfs object resulting into a
>> kernel crash. This patch fixes this race by protecting a sysfs file
>> creation/notification by holding an additional reference count on kobject
>> until we safely send kobject_uevent().
>>
>> Reported-by: Oliver O'Halloran 
>> Signed-off-by: Mahesh Salgaonkar 
>> Signed-off-by: Aneesh Kumar K.V 
>> ---
>> Change in v2:
>> - Instead of mutex and use extra reference count on kobject to avoid the
>>   race.
>> ---
>>  arch/powerpc/platforms/powernv/opal-elog.c |   15 +++
>>  1 file changed, 15 insertions(+)
>>
>> diff --git a/arch/powerpc/platforms/powernv/opal-elog.c 
>> b/arch/powerpc/platforms/powernv/opal-elog.c
>> index 62ef7ad995da..230f102e87c0 100644
>> --- a/arch/powerpc/platforms/powernv/opal-elog.c
>> +++ b/arch/powerpc/platforms/powernv/opal-elog.c
>> @@ -222,13 +222,28 @@ static struct elog_obj *create_elog_obj(uint64_t id, 
>> size_t size, uint64_t type)
>> return NULL;
>> }
>>
>> +   /*
>> +* As soon as sysfs file for this elog is created/activated there is
>> +* chance opal_errd daemon might read and acknowledge this elog 
>> before
>> +* kobject_uevent() is called. If that happens then we have a 
>> potential
>> +* race between elog_ack_store->kobject_put() and kobject_uevent 
>> which
>> +* leads to use-after-free issue of a kernfs object resulting into
>> +* kernel crash. To avoid this race take an additional reference 
>> count
>> +* on kobject until we safely send kobject_uevent().
>> +*/
>> +
>> +   kobject_get(>kobj);  /* extra reference count */
>> rc = sysfs_create_bin_file(>kobj, >raw_attr);
>> if (rc) {
>> +   kobject_put(>kobj);
>> +   /* Drop the extra reference count  */
>> kobject_put(>kobj);
>> return NULL;
>> }
>>
>> kobject_uevent(>kobj, KOBJ_ADD);
>> +   /* Drop the extra reference count  */
>> +   kobject_put(>kobj);
> 
> Makes sense,
> 
> Reviewed-by: Oliver O'Halloran 
> 
>>
>> return elog;
> 
> Does the returned value actually get used anywhere? We'd have a
> similar use-after-free problem if it does. This should probably return
> an error code rather than the object itself.
> 

Nope. It  isn't being used. I can make it function as void and send v3.

Thanks,
-Mahesh.

Re: Injecting SLB miltihit crashes kernel 5.9.0-rc5

2020-09-16 Thread Mahesh Jagannath Salgaonkar

On 9/15/20 2:13 PM, Michal Suchánek wrote:
> Hello,
> 
> Using the SLB mutihit injection test module (which I did not write so I
> do not want to post it here) to verify updates on my 5.3 frankernekernel
> I found that the kernel crashes with Oops: kernel bad access.
> 
> I tested on latest upstream kernel build that I have at hand and the
> result is te same (minus the message - nothing was logged and the kernel
> simply rebooted).


Yes, SLB multihit recovery is broken upstream. Fix is on the way.


> 
> Since the whole effort to write a real mode MCE handler was supposed to
> prevent this maybe the SLB injection module should be added to the
> kernel selftests?

Yes. We are working on adding SLB injection selftest patches will be
posted soon.

Thanks,
-Mahesh.

> 
> Thanks
> 
> Michal
>

Re: [PATCH v5 02/31] powerpc/fadump: move internal code to a new file

2019-09-04 Thread Mahesh Jagannath Salgaonkar

On 9/3/19 9:35 PM, Hari Bathini wrote:
> 
> 
> On 03/09/19 4:39 PM, Michael Ellerman wrote:
>> Hari Bathini  writes:
>>> Make way for refactoring platform specific FADump code by moving code
>>> that could be referenced from multiple places to fadump-common.c file.
>>>
>>> Signed-off-by: Hari Bathini 
>>> ---
>>>  arch/powerpc/kernel/Makefile|2 
>>>  arch/powerpc/kernel/fadump-common.c |  140 
>>> ++
>>>  arch/powerpc/kernel/fadump-common.h |8 ++
>>>  arch/powerpc/kernel/fadump.c|  146 
>>> ++-
>>>  4 files changed, 158 insertions(+), 138 deletions(-)
>>>  create mode 100644 arch/powerpc/kernel/fadump-common.c
>>
>> I don't understand why we need fadump.c and fadump-common.c? They're
>> both common/shared across pseries & powernv aren't they?
> 
> The convention I tried to follow to have fadump-common.c shared between 
> fadump.c,
> pseries & powernv code while pseries & powernv code take callback requests 
> from
> fadump.c and use fadump-common.c (shared by both platforms), if necessary to 
> fullfil
> those requests...
> 
>> By the end of the series we end up with 149 lines in fadump-common.c
>> which seems like a waste of time. Just put it all in fadump.c.
> 
> Yeah. Probably not worth a new C file. Will just have two separate headers. 
> One for
> internal code and one for interfacing with other modules...
> 
> [...]
> 
>>> + * Copyright 2019, IBM Corp.
>>> + * Author: Hari Bathini 
>>
>> These can just be:
>>
>>  * Copyright 2011, Mahesh Salgaonkar, IBM Corporation.
>>  * Copyright 2019, Hari Bathini, IBM Corporation.
>>
> 
> Sure.
> 
>>> + */
>>> +
>>> +#undef DEBUG
>>
>> Don't undef DEBUG please.
>>
> 
> Sorry! Seeing such thing in most files, I thought this was the convention. 
> Will drop
> this change in all the new files I added.
> 
>>> +#define pr_fmt(fmt) "fadump: " fmt
>>> +
>>> +#include 
>>> +#include 
>>> +#include 
>>> +#include 
>>> +
>>> +#include "fadump-common.h"
>>> +
>>> +void *fadump_cpu_notes_buf_alloc(unsigned long size)
>>> +{
>>> +   void *vaddr;
>>> +   struct page *page;
>>> +   unsigned long order, count, i;
>>> +
>>> +   order = get_order(size);
>>> +   vaddr = (void *)__get_free_pages(GFP_KERNEL|__GFP_ZERO, order);
>>> +   if (!vaddr)
>>> +   return NULL;
>>> +
>>> +   count = 1 << order;
>>> +   page = virt_to_page(vaddr);
>>> +   for (i = 0; i < count; i++)
>>> +   SetPageReserved(page + i);
>>> +   return vaddr;
>>> +}
>>
>> I realise you're just moving this code, but why do we need all this hand
>> rolled allocation stuff?
> 
> Yeah, I think alloc_pages_exact() may be better here. Mahesh, am I missing 
> something?

We hook up the physical address of this buffer to ELF core header as
PT_NOTE section. Hence we don't want these pages to be moved around or
reclaimed.

Thanks,
-Mahesh.

Re: [PATCH v4 11/25] powernv/fadump: register kernel metadata address with opal

2019-08-14 Thread Mahesh Jagannath Salgaonkar

On 8/14/19 12:36 PM, Hari Bathini wrote:
> 
> 
> On 13/08/19 4:11 PM, Mahesh J Salgaonkar wrote:
>> On 2019-07-16 17:03:15 Tue, Hari Bathini wrote:
>>> OPAL allows registering address with it in the first kernel and
>>> retrieving it after MPIPL. Setup kernel metadata and register its
>>> address with OPAL to use it for processing the crash dump.
>>>
>>> Signed-off-by: Hari Bathini 
>>> ---
>>>  arch/powerpc/kernel/fadump-common.h  |4 +
>>>  arch/powerpc/kernel/fadump.c |   65 ++-
>>>  arch/powerpc/platforms/powernv/opal-fadump.c |   73 
>>> ++
>>>  arch/powerpc/platforms/powernv/opal-fadump.h |   37 +
>>>  arch/powerpc/platforms/pseries/rtas-fadump.c |   32 +--
>>>  5 files changed, 177 insertions(+), 34 deletions(-)
>>>  create mode 100644 arch/powerpc/platforms/powernv/opal-fadump.h
>>>
>> [...]
>>> @@ -346,30 +349,42 @@ int __init fadump_reserve_mem(void)
>>>  * use memblock_find_in_range() here since it doesn't allocate
>>>  * from bottom to top.
>>>  */
>>> -   for (base = fw_dump.boot_memory_size;
>>> -base <= (memory_boundary - size);
>>> -base += size) {
>>> +   while (base <= (memory_boundary - size)) {
>>> if (memblock_is_region_memory(base, size) &&
>>> !memblock_is_region_reserved(base, size))
>>> break;
>>> +
>>> +   base += size;
>>> }
>>> -   if ((base > (memory_boundary - size)) ||
>>> -   memblock_reserve(base, size)) {
>>> +
>>> +   if (base > (memory_boundary - size)) {
>>> +   pr_err("Failed to find memory chunk for reservation\n");
>>> +   goto error_out;
>>> +   }
>>> +   fw_dump.reserve_dump_area_start = base;
>>> +
>>> +   /*
>>> +* Calculate the kernel metadata address and register it with
>>> +* f/w if the platform supports.
>>> +*/
>>> +   if (fw_dump.ops->setup_kernel_metadata(_dump) < 0)
>>> +   goto error_out;
>>
>> I see setup_kernel_metadata() registers the metadata address with opal 
>> without
>> having any minimum data initialized in it. Secondaly, why can't this wait 
>> until> registration ? I think we should defer this until fadump registration.
> 
> If setting up metadata address fails (it should ideally not fail, but..), 
> everything else
> is useless. 

That's less likely.. so is true with opal_mpipl_update() as well.

> So, we might as well try that early and fall back to KDump in case of an 
> error..

ok. Yeah but not uninitialized metadata.

> 
>> What if kernel crashes before metadata area is initialized ?
> 
> registered_regions would be '0'. So, it is treated as fadump is not 
> registered case.
> Let me
> initialize metadata explicitly before registering the address with f/w to 
> avoid any assumption...

Do you want to do that before memblock reservation ? Should we move this
to setup_fadump() ?

Thanks,
-Mahesh.

> 
>>
>>> +
>>> +   if (memblock_reserve(base, size)) {
>>> pr_err("Failed to reserve memory\n");
>>> -   return 0;
>>> +   goto error_out;
>>> }
>> [...]
>>> -
>>>  static struct fadump_ops rtas_fadump_ops = {
>>> -   .init_fadump_mem_struct = rtas_fadump_init_mem_struct,
>>> -   .register_fadump= rtas_fadump_register_fadump,
>>> -   .unregister_fadump  = rtas_fadump_unregister_fadump,
>>> -   .invalidate_fadump  = rtas_fadump_invalidate_fadump,
>>> -   .process_fadump = rtas_fadump_process_fadump,
>>> -   .fadump_region_show = rtas_fadump_region_show,
>>> -   .fadump_trigger = rtas_fadump_trigger,
>>> +   .init_fadump_mem_struct = rtas_fadump_init_mem_struct,
>>> +   .get_kernel_metadata_size   = rtas_fadump_get_kernel_metadata_size,
>>> +   .setup_kernel_metadata  = rtas_fadump_setup_kernel_metadata,
>>> +   .register_fadump= rtas_fadump_register_fadump,
>>> +   .unregister_fadump  = rtas_fadump_unregister_fadump,
>>> +   .invalidate_fadump  = rtas_fadump_invalidate_fadump,
>>> +   .process_fadump = rtas_fadump_process_fadump,
>>> +   .fadump_region_show = rtas_fadump_region_show,
>>> +   .fadump_trigger = rtas_fadump_trigger,
>>
>> Can you make the tab space changes in your previous patch where these
>> were initially introduced ? So that this patch can only show new members
>> that are added.
> 
> done.
> 
> Thanks
> Hari
>

Re: [PATCH v9 6/7] powerpc/mce: Handle UE event for memcpy_mcsafe

2019-08-14 Thread Mahesh Jagannath Salgaonkar

On 8/12/19 2:52 PM, Santosh Sivaraj wrote:
> If we take a UE on one of the instructions with a fixup entry, set nip
> to continue execution at the fixup entry. Stop processing the event
> further or print it.
> 
> Co-developed-by: Reza Arbab 
> Signed-off-by: Reza Arbab 
> Cc: Mahesh Salgaonkar 
> Signed-off-by: Santosh Sivaraj 

Looks good to me.

Reviewed-by: Mahesh Salgaonkar 

Thanks,
-Mahesh.

> ---
>  arch/powerpc/include/asm/mce.h  |  4 +++-
>  arch/powerpc/kernel/mce.c   | 16 
>  arch/powerpc/kernel/mce_power.c | 15 +--
>  3 files changed, 32 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/mce.h b/arch/powerpc/include/asm/mce.h
> index f3a6036b6bc0..e1931c8c2743 100644
> --- a/arch/powerpc/include/asm/mce.h
> +++ b/arch/powerpc/include/asm/mce.h
> @@ -122,7 +122,8 @@ struct machine_check_event {
>   enum MCE_UeErrorType ue_error_type:8;
>   u8  effective_address_provided;
>   u8  physical_address_provided;
> - u8  reserved_1[5];
> + u8  ignore_event;
> + u8  reserved_1[4];
>   u64 effective_address;
>   u64 physical_address;
>   u8  reserved_2[8];
> @@ -193,6 +194,7 @@ struct mce_error_info {
>   enum MCE_Initiator  initiator:8;
>   enum MCE_ErrorClass error_class:8;
>   boolsync_error;
> + boolignore_event;
>  };
>  
>  #define MAX_MC_EVT   100
> diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
> index a3b122a685a5..ec4b3e1087be 100644
> --- a/arch/powerpc/kernel/mce.c
> +++ b/arch/powerpc/kernel/mce.c
> @@ -149,6 +149,7 @@ void save_mce_event(struct pt_regs *regs, long handled,
>   if (phys_addr != ULONG_MAX) {
>   mce->u.ue_error.physical_address_provided = true;
>   mce->u.ue_error.physical_address = phys_addr;
> + mce->u.ue_error.ignore_event = mce_err->ignore_event;
>   machine_check_ue_event(mce);
>   }
>   }
> @@ -266,8 +267,17 @@ static void machine_process_ue_event(struct work_struct 
> *work)
>   /*
>* This should probably queued elsewhere, but
>* oh! well
> +  *
> +  * Don't report this machine check because the caller has a
> +  * asked us to ignore the event, it has a fixup handler which
> +  * will do the appropriate error handling and reporting.
>*/
>   if (evt->error_type == MCE_ERROR_TYPE_UE) {
> + if (evt->u.ue_error.ignore_event) {
> + __this_cpu_dec(mce_ue_count);
> + continue;
> + }
> +
>   if (evt->u.ue_error.physical_address_provided) {
>   unsigned long pfn;
>  
> @@ -301,6 +311,12 @@ static void machine_check_process_queued_event(struct 
> irq_work *work)
>   while (__this_cpu_read(mce_queue_count) > 0) {
>   index = __this_cpu_read(mce_queue_count) - 1;
>   evt = this_cpu_ptr(_event_queue[index]);
> +
> + if (evt->error_type == MCE_ERROR_TYPE_UE &&
> + evt->u.ue_error.ignore_event) {
> + __this_cpu_dec(mce_queue_count);
> + continue;
> + }
>   machine_check_print_event_info(evt, false, false);
>   __this_cpu_dec(mce_queue_count);
>   }
> diff --git a/arch/powerpc/kernel/mce_power.c b/arch/powerpc/kernel/mce_power.c
> index e74816f045f8..1dd87f6f5186 100644
> --- a/arch/powerpc/kernel/mce_power.c
> +++ b/arch/powerpc/kernel/mce_power.c
> @@ -11,6 +11,7 @@
>  
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -18,6 +19,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  /*
>   * Convert an address related to an mm to a physical address.
> @@ -559,9 +561,18 @@ static int mce_handle_derror(struct pt_regs *regs,
>   return 0;
>  }
>  
> -static long mce_handle_ue_error(struct pt_regs *regs)
> +static long mce_handle_ue_error(struct pt_regs *regs,
> + struct mce_error_info *mce_err)
>  {
>   long handled = 0;
> + const struct exception_table_entry *entry;
> +
> + entry = search_kernel_exception_table(regs->nip);
> + if (entry) {
> + mce_err->ignore_event = true;
> + regs->nip = extable_fixup(entry);
> + return 1;
> + }
>  
>   /*
>* On specific SCOM read via MMIO we may get a machine check
> @@ -594,7 +605,7 @@ static long mce_handle_error(struct pt_regs *regs,
>   _addr);
>  
>   if (!handled &&

Re: [PATCH v9 1/7] powerpc/mce: Schedule work from irq_work

2019-08-12 Thread Mahesh Jagannath Salgaonkar

On 8/12/19 2:52 PM, Santosh Sivaraj wrote:
> schedule_work() cannot be called from MCE exception context as MCE can
> interrupt even in interrupt disabled context.
> 
> fixes: 733e4a4c ("powerpc/mce: hookup memory_failure for UE errors")
> Suggested-by: Mahesh Salgaonkar 
> Signed-off-by: Santosh Sivaraj 
> Cc: sta...@vger.kernel.org # v4.15+
> ---
>  arch/powerpc/kernel/mce.c | 11 ++-
>  1 file changed, 10 insertions(+), 1 deletion(-)

Reviewed-by: Mahesh Salgaonkar 

Thanks,
-Mahesh.

> 
> diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
> index b18df633eae9..cff31d4a501f 100644
> --- a/arch/powerpc/kernel/mce.c
> +++ b/arch/powerpc/kernel/mce.c
> @@ -33,6 +33,7 @@ static DEFINE_PER_CPU(struct 
> machine_check_event[MAX_MC_EVT],
>   mce_ue_event_queue);
>  
>  static void machine_check_process_queued_event(struct irq_work *work);
> +static void machine_check_ue_irq_work(struct irq_work *work);
>  void machine_check_ue_event(struct machine_check_event *evt);
>  static void machine_process_ue_event(struct work_struct *work);
>  
> @@ -40,6 +41,10 @@ static struct irq_work mce_event_process_work = {
>  .func = machine_check_process_queued_event,
>  };
>  
> +static struct irq_work mce_ue_event_irq_work = {
> + .func = machine_check_ue_irq_work,
> +};
> +
>  DECLARE_WORK(mce_ue_event_work, machine_process_ue_event);
>  
>  static void mce_set_error_info(struct machine_check_event *mce,
> @@ -199,6 +204,10 @@ void release_mce_event(void)
>   get_mce_event(NULL, true);
>  }
>  
> +static void machine_check_ue_irq_work(struct irq_work *work)
> +{
> + schedule_work(_ue_event_work);
> +}
>  
>  /*
>   * Queue up the MCE event which then can be handled later.
> @@ -216,7 +225,7 @@ void machine_check_ue_event(struct machine_check_event 
> *evt)
>   memcpy(this_cpu_ptr(_ue_event_queue[index]), evt, sizeof(*evt));
>  
>   /* Queue work to process this event later. */
> - schedule_work(_ue_event_work);
> + irq_work_queue(_ue_event_irq_work);
>  }
>  
>  /*
>

Re: [PATCH v4 03/25] powerpc/fadump: Improve fadump documentation

2019-08-12 Thread Mahesh Jagannath Salgaonkar

On 7/16/19 5:02 PM, Hari Bathini wrote:
> The figures depicting FADump's (Firmware-Assisted Dump) memory layout
> are missing some finer details like different memory regions and what
> they represent. Improve the documentation by updating those details.
> 
> Signed-off-by: Hari Bathini 
> ---
>  Documentation/powerpc/firmware-assisted-dump.txt |   65 
> --
>  1 file changed, 35 insertions(+), 30 deletions(-)
> 
> diff --git a/Documentation/powerpc/firmware-assisted-dump.txt 
> b/Documentation/powerpc/firmware-assisted-dump.txt
> index 0c41d6d..e9b4e3c 100644
> --- a/Documentation/powerpc/firmware-assisted-dump.txt
> +++ b/Documentation/powerpc/firmware-assisted-dump.txt

This will have to be rebased now on firmware-assisted-dump.rst. However
Changes looks good to me.

Thanks,
-Mahesh.

> @@ -74,8 +74,9 @@ as follows:
> there is crash data available from a previous boot. During
> the early boot OS will reserve rest of the memory above
> boot memory size effectively booting with restricted memory
> -   size. This will make sure that the second kernel will not
> -   touch any of the dump memory area.
> +   size. This will make sure that this kernel (also, referred
> +   to as second kernel or capture kernel) will not touch any
> +   of the dump memory area.
>  
>  -- User-space tools will read /proc/vmcore to obtain the contents
> of memory, which holds the previous crashed kernel dump in ELF
> @@ -125,48 +126,52 @@ space memory except the user pages that were present in 
> CMA region.
>  
>o Memory Reservation during first kernel
>  
> -  Low memory Top of memory
> -  0  boot memory size   |
> -  |   ||<--Reserved dump area -->|  |
> -  V   V|   Permanent Reservation |  V
> -  +---+--/ /---+---++---++--+
> -  |   ||CPU|HPTE|  DUMP |ELF |  |
> -  +---+--/ /---+---++---++--+
> -|   ^
> -|   |
> -\   /
> - ---
> -  Boot memory content gets transferred to
> -  reserved area by firmware at the time of
> -  crash
> +  Low memoryTop of memory
> +  0  boot memory size  |<--Reserved dump area --->|  |
> +  |   ||   Permanent Reservation  |  |
> +  V   V|   (Preserve area)|  V
> +  +---+--/ /---+---+++---++--+
> +  |   ||CPU|HPTE|  DUMP  |HDR|ELF |  |
> +  +---+--/ /---+---+++---++--+
> +|   ^  ^
> +|   |  |
> +\   /  |
> + --- FADump Header
> +  Boot memory content gets transferred   (meta area)
> +  to reserved area by firmware at the
> +  time of crash
> +
> Fig. 1
>  
> +
>o Memory Reservation during second kernel after crash
>  
> -  Low memoryTop of memory
> -  0  boot memory size   |
> -  |   |<- Reserved dump area --- -->|
> -  V   V V
> -  +---+--/ /---+---++---++--+
> -  |   ||CPU|HPTE|  DUMP |ELF |  |
> -  +---+--/ /---+---++---++--+
> +  Low memoryTop of memory
> +  0  boot memory size|
> +  |   |<- Reserved dump area --->|
> +  V   V|< Preserve area ->|  V
> +  +---+--/ /---+---+++---++--+
> +  |   ||CPU|HPTE|  DUMP  |HDR|ELF |  |
> +  +---+--/ /---+---+++---++--+
>  |  |
>  V  V
> Used by second/proc/vmcore
> kernel to boot
> Fig. 2
>  
> -Currently the dump will be copied from /proc/vmcore to a
> -a new file upon user intervention. The dump data available through
> -/proc/vmcore will be in ELF format. Hence the existing kdump
> -infrastructure (kdump scripts) to save the dump works fine with
> -minor modifications.
> +Currently the dump will be copied from /proc/vmcore to a new file upon
> +user

Re: [PATCH v8 3/7] powerpc/mce: Fix MCE handling for huge pages

2019-08-09 Thread Mahesh Jagannath Salgaonkar

On 8/7/19 8:26 PM, Santosh Sivaraj wrote:
> From: Balbir Singh 
> 
> The current code would fail on huge pages addresses, since the shift would
> be incorrect. Use the correct page shift value returned by
> __find_linux_pte() to get the correct physical address. The code is more
> generic and can handle both regular and compound pages.
> 
> Fixes: ba41e1e1ccb9 ("powerpc/mce: Hookup derror (load/store) UE errors")
> Signed-off-by: Balbir Singh 
> [ar...@linux.ibm.com: Fixup pseries_do_memory_failure()]
> Signed-off-by: Reza Arbab 
> Co-developed-by: Santosh Sivaraj 
> Signed-off-by: Santosh Sivaraj 
> ---
>  arch/powerpc/include/asm/mce.h   |  2 +-
>  arch/powerpc/kernel/mce_power.c  | 50 ++--
>  arch/powerpc/platforms/pseries/ras.c |  9 ++---
>  3 files changed, 29 insertions(+), 32 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/mce.h b/arch/powerpc/include/asm/mce.h
> index a4c6a74ad2fb..f3a6036b6bc0 100644
> --- a/arch/powerpc/include/asm/mce.h
> +++ b/arch/powerpc/include/asm/mce.h
> @@ -209,7 +209,7 @@ extern void release_mce_event(void);
>  extern void machine_check_queue_event(void);
>  extern void machine_check_print_event_info(struct machine_check_event *evt,
>  bool user_mode, bool in_guest);
> -unsigned long addr_to_pfn(struct pt_regs *regs, unsigned long addr);
> +unsigned long addr_to_phys(struct pt_regs *regs, unsigned long addr);
>  #ifdef CONFIG_PPC_BOOK3S_64
>  void flush_and_reload_slb(void);
>  #endif /* CONFIG_PPC_BOOK3S_64 */
> diff --git a/arch/powerpc/kernel/mce_power.c b/arch/powerpc/kernel/mce_power.c
> index a814d2dfb5b0..bed38a8e2e50 100644
> --- a/arch/powerpc/kernel/mce_power.c
> +++ b/arch/powerpc/kernel/mce_power.c
> @@ -20,13 +20,14 @@
>  #include 
>  
>  /*
> - * Convert an address related to an mm to a PFN. NOTE: we are in real
> - * mode, we could potentially race with page table updates.
> + * Convert an address related to an mm to a physical address.
> + * NOTE: we are in real mode, we could potentially race with page table 
> updates.
>   */
> -unsigned long addr_to_pfn(struct pt_regs *regs, unsigned long addr)
> +unsigned long addr_to_phys(struct pt_regs *regs, unsigned long addr)
>  {
> - pte_t *ptep;
> - unsigned long flags;
> + pte_t *ptep, pte;
> + unsigned int shift;
> + unsigned long flags, phys_addr;
>   struct mm_struct *mm;
>  
>   if (user_mode(regs))
> @@ -35,14 +36,21 @@ unsigned long addr_to_pfn(struct pt_regs *regs, unsigned 
> long addr)
>   mm = _mm;
>  
>   local_irq_save(flags);
> - if (mm == current->mm)
> - ptep = find_current_mm_pte(mm->pgd, addr, NULL, NULL);
> - else
> - ptep = find_init_mm_pte(addr, NULL);
> + ptep = __find_linux_pte(mm->pgd, addr, NULL, );
>   local_irq_restore(flags);
> +
>   if (!ptep || pte_special(*ptep))
>   return ULONG_MAX;
> - return pte_pfn(*ptep);
> +
> + pte = *ptep;
> + if (shift > PAGE_SHIFT) {
> + unsigned long rpnmask = (1ul << shift) - PAGE_SIZE;
> +
> + pte = __pte(pte_val(pte) | (addr & rpnmask));
> + }
> + phys_addr = pte_pfn(pte) << PAGE_SHIFT;
> +
> + return phys_addr;
>  }
>  
>  /* flush SLBs and reload */
> @@ -354,18 +362,16 @@ static int mce_find_instr_ea_and_pfn(struct pt_regs 
> *regs, uint64_t *addr,

Now that we have addr_to_phys() can we change this function name as well
to mce_find_instr_ea_and_phys() ?

Tested-by: Mahesh Salgaonkar 

This should go to stable tree. Can you move this patch to 2nd position ?

Thanks,
-Mahesh.

Re: [PATCH v8 1/7] powerpc/mce: Schedule work from irq_work

2019-08-09 Thread Mahesh Jagannath Salgaonkar

On 8/7/19 8:26 PM, Santosh Sivaraj wrote:
> schedule_work() cannot be called from MCE exception context as MCE can
> interrupt even in interrupt disabled context.
> 
> fixes: 733e4a4c ("powerpc/mce: hookup memory_failure for UE errors")
> Signed-off-by: Santosh Sivaraj 
> ---
>  arch/powerpc/kernel/mce.c | 8 +---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
> index b18df633eae9..0ab6fa7c 100644
> --- a/arch/powerpc/kernel/mce.c
> +++ b/arch/powerpc/kernel/mce.c
> @@ -144,7 +144,6 @@ void save_mce_event(struct pt_regs *regs, long handled,
>   if (phys_addr != ULONG_MAX) {
>   mce->u.ue_error.physical_address_provided = true;
>   mce->u.ue_error.physical_address = phys_addr;
> - machine_check_ue_event(mce);
>   }
>   }
>   return;
> @@ -275,8 +274,7 @@ static void machine_process_ue_event(struct work_struct 
> *work)
>   }
>  }
>  /*
> - * process pending MCE event from the mce event queue. This function will be
> - * called during syscall exit.
> + * process pending MCE event from the mce event queue.
>   */
>  static void machine_check_process_queued_event(struct irq_work *work)
>  {
> @@ -292,6 +290,10 @@ static void machine_check_process_queued_event(struct 
> irq_work *work)
>   while (__this_cpu_read(mce_queue_count) > 0) {
>   index = __this_cpu_read(mce_queue_count) - 1;
>   evt = this_cpu_ptr(_event_queue[index]);
> +
> + if (evt->error_type == MCE_ERROR_TYPE_UE)
> + machine_check_ue_event(evt);

This will work only for the event that are queued by mce handler, others
will get ignored. I think you should introduce a separate irq work queue
for schedule_work().

Thanks,
-Mahesh.

Re: [PATCH] powerpc/fadump: sysfs for fadump memory reservation

2019-08-06 Thread Mahesh Jagannath Salgaonkar

On 8/6/19 8:42 AM, Sourabh Jain wrote:
> Add a sys interface to allow querying the memory reserved by fadump
> for saving the crash dump.
> 
> Signed-off-by: Sourabh Jain 

Looks good to me.

Reviewed-by: Mahesh Salgaonkar 

Thanks,
-Mahesh.

> ---
>  Documentation/powerpc/firmware-assisted-dump.rst |  5 +
>  arch/powerpc/kernel/fadump.c | 14 ++
>  2 files changed, 19 insertions(+)
> 
> diff --git a/Documentation/powerpc/firmware-assisted-dump.rst 
> b/Documentation/powerpc/firmware-assisted-dump.rst
> index 9ca12830a48e..4a7f6dc556f5 100644
> --- a/Documentation/powerpc/firmware-assisted-dump.rst
> +++ b/Documentation/powerpc/firmware-assisted-dump.rst
> @@ -222,6 +222,11 @@ Here is the list of files under kernel sysfs:
>  be handled and vmcore will not be captured. This interface can be
>  easily integrated with kdump service start/stop.
> 
> +/sys/kernel/fadump_mem_reserved
> +
> +   This is used to display the memory reserved by fadump for saving the
> +   crash dump.
> +
>   /sys/kernel/fadump_release_mem
>  This file is available only when fadump is active during
>  second kernel. This is used to release the reserved memory
> diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
> index 4eab97292cc2..70d49013ebec 100644
> --- a/arch/powerpc/kernel/fadump.c
> +++ b/arch/powerpc/kernel/fadump.c
> @@ -1514,6 +1514,13 @@ static ssize_t fadump_enabled_show(struct kobject 
> *kobj,
>   return sprintf(buf, "%d\n", fw_dump.fadump_enabled);
>  }
> 
> +static ssize_t fadump_mem_reserved_show(struct kobject *kobj,
> + struct kobj_attribute *attr,
> + char *buf)
> +{
> + return sprintf(buf, "%ld\n", fw_dump.reserve_dump_area_size);
> +}
> +
>  static ssize_t fadump_register_show(struct kobject *kobj,
>   struct kobj_attribute *attr,
>   char *buf)
> @@ -1632,6 +1639,9 @@ static struct kobj_attribute fadump_attr = 
> __ATTR(fadump_enabled,
>  static struct kobj_attribute fadump_register_attr = __ATTR(fadump_registered,
>   0644, fadump_register_show,
>   fadump_register_store);
> +static struct kobj_attribute fadump_mem_reserved_attr =
> + __ATTR(fadump_mem_reserved, 0444,
> + fadump_mem_reserved_show, NULL);
> 
>  DEFINE_SHOW_ATTRIBUTE(fadump_region);
> 
> @@ -1663,6 +1673,10 @@ static void fadump_init_files(void)
>   printk(KERN_ERR "fadump: unable to create sysfs file"
>   " fadump_release_mem (%d)\n", rc);
>   }
> + rc = sysfs_create_file(kernel_kobj, _mem_reserved_attr.attr);
> + if (rc)
> + pr_err("unable to create sysfs file fadump_mem_reserved (%d)\n",
> + rc);
>   return;
>  }
>

Re: [PATCH 2/2] powerpc: avoid adjusting memory_limit for capture kernel memory reservation

2019-07-23 Thread Mahesh Jagannath Salgaonkar

On 7/22/19 11:19 PM, Michal Suchánek wrote:
> On Fri, 28 Jun 2019 00:51:19 +0530
> Hari Bathini  wrote:
> 
>> Currently, if memory_limit is specified and it overlaps with memory to
>> be reserved for capture kernel, memory_limit is adjusted to accommodate
>> capture kernel. With memory reservation for capture kernel moved later
>> (after enforcing memory limit), this adjustment no longer holds water.
>> So, avoid adjusting memory_limit and error out instead.
> 
> Can you split out the memory limit adjustment out of memory reservation
> so it can still be adjusted?

Do you mean adjust the memory limit before we do the actual reservation ?

> 
> Thanks
> 
> Michal
>>
>> Signed-off-by: Hari Bathini 
>> ---
>>  arch/powerpc/kernel/fadump.c|   16 
>>  arch/powerpc/kernel/machine_kexec.c |   22 +++---
>>  2 files changed, 11 insertions(+), 27 deletions(-)
>>
>> diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
>> index 4eab972..a784695 100644
>> --- a/arch/powerpc/kernel/fadump.c
>> +++ b/arch/powerpc/kernel/fadump.c
>> @@ -476,22 +476,6 @@ int __init fadump_reserve_mem(void)
>>  #endif
>>  }
>>  
>> -/*
>> - * Calculate the memory boundary.
>> - * If memory_limit is less than actual memory boundary then reserve
>> - * the memory for fadump beyond the memory_limit and adjust the
>> - * memory_limit accordingly, so that the running kernel can run with
>> - * specified memory_limit.
>> - */
>> -if (memory_limit && memory_limit < memblock_end_of_DRAM()) {
>> -size = get_fadump_area_size();
>> -if ((memory_limit + size) < memblock_end_of_DRAM())
>> -memory_limit += size;
>> -else
>> -memory_limit = memblock_end_of_DRAM();
>> -printk(KERN_INFO "Adjusted memory_limit for firmware-assisted"
>> -" dump, now %#016llx\n", memory_limit);
>> -}
>>  if (memory_limit)
>>  memory_boundary = memory_limit;
>>  else
>> diff --git a/arch/powerpc/kernel/machine_kexec.c 
>> b/arch/powerpc/kernel/machine_kexec.c
>> index c4ed328..fc5533b 100644
>> --- a/arch/powerpc/kernel/machine_kexec.c
>> +++ b/arch/powerpc/kernel/machine_kexec.c
>> @@ -125,10 +125,8 @@ void __init reserve_crashkernel(void)
>>  crashk_res.end = crash_base + crash_size - 1;
>>  }
>>  
>> -if (crashk_res.end == crashk_res.start) {
>> -crashk_res.start = crashk_res.end = 0;
>> -return;
>> -}
>> +if (crashk_res.end == crashk_res.start)
>> +goto error_out;
>>  
>>  /* We might have got these values via the command line or the
>>   * device tree, either way sanitise them now. */
>> @@ -170,15 +168,13 @@ void __init reserve_crashkernel(void)
>>  if (overlaps_crashkernel(__pa(_stext), _end - _stext)) {
>>  printk(KERN_WARNING
>>  "Crash kernel can not overlap current kernel\n");
>> -crashk_res.start = crashk_res.end = 0;
>> -return;
>> +goto error_out;
>>  }
>>  
>>  /* Crash kernel trumps memory limit */
>>  if (memory_limit && memory_limit <= crashk_res.end) {
>> -memory_limit = crashk_res.end + 1;
>> -printk("Adjusted memory limit for crashkernel, now 0x%llx\n",
>> -   memory_limit);
>> +pr_err("Crash kernel size can't exceed memory_limit\n");
>> +goto error_out;
>>  }
>>  
>>  printk(KERN_INFO "Reserving %ldMB of memory at %ldMB "
>> @@ -190,9 +186,13 @@ void __init reserve_crashkernel(void)
>>  if (!memblock_is_region_memory(crashk_res.start, crash_size) ||
>>  memblock_reserve(crashk_res.start, crash_size)) {
>>  pr_err("Failed to reserve memory for crashkernel!\n");
>> -crashk_res.start = crashk_res.end = 0;
>> -return;
>> +goto error_out;
>>  }
>> +
>> +return;
>> +error_out:
>> +crashk_res.start = crashk_res.end = 0;
>> +return;
>>  }
>>  
>>  int overlaps_crashkernel(unsigned long start, unsigned long size)
>>
>

Re: [v3 4/7] powerpc/mce: Handle UE event for memcpy_mcsafe

2019-07-08 Thread Mahesh Jagannath Salgaonkar

On 7/6/19 3:23 PM, Nicholas Piggin wrote:
> Santosh Sivaraj's on July 6, 2019 7:26 am:
>> If we take a UE on one of the instructions with a fixup entry, set nip
>> to continue exucution at the fixup entry. Stop processing the event
>> further or print it.
> 
> Minor nit, but can you instead a field in the mce data structure that
> describes the property of the event, and then the code that intends to
> ignore such events can test for it (with an appropriate comment).
> 
> So it would be has_fixup_handler or similar. Then queue_event can have
> the logic
> 
> /*
>  * Don't report this machine check because the caller has a fixup 
>  * handler which will do the appropriate error handling and reporting.
>  */
> 
> 
>> @@ -565,9 +567,18 @@ static int mce_handle_derror(struct pt_regs *regs,
>>  return 0;
>>  }
>>  
>> -static long mce_handle_ue_error(struct pt_regs *regs)
>> +static long mce_handle_ue_error(struct pt_regs *regs,
>> +struct mce_error_info *mce_err)
>>  {
>>  long handled = 0;
>> +const struct exception_table_entry *entry;
>> +
>> +entry = search_exception_tables(regs->nip);
> 
> Uh oh, this searches module exception tables too... we can't do that
> in real mode, can we?

Yeah, we can not do that in real mode.  Should we directly call
search_extable() for kernel exception table ?

> 
> Thanks,
> Nick
>

Re: [v2 09/12] powerpc/mce: Enable MCE notifiers in external modules

2019-07-02 Thread Mahesh Jagannath Salgaonkar

On 7/2/19 11:47 AM, Nicholas Piggin wrote:
> Santosh Sivaraj's on July 2, 2019 3:19 pm:
>> From: Reza Arbab 
>>
>> Signed-off-by: Reza Arbab 
>> ---
>>  arch/powerpc/kernel/exceptions-64s.S | 6 ++
>>  arch/powerpc/kernel/mce.c| 2 ++
>>  2 files changed, 8 insertions(+)
>>
>> diff --git a/arch/powerpc/kernel/exceptions-64s.S 
>> b/arch/powerpc/kernel/exceptions-64s.S
>> index c83e38a403fd..311f1392a2ec 100644
>> --- a/arch/powerpc/kernel/exceptions-64s.S
>> +++ b/arch/powerpc/kernel/exceptions-64s.S
>> @@ -458,6 +458,12 @@ EXC_COMMON_BEGIN(machine_check_handle_early)
>>  bl  machine_check_early
>>  std r3,RESULT(r1)   /* Save result */
>>  
>> +/* Notifiers may be in a module, so enable virtual addressing. */
>> +mfmsr   r11
>> +ori r11,r11,MSR_IR
>> +ori r11,r11,MSR_DR
>> +mtmsr   r11
> 
> Can't do this, we could take a machine check somewhere the MMU is
> not sane (in fact the guest early mce handling that was added recently
> should not be enabling virtual mode either, which needs to be fixed).

Looks like they need this to be able to run notifier chain which may
fail in real mode.

> 
> Thanks,
> Nick
>

Re: [PATCH 05/13] powerpc/mce: Allow notifier callback to handle MCE

2019-06-23 Thread Mahesh Jagannath Salgaonkar

On 6/23/19 7:44 AM, Reza Arbab wrote:
> Hi Mahesh,
> 
> On Fri, Jun 21, 2019 at 12:35:08PM +0530, Mahesh Jagannath Salgaonkar
> wrote:
>> On 6/21/19 6:27 AM, Santosh Sivaraj wrote:
>>> -    blocking_notifier_call_chain(_notifier_list, 0, );
>>> +    rc = blocking_notifier_call_chain(_notifier_list, 0, evt);
>>> +    if (rc & NOTIFY_STOP_MASK) {
>>> +    evt->disposition = MCE_DISPOSITION_RECOVERED;
>>> +    regs->msr |= MSR_RI;
>>
>> What is the reason for setting MSR_RI ? I don't think this is a good
>> idea. MSR_RI = 0 means system got MCE interrupt when SRR0 and SRR1
>> contents were live and was overwritten by MCE interrupt. Hence this
>> interrupt is unrecoverable irrespective of whether machine check handler
>> recovers from it or not.
> 
> Good catch! I think this is an artifact from when I was first trying to
> get all this working.
> 
> Instead of setting MSR_RI, we should probably just check for it. Ie,
> 
> if ((rc & NOTIFY_STOP_MASK) && (regs->msr & MSR_RI)) {
>     evt->disposition = MCE_DISPOSITION_RECOVERED;

Yup, looks good to me.

Thanks,
-Mahesh.

Re: [PATCH 05/13] powerpc/mce: Allow notifier callback to handle MCE

2019-06-21 Thread Mahesh Jagannath Salgaonkar

On 6/21/19 6:27 AM, Santosh Sivaraj wrote:
> From: Reza Arbab 
> 
> If a notifier returns NOTIFY_STOP, consider the MCE handled, just as we
> do when machine_check_early() returns 1.
> 
> Signed-off-by: Reza Arbab 
> ---
>  arch/powerpc/include/asm/asm-prototypes.h |  2 +-
>  arch/powerpc/kernel/exceptions-64s.S  |  3 +++
>  arch/powerpc/kernel/mce.c | 28 ---
>  3 files changed, 24 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/asm-prototypes.h 
> b/arch/powerpc/include/asm/asm-prototypes.h
> index f66f26ef3ce0..49ee8f08de2a 100644
> --- a/arch/powerpc/include/asm/asm-prototypes.h
> +++ b/arch/powerpc/include/asm/asm-prototypes.h
> @@ -72,7 +72,7 @@ void machine_check_exception(struct pt_regs *regs);
>  void emulation_assist_interrupt(struct pt_regs *regs);
>  long do_slb_fault(struct pt_regs *regs, unsigned long ea);
>  void do_bad_slb_fault(struct pt_regs *regs, unsigned long ea, long err);
> -void machine_check_notify(struct pt_regs *regs);
> +long machine_check_notify(struct pt_regs *regs);
>  
>  /* signals, syscalls and interrupts */
>  long sys_swapcontext(struct ucontext __user *old_ctx,
> diff --git a/arch/powerpc/kernel/exceptions-64s.S 
> b/arch/powerpc/kernel/exceptions-64s.S
> index 2e56014fca21..c83e38a403fd 100644
> --- a/arch/powerpc/kernel/exceptions-64s.S
> +++ b/arch/powerpc/kernel/exceptions-64s.S
> @@ -460,6 +460,9 @@ EXC_COMMON_BEGIN(machine_check_handle_early)
>  
>   addir3,r1,STACK_FRAME_OVERHEAD
>   bl  machine_check_notify
> + ld  r11,RESULT(r1)
> + or  r3,r3,r11
> + std r3,RESULT(r1)
>  
>   ld  r12,_MSR(r1)
>  BEGIN_FTR_SECTION
> diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
> index 0ab171b41ede..912efe58e0b1 100644
> --- a/arch/powerpc/kernel/mce.c
> +++ b/arch/powerpc/kernel/mce.c
> @@ -647,16 +647,28 @@ long hmi_exception_realmode(struct pt_regs *regs)
>   return 1;
>  }
>  
> -void machine_check_notify(struct pt_regs *regs)
> +long machine_check_notify(struct pt_regs *regs)
>  {
> - struct machine_check_event evt;
> + int index = __this_cpu_read(mce_nest_count) - 1;
> + struct machine_check_event *evt;
> + int rc;
>  
> - if (!get_mce_event(, MCE_EVENT_DONTRELEASE))
> - return;
> + if (index < 0 || index >= MAX_MC_EVT)
> + return 0;
> +
> + evt = this_cpu_ptr(_event[index]);
>  
> - blocking_notifier_call_chain(_notifier_list, 0, );
> + rc = blocking_notifier_call_chain(_notifier_list, 0, evt);
> + if (rc & NOTIFY_STOP_MASK) {
> + evt->disposition = MCE_DISPOSITION_RECOVERED;
> + regs->msr |= MSR_RI;

What is the reason for setting MSR_RI ? I don't think this is a good
idea. MSR_RI = 0 means system got MCE interrupt when SRR0 and SRR1
contents were live and was overwritten by MCE interrupt. Hence this
interrupt is unrecoverable irrespective of whether machine check handler
recovers from it or not.

Thanks,
-Mahesh.

Re: [PATCH v2 43/52] powerpc/64s/exception: machine check early only runs in HV mode

2019-06-20 Thread Mahesh Jagannath Salgaonkar

On 6/20/19 3:46 PM, Nicholas Piggin wrote:
> Mahesh J Salgaonkar's on June 20, 2019 7:53 pm:
>> On 2019-06-20 15:14:50 Thu, Nicholas Piggin wrote:
>>> machine_check_common_early and machine_check_handle_early only run in
>>> HVMODE. Remove dead code.
>>
>> That's not true. For pseries guest with FWNMI enabled hypervisor,
>> machine_check_common_early gets called in non-HV mode as well.
>>
>>machine_check_fwnmi
>>  machine_check_common_early
>>machine_check_handle_early
>>  machine_check_early
>>pseries_machine_check_realmode
> 
> Yep, yep I was confused by the earlier patch. So we're only doing the
> early machine check path for the FWNMI case?

yes.

> 
> Thanks,
> Nick
>

Re: [PATCH v2 42/52] powerpc/64s/exception: machine check fwnmi does not trigger when in HV mode

2019-06-20 Thread Mahesh Jagannath Salgaonkar

On 6/20/19 10:44 AM, Nicholas Piggin wrote:
> Remove dead code.
> 
> Signed-off-by: Nicholas Piggin 
> ---
>  arch/powerpc/kernel/exceptions-64s.S | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/exceptions-64s.S 
> b/arch/powerpc/kernel/exceptions-64s.S
> index 286bd5670d60..b12755a4f884 100644
> --- a/arch/powerpc/kernel/exceptions-64s.S
> +++ b/arch/powerpc/kernel/exceptions-64s.S
> @@ -1040,9 +1040,6 @@ TRAMP_REAL_BEGIN(machine_check_pSeries)
>   .globl machine_check_fwnmi
>  machine_check_fwnmi:
>   EXCEPTION_PROLOG_0 PACA_EXMC
> -BEGIN_FTR_SECTION
> - b   machine_check_common_early
> -END_FTR_SECTION_IFCLR(CPU_FTR_HVMODE)

Didn't We add that to handle SLB/ERAT errors in real mode for pseries ?
Are we taking that off ?

>  machine_check_pSeries_0:
>   EXCEPTION_PROLOG_1 EXC_STD, PACA_EXMC, 1, 0x200, 1, 1, 0
>   /*
>

Re: [RFC PATCH 2/3] powernv/mce: Print correct severity for mce error.

2019-03-29 Thread Mahesh Jagannath Salgaonkar

On 3/29/19 5:53 AM, Michael Ellerman wrote:
> Mahesh J Salgaonkar  writes:
>> diff --git a/arch/powerpc/include/asm/mce.h b/arch/powerpc/include/asm/mce.h
>> index 8d0b1c24c636..314ed3f13d59 100644
>> --- a/arch/powerpc/include/asm/mce.h
>> +++ b/arch/powerpc/include/asm/mce.h
>> @@ -110,17 +110,18 @@ enum MCE_LinkErrorType {
>>  };
>>  
>>  struct machine_check_event {
>> -enum MCE_Versionversion:8;  /* 0x00 */
>> -uint8_t in_use; /* 0x01 */
>> -enum MCE_Severity   severity:8; /* 0x02 */
>> -enum MCE_Initiator  initiator:8;/* 0x03 */
>> -enum MCE_ErrorType  error_type:8;   /* 0x04 */
>> -enum MCE_Dispositiondisposition:8;  /* 0x05 */
>> -uint16_tcpu;/* 0x06 */
>> -uint64_tgpr3;   /* 0x08 */
>> -uint64_tsrr0;   /* 0x10 */
>> -uint64_tsrr1;   /* 0x18 */
>> -union { /* 0x20 */
>> +enum MCE_Versionversion:8;
>> +uint8_t in_use;
>> +enum MCE_Severity   severity:8;
>> +enum MCE_Initiator  initiator:8;
>> +enum MCE_ErrorType  error_type:8;
>> +enum MCE_Dispositiondisposition:8;
>> +uint8_t sync_error;
>> +uint16_tcpu;
>> +uint64_tgpr3;
>> +uint64_tsrr0;
>> +uint64_tsrr1;
> 
> Can you switch these to use kernel types while you're at it, ie. u8, u64 etc.

sure.

> 
>> @@ -194,6 +195,7 @@ struct mce_error_info {
>>  } u;
>>  enum MCE_Severity   severity:8;
>>  enum MCE_Initiator  initiator:8;
>> +uint8_t sync_error;
> 
> u8 here but bool later?

Will make it bool everywhere.

Thanks,
-Mahesh.

Re: [RFC PATCH 1/3] powernv/mce: reduce mce console logs to lesser lines.

2019-03-29 Thread Mahesh Jagannath Salgaonkar

On 3/29/19 5:50 AM, Michael Ellerman wrote:
> Hi Mahesh,
> 
> Thanks for doing this series.
> 
> Mahesh J Salgaonkar  writes:
>> From: Mahesh Salgaonkar 
>>
>> Also add cpu number while displaying mce log. This will help cleaner logs
>> when mce hits on multiple cpus simultaneously.
> 
> Can you include some examples of the output before and after, so it's
> easier to compare what the changes are.

Sure will add that in next revision.

> 
> I think you have an example in patch 3, but it would be good to have it here.
> 
>> diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
>> index b5fec1f9751a..44614462cb34 100644
>> --- a/arch/powerpc/kernel/mce.c
>> +++ b/arch/powerpc/kernel/mce.c
>> @@ -384,101 +387,100 @@ void machine_check_print_event_info(struct 
>> machine_check_event *evt,
>>  break;
>>  }
>>  
>> -printk("%s%s Machine check interrupt [%s]\n", level, sevstr,
> 
> I think I'd still like the first line at least to include "machine
> check" somewhere, I'm not sure everyone will understand what "MCE" means.

I reduced it to MCE so that I can pack in more stuff in 80 columns. But
you are right, let me see.

> 
> ...
>> +
>> +if (ea && evt->srr0 != ea)
>> +sprintf(dar_str, "DAR: %016llx ", ea);
>> +else
>> +memset(dar_str, 0, sizeof(dar_str));
> 
> Just dar_str[0] = '\0' would work wouldn't it?

Yeah, that also should be enough.

> 
>> +if (in_guest || user_mode) {
>> +printk("%sMCE: CPU%d: (%s) %s %s %s at %016llx %s[%s]\n",
>> +level, evt->cpu, sevstr,
>> +in_guest ? "Guest" : "Host",
>> +err_type, subtype, evt->srr0, dar_str,
>> +evt->disposition == MCE_DISPOSITION_RECOVERED ?
>> +"Recovered" : "Not recovered");
>> +printk("%sMCE: CPU%d: PID: %d Comm: %s\n",
>> +level, evt->cpu, current->pid, current->comm);
>> +} else {
>> +printk("%sMCE: CPU%d: (%s) Host %s %s at %016llx %s[%s]\n",
>> +level, evt->cpu, sevstr, err_type, subtype, evt->srr0,
>> +dar_str,
>> +evt->disposition == MCE_DISPOSITION_RECOVERED ?
>> +"Recovered" : "Not recovered");
>> +printk("%sMCE: CPU%d: NIP: [%016llx] %pS\n",
>> +level, evt->cpu, evt->srr0, (void *)evt->srr0);
>> +}
> 
> The first printf in the two cases is quite similar, seems like they
> could be consolidated.
> 
> I also think it'd be clearer to print the NIP on the 2nd line in all
> cases, rather than the first.

Sure, and I will then put "machine check" on 1st line like below ?

printk("%sMCE: CPU%d: machine check (%s) %s %s %s %s[%s]\n",


> 
> What about (untested) ?
> 
>   printk("%sMCE: CPU%d: (%s) %s %s %s %s[%s]\n",
>level, evt->cpu, sevstr,
>in_guest ? "Guest" : "Host",
>err_type, subtype, dar_str,
>evt->disposition == MCE_DISPOSITION_RECOVERED ?
>"Recovered" : "Not recovered");
>   
>   if (in_guest || user_mode) {
>   printk("%sMCE: CPU%d: PID: %d Comm: %s %sNIP: [%016llx]\n",
>  level, evt->cpu, current->pid, current->comm,
>  in_guest ? "Guest " : "", evt->srr0);
>   } else {
>   printk("%sMCE: CPU%d: NIP: [%016llx] %pS\n",
>   level, evt->cpu, evt->srr0, (void *)evt->srr0);
>   }

Sure, will make these changes.

Thanks,
-Mahesh.

Re: Disable kcov for slb routines.

2019-03-14 Thread Mahesh Jagannath Salgaonkar

On 3/14/19 5:13 PM, Michael Ellerman wrote:
> On Mon, 2019-03-04 at 08:25:51 UTC, Mahesh J Salgaonkar wrote:
>> From: Mahesh Salgaonkar 
>>
>> The kcov instrumentation inside SLB routines causes duplicate SLB entries
>> to be added resulting into SLB multihit machine checks.
>> Disable kcov instrumentation on slb.o
>>
>> Signed-off-by: Mahesh Salgaonkar 
>> Acked-by: Andrew Donnellan 
>> Tested-by: Satheesh Rajendran 
> 
> Applied to powerpc next, thanks.
> 
> https://git.kernel.org/powerpc/c/19d6907521b04206676741b26e05a152
> 
> cheers
> 

There was a v2 at http://patchwork.ozlabs.org/patch/1051718/, looks like
v1 got picked up. But I see the applied commit does address Andrew's
comments.

Thanks,
-Mahesh.

Re: [PATCH] powerpc/fadump: re-register firmware-assisted dump if already registered

2018-09-18 Thread Mahesh Jagannath Salgaonkar

On 09/14/2018 07:36 PM, Hari Bathini wrote:
> Firmware-Assisted Dump (FADump) needs to be registered again after any
> memory hot add/remove operation to update the crash memory ranges. But
> currently, the kernel returns '-EEXIST' if we try to register without
> uregistering it first. This could expose the system to racing issues
> while unregistering and registering FADump from userspace during udev
> events. Spare the userspace of this and let it be taken care of in the
> kernel space for a simpler interface.
> 
> Since this change, running 'echo 1 > /sys/kernel/fadump_registered'
> would result in re-regisering (unregistering and registering) FADump,
> if it was already registered.
> 
> Signed-off-by: Hari Bathini 

Looks good to me.

Acked-by: Mahesh Salgaonkar 

Thanks,
-Mahesh.

> ---
>  arch/powerpc/kernel/fadump.c |4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
> index a711d22..761b28b 100644
> --- a/arch/powerpc/kernel/fadump.c
> +++ b/arch/powerpc/kernel/fadump.c
> @@ -1444,8 +1444,8 @@ static ssize_t fadump_register_store(struct kobject 
> *kobj,
>   break;
>   case 1:
>   if (fw_dump.dump_registered == 1) {
> - ret = -EEXIST;
> - goto unlock_out;
> + /* Un-register Firmware-assisted dump */
> + fadump_unregister_dump();
>   }
>   /* Register Firmware-assisted dump */
>   ret = register_fadump();
>

Re: [PATCH v8 5/5] powernv/pseries: consolidate code for mce early handling.

2018-08-27 Thread Mahesh Jagannath Salgaonkar

On 08/23/2018 02:32 PM, Nicholas Piggin wrote:
> On Thu, 23 Aug 2018 14:13:13 +0530
> Mahesh Jagannath Salgaonkar  wrote:
> 
>> On 08/20/2018 05:04 PM, Nicholas Piggin wrote:
>>> On Sun, 19 Aug 2018 22:38:39 +0530
>>> Mahesh J Salgaonkar  wrote:
>>>   
>>>> From: Mahesh Salgaonkar 
>>>>
>>>> Now that other platforms also implements real mode mce handler,
>>>> lets consolidate the code by sharing existing powernv machine check
>>>> early code. Rename machine_check_powernv_early to
>>>> machine_check_common_early and reuse the code.
>>>>
>>>> Signed-off-by: Mahesh Salgaonkar 
>>>> ---
>>>>  arch/powerpc/kernel/exceptions-64s.S |  155 
>>>> ++
>>>>  1 file changed, 28 insertions(+), 127 deletions(-)
>>>>
>>>> diff --git a/arch/powerpc/kernel/exceptions-64s.S 
>>>> b/arch/powerpc/kernel/exceptions-64s.S
>>>> index 12f056179112..2f85a7baf026 100644
>>>> --- a/arch/powerpc/kernel/exceptions-64s.S
>>>> +++ b/arch/powerpc/kernel/exceptions-64s.S
>>>> @@ -243,14 +243,13 @@ EXC_REAL_BEGIN(machine_check, 0x200, 0x100)
>>>>SET_SCRATCH0(r13)   /* save r13 */
>>>>EXCEPTION_PROLOG_0(PACA_EXMC)
>>>>  BEGIN_FTR_SECTION
>>>> -  b   machine_check_powernv_early
>>>> +  b   machine_check_common_early
>>>>  FTR_SECTION_ELSE
>>>>b   machine_check_pSeries_0
>>>>  ALT_FTR_SECTION_END_IFSET(CPU_FTR_HVMODE)
>>>>  EXC_REAL_END(machine_check, 0x200, 0x100)
>>>>  EXC_VIRT_NONE(0x4200, 0x100)
>>>> -TRAMP_REAL_BEGIN(machine_check_powernv_early)
>>>> -BEGIN_FTR_SECTION
>>>> +TRAMP_REAL_BEGIN(machine_check_common_early)
>>>>EXCEPTION_PROLOG_1(PACA_EXMC, NOTEST, 0x200)
>>>>/*
>>>> * Register contents:
>>>> @@ -306,7 +305,9 @@ BEGIN_FTR_SECTION
>>>>/* Save r9 through r13 from EXMC save area to stack frame. */
>>>>EXCEPTION_PROLOG_COMMON_2(PACA_EXMC)
>>>>mfmsr   r11 /* get MSR value */
>>>> +BEGIN_FTR_SECTION
>>>>ori r11,r11,MSR_ME  /* turn on ME bit */
>>>> +END_FTR_SECTION_IFSET(CPU_FTR_HVMODE)
>>>>ori r11,r11,MSR_RI  /* turn on RI bit */
>>>>LOAD_HANDLER(r12, machine_check_handle_early)
>>>>  1:mtspr   SPRN_SRR0,r12
>>>> @@ -325,7 +326,6 @@ BEGIN_FTR_SECTION
>>>>andcr11,r11,r10 /* Turn off MSR_ME */
>>>>b   1b
>>>>b   .   /* prevent speculative execution */
>>>> -END_FTR_SECTION_IFSET(CPU_FTR_HVMODE)
>>>>  
>>>>  TRAMP_REAL_BEGIN(machine_check_pSeries)
>>>>.globl machine_check_fwnmi
>>>> @@ -333,7 +333,7 @@ machine_check_fwnmi:
>>>>SET_SCRATCH0(r13)   /* save r13 */
>>>>EXCEPTION_PROLOG_0(PACA_EXMC)
>>>>  BEGIN_FTR_SECTION
>>>> -  b   machine_check_pSeries_early
>>>> +  b   machine_check_common_early
>>>>  END_FTR_SECTION_IFCLR(CPU_FTR_HVMODE)
>>>>  machine_check_pSeries_0:
>>>>EXCEPTION_PROLOG_1(PACA_EXMC, KVMTEST_PR, 0x200)
>>>> @@ -346,103 +346,6 @@ machine_check_pSeries_0:
>>>>  
>>>>  TRAMP_KVM_SKIP(PACA_EXMC, 0x200)
>>>>  
>>>> -TRAMP_REAL_BEGIN(machine_check_pSeries_early)
>>>> -BEGIN_FTR_SECTION
>>>> -  EXCEPTION_PROLOG_1(PACA_EXMC, NOTEST, 0x200)
>>>> -  mr  r10,r1  /* Save r1 */
>>>> -  lhz r11,PACA_IN_MCE(r13)
>>>> -  cmpwi   r11,0   /* Are we in nested machine check */
>>>> -  bne 0f  /* Yes, we are. */
>>>> -  /* First machine check entry */
>>>> -  ld  r1,PACAMCEMERGSP(r13)   /* Use MC emergency stack */
>>>> -0:subir1,r1,INT_FRAME_SIZE/* alloc stack frame */
>>>> -  addir11,r11,1   /* increment paca->in_mce */
>>>> -  sth r11,PACA_IN_MCE(r13)
>>>> -  /* Limit nested MCE to level 4 to avoid stack overflow */
>>>> -  cmpwi   r11,MAX_MCE_DEPTH
>>>> -  bgt 1f  /* Check if we hit limit of 4 */
>>>> -  mfspr   r11,SPRN_SRR0   /* Save SRR0 */
>>>> -  mfspr   r12,SPRN_SRR1   /* Save SRR1 */
>>>> -  EXCEPTION_PROLOG_COMMON_1()
&g

Re: [RESEND PATCH v2] powerpc/mce: Fix SLB rebolting during MCE recovery path.

2018-08-23 Thread Mahesh Jagannath Salgaonkar

On 08/23/2018 05:35 PM, Michael Ellerman wrote:
> Mahesh Jagannath Salgaonkar  writes:
> 
>> On 08/23/2018 12:14 PM, Michael Ellerman wrote:
>>> Mahesh J Salgaonkar  writes:
>>>
>>>> From: Mahesh Salgaonkar 
>>>>
>>>> With the powerpc next commit e7e81847478 (powerpc/mce: Fix SLB rebolting
>>>> during MCE recovery path.),
>>>
>>> That commit description is wrong, I'll fix it up.
>>
>> Ouch.. My bad.. :-(
> 
> To make it easier to get right, if you don't already, add these to your
> ~/.gitconfig:
> 
> [pretty]
>   fixes = Fixes: %h (\"%s\")
>   quote = %h (\"%s\")
> 
> 
> And then you can do:
> 
> $ git log -1 --pretty=quote e7e81847478 
> e7e81847478b ("powerpc/64s: move machine check SLB flushing to mm/slb.c")
> 
> $ git log -1 --pretty=fixes e7e81847478 
> Fixes: e7e81847478b ("powerpc/64s: move machine check SLB flushing to 
> mm/slb.c")

Thank you very much :-) This is going to be very handy...

-Mahesh.

Re: [PATCH v8 5/5] powernv/pseries: consolidate code for mce early handling.

2018-08-23 Thread Mahesh Jagannath Salgaonkar

On 08/20/2018 05:04 PM, Nicholas Piggin wrote:
> On Sun, 19 Aug 2018 22:38:39 +0530
> Mahesh J Salgaonkar  wrote:
> 
>> From: Mahesh Salgaonkar 
>>
>> Now that other platforms also implements real mode mce handler,
>> lets consolidate the code by sharing existing powernv machine check
>> early code. Rename machine_check_powernv_early to
>> machine_check_common_early and reuse the code.
>>
>> Signed-off-by: Mahesh Salgaonkar 
>> ---
>>  arch/powerpc/kernel/exceptions-64s.S |  155 
>> ++
>>  1 file changed, 28 insertions(+), 127 deletions(-)
>>
>> diff --git a/arch/powerpc/kernel/exceptions-64s.S 
>> b/arch/powerpc/kernel/exceptions-64s.S
>> index 12f056179112..2f85a7baf026 100644
>> --- a/arch/powerpc/kernel/exceptions-64s.S
>> +++ b/arch/powerpc/kernel/exceptions-64s.S
>> @@ -243,14 +243,13 @@ EXC_REAL_BEGIN(machine_check, 0x200, 0x100)
>>  SET_SCRATCH0(r13)   /* save r13 */
>>  EXCEPTION_PROLOG_0(PACA_EXMC)
>>  BEGIN_FTR_SECTION
>> -b   machine_check_powernv_early
>> +b   machine_check_common_early
>>  FTR_SECTION_ELSE
>>  b   machine_check_pSeries_0
>>  ALT_FTR_SECTION_END_IFSET(CPU_FTR_HVMODE)
>>  EXC_REAL_END(machine_check, 0x200, 0x100)
>>  EXC_VIRT_NONE(0x4200, 0x100)
>> -TRAMP_REAL_BEGIN(machine_check_powernv_early)
>> -BEGIN_FTR_SECTION
>> +TRAMP_REAL_BEGIN(machine_check_common_early)
>>  EXCEPTION_PROLOG_1(PACA_EXMC, NOTEST, 0x200)
>>  /*
>>   * Register contents:
>> @@ -306,7 +305,9 @@ BEGIN_FTR_SECTION
>>  /* Save r9 through r13 from EXMC save area to stack frame. */
>>  EXCEPTION_PROLOG_COMMON_2(PACA_EXMC)
>>  mfmsr   r11 /* get MSR value */
>> +BEGIN_FTR_SECTION
>>  ori r11,r11,MSR_ME  /* turn on ME bit */
>> +END_FTR_SECTION_IFSET(CPU_FTR_HVMODE)
>>  ori r11,r11,MSR_RI  /* turn on RI bit */
>>  LOAD_HANDLER(r12, machine_check_handle_early)
>>  1:  mtspr   SPRN_SRR0,r12
>> @@ -325,7 +326,6 @@ BEGIN_FTR_SECTION
>>  andcr11,r11,r10 /* Turn off MSR_ME */
>>  b   1b
>>  b   .   /* prevent speculative execution */
>> -END_FTR_SECTION_IFSET(CPU_FTR_HVMODE)
>>  
>>  TRAMP_REAL_BEGIN(machine_check_pSeries)
>>  .globl machine_check_fwnmi
>> @@ -333,7 +333,7 @@ machine_check_fwnmi:
>>  SET_SCRATCH0(r13)   /* save r13 */
>>  EXCEPTION_PROLOG_0(PACA_EXMC)
>>  BEGIN_FTR_SECTION
>> -b   machine_check_pSeries_early
>> +b   machine_check_common_early
>>  END_FTR_SECTION_IFCLR(CPU_FTR_HVMODE)
>>  machine_check_pSeries_0:
>>  EXCEPTION_PROLOG_1(PACA_EXMC, KVMTEST_PR, 0x200)
>> @@ -346,103 +346,6 @@ machine_check_pSeries_0:
>>  
>>  TRAMP_KVM_SKIP(PACA_EXMC, 0x200)
>>  
>> -TRAMP_REAL_BEGIN(machine_check_pSeries_early)
>> -BEGIN_FTR_SECTION
>> -EXCEPTION_PROLOG_1(PACA_EXMC, NOTEST, 0x200)
>> -mr  r10,r1  /* Save r1 */
>> -lhz r11,PACA_IN_MCE(r13)
>> -cmpwi   r11,0   /* Are we in nested machine check */
>> -bne 0f  /* Yes, we are. */
>> -/* First machine check entry */
>> -ld  r1,PACAMCEMERGSP(r13)   /* Use MC emergency stack */
>> -0:  subir1,r1,INT_FRAME_SIZE/* alloc stack frame */
>> -addir11,r11,1   /* increment paca->in_mce */
>> -sth r11,PACA_IN_MCE(r13)
>> -/* Limit nested MCE to level 4 to avoid stack overflow */
>> -cmpwi   r11,MAX_MCE_DEPTH
>> -bgt 1f  /* Check if we hit limit of 4 */
>> -mfspr   r11,SPRN_SRR0   /* Save SRR0 */
>> -mfspr   r12,SPRN_SRR1   /* Save SRR1 */
>> -EXCEPTION_PROLOG_COMMON_1()
>> -EXCEPTION_PROLOG_COMMON_2(PACA_EXMC)
>> -EXCEPTION_PROLOG_COMMON_3(0x200)
>> -addir3,r1,STACK_FRAME_OVERHEAD
>> -BRANCH_LINK_TO_FAR(machine_check_early) /* Function call ABI */
>> -ld  r12,_MSR(r1)
>> -andi.   r11,r12,MSR_PR  /* See if coming from user. */
>> -bne 2f  /* continue in V mode if we are. */
>> -
>> -/*
>> - * At this point we are not sure about what context we come from.
>> - * We may be in the middle of swithing stack. r1 may not be valid.
>> - * Hence stay on emergency stack, call machine_check_exception and
>> - * return from the interrupt.
>> - * But before that, check if this is an un-recoverable exception.
>> - * If yes, then stay on emergency stack and panic.
>> - */
>> -andi.   r11,r12,MSR_RI
>> -beq 1f
>> -
>> -/*
>> - * Check if we have successfully handled/recovered from error, if not
>> - * then stay on emergency stack and panic.
>> - */
>> -cmpdi   r3,0/* see if we handled MCE successfully */
>> -beq 1f  /* if !handled then panic */
>> -
>> -/* Stay on emergency stack and return from interrupt. */
>> -LOAD_HANDLER(r10,mce_return)
>> -mtspr   SPRN_SRR0,r10
>> -ld

Re: [RESEND PATCH v2] powerpc/mce: Fix SLB rebolting during MCE recovery path.

2018-08-23 Thread Mahesh Jagannath Salgaonkar

On 08/23/2018 12:14 PM, Michael Ellerman wrote:
> Mahesh J Salgaonkar  writes:
> 
>> From: Mahesh Salgaonkar 
>>
>> With the powerpc next commit e7e81847478 (powerpc/mce: Fix SLB rebolting
>> during MCE recovery path.),
> 
> That commit description is wrong, I'll fix it up.

Ouch.. My bad.. :-(

> 
> cheers
> 
>> the SLB error recovery is broken. The new
>> change now does not add index value to RB[52-63] that selects the SLB
>> entry while rebolting, instead it assumes that the shadow save area
>> already have index embeded correctly in esid field. While all valid bolted
>> save areas do contain index value set correctly, there is a case where
>> 3rd (KSTACK_INDEX) entry for kernel stack does not embed index for NULL
>> esid entry. This patch fixes that.
>>
>> Without this patch the SLB rebolt code overwrites the 1st entry of kernel
>> linear mapping and causes SLB recovery to fail.
>>
>> Signed-off-by: Mahesh Salgaonkar 
>> Signed-off-by: Nicholas Piggin 
>> Reviewed-by: Nicholas Piggin 
>> ---
>>  arch/powerpc/mm/slb.c |2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/arch/powerpc/mm/slb.c b/arch/powerpc/mm/slb.c
>> index 0b095fa54049..9f574e59d178 100644
>> --- a/arch/powerpc/mm/slb.c
>> +++ b/arch/powerpc/mm/slb.c
>> @@ -70,7 +70,7 @@ static inline void slb_shadow_update(unsigned long ea, int 
>> ssize,
>>  
>>  static inline void slb_shadow_clear(enum slb_index index)
>>  {
>> -WRITE_ONCE(get_slb_shadow()->save_area[index].esid, 0);
>> +WRITE_ONCE(get_slb_shadow()->save_area[index].esid, cpu_to_be64(index));
>>  }
>>  
>>  static inline void create_shadowed_slbe(unsigned long ea, int ssize,
>

Re: [PATCH v2] poewrpc/mce: Fix SLB rebolting during MCE recovery path.

2018-08-23 Thread Mahesh Jagannath Salgaonkar

On 08/23/2018 10:26 AM, Mahesh J Salgaonkar wrote:
> From: Mahesh Salgaonkar 
> 
> With the powrpc next commit e7e81847478 (poewrpc/mce: Fix SLB rebolting
> during MCE recovery path.), the SLB error recovery is broken. The new
> change now does not add index value to RB[52-63] that selects the SLB
> entry while rebolting, instead it assumes that the shadow save area
> already have index embeded correctly in esid field. While all valid bolted
> save areas do contain index value set correctly, there is a case where
> 3rd (KSTACK_INDEX) entry for kernel stack does not embed index for NULL
> esid entry. This patch fixes that.
> 
> Without this patch the SLB rebolt code overwirtes the 1st entry of kernel
> linear mapping and causes SLB recovery to fail.
> 
> Signed-off-by: Mahesh Salgaonkar 
> Signed-off-by: Nicholas Piggin 
> Reviewed-by: Nicholas Piggin 

Ignore this patch.. There are few spelling mistakes in this patch.. will
resend v2 again after fixing those.

Thanks,
-Mahesh.

Re: [PATCH] poewrpc/mce: Fix SLB rebolting during MCE recovery path.

2018-08-22 Thread Mahesh Jagannath Salgaonkar

On 08/21/2018 03:57 PM, Nicholas Piggin wrote:
> On Fri, 17 Aug 2018 14:51:47 +0530
> Mahesh J Salgaonkar  wrote:
> 
>> From: Mahesh Salgaonkar 
>>
>> With the powrpc next commit e7e81847478 (poewrpc/mce: Fix SLB rebolting
>> during MCE recovery path.), the SLB error recovery is broken. The
>> commit missed a crucial change of OR-ing index value to RB[52-63] which
>> selects the SLB entry while rebolting. This patch fixes that.
>>
>> Signed-off-by: Mahesh Salgaonkar 
>> Reviewed-by: Nicholas Piggin 
>> ---
>>  arch/powerpc/mm/slb.c |5 -
>>  1 file changed, 4 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/powerpc/mm/slb.c b/arch/powerpc/mm/slb.c
>> index 0b095fa54049..6dd9913425bc 100644
>> --- a/arch/powerpc/mm/slb.c
>> +++ b/arch/powerpc/mm/slb.c
>> @@ -101,9 +101,12 @@ void __slb_restore_bolted_realmode(void)
>>  
>>   /* No isync needed because realmode. */
>>  for (index = 0; index < SLB_NUM_BOLTED; index++) {
>> +unsigned long rb = be64_to_cpu(p->save_area[index].esid);
>> +
>> +rb = (rb & ~0xFFFul) | index;
>>  asm volatile("slbmte  %0,%1" :
>>   : "r" (be64_to_cpu(p->save_area[index].vsid)),
>> -   "r" (be64_to_cpu(p->save_area[index].esid)));
>> +   "r" (rb));
>>  }
>>  }
>>  
>>
> 
> I'm just looking at this again. The bolted save areas do have the
> index field set. So for the OS, your patch should be equivalent to
> this, right?
> 
>  static inline void slb_shadow_clear(enum slb_index index)
>  {
> -   WRITE_ONCE(get_slb_shadow()->save_area[index].esid, 0);
> +   WRITE_ONCE(get_slb_shadow()->save_area[index].esid, index);
>  }
> 
> Which seems like a better fix.

Yeah this also fixes the issue. The only additional change required is
cpu_to_be64(index). As long as we maintain index in bolted save areas
(for valid/invalid entries) we should be ok. Will respin v2 with this
change.

Thanks,
-Mahesh.

Re: [PATCH v7 4/9] powerpc/pseries: Define MCE error event section.

2018-08-17 Thread Mahesh Jagannath Salgaonkar

On 08/16/2018 09:44 AM, Michael Ellerman wrote:
> Mahesh Jagannath Salgaonkar  writes:
>> On 08/08/2018 08:12 PM, Michael Ellerman wrote:
> ...
>>>
>>>> +  union {
>>>> +  struct {
>>>> +  uint8_t ue_err_type;
>>>> +  /* 
>>>> +   * X1: Permanent or Transient UE.
>>>> +   *  X   1: Effective address provided.
>>>> +   *   X  1: Logical address provided.
>>>> +   *XX2: Reserved.
>>>> +   *  XXX 3: Type of UE error.
>>>> +   */
>>>
>>> But which bit is bit 0? And is that the LSB or MSB?
>>
>> RTAS errorlog data in BE format, the leftmost bit is MSB 0 (1: Permanent
>> or Transient UE.). I Will update the comment above that properly points
>> out which one is MSB 0.
>>
>>>
>>>
>>>> +  uint8_t reserved_1[6];
>>>> +  __be64  effective_address;
>>>> +  __be64  logical_address;
>>>> +  } ue_error;
>>>> +  struct {
>>>> +  uint8_t soft_err_type;
>>>> +  /* 
>>>> +   * X1: Effective address provided.
>>>> +   *  X   5: Reserved.
>>>> +   *   XX 2: Type of SLB/ERAT/TLB error.
>>>> +   */
>>>> +  uint8_t reserved_1[6];
>>>> +  __be64  effective_address;
>>>> +  uint8_t reserved_2[8];
>>>> +  } soft_error;
>>>> +  } u;
>>>> +};
>>>> +#pragma pack(pop)
>>>
>>> Why not __packed ?
>>
>> Because when used __packed it added 1 byte extra padding between
>> reserved_1[6] and effective_address. That caused wrong effective address
>> to be printed on the console. Hence I switched to #pragma pack to force
>> 1 byte alignment for this structure alone.
> 
> OK, that's weird.
> 
> Do we really need to bother with all the union stuff? The only
> difference is the field names, and whether logical address has a value

Also the bit fields for UE and other sub errors differ. Yeah but we can
do away with union stuff.

> or not. What about:
> 
> struct pseries_mc_errorlog {
>   __be32  fru_id;
>   __be32  proc_id;
>   u8  error_type;
>   u8  sub_error_type;
>   u8  reserved_1[6];
>   __be64  effective_address;
>   __be64  logical_address;
> } __packed;

Sure will do.

Thanks
-Mahesh.

> 
> cheers
>

Re: [PATCH v7 7/9] powerpc/pseries: Dump the SLB contents on SLB MCE errors.

2018-08-14 Thread Mahesh Jagannath Salgaonkar

On 08/13/2018 07:57 PM, Nicholas Piggin wrote:
> On Mon, 13 Aug 2018 09:47:04 +0530
> Mahesh Jagannath Salgaonkar  wrote:
> 
>> On 08/11/2018 10:03 AM, Nicholas Piggin wrote:
>>> On Tue, 07 Aug 2018 19:47:39 +0530
>>> Mahesh J Salgaonkar  wrote:
>>>   
>>>> From: Mahesh Salgaonkar 
>>>>
>>>> If we get a machine check exceptions due to SLB errors then dump the
>>>> current SLB contents which will be very much helpful in debugging the
>>>> root cause of SLB errors. Introduce an exclusive buffer per cpu to hold
>>>> faulty SLB entries. In real mode mce handler saves the old SLB contents
>>>> into this buffer accessible through paca and print it out later in virtual
>>>> mode.
>>>>
>>>> With this patch the console will log SLB contents like below on SLB MCE
>>>> errors:
>>>>
>>>> [  507.297236] SLB contents of cpu 0x1
>>>> [  507.297237] Last SLB entry inserted at slot 16
>>>> [  507.297238] 00 c800 400ea1b217000500
>>>> [  507.297239]   1T  ESID=   c0  VSID=  ea1b217 LLP:100
>>>> [  507.297240] 01 d800 400d43642f000510
>>>> [  507.297242]   1T  ESID=   d0  VSID=  d43642f LLP:110
>>>> [  507.297243] 11 f800 400a86c85f000500
>>>> [  507.297244]   1T  ESID=   f0  VSID=  a86c85f LLP:100
>>>> [  507.297245] 12 7f000800 4008119624000d90
>>>> [  507.297246]   1T  ESID=   7f  VSID=  8119624 LLP:110
>>>> [  507.297247] 13 1800 00092885f5150d90
>>>> [  507.297247]  256M ESID=1  VSID=   92885f5150 LLP:110
>>>> [  507.297248] 14 01000800 4009e7cb5d90
>>>> [  507.297249]   1T  ESID=1  VSID=  9e7cb50 LLP:110
>>>> [  507.297250] 15 d800 400d43642f000510
>>>> [  507.297251]   1T  ESID=   d0  VSID=  d43642f LLP:110
>>>> [  507.297252] 16 d800 400d43642f000510
>>>> [  507.297253]   1T  ESID=   d0  VSID=  d43642f LLP:110
>>>> [  507.297253] --
>>>> [  507.297254] SLB cache ptr value = 3
>>>> [  507.297254] Valid SLB cache entries:
>>>> [  507.297255] 00 EA[0-35]=7f000
>>>> [  507.297256] 01 EA[0-35]=1
>>>> [  507.297257] 02 EA[0-35]= 1000
>>>> [  507.297257] Rest of SLB cache entries:
>>>> [  507.297258] 03 EA[0-35]=7f000
>>>> [  507.297258] 04 EA[0-35]=1
>>>> [  507.297259] 05 EA[0-35]= 1000
>>>> [  507.297260] 06 EA[0-35]=   12
>>>> [  507.297260] 07 EA[0-35]=7f000
>>>>
>>>> Suggested-by: Aneesh Kumar K.V 
>>>> Suggested-by: Michael Ellerman 
>>>> Signed-off-by: Mahesh Salgaonkar 
>>>> ---
>>>>
>>>> Changes in V7:
>>>> - Print slb cache ptr value and slb cache data
>>>> ---
>>>>  arch/powerpc/include/asm/book3s/64/mmu-hash.h |7 ++
>>>>  arch/powerpc/include/asm/paca.h   |4 +
>>>>  arch/powerpc/mm/slb.c |   73 
>>>> +
>>>>  arch/powerpc/platforms/pseries/ras.c  |   10 +++
>>>>  arch/powerpc/platforms/pseries/setup.c|   10 +++
>>>>  5 files changed, 103 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h 
>>>> b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
>>>> index cc00a7088cf3..5a3fe282076d 100644
>>>> --- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
>>>> +++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
>>>> @@ -485,9 +485,16 @@ static inline void hpte_init_pseries(void) { }
>>>>  
>>>>  extern void hpte_init_native(void);
>>>>  
>>>> +struct slb_entry {
>>>> +  u64 esid;
>>>> +  u64 vsid;
>>>> +};
>>>> +
>>>>  extern void slb_initialize(void);
>>>>  extern void slb_flush_and_rebolt(void);
>>>>  extern void slb_flush_and_rebolt_realmode(void);
>>>> +extern void slb_save_contents(struct slb_entry *slb_ptr);
>>>> +extern void slb_dump_contents(struct slb_entry *slb_ptr);
>>>>  
>>>>  extern void slb_vmalloc_update(void);
>>>>  extern void slb_set_size(u16 size);
>>>> diff --git a/arch/powerpc/include/asm/paca.h 
>>>> b/arch/powerpc/inclu

Re: [PATCH v2 1/2] powerpc/64s: move machine check SLB flushing to mm/slb.c

2018-08-12 Thread Mahesh Jagannath Salgaonkar

On 08/10/2018 12:12 PM, Nicholas Piggin wrote:
> The machine check code that flushes and restores bolted segments in
> real mode belongs in mm/slb.c. This will also be used by pseries
> machine check and idle code in future changes.
> 
> Signed-off-by: Nicholas Piggin 
> 
> Since v1:
> - Restore the test for slb_shadow (mpe)
> ---
>  arch/powerpc/include/asm/book3s/64/mmu-hash.h |  3 ++
>  arch/powerpc/kernel/mce_power.c   | 26 +
>  arch/powerpc/mm/slb.c | 39 +++
>  3 files changed, 51 insertions(+), 17 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h 
> b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
> index 2f74bdc805e0..d4e398185b3a 100644
> --- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
> +++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
> @@ -497,6 +497,9 @@ extern void hpte_init_native(void);
> 
>  extern void slb_initialize(void);
>  extern void slb_flush_and_rebolt(void);
> +extern void slb_flush_all_realmode(void);
> +extern void __slb_restore_bolted_realmode(void);
> +extern void slb_restore_bolted_realmode(void);
> 
>  extern void slb_vmalloc_update(void);
>  extern void slb_set_size(u16 size);
> diff --git a/arch/powerpc/kernel/mce_power.c b/arch/powerpc/kernel/mce_power.c
> index d6756af6ec78..3497c8329c1d 100644
> --- a/arch/powerpc/kernel/mce_power.c
> +++ b/arch/powerpc/kernel/mce_power.c
> @@ -62,11 +62,8 @@ static unsigned long addr_to_pfn(struct pt_regs *regs, 
> unsigned long addr)
>  #ifdef CONFIG_PPC_BOOK3S_64
>  static void flush_and_reload_slb(void)
>  {
> - struct slb_shadow *slb;
> - unsigned long i, n;
> -
>   /* Invalidate all SLBs */
> - asm volatile("slbmte %0,%0; slbia" : : "r" (0));
> + slb_flush_all_realmode();
> 
>  #ifdef CONFIG_KVM_BOOK3S_HANDLER
>   /*
> @@ -76,22 +73,17 @@ static void flush_and_reload_slb(void)
>   if (get_paca()->kvm_hstate.in_guest)
>   return;
>  #endif
> -
> - /* For host kernel, reload the SLBs from shadow SLB buffer. */
> - slb = get_slb_shadow();
> - if (!slb)
> + if (early_radix_enabled())
>   return;

Would we ever get MCE for SLB errors when radix is enabled ?

> 
> - n = min_t(u32, be32_to_cpu(slb->persistent), SLB_MIN_SIZE);
> -
> - /* Load up the SLB entries from shadow SLB */
> - for (i = 0; i < n; i++) {
> - unsigned long rb = be64_to_cpu(slb->save_area[i].esid);
> - unsigned long rs = be64_to_cpu(slb->save_area[i].vsid);
> + /*
> +  * This probably shouldn't happen, but it may be possible it's
> +  * called in early boot before SLB shadows are allocated.
> +  */
> + if (!get_slb_shadow())
> + return;

Any reason you added above check here instead on mm/slb.c ? Should we
move above check inside slb_restore_bolted_realmode() ? I guess mm/slb.c
is right place for this check. This will also help pseries machine check
to avoid calling this extra check explicitly.

Thanks,
-Mahesh.

> 
> - rb = (rb & ~0xFFFul) | i;
> - asm volatile("slbmte %0,%1" : : "r" (rs), "r" (rb));
> - }
> + slb_restore_bolted_realmode();
>  }
>  #endif
> 
> diff --git a/arch/powerpc/mm/slb.c b/arch/powerpc/mm/slb.c
> index cb796724a6fc..0b095fa54049 100644
> --- a/arch/powerpc/mm/slb.c
> +++ b/arch/powerpc/mm/slb.c
> @@ -90,6 +90,45 @@ static inline void create_shadowed_slbe(unsigned long ea, 
> int ssize,
>: "memory" );
>  }
> 
> +/*
> + * Insert bolted entries into SLB (which may not be empty, so don't clear
> + * slb_cache_ptr).
> + */
> +void __slb_restore_bolted_realmode(void)
> +{
> + struct slb_shadow *p = get_slb_shadow();
> + enum slb_index index;
> +
> +  /* No isync needed because realmode. */
> + for (index = 0; index < SLB_NUM_BOLTED; index++) {
> + asm volatile("slbmte  %0,%1" :
> +  : "r" (be64_to_cpu(p->save_area[index].vsid)),
> +"r" (be64_to_cpu(p->save_area[index].esid)));
> + }
> +}
> +
> +/*
> + * Insert the bolted entries into an empty SLB.
> + * This is not the same as rebolt because the bolted segments are not
> + * changed, just loaded from the shadow area.
> + */
> +void slb_restore_bolted_realmode(void)
> +{
> + __slb_restore_bolted_realmode();
> + get_paca()->slb_cache_ptr = 0;
> +}
> +
> +/*
> + * This flushes all SLB entries including 0, so it must be realmode.
> + */
> +void slb_flush_all_realmode(void)
> +{
> + /*
> +  * This flushes all SLB entries including 0, so it must be realmode.
> +  */
> + asm volatile("slbmte %0,%0; slbia" : : "r" (0));
> +}
> +
>  static void __slb_flush_and_rebolt(void)
>  {
>   /* If you change this make sure you change SLB_NUM_BOLTED
>

Re: [PATCH v7 7/9] powerpc/pseries: Dump the SLB contents on SLB MCE errors.

2018-08-12 Thread Mahesh Jagannath Salgaonkar

On 08/11/2018 10:03 AM, Nicholas Piggin wrote:
> On Tue, 07 Aug 2018 19:47:39 +0530
> Mahesh J Salgaonkar  wrote:
> 
>> From: Mahesh Salgaonkar 
>>
>> If we get a machine check exceptions due to SLB errors then dump the
>> current SLB contents which will be very much helpful in debugging the
>> root cause of SLB errors. Introduce an exclusive buffer per cpu to hold
>> faulty SLB entries. In real mode mce handler saves the old SLB contents
>> into this buffer accessible through paca and print it out later in virtual
>> mode.
>>
>> With this patch the console will log SLB contents like below on SLB MCE
>> errors:
>>
>> [  507.297236] SLB contents of cpu 0x1
>> [  507.297237] Last SLB entry inserted at slot 16
>> [  507.297238] 00 c800 400ea1b217000500
>> [  507.297239]   1T  ESID=   c0  VSID=  ea1b217 LLP:100
>> [  507.297240] 01 d800 400d43642f000510
>> [  507.297242]   1T  ESID=   d0  VSID=  d43642f LLP:110
>> [  507.297243] 11 f800 400a86c85f000500
>> [  507.297244]   1T  ESID=   f0  VSID=  a86c85f LLP:100
>> [  507.297245] 12 7f000800 4008119624000d90
>> [  507.297246]   1T  ESID=   7f  VSID=  8119624 LLP:110
>> [  507.297247] 13 1800 00092885f5150d90
>> [  507.297247]  256M ESID=1  VSID=   92885f5150 LLP:110
>> [  507.297248] 14 01000800 4009e7cb5d90
>> [  507.297249]   1T  ESID=1  VSID=  9e7cb50 LLP:110
>> [  507.297250] 15 d800 400d43642f000510
>> [  507.297251]   1T  ESID=   d0  VSID=  d43642f LLP:110
>> [  507.297252] 16 d800 400d43642f000510
>> [  507.297253]   1T  ESID=   d0  VSID=  d43642f LLP:110
>> [  507.297253] --
>> [  507.297254] SLB cache ptr value = 3
>> [  507.297254] Valid SLB cache entries:
>> [  507.297255] 00 EA[0-35]=7f000
>> [  507.297256] 01 EA[0-35]=1
>> [  507.297257] 02 EA[0-35]= 1000
>> [  507.297257] Rest of SLB cache entries:
>> [  507.297258] 03 EA[0-35]=7f000
>> [  507.297258] 04 EA[0-35]=1
>> [  507.297259] 05 EA[0-35]= 1000
>> [  507.297260] 06 EA[0-35]=   12
>> [  507.297260] 07 EA[0-35]=7f000
>>
>> Suggested-by: Aneesh Kumar K.V 
>> Suggested-by: Michael Ellerman 
>> Signed-off-by: Mahesh Salgaonkar 
>> ---
>>
>> Changes in V7:
>> - Print slb cache ptr value and slb cache data
>> ---
>>  arch/powerpc/include/asm/book3s/64/mmu-hash.h |7 ++
>>  arch/powerpc/include/asm/paca.h   |4 +
>>  arch/powerpc/mm/slb.c |   73 
>> +
>>  arch/powerpc/platforms/pseries/ras.c  |   10 +++
>>  arch/powerpc/platforms/pseries/setup.c|   10 +++
>>  5 files changed, 103 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h 
>> b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
>> index cc00a7088cf3..5a3fe282076d 100644
>> --- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
>> +++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
>> @@ -485,9 +485,16 @@ static inline void hpte_init_pseries(void) { }
>>  
>>  extern void hpte_init_native(void);
>>  
>> +struct slb_entry {
>> +u64 esid;
>> +u64 vsid;
>> +};
>> +
>>  extern void slb_initialize(void);
>>  extern void slb_flush_and_rebolt(void);
>>  extern void slb_flush_and_rebolt_realmode(void);
>> +extern void slb_save_contents(struct slb_entry *slb_ptr);
>> +extern void slb_dump_contents(struct slb_entry *slb_ptr);
>>  
>>  extern void slb_vmalloc_update(void);
>>  extern void slb_set_size(u16 size);
>> diff --git a/arch/powerpc/include/asm/paca.h 
>> b/arch/powerpc/include/asm/paca.h
>> index 7f22929ce915..233d25ff6f64 100644
>> --- a/arch/powerpc/include/asm/paca.h
>> +++ b/arch/powerpc/include/asm/paca.h
>> @@ -254,6 +254,10 @@ struct paca_struct {
>>  #endif
>>  #ifdef CONFIG_PPC_PSERIES
>>  u8 *mce_data_buf;   /* buffer to hold per cpu rtas errlog */
>> +
>> +/* Capture SLB related old contents in MCE handler. */
>> +struct slb_entry *mce_faulty_slbs;
>> +u16 slb_save_cache_ptr;
>>  #endif /* CONFIG_PPC_PSERIES */
>>  } cacheline_aligned;
>>  
>> diff --git a/arch/powerpc/mm/slb.c b/arch/powerpc/mm/slb.c
>> index e89f675f1b5e..16a53689ffd4 100644
>> --- a/arch/powerpc/mm/slb.c
>> +++ b/arch/powerpc/mm/slb.c
>> @@ -151,6 +151,79 @@ void slb_flush_and_rebolt_realmode(void)
>>  get_paca()->slb_cache_ptr = 0;
>>  }
>>  
>> +void slb_save_contents(struct slb_entry *slb_ptr)
>> +{
>> +int i;
>> +unsigned long e, v;
>> +
>> +/* Save slb_cache_ptr value. */
>> +get_paca()->slb_save_cache_ptr = get_paca()->slb_cache_ptr;
> 
> What's the point of saving this?

This is to know how many valid cache entries were present at the time of
SLB mutlihit. We use this index value while dumping the slb cahce entries.

> 
>> +
>> +if (!slb_ptr)
>> +return;
> 
> Can this ever happen?

May be Never. We allocate the memory at very early

Re: [PATCH v7 7/9] powerpc/pseries: Dump the SLB contents on SLB MCE errors.

2018-08-10 Thread Mahesh Jagannath Salgaonkar

On 08/10/2018 04:02 PM, Mahesh Jagannath Salgaonkar wrote:
> On 08/09/2018 06:35 AM, Michael Ellerman wrote:
>> Mahesh J Salgaonkar  writes:
>>
>>> diff --git a/arch/powerpc/include/asm/paca.h 
>>> b/arch/powerpc/include/asm/paca.h
>>> index 7f22929ce915..233d25ff6f64 100644
>>> --- a/arch/powerpc/include/asm/paca.h
>>> +++ b/arch/powerpc/include/asm/paca.h
>>> @@ -254,6 +254,10 @@ struct paca_struct {
>>>  #endif
>>>  #ifdef CONFIG_PPC_PSERIES
>>> u8 *mce_data_buf;   /* buffer to hold per cpu rtas errlog */
>>> +
>>> +   /* Capture SLB related old contents in MCE handler. */
>>> +   struct slb_entry *mce_faulty_slbs;
>>> +   u16 slb_save_cache_ptr;
>>>  #endif /* CONFIG_PPC_PSERIES */
>>
>>  ^
> 
> I will pull that out of CONFIG_PPC_PSERIES.

I mean will pull 'mce_faulty_slbs' and 'slb_save_cache_ptr' and put it
under CONFIG_PPC_BOOK3S_64.

-Mahesh.

> 
>>
>>> diff --git a/arch/powerpc/mm/slb.c b/arch/powerpc/mm/slb.c
>>> index e89f675f1b5e..16a53689ffd4 100644
>>> --- a/arch/powerpc/mm/slb.c
>>> +++ b/arch/powerpc/mm/slb.c
>>> @@ -151,6 +151,79 @@ void slb_flush_and_rebolt_realmode(void)
>>> get_paca()->slb_cache_ptr = 0;
>>>  }
>>>  
>>> +void slb_save_contents(struct slb_entry *slb_ptr)
>>> +{
>>> +   int i;
>>> +   unsigned long e, v;
>>> +
>>> +   /* Save slb_cache_ptr value. */
>>> +   get_paca()->slb_save_cache_ptr = get_paca()->slb_cache_ptr;
>>
>> This isn't inside CONFIG_PPC_PSERIES which breaks lots of configs, eg
>> powernv.
>>
>>   arch/powerpc/mm/slb.c:160:12: error: 'struct paca_struct' has no member 
>> named 'slb_save_cache_ptr'
>>   arch/powerpc/mm/slb.c:218:27: error: 'struct paca_struct' has no member 
>> named 'slb_save_cache_ptr'
>>   arch/powerpc/mm/slb.c:216:49: error: 'struct paca_struct' has no member 
>> named 'slb_save_cache_ptr'
>>
>> http://kisskb.ozlabs.ibm.com/kisskb/head/219f20e490add009194d94fdeb480da2e385f1c6/
>>
>> cheers
>>
> 
> Ouch.. my bad. Will fix it.
> 
> Thanks,
> -Mahesh.
>

Re: [PATCH v7 7/9] powerpc/pseries: Dump the SLB contents on SLB MCE errors.

2018-08-10 Thread Mahesh Jagannath Salgaonkar

On 08/09/2018 06:35 AM, Michael Ellerman wrote:
> Mahesh J Salgaonkar  writes:
> 
>> diff --git a/arch/powerpc/include/asm/paca.h 
>> b/arch/powerpc/include/asm/paca.h
>> index 7f22929ce915..233d25ff6f64 100644
>> --- a/arch/powerpc/include/asm/paca.h
>> +++ b/arch/powerpc/include/asm/paca.h
>> @@ -254,6 +254,10 @@ struct paca_struct {
>>  #endif
>>  #ifdef CONFIG_PPC_PSERIES
>>  u8 *mce_data_buf;   /* buffer to hold per cpu rtas errlog */
>> +
>> +/* Capture SLB related old contents in MCE handler. */
>> +struct slb_entry *mce_faulty_slbs;
>> +u16 slb_save_cache_ptr;
>>  #endif /* CONFIG_PPC_PSERIES */
> 
>  ^

I will pull that out of CONFIG_PPC_PSERIES.

> 
>> diff --git a/arch/powerpc/mm/slb.c b/arch/powerpc/mm/slb.c
>> index e89f675f1b5e..16a53689ffd4 100644
>> --- a/arch/powerpc/mm/slb.c
>> +++ b/arch/powerpc/mm/slb.c
>> @@ -151,6 +151,79 @@ void slb_flush_and_rebolt_realmode(void)
>>  get_paca()->slb_cache_ptr = 0;
>>  }
>>  
>> +void slb_save_contents(struct slb_entry *slb_ptr)
>> +{
>> +int i;
>> +unsigned long e, v;
>> +
>> +/* Save slb_cache_ptr value. */
>> +get_paca()->slb_save_cache_ptr = get_paca()->slb_cache_ptr;
> 
> This isn't inside CONFIG_PPC_PSERIES which breaks lots of configs, eg
> powernv.
> 
>   arch/powerpc/mm/slb.c:160:12: error: 'struct paca_struct' has no member 
> named 'slb_save_cache_ptr'
>   arch/powerpc/mm/slb.c:218:27: error: 'struct paca_struct' has no member 
> named 'slb_save_cache_ptr'
>   arch/powerpc/mm/slb.c:216:49: error: 'struct paca_struct' has no member 
> named 'slb_save_cache_ptr'
> 
> http://kisskb.ozlabs.ibm.com/kisskb/head/219f20e490add009194d94fdeb480da2e385f1c6/
> 
> cheers
> 

Ouch.. my bad. Will fix it.

Thanks,
-Mahesh.

Re: [PATCH v7 5/9] powerpc/pseries: flush SLB contents on SLB MCE errors.

2018-08-10 Thread Mahesh Jagannath Salgaonkar

On 08/08/2018 02:34 PM, Nicholas Piggin wrote:
> On Tue, 07 Aug 2018 19:47:14 +0530
> Mahesh J Salgaonkar  wrote:
> 
>> From: Mahesh Salgaonkar 
>>
>> On pseries, as of today system crashes if we get a machine check
>> exceptions due to SLB errors. These are soft errors and can be fixed by
>> flushing the SLBs so the kernel can continue to function instead of
>> system crash. We do this in real mode before turning on MMU. Otherwise
>> we would run into nested machine checks. This patch now fetches the
>> rtas error log in real mode and flushes the SLBs on SLB errors.
>>
>> Signed-off-by: Mahesh Salgaonkar 
>> Signed-off-by: Michal Suchanek 
>> ---
>>
>> Changes in V7:
>> - Fold Michal's patch into this patch.
>> - Handle MSR_RI=0 and evil context case in MC handler.
>> ---
> 
> 
>> diff --git a/arch/powerpc/mm/slb.c b/arch/powerpc/mm/slb.c
>> index cb796724a6fc..e89f675f1b5e 100644
>> --- a/arch/powerpc/mm/slb.c
>> +++ b/arch/powerpc/mm/slb.c
>> @@ -145,6 +145,12 @@ void slb_flush_and_rebolt(void)
>>  get_paca()->slb_cache_ptr = 0;
>>  }
>>  
>> +void slb_flush_and_rebolt_realmode(void)
>> +{
>> +__slb_flush_and_rebolt();
>> +get_paca()->slb_cache_ptr = 0;
>> +}
>> +
>>  void slb_vmalloc_update(void)
>>  {
>>  unsigned long vflags;
> 
> Can you use this patch for the SLB flush?
> 
> https://patchwork.ozlabs.org/patch/953034/

Will use your v2.

Thanks,
-Mahesh.

> 
> Thanks,
> Nick
>

Re: [PATCH v7 5/9] powerpc/pseries: flush SLB contents on SLB MCE errors.

2018-08-10 Thread Mahesh Jagannath Salgaonkar

On 08/07/2018 10:24 PM, Michal Suchánek wrote:
> Hello,
> 
> 
> On Tue, 07 Aug 2018 19:47:14 +0530
> "Mahesh J Salgaonkar"  wrote:
> 
>> From: Mahesh Salgaonkar 
>>
>> On pseries, as of today system crashes if we get a machine check
>> exceptions due to SLB errors. These are soft errors and can be fixed
>> by flushing the SLBs so the kernel can continue to function instead of
>> system crash. We do this in real mode before turning on MMU. Otherwise
>> we would run into nested machine checks. This patch now fetches the
>> rtas error log in real mode and flushes the SLBs on SLB errors.
>>
>> Signed-off-by: Mahesh Salgaonkar 
>> Signed-off-by: Michal Suchanek 
>> ---
>>
>> Changes in V7:
>> - Fold Michal's patch into this patch.
>> - Handle MSR_RI=0 and evil context case in MC handler.
>> ---
>>  arch/powerpc/include/asm/book3s/64/mmu-hash.h |1 
>>  arch/powerpc/include/asm/machdep.h|1 
>>  arch/powerpc/kernel/exceptions-64s.S  |  112
>> +
>> arch/powerpc/kernel/mce.c |   15 +++
>> arch/powerpc/mm/slb.c |6 +
>> arch/powerpc/platforms/powernv/setup.c|   11 ++
>> arch/powerpc/platforms/pseries/pseries.h  |1
>> arch/powerpc/platforms/pseries/ras.c  |   51 +++
>> arch/powerpc/platforms/pseries/setup.c|1 9 files changed,
>> 195 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
>> b/arch/powerpc/include/asm/book3s/64/mmu-hash.h index
>> 50ed64fba4ae..cc00a7088cf3 100644 ---
>> a/arch/powerpc/include/asm/book3s/64/mmu-hash.h +++
>> b/arch/powerpc/include/asm/book3s/64/mmu-hash.h @@ -487,6 +487,7 @@
>> extern void hpte_init_native(void); 
>>  extern void slb_initialize(void);
>>  extern void slb_flush_and_rebolt(void);
>> +extern void slb_flush_and_rebolt_realmode(void);
>>  
>>  extern void slb_vmalloc_update(void);
>>  extern void slb_set_size(u16 size);
>> diff --git a/arch/powerpc/include/asm/machdep.h
>> b/arch/powerpc/include/asm/machdep.h index a47de82fb8e2..b4831f1338db
>> 100644 --- a/arch/powerpc/include/asm/machdep.h
>> +++ b/arch/powerpc/include/asm/machdep.h
>> @@ -108,6 +108,7 @@ struct machdep_calls {
>>  
>>  /* Early exception handlers called in realmode */
>>  int (*hmi_exception_early)(struct pt_regs
>> *regs);
>> +long(*machine_check_early)(struct pt_regs
>> *regs); 
>>  /* Called during machine check exception to retrive fixup
>> address. */ bool (*mce_check_early_recovery)(struct
>> pt_regs *regs); diff --git a/arch/powerpc/kernel/exceptions-64s.S
>> b/arch/powerpc/kernel/exceptions-64s.S index
>> 285c6465324a..cb06f219570a 100644 ---
>> a/arch/powerpc/kernel/exceptions-64s.S +++
>> b/arch/powerpc/kernel/exceptions-64s.S @@ -332,6 +332,9 @@
>> TRAMP_REAL_BEGIN(machine_check_pSeries) machine_check_fwnmi:
>>  SET_SCRATCH0(r13)   /* save r13 */
>>  EXCEPTION_PROLOG_0(PACA_EXMC)
>> +BEGIN_FTR_SECTION
>> +b   machine_check_pSeries_early
>> +END_FTR_SECTION_IFCLR(CPU_FTR_HVMODE)
>>  machine_check_pSeries_0:
>>  EXCEPTION_PROLOG_1(PACA_EXMC, KVMTEST_PR, 0x200)
>>  /*
>> @@ -343,6 +346,90 @@ machine_check_pSeries_0:
>>  
>>  TRAMP_KVM_SKIP(PACA_EXMC, 0x200)
>>  
>> +TRAMP_REAL_BEGIN(machine_check_pSeries_early)
>> +BEGIN_FTR_SECTION
>> +EXCEPTION_PROLOG_1(PACA_EXMC, NOTEST, 0x200)
>> +mr  r10,r1  /* Save r1 */
>> +ld  r1,PACAMCEMERGSP(r13)   /* Use MC emergency
>> stack */
>> +subir1,r1,INT_FRAME_SIZE/* alloc stack
>> frame*/
>> +mfspr   r11,SPRN_SRR0   /* Save SRR0 */
>> +mfspr   r12,SPRN_SRR1   /* Save SRR1 */
>> +EXCEPTION_PROLOG_COMMON_1()
>> +EXCEPTION_PROLOG_COMMON_2(PACA_EXMC)
>> +EXCEPTION_PROLOG_COMMON_3(0x200)
>> +addir3,r1,STACK_FRAME_OVERHEAD
>> +BRANCH_LINK_TO_FAR(machine_check_early) /* Function call ABI
>> */
>> +ld  r12,_MSR(r1)
>> +andi.   r11,r12,MSR_PR  /* See if coming
>> from user. */
>> +bne 2f  /* continue in V mode
>> if we are. */ +
>> +/*
>> + * At this point we are not sure about what context we come
>> from.
>> + * We may be in the middle of swithing stack. r1 may not be
>> valid.
>> + * Hence stay on emergency stack, call
>> machine_check_exception and
>> + * return from the interrupt.
>> + * But before that, check if this is an un-recoverable
>> exception.
>> + * If yes, then stay on emergency stack and panic.
>> + */
>> +andi.   r11,r12,MSR_RI
>> +bne 1f
>> +
>> +/*
>> + * Check if we have successfully handled/recovered from
>> error, if not
>> + * then stay on emergency stack and panic.
>> + */
>> +cmpdi   r3,0/* see if we handled MCE
>> successfully */
>> +bne 1f  /* if handled then return from
>> interrupt */ +
>> +

Re: [PATCH v7 4/9] powerpc/pseries: Define MCE error event section.

2018-08-10 Thread Mahesh Jagannath Salgaonkar

On 08/08/2018 08:12 PM, Michael Ellerman wrote:
> Hi Mahesh,
> 
> A few nitpicks.
> 
> Mahesh J Salgaonkar  writes:
>> From: Mahesh Salgaonkar 
>>
>> On pseries, the machine check error details are part of RTAS extended
>> event log passed under Machine check exception section. This patch adds
>> the definition of rtas MCE event section and related helper
>> functions.
>>
>> Signed-off-by: Mahesh Salgaonkar 
>> ---
>>  arch/powerpc/include/asm/rtas.h |  111 
>> +++
>>  1 file changed, 111 insertions(+)
> 
> AFIACS none of this ever gets used outside of ras.c, should it should
> just go in there.

Since it was all rtas specific I thought rtas.h is better place. But
yes, I can move this into ras.c

> 
>> diff --git a/arch/powerpc/include/asm/rtas.h 
>> b/arch/powerpc/include/asm/rtas.h
>> index 71e393c46a49..adc677c5e3a4 100644
>> --- a/arch/powerpc/include/asm/rtas.h
>> +++ b/arch/powerpc/include/asm/rtas.h
>> @@ -326,6 +334,109 @@ struct pseries_hp_errorlog {
>>  #define PSERIES_HP_ELOG_ID_DRC_COUNT3
>>  #define PSERIES_HP_ELOG_ID_DRC_IC   4
>>  
>> +/* RTAS pseries MCE errorlog section */
>> +#pragma pack(push, 1)
>> +struct pseries_mc_errorlog {
>> +__be32  fru_id;
>> +__be32  proc_id;
>> +uint8_t error_type;
> 
> Please use kernel types, so u8.

Will do so.

> 
>> +union {
>> +struct {
>> +uint8_t ue_err_type;
>> +/* 
>> + * X1: Permanent or Transient UE.
>> + *  X   1: Effective address provided.
>> + *   X  1: Logical address provided.
>> + *XX2: Reserved.
>> + *  XXX 3: Type of UE error.
>> + */
> 
> But which bit is bit 0? And is that the LSB or MSB?

RTAS errorlog data in BE format, the leftmost bit is MSB 0 (1: Permanent
or Transient UE.). I Will update the comment above that properly points
out which one is MSB 0.

> 
> 
>> +uint8_t reserved_1[6];
>> +__be64  effective_address;
>> +__be64  logical_address;
>> +} ue_error;
>> +struct {
>> +uint8_t soft_err_type;
>> +/* 
>> + * X1: Effective address provided.
>> + *  X   5: Reserved.
>> + *   XX 2: Type of SLB/ERAT/TLB error.
>> + */
>> +uint8_t reserved_1[6];
>> +__be64  effective_address;
>> +uint8_t reserved_2[8];
>> +} soft_error;
>> +} u;
>> +};
>> +#pragma pack(pop)
> 
> Why not __packed ?

Because when used __packed it added 1 byte extra padding between
reserved_1[6] and effective_address. That caused wrong effective address
to be printed on the console. Hence I switched to #pragma pack to force
1 byte alignment for this structure alone.

> 
>> +/* RTAS pseries MCE error types */
>> +#define PSERIES_MC_ERROR_TYPE_UE0x00
>> +#define PSERIES_MC_ERROR_TYPE_SLB   0x01
>> +#define PSERIES_MC_ERROR_TYPE_ERAT  0x02
>> +#define PSERIES_MC_ERROR_TYPE_TLB   0x04
>> +#define PSERIES_MC_ERROR_TYPE_D_CACHE   0x05
>> +#define PSERIES_MC_ERROR_TYPE_I_CACHE   0x07
> 
> Once these are in ras.c they can have less unwieldy names, ie. the
> PSERIES at least can be dropped.

ok.

> 
>> +/* RTAS pseries MCE error sub types */
>> +#define PSERIES_MC_ERROR_UE_INDETERMINATE   0
>> +#define PSERIES_MC_ERROR_UE_IFETCH  1
>> +#define PSERIES_MC_ERROR_UE_PAGE_TABLE_WALK_IFETCH  2
>> +#define PSERIES_MC_ERROR_UE_LOAD_STORE  3
>> +#define PSERIES_MC_ERROR_UE_PAGE_TABLE_WALK_LOAD_STORE  4
>> +
>> +#define PSERIES_MC_ERROR_SLB_PARITY 0
>> +#define PSERIES_MC_ERROR_SLB_MULTIHIT   1
>> +#define PSERIES_MC_ERROR_SLB_INDETERMINATE  2
>> +
>> +#define PSERIES_MC_ERROR_ERAT_PARITY1
>> +#define PSERIES_MC_ERROR_ERAT_MULTIHIT  2
>> +#define PSERIES_MC_ERROR_ERAT_INDETERMINATE 3
>> +
>> +#define PSERIES_MC_ERROR_TLB_PARITY 1
>> +#define PSERIES_MC_ERROR_TLB_MULTIHIT   2
>> +#define PSERIES_MC_ERROR_TLB_INDETERMINATE  3
>> +
>> +static inline uint8_t rtas_mc_error_type(const struct pseries_mc_errorlog 
>> *mlog)
>> +{
>> +return mlog->error_type;
>> +}
> 
> Why not just access it directly?

sure.

> 
>> +static inline uint8_t rtas_mc_error_sub_type(
>> +const struct pseries_mc_errorlog *mlog)
>> +{
>> +switch (mlog->error_type) {
>> +casePSERIES_MC_ERROR_TYPE_UE:
>> +return (mlog->u.ue_error.ue_err_type & 0x07);
>> +casePSERIES_MC_ERROR_TYPE_SLB:
>> +casePSERIES_MC_ERROR_TYPE_ERAT:
>> +casePSERIES_MC_ERROR_TYPE_TLB:
>> +

Re: [PATCH v2 2/2] powerpc/fadump: merge adjacent memory ranges to reduce PT_LOAD segements

2018-08-08 Thread Mahesh Jagannath Salgaonkar

On 08/07/2018 02:12 AM, Hari Bathini wrote:
> With dynamic memory allocation support for crash memory ranges array,
> there is no hard limit on the no. of crash memory ranges kernel could
> export, but program headers count could overflow in the /proc/vmcore
> ELF file while exporting each memory range as PT_LOAD segment. Reduce
> the likelihood of a such scenario, by folding adjacent crash memory
> ranges which minimizes the total number of PT_LOAD segments.
> 
> Signed-off-by: Hari Bathini 
> ---
>  arch/powerpc/kernel/fadump.c |   45 
> ++
>  1 file changed, 36 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
> index 2ec5704..cd0c555 100644
> --- a/arch/powerpc/kernel/fadump.c
> +++ b/arch/powerpc/kernel/fadump.c
> @@ -908,22 +908,41 @@ static int allocate_crash_memory_ranges(void)
>  static inline int fadump_add_crash_memory(unsigned long long base,
> unsigned long long end)
>  {
> + u64  start, size;
> + bool is_adjacent = false;
> +
>   if (base == end)
>   return 0;
>  
> - if (crash_mem_ranges == max_crash_mem_ranges) {
> - int ret;
> + /*
> +  * Fold adjacent memory ranges to bring down the memory ranges/
> +  * PT_LOAD segments count.
> +  */
> + if (crash_mem_ranges) {
> + start = crash_memory_ranges[crash_mem_ranges-1].base;
> + size = crash_memory_ranges[crash_mem_ranges-1].size;
>  
> - ret = allocate_crash_memory_ranges();
> - if (ret)
> - return ret;
> + if ((start + size) == base)
> + is_adjacent = true;
> + }
> + if (!is_adjacent) {
> + /* resize the array on reaching the limit */
> + if (crash_mem_ranges == max_crash_mem_ranges) {
> + int ret;
> +
> + ret = allocate_crash_memory_ranges();
> + if (ret)
> + return ret;
> + }
> +
> + start = base;
> + crash_memory_ranges[crash_mem_ranges].base = start;
> + crash_mem_ranges++;
>   }
>  
> + crash_memory_ranges[crash_mem_ranges-1].size = (end - start);
>   pr_debug("crash_memory_range[%d] [%#016llx-%#016llx], %#llx bytes\n",
> - crash_mem_ranges, base, end - 1, (end - base));
> - crash_memory_ranges[crash_mem_ranges].base = base;
> - crash_memory_ranges[crash_mem_ranges].size = end - base;
> - crash_mem_ranges++;
> + (crash_mem_ranges - 1), start, end - 1, (end - start));
>   return 0;
>  }
>  
> @@ -999,6 +1018,14 @@ static int fadump_setup_crash_memory_ranges(void)
>  
>   pr_debug("Setup crash memory ranges.\n");
>   crash_mem_ranges = 0;
> +
> + /* allocate memory for crash memory ranges for the first time */
> + if (!max_crash_mem_ranges) {
> + ret = allocate_crash_memory_ranges();
> + if (ret)
> + return ret;
> + }
> +

I see that the check for (!is_adjacent) in first hunk already handles
the first time allocation. Do we need this ?

Rest looks fine to me.

Reviewed-by: Mahesh Salgaonkar 

Thanks,
-Mahesh.

>   /*
>* add the first memory chunk (RMA_START through boot_memory_size) as
>* a separate memory chunk. The reason is, at the time crash firmware
>

Re: [PATCH v2 1/2] powerpc/fadump: handle crash memory ranges array index overflow

2018-08-08 Thread Mahesh Jagannath Salgaonkar

On 08/07/2018 02:12 AM, Hari Bathini wrote:
> Crash memory ranges is an array of memory ranges of the crashing kernel
> to be exported as a dump via /proc/vmcore file. The size of the array
> is set based on INIT_MEMBLOCK_REGIONS, which works alright in most cases
> where memblock memory regions count is less than INIT_MEMBLOCK_REGIONS
> value. But this count can grow beyond INIT_MEMBLOCK_REGIONS value since
> commit 142b45a72e22 ("memblock: Add array resizing support").
> 
> On large memory systems with a few DLPAR operations, the memblock memory
> regions count could be larger than INIT_MEMBLOCK_REGIONS value. On such
> systems, registering fadump results in crash or other system failures
> like below:
> 
>   task: c7f39a290010 ti: cb738000 task.ti: cb738000
>   NIP: c0047df4 LR: c00f9e58 CTR: c010f180
>   REGS: cb73b570 TRAP: 0300   Tainted: G  L   X  (4.4.140+)
>   MSR: 80009033   CR: 22004484  XER: 2000
>   CFAR: c0008500 DAR: 07a45000 DSISR: 4000 SOFTE: 0
>   GPR00: c00f9e58 cb73b7f0 c0f09a00 001a
>   GPR04: c7f3bf774c90 0004 c0eb9a00 0800
>   GPR08: 0804 07a45000 c0fa9a00 c7ffb169ca20
>   GPR12: 22004482 cfa12c00 c7f3a0ea97a8 
>   GPR16: c7f3a0ea9a50 cb73bd60 0118 0001fe80
>   GPR20: 0118  c0b8c980 00d0
>   GPR24: 07ffb0b1 c7ffb169c980  c0b8c980
>   GPR28: 0004 c7ffb169c980 001a c7ffb169c980
>   NIP [c0047df4] smp_send_reschedule+0x24/0x80
>   LR [c00f9e58] resched_curr+0x138/0x160
>   Call Trace:
>   [cb73b7f0] [c00f9e58] resched_curr+0x138/0x160 (unreliable)
>   [cb73b820] [c00fb538] check_preempt_curr+0xc8/0xf0
>   [cb73b850] [c00fb598] ttwu_do_wakeup+0x38/0x150
>   [cb73b890] [c00fc9c4] try_to_wake_up+0x224/0x4d0
>   [cb73b900] [c011ef34] __wake_up_common+0x94/0x100
>   [cb73b960] [c034a78c] ep_poll_callback+0xac/0x1c0
>   [cb73b9b0] [c011ef34] __wake_up_common+0x94/0x100
>   [cb73ba10] [c011f810] __wake_up_sync_key+0x70/0xa0
>   [cb73ba60] [c067c3e8] sock_def_readable+0x58/0xa0
>   [cb73ba90] [c07848ac] unix_stream_sendmsg+0x2dc/0x4c0
>   [cb73bb70] [c0675a38] sock_sendmsg+0x68/0xa0
>   [cb73bba0] [c067673c] ___sys_sendmsg+0x2cc/0x2e0
>   [cb73bd30] [c0677dbc] __sys_sendmsg+0x5c/0xc0
>   [cb73bdd0] [c06789bc] SyS_socketcall+0x36c/0x3f0
>   [cb73be30] [c0009488] system_call+0x3c/0x100
>   Instruction dump:
>   4e800020 6000 6042 3c4c00ec 38421c30 7c0802a6 f8010010 6000
>   3d42000a e92ab420 2fa9 4dde0020  2fa9 419e0044 7c0802a6
>   ---[ end trace a6d1dd4bab5f8253 ]---
> 
> as array index overflow is not checked for while setting up crash memory
> ranges causing memory corruption. To resolve this issue, dynamically
> allocate memory for crash memory ranges and resize it incrementally,
> in units of pagesize, on hitting array size limit.
> 
> Fixes: 2df173d9e85d ("fadump: Initialize elfcore header and add PT_LOAD 
> program headers.")
> Cc: sta...@vger.kernel.org
> Cc: Mahesh Salgaonkar 
> Signed-off-by: Hari Bathini 

Looks ok to me.

Reviewed-by: Mahesh Salgaonkar 

Thanks,
-Mahesh.

> ---
> 
> Changes in v2:
> * Allocating memory for crash ranges in pagesize unit.
> * freeing memory allocated while cleaning up.
> * Moved the changes to coalesce memory ranges into patch 2/2.
> 
> 
>  arch/powerpc/include/asm/fadump.h |4 +-
>  arch/powerpc/kernel/fadump.c  |   91 
> +++--
>  2 files changed, 79 insertions(+), 16 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/fadump.h 
> b/arch/powerpc/include/asm/fadump.h
> index 5a23010..3abc738 100644
> --- a/arch/powerpc/include/asm/fadump.h
> +++ b/arch/powerpc/include/asm/fadump.h
> @@ -195,8 +195,8 @@ struct fadump_crash_info_header {
>   struct cpumask  online_mask;
>  };
>  
> -/* Crash memory ranges */
> -#define INIT_CRASHMEM_RANGES (INIT_MEMBLOCK_REGIONS + 2)
> +/* Crash memory ranges size unit (pagesize) */
> +#define CRASHMEM_RANGES_ALLOC_SIZE   PAGE_SIZE
>  
>  struct fad_crash_memory_ranges {
>   unsigned long long  base;
> diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
> index 07e8396..2ec5704 100644
> --- a/arch/powerpc/kernel/fadump.c
> +++ b/arch/powerpc/kernel/fadump.c
> @@ -47,8 +47,10 @@ static struct fadump_mem_struct fdm;
>  static const struct fadump_mem_struct *fdm_active;
>  
>  static DEFINE_MUTEX(fadump_mutex);
> -struct fad_crash_memory_ranges crash_memory_ranges[INIT_CRASHMEM_RANGES];
> +struct

Re: [PATCH] powerpc/fadump: handle crash memory ranges array overflow

2018-08-05 Thread Mahesh Jagannath Salgaonkar

On 07/31/2018 07:26 PM, Hari Bathini wrote:
> Crash memory ranges is an array of memory ranges of the crashing kernel
> to be exported as a dump via /proc/vmcore file. The size of the array
> is set based on INIT_MEMBLOCK_REGIONS, which works alright in most cases
> where memblock memory regions count is less than INIT_MEMBLOCK_REGIONS
> value. But this count can grow beyond INIT_MEMBLOCK_REGIONS value since
> commit 142b45a72e22 ("memblock: Add array resizing support").
> 
> On large memory systems with a few DLPAR operations, the memblock memory
> regions count could be larger than INIT_MEMBLOCK_REGIONS value. On such
> systems, registering fadump results in crash or other system failures
> like below:
> 
>   task: c7f39a290010 ti: cb738000 task.ti: cb738000
>   NIP: c0047df4 LR: c00f9e58 CTR: c010f180
>   REGS: cb73b570 TRAP: 0300   Tainted: G  L   X  (4.4.140+)
>   MSR: 80009033   CR: 22004484  XER: 2000
>   CFAR: c0008500 DAR: 07a45000 DSISR: 4000 SOFTE: 0
>   GPR00: c00f9e58 cb73b7f0 c0f09a00 001a
>   GPR04: c7f3bf774c90 0004 c0eb9a00 0800
>   GPR08: 0804 07a45000 c0fa9a00 c7ffb169ca20
>   GPR12: 22004482 cfa12c00 c7f3a0ea97a8 
>   GPR16: c7f3a0ea9a50 cb73bd60 0118 0001fe80
>   GPR20: 0118  c0b8c980 00d0
>   GPR24: 07ffb0b1 c7ffb169c980  c0b8c980
>   GPR28: 0004 c7ffb169c980 001a c7ffb169c980
>   NIP [c0047df4] smp_send_reschedule+0x24/0x80
>   LR [c00f9e58] resched_curr+0x138/0x160
>   Call Trace:
>   [cb73b7f0] [c00f9e58] resched_curr+0x138/0x160 (unreliable)
>   [cb73b820] [c00fb538] check_preempt_curr+0xc8/0xf0
>   [cb73b850] [c00fb598] ttwu_do_wakeup+0x38/0x150
>   [cb73b890] [c00fc9c4] try_to_wake_up+0x224/0x4d0
>   [cb73b900] [c011ef34] __wake_up_common+0x94/0x100
>   [cb73b960] [c034a78c] ep_poll_callback+0xac/0x1c0
>   [cb73b9b0] [c011ef34] __wake_up_common+0x94/0x100
>   [cb73ba10] [c011f810] __wake_up_sync_key+0x70/0xa0
>   [cb73ba60] [c067c3e8] sock_def_readable+0x58/0xa0
>   [cb73ba90] [c07848ac] unix_stream_sendmsg+0x2dc/0x4c0
>   [cb73bb70] [c0675a38] sock_sendmsg+0x68/0xa0
>   [cb73bba0] [c067673c] ___sys_sendmsg+0x2cc/0x2e0
>   [cb73bd30] [c0677dbc] __sys_sendmsg+0x5c/0xc0
>   [cb73bdd0] [c06789bc] SyS_socketcall+0x36c/0x3f0
>   [cb73be30] [c0009488] system_call+0x3c/0x100
>   Instruction dump:
>   4e800020 6000 6042 3c4c00ec 38421c30 7c0802a6 f8010010 6000
>   3d42000a e92ab420 2fa9 4dde0020  2fa9 419e0044 7c0802a6
>   ---[ end trace a6d1dd4bab5f8253 ]---
> 
> as array index overflow is not checked for while setting up crash memory
> ranges causing memory corruption. To resolve this issue, resize crash
> memory ranges array on hitting array size limit.
> 
> But without a hard limit on the number of crash memory ranges, there is
> a possibility of program headers count overflow in the /proc/vmcore ELF
> file while exporting each of this memory ranges as PT_LOAD segments. To
> reduce the likelihood of such scenario, fold adjacent memory ranges to
> minimize the total number of crash memory ranges.
> 
> Fixes: 2df173d9e85d ("fadump: Initialize elfcore header and add PT_LOAD 
> program headers.")
> Cc: sta...@vger.kernel.org
> Cc: Mahesh Salgaonkar 
> Signed-off-by: Hari Bathini 
> ---
>  arch/powerpc/include/asm/fadump.h |2 +
>  arch/powerpc/kernel/fadump.c  |   63 
> ++---
>  2 files changed, 59 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/fadump.h 
> b/arch/powerpc/include/asm/fadump.h
> index 5a23010..ff708b3 100644
> --- a/arch/powerpc/include/asm/fadump.h
> +++ b/arch/powerpc/include/asm/fadump.h
> @@ -196,7 +196,7 @@ struct fadump_crash_info_header {
>  };
> 
>  /* Crash memory ranges */
> -#define INIT_CRASHMEM_RANGES (INIT_MEMBLOCK_REGIONS + 2)
> +#define INIT_CRASHMEM_RANGES INIT_MEMBLOCK_REGIONS
> 
>  struct fad_crash_memory_ranges {
>   unsigned long long  base;
> diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
> index 07e8396..1c1df4f 100644
> --- a/arch/powerpc/kernel/fadump.c
> +++ b/arch/powerpc/kernel/fadump.c
> @@ -47,7 +47,9 @@ static struct fadump_mem_struct fdm;
>  static const struct fadump_mem_struct *fdm_active;
> 
>  static DEFINE_MUTEX(fadump_mutex);
> -struct fad_crash_memory_ranges crash_memory_ranges[INIT_CRASHMEM_RANGES];
> +struct fad_crash_memory_ranges 
> init_crash_memory_ranges[INIT_CRASHMEM_RANGES];
> +int

Re: [PATCH v6 5/8] powerpc/pseries: flush SLB contents on SLB MCE errors.

2018-08-01 Thread Mahesh Jagannath Salgaonkar

On 08/01/2018 11:28 AM, Nicholas Piggin wrote:
> On Wed, 04 Jul 2018 23:28:21 +0530
> Mahesh J Salgaonkar  wrote:
> 
>> From: Mahesh Salgaonkar 
>>
>> On pseries, as of today system crashes if we get a machine check
>> exceptions due to SLB errors. These are soft errors and can be fixed by
>> flushing the SLBs so the kernel can continue to function instead of
>> system crash. We do this in real mode before turning on MMU. Otherwise
>> we would run into nested machine checks. This patch now fetches the
>> rtas error log in real mode and flushes the SLBs on SLB errors.
>>
>> Signed-off-by: Mahesh Salgaonkar 
>> ---
>>  arch/powerpc/include/asm/book3s/64/mmu-hash.h |1 
>>  arch/powerpc/include/asm/machdep.h|1 
>>  arch/powerpc/kernel/exceptions-64s.S  |   42 +
>>  arch/powerpc/kernel/mce.c |   16 +++-
>>  arch/powerpc/mm/slb.c |6 +++
>>  arch/powerpc/platforms/pseries/pseries.h  |1 
>>  arch/powerpc/platforms/pseries/ras.c  |   51 
>> +
>>  arch/powerpc/platforms/pseries/setup.c|1 
>>  8 files changed, 116 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h 
>> b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
>> index 50ed64fba4ae..cc00a7088cf3 100644
>> --- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
>> +++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
>> @@ -487,6 +487,7 @@ extern void hpte_init_native(void);
>>  
>>  extern void slb_initialize(void);
>>  extern void slb_flush_and_rebolt(void);
>> +extern void slb_flush_and_rebolt_realmode(void);
>>  
>>  extern void slb_vmalloc_update(void);
>>  extern void slb_set_size(u16 size);
>> diff --git a/arch/powerpc/include/asm/machdep.h 
>> b/arch/powerpc/include/asm/machdep.h
>> index ffe7c71e1132..fe447e0d4140 100644
>> --- a/arch/powerpc/include/asm/machdep.h
>> +++ b/arch/powerpc/include/asm/machdep.h
>> @@ -108,6 +108,7 @@ struct machdep_calls {
>>  
>>  /* Early exception handlers called in realmode */
>>  int (*hmi_exception_early)(struct pt_regs *regs);
>> +int (*machine_check_early)(struct pt_regs *regs);
>>  
>>  /* Called during machine check exception to retrive fixup address. */
>>  bool(*mce_check_early_recovery)(struct pt_regs *regs);
>> diff --git a/arch/powerpc/kernel/exceptions-64s.S 
>> b/arch/powerpc/kernel/exceptions-64s.S
>> index f283958129f2..0038596b7906 100644
>> --- a/arch/powerpc/kernel/exceptions-64s.S
>> +++ b/arch/powerpc/kernel/exceptions-64s.S
>> @@ -332,6 +332,9 @@ TRAMP_REAL_BEGIN(machine_check_pSeries)
>>  machine_check_fwnmi:
>>  SET_SCRATCH0(r13)   /* save r13 */
>>  EXCEPTION_PROLOG_0(PACA_EXMC)
>> +BEGIN_FTR_SECTION
>> +b   machine_check_pSeries_early
>> +END_FTR_SECTION_IFCLR(CPU_FTR_HVMODE)
>>  machine_check_pSeries_0:
>>  EXCEPTION_PROLOG_1(PACA_EXMC, KVMTEST_PR, 0x200)
>>  /*
>> @@ -343,6 +346,45 @@ machine_check_pSeries_0:
>>  
>>  TRAMP_KVM_SKIP(PACA_EXMC, 0x200)
>>  
>> +TRAMP_REAL_BEGIN(machine_check_pSeries_early)
>> +BEGIN_FTR_SECTION
>> +EXCEPTION_PROLOG_1(PACA_EXMC, NOTEST, 0x200)
>> +mr  r10,r1  /* Save r1 */
>> +ld  r1,PACAMCEMERGSP(r13)   /* Use MC emergency stack */
>> +subir1,r1,INT_FRAME_SIZE/* alloc stack frame*/
>> +mfspr   r11,SPRN_SRR0   /* Save SRR0 */
>> +mfspr   r12,SPRN_SRR1   /* Save SRR1 */
>> +EXCEPTION_PROLOG_COMMON_1()
>> +EXCEPTION_PROLOG_COMMON_2(PACA_EXMC)
>> +EXCEPTION_PROLOG_COMMON_3(0x200)
>> +addir3,r1,STACK_FRAME_OVERHEAD
>> +BRANCH_LINK_TO_FAR(machine_check_early) /* Function call ABI */
>> +
>> +/* Move original SRR0 and SRR1 into the respective regs */
>> +ld  r9,_MSR(r1)
>> +mtspr   SPRN_SRR1,r9
>> +ld  r3,_NIP(r1)
>> +mtspr   SPRN_SRR0,r3
>> +ld  r9,_CTR(r1)
>> +mtctr   r9
>> +ld  r9,_XER(r1)
>> +mtxer   r9
>> +ld  r9,_LINK(r1)
>> +mtlrr9
>> +REST_GPR(0, r1)
>> +REST_8GPRS(2, r1)
>> +REST_GPR(10, r1)
>> +ld  r11,_CCR(r1)
>> +mtcrr11
>> +REST_GPR(11, r1)
>> +REST_2GPRS(12, r1)
>> +/* restore original r1. */
>> +ld  r1,GPR1(r1)
>> +SET_SCRATCH0(r13)   /* save r13 */
>> +EXCEPTION_PROLOG_0(PACA_EXMC)
>> +b   machine_check_pSeries_0
>> +END_FTR_SECTION_IFCLR(CPU_FTR_HVMODE)
>> +
>>  EXC_COMMON_BEGIN(machine_check_common)
>>  /*
>>   * Machine check is different because we use a different
>> diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
>> index efdd16a79075..221271c96a57 100644
>> --- a/arch/powerpc/kernel/mce.c
>> +++ b/arch/powerpc/kernel/mce.c
>> @@ -488,9 +488,21 @@ long machine_check_early(struct pt_regs *regs)
>>  {
>>  long handled = 0;
>>  
>> -__this_cpu_inc(irq_stat.mce_exceptions);
>> +/*
>> +

Re: [RFC PATCH v6 0/4] powerpc/fadump: Improvements and fixes for firmware-assisted dump.

2018-07-18 Thread Mahesh Jagannath Salgaonkar

On 07/17/2018 05:22 PM, Michal Hocko wrote:
> On Tue 17-07-18 16:58:10, Mahesh Jagannath Salgaonkar wrote:
>> On 07/16/2018 01:56 PM, Michal Hocko wrote:
>>> On Mon 16-07-18 11:32:56, Mahesh J Salgaonkar wrote:
>>>> One of the primary issues with Firmware Assisted Dump (fadump) on Power
>>>> is that it needs a large amount of memory to be reserved. This reserved
>>>> memory is used for saving the contents of old crashed kernel's memory 
>>>> before
>>>> fadump capture kernel uses old kernel's memory area to boot. However, This
>>>> reserved memory area stays unused until system crash and isn't available
>>>> for production kernel to use.
>>>
>>> How much memory are we talking about. Regular kernel dump process needs
>>> some reserved memory as well. Why that is not a big problem?
>>
>> We reserve around 5% of total system RAM. On large systems with
>> TeraBytes of memory, this reservation can be quite significant.
>>
>> The regular kernel dump uses the kexec method to boot into capture
>> kernel and it can control the parameters that are being passed to
>> capture kernel. This allows a capability to strip down the parameters
>> that can help lowering down the memory requirement for capture kernel to
>> boot. This allows regular kdump to reserve less memory to start with.
>>
>> Where as fadump depends on power firmware (pHyp) to load the capture
>> kernel after full reset and boots like a regular kernel. It needs same
>> amount of memory to boot as the production kernel. On large systems
>> production kernel needs significant amount of memory to boot. Hence
>> fadump needs to reserve enough memory for capture kernel to boot
>> successfully and execute dump capturing operations. By default fadump
>> reserves 5% of total system RAM and in most cases this has worked
>> flawlessly on variety of system configurations. Optionally,
>> 'crashkernel=X' can also be used to specify more fine-tuned memory size
>> for reservation.
> 
> So why do we even care about fadump when regular kexec provides
> (presumably) same functionality with a smaller memory footprint? Or is
> there any reason why kexec doesn't work well on ppc?

Kexec based kdump is loaded by crashing kernel. When OS crashes, the
system is in an inconsistent state, especially the devices. In some
cases, a rogue DMA or ill-behaving device drivers can cause the kdump
capture to fail.

On power platform, fadump solves these issues by taking help from power
firmware, to fully-reset the system, load the fresh copy of same kernel
to capture the dump with PCI and I/O devices reinitialized, making it
more reliable.

Fadump does full system reset, booting system through the regular boot
options i.e the dump capture kernel is booted in the same fashion and
doesn't have specialized kernel command line option. This implies, we
need to give more memory for the system boot. Since the new kernel boots
from the same memory location as crashed kernel, we reserve 5% of memory
where power firmware moves the crashed kernel's memory content. This
reserved memory is completely removed from the available memory. For
large memory systems like 64TB systems, this account to ~ 3TB, which is
a significant chunk of memory production kernel is deprived of. Hence,
this patch adds an improvement to exiting fadump feature to make the
reserved memory available to system for use, using zone movable.

Thanks,
-Mahesh.

> 
>>>> Instead of setting aside a significant chunk of memory that nobody can use,
>>>> take advantage ZONE_MOVABLE to mark a significant chunk of reserved memory
>>>> as ZONE_MOVABLE, so that the kernel is prevented from using, but
>>>> applications are free to use it.
>>>
>>> Why kernel cannot use that memory while userspace can?
>>
>> fadump needs to reserve memory to be able to save crashing kernel's
>> memory, with help from power firmware, before the capture kernel loads
>> into crashing kernel's memory area. Any contents present in this
>> reserved memory will be over-written. If kernel is allowed to use this
>> memory, then we loose that kernel data and won't be part of captured
>> dump, which could be critical to debug root cause of system crash.
> 
> But then you simply screw user memory sitting there. This might be not
> so critical as the kernel memory but still it sounds like you are
> reducing the usefulness of the dump just because of inherent limitations
> of fadump.
> 
>> Kdump and fadump both uses same infrastructure/tool (makedumpfile) to
>> capture the memory dump. While the tool provides flexibility to
>> determine what needs to be part of the dump an

Re: [RFC PATCH v6 0/4] powerpc/fadump: Improvements and fixes for firmware-assisted dump.

2018-07-17 Thread Mahesh Jagannath Salgaonkar

On 07/16/2018 01:56 PM, Michal Hocko wrote:
> On Mon 16-07-18 11:32:56, Mahesh J Salgaonkar wrote:
>> One of the primary issues with Firmware Assisted Dump (fadump) on Power
>> is that it needs a large amount of memory to be reserved. This reserved
>> memory is used for saving the contents of old crashed kernel's memory before
>> fadump capture kernel uses old kernel's memory area to boot. However, This
>> reserved memory area stays unused until system crash and isn't available
>> for production kernel to use.
> 
> How much memory are we talking about. Regular kernel dump process needs
> some reserved memory as well. Why that is not a big problem?

We reserve around 5% of total system RAM. On large systems with
TeraBytes of memory, this reservation can be quite significant.

The regular kernel dump uses the kexec method to boot into capture
kernel and it can control the parameters that are being passed to
capture kernel. This allows a capability to strip down the parameters
that can help lowering down the memory requirement for capture kernel to
boot. This allows regular kdump to reserve less memory to start with.

Where as fadump depends on power firmware (pHyp) to load the capture
kernel after full reset and boots like a regular kernel. It needs same
amount of memory to boot as the production kernel. On large systems
production kernel needs significant amount of memory to boot. Hence
fadump needs to reserve enough memory for capture kernel to boot
successfully and execute dump capturing operations. By default fadump
reserves 5% of total system RAM and in most cases this has worked
flawlessly on variety of system configurations. Optionally,
'crashkernel=X' can also be used to specify more fine-tuned memory size
for reservation.

> 
>> Instead of setting aside a significant chunk of memory that nobody can use,
>> take advantage ZONE_MOVABLE to mark a significant chunk of reserved memory
>> as ZONE_MOVABLE, so that the kernel is prevented from using, but
>> applications are free to use it.
> 
> Why kernel cannot use that memory while userspace can?

fadump needs to reserve memory to be able to save crashing kernel's
memory, with help from power firmware, before the capture kernel loads
into crashing kernel's memory area. Any contents present in this
reserved memory will be over-written. If kernel is allowed to use this
memory, then we loose that kernel data and won't be part of captured
dump, which could be critical to debug root cause of system crash.

Kdump and fadump both uses same infrastructure/tool (makedumpfile) to
capture the memory dump. While the tool provides flexibility to
determine what needs to be part of the dump and what memory to filter
out, all supported distributions defaults to "Capture only kernel data
and nothing else". Taking advantage of this default we can at least make
the reserved memory available for userspace to use.

If someone wants to capture userspace data as well then
'fadump=nonmovable' option can be used where reserved pages won't be
marked zone movable.

Advantage of movable method is the reserved memory chunk is also
available for use.

> [...]
>>  Documentation/powerpc/firmware-assisted-dump.txt |   18 +++
>>  arch/powerpc/include/asm/fadump.h|7 +
>>  arch/powerpc/kernel/fadump.c |  123 +--
>>  arch/powerpc/platforms/pseries/hotplug-memory.c  |7 +
>>  include/linux/mmzone.h   |2 
>>  mm/page_alloc.c  |  146 
>> ++
>>  6 files changed, 290 insertions(+), 13 deletions(-)
> 
> This is quite a large change and you didn't seem to explain why we need
> it.
> 

In fadump case, the reserved memory stays unused until system is
crashed. fadump uses very small portion of this reserved memory, few
KBs, for storing fadump metadata. Otherwise, the significant chunk of
memory is completely unused. Hence, instead of blocking a memory that is
un-utilized through out the lifetime of system, it's better to give it
back to production kernel to use. But at the same time we don't want
kernel to use that memory. While exploring we found 1) Linux kernel's
Contiguous Memory Allocator (CMA) feature and 2) ZONE_MOVABLE, that
suites the requirement. Initial 5 revisions of this patchset () was
using CMA feature. However, fadump does not do any cma allocations,
hence it will be more appropriate to use zone movable to achieve the same.

But unlike CMA, there is no interface available to mark a custom
reserved memory area as ZONE_MOVABLE. Hence patch 1/4 proposes the same.

Thanks,
-Mahesh.

Re: [PATCH v5 2/7] powerpc/pseries: Defer the logging of rtas error to irq work queue.

2018-07-03 Thread Mahesh Jagannath Salgaonkar

On 07/03/2018 08:55 AM, Nicholas Piggin wrote:
> On Mon, 02 Jul 2018 11:16:29 +0530
> Mahesh J Salgaonkar  wrote:
> 
>> From: Mahesh Salgaonkar 
>>
>> rtas_log_buf is a buffer to hold RTAS event data that are communicated
>> to kernel by hypervisor. This buffer is then used to pass RTAS event
>> data to user through proc fs. This buffer is allocated from vmalloc
>> (non-linear mapping) area.
>>
>> On Machine check interrupt, register r3 points to RTAS extended event
>> log passed by hypervisor that contains the MCE event. The pseries
>> machine check handler then logs this error into rtas_log_buf. The
>> rtas_log_buf is a vmalloc-ed (non-linear) buffer we end up taking up a
>> page fault (vector 0x300) while accessing it. Since machine check
>> interrupt handler runs in NMI context we can not afford to take any
>> page fault. Page faults are not honored in NMI context and causes
>> kernel panic. Apart from that, as Nick pointed out, pSeries_log_error()
>> also takes a spin_lock while logging error which is not safe in NMI
>> context. It may endup in deadlock if we get another MCE before releasing
>> the lock. Fix this by deferring the logging of rtas error to irq work queue.
>>
>> Current implementation uses two different buffers to hold rtas error log
>> depending on whether extended log is provided or not. This makes bit
>> difficult to identify which buffer has valid data that needs to logged
>> later in irq work. Simplify this using single buffer, one per paca, and
>> copy rtas log to it irrespective of whether extended log is provided or
>> not. Allocate this buffer below RMA region so that it can be accessed
>> in real mode mce handler.
>>
>> Fixes: b96672dd840f ("powerpc: Machine check interrupt is a non-maskable 
>> interrupt")
>> Cc: sta...@vger.kernel.org
>> Signed-off-by: Mahesh Salgaonkar 
> 
> I think this looks reasonable. It doesn't fix that commit so much as
> fixes the problem that's apparent after it's applied. I don't know if
> we should backport this to a wider set of stable kernels? Aside from
> that,

Since the commit b96672dd840f went into 4.13, I think it is good if we
backport this to V4.14 and above stable kernels.

Thanks,
-Mahesh

> 
> Reviewed-by: Nicholas Piggin 
> 
> Thanks,
> Nick
> 
>> ---
>>  arch/powerpc/include/asm/paca.h|3 ++
>>  arch/powerpc/platforms/pseries/ras.c   |   47 
>> ++--
>>  arch/powerpc/platforms/pseries/setup.c |   16 +++
>>  3 files changed, 51 insertions(+), 15 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/paca.h 
>> b/arch/powerpc/include/asm/paca.h
>> index 3f109a3e3edb..b441fef53077 100644
>> --- a/arch/powerpc/include/asm/paca.h
>> +++ b/arch/powerpc/include/asm/paca.h
>> @@ -251,6 +251,9 @@ struct paca_struct {
>>  void *rfi_flush_fallback_area;
>>  u64 l1d_flush_size;
>>  #endif
>> +#ifdef CONFIG_PPC_PSERIES
>> +u8 *mce_data_buf;   /* buffer to hold per cpu rtas errlog */
>> +#endif /* CONFIG_PPC_PSERIES */
>>  } cacheline_aligned;
>>  
>>  extern void copy_mm_to_paca(struct mm_struct *mm);
>> diff --git a/arch/powerpc/platforms/pseries/ras.c 
>> b/arch/powerpc/platforms/pseries/ras.c
>> index ef104144d4bc..14a46b07ab2f 100644
>> --- a/arch/powerpc/platforms/pseries/ras.c
>> +++ b/arch/powerpc/platforms/pseries/ras.c
>> @@ -22,6 +22,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  
>>  #include 
>>  #include 
>> @@ -32,11 +33,13 @@
>>  static unsigned char ras_log_buf[RTAS_ERROR_LOG_MAX];
>>  static DEFINE_SPINLOCK(ras_log_buf_lock);
>>  
>> -static char global_mce_data_buf[RTAS_ERROR_LOG_MAX];
>> -static DEFINE_PER_CPU(__u64, mce_data_buf);
>> -
>>  static int ras_check_exception_token;
>>  
>> +static void mce_process_errlog_event(struct irq_work *work);
>> +static struct irq_work mce_errlog_process_work = {
>> +.func = mce_process_errlog_event,
>> +};
>> +
>>  #define EPOW_SENSOR_TOKEN   9
>>  #define EPOW_SENSOR_INDEX   0
>>  
>> @@ -330,16 +333,20 @@ static irqreturn_t ras_error_interrupt(int irq, void 
>> *dev_id)
>>  A) >= 0x7000) && ((A) < 0x7ff0)) || \
>>  (((A) >= rtas.base) && ((A) < (rtas.base + rtas.size - 16
>>  
>> +static inline struct rtas_error_log *fwnmi_get_errlog(void)
>> +{
>> +return (struct rtas_error_log *)local_paca->mce_data_buf;
>> +}
>> +
>>  /*
>>   * Get the error information for errors coming through the
>>   * FWNMI vectors.  The pt_regs' r3 will be updated to reflect
>>   * the actual r3 if possible, and a ptr to the error log entry
>>   * will be returned if found.
>>   *
>> - * If the RTAS error is not of the extended type, then we put it in a per
>> - * cpu 64bit buffer. If it is the extended type we use global_mce_data_buf.
>> + * Use one buffer mce_data_buf per cpu to store RTAS error.
>>   *
>> - * The global_mce_data_buf does not have any locks or protection around it,
>> + * The mce_data_buf does not have any locks or protection around it,
>>   * if a second machine check comes

Re: [PATCH v5 5/7] powerpc/pseries: flush SLB contents on SLB MCE errors.

2018-07-03 Thread Mahesh Jagannath Salgaonkar

On 07/03/2018 03:38 AM, Nicholas Piggin wrote:
> On Mon, 02 Jul 2018 11:17:06 +0530
> Mahesh J Salgaonkar  wrote:
> 
>> From: Mahesh Salgaonkar 
>>
>> On pseries, as of today system crashes if we get a machine check
>> exceptions due to SLB errors. These are soft errors and can be fixed by
>> flushing the SLBs so the kernel can continue to function instead of
>> system crash. We do this in real mode before turning on MMU. Otherwise
>> we would run into nested machine checks. This patch now fetches the
>> rtas error log in real mode and flushes the SLBs on SLB errors.
>>
>> Signed-off-by: Mahesh Salgaonkar 
>> ---
>>  arch/powerpc/include/asm/book3s/64/mmu-hash.h |1 
>>  arch/powerpc/include/asm/machdep.h|1 
>>  arch/powerpc/kernel/exceptions-64s.S  |   42 +
>>  arch/powerpc/kernel/mce.c |   16 +++-
>>  arch/powerpc/mm/slb.c |6 +++
>>  arch/powerpc/platforms/powernv/opal.c |1 
>>  arch/powerpc/platforms/pseries/pseries.h  |1 
>>  arch/powerpc/platforms/pseries/ras.c  |   51 
>> +
>>  arch/powerpc/platforms/pseries/setup.c|1 
>>  9 files changed, 116 insertions(+), 4 deletions(-)
>>
> 
> 
>> +TRAMP_REAL_BEGIN(machine_check_pSeries_early)
>> +BEGIN_FTR_SECTION
>> +EXCEPTION_PROLOG_1(PACA_EXMC, NOTEST, 0x200)
>> +mr  r10,r1  /* Save r1 */
>> +ld  r1,PACAMCEMERGSP(r13)   /* Use MC emergency stack */
>> +subir1,r1,INT_FRAME_SIZE/* alloc stack frame*/
>> +mfspr   r11,SPRN_SRR0   /* Save SRR0 */
>> +mfspr   r12,SPRN_SRR1   /* Save SRR1 */
>> +EXCEPTION_PROLOG_COMMON_1()
>> +EXCEPTION_PROLOG_COMMON_2(PACA_EXMC)
>> +EXCEPTION_PROLOG_COMMON_3(0x200)
>> +addir3,r1,STACK_FRAME_OVERHEAD
>> +BRANCH_LINK_TO_FAR(machine_check_early) /* Function call ABI */
> 
> Is there any reason you can't use the existing
> machine_check_powernv_early code to do all this?

I did think about that :-). But the machine_check_powernv_early code
does bit of extra stuff which isn't required in pseries like touching ME
bit in MSR and lots of checks that are done in
machine_check_handle_early() before going to virtual mode. But on second
look I see that we can bypass all that with HVMODE FTR section. Will
rename machine_check_powernv_early to machine_check_common_early and
reuse it.

> 
>> diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
>> index efdd16a79075..221271c96a57 100644
>> --- a/arch/powerpc/kernel/mce.c
>> +++ b/arch/powerpc/kernel/mce.c
>> @@ -488,9 +488,21 @@ long machine_check_early(struct pt_regs *regs)
>>  {
>>  long handled = 0;
>>  
>> -__this_cpu_inc(irq_stat.mce_exceptions);
>> +/*
>> + * For pSeries we count mce when we go into virtual mode machine
>> + * check handler. Hence skip it. Also, We can't access per cpu
>> + * variables in real mode for LPAR.
>> + */
>> +if (early_cpu_has_feature(CPU_FTR_HVMODE))
>> +__this_cpu_inc(irq_stat.mce_exceptions);
>>  
>> -if (cur_cpu_spec && cur_cpu_spec->machine_check_early)
>> +/*
>> + * See if platform is capable of handling machine check.
>> + * Otherwise fallthrough and allow CPU to handle this machine check.
>> + */
>> +if (ppc_md.machine_check_early)
>> +handled = ppc_md.machine_check_early(regs);
>> +else if (cur_cpu_spec && cur_cpu_spec->machine_check_early)
>>  handled = cur_cpu_spec->machine_check_early(regs);
> 
> Would be good to add a powernv ppc_md handler which does the
> cur_cpu_spec->machine_check_early() call now that other platforms are
> calling this code. Because those aren't valid as a fallback call, but
> specific to powernv.
> 
>> diff --git a/arch/powerpc/platforms/powernv/opal.c 
>> b/arch/powerpc/platforms/powernv/opal.c
>> index 48fbb41af5d1..ed548d40a9e1 100644
>> --- a/arch/powerpc/platforms/powernv/opal.c
>> +++ b/arch/powerpc/platforms/powernv/opal.c
>> @@ -417,7 +417,6 @@ static int opal_recover_mce(struct pt_regs *regs,
>>  
>>  if (!(regs->msr & MSR_RI)) {
>>  /* If MSR_RI isn't set, we cannot recover */
>> -pr_err("Machine check interrupt unrecoverable: MSR(RI=0)\n");
> 
> What's the reason for this change?

Err... This is by mistake.. My bad. Thanks for catching this. Will
remove this hunk in next revision. We need a similar print for pSeries
in ras.c.

> 
>>  recovered = 0;
>>  } else if (evt->disposition == MCE_DISPOSITION_RECOVERED) {
>>  /* Platform corrected itself */
>> diff --git a/arch/powerpc/platforms/pseries/pseries.h 
>> b/arch/powerpc/platforms/pseries/pseries.h
>> index 60db2ee511fb..3611db5dd583 100644
>> --- a/arch/powerpc/platforms/pseries/pseries.h
>> +++ b/arch/powerpc/platforms/pseries/pseries.h
>> @@ -24,6 +24,7 @@ struct pt_regs;
>>  
>>  extern int pSeries_system_reset_exception(struct

Re: [PATCH v4 1/6] powerpc/pseries: Defer the logging of rtas error to irq work queue.

2018-06-28 Thread Mahesh Jagannath Salgaonkar

On 06/29/2018 02:35 AM, kbuild test robot wrote:
> Hi Mahesh,
> 
> Thank you for the patch! Yet something to improve:
> 
> [auto build test ERROR on powerpc/next]
> [also build test ERROR on v4.18-rc2 next-20180628]
> [if your patch is applied to the wrong git tree, please drop us a note to 
> help improve the system]
> 
> url:
> https://github.com/0day-ci/linux/commits/Mahesh-J-Salgaonkar/powerpc-pseries-Defer-the-logging-of-rtas-error-to-irq-work-queue/20180628-224101
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next
> config: powerpc-defconfig (attached as .config)
> compiler: powerpc64-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
> reproduce:
> wget 
> https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
> ~/bin/make.cross
> chmod +x ~/bin/make.cross
> # save the attached .config to linux build tree
> GCC_VERSION=7.2.0 make.cross ARCH=powerpc 
> 
> Note: the 
> linux-review/Mahesh-J-Salgaonkar/powerpc-pseries-Defer-the-logging-of-rtas-error-to-irq-work-queue/20180628-224101
>  HEAD 3496ae1afd6528103d508528e25bfca82c60f4ee builds fine.
>   It only hurts bisectibility.
> 
> All errors (new ones prefixed by >>):
> 
>arch/powerpc/platforms/pseries/ras.c: In function 
> 'mce_process_errlog_event':
>>> arch/powerpc/platforms/pseries/ras.c:433:8: error: implicit declaration of 
>>> function 'fwnmi_get_errlog'; did you mean 'fwnmi_get_errinfo'? 
>>> [-Werror=implicit-function-declaration]
>  err = fwnmi_get_errlog();
>^~~~
>fwnmi_get_errinfo
>>> arch/powerpc/platforms/pseries/ras.c:433:6: error: assignment makes pointer 
>>> from integer without a cast [-Werror=int-conversion]
>  err = fwnmi_get_errlog();
>  ^
>cc1: all warnings being treated as errors

Ouch... Looks like I pushed down the function definition while
rearranging the hunks. Will fix it in next revision. Thanks for catching
this.

Thanks,
-Mahesh.

Re: [PATCH v4 1/6] powerpc/pseries: Defer the logging of rtas error to irq work queue.

2018-06-28 Thread Mahesh Jagannath Salgaonkar

On 06/28/2018 06:49 PM, Laurent Dufour wrote:
> On 28/06/2018 13:10, Mahesh J Salgaonkar wrote:
>> From: Mahesh Salgaonkar 
>>
>> rtas_log_buf is a buffer to hold RTAS event data that are communicated
>> to kernel by hypervisor. This buffer is then used to pass RTAS event
>> data to user through proc fs. This buffer is allocated from vmalloc
>> (non-linear mapping) area.
>>
>> On Machine check interrupt, register r3 points to RTAS extended event
>> log passed by hypervisor that contains the MCE event. The pseries
>> machine check handler then logs this error into rtas_log_buf. The
>> rtas_log_buf is a vmalloc-ed (non-linear) buffer we end up taking up a
>> page fault (vector 0x300) while accessing it. Since machine check
>> interrupt handler runs in NMI context we can not afford to take any
>> page fault. Page faults are not honored in NMI context and causes
>> kernel panic. Apart from that, as Nick pointed out, pSeries_log_error()
>> also takes a spin_lock while logging error which is not safe in NMI
>> context. It may endup in deadlock if we get another MCE before releasing
>> the lock. Fix this by deferring the logging of rtas error to irq work queue.
>>
>> Current implementation uses two different buffers to hold rtas error log
>> depending on whether extended log is provided or not. This makes bit
>> difficult to identify which buffer has valid data that needs to logged
>> later in irq work. Simplify this using single buffer, one per paca, and
>> copy rtas log to it irrespective of whether extended log is provided or
>> not. Allocate this buffer below RMA region so that it can be accessed
>> in real mode mce handler.
>>
>> Fixes: b96672dd840f ("powerpc: Machine check interrupt is a non-maskable 
>> interrupt")
>> Cc: sta...@vger.kernel.org
>> Signed-off-by: Mahesh Salgaonkar 
>> ---
>>  arch/powerpc/include/asm/paca.h|3 ++
>>  arch/powerpc/platforms/pseries/ras.c   |   39 
>> +---
>>  arch/powerpc/platforms/pseries/setup.c |   16 +
>>  3 files changed, 45 insertions(+), 13 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/paca.h 
>> b/arch/powerpc/include/asm/paca.h
>> index 3f109a3e3edb..b441fef53077 100644
>> --- a/arch/powerpc/include/asm/paca.h
>> +++ b/arch/powerpc/include/asm/paca.h
>> @@ -251,6 +251,9 @@ struct paca_struct {
>>  void *rfi_flush_fallback_area;
>>  u64 l1d_flush_size;
>>  #endif
>> +#ifdef CONFIG_PPC_PSERIES
>> +u8 *mce_data_buf;   /* buffer to hold per cpu rtas errlog */
>> +#endif /* CONFIG_PPC_PSERIES */
>>  } cacheline_aligned;
>>
>>  extern void copy_mm_to_paca(struct mm_struct *mm);
>> diff --git a/arch/powerpc/platforms/pseries/ras.c 
>> b/arch/powerpc/platforms/pseries/ras.c
>> index 5e1ef9150182..f6ba9a2a4f84 100644
>> --- a/arch/powerpc/platforms/pseries/ras.c
>> +++ b/arch/powerpc/platforms/pseries/ras.c
>> @@ -22,6 +22,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>
>>  #include 
>>  #include 
>> @@ -32,11 +33,13 @@
>>  static unsigned char ras_log_buf[RTAS_ERROR_LOG_MAX];
>>  static DEFINE_SPINLOCK(ras_log_buf_lock);
>>
>> -static char global_mce_data_buf[RTAS_ERROR_LOG_MAX];
>> -static DEFINE_PER_CPU(__u64, mce_data_buf);
>> -
>>  static int ras_check_exception_token;
>>
>> +static void mce_process_errlog_event(struct irq_work *work);
>> +static struct irq_work mce_errlog_process_work = {
>> +.func = mce_process_errlog_event,
>> +};
>> +
>>  #define EPOW_SENSOR_TOKEN   9
>>  #define EPOW_SENSOR_INDEX   0
>>
>> @@ -336,10 +339,9 @@ static irqreturn_t ras_error_interrupt(int irq, void 
>> *dev_id)
>>   * the actual r3 if possible, and a ptr to the error log entry
>>   * will be returned if found.
>>   *
>> - * If the RTAS error is not of the extended type, then we put it in a per
>> - * cpu 64bit buffer. If it is the extended type we use global_mce_data_buf.
>> + * Use one buffer mce_data_buf per cpu to store RTAS error.
>>   *
>> - * The global_mce_data_buf does not have any locks or protection around it,
>> + * The mce_data_buf does not have any locks or protection around it,
>>   * if a second machine check comes in, or a system reset is done
>>   * before we have logged the error, then we will get corruption in the
>>   * error log.  This is preferable over holding off on calling
>> @@ -362,20 +364,19 @@ static struct rtas_error_log *fwnmi_get_errinfo(struct 
>> pt_regs *regs)
>>  savep = __va(regs->gpr[3]);
>>  regs->gpr[3] = savep[0];/* restore original r3 */
>>
>> -/* If it isn't an extended log we can use the per cpu 64bit buffer */
>>  h = (struct rtas_error_log *)[1];
>> +/* Use the per cpu buffer from paca to store rtas error log */
>> +memset(local_paca->mce_data_buf, 0, RTAS_ERROR_LOG_MAX);
>>  if (!rtas_error_extended(h)) {
>> -memcpy(this_cpu_ptr(_data_buf), h, sizeof(__u64));
>> -errhdr = (struct rtas_error_log *)this_cpu_ptr(_data_buf);
>> +memcpy(local_paca->mce_data_buf,

Re: [v3 PATCH 4/5] powerpc/pseries: Dump and flush SLB contents on SLB MCE errors.

2018-06-12 Thread Mahesh Jagannath Salgaonkar

On 06/12/2018 07:17 PM, Michael Ellerman wrote:
> Mahesh J Salgaonkar  writes:
>> diff --git a/arch/powerpc/platforms/pseries/ras.c 
>> b/arch/powerpc/platforms/pseries/ras.c
>> index 2edc673be137..e56759d92356 100644
>> --- a/arch/powerpc/platforms/pseries/ras.c
>> +++ b/arch/powerpc/platforms/pseries/ras.c
>> @@ -422,6 +422,31 @@ int pSeries_system_reset_exception(struct pt_regs *regs)
>>  return 0; /* need to perform reset */
>>  }
>>  
>> +static int mce_handle_error(struct rtas_error_log *errp)
>> +{
>> +struct pseries_errorlog *pseries_log;
>> +struct pseries_mc_errorlog *mce_log;
>> +int disposition = rtas_error_disposition(errp);
>> +uint8_t error_type;
>> +
>> +pseries_log = get_pseries_errorlog(errp, PSERIES_ELOG_SECT_ID_MCE);
>> +if (pseries_log == NULL)
>> +goto out;
>> +
>> +mce_log = (struct pseries_mc_errorlog *)pseries_log->data;
>> +error_type = rtas_mc_error_type(mce_log);
>> +
>> +if ((disposition == RTAS_DISP_NOT_RECOVERED) &&
>> +(error_type == PSERIES_MC_ERROR_TYPE_SLB)) {
>> +slb_dump_contents();
>> +slb_flush_and_rebolt();
> 
> Aren't we back in virtual mode here?
> 
> Don't we need to do the flush in real mode before turning the MMU back
> on. Otherwise we'll just take another multi-hit?

Yeah for duplicate entries for kernel segment "0xc00", we will end up
with another multi-hit. For other segments we won't. I think I need to
move the fetching of rtas error log and handling part into real mode to
avoid a loop, and do only printing part in virtual mode.

> 
> cheers
>

Re: [v3 PATCH 2/5] powerpc/pseries: Fix endainness while restoring of r3 in MCE handler.

2018-06-08 Thread Mahesh Jagannath Salgaonkar

On 06/08/2018 12:20 PM, Michael Ellerman wrote:
> Mahesh J Salgaonkar  writes:
>> From: Mahesh Salgaonkar 
>>
>> During Machine Check interrupt on pseries platform, register r3 points
>> RTAS extended event log passed by hypervisor. Since hypervisor uses r3
>> to pass pointer to rtas log, it stores the original r3 value at the
>> start of the memory (first 8 bytes) pointed by r3. Since hypervisor
>> stores this info and rtas log is in BE format, linux should make
>> sure to restore r3 value in correct endian format.
> 
> Can we hit this under KVM? And if so what if the KVM/qemu is running
> little endian, does it still write the value BE?

FWNMI support for qemu is still not in. But when it is in, we can hit
this. But whenever FWNMI support gets in, it should pass RTAS event data
always in BE format including original r3 value.

Thanks,
-Mahesh.
> 
> cheers
>

Re: [v3 PATCH 5/5] powerpc/pseries: Display machine check error details.

2018-06-08 Thread Mahesh Jagannath Salgaonkar

On 06/08/2018 07:21 AM, Nicholas Piggin wrote:
> On Thu, 07 Jun 2018 22:59:04 +0530
> Mahesh J Salgaonkar  wrote:
> 
>> From: Mahesh Salgaonkar 
>>
>> Extract the MCE error details from RTAS extended log and display it to
>> console.
>>
>> With this patch you should now see mce logs like below:
>>
>> [  142.371818] Severe Machine check interrupt [Recovered]
>> [  142.371822]   NIP [dca301b8]: init_module+0x1b8/0x338 
>> [bork_kernel]
>> [  142.371822]   Initiator: CPU
>> [  142.371823]   Error type: SLB [Multihit]
>> [  142.371824] Effective address: dca7
>>
>> Signed-off-by: Mahesh Salgaonkar 
>> ---
>>  arch/powerpc/include/asm/rtas.h  |5 +
>>  arch/powerpc/platforms/pseries/ras.c |  128 
>> +-
>>  2 files changed, 131 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/rtas.h 
>> b/arch/powerpc/include/asm/rtas.h
>> index 3f2fba7ef23b..8100a95c133a 100644
>> --- a/arch/powerpc/include/asm/rtas.h
>> +++ b/arch/powerpc/include/asm/rtas.h
>> @@ -190,6 +190,11 @@ static inline uint8_t rtas_error_extended(const struct 
>> rtas_error_log *elog)
>>  return (elog->byte1 & 0x04) >> 2;
>>  }
>>  
>> +static inline uint8_t rtas_error_initiator(const struct rtas_error_log 
>> *elog)
>> +{
>> +return (elog->byte2 & 0xf0) >> 4;
>> +}
>> +
>>  #define rtas_error_type(x)  ((x)->byte3)
>>  
>>  static inline
>> diff --git a/arch/powerpc/platforms/pseries/ras.c 
>> b/arch/powerpc/platforms/pseries/ras.c
>> index e56759d92356..cd9446980092 100644
>> --- a/arch/powerpc/platforms/pseries/ras.c
>> +++ b/arch/powerpc/platforms/pseries/ras.c
>> @@ -422,7 +422,130 @@ int pSeries_system_reset_exception(struct pt_regs 
>> *regs)
>>  return 0; /* need to perform reset */
>>  }
>>  
>> -static int mce_handle_error(struct rtas_error_log *errp)
>> +#define VAL_TO_STRING(ar, val)  ((val < ARRAY_SIZE(ar)) ? ar[val] : 
>> "Unknown")
>> +
>> +static void pseries_print_mce_info(struct pt_regs *regs,
>> +struct rtas_error_log *errp, int disposition)
>> +{
>> +const char *level, *sevstr;
>> +struct pseries_errorlog *pseries_log;
>> +struct pseries_mc_errorlog *mce_log;
>> +uint8_t error_type, err_sub_type;
>> +uint8_t initiator = rtas_error_initiator(errp);
>> +uint64_t addr;
>> +
>> +static const char * const initiators[] = {
>> +"Unknown",
>> +"CPU",
>> +"PCI",
>> +"ISA",
>> +"Memory",
>> +"Power Mgmt",
>> +};
>> +static const char * const mc_err_types[] = {
>> +"UE",
>> +"SLB",
>> +"ERAT",
>> +"TLB",
>> +"D-Cache",
>> +"Unknown",
>> +"I-Cache",
>> +};
>> +static const char * const mc_ue_types[] = {
>> +"Indeterminate",
>> +"Instruction fetch",
>> +"Page table walk ifetch",
>> +"Load/Store",
>> +"Page table walk Load/Store",
>> +};
>> +
>> +/* SLB sub errors valid values are 0x0, 0x1, 0x2 */
>> +static const char * const mc_slb_types[] = {
>> +"Parity",
>> +"Multihit",
>> +"Indeterminate",
>> +};
>> +
>> +/* TLB and ERAT sub errors valid values are 0x1, 0x2, 0x3 */
>> +static const char * const mc_soft_types[] = {
>> +"Unknown",
>> +"Parity",
>> +"Multihit",
>> +"Indeterminate",
>> +};
>> +
>> +pseries_log = get_pseries_errorlog(errp, PSERIES_ELOG_SECT_ID_MCE);
>> +if (pseries_log == NULL)
>> +return;
>> +
>> +mce_log = (struct pseries_mc_errorlog *)pseries_log->data;
>> +
>> +error_type = rtas_mc_error_type(mce_log);
>> +err_sub_type = rtas_mc_error_sub_type(mce_log);
>> +
>> +switch (rtas_error_severity(errp)) {
>> +case RTAS_SEVERITY_NO_ERROR:
>> +level = KERN_INFO;
>> +sevstr = "Harmless";
>> +break;
>> +case RTAS_SEVERITY_WARNING:
>> +level = KERN_WARNING;
>> +sevstr = "";
>> +break;
>> +case RTAS_SEVERITY_ERROR:
>> +case RTAS_SEVERITY_ERROR_SYNC:
>> +level = KERN_ERR;
>> +sevstr = "Severe";
>> +break;
>> +case RTAS_SEVERITY_FATAL:
>> +default:
>> +level = KERN_ERR;
>> +sevstr = "Fatal";
>> +break;
>> +}
>> +
>> +printk("%s%s Machine check interrupt [%s]\n", level, sevstr,
>> +disposition == RTAS_DISP_FULLY_RECOVERED ?
>> +"Recovered" : "Not recovered");
>> +if (user_mode(regs)) {
>> +printk("%s  NIP: [%016lx] PID: %d Comm: %s\n", level,
>> +regs->nip, current->pid, current->comm);
>> +} else {
>> +printk("%s  NIP [%016lx]: %pS\n", level, regs->nip,
>> +(void *)regs->nip);
>> +}
> 
> I think it's probably still useful to print

Re: [v3 PATCH 4/5] powerpc/pseries: Dump and flush SLB contents on SLB MCE errors.

2018-06-08 Thread Mahesh Jagannath Salgaonkar

On 06/08/2018 07:18 AM, Nicholas Piggin wrote:
> On Thu, 07 Jun 2018 22:58:55 +0530
> Mahesh J Salgaonkar  wrote:
> 
>> From: Mahesh Salgaonkar 
>>
>> If we get a machine check exceptions due to SLB errors then dump the
>> current SLB contents which will be very much helpful in debugging the
>> root cause of SLB errors. On pseries, as of today system crashes on SLB
>> errors. These are soft errors and can be fixed by flushing the SLBs so
>> the kernel can continue to function instead of system crash. This patch
>> fixes that also.
> 
> So pseries never flushed SLB and reloaded in response to multi hit
> errors? This seems like quite a good improvement then. I like
> dumping SLB too.
> 
> It's a bit annoying we can't share the same code with xmon really,
> that's okay but I just suggest commenting them both if you take a
> copy like this with a note to keep them in synch if you re-post
> the series.
> 
>>
>> With this patch the console will log SLB contents like below on SLB MCE
>> errors:
>>
>> [  822.711728] slb contents:
> 
> Suggest keeping the same format as the xmon dump (in particular
> CPU number, even though it's probably printed elsewhere in the MCE
> message it doesn't hurt.

Sure will do that and repost.

Thanks,
-Mahesh.

> 
> Reviewed-by: Nicholas Piggin 
> 
> Thanks,
> Nick
> 
>> [  822.711730] 00 c800 400ea1b217000500
>> [  822.711731]   1T  ESID=   c0  VSID=  ea1b217 LLP:100
>> [  822.711732] 01 d800 400d43642f000510
>> [  822.711733]   1T  ESID=   d0  VSID=  d43642f LLP:110
>> [  822.711734] 09 f800 400a86c85f000500
>> [  822.711736]   1T  ESID=   f0  VSID=  a86c85f LLP:100
>> [  822.711737] 10 7f000800 400d1f26e3000d90
>> [  822.711738]   1T  ESID=   7f  VSID=  d1f26e3 LLP:110
>> [  822.711739] 11 1800 000e3615f520fd90
>> [  822.711740]  256M ESID=1  VSID=   e3615f520f LLP:110
>> [  822.711740] 12 d800 400d43642f000510
>> [  822.711741]   1T  ESID=   d0  VSID=  d43642f LLP:110
>> [  822.711742] 13 d800 400d43642f000510
>> [  822.711743]   1T  ESID=   d0  VSID=  d43642f LLP:110
>>
>>
>> Suggested-by: Aneesh Kumar K.V 
>> Suggested-by: Michael Ellerman 
>> Signed-off-by: Mahesh Salgaonkar 
>> ---
>>  arch/powerpc/include/asm/book3s/64/mmu-hash.h |1 +
>>  arch/powerpc/mm/slb.c |   35 
>> +
>>  arch/powerpc/platforms/pseries/ras.c  |   29 -
>>  3 files changed, 64 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h 
>> b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
>> index 50ed64fba4ae..c0da68927235 100644
>> --- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
>> +++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
>> @@ -487,6 +487,7 @@ extern void hpte_init_native(void);
>>  
>>  extern void slb_initialize(void);
>>  extern void slb_flush_and_rebolt(void);
>> +extern void slb_dump_contents(void);
>>  
>>  extern void slb_vmalloc_update(void);
>>  extern void slb_set_size(u16 size);
>> diff --git a/arch/powerpc/mm/slb.c b/arch/powerpc/mm/slb.c
>> index 66577cc66dc9..799aa117cec3 100644
>> --- a/arch/powerpc/mm/slb.c
>> +++ b/arch/powerpc/mm/slb.c
>> @@ -145,6 +145,41 @@ void slb_flush_and_rebolt(void)
>>  get_paca()->slb_cache_ptr = 0;
>>  }
>>  
>> +void slb_dump_contents(void)
>> +{
>> +int i;
>> +unsigned long e, v;
>> +unsigned long llp;
>> +
>> +pr_err("slb contents:\n");
>> +for (i = 0; i < mmu_slb_size; i++) {
>> +asm volatile("slbmfee  %0,%1" : "=r" (e) : "r" (i));
>> +asm volatile("slbmfev  %0,%1" : "=r" (v) : "r" (i));
>> +
>> +if (!e && !v)
>> +continue;
>> +
>> +pr_err("%02d %016lx %016lx", i, e, v);
>> +
>> +if (!(e & SLB_ESID_V)) {
>> +pr_err("\n");
>> +continue;
>> +}
>> +llp = v & SLB_VSID_LLP;
>> +if (v & SLB_VSID_B_1T) {
>> +pr_err("  1T  ESID=%9lx  VSID=%13lx LLP:%3lx\n",
>> +GET_ESID_1T(e),
>> +(v & ~SLB_VSID_B) >> SLB_VSID_SHIFT_1T,
>> +llp);
>> +} else {
>> +pr_err(" 256M ESID=%9lx  VSID=%13lx LLP:%3lx\n",
>> +GET_ESID(e),
>> +(v & ~SLB_VSID_B) >> SLB_VSID_SHIFT,
>> +llp);
>> +}
>> +}
>> +}
>> +
>>  void slb_vmalloc_update(void)
>>  {
>>  unsigned long vflags;
>> diff --git a/arch/powerpc/platforms/pseries/ras.c 
>> b/arch/powerpc/platforms/pseries/ras.c
>> index 2edc673be137..e56759d92356 100644
>> --- a/arch/powerpc/platforms/pseries/ras.c
>> +++ b/arch/powerpc/platforms/pseries/ras.c
>> @@ -422,6 +422,31 @@ int pSeries_system_reset_exception(struct pt_regs *regs)
>>  return 0; /* need to perform

Re: [v3 PATCH 1/5] powerpc/pseries: convert rtas_log_buf to linear allocation.

2018-06-08 Thread Mahesh Jagannath Salgaonkar

On 06/08/2018 07:01 AM, Nicholas Piggin wrote:
> On Thu, 07 Jun 2018 22:58:11 +0530
> Mahesh J Salgaonkar  wrote:
> 
>> From: Mahesh Salgaonkar 
>>
>> rtas_log_buf is a buffer to hold RTAS event data that are communicated
>> to kernel by hypervisor. This buffer is then used to pass RTAS event
>> data to user through proc fs. This buffer is allocated from vmalloc
>> (non-linear mapping) area.
>>
>> On Machine check interrupt, register r3 points to RTAS extended event
>> log passed by hypervisor that contains the MCE event. The pseries
>> machine check handler then logs this error into rtas_log_buf. The
>> rtas_log_buf is a vmalloc-ed (non-linear) buffer we end up taking up a
>> page fault (vector 0x300) while accessing it. Since machine check
>> interrupt handler runs in NMI context we can not afford to take any
>> page fault. Page faults are not honored in NMI context and causes
>> kernel panic. This patch fixes this issue by allocating rtas_log_buf
>> using kmalloc.
>>
>> Fixes: b96672dd840f ("powerpc: Machine check interrupt is a non-maskable 
>> interrupt")
>> Cc: sta...@vger.kernel.org
>> Suggested-by: Aneesh Kumar K.V 
>> Signed-off-by: Mahesh Salgaonkar 
>> ---
>>  arch/powerpc/kernel/rtasd.c |2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/arch/powerpc/kernel/rtasd.c b/arch/powerpc/kernel/rtasd.c
>> index f915db93cd42..3957d4ae2ba2 100644
>> --- a/arch/powerpc/kernel/rtasd.c
>> +++ b/arch/powerpc/kernel/rtasd.c
>> @@ -559,7 +559,7 @@ static int __init rtas_event_scan_init(void)
>>  rtas_error_log_max = rtas_get_error_log_max();
>>  rtas_error_log_buffer_max = rtas_error_log_max + sizeof(int);
>>  
>> -rtas_log_buf = vmalloc(rtas_error_log_buffer_max*LOG_NUMBER);
>> +rtas_log_buf = kmalloc(rtas_error_log_buffer_max*LOG_NUMBER, 
>> GFP_KERNEL);
> 
> Does this have to be in the RMA region if it's to be accessed with
> relocation off in the guest?

Nope not required. It never gets accessed with relocation off.

> 
> A comment about it being accessed with relocation off might be helpful
> too.

Sure.

Thanks,
-Mahesh.

Re: [v2 PATCH 0/5] powerpc/pseries: Machien check handler improvements.

2018-06-07 Thread Mahesh Jagannath Salgaonkar

On 06/07/2018 04:15 PM, Nicholas Piggin wrote:
> On Thu, 07 Jun 2018 15:36:25 +0530
> Mahesh J Salgaonkar  wrote:
> 
>> This patch series includes some improvement to Machine check handler
>> for pseries. Patch 1 fixes an issue where machine check handler crashes
>> kernel while accessing vmalloc-ed buffer while in nmi context.
>> Patch 3 dumps the SLB contents on SLB MCE errors to improve the debugability.
>> Patch 4 display's the MCE error details on console.
>>
>> Change in V2:
>> - patch 4: Display additional info (NIP and task info) in MCE error details.
>> - patch 5: Fix endain bug while restoring of r3 in MCE handler.
>>
>> ---
>>
>> Mahesh Salgaonkar (5):
>>   powerpc/pseries: convert rtas_log_buf to linear allocation.
>>   powerpc/pseries: Define MCE error event section.
>>   powerpc/pseries: Dump and flush SLB contents on SLB MCE errors.
>>   powerpc/pseries: Display machine check error details.
>>   powerpc/pseries: Fix endainness while restoring of r3 in MCE handler.
> 
> These look good, should patch 5 be moved to patch 2 and the first 2
> patches marked for stable?

Yup. Will move patch 5 to 2nd position.

> 
> Do you also plan to dump SLB contents for bare metal MCEs?

Yes. That's the plan. Will do that separately.

Thanks,
-Mahesh.

Re: [PATCH v5 1/4] powerpc/fadump: un-register fadump on kexec path.

2018-04-27 Thread Mahesh Jagannath Salgaonkar

On 04/26/2018 07:10 PM, Nicholas Piggin wrote:
> On Thu, 26 Apr 2018 18:35:10 +0530
> Mahesh Jagannath Salgaonkar <mah...@linux.vnet.ibm.com> wrote:
> 
>> On 04/26/2018 06:28 PM, Nicholas Piggin wrote:
>>> On Thu, 26 Apr 2018 17:12:03 +0530
>>> Mahesh J Salgaonkar <mah...@linux.vnet.ibm.com> wrote:
>>>   
>>>> From: Mahesh Salgaonkar <mah...@linux.vnet.ibm.com>
>>>>
>>>> otherwise the fadump registration in new kexec-ed kernel complains that
>>>> fadump is already registered. This makes new kernel to continue using
>>>> fadump registered by previous kernel which may lead to invalid vmcore
>>>> generation. Hence this patch fixes this issue by un-registering fadump
>>>> in fadump_cleanup() which is called during kexec path so that new kernel
>>>> can register fadump with new valid values.  
>>>
>>> I assume this series is for 4.17, but it might be good to get this one
>>> into the 4.16 fixes branch? Should it go to stable kernels?  
>>
>> You are right. Should I send it out as separate patch ?
> 
> Yes I think so.

Done posted it separately (http://patchwork.ozlabs.org/patch/905508/)
Ignore this patch from this series.

Thanks,
-Mahesh.

Re: [PATCH v5 1/4] powerpc/fadump: un-register fadump on kexec path.

2018-04-26 Thread Mahesh Jagannath Salgaonkar

On 04/26/2018 06:28 PM, Nicholas Piggin wrote:
> On Thu, 26 Apr 2018 17:12:03 +0530
> Mahesh J Salgaonkar  wrote:
> 
>> From: Mahesh Salgaonkar 
>>
>> otherwise the fadump registration in new kexec-ed kernel complains that
>> fadump is already registered. This makes new kernel to continue using
>> fadump registered by previous kernel which may lead to invalid vmcore
>> generation. Hence this patch fixes this issue by un-registering fadump
>> in fadump_cleanup() which is called during kexec path so that new kernel
>> can register fadump with new valid values.
> 
> I assume this series is for 4.17, but it might be good to get this one
> into the 4.16 fixes branch? Should it go to stable kernels?

You are right. Should I send it out as separate patch ?

> 
>>
>> Signed-off-by: Mahesh Salgaonkar 
>> ---
>>  arch/powerpc/kernel/fadump.c |3 +++
>>  1 file changed, 3 insertions(+)
>>
>> diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
>> index 8ceabef40d3d..07e8396d472b 100644
>> --- a/arch/powerpc/kernel/fadump.c
>> +++ b/arch/powerpc/kernel/fadump.c
>> @@ -1180,6 +1180,9 @@ void fadump_cleanup(void)
>>  init_fadump_mem_struct(,
>>  
>> be64_to_cpu(fdm_active->cpu_state_data.destination_address));
>>  fadump_invalidate_dump();
>> +} else if (fw_dump.dump_registered) {
>> +/* Un-register Firmware-assisted dump if it was registered. */
>> +fadump_unregister_dump();
>>  }
>>  }
>>  
>>
>

Re: [PATCH] powerpc/mce: Fix a bug where mce loops on memory UE.

2018-04-23 Thread Mahesh Jagannath Salgaonkar

On 04/23/2018 04:44 PM, Balbir Singh wrote:
> On Mon, Apr 23, 2018 at 8:33 PM, Mahesh Jagannath Salgaonkar
> <mah...@linux.vnet.ibm.com> wrote:
>> On 04/23/2018 12:21 PM, Balbir Singh wrote:
>>> On Mon, Apr 23, 2018 at 2:59 PM, Mahesh J Salgaonkar
>>> <mah...@linux.vnet.ibm.com> wrote:
>>>> From: Mahesh Salgaonkar <mah...@linux.vnet.ibm.com>
>>>>
>>>> The current code extracts the physical address for UE errors and then
>>>> hooks it up into memory failure infrastructure. On successful extraction
>>>> of physical address it wrongly sets "handled = 1" which means this UE error
>>>> has been recovered. Since MCE handler gets return value as handled = 1, it
>>>> assumes that error has been recovered and goes back to same NIP. This 
>>>> causes
>>>> MCE interrupt again and again in a loop leading to hard lockup.
>>>>
>>>> Also, initialize phys_addr to ULONG_MAX so that we don't end up queuing
>>>> undesired page to hwpoison.
>>>>
>>>> Without this patch we see:
>>>> [ 1476.541984] Severe Machine check interrupt [Recovered]
>>>> [ 1476.541985]   NIP: [1002588c] PID: 7109 Comm: find
>>>> [ 1476.541986]   Initiator: CPU
>>>> [ 1476.541987]   Error type: UE [Load/Store]
>>>> [ 1476.541988] Effective address: 7fffd2755940
>>>> [ 1476.541989] Physical address:  20181a08
>>>> [...]
>>>> [ 1476.542003] Severe Machine check interrupt [Recovered]
>>>> [ 1476.542004]   NIP: [1002588c] PID: 7109 Comm: find
>>>> [ 1476.542005]   Initiator: CPU
>>>> [ 1476.542006]   Error type: UE [Load/Store]
>>>> [ 1476.542006] Effective address: 7fffd2755940
>>>> [ 1476.542007] Physical address:  20181a08
>>>> [ 1476.542010] Severe Machine check interrupt [Recovered]
>>>> [ 1476.542012]   NIP: [1002588c] PID: 7109 Comm: find
>>>> [ 1476.542013]   Initiator: CPU
>>>> [ 1476.542014]   Error type: UE [Load/Store]
>>>> [ 1476.542015] Effective address: 7fffd2755940
>>>> [ 1476.542016] Physical address:  20181a08
>>>> [ 1476.542448] Memory failure: 0x20181a08: recovery action for dirty LRU 
>>>> page: Recovered
>>>> [ 1476.542452] Memory failure: 0x20181a08: already hardware poisoned
>>>> [ 1476.542453] Memory failure: 0x20181a08: already hardware poisoned
>>>> [ 1476.542454] Memory failure: 0x20181a08: already hardware poisoned
>>>> [ 1476.542455] Memory failure: 0x20181a08: already hardware poisoned
>>>> [ 1476.542456] Memory failure: 0x20181a08: already hardware poisoned
>>>> [ 1476.542457] Memory failure: 0x20181a08: already hardware poisoned
>>>> [...]
>>>> [ 1490.972174] Watchdog CPU:38 Hard LOCKUP
>>>>
>>>> After this patch we see:
>>>>
>>>> [  325.384336] Severe Machine check interrupt [Not recovered]
>>>
>>> How did you test for this?
>>
>> By injecting cache SUE using L2 FIR register (0x1001080c).
>>
>>> If the error was recovered, shouldn't the
>>> process have gotten
>>> a SIGBUS and we should have prevented further access as a part of the 
>>> handling
>>> (memory_failure()). Do we just need a MF_MUST_KILL in the flags?
>>
>> We hook it up to memory_failure() through a work queue and by the time
>> work queue kicks in, the application continues to restart and hit same
>> NIP again and again. Every MCE again hooks the same address to memory
>> failure work queue and throws multiple recovered MCE messages for same
>> address. Once the memory_failure() hwpoisons the page, application gets
>> SIGBUS and then we are fine.
>>
> 
> That seems quite broken and not recovered is very confusing. So effectively
> we can never recover from a MCE UE. 

By not setting handle = 1, the recovery code will fall through
machine_check_exception()->opal_machine_check() and then SIGBUS is sent
to this process to recover OR head to panic path for kernel UE. We have
already hooked up the physical address to memory_failure() which will
later hwpoison the page whenever work queue kicks in. This patch makes
sure this happens.

> I think we need a notion of delayed
> recovery then? Where we do recover, but mark is as recovered with delays?

Yeah, may be we can set disposition for userspace mce event as recovery
in progress/delayed and then print the mce event again from work queue
by looking at return value from memory_failure(). What do you think ?

> We might want to revisit our recovery process and see if the recovery requires
> to turn the MMU on, but that is for later, I suppose.
> 
>> But in case of UE in kernel space, if early machine_check handler
>> "machine_check_early()" returns as recovered then
>> machine_check_handle_early() queues up the MCE event and continues from
>> NIP assuming it is safe causing a MCE loop. So, for UE in kernel we end
>> up in hard lockup.
>>
> 
> Yeah for the kernel, we need to definitely cause a panic for now, I've got 
> other
> patches for things we need to do for pmem that would allow potential recovery.
> 
> Balbir Singh
>

Re: [PATCH] powerpc/mce: Fix a bug where mce loops on memory UE.

2018-04-23 Thread Mahesh Jagannath Salgaonkar

On 04/23/2018 12:21 PM, Balbir Singh wrote:
> On Mon, Apr 23, 2018 at 2:59 PM, Mahesh J Salgaonkar
>  wrote:
>> From: Mahesh Salgaonkar 
>>
>> The current code extracts the physical address for UE errors and then
>> hooks it up into memory failure infrastructure. On successful extraction
>> of physical address it wrongly sets "handled = 1" which means this UE error
>> has been recovered. Since MCE handler gets return value as handled = 1, it
>> assumes that error has been recovered and goes back to same NIP. This causes
>> MCE interrupt again and again in a loop leading to hard lockup.
>>
>> Also, initialize phys_addr to ULONG_MAX so that we don't end up queuing
>> undesired page to hwpoison.
>>
>> Without this patch we see:
>> [ 1476.541984] Severe Machine check interrupt [Recovered]
>> [ 1476.541985]   NIP: [1002588c] PID: 7109 Comm: find
>> [ 1476.541986]   Initiator: CPU
>> [ 1476.541987]   Error type: UE [Load/Store]
>> [ 1476.541988] Effective address: 7fffd2755940
>> [ 1476.541989] Physical address:  20181a08
>> [...]
>> [ 1476.542003] Severe Machine check interrupt [Recovered]
>> [ 1476.542004]   NIP: [1002588c] PID: 7109 Comm: find
>> [ 1476.542005]   Initiator: CPU
>> [ 1476.542006]   Error type: UE [Load/Store]
>> [ 1476.542006] Effective address: 7fffd2755940
>> [ 1476.542007] Physical address:  20181a08
>> [ 1476.542010] Severe Machine check interrupt [Recovered]
>> [ 1476.542012]   NIP: [1002588c] PID: 7109 Comm: find
>> [ 1476.542013]   Initiator: CPU
>> [ 1476.542014]   Error type: UE [Load/Store]
>> [ 1476.542015] Effective address: 7fffd2755940
>> [ 1476.542016] Physical address:  20181a08
>> [ 1476.542448] Memory failure: 0x20181a08: recovery action for dirty LRU 
>> page: Recovered
>> [ 1476.542452] Memory failure: 0x20181a08: already hardware poisoned
>> [ 1476.542453] Memory failure: 0x20181a08: already hardware poisoned
>> [ 1476.542454] Memory failure: 0x20181a08: already hardware poisoned
>> [ 1476.542455] Memory failure: 0x20181a08: already hardware poisoned
>> [ 1476.542456] Memory failure: 0x20181a08: already hardware poisoned
>> [ 1476.542457] Memory failure: 0x20181a08: already hardware poisoned
>> [...]
>> [ 1490.972174] Watchdog CPU:38 Hard LOCKUP
>>
>> After this patch we see:
>>
>> [  325.384336] Severe Machine check interrupt [Not recovered]
> 
> How did you test for this? 

By injecting cache SUE using L2 FIR register (0x1001080c).

> If the error was recovered, shouldn't the
> process have gotten
> a SIGBUS and we should have prevented further access as a part of the handling
> (memory_failure()). Do we just need a MF_MUST_KILL in the flags?

We hook it up to memory_failure() through a work queue and by the time
work queue kicks in, the application continues to restart and hit same
NIP again and again. Every MCE again hooks the same address to memory
failure work queue and throws multiple recovered MCE messages for same
address. Once the memory_failure() hwpoisons the page, application gets
SIGBUS and then we are fine.

But in case of UE in kernel space, if early machine_check handler
"machine_check_early()" returns as recovered then
machine_check_handle_early() queues up the MCE event and continues from
NIP assuming it is safe causing a MCE loop. So, for UE in kernel we end
up in hard lockup.

> Why shouldn't we treat it as handled if we isolate the page?

Yes we should, but I think not until the page is actually hwpoisioned OR
until we send SIGBUS to process.

> 
> Thanks,
> Balbir Singh.
>

Re: [PATCH v4 3/7] powerpc/fadump: un-register fadump on kexec path.

2018-04-22 Thread Mahesh Jagannath Salgaonkar

On 04/22/2018 07:28 AM, Nicholas Piggin wrote:
> On Fri, 20 Apr 2018 10:34:35 +0530
> Mahesh J Salgaonkar  wrote:
> 
>> From: Mahesh Salgaonkar 
>>
>> otherwise the fadump registration in new kexec-ed kernel complains that
>> fadump is already registered. This makes new kernel to continue using
>> fadump registered by previous kernel which may lead to invalid vmcore
>> generation. Hence this patch fixes this issue by un-registering fadump
>> in fadump_cleanup() which is called during kexec path so that new kernel
>> can register fadump with new valid values.
> 
> Is this a bug fix that should go to previous kernels as well?

Yes.

> 
>>
>> Signed-off-by: Mahesh Salgaonkar 
>> ---
>>  arch/powerpc/kernel/fadump.c |3 +++
>>  1 file changed, 3 insertions(+)
>>
>> diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
>> index 43bfa535d0ea..16b3e8c5cae0 100644
>> --- a/arch/powerpc/kernel/fadump.c
>> +++ b/arch/powerpc/kernel/fadump.c
>> @@ -1276,6 +1276,9 @@ void fadump_cleanup(void)
>>  /* Invalidate the registration only if dump is active. */
>>  if (fw_dump.dump_active) {
>>  fadump_invalidate_dump(fdm_active);
>> +} else if (fw_dump.dump_registered) {
>> +/* Un-register Firmware-assisted dump if it was registered. */
>> +fadump_unregister_dump();
>>  }
>>  }
>>  
>>
>

Re: [PATCH v4 1/7] powerpc/fadump: Move the metadata region to start of the reserved area.

2018-04-22 Thread Mahesh Jagannath Salgaonkar

On 04/22/2018 07:28 AM, Nicholas Piggin wrote:
> On Fri, 20 Apr 2018 10:34:18 +0530
> Mahesh J Salgaonkar  wrote:
> 
>> From: Mahesh Salgaonkar 
>>
>> Currently the metadata region that holds crash info structure and ELF core
>> header is placed towards the end of reserved memory area. This patch places
>> it at the beginning of the reserved memory area. It also introduces
>> additional dump section called metadata section to communicate location
>> of metadata region to 2nd kernel. This patch also maintains the
>> compatibility between production/capture kernels irrespective of their
>> kernel versions. Both combination older/newer and newer/older works fine.
> 
> Trying to look at the patches it might help me if you document reasons
> for why this change is made changelog, even if it may be obvious to
> someone who knows the code better.

Yeah, I should have mentioned that this patch provides the foundation
for CMA patch 4. With CMA reservation we now allocate metadata region
using cma_alloc() which always allocates metadata region at the start of
CMA reserved region. Earlier in v1, I had this change included along
with CMA reservation patch. But then to make things simpler for review I
did a logical split of movement of metadata region and CMA reservation
patch separately. I think I should order patch 1, 2 and 4 in a sequence
and Move patch3 to patch 1.

> 
> I thought you could include the documentation change in this patch as
> well, but maybe that's a matter of preference.

Yeah that's how I prefer it :-), but that just me. But if it helps in
review I can fold it into 1.

Thanks,
-Mahesh.

Re: [PATCH v2 2/2] powerpc/fadump: Do not use hugepages when fadump is active

2018-04-11 Thread Mahesh Jagannath Salgaonkar

On 04/10/2018 07:11 PM, Hari Bathini wrote:
> FADump capture kernel boots in restricted memory environment preserving
> the context of previous kernel to save vmcore. Supporting hugepages in
> such environment makes things unnecessarily complicated, as hugepages
> need memory set aside for them. This means most of the capture kernel's
> memory is used in supporting hugepages. In most cases, this results in
> out-of-memory issues while booting FADump capture kernel. But hugepages
> are not of much use in capture kernel whose only job is to save vmcore.
> So, disabling hugepages support, when fadump is active, is a reliable
> solution for the out of memory issues. Introducing a flag variable to
> disable HugeTLB support when fadump is active.
> 
> Signed-off-by: Hari Bathini 
> ---
> 
> Changes in v2:
> * Introduce a hugetlb_disabled flag to enable/disable hugepage support &
>   use that flag to disable hugepage support when fadump is active.

Looks good to me.

Reviewed-by: Mahesh Salgaonkar 

> 
> 
>  arch/powerpc/include/asm/page.h |1 +
>  arch/powerpc/kernel/fadump.c|8 
>  arch/powerpc/mm/hash_utils_64.c |6 --
>  arch/powerpc/mm/hugetlbpage.c   |7 +++
>  4 files changed, 20 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/page.h b/arch/powerpc/include/asm/page.h
> index 8da5d4c..40aee93 100644
> --- a/arch/powerpc/include/asm/page.h
> +++ b/arch/powerpc/include/asm/page.h
> @@ -39,6 +39,7 @@
> 
>  #ifndef __ASSEMBLY__
>  #ifdef CONFIG_HUGETLB_PAGE
> +extern bool hugetlb_disabled;
>  extern unsigned int HPAGE_SHIFT;
>  #else
>  #define HPAGE_SHIFT PAGE_SHIFT
> diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
> index bea8d5f..8ceabef4 100644
> --- a/arch/powerpc/kernel/fadump.c
> +++ b/arch/powerpc/kernel/fadump.c
> @@ -402,6 +402,14 @@ int __init fadump_reserve_mem(void)
>   if (fw_dump.dump_active) {
>   pr_info("Firmware-assisted dump is active.\n");
> 
> +#ifdef CONFIG_HUGETLB_PAGE
> + /*
> +  * FADump capture kernel doesn't care much about hugepages.
> +  * In fact, handling hugepages in capture kernel is asking for
> +  * trouble. So, disable HugeTLB support when fadump is active.
> +  */
> + hugetlb_disabled = true;
> +#endif
>   /*
>* If last boot has crashed then reserve all the memory
>* above boot_memory_size so that we don't touch it until
> diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
> index cf290d41..eab8f1d 100644
> --- a/arch/powerpc/mm/hash_utils_64.c
> +++ b/arch/powerpc/mm/hash_utils_64.c
> @@ -571,8 +571,10 @@ static void __init htab_scan_page_sizes(void)
>   }
> 
>  #ifdef CONFIG_HUGETLB_PAGE
> - /* Reserve 16G huge page memory sections for huge pages */
> - of_scan_flat_dt(htab_dt_scan_hugepage_blocks, NULL);
> + if (!hugetlb_disabled) {
> + /* Reserve 16G huge page memory sections for huge pages */
> + of_scan_flat_dt(htab_dt_scan_hugepage_blocks, NULL);
> + }
>  #endif /* CONFIG_HUGETLB_PAGE */
>  }
> 
> diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
> index 876da2b..18c080a 100644
> --- a/arch/powerpc/mm/hugetlbpage.c
> +++ b/arch/powerpc/mm/hugetlbpage.c
> @@ -35,6 +35,8 @@
>  #define PAGE_SHIFT_16M   24
>  #define PAGE_SHIFT_16G   34
> 
> +bool hugetlb_disabled = false;
> +
>  unsigned int HPAGE_SHIFT;
>  EXPORT_SYMBOL(HPAGE_SHIFT);
> 
> @@ -653,6 +655,11 @@ static int __init hugetlbpage_init(void)
>  {
>   int psize;
> 
> + if (hugetlb_disabled) {
> + pr_info("HugeTLB support is disabled!\n");
> + return 0;
> + }
> +
>  #if !defined(CONFIG_PPC_FSL_BOOK3E) && !defined(CONFIG_PPC_8xx)
>   if (!radix_enabled() && !mmu_has_feature(MMU_FTR_16M_PAGE))
>   return -ENODEV;
>

Re: [PATCH v3 1/7] powerpc/fadump: Move the metadata region to start of the reserved area.

2018-04-04 Thread Mahesh Jagannath Salgaonkar

On 04/04/2018 12:56 AM, Hari Bathini wrote:
> Mahesh, I think we should explicitly document that production and
> capture kernel
> versions should be same. For changes like below, older/newer production
> kernel vs
> capture kernel is bound to fail. Of course, production and capture
> kernel versions
> would be the same case usually but for the uninitiated, mentioning that
> explicitly
> in documentation would help..?

Yeah we could do that. In ideal case production and capture kernel will
be same. But yes, there can be cases where we may end up in older/newer
kernel scenario. My earlier versions had backward compatibility which I
broke in v3. I can fix that in v4. Thanks for catching it. Will also
update documentation mentioning working kernel combinations.

Thanks,
-Mahesh.

> 
> Thanks
> Hari
> 
> 
> On Monday 02 April 2018 11:59 AM, Mahesh J Salgaonkar wrote:
>> From: Mahesh Salgaonkar 
>>
>> Currently the metadata region that holds crash info structure and ELF
>> core
>> header is placed towards the end of reserved memory area. This patch
>> places
>> it at the beginning of the reserved memory area.
>>
>> Signed-off-by: Mahesh Salgaonkar 
>> ---
>>   arch/powerpc/include/asm/fadump.h |    4 ++
>>   arch/powerpc/kernel/fadump.c  |   92
>> -
>>   2 files changed, 83 insertions(+), 13 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/fadump.h
>> b/arch/powerpc/include/asm/fadump.h
>> index 5a23010af600..44b6ebfe9be6 100644
>> --- a/arch/powerpc/include/asm/fadump.h
>> +++ b/arch/powerpc/include/asm/fadump.h
>> @@ -61,6 +61,9 @@
>>   #define FADUMP_UNREGISTER    2
>>   #define FADUMP_INVALIDATE    3
>>
>> +/* Number of dump sections requested by kernel */
>> +#define FADUMP_NUM_SECTIONS    4
>> +
>>   /* Dump status flag */
>>   #define FADUMP_ERROR_FLAG    0x2000
>>
>> @@ -119,6 +122,7 @@ struct fadump_mem_struct {
>>   struct fadump_section    cpu_state_data;
>>   struct fadump_section    hpte_region;
>>   struct fadump_section    rmr_region;
>> +    struct fadump_section    metadata_region;
>>   };
>>
>>   /* Firmware-assisted dump configuration details. */
>> diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
>> index 3c2c2688918f..80552fd7d5f8 100644
>> --- a/arch/powerpc/kernel/fadump.c
>> +++ b/arch/powerpc/kernel/fadump.c
>> @@ -188,17 +188,48 @@ static void fadump_show_config(void)
>>   pr_debug("Boot memory size  : %lx\n", fw_dump.boot_memory_size);
>>   }
>>
>> +static unsigned long get_fadump_metadata_size(void)
>> +{
>> +    unsigned long size = 0;
>> +
>> +    /*
>> + * If fadump is active then look into fdm_active to get metadata
>> + * size set by previous kernel.
>> + */
>> +    if (fw_dump.dump_active) {
>> +    size = fdm_active->cpu_state_data.destination_address -
>> +    fdm_active->metadata_region.source_address;
>> +    goto out;
>> +    }
>> +    size += sizeof(struct fadump_crash_info_header);
>> +    size += sizeof(struct elfhdr); /* ELF core header.*/
>> +    size += sizeof(struct elf_phdr); /* place holder for cpu notes */
>> +    /* Program headers for crash memory regions. */
>> +    size += sizeof(struct elf_phdr) * (memblock_num_regions(memory) +
>> 2);
>> +
>> +    size = PAGE_ALIGN(size);
>> +out:
>> +    pr_debug("fadump Metadata size is %ld\n", size);
>> +    return size;
>> +}
>> +
>>   static unsigned long init_fadump_mem_struct(struct fadump_mem_struct
>> *fdm,
>>   unsigned long addr)
>>   {
>> +    uint16_t num_sections = 0;
>> +    unsigned long metadata_base = 0;
>> +
>>   if (!fdm)
>>   return 0;
>>
>> +    /* Skip the fadump metadata area. */
>> +    metadata_base = addr;
>> +    addr += get_fadump_metadata_size();
>> +
>>   memset(fdm, 0, sizeof(struct fadump_mem_struct));
>>   addr = addr & PAGE_MASK;
>>
>>   fdm->header.dump_format_version = cpu_to_be32(0x0001);
>> -    fdm->header.dump_num_sections = cpu_to_be16(3);
>>   fdm->header.dump_status_flag = 0;
>>   fdm->header.offset_first_dump_section =
>>   cpu_to_be32((u32)offsetof(struct fadump_mem_struct,
>> cpu_state_data));
>> @@ -222,6 +253,7 @@ static unsigned long init_fadump_mem_struct(struct
>> fadump_mem_struct *fdm,
>>   fdm->cpu_state_data.source_address = 0;
>>   fdm->cpu_state_data.source_len =
>> cpu_to_be64(fw_dump.cpu_state_data_size);
>>   fdm->cpu_state_data.destination_address = cpu_to_be64(addr);
>> +    num_sections++;
>>   addr += fw_dump.cpu_state_data_size;
>>
>>   /* hpte region section */
>> @@ -230,6 +262,7 @@ static unsigned long init_fadump_mem_struct(struct
>> fadump_mem_struct *fdm,
>>   fdm->hpte_region.source_address = 0;
>>   fdm->hpte_region.source_len =
>> cpu_to_be64(fw_dump.hpte_region_size);
>>   fdm->hpte_region.destination_address = cpu_to_be64(addr);
>> +    num_sections++;
>>   addr +=

Re: [PATCH v3 4/7] powerpc/fadump: exclude memory holes while reserving memory in second kernel.

2018-04-03 Thread Mahesh Jagannath Salgaonkar

On 04/03/2018 03:21 PM, Hari Bathini wrote:
> 
> 
> On Monday 02 April 2018 12:00 PM, Mahesh J Salgaonkar wrote:
>> From: Mahesh Salgaonkar 
>>
>> The second kernel, during early boot after the crash, reserves rest of
>> the memory above boot memory size to make sure it does not touch any
>> of the
>> dump memory area. It uses memblock_reserve() that reserves the specified
>> memory region irrespective of memory holes present within that region.
>> There are chances where previous kernel would have hot removed some of
>> its memory leaving memory holes behind. In such cases fadump kernel
>> reports
>> incorrect number of reserved pages through arch_reserved_kernel_pages()
>> hook causing kernel to hang or panic.
>>
>> Fix this by excluding memory holes while reserving rest of the memory
>> above boot memory size during second kernel boot after crash.
>>
>> Signed-off-by: Mahesh Salgaonkar 
>> ---
>>   arch/powerpc/kernel/fadump.c |   17 -
>>   1 file changed, 16 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
>> index 011f1aa7abab..a497e9fb93fe 100644
>> --- a/arch/powerpc/kernel/fadump.c
>> +++ b/arch/powerpc/kernel/fadump.c
>> @@ -433,6 +433,21 @@ static inline unsigned long
>> get_fadump_metadata_base(
>>   return be64_to_cpu(fdm->metadata_region.source_address);
>>   }
>>
>> +static void fadump_memblock_reserve(unsigned long base, unsigned long
>> size)
>> +{
>> +    struct memblock_region *reg;
>> +    unsigned long start, end;
>> +
>> +    for_each_memblock(memory, reg) {
>> +    start = max(base, (unsigned long)reg->base);
>> +    end = reg->base + reg->size;
>> +    end = min(base + size, end);
>> +
>> +    if (start < end)
>> +    memblock_reserve(start, end - start);
>> +    }
>> +}
>> +
>>   int __init fadump_reserve_mem(void)
>>   {
>>   unsigned long base, size, memory_boundary;
>> @@ -487,7 +502,7 @@ int __init fadump_reserve_mem(void)
>>    */
>>   base = fw_dump.boot_memory_size;
>>   size = memory_boundary - base;
>> -    memblock_reserve(base, size);
>> +    fadump_memblock_reserve(base, size);
>>   printk(KERN_INFO "Reserved %ldMB of memory at %ldMB "
> 
> Mahesh, you may want to change this print as well as it would be
> misleading in case of
> holes in the memory.

Yeah, may be we can just move that printk also inside
fadump_memblock_reserve().

Thanks,
-Mahesh.

> 
> Thanks
> Hari
> 
>>   "for saving crash dump\n",
>>   (unsigned long)(size >> 20),
>>
>

Re: [PATCH v3 6/7] powerpc/fadump: Do not allow hot-remove memory from fadump reserved area.

2018-04-03 Thread Mahesh Jagannath Salgaonkar

On 04/03/2018 08:48 AM, Pingfan Liu wrote:
> I think CMA has protected us from hot-remove, so this patch is not necessary.

Yes, but only if the memory from declared CMA region is allocated using
cma_alloc(). The rest of the memory inside CMA region which hasn't been
cma_allocat-ed can still be hot-removed. Hence this patch is necessary.

fadump allocates only 1 page from fadump CMA region for metadata region.
Rest all memory is free to use by applications and vulnerable to hot-remove.

Thanks,
-Mahesh.

> 
> Regards,
> Pingfan
> 
> On Mon, Apr 2, 2018 at 2:30 PM, Mahesh J Salgaonkar
>  wrote:
>> From: Mahesh Salgaonkar 
>>
>> For fadump to work successfully there should not be any holes in reserved
>> memory ranges where kernel has asked firmware to move the content of old
>> kernel memory in event of crash. But this memory area is currently not
>> protected from hot-remove operations. Hence, fadump service can fail to
>> re-register after the hot-remove operation, if hot-removed memory belongs
>> to fadump reserved region. To avoid this make sure that memory from fadump
>> reserved area is not hot-removable if fadump is registered.
>>
>> However, if user still wants to remove that memory, he can do so by
>> manually stopping fadump service before hot-remove operation.
>>
>> Signed-off-by: Mahesh Salgaonkar 
>> ---
>>  arch/powerpc/include/asm/fadump.h   |2 +-
>>  arch/powerpc/kernel/fadump.c|   10 --
>>  arch/powerpc/platforms/pseries/hotplug-memory.c |7 +--
>>  3 files changed, 14 insertions(+), 5 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/fadump.h 
>> b/arch/powerpc/include/asm/fadump.h
>> index 44b6ebfe9be6..d16dc77107a8 100644
>> --- a/arch/powerpc/include/asm/fadump.h
>> +++ b/arch/powerpc/include/asm/fadump.h
>> @@ -207,7 +207,7 @@ struct fad_crash_memory_ranges {
>> unsigned long long  size;
>>  };
>>
>> -extern int is_fadump_boot_memory_area(u64 addr, ulong size);
>> +extern int is_fadump_memory_area(u64 addr, ulong size);
>>  extern int early_init_dt_scan_fw_dump(unsigned long node,
>> const char *uname, int depth, void *data);
>>  extern int fadump_reserve_mem(void);
>> diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
>> index 59aaf0357a52..2c3c7e655eec 100644
>> --- a/arch/powerpc/kernel/fadump.c
>> +++ b/arch/powerpc/kernel/fadump.c
>> @@ -162,13 +162,19 @@ int __init early_init_dt_scan_fw_dump(unsigned long 
>> node,
>>
>>  /*
>>   * If fadump is registered, check if the memory provided
>> - * falls within boot memory area.
>> + * falls within boot memory area and reserved memory area.
>>   */
>> -int is_fadump_boot_memory_area(u64 addr, ulong size)
>> +int is_fadump_memory_area(u64 addr, ulong size)
>>  {
>> +   u64 d_start = fw_dump.reserve_dump_area_start;
>> +   u64 d_end = d_start + fw_dump.reserve_dump_area_size;
>> +
>> if (!fw_dump.dump_registered)
>> return 0;
>>
>> +   if (((addr + size) > d_start) && (addr <= d_end))
>> +   return 1;
>> +
>> return (addr + size) > RMA_START && addr <= fw_dump.boot_memory_size;
>>  }
>>
>> diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
>> b/arch/powerpc/platforms/pseries/hotplug-memory.c
>> index c1578f54c626..e4c658cda3a7 100644
>> --- a/arch/powerpc/platforms/pseries/hotplug-memory.c
>> +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
>> @@ -389,8 +389,11 @@ static bool lmb_is_removable(struct drmem_lmb *lmb)
>> phys_addr = lmb->base_addr;
>>
>>  #ifdef CONFIG_FA_DUMP
>> -   /* Don't hot-remove memory that falls in fadump boot memory area */
>> -   if (is_fadump_boot_memory_area(phys_addr, block_sz))
>> +   /*
>> +* Don't hot-remove memory that falls in fadump boot memory area
>> +* and memory that is reserved for capturing old kernel memory.
>> +*/
>> +   if (is_fadump_memory_area(phys_addr, block_sz))
>> return false;
>>  #endif
>>
>>
>

Re: [PATCH] powernv: Avoid calling trace tlbie in kexec path.

2017-11-23 Thread Mahesh Jagannath Salgaonkar

On 11/23/2017 04:26 AM, Balbir Singh wrote:
> On Thu, Nov 23, 2017 at 4:32 AM, Mahesh J Salgaonkar
>  wrote:
>> From: Mahesh Salgaonkar 
>>
>> Rebooting into a new kernel with kexec fails in trace_tlbie() which is
>> called from native_hpte_clear(). This happens if the running kernel has
>> CONFIG_LOCKDEP enabled. With lockdep enabled, the tracepoints always
>> execute few RCU checks regardless of whether tracing is on or off.
>> We are already in the last phase of kexec sequence in real mode with
>> HILE_BE set. At this point the RCU check ends up in RCU_LOCKDEP_WARN and
>> causes kexec to fail.
>>
> 
> Effectively we can't enter the trace point code after we've set
> HILE_BE.  Do we need
> a fixes tag? Or is this a side-effect of a new generic change?

Yup. I missed it. Will resend the patch with fixes tag

Fixes: 0428491cba92 ("powerpc/mm: Trace tlbie(l) instructions")

> 
> I think the right thing in the longer run might be to do a 
> TRACE_EVENT_CONDITION
> and have the condition do the right thing, but what you have for now is good.
> 
> Balbir Singh.
>

Re: [PATCH] powernv: Avoid calling trace tlbie in kexec path.

2017-11-23 Thread Mahesh Jagannath Salgaonkar

On 11/23/2017 12:37 AM, Naveen N. Rao wrote:
> Mahesh J Salgaonkar wrote:
>> From: Mahesh Salgaonkar 
>>
>> Rebooting into a new kernel with kexec fails in trace_tlbie() which is
>> called from native_hpte_clear(). This happens if the running kernel has
>> CONFIG_LOCKDEP enabled. With lockdep enabled, the tracepoints always
>> execute few RCU checks regardless of whether tracing is on or off.
>> We are already in the last phase of kexec sequence in real mode with
>> HILE_BE set. At this point the RCU check ends up in RCU_LOCKDEP_WARN and
>> causes kexec to fail.
>>
>> Fix this by not calling trace_tlbie() from native_hpte_clear().
>>
>> Signed-off-by: Mahesh Salgaonkar 
>> Reported-by: Aneesh Kumar K.V 
>> Suggested-by: Michael Ellerman 
>> ---
>>  arch/powerpc/mm/hash_native_64.c |   15 ---
>>  1 file changed, 12 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/powerpc/mm/hash_native_64.c
>> b/arch/powerpc/mm/hash_native_64.c
>> index 3848af1..640cf56 100644
>> --- a/arch/powerpc/mm/hash_native_64.c
>> +++ b/arch/powerpc/mm/hash_native_64.c
>> @@ -47,7 +47,8 @@
>>
>>  DEFINE_RAW_SPINLOCK(native_tlbie_lock);
>>
>> -static inline void __tlbie(unsigned long vpn, int psize, int apsize,
>> int ssize)
>> +static inline unsigned long  ___tlbie(unsigned long vpn, int psize,
>> +    int apsize, int ssize)
>>  {
>>  unsigned long va;
>>  unsigned int penc;
>> @@ -100,7 +101,15 @@ static inline void __tlbie(unsigned long vpn, int
>> psize, int apsize, int ssize)
>>   : "memory");
>>  break;
>>  }
>> -    trace_tlbie(0, 0, va, 0, 0, 0, 0);
> 
> Does it help if you use the _rcuidle variant instead, to turn off all
> rcu checks for tracing __tlbie()?
> trace_tlbie_rcuidle(0, 0, va, 0, 0, 0, 0);

It helps if tracepoint is not enabled. But with tracepoint enabled kexec
still fails. I think we should not have tracepoint in kexec path at all.
If someone enables it, kexec will definitely fail regardless of
CONFIG_LOCKDEP.

Thanks,
-Mahesh.

Re: [rfc 2/3] powerpc/mce: Extract physical_address for UE errors

2017-09-07 Thread Mahesh Jagannath Salgaonkar

On 09/05/2017 09:45 AM, Balbir Singh wrote:
> Walk the page table for NIP and extract the instruction. Then
> use the instruction to find the effective address via analyse_instr().
> 
> We might have page table walking races, but we expect them to
> be rare, the physical address extraction is best effort. The idea
> is to then hook up this infrastructure to memory failure eventually.
> 
> Signed-off-by: Balbir Singh 
> ---
>  arch/powerpc/include/asm/mce.h  |  2 +-
>  arch/powerpc/kernel/mce.c   |  6 -
>  arch/powerpc/kernel/mce_power.c | 60 
> +
>  3 files changed, 61 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/mce.h b/arch/powerpc/include/asm/mce.h
> index 75292c7..3a1226e 100644
> --- a/arch/powerpc/include/asm/mce.h
> +++ b/arch/powerpc/include/asm/mce.h
> @@ -204,7 +204,7 @@ struct mce_error_info {
> 
>  extern void save_mce_event(struct pt_regs *regs, long handled,
>  struct mce_error_info *mce_err, uint64_t nip,
> -uint64_t addr);
> +uint64_t addr, uint64_t phys_addr);
>  extern int get_mce_event(struct machine_check_event *mce, bool release);
>  extern void release_mce_event(void);
>  extern void machine_check_queue_event(void);
> diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
> index e254399..f41a75d 100644
> --- a/arch/powerpc/kernel/mce.c
> +++ b/arch/powerpc/kernel/mce.c
> @@ -82,7 +82,7 @@ static void mce_set_error_info(struct machine_check_event 
> *mce,
>   */
>  void save_mce_event(struct pt_regs *regs, long handled,
>   struct mce_error_info *mce_err,
> - uint64_t nip, uint64_t addr)
> + uint64_t nip, uint64_t addr, uint64_t phys_addr)
>  {
>   int index = __this_cpu_inc_return(mce_nest_count) - 1;
>   struct machine_check_event *mce = this_cpu_ptr(_event[index]);
> @@ -140,6 +140,10 @@ void save_mce_event(struct pt_regs *regs, long handled,
>   } else if (mce->error_type == MCE_ERROR_TYPE_UE) {
>   mce->u.ue_error.effective_address_provided = true;
>   mce->u.ue_error.effective_address = addr;
> + if (phys_addr != ULONG_MAX) {
> + mce->u.ue_error.physical_address_provided = true;
> + mce->u.ue_error.physical_address = phys_addr;
> + }
>   }
>   return;
>  }
> diff --git a/arch/powerpc/kernel/mce_power.c b/arch/powerpc/kernel/mce_power.c
> index b76ca19..b77a698 100644
> --- a/arch/powerpc/kernel/mce_power.c
> +++ b/arch/powerpc/kernel/mce_power.c
> @@ -27,6 +27,25 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
> +#include 
> +
> +static unsigned long addr_to_pfn(struct mm_struct *mm, unsigned long addr)
> +{
> + pte_t *ptep;
> + unsigned long flags;
> +
> + local_irq_save(flags);
> + if (mm == current->mm)
> + ptep = find_current_mm_pte(mm->pgd, addr, NULL, NULL);
> + else
> + ptep = find_init_mm_pte(addr, NULL);
> + local_irq_restore(flags);
> + if (!ptep)
> + return ULONG_MAX;
> + return pte_pfn(*ptep);
> +}
> 
>  static void flush_tlb_206(unsigned int num_sets, unsigned int action)
>  {
> @@ -489,7 +508,8 @@ static int mce_handle_ierror(struct pt_regs *regs,
> 
>  static int mce_handle_derror(struct pt_regs *regs,
>   const struct mce_derror_table table[],
> - struct mce_error_info *mce_err, uint64_t *addr)
> + struct mce_error_info *mce_err, uint64_t *addr,
> + uint64_t *phys_addr)
>  {
>   uint64_t dsisr = regs->dsisr;
>   int handled = 0;
> @@ -555,7 +575,37 @@ static int mce_handle_derror(struct pt_regs *regs,
>   mce_err->initiator = table[i].initiator;
>   if (table[i].dar_valid)
>   *addr = regs->dar;
> -
> + else if (mce_err->severity == MCE_SEV_ERROR_SYNC &&
> + table[i].error_type == MCE_ERROR_TYPE_UE) {
> + /*
> +  * Carefully look at the NIP to determine
> +  * the instruction to analyse. Reading the NIP
> +  * in real-mode is tricky and can lead to recursive
> +  * faults
> +  */
> + int instr;
> + struct mm_struct *mm;
> + unsigned long nip = regs->nip;
> + unsigned long pfn = 0, instr_addr;
> + struct instruction_op op;
> + struct pt_regs tmp = *regs;
> +
> + if (user_mode(regs))
> + mm = current->mm;
> + else
> + mm = _mm;
> +
> + pfn = addr_to_pfn(mm, nip);
> + if (pfn != ULONG_MAX) {
> + instr_addr = (pfn

Re: [PATCH v2 2/3] powerpc/powernv: machine check use kernel crash path

2017-07-20 Thread Mahesh Jagannath Salgaonkar

On 07/19/2017 12:29 PM, Nicholas Piggin wrote:
> There are quite a few machine check exceptions that can be caused by
> kernel bugs. To make debugging easier, use the kernel crash path in
> cases of synchronous machine checks that occur in kernel mode, if that
> would not result in the machine going straight to panic or crash dump.
> 
> There is a downside here that die()ing the process in kernel mode can
> still leave the system unstable. panic_on_oops will always force the
> system to fail-stop, so systems where that behaviour is important will
> still do the right thing.
> 
> As a test, when triggering an i-side 0111b error (ifetch from foreign
> address) in kernel mode process context on POWER9, the kernel currently
> dies quickly like this:
> 
> Severe Machine check interrupt [Not recovered]
>   NIP []: 0x
>   Initiator: CPU
>   Error type: Real address [Instruction fetch (foreign)]
> [  127.426651616,0] OPAL: Reboot requested due to Platform error.
> Effective[  127.426693712,3] OPAL: Reboot requested due to Platform 
> error. address: 
> opal: Reboot type 1 not supported
> Kernel panic - not syncing: PowerNV Unrecovered Machine Check
> CPU: 56 PID: 4425 Comm: syscall Tainted: G   M
> 4.12.0-rc1-13857-ga4700a261072-dirty #35
> Call Trace:
> [  128.017988928,4] IPMI: BUG: Dropping ESEL on the floor due to buggy/mising 
> code in OPAL for this BMCRebooting in 10 seconds..
> Trying to free IRQ 496 from IRQ context!
> 
> 
> After this patch, the process is killed and the kernel continues with
> this message, which gives enough information to identify the offending
> branch (i.e., with CFAR):
> 
> Severe Machine check interrupt [Not recovered]
>   NIP []: 0x
>   Initiator: CPU
>   Error type: Real address [Instruction fetch (foreign)]
> Effective address: 
> Oops: Machine check, sig: 7 [#1]
> SMP NR_CPUS=2048
> NUMA
> PowerNV
> Modules linked in: iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 
> iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack 
> nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp tun bridge stp llc kvm_hv 
> kvm iptable_filter binfmt_misc vmx_crypto ip_tables x_tables autofs4 
> crc32c_vpmsum
> CPU: 22 PID: 4436 Comm: syscall Tainted: G   M
> 4.12.0-rc1-13857-ga4700a261072-dirty #36
> task: c0093230 task.stack: c0093238
> NIP:  LR: 217706a4 CTR: 
> REGS: cfc8fd80 TRAP: 0200   Tainted: G   M 
> (4.12.0-rc1-13857-ga4700a261072-dirty)
> MSR: 901c1003 
>   CR: 24000484  XER: 2000
> CFAR: c0004c80 DAR: 21770a90 DSISR: 0a00 SOFTE: 1
> GPR00: 1ebe 7fffce4818b0 21797f00 
> GPR04: 7fff8007ac24 44000484 4000 7fff801405e8
> GPR08: 9280f033 24000484  0030
> GPR12: 90001003 7fff801bc370  
> GPR16:    
> GPR20:    
> GPR24:    
> GPR28: 7fff801b  217707a0 7fffce481918
> NIP [] 0x
> LR [217706a4] 0x217706a4
> Call Trace:
> Instruction dump:
>        
>        
> ---[ end trace 32ae1dabb4f8dae6 ]---
> 
> Signed-off-by: Nicholas Piggin 

Reviewed-by: Mahesh Salgaonkar 

Thanks,
-Mahesh.

> ---
>  arch/powerpc/include/asm/bug.h|  1 +
>  arch/powerpc/include/asm/fadump.h |  2 ++
>  arch/powerpc/kernel/fadump.c  |  9 -
>  arch/powerpc/kernel/traps.c   | 22 ++
>  arch/powerpc/platforms/powernv/opal.c | 32 ++--
>  5 files changed, 59 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/bug.h b/arch/powerpc/include/asm/bug.h
> index 0151af6c2a50..9a918b3ca5ee 100644
> --- a/arch/powerpc/include/asm/bug.h
> +++ b/arch/powerpc/include/asm/bug.h
> @@ -133,6 +133,7 @@ extern int do_page_fault(struct pt_regs *, unsigned long, 
> unsigned long);
>  extern void bad_page_fault(struct pt_regs *, unsigned long, int);
>  extern void _exception(int, struct pt_regs *, int, unsigned long);
>  extern void die(const char *, struct pt_regs *, long);
> +extern bool die_will_crash(void);
> 
>  #endif /* !__ASSEMBLY__ */
> 
> diff --git a/arch/powerpc/include/asm/fadump.h 
> b/arch/powerpc/include/asm/fadump.h
> index ce88bbe1d809..5a23010af600 100644
> --- a/arch/powerpc/include/asm/fadump.h
> +++ b/arch/powerpc/include/asm/fadump.h
> @@ -209,11 +209,13 @@

Re: [PATCH v2 1/3] powerpc/powernv: handle the platform error reboot in ppc_md.restart

2017-07-19 Thread Mahesh Jagannath Salgaonkar

On 07/19/2017 12:29 PM, Nicholas Piggin wrote:
> Unrecovered MCE and HMI errors are sent through a special restart OPAL
> call to log the platform error. The downside is that they don't go
> through normal Linux crash paths, so they don't give much information
> to the Linux console.
> 
> Change this by providing a special crash function which does some of
> the console flushing from the panic() path before calling firmware to
> reboot.
> 
> The downside of this is a little more code to execute before reaching
> the firmware reboot. However in practice, it's critical to get the
> Linux console messages output in order to debug a problem. So this is
> a desirable tradeoff.
> 
> Note on the implementation: It is difficult to plumb a custom reboot
> handler into the panic path, because panic does a little bit too much
> work. For example, it will try to delay with the timebase, but that
> may be corrupted in some cases resulting in a hang without reaching
> the platform reboot. Another problem is that panic can invoke the
> crash dump code which is not what we want in the case of a hardware
> platform error. Long-term the best solution will be to rework the
> panic path so it can be suitable for this kind of panic, but for now
> we just duplicate a bit of the code.
> 
> Signed-off-by: Nicholas Piggin 

Reviewed-by: Mahesh Salgaonkar 

Thanks,
-Mahesh.

> ---
>  arch/powerpc/include/asm/opal.h   |  2 +-
>  arch/powerpc/platforms/powernv/opal-hmi.c | 22 ++--
>  arch/powerpc/platforms/powernv/opal.c | 89 
> ++-
>  arch/powerpc/platforms/powernv/powernv.h  |  2 +
>  4 files changed, 57 insertions(+), 58 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
> index 588fb1c23af9..182dab435aad 100644
> --- a/arch/powerpc/include/asm/opal.h
> +++ b/arch/powerpc/include/asm/opal.h
> @@ -50,7 +50,7 @@ int64_t opal_tpo_write(uint64_t token, uint32_t 
> year_mon_day,
>  uint32_t hour_min);
>  int64_t opal_cec_power_down(uint64_t request);
>  int64_t opal_cec_reboot(void);
> -int64_t opal_cec_reboot2(uint32_t reboot_type, char *diag);
> +int64_t opal_cec_reboot2(uint32_t reboot_type, const char *diag);
>  int64_t opal_read_nvram(uint64_t buffer, uint64_t size, uint64_t offset);
>  int64_t opal_write_nvram(uint64_t buffer, uint64_t size, uint64_t offset);
>  int64_t opal_handle_interrupt(uint64_t isn, __be64 *outstanding_event_mask);
> diff --git a/arch/powerpc/platforms/powernv/opal-hmi.c 
> b/arch/powerpc/platforms/powernv/opal-hmi.c
> index 88f3c61eec95..d78fed728cdf 100644
> --- a/arch/powerpc/platforms/powernv/opal-hmi.c
> +++ b/arch/powerpc/platforms/powernv/opal-hmi.c
> @@ -30,6 +30,8 @@
>  #include 
>  #include 
> 
> +#include "powernv.h"
> +
>  static int opal_hmi_handler_nb_init;
>  struct OpalHmiEvtNode {
>   struct list_head list;
> @@ -267,8 +269,6 @@ static void hmi_event_handler(struct work_struct *work)
>   spin_unlock_irqrestore(_hmi_evt_lock, flags);
> 
>   if (unrecoverable) {
> - int ret;
> -
>   /* Pull all HMI events from OPAL before we panic. */
>   while (opal_get_msg(__pa(), sizeof(msg)) == OPAL_SUCCESS) {
>   u32 type;
> @@ -284,23 +284,7 @@ static void hmi_event_handler(struct work_struct *work)
>   print_hmi_event_info(hmi_evt);
>   }
> 
> - /*
> -  * Unrecoverable HMI exception. We need to inform BMC/OCC
> -  * about this error so that it can collect relevant data
> -  * for error analysis before rebooting.
> -  */
> - ret = opal_cec_reboot2(OPAL_REBOOT_PLATFORM_ERROR,
> - "Unrecoverable HMI exception");
> - if (ret == OPAL_UNSUPPORTED) {
> - pr_emerg("Reboot type %d not supported\n",
> - OPAL_REBOOT_PLATFORM_ERROR);
> - }
> -
> - /*
> -  * Fall through and panic if opal_cec_reboot2() returns
> -  * OPAL_UNSUPPORTED.
> -  */
> - panic("Unrecoverable HMI exception");
> + pnv_platform_error_reboot(NULL, "Unrecoverable HMI exception");
>   }
>  }
> 
> diff --git a/arch/powerpc/platforms/powernv/opal.c 
> b/arch/powerpc/platforms/powernv/opal.c
> index 9b87abb178f0..96436d129684 100644
> --- a/arch/powerpc/platforms/powernv/opal.c
> +++ b/arch/powerpc/platforms/powernv/opal.c
> @@ -25,6 +25,10 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
> +#include 
> +#include 
> 
>  #include 
>  #include 
> @@ -436,10 +440,55 @@ static int opal_recover_mce(struct pt_regs *regs,
>   return recovered;
>  }
> 
> +void pnv_platform_error_reboot(struct pt_regs *regs, const char *msg)
> +{
> + /*
> +  * This is mostly taken from kernel/panic.c, but tries to do
> +  * relatively

Re: [PATCH 1/4] powerpc/powernv: handle the platform error reboot in ppc_md.restart

2017-07-09 Thread Mahesh Jagannath Salgaonkar

On 07/06/2017 11:26 PM, Nicholas Piggin wrote:
> On Wed,  5 Jul 2017 14:04:19 +1000
> Nicholas Piggin  wrote:
> 
>> Unrecovered MCE and HMI errors are sent through a special restart
>> OPAL call to log the platform error. The downside is that they don't
>> go through normal crash paths, so they don't give much information
>> to the Linux console.
>>
>> Change this by allowing them to set an error which then causes the
>> normal restart handler to use the platform error call. Have MCE and HMI
>> handlers set this and then use the normal panic path for unrecoverable
>> cases.
>>
>> Signed-off-by: Nicholas Piggin 
> 
> Mahesh brought up a couple of good points about this offline.
> Firstly that some HMI erorrs will stop the TB, second that if
> crash dumps are registered then we will not get to the platform
> reboot code from panic.
> 
> So it was a nice idea, but it seems to be just a bit too hard to
> do exactly what we want in the panic path. So the other option is
> put some of the printk and console flushing into the opal platform
> error handler.
> 
> It's not really ideal to duplicate this code here... but it's better
> than not printing anything.
> 
> Patch 2 won't be able to just call die() for kernel context now, but
> it will have to check in_interrupt(), panic_on_oops, etc. to make
> sure die() doesn't panic. But that should be okay.
> 
> This is what I have. If there are no great objections I'll repost
> a v2 series with this.

Looks good to me. Just one minor suggestion/comment below.

> 
> ---
> diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
> index 588fb1c23af9..182dab435aad 100644
> --- a/arch/powerpc/include/asm/opal.h
> +++ b/arch/powerpc/include/asm/opal.h
> @@ -50,7 +50,7 @@ int64_t opal_tpo_write(uint64_t token, uint32_t 
> year_mon_day,
>  uint32_t hour_min);
>  int64_t opal_cec_power_down(uint64_t request);
>  int64_t opal_cec_reboot(void);
> -int64_t opal_cec_reboot2(uint32_t reboot_type, char *diag);
> +int64_t opal_cec_reboot2(uint32_t reboot_type, const char *diag);
>  int64_t opal_read_nvram(uint64_t buffer, uint64_t size, uint64_t offset);
>  int64_t opal_write_nvram(uint64_t buffer, uint64_t size, uint64_t offset);
>  int64_t opal_handle_interrupt(uint64_t isn, __be64 *outstanding_event_mask);
> diff --git a/arch/powerpc/platforms/powernv/opal-hmi.c 
> b/arch/powerpc/platforms/powernv/opal-hmi.c
> index 88f3c61eec95..d78fed728cdf 100644
> --- a/arch/powerpc/platforms/powernv/opal-hmi.c
> +++ b/arch/powerpc/platforms/powernv/opal-hmi.c
> @@ -30,6 +30,8 @@
>  #include 
>  #include 
> 
> +#include "powernv.h"
> +
>  static int opal_hmi_handler_nb_init;
>  struct OpalHmiEvtNode {
>   struct list_head list;
> @@ -267,8 +269,6 @@ static void hmi_event_handler(struct work_struct *work)
>   spin_unlock_irqrestore(_hmi_evt_lock, flags);
> 
>   if (unrecoverable) {
> - int ret;
> -
>   /* Pull all HMI events from OPAL before we panic. */
>   while (opal_get_msg(__pa(), sizeof(msg)) == OPAL_SUCCESS) {
>   u32 type;
> @@ -284,23 +284,7 @@ static void hmi_event_handler(struct work_struct *work)
>   print_hmi_event_info(hmi_evt);
>   }
> 
> - /*
> -  * Unrecoverable HMI exception. We need to inform BMC/OCC
> -  * about this error so that it can collect relevant data
> -  * for error analysis before rebooting.
> -  */
> - ret = opal_cec_reboot2(OPAL_REBOOT_PLATFORM_ERROR,
> - "Unrecoverable HMI exception");
> - if (ret == OPAL_UNSUPPORTED) {
> - pr_emerg("Reboot type %d not supported\n",
> - OPAL_REBOOT_PLATFORM_ERROR);
> - }
> -
> - /*
> -  * Fall through and panic if opal_cec_reboot2() returns
> -  * OPAL_UNSUPPORTED.
> -  */
> - panic("Unrecoverable HMI exception");
> + pnv_platform_error_reboot(NULL, "Unrecoverable HMI exception");
>   }
>  }
> 
> diff --git a/arch/powerpc/platforms/powernv/opal.c 
> b/arch/powerpc/platforms/powernv/opal.c
> index 59684b4af4d1..ccbcfa22bacf 100644
> --- a/arch/powerpc/platforms/powernv/opal.c
> +++ b/arch/powerpc/platforms/powernv/opal.c
> @@ -25,6 +25,10 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
> +#include 
> +#include 
> 
>  #include 
>  #include 
> @@ -421,10 +425,57 @@ static int opal_recover_mce(struct pt_regs *regs,
>   return recovered;
>  }
> 
> +void pnv_platform_error_reboot(struct pt_regs *regs, const char *msg)
> +{
> + /*
> +  * This is mostly taken from kernel/panic.c, but tries to do
> +  * relatively minimal work. Don't use delay functions (TB may
> +  * be broken), don't crash dump (need to set a firmware log),
> +  * don't run notifiers. We do want

Re: [PATCH v2 1/3] powerpc: do not call ppc_md.panic in fadump panic notifier

2017-07-04 Thread Mahesh Jagannath Salgaonkar

On 07/05/2017 09:26 AM, Nicholas Piggin wrote:
> If fadump is not registered, and no other crash or debug handlers are
> registered, the powerpc panic handler stops the guest before the generic
> panic code can push out debug information to the console.
> 
> Currently, system reset injection causes the guest to silently
> stop.
> 
> Stop calling ppc_md.panic in the panic notifier. crash_fadump already
> does rtas_os_term() to terminate the guest if fadump is registered.
> 
> Remove ppc_md.panic. Move fadump panic notifier into fadump code.
> 
> Signed-off-by: Nicholas Piggin 

Reviewed-by: Mahesh Salgaonkar 

Thanks,
-Mahesh.

> ---
>  arch/powerpc/include/asm/machdep.h |  1 -
>  arch/powerpc/include/asm/setup.h   |  1 -
>  arch/powerpc/kernel/fadump.c   | 22 ++
>  arch/powerpc/kernel/setup-common.c | 27 ---
>  arch/powerpc/platforms/ps3/setup.c | 15 ---
>  arch/powerpc/platforms/pseries/setup.c |  1 -
>  6 files changed, 22 insertions(+), 45 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/machdep.h 
> b/arch/powerpc/include/asm/machdep.h
> index cd2fc1cc1cc7..73b92017b6d7 100644
> --- a/arch/powerpc/include/asm/machdep.h
> +++ b/arch/powerpc/include/asm/machdep.h
> @@ -76,7 +76,6 @@ struct machdep_calls {
> 
>   void __noreturn (*restart)(char *cmd);
>   void __noreturn (*halt)(void);
> - void(*panic)(char *str);
>   void(*cpu_die)(void);
> 
>   long(*time_init)(void); /* Optional, may be NULL */
> diff --git a/arch/powerpc/include/asm/setup.h 
> b/arch/powerpc/include/asm/setup.h
> index 654d64c9f3ac..3a3fb0ca68f5 100644
> --- a/arch/powerpc/include/asm/setup.h
> +++ b/arch/powerpc/include/asm/setup.h
> @@ -23,7 +23,6 @@ extern void reloc_got2(unsigned long);
> 
>  void check_for_initrd(void);
>  void initmem_init(void);
> -void setup_panic(void);
>  #define ARCH_PANIC_TIMEOUT 180
> 
>  #ifdef CONFIG_PPC_PSERIES
> diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
> index 3079518f2245..8f1a8bd8433c 100644
> --- a/arch/powerpc/kernel/fadump.c
> +++ b/arch/powerpc/kernel/fadump.c
> @@ -1447,6 +1447,25 @@ static void fadump_init_files(void)
>   return;
>  }
> 
> +static int fadump_panic_event(struct notifier_block *this,
> + unsigned long event, void *ptr)
> +{
> + /*
> +  * If firmware-assisted dump has been registered then trigger
> +  * firmware-assisted dump and let firmware handle everything
> +  * else. If this returns, then fadump was not registered, so
> +  * go through the rest of the panic path.
> +  */
> + crash_fadump(NULL, ptr);
> +
> + return NOTIFY_DONE;
> +}
> +
> +static struct notifier_block fadump_panic_block = {
> + .notifier_call = fadump_panic_event,
> + .priority = INT_MIN /* may not return; must be done last */
> +};
> +
>  /*
>   * Prepare for firmware-assisted dump.
>   */
> @@ -1479,6 +1498,9 @@ int __init setup_fadump(void)
>   init_fadump_mem_struct(, fw_dump.reserve_dump_area_start);
>   fadump_init_files();
> 
> + atomic_notifier_chain_register(_notifier_list,
> + _panic_block);
> +
>   return 1;
>  }
>  subsys_initcall(setup_fadump);
> diff --git a/arch/powerpc/kernel/setup-common.c 
> b/arch/powerpc/kernel/setup-common.c
> index 94a948207cd2..b697530d9bdc 100644
> --- a/arch/powerpc/kernel/setup-common.c
> +++ b/arch/powerpc/kernel/setup-common.c
> @@ -704,30 +704,6 @@ int check_legacy_ioport(unsigned long base_port)
>  }
>  EXPORT_SYMBOL(check_legacy_ioport);
> 
> -static int ppc_panic_event(struct notifier_block *this,
> - unsigned long event, void *ptr)
> -{
> - /*
> -  * If firmware-assisted dump has been registered then trigger
> -  * firmware-assisted dump and let firmware handle everything else.
> -  */
> - crash_fadump(NULL, ptr);
> - ppc_md.panic(ptr);  /* May not return */
> - return NOTIFY_DONE;
> -}
> -
> -static struct notifier_block ppc_panic_block = {
> - .notifier_call = ppc_panic_event,
> - .priority = INT_MIN /* may not return; must be done last */
> -};
> -
> -void __init setup_panic(void)
> -{
> - if (!ppc_md.panic)
> - return;
> - atomic_notifier_chain_register(_notifier_list, _panic_block);
> -}
> -
>  #ifdef CONFIG_CHECK_CACHE_COHERENCY
>  /*
>   * For platforms that have configurable cache-coherency.  This function
> @@ -872,9 +848,6 @@ void __init setup_arch(char **cmdline_p)
>   /* Probe the machine type, establish ppc_md. */
>   probe_machine();
> 
> - /* Setup panic notifier if requested by the platform. */
> - setup_panic();
> -
>   /*
>* Configure ppc_md.power_save (ppc32 only, 64-bit machines do
>* it from their respective probe() function.
> diff --git

Re: [PATCH 1/3] powerpc: do not call ppc_md.panic in panic notifier if fadump not used

2017-07-04 Thread Mahesh Jagannath Salgaonkar

On 07/04/2017 03:39 PM, Nicholas Piggin wrote:
> If fadump is not registered, and no other crash or debug handlers are
> registered, the powerpc panic handler stops the guest before the generic
> panic code can push out debug information to the console.
> 
> Without this patch, system reset injection to a guest causes the guest to
> silently stop. Afterwards, we get the expected oops trace.
> 
> Signed-off-by: Nicholas Piggin 
> ---
>  arch/powerpc/kernel/setup-common.c | 15 +--
>  1 file changed, 9 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/setup-common.c 
> b/arch/powerpc/kernel/setup-common.c
> index 94a948207cd2..39ba09965b04 100644
> --- a/arch/powerpc/kernel/setup-common.c
> +++ b/arch/powerpc/kernel/setup-common.c
> @@ -707,12 +707,15 @@ EXPORT_SYMBOL(check_legacy_ioport);
>  static int ppc_panic_event(struct notifier_block *this,
>   unsigned long event, void *ptr)
>  {
> - /*
> -  * If firmware-assisted dump has been registered then trigger
> -  * firmware-assisted dump and let firmware handle everything else.
> -  */
> - crash_fadump(NULL, ptr);
> - ppc_md.panic(ptr);  /* May not return */
> + if (is_fadump_active()) {

Should it be !fw_dump.dump_registered check ? The function crash_dump()
already checks for !fw_dump.dump_registered before proceeding. If fadump
has not been registered then the crash_dump() will return immediately
without doing anything and then fall through next line ppc_md.panic(ptr).

fadump active is always false in the first kernel (production kernel).
Hence is_fadump_active() check here would stop triggering fadump even if
it is registered.

fadump active is true only during second kernel (dump capture kernel)
that boots after fadump crash when firmware exports 'ibm,kernel-dump'
device tree node indicating dump data is available.

Thanks,
-Mahesh.

> + /*
> +  * If firmware-assisted dump has been registered then trigger
> +  * firmware-assisted dump and let firmware handle everything
> +  * else.
> +  */
> + crash_fadump(NULL, ptr);
> + ppc_md.panic(ptr);  /* May not return */
> + }
>   return NOTIFY_DONE;
>  }
>

Re: [PATCH v2] powerpc/fadump: return error when fadump registration fails

2017-05-29 Thread Mahesh Jagannath Salgaonkar

On 05/27/2017 09:16 PM, Michal Suchanek wrote:
>  - log an error message when registration fails and no error code listed
>  in the switch is returned
>  - translate the hv error code to posix error code and return it from
>  fw_register
>  - return the posix error code from fw_register to the process writing
>  to sysfs
>  - return EEXIST on re-registration
>  - return success on deregistration when fadump is not registered
>  - return ENODEV when no memory is reserved for fadump

Why do we need this ? Userspace can always read back the fadump
registration status from /sys/kernel/fadump_registered (after echo 1 to
it) to find out whether fadump registration succeeded or not.

 /sys/kernel/fadump_registered

This is used to display the fadump registration status as well
as to control (start/stop) the fadump registration.
0 = fadump is not registered.
1 = fadump is registered and ready to handle system crash.

-Mahesh.

Re: [PATCH 2/2] powerpc/fadump: avoid holes in boot memory area when fadump is registered

2017-05-05 Thread Mahesh Jagannath Salgaonkar

On 05/04/2017 11:24 PM, Hari Bathini wrote:
> To register fadump, boot memory area - the size of low memory chunk that
> is required for a kernel to boot successfully when booted with restricted
> memory, is assumed to have no holes. But this memory area is currently
> not protected from hot-remove operations. So, fadump could fail to
> re-register after a memory hot-remove operation, if memory is removed
> from boot memory area. To avoid this, ensure that memory from boot
> memory area is not hot-removed when fadump is registered.
> 
> Signed-off-by: Hari Bathini 

Reviewed-by: Mahesh J Salgaonkar 

> ---
>  arch/powerpc/include/asm/fadump.h   |1 +
>  arch/powerpc/kernel/fadump.c|   12 
>  arch/powerpc/platforms/pseries/hotplug-memory.c |7 +++
>  3 files changed, 20 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/fadump.h 
> b/arch/powerpc/include/asm/fadump.h
> index 0031806..609fccc 100644
> --- a/arch/powerpc/include/asm/fadump.h
> +++ b/arch/powerpc/include/asm/fadump.h
> @@ -198,6 +198,7 @@ struct fad_crash_memory_ranges {
>   unsigned long long  size;
>  };
> 
> +extern int is_fadump_boot_memory_area(u64 addr, ulong size);
>  extern int early_init_dt_scan_fw_dump(unsigned long node,
>   const char *uname, int depth, void *data);
>  extern int fadump_reserve_mem(void);
> diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
> index 03563c6..ea7dfdc 100644
> --- a/arch/powerpc/kernel/fadump.c
> +++ b/arch/powerpc/kernel/fadump.c
> @@ -114,6 +114,18 @@ int __init early_init_dt_scan_fw_dump(unsigned long node,
>   return 1;
>  }
> 
> +/*
> + * If fadump is registered, check if the memory provided
> + * falls within boot memory area.
> + */
> +int is_fadump_boot_memory_area(u64 addr, ulong size)
> +{
> + if (!fw_dump.dump_registered)
> + return 0;
> +
> + return (addr + size) > RMA_START && addr <= fw_dump.boot_memory_size;
> +}
> +
>  int is_fadump_active(void)
>  {
>   return fw_dump.dump_active;
> diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
> b/arch/powerpc/platforms/pseries/hotplug-memory.c
> index e104c71..a186b8e 100644
> --- a/arch/powerpc/platforms/pseries/hotplug-memory.c
> +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
> @@ -22,6 +22,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include "pseries.h"
> 
>  static bool rtas_hp_event;
> @@ -406,6 +407,12 @@ static bool lmb_is_removable(struct of_drconf_cell *lmb)
>   scns_per_block = block_sz / MIN_MEMORY_BLOCK_SIZE;
>   phys_addr = lmb->base_addr;
> 
> +#ifdef CONFIG_FA_DUMP
> + /* Don't hot-remove memory that falls in fadump boot memory area */
> + if (is_fadump_boot_memory_area(phys_addr, block_sz))
> + return false;
> +#endif
> +
>   for (i = 0; i < scns_per_block; i++) {
>   pfn = PFN_DOWN(phys_addr);
>   if (!pfn_present(pfn))
>

Re: [PATCH 1/2] powerpc/fadump: avoid duplicates in crash memory ranges

2017-05-05 Thread Mahesh Jagannath Salgaonkar

On 05/04/2017 11:23 PM, Hari Bathini wrote:
> fadump sets up crash memory ranges to be used for creating PT_LOAD
> program headers in elfcore header. Memory chunk RMA_START through
> boot memory area size is added as the first memory range because
> firmware, at the time of crash, moves this memory chunk to different
> location specified during fadump registration making it necessary to
> create a separate program header for it with the correct offset.
> This memory chunk is skipped while setting up the remaining memory
> ranges. But currently, there is possibility that some of this memory
> may have duplicate entries like when it is hot-removed and added
> again. Ensure that no two memory ranges represent the same memory.
> 
> When 5 lmbs are hot-removed and then hot-plugged before registering
> fadump, here is how the program headers in /proc/vmcore exported by
> fadump look like

We should also make sure that fadump registration fails with proper
error if user don't put back those lmbs creating holes below the boot
memory size.

But for this patch

Reviewed-by: Mahesh J Salgaonkar 

Thanks,
-Mahesh.

> 
> without this change:
> 
>   Program Headers:
> Type   Offset VirtAddr   PhysAddr
>FileSizMemSiz  Flags  Align
> NOTE   0x0001 0x 0x
>0x1894 0x1894 0
> LOAD   0x00021020 0xc000 0x
>0x4000 0x4000  RWE0
> LOAD   0x40031020 0xc000 0x
>0x1000 0x1000  RWE0
> LOAD   0x5004 0xc0001000 0x1000
>0x5000 0x5000  RWE0
> LOAD   0xa004 0xc0006000 0x6000
>0x00019ffe 0x00019ffe  RWE0
> 
> and with this change:
> 
>   Program Headers:
> Type   Offset VirtAddr   PhysAddr
>FileSizMemSiz  Flags  Align
> NOTE   0x0001 0x 0x
>0x1894 0x1894 0
> LOAD   0x00021020 0xc000 0x
>0x4000 0x4000  RWE0
> LOAD   0x4003 0xc0004000 0x4000
>0x2000 0x2000  RWE0
> LOAD   0x6003 0xc0006000 0x6000
>0x00019ffe 0x00019ffe  RWE0
> 
> Signed-off-by: Hari Bathini 
> ---
>  arch/powerpc/kernel/fadump.c |   13 +++--
>  1 file changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
> index 8ff0dd4..03563c6 100644
> --- a/arch/powerpc/kernel/fadump.c
> +++ b/arch/powerpc/kernel/fadump.c
> @@ -844,8 +844,17 @@ static void fadump_setup_crash_memory_ranges(void)
>   for_each_memblock(memory, reg) {
>   start = (unsigned long long)reg->base;
>   end = start + (unsigned long long)reg->size;
> - if (start == RMA_START && end >= fw_dump.boot_memory_size)
> - start = fw_dump.boot_memory_size;
> +
> + /*
> +  * skip the first memory chunk (RMA_START through
> +  * boot_memory_size) that is already added.
> +  */
> + if (start < fw_dump.boot_memory_size && start >= RMA_START) {
> + if (end > fw_dump.boot_memory_size)
> + start = fw_dump.boot_memory_size;
> + else
> + continue;
> + }
> 
>   /* add this range excluding the reserved dump area. */
>   fadump_exclude_reserved_area(start, end);
>

Re: [PATCH v4 2/3] powerpc/fadump: Use the correct VMCOREINFO_NOTE_SIZE for phdr

2017-04-27 Thread Mahesh Jagannath Salgaonkar

On 04/26/2017 12:41 PM, Dave Young wrote:
> Ccing ppc list
> On 04/20/17 at 07:39pm, Xunlei Pang wrote:
>> vmcoreinfo_max_size stands for the vmcoreinfo_data, the
>> correct one we should use is vmcoreinfo_note whose total
>> size is VMCOREINFO_NOTE_SIZE.
>>
>> Like explained in commit 77019967f06b ("kdump: fix exported
>> size of vmcoreinfo note"), it should not affect the actual
>> function, but we better fix it, also this change should be
>> safe and backward compatible.
>>
>> After this, we can get rid of variable vmcoreinfo_max_size,
>> let's use the corresponding macros directly, fewer variables
>> means more safety for vmcoreinfo operation.
>>
>> Cc: Mahesh Salgaonkar 
>> Cc: Hari Bathini 
>> Signed-off-by: Xunlei Pang 

Reviewed-by: Mahesh Salgaonkar 

Thanks,
-Mahesh.

>> ---
>> v3->v4:
>> -Rebased on the latest linux-next
>>
>>  arch/powerpc/kernel/fadump.c | 3 +--
>>  include/linux/crash_core.h   | 1 -
>>  kernel/crash_core.c  | 3 +--
>>  3 files changed, 2 insertions(+), 5 deletions(-)
>>
>> diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
>> index 466569e..7bd6cd0 100644
>> --- a/arch/powerpc/kernel/fadump.c
>> +++ b/arch/powerpc/kernel/fadump.c
>> @@ -893,8 +893,7 @@ static int fadump_create_elfcore_headers(char *bufp)
>>  
>>  phdr->p_paddr   = fadump_relocate(paddr_vmcoreinfo_note());
>>  phdr->p_offset  = phdr->p_paddr;
>> -phdr->p_memsz   = vmcoreinfo_max_size;
>> -phdr->p_filesz  = vmcoreinfo_max_size;
>> +phdr->p_memsz   = phdr->p_filesz = VMCOREINFO_NOTE_SIZE;
>>  
>>  /* Increment number of program headers. */
>>  (elf->e_phnum)++;
>> diff --git a/include/linux/crash_core.h b/include/linux/crash_core.h
>> index ba283a2..7d6bc7b 100644
>> --- a/include/linux/crash_core.h
>> +++ b/include/linux/crash_core.h
>> @@ -55,7 +55,6 @@
>>  
>>  extern u32 *vmcoreinfo_note;
>>  extern size_t vmcoreinfo_size;
>> -extern size_t vmcoreinfo_max_size;
>>  
>>  Elf_Word *append_elf_note(Elf_Word *buf, char *name, unsigned int type,
>>void *data, size_t data_len);
>> diff --git a/kernel/crash_core.c b/kernel/crash_core.c
>> index 0321f04..43cdb00 100644
>> --- a/kernel/crash_core.c
>> +++ b/kernel/crash_core.c
>> @@ -16,7 +16,6 @@
>>  /* vmcoreinfo stuff */
>>  static unsigned char *vmcoreinfo_data;
>>  size_t vmcoreinfo_size;
>> -size_t vmcoreinfo_max_size = VMCOREINFO_BYTES;
>>  u32 *vmcoreinfo_note;
>>  
>>  /*
>> @@ -343,7 +342,7 @@ void vmcoreinfo_append_str(const char *fmt, ...)
>>  r = vscnprintf(buf, sizeof(buf), fmt, args);
>>  va_end(args);
>>  
>> -r = min(r, vmcoreinfo_max_size - vmcoreinfo_size);
>> +r = min(r, VMCOREINFO_BYTES - vmcoreinfo_size);
>>  
>>  memcpy(_data[vmcoreinfo_size], buf, r);
>>  
>> -- 
>> 1.8.3.1
>>
>>
>> ___
>> kexec mailing list
>> ke...@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/kexec
> 
> Reviewed-by: Dave Young 
> 
> Thanks
> Dave
>

Re: [PATCH v2] powerpc/book3s: mce: Move add_taint() later in virtual mode.

2017-04-24 Thread Mahesh Jagannath Salgaonkar

On 04/21/2017 09:37 AM, Michael Ellerman wrote:
> Daniel Axtens  writes:
>>> diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
>>> index a1475e6..b23b323 100644
>>> --- a/arch/powerpc/kernel/mce.c
>>> +++ b/arch/powerpc/kernel/mce.c
>>> @@ -221,6 +221,8 @@ static void machine_check_process_queued_event(struct 
>>> irq_work *work)
>>>  {
>>> int index;
>>>  
>>> +   add_taint(TAINT_MACHINE_CHECK, LOCKDEP_NOW_UNRELIABLE);
>>> +
>> This bit makes sense...
>>
>>> diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
>>> index ff365f9..af97e81 100644
>>> --- a/arch/powerpc/kernel/traps.c
>>> +++ b/arch/powerpc/kernel/traps.c
>>> @@ -741,6 +739,8 @@ void machine_check_exception(struct pt_regs *regs)
>>>  
>>> __this_cpu_inc(irq_stat.mce_exceptions);
>>>  
>>> +   add_taint(TAINT_MACHINE_CHECK, LOCKDEP_NOW_UNRELIABLE);
>>> +
>>
>> But this bit I'm not sure about.
>>
>> Isn't machine_check_exception called from asm in
>> kernel/exceptions-64s.S? As in, it's called really early/in real mode?
> 
> It is called from there, in asm, but not from real mode AFAICS.
> 
> There's a call from machine_check_common(), we're already in virtual
> mode there.
> 
> The other call is from unrecover_mce(), and both places that call that
> do so via rfid, using PACAKMSR, which should turn on virtual mode.
> 
> 
> But none of that really matters. The fundamental issue here is we can't
> recursively call OPAL, that's what matters.
> 
> So if we were in OPAL and take an MCE, then we must not call OPAL again
> from the MCE handler.
> 
> This fixes one case where we know that can happen, but AFAICS we are not
> protected in general from it.
> 
> For example if we take an MCE in OPAL, decide it's not recoverable and
> go to unrecover_mce(), that will call machine_check_exception() which
> can then call OPAL via printk.
> 
> Or maybe there's a check in there somewhere that makes it OK, but it's
> not clear to me.

There is no check, but for non-recoverable MCE in OPAL we print mce
event, go down to panic path and reboot. Hence we are fine. For
recoverable mce error in opal we would never end up in
machine_check_exception().

Thanks,
-Mahesh.

Re: [PATCH 2/2] powerpc/book3s: mce: Use add_taint_no_warn() in machine_check_early().

2017-04-17 Thread Mahesh Jagannath Salgaonkar

On 04/17/2017 04:09 PM, Daniel Axtens wrote:
> Hi Mahesh,
> 
>> Fixes: 27ea2c420cad powerpc: Set the correct kernel taint on machine check 
>> errors.
> 
> I notice this Fixes a commit I introduced. Please could you cc me when
> you do this? I am likely to miss it otherwise, especially since I have
> now left IBM.

Sure will do. :-)

> 
> Being cced allows me to provide an Ack or a review. And getting feedback
> on my changes is very helpful in becoming a better programmer.
> 
> In this case, as per Michael's comment, why don't we just move the
> add_taint from machine_check_early to
> machine_check_process_queued_event - the other side of the work queue.

Yes. That is what my plan is. Also, that is not the only place.
add_taint() need to be called from machine_check_exception() as well. So
it will be called from two places.

Thanks,
-Mahesh.

> 
> The work queue system is supposed to provide us with a safe place to do
> printing, etc., so it's an appropriate place. Also, we already do
> machine_check_print_event_info there, and adding the taint doesn't need
> to be done synchronously.
> 
> Regards,
> Daniel
> 
> Mahesh J Salgaonkar  writes:
> 
>> From: Mahesh Salgaonkar 
>>
>> machine_check_early() gets called in real mode. The very first time when
>> add_taint() is called, it prints a warning which ends up calling opal
>> call (that uses OPAL_CALL wrapper) for writing it to console. If we get a
>> very first machine check while we are in opal we are doomed. OPAL_CALL
>> overwrites the PACASAVEDMSR in r13 and in this case when we are done with
>> MCE handling the original opal call will use this new MSR on it's way
>> back to opal_return. This usually leads unexpected behaviour or kernel
>> to panic. Instead use the add_taint_no_warn() that does not call printk.
>>
>> This is broken with current FW level. We got lucky so far for not getting
>> very first MCE hit while in OPAL. But easily reproducible on Mambo.
>> This should go to stable as well alongwith patch 1/2.
>>
>> Signed-off-by: Mahesh Salgaonkar 
>> ---
>>  arch/powerpc/kernel/traps.c |2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
>> index 62b587f..4a048dc 100644
>> --- a/arch/powerpc/kernel/traps.c
>> +++ b/arch/powerpc/kernel/traps.c
>> @@ -306,7 +306,7 @@ long machine_check_early(struct pt_regs *regs)
>>  
>>  __this_cpu_inc(irq_stat.mce_exceptions);
>>  
>> -add_taint(TAINT_MACHINE_CHECK, LOCKDEP_NOW_UNRELIABLE);
>> +add_taint_no_warn(TAINT_MACHINE_CHECK, LOCKDEP_NOW_UNRELIABLE);
>>  
>>  /*
>>   * See if platform is capable of handling machine check. (e.g. PowerNV
>

Re: [PATCH v7 2/3] powerpc/powernv: Introduce a machine check hook for Guest MCEs.

2017-04-06 Thread Mahesh Jagannath Salgaonkar

On 04/06/2017 10:52 AM, David Gibson wrote:
> On Thu, Apr 06, 2017 at 02:17:22AM +0530, Mahesh J Salgaonkar wrote:
>> From: Mahesh Salgaonkar 
>>
>> This patch introduces a mce hook which is invoked at the time of guest
>> exit to facilitate the host-side handling of machine check exception
>> before the exception is passed on to the guest. This hook will be invoked
>> from host virtual mode from KVM (before exiting the guest with
>> KVM_EXIT_NMI reason) for machine check exception that occurs in the guest.
>>
>> Signed-off-by: Mahesh Salgaonkar 
> 
> Um.. this introduces the hook, and puts in an implementation of it,
> but AFAICT, nothing calls it, either here or in the next patch.  That
> seems a bit pointless.

It gets called in the next patch [3/3] through
ppc_md.machine_check_exception_guest(). See the hunk for file
arch/powerpc/kvm/book3s_hv.c in next patch.

> 
>> ---
>>  arch/powerpc/include/asm/machdep.h |7 +++
>>  arch/powerpc/include/asm/opal.h|4 
>>  arch/powerpc/platforms/powernv/opal.c  |   26 ++
>>  arch/powerpc/platforms/powernv/setup.c |3 +++
>>  4 files changed, 40 insertions(+)
>>
>> diff --git a/arch/powerpc/include/asm/machdep.h 
>> b/arch/powerpc/include/asm/machdep.h
>> index 5011b69..9d74e7a 100644
>> --- a/arch/powerpc/include/asm/machdep.h
>> +++ b/arch/powerpc/include/asm/machdep.h
>> @@ -15,6 +15,7 @@
>>  #include 
>>  
>>  #include 
>> +#include 
>>  
>>  /* We export this macro for external modules like Alsa to know if
>>   * ppc_md.feature_call is implemented or not
>> @@ -112,6 +113,12 @@ struct machdep_calls {
>>  /* Called during machine check exception to retrive fixup address. */
>>  bool(*mce_check_early_recovery)(struct pt_regs *regs);
>>  
>> +#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
>> +/* Called after KVM interrupt handler finishes handling MCE for guest */
>> +int (*machine_check_exception_guest)
>> +(struct machine_check_event *evt);
>> +#endif
>> +
>>  /* Motherboard/chipset features. This is a kind of general purpose
>>   * hook used to control some machine specific features (like reset
>>   * lines, chip power control, etc...).
>> diff --git a/arch/powerpc/include/asm/opal.h 
>> b/arch/powerpc/include/asm/opal.h
>> index 1ff03a6..9b1fcbf 100644
>> --- a/arch/powerpc/include/asm/opal.h
>> +++ b/arch/powerpc/include/asm/opal.h
>> @@ -17,6 +17,7 @@
>>  #ifndef __ASSEMBLY__
>>  
>>  #include 
>> +#include 
>>  
>>  /* We calculate number of sg entries based on PAGE_SIZE */
>>  #define SG_ENTRIES_PER_NODE ((PAGE_SIZE - 16) / sizeof(struct 
>> opal_sg_entry))
>> @@ -273,6 +274,9 @@ extern int opal_hmi_handler_init(void);
>>  extern int opal_event_init(void);
>>  
>>  extern int opal_machine_check(struct pt_regs *regs);
>> +#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
>> +extern int opal_machine_check_guest(struct machine_check_event *evt);
>> +#endif
>>  extern bool opal_mce_check_early_recovery(struct pt_regs *regs);
>>  extern int opal_hmi_exception_early(struct pt_regs *regs);
>>  extern int opal_handle_hmi_exception(struct pt_regs *regs);
>> diff --git a/arch/powerpc/platforms/powernv/opal.c 
>> b/arch/powerpc/platforms/powernv/opal.c
>> index e0f856b..5e633a4 100644
>> --- a/arch/powerpc/platforms/powernv/opal.c
>> +++ b/arch/powerpc/platforms/powernv/opal.c
>> @@ -479,6 +479,32 @@ int opal_machine_check(struct pt_regs *regs)
>>  return 0;
>>  }
>>  
>> +#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
>> +/*
>> + * opal_machine_check_guest() is a hook which is invoked at the time
>> + * of guest exit to facilitate the host-side handling of machine check
>> + * exception before the exception is passed on to the guest. This hook
>> + * is invoked from host virtual mode from KVM (before exiting the guest
>> + * with KVM_EXIT_NMI reason) for machine check exception that occurs in
>> + * the guest.
>> + *
>> + * Currently no action is performed in the host other than printing the
>> + * event information. The machine check exception is passed on to the
>> + * guest kernel and the guest kernel will attempt for recovery.
>> + */
>> +int opal_machine_check_guest(struct machine_check_event *evt)
>> +{
>> +/* Print things out */
>> +if (evt->version != MCE_V1) {
>> +pr_err("Machine Check Exception, Unknown event version %d !\n",
>> +   evt->version);
>> +return 0;
>> +}
>> +machine_check_print_event_info(evt);
>> +return 0;
>> +}
>> +#endif
>> +
>>  /* Early hmi handler called in real mode. */
>>  int opal_hmi_exception_early(struct pt_regs *regs)
>>  {
>> diff --git a/arch/powerpc/platforms/powernv/setup.c 
>> b/arch/powerpc/platforms/powernv/setup.c
>> index d50c7d9..333ee09 100644
>> --- a/arch/powerpc/platforms/powernv/setup.c
>> +++ b/arch/powerpc/platforms/powernv/setup.c
>> @@ -264,6 +264,9 @@ static void __init

Re: [PATCH 2/2] powerpc/book3s: Display task info for MCE error in user mode.

2017-03-30 Thread Mahesh Jagannath Salgaonkar

On 03/30/2017 05:39 AM, Nicholas Piggin wrote:
> On Tue, 28 Mar 2017 19:15:28 +0530
> Mahesh J Salgaonkar  wrote:
> 
>> From: Mahesh Salgaonkar 
>>
>> For MCE that hit while in use mode MSR(HV=1,PR=1), print the task info on the
>> console MCE error log. This will help to identify application that stumbled
>> upon MCE error.
> 
> I think you may still want these details for a task currently in
> kernel. How about something like if (!in_interrupt()) {

We queue up the MCE event to delay the printing of recovered MCEs in the
kernel. We may have to hook up the task details in the MCE event.

> 
> 
>> @@ -311,8 +312,13 @@ void machine_check_print_event_info(struct 
>> machine_check_event *evt)
>>  printk("%s%s Machine check interrupt [%s]\n", level, sevstr,
>> evt->disposition == MCE_DISPOSITION_RECOVERED ?
>> "Recovered" : "Not recovered");
>> -printk("%s  NIP [%016llx]: %pS\n", level, evt->srr0,
>> +if (user_mode) {
>> +printk("%s  NIP: [%016llx] PID: %d Comm: %s\n", level,
>> +evt->srr0, current->pid, current->comm);
>> +} else {
>> +printk("%s  NIP [%016llx]: %pS\n", level, evt->srr0,
>>  (void *)evt->srr0);
>> +}
>>  printk("%s  Initiator: %s\n", level,
>> evt->initiator == MCE_INITIATOR_CPU ? "CPU" : "Unknown");
>>  switch (evt->error_type) {
>

Re: [PATCH 4/8] powerpc/64s: fix POWER9 machine check handler from stop state

2017-03-19 Thread Mahesh Jagannath Salgaonkar

On 03/16/2017 06:49 PM, Gautham R Shenoy wrote:
> Hi,
> 
> On Thu, Mar 16, 2017 at 11:05:20PM +1000, Nicholas Piggin wrote:
>> On Thu, 16 Mar 2017 18:10:48 +0530
>> Mahesh Jagannath Salgaonkar <mah...@linux.vnet.ibm.com> wrote:
>>
>>> On 03/14/2017 02:53 PM, Nicholas Piggin wrote:
>>>> The ISA specifies power save wakeup can cause a machine check interrupt.
>>>> The machine check handler currently has code to handle that for POWER8,
>>>> but POWER9 crashes when trying to execute the P8 style sleep
>>>> instructions.
>>>>
>>>> So queue up the machine check, then call into the idle code to wake up
>>>> as the system reset interrupt does, rather than attempting to sleep
>>>> again without going through the main idle path.
>>>>
>>>> Reviewed-by: Gautham R. Shenoy <e...@linux.vnet.ibm.com>
>>>> Signed-off-by: Nicholas Piggin <npig...@gmail.com>
>>>> ---
>>>>  arch/powerpc/include/asm/reg.h   |  1 +
>>>>  arch/powerpc/kernel/exceptions-64s.S | 69 
>>>> ++--
>>>>  2 files changed, 35 insertions(+), 35 deletions(-)
>>>>
>>>> diff --git a/arch/powerpc/include/asm/reg.h 
>>>> b/arch/powerpc/include/asm/reg.h
>>>> index fc879fd6bdae..8bbdfacce970 100644
>>>> --- a/arch/powerpc/include/asm/reg.h
>>>> +++ b/arch/powerpc/include/asm/reg.h
>>>> @@ -656,6 +656,7 @@
>>>>  #define   SRR1_ISI_PROT   0x0800 /* ISI: Other protection 
>>>> fault */
>>>>  #define   SRR1_WAKEMASK   0x0038 /* reason for wakeup */
>>>>  #define   SRR1_WAKEMASK_P80x003c /* reason for wakeup on 
>>>> POWER8 and 9 */
>>>> +#define   SRR1_WAKEMCE_RESVD  0x003c /* Unused/reserved value 
>>>> used by MCE wakeup to indicate cause to idle wakeup handler */
>>>>  #define   SRR1_WAKESYSERR 0x0030 /* System error */
>>>>  #define   SRR1_WAKEEE 0x0020 /* External interrupt */
>>>>  #define   SRR1_WAKEHVI0x0024 /* Hypervisor Virtualization 
>>>> Interrupt (P9) */
>>>> diff --git a/arch/powerpc/kernel/exceptions-64s.S 
>>>> b/arch/powerpc/kernel/exceptions-64s.S
>>>> index e390fcd04bcb..5779d2d6a192 100644
>>>> --- a/arch/powerpc/kernel/exceptions-64s.S
>>>> +++ b/arch/powerpc/kernel/exceptions-64s.S
>>>> @@ -306,6 +306,33 @@ EXC_COMMON_BEGIN(machine_check_common)
>>>>/* restore original r1. */  \
>>>>ld  r1,GPR1(r1)
>>>>
>>>> +#ifdef CONFIG_PPC_P7_NAP
>>>> +EXC_COMMON_BEGIN(machine_check_idle_common)
>>>> +  bl  machine_check_queue_event
>>>> +  /*
>>>> +   * Queue the machine check, then reload SRR1 and use it to set
>>>> +   * CR3 according to pnv_powersave_wakeup convention.
>>>> +   */
>>>> +  ld  r12,_MSR(r1)
>>>> +  rlwinm  r11,r12,47-31,30,31
>>>> +  cmpwi   cr3,r11,2
>>>> +
>>>> +  /*
>>>> +   * Now put SRR1_WAKEMCE_RESVD into SRR1, allows it to follow the
>>>> +   * system reset wakeup code.
>>>> +   */
>>>> +  orisr12,r12,SRR1_WAKEMCE_RESVD@h
>>>> +  mtspr   SPRN_SRR1,r12
>>>> +  std r12,_MSR(r1)
>>>> +
>>>> +  /*
>>>> +   * Decrement MCE nesting after finishing with the stack.
>>>> +   */
>>>> +  lhz r11,PACA_IN_MCE(r13)
>>>> +  subir11,r11,1
>>>> +  sth r11,PACA_IN_MCE(r13)  
>>>
>>> Looks like we are not winding up.. Shouldn't we ? What if we may end up
>>> in pnv_wakeup_noloss() which assumes that no GPRs are lost. Am I missing
>>> anything ?
> 
> Nice catch! This can occur if SRR1[46:47] == 0b01.
> 
>>
>> Hmm, no I think you're right. Thanks, good catch. But can we do it with
>> just setting PACA_NAPSTATELOST?
> 
> Unconditionally setting PACA_NAPSTATELOST should be sufficient.

Agree, that should take care.

> 
>>
>>>
>>>> +  b   pnv_powersave_wakeup
>>>> +#endif
>>>>/*  
>>>
>>> [...]
>>>
>>> Rest looks good to me.
>>>
>>> Reviewed-by: Mahesh J Salgaonkar <mah...@linux.vnet.ibm.com>
>>
>> Thanks,
>> Nick
>>

Re: [PATCH 4/8] powerpc/64s: fix POWER9 machine check handler from stop state

2017-03-16 Thread Mahesh Jagannath Salgaonkar

On 03/14/2017 02:53 PM, Nicholas Piggin wrote:
> The ISA specifies power save wakeup can cause a machine check interrupt.
> The machine check handler currently has code to handle that for POWER8,
> but POWER9 crashes when trying to execute the P8 style sleep
> instructions.
> 
> So queue up the machine check, then call into the idle code to wake up
> as the system reset interrupt does, rather than attempting to sleep
> again without going through the main idle path.
> 
> Reviewed-by: Gautham R. Shenoy 
> Signed-off-by: Nicholas Piggin 
> ---
>  arch/powerpc/include/asm/reg.h   |  1 +
>  arch/powerpc/kernel/exceptions-64s.S | 69 
> ++--
>  2 files changed, 35 insertions(+), 35 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
> index fc879fd6bdae..8bbdfacce970 100644
> --- a/arch/powerpc/include/asm/reg.h
> +++ b/arch/powerpc/include/asm/reg.h
> @@ -656,6 +656,7 @@
>  #define   SRR1_ISI_PROT  0x0800 /* ISI: Other protection 
> fault */
>  #define   SRR1_WAKEMASK  0x0038 /* reason for wakeup */
>  #define   SRR1_WAKEMASK_P8   0x003c /* reason for wakeup on POWER8 and 9 
> */
> +#define   SRR1_WAKEMCE_RESVD 0x003c /* Unused/reserved value used by MCE 
> wakeup to indicate cause to idle wakeup handler */
>  #define   SRR1_WAKESYSERR0x0030 /* System error */
>  #define   SRR1_WAKEEE0x0020 /* External interrupt */
>  #define   SRR1_WAKEHVI   0x0024 /* Hypervisor Virtualization 
> Interrupt (P9) */
> diff --git a/arch/powerpc/kernel/exceptions-64s.S 
> b/arch/powerpc/kernel/exceptions-64s.S
> index e390fcd04bcb..5779d2d6a192 100644
> --- a/arch/powerpc/kernel/exceptions-64s.S
> +++ b/arch/powerpc/kernel/exceptions-64s.S
> @@ -306,6 +306,33 @@ EXC_COMMON_BEGIN(machine_check_common)
>   /* restore original r1. */  \
>   ld  r1,GPR1(r1)
> 
> +#ifdef CONFIG_PPC_P7_NAP
> +EXC_COMMON_BEGIN(machine_check_idle_common)
> + bl  machine_check_queue_event
> + /*
> +  * Queue the machine check, then reload SRR1 and use it to set
> +  * CR3 according to pnv_powersave_wakeup convention.
> +  */
> + ld  r12,_MSR(r1)
> + rlwinm  r11,r12,47-31,30,31
> + cmpwi   cr3,r11,2
> +
> + /*
> +  * Now put SRR1_WAKEMCE_RESVD into SRR1, allows it to follow the
> +  * system reset wakeup code.
> +  */
> + orisr12,r12,SRR1_WAKEMCE_RESVD@h
> + mtspr   SPRN_SRR1,r12
> + std r12,_MSR(r1)
> +
> + /*
> +  * Decrement MCE nesting after finishing with the stack.
> +  */
> + lhz r11,PACA_IN_MCE(r13)
> + subir11,r11,1
> + sth r11,PACA_IN_MCE(r13)

Looks like we are not winding up.. Shouldn't we ? What if we may end up
in pnv_wakeup_noloss() which assumes that no GPRs are lost. Am I missing
anything ?

> + b   pnv_powersave_wakeup
> +#endif
>   /*

[...]

Rest looks good to me.

Reviewed-by: Mahesh J Salgaonkar 

Thanks,
-Mahesh.

Re: [PATCH 3/3] powerpc/64s: POWER9 machine check handler

2017-02-27 Thread Mahesh Jagannath Salgaonkar

On 02/28/2017 07:30 AM, Nicholas Piggin wrote:
> Add POWER9 machine check handler. There are several new types of errors
> added, so logging messages for those are also added.
> 
> This doesn't attempt to reuse any of the P7/8 defines or functions,
> because that becomes too complex. The better option in future is to use
> a table driven approach.
> 
> Signed-off-by: Nicholas Piggin 
> ---
>  arch/powerpc/include/asm/bitops.h |   4 +
>  arch/powerpc/include/asm/mce.h| 105 +
>  arch/powerpc/kernel/cputable.c|   3 +
>  arch/powerpc/kernel/mce.c |  83 ++
>  arch/powerpc/kernel/mce_power.c   | 231 
> ++
>  5 files changed, 426 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/bitops.h 
> b/arch/powerpc/include/asm/bitops.h
> index 59abc620f8e8..5f057c74bf21 100644
> --- a/arch/powerpc/include/asm/bitops.h
> +++ b/arch/powerpc/include/asm/bitops.h
> @@ -51,6 +51,10 @@
>  #define PPC_BIT(bit) (1UL << PPC_BITLSHIFT(bit))
>  #define PPC_BITMASK(bs, be)  ((PPC_BIT(bs) - PPC_BIT(be)) | PPC_BIT(bs))
> 
> +/* Put a PPC bit into a "normal" bit position */
> +#define PPC_BITEXTRACT(bits, ppc_bit, dst_bit)   \
> + bits) >> PPC_BITLSHIFT(ppc_bit)) & 1) << (dst_bit))
> +
>  #include 
> 
>  /* Macro for generating the ***_bits() functions */
> diff --git a/arch/powerpc/include/asm/mce.h b/arch/powerpc/include/asm/mce.h
> index b2a5865ccd87..ed62efe01e49 100644
> --- a/arch/powerpc/include/asm/mce.h
> +++ b/arch/powerpc/include/asm/mce.h
> @@ -66,6 +66,55 @@
> 
>  #define P8_DSISR_MC_SLB_ERRORS   (P7_DSISR_MC_SLB_ERRORS | \
>P8_DSISR_MC_ERAT_MULTIHIT_SEC)
> +
> +/*
> + * Machine Check bits on power9
> + */
> +#define P9_SRR1_MC_LOADSTORE(srr1)   (((srr1) >> PPC_BITLSHIFT(42)) & 1)
> +
> +#define P9_SRR1_MC_IFETCH(srr1)  (   \
> + PPC_BITEXTRACT(srr1, 45, 0) |   \
> + PPC_BITEXTRACT(srr1, 44, 1) |   \
> + PPC_BITEXTRACT(srr1, 43, 2) |   \
> + PPC_BITEXTRACT(srr1, 36, 3) )
> +
> +/* 0 is reserved */
> +#define P9_SRR1_MC_IFETCH_UE 1
> +#define P9_SRR1_MC_IFETCH_SLB_PARITY 2
> +#define P9_SRR1_MC_IFETCH_SLB_MULTIHIT   3
> +#define P9_SRR1_MC_IFETCH_ERAT_MULTIHIT  4
> +#define P9_SRR1_MC_IFETCH_TLB_MULTIHIT   5
> +#define P9_SRR1_MC_IFETCH_UE_TLB_RELOAD  6
> +/* 7 is reserved */
> +#define P9_SRR1_MC_IFETCH_LINK_TIMEOUT   8
> +#define P9_SRR1_MC_IFETCH_LINK_TABLEWALK_TIMEOUT 9
> +/* 10 ? */
> +#define P9_SRR1_MC_IFETCH_RA 11
> +#define P9_SRR1_MC_IFETCH_RA_TABLEWALK   12
> +#define P9_SRR1_MC_IFETCH_RA_ASYNC_STORE 13
> +#define P9_SRR1_MC_IFETCH_LINK_ASYNC_STORE_TIMEOUT   14
> +#define P9_SRR1_MC_IFETCH_RA_TABLEWALK_FOREIGN   15
> +
> +/* DSISR bits for machine check (On Power9) */
> +#define P9_DSISR_MC_UE   (PPC_BIT(48))
> +#define P9_DSISR_MC_UE_TABLEWALK (PPC_BIT(49))
> +#define P9_DSISR_MC_LINK_LOAD_TIMEOUT(PPC_BIT(50))
> +#define P9_DSISR_MC_LINK_TABLEWALK_TIMEOUT   (PPC_BIT(51))
> +#define P9_DSISR_MC_ERAT_MULTIHIT(PPC_BIT(52))
> +#define P9_DSISR_MC_TLB_MULTIHIT_MFTLB   (PPC_BIT(53))
> +#define P9_DSISR_MC_USER_TLBIE   (PPC_BIT(54))
> +#define P9_DSISR_MC_SLB_PARITY_MFSLB (PPC_BIT(55))
> +#define P9_DSISR_MC_SLB_MULTIHIT_MFSLB   (PPC_BIT(56))
> +#define P9_DSISR_MC_RA_LOAD  (PPC_BIT(57))
> +#define P9_DSISR_MC_RA_TABLEWALK (PPC_BIT(58))
> +#define P9_DSISR_MC_RA_TABLEWALK_FOREIGN (PPC_BIT(59))
> +#define P9_DSISR_MC_RA_FOREIGN   (PPC_BIT(60))
> +
> +/* SLB error bits */
> +#define P9_DSISR_MC_SLB_ERRORS   (P9_DSISR_MC_ERAT_MULTIHIT | \
> +  P9_DSISR_MC_SLB_PARITY_MFSLB | \
> +  P9_DSISR_MC_SLB_MULTIHIT_MFSLB)
> +
>  enum MCE_Version {
>   MCE_V1 = 1,
>  };
> @@ -93,6 +142,9 @@ enum MCE_ErrorType {
>   MCE_ERROR_TYPE_SLB = 2,
>   MCE_ERROR_TYPE_ERAT = 3,
>   MCE_ERROR_TYPE_TLB = 4,
> + MCE_ERROR_TYPE_USER = 5,
> + MCE_ERROR_TYPE_RA = 6,
> + MCE_ERROR_TYPE_LINK = 7,
>  };
> 
>  enum MCE_UeErrorType {
> @@ -121,6 +173,32 @@ enum MCE_TlbErrorType {
>   MCE_TLB_ERROR_MULTIHIT = 2,
>  };
> 
> +enum MCE_UserErrorType {
> + MCE_USER_ERROR_INDETERMINATE = 0,
> + MCE_USER_ERROR_TLBIE = 1,
> +};
> +
> +enum MCE_RaErrorType {
> + MCE_RA_ERROR_INDETERMINATE = 0,
> + MCE_RA_ERROR_IFETCH = 1,
> + MCE_RA_ERROR_PAGE_TABLE_WALK_IFETCH = 2,
> + MCE_RA_ERROR_PAGE_TABLE_WALK_IFETCH_FOREIGN = 3,
> + MCE_RA_ERROR_LOAD = 4,
> +

Re: [PATCH 1/3] powerpc/64s: fix handling of non-synchronous machine checks

2017-02-27 Thread Mahesh Jagannath Salgaonkar

On 02/28/2017 07:30 AM, Nicholas Piggin wrote:
> A synchronous machine check is an exception raised by the attempt to
> execute the current instruction. If the error can't be corrected, it
> can make sense to SIGBUS the currently running process.
> 
> In other cases, the error condition is not related to the current
> instruction, so killing the current process is not the right thing to
> do.
> 
> Today, all machine checks are MCE_SEV_ERROR_SYNC, so this has no
> practical change. It will be used to handle POWER9 asynchronous
> machine checks.
> 
> Signed-off-by: Nicholas Piggin 
> ---
>  arch/powerpc/platforms/powernv/opal.c | 21 ++---
>  1 file changed, 6 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/opal.c 
> b/arch/powerpc/platforms/powernv/opal.c
> index 86d9fde93c17..e0f856bfbfe8 100644
> --- a/arch/powerpc/platforms/powernv/opal.c
> +++ b/arch/powerpc/platforms/powernv/opal.c
> @@ -395,7 +395,6 @@ static int opal_recover_mce(struct pt_regs *regs,
>   struct machine_check_event *evt)
>  {
>   int recovered = 0;
> - uint64_t ea = get_mce_fault_addr(evt);
> 
>   if (!(regs->msr & MSR_RI)) {
>   /* If MSR_RI isn't set, we cannot recover */
> @@ -404,26 +403,18 @@ static int opal_recover_mce(struct pt_regs *regs,
>   } else if (evt->disposition == MCE_DISPOSITION_RECOVERED) {
>   /* Platform corrected itself */
>   recovered = 1;
> - } else if (ea && !is_kernel_addr(ea)) {
> + } else if (evt->severity == MCE_SEV_FATAL) {
> + /* Fatal machine check */
> + pr_err("Machine check interrupt is fatal\n");
> + recovered = 0;

Setting recovered = 0 would trigger kernel panic. Should we panic the
kernel for asynchronous errors ?

> + } else if ((evt->severity == MCE_SEV_ERROR_SYNC) &&
> + (user_mode(regs) && !is_global_init(current))) {
>   /*
> -  * Faulting address is not in kernel text. We should be fine.
> -  * We need to find which process uses this address.
>* For now, kill the task if we have received exception when
>* in userspace.
>*
>* TODO: Queue up this address for hwpoisioning later.
>*/
> - if (user_mode(regs) && !is_global_init(current)) {
> - _exception(SIGBUS, regs, BUS_MCEERR_AR, regs->nip);
> - recovered = 1;
> - } else
> - recovered = 0;
> - } else if (user_mode(regs) && !is_global_init(current) &&
> - evt->severity == MCE_SEV_ERROR_SYNC) {
> - /*
> -  * If we have received a synchronous error when in userspace
> -  * kill the task.
> -  */
>   _exception(SIGBUS, regs, BUS_MCEERR_AR, regs->nip);
>   recovered = 1;
>   }
>

Re: [RFC PATCH 1/7] powerpc/book3s: Move machine check event structure to opal-api.h

2017-02-20 Thread Mahesh Jagannath Salgaonkar

On 02/21/2017 08:05 AM, Nicholas Piggin wrote:
> On Tue, 21 Feb 2017 07:21:56 +0530
> Mahesh J Salgaonkar  wrote:
> 
>> +enum MCE_TlbErrorType {
>> +MCE_TLB_ERROR_INDETERMINATE = 0,
>> +MCE_TLB_ERROR_PARITY = 1,
>> +MCE_TLB_ERROR_MULTIHIT = 2,
>> +MCE_TLB_ERROR_TLBIEL_PROG_ERROR = 3,
>> +};
> 
> The new TLBIE error isn't really a TLB error as such. Not a hardware error.
> I added a new "user" type for it.
> 
> I don't think we can handle it just by flushing TLB because it can also be
> raised in response to invalid non-local tlbie. We could flush all TLBs maybe
> but I think also have to advance nip to return to.

ok got it.

> 
>> +
>> +enum MCE_NestErrorType {
>> +MCE_NEST_ERROR_ABRT_IFETCH = 0,
>> +MCE_NEST_ERROR_ABRT_IFETCH_TABLEWALK = 1,
>> +MCE_NEST_ERROR_ABRT_LOAD = 2,
>> +MCE_NEST_ERROR_ABRT_LOAD_TABLEWALK = 3,
>> +};
>> +
>> +enum MCE_CrespErrorType {
>> +MCE_CRESP_ERROR_BAD_RADDR_IFETCH = 0,
>> +MCE_CRESP_ERROR_BAD_RADDR_IFETCH_TABLEWALK = 1,
>> +MCE_CRESP_ERROR_BAD_RADDR_LOAD = 2,
>> +MCE_CRESP_ERROR_BAD_RADDR_LOAD_TABLEWALK = 3,
>> +};
>> +
>> +enum MCE_FspaceErrorType {
>> +MCE_FSPACE_ERROR_IFETCH = 0,
>> +MCE_FSPACE_ERROR_IFETCH_TABLEWALK = 1,
>> +MCE_FSPACE_ERROR_RADDR_TRANSLATION = 2,
>> +MCE_FSPACE_ERROR_RADDR_LOAD = 3,
>> +};
>> +
>> +enum MCE_AsyncErrorType {
>> +MCE_ASYNC_ERROR_REAL_ADDR_STORE = 0,
>> +MCE_ASYNC_ERROR_NEST_ABRT_STORE = 1,
>> +};
>> +
>> +struct OpalMachineCheckEvent {
> 
> Can we have more of a think about this structure and error types
> before making it an OPAL API?

Agree. I was just thinking how about we can just replace the entire
union as below:

uint8_t specific_error_type;/* 0x20 */
uint8_t effective_address_provided; /* 0x21 */
uint8_t physical_address_provided;  /* 0x22 */
uint8_t reserved_1[5];  /* 0x23 */
uint64_teffective_address;  /* 0x28 */
uint64_tphysical_address;   /* 0x30 */
uint8_t reserved_2[8];  /* 0x38 */
};

What do you say ? May increase few more bytes as reserved for future.

> 
> Errors don't always fit neatly into a simple classification like
> this. For example "async" is not really an error. It's a property
> of how the error is reported. The error is a timeout or real
> address error. And it's caused by a store. And initiated by nest
> or cResp... Other errors are caused by a table walk that was
> caused by a store, etc.
> 
> I shoehorned these async errors into realaddr/link types in my
> patch along with a different severity (i.e., not SYNC). But I
> think we can do a lot better with a clean slate for OPAL.

I see.

> 
> More general thing is, I wonder how much we need to know of the
> implementation details in this API? This still seems like it's
> unnecessarily split between OS and FW. I think it would be much
> nicer if we just return a set of things that the OS can usefully
> respond to and have firmware construct the detailed messages for
> logging.
> 
> That way we'll have much fewer new types of errors we don't know
> how to handle, and never have to report unknown error.

Makes sense. That would make linux MCE error printing much simpler and
we may never have to modify it to add new strings. We can probably add
char buffer to machine check struct or send it as separate string buffer.

Thanks,
-Mahesh.

Re: [RFC PATCH 5/7] powerpc/book3s: Don't turn on the MSR[ME] bit until opal processes the reason.

2017-02-20 Thread Mahesh Jagannath Salgaonkar

On 02/21/2017 08:17 AM, Nicholas Piggin wrote:
> On Tue, 21 Feb 2017 07:22:56 +0530
> Mahesh J Salgaonkar  wrote:
> 
>> From: Mahesh Salgaonkar 
>>
>> Delay it until we are done with machine_check_early() call. Turn on MSR[ME]
>> once opal is done with processing MCE.
> 
> Why? This seems like quite a regression -- the MCE handler today
> has about 60 instructions and 30 l/st with ME clear.

I understand that this is bit long window. But we are in MCE handling
code and if we hit MCE while doing that we may anyway end up with
recursive MCE interrupts without really be able to recover from it.
Instead lets risk checkstop which would get us rebooted with hostboot
throwing proper error call out.

-Mahesh.

Re: [PATCH v1 1/2] fadump: reduce memory consumption for capture kernel

2017-01-30 Thread Mahesh Jagannath Salgaonkar

On 01/30/2017 10:14 PM, Hari Bathini wrote:
> In case of fadump, capture (fadump) kernel boots like a normal kernel.
> While this has its advantages, the capture kernel would initialize all
> the components like normal kernel, which may not necessarily be needed
> for a typical dump capture kernel. So, fadump capture kernel ends up
> needing more memory than a typical (read kdump) capture kernel to boot.
> 
> This can be overcome by introducing parameters like fadump_nr_cpus=1,
> similar to nr_cpus=1 parameter, applicable only when fadump is active.
> But this approach needs introduction of special parameters applicable
> only when fadump is active (capture kernel), for every parameter that
> reduces memory/resource consumption.
> 
> A better approach would be to pass extra parameters to fadump capture
> kernel. As firmware leaves the memory contents intact from the time of
> crash till the new kernel is booted up, parameters to append to capture
> kernel can be saved in real memory region and retrieved later when the
> capture kernel is in its early boot process for appending to command
> line parameters.
> 
> This patch introduces a new node /sys/kernel/fadump_cmdline_append to
> specify the parameters to pass to fadump capture kernel, saves them in
> real memory region and appends these parameters to capture kernel early
> in its boot process.
> 
> Signed-off-by: Hari Bathini 
> ---
>  arch/powerpc/include/asm/fadump.h |   28 
>  arch/powerpc/kernel/fadump.c  |  125 
> -
>  arch/powerpc/kernel/prom.c|   19 ++
>  3 files changed, 170 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/fadump.h 
> b/arch/powerpc/include/asm/fadump.h
> index 0031806..484083a 100644
> --- a/arch/powerpc/include/asm/fadump.h
> +++ b/arch/powerpc/include/asm/fadump.h
> @@ -24,6 +24,8 @@
> 
>  #ifdef CONFIG_FA_DUMP
> 
> +#include 
> +
>  /*
>   * The RMA region will be saved for later dumping when kernel crashes.
>   * RMA is Real Mode Area, the first block of logical memory address owned
> @@ -45,6 +47,8 @@
> 
>  #define memblock_num_regions(memblock_type)  (memblock.memblock_type.cnt)
> 
> +#define FADUMP_FORMAT_VERSION0x0002

Why 0x0002 ? Does Phyp now support new version of dump format ? We
should be more careful not to break backward compatibility. May be now
it's a time that we should look for minimum/maximum kernel dump version
supported by the firmware by looking at
"/proc/device-tree/rtas/ibm,configure-kernel-dump-version" and then use
whichever is supported.

> +
>  /* Firmware provided dump sections */
>  #define FADUMP_CPU_STATE_DATA0x0001
>  #define FADUMP_HPTE_REGION   0x0002
> @@ -126,6 +130,13 @@ struct fw_dump {
>   /* cmd line option during boot */
>   unsigned long   reserve_bootvar;
> 
> + /*
> +  * Area to pass info to capture (fadump) kernel. For now,
> +  * we are only passing parameters to append.
> +  */
> + unsigned long   handover_area_start;
> + unsigned long   handover_area_size;
> +
>   unsigned long   fadumphdr_addr;
>   unsigned long   cpu_notes_buf;
>   unsigned long   cpu_notes_buf_size;
> @@ -159,6 +170,22 @@ static inline u64 str_to_u64(const char *str)
>  #define FADUMP_CRASH_INFO_MAGIC  STR_TO_HEX("FADMPINF")
>  #define REGSAVE_AREA_MAGIC   STR_TO_HEX("REGSAVE")
> 
> +/*
> + * The start address for an area to pass off certain configuration details
> + * like parameters to append to the commandline for a capture (fadump) 
> kernel.
> + * Setting it to 128MB as this needs to be accessed in realmode.
> + */
> +#define FADUMP_HANDOVER_AREA_START   (1UL << 27)
> +
> +#define FADUMP_PARAMS_AREA_MARKERSTR_TO_HEX("FADMPCMD")
> +#define FADUMP_PARAMS_INFO_SIZE  sizeof(struct 
> fadump_params_info)
> +
> +/* fadump parameters info */
> +struct fadump_params_info {
> + u64 params_area_marker;
> + charparams[COMMAND_LINE_SIZE/2];
> +};
> +
>  /* The firmware-assisted dump format.
>   *
>   * The register save area is an area in the partition's memory used to 
> preserve
> @@ -200,6 +227,7 @@ struct fad_crash_memory_ranges {
> 
>  extern int early_init_dt_scan_fw_dump(unsigned long node,
>   const char *uname, int depth, void *data);
> +extern char *get_fadump_parameters_realmode(void);
>  extern int fadump_reserve_mem(void);
>  extern int setup_fadump(void);
>  extern int is_fadump_active(void);
> diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
> index 8f0c7c5..bc82d22 100644
> --- a/arch/powerpc/kernel/fadump.c
> +++ b/arch/powerpc/kernel/fadump.c
> @@ -41,7 +41,6 @@
>  #include 
>  #include 
>  #include 
> -#include 
> 
>  static struct fw_dump fw_dump;
>  static struct fadump_mem_struct fdm;
> @@ -74,6 +73,9 @@ int __init early_init_dt_scan_fw_dump(unsigned long node,
>   fw_dump.fadump_supported = 1;
>

Re: [PATCH v5 2/2] KVM: PPC: Exit guest upon MCE when FWNMI capability is enabled

2017-01-17 Thread Mahesh Jagannath Salgaonkar

On 01/16/2017 10:05 AM, Paul Mackerras wrote:
> On Fri, Jan 13, 2017 at 04:51:45PM +0530, Aravinda Prasad wrote:
>> Enhance KVM to cause a guest exit with KVM_EXIT_NMI
>> exit reason upon a machine check exception (MCE) in
>> the guest address space if the KVM_CAP_PPC_FWNMI
>> capability is enabled (instead of delivering a 0x200
>> interrupt to guest). This enables QEMU to build error
>> log and deliver machine check exception to guest via
>> guest registered machine check handler.
>>
>> This approach simplifies the delivery of machine
>> check exception to guest OS compared to the earlier
>> approach of KVM directly invoking 0x200 guest interrupt
>> vector.
>>
>> This design/approach is based on the feedback for the
>> QEMU patches to handle machine check exception. Details
>> of earlier approach of handling machine check exception
>> in QEMU and related discussions can be found at:
>>
>> https://lists.nongnu.org/archive/html/qemu-devel/2014-11/msg00813.html
>>
>> Note:
>>
>> This patch introduces a hook which is invoked at the time
>> of guest exit to facilitate the host-side handling of
>> machine check exception before the exception is passed
>> on to the guest. Hence, the host-side handling which was
>> performed earlier via machine_check_fwnmi is removed.
>>
>> The reasons for this approach is (i) it is not possible
>> to distinguish whether the exception occurred in the
>> guest or the host from the pt_regs passed on the
>> machine_check_exception(). Hence machine_check_exception()
>> calls panic, instead of passing on the exception to
>> the guest, if the machine check exception is not
>> recoverable. (ii) the approach introduced in this
>> patch gives opportunity to the host kernel to perform
>> actions in virtual mode before passing on the exception
>> to the guest. This approach does not require complex
>> tweaks to machine_check_fwnmi and friends.
>>
>> Signed-off-by: Aravinda Prasad 
>> Reviewed-by: David Gibson 
> 
> This patch mostly looks OK.  I have a few relatively minor comments
> below.
> 
>> ---
>>  arch/powerpc/kvm/book3s_hv.c|   27 +-
>>  arch/powerpc/kvm/book3s_hv_rmhandlers.S |   47 
>> ---
>>  arch/powerpc/platforms/powernv/opal.c   |   10 +++
>>  3 files changed, 54 insertions(+), 30 deletions(-)
>>
>> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
>> index 3686471..cae4921 100644
>> --- a/arch/powerpc/kvm/book3s_hv.c
>> +++ b/arch/powerpc/kvm/book3s_hv.c
>> @@ -123,6 +123,7 @@ MODULE_PARM_DESC(halt_poll_ns_shrink, "Factor halt poll 
>> time is shrunk by");
>>  
>>  static void kvmppc_end_cede(struct kvm_vcpu *vcpu);
>>  static int kvmppc_hv_setup_htab_rma(struct kvm_vcpu *vcpu);
>> +static void kvmppc_machine_check_hook(void);
>>  
>>  static inline struct kvm_vcpu *next_runnable_thread(struct kvmppc_vcore *vc,
>>  int *ip)
>> @@ -954,15 +955,14 @@ static int kvmppc_handle_exit_hv(struct kvm_run *run, 
>> struct kvm_vcpu *vcpu,
>>  r = RESUME_GUEST;
>>  break;
>>  case BOOK3S_INTERRUPT_MACHINE_CHECK:
>> +/* Exit to guest with KVM_EXIT_NMI as exit reason */
>> +run->exit_reason = KVM_EXIT_NMI;
>> +r = RESUME_HOST;
>>  /*
>> - * Deliver a machine check interrupt to the guest.
>> - * We have to do this, even if the host has handled the
>> - * machine check, because machine checks use SRR0/1 and
>> - * the interrupt might have trashed guest state in them.
>> + * Invoke host-kernel handler to perform any host-side
>> + * handling before exiting the guest.
>>   */
>> -kvmppc_book3s_queue_irqprio(vcpu,
>> -BOOK3S_INTERRUPT_MACHINE_CHECK);
>> -r = RESUME_GUEST;
>> +kvmppc_machine_check_hook();
> 
> Note that this won't necessarily be called on the same CPU that
> received the machine check.  This will be called on thread 0 of the
> core (or subcore), whereas the machine check could have occurred on
> some other thread.  Are you sure that the machine check handling code
> will be OK with that?

That will have only one problem. get_mce_event() from
opal_machine_check() may not be able to pull mce event for error on
non-zero thread. We should hook the mce event into vcpu structure during
kvmppc_realmode_machine_check() and then pass it to
ppc_md.machine_check_exception() as an additional argument.

> 
>>  break;
>>  case BOOK3S_INTERRUPT_PROGRAM:
>>  {
>> @@ -3491,6 +3491,19 @@ static void kvmppc_irq_bypass_del_producer_hv(struct 
>> irq_bypass_consumer *cons,
>>  }
>>  #endif
>>  
>> +/*
>> + * Hook to handle machine check exceptions occurred inside a guest.
>> + * This hook is invoked from host virtual mode from KVM before exiting
>> + * the guest with KVM_EXIT_NMI exit reason. This

Re: [PATCH v4 4/5] powerpc/fadump: reuse crashkernel parameter for fadump memory reservation

2017-01-13 Thread Mahesh Jagannath Salgaonkar

On 01/05/2017 11:02 PM, Hari Bathini wrote:
> fadump supports specifying memory to reserve for fadump's crash kernel
> with fadump_reserve_mem kernel parameter. This parameter currently
> supports passing a fixed memory size, like fadump_reserve_mem=
> only. This patch aims to add support for other syntaxes like range-based
> memory size :[,:,:,...]
> which allows using the same parameter to boot the kernel with different
> system RAM sizes.
> 
> As crashkernel parameter already supports the above mentioned syntaxes,
> this patch deprecates fadump_reserve_mem parameter and reuses crashkernel
> parameter instead, to specify memory for fadump's crash kernel memory
> reservation as well. If any offset is provided in crashkernel parameter,
> it will be ignored in case of fadump, as fadump reserves memory at end
> of RAM.
> 
> Advantages using crashkernel parameter instead of fadump_reserve_mem
> parameter are one less kernel parameter overall, code reuse and support
> for multiple syntaxes to specify memory.
> 
> Suggested-by: Dave Young 
> Signed-off-by: Hari Bathini 

Reviewed-by: Mahesh Salgaonkar 

> ---
>  arch/powerpc/kernel/fadump.c |   23 ++-
>  1 file changed, 10 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
> index db0b339..de7d39a 100644
> --- a/arch/powerpc/kernel/fadump.c
> +++ b/arch/powerpc/kernel/fadump.c
> @@ -210,14 +210,20 @@ static unsigned long init_fadump_mem_struct(struct 
> fadump_mem_struct *fdm,
>   */
>  static inline unsigned long fadump_calculate_reserve_size(void)
>  {
> - unsigned long size;
> + int ret;
> + unsigned long long base, size;
> 
>   /*
> -  * Check if the size is specified through fadump_reserve_mem= cmdline
> -  * option. If yes, then use that.
> +  * Check if the size is specified through crashkernel= cmdline
> +  * option. If yes, then use that but ignore base as fadump
> +  * reserves memory at end of RAM.
>*/
> - if (fw_dump.reserve_bootvar)
> + ret = parse_crashkernel(boot_command_line, memblock_phys_mem_size(),
> + , );
> + if (ret == 0 && size > 0) {
> + fw_dump.reserve_bootvar = (unsigned long)size;
>   return fw_dump.reserve_bootvar;
> + }
> 
>   /* divide by 20 to get 5% of value */
>   size = memblock_end_of_DRAM() / 20;
> @@ -353,15 +359,6 @@ static int __init early_fadump_param(char *p)
>  }
>  early_param("fadump", early_fadump_param);
> 
> -/* Look for fadump_reserve_mem= cmdline option */
> -static int __init early_fadump_reserve_mem(char *p)
> -{
> - if (p)
> - fw_dump.reserve_bootvar = memparse(p, );
> - return 0;
> -}
> -early_param("fadump_reserve_mem", early_fadump_reserve_mem);
> -
>  static void register_fw_dump(struct fadump_mem_struct *fdm)
>  {
>   int rc;
>

Re: [PATCH v4 3/5] powerpc/fadump: remove dependency with CONFIG_KEXEC

2017-01-13 Thread Mahesh Jagannath Salgaonkar

On 01/05/2017 11:02 PM, Hari Bathini wrote:
> Now that crashkernel parameter parsing and vmcoreinfo related code is
> moved under CONFIG_CRASH_CORE instead of CONFIG_KEXEC_CORE, remove
> dependency with CONFIG_KEXEC for CONFIG_FA_DUMP. While here, get rid
> of definitions of fadump_append_elf_note() & fadump_final_note()
> functions to reuse similar functions compiled under CONFIG_CRASH_CORE.
> 
> Signed-off-by: Hari Bathini 

Reviewed-by: Mahesh Salgaonkar 

> ---
>  arch/powerpc/Kconfig   |   10 ++
>  arch/powerpc/include/asm/fadump.h  |2 ++
>  arch/powerpc/kernel/crash.c|2 --
>  arch/powerpc/kernel/fadump.c   |   34 +++---
>  arch/powerpc/kernel/setup-common.c |5 +
>  5 files changed, 16 insertions(+), 37 deletions(-)
> 
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index a8ee573..b9726be 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -513,21 +513,23 @@ config RELOCATABLE_TEST
> relocation code.
> 
>  config CRASH_DUMP
> - bool "Build a kdump crash kernel"
> + bool "Build a dump capture kernel"
>   depends on PPC64 || 6xx || FSL_BOOKE || (44x && !SMP)
>   select RELOCATABLE if (PPC64 && !COMPILE_TEST) || 44x || FSL_BOOKE
>   help
> -   Build a kernel suitable for use as a kdump capture kernel.
> +   Build a kernel suitable for use as a dump capture kernel.
> The same kernel binary can be used as production kernel and dump
> capture kernel.
> 
>  config FA_DUMP
>   bool "Firmware-assisted dump"
> - depends on PPC64 && PPC_RTAS && CRASH_DUMP && KEXEC_CORE
> + depends on PPC64 && PPC_RTAS
> + select CRASH_CORE
> + select CRASH_DUMP
>   help
> A robust mechanism to get reliable kernel crash dump with
> assistance from firmware. This approach does not use kexec,
> -   instead firmware assists in booting the kdump kernel
> +   instead firmware assists in booting the capture kernel
> while preserving memory contents. Firmware-assisted dump
> is meant to be a kdump replacement offering robustness and
> speed not possible without system firmware assistance.
> diff --git a/arch/powerpc/include/asm/fadump.h 
> b/arch/powerpc/include/asm/fadump.h
> index 0031806..60b9108 100644
> --- a/arch/powerpc/include/asm/fadump.h
> +++ b/arch/powerpc/include/asm/fadump.h
> @@ -73,6 +73,8 @@
>   reg_entry++;\
>  })
> 
> +extern int crashing_cpu;
> +
>  /* Kernel Dump section info */
>  struct fadump_section {
>   __be32  request_flag;
> diff --git a/arch/powerpc/kernel/crash.c b/arch/powerpc/kernel/crash.c
> index 47b63de..cbabb5a 100644
> --- a/arch/powerpc/kernel/crash.c
> +++ b/arch/powerpc/kernel/crash.c
> @@ -43,8 +43,6 @@
>  #define IPI_TIMEOUT  1
>  #define REAL_MODE_TIMEOUT1
> 
> -/* This keeps a track of which one is the crashing cpu. */
> -int crashing_cpu = -1;
>  static int time_to_dump;
> 
>  #define CRASH_HANDLER_MAX 3
> diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
> index 8f0c7c5..db0b339 100644
> --- a/arch/powerpc/kernel/fadump.c
> +++ b/arch/powerpc/kernel/fadump.c
> @@ -486,34 +486,6 @@ fadump_read_registers(struct fadump_reg_entry 
> *reg_entry, struct pt_regs *regs)
>   return reg_entry;
>  }
> 
> -static u32 *fadump_append_elf_note(u32 *buf, char *name, unsigned type,
> - void *data, size_t data_len)
> -{
> - struct elf_note note;
> -
> - note.n_namesz = strlen(name) + 1;
> - note.n_descsz = data_len;
> - note.n_type   = type;
> - memcpy(buf, , sizeof(note));
> - buf += (sizeof(note) + 3)/4;
> - memcpy(buf, name, note.n_namesz);
> - buf += (note.n_namesz + 3)/4;
> - memcpy(buf, data, note.n_descsz);
> - buf += (note.n_descsz + 3)/4;
> -
> - return buf;
> -}
> -
> -static void fadump_final_note(u32 *buf)
> -{
> - struct elf_note note;
> -
> - note.n_namesz = 0;
> - note.n_descsz = 0;
> - note.n_type   = 0;
> - memcpy(buf, , sizeof(note));
> -}
> -
>  static u32 *fadump_regs_to_elf_notes(u32 *buf, struct pt_regs *regs)
>  {
>   struct elf_prstatus prstatus;
> @@ -524,8 +496,8 @@ static u32 *fadump_regs_to_elf_notes(u32 *buf, struct 
> pt_regs *regs)
>* prstatus.pr_pid = 
>*/
>   elf_core_copy_kernel_regs(_reg, regs);
> - buf = fadump_append_elf_note(buf, KEXEC_CORE_NOTE_NAME, NT_PRSTATUS,
> - , sizeof(prstatus));
> + buf = append_elf_note(buf, CRASH_CORE_NOTE_NAME, NT_PRSTATUS,
> +   , sizeof(prstatus));
>   return buf;
>  }
> 
> @@ -666,7 +638,7 @@ static int __init fadump_build_cpu_notes(const struct 
> fadump_mem_struct *fdm)
>   note_buf = fadump_regs_to_elf_notes(note_buf, );
>

Re: [PATCH] powerpc/64s: relocation, register save fixes for system reset interrupt

2016-11-02 Thread Mahesh Jagannath Salgaonkar

On 10/13/2016 07:47 AM, Nicholas Piggin wrote:
> This patch does a couple of things. First of all, powernv immediately
> explodes when running a relocated kernel, because the system reset
> exception for handling sleeps does not do correct relocated branches.
> 
> Secondly, the sleep handling code trashes the condition and cfar
> registers, which we would like to preserve for debugging purposes (for
> non-sleep case exception).
> 
> This patch changes the exception to use the standard format that saves
> registers before any tests or branches are made. It adds the test for
> idle-wakeup as an "extra" to break out of the normal exception path.
> Then it branches to a relocated idle handler that calls the various
> idle handling functions.
> 
> After this patch, POWER8 CPU simulator now boots powernv kernel that is
> running at non-zero.
> 
> Cc: Balbir Singh 
> Cc: Shreyas B. Prabhu 
> Cc: Gautham R. Shenoy 
> Signed-off-by: Nicholas Piggin 
> ---
>  arch/powerpc/include/asm/exception-64s.h | 16 ++
>  arch/powerpc/kernel/exceptions-64s.S | 50 
> ++--
>  2 files changed, 45 insertions(+), 21 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/exception-64s.h 
> b/arch/powerpc/include/asm/exception-64s.h
> index 2e4e7d8..84d49b1 100644
> --- a/arch/powerpc/include/asm/exception-64s.h
> +++ b/arch/powerpc/include/asm/exception-64s.h
> @@ -93,6 +93,10 @@
>   ld  reg,PACAKBASE(r13); /* get high part of  */   \
>   ori reg,reg,(FIXED_SYMBOL_ABS_ADDR(label))@l;
> 
> +#define __LOAD_HANDLER(reg, label)   \
> + ld  reg,PACAKBASE(r13); \
> + ori reg,reg,(ABS_ADDR(label))@l;
> +
>  /* Exception register prefixes */
>  #define EXC_HV   H
>  #define EXC_STD
> @@ -208,6 +212,18 @@ END_FTR_SECTION_NESTED(ftr,ftr,943)
>  #define kvmppc_interrupt kvmppc_interrupt_pr
>  #endif
> 
> +#ifdef CONFIG_RELOCATABLE
> +#define BRANCH_TO_COMMON(reg, label) \
> + __LOAD_HANDLER(reg, label); \
> + mtctr   reg;\
> + bctr
> +
> +#else
> +#define BRANCH_TO_COMMON(reg, label) \
> + b   label
> +
> +#endif
> +
>  #define __KVM_HANDLER_PROLOG(area, n)
> \
>   BEGIN_FTR_SECTION_NESTED(947)   \
>   ld  r10,area+EX_CFAR(r13);  \
> diff --git a/arch/powerpc/kernel/exceptions-64s.S 
> b/arch/powerpc/kernel/exceptions-64s.S
> index 08992f8..e680e84 100644
> --- a/arch/powerpc/kernel/exceptions-64s.S
> +++ b/arch/powerpc/kernel/exceptions-64s.S
> @@ -95,19 +95,35 @@ __start_interrupts:
>  /* No virt vectors corresponding with 0x0..0x100 */
>  EXC_VIRT_NONE(0x4000, 0x4100)
> 
> -EXC_REAL_BEGIN(system_reset, 0x100, 0x200)
> - SET_SCRATCH0(r13)
> +
>  #ifdef CONFIG_PPC_P7_NAP
> -BEGIN_FTR_SECTION
> - /* Running native on arch 2.06 or later, check if we are
> -  * waking up from nap/sleep/winkle.
> + /*
> +  * If running native on arch 2.06 or later, check if we are waking up
> +  * from nap/sleep/winkle, and branch to idle handler.
>*/
> - mfspr   r13,SPRN_SRR1
> - rlwinm. r13,r13,47-31,30,31
> - beq 9f
> +#define IDLETEST(n)  \
> + BEGIN_FTR_SECTION ; \
> + mfspr   r10,SPRN_SRR1 ; \
> + rlwinm. r10,r10,47-31,30,31 ;   \
> + beq-1f ;\
> + cmpwi   cr3,r10,2 ; \
> + BRANCH_TO_COMMON(r10, system_reset_idle_common) ;   \
> +1:   \
> + END_FTR_SECTION_IFSET(CPU_FTR_HVMODE | CPU_FTR_ARCH_206)
> +#else
> +#define IDLETEST NOTEST
> +#endif
> 
> - cmpwi   cr3,r13,2
> - GET_PACA(r13)
> +EXC_REAL_BEGIN(system_reset, 0x100, 0x200)
> + SET_SCRATCH0(r13)
> + EXCEPTION_PROLOG_PSERIES(PACA_EXGEN, system_reset_common, EXC_STD,
> +  IDLETEST, 0x100)

Very sorry for late review. On arch 2.07 and less if we wakeup from
winkle then last bit of HSPGR0 would be set to 1. Hence before we access
paca we need to fix it by clearing that bit and that is done in
pnv_restore_hyp_resource(). But with this patch, we would end up there
after going through EXCEPTION_PROLOG_PSERIES(). This macro gets the paca
using GET_PACA(r13) and all the EXCEPTION_PROLOG_* starts
using/accessing r13/paca without fixing it. Wouldn't this break things
badly on arch 2.07 and less ? Am I missing anything ?

Thanks,
-Mahesh.

Re: [PATCH] powerpc/pseries: Use H_CLEAR_HPT to clear MMU hash table during kexec

2016-10-27 Thread Mahesh Jagannath Salgaonkar

On 10/01/2016 04:11 PM, Anton Blanchard wrote:
> From: Anton Blanchard 
> 
> An hcall was recently added that does exactly what we need
> during kexec - it clears the entire MMU hash table, ignoring any
> VRMA mappings.
> 
> Try it and fall back to the old method if we get a failure.
> 
> On a POWER8 box with 5TB of memory, this reduces the time it takes to
> kexec a new kernel from from 4 minutes to 1 minute.
> 
> Signed-off-by: Anton Blanchard 

Tested-by: Mahesh Salgaonkar 

> ---
>  arch/powerpc/include/asm/hvcall.h |  3 ++-
>  arch/powerpc/platforms/pseries/lpar.c | 14 +-
>  2 files changed, 15 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/hvcall.h 
> b/arch/powerpc/include/asm/hvcall.h
> index 708edeb..489748e 100644
> --- a/arch/powerpc/include/asm/hvcall.h
> +++ b/arch/powerpc/include/asm/hvcall.h
> @@ -275,7 +275,8 @@
>  #define H_COP0x304
>  #define H_GET_MPP_X  0x314
>  #define H_SET_MODE   0x31C
> -#define MAX_HCALL_OPCODE H_SET_MODE
> +#define H_CLEAR_HPT  0x358
> +#define MAX_HCALL_OPCODE H_CLEAR_HPT
> 
>  /* H_VIOCTL functions */
>  #define H_GET_VIOA_DUMP_SIZE 0x01
> diff --git a/arch/powerpc/platforms/pseries/lpar.c 
> b/arch/powerpc/platforms/pseries/lpar.c
> index 86707e6..03884a8 100644
> --- a/arch/powerpc/platforms/pseries/lpar.c
> +++ b/arch/powerpc/platforms/pseries/lpar.c
> @@ -221,7 +221,7 @@ static long pSeries_lpar_hpte_remove(unsigned long 
> hpte_group)
>   return -1;
>  }
> 
> -static void pSeries_lpar_hptab_clear(void)
> +static void __pSeries_lpar_clear_hpt(void)
>  {
>   unsigned long size_bytes = 1UL << ppc64_pft_size;
>   unsigned long hpte_count = size_bytes >> 4;
> @@ -249,6 +249,18 @@ static void pSeries_lpar_hptab_clear(void)
>   &(ptes[j].pteh), &(ptes[j].ptel));
>   }
>   }
> +}
> +
> +static void pSeries_lpar_hptab_clear(void)
> +{
> + int rc;
> +
> + do {
> + rc = plpar_hcall_norets(H_CLEAR_HPT);
> + } while (rc == H_CONTINUE);
> +
> + if (rc != H_SUCCESS)
> + __pSeries_lpar_clear_hpt();
> 
>  #ifdef __LITTLE_ENDIAN__
>   /*
>

Re: [PATCH] powerpc/fadump: Fix the race in crash_fadump().

2016-10-12 Thread Mahesh Jagannath Salgaonkar

On 10/10/2016 04:22 PM, Michael Ellerman wrote:
> Mahesh J Salgaonkar  writes:
> 
>> From: Mahesh Salgaonkar 
>>
>> There are chances that multiple CPUs can call crash_fadump() simultaneously
>> and would start duplicating same info to vmcoreinfo ELF note section. This
>> causes makedumpfile to fail during kdump capture. One example is,
>> triggering dumprestart from HMC which sends system reset to all the CPUs at
>> once.
> ...
>> diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
>> index b3a6633..2ed9d1c 100644
>> --- a/arch/powerpc/kernel/fadump.c
>> +++ b/arch/powerpc/kernel/fadump.c
>> @@ -402,8 +402,14 @@ void crash_fadump(struct pt_regs *regs, const char *str)
>>  {
>>  struct fadump_crash_info_header *fdh = NULL;
>>  
>> -if (!fw_dump.dump_registered || !fw_dump.fadumphdr_addr)
>> +mutex_lock(_mutex);
> 
> What happens when a crashing CPU can't get the mutex and goes to sleep?

Got your point. I think I should use mutex_trylock() here. There is only
two reason crashing CPU can't get mutex, 1) Another CPU also crashing
that got the mutex and on its way to trigger fadump. OR 2) We are in
middle of fadump register/un-register, in which case we can just return
and go to normal panic.

Thanks,
-Mahesh.

Re: [PATCH v2 4/5] powerpc/fadump: Make ELF eflags depend on endian

2016-09-08 Thread Mahesh Jagannath Salgaonkar

On 09/08/2016 12:30 PM, Mahesh Jagannath Salgaonkar wrote:
> On 09/06/2016 11:02 AM, Daniel Axtens wrote:
>> Firmware Assisted Dump is a facility to dump kernel core with assistance
>> from firmware.  As part of this process the kernel ELF version is
>> stored.
>>
>> Currently, fadump.h defines this to 0 if it is not already defined. This
>> clashes with a define in elf.h which sets it based on the current task -
>> not based on the kernel.
>>
>> When the kernel is compiled on LE, the kernel will always be version
>> 2. Otherwise it will be version 0. So the correct behaviour is to set
>> the ELF eflags based on the endianness of the kernel. Do that.
>>
>> Remove the definition in fadump.h, which becomes unused.
>>
>> Cc: Mahesh Salgaonkar <mah...@linux.vnet.ibm.com>
>> Cc: Hari Bathini <hbath...@linux.vnet.ibm.com>
>> Signed-off-by: Daniel Axtens <d...@axtens.net>
>>
>> ---
>>
>> Mahesh or Hari: I'm not familiar with this code at all, so if either of
>> you could check that this makes sense I would really appreciate that.
>> Thanks!
>> ---
>>  arch/powerpc/include/asm/fadump.h | 4 
>>  arch/powerpc/kernel/fadump.c  | 6 +-
>>  2 files changed, 5 insertions(+), 5 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/fadump.h 
>> b/arch/powerpc/include/asm/fadump.h
>> index b4407d0add27..0031806475f0 100644
>> --- a/arch/powerpc/include/asm/fadump.h
>> +++ b/arch/powerpc/include/asm/fadump.h
>> @@ -45,10 +45,6 @@
>>
>>  #define memblock_num_regions(memblock_type) (memblock.memblock_type.cnt)
>>
>> -#ifndef ELF_CORE_EFLAGS
>> -#define ELF_CORE_EFLAGS 0
>> -#endif
>> -
>>  /* Firmware provided dump sections */
>>  #define FADUMP_CPU_STATE_DATA   0x0001
>>  #define FADUMP_HPTE_REGION  0x0002
>> diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
>> index 0638b82ce294..457f08e544c6 100644
>> --- a/arch/powerpc/kernel/fadump.c
>> +++ b/arch/powerpc/kernel/fadump.c
>> @@ -779,7 +779,11 @@ static int fadump_init_elfcore_header(char *bufp)
>>  elf->e_entry = 0;
>>  elf->e_phoff = sizeof(struct elfhdr);
>>  elf->e_shoff = 0;
>> -elf->e_flags = ELF_CORE_EFLAGS;
>> +#ifdef __LITTLE_ENDIAN__
> 
> Wouldn't '#ifdef PPC64_ELF_ABI_v2' be more appropriate here ?

Hari just pointed out to me that the upstream commit
[https://git.kernel.org/cgit/linux/kernel/git/powerpc/linux.git/commit/?id=918d0355]
introduces ELF_CORE_EFLAGS with correct values.

May be we are just fine by including  in fadump.h along with
your first hunk. What do you say?

Thanks,
-Mahesh.

> 
>> +elf->e_flags = 2;
>> +#else
>> +elf->e_flags = 0;
>> +#endif
>>  elf->e_ehsize = sizeof(struct elfhdr);
>>  elf->e_phentsize = sizeof(struct elf_phdr);
>>  elf->e_phnum = 0;
>>
> 
> Reviewed-by: Mahesh Salgaonkar <mah...@linux.vnet.ibm.com>
> 
> Thanks,
> -Mahesh.
> 
>

Re: [PATCH v2 4/5] powerpc/fadump: Make ELF eflags depend on endian

2016-09-08 Thread Mahesh Jagannath Salgaonkar

On 09/06/2016 11:02 AM, Daniel Axtens wrote:
> Firmware Assisted Dump is a facility to dump kernel core with assistance
> from firmware.  As part of this process the kernel ELF version is
> stored.
> 
> Currently, fadump.h defines this to 0 if it is not already defined. This
> clashes with a define in elf.h which sets it based on the current task -
> not based on the kernel.
> 
> When the kernel is compiled on LE, the kernel will always be version
> 2. Otherwise it will be version 0. So the correct behaviour is to set
> the ELF eflags based on the endianness of the kernel. Do that.
> 
> Remove the definition in fadump.h, which becomes unused.
> 
> Cc: Mahesh Salgaonkar 
> Cc: Hari Bathini 
> Signed-off-by: Daniel Axtens 
> 
> ---
> 
> Mahesh or Hari: I'm not familiar with this code at all, so if either of
> you could check that this makes sense I would really appreciate that.
> Thanks!
> ---
>  arch/powerpc/include/asm/fadump.h | 4 
>  arch/powerpc/kernel/fadump.c  | 6 +-
>  2 files changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/fadump.h 
> b/arch/powerpc/include/asm/fadump.h
> index b4407d0add27..0031806475f0 100644
> --- a/arch/powerpc/include/asm/fadump.h
> +++ b/arch/powerpc/include/asm/fadump.h
> @@ -45,10 +45,6 @@
> 
>  #define memblock_num_regions(memblock_type)  (memblock.memblock_type.cnt)
> 
> -#ifndef ELF_CORE_EFLAGS
> -#define ELF_CORE_EFLAGS 0
> -#endif
> -
>  /* Firmware provided dump sections */
>  #define FADUMP_CPU_STATE_DATA0x0001
>  #define FADUMP_HPTE_REGION   0x0002
> diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
> index 0638b82ce294..457f08e544c6 100644
> --- a/arch/powerpc/kernel/fadump.c
> +++ b/arch/powerpc/kernel/fadump.c
> @@ -779,7 +779,11 @@ static int fadump_init_elfcore_header(char *bufp)
>   elf->e_entry = 0;
>   elf->e_phoff = sizeof(struct elfhdr);
>   elf->e_shoff = 0;
> - elf->e_flags = ELF_CORE_EFLAGS;
> +#ifdef __LITTLE_ENDIAN__

Wouldn't '#ifdef PPC64_ELF_ABI_v2' be more appropriate here ?

> + elf->e_flags = 2;
> +#else
> + elf->e_flags = 0;
> +#endif
>   elf->e_ehsize = sizeof(struct elfhdr);
>   elf->e_phentsize = sizeof(struct elf_phdr);
>   elf->e_phnum = 0;
> 

Reviewed-by: Mahesh Salgaonkar 

Thanks,
-Mahesh.

Re: [PATCH 1/2] powerpc/pseries: PACA save area fix for general exception vs MCE

2016-08-11 Thread Mahesh Jagannath Salgaonkar

On 08/10/2016 04:18 PM, Nicholas Piggin wrote:
> MCE must not use PACA_EXGEN. When a general exception enables MSR_RI,
> that means SPRN_SRR[01] and SPRN_SPRG are no longer used. However the
> PACA save area is still in use.
> ---
>  arch/powerpc/kernel/exceptions-64s.S | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/exceptions-64s.S 
> b/arch/powerpc/kernel/exceptions-64s.S
> index 694def6..4174c4e 100644
> --- a/arch/powerpc/kernel/exceptions-64s.S
> +++ b/arch/powerpc/kernel/exceptions-64s.S
> @@ -984,14 +984,14 @@ ALT_MMU_FTR_SECTION_END_IFCLR(MMU_FTR_RADIX)
>  machine_check_common:
> 
>   mfspr   r10,SPRN_DAR
> - std r10,PACA_EXGEN+EX_DAR(r13)
> + std r10,PACA_EXMC+EX_DAR(r13)
>   mfspr   r10,SPRN_DSISR
> - stw r10,PACA_EXGEN+EX_DSISR(r13)
> + stw r10,PACA_EXMC+EX_DSISR(r13)
>   EXCEPTION_PROLOG_COMMON(0x200, PACA_EXMC)
>   FINISH_NAP
>   RECONCILE_IRQ_STATE(r10, r11)
> - ld  r3,PACA_EXGEN+EX_DAR(r13)
> - lwz r4,PACA_EXGEN+EX_DSISR(r13)
> + ld  r3,PACA_EXMC+EX_DAR(r13)
> + lwz r4,PACA_EXMC+EX_DSISR(r13)
>   std r3,_DAR(r1)
>   std r4,_DSISR(r1)
>   bl  save_nvgprs
> 

Yup agree. Looks like copy-paste was the culprit.

Acked-by: Mahesh Salgaonkar

Re: [RESEND PATCH v3 2/2] powernv: Fix MCE handler to avoid trashing CR0/CR1 registers.

2016-08-08 Thread Mahesh Jagannath Salgaonkar

On 08/08/2016 02:28 PM, Michael Ellerman wrote:
> Mahesh J Salgaonkar  writes:
> 
>> From: Mahesh Salgaonkar 
>>
>> The current implementation of MCE early handling modifies CR0/1 registers
>> without saving its old values. Fix this by moving early check for
>> powersaving mode to machine_check_handle_early().
>>
>> The power architecture 2.06 or later allows the possibility of getting
>> machine check while in nap/sleep/winkle. The last bit of HSPRG0 is set
>> to 1, if thread is woken up from winkle. Hence, clear the last bit of
>> HSPRG0 (r13) before MCE handler starts using it as paca pointer.
>>
>> Also, the current code always puts the thread into nap state irrespective
>> of whatever idle state it woke up from. Fix that by looking at
>> paca->thread_idle_state and put the thread back into same state where it
>> came from.
>>
>> Cc: sta...@vger.kernel.org
> 
> The information I need is "which commit introduced the bug".

It fixes commit 1c51089: powerpc/book3s: Return from interrupt if coming
from evil context.

> Given that I can work out which stable releases we should backport the
> patch to.

It will need an backport to stable once it hits upstream.

-Mahesh.

> 
> cheers
>

Re: [PATCH] powernv: Load correct TOC pointer while waking up from winkle.

2016-08-05 Thread Mahesh Jagannath Salgaonkar

On 08/06/2016 04:08 AM, Benjamin Herrenschmidt wrote:
> On Fri, 2016-08-05 at 19:13 +0530, Mahesh J Salgaonkar wrote:
>> From: Mahesh Salgaonkar 
>>
>> The function pnv_restore_hyp_resource() loads the TOC into r2 from
>> the invalid PACA pointer before fixing r13 value. This do not affect
>> POWER ISA 3.0 but it does have an impact on POWER ISA 2.07 or less
>> leading CPU to get stuck forever.
> 
> When was this broken ? Should this get backported to stable ?

This is broken with recent Power9 cpu idle changes (commit bcef83a00)
that gone in Linus' master after V4.7. We are fine with v4.7

-Mahesh.

> 
>>  login: [  471.830433] Processor 120 is stuck.
>>
>>
>> This can be easily reproducible using following steps:
>> - Turn off SMT
>>  $ ppc64_cpu --smt=off
>> - offline/online any online cpu (Thread 0 of any core which is
>> online)
>>  $ echo 0 > /sys/devices/system/cpu/cpu/online
>>  $ echo 1 > /sys/devices/system/cpu/cpu/online
>>
>> For POWER ISA 2.07 or less, the last bit of HSPRG0 is set indicating
>> that thread is waking up from winkle. Hence, the last bit of
>> HSPRG0(r13)
>> needs to be clear before accessing it as PACA to avoid loading
>> invalid
>> values from invalid PACA pointer.
>>
>> Fix this by loading TOC after r13 register is corrected.
>>
>> Signed-off-by: Mahesh Salgaonkar 
>> ---
>>  arch/powerpc/kernel/idle_book3s.S |5 -
>>  1 file changed, 4 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/powerpc/kernel/idle_book3s.S
>> b/arch/powerpc/kernel/idle_book3s.S
>> index 8a56a51..45784ec 100644
>> --- a/arch/powerpc/kernel/idle_book3s.S
>> +++ b/arch/powerpc/kernel/idle_book3s.S
>> @@ -363,8 +363,8 @@ _GLOBAL(power9_idle_stop)
>>   * cr3 - set to gt if waking up with partial/complete hypervisor
>> state loss
>>   */
>>  _GLOBAL(pnv_restore_hyp_resource)
>> -ld  r2,PACATOC(r13);
>>  BEGIN_FTR_SECTION
>> +ld  r2,PACATOC(r13);
>>  /*
>>   * POWER ISA 3. Use PSSCR to determine if we
>>   * are waking up from deep idle state
>> @@ -395,6 +395,9 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300)
>>   */
>>  clrldi  r5,r13,63
>>  clrrdi  r13,r13,1
>> +
>> +/* Now that we are sure r13 is corrected, load TOC */
>> +ld  r2,PACATOC(r13);
>>  cmpwi   cr4,r5,1
>>  mtspr   SPRN_HSPRG0,r13
>>  
>

Re: [PATCH] powerpc/book3s: Fix MCE console messages for unrecoverable MCE.

2016-08-04 Thread Mahesh Jagannath Salgaonkar

On 08/04/2016 03:27 PM, Michael Ellerman wrote:
> Mahesh J Salgaonkar  writes:
> 
>> From: Mahesh Salgaonkar 
>>
>> When machine check occurs with MSR(RI=0), it means MC interrupt is
>> unrecoverable and kernel goes down to panic path. But the console
>> message still shows it as recovered. This patch fixes the MCE console
>> messages.
>>
>> Signed-off-by: Mahesh Salgaonkar 
>> ---
>>  arch/powerpc/kernel/mce.c |3 ++-
>>  arch/powerpc/platforms/powernv/opal.c |2 ++
>>  2 files changed, 4 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
>> index ef267fd..5e7ece0 100644
>> --- a/arch/powerpc/kernel/mce.c
>> +++ b/arch/powerpc/kernel/mce.c
>> @@ -92,7 +92,8 @@ void save_mce_event(struct pt_regs *regs, long handled,
>>  mce->in_use = 1;
>>  
>>  mce->initiator = MCE_INITIATOR_CPU;
>> -if (handled)
>> +/* Mark it recovered if we have handled it and MSR(RI=1). */
>> +if (handled && (regs->msr & MSR_RI))
>>  mce->disposition = MCE_DISPOSITION_RECOVERED;
> 
> This seems like it has bigger implications than just changing the
> printk output? We're now (correctly) marking any MC where RI=0 as
> unrecoverable.
> 
> Or is the only place that uses this the code below which *also* checks
> MSR_RI?

We would always check MSR_RI at code below and panic correctly. It was
just that we were always printing it as recovered and then panic.

> 
>> diff --git a/arch/powerpc/platforms/powernv/opal.c 
>> b/arch/powerpc/platforms/powernv/opal.c
>> index 5385434..8154171 100644
>> --- a/arch/powerpc/platforms/powernv/opal.c
>> +++ b/arch/powerpc/platforms/powernv/opal.c
>> @@ -401,6 +401,8 @@ static int opal_recover_mce(struct pt_regs *regs,
>>  
>>  if (!(regs->msr & MSR_RI)) {
>>  /* If MSR_RI isn't set, we cannot recover */
> 
> Why do we check MSR_RI again here? Shouldn't we just be looking at the 
> evt->disposition?

When MSR_RI=0, where SRR0/SRR1 registers values have been thrashed,
kernel can not continue reliably if we return from interrupt. It should
definitely go down to panic path. Hence we check for RI=0 and return 0.
Where as, if MSR_RI=1 and disposition is "unrecovered", we can minimise
the damage to user process if this MCE was hit in user space context.

The print is just to tell that the kernel panic'ed because MCE occured
during a rare window where MSR RI bit was set to zero(0) and not that
handler could not fix the error.

> 
>> +printk(KERN_ERR "Machine check interrupt unrecoverable:"
>> +" MSR(RI=0)\n");
> 
> Are we sure it's safe to call printk() there?

Yes, we had just printed MCE event info before we came here.

> 
> Please don't split the message across lines, and use pr_err() like the
> rest of the code in this file. So it would be:
> 
>   pr_err("Machine check interrupt unrecoverable: MSR(RI=0)\n");

Sure. Will make the change.

> 
>>  recovered = 0;
>>  } else if (evt->disposition == MCE_DISPOSITION_RECOVERED) {
>>  /* Platform corrected itself */
> 
> cheers
>

Re: [PATCH] powerpc/book3s: Fix MCE console messages for unrecoverable MCE.

2016-08-04 Thread Mahesh Jagannath Salgaonkar

On 08/04/2016 01:35 PM, Greg KH wrote:
> On Thu, Aug 04, 2016 at 10:16:48AM +0530, Mahesh J Salgaonkar wrote:
>> From: Mahesh Salgaonkar 
>>
>> When machine check occurs with MSR(RI=0), it means MC interrupt is
>> unrecoverable and kernel goes down to panic path. But the console
>> message still shows it as recovered. This patch fixes the MCE console
>> messages.
>>
>> Signed-off-by: Mahesh Salgaonkar 
>> ---
>>  arch/powerpc/kernel/mce.c |3 ++-
>>  arch/powerpc/platforms/powernv/opal.c |2 ++
>>  2 files changed, 4 insertions(+), 1 deletion(-)
> 
> 
> 
> 
> This is not the correct way to submit patches for inclusion in the
> stable kernel tree.  Please read Documentation/stable_kernel_rules.txt
> for how to do this properly.
> 
> 
> 

Ouch. My mistake. Will follow Documentation/stable_kernel_rules.txt

Thanks,
-Mahesh.

1 2 >

1 - 100 of 126 matches

Mail list logo