from:"Santosh Sivaraj"

Re: [PATCH 11/15] powerpc: convert to setup_initial_init_mm()

2021-05-29 Thread Santosh Sivaraj

Kefeng Wang  writes:

> Use setup_initial_init_mm() helper to simplify code.
>
> Cc: Michael Ellerman 
> Cc: Benjamin Herrenschmidt 
> Cc: linuxppc-dev@lists.ozlabs.org
> Signed-off-by: Kefeng Wang 
> ---
>  arch/powerpc/kernel/setup-common.c | 5 +
>  1 file changed, 1 insertion(+), 4 deletions(-)
>
> diff --git a/arch/powerpc/kernel/setup-common.c 
> b/arch/powerpc/kernel/setup-common.c
> index 046fe21b5c3b..c046d99efd18 100644
> --- a/arch/powerpc/kernel/setup-common.c
> +++ b/arch/powerpc/kernel/setup-common.c
> @@ -928,10 +928,7 @@ void __init setup_arch(char **cmdline_p)
>  
>   klp_init_thread_info(_task);
>  
> - init_mm.start_code = (unsigned long)_stext;
> - init_mm.end_code = (unsigned long) _etext;
> - init_mm.end_data = (unsigned long) _edata;
> - init_mm.brk = klimit;
> + setup_initial_init_mm(_stext, _etext, _edata, _end);

This function definition is not visible for those who have subscribed only to
linuxppc-dev mailing list. I had to do a web-search with the ID. 

Thanks,
Santosh

>  
>   mm_iommu_init(_mm);
>   irqstack_early_init();
> -- 
> 2.26.2

Re: [PATCH] powerpc/papr_scm: Add support for reporting dirty-shutdown-count

2021-05-26 Thread Santosh Sivaraj



Hi Vaibhav,

Vaibhav Jain  writes:

> Persistent memory devices like NVDIMMs can loose cached writes in case
> something prevents flush on power-fail. Such situations are termed as
> dirty shutdown and are exposed to applications as
> last-shutdown-state (LSS) flag and a dirty-shutdown-counter(DSC) as
> described at [1]. The latter being useful in conditions where multiple
> applications want to detect a dirty shutdown event without racing with
> one another.
>
> PAPR-NVDIMMs have so far only exposed LSS style flags to indicate a
> dirty-shutdown-state. This patch further adds support for DSC via the
> "ibm,persistence-failed-count" device tree property of an NVDIMM. This
> property is a monotonic increasing 64-bit counter thats an indication
> of number of times an NVDIMM has encountered a dirty-shutdown event
> causing persistence loss.
>
> Since this value is not expected to change after system-boot hence
> papr_scm reads & caches its value during NVDIMM probe and exposes it
> as a PAPR sysfs attributed named 'dirty_shutdown' to match the name of
> similarly named NFIT sysfs attribute. Also this value is available to
> libnvdimm via PAPR_PDSM_HEALTH payload. 'struct nd_papr_pdsm_health'
> has been extended to add a new member called 'dimm_dsc' presence of
> which is indicated by the newly introduced PDSM_DIMM_DSC_VALID flag.
>
> References:
> [1] https://pmem.io/documents/Dirty_Shutdown_Handling-V1.0.pdf
>
> Signed-off-by: Vaibhav Jain 
> ---
>  arch/powerpc/include/uapi/asm/papr_pdsm.h |  6 +
>  arch/powerpc/platforms/pseries/papr_scm.c | 30 +++
>  2 files changed, 36 insertions(+)
>
> diff --git a/arch/powerpc/include/uapi/asm/papr_pdsm.h 
> b/arch/powerpc/include/uapi/asm/papr_pdsm.h
> index 50ef95e2f5b1..82488b1e7276 100644
> --- a/arch/powerpc/include/uapi/asm/papr_pdsm.h
> +++ b/arch/powerpc/include/uapi/asm/papr_pdsm.h
> @@ -77,6 +77,9 @@
>  /* Indicate that the 'dimm_fuel_gauge' field is valid */
>  #define PDSM_DIMM_HEALTH_RUN_GAUGE_VALID 1
>  
> +/* Indicate that the 'dimm_dsc' field is valid */
> +#define PDSM_DIMM_DSC_VALID 2
> +
>  /*
>   * Struct exchanged between kernel & ndctl in for PAPR_PDSM_HEALTH
>   * Various flags indicate the health status of the dimm.
> @@ -105,6 +108,9 @@ struct nd_papr_pdsm_health {
>  
>   /* Extension flag PDSM_DIMM_HEALTH_RUN_GAUGE_VALID */
>   __u16 dimm_fuel_gauge;
> +
> + /* Extension flag PDSM_DIMM_DSC_VALID */
> + __u64 dimm_dsc;
>   };
>   __u8 buf[ND_PDSM_PAYLOAD_MAX_SIZE];
>   };
> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
> b/arch/powerpc/platforms/pseries/papr_scm.c
> index 11e7b90a3360..68f0d3d5e899 100644
> --- a/arch/powerpc/platforms/pseries/papr_scm.c
> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
> @@ -114,6 +114,9 @@ struct papr_scm_priv {
>   /* Health information for the dimm */
>   u64 health_bitmap;
>  
> + /* Holds the last known dirty shutdown counter value */
> + u64 dirty_shutdown_counter;
> +
>   /* length of the stat buffer as expected by phyp */
>   size_t stat_buffer_len;
>  };
> @@ -603,6 +606,16 @@ static int papr_pdsm_fuel_gauge(struct papr_scm_priv *p,
>   return rc;
>  }
>  
> +/* Add the dirty-shutdown-counter value to the pdsm */
> +static int papr_psdm_dsc(struct papr_scm_priv *p,
    should be pdsm
> +  union nd_pdsm_payload *payload)
> +{
> + payload->health.extension_flags |= PDSM_DIMM_DSC_VALID;
> + payload->health.dimm_dsc = p->dirty_shutdown_counter;
> +
> + return sizeof(struct nd_papr_pdsm_health);
> +}
> +
>  /* Fetch the DIMM health info and populate it in provided package. */
>  static int papr_pdsm_health(struct papr_scm_priv *p,
>   union nd_pdsm_payload *payload)
> @@ -646,6 +659,8 @@ static int papr_pdsm_health(struct papr_scm_priv *p,
>  
>   /* Populate the fuel gauge meter in the payload */
>   papr_pdsm_fuel_gauge(p, payload);
> + /* Populate the dirty-shutdown-counter field */
> + papr_psdm_dsc(p, payload);
  same typo

Thanks,
Santosh

>  
>   rc = sizeof(struct nd_papr_pdsm_health);
>  
> @@ -907,6 +922,16 @@ static ssize_t flags_show(struct device *dev,
>  }
>  DEVICE_ATTR_RO(flags);
>  
> +static ssize_t dirty_shutdown_show(struct device *dev,
> +   struct device_attribute *attr, char *buf)
> +{
> + struct nvdimm *dimm = to_nvdimm(dev);
> + struct papr_scm_priv *p = nvdimm_provider_data(dimm);
> +
> + return sysfs_emit(buf, "%llu\n", p->dirty_shutdown_counter);
> +}
> +DEVICE_ATTR_RO(dirty_shutdown);
> +
>  static umode_t papr_nd_attribute_visible(struct kobject *kobj,
>struct attribute *attr, int n)
>  {
> @@ -925,6 +950,7 @@ static umode_t papr_nd_attribute_visible(struct kobject 
> *kobj,
>  static struct attribute *papr_nd_attributes[] =

Re: [PATCH 1/2] powerpc: Free fdt on error in elf64_load()

2021-04-21 Thread Santosh Sivaraj

Lakshmi Ramasubramanian  writes:

> On 4/20/21 10:35 PM, Santosh Sivaraj wrote:
> Hi Santosh,
>
>> 
>>> There are a few "goto out;" statements before the local variable "fdt"
>>> is initialized through the call to of_kexec_alloc_and_setup_fdt() in
>>> elf64_load().  This will result in an uninitialized "fdt" being passed
>>> to kvfree() in this function if there is an error before the call to
>>> of_kexec_alloc_and_setup_fdt().
>>>
>>> If there is any error after fdt is allocated, but before it is
>>> saved in the arch specific kimage struct, free the fdt.
>>>
>>> Signed-off-by: Lakshmi Ramasubramanian 
>>> Reported-by: kernel test robot 
>>> Reported-by: Dan Carpenter 
>>> Suggested-by: Michael Ellerman 
>>> ---
>>>   arch/powerpc/kexec/elf_64.c | 16 ++--
>>>   1 file changed, 6 insertions(+), 10 deletions(-)
>>>
>>> diff --git a/arch/powerpc/kexec/elf_64.c b/arch/powerpc/kexec/elf_64.c
>>> index 5a569bb51349..02662e72c53d 100644
>>> --- a/arch/powerpc/kexec/elf_64.c
>>> +++ b/arch/powerpc/kexec/elf_64.c
>>> @@ -114,7 +114,7 @@ static void *elf64_load(struct kimage *image, char 
>>> *kernel_buf,
>>> ret = setup_new_fdt_ppc64(image, fdt, initrd_load_addr,
>>>   initrd_len, cmdline);
>>> if (ret)
>>> -   goto out;
>>> +   goto out_free_fdt;
>> 
>> Shouldn't there be a goto out_free_fdt if fdt_open_into fails?
>
> You are likely looking at elf_64.c in the mainline branch. The patch I 
> have submitted is based on Rob's device-tree for-next branch. Please see 
> the link below:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/robh/linux.git/tree/arch/powerpc/kexec/elf_64.c?h=for-next

That's right, I was indeed looking at the mainline. Sorry for the noise.

Thanks,
Santosh

>
>> 
>>>   
>>> fdt_pack(fdt);
>>>   
>>> @@ -125,7 +125,7 @@ static void *elf64_load(struct kimage *image, char 
>>> *kernel_buf,
>>> kbuf.mem = KEXEC_BUF_MEM_UNKNOWN;
>>> ret = kexec_add_buffer();
>>> if (ret)
>>> -   goto out;
>>> +   goto out_free_fdt;
>>>   
>>> /* FDT will be freed in arch_kimage_file_post_load_cleanup */
>>> image->arch.fdt = fdt;
>>> @@ -140,18 +140,14 @@ static void *elf64_load(struct kimage *image, char 
>>> *kernel_buf,
>>> if (ret)
>>> pr_err("Error setting up the purgatory.\n");
>>>   
>>> +   goto out;
>>> +
>>> +out_free_fdt:
>>> +   kvfree(fdt);
>> 
>> Can just use kfree here?
> "fdt" is allocated through kvmalloc(). So it is freed using kvfree.
>
> thanks,
>   -lakshmi
>
>>>   out:
>>> kfree(modified_cmdline);
>>> kexec_free_elf_info(_info);
>>>   
>>> -   /*
>>> -* Once FDT buffer has been successfully passed to kexec_add_buffer(),
>>> -* the FDT buffer address is saved in image->arch.fdt. In that case,
>>> -* the memory cannot be freed here in case of any other error.
>>> -*/
>>> -   if (ret && !image->arch.fdt)
>>> -   kvfree(fdt);
>>> -
>>> return ret ? ERR_PTR(ret) : NULL;
>>>   }
>>>   
>>> -- 
>>> 2.31.0

Re: [PATCH 1/2] powerpc: Free fdt on error in elf64_load()

2021-04-20 Thread Santosh Sivaraj



Hi Lakshmi,

Lakshmi Ramasubramanian  writes:

> There are a few "goto out;" statements before the local variable "fdt"
> is initialized through the call to of_kexec_alloc_and_setup_fdt() in
> elf64_load().  This will result in an uninitialized "fdt" being passed
> to kvfree() in this function if there is an error before the call to
> of_kexec_alloc_and_setup_fdt().
>
> If there is any error after fdt is allocated, but before it is
> saved in the arch specific kimage struct, free the fdt.
>
> Signed-off-by: Lakshmi Ramasubramanian 
> Reported-by: kernel test robot 
> Reported-by: Dan Carpenter 
> Suggested-by: Michael Ellerman 
> ---
>  arch/powerpc/kexec/elf_64.c | 16 ++--
>  1 file changed, 6 insertions(+), 10 deletions(-)
>
> diff --git a/arch/powerpc/kexec/elf_64.c b/arch/powerpc/kexec/elf_64.c
> index 5a569bb51349..02662e72c53d 100644
> --- a/arch/powerpc/kexec/elf_64.c
> +++ b/arch/powerpc/kexec/elf_64.c
> @@ -114,7 +114,7 @@ static void *elf64_load(struct kimage *image, char 
> *kernel_buf,
>   ret = setup_new_fdt_ppc64(image, fdt, initrd_load_addr,
> initrd_len, cmdline);
>   if (ret)
> - goto out;
> + goto out_free_fdt;

Shouldn't there be a goto out_free_fdt if fdt_open_into fails?

>  
>   fdt_pack(fdt);
>  
> @@ -125,7 +125,7 @@ static void *elf64_load(struct kimage *image, char 
> *kernel_buf,
>   kbuf.mem = KEXEC_BUF_MEM_UNKNOWN;
>   ret = kexec_add_buffer();
>   if (ret)
> - goto out;
> + goto out_free_fdt;
>  
>   /* FDT will be freed in arch_kimage_file_post_load_cleanup */
>   image->arch.fdt = fdt;
> @@ -140,18 +140,14 @@ static void *elf64_load(struct kimage *image, char 
> *kernel_buf,
>   if (ret)
>   pr_err("Error setting up the purgatory.\n");
>  
> + goto out;
> +
> +out_free_fdt:
> + kvfree(fdt);

Can just use kfree here?

Thanks,
Santosh
>  out:
>   kfree(modified_cmdline);
>   kexec_free_elf_info(_info);
>  
> - /*
> -  * Once FDT buffer has been successfully passed to kexec_add_buffer(),
> -  * the FDT buffer address is saved in image->arch.fdt. In that case,
> -  * the memory cannot be freed here in case of any other error.
> -  */
> - if (ret && !image->arch.fdt)
> - kvfree(fdt);
> -
>   return ret ? ERR_PTR(ret) : NULL;
>  }
>  
> -- 
> 2.31.0

Re: [PATCH] powerpc/mce: save ignore_event flag unconditionally for UE

2021-04-20 Thread Santosh Sivaraj

Ganesh  writes:

> On 4/20/21 12:54 PM, Santosh Sivaraj wrote:
>
>> Hi Ganesh,
>>
>> Ganesh Goudar  writes:
>>
>>> When we hit an UE while using machine check safe copy routines,
>>> ignore_event flag is set and the event is ignored by mce handler,
>>> And the flag is also saved for defered handling and printing of
>>> mce event information, But as of now saving of this flag is done
>>> on checking if the effective address is provided or physical address
>>> is calculated, which is not right.
>>>
>>> Save ignore_event flag regardless of whether the effective address is
>>> provided or physical address is calculated.
>>>
>>> Without this change following log is seen, when the event is to be
>>> ignored.
>>>
>>> [  512.971365] MCE: CPU1: machine check (Severe)  UE Load/Store [Recovered]
>>> [  512.971509] MCE: CPU1: NIP: [c00b67c0] memcpy+0x40/0x90
>>> [  512.971655] MCE: CPU1: Initiator CPU
>>> [  512.971739] MCE: CPU1: Unknown
>>> [  512.972209] MCE: CPU1: machine check (Severe)  UE Load/Store [Recovered]
>>> [  512.972334] MCE: CPU1: NIP: [c00b6808] memcpy+0x88/0x90
>>> [  512.972456] MCE: CPU1: Initiator CPU
>>> [  512.972534] MCE: CPU1: Unknown
>>>
>>> Signed-off-by: Ganesh Goudar 
>>> ---
>>>   arch/powerpc/kernel/mce.c | 3 ++-
>>>   1 file changed, 2 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
>>> index 11f0cae086ed..db9363e131ce 100644
>>> --- a/arch/powerpc/kernel/mce.c
>>> +++ b/arch/powerpc/kernel/mce.c
>>> @@ -131,6 +131,8 @@ void save_mce_event(struct pt_regs *regs, long handled,
>>>  * Populate the mce error_type and type-specific error_type.
>>>  */
>>> mce_set_error_info(mce, mce_err);
>>> +   if (mce->error_type == MCE_ERROR_TYPE_UE)
>>> +   mce->u.ue_error.ignore_event = mce_err->ignore_event;
>>>   
>>> if (!addr)
>>> return;
>>> @@ -159,7 +161,6 @@ void save_mce_event(struct pt_regs *regs, long handled,
>>> if (phys_addr != ULONG_MAX) {
>>> mce->u.ue_error.physical_address_provided = true;
>>> mce->u.ue_error.physical_address = phys_addr;
>>> -   mce->u.ue_error.ignore_event = mce_err->ignore_event;
>>> machine_check_ue_event(mce);
>>>     }
>>> }
>> Small nit:
>> Setting ignore event can happen before the phys_addr check, under the 
>> existing
>> check for MCE_ERROR_TYPE_UE, instead of repeating the same condition again.
>
> In some cases we may not get effective address also, so it is placed before
> effective address check.

Yes, I forgot the last two lines in the changelog after I applied the patch :-)

Thanks,
Santosh
>
>>
>> Except for the above nit
>>
>> Reviewed-by: Santosh Sivaraj 
>>
>> Thanks,
>> Santosh
>>> -- 
>>> 2.26.2

Re: [PATCH] powerpc/mce: save ignore_event flag unconditionally for UE

2021-04-20 Thread Santosh Sivaraj



Hi Ganesh,

Ganesh Goudar  writes:

> When we hit an UE while using machine check safe copy routines,
> ignore_event flag is set and the event is ignored by mce handler,
> And the flag is also saved for defered handling and printing of
> mce event information, But as of now saving of this flag is done
> on checking if the effective address is provided or physical address
> is calculated, which is not right.
>
> Save ignore_event flag regardless of whether the effective address is
> provided or physical address is calculated.
>
> Without this change following log is seen, when the event is to be
> ignored.
>
> [  512.971365] MCE: CPU1: machine check (Severe)  UE Load/Store [Recovered]
> [  512.971509] MCE: CPU1: NIP: [c00b67c0] memcpy+0x40/0x90
> [  512.971655] MCE: CPU1: Initiator CPU
> [  512.971739] MCE: CPU1: Unknown
> [  512.972209] MCE: CPU1: machine check (Severe)  UE Load/Store [Recovered]
> [  512.972334] MCE: CPU1: NIP: [c00b6808] memcpy+0x88/0x90
> [  512.972456] MCE: CPU1: Initiator CPU
> [  512.972534] MCE: CPU1: Unknown
>
> Signed-off-by: Ganesh Goudar 
> ---
>  arch/powerpc/kernel/mce.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
> index 11f0cae086ed..db9363e131ce 100644
> --- a/arch/powerpc/kernel/mce.c
> +++ b/arch/powerpc/kernel/mce.c
> @@ -131,6 +131,8 @@ void save_mce_event(struct pt_regs *regs, long handled,
>* Populate the mce error_type and type-specific error_type.
>*/
>   mce_set_error_info(mce, mce_err);
> + if (mce->error_type == MCE_ERROR_TYPE_UE)
> + mce->u.ue_error.ignore_event = mce_err->ignore_event;
>  
>   if (!addr)
>   return;
> @@ -159,7 +161,6 @@ void save_mce_event(struct pt_regs *regs, long handled,
>   if (phys_addr != ULONG_MAX) {
>   mce->u.ue_error.physical_address_provided = true;
>   mce->u.ue_error.physical_address = phys_addr;
> - mce->u.ue_error.ignore_event = mce_err->ignore_event;
>   machine_check_ue_event(mce);
>   }
>   }

Small nit:
Setting ignore event can happen before the phys_addr check, under the existing
check for MCE_ERROR_TYPE_UE, instead of repeating the same condition again.

Except for the above nit

Reviewed-by: Santosh Sivaraj 

Thanks,
Santosh
> -- 
> 2.26.2

Re: [PATCH] powerpc/mce: save ignore_event flag unconditionally for UE

2021-04-20 Thread Santosh Sivaraj



Hi Ganesh,

Ganesh Goudar  writes:

> When we hit an UE while using machine check safe copy routines,
> ignore_event flag is set and the event is ignored by mce handler,
> And the flag is also saved for defered handling and printing of
> mce event information, But as of now saving of this flag is done
> on checking if the effective address is provided or physical address
> is calculated, which is not right.
>
> Save ignore_event flag regardless of whether the effective address is
> provided or physical address is calculated.
>
> Without this change following log is seen, when the event is to be
> ignored.
>
> [  512.971365] MCE: CPU1: machine check (Severe)  UE Load/Store [Recovered]
> [  512.971509] MCE: CPU1: NIP: [c00b67c0] memcpy+0x40/0x90
> [  512.971655] MCE: CPU1: Initiator CPU
> [  512.971739] MCE: CPU1: Unknown
> [  512.972209] MCE: CPU1: machine check (Severe)  UE Load/Store [Recovered]
> [  512.972334] MCE: CPU1: NIP: [c00b6808] memcpy+0x88/0x90
> [  512.972456] MCE: CPU1: Initiator CPU
> [  512.972534] MCE: CPU1: Unknown
>
> Signed-off-by: Ganesh Goudar 
> ---
>  arch/powerpc/kernel/mce.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
> index 11f0cae086ed..db9363e131ce 100644
> --- a/arch/powerpc/kernel/mce.c
> +++ b/arch/powerpc/kernel/mce.c
> @@ -131,6 +131,8 @@ void save_mce_event(struct pt_regs *regs, long handled,
>* Populate the mce error_type and type-specific error_type.
>*/
>   mce_set_error_info(mce, mce_err);
> + if (mce->error_type == MCE_ERROR_TYPE_UE)
> + mce->u.ue_error.ignore_event = mce_err->ignore_event;
>  
>   if (!addr)
>   return;
> @@ -159,7 +161,6 @@ void save_mce_event(struct pt_regs *regs, long handled,
>   if (phys_addr != ULONG_MAX) {
>   mce->u.ue_error.physical_address_provided = true;
>   mce->u.ue_error.physical_address = phys_addr;
> - mce->u.ue_error.ignore_event = mce_err->ignore_event;
>   machine_check_ue_event(mce);
>   }
>   }

Small nit:
Setting ignore event can happen before the phys_addr check, under the existing
check for MCE_ERROR_TYPE_UE, instead of repeating the same condition again.

Except for the above nit



Thanks,
Santosh
> -- 
> 2.26.2

[PATCH v2] kernel/watchdog: Fix watchdog_allowed_mask not used warning

2020-11-05 Thread Santosh Sivaraj

Define watchdog_allowed_mask only when SOFTLOCKUP_DETECTOR is enabled.

Fixes: 7feeb9cd4f5b ("watchdog/sysctl: Clean up sysctl variable name space")
Cc: Thomas Gleixner 
Cc: Andrew Morton 
Reviewed-by: Petr Mladek 
Signed-off-by: Santosh Sivaraj 
---
v2:
Added Petr's reviewed-by from [1] and add fixes tag as suggested by Christophe.

[1]: https://lkml.org/lkml/2020/8/20/1030

 kernel/watchdog.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 5abb5b22ad13..71109065bd8e 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -44,8 +44,6 @@ int __read_mostly soft_watchdog_user_enabled = 1;
 int __read_mostly watchdog_thresh = 10;
 static int __read_mostly nmi_watchdog_available;
 
-static struct cpumask watchdog_allowed_mask __read_mostly;
-
 struct cpumask watchdog_cpumask __read_mostly;
 unsigned long *watchdog_cpumask_bits = cpumask_bits(_cpumask);
 
@@ -162,6 +160,8 @@ static void lockup_detector_update_enable(void)
 int __read_mostly sysctl_softlockup_all_cpu_backtrace;
 #endif
 
+static struct cpumask watchdog_allowed_mask __read_mostly;
+
 /* Global variables, exported for sysctl */
 unsigned int __read_mostly softlockup_panic =
CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC_VALUE;
-- 
2.26.2

[RESEND PATCH] kernel/watchdog: Fix watchdog_allowed_mask not used warning

2020-11-03 Thread Santosh Sivaraj

Define watchdog_allowed_mask only when SOFTLOCKUP_DETECTOR is enabled.

Signed-off-by: Santosh Sivaraj 
---

Original patch is here:
https://lore.kernel.org/lkml/20190807014417.9418-1-sant...@fossix.org/

A similar patch was also sent by Balamuruhan and reviewed by Petr.
https://lkml.org/lkml/2020/8/20/1030

 kernel/watchdog.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 5abb5b22ad13..71109065bd8e 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -44,8 +44,6 @@ int __read_mostly soft_watchdog_user_enabled = 1;
 int __read_mostly watchdog_thresh = 10;
 static int __read_mostly nmi_watchdog_available;
 
-static struct cpumask watchdog_allowed_mask __read_mostly;
-
 struct cpumask watchdog_cpumask __read_mostly;
 unsigned long *watchdog_cpumask_bits = cpumask_bits(_cpumask);
 
@@ -162,6 +160,8 @@ static void lockup_detector_update_enable(void)
 int __read_mostly sysctl_softlockup_all_cpu_backtrace;
 #endif
 
+static struct cpumask watchdog_allowed_mask __read_mostly;
+
 /* Global variables, exported for sysctl */
 unsigned int __read_mostly softlockup_panic =
CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC_VALUE;
-- 
2.26.2

[PATCH trivial] ppc64/mm: remove comment that is no longer valid

2020-07-21 Thread Santosh Sivaraj

hash_low_64.S was removed in [1] and since flush_hash_page is not called
from any assembly routine.

[1]: commit a43c0eb8364c0 ("powerpc/mm: Convert 4k insert from asm to C")

Signed-off-by: Santosh Sivaraj 
---
 arch/powerpc/mm/book3s64/hash_utils.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/hash_utils.c 
b/arch/powerpc/mm/book3s64/hash_utils.c
index 468169e33c86f..90ee0be3281a9 100644
--- a/arch/powerpc/mm/book3s64/hash_utils.c
+++ b/arch/powerpc/mm/book3s64/hash_utils.c
@@ -1706,10 +1706,6 @@ unsigned long pte_get_hash_gslot(unsigned long vpn, 
unsigned long shift,
return gslot;
 }
 
-/*
- * WARNING: This is called from hash_low_64.S, if you change this prototype,
- *  do not forget to update the assembly call site !
- */
 void flush_hash_page(unsigned long vpn, real_pte_t pte, int psize, int ssize,
 unsigned long flags)
 {
-- 
2.26.2

[PATCH v2 2/2] papr/scm: Add bad memory ranges to nvdimm bad ranges

2020-07-09 Thread Santosh Sivaraj

Subscribe to the MCE notification and add the physical address which
generated a memory error to nvdimm bad range.

Reviewed-by: Mahesh Salgaonkar 
Signed-off-by: Santosh Sivaraj 
---
 arch/powerpc/platforms/pseries/papr_scm.c | 96 ++-
 1 file changed, 95 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index 9c569078a09fd..90729029ca010 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -13,9 +13,11 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
+#include 
 
 #define BIND_ANY_ADDR (~0ul)
 
@@ -80,6 +82,7 @@ struct papr_scm_priv {
struct resource res;
struct nd_region *region;
struct nd_interleave_set nd_set;
+   struct list_head region_list;
 
/* Protect dimm health data from concurrent read/writes */
struct mutex health_mutex;
@@ -91,6 +94,9 @@ struct papr_scm_priv {
u64 health_bitmap;
 };
 
+LIST_HEAD(papr_nd_regions);
+DEFINE_MUTEX(papr_ndr_lock);
+
 static int drc_pmem_bind(struct papr_scm_priv *p)
 {
unsigned long ret[PLPAR_HCALL_BUFSIZE];
@@ -759,6 +765,10 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
dev_info(dev, "Region registered with target node %d and online 
node %d",
 target_nid, online_nid);
 
+   mutex_lock(_ndr_lock);
+   list_add_tail(>region_list, _nd_regions);
+   mutex_unlock(_ndr_lock);
+
return 0;
 
 err:   nvdimm_bus_unregister(p->bus);
@@ -766,6 +776,68 @@ err:   nvdimm_bus_unregister(p->bus);
return -ENXIO;
 }
 
+static void papr_scm_add_badblock(struct nd_region *region,
+ struct nvdimm_bus *bus, u64 phys_addr)
+{
+   u64 aligned_addr = ALIGN_DOWN(phys_addr, L1_CACHE_BYTES);
+
+   if (nvdimm_bus_add_badrange(bus, aligned_addr, L1_CACHE_BYTES)) {
+   pr_err("Bad block registration for 0x%llx failed\n", phys_addr);
+   return;
+   }
+
+   pr_debug("Add memory range (0x%llx - 0x%llx) as bad range\n",
+aligned_addr, aligned_addr + L1_CACHE_BYTES);
+
+   nvdimm_region_notify(region, NVDIMM_REVALIDATE_POISON);
+}
+
+static int handle_mce_ue(struct notifier_block *nb, unsigned long val,
+void *data)
+{
+   struct machine_check_event *evt = data;
+   struct papr_scm_priv *p;
+   u64 phys_addr;
+   bool found = false;
+
+   if (evt->error_type != MCE_ERROR_TYPE_UE)
+   return NOTIFY_DONE;
+
+   if (list_empty(_nd_regions))
+   return NOTIFY_DONE;
+
+   /*
+* The physical address obtained here is PAGE_SIZE aligned, so get the
+* exact address from the effective address
+*/
+   phys_addr = evt->u.ue_error.physical_address +
+   (evt->u.ue_error.effective_address & ~PAGE_MASK);
+
+   if (!evt->u.ue_error.physical_address_provided ||
+   !is_zone_device_page(pfn_to_page(phys_addr >> PAGE_SHIFT)))
+   return NOTIFY_DONE;
+
+   /* mce notifier is called from a process context, so mutex is safe */
+   mutex_lock(_ndr_lock);
+   list_for_each_entry(p, _nd_regions, region_list) {
+   if (phys_addr >= p->res.start && phys_addr <= p->res.end) {
+   found = true;
+   break;
+   }
+   }
+
+   if (found)
+   papr_scm_add_badblock(p->region, p->bus, phys_addr);
+
+   mutex_unlock(_ndr_lock);
+
+   return found ? NOTIFY_OK : NOTIFY_DONE;
+}
+
+static struct notifier_block mce_ue_nb = {
+   .notifier_call = handle_mce_ue
+};
+
 static int papr_scm_probe(struct platform_device *pdev)
 {
struct device_node *dn = pdev->dev.of_node;
@@ -866,6 +938,10 @@ static int papr_scm_remove(struct platform_device *pdev)
 {
struct papr_scm_priv *p = platform_get_drvdata(pdev);
 
+   mutex_lock(_ndr_lock);
+   list_del(>region_list);
+   mutex_unlock(_ndr_lock);
+
nvdimm_bus_unregister(p->bus);
drc_pmem_unbind(p);
kfree(p->bus_desc.provider_name);
@@ -888,7 +964,25 @@ static struct platform_driver papr_scm_driver = {
},
 };
 
-module_platform_driver(papr_scm_driver);
+static int __init papr_scm_init(void)
+{
+   int ret;
+
+   ret = platform_driver_register(_scm_driver);
+   if (!ret)
+   mce_register_notifier(_ue_nb);
+
+   return ret;
+}
+module_init(papr_scm_init);
+
+static void __exit papr_scm_exit(void)
+{
+   mce_unregister_notifier(_ue_nb);
+   platform_driver_unregister(_scm_driver);
+}
+module_exit(papr_scm_exit);
+
 MODULE_DEVICE_TABLE(of, papr_scm_match);
 MODULE_LICENSE("GPL");
 MODULE_AUTHOR("IBM Corporation");
-- 
2.26.2

[PATCH v2 1/2] powerpc/mce: Add MCE notification chain

2020-07-09 Thread Santosh Sivaraj

Introduce notification chain which lets us know about uncorrected memory
errors(UE). This would help prospective users in pmem or nvdimm subsystem
to track bad blocks for better handling of persistent memory allocations.

Signed-off-by: Santosh Sivaraj 
Signed-off-by: Ganesh Goudar 
---
 arch/powerpc/include/asm/mce.h |  2 ++
 arch/powerpc/kernel/mce.c  | 15 +++
 2 files changed, 17 insertions(+)

v2: Address comments from Christophe.

RESEND: Sending the two patches together so the dependencies are clear. The
earlier patch reviews are here [1]; rebase the patches on top on 5.8-rc4

[1]: 
https://lore.kernel.org/linuxppc-dev/20200330071219.12284-1-ganes...@linux.ibm.com/

diff --git a/arch/powerpc/include/asm/mce.h b/arch/powerpc/include/asm/mce.h
index 376a395daf329..7bdd0cd4f2de0 100644
--- a/arch/powerpc/include/asm/mce.h
+++ b/arch/powerpc/include/asm/mce.h
@@ -220,6 +220,8 @@ extern void machine_check_print_event_info(struct 
machine_check_event *evt,
 unsigned long addr_to_pfn(struct pt_regs *regs, unsigned long addr);
 extern void mce_common_process_ue(struct pt_regs *regs,
  struct mce_error_info *mce_err);
+int mce_register_notifier(struct notifier_block *nb);
+int mce_unregister_notifier(struct notifier_block *nb);
 #ifdef CONFIG_PPC_BOOK3S_64
 void flush_and_reload_slb(void);
 #endif /* CONFIG_PPC_BOOK3S_64 */
diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
index fd90c0eda2290..b7b3ed4e61937 100644
--- a/arch/powerpc/kernel/mce.c
+++ b/arch/powerpc/kernel/mce.c
@@ -49,6 +49,20 @@ static struct irq_work mce_ue_event_irq_work = {
 
 DECLARE_WORK(mce_ue_event_work, machine_process_ue_event);
 
+static BLOCKING_NOTIFIER_HEAD(mce_notifier_list);
+
+int mce_register_notifier(struct notifier_block *nb)
+{
+   return blocking_notifier_chain_register(_notifier_list, nb);
+}
+EXPORT_SYMBOL_GPL(mce_register_notifier);
+
+int mce_unregister_notifier(struct notifier_block *nb)
+{
+   return blocking_notifier_chain_unregister(_notifier_list, nb);
+}
+EXPORT_SYMBOL_GPL(mce_unregister_notifier);
+
 static void mce_set_error_info(struct machine_check_event *mce,
   struct mce_error_info *mce_err)
 {
@@ -278,6 +292,7 @@ static void machine_process_ue_event(struct work_struct 
*work)
while (__this_cpu_read(mce_ue_count) > 0) {
index = __this_cpu_read(mce_ue_count) - 1;
evt = this_cpu_ptr(_ue_event_queue[index]);
+   blocking_notifier_call_chain(_notifier_list, 0, evt);
 #ifdef CONFIG_MEMORY_FAILURE
/*
 * This should probably queued elsewhere, but
-- 
2.26.2

Re: [PATCH RESEND 1/2] powerpc/mce: Add MCE notification chain

2020-07-09 Thread Santosh Sivaraj

Christophe Leroy  writes:

> Le 09/07/2020 à 09:56, Santosh Sivaraj a écrit :
>> Introduce notification chain which lets know about uncorrected memory
>> errors(UE). This would help prospective users in pmem or nvdimm subsystem
>> to track bad blocks for better handling of persistent memory allocations.
>> 
>> Signed-off-by: Santosh S 
>> Signed-off-by: Ganesh Goudar 
>> ---
>>   arch/powerpc/include/asm/mce.h |  2 ++
>>   arch/powerpc/kernel/mce.c  | 15 +++
>>   2 files changed, 17 insertions(+)
>> 
>> Send the two patches together, so the dependencies are clear. The earlier 
>> patch reviews are
>> here: 
>> https://lore.kernel.org/linuxppc-dev/20200330071219.12284-1-ganes...@linux.ibm.com/
>> 
>> Rebase the patches on top on 5.8-rc4
>> 
>> diff --git a/arch/powerpc/include/asm/mce.h b/arch/powerpc/include/asm/mce.h
>> index 376a395daf329..a57b0772702a9 100644
>> --- a/arch/powerpc/include/asm/mce.h
>> +++ b/arch/powerpc/include/asm/mce.h
>> @@ -220,6 +220,8 @@ extern void machine_check_print_event_info(struct 
>> machine_check_event *evt,
>>   unsigned long addr_to_pfn(struct pt_regs *regs, unsigned long addr);
>>   extern void mce_common_process_ue(struct pt_regs *regs,
>>struct mce_error_info *mce_err);
>> +extern int mce_register_notifier(struct notifier_block *nb);
>> +extern int mce_unregister_notifier(struct notifier_block *nb);
>
> Using the 'extern' keyword on function declaration is pointless and 
> should be avoided in new patches. (checkpatch.pl --strict usually 
> complains about it).

I will remove that in the v2 which I will be sending for your comments for
the other patch.

Thanks,
Santosh

>
>>   #ifdef CONFIG_PPC_BOOK3S_64
>>   void flush_and_reload_slb(void);
>>   #endif /* CONFIG_PPC_BOOK3S_64 */
>> diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
>> index fd90c0eda2290..b7b3ed4e61937 100644
>> --- a/arch/powerpc/kernel/mce.c
>> +++ b/arch/powerpc/kernel/mce.c
>> @@ -49,6 +49,20 @@ static struct irq_work mce_ue_event_irq_work = {
>>   
>>   DECLARE_WORK(mce_ue_event_work, machine_process_ue_event);
>>   
>> +static BLOCKING_NOTIFIER_HEAD(mce_notifier_list);
>> +
>> +int mce_register_notifier(struct notifier_block *nb)
>> +{
>> +return blocking_notifier_chain_register(_notifier_list, nb);
>> +}
>> +EXPORT_SYMBOL_GPL(mce_register_notifier);
>> +
>> +int mce_unregister_notifier(struct notifier_block *nb)
>> +{
>> +return blocking_notifier_chain_unregister(_notifier_list, nb);
>> +}
>> +EXPORT_SYMBOL_GPL(mce_unregister_notifier);
>> +
>>   static void mce_set_error_info(struct machine_check_event *mce,
>> struct mce_error_info *mce_err)
>>   {
>> @@ -278,6 +292,7 @@ static void machine_process_ue_event(struct work_struct 
>> *work)
>>  while (__this_cpu_read(mce_ue_count) > 0) {
>>  index = __this_cpu_read(mce_ue_count) - 1;
>>  evt = this_cpu_ptr(_ue_event_queue[index]);
>> +blocking_notifier_call_chain(_notifier_list, 0, evt);
>>   #ifdef CONFIG_MEMORY_FAILURE
>>  /*
>>   * This should probably queued elsewhere, but
>> 
>
> Christophe

Re: [PATCH RESEND 2/2] papr/scm: Add bad memory ranges to nvdimm bad ranges

2020-07-09 Thread Santosh Sivaraj

Christophe Leroy  writes:

> Le 09/07/2020 à 09:56, Santosh Sivaraj a écrit :
>> Subscribe to the MCE notification and add the physical address which
>> generated a memory error to nvdimm bad range.
>> 
>> Reviewed-by: Mahesh Salgaonkar 
>> Signed-off-by: Santosh Sivaraj 
>> ---
>>   arch/powerpc/platforms/pseries/papr_scm.c | 98 ++-
>>   1 file changed, 97 insertions(+), 1 deletion(-)
>> 
>> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
>> b/arch/powerpc/platforms/pseries/papr_scm.c
>> index 9c569078a09fd..5ebb1c797795d 100644
>> --- a/arch/powerpc/platforms/pseries/papr_scm.c
>> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
>> @@ -13,9 +13,11 @@
>>   #include 
>>   #include 
>>   #include 
>> +#include 
>>   
>>   #include 
>>   #include 
>> +#include 
>>   
>>   #define BIND_ANY_ADDR (~0ul)
>>   
>> @@ -80,6 +82,7 @@ struct papr_scm_priv {
>>  struct resource res;
>>  struct nd_region *region;
>>  struct nd_interleave_set nd_set;
>> +struct list_head region_list;
>>   
>>  /* Protect dimm health data from concurrent read/writes */
>>  struct mutex health_mutex;
>> @@ -91,6 +94,9 @@ struct papr_scm_priv {
>>  u64 health_bitmap;
>>   };
>>   
>> +LIST_HEAD(papr_nd_regions);
>> +DEFINE_MUTEX(papr_ndr_lock);
>> +
>>   static int drc_pmem_bind(struct papr_scm_priv *p)
>>   {
>>  unsigned long ret[PLPAR_HCALL_BUFSIZE];
>> @@ -759,6 +765,10 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
>>  dev_info(dev, "Region registered with target node %d and online 
>> node %d",
>>   target_nid, online_nid);
>>   
>> +mutex_lock(_ndr_lock);
>> +list_add_tail(>region_list, _nd_regions);
>> +mutex_unlock(_ndr_lock);
>> +
>>  return 0;
>>   
>>   err:   nvdimm_bus_unregister(p->bus);
>> @@ -766,6 +776,70 @@ err:nvdimm_bus_unregister(p->bus);
>>  return -ENXIO;
>>   }
>>   
>> +static void papr_scm_add_badblock(struct nd_region *region,
>> +  struct nvdimm_bus *bus, u64 phys_addr)
>> +{
>> +u64 aligned_addr = ALIGN_DOWN(phys_addr, L1_CACHE_BYTES);
>> +
>> +if (nvdimm_bus_add_badrange(bus, aligned_addr, L1_CACHE_BYTES)) {
>> +pr_err("Bad block registration for 0x%llx failed\n", phys_addr);
>> +return;
>> +}
>> +
>> +pr_debug("Add memory range (0x%llx - 0x%llx) as bad range\n",
>> + aligned_addr, aligned_addr + L1_CACHE_BYTES);
>> +
>> +nvdimm_region_notify(region, NVDIMM_REVALIDATE_POISON);
>> +}
>> +
>> +static int handle_mce_ue(struct notifier_block *nb, unsigned long val,
>> + void *data)
>> +{
>> +struct machine_check_event *evt = data;
>> +struct papr_scm_priv *p;
>> +u64 phys_addr;
>> +bool found = false;
>> +
>> +if (evt->error_type != MCE_ERROR_TYPE_UE)
>> +return NOTIFY_DONE;
>> +
>> +if (list_empty(_nd_regions))
>> +return NOTIFY_DONE;
>> +
>> +/*
>> + * The physical address obtained here is PAGE_SIZE aligned, so get the
>> + * exact address from the effective address
>> + */
>> +phys_addr = evt->u.ue_error.physical_address +
>> +(evt->u.ue_error.effective_address & ~PAGE_MASK);
>
> Not properly aligned

Will fix it.

>
>> +
>> +if (!evt->u.ue_error.physical_address_provided ||
>> +!is_zone_device_page(pfn_to_page(phys_addr >> PAGE_SHIFT)))
>> +return NOTIFY_DONE;
>> +
>> +/* mce notifier is called from a process context, so mutex is safe */
>> +mutex_lock(_ndr_lock);
>> +list_for_each_entry(p, _nd_regions, region_list) {
>> +struct resource res = p->res;
>
> Is this local struct really worth it ? Why not use p->res below directly ?
>

Right, not really needed. I can fix that in v2.

>> +
>> +if (phys_addr >= res.start && phys_addr <= res.end) {
>> +found = true;
>> +break;
>> +}
>> +}
>> +
>> +if (found)
>> +papr_scm_add_badblock(p->region, p->bus, phys_addr);
>> +
>> +mutex_unlock(_ndr_lock);
>> +
>> +return found ? NOTIFY_OK : NOTIFY_DONE;
>>

[PATCH RESEND 2/2] papr/scm: Add bad memory ranges to nvdimm bad ranges

2020-07-09 Thread Santosh Sivaraj

Subscribe to the MCE notification and add the physical address which
generated a memory error to nvdimm bad range.

Reviewed-by: Mahesh Salgaonkar 
Signed-off-by: Santosh Sivaraj 
---
 arch/powerpc/platforms/pseries/papr_scm.c | 98 ++-
 1 file changed, 97 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index 9c569078a09fd..5ebb1c797795d 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -13,9 +13,11 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
+#include 
 
 #define BIND_ANY_ADDR (~0ul)
 
@@ -80,6 +82,7 @@ struct papr_scm_priv {
struct resource res;
struct nd_region *region;
struct nd_interleave_set nd_set;
+   struct list_head region_list;
 
/* Protect dimm health data from concurrent read/writes */
struct mutex health_mutex;
@@ -91,6 +94,9 @@ struct papr_scm_priv {
u64 health_bitmap;
 };
 
+LIST_HEAD(papr_nd_regions);
+DEFINE_MUTEX(papr_ndr_lock);
+
 static int drc_pmem_bind(struct papr_scm_priv *p)
 {
unsigned long ret[PLPAR_HCALL_BUFSIZE];
@@ -759,6 +765,10 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
dev_info(dev, "Region registered with target node %d and online 
node %d",
 target_nid, online_nid);
 
+   mutex_lock(_ndr_lock);
+   list_add_tail(>region_list, _nd_regions);
+   mutex_unlock(_ndr_lock);
+
return 0;
 
 err:   nvdimm_bus_unregister(p->bus);
@@ -766,6 +776,70 @@ err:   nvdimm_bus_unregister(p->bus);
return -ENXIO;
 }
 
+static void papr_scm_add_badblock(struct nd_region *region,
+ struct nvdimm_bus *bus, u64 phys_addr)
+{
+   u64 aligned_addr = ALIGN_DOWN(phys_addr, L1_CACHE_BYTES);
+
+   if (nvdimm_bus_add_badrange(bus, aligned_addr, L1_CACHE_BYTES)) {
+   pr_err("Bad block registration for 0x%llx failed\n", phys_addr);
+   return;
+   }
+
+   pr_debug("Add memory range (0x%llx - 0x%llx) as bad range\n",
+aligned_addr, aligned_addr + L1_CACHE_BYTES);
+
+   nvdimm_region_notify(region, NVDIMM_REVALIDATE_POISON);
+}
+
+static int handle_mce_ue(struct notifier_block *nb, unsigned long val,
+void *data)
+{
+   struct machine_check_event *evt = data;
+   struct papr_scm_priv *p;
+   u64 phys_addr;
+   bool found = false;
+
+   if (evt->error_type != MCE_ERROR_TYPE_UE)
+   return NOTIFY_DONE;
+
+   if (list_empty(_nd_regions))
+   return NOTIFY_DONE;
+
+   /*
+* The physical address obtained here is PAGE_SIZE aligned, so get the
+* exact address from the effective address
+*/
+   phys_addr = evt->u.ue_error.physical_address +
+   (evt->u.ue_error.effective_address & ~PAGE_MASK);
+
+   if (!evt->u.ue_error.physical_address_provided ||
+   !is_zone_device_page(pfn_to_page(phys_addr >> PAGE_SHIFT)))
+   return NOTIFY_DONE;
+
+   /* mce notifier is called from a process context, so mutex is safe */
+   mutex_lock(_ndr_lock);
+   list_for_each_entry(p, _nd_regions, region_list) {
+   struct resource res = p->res;
+
+   if (phys_addr >= res.start && phys_addr <= res.end) {
+   found = true;
+   break;
+   }
+   }
+
+   if (found)
+   papr_scm_add_badblock(p->region, p->bus, phys_addr);
+
+   mutex_unlock(_ndr_lock);
+
+   return found ? NOTIFY_OK : NOTIFY_DONE;
+}
+
+static struct notifier_block mce_ue_nb = {
+   .notifier_call = handle_mce_ue
+};
+
 static int papr_scm_probe(struct platform_device *pdev)
 {
struct device_node *dn = pdev->dev.of_node;
@@ -866,6 +940,10 @@ static int papr_scm_remove(struct platform_device *pdev)
 {
struct papr_scm_priv *p = platform_get_drvdata(pdev);
 
+   mutex_lock(_ndr_lock);
+   list_del(&(p->region_list));
+   mutex_unlock(_ndr_lock);
+
nvdimm_bus_unregister(p->bus);
drc_pmem_unbind(p);
kfree(p->bus_desc.provider_name);
@@ -888,7 +966,25 @@ static struct platform_driver papr_scm_driver = {
},
 };
 
-module_platform_driver(papr_scm_driver);
+static int __init papr_scm_init(void)
+{
+   int ret;
+
+   ret = platform_driver_register(_scm_driver);
+   if (!ret)
+   mce_register_notifier(_ue_nb);
+
+return ret;
+}
+module_init(papr_scm_init);
+
+static void __exit papr_scm_exit(void)
+{
+   mce_unregister_notifier(_ue_nb);
+   platform_driver_unregister(_scm_driver);
+}
+module_exit(papr_scm_exit);
+
 MODULE_DEVICE_TABLE(of, papr_scm_match);
 MODULE_LICENSE("GPL");
 MODULE_AUTHOR("IBM Corporation");
-- 
2.26.2

[PATCH RESEND 1/2] powerpc/mce: Add MCE notification chain

2020-07-09 Thread Santosh Sivaraj

Introduce notification chain which lets know about uncorrected memory
errors(UE). This would help prospective users in pmem or nvdimm subsystem
to track bad blocks for better handling of persistent memory allocations.

Signed-off-by: Santosh S 
Signed-off-by: Ganesh Goudar 
---
 arch/powerpc/include/asm/mce.h |  2 ++
 arch/powerpc/kernel/mce.c  | 15 +++
 2 files changed, 17 insertions(+)

Send the two patches together, so the dependencies are clear. The earlier patch 
reviews are
here: 
https://lore.kernel.org/linuxppc-dev/20200330071219.12284-1-ganes...@linux.ibm.com/

Rebase the patches on top on 5.8-rc4

diff --git a/arch/powerpc/include/asm/mce.h b/arch/powerpc/include/asm/mce.h
index 376a395daf329..a57b0772702a9 100644
--- a/arch/powerpc/include/asm/mce.h
+++ b/arch/powerpc/include/asm/mce.h
@@ -220,6 +220,8 @@ extern void machine_check_print_event_info(struct 
machine_check_event *evt,
 unsigned long addr_to_pfn(struct pt_regs *regs, unsigned long addr);
 extern void mce_common_process_ue(struct pt_regs *regs,
  struct mce_error_info *mce_err);
+extern int mce_register_notifier(struct notifier_block *nb);
+extern int mce_unregister_notifier(struct notifier_block *nb);
 #ifdef CONFIG_PPC_BOOK3S_64
 void flush_and_reload_slb(void);
 #endif /* CONFIG_PPC_BOOK3S_64 */
diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
index fd90c0eda2290..b7b3ed4e61937 100644
--- a/arch/powerpc/kernel/mce.c
+++ b/arch/powerpc/kernel/mce.c
@@ -49,6 +49,20 @@ static struct irq_work mce_ue_event_irq_work = {
 
 DECLARE_WORK(mce_ue_event_work, machine_process_ue_event);
 
+static BLOCKING_NOTIFIER_HEAD(mce_notifier_list);
+
+int mce_register_notifier(struct notifier_block *nb)
+{
+   return blocking_notifier_chain_register(_notifier_list, nb);
+}
+EXPORT_SYMBOL_GPL(mce_register_notifier);
+
+int mce_unregister_notifier(struct notifier_block *nb)
+{
+   return blocking_notifier_chain_unregister(_notifier_list, nb);
+}
+EXPORT_SYMBOL_GPL(mce_unregister_notifier);
+
 static void mce_set_error_info(struct machine_check_event *mce,
   struct mce_error_info *mce_err)
 {
@@ -278,6 +292,7 @@ static void machine_process_ue_event(struct work_struct 
*work)
while (__this_cpu_read(mce_ue_count) > 0) {
index = __this_cpu_read(mce_ue_count) - 1;
evt = this_cpu_ptr(_ue_event_queue[index]);
+   blocking_notifier_call_chain(_notifier_list, 0, evt);
 #ifdef CONFIG_MEMORY_FAILURE
/*
 * This should probably queued elsewhere, but
-- 
2.26.2

Re: [PATCH v3 1/3] powerpc/mm: Enable radix GTSE only if supported.

2020-07-05 Thread Santosh Sivaraj



Hi Bharata,

Bharata B Rao  writes:

> Make GTSE an MMU feature and enable it by default for radix.
> However for guest, conditionally enable it if hypervisor supports
> it via OV5 vector. Let prom_init ask for radix GTSE only if the
> support exists.
>
> Having GTSE as an MMU feature will make it easy to enable radix
> without GTSE. Currently radix assumes GTSE is enabled by default.
>
> Signed-off-by: Bharata B Rao 
> Reviewed-by: Aneesh Kumar K.V 
> ---
>  arch/powerpc/include/asm/mmu.h|  4 
>  arch/powerpc/kernel/dt_cpu_ftrs.c |  1 +
>  arch/powerpc/kernel/prom_init.c   | 13 -
>  arch/powerpc/mm/init_64.c |  5 -
>  4 files changed, 17 insertions(+), 6 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/mmu.h b/arch/powerpc/include/asm/mmu.h
> index f4ac25d4df05..884d51995934 100644
> --- a/arch/powerpc/include/asm/mmu.h
> +++ b/arch/powerpc/include/asm/mmu.h
> @@ -28,6 +28,9 @@
>   * Individual features below.
>   */
>  
> +/* Guest Translation Shootdown Enable */
> +#define MMU_FTR_GTSE ASM_CONST(0x1000)
> +
>  /*
>   * Support for 68 bit VA space. We added that from ISA 2.05
>   */
> @@ -173,6 +176,7 @@ enum {
>  #endif
>  #ifdef CONFIG_PPC_RADIX_MMU
>   MMU_FTR_TYPE_RADIX |
> + MMU_FTR_GTSE |
>  #ifdef CONFIG_PPC_KUAP
>   MMU_FTR_RADIX_KUAP |
>  #endif /* CONFIG_PPC_KUAP */
> diff --git a/arch/powerpc/kernel/dt_cpu_ftrs.c 
> b/arch/powerpc/kernel/dt_cpu_ftrs.c
> index a0edeb391e3e..ac650c233cd9 100644
> --- a/arch/powerpc/kernel/dt_cpu_ftrs.c
> +++ b/arch/powerpc/kernel/dt_cpu_ftrs.c
> @@ -336,6 +336,7 @@ static int __init feat_enable_mmu_radix(struct 
> dt_cpu_feature *f)
>  #ifdef CONFIG_PPC_RADIX_MMU
>   cur_cpu_spec->mmu_features |= MMU_FTR_TYPE_RADIX;
>   cur_cpu_spec->mmu_features |= MMU_FTRS_HASH_BASE;
> + cur_cpu_spec->mmu_features |= MMU_FTR_GTSE;
>   cur_cpu_spec->cpu_user_features |= PPC_FEATURE_HAS_MMU;
>  
>   return 1;
> diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
> index 90c604d00b7d..cbc605cfdec0 100644
> --- a/arch/powerpc/kernel/prom_init.c
> +++ b/arch/powerpc/kernel/prom_init.c
> @@ -1336,12 +1336,15 @@ static void __init prom_check_platform_support(void)
>   }
>   }
>  
> - if (supported.radix_mmu && supported.radix_gtse &&
> - IS_ENABLED(CONFIG_PPC_RADIX_MMU)) {
> - /* Radix preferred - but we require GTSE for now */
> - prom_debug("Asking for radix with GTSE\n");
> + if (supported.radix_mmu && IS_ENABLED(CONFIG_PPC_RADIX_MMU)) {
> + /* Radix preferred - Check if GTSE is also supported */
> + prom_debug("Asking for radix\n");
>   ibm_architecture_vec.vec5.mmu = OV5_FEAT(OV5_MMU_RADIX);
> - ibm_architecture_vec.vec5.radix_ext = OV5_FEAT(OV5_RADIX_GTSE);
> + if (supported.radix_gtse)
> + ibm_architecture_vec.vec5.radix_ext =
> + OV5_FEAT(OV5_RADIX_GTSE);
> + else
> + prom_debug("Radix GTSE isn't supported\n");
>   } else if (supported.hash_mmu) {
>   /* Default to hash mmu (if we can) */
>   prom_debug("Asking for hash\n");
> diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
> index bc73abf0bc25..152aa0200cef 100644
> --- a/arch/powerpc/mm/init_64.c
> +++ b/arch/powerpc/mm/init_64.c
> @@ -407,12 +407,15 @@ static void __init early_check_vec5(void)
>   if (!(vec5[OV5_INDX(OV5_RADIX_GTSE)] &
>   OV5_FEAT(OV5_RADIX_GTSE))) {
>   pr_warn("WARNING: Hypervisor doesn't support RADIX with 
> GTSE\n");
> - }
> + cur_cpu_spec->mmu_features &= ~MMU_FTR_GTSE;
> + } else
> + cur_cpu_spec->mmu_features |= MMU_FTR_GTSE;

The GTSE flag is set by default in feat_enable_mmu_radix(), should it
be set again here?

Thanks,
Santosh
>   /* Do radix anyway - the hypervisor said we had to */
>   cur_cpu_spec->mmu_features |= MMU_FTR_TYPE_RADIX;
>   } else if (mmu_supported == OV5_FEAT(OV5_MMU_HASH)) {
>   /* Hypervisor only supports hash - disable radix */
>   cur_cpu_spec->mmu_features &= ~MMU_FTR_TYPE_RADIX;
> + cur_cpu_spec->mmu_features &= ~MMU_FTR_GTSE;
>   }
>  }
>  
> -- 
> 2.21.3

[PATCH 2/2] mm/mmu_gather: invalidate TLB correctly on batch allocation failure and flush

2020-06-30 Thread Santosh Sivaraj

From: Peter Zijlstra 

commit 0ed1325967ab5f7a4549a2641c6ebe115f76e228 upstream

Architectures for which we have hardware walkers of Linux page table
should flush TLB on mmu gather batch allocation failures and batch flush.
Some architectures like POWER supports multiple translation modes (hash
and radix) and in the case of POWER only radix translation mode needs the
above TLBI.  This is because for hash translation mode kernel wants to
avoid this extra flush since there are no hardware walkers of linux page
table.  With radix translation, the hardware also walks linux page table
and with that, kernel needs to make sure to TLB invalidate page walk cache
before page table pages are freed.

More details in commit d86564a2f085 ("mm/tlb, x86/mm: Support invalidating
TLB caches for RCU_TABLE_FREE")

The changes to sparc are to make sure we keep the old behavior since we
are now removing HAVE_RCU_TABLE_INVALIDATE.  The default value for
tlb_needs_table_invalidate is to always force an invalidate and sparc can
avoid the table invalidate.  Hence we define tlb_needs_table_invalidate to
false for sparc architecture.

Link: 
http://lkml.kernel.org/r/20200116064531.483522-3-aneesh.ku...@linux.ibm.com
Fixes: a46cc7a90fd8 ("powerpc/mm/radix: Improve TLB/PWC flushes")
Signed-off-by: Peter Zijlstra (Intel) 
Cc:   # 4.19
Signed-off-by: Santosh Sivaraj 
---
 arch/Kconfig|  3 ---
 arch/powerpc/include/asm/tlb.h  | 11 +++
 arch/sparc/include/asm/tlb_64.h |  9 +
 include/asm-generic/tlb.h   | 15 +++
 mm/memory.c | 16 
 5 files changed, 43 insertions(+), 11 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index a336548487e69..3abbdb0cea447 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -363,9 +363,6 @@ config HAVE_ARCH_JUMP_LABEL
 config HAVE_RCU_TABLE_FREE
bool
 
-config HAVE_RCU_TABLE_INVALIDATE
-   bool
-
 config ARCH_HAVE_NMI_SAFE_CMPXCHG
bool
 
diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h
index f0e571b2dc7c8..63418275f402e 100644
--- a/arch/powerpc/include/asm/tlb.h
+++ b/arch/powerpc/include/asm/tlb.h
@@ -30,6 +30,17 @@
 #define tlb_remove_check_page_size_change tlb_remove_check_page_size_change
 
 extern void tlb_flush(struct mmu_gather *tlb);
+/*
+ * book3s:
+ * Hash does not use the linux page-tables, so we can avoid
+ * the TLB invalidate for page-table freeing, Radix otoh does use the
+ * page-tables and needs the TLBI.
+ *
+ * nohash:
+ * We still do TLB invalidate in the __pte_free_tlb routine before we
+ * add the page table pages to mmu gather table batch.
+ */
+#define tlb_needs_table_invalidate()   radix_enabled()
 
 /* Get the generic bits... */
 #include 
diff --git a/arch/sparc/include/asm/tlb_64.h b/arch/sparc/include/asm/tlb_64.h
index a2f3fa61ee36a..8cb8f3833239a 100644
--- a/arch/sparc/include/asm/tlb_64.h
+++ b/arch/sparc/include/asm/tlb_64.h
@@ -28,6 +28,15 @@ void flush_tlb_pending(void);
 #define __tlb_remove_tlb_entry(tlb, ptep, address) do { } while (0)
 #define tlb_flush(tlb) flush_tlb_pending()
 
+/*
+ * SPARC64's hardware TLB fill does not use the Linux page-tables
+ * and therefore we don't need a TLBI when freeing page-table pages.
+ */
+
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+#define tlb_needs_table_invalidate()   (false)
+#endif
+
 #include 
 
 #endif /* _SPARC64_TLB_H */
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index b3353e21f3b3e..92dcfd01e0ee4 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -61,8 +61,23 @@ struct mmu_table_batch {
 extern void tlb_table_flush(struct mmu_gather *tlb);
 extern void tlb_remove_table(struct mmu_gather *tlb, void *table);
 
+/*
+ * This allows an architecture that does not use the linux page-tables for
+ * hardware to skip the TLBI when freeing page tables.
+ */
+#ifndef tlb_needs_table_invalidate
+#define tlb_needs_table_invalidate() (true)
 #endif
 
+#else
+
+#ifdef tlb_needs_table_invalidate
+#error tlb_needs_table_invalidate() requires HAVE_RCU_TABLE_FREE
+#endif
+
+#endif /* CONFIG_HAVE_RCU_TABLE_FREE */
+
+
 /*
  * If we can't allocate a page to make a big batch of page pointers
  * to work on, then just handle a few from the on-stack structure.
diff --git a/mm/memory.c b/mm/memory.c
index bbf0cc4066c84..7656714c9b7c4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -325,14 +325,14 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, 
struct page *page, int page_
  */
 static inline void tlb_table_invalidate(struct mmu_gather *tlb)
 {
-#ifdef CONFIG_HAVE_RCU_TABLE_INVALIDATE
-   /*
-* Invalidate page-table caches used by hardware walkers. Then we still
-* need to RCU-sched wait while freeing the pages because software
-* walkers can still be in-flight.
-*/
-   tlb_flush_mmu_tlbonly(tlb);
-#endif
+   if (tlb_needs_table_invalidate()) {
+   /*
+* Invalidate page-table cac

[PATCH 1/2] powerpc/mmu_gather: enable RCU_TABLE_FREE even for !SMP case

2020-06-30 Thread Santosh Sivaraj

From: "Aneesh Kumar K.V" 

commit 12e4d53f3f04e81f9e83d6fc10edc7314ab9f6b9 upstream

The TLB flush optimisation (a46cc7a90f: powerpc/mm/radix: Improve TLB/PWC
flushes) may result in random memory corruption.

On any SMP system, freeing page directories should observe the exact same
order as normal page freeing:

 1) unhook page/directory
 2) TLB invalidate
 3) free page/directory

Without this, any concurrent page-table walk could end up with a
Use-after-Free.  This is esp.  trivial for anything that has software
page-table walkers (HAVE_FAST_GUP / software TLB fill) or the hardware
caches partial page-walks (ie.  caches page directories).

Even on UP this might give issues since mmu_gather is preemptible these
days.  An interrupt or preempted task accessing user pages might stumble
into the free page if the hardware caches page directories.

!SMP case is right now broken for radix translation w.r.t page walk
cache flush.  We can get interrupted in between page table free and
that would imply we have page walk cache entries pointing to tables
which got freed already.  Michael said "both our platforms that run on
Power9 force SMP on in Kconfig, so the !SMP case is unlikely to be a
problem for anyone in practice, unless they've hacked their kernel to
build it !SMP."

Link: 
http://lkml.kernel.org/r/20200116064531.483522-2-aneesh.ku...@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V 
Cc:  # 4.19
Signed-off-by: Santosh Sivaraj 
---
 arch/powerpc/Kconfig | 2 +-
 arch/powerpc/include/asm/book3s/32/pgalloc.h | 8 
 arch/powerpc/include/asm/book3s/64/pgalloc.h | 2 --
 arch/powerpc/include/asm/nohash/32/pgalloc.h | 8 
 arch/powerpc/mm/pgtable-book3s64.c   | 7 ---
 5 files changed, 1 insertion(+), 26 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index f38d153d25861..4863fc0dd945a 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -215,7 +215,7 @@ config PPC
select HAVE_HARDLOCKUP_DETECTOR_PERFif PERF_EVENTS && 
HAVE_PERF_EVENTS_NMI && !HAVE_HARDLOCKUP_DETECTOR_ARCH
select HAVE_PERF_REGS
select HAVE_PERF_USER_STACK_DUMP
-   select HAVE_RCU_TABLE_FREE  if SMP
+   select HAVE_RCU_TABLE_FREE
select HAVE_REGS_AND_STACK_ACCESS_API
select HAVE_RELIABLE_STACKTRACE if PPC64 && CPU_LITTLE_ENDIAN
select HAVE_SYSCALL_TRACEPOINTS
diff --git a/arch/powerpc/include/asm/book3s/32/pgalloc.h 
b/arch/powerpc/include/asm/book3s/32/pgalloc.h
index 82e44b1a00ae9..79ba3fbb512e3 100644
--- a/arch/powerpc/include/asm/book3s/32/pgalloc.h
+++ b/arch/powerpc/include/asm/book3s/32/pgalloc.h
@@ -110,7 +110,6 @@ static inline void pgtable_free(void *table, unsigned 
index_size)
 #define check_pgt_cache()  do { } while (0)
 #define get_hugepd_cache_index(x)  (x)
 
-#ifdef CONFIG_SMP
 static inline void pgtable_free_tlb(struct mmu_gather *tlb,
void *table, int shift)
 {
@@ -127,13 +126,6 @@ static inline void __tlb_remove_table(void *_table)
 
pgtable_free(table, shift);
 }
-#else
-static inline void pgtable_free_tlb(struct mmu_gather *tlb,
-   void *table, int shift)
-{
-   pgtable_free(table, shift);
-}
-#endif
 
 static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t table,
  unsigned long address)
diff --git a/arch/powerpc/include/asm/book3s/64/pgalloc.h 
b/arch/powerpc/include/asm/book3s/64/pgalloc.h
index f9019b579903a..1013c02142139 100644
--- a/arch/powerpc/include/asm/book3s/64/pgalloc.h
+++ b/arch/powerpc/include/asm/book3s/64/pgalloc.h
@@ -47,9 +47,7 @@ extern pmd_t *pmd_fragment_alloc(struct mm_struct *, unsigned 
long);
 extern void pte_fragment_free(unsigned long *, int);
 extern void pmd_fragment_free(unsigned long *);
 extern void pgtable_free_tlb(struct mmu_gather *tlb, void *table, int shift);
-#ifdef CONFIG_SMP
 extern void __tlb_remove_table(void *_table);
-#endif
 
 static inline pgd_t *radix__pgd_alloc(struct mm_struct *mm)
 {
diff --git a/arch/powerpc/include/asm/nohash/32/pgalloc.h 
b/arch/powerpc/include/asm/nohash/32/pgalloc.h
index 8825953c225b2..96eed46d56842 100644
--- a/arch/powerpc/include/asm/nohash/32/pgalloc.h
+++ b/arch/powerpc/include/asm/nohash/32/pgalloc.h
@@ -111,7 +111,6 @@ static inline void pgtable_free(void *table, unsigned 
index_size)
 #define check_pgt_cache()  do { } while (0)
 #define get_hugepd_cache_index(x)  (x)
 
-#ifdef CONFIG_SMP
 static inline void pgtable_free_tlb(struct mmu_gather *tlb,
void *table, int shift)
 {
@@ -128,13 +127,6 @@ static inline void __tlb_remove_table(void *_table)
 
pgtable_free(table, shift);
 }
-#else
-static inline void pgtable_free_tlb(struct mmu_gather *tlb,
-   void *table, int shift)
-{
-   pgtable_free(table, shift);
-}
-#endif
 
 static inlin

[PATCH v4 6/6] asm-generic/tlb: avoid potential double flush

2020-05-20 Thread Santosh Sivaraj

From: Peter Zijlstra 

commit 0758cd8304942292e95a0f750c374533db378b32 upstream

Aneesh reported that:

tlb_flush_mmu()
  tlb_flush_mmu_tlbonly()
tlb_flush() <-- #1
  tlb_flush_mmu_free()
tlb_table_flush()
  tlb_table_invalidate()
tlb_flush_mmu_tlbonly()
  tlb_flush()   <-- #2

does two TLBIs when tlb->fullmm, because __tlb_reset_range() will not
clear tlb->end in that case.

Observe that any caller to __tlb_adjust_range() also sets at least one of
the tlb->freed_tables || tlb->cleared_p* bits, and those are
unconditionally cleared by __tlb_reset_range().

Change the condition for actually issuing TLBI to having one of those bits
set, as opposed to having tlb->end != 0.

Link: 
http://lkml.kernel.org/r/20200116064531.483522-4-aneesh.ku...@linux.ibm.com
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Aneesh Kumar K.V 
Reported-by: "Aneesh Kumar K.V" 
Cc:   # 4.19
Signed-off-by: Santosh Sivaraj 
[santosh: backported to 4.19 stable]
---
 include/asm-generic/tlb.h | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 19934cdd143e..427a70c56ddd 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -179,7 +179,12 @@ static inline void __tlb_reset_range(struct mmu_gather 
*tlb)
 
 static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
 {
-   if (!tlb->end)
+   /*
+* Anything calling __tlb_adjust_range() also sets at least one of
+* these bits.
+*/
+   if (!(tlb->freed_tables || tlb->cleared_ptes || tlb->cleared_pmds ||
+ tlb->cleared_puds || tlb->cleared_p4ds))
return;
 
tlb_flush(tlb);
-- 
2.25.4

[PATCH v4 5/6] mm/mmu_gather: invalidate TLB correctly on batch allocation failure and flush

2020-05-20 Thread Santosh Sivaraj

From: Peter Zijlstra 

commit 0ed1325967ab5f7a4549a2641c6ebe115f76e228 upstream

Architectures for which we have hardware walkers of Linux page table
should flush TLB on mmu gather batch allocation failures and batch flush.
Some architectures like POWER supports multiple translation modes (hash
and radix) and in the case of POWER only radix translation mode needs the
above TLBI.  This is because for hash translation mode kernel wants to
avoid this extra flush since there are no hardware walkers of linux page
table.  With radix translation, the hardware also walks linux page table
and with that, kernel needs to make sure to TLB invalidate page walk cache
before page table pages are freed.

More details in commit d86564a2f085 ("mm/tlb, x86/mm: Support invalidating
TLB caches for RCU_TABLE_FREE")

The changes to sparc are to make sure we keep the old behavior since we
are now removing HAVE_RCU_TABLE_NO_INVALIDATE.  The default value for
tlb_needs_table_invalidate is to always force an invalidate and sparc can
avoid the table invalidate.  Hence we define tlb_needs_table_invalidate to
false for sparc architecture.

Link: 
http://lkml.kernel.org/r/20200116064531.483522-3-aneesh.ku...@linux.ibm.com
Fixes: a46cc7a90fd8 ("powerpc/mm/radix: Improve TLB/PWC flushes")
Signed-off-by: Peter Zijlstra (Intel) 
Cc:   # 4.19
Signed-off-by: Santosh Sivaraj 
[santosh: backported to 4.19 stable]
---
 arch/Kconfig|  3 ---
 arch/powerpc/Kconfig|  1 -
 arch/powerpc/include/asm/tlb.h  | 11 +++
 arch/sparc/Kconfig  |  1 -
 arch/sparc/include/asm/tlb_64.h |  9 +
 include/asm-generic/tlb.h   | 15 +++
 mm/memory.c | 16 
 7 files changed, 43 insertions(+), 13 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 061a12b8140e..3abbdb0cea44 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -363,9 +363,6 @@ config HAVE_ARCH_JUMP_LABEL
 config HAVE_RCU_TABLE_FREE
bool
 
-config HAVE_RCU_TABLE_NO_INVALIDATE
-   bool
-
 config ARCH_HAVE_NMI_SAFE_CMPXCHG
bool
 
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 1a00ce4b0040..e5bc0cfea2b1 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -216,7 +216,6 @@ config PPC
select HAVE_PERF_REGS
select HAVE_PERF_USER_STACK_DUMP
select HAVE_RCU_TABLE_FREE
-   select HAVE_RCU_TABLE_NO_INVALIDATE if HAVE_RCU_TABLE_FREE
select HAVE_REGS_AND_STACK_ACCESS_API
select HAVE_RELIABLE_STACKTRACE if PPC64 && CPU_LITTLE_ENDIAN
select HAVE_SYSCALL_TRACEPOINTS
diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h
index f0e571b2dc7c..63418275f402 100644
--- a/arch/powerpc/include/asm/tlb.h
+++ b/arch/powerpc/include/asm/tlb.h
@@ -30,6 +30,17 @@
 #define tlb_remove_check_page_size_change tlb_remove_check_page_size_change
 
 extern void tlb_flush(struct mmu_gather *tlb);
+/*
+ * book3s:
+ * Hash does not use the linux page-tables, so we can avoid
+ * the TLB invalidate for page-table freeing, Radix otoh does use the
+ * page-tables and needs the TLBI.
+ *
+ * nohash:
+ * We still do TLB invalidate in the __pte_free_tlb routine before we
+ * add the page table pages to mmu gather table batch.
+ */
+#define tlb_needs_table_invalidate()   radix_enabled()
 
 /* Get the generic bits... */
 #include 
diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
index d90d632868aa..e6f2a38d2e61 100644
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -64,7 +64,6 @@ config SPARC64
select HAVE_KRETPROBES
select HAVE_KPROBES
select HAVE_RCU_TABLE_FREE if SMP
-   select HAVE_RCU_TABLE_NO_INVALIDATE if HAVE_RCU_TABLE_FREE
select HAVE_MEMBLOCK_NODE_MAP
select HAVE_ARCH_TRANSPARENT_HUGEPAGE
select HAVE_DYNAMIC_FTRACE
diff --git a/arch/sparc/include/asm/tlb_64.h b/arch/sparc/include/asm/tlb_64.h
index a2f3fa61ee36..8cb8f3833239 100644
--- a/arch/sparc/include/asm/tlb_64.h
+++ b/arch/sparc/include/asm/tlb_64.h
@@ -28,6 +28,15 @@ void flush_tlb_pending(void);
 #define __tlb_remove_tlb_entry(tlb, ptep, address) do { } while (0)
 #define tlb_flush(tlb) flush_tlb_pending()
 
+/*
+ * SPARC64's hardware TLB fill does not use the Linux page-tables
+ * and therefore we don't need a TLBI when freeing page-table pages.
+ */
+
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+#define tlb_needs_table_invalidate()   (false)
+#endif
+
 #include 
 
 #endif /* _SPARC64_TLB_H */
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index f2b9dc9cbaf8..19934cdd143e 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -61,8 +61,23 @@ struct mmu_table_batch {
 extern void tlb_table_flush(struct mmu_gather *tlb);
 extern void tlb_remove_table(struct mmu_gather *tlb, void *table);
 
+/*
+ * This allows an architecture that does not use the linux page-tables for
+ * hardware to skip the TLBI when freein

[PATCH v4 4/6] powerpc/mmu_gather: enable RCU_TABLE_FREE even for !SMP case

2020-05-20 Thread Santosh Sivaraj

From: "Aneesh Kumar K.V" 

commit 12e4d53f3f04e81f9e83d6fc10edc7314ab9f6b9 upstream

Patch series "Fixup page directory freeing", v4.

This is a repost of patch series from Peter with the arch specific changes
except ppc64 dropped.  ppc64 changes are added here because we are redoing
the patch series on top of ppc64 changes.  This makes it easy to backport
these changes.  Only the first 2 patches need to be backported to stable.

The thing is, on anything SMP, freeing page directories should observe the
exact same order as normal page freeing:

 1) unhook page/directory
 2) TLB invalidate
 3) free page/directory

Without this, any concurrent page-table walk could end up with a
Use-after-Free.  This is esp.  trivial for anything that has software
page-table walkers (HAVE_FAST_GUP / software TLB fill) or the hardware
caches partial page-walks (ie.  caches page directories).

Even on UP this might give issues since mmu_gather is preemptible these
days.  An interrupt or preempted task accessing user pages might stumble
into the free page if the hardware caches page directories.

This patch series fixes ppc64 and add generic MMU_GATHER changes to
support the conversion of other architectures.  I haven't added patches
w.r.t other architecture because they are yet to be acked.

This patch (of 9):

A followup patch is going to make sure we correctly invalidate page walk
cache before we free page table pages.  In order to keep things simple
enable RCU_TABLE_FREE even for !SMP so that we don't have to fixup the
!SMP case differently in the followup patch

!SMP case is right now broken for radix translation w.r.t page walk
cache flush.  We can get interrupted in between page table free and
that would imply we have page walk cache entries pointing to tables
which got freed already.  Michael said "both our platforms that run on
Power9 force SMP on in Kconfig, so the !SMP case is unlikely to be a
problem for anyone in practice, unless they've hacked their kernel to
build it !SMP."

Link: 
http://lkml.kernel.org/r/20200116064531.483522-2-aneesh.ku...@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V 
Cc:  # 4.19
Signed-off-by: Santosh Sivaraj 
[santosh: backported for 4.19 stable]
---
 arch/powerpc/Kconfig | 2 +-
 arch/powerpc/include/asm/book3s/32/pgalloc.h | 8 
 arch/powerpc/include/asm/book3s/64/pgalloc.h | 2 --
 arch/powerpc/include/asm/nohash/32/pgalloc.h | 8 
 arch/powerpc/mm/pgtable-book3s64.c   | 7 ---
 5 files changed, 1 insertion(+), 26 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index e09cfb109b8c..1a00ce4b0040 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -215,7 +215,7 @@ config PPC
select HAVE_HARDLOCKUP_DETECTOR_PERFif PERF_EVENTS && 
HAVE_PERF_EVENTS_NMI && !HAVE_HARDLOCKUP_DETECTOR_ARCH
select HAVE_PERF_REGS
select HAVE_PERF_USER_STACK_DUMP
-   select HAVE_RCU_TABLE_FREE  if SMP
+   select HAVE_RCU_TABLE_FREE
select HAVE_RCU_TABLE_NO_INVALIDATE if HAVE_RCU_TABLE_FREE
select HAVE_REGS_AND_STACK_ACCESS_API
select HAVE_RELIABLE_STACKTRACE if PPC64 && CPU_LITTLE_ENDIAN
diff --git a/arch/powerpc/include/asm/book3s/32/pgalloc.h 
b/arch/powerpc/include/asm/book3s/32/pgalloc.h
index 82e44b1a00ae..79ba3fbb512e 100644
--- a/arch/powerpc/include/asm/book3s/32/pgalloc.h
+++ b/arch/powerpc/include/asm/book3s/32/pgalloc.h
@@ -110,7 +110,6 @@ static inline void pgtable_free(void *table, unsigned 
index_size)
 #define check_pgt_cache()  do { } while (0)
 #define get_hugepd_cache_index(x)  (x)
 
-#ifdef CONFIG_SMP
 static inline void pgtable_free_tlb(struct mmu_gather *tlb,
void *table, int shift)
 {
@@ -127,13 +126,6 @@ static inline void __tlb_remove_table(void *_table)
 
pgtable_free(table, shift);
 }
-#else
-static inline void pgtable_free_tlb(struct mmu_gather *tlb,
-   void *table, int shift)
-{
-   pgtable_free(table, shift);
-}
-#endif
 
 static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t table,
  unsigned long address)
diff --git a/arch/powerpc/include/asm/book3s/64/pgalloc.h 
b/arch/powerpc/include/asm/book3s/64/pgalloc.h
index f9019b579903..1013c0214213 100644
--- a/arch/powerpc/include/asm/book3s/64/pgalloc.h
+++ b/arch/powerpc/include/asm/book3s/64/pgalloc.h
@@ -47,9 +47,7 @@ extern pmd_t *pmd_fragment_alloc(struct mm_struct *, unsigned 
long);
 extern void pte_fragment_free(unsigned long *, int);
 extern void pmd_fragment_free(unsigned long *);
 extern void pgtable_free_tlb(struct mmu_gather *tlb, void *table, int shift);
-#ifdef CONFIG_SMP
 extern void __tlb_remove_table(void *_table);
-#endif
 
 static inline pgd_t *radix__pgd_alloc(struct mm_struct *mm)
 {
diff --git a/arch/powerpc/include/asm/nohash/32/pgalloc.h 
b/arch/powerpc/include/asm/nohas

[PATCH v4 3/6] asm-generic/tlb, arch: Invert CONFIG_HAVE_RCU_TABLE_INVALIDATE

2020-05-20 Thread Santosh Sivaraj

From: Peter Zijlstra 

commit 96bc9567cbe112e9320250f01b9c060c882e8619 upstream

Make issuing a TLB invalidate for page-table pages the normal case.

The reason is twofold:

 - too many invalidates is safer than too few,
 - most architectures use the linux page-tables natively
   and would thus require this.

Make it an opt-out, instead of an opt-in.

No change in behavior intended.

Signed-off-by: Peter Zijlstra (Intel) 
Cc:  # 4.19
Signed-off-by: Santosh Sivaraj 
[santosh: prerequisite for upcoming tlbflush backports]
---
 arch/Kconfig | 2 +-
 arch/powerpc/Kconfig | 1 +
 arch/sparc/Kconfig   | 1 +
 arch/x86/Kconfig | 1 -
 mm/memory.c  | 2 +-
 5 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index a336548487e6..061a12b8140e 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -363,7 +363,7 @@ config HAVE_ARCH_JUMP_LABEL
 config HAVE_RCU_TABLE_FREE
bool
 
-config HAVE_RCU_TABLE_INVALIDATE
+config HAVE_RCU_TABLE_NO_INVALIDATE
bool
 
 config ARCH_HAVE_NMI_SAFE_CMPXCHG
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 6f475dc5829b..e09cfb109b8c 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -216,6 +216,7 @@ config PPC
select HAVE_PERF_REGS
select HAVE_PERF_USER_STACK_DUMP
select HAVE_RCU_TABLE_FREE  if SMP
+   select HAVE_RCU_TABLE_NO_INVALIDATE if HAVE_RCU_TABLE_FREE
select HAVE_REGS_AND_STACK_ACCESS_API
select HAVE_RELIABLE_STACKTRACE if PPC64 && CPU_LITTLE_ENDIAN
select HAVE_SYSCALL_TRACEPOINTS
diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
index e6f2a38d2e61..d90d632868aa 100644
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -64,6 +64,7 @@ config SPARC64
select HAVE_KRETPROBES
select HAVE_KPROBES
select HAVE_RCU_TABLE_FREE if SMP
+   select HAVE_RCU_TABLE_NO_INVALIDATE if HAVE_RCU_TABLE_FREE
select HAVE_MEMBLOCK_NODE_MAP
select HAVE_ARCH_TRANSPARENT_HUGEPAGE
select HAVE_DYNAMIC_FTRACE
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index af35f5caadbe..181d0d522977 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -181,7 +181,6 @@ config X86
select HAVE_PERF_REGS
select HAVE_PERF_USER_STACK_DUMP
select HAVE_RCU_TABLE_FREE  if PARAVIRT
-   select HAVE_RCU_TABLE_INVALIDATEif HAVE_RCU_TABLE_FREE
select HAVE_REGS_AND_STACK_ACCESS_API
select HAVE_RELIABLE_STACKTRACE if X86_64 && 
(UNWINDER_FRAME_POINTER || UNWINDER_ORC) && STACK_VALIDATION
select HAVE_STACKPROTECTOR  if CC_HAS_SANE_STACKPROTECTOR
diff --git a/mm/memory.c b/mm/memory.c
index 1832c5ed6ac0..ba5689610c04 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -327,7 +327,7 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, struct 
page *page, int page_
  */
 static inline void tlb_table_invalidate(struct mmu_gather *tlb)
 {
-#ifdef CONFIG_HAVE_RCU_TABLE_INVALIDATE
+#ifndef CONFIG_HAVE_RCU_TABLE_NO_INVALIDATE
/*
 * Invalidate page-table caches used by hardware walkers. Then we still
 * need to RCU-sched wait while freeing the pages because software
-- 
2.25.4

[PATCH v4 2/6] asm-generic/tlb: Track which levels of the page tables have been cleared

2020-05-20 Thread Santosh Sivaraj

From: Will Deacon 

commit a6d60245d6d9b1caf66b0d94419988c4836980af upstream

It is common for architectures with hugepage support to require only a
single TLB invalidation operation per hugepage during unmap(), rather than
iterating through the mapping at a PAGE_SIZE increment. Currently,
however, the level in the page table where the unmap() operation occurs
is not stored in the mmu_gather structure, therefore forcing
architectures to issue additional TLB invalidation operations or to give
up and over-invalidate by e.g. invalidating the entire TLB.

Ideally, we could add an interval rbtree to the mmu_gather structure,
which would allow us to associate the correct mapping granule with the
various sub-mappings within the range being invalidated. However, this
is costly in terms of book-keeping and memory management, so instead we
approximate by keeping track of the page table levels that are cleared
and provide a means to query the smallest granule required for invalidation.

Signed-off-by: Will Deacon 
Cc:  # 4.19
Signed-off-by: Santosh Sivaraj 
[santosh: prerequisite for upcoming tlbflush backports]
---
 include/asm-generic/tlb.h | 58 +--
 mm/memory.c   |  4 ++-
 2 files changed, 53 insertions(+), 9 deletions(-)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 97306b32d8d2..f2b9dc9cbaf8 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -114,6 +114,14 @@ struct mmu_gather {
 */
unsigned intfreed_tables : 1;
 
+   /*
+* at which levels have we cleared entries?
+*/
+   unsigned intcleared_ptes : 1;
+   unsigned intcleared_pmds : 1;
+   unsigned intcleared_puds : 1;
+   unsigned intcleared_p4ds : 1;
+
struct mmu_gather_batch *active;
struct mmu_gather_batch local;
struct page *__pages[MMU_GATHER_BUNDLE];
@@ -148,6 +156,10 @@ static inline void __tlb_reset_range(struct mmu_gather 
*tlb)
tlb->end = 0;
}
tlb->freed_tables = 0;
+   tlb->cleared_ptes = 0;
+   tlb->cleared_pmds = 0;
+   tlb->cleared_puds = 0;
+   tlb->cleared_p4ds = 0;
 }
 
 static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
@@ -197,6 +209,25 @@ static inline void 
tlb_remove_check_page_size_change(struct mmu_gather *tlb,
 }
 #endif
 
+static inline unsigned long tlb_get_unmap_shift(struct mmu_gather *tlb)
+{
+   if (tlb->cleared_ptes)
+   return PAGE_SHIFT;
+   if (tlb->cleared_pmds)
+   return PMD_SHIFT;
+   if (tlb->cleared_puds)
+   return PUD_SHIFT;
+   if (tlb->cleared_p4ds)
+   return P4D_SHIFT;
+
+   return PAGE_SHIFT;
+}
+
+static inline unsigned long tlb_get_unmap_size(struct mmu_gather *tlb)
+{
+   return 1UL << tlb_get_unmap_shift(tlb);
+}
+
 /*
  * In the case of tlb vma handling, we can optimise these away in the
  * case where we're doing a full MM flush.  When we're doing a munmap,
@@ -230,13 +261,19 @@ static inline void 
tlb_remove_check_page_size_change(struct mmu_gather *tlb,
 #define tlb_remove_tlb_entry(tlb, ptep, address)   \
do {\
__tlb_adjust_range(tlb, address, PAGE_SIZE);\
+   tlb->cleared_ptes = 1;  \
__tlb_remove_tlb_entry(tlb, ptep, address); \
} while (0)
 
-#define tlb_remove_huge_tlb_entry(h, tlb, ptep, address)\
-   do { \
-   __tlb_adjust_range(tlb, address, huge_page_size(h)); \
-   __tlb_remove_tlb_entry(tlb, ptep, address);  \
+#define tlb_remove_huge_tlb_entry(h, tlb, ptep, address)   \
+   do {\
+   unsigned long _sz = huge_page_size(h);  \
+   __tlb_adjust_range(tlb, address, _sz);  \
+   if (_sz == PMD_SIZE)\
+   tlb->cleared_pmds = 1;  \
+   else if (_sz == PUD_SIZE)   \
+   tlb->cleared_puds = 1;  \
+   __tlb_remove_tlb_entry(tlb, ptep, address); \
} while (0)
 
 /**
@@ -250,6 +287,7 @@ static inline void tlb_remove_check_page_size_change(struct 
mmu_gather *tlb,
 #define tlb_remove_pmd_tlb_entry(tlb, pmdp, address)   \
do {\
__tlb_adjust_range(tlb, address, HPAGE_PMD_SIZE);   \
+   tlb->cleared_pmds = 1;  \
__tlb_remove_pmd_tlb_entry(tlb, pmdp, address); \
} wh

[PATCH v4 1/6] asm-generic/tlb: Track freeing of page-table directories in struct mmu_gather

2020-05-20 Thread Santosh Sivaraj

From: Peter Zijlstra 

commit 22a61c3c4f1379ef8b0ce0d5cb78baf3178950e2 upstream

Some architectures require different TLB invalidation instructions
depending on whether it is only the last-level of page table being
changed, or whether there are also changes to the intermediate
(directory) entries higher up the tree.

Add a new bit to the flags bitfield in struct mmu_gather so that the
architecture code can operate accordingly if it's the intermediate
levels being invalidated.

Signed-off-by: Peter Zijlstra 
Signed-off-by: Will Deacon 
Cc:  # 4.19
Signed-off-by: Santosh Sivaraj 
[santosh: prerequisite for tlbflush backports]
---
 include/asm-generic/tlb.h | 31 +++
 1 file changed, 23 insertions(+), 8 deletions(-)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index b3353e21f3b3..97306b32d8d2 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -97,12 +97,22 @@ struct mmu_gather {
 #endif
unsigned long   start;
unsigned long   end;
-   /* we are in the middle of an operation to clear
-* a full mm and can make some optimizations */
-   unsigned intfullmm : 1,
-   /* we have performed an operation which
-* requires a complete flush of the tlb */
-   need_flush_all : 1;
+   /*
+* we are in the middle of an operation to clear
+* a full mm and can make some optimizations
+*/
+   unsigned intfullmm : 1;
+
+   /*
+* we have performed an operation which
+* requires a complete flush of the tlb
+*/
+   unsigned intneed_flush_all : 1;
+
+   /*
+* we have removed page directories
+*/
+   unsigned intfreed_tables : 1;
 
struct mmu_gather_batch *active;
struct mmu_gather_batch local;
@@ -137,6 +147,7 @@ static inline void __tlb_reset_range(struct mmu_gather *tlb)
tlb->start = TASK_SIZE;
tlb->end = 0;
}
+   tlb->freed_tables = 0;
 }
 
 static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
@@ -278,6 +289,7 @@ static inline void tlb_remove_check_page_size_change(struct 
mmu_gather *tlb,
 #define pte_free_tlb(tlb, ptep, address)   \
do {\
__tlb_adjust_range(tlb, address, PAGE_SIZE);\
+   tlb->freed_tables = 1;  \
__pte_free_tlb(tlb, ptep, address); \
} while (0)
 #endif
@@ -285,7 +297,8 @@ static inline void tlb_remove_check_page_size_change(struct 
mmu_gather *tlb,
 #ifndef pmd_free_tlb
 #define pmd_free_tlb(tlb, pmdp, address)   \
do {\
-   __tlb_adjust_range(tlb, address, PAGE_SIZE);\
+   __tlb_adjust_range(tlb, address, PAGE_SIZE);\
+   tlb->freed_tables = 1;  \
__pmd_free_tlb(tlb, pmdp, address); \
} while (0)
 #endif
@@ -295,6 +308,7 @@ static inline void tlb_remove_check_page_size_change(struct 
mmu_gather *tlb,
 #define pud_free_tlb(tlb, pudp, address)   \
do {\
__tlb_adjust_range(tlb, address, PAGE_SIZE);\
+   tlb->freed_tables = 1;  \
__pud_free_tlb(tlb, pudp, address); \
} while (0)
 #endif
@@ -304,7 +318,8 @@ static inline void tlb_remove_check_page_size_change(struct 
mmu_gather *tlb,
 #ifndef p4d_free_tlb
 #define p4d_free_tlb(tlb, pudp, address)   \
do {\
-   __tlb_adjust_range(tlb, address, PAGE_SIZE);\
+   __tlb_adjust_range(tlb, address, PAGE_SIZE);\
+   tlb->freed_tables = 1;  \
__p4d_free_tlb(tlb, pudp, address); \
} while (0)
 #endif
-- 
2.25.4

[PATCH v4 0/6] Memory corruption may occur due to incorrent tlb flush

2020-05-20 Thread Santosh Sivaraj

The TLB flush optimisation (a46cc7a90f: powerpc/mm/radix: Improve TLB/PWC
flushes) may result in random memory corruption. Any concurrent page-table walk
could end up with a Use-after-Free. Even on UP this might give issues, since
mmu_gather is preemptible these days. An interrupt or preempted task accessing
user pages might stumble into the free page if the hardware caches page
directories.

The series is a backport of the fix sent by Peter [1].

The first three patches are dependencies for the last patch (avoid potential
double flush). If the performance impact due to double flush is considered
trivial then the first three patches and last patch may be dropped.

This is only for v4.19 stable.
--
Changelog:
 v2: Send the patches with the correct format (commit sha1 upstream) for stable
 v3: Fix compilation for ppc44x_defconfig and mpc885_ads_defconfig
 v4: No change, Resend.

--
Aneesh Kumar K.V (1):
  powerpc/mmu_gather: enable RCU_TABLE_FREE even for !SMP case

Peter Zijlstra (4):
  asm-generic/tlb: Track freeing of page-table directories in struct
mmu_gather
  asm-generic/tlb, arch: Invert CONFIG_HAVE_RCU_TABLE_INVALIDATE
  mm/mmu_gather: invalidate TLB correctly on batch allocation failure
and flush
  asm-generic/tlb: avoid potential double flush

Will Deacon (1):
  asm-generic/tlb: Track which levels of the page tables have been
cleared

 arch/Kconfig |   3 -
 arch/powerpc/Kconfig |   2 +-
 arch/powerpc/include/asm/book3s/32/pgalloc.h |   8 --
 arch/powerpc/include/asm/book3s/64/pgalloc.h |   2 -
 arch/powerpc/include/asm/nohash/32/pgalloc.h |   8 --
 arch/powerpc/include/asm/tlb.h   |  11 ++
 arch/powerpc/mm/pgtable-book3s64.c   |   7 --
 arch/sparc/include/asm/tlb_64.h  |   9 ++
 arch/x86/Kconfig |   1 -
 include/asm-generic/tlb.h| 103 ---
 mm/memory.c  |  20 ++--
 11 files changed, 122 insertions(+), 52 deletions(-)

-- 
2.25.4

[PATCH v2] papr/scm: Add bad memory ranges to nvdimm bad ranges

2020-04-16 Thread Santosh Sivaraj

Subscribe to the MCE notification and add the physical address which
generated a memory error to nvdimm bad range.

Reviewed-by: Mahesh Salgaonkar 
Signed-off-by: Santosh Sivaraj 
---
 arch/powerpc/platforms/pseries/papr_scm.c | 98 ++-
 1 file changed, 97 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index f35592423380..e23fd1399d5b 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -12,6 +12,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include 
 
@@ -39,8 +41,12 @@ struct papr_scm_priv {
struct resource res;
struct nd_region *region;
struct nd_interleave_set nd_set;
+   struct list_head region_list;
 };
 
+LIST_HEAD(papr_nd_regions);
+DEFINE_MUTEX(papr_ndr_lock);
+
 static int drc_pmem_bind(struct papr_scm_priv *p)
 {
unsigned long ret[PLPAR_HCALL_BUFSIZE];
@@ -356,6 +362,10 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
dev_info(dev, "Region registered with target node %d and online 
node %d",
 target_nid, online_nid);
 
+   mutex_lock(_ndr_lock);
+   list_add_tail(>region_list, _nd_regions);
+   mutex_unlock(_ndr_lock);
+
return 0;
 
 err:   nvdimm_bus_unregister(p->bus);
@@ -363,6 +373,70 @@ err:   nvdimm_bus_unregister(p->bus);
return -ENXIO;
 }
 
+void papr_scm_add_badblock(struct nd_region *region, struct nvdimm_bus *bus,
+  u64 phys_addr)
+{
+   u64 aligned_addr = ALIGN_DOWN(phys_addr, L1_CACHE_BYTES);
+
+   if (nvdimm_bus_add_badrange(bus, aligned_addr, L1_CACHE_BYTES)) {
+   pr_err("Bad block registration for 0x%llx failed\n", phys_addr);
+   return;
+   }
+
+   pr_debug("Add memory range (0x%llx - 0x%llx) as bad range\n",
+aligned_addr, aligned_addr + L1_CACHE_BYTES);
+
+   nvdimm_region_notify(region, NVDIMM_REVALIDATE_POISON);
+}
+
+static int handle_mce_ue(struct notifier_block *nb, unsigned long val,
+void *data)
+{
+   struct machine_check_event *evt = data;
+   struct papr_scm_priv *p;
+   u64 phys_addr;
+   bool found = false;
+
+   if (evt->error_type != MCE_ERROR_TYPE_UE)
+   return NOTIFY_DONE;
+
+   if (list_empty(_nd_regions))
+   return NOTIFY_DONE;
+
+   /*
+* The physical address obtained here is PAGE_SIZE aligned, so get the
+* exact address from the effective address
+*/
+   phys_addr = evt->u.ue_error.physical_address +
+   (evt->u.ue_error.effective_address & ~PAGE_MASK);
+
+   if (!evt->u.ue_error.physical_address_provided ||
+   !is_zone_device_page(pfn_to_page(phys_addr >> PAGE_SHIFT)))
+   return NOTIFY_DONE;
+
+   /* mce notifier is called from a process context, so mutex is safe */
+   mutex_lock(_ndr_lock);
+   list_for_each_entry(p, _nd_regions, region_list) {
+   struct resource res = p->res;
+
+   if (phys_addr >= res.start && phys_addr <= res.end) {
+   found = true;
+   break;
+   }
+   }
+
+   if (found)
+   papr_scm_add_badblock(p->region, p->bus, phys_addr);
+
+   mutex_unlock(_ndr_lock);
+
+   return found ? NOTIFY_OK : NOTIFY_DONE;
+}
+
+static struct notifier_block mce_ue_nb = {
+   .notifier_call = handle_mce_ue
+};
+
 static int papr_scm_probe(struct platform_device *pdev)
 {
struct device_node *dn = pdev->dev.of_node;
@@ -460,6 +534,10 @@ static int papr_scm_remove(struct platform_device *pdev)
 {
struct papr_scm_priv *p = platform_get_drvdata(pdev);
 
+   mutex_lock(_ndr_lock);
+   list_del(&(p->region_list));
+   mutex_unlock(_ndr_lock);
+
nvdimm_bus_unregister(p->bus);
drc_pmem_unbind(p);
kfree(p->bus_desc.provider_name);
@@ -482,7 +560,25 @@ static struct platform_driver papr_scm_driver = {
},
 };
 
-module_platform_driver(papr_scm_driver);
+static int __init papr_scm_init(void)
+{
+   int ret;
+
+   ret = platform_driver_register(_scm_driver);
+   if (!ret)
+   mce_register_notifier(_ue_nb);
+
+   return ret;
+}
+module_init(papr_scm_init);
+
+static void __exit papr_scm_exit(void)
+{
+   mce_unregister_notifier(_ue_nb);
+   platform_driver_unregister(_scm_driver);
+}
+module_exit(papr_scm_exit);
+
 MODULE_DEVICE_TABLE(of, papr_scm_match);
 MODULE_LICENSE("GPL");
 MODULE_AUTHOR("IBM Corporation");
-- 
2.25.2

Re: [PATCH] papr/scm: Add bad memory ranges to nvdimm bad ranges

2020-04-13 Thread Santosh Sivaraj

Mahesh J Salgaonkar  writes:

> On 2020-04-01 13:17:41 Wed, Santosh Sivaraj wrote:
>> Subscribe to the MCE notification and add the physical address which
>> generated a memory error to nvdimm bad range.
>> 
>> Signed-off-by: Santosh Sivaraj 
>> ---
>> 
>> This patch depends on "powerpc/mce: Add MCE notification chain" [1].
>> 
>> Unlike the previous series[2], the patch adds badblock registration only for
>> pseries scm driver. Handling badblocks for baremetal (powernv) PMEM will be 
>> done
>> later and if possible get the badblock handling as a common code.
>> 
>> [1] 
>> https://lore.kernel.org/linuxppc-dev/20200330071219.12284-1-ganes...@linux.ibm.com/
>> [2] 
>> https://lore.kernel.org/linuxppc-dev/20190820023030.18232-1-sant...@fossix.org/
>> 
>> arch/powerpc/platforms/pseries/papr_scm.c | 96 ++-
>>  1 file changed, 95 insertions(+), 1 deletion(-)
>> 
>> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
>> b/arch/powerpc/platforms/pseries/papr_scm.c
>> index 0b4467e378e5..5012cbf4606e 100644
>> --- a/arch/powerpc/platforms/pseries/papr_scm.c
>> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
> [...]
>> +static int handle_mce_ue(struct notifier_block *nb, unsigned long val,
>> + void *data)
>> +{
>> +struct machine_check_event *evt = data;
>> +struct papr_scm_priv *p;
>> +u64 phys_addr;
>> +bool found = false;
>> +
>> +if (evt->error_type != MCE_ERROR_TYPE_UE)
>> +return NOTIFY_DONE;
>> +
>> +if (list_empty(_nd_regions))
>> +return NOTIFY_DONE;
>
> Do you really need this check ?

Quite harmless I guess, atleast it saves a branch and mutex_lock/unlock.

>
>> +
>> +phys_addr = evt->u.ue_error.physical_address +
>> +(evt->u.ue_error.effective_address & ~PAGE_MASK);
>> +
>> +if (!evt->u.ue_error.physical_address_provided ||
>> +!is_zone_device_page(pfn_to_page(phys_addr >> PAGE_SHIFT)))
>> +return NOTIFY_DONE;
>> +
>> +/* mce notifier is called from a process context, so mutex is safe */
>> +mutex_lock(_ndr_lock);
>> +list_for_each_entry(p, _nd_regions, region_list) {
>> +struct resource res = p->res;
>> +
>> +if (phys_addr >= res.start && phys_addr <= res.end) {
>> +found = true;
>> +break;
>> +}
>> +}
>> +
>> +mutex_unlock(_ndr_lock);
>> +
>> +if (!found)
>> +return NOTIFY_DONE;
>> +
>> +papr_scm_add_badblock(p->region, p->bus, phys_addr);
>> +
>> +return NOTIFY_OK;
>> +}
>> +
>> +static struct notifier_block mce_ue_nb = {
>> +.notifier_call = handle_mce_ue
>> +};
>> +
> [...]
>> -module_platform_driver(papr_scm_driver);
>> +static int __init papr_scm_init(void)
>> +{
>> +int ret;
>> +
>> +ret = platform_driver_register(_scm_driver);
>> +if (!ret)
>> +mce_register_notifier(_ue_nb);
>> +
>> +return ret;
>> +}
>> +module_init(papr_scm_init);
>> +
>> +static void __exit papr_scm_exit(void)
>> +{
>> +mce_unregister_notifier(_ue_nb);
>> +platform_driver_unregister(_scm_driver);
>> +}
>> +module_exit(papr_scm_exit);
>
> Rest Looks good to me.
>
> Reviewed-by: Mahesh Salgaonkar 

Thanks for the review.

Santosh
>
> Thanks,
> -Mahesh.
>
>> +
>>  MODULE_DEVICE_TABLE(of, papr_scm_match);
>>  MODULE_LICENSE("GPL");
>>  MODULE_AUTHOR("IBM Corporation");
>> -- 
>> 2.25.1
>> 
>
> -- 
> Mahesh J Salgaonkar

Re: [PATCH] papr/scm: Add bad memory ranges to nvdimm bad ranges

2020-04-13 Thread Santosh Sivaraj

kbuild test robot  writes:

> Hi Santosh,
>
> Thank you for the patch! Yet something to improve:
>
> [auto build test ERROR on powerpc/next]
> [also build test ERROR on v5.7-rc1 next-20200412]
> [if your patch is applied to the wrong git tree, please drop us a note to help
> improve the system. BTW, we also suggest to use '--base' option to specify the
> base tree in git format-patch, please see
> https://stackoverflow.com/a/37406982]

This patch depends on "powerpc/mce: Add MCE notification chain" [1].

[1]: 
https://lore.kernel.org/linuxppc-dev/20200330071219.12284-1-ganes...@linux.ibm.com/

Thanks,
Santosh

>
> url:    
> https://github.com/0day-ci/linux/commits/Santosh-Sivaraj/papr-scm-Add-bad-memory-ranges-to-nvdimm-bad-ranges/20200401-171233
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next
> config: powerpc-allyesconfig (attached as .config)
> compiler: powerpc64-linux-gcc (GCC) 9.3.0
> reproduce:
> wget 
> https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
> ~/bin/make.cross
> chmod +x ~/bin/make.cross
> # save the attached .config to linux build tree
> GCC_VERSION=9.3.0 make.cross ARCH=powerpc 
>
> If you fix the issue, kindly add following tag as appropriate
> Reported-by: kbuild test robot 
>
> All errors (new ones prefixed by >>):
>
>arch/powerpc/platforms/pseries/papr_scm.c: In function 'papr_scm_init':
>>> arch/powerpc/platforms/pseries/papr_scm.c:584:3: error: implicit 
>>> declaration of function 'mce_register_notifier'; did you mean 
>>> 'bus_register_notifier'? [-Werror=implicit-function-declaration]
>  584 |   mce_register_notifier(_ue_nb);
>  |   ^
>  |   bus_register_notifier
>arch/powerpc/platforms/pseries/papr_scm.c: In function 'papr_scm_exit':
>>> arch/powerpc/platforms/pseries/papr_scm.c:592:2: error: implicit 
>>> declaration of function 'mce_unregister_notifier'; did you mean 
>>> 'bus_unregister_notifier'? [-Werror=implicit-function-declaration]
>  592 |  mce_unregister_notifier(_ue_nb);
>  |  ^~~
>  |  bus_unregister_notifier
>cc1: some warnings being treated as errors
>
> vim +584 arch/powerpc/platforms/pseries/papr_scm.c
>
>577
>578static int __init papr_scm_init(void)
>579{
>580int ret;
>581
>582ret = platform_driver_register(_scm_driver);
>583if (!ret)
>  > 584mce_register_notifier(_ue_nb);
>585
>586return ret;
>587}
>588module_init(papr_scm_init);
>589
>590static void __exit papr_scm_exit(void)
>591{
>  > 592mce_unregister_notifier(_ue_nb);
>593platform_driver_unregister(_scm_driver);
>594}
>595module_exit(papr_scm_exit);
>596
>
> ---
> 0-DAY CI Kernel Test Service, Intel Corporation
> https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org

Re: [PATCH] papr/scm: Add bad memory ranges to nvdimm bad ranges

2020-04-09 Thread Santosh Sivaraj

On Wed, Apr 1, 2020 at 1:18 PM Santosh Sivaraj  wrote:

> Subscribe to the MCE notification and add the physical address which
> generated a memory error to nvdimm bad range.
>
> Signed-off-by: Santosh Sivaraj 
> ---
>

Any comments on this?

Thanks,
Santosh


> This patch depends on "powerpc/mce: Add MCE notification chain" [1].
>
> Unlike the previous series[2], the patch adds badblock registration only
> for
> pseries scm driver. Handling badblocks for baremetal (powernv) PMEM will
> be done
> later and if possible get the badblock handling as a common code.
>
> [1]
> https://lore.kernel.org/linuxppc-dev/20200330071219.12284-1-ganes...@linux.ibm.com/
> [2]
> https://lore.kernel.org/linuxppc-dev/20190820023030.18232-1-sant...@fossix.org/
>
> arch/powerpc/platforms/pseries/papr_scm.c | 96 ++-
>  1 file changed, 95 insertions(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c
> b/arch/powerpc/platforms/pseries/papr_scm.c
> index 0b4467e378e5..5012cbf4606e 100644
> --- a/arch/powerpc/platforms/pseries/papr_scm.c
> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
> @@ -12,6 +12,8 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
>
>  #include 
>
> @@ -39,8 +41,12 @@ struct papr_scm_priv {
> struct resource res;
> struct nd_region *region;
> struct nd_interleave_set nd_set;
> +   struct list_head region_list;
>  };
>
> +LIST_HEAD(papr_nd_regions);
> +DEFINE_MUTEX(papr_ndr_lock);
> +
>  static int drc_pmem_bind(struct papr_scm_priv *p)
>  {
> unsigned long ret[PLPAR_HCALL_BUFSIZE];
> @@ -372,6 +378,10 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv
> *p)
> dev_info(dev, "Region registered with target node %d and
> online node %d",
>  target_nid, online_nid);
>
> +   mutex_lock(_ndr_lock);
> +   list_add_tail(>region_list, _nd_regions);
> +   mutex_unlock(_ndr_lock);
> +
> return 0;
>
>  err:   nvdimm_bus_unregister(p->bus);
> @@ -379,6 +389,68 @@ err:   nvdimm_bus_unregister(p->bus);
> return -ENXIO;
>  }
>
> +void papr_scm_add_badblock(struct nd_region *region, struct nvdimm_bus
> *bus,
> +  u64 phys_addr)
> +{
> +   u64 aligned_addr = ALIGN_DOWN(phys_addr, L1_CACHE_BYTES);
> +
> +   if (nvdimm_bus_add_badrange(bus, aligned_addr, L1_CACHE_BYTES)) {
> +   pr_err("Bad block registration for 0x%llx failed\n",
> phys_addr);
> +   return;
> +   }
> +
> +   pr_debug("Add memory range (0x%llx - 0x%llx) as bad range\n",
> +aligned_addr, aligned_addr + L1_CACHE_BYTES);
> +
> +   nvdimm_region_notify(region, NVDIMM_REVALIDATE_POISON);
> +}
> +
> +static int handle_mce_ue(struct notifier_block *nb, unsigned long val,
> +void *data)
> +{
> +   struct machine_check_event *evt = data;
> +   struct papr_scm_priv *p;
> +   u64 phys_addr;
> +   bool found = false;
> +
> +   if (evt->error_type != MCE_ERROR_TYPE_UE)
> +   return NOTIFY_DONE;
> +
> +   if (list_empty(_nd_regions))
> +   return NOTIFY_DONE;
> +
> +   phys_addr = evt->u.ue_error.physical_address +
> +   (evt->u.ue_error.effective_address & ~PAGE_MASK);
> +
> +   if (!evt->u.ue_error.physical_address_provided ||
> +   !is_zone_device_page(pfn_to_page(phys_addr >> PAGE_SHIFT)))
> +   return NOTIFY_DONE;
> +
> +   /* mce notifier is called from a process context, so mutex is safe
> */
> +   mutex_lock(_ndr_lock);
> +   list_for_each_entry(p, _nd_regions, region_list) {
> +   struct resource res = p->res;
> +
> +   if (phys_addr >= res.start && phys_addr <= res.end) {
> +   found = true;
> +   break;
> +   }
> +   }
> +
> +   mutex_unlock(_ndr_lock);
> +
> +   if (!found)
> +   return NOTIFY_DONE;
> +
> +   papr_scm_add_badblock(p->region, p->bus, phys_addr);
> +
> +   return NOTIFY_OK;
> +}
> +
> +static struct notifier_block mce_ue_nb = {
> +   .notifier_call = handle_mce_ue
> +};
> +
>  static int papr_scm_probe(struct platform_device *pdev)
>  {
> struct device_node *dn = pdev->dev.of_node;
> @@ -476,6 +548,10 @@ static int papr_scm_remove(struct platform_device
> *pdev)
>  {
> struct papr_scm_priv *p = platform_get_drvdata(pdev

[PATCH] papr/scm: Add bad memory ranges to nvdimm bad ranges

2020-04-01 Thread Santosh Sivaraj

Subscribe to the MCE notification and add the physical address which
generated a memory error to nvdimm bad range.

Signed-off-by: Santosh Sivaraj 
---

This patch depends on "powerpc/mce: Add MCE notification chain" [1].

Unlike the previous series[2], the patch adds badblock registration only for
pseries scm driver. Handling badblocks for baremetal (powernv) PMEM will be done
later and if possible get the badblock handling as a common code.

[1] 
https://lore.kernel.org/linuxppc-dev/20200330071219.12284-1-ganes...@linux.ibm.com/
[2] 
https://lore.kernel.org/linuxppc-dev/20190820023030.18232-1-sant...@fossix.org/

arch/powerpc/platforms/pseries/papr_scm.c | 96 ++-
 1 file changed, 95 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index 0b4467e378e5..5012cbf4606e 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -12,6 +12,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include 
 
@@ -39,8 +41,12 @@ struct papr_scm_priv {
struct resource res;
struct nd_region *region;
struct nd_interleave_set nd_set;
+   struct list_head region_list;
 };
 
+LIST_HEAD(papr_nd_regions);
+DEFINE_MUTEX(papr_ndr_lock);
+
 static int drc_pmem_bind(struct papr_scm_priv *p)
 {
unsigned long ret[PLPAR_HCALL_BUFSIZE];
@@ -372,6 +378,10 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
dev_info(dev, "Region registered with target node %d and online 
node %d",
 target_nid, online_nid);
 
+   mutex_lock(_ndr_lock);
+   list_add_tail(>region_list, _nd_regions);
+   mutex_unlock(_ndr_lock);
+
return 0;
 
 err:   nvdimm_bus_unregister(p->bus);
@@ -379,6 +389,68 @@ err:   nvdimm_bus_unregister(p->bus);
return -ENXIO;
 }
 
+void papr_scm_add_badblock(struct nd_region *region, struct nvdimm_bus *bus,
+  u64 phys_addr)
+{
+   u64 aligned_addr = ALIGN_DOWN(phys_addr, L1_CACHE_BYTES);
+
+   if (nvdimm_bus_add_badrange(bus, aligned_addr, L1_CACHE_BYTES)) {
+   pr_err("Bad block registration for 0x%llx failed\n", phys_addr);
+   return;
+   }
+
+   pr_debug("Add memory range (0x%llx - 0x%llx) as bad range\n",
+aligned_addr, aligned_addr + L1_CACHE_BYTES);
+
+   nvdimm_region_notify(region, NVDIMM_REVALIDATE_POISON);
+}
+
+static int handle_mce_ue(struct notifier_block *nb, unsigned long val,
+void *data)
+{
+   struct machine_check_event *evt = data;
+   struct papr_scm_priv *p;
+   u64 phys_addr;
+   bool found = false;
+
+   if (evt->error_type != MCE_ERROR_TYPE_UE)
+   return NOTIFY_DONE;
+
+   if (list_empty(_nd_regions))
+   return NOTIFY_DONE;
+
+   phys_addr = evt->u.ue_error.physical_address +
+   (evt->u.ue_error.effective_address & ~PAGE_MASK);
+
+   if (!evt->u.ue_error.physical_address_provided ||
+   !is_zone_device_page(pfn_to_page(phys_addr >> PAGE_SHIFT)))
+   return NOTIFY_DONE;
+
+   /* mce notifier is called from a process context, so mutex is safe */
+   mutex_lock(_ndr_lock);
+   list_for_each_entry(p, _nd_regions, region_list) {
+   struct resource res = p->res;
+
+   if (phys_addr >= res.start && phys_addr <= res.end) {
+   found = true;
+   break;
+   }
+   }
+
+   mutex_unlock(_ndr_lock);
+
+   if (!found)
+   return NOTIFY_DONE;
+
+   papr_scm_add_badblock(p->region, p->bus, phys_addr);
+
+   return NOTIFY_OK;
+}
+
+static struct notifier_block mce_ue_nb = {
+   .notifier_call = handle_mce_ue
+};
+
 static int papr_scm_probe(struct platform_device *pdev)
 {
struct device_node *dn = pdev->dev.of_node;
@@ -476,6 +548,10 @@ static int papr_scm_remove(struct platform_device *pdev)
 {
struct papr_scm_priv *p = platform_get_drvdata(pdev);
 
+   mutex_lock(_ndr_lock);
+   list_del(&(p->region_list));
+   mutex_unlock(_ndr_lock);
+
nvdimm_bus_unregister(p->bus);
drc_pmem_unbind(p);
kfree(p->bus_desc.provider_name);
@@ -498,7 +574,25 @@ static struct platform_driver papr_scm_driver = {
},
 };
 
-module_platform_driver(papr_scm_driver);
+static int __init papr_scm_init(void)
+{
+   int ret;
+
+   ret = platform_driver_register(_scm_driver);
+   if (!ret)
+   mce_register_notifier(_ue_nb);
+
+   return ret;
+}
+module_init(papr_scm_init);
+
+static void __exit papr_scm_exit(void)
+{
+   mce_unregister_notifier(_ue_nb);
+   platform_driver_unregister(_scm_driver);
+}
+module_exit(papr_scm_exit);
+
 MODULE_DEVICE_TABLE(of, papr_scm_match);
 MODULE_LICENSE("GPL");
 MODULE_AUTHOR("IBM Corporation");
-- 
2.25.1

[PATCH v3 6/6] asm-generic/tlb: avoid potential double flush

2020-03-12 Thread Santosh Sivaraj

From: Peter Zijlstra 

commit 0758cd8304942292e95a0f750c374533db378b32 upstream.

Aneesh reported that:

tlb_flush_mmu()
  tlb_flush_mmu_tlbonly()
tlb_flush() <-- #1
  tlb_flush_mmu_free()
tlb_table_flush()
  tlb_table_invalidate()
tlb_flush_mmu_tlbonly()
  tlb_flush()   <-- #2

does two TLBIs when tlb->fullmm, because __tlb_reset_range() will not
clear tlb->end in that case.

Observe that any caller to __tlb_adjust_range() also sets at least one of
the tlb->freed_tables || tlb->cleared_p* bits, and those are
unconditionally cleared by __tlb_reset_range().

Change the condition for actually issuing TLBI to having one of those bits
set, as opposed to having tlb->end != 0.

Link: 
http://lkml.kernel.org/r/20200116064531.483522-4-aneesh.ku...@linux.ibm.com
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Aneesh Kumar K.V 
Reported-by: "Aneesh Kumar K.V" 
Cc:   # 4.19
Signed-off-by: Santosh Sivaraj 
[santosh: backported to 4.19 stable]
---
 include/asm-generic/tlb.h | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 19934cdd143e..427a70c56ddd 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -179,7 +179,12 @@ static inline void __tlb_reset_range(struct mmu_gather 
*tlb)
 
 static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
 {
-   if (!tlb->end)
+   /*
+* Anything calling __tlb_adjust_range() also sets at least one of
+* these bits.
+*/
+   if (!(tlb->freed_tables || tlb->cleared_ptes || tlb->cleared_pmds ||
+ tlb->cleared_puds || tlb->cleared_p4ds))
return;
 
tlb_flush(tlb);
-- 
2.24.1

[PATCH v3 5/6] mm/mmu_gather: invalidate TLB correctly on batch allocation failure and flush

2020-03-12 Thread Santosh Sivaraj

From: Peter Zijlstra 

commit 0ed1325967ab5f7a4549a2641c6ebe115f76e228 upstream.

Architectures for which we have hardware walkers of Linux page table
should flush TLB on mmu gather batch allocation failures and batch flush.
Some architectures like POWER supports multiple translation modes (hash
and radix) and in the case of POWER only radix translation mode needs the
above TLBI.  This is because for hash translation mode kernel wants to
avoid this extra flush since there are no hardware walkers of linux page
table.  With radix translation, the hardware also walks linux page table
and with that, kernel needs to make sure to TLB invalidate page walk cache
before page table pages are freed.

More details in commit d86564a2f085 ("mm/tlb, x86/mm: Support invalidating
TLB caches for RCU_TABLE_FREE")

The changes to sparc are to make sure we keep the old behavior since we
are now removing HAVE_RCU_TABLE_NO_INVALIDATE.  The default value for
tlb_needs_table_invalidate is to always force an invalidate and sparc can
avoid the table invalidate.  Hence we define tlb_needs_table_invalidate to
false for sparc architecture.

Link: 
http://lkml.kernel.org/r/20200116064531.483522-3-aneesh.ku...@linux.ibm.com
Fixes: a46cc7a90fd8 ("powerpc/mm/radix: Improve TLB/PWC flushes")
Signed-off-by: Peter Zijlstra (Intel) 
Cc:   # 4.19
Signed-off-by: Santosh Sivaraj 
[santosh: backported to 4.19 stable]
---
 arch/Kconfig|  3 ---
 arch/powerpc/Kconfig|  1 -
 arch/powerpc/include/asm/tlb.h  | 11 +++
 arch/sparc/Kconfig  |  1 -
 arch/sparc/include/asm/tlb_64.h |  9 +
 include/asm-generic/tlb.h   | 15 +++
 mm/memory.c | 16 
 7 files changed, 43 insertions(+), 13 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 061a12b8140e..3abbdb0cea44 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -363,9 +363,6 @@ config HAVE_ARCH_JUMP_LABEL
 config HAVE_RCU_TABLE_FREE
bool
 
-config HAVE_RCU_TABLE_NO_INVALIDATE
-   bool
-
 config ARCH_HAVE_NMI_SAFE_CMPXCHG
bool
 
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 1a00ce4b0040..e5bc0cfea2b1 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -216,7 +216,6 @@ config PPC
select HAVE_PERF_REGS
select HAVE_PERF_USER_STACK_DUMP
select HAVE_RCU_TABLE_FREE
-   select HAVE_RCU_TABLE_NO_INVALIDATE if HAVE_RCU_TABLE_FREE
select HAVE_REGS_AND_STACK_ACCESS_API
select HAVE_RELIABLE_STACKTRACE if PPC64 && CPU_LITTLE_ENDIAN
select HAVE_SYSCALL_TRACEPOINTS
diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h
index f0e571b2dc7c..63418275f402 100644
--- a/arch/powerpc/include/asm/tlb.h
+++ b/arch/powerpc/include/asm/tlb.h
@@ -30,6 +30,17 @@
 #define tlb_remove_check_page_size_change tlb_remove_check_page_size_change
 
 extern void tlb_flush(struct mmu_gather *tlb);
+/*
+ * book3s:
+ * Hash does not use the linux page-tables, so we can avoid
+ * the TLB invalidate for page-table freeing, Radix otoh does use the
+ * page-tables and needs the TLBI.
+ *
+ * nohash:
+ * We still do TLB invalidate in the __pte_free_tlb routine before we
+ * add the page table pages to mmu gather table batch.
+ */
+#define tlb_needs_table_invalidate()   radix_enabled()
 
 /* Get the generic bits... */
 #include 
diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
index d90d632868aa..e6f2a38d2e61 100644
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -64,7 +64,6 @@ config SPARC64
select HAVE_KRETPROBES
select HAVE_KPROBES
select HAVE_RCU_TABLE_FREE if SMP
-   select HAVE_RCU_TABLE_NO_INVALIDATE if HAVE_RCU_TABLE_FREE
select HAVE_MEMBLOCK_NODE_MAP
select HAVE_ARCH_TRANSPARENT_HUGEPAGE
select HAVE_DYNAMIC_FTRACE
diff --git a/arch/sparc/include/asm/tlb_64.h b/arch/sparc/include/asm/tlb_64.h
index a2f3fa61ee36..8cb8f3833239 100644
--- a/arch/sparc/include/asm/tlb_64.h
+++ b/arch/sparc/include/asm/tlb_64.h
@@ -28,6 +28,15 @@ void flush_tlb_pending(void);
 #define __tlb_remove_tlb_entry(tlb, ptep, address) do { } while (0)
 #define tlb_flush(tlb) flush_tlb_pending()
 
+/*
+ * SPARC64's hardware TLB fill does not use the Linux page-tables
+ * and therefore we don't need a TLBI when freeing page-table pages.
+ */
+
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+#define tlb_needs_table_invalidate()   (false)
+#endif
+
 #include 
 
 #endif /* _SPARC64_TLB_H */
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index f2b9dc9cbaf8..19934cdd143e 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -61,8 +61,23 @@ struct mmu_table_batch {
 extern void tlb_table_flush(struct mmu_gather *tlb);
 extern void tlb_remove_table(struct mmu_gather *tlb, void *table);
 
+/*
+ * This allows an architecture that does not use the linux page-tables for
+ * hardware to skip the TLBI when freein

[PATCH v3 4/6] powerpc/mmu_gather: enable RCU_TABLE_FREE even for !SMP case

2020-03-12 Thread Santosh Sivaraj

From: "Aneesh Kumar K.V" 

commit 12e4d53f3f04e81f9e83d6fc10edc7314ab9f6b9 upstream.

Patch series "Fixup page directory freeing", v4.

This is a repost of patch series from Peter with the arch specific changes
except ppc64 dropped.  ppc64 changes are added here because we are redoing
the patch series on top of ppc64 changes.  This makes it easy to backport
these changes.  Only the first 2 patches need to be backported to stable.

The thing is, on anything SMP, freeing page directories should observe the
exact same order as normal page freeing:

 1) unhook page/directory
 2) TLB invalidate
 3) free page/directory

Without this, any concurrent page-table walk could end up with a
Use-after-Free.  This is esp.  trivial for anything that has software
page-table walkers (HAVE_FAST_GUP / software TLB fill) or the hardware
caches partial page-walks (ie.  caches page directories).

Even on UP this might give issues since mmu_gather is preemptible these
days.  An interrupt or preempted task accessing user pages might stumble
into the free page if the hardware caches page directories.

This patch series fixes ppc64 and add generic MMU_GATHER changes to
support the conversion of other architectures.  I haven't added patches
w.r.t other architecture because they are yet to be acked.

This patch (of 9):

A followup patch is going to make sure we correctly invalidate page walk
cache before we free page table pages.  In order to keep things simple
enable RCU_TABLE_FREE even for !SMP so that we don't have to fixup the
!SMP case differently in the followup patch

!SMP case is right now broken for radix translation w.r.t page walk
cache flush.  We can get interrupted in between page table free and
that would imply we have page walk cache entries pointing to tables
which got freed already.  Michael said "both our platforms that run on
Power9 force SMP on in Kconfig, so the !SMP case is unlikely to be a
problem for anyone in practice, unless they've hacked their kernel to
build it !SMP."

Link: 
http://lkml.kernel.org/r/20200116064531.483522-2-aneesh.ku...@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V 
Cc:  # 4.19
Signed-off-by: Santosh Sivaraj 
[santosh: backported for 4.19 stable]
---
 arch/powerpc/Kconfig | 2 +-
 arch/powerpc/include/asm/book3s/32/pgalloc.h | 8 
 arch/powerpc/include/asm/book3s/64/pgalloc.h | 2 --
 arch/powerpc/include/asm/nohash/32/pgalloc.h | 8 
 arch/powerpc/include/asm/nohash/64/pgalloc.h | 9 +
 arch/powerpc/mm/pgtable-book3s64.c   | 7 ---
 6 files changed, 2 insertions(+), 34 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index e09cfb109b8c..1a00ce4b0040 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -215,7 +215,7 @@ config PPC
select HAVE_HARDLOCKUP_DETECTOR_PERFif PERF_EVENTS && 
HAVE_PERF_EVENTS_NMI && !HAVE_HARDLOCKUP_DETECTOR_ARCH
select HAVE_PERF_REGS
select HAVE_PERF_USER_STACK_DUMP
-   select HAVE_RCU_TABLE_FREE  if SMP
+   select HAVE_RCU_TABLE_FREE
select HAVE_RCU_TABLE_NO_INVALIDATE if HAVE_RCU_TABLE_FREE
select HAVE_REGS_AND_STACK_ACCESS_API
select HAVE_RELIABLE_STACKTRACE if PPC64 && CPU_LITTLE_ENDIAN
diff --git a/arch/powerpc/include/asm/book3s/32/pgalloc.h 
b/arch/powerpc/include/asm/book3s/32/pgalloc.h
index 82e44b1a00ae..79ba3fbb512e 100644
--- a/arch/powerpc/include/asm/book3s/32/pgalloc.h
+++ b/arch/powerpc/include/asm/book3s/32/pgalloc.h
@@ -110,7 +110,6 @@ static inline void pgtable_free(void *table, unsigned 
index_size)
 #define check_pgt_cache()  do { } while (0)
 #define get_hugepd_cache_index(x)  (x)
 
-#ifdef CONFIG_SMP
 static inline void pgtable_free_tlb(struct mmu_gather *tlb,
void *table, int shift)
 {
@@ -127,13 +126,6 @@ static inline void __tlb_remove_table(void *_table)
 
pgtable_free(table, shift);
 }
-#else
-static inline void pgtable_free_tlb(struct mmu_gather *tlb,
-   void *table, int shift)
-{
-   pgtable_free(table, shift);
-}
-#endif
 
 static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t table,
  unsigned long address)
diff --git a/arch/powerpc/include/asm/book3s/64/pgalloc.h 
b/arch/powerpc/include/asm/book3s/64/pgalloc.h
index f9019b579903..1013c0214213 100644
--- a/arch/powerpc/include/asm/book3s/64/pgalloc.h
+++ b/arch/powerpc/include/asm/book3s/64/pgalloc.h
@@ -47,9 +47,7 @@ extern pmd_t *pmd_fragment_alloc(struct mm_struct *, unsigned 
long);
 extern void pte_fragment_free(unsigned long *, int);
 extern void pmd_fragment_free(unsigned long *);
 extern void pgtable_free_tlb(struct mmu_gather *tlb, void *table, int shift);
-#ifdef CONFIG_SMP
 extern void __tlb_remove_table(void *_table);
-#endif
 
 static inline pgd_t *radix__pgd_alloc(struct mm_struct *mm)
 {
diff --git a/arch/powerpc/i

[PATCH v3 3/6] asm-generic/tlb, arch: Invert CONFIG_HAVE_RCU_TABLE_INVALIDATE

2020-03-12 Thread Santosh Sivaraj

From: Peter Zijlstra 

commit 96bc9567cbe112e9320250f01b9c060c882e8619 upstream.

Make issuing a TLB invalidate for page-table pages the normal case.

The reason is twofold:

 - too many invalidates is safer than too few,
 - most architectures use the linux page-tables natively
   and would thus require this.

Make it an opt-out, instead of an opt-in.

No change in behavior intended.

Signed-off-by: Peter Zijlstra (Intel) 
Cc:  # 4.19
Signed-off-by: Santosh Sivaraj 
[santosh: prerequisite for upcoming tlbflush backports]
---
 arch/Kconfig | 2 +-
 arch/powerpc/Kconfig | 1 +
 arch/sparc/Kconfig   | 1 +
 arch/x86/Kconfig | 1 -
 mm/memory.c  | 2 +-
 5 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index a336548487e6..061a12b8140e 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -363,7 +363,7 @@ config HAVE_ARCH_JUMP_LABEL
 config HAVE_RCU_TABLE_FREE
bool
 
-config HAVE_RCU_TABLE_INVALIDATE
+config HAVE_RCU_TABLE_NO_INVALIDATE
bool
 
 config ARCH_HAVE_NMI_SAFE_CMPXCHG
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 6f475dc5829b..e09cfb109b8c 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -216,6 +216,7 @@ config PPC
select HAVE_PERF_REGS
select HAVE_PERF_USER_STACK_DUMP
select HAVE_RCU_TABLE_FREE  if SMP
+   select HAVE_RCU_TABLE_NO_INVALIDATE if HAVE_RCU_TABLE_FREE
select HAVE_REGS_AND_STACK_ACCESS_API
select HAVE_RELIABLE_STACKTRACE if PPC64 && CPU_LITTLE_ENDIAN
select HAVE_SYSCALL_TRACEPOINTS
diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
index e6f2a38d2e61..d90d632868aa 100644
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -64,6 +64,7 @@ config SPARC64
select HAVE_KRETPROBES
select HAVE_KPROBES
select HAVE_RCU_TABLE_FREE if SMP
+   select HAVE_RCU_TABLE_NO_INVALIDATE if HAVE_RCU_TABLE_FREE
select HAVE_MEMBLOCK_NODE_MAP
select HAVE_ARCH_TRANSPARENT_HUGEPAGE
select HAVE_DYNAMIC_FTRACE
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index af35f5caadbe..181d0d522977 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -181,7 +181,6 @@ config X86
select HAVE_PERF_REGS
select HAVE_PERF_USER_STACK_DUMP
select HAVE_RCU_TABLE_FREE  if PARAVIRT
-   select HAVE_RCU_TABLE_INVALIDATEif HAVE_RCU_TABLE_FREE
select HAVE_REGS_AND_STACK_ACCESS_API
select HAVE_RELIABLE_STACKTRACE if X86_64 && 
(UNWINDER_FRAME_POINTER || UNWINDER_ORC) && STACK_VALIDATION
select HAVE_STACKPROTECTOR  if CC_HAS_SANE_STACKPROTECTOR
diff --git a/mm/memory.c b/mm/memory.c
index 1832c5ed6ac0..ba5689610c04 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -327,7 +327,7 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, struct 
page *page, int page_
  */
 static inline void tlb_table_invalidate(struct mmu_gather *tlb)
 {
-#ifdef CONFIG_HAVE_RCU_TABLE_INVALIDATE
+#ifndef CONFIG_HAVE_RCU_TABLE_NO_INVALIDATE
/*
 * Invalidate page-table caches used by hardware walkers. Then we still
 * need to RCU-sched wait while freeing the pages because software
-- 
2.24.1

[PATCH v3 2/6] asm-generic/tlb: Track which levels of the page tables have been cleared

2020-03-12 Thread Santosh Sivaraj

From: Will Deacon 

commit a6d60245d6d9b1caf66b0d94419988c4836980af upstream

It is common for architectures with hugepage support to require only a
single TLB invalidation operation per hugepage during unmap(), rather than
iterating through the mapping at a PAGE_SIZE increment. Currently,
however, the level in the page table where the unmap() operation occurs
is not stored in the mmu_gather structure, therefore forcing
architectures to issue additional TLB invalidation operations or to give
up and over-invalidate by e.g. invalidating the entire TLB.

Ideally, we could add an interval rbtree to the mmu_gather structure,
which would allow us to associate the correct mapping granule with the
various sub-mappings within the range being invalidated. However, this
is costly in terms of book-keeping and memory management, so instead we
approximate by keeping track of the page table levels that are cleared
and provide a means to query the smallest granule required for invalidation.

Signed-off-by: Will Deacon 
Cc:  # 4.19
Signed-off-by: Santosh Sivaraj 
[santosh: prerequisite for upcoming tlbflush backports]
---
 include/asm-generic/tlb.h | 58 +--
 mm/memory.c   |  4 ++-
 2 files changed, 53 insertions(+), 9 deletions(-)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 97306b32d8d2..f2b9dc9cbaf8 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -114,6 +114,14 @@ struct mmu_gather {
 */
unsigned intfreed_tables : 1;
 
+   /*
+* at which levels have we cleared entries?
+*/
+   unsigned intcleared_ptes : 1;
+   unsigned intcleared_pmds : 1;
+   unsigned intcleared_puds : 1;
+   unsigned intcleared_p4ds : 1;
+
struct mmu_gather_batch *active;
struct mmu_gather_batch local;
struct page *__pages[MMU_GATHER_BUNDLE];
@@ -148,6 +156,10 @@ static inline void __tlb_reset_range(struct mmu_gather 
*tlb)
tlb->end = 0;
}
tlb->freed_tables = 0;
+   tlb->cleared_ptes = 0;
+   tlb->cleared_pmds = 0;
+   tlb->cleared_puds = 0;
+   tlb->cleared_p4ds = 0;
 }
 
 static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
@@ -197,6 +209,25 @@ static inline void 
tlb_remove_check_page_size_change(struct mmu_gather *tlb,
 }
 #endif
 
+static inline unsigned long tlb_get_unmap_shift(struct mmu_gather *tlb)
+{
+   if (tlb->cleared_ptes)
+   return PAGE_SHIFT;
+   if (tlb->cleared_pmds)
+   return PMD_SHIFT;
+   if (tlb->cleared_puds)
+   return PUD_SHIFT;
+   if (tlb->cleared_p4ds)
+   return P4D_SHIFT;
+
+   return PAGE_SHIFT;
+}
+
+static inline unsigned long tlb_get_unmap_size(struct mmu_gather *tlb)
+{
+   return 1UL << tlb_get_unmap_shift(tlb);
+}
+
 /*
  * In the case of tlb vma handling, we can optimise these away in the
  * case where we're doing a full MM flush.  When we're doing a munmap,
@@ -230,13 +261,19 @@ static inline void 
tlb_remove_check_page_size_change(struct mmu_gather *tlb,
 #define tlb_remove_tlb_entry(tlb, ptep, address)   \
do {\
__tlb_adjust_range(tlb, address, PAGE_SIZE);\
+   tlb->cleared_ptes = 1;  \
__tlb_remove_tlb_entry(tlb, ptep, address); \
} while (0)
 
-#define tlb_remove_huge_tlb_entry(h, tlb, ptep, address)\
-   do { \
-   __tlb_adjust_range(tlb, address, huge_page_size(h)); \
-   __tlb_remove_tlb_entry(tlb, ptep, address);  \
+#define tlb_remove_huge_tlb_entry(h, tlb, ptep, address)   \
+   do {\
+   unsigned long _sz = huge_page_size(h);  \
+   __tlb_adjust_range(tlb, address, _sz);  \
+   if (_sz == PMD_SIZE)\
+   tlb->cleared_pmds = 1;  \
+   else if (_sz == PUD_SIZE)   \
+   tlb->cleared_puds = 1;  \
+   __tlb_remove_tlb_entry(tlb, ptep, address); \
} while (0)
 
 /**
@@ -250,6 +287,7 @@ static inline void tlb_remove_check_page_size_change(struct 
mmu_gather *tlb,
 #define tlb_remove_pmd_tlb_entry(tlb, pmdp, address)   \
do {\
__tlb_adjust_range(tlb, address, HPAGE_PMD_SIZE);   \
+   tlb->cleared_pmds = 1;  \
__tlb_remove_pmd_tlb_entry(tlb, pmdp, address); \
} wh

[PATCH v3 1/6] asm-generic/tlb: Track freeing of page-table directories in struct mmu_gather

2020-03-12 Thread Santosh Sivaraj

From: Peter Zijlstra 

commit 22a61c3c4f1379ef8b0ce0d5cb78baf3178950e2 upstream

Some architectures require different TLB invalidation instructions
depending on whether it is only the last-level of page table being
changed, or whether there are also changes to the intermediate
(directory) entries higher up the tree.

Add a new bit to the flags bitfield in struct mmu_gather so that the
architecture code can operate accordingly if it's the intermediate
levels being invalidated.

Signed-off-by: Peter Zijlstra 
Signed-off-by: Will Deacon 
Cc:  # 4.19
Signed-off-by: Santosh Sivaraj 
[santosh: prerequisite for tlbflush backports]
---
 include/asm-generic/tlb.h | 31 +++
 1 file changed, 23 insertions(+), 8 deletions(-)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index b3353e21f3b3..97306b32d8d2 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -97,12 +97,22 @@ struct mmu_gather {
 #endif
unsigned long   start;
unsigned long   end;
-   /* we are in the middle of an operation to clear
-* a full mm and can make some optimizations */
-   unsigned intfullmm : 1,
-   /* we have performed an operation which
-* requires a complete flush of the tlb */
-   need_flush_all : 1;
+   /*
+* we are in the middle of an operation to clear
+* a full mm and can make some optimizations
+*/
+   unsigned intfullmm : 1;
+
+   /*
+* we have performed an operation which
+* requires a complete flush of the tlb
+*/
+   unsigned intneed_flush_all : 1;
+
+   /*
+* we have removed page directories
+*/
+   unsigned intfreed_tables : 1;
 
struct mmu_gather_batch *active;
struct mmu_gather_batch local;
@@ -137,6 +147,7 @@ static inline void __tlb_reset_range(struct mmu_gather *tlb)
tlb->start = TASK_SIZE;
tlb->end = 0;
}
+   tlb->freed_tables = 0;
 }
 
 static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
@@ -278,6 +289,7 @@ static inline void tlb_remove_check_page_size_change(struct 
mmu_gather *tlb,
 #define pte_free_tlb(tlb, ptep, address)   \
do {\
__tlb_adjust_range(tlb, address, PAGE_SIZE);\
+   tlb->freed_tables = 1;  \
__pte_free_tlb(tlb, ptep, address); \
} while (0)
 #endif
@@ -285,7 +297,8 @@ static inline void tlb_remove_check_page_size_change(struct 
mmu_gather *tlb,
 #ifndef pmd_free_tlb
 #define pmd_free_tlb(tlb, pmdp, address)   \
do {\
-   __tlb_adjust_range(tlb, address, PAGE_SIZE);\
+   __tlb_adjust_range(tlb, address, PAGE_SIZE);\
+   tlb->freed_tables = 1;  \
__pmd_free_tlb(tlb, pmdp, address); \
} while (0)
 #endif
@@ -295,6 +308,7 @@ static inline void tlb_remove_check_page_size_change(struct 
mmu_gather *tlb,
 #define pud_free_tlb(tlb, pudp, address)   \
do {\
__tlb_adjust_range(tlb, address, PAGE_SIZE);\
+   tlb->freed_tables = 1;  \
__pud_free_tlb(tlb, pudp, address); \
} while (0)
 #endif
@@ -304,7 +318,8 @@ static inline void tlb_remove_check_page_size_change(struct 
mmu_gather *tlb,
 #ifndef p4d_free_tlb
 #define p4d_free_tlb(tlb, pudp, address)   \
do {\
-   __tlb_adjust_range(tlb, address, PAGE_SIZE);\
+   __tlb_adjust_range(tlb, address, PAGE_SIZE);\
+   tlb->freed_tables = 1;  \
__p4d_free_tlb(tlb, pudp, address); \
} while (0)
 #endif
-- 
2.24.1

[PATCH v3 0/6] Memory corruption may occur due to incorrent tlb flush

2020-03-12 Thread Santosh Sivaraj

The TLB flush optimisation (a46cc7a90f: powerpc/mm/radix: Improve TLB/PWC
flushes) may result in random memory corruption. Any concurrent page-table walk
could end up with a Use-after-Free. Even on UP this might give issues, since
mmu_gather is preemptible these days. An interrupt or preempted task accessing
user pages might stumble into the free page if the hardware caches page
directories.

The series is a backport of the fix sent by Peter [1].

The first three patches are dependencies for the last patch (avoid potential
double flush). If the performance impact due to double flush is considered
trivial then the first three patches and last patch may be dropped.

This is only for v4.19 stable.

[1] https://patchwork.kernel.org/cover/11284843/

--
Changelog:
v2: Send the patches with the correct format (commit sha1 upstream) for stable
v3: Fix compilation issue on ppc40x_defconfig and ppc44x_defconfig

--
Aneesh Kumar K.V (1):
  powerpc/mmu_gather: enable RCU_TABLE_FREE even for !SMP case

Peter Zijlstra (4):
  asm-generic/tlb: Track freeing of page-table directories in struct
mmu_gather
  asm-generic/tlb, arch: Invert CONFIG_HAVE_RCU_TABLE_INVALIDATE
  mm/mmu_gather: invalidate TLB correctly on batch allocation failure
and flush
  asm-generic/tlb: avoid potential double flush

Will Deacon (1):
  asm-generic/tlb: Track which levels of the page tables have been
cleared

 arch/Kconfig |   3 -
 arch/powerpc/Kconfig |   2 +-
 arch/powerpc/include/asm/book3s/32/pgalloc.h |   8 --
 arch/powerpc/include/asm/book3s/64/pgalloc.h |   2 -
 arch/powerpc/include/asm/nohash/32/pgalloc.h |   8 --
 arch/powerpc/include/asm/nohash/64/pgalloc.h |   9 +-
 arch/powerpc/include/asm/tlb.h   |  11 ++
 arch/powerpc/mm/pgtable-book3s64.c   |   7 --
 arch/sparc/include/asm/tlb_64.h  |   9 ++
 arch/x86/Kconfig |   1 -
 include/asm-generic/tlb.h| 103 ---
 mm/memory.c  |  20 ++--
 12 files changed, 123 insertions(+), 60 deletions(-)

-- 
2.24.1

[PATCH v2 6/6] asm-generic/tlb: avoid potential double flush

2020-03-03 Thread Santosh Sivaraj

From: Peter Zijlstra 

commit 0758cd8304942292e95a0f750c374533db378b32 upstream.

Aneesh reported that:

tlb_flush_mmu()
  tlb_flush_mmu_tlbonly()
tlb_flush() <-- #1
  tlb_flush_mmu_free()
tlb_table_flush()
  tlb_table_invalidate()
tlb_flush_mmu_tlbonly()
  tlb_flush()   <-- #2

does two TLBIs when tlb->fullmm, because __tlb_reset_range() will not
clear tlb->end in that case.

Observe that any caller to __tlb_adjust_range() also sets at least one of
the tlb->freed_tables || tlb->cleared_p* bits, and those are
unconditionally cleared by __tlb_reset_range().

Change the condition for actually issuing TLBI to having one of those bits
set, as opposed to having tlb->end != 0.

Link: 
http://lkml.kernel.org/r/20200116064531.483522-4-aneesh.ku...@linux.ibm.com
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Aneesh Kumar K.V 
Reported-by: "Aneesh Kumar K.V" 
Cc:   # 4.19
Signed-off-by: Santosh Sivaraj 
[santosh: backported to 4.19 stable]
---
 include/asm-generic/tlb.h | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 19934cdd143e..427a70c56ddd 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -179,7 +179,12 @@ static inline void __tlb_reset_range(struct mmu_gather 
*tlb)
 
 static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
 {
-   if (!tlb->end)
+   /*
+* Anything calling __tlb_adjust_range() also sets at least one of
+* these bits.
+*/
+   if (!(tlb->freed_tables || tlb->cleared_ptes || tlb->cleared_pmds ||
+ tlb->cleared_puds || tlb->cleared_p4ds))
return;
 
tlb_flush(tlb);
-- 
2.24.1

[PATCH v2 5/6] mm/mmu_gather: invalidate TLB correctly on batch allocation failure and flush

2020-03-03 Thread Santosh Sivaraj

From: Peter Zijlstra 

commit 0ed1325967ab5f7a4549a2641c6ebe115f76e228 upstream.

Architectures for which we have hardware walkers of Linux page table
should flush TLB on mmu gather batch allocation failures and batch flush.
Some architectures like POWER supports multiple translation modes (hash
and radix) and in the case of POWER only radix translation mode needs the
above TLBI.  This is because for hash translation mode kernel wants to
avoid this extra flush since there are no hardware walkers of linux page
table.  With radix translation, the hardware also walks linux page table
and with that, kernel needs to make sure to TLB invalidate page walk cache
before page table pages are freed.

More details in commit d86564a2f085 ("mm/tlb, x86/mm: Support invalidating
TLB caches for RCU_TABLE_FREE")

The changes to sparc are to make sure we keep the old behavior since we
are now removing HAVE_RCU_TABLE_NO_INVALIDATE.  The default value for
tlb_needs_table_invalidate is to always force an invalidate and sparc can
avoid the table invalidate.  Hence we define tlb_needs_table_invalidate to
false for sparc architecture.

Link: 
http://lkml.kernel.org/r/20200116064531.483522-3-aneesh.ku...@linux.ibm.com
Fixes: a46cc7a90fd8 ("powerpc/mm/radix: Improve TLB/PWC flushes")
Signed-off-by: Peter Zijlstra (Intel) 
Cc:   # 4.19
Signed-off-by: Santosh Sivaraj 
[santosh: backported to 4.19 stable]
---
 arch/Kconfig|  3 ---
 arch/powerpc/Kconfig|  1 -
 arch/powerpc/include/asm/tlb.h  | 11 +++
 arch/sparc/Kconfig  |  1 -
 arch/sparc/include/asm/tlb_64.h |  9 +
 include/asm-generic/tlb.h   | 15 +++
 mm/memory.c | 16 
 7 files changed, 43 insertions(+), 13 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 061a12b8140e..3abbdb0cea44 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -363,9 +363,6 @@ config HAVE_ARCH_JUMP_LABEL
 config HAVE_RCU_TABLE_FREE
bool
 
-config HAVE_RCU_TABLE_NO_INVALIDATE
-   bool
-
 config ARCH_HAVE_NMI_SAFE_CMPXCHG
bool
 
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 1a00ce4b0040..e5bc0cfea2b1 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -216,7 +216,6 @@ config PPC
select HAVE_PERF_REGS
select HAVE_PERF_USER_STACK_DUMP
select HAVE_RCU_TABLE_FREE
-   select HAVE_RCU_TABLE_NO_INVALIDATE if HAVE_RCU_TABLE_FREE
select HAVE_REGS_AND_STACK_ACCESS_API
select HAVE_RELIABLE_STACKTRACE if PPC64 && CPU_LITTLE_ENDIAN
select HAVE_SYSCALL_TRACEPOINTS
diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h
index f0e571b2dc7c..63418275f402 100644
--- a/arch/powerpc/include/asm/tlb.h
+++ b/arch/powerpc/include/asm/tlb.h
@@ -30,6 +30,17 @@
 #define tlb_remove_check_page_size_change tlb_remove_check_page_size_change
 
 extern void tlb_flush(struct mmu_gather *tlb);
+/*
+ * book3s:
+ * Hash does not use the linux page-tables, so we can avoid
+ * the TLB invalidate for page-table freeing, Radix otoh does use the
+ * page-tables and needs the TLBI.
+ *
+ * nohash:
+ * We still do TLB invalidate in the __pte_free_tlb routine before we
+ * add the page table pages to mmu gather table batch.
+ */
+#define tlb_needs_table_invalidate()   radix_enabled()
 
 /* Get the generic bits... */
 #include 
diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
index d90d632868aa..e6f2a38d2e61 100644
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -64,7 +64,6 @@ config SPARC64
select HAVE_KRETPROBES
select HAVE_KPROBES
select HAVE_RCU_TABLE_FREE if SMP
-   select HAVE_RCU_TABLE_NO_INVALIDATE if HAVE_RCU_TABLE_FREE
select HAVE_MEMBLOCK_NODE_MAP
select HAVE_ARCH_TRANSPARENT_HUGEPAGE
select HAVE_DYNAMIC_FTRACE
diff --git a/arch/sparc/include/asm/tlb_64.h b/arch/sparc/include/asm/tlb_64.h
index a2f3fa61ee36..8cb8f3833239 100644
--- a/arch/sparc/include/asm/tlb_64.h
+++ b/arch/sparc/include/asm/tlb_64.h
@@ -28,6 +28,15 @@ void flush_tlb_pending(void);
 #define __tlb_remove_tlb_entry(tlb, ptep, address) do { } while (0)
 #define tlb_flush(tlb) flush_tlb_pending()
 
+/*
+ * SPARC64's hardware TLB fill does not use the Linux page-tables
+ * and therefore we don't need a TLBI when freeing page-table pages.
+ */
+
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+#define tlb_needs_table_invalidate()   (false)
+#endif
+
 #include 
 
 #endif /* _SPARC64_TLB_H */
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index f2b9dc9cbaf8..19934cdd143e 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -61,8 +61,23 @@ struct mmu_table_batch {
 extern void tlb_table_flush(struct mmu_gather *tlb);
 extern void tlb_remove_table(struct mmu_gather *tlb, void *table);
 
+/*
+ * This allows an architecture that does not use the linux page-tables for
+ * hardware to skip the TLBI when freein

[PATCH v2 4/6] powerpc/mmu_gather: enable RCU_TABLE_FREE even for !SMP case

2020-03-03 Thread Santosh Sivaraj

From: "Aneesh Kumar K.V" 

commit 12e4d53f3f04e81f9e83d6fc10edc7314ab9f6b9 upstream.

Patch series "Fixup page directory freeing", v4.

This is a repost of patch series from Peter with the arch specific changes
except ppc64 dropped.  ppc64 changes are added here because we are redoing
the patch series on top of ppc64 changes.  This makes it easy to backport
these changes.  Only the first 2 patches need to be backported to stable.

The thing is, on anything SMP, freeing page directories should observe the
exact same order as normal page freeing:

 1) unhook page/directory
 2) TLB invalidate
 3) free page/directory

Without this, any concurrent page-table walk could end up with a
Use-after-Free.  This is esp.  trivial for anything that has software
page-table walkers (HAVE_FAST_GUP / software TLB fill) or the hardware
caches partial page-walks (ie.  caches page directories).

Even on UP this might give issues since mmu_gather is preemptible these
days.  An interrupt or preempted task accessing user pages might stumble
into the free page if the hardware caches page directories.

This patch series fixes ppc64 and add generic MMU_GATHER changes to
support the conversion of other architectures.  I haven't added patches
w.r.t other architecture because they are yet to be acked.

This patch (of 9):

A followup patch is going to make sure we correctly invalidate page walk
cache before we free page table pages.  In order to keep things simple
enable RCU_TABLE_FREE even for !SMP so that we don't have to fixup the
!SMP case differently in the followup patch

!SMP case is right now broken for radix translation w.r.t page walk
cache flush.  We can get interrupted in between page table free and
that would imply we have page walk cache entries pointing to tables
which got freed already.  Michael said "both our platforms that run on
Power9 force SMP on in Kconfig, so the !SMP case is unlikely to be a
problem for anyone in practice, unless they've hacked their kernel to
build it !SMP."

Link: 
http://lkml.kernel.org/r/20200116064531.483522-2-aneesh.ku...@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V 
Cc:  # 4.19
Signed-off-by: Santosh Sivaraj 
[santosh: backported for 4.19 stable]
---
 arch/powerpc/Kconfig | 2 +-
 arch/powerpc/include/asm/book3s/32/pgalloc.h | 8 
 arch/powerpc/include/asm/book3s/64/pgalloc.h | 2 --
 arch/powerpc/mm/pgtable-book3s64.c   | 7 ---
 4 files changed, 1 insertion(+), 18 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index e09cfb109b8c..1a00ce4b0040 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -215,7 +215,7 @@ config PPC
select HAVE_HARDLOCKUP_DETECTOR_PERFif PERF_EVENTS && 
HAVE_PERF_EVENTS_NMI && !HAVE_HARDLOCKUP_DETECTOR_ARCH
select HAVE_PERF_REGS
select HAVE_PERF_USER_STACK_DUMP
-   select HAVE_RCU_TABLE_FREE  if SMP
+   select HAVE_RCU_TABLE_FREE
select HAVE_RCU_TABLE_NO_INVALIDATE if HAVE_RCU_TABLE_FREE
select HAVE_REGS_AND_STACK_ACCESS_API
select HAVE_RELIABLE_STACKTRACE if PPC64 && CPU_LITTLE_ENDIAN
diff --git a/arch/powerpc/include/asm/book3s/32/pgalloc.h 
b/arch/powerpc/include/asm/book3s/32/pgalloc.h
index 82e44b1a00ae..79ba3fbb512e 100644
--- a/arch/powerpc/include/asm/book3s/32/pgalloc.h
+++ b/arch/powerpc/include/asm/book3s/32/pgalloc.h
@@ -110,7 +110,6 @@ static inline void pgtable_free(void *table, unsigned 
index_size)
 #define check_pgt_cache()  do { } while (0)
 #define get_hugepd_cache_index(x)  (x)
 
-#ifdef CONFIG_SMP
 static inline void pgtable_free_tlb(struct mmu_gather *tlb,
void *table, int shift)
 {
@@ -127,13 +126,6 @@ static inline void __tlb_remove_table(void *_table)
 
pgtable_free(table, shift);
 }
-#else
-static inline void pgtable_free_tlb(struct mmu_gather *tlb,
-   void *table, int shift)
-{
-   pgtable_free(table, shift);
-}
-#endif
 
 static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t table,
  unsigned long address)
diff --git a/arch/powerpc/include/asm/book3s/64/pgalloc.h 
b/arch/powerpc/include/asm/book3s/64/pgalloc.h
index f9019b579903..1013c0214213 100644
--- a/arch/powerpc/include/asm/book3s/64/pgalloc.h
+++ b/arch/powerpc/include/asm/book3s/64/pgalloc.h
@@ -47,9 +47,7 @@ extern pmd_t *pmd_fragment_alloc(struct mm_struct *, unsigned 
long);
 extern void pte_fragment_free(unsigned long *, int);
 extern void pmd_fragment_free(unsigned long *);
 extern void pgtable_free_tlb(struct mmu_gather *tlb, void *table, int shift);
-#ifdef CONFIG_SMP
 extern void __tlb_remove_table(void *_table);
-#endif
 
 static inline pgd_t *radix__pgd_alloc(struct mm_struct *mm)
 {
diff --git a/arch/powerpc/mm/pgtable-book3s64.c 
b/arch/powerpc/mm/pgtable-book3s64.c
index 297db665d953..5b4e9fd8990c 100644
--- a/arch/power

[PATCH v2 3/6] asm-generic/tlb, arch: Invert CONFIG_HAVE_RCU_TABLE_INVALIDATE

2020-03-03 Thread Santosh Sivaraj

From: Peter Zijlstra 

commit 96bc9567cbe112e9320250f01b9c060c882e8619 upstream.

Make issuing a TLB invalidate for page-table pages the normal case.

The reason is twofold:

 - too many invalidates is safer than too few,
 - most architectures use the linux page-tables natively
   and would thus require this.

Make it an opt-out, instead of an opt-in.

No change in behavior intended.

Signed-off-by: Peter Zijlstra (Intel) 
Cc:  # 4.19
Signed-off-by: Santosh Sivaraj 
[santosh: prerequisite for upcoming tlbflush backports]
---
 arch/Kconfig | 2 +-
 arch/powerpc/Kconfig | 1 +
 arch/sparc/Kconfig   | 1 +
 arch/x86/Kconfig | 1 -
 mm/memory.c  | 2 +-
 5 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index a336548487e6..061a12b8140e 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -363,7 +363,7 @@ config HAVE_ARCH_JUMP_LABEL
 config HAVE_RCU_TABLE_FREE
bool
 
-config HAVE_RCU_TABLE_INVALIDATE
+config HAVE_RCU_TABLE_NO_INVALIDATE
bool
 
 config ARCH_HAVE_NMI_SAFE_CMPXCHG
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 6f475dc5829b..e09cfb109b8c 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -216,6 +216,7 @@ config PPC
select HAVE_PERF_REGS
select HAVE_PERF_USER_STACK_DUMP
select HAVE_RCU_TABLE_FREE  if SMP
+   select HAVE_RCU_TABLE_NO_INVALIDATE if HAVE_RCU_TABLE_FREE
select HAVE_REGS_AND_STACK_ACCESS_API
select HAVE_RELIABLE_STACKTRACE if PPC64 && CPU_LITTLE_ENDIAN
select HAVE_SYSCALL_TRACEPOINTS
diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
index e6f2a38d2e61..d90d632868aa 100644
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -64,6 +64,7 @@ config SPARC64
select HAVE_KRETPROBES
select HAVE_KPROBES
select HAVE_RCU_TABLE_FREE if SMP
+   select HAVE_RCU_TABLE_NO_INVALIDATE if HAVE_RCU_TABLE_FREE
select HAVE_MEMBLOCK_NODE_MAP
select HAVE_ARCH_TRANSPARENT_HUGEPAGE
select HAVE_DYNAMIC_FTRACE
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index af35f5caadbe..181d0d522977 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -181,7 +181,6 @@ config X86
select HAVE_PERF_REGS
select HAVE_PERF_USER_STACK_DUMP
select HAVE_RCU_TABLE_FREE  if PARAVIRT
-   select HAVE_RCU_TABLE_INVALIDATEif HAVE_RCU_TABLE_FREE
select HAVE_REGS_AND_STACK_ACCESS_API
select HAVE_RELIABLE_STACKTRACE if X86_64 && 
(UNWINDER_FRAME_POINTER || UNWINDER_ORC) && STACK_VALIDATION
select HAVE_STACKPROTECTOR  if CC_HAS_SANE_STACKPROTECTOR
diff --git a/mm/memory.c b/mm/memory.c
index 1832c5ed6ac0..ba5689610c04 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -327,7 +327,7 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, struct 
page *page, int page_
  */
 static inline void tlb_table_invalidate(struct mmu_gather *tlb)
 {
-#ifdef CONFIG_HAVE_RCU_TABLE_INVALIDATE
+#ifndef CONFIG_HAVE_RCU_TABLE_NO_INVALIDATE
/*
 * Invalidate page-table caches used by hardware walkers. Then we still
 * need to RCU-sched wait while freeing the pages because software
-- 
2.24.1

[PATCH v2 2/6] asm-generic/tlb: Track which levels of the page tables have been cleared

2020-03-03 Thread Santosh Sivaraj

From: Will Deacon 

commit a6d60245d6d9b1caf66b0d94419988c4836980af upstream

It is common for architectures with hugepage support to require only a
single TLB invalidation operation per hugepage during unmap(), rather than
iterating through the mapping at a PAGE_SIZE increment. Currently,
however, the level in the page table where the unmap() operation occurs
is not stored in the mmu_gather structure, therefore forcing
architectures to issue additional TLB invalidation operations or to give
up and over-invalidate by e.g. invalidating the entire TLB.

Ideally, we could add an interval rbtree to the mmu_gather structure,
which would allow us to associate the correct mapping granule with the
various sub-mappings within the range being invalidated. However, this
is costly in terms of book-keeping and memory management, so instead we
approximate by keeping track of the page table levels that are cleared
and provide a means to query the smallest granule required for invalidation.

Signed-off-by: Will Deacon 
Cc:  # 4.19
Signed-off-by: Santosh Sivaraj 
[santosh: prerequisite for upcoming tlbflush backports]
---
 include/asm-generic/tlb.h | 58 +--
 mm/memory.c   |  4 ++-
 2 files changed, 53 insertions(+), 9 deletions(-)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 97306b32d8d2..f2b9dc9cbaf8 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -114,6 +114,14 @@ struct mmu_gather {
 */
unsigned intfreed_tables : 1;
 
+   /*
+* at which levels have we cleared entries?
+*/
+   unsigned intcleared_ptes : 1;
+   unsigned intcleared_pmds : 1;
+   unsigned intcleared_puds : 1;
+   unsigned intcleared_p4ds : 1;
+
struct mmu_gather_batch *active;
struct mmu_gather_batch local;
struct page *__pages[MMU_GATHER_BUNDLE];
@@ -148,6 +156,10 @@ static inline void __tlb_reset_range(struct mmu_gather 
*tlb)
tlb->end = 0;
}
tlb->freed_tables = 0;
+   tlb->cleared_ptes = 0;
+   tlb->cleared_pmds = 0;
+   tlb->cleared_puds = 0;
+   tlb->cleared_p4ds = 0;
 }
 
 static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
@@ -197,6 +209,25 @@ static inline void 
tlb_remove_check_page_size_change(struct mmu_gather *tlb,
 }
 #endif
 
+static inline unsigned long tlb_get_unmap_shift(struct mmu_gather *tlb)
+{
+   if (tlb->cleared_ptes)
+   return PAGE_SHIFT;
+   if (tlb->cleared_pmds)
+   return PMD_SHIFT;
+   if (tlb->cleared_puds)
+   return PUD_SHIFT;
+   if (tlb->cleared_p4ds)
+   return P4D_SHIFT;
+
+   return PAGE_SHIFT;
+}
+
+static inline unsigned long tlb_get_unmap_size(struct mmu_gather *tlb)
+{
+   return 1UL << tlb_get_unmap_shift(tlb);
+}
+
 /*
  * In the case of tlb vma handling, we can optimise these away in the
  * case where we're doing a full MM flush.  When we're doing a munmap,
@@ -230,13 +261,19 @@ static inline void 
tlb_remove_check_page_size_change(struct mmu_gather *tlb,
 #define tlb_remove_tlb_entry(tlb, ptep, address)   \
do {\
__tlb_adjust_range(tlb, address, PAGE_SIZE);\
+   tlb->cleared_ptes = 1;  \
__tlb_remove_tlb_entry(tlb, ptep, address); \
} while (0)
 
-#define tlb_remove_huge_tlb_entry(h, tlb, ptep, address)\
-   do { \
-   __tlb_adjust_range(tlb, address, huge_page_size(h)); \
-   __tlb_remove_tlb_entry(tlb, ptep, address);  \
+#define tlb_remove_huge_tlb_entry(h, tlb, ptep, address)   \
+   do {\
+   unsigned long _sz = huge_page_size(h);  \
+   __tlb_adjust_range(tlb, address, _sz);  \
+   if (_sz == PMD_SIZE)\
+   tlb->cleared_pmds = 1;  \
+   else if (_sz == PUD_SIZE)   \
+   tlb->cleared_puds = 1;  \
+   __tlb_remove_tlb_entry(tlb, ptep, address); \
} while (0)
 
 /**
@@ -250,6 +287,7 @@ static inline void tlb_remove_check_page_size_change(struct 
mmu_gather *tlb,
 #define tlb_remove_pmd_tlb_entry(tlb, pmdp, address)   \
do {\
__tlb_adjust_range(tlb, address, HPAGE_PMD_SIZE);   \
+   tlb->cleared_pmds = 1;  \
__tlb_remove_pmd_tlb_entry(tlb, pmdp, address); \
} wh

[PATCH v2 1/6] asm-generic/tlb: Track freeing of page-table directories in struct mmu_gather

2020-03-03 Thread Santosh Sivaraj

From: Peter Zijlstra 

commit 22a61c3c4f1379ef8b0ce0d5cb78baf3178950e2 upstream

Some architectures require different TLB invalidation instructions
depending on whether it is only the last-level of page table being
changed, or whether there are also changes to the intermediate
(directory) entries higher up the tree.

Add a new bit to the flags bitfield in struct mmu_gather so that the
architecture code can operate accordingly if it's the intermediate
levels being invalidated.

Signed-off-by: Peter Zijlstra 
Signed-off-by: Will Deacon 
Cc:  # 4.19
Signed-off-by: Santosh Sivaraj 
[santosh: prerequisite for tlbflush backports]
---
 include/asm-generic/tlb.h | 31 +++
 1 file changed, 23 insertions(+), 8 deletions(-)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index b3353e21f3b3..97306b32d8d2 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -97,12 +97,22 @@ struct mmu_gather {
 #endif
unsigned long   start;
unsigned long   end;
-   /* we are in the middle of an operation to clear
-* a full mm and can make some optimizations */
-   unsigned intfullmm : 1,
-   /* we have performed an operation which
-* requires a complete flush of the tlb */
-   need_flush_all : 1;
+   /*
+* we are in the middle of an operation to clear
+* a full mm and can make some optimizations
+*/
+   unsigned intfullmm : 1;
+
+   /*
+* we have performed an operation which
+* requires a complete flush of the tlb
+*/
+   unsigned intneed_flush_all : 1;
+
+   /*
+* we have removed page directories
+*/
+   unsigned intfreed_tables : 1;
 
struct mmu_gather_batch *active;
struct mmu_gather_batch local;
@@ -137,6 +147,7 @@ static inline void __tlb_reset_range(struct mmu_gather *tlb)
tlb->start = TASK_SIZE;
tlb->end = 0;
}
+   tlb->freed_tables = 0;
 }
 
 static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
@@ -278,6 +289,7 @@ static inline void tlb_remove_check_page_size_change(struct 
mmu_gather *tlb,
 #define pte_free_tlb(tlb, ptep, address)   \
do {\
__tlb_adjust_range(tlb, address, PAGE_SIZE);\
+   tlb->freed_tables = 1;  \
__pte_free_tlb(tlb, ptep, address); \
} while (0)
 #endif
@@ -285,7 +297,8 @@ static inline void tlb_remove_check_page_size_change(struct 
mmu_gather *tlb,
 #ifndef pmd_free_tlb
 #define pmd_free_tlb(tlb, pmdp, address)   \
do {\
-   __tlb_adjust_range(tlb, address, PAGE_SIZE);\
+   __tlb_adjust_range(tlb, address, PAGE_SIZE);\
+   tlb->freed_tables = 1;  \
__pmd_free_tlb(tlb, pmdp, address); \
} while (0)
 #endif
@@ -295,6 +308,7 @@ static inline void tlb_remove_check_page_size_change(struct 
mmu_gather *tlb,
 #define pud_free_tlb(tlb, pudp, address)   \
do {\
__tlb_adjust_range(tlb, address, PAGE_SIZE);\
+   tlb->freed_tables = 1;  \
__pud_free_tlb(tlb, pudp, address); \
} while (0)
 #endif
@@ -304,7 +318,8 @@ static inline void tlb_remove_check_page_size_change(struct 
mmu_gather *tlb,
 #ifndef p4d_free_tlb
 #define p4d_free_tlb(tlb, pudp, address)   \
do {\
-   __tlb_adjust_range(tlb, address, PAGE_SIZE);\
+   __tlb_adjust_range(tlb, address, PAGE_SIZE);\
+   tlb->freed_tables = 1;  \
__p4d_free_tlb(tlb, pudp, address); \
} while (0)
 #endif
-- 
2.24.1

[PATCH v2 0/6] Memory corruption may occur due to incorrent tlb flush

2020-03-03 Thread Santosh Sivaraj

The TLB flush optimisation (a46cc7a90f: powerpc/mm/radix: Improve TLB/PWC
flushes) may result in random memory corruption. Any concurrent page-table walk
could end up with a Use-after-Free. Even on UP this might give issues, since
mmu_gather is preemptible these days. An interrupt or preempted task accessing
user pages might stumble into the free page if the hardware caches page
directories.

The series is a backport of the fix sent by Peter [1].

The first three patches are dependencies for the last patch (avoid potential
double flush). If the performance impact due to double flush is considered
trivial then the first three patches and last patch may be dropped.

This is only for v4.19 stable.

Changelog:
* Send the patches with the correct format (commit sha1 upstream) for stable

--
Aneesh Kumar K.V (1):
  powerpc/mmu_gather: enable RCU_TABLE_FREE even for !SMP case

Peter Zijlstra (4):
  asm-generic/tlb: Track freeing of page-table directories in struct
mmu_gather
  asm-generic/tlb, arch: Invert CONFIG_HAVE_RCU_TABLE_INVALIDATE
  mm/mmu_gather: invalidate TLB correctly on batch allocation failure
and flush
  asm-generic/tlb: avoid potential double flush

Will Deacon (1):
  asm-generic/tlb: Track which levels of the page tables have been
cleared

 arch/Kconfig |   3 -
 arch/powerpc/Kconfig |   2 +-
 arch/powerpc/include/asm/book3s/32/pgalloc.h |   8 --
 arch/powerpc/include/asm/book3s/64/pgalloc.h |   2 -
 arch/powerpc/include/asm/tlb.h   |  11 ++
 arch/powerpc/mm/pgtable-book3s64.c   |   7 --
 arch/sparc/include/asm/tlb_64.h  |   9 ++
 arch/x86/Kconfig |   1 -
 include/asm-generic/tlb.h| 103 ---
 mm/memory.c  |  20 ++--
 10 files changed, 122 insertions(+), 44 deletions(-)

-- 
2.24.1

[PATCH 5/6] mm/mmu_gather: invalidate TLB correctly on batch allocation failure and flush

2020-02-20 Thread Santosh Sivaraj

From: Peter Zijlstra 

Architectures for which we have hardware walkers of Linux page table
should flush TLB on mmu gather batch allocation failures and batch flush.
Some architectures like POWER supports multiple translation modes (hash
and radix) and in the case of POWER only radix translation mode needs the
above TLBI.  This is because for hash translation mode kernel wants to
avoid this extra flush since there are no hardware walkers of linux page
table.  With radix translation, the hardware also walks linux page table
and with that, kernel needs to make sure to TLB invalidate page walk cache
before page table pages are freed.

More details in commit d86564a2f085 ("mm/tlb, x86/mm: Support invalidating
TLB caches for RCU_TABLE_FREE")

The changes to sparc are to make sure we keep the old behavior since we
are now removing HAVE_RCU_TABLE_NO_INVALIDATE.  The default value for
tlb_needs_table_invalidate is to always force an invalidate and sparc can
avoid the table invalidate.  Hence we define tlb_needs_table_invalidate to
false for sparc architecture.

0ed1325967ab5f in upstream.

Link: 
http://lkml.kernel.org/r/20200116064531.483522-3-aneesh.ku...@linux.ibm.com
Fixes: a46cc7a90fd8 ("powerpc/mm/radix: Improve TLB/PWC flushes")
Signed-off-by: Peter Zijlstra (Intel) 
Cc:   # 4.19
Signed-off-by: Santosh Sivaraj 
[santosh: backported to 4.19 stable]
---
 arch/Kconfig|  3 ---
 arch/powerpc/Kconfig|  1 -
 arch/powerpc/include/asm/tlb.h  | 11 +++
 arch/sparc/Kconfig  |  1 -
 arch/sparc/include/asm/tlb_64.h |  9 +
 include/asm-generic/tlb.h   | 15 +++
 mm/memory.c | 16 
 7 files changed, 43 insertions(+), 13 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 061a12b8140e..3abbdb0cea44 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -363,9 +363,6 @@ config HAVE_ARCH_JUMP_LABEL
 config HAVE_RCU_TABLE_FREE
bool
 
-config HAVE_RCU_TABLE_NO_INVALIDATE
-   bool
-
 config ARCH_HAVE_NMI_SAFE_CMPXCHG
bool
 
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index fa231130eee1..b6429f53835e 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -216,7 +216,6 @@ config PPC
select HAVE_PERF_REGS
select HAVE_PERF_USER_STACK_DUMP
select HAVE_RCU_TABLE_FREE
-   select HAVE_RCU_TABLE_NO_INVALIDATE if HAVE_RCU_TABLE_FREE
select HAVE_REGS_AND_STACK_ACCESS_API
select HAVE_RELIABLE_STACKTRACE if PPC64 && CPU_LITTLE_ENDIAN
select HAVE_SYSCALL_TRACEPOINTS
diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h
index f0e571b2dc7c..63418275f402 100644
--- a/arch/powerpc/include/asm/tlb.h
+++ b/arch/powerpc/include/asm/tlb.h
@@ -30,6 +30,17 @@
 #define tlb_remove_check_page_size_change tlb_remove_check_page_size_change
 
 extern void tlb_flush(struct mmu_gather *tlb);
+/*
+ * book3s:
+ * Hash does not use the linux page-tables, so we can avoid
+ * the TLB invalidate for page-table freeing, Radix otoh does use the
+ * page-tables and needs the TLBI.
+ *
+ * nohash:
+ * We still do TLB invalidate in the __pte_free_tlb routine before we
+ * add the page table pages to mmu gather table batch.
+ */
+#define tlb_needs_table_invalidate()   radix_enabled()
 
 /* Get the generic bits... */
 #include 
diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
index d90d632868aa..e6f2a38d2e61 100644
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -64,7 +64,6 @@ config SPARC64
select HAVE_KRETPROBES
select HAVE_KPROBES
select HAVE_RCU_TABLE_FREE if SMP
-   select HAVE_RCU_TABLE_NO_INVALIDATE if HAVE_RCU_TABLE_FREE
select HAVE_MEMBLOCK_NODE_MAP
select HAVE_ARCH_TRANSPARENT_HUGEPAGE
select HAVE_DYNAMIC_FTRACE
diff --git a/arch/sparc/include/asm/tlb_64.h b/arch/sparc/include/asm/tlb_64.h
index a2f3fa61ee36..8cb8f3833239 100644
--- a/arch/sparc/include/asm/tlb_64.h
+++ b/arch/sparc/include/asm/tlb_64.h
@@ -28,6 +28,15 @@ void flush_tlb_pending(void);
 #define __tlb_remove_tlb_entry(tlb, ptep, address) do { } while (0)
 #define tlb_flush(tlb) flush_tlb_pending()
 
+/*
+ * SPARC64's hardware TLB fill does not use the Linux page-tables
+ * and therefore we don't need a TLBI when freeing page-table pages.
+ */
+
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+#define tlb_needs_table_invalidate()   (false)
+#endif
+
 #include 
 
 #endif /* _SPARC64_TLB_H */
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index f2b9dc9cbaf8..19934cdd143e 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -61,8 +61,23 @@ struct mmu_table_batch {
 extern void tlb_table_flush(struct mmu_gather *tlb);
 extern void tlb_remove_table(struct mmu_gather *tlb, void *table);
 
+/*
+ * This allows an architecture that does not use the linux page-tables for
+ * hardware to skip the TLBI when freeing page tables.
+ */
+#ifndef tlb_nee

[PATCH 6/6] asm-generic/tlb: avoid potential double flush

2020-02-20 Thread Santosh Sivaraj

From: Peter Zijlstra 

Aneesh reported that:

tlb_flush_mmu()
  tlb_flush_mmu_tlbonly()
tlb_flush() <-- #1
  tlb_flush_mmu_free()
tlb_table_flush()
  tlb_table_invalidate()
tlb_flush_mmu_tlbonly()
  tlb_flush()   <-- #2

does two TLBIs when tlb->fullmm, because __tlb_reset_range() will not
clear tlb->end in that case.

Observe that any caller to __tlb_adjust_range() also sets at least one of
the tlb->freed_tables || tlb->cleared_p* bits, and those are
unconditionally cleared by __tlb_reset_range().

Change the condition for actually issuing TLBI to having one of those bits
set, as opposed to having tlb->end != 0.

0758cd830494 in upstream.

Link: 
http://lkml.kernel.org/r/20200116064531.483522-4-aneesh.ku...@linux.ibm.com
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Aneesh Kumar K.V 
Reported-by: "Aneesh Kumar K.V" 
Cc:   # 4.19
Signed-off-by: Santosh Sivaraj 
[santosh: backported to 4.19 stable]
---
 include/asm-generic/tlb.h | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 19934cdd143e..427a70c56ddd 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -179,7 +179,12 @@ static inline void __tlb_reset_range(struct mmu_gather 
*tlb)
 
 static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
 {
-   if (!tlb->end)
+   /*
+* Anything calling __tlb_adjust_range() also sets at least one of
+* these bits.
+*/
+   if (!(tlb->freed_tables || tlb->cleared_ptes || tlb->cleared_pmds ||
+ tlb->cleared_puds || tlb->cleared_p4ds))
return;
 
tlb_flush(tlb);
-- 
2.24.1

[PATCH 4/6] powerpc/mmu_gather: enable RCU_TABLE_FREE even for !SMP case

2020-02-20 Thread Santosh Sivaraj

From: "Aneesh Kumar K.V" 

Patch series "Fixup page directory freeing", v4.

This is a repost of patch series from Peter with the arch specific changes
except ppc64 dropped.  ppc64 changes are added here because we are redoing
the patch series on top of ppc64 changes.  This makes it easy to backport
these changes.  Only the first 2 patches need to be backported to stable.

The thing is, on anything SMP, freeing page directories should observe the
exact same order as normal page freeing:

 1) unhook page/directory
 2) TLB invalidate
 3) free page/directory

Without this, any concurrent page-table walk could end up with a
Use-after-Free.  This is esp.  trivial for anything that has software
page-table walkers (HAVE_FAST_GUP / software TLB fill) or the hardware
caches partial page-walks (ie.  caches page directories).

Even on UP this might give issues since mmu_gather is preemptible these
days.  An interrupt or preempted task accessing user pages might stumble
into the free page if the hardware caches page directories.

This patch series fixes ppc64 and add generic MMU_GATHER changes to
support the conversion of other architectures.  I haven't added patches
w.r.t other architecture because they are yet to be acked.

This patch (of 9):

A followup patch is going to make sure we correctly invalidate page walk
cache before we free page table pages.  In order to keep things simple
enable RCU_TABLE_FREE even for !SMP so that we don't have to fixup the
!SMP case differently in the followup patch

!SMP case is right now broken for radix translation w.r.t page walk
cache flush.  We can get interrupted in between page table free and
that would imply we have page walk cache entries pointing to tables
which got freed already.  Michael said "both our platforms that run on
Power9 force SMP on in Kconfig, so the !SMP case is unlikely to be a
problem for anyone in practice, unless they've hacked their kernel to
build it !SMP."

12e4d53f3f04e in upstream.

Link: 
http://lkml.kernel.org/r/20200116064531.483522-2-aneesh.ku...@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V 
Cc:  # 4.19
Signed-off-by: Santosh Sivaraj 
[santosh: backported for 4.19 stable]
---
 arch/powerpc/Kconfig | 2 +-
 arch/powerpc/include/asm/book3s/32/pgalloc.h | 8 
 arch/powerpc/include/asm/book3s/64/pgalloc.h | 2 --
 arch/powerpc/mm/pgtable-book3s64.c   | 7 ---
 4 files changed, 1 insertion(+), 18 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index f7f046ff6407..fa231130eee1 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -215,7 +215,7 @@ config PPC
select HAVE_HARDLOCKUP_DETECTOR_PERFif PERF_EVENTS && 
HAVE_PERF_EVENTS_NMI && !HAVE_HARDLOCKUP_DETECTOR_ARCH
select HAVE_PERF_REGS
select HAVE_PERF_USER_STACK_DUMP
-   select HAVE_RCU_TABLE_FREE  if SMP
+   select HAVE_RCU_TABLE_FREE
select HAVE_RCU_TABLE_NO_INVALIDATE if HAVE_RCU_TABLE_FREE
select HAVE_REGS_AND_STACK_ACCESS_API
select HAVE_RELIABLE_STACKTRACE if PPC64 && CPU_LITTLE_ENDIAN
diff --git a/arch/powerpc/include/asm/book3s/32/pgalloc.h 
b/arch/powerpc/include/asm/book3s/32/pgalloc.h
index 82e44b1a00ae..79ba3fbb512e 100644
--- a/arch/powerpc/include/asm/book3s/32/pgalloc.h
+++ b/arch/powerpc/include/asm/book3s/32/pgalloc.h
@@ -110,7 +110,6 @@ static inline void pgtable_free(void *table, unsigned 
index_size)
 #define check_pgt_cache()  do { } while (0)
 #define get_hugepd_cache_index(x)  (x)
 
-#ifdef CONFIG_SMP
 static inline void pgtable_free_tlb(struct mmu_gather *tlb,
void *table, int shift)
 {
@@ -127,13 +126,6 @@ static inline void __tlb_remove_table(void *_table)
 
pgtable_free(table, shift);
 }
-#else
-static inline void pgtable_free_tlb(struct mmu_gather *tlb,
-   void *table, int shift)
-{
-   pgtable_free(table, shift);
-}
-#endif
 
 static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t table,
  unsigned long address)
diff --git a/arch/powerpc/include/asm/book3s/64/pgalloc.h 
b/arch/powerpc/include/asm/book3s/64/pgalloc.h
index f9019b579903..1013c0214213 100644
--- a/arch/powerpc/include/asm/book3s/64/pgalloc.h
+++ b/arch/powerpc/include/asm/book3s/64/pgalloc.h
@@ -47,9 +47,7 @@ extern pmd_t *pmd_fragment_alloc(struct mm_struct *, unsigned 
long);
 extern void pte_fragment_free(unsigned long *, int);
 extern void pmd_fragment_free(unsigned long *);
 extern void pgtable_free_tlb(struct mmu_gather *tlb, void *table, int shift);
-#ifdef CONFIG_SMP
 extern void __tlb_remove_table(void *_table);
-#endif
 
 static inline pgd_t *radix__pgd_alloc(struct mm_struct *mm)
 {
diff --git a/arch/powerpc/mm/pgtable-book3s64.c 
b/arch/powerpc/mm/pgtable-book3s64.c
index 297db665d953..5b4e9fd8990c 100644
--- a/arch/powerpc/mm/pgtable-book3s64.c
+++ b/arch/po

[PATCH 3/6] asm-generic/tlb, arch: Invert CONFIG_HAVE_RCU_TABLE_INVALIDATE

2020-02-20 Thread Santosh Sivaraj

From: Peter Zijlstra 

Make issuing a TLB invalidate for page-table pages the normal case.

The reason is twofold:

 - too many invalidates is safer than too few,
 - most architectures use the linux page-tables natively
   and would thus require this.

Make it an opt-out, instead of an opt-in.

No change in behavior intended.

96bc9567cbe1 in upstream.

Signed-off-by: Peter Zijlstra (Intel) 
Cc:  # 4.19
Signed-off-by: Santosh Sivaraj 
[santosh: prerequisite for upcoming tlbflush backports]
---
 arch/Kconfig | 2 +-
 arch/powerpc/Kconfig | 1 +
 arch/sparc/Kconfig   | 1 +
 arch/x86/Kconfig | 1 -
 mm/memory.c  | 2 +-
 5 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index a336548487e6..061a12b8140e 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -363,7 +363,7 @@ config HAVE_ARCH_JUMP_LABEL
 config HAVE_RCU_TABLE_FREE
bool
 
-config HAVE_RCU_TABLE_INVALIDATE
+config HAVE_RCU_TABLE_NO_INVALIDATE
bool
 
 config ARCH_HAVE_NMI_SAFE_CMPXCHG
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index a80669209155..f7f046ff6407 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -216,6 +216,7 @@ config PPC
select HAVE_PERF_REGS
select HAVE_PERF_USER_STACK_DUMP
select HAVE_RCU_TABLE_FREE  if SMP
+   select HAVE_RCU_TABLE_NO_INVALIDATE if HAVE_RCU_TABLE_FREE
select HAVE_REGS_AND_STACK_ACCESS_API
select HAVE_RELIABLE_STACKTRACE if PPC64 && CPU_LITTLE_ENDIAN
select HAVE_SYSCALL_TRACEPOINTS
diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
index e6f2a38d2e61..d90d632868aa 100644
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -64,6 +64,7 @@ config SPARC64
select HAVE_KRETPROBES
select HAVE_KPROBES
select HAVE_RCU_TABLE_FREE if SMP
+   select HAVE_RCU_TABLE_NO_INVALIDATE if HAVE_RCU_TABLE_FREE
select HAVE_MEMBLOCK_NODE_MAP
select HAVE_ARCH_TRANSPARENT_HUGEPAGE
select HAVE_DYNAMIC_FTRACE
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index af35f5caadbe..181d0d522977 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -181,7 +181,6 @@ config X86
select HAVE_PERF_REGS
select HAVE_PERF_USER_STACK_DUMP
select HAVE_RCU_TABLE_FREE  if PARAVIRT
-   select HAVE_RCU_TABLE_INVALIDATEif HAVE_RCU_TABLE_FREE
select HAVE_REGS_AND_STACK_ACCESS_API
select HAVE_RELIABLE_STACKTRACE if X86_64 && 
(UNWINDER_FRAME_POINTER || UNWINDER_ORC) && STACK_VALIDATION
select HAVE_STACKPROTECTOR  if CC_HAS_SANE_STACKPROTECTOR
diff --git a/mm/memory.c b/mm/memory.c
index 1832c5ed6ac0..ba5689610c04 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -327,7 +327,7 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, struct 
page *page, int page_
  */
 static inline void tlb_table_invalidate(struct mmu_gather *tlb)
 {
-#ifdef CONFIG_HAVE_RCU_TABLE_INVALIDATE
+#ifndef CONFIG_HAVE_RCU_TABLE_NO_INVALIDATE
/*
 * Invalidate page-table caches used by hardware walkers. Then we still
 * need to RCU-sched wait while freeing the pages because software
-- 
2.24.1

[PATCH 2/6] asm-generic/tlb: Track which levels of the page tables have been cleared

2020-02-20 Thread Santosh Sivaraj

From: Will Deacon 

It is common for architectures with hugepage support to require only a
single TLB invalidation operation per hugepage during unmap(), rather than
iterating through the mapping at a PAGE_SIZE increment. Currently,
however, the level in the page table where the unmap() operation occurs
is not stored in the mmu_gather structure, therefore forcing
architectures to issue additional TLB invalidation operations or to give
up and over-invalidate by e.g. invalidating the entire TLB.

Ideally, we could add an interval rbtree to the mmu_gather structure,
which would allow us to associate the correct mapping granule with the
various sub-mappings within the range being invalidated. However, this
is costly in terms of book-keeping and memory management, so instead we
approximate by keeping track of the page table levels that are cleared
and provide a means to query the smallest granule required for invalidation.

a6d60245d6d9 in upstream

Signed-off-by: Will Deacon 
Cc:  # 4.19
Signed-off-by: Santosh Sivaraj 
[santosh: prerequisite for upcoming tlbflush backports]
---
 include/asm-generic/tlb.h | 58 +--
 mm/memory.c   |  4 ++-
 2 files changed, 53 insertions(+), 9 deletions(-)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 97306b32d8d2..f2b9dc9cbaf8 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -114,6 +114,14 @@ struct mmu_gather {
 */
unsigned intfreed_tables : 1;
 
+   /*
+* at which levels have we cleared entries?
+*/
+   unsigned intcleared_ptes : 1;
+   unsigned intcleared_pmds : 1;
+   unsigned intcleared_puds : 1;
+   unsigned intcleared_p4ds : 1;
+
struct mmu_gather_batch *active;
struct mmu_gather_batch local;
struct page *__pages[MMU_GATHER_BUNDLE];
@@ -148,6 +156,10 @@ static inline void __tlb_reset_range(struct mmu_gather 
*tlb)
tlb->end = 0;
}
tlb->freed_tables = 0;
+   tlb->cleared_ptes = 0;
+   tlb->cleared_pmds = 0;
+   tlb->cleared_puds = 0;
+   tlb->cleared_p4ds = 0;
 }
 
 static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
@@ -197,6 +209,25 @@ static inline void 
tlb_remove_check_page_size_change(struct mmu_gather *tlb,
 }
 #endif
 
+static inline unsigned long tlb_get_unmap_shift(struct mmu_gather *tlb)
+{
+   if (tlb->cleared_ptes)
+   return PAGE_SHIFT;
+   if (tlb->cleared_pmds)
+   return PMD_SHIFT;
+   if (tlb->cleared_puds)
+   return PUD_SHIFT;
+   if (tlb->cleared_p4ds)
+   return P4D_SHIFT;
+
+   return PAGE_SHIFT;
+}
+
+static inline unsigned long tlb_get_unmap_size(struct mmu_gather *tlb)
+{
+   return 1UL << tlb_get_unmap_shift(tlb);
+}
+
 /*
  * In the case of tlb vma handling, we can optimise these away in the
  * case where we're doing a full MM flush.  When we're doing a munmap,
@@ -230,13 +261,19 @@ static inline void 
tlb_remove_check_page_size_change(struct mmu_gather *tlb,
 #define tlb_remove_tlb_entry(tlb, ptep, address)   \
do {\
__tlb_adjust_range(tlb, address, PAGE_SIZE);\
+   tlb->cleared_ptes = 1;  \
__tlb_remove_tlb_entry(tlb, ptep, address); \
} while (0)
 
-#define tlb_remove_huge_tlb_entry(h, tlb, ptep, address)\
-   do { \
-   __tlb_adjust_range(tlb, address, huge_page_size(h)); \
-   __tlb_remove_tlb_entry(tlb, ptep, address);  \
+#define tlb_remove_huge_tlb_entry(h, tlb, ptep, address)   \
+   do {\
+   unsigned long _sz = huge_page_size(h);  \
+   __tlb_adjust_range(tlb, address, _sz);  \
+   if (_sz == PMD_SIZE)\
+   tlb->cleared_pmds = 1;  \
+   else if (_sz == PUD_SIZE)   \
+   tlb->cleared_puds = 1;  \
+   __tlb_remove_tlb_entry(tlb, ptep, address); \
} while (0)
 
 /**
@@ -250,6 +287,7 @@ static inline void tlb_remove_check_page_size_change(struct 
mmu_gather *tlb,
 #define tlb_remove_pmd_tlb_entry(tlb, pmdp, address)   \
do {\
__tlb_adjust_range(tlb, address, HPAGE_PMD_SIZE);   \
+   tlb->cleared_pmds = 1;  \
__tlb_remove_pmd_tlb_entry(tlb, pmdp, address); \
} while (0)
 
@@ -264,6 +302,7 @@ static inline void tlb_remov

[PATCH 1/6] asm-generic/tlb: Track freeing of page-table directories in struct mmu_gather

2020-02-20 Thread Santosh Sivaraj

From: Peter Zijlstra 

Some architectures require different TLB invalidation instructions
depending on whether it is only the last-level of page table being
changed, or whether there are also changes to the intermediate
(directory) entries higher up the tree.

Add a new bit to the flags bitfield in struct mmu_gather so that the
architecture code can operate accordingly if it's the intermediate
levels being invalidated.

22a61c3c4f1379 in upstream

Signed-off-by: Peter Zijlstra 
Signed-off-by: Will Deacon 
Cc:  # 4.19
Signed-off-by: Santosh Sivaraj 
[santosh: prerequisite for tlbflush backports]
---
 include/asm-generic/tlb.h | 31 +++
 1 file changed, 23 insertions(+), 8 deletions(-)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index b3353e21f3b3..97306b32d8d2 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -97,12 +97,22 @@ struct mmu_gather {
 #endif
unsigned long   start;
unsigned long   end;
-   /* we are in the middle of an operation to clear
-* a full mm and can make some optimizations */
-   unsigned intfullmm : 1,
-   /* we have performed an operation which
-* requires a complete flush of the tlb */
-   need_flush_all : 1;
+   /*
+* we are in the middle of an operation to clear
+* a full mm and can make some optimizations
+*/
+   unsigned intfullmm : 1;
+
+   /*
+* we have performed an operation which
+* requires a complete flush of the tlb
+*/
+   unsigned intneed_flush_all : 1;
+
+   /*
+* we have removed page directories
+*/
+   unsigned intfreed_tables : 1;
 
struct mmu_gather_batch *active;
struct mmu_gather_batch local;
@@ -137,6 +147,7 @@ static inline void __tlb_reset_range(struct mmu_gather *tlb)
tlb->start = TASK_SIZE;
tlb->end = 0;
}
+   tlb->freed_tables = 0;
 }
 
 static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
@@ -278,6 +289,7 @@ static inline void tlb_remove_check_page_size_change(struct 
mmu_gather *tlb,
 #define pte_free_tlb(tlb, ptep, address)   \
do {\
__tlb_adjust_range(tlb, address, PAGE_SIZE);\
+   tlb->freed_tables = 1;  \
__pte_free_tlb(tlb, ptep, address); \
} while (0)
 #endif
@@ -285,7 +297,8 @@ static inline void tlb_remove_check_page_size_change(struct 
mmu_gather *tlb,
 #ifndef pmd_free_tlb
 #define pmd_free_tlb(tlb, pmdp, address)   \
do {\
-   __tlb_adjust_range(tlb, address, PAGE_SIZE);\
+   __tlb_adjust_range(tlb, address, PAGE_SIZE);\
+   tlb->freed_tables = 1;  \
__pmd_free_tlb(tlb, pmdp, address); \
} while (0)
 #endif
@@ -295,6 +308,7 @@ static inline void tlb_remove_check_page_size_change(struct 
mmu_gather *tlb,
 #define pud_free_tlb(tlb, pudp, address)   \
do {\
__tlb_adjust_range(tlb, address, PAGE_SIZE);\
+   tlb->freed_tables = 1;  \
__pud_free_tlb(tlb, pudp, address); \
} while (0)
 #endif
@@ -304,7 +318,8 @@ static inline void tlb_remove_check_page_size_change(struct 
mmu_gather *tlb,
 #ifndef p4d_free_tlb
 #define p4d_free_tlb(tlb, pudp, address)   \
do {\
-   __tlb_adjust_range(tlb, address, PAGE_SIZE);\
+   __tlb_adjust_range(tlb, address, PAGE_SIZE);\
+   tlb->freed_tables = 1;  \
__p4d_free_tlb(tlb, pudp, address); \
} while (0)
 #endif
-- 
2.24.1

[PATCH 0/6] Memory corruption may occur due to incorrent tlb flush

2020-02-20 Thread Santosh Sivaraj

The TLB flush optimisation (a46cc7a90f: powerpc/mm/radix: Improve TLB/PWC
flushes) may result in random memory corruption. Any concurrent page-table walk
could end up with a Use-after-Free. Even on UP this might give issues, since
mmu_gather is preemptible these days. An interrupt or preempted task accessing
user pages might stumble into the free page if the hardware caches page
directories.

The series is a backport of the fix sent by Peter [1].

The first three patches are dependencies for the last patch (avoid potential
double flush). If the performance impact due to double flush is considered
trivial then the first three patches and last patch may be dropped.

This is only for v4.19 stable.

[1] https://patchwork.kernel.org/cover/11284843/

--
Aneesh Kumar K.V (1):
  powerpc/mmu_gather: enable RCU_TABLE_FREE even for !SMP case

Peter Zijlstra (4):
  asm-generic/tlb: Track freeing of page-table directories in struct
mmu_gather
  asm-generic/tlb, arch: Invert CONFIG_HAVE_RCU_TABLE_INVALIDATE
  mm/mmu_gather: invalidate TLB correctly on batch allocation failure
and flush
  asm-generic/tlb: avoid potential double flush

Will Deacon (1):
  asm-generic/tlb: Track which levels of the page tables have been
cleared

 arch/Kconfig |   3 -
 arch/powerpc/Kconfig |   2 +-
 arch/powerpc/include/asm/book3s/32/pgalloc.h |   8 --
 arch/powerpc/include/asm/book3s/64/pgalloc.h |   2 -
 arch/powerpc/include/asm/tlb.h   |  11 ++
 arch/powerpc/mm/pgtable-book3s64.c   |   7 --
 arch/sparc/include/asm/tlb_64.h  |   9 ++
 arch/x86/Kconfig |   1 -
 include/asm-generic/tlb.h| 103 ---
 mm/memory.c  |  20 ++--
 10 files changed, 122 insertions(+), 44 deletions(-)

-- 
2.24.1

[PATCH 6/6] asm-generic/tlb: avoid potential double flush

2020-02-19 Thread Santosh Sivaraj

From: Peter Zijlstra 

Aneesh reported that:

tlb_flush_mmu()
  tlb_flush_mmu_tlbonly()
tlb_flush() <-- #1
  tlb_flush_mmu_free()
tlb_table_flush()
  tlb_table_invalidate()
tlb_flush_mmu_tlbonly()
  tlb_flush()   <-- #2

does two TLBIs when tlb->fullmm, because __tlb_reset_range() will not
clear tlb->end in that case.

Observe that any caller to __tlb_adjust_range() also sets at least one of
the tlb->freed_tables || tlb->cleared_p* bits, and those are
unconditionally cleared by __tlb_reset_range().

Change the condition for actually issuing TLBI to having one of those bits
set, as opposed to having tlb->end != 0.

Link: 
http://lkml.kernel.org/r/20200116064531.483522-4-aneesh.ku...@linux.ibm.com
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Aneesh Kumar K.V 
Reported-by: "Aneesh Kumar K.V" 
Cc:   # 4.19
Signed-off-by: Santosh Sivaraj 
[santosh: backported to 4.19 stable]
---
 include/asm-generic/tlb.h | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 19934cdd143e..427a70c56ddd 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -179,7 +179,12 @@ static inline void __tlb_reset_range(struct mmu_gather 
*tlb)
 
 static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
 {
-   if (!tlb->end)
+   /*
+* Anything calling __tlb_adjust_range() also sets at least one of
+* these bits.
+*/
+   if (!(tlb->freed_tables || tlb->cleared_ptes || tlb->cleared_pmds ||
+ tlb->cleared_puds || tlb->cleared_p4ds))
return;
 
tlb_flush(tlb);
-- 
2.24.1

[PATCH 5/6] mm/mmu_gather: invalidate TLB correctly on batch allocation failure and flush

2020-02-19 Thread Santosh Sivaraj

From: Peter Zijlstra 

Architectures for which we have hardware walkers of Linux page table
should flush TLB on mmu gather batch allocation failures and batch flush.
Some architectures like POWER supports multiple translation modes (hash
and radix) and in the case of POWER only radix translation mode needs the
above TLBI.  This is because for hash translation mode kernel wants to
avoid this extra flush since there are no hardware walkers of linux page
table.  With radix translation, the hardware also walks linux page table
and with that, kernel needs to make sure to TLB invalidate page walk cache
before page table pages are freed.

More details in commit d86564a2f085 ("mm/tlb, x86/mm: Support invalidating
TLB caches for RCU_TABLE_FREE")

The changes to sparc are to make sure we keep the old behavior since we
are now removing HAVE_RCU_TABLE_NO_INVALIDATE.  The default value for
tlb_needs_table_invalidate is to always force an invalidate and sparc can
avoid the table invalidate.  Hence we define tlb_needs_table_invalidate to
false for sparc architecture.

Link: 
http://lkml.kernel.org/r/20200116064531.483522-3-aneesh.ku...@linux.ibm.com
Fixes: a46cc7a90fd8 ("powerpc/mm/radix: Improve TLB/PWC flushes")
Signed-off-by: Peter Zijlstra (Intel) 
Cc:   # 4.19
Signed-off-by: Santosh Sivaraj 
[santosh: backported to 4.19 stable]
---
 arch/Kconfig|  3 ---
 arch/powerpc/Kconfig|  1 -
 arch/powerpc/include/asm/tlb.h  | 11 +++
 arch/sparc/Kconfig  |  1 -
 arch/sparc/include/asm/tlb_64.h |  9 +
 include/asm-generic/tlb.h   | 15 +++
 mm/memory.c | 16 
 7 files changed, 43 insertions(+), 13 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 061a12b8140e..3abbdb0cea44 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -363,9 +363,6 @@ config HAVE_ARCH_JUMP_LABEL
 config HAVE_RCU_TABLE_FREE
bool
 
-config HAVE_RCU_TABLE_NO_INVALIDATE
-   bool
-
 config ARCH_HAVE_NMI_SAFE_CMPXCHG
bool
 
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index fa231130eee1..b6429f53835e 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -216,7 +216,6 @@ config PPC
select HAVE_PERF_REGS
select HAVE_PERF_USER_STACK_DUMP
select HAVE_RCU_TABLE_FREE
-   select HAVE_RCU_TABLE_NO_INVALIDATE if HAVE_RCU_TABLE_FREE
select HAVE_REGS_AND_STACK_ACCESS_API
select HAVE_RELIABLE_STACKTRACE if PPC64 && CPU_LITTLE_ENDIAN
select HAVE_SYSCALL_TRACEPOINTS
diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h
index f0e571b2dc7c..63418275f402 100644
--- a/arch/powerpc/include/asm/tlb.h
+++ b/arch/powerpc/include/asm/tlb.h
@@ -30,6 +30,17 @@
 #define tlb_remove_check_page_size_change tlb_remove_check_page_size_change
 
 extern void tlb_flush(struct mmu_gather *tlb);
+/*
+ * book3s:
+ * Hash does not use the linux page-tables, so we can avoid
+ * the TLB invalidate for page-table freeing, Radix otoh does use the
+ * page-tables and needs the TLBI.
+ *
+ * nohash:
+ * We still do TLB invalidate in the __pte_free_tlb routine before we
+ * add the page table pages to mmu gather table batch.
+ */
+#define tlb_needs_table_invalidate()   radix_enabled()
 
 /* Get the generic bits... */
 #include 
diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
index d90d632868aa..e6f2a38d2e61 100644
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -64,7 +64,6 @@ config SPARC64
select HAVE_KRETPROBES
select HAVE_KPROBES
select HAVE_RCU_TABLE_FREE if SMP
-   select HAVE_RCU_TABLE_NO_INVALIDATE if HAVE_RCU_TABLE_FREE
select HAVE_MEMBLOCK_NODE_MAP
select HAVE_ARCH_TRANSPARENT_HUGEPAGE
select HAVE_DYNAMIC_FTRACE
diff --git a/arch/sparc/include/asm/tlb_64.h b/arch/sparc/include/asm/tlb_64.h
index a2f3fa61ee36..8cb8f3833239 100644
--- a/arch/sparc/include/asm/tlb_64.h
+++ b/arch/sparc/include/asm/tlb_64.h
@@ -28,6 +28,15 @@ void flush_tlb_pending(void);
 #define __tlb_remove_tlb_entry(tlb, ptep, address) do { } while (0)
 #define tlb_flush(tlb) flush_tlb_pending()
 
+/*
+ * SPARC64's hardware TLB fill does not use the Linux page-tables
+ * and therefore we don't need a TLBI when freeing page-table pages.
+ */
+
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+#define tlb_needs_table_invalidate()   (false)
+#endif
+
 #include 
 
 #endif /* _SPARC64_TLB_H */
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index f2b9dc9cbaf8..19934cdd143e 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -61,8 +61,23 @@ struct mmu_table_batch {
 extern void tlb_table_flush(struct mmu_gather *tlb);
 extern void tlb_remove_table(struct mmu_gather *tlb, void *table);
 
+/*
+ * This allows an architecture that does not use the linux page-tables for
+ * hardware to skip the TLBI when freeing page tables.
+ */
+#ifndef tlb_needs_table_invalidate
+#define tlb_ne

[PATCH 4/6] powerpc/mmu_gather: enable RCU_TABLE_FREE even for !SMP case

2020-02-19 Thread Santosh Sivaraj

From: "Aneesh Kumar K.V" 

Patch series "Fixup page directory freeing", v4.

This is a repost of patch series from Peter with the arch specific changes
except ppc64 dropped.  ppc64 changes are added here because we are redoing
the patch series on top of ppc64 changes.  This makes it easy to backport
these changes.  Only the first 2 patches need to be backported to stable.

The thing is, on anything SMP, freeing page directories should observe the
exact same order as normal page freeing:

 1) unhook page/directory
 2) TLB invalidate
 3) free page/directory

Without this, any concurrent page-table walk could end up with a
Use-after-Free.  This is esp.  trivial for anything that has software
page-table walkers (HAVE_FAST_GUP / software TLB fill) or the hardware
caches partial page-walks (ie.  caches page directories).

Even on UP this might give issues since mmu_gather is preemptible these
days.  An interrupt or preempted task accessing user pages might stumble
into the free page if the hardware caches page directories.

This patch series fixes ppc64 and add generic MMU_GATHER changes to
support the conversion of other architectures.  I haven't added patches
w.r.t other architecture because they are yet to be acked.

This patch (of 9):

A followup patch is going to make sure we correctly invalidate page walk
cache before we free page table pages.  In order to keep things simple
enable RCU_TABLE_FREE even for !SMP so that we don't have to fixup the
!SMP case differently in the followup patch

!SMP case is right now broken for radix translation w.r.t page walk
cache flush.  We can get interrupted in between page table free and
that would imply we have page walk cache entries pointing to tables
which got freed already.  Michael said "both our platforms that run on
Power9 force SMP on in Kconfig, so the !SMP case is unlikely to be a
problem for anyone in practice, unless they've hacked their kernel to
build it !SMP."

Link: 
http://lkml.kernel.org/r/20200116064531.483522-2-aneesh.ku...@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V 
Cc:  # 4.19
Signed-off-by: Santosh Sivaraj 
[santosh: backported for 4.19 stable]
---
 arch/powerpc/Kconfig | 2 +-
 arch/powerpc/include/asm/book3s/32/pgalloc.h | 8 
 arch/powerpc/include/asm/book3s/64/pgalloc.h | 2 --
 arch/powerpc/mm/pgtable-book3s64.c   | 7 ---
 4 files changed, 1 insertion(+), 18 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index f7f046ff6407..fa231130eee1 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -215,7 +215,7 @@ config PPC
select HAVE_HARDLOCKUP_DETECTOR_PERFif PERF_EVENTS && 
HAVE_PERF_EVENTS_NMI && !HAVE_HARDLOCKUP_DETECTOR_ARCH
select HAVE_PERF_REGS
select HAVE_PERF_USER_STACK_DUMP
-   select HAVE_RCU_TABLE_FREE  if SMP
+   select HAVE_RCU_TABLE_FREE
select HAVE_RCU_TABLE_NO_INVALIDATE if HAVE_RCU_TABLE_FREE
select HAVE_REGS_AND_STACK_ACCESS_API
select HAVE_RELIABLE_STACKTRACE if PPC64 && CPU_LITTLE_ENDIAN
diff --git a/arch/powerpc/include/asm/book3s/32/pgalloc.h 
b/arch/powerpc/include/asm/book3s/32/pgalloc.h
index 82e44b1a00ae..79ba3fbb512e 100644
--- a/arch/powerpc/include/asm/book3s/32/pgalloc.h
+++ b/arch/powerpc/include/asm/book3s/32/pgalloc.h
@@ -110,7 +110,6 @@ static inline void pgtable_free(void *table, unsigned 
index_size)
 #define check_pgt_cache()  do { } while (0)
 #define get_hugepd_cache_index(x)  (x)
 
-#ifdef CONFIG_SMP
 static inline void pgtable_free_tlb(struct mmu_gather *tlb,
void *table, int shift)
 {
@@ -127,13 +126,6 @@ static inline void __tlb_remove_table(void *_table)
 
pgtable_free(table, shift);
 }
-#else
-static inline void pgtable_free_tlb(struct mmu_gather *tlb,
-   void *table, int shift)
-{
-   pgtable_free(table, shift);
-}
-#endif
 
 static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t table,
  unsigned long address)
diff --git a/arch/powerpc/include/asm/book3s/64/pgalloc.h 
b/arch/powerpc/include/asm/book3s/64/pgalloc.h
index f9019b579903..1013c0214213 100644
--- a/arch/powerpc/include/asm/book3s/64/pgalloc.h
+++ b/arch/powerpc/include/asm/book3s/64/pgalloc.h
@@ -47,9 +47,7 @@ extern pmd_t *pmd_fragment_alloc(struct mm_struct *, unsigned 
long);
 extern void pte_fragment_free(unsigned long *, int);
 extern void pmd_fragment_free(unsigned long *);
 extern void pgtable_free_tlb(struct mmu_gather *tlb, void *table, int shift);
-#ifdef CONFIG_SMP
 extern void __tlb_remove_table(void *_table);
-#endif
 
 static inline pgd_t *radix__pgd_alloc(struct mm_struct *mm)
 {
diff --git a/arch/powerpc/mm/pgtable-book3s64.c 
b/arch/powerpc/mm/pgtable-book3s64.c
index 297db665d953..5b4e9fd8990c 100644
--- a/arch/powerpc/mm/pgtable-book3s64.c
+++ b/arch/powerpc/mm/pgtable-book3s64

[PATCH 3/6] asm-generic/tlb, arch: Invert CONFIG_HAVE_RCU_TABLE_INVALIDATE

2020-02-19 Thread Santosh Sivaraj

From: Peter Zijlstra 

Make issuing a TLB invalidate for page-table pages the normal case.

The reason is twofold:

 - too many invalidates is safer than too few,
 - most architectures use the linux page-tables natively
   and would thus require this.

Make it an opt-out, instead of an opt-in.

No change in behavior intended.

Signed-off-by: Peter Zijlstra (Intel) 
Cc:  # 4.19
Signed-off-by: Santosh Sivaraj 
[santosh: prerequisite for upcoming tlbflush backports]
---
 arch/Kconfig | 2 +-
 arch/powerpc/Kconfig | 1 +
 arch/sparc/Kconfig   | 1 +
 arch/x86/Kconfig | 1 -
 mm/memory.c  | 2 +-
 5 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index a336548487e6..061a12b8140e 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -363,7 +363,7 @@ config HAVE_ARCH_JUMP_LABEL
 config HAVE_RCU_TABLE_FREE
bool
 
-config HAVE_RCU_TABLE_INVALIDATE
+config HAVE_RCU_TABLE_NO_INVALIDATE
bool
 
 config ARCH_HAVE_NMI_SAFE_CMPXCHG
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index a80669209155..f7f046ff6407 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -216,6 +216,7 @@ config PPC
select HAVE_PERF_REGS
select HAVE_PERF_USER_STACK_DUMP
select HAVE_RCU_TABLE_FREE  if SMP
+   select HAVE_RCU_TABLE_NO_INVALIDATE if HAVE_RCU_TABLE_FREE
select HAVE_REGS_AND_STACK_ACCESS_API
select HAVE_RELIABLE_STACKTRACE if PPC64 && CPU_LITTLE_ENDIAN
select HAVE_SYSCALL_TRACEPOINTS
diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
index e6f2a38d2e61..d90d632868aa 100644
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -64,6 +64,7 @@ config SPARC64
select HAVE_KRETPROBES
select HAVE_KPROBES
select HAVE_RCU_TABLE_FREE if SMP
+   select HAVE_RCU_TABLE_NO_INVALIDATE if HAVE_RCU_TABLE_FREE
select HAVE_MEMBLOCK_NODE_MAP
select HAVE_ARCH_TRANSPARENT_HUGEPAGE
select HAVE_DYNAMIC_FTRACE
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index af35f5caadbe..181d0d522977 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -181,7 +181,6 @@ config X86
select HAVE_PERF_REGS
select HAVE_PERF_USER_STACK_DUMP
select HAVE_RCU_TABLE_FREE  if PARAVIRT
-   select HAVE_RCU_TABLE_INVALIDATEif HAVE_RCU_TABLE_FREE
select HAVE_REGS_AND_STACK_ACCESS_API
select HAVE_RELIABLE_STACKTRACE if X86_64 && 
(UNWINDER_FRAME_POINTER || UNWINDER_ORC) && STACK_VALIDATION
select HAVE_STACKPROTECTOR  if CC_HAS_SANE_STACKPROTECTOR
diff --git a/mm/memory.c b/mm/memory.c
index 1832c5ed6ac0..ba5689610c04 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -327,7 +327,7 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, struct 
page *page, int page_
  */
 static inline void tlb_table_invalidate(struct mmu_gather *tlb)
 {
-#ifdef CONFIG_HAVE_RCU_TABLE_INVALIDATE
+#ifndef CONFIG_HAVE_RCU_TABLE_NO_INVALIDATE
/*
 * Invalidate page-table caches used by hardware walkers. Then we still
 * need to RCU-sched wait while freeing the pages because software
-- 
2.24.1

[PATCH 2/6] asm-generic/tlb: Track which levels of the page tables have been cleared

2020-02-19 Thread Santosh Sivaraj

From: Will Deacon 

It is common for architectures with hugepage support to require only a
single TLB invalidation operation per hugepage during unmap(), rather than
iterating through the mapping at a PAGE_SIZE increment. Currently,
however, the level in the page table where the unmap() operation occurs
is not stored in the mmu_gather structure, therefore forcing
architectures to issue additional TLB invalidation operations or to give
up and over-invalidate by e.g. invalidating the entire TLB.

Ideally, we could add an interval rbtree to the mmu_gather structure,
which would allow us to associate the correct mapping granule with the
various sub-mappings within the range being invalidated. However, this
is costly in terms of book-keeping and memory management, so instead we
approximate by keeping track of the page table levels that are cleared
and provide a means to query the smallest granule required for invalidation.

Signed-off-by: Will Deacon 
Cc:  # 4.19
Signed-off-by: Santosh Sivaraj 
[santosh: prerequisite for upcoming tlbflush backports]
---
 include/asm-generic/tlb.h | 58 +--
 mm/memory.c   |  4 ++-
 2 files changed, 53 insertions(+), 9 deletions(-)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 97306b32d8d2..f2b9dc9cbaf8 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -114,6 +114,14 @@ struct mmu_gather {
 */
unsigned intfreed_tables : 1;
 
+   /*
+* at which levels have we cleared entries?
+*/
+   unsigned intcleared_ptes : 1;
+   unsigned intcleared_pmds : 1;
+   unsigned intcleared_puds : 1;
+   unsigned intcleared_p4ds : 1;
+
struct mmu_gather_batch *active;
struct mmu_gather_batch local;
struct page *__pages[MMU_GATHER_BUNDLE];
@@ -148,6 +156,10 @@ static inline void __tlb_reset_range(struct mmu_gather 
*tlb)
tlb->end = 0;
}
tlb->freed_tables = 0;
+   tlb->cleared_ptes = 0;
+   tlb->cleared_pmds = 0;
+   tlb->cleared_puds = 0;
+   tlb->cleared_p4ds = 0;
 }
 
 static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
@@ -197,6 +209,25 @@ static inline void 
tlb_remove_check_page_size_change(struct mmu_gather *tlb,
 }
 #endif
 
+static inline unsigned long tlb_get_unmap_shift(struct mmu_gather *tlb)
+{
+   if (tlb->cleared_ptes)
+   return PAGE_SHIFT;
+   if (tlb->cleared_pmds)
+   return PMD_SHIFT;
+   if (tlb->cleared_puds)
+   return PUD_SHIFT;
+   if (tlb->cleared_p4ds)
+   return P4D_SHIFT;
+
+   return PAGE_SHIFT;
+}
+
+static inline unsigned long tlb_get_unmap_size(struct mmu_gather *tlb)
+{
+   return 1UL << tlb_get_unmap_shift(tlb);
+}
+
 /*
  * In the case of tlb vma handling, we can optimise these away in the
  * case where we're doing a full MM flush.  When we're doing a munmap,
@@ -230,13 +261,19 @@ static inline void 
tlb_remove_check_page_size_change(struct mmu_gather *tlb,
 #define tlb_remove_tlb_entry(tlb, ptep, address)   \
do {\
__tlb_adjust_range(tlb, address, PAGE_SIZE);\
+   tlb->cleared_ptes = 1;  \
__tlb_remove_tlb_entry(tlb, ptep, address); \
} while (0)
 
-#define tlb_remove_huge_tlb_entry(h, tlb, ptep, address)\
-   do { \
-   __tlb_adjust_range(tlb, address, huge_page_size(h)); \
-   __tlb_remove_tlb_entry(tlb, ptep, address);  \
+#define tlb_remove_huge_tlb_entry(h, tlb, ptep, address)   \
+   do {\
+   unsigned long _sz = huge_page_size(h);  \
+   __tlb_adjust_range(tlb, address, _sz);  \
+   if (_sz == PMD_SIZE)\
+   tlb->cleared_pmds = 1;  \
+   else if (_sz == PUD_SIZE)   \
+   tlb->cleared_puds = 1;  \
+   __tlb_remove_tlb_entry(tlb, ptep, address); \
} while (0)
 
 /**
@@ -250,6 +287,7 @@ static inline void tlb_remove_check_page_size_change(struct 
mmu_gather *tlb,
 #define tlb_remove_pmd_tlb_entry(tlb, pmdp, address)   \
do {\
__tlb_adjust_range(tlb, address, HPAGE_PMD_SIZE);   \
+   tlb->cleared_pmds = 1;  \
__tlb_remove_pmd_tlb_entry(tlb, pmdp, address); \
} while (0)
 
@@ -264,6 +302,7 @@ static inline void tlb_remove_check_pa

[PATCH 0/6] Memory corruption may occur due to incorrent tlb flush

2020-02-19 Thread Santosh Sivaraj

The TLB flush optimisation (a46cc7a90f: powerpc/mm/radix: Improve TLB/PWC
flushes) may result in random memory corruption. Any concurrent page-table walk
could end up with a Use-after-Free. Even on UP this might give issues, since
mmu_gather is preemptible these days. An interrupt or preempted task accessing
user pages might stumble into the free page if the hardware caches page
directories.

The series is a backport of the fix sent by Peter [1].

The first three patches are dependencies for the last patch (avoid potential
double flush). If the performance impact due to double flush is considered
trivial then the first three patches and last patch may be dropped.

[1] https://patchwork.kernel.org/cover/11284843/
--
Aneesh Kumar K.V (1):
  powerpc/mmu_gather: enable RCU_TABLE_FREE even for !SMP case

Peter Zijlstra (4):
  asm-generic/tlb: Track freeing of page-table directories in struct
mmu_gather
  asm-generic/tlb, arch: Invert CONFIG_HAVE_RCU_TABLE_INVALIDATE
  mm/mmu_gather: invalidate TLB correctly on batch allocation failure
and flush
  asm-generic/tlb: avoid potential double flush

Will Deacon (1):
  asm-generic/tlb: Track which levels of the page tables have been
cleared

 arch/Kconfig |   3 -
 arch/powerpc/Kconfig |   2 +-
 arch/powerpc/include/asm/book3s/32/pgalloc.h |   8 --
 arch/powerpc/include/asm/book3s/64/pgalloc.h |   2 -
 arch/powerpc/include/asm/tlb.h   |  11 ++
 arch/powerpc/mm/pgtable-book3s64.c   |   7 --
 arch/sparc/include/asm/tlb_64.h  |   9 ++
 arch/x86/Kconfig |   1 -
 include/asm-generic/tlb.h| 103 ---
 mm/memory.c  |  20 ++--
 10 files changed, 122 insertions(+), 44 deletions(-)

-- 
2.24.1

[PATCH 1/6] asm-generic/tlb: Track freeing of page-table directories in struct mmu_gather

2020-02-19 Thread Santosh Sivaraj

From: Peter Zijlstra 

Some architectures require different TLB invalidation instructions
depending on whether it is only the last-level of page table being
changed, or whether there are also changes to the intermediate
(directory) entries higher up the tree.

Add a new bit to the flags bitfield in struct mmu_gather so that the
architecture code can operate accordingly if it's the intermediate
levels being invalidated.

Signed-off-by: Peter Zijlstra 
Signed-off-by: Will Deacon 
Cc:  # 4.19
Signed-off-by: Santosh Sivaraj 
[santosh: prerequisite for tlbflush backports]
---
 include/asm-generic/tlb.h | 31 +++
 1 file changed, 23 insertions(+), 8 deletions(-)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index b3353e21f3b3..97306b32d8d2 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -97,12 +97,22 @@ struct mmu_gather {
 #endif
unsigned long   start;
unsigned long   end;
-   /* we are in the middle of an operation to clear
-* a full mm and can make some optimizations */
-   unsigned intfullmm : 1,
-   /* we have performed an operation which
-* requires a complete flush of the tlb */
-   need_flush_all : 1;
+   /*
+* we are in the middle of an operation to clear
+* a full mm and can make some optimizations
+*/
+   unsigned intfullmm : 1;
+
+   /*
+* we have performed an operation which
+* requires a complete flush of the tlb
+*/
+   unsigned intneed_flush_all : 1;
+
+   /*
+* we have removed page directories
+*/
+   unsigned intfreed_tables : 1;
 
struct mmu_gather_batch *active;
struct mmu_gather_batch local;
@@ -137,6 +147,7 @@ static inline void __tlb_reset_range(struct mmu_gather *tlb)
tlb->start = TASK_SIZE;
tlb->end = 0;
}
+   tlb->freed_tables = 0;
 }
 
 static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
@@ -278,6 +289,7 @@ static inline void tlb_remove_check_page_size_change(struct 
mmu_gather *tlb,
 #define pte_free_tlb(tlb, ptep, address)   \
do {\
__tlb_adjust_range(tlb, address, PAGE_SIZE);\
+   tlb->freed_tables = 1;  \
__pte_free_tlb(tlb, ptep, address); \
} while (0)
 #endif
@@ -285,7 +297,8 @@ static inline void tlb_remove_check_page_size_change(struct 
mmu_gather *tlb,
 #ifndef pmd_free_tlb
 #define pmd_free_tlb(tlb, pmdp, address)   \
do {\
-   __tlb_adjust_range(tlb, address, PAGE_SIZE);\
+   __tlb_adjust_range(tlb, address, PAGE_SIZE);\
+   tlb->freed_tables = 1;  \
__pmd_free_tlb(tlb, pmdp, address); \
} while (0)
 #endif
@@ -295,6 +308,7 @@ static inline void tlb_remove_check_page_size_change(struct 
mmu_gather *tlb,
 #define pud_free_tlb(tlb, pudp, address)   \
do {\
__tlb_adjust_range(tlb, address, PAGE_SIZE);\
+   tlb->freed_tables = 1;  \
__pud_free_tlb(tlb, pudp, address); \
} while (0)
 #endif
@@ -304,7 +318,8 @@ static inline void tlb_remove_check_page_size_change(struct 
mmu_gather *tlb,
 #ifndef p4d_free_tlb
 #define p4d_free_tlb(tlb, pudp, address)   \
do {\
-   __tlb_adjust_range(tlb, address, PAGE_SIZE);\
+   __tlb_adjust_range(tlb, address, PAGE_SIZE);\
+   tlb->freed_tables = 1;  \
__p4d_free_tlb(tlb, pudp, address); \
} while (0)
 #endif
-- 
2.24.1

Re: [PATCH v2 4/8] powerpc/vdso32: inline __get_datapage()

2019-09-13 Thread Santosh Sivaraj

Christophe Leroy  writes:

> Le 13/09/2019 à 15:31, Santosh Sivaraj a écrit :
>> Christophe Leroy  writes:
>> 
>>> Hi Santosh,
>>>
>>> Le 26/08/2019 à 07:44, Santosh Sivaraj a écrit :
>>>> Hi Christophe,
>>>>
>>>> Christophe Leroy  writes:
>>>>
>>>>> __get_datapage() is only a few instructions to retrieve the
>>>>> address of the page where the kernel stores data to the VDSO.
>>>>>
>>>>> By inlining this function into its users, a bl/blr pair and
>>>>> a mflr/mtlr pair is avoided, plus a few reg moves.
>>>>>
>>>>> The improvement is noticeable (about 55 nsec/call on an 8xx)
>>>>>
>>>>> vdsotest before the patch:
>>>>> gettimeofday:vdso: 731 nsec/call
>>>>> clock-gettime-realtime-coarse:vdso: 668 nsec/call
>>>>> clock-gettime-monotonic-coarse:vdso: 745 nsec/call
>>>>>
>>>>> vdsotest after the patch:
>>>>> gettimeofday:vdso: 677 nsec/call
>>>>> clock-gettime-realtime-coarse:vdso: 613 nsec/call
>>>>> clock-gettime-monotonic-coarse:vdso: 690 nsec/call
>>>>>
>>>>> Signed-off-by: Christophe Leroy 
>>>>> ---
>>>>>arch/powerpc/kernel/vdso32/cacheflush.S   | 10 +-
>>>>>arch/powerpc/kernel/vdso32/datapage.S | 29 
>>>>> -
>>>>>arch/powerpc/kernel/vdso32/datapage.h | 11 +++
>>>>>arch/powerpc/kernel/vdso32/gettimeofday.S | 13 ++---
>>>>>4 files changed, 26 insertions(+), 37 deletions(-)
>>>>>create mode 100644 arch/powerpc/kernel/vdso32/datapage.h
>>>>
>>>> The datapage.h file should ideally be moved under include/asm, then we can 
>>>> use
>>>> the same for powerpc64 too.
>>>
>>> I have a more ambitious project indeed.
>>>
>>> Most of the VDSO code is duplicated between vdso32 and vdso64. I'm
>>> aiming at merging everything into a single source code.
>>>
>>> This means we would have to generate vdso32.so and vdso64.so out of the
>>> same source files. Any idea on how to do that ? I'm not too good at
>>> creating Makefiles. I guess we would have everything in
>>> arch/powerpc/kernel/vdso/ and would have to build the objects twice,
>>> once in arch/powerpc/kernel/vdso32/ and once in arch/powerpc/kernel/vdso64/
>> 
>> Should we need to build the objects twice? For 64 bit config it is going to 
>> be
>> a 64 bit build else a 32 bit build. It should suffice to get the single 
>> source
>> code compile for both, maybe with macros or (!)CONFIG_PPC64 conditional
>> compilation. Am I missing something when you say build twice?
>> 
>
> IIUC, on PPC64 we build vdso64 for 64bits user apps and vdso32 for 
> 32bits user apps.
>
> In arch/powerpc/kernel/Makefile, you have:
>
> obj-$(CONFIG_VDSO32)  += vdso32/
> obj-$(CONFIG_PPC64)   += vdso64/
>
> And in arch/powerpc/platforms/Kconfig.cputype, you have:
>
> config VDSO32
>   def_bool y
>   depends on PPC32 || CPU_BIG_ENDIAN
>   help
> This symbol controls whether we build the 32-bit VDSO. We obviously
> want to do that if we're building a 32-bit kernel. If we're building
> a 64-bit kernel then we only want a 32-bit VDSO if we're building for
> big endian. That is because the only little endian configuration we
> support is ppc64le which is 64-bit only.
>

I didn't know we build 32 bit vdso for 64 bit big endians. But I don't think
its difficult to do it, might be a bit tricky. We can have two targets from
the same source.

SRC = vdso/*.c
OBJS_32 = $(SRC:.c=vdso32/.o)
OBJS_64 = $(SRC:.c=vdso64/.o)

Something like this would work. Of course, this is out of memory, might have to
do something slightly different for the Makefiles in kernel.

Thanks,
Santosh

>
>
>
> Christophe

Re: [PATCH v2 4/8] powerpc/vdso32: inline __get_datapage()

2019-09-13 Thread Santosh Sivaraj

Christophe Leroy  writes:

> Hi Santosh,
>
> Le 26/08/2019 à 07:44, Santosh Sivaraj a écrit :
>> Hi Christophe,
>> 
>> Christophe Leroy  writes:
>> 
>>> __get_datapage() is only a few instructions to retrieve the
>>> address of the page where the kernel stores data to the VDSO.
>>>
>>> By inlining this function into its users, a bl/blr pair and
>>> a mflr/mtlr pair is avoided, plus a few reg moves.
>>>
>>> The improvement is noticeable (about 55 nsec/call on an 8xx)
>>>
>>> vdsotest before the patch:
>>> gettimeofday:vdso: 731 nsec/call
>>> clock-gettime-realtime-coarse:vdso: 668 nsec/call
>>> clock-gettime-monotonic-coarse:vdso: 745 nsec/call
>>>
>>> vdsotest after the patch:
>>> gettimeofday:vdso: 677 nsec/call
>>> clock-gettime-realtime-coarse:vdso: 613 nsec/call
>>> clock-gettime-monotonic-coarse:vdso: 690 nsec/call
>>>
>>> Signed-off-by: Christophe Leroy 
>>> ---
>>>   arch/powerpc/kernel/vdso32/cacheflush.S   | 10 +-
>>>   arch/powerpc/kernel/vdso32/datapage.S | 29 
>>> -
>>>   arch/powerpc/kernel/vdso32/datapage.h | 11 +++
>>>   arch/powerpc/kernel/vdso32/gettimeofday.S | 13 ++---
>>>   4 files changed, 26 insertions(+), 37 deletions(-)
>>>   create mode 100644 arch/powerpc/kernel/vdso32/datapage.h
>> 
>> The datapage.h file should ideally be moved under include/asm, then we can 
>> use
>> the same for powerpc64 too.
>
> I have a more ambitious project indeed.
>
> Most of the VDSO code is duplicated between vdso32 and vdso64. I'm 
> aiming at merging everything into a single source code.
>
> This means we would have to generate vdso32.so and vdso64.so out of the 
> same source files. Any idea on how to do that ? I'm not too good at 
> creating Makefiles. I guess we would have everything in 
> arch/powerpc/kernel/vdso/ and would have to build the objects twice, 
> once in arch/powerpc/kernel/vdso32/ and once in arch/powerpc/kernel/vdso64/

Should we need to build the objects twice? For 64 bit config it is going to be
a 64 bit build else a 32 bit build. It should suffice to get the single source
code compile for both, maybe with macros or (!)CONFIG_PPC64 conditional
compilation. Am I missing something when you say build twice?

Thanks,
Santosh

Re: [PATCH 1/2] libnvdimm/altmap: Track namespace boundaries in altmap

2019-09-10 Thread Santosh Sivaraj

"Aneesh Kumar K.V"  writes:

> With PFN_MODE_PMEM namespace, the memmap area is allocated from the device
> area. Some architectures map the memmap area with large page size. On
> architectures like ppc64, 16MB page for memap mapping can map 262144 pfns.
> This maps a namespace size of 16G.
>
> When populating memmap region with 16MB page from the device area,
> make sure the allocated space is not used to map resources outside this
> namespace. Such usage of device area will prevent a namespace destroy.
>
> Add resource end pnf in altmap and use that to check if the memmap area
> allocation can map pfn outside the namespace. On ppc64 in such case we 
> fallback
> to allocation from memory.
>
> This fix kernel crash reported below:
>
> [  132.034989] WARNING: CPU: 13 PID: 13719 at mm/memremap.c:133 
> devm_memremap_pages_release+0x2d8/0x2e0
> [  133.464754] BUG: Unable to handle kernel data access at 0xc00c00010b204000
> [  133.464760] Faulting instruction address: 0xc007580c
> [  133.464766] Oops: Kernel access of bad area, sig: 11 [#1]
> [  133.464771] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
> .
> [  133.464901] NIP [c007580c] vmemmap_free+0x2ac/0x3d0
> [  133.464906] LR [c00757f8] vmemmap_free+0x298/0x3d0
> [  133.464910] Call Trace:
> [  133.464914] [c07cbfd0f7b0] [c00757f8] vmemmap_free+0x298/0x3d0 
> (unreliable)
> [  133.464921] [c07cbfd0f8d0] [c0370a44] 
> section_deactivate+0x1a4/0x240
> [  133.464928] [c07cbfd0f980] [c0386270] 
> __remove_pages+0x3a0/0x590
> [  133.464935] [c07cbfd0fa50] [c0074158] 
> arch_remove_memory+0x88/0x160
> [  133.464942] [c07cbfd0fae0] [c03be8c0] 
> devm_memremap_pages_release+0x150/0x2e0
> [  133.464949] [c07cbfd0fb70] [c0738ea0] 
> devm_action_release+0x30/0x50
> [  133.464955] [c07cbfd0fb90] [c073a5a4] release_nodes+0x344/0x400
> [  133.464961] [c07cbfd0fc40] [c073378c] 
> device_release_driver_internal+0x15c/0x250
> [  133.464968] [c07cbfd0fc80] [c072fd14] unbind_store+0x104/0x110
> [  133.464973] [c07cbfd0fcd0] [c072ee24] drv_attr_store+0x44/0x70
> [  133.464981] [c07cbfd0fcf0] [c04a32bc] sysfs_kf_write+0x6c/0xa0
> [  133.464987] [c07cbfd0fd10] [c04a1dfc] 
> kernfs_fop_write+0x17c/0x250
> [  133.464993] [c07cbfd0fd60] [c03c348c] __vfs_write+0x3c/0x70
> [  133.464999] [c07cbfd0fd80] [c03c75d0] vfs_write+0xd0/0x250
>
> Reported-by: Sachin Sant 
> Signed-off-by: Aneesh Kumar K.V 
> ---
>  arch/powerpc/mm/init_64.c | 17 ++++-
>  drivers/nvdimm/pfn_devs.c |  2 ++
>  include/linux/memremap.h  |  1 +
>  3 files changed, 19 insertions(+), 1 deletion(-)

Tested-by: Santosh Sivaraj 

>
> diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
> index a44f6281ca3a..4e08246acd79 100644
> --- a/arch/powerpc/mm/init_64.c
> +++ b/arch/powerpc/mm/init_64.c
> @@ -172,6 +172,21 @@ static __meminit void vmemmap_list_populate(unsigned 
> long phys,
>   vmemmap_list = vmem_back;
>  }
>  
> +static bool altmap_cross_boundary(struct vmem_altmap *altmap, unsigned long 
> start,
> + unsigned long page_size)
> +{
> + unsigned long nr_pfn = page_size / sizeof(struct page);
> + unsigned long start_pfn = page_to_pfn((struct page *)start);
> +
> + if ((start_pfn + nr_pfn) > altmap->end_pfn)
> + return true;
> +
> + if (start_pfn < altmap->base_pfn)
> + return true;
> +
> + return false;
> +}
> +
>  int __meminit vmemmap_populate(unsigned long start, unsigned long end, int 
> node,
>   struct vmem_altmap *altmap)
>  {
> @@ -194,7 +209,7 @@ int __meminit vmemmap_populate(unsigned long start, 
> unsigned long end, int node,
>* fail due to alignment issues when using 16MB hugepages, so
>* fall back to system memory if the altmap allocation fail.
>*/
> - if (altmap) {
> + if (altmap && !altmap_cross_boundary(altmap, start, page_size)) 
> {
>   p = altmap_alloc_block_buf(page_size, altmap);
>   if (!p)
>   pr_debug("altmap block allocation failed, 
> falling back to system memory");
> diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
> index 3e7b11cf1aae..a616d69c8224 100644
> --- a/drivers/nvdimm/pfn_devs.c
> +++ b/drivers/nvdimm/pfn_devs.c
> @@ -618,9 +618,11 @@ static int __nvdimm_setup_pfn(struct nd_pfn *nd_pfn, 
> struct dev_pagemap *pgmap)
>   struct nd_namespace

[PATCH 1/2] powerpc/memcpy: Fix stack corruption for smaller sizes

2019-09-03 Thread Santosh Sivaraj

For sizes lesser than 128 bytes, the code branches out early without saving
the stack frame, which when restored later drops frame of the caller.

Tested-by: Aneesh Kumar K.V 
Signed-off-by: Santosh Sivaraj 
---
 arch/powerpc/lib/memcpy_mcsafe_64.S | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/lib/memcpy_mcsafe_64.S 
b/arch/powerpc/lib/memcpy_mcsafe_64.S
index 949976dc115d..cb882d9a6d8a 100644
--- a/arch/powerpc/lib/memcpy_mcsafe_64.S
+++ b/arch/powerpc/lib/memcpy_mcsafe_64.S
@@ -84,7 +84,6 @@ err1; stw r0,0(r3)
 
 3: sub r5,r5,r6
cmpldi  r5,128
-   blt 5f
 
mflrr0
stdur1,-STACKFRAMESIZE(r1)
@@ -99,6 +98,7 @@ err1; stw r0,0(r3)
std r22,STK_REG(R22)(r1)
std r0,STACKFRAMESIZE+16(r1)
 
+   blt 5f
srdir6,r5,7
mtctr   r6
 
-- 
2.21.0

[PATCH 2/2] seltests/powerpc: Add a selftest for memcpy_mcsafe

2019-09-03 Thread Santosh Sivaraj

Appropriate self tests for memcpy_mcsafe

Suggested-by: Michael Ellerman 
Signed-off-by: Santosh Sivaraj 
---
 tools/testing/selftests/powerpc/copyloops/.gitignore   | 1 +
 tools/testing/selftests/powerpc/copyloops/Makefile | 7 ++-
 tools/testing/selftests/powerpc/copyloops/asm/export.h | 1 +
 .../testing/selftests/powerpc/copyloops/memcpy_mcsafe_64.S | 1 +
 4 files changed, 9 insertions(+), 1 deletion(-)
 create mode 12 tools/testing/selftests/powerpc/copyloops/memcpy_mcsafe_64.S

diff --git a/tools/testing/selftests/powerpc/copyloops/.gitignore 
b/tools/testing/selftests/powerpc/copyloops/.gitignore
index de158104912a..12ef5b031974 100644
--- a/tools/testing/selftests/powerpc/copyloops/.gitignore
+++ b/tools/testing/selftests/powerpc/copyloops/.gitignore
@@ -11,3 +11,4 @@ memcpy_p7_t1
 copyuser_64_exc_t0
 copyuser_64_exc_t1
 copyuser_64_exc_t2
+memcpy_mcsafe_64
diff --git a/tools/testing/selftests/powerpc/copyloops/Makefile 
b/tools/testing/selftests/powerpc/copyloops/Makefile
index 44574f3818b3..0917983a1c78 100644
--- a/tools/testing/selftests/powerpc/copyloops/Makefile
+++ b/tools/testing/selftests/powerpc/copyloops/Makefile
@@ -12,7 +12,7 @@ ASFLAGS = $(CFLAGS) -Wa,-mpower4
 TEST_GEN_PROGS := copyuser_64_t0 copyuser_64_t1 copyuser_64_t2 \
copyuser_p7_t0 copyuser_p7_t1 \
memcpy_64_t0 memcpy_64_t1 memcpy_64_t2 \
-   memcpy_p7_t0 memcpy_p7_t1 \
+   memcpy_p7_t0 memcpy_p7_t1 memcpy_mcsafe_64 \
copyuser_64_exc_t0 copyuser_64_exc_t1 copyuser_64_exc_t2
 
 EXTRA_SOURCES := validate.c ../harness.c stubs.S
@@ -45,6 +45,11 @@ $(OUTPUT)/memcpy_p7_t%:  memcpy_power7.S $(EXTRA_SOURCES)
-D SELFTEST_CASE=$(subst memcpy_p7_t,,$(notdir $@)) \
-o $@ $^
 
+$(OUTPUT)/memcpy_mcsafe_64: memcpy_mcsafe_64.S $(EXTRA_SOURCES)
+   $(CC) $(CPPFLAGS) $(CFLAGS) \
+   -D COPY_LOOP=test_memcpy_mcsafe \
+   -o $@ $^
+
 $(OUTPUT)/copyuser_64_exc_t%: copyuser_64.S exc_validate.c ../harness.c \
copy_tofrom_user_reference.S stubs.S
$(CC) $(CPPFLAGS) $(CFLAGS) \
diff --git a/tools/testing/selftests/powerpc/copyloops/asm/export.h 
b/tools/testing/selftests/powerpc/copyloops/asm/export.h
index 05c1663c89b0..e6b80d5fbd14 100644
--- a/tools/testing/selftests/powerpc/copyloops/asm/export.h
+++ b/tools/testing/selftests/powerpc/copyloops/asm/export.h
@@ -1,3 +1,4 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 #define EXPORT_SYMBOL(x)
+#define EXPORT_SYMBOL_GPL(x)
 #define EXPORT_SYMBOL_KASAN(x)
diff --git a/tools/testing/selftests/powerpc/copyloops/memcpy_mcsafe_64.S 
b/tools/testing/selftests/powerpc/copyloops/memcpy_mcsafe_64.S
new file mode 12
index ..f0feef3062f6
--- /dev/null
+++ b/tools/testing/selftests/powerpc/copyloops/memcpy_mcsafe_64.S
@@ -0,0 +1 @@
+../../../../../arch/powerpc/lib/memcpy_mcsafe_64.S
\ No newline at end of file
-- 
2.21.0

Re: [PATCH v2 4/8] powerpc/vdso32: inline __get_datapage()

2019-08-25 Thread Santosh Sivaraj

Hi Christophe,

Christophe Leroy  writes:

> __get_datapage() is only a few instructions to retrieve the
> address of the page where the kernel stores data to the VDSO.
>
> By inlining this function into its users, a bl/blr pair and
> a mflr/mtlr pair is avoided, plus a few reg moves.
>
> The improvement is noticeable (about 55 nsec/call on an 8xx)
>
> vdsotest before the patch:
> gettimeofday:vdso: 731 nsec/call
> clock-gettime-realtime-coarse:vdso: 668 nsec/call
> clock-gettime-monotonic-coarse:vdso: 745 nsec/call
>
> vdsotest after the patch:
> gettimeofday:vdso: 677 nsec/call
> clock-gettime-realtime-coarse:vdso: 613 nsec/call
> clock-gettime-monotonic-coarse:vdso: 690 nsec/call
>
> Signed-off-by: Christophe Leroy 
> ---
>  arch/powerpc/kernel/vdso32/cacheflush.S   | 10 +-
>  arch/powerpc/kernel/vdso32/datapage.S | 29 -
>  arch/powerpc/kernel/vdso32/datapage.h | 11 +++
>  arch/powerpc/kernel/vdso32/gettimeofday.S | 13 ++---
>  4 files changed, 26 insertions(+), 37 deletions(-)
>  create mode 100644 arch/powerpc/kernel/vdso32/datapage.h

The datapage.h file should ideally be moved under include/asm, then we can use
the same for powerpc64 too.

Santosh

>
> diff --git a/arch/powerpc/kernel/vdso32/cacheflush.S 
> b/arch/powerpc/kernel/vdso32/cacheflush.S
> index 7f882e7b9f43..e9453837e4ee 100644
> --- a/arch/powerpc/kernel/vdso32/cacheflush.S
> +++ b/arch/powerpc/kernel/vdso32/cacheflush.S
> @@ -10,6 +10,8 @@
>  #include 
>  #include 
>  
> +#include "datapage.h"
> +
>   .text
>  
>  /*
> @@ -24,14 +26,12 @@ V_FUNCTION_BEGIN(__kernel_sync_dicache)
>.cfi_startproc
>   mflrr12
>.cfi_register lr,r12
> - mr  r11,r3
> - bl  __get_datapage@local
> + get_datapager10, r0
>   mtlrr12
> - mr  r10,r3
>  
>   lwz r7,CFG_DCACHE_BLOCKSZ(r10)
>   addir5,r7,-1
> - andcr6,r11,r5   /* round low to line bdy */
> + andcr6,r3,r5/* round low to line bdy */
>   subfr8,r6,r4/* compute length */
>   add r8,r8,r5/* ensure we get enough */
>   lwz r9,CFG_DCACHE_LOGBLOCKSZ(r10)
> @@ -48,7 +48,7 @@ V_FUNCTION_BEGIN(__kernel_sync_dicache)
>  
>   lwz r7,CFG_ICACHE_BLOCKSZ(r10)
>   addir5,r7,-1
> - andcr6,r11,r5   /* round low to line bdy */
> + andcr6,r3,r5/* round low to line bdy */
>   subfr8,r6,r4/* compute length */
>   add r8,r8,r5
>   lwz r9,CFG_ICACHE_LOGBLOCKSZ(r10)
> diff --git a/arch/powerpc/kernel/vdso32/datapage.S 
> b/arch/powerpc/kernel/vdso32/datapage.S
> index 6984125b9fc0..d480d2d4a3fe 100644
> --- a/arch/powerpc/kernel/vdso32/datapage.S
> +++ b/arch/powerpc/kernel/vdso32/datapage.S
> @@ -11,34 +11,13 @@
>  #include 
>  #include 
>  
> +#include "datapage.h"
> +
>   .text
>   .global __kernel_datapage_offset;
>  __kernel_datapage_offset:
>   .long   0
>  
> -V_FUNCTION_BEGIN(__get_datapage)
> -  .cfi_startproc
> - /* We don't want that exposed or overridable as we want other objects
> -  * to be able to bl directly to here
> -  */
> - .protected __get_datapage
> - .hidden __get_datapage
> -
> - mflrr0
> -  .cfi_register lr,r0
> -
> - bcl 20,31,data_page_branch
> -data_page_branch:
> - mflrr3
> - mtlrr0
> - addir3, r3, __kernel_datapage_offset-data_page_branch
> - lwz r0,0(r3)
> -  .cfi_restore lr
> - add r3,r0,r3
> - blr
> -  .cfi_endproc
> -V_FUNCTION_END(__get_datapage)
> -
>  /*
>   * void *__kernel_get_syscall_map(unsigned int *syscall_count) ;
>   *
> @@ -53,7 +32,7 @@ V_FUNCTION_BEGIN(__kernel_get_syscall_map)
>   mflrr12
>.cfi_register lr,r12
>   mr  r4,r3
> - bl  __get_datapage@local
> + get_datapager3, r0
>   mtlrr12
>   addir3,r3,CFG_SYSCALL_MAP32
>   cmpli   cr0,r4,0
> @@ -74,7 +53,7 @@ V_FUNCTION_BEGIN(__kernel_get_tbfreq)
>.cfi_startproc
>   mflrr12
>.cfi_register lr,r12
> - bl  __get_datapage@local
> + get_datapager3, r0
>   lwz r4,(CFG_TB_TICKS_PER_SEC + 4)(r3)
>   lwz r3,CFG_TB_TICKS_PER_SEC(r3)
>   mtlrr12
> diff --git a/arch/powerpc/kernel/vdso32/datapage.h 
> b/arch/powerpc/kernel/vdso32/datapage.h
> new file mode 100644
> index ..74f4f57c2da8
> --- /dev/null
> +++ b/arch/powerpc/kernel/vdso32/datapage.h
> @@ -0,0 +1,11 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later */
> +
> +.macro get_datapage ptr, tmp
> + bcl 20,31,.+4
> + mflr\ptr
> + addi\ptr, \ptr, __kernel_datapage_offset - (.-4)
> + lwz \tmp, 0(\ptr)
> + add \ptr, \tmp, \ptr
> +.endm
> +
> +
> diff --git a/arch/powerpc/kernel/vdso32/gettimeofday.S 
> b/arch/powerpc/kernel/vdso32/gettimeofday.S
> index 355b537d327a..3e55cba19f44

Re: [PATCH] powerpc/mm: tell if a bad page fault on data is read or write.

2019-08-25 Thread Santosh Sivaraj

Christophe Leroy  writes:

> DSISR has a bit to tell if the fault is due to a read or a write.
>
> Display it.
>
> Signed-off-by: Christophe Leroy 

Reviewed-by: Santosh Sivaraj 

> ---
>  arch/powerpc/mm/fault.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
> index 8432c281de92..b5047f9b5dec 100644
> --- a/arch/powerpc/mm/fault.c
> +++ b/arch/powerpc/mm/fault.c
> @@ -645,6 +645,7 @@ NOKPROBE_SYMBOL(do_page_fault);
>  void bad_page_fault(struct pt_regs *regs, unsigned long address, int sig)
>  {
>   const struct exception_table_entry *entry;
> + int is_write = page_fault_is_write(regs->dsisr);
>  
>   /* Are we prepared to handle this fault?  */
>   if ((entry = search_exception_tables(regs->nip)) != NULL) {
> @@ -658,9 +659,10 @@ void bad_page_fault(struct pt_regs *regs, unsigned long 
> address, int sig)
>   case 0x300:
>   case 0x380:
>   case 0xe00:
> - pr_alert("BUG: %s at 0x%08lx\n",
> + pr_alert("BUG: %s on %s at 0x%08lx\n",
>regs->dar < PAGE_SIZE ? "Kernel NULL pointer 
> dereference" :
> -  "Unable to handle kernel data access", regs->dar);
> +  "Unable to handle kernel data access",
> +  is_write ? "write" : "read", regs->dar);
>   break;
>   case 0x400:
>   case 0x480:
> -- 
> 2.13.3

Re: [PATCH] powerpc/vdso64: inline __get_datapage()

2019-08-22 Thread Santosh Sivaraj

Christophe Leroy  writes:

> Le 21/08/2019 à 14:15, Segher Boessenkool a écrit :
>> On Wed, Aug 21, 2019 at 01:50:52PM +0200, Christophe Leroy wrote:
>>> Do you have any idea on how to avoid that bcl/mflr stuff ?
>> 
>> Do a load from some fixed address?  Maybe an absolute address, even?
>> lwz r3,-12344(0)  or similar (that address is in kernel space...)
>> 
>> There aren't many options, and certainly not many *good* options!
>> 
>
> IIUC, the VDSO is seen by apps the same way as a dynamic lib. Couldn't 
> the relocation be done only once when the app loads the VDSO as for a 
> regular .so lib ?

How does address space randomization work for .so libs?

>
> It looks like it is what others do, at least x86 and arm64, unless I 
> misunderstood their code.
>
> Christophe

Re: [PATCH] powerpc/vdso64: inline __get_datapage()

2019-08-21 Thread Santosh Sivaraj

Christophe Leroy  writes:

> Le 21/08/2019 à 11:29, Santosh Sivaraj a écrit :
>> __get_datapage() is only a few instructions to retrieve the
>> address of the page where the kernel stores data to the VDSO.
>> 
>> By inlining this function into its users, a bl/blr pair and
>> a mflr/mtlr pair is avoided, plus a few reg moves.
>> 
>> clock-gettime-monotonic: syscall: 514 nsec/call  396 nsec/call
>> clock-gettime-monotonic:libc: 25 nsec/call   24 nsec/call
>> clock-gettime-monotonic:vdso: 20 nsec/call   20 nsec/call
>> clock-getres-monotonic: syscall: 347 nsec/call   372 nsec/call
>> clock-getres-monotonic:libc: 19 nsec/call19 nsec/call
>> clock-getres-monotonic:vdso: 10 nsec/call10 nsec/call
>> clock-gettime-monotonic-coarse: syscall: 511 nsec/call   396 nsec/call
>> clock-gettime-monotonic-coarse:libc: 23 nsec/call21 nsec/call
>> clock-gettime-monotonic-coarse:vdso: 15 nsec/call13 nsec/call
>> clock-gettime-realtime: syscall: 526 nsec/call   405 nsec/call
>> clock-gettime-realtime:libc: 24 nsec/call23 nsec/call
>> clock-gettime-realtime:vdso: 18 nsec/call18 nsec/call
>> clock-getres-realtime: syscall: 342 nsec/call372 nsec/call
>> clock-getres-realtime:libc: 19 nsec/call 19 nsec/call
>> clock-getres-realtime:vdso: 10 nsec/call 10 nsec/call
>> clock-gettime-realtime-coarse: syscall: 515 nsec/call373 nsec/call
>> clock-gettime-realtime-coarse:libc: 23 nsec/call 22 nsec/call
>> clock-gettime-realtime-coarse:vdso: 14 nsec/call 13 nsec/call
>
> I think you should only put the measurements on vdso calls, and only the 
> ones that are impacted by the change. For exemple, getres function 
> doesn't use __get_datapage so showing it here is pointless.
>
> gettimeofday should be shown there as it uses __get_datapage()
>
>
>> 
>> Based on the patch by Christophe Leroy  for vdso32.
>> 
>> Signed-off-by: Santosh Sivaraj 
>> ---
>> 
>> except for a couple of calls (1 or 2 nsec reduction), there are no
>> improvements in the call times. Or is 10 nsec the minimum granularity??
>
> Maybe the ones that show no improvements are the ones that don't use 
> __get_datapage() at all ...

Yes makes sense.

>
>> 
>> So I don't know if its even worth updating vdso64 except to keep vdso32 and
>> vdso64 equal.
>
> 2ns on a 15ns call is 13% so it is worth it I think.

true. Since datapage.h is the same for both 32 and 64, may be we should put
it in include/asm.

Thanks,
Santosh
>
> Christophe
>
>
>> 
>> 
>>   arch/powerpc/kernel/vdso64/cacheflush.S   | 10 
>>   arch/powerpc/kernel/vdso64/datapage.S | 29 ---
>>   arch/powerpc/kernel/vdso64/datapage.h | 10 
>>   arch/powerpc/kernel/vdso64/gettimeofday.S |  8 ---
>>   4 files changed, 24 insertions(+), 33 deletions(-)
>>   create mode 100644 arch/powerpc/kernel/vdso64/datapage.h
>> 
>> diff --git a/arch/powerpc/kernel/vdso64/cacheflush.S 
>> b/arch/powerpc/kernel/vdso64/cacheflush.S
>> index 3f92561a64c4..30e8b0d29bea 100644
>> --- a/arch/powerpc/kernel/vdso64/cacheflush.S
>> +++ b/arch/powerpc/kernel/vdso64/cacheflush.S
>> @@ -10,6 +10,8 @@
>>   #include 
>>   #include 
>>   
>> +#include "datapage.h"
>> +
>>  .text
>>   
>>   /*
>> @@ -24,14 +26,12 @@ V_FUNCTION_BEGIN(__kernel_sync_dicache)
>> .cfi_startproc
>>  mflrr12
>> .cfi_register lr,r12
>> -mr  r11,r3
>> -bl  V_LOCAL_FUNC(__get_datapage)
>> +get_datapager11, r0
>>  mtlrr12
>> -mr  r10,r3
>>   
>>  lwz r7,CFG_DCACHE_BLOCKSZ(r10)
>>  addir5,r7,-1
>> -andcr6,r11,r5   /* round low to line bdy */
>> +andcr6,r3,r5/* round low to line bdy */
>>  subfr8,r6,r4/* compute length */
>>  add r8,r8,r5/* ensure we get enough */
>>  lwz r9,CFG_DCACHE_LOGBLOCKSZ(r10)
>> @@ -48,7 +48,7 @@ V_FUNCTION_BEGIN(__kernel_sync_dicache)
>>   
>>  lwz r7,CFG_ICACHE_BLOCKSZ(r10)
>>  addir5,r7,-1
>> -andcr6,r11,r5   /* round low to line bdy */
>> +andcr6,r3,r5/* round low to line bdy */
>>  subfr8,r6,r4/* compute length */
>>  add r8,r8,r5
>>  lwz r9,CFG_ICACHE_LOGBLOCKSZ(r10)
>> diff --git a/arch/powerpc/kernel/vdso64/datapage.S 
>> b/arch/powerpc/kernel/vdso64/datapage

[PATCH] powerpc/vdso64: inline __get_datapage()

2019-08-21 Thread Santosh Sivaraj

__get_datapage() is only a few instructions to retrieve the
address of the page where the kernel stores data to the VDSO.

By inlining this function into its users, a bl/blr pair and
a mflr/mtlr pair is avoided, plus a few reg moves.

clock-gettime-monotonic: syscall: 514 nsec/call  396 nsec/call
clock-gettime-monotonic:libc: 25 nsec/call   24 nsec/call
clock-gettime-monotonic:vdso: 20 nsec/call   20 nsec/call
clock-getres-monotonic: syscall: 347 nsec/call   372 nsec/call
clock-getres-monotonic:libc: 19 nsec/call19 nsec/call
clock-getres-monotonic:vdso: 10 nsec/call10 nsec/call
clock-gettime-monotonic-coarse: syscall: 511 nsec/call   396 nsec/call
clock-gettime-monotonic-coarse:libc: 23 nsec/call21 nsec/call
clock-gettime-monotonic-coarse:vdso: 15 nsec/call13 nsec/call
clock-gettime-realtime: syscall: 526 nsec/call   405 nsec/call
clock-gettime-realtime:libc: 24 nsec/call23 nsec/call
clock-gettime-realtime:vdso: 18 nsec/call18 nsec/call
clock-getres-realtime: syscall: 342 nsec/call372 nsec/call
clock-getres-realtime:libc: 19 nsec/call 19 nsec/call
clock-getres-realtime:vdso: 10 nsec/call 10 nsec/call
clock-gettime-realtime-coarse: syscall: 515 nsec/call373 nsec/call
clock-gettime-realtime-coarse:libc: 23 nsec/call 22 nsec/call
clock-gettime-realtime-coarse:vdso: 14 nsec/call 13 nsec/call

Based on the patch by Christophe Leroy  for vdso32.

Signed-off-by: Santosh Sivaraj 
---

except for a couple of calls (1 or 2 nsec reduction), there are no
improvements in the call times. Or is 10 nsec the minimum granularity??

So I don't know if its even worth updating vdso64 except to keep vdso32 and
vdso64 equal.


 arch/powerpc/kernel/vdso64/cacheflush.S   | 10 
 arch/powerpc/kernel/vdso64/datapage.S | 29 ---
 arch/powerpc/kernel/vdso64/datapage.h | 10 
 arch/powerpc/kernel/vdso64/gettimeofday.S |  8 ---
 4 files changed, 24 insertions(+), 33 deletions(-)
 create mode 100644 arch/powerpc/kernel/vdso64/datapage.h

diff --git a/arch/powerpc/kernel/vdso64/cacheflush.S 
b/arch/powerpc/kernel/vdso64/cacheflush.S
index 3f92561a64c4..30e8b0d29bea 100644
--- a/arch/powerpc/kernel/vdso64/cacheflush.S
+++ b/arch/powerpc/kernel/vdso64/cacheflush.S
@@ -10,6 +10,8 @@
 #include 
 #include 
 
+#include "datapage.h"
+
.text
 
 /*
@@ -24,14 +26,12 @@ V_FUNCTION_BEGIN(__kernel_sync_dicache)
   .cfi_startproc
mflrr12
   .cfi_register lr,r12
-   mr  r11,r3
-   bl  V_LOCAL_FUNC(__get_datapage)
+   get_datapager11, r0
mtlrr12
-   mr  r10,r3
 
lwz r7,CFG_DCACHE_BLOCKSZ(r10)
addir5,r7,-1
-   andcr6,r11,r5   /* round low to line bdy */
+   andcr6,r3,r5/* round low to line bdy */
subfr8,r6,r4/* compute length */
add r8,r8,r5/* ensure we get enough */
lwz r9,CFG_DCACHE_LOGBLOCKSZ(r10)
@@ -48,7 +48,7 @@ V_FUNCTION_BEGIN(__kernel_sync_dicache)
 
lwz r7,CFG_ICACHE_BLOCKSZ(r10)
addir5,r7,-1
-   andcr6,r11,r5   /* round low to line bdy */
+   andcr6,r3,r5/* round low to line bdy */
subfr8,r6,r4/* compute length */
add r8,r8,r5
lwz r9,CFG_ICACHE_LOGBLOCKSZ(r10)
diff --git a/arch/powerpc/kernel/vdso64/datapage.S 
b/arch/powerpc/kernel/vdso64/datapage.S
index dc84f5ae3802..8712f57c931c 100644
--- a/arch/powerpc/kernel/vdso64/datapage.S
+++ b/arch/powerpc/kernel/vdso64/datapage.S
@@ -11,34 +11,13 @@
 #include 
 #include 
 
+#include "datapage.h"
+
.text
 .global__kernel_datapage_offset;
 __kernel_datapage_offset:
.long   0
 
-V_FUNCTION_BEGIN(__get_datapage)
-  .cfi_startproc
-   /* We don't want that exposed or overridable as we want other objects
-* to be able to bl directly to here
-*/
-   .protected __get_datapage
-   .hidden __get_datapage
-
-   mflrr0
-  .cfi_register lr,r0
-
-   bcl 20,31,data_page_branch
-data_page_branch:
-   mflrr3
-   mtlrr0
-   addir3, r3, __kernel_datapage_offset-data_page_branch
-   lwz r0,0(r3)
-  .cfi_restore lr
-   add r3,r0,r3
-   blr
-  .cfi_endproc
-V_FUNCTION_END(__get_datapage)
-
 /*
  * void *__kernel_get_syscall_map(unsigned int *syscall_count) ;
  *
@@ -53,7 +32,7 @@ V_FUNCTION_BEGIN(__kernel_get_syscall_map)
mflrr12
   .cfi_register lr,r12
mr  r4,r3
-   bl  V_LOCAL_FUNC(__get_datapage)
+   get_datapager3, r0
mtlrr12
addir3,r3,CFG_SYSCALL_MAP64
cmpldi  cr0,r4,0
@@ -75,7 +54,7 @@ V_FUNCTION_BEGIN(__kernel_get_tbfreq)
   .cfi_startproc
mflrr12
   .cfi_register lr,r12
-   bl  V_LOCAL_FUNC(__get_datapage)
+   get_datapager3, r

[PATCH v11 7/7] powerpc: add machine check safe copy_to_user

2019-08-20 Thread Santosh Sivaraj

Use  memcpy_mcsafe() implementation to define copy_to_user_mcsafe()

Signed-off-by: Santosh Sivaraj 
---
 arch/powerpc/Kconfig   |  1 +
 arch/powerpc/include/asm/uaccess.h | 14 ++
 2 files changed, 15 insertions(+)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index d8dcd8820369..39c738aa600a 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -136,6 +136,7 @@ config PPC
select ARCH_HAS_STRICT_KERNEL_RWX   if ((PPC_BOOK3S_64 || PPC32) && 
!RELOCATABLE && !HIBERNATION)
select ARCH_HAS_TICK_BROADCAST  if GENERIC_CLOCKEVENTS_BROADCAST
select ARCH_HAS_UACCESS_FLUSHCACHE  if PPC64
+   select ARCH_HAS_UACCESS_MCSAFE  if PPC64
select ARCH_HAS_UBSAN_SANITIZE_ALL
select ARCH_HAVE_NMI_SAFE_CMPXCHG
select ARCH_KEEP_MEMBLOCK
diff --git a/arch/powerpc/include/asm/uaccess.h 
b/arch/powerpc/include/asm/uaccess.h
index 8b03eb44e876..15002b51ff18 100644
--- a/arch/powerpc/include/asm/uaccess.h
+++ b/arch/powerpc/include/asm/uaccess.h
@@ -387,6 +387,20 @@ static inline unsigned long raw_copy_to_user(void __user 
*to,
return ret;
 }
 
+static __always_inline unsigned long __must_check
+copy_to_user_mcsafe(void __user *to, const void *from, unsigned long n)
+{
+   if (likely(check_copy_size(from, n, true))) {
+   if (access_ok(to, n)) {
+   allow_write_to_user(to, n);
+   n = memcpy_mcsafe((void *)to, from, n);
+   prevent_write_to_user(to, n);
+   }
+   }
+
+   return n;
+}
+
 extern unsigned long __clear_user(void __user *addr, unsigned long size);
 
 static inline unsigned long clear_user(void __user *addr, unsigned long size)
-- 
2.21.0

[PATCH v11 6/7] powerpc/memcpy: Add memcpy_mcsafe for pmem

2019-08-20 Thread Santosh Sivaraj

From: Balbir Singh 

The pmem infrastructure uses memcpy_mcsafe in the pmem layer so as to
convert machine check exceptions into a return value on failure in case
a machine check exception is encountered during the memcpy. The return
value is the number of bytes remaining to be copied.

This patch largely borrows from the copyuser_power7 logic and does not add
the VMX optimizations, largely to keep the patch simple. If needed those
optimizations can be folded in.

Signed-off-by: Balbir Singh 
[ar...@linux.ibm.com: Added symbol export]
Co-developed-by: Santosh Sivaraj 
Signed-off-by: Santosh Sivaraj 
---
 arch/powerpc/include/asm/string.h   |   2 +
 arch/powerpc/lib/Makefile   |   2 +-
 arch/powerpc/lib/memcpy_mcsafe_64.S | 242 
 3 files changed, 245 insertions(+), 1 deletion(-)
 create mode 100644 arch/powerpc/lib/memcpy_mcsafe_64.S

diff --git a/arch/powerpc/include/asm/string.h 
b/arch/powerpc/include/asm/string.h
index 9bf6dffb4090..b72692702f35 100644
--- a/arch/powerpc/include/asm/string.h
+++ b/arch/powerpc/include/asm/string.h
@@ -53,7 +53,9 @@ void *__memmove(void *to, const void *from, __kernel_size_t 
n);
 #ifndef CONFIG_KASAN
 #define __HAVE_ARCH_MEMSET32
 #define __HAVE_ARCH_MEMSET64
+#define __HAVE_ARCH_MEMCPY_MCSAFE
 
+extern int memcpy_mcsafe(void *dst, const void *src, __kernel_size_t sz);
 extern void *__memset16(uint16_t *, uint16_t v, __kernel_size_t);
 extern void *__memset32(uint32_t *, uint32_t v, __kernel_size_t);
 extern void *__memset64(uint64_t *, uint64_t v, __kernel_size_t);
diff --git a/arch/powerpc/lib/Makefile b/arch/powerpc/lib/Makefile
index eebc782d89a5..fa6b1b657b43 100644
--- a/arch/powerpc/lib/Makefile
+++ b/arch/powerpc/lib/Makefile
@@ -39,7 +39,7 @@ obj-$(CONFIG_PPC_BOOK3S_64) += copyuser_power7.o 
copypage_power7.o \
   memcpy_power7.o
 
 obj64-y+= copypage_64.o copyuser_64.o mem_64.o hweight_64.o \
-  memcpy_64.o pmem.o
+  memcpy_64.o pmem.o memcpy_mcsafe_64.o
 
 obj64-$(CONFIG_SMP)+= locks.o
 obj64-$(CONFIG_ALTIVEC)+= vmx-helper.o
diff --git a/arch/powerpc/lib/memcpy_mcsafe_64.S 
b/arch/powerpc/lib/memcpy_mcsafe_64.S
new file mode 100644
index ..949976dc115d
--- /dev/null
+++ b/arch/powerpc/lib/memcpy_mcsafe_64.S
@@ -0,0 +1,242 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) IBM Corporation, 2011
+ * Derived from copyuser_power7.s by Anton Blanchard 
+ * Author - Balbir Singh 
+ */
+#include 
+#include 
+#include 
+
+   .macro err1
+100:
+   EX_TABLE(100b,.Ldo_err1)
+   .endm
+
+   .macro err2
+200:
+   EX_TABLE(200b,.Ldo_err2)
+   .endm
+
+   .macro err3
+300:   EX_TABLE(300b,.Ldone)
+   .endm
+
+.Ldo_err2:
+   ld  r22,STK_REG(R22)(r1)
+   ld  r21,STK_REG(R21)(r1)
+   ld  r20,STK_REG(R20)(r1)
+   ld  r19,STK_REG(R19)(r1)
+   ld  r18,STK_REG(R18)(r1)
+   ld  r17,STK_REG(R17)(r1)
+   ld  r16,STK_REG(R16)(r1)
+   ld  r15,STK_REG(R15)(r1)
+   ld  r14,STK_REG(R14)(r1)
+   addir1,r1,STACKFRAMESIZE
+.Ldo_err1:
+   /* Do a byte by byte copy to get the exact remaining size */
+   mtctr   r7
+46:
+err3;  lbz r0,0(r4)
+   addir4,r4,1
+err3;  stb r0,0(r3)
+   addir3,r3,1
+   bdnz46b
+   li  r3,0
+   blr
+
+.Ldone:
+   mfctr   r3
+   blr
+
+
+_GLOBAL(memcpy_mcsafe)
+   mr  r7,r5
+   cmpldi  r5,16
+   blt .Lshort_copy
+
+.Lcopy:
+   /* Get the source 8B aligned */
+   neg r6,r4
+   mtocrf  0x01,r6
+   clrldi  r6,r6,(64-3)
+
+   bf  cr7*4+3,1f
+err1;  lbz r0,0(r4)
+   addir4,r4,1
+err1;  stb r0,0(r3)
+   addir3,r3,1
+   subir7,r7,1
+
+1: bf  cr7*4+2,2f
+err1;  lhz r0,0(r4)
+   addir4,r4,2
+err1;  sth r0,0(r3)
+   addir3,r3,2
+   subir7,r7,2
+
+2: bf  cr7*4+1,3f
+err1;  lwz r0,0(r4)
+   addir4,r4,4
+err1;  stw r0,0(r3)
+   addir3,r3,4
+   subir7,r7,4
+
+3: sub r5,r5,r6
+   cmpldi  r5,128
+   blt 5f
+
+   mflrr0
+   stdur1,-STACKFRAMESIZE(r1)
+   std r14,STK_REG(R14)(r1)
+   std r15,STK_REG(R15)(r1)
+   std r16,STK_REG(R16)(r1)
+   std r17,STK_REG(R17)(r1)
+   std r18,STK_REG(R18)(r1)
+   std r19,STK_REG(R19)(r1)
+   std r20,STK_REG(R20)(r1)
+   std r21,STK_REG(R21)(r1)
+   std r22,STK_REG(R22)(r1)
+   std r0,STACKFRAMESIZE+16(r1)
+
+   srdir6,r5,7
+   mtctr   r6
+
+   /* Now do cacheline (128B) sized loads and stores. */
+   .align  5
+4:
+err2;  ld  r0,0(r4)
+err2;  ld  r6,8(r4)
+err2;  ld  r8,16(r4)
+err2;  ld  r9,24(r4)
+err2;  ld  r10,32(r4)
+err2;  ld  r11,40(r4)
+err2;  ld  r12,48(r4)
+err2;  ld  r14,56(r4)
+err2;  ld  r15,64(r4)
+err2;  ld  r16,72

[PATCH v11 5/7] powerpc/mce: Handle UE event for memcpy_mcsafe

2019-08-20 Thread Santosh Sivaraj

From: Balbir Singh 

If we take a UE on one of the instructions with a fixup entry, set nip
to continue execution at the fixup entry. Stop processing the event
further or print it.

Co-developed-by: Reza Arbab 
Signed-off-by: Reza Arbab 
Signed-off-by: Balbir Singh 
Reviewed-by: Mahesh Salgaonkar 
Reviewed-by: Nicholas Piggin 
Signed-off-by: Santosh Sivaraj 
---
 arch/powerpc/include/asm/mce.h  |  4 +++-
 arch/powerpc/kernel/mce.c   | 16 
 arch/powerpc/kernel/mce_power.c | 15 +--
 3 files changed, 32 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/mce.h b/arch/powerpc/include/asm/mce.h
index a4c6a74ad2fb..19a33707d5ef 100644
--- a/arch/powerpc/include/asm/mce.h
+++ b/arch/powerpc/include/asm/mce.h
@@ -122,7 +122,8 @@ struct machine_check_event {
enum MCE_UeErrorType ue_error_type:8;
u8  effective_address_provided;
u8  physical_address_provided;
-   u8  reserved_1[5];
+   u8  ignore_event;
+   u8  reserved_1[4];
u64 effective_address;
u64 physical_address;
u8  reserved_2[8];
@@ -193,6 +194,7 @@ struct mce_error_info {
enum MCE_Initiator  initiator:8;
enum MCE_ErrorClass error_class:8;
boolsync_error;
+   boolignore_event;
 };
 
 #define MAX_MC_EVT 100
diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
index a3b122a685a5..ec4b3e1087be 100644
--- a/arch/powerpc/kernel/mce.c
+++ b/arch/powerpc/kernel/mce.c
@@ -149,6 +149,7 @@ void save_mce_event(struct pt_regs *regs, long handled,
if (phys_addr != ULONG_MAX) {
mce->u.ue_error.physical_address_provided = true;
mce->u.ue_error.physical_address = phys_addr;
+   mce->u.ue_error.ignore_event = mce_err->ignore_event;
machine_check_ue_event(mce);
}
}
@@ -266,8 +267,17 @@ static void machine_process_ue_event(struct work_struct 
*work)
/*
 * This should probably queued elsewhere, but
 * oh! well
+*
+* Don't report this machine check because the caller has a
+* asked us to ignore the event, it has a fixup handler which
+* will do the appropriate error handling and reporting.
 */
if (evt->error_type == MCE_ERROR_TYPE_UE) {
+   if (evt->u.ue_error.ignore_event) {
+   __this_cpu_dec(mce_ue_count);
+   continue;
+   }
+
if (evt->u.ue_error.physical_address_provided) {
unsigned long pfn;
 
@@ -301,6 +311,12 @@ static void machine_check_process_queued_event(struct 
irq_work *work)
while (__this_cpu_read(mce_queue_count) > 0) {
index = __this_cpu_read(mce_queue_count) - 1;
evt = this_cpu_ptr(_event_queue[index]);
+
+   if (evt->error_type == MCE_ERROR_TYPE_UE &&
+   evt->u.ue_error.ignore_event) {
+   __this_cpu_dec(mce_queue_count);
+   continue;
+   }
machine_check_print_event_info(evt, false, false);
__this_cpu_dec(mce_queue_count);
}
diff --git a/arch/powerpc/kernel/mce_power.c b/arch/powerpc/kernel/mce_power.c
index 714a98e0927f..b6cbe3449358 100644
--- a/arch/powerpc/kernel/mce_power.c
+++ b/arch/powerpc/kernel/mce_power.c
@@ -11,6 +11,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -18,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * Convert an address related to an mm to a PFN. NOTE: we are in real
@@ -565,9 +567,18 @@ static int mce_handle_derror(struct pt_regs *regs,
return 0;
 }
 
-static long mce_handle_ue_error(struct pt_regs *regs)
+static long mce_handle_ue_error(struct pt_regs *regs,
+   struct mce_error_info *mce_err)
 {
long handled = 0;
+   const struct exception_table_entry *entry;
+
+   entry = search_kernel_exception_table(regs->nip);
+   if (entry) {
+   mce_err->ignore_event = true;
+   regs->nip = extable_fixup(entry);
+   return 1;
+   }
 
/*
 * On specific SCOM read via MMIO we may get a machine check
@@ -600,7 +611,7 @@ static long mce_handle_error(struct pt_regs *regs,
_addr);
 
if (!handled && mce_err.error_type == MCE_ERROR_TYPE_UE)
-   handled = mce_handle_ue_error(re

[PATCH v11 4/7] extable: Add function to search only kernel exception table

2019-08-20 Thread Santosh Sivaraj

Certain architecture specific operating modes (e.g., in powerpc machine
check handler that is unable to access vmalloc memory), the
search_exception_tables cannot be called because it also searches the
module exception tables if entry is not found in the kernel exception
table.

Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Nicholas Piggin 
Signed-off-by: Santosh Sivaraj 
Reviewed-by: Nicholas Piggin 
---
 include/linux/extable.h |  2 ++
 kernel/extable.c| 11 +--
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/include/linux/extable.h b/include/linux/extable.h
index 41c5b3a25f67..81ecfaa83ad3 100644
--- a/include/linux/extable.h
+++ b/include/linux/extable.h
@@ -19,6 +19,8 @@ void trim_init_extable(struct module *m);
 
 /* Given an address, look for it in the exception tables */
 const struct exception_table_entry *search_exception_tables(unsigned long add);
+const struct exception_table_entry *
+search_kernel_exception_table(unsigned long addr);
 
 #ifdef CONFIG_MODULES
 /* For extable.c to search modules' exception tables. */
diff --git a/kernel/extable.c b/kernel/extable.c
index e23cce6e6092..f6c9406eec7d 100644
--- a/kernel/extable.c
+++ b/kernel/extable.c
@@ -40,13 +40,20 @@ void __init sort_main_extable(void)
}
 }
 
+/* Given an address, look for it in the kernel exception table */
+const
+struct exception_table_entry *search_kernel_exception_table(unsigned long addr)
+{
+   return search_extable(__start___ex_table,
+ __stop___ex_table - __start___ex_table, addr);
+}
+
 /* Given an address, look for it in the exception tables. */
 const struct exception_table_entry *search_exception_tables(unsigned long addr)
 {
const struct exception_table_entry *e;
 
-   e = search_extable(__start___ex_table,
-  __stop___ex_table - __start___ex_table, addr);
+   e = search_kernel_exception_table(addr);
if (!e)
e = search_module_extables(addr);
return e;
-- 
2.21.0

[PATCH v11 3/7] powerpc/mce: Make machine_check_ue_event() static

2019-08-20 Thread Santosh Sivaraj

From: Reza Arbab 

The function doesn't get used outside this file, so make it static.

Signed-off-by: Reza Arbab 
Signed-off-by: Santosh Sivaraj 
Reviewed-by: Nicholas Piggin 
---
 arch/powerpc/kernel/mce.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
index cff31d4a501f..a3b122a685a5 100644
--- a/arch/powerpc/kernel/mce.c
+++ b/arch/powerpc/kernel/mce.c
@@ -34,7 +34,7 @@ static DEFINE_PER_CPU(struct machine_check_event[MAX_MC_EVT],
 
 static void machine_check_process_queued_event(struct irq_work *work);
 static void machine_check_ue_irq_work(struct irq_work *work);
-void machine_check_ue_event(struct machine_check_event *evt);
+static void machine_check_ue_event(struct machine_check_event *evt);
 static void machine_process_ue_event(struct work_struct *work);
 
 static struct irq_work mce_event_process_work = {
@@ -212,7 +212,7 @@ static void machine_check_ue_irq_work(struct irq_work *work)
 /*
  * Queue up the MCE event which then can be handled later.
  */
-void machine_check_ue_event(struct machine_check_event *evt)
+static void machine_check_ue_event(struct machine_check_event *evt)
 {
int index;
 
-- 
2.21.0

[PATCH v11 2/7] powerpc/mce: Fix MCE handling for huge pages

2019-08-20 Thread Santosh Sivaraj

From: Balbir Singh 

The current code would fail on huge pages addresses, since the shift would
be incorrect. Use the correct page shift value returned by
__find_linux_pte() to get the correct physical address. The code is more
generic and can handle both regular and compound pages.

Fixes: ba41e1e1ccb9 ("powerpc/mce: Hookup derror (load/store) UE errors")
Signed-off-by: Balbir Singh 
[ar...@linux.ibm.com: Fixup pseries_do_memory_failure()]
Signed-off-by: Reza Arbab 
Tested-by: Mahesh Salgaonkar 
Signed-off-by: Santosh Sivaraj 
Cc: sta...@vger.kernel.org # v4.15+
---
 arch/powerpc/kernel/mce_power.c | 19 +--
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kernel/mce_power.c b/arch/powerpc/kernel/mce_power.c
index a814d2dfb5b0..714a98e0927f 100644
--- a/arch/powerpc/kernel/mce_power.c
+++ b/arch/powerpc/kernel/mce_power.c
@@ -26,6 +26,7 @@
 unsigned long addr_to_pfn(struct pt_regs *regs, unsigned long addr)
 {
pte_t *ptep;
+   unsigned int shift;
unsigned long flags;
struct mm_struct *mm;
 
@@ -35,13 +36,18 @@ unsigned long addr_to_pfn(struct pt_regs *regs, unsigned 
long addr)
mm = _mm;
 
local_irq_save(flags);
-   if (mm == current->mm)
-   ptep = find_current_mm_pte(mm->pgd, addr, NULL, NULL);
-   else
-   ptep = find_init_mm_pte(addr, NULL);
+   ptep = __find_linux_pte(mm->pgd, addr, NULL, );
local_irq_restore(flags);
+
if (!ptep || pte_special(*ptep))
return ULONG_MAX;
+
+   if (shift > PAGE_SHIFT) {
+   unsigned long rpnmask = (1ul << shift) - PAGE_SIZE;
+
+   return pte_pfn(__pte(pte_val(*ptep) | (addr & rpnmask)));
+   }
+
return pte_pfn(*ptep);
 }
 
@@ -344,7 +350,7 @@ static const struct mce_derror_table mce_p9_derror_table[] 
= {
   MCE_INITIATOR_CPU,   MCE_SEV_SEVERE, true },
 { 0, false, 0, 0, 0, 0, 0 } };
 
-static int mce_find_instr_ea_and_pfn(struct pt_regs *regs, uint64_t *addr,
+static int mce_find_instr_ea_and_phys(struct pt_regs *regs, uint64_t *addr,
uint64_t *phys_addr)
 {
/*
@@ -541,7 +547,8 @@ static int mce_handle_derror(struct pt_regs *regs,
 * kernel/exception-64s.h
 */
if (get_paca()->in_mce < MAX_MCE_DEPTH)
-   mce_find_instr_ea_and_pfn(regs, addr, 
phys_addr);
+   mce_find_instr_ea_and_phys(regs, addr,
+  phys_addr);
}
found = 1;
}
-- 
2.21.0

[PATCH v11 1/7] powerpc/mce: Schedule work from irq_work

2019-08-20 Thread Santosh Sivaraj

schedule_work() cannot be called from MCE exception context as MCE can
interrupt even in interrupt disabled context.

fixes: 733e4a4c ("powerpc/mce: hookup memory_failure for UE errors")
Reviewed-by: Mahesh Salgaonkar 
Reviewed-by: Nicholas Piggin 
Acked-by: Balbir Singh 
Signed-off-by: Santosh Sivaraj 
Cc: sta...@vger.kernel.org # v4.15+
---
 arch/powerpc/kernel/mce.c | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
index b18df633eae9..cff31d4a501f 100644
--- a/arch/powerpc/kernel/mce.c
+++ b/arch/powerpc/kernel/mce.c
@@ -33,6 +33,7 @@ static DEFINE_PER_CPU(struct machine_check_event[MAX_MC_EVT],
mce_ue_event_queue);
 
 static void machine_check_process_queued_event(struct irq_work *work);
+static void machine_check_ue_irq_work(struct irq_work *work);
 void machine_check_ue_event(struct machine_check_event *evt);
 static void machine_process_ue_event(struct work_struct *work);
 
@@ -40,6 +41,10 @@ static struct irq_work mce_event_process_work = {
 .func = machine_check_process_queued_event,
 };
 
+static struct irq_work mce_ue_event_irq_work = {
+   .func = machine_check_ue_irq_work,
+};
+
 DECLARE_WORK(mce_ue_event_work, machine_process_ue_event);
 
 static void mce_set_error_info(struct machine_check_event *mce,
@@ -199,6 +204,10 @@ void release_mce_event(void)
get_mce_event(NULL, true);
 }
 
+static void machine_check_ue_irq_work(struct irq_work *work)
+{
+   schedule_work(_ue_event_work);
+}
 
 /*
  * Queue up the MCE event which then can be handled later.
@@ -216,7 +225,7 @@ void machine_check_ue_event(struct machine_check_event *evt)
memcpy(this_cpu_ptr(_ue_event_queue[index]), evt, sizeof(*evt));
 
/* Queue work to process this event later. */
-   schedule_work(_ue_event_work);
+   irq_work_queue(_ue_event_irq_work);
 }
 
 /*
-- 
2.21.0

[PATCH v11 0/7] powerpc: implement machine check safe memcpy

2019-08-20 Thread Santosh Sivaraj

During a memcpy from a pmem device, if a machine check exception is
generated we end up in a panic. In case of fsdax read, this should
only result in a -EIO. Avoid MCE by implementing memcpy_mcsafe.

Before this patch series:

```
bash-4.4# mount -o dax /dev/pmem0 /mnt/pmem/
[ 7621.714094] Disabling lock debugging due to kernel taint
[ 7621.714099] MCE: CPU0: machine check (Severe) Host UE Load/Store [Not 
recovered]
[ 7621.714104] MCE: CPU0: NIP: [c0088978] memcpy_power7+0x418/0x7e0
[ 7621.714107] MCE: CPU0: Hardware error
[ 7621.714112] opal: Hardware platform error: Unrecoverable Machine Check 
exception
[ 7621.714118] CPU: 0 PID: 1368 Comm: mount Tainted: G   M  
5.2.0-rc5-00239-g241e39004581
#50
[ 7621.714123] NIP:  c0088978 LR: c08e16f8 CTR: 01de
[ 7621.714129] REGS: c000fffbfd70 TRAP: 0200   Tainted: G   M  
(5.2.0-rc5-00239-g241e39004581)
[ 7621.714131] MSR:  92209033   CR: 
24428840  XER: 0004
[ 7621.714160] CFAR: c00889a8 DAR: deadbeefdeadbeef DSISR: 8000 
IRQMASK: 0
[ 7621.714171] GPR00: 0e00 c000f0b8b1e0 c12cf100 
c000ed8e1100 
[ 7621.714186] GPR04: c2001100 0001 0200 
03fff1272000 
[ 7621.714201] GPR08: 8000 0010 0020 
0030 
[ 7621.714216] GPR12: 0040 7fffb8c6d390 0050 
0060 
[ 7621.714232] GPR16: 0070  0001 
c000f0b8b960 
[ 7621.714247] GPR20: 0001 c000f0b8b940 0001 
0001 
[ 7621.714262] GPR24: c1382560 c00c003b6380 c00c003b6380 
0001 
[ 7621.714277] GPR28:  0001 c200 
0001 
[ 7621.714294] NIP [c0088978] memcpy_power7+0x418/0x7e0
[ 7621.714298] LR [c08e16f8] pmem_do_bvec+0xf8/0x430
...  ...
```

After this patch series:

```
bash-4.4# mount -o dax /dev/pmem0 /mnt/pmem/
[25302.883978] Buffer I/O error on dev pmem0, logical block 0, async page read
[25303.020816] EXT4-fs (pmem0): DAX enabled. Warning: EXPERIMENTAL, use at your 
own risk
[25303.021236] EXT4-fs (pmem0): Can't read superblock on 2nd try
[25303.152515] EXT4-fs (pmem0): DAX enabled. Warning: EXPERIMENTAL, use at your 
own risk
[25303.284031] EXT4-fs (pmem0): DAX enabled. Warning: EXPERIMENTAL, use at your 
own risk
[25304.084100] UDF-fs: bad mount option "dax" or missing value
mount: /mnt/pmem: wrong fs type, bad option, bad superblock on /dev/pmem0, 
missing codepage or helper
program, or other error.
```

MCE is injected on a pmem address using mambo. The last patch which adds a
nop is only for testing on mambo, where r13 is not restored upon hitting
vector 200.

The memcpy code can be optimised by adding VMX optimizations and GAS macros
can be used to enable code reusablity, which I will send as another series.

--
v11:
* Move back to returning pfn instead of physical address [nick]
* Move patch "Handle UE event" up in the order
* Add reviewed-bys

v10: Fix authorship; add reviewed-bys and acks.

v9:
* Add a new IRQ work for UE events [mahesh]
* Reorder patches, and copy stable

v8:
* While ignoring UE events, return was used instead of continue.
* Checkpatch fixups for commit log

v7:
* Move schedule_work to be called from irq_work.

v6:
* Don't return pfn, all callees are expecting physical address anyway [nick]
* Patch re-ordering: move exception table patch before memcpy_mcsafe patch 
[nick]
* Reword commit log for search_exception_tables patch [nick]

v5:
* Don't use search_exception_tables since it searches for module exception 
tables
  also [Nicholas]
* Fix commit message for patch 2 [Nicholas]

v4:
* Squash return remaining bytes patch to memcpy_mcsafe implemtation patch 
[christophe]
* Access ok should be checked for copy_to_user_mcsafe() [christophe]

v3:
* Drop patch which enables DR/IR for external modules
* Drop notifier call chain, we don't want to do that in real mode
* Return remaining bytes from memcpy_mcsafe correctly
* We no longer restore r13 for simulator tests, rather use a nop at 
  vector 0x200 [workaround for simulator; not to be merged]

v2:
* Don't set RI bit explicitly [mahesh]
* Re-ordered series to get r13 workaround as the last patch

--
Balbir Singh (3):
  powerpc/mce: Fix MCE handling for huge pages
  powerpc/mce: Handle UE event for memcpy_mcsafe
  powerpc/memcpy: Add memcpy_mcsafe for pmem

Reza Arbab (1):
  powerpc/mce: Make machine_check_ue_event() static

Santosh Sivaraj (3):
  powerpc/mce: Schedule work from irq_work
  extable: Add function to search only kernel exception table
  powerpc: add machine check safe copy_to_user

 arch/powerpc/Kconfig|   1 +
 arch/powerpc/include/asm/mce.h  |   4 +-
 arch/powerpc/include/asm/string.h   |   2 +
 arch/powerpc/include/asm/uaccess.h  |  14 ++
 arch/powerpc/kernel/mce.c   |  31 +++-

Re: [PATCH 0/3] Add bad pmem bad blocks to bad range

2019-08-19 Thread Santosh Sivaraj

Santosh Sivaraj  writes:

> This series, which should be based on top of the still un-merged
> "powerpc: implement machine check safe memcpy" series, adds support
> to add the bad blocks which generated an MCE to the NVDIMM bad blocks.
> The next access of the same memory will be blocked by the NVDIMM layer
> itself.

This is the v2 series. Missed to add in the subject.

>
> ---
> Santosh Sivaraj (3):
>   powerpc/mce: Add MCE notification chain
>   of_pmem: Add memory ranges which took a mce to bad range
>   papr/scm: Add bad memory ranges to nvdimm bad ranges
>
>  arch/powerpc/include/asm/mce.h|   3 +
>  arch/powerpc/kernel/mce.c |  15 +++
>  arch/powerpc/platforms/pseries/papr_scm.c |  86 +++-
>  drivers/nvdimm/of_pmem.c  | 151 +++---
>  4 files changed, 234 insertions(+), 21 deletions(-)
>
> -- 
> 2.21.0

[PATCH 2/3] of_pmem: Add memory ranges which took a mce to bad range

2019-08-19 Thread Santosh Sivaraj

Subscribe to the MCE notification and add the physical address which
generated a memory error to nvdimm bad range.

Signed-off-by: Santosh Sivaraj 
---
 drivers/nvdimm/of_pmem.c | 151 +--
 1 file changed, 131 insertions(+), 20 deletions(-)

diff --git a/drivers/nvdimm/of_pmem.c b/drivers/nvdimm/of_pmem.c
index a0c8dcfa0bf9..155e56862fdf 100644
--- a/drivers/nvdimm/of_pmem.c
+++ b/drivers/nvdimm/of_pmem.c
@@ -8,6 +8,9 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
 
 static const struct attribute_group *region_attr_groups[] = {
_region_attribute_group,
@@ -25,11 +28,77 @@ struct of_pmem_private {
struct nvdimm_bus *bus;
 };
 
+struct of_pmem_region {
+   struct of_pmem_private *priv;
+   struct nd_region_desc *region_desc;
+   struct nd_region *region;
+   struct list_head region_list;
+};
+
+LIST_HEAD(pmem_regions);
+DEFINE_MUTEX(pmem_region_lock);
+
+static int handle_mce_ue(struct notifier_block *nb, unsigned long val,
+void *data)
+{
+   struct machine_check_event *evt = data;
+   struct of_pmem_region *pmem_region;
+   u64 aligned_addr, phys_addr;
+   bool found = false;
+
+   if (evt->error_type != MCE_ERROR_TYPE_UE)
+   return NOTIFY_DONE;
+
+   if (list_empty(_regions))
+   return NOTIFY_DONE;
+
+   phys_addr = evt->u.ue_error.physical_address +
+   (evt->u.ue_error.effective_address & ~PAGE_MASK);
+
+   if (!evt->u.ue_error.physical_address_provided ||
+   !is_zone_device_page(pfn_to_page(phys_addr >> PAGE_SHIFT)))
+   return NOTIFY_DONE;
+
+   mutex_lock(_region_lock);
+   list_for_each_entry(pmem_region, _regions, region_list) {
+   struct resource *res = pmem_region->region_desc->res;
+
+   if (phys_addr >= res->start && phys_addr <= res->end) {
+   found = true;
+   break;
+   }
+   }
+   mutex_unlock(_region_lock);
+
+   if (!found)
+   return NOTIFY_DONE;
+
+   aligned_addr = ALIGN_DOWN(phys_addr, L1_CACHE_BYTES);
+
+   if (nvdimm_bus_add_badrange(pmem_region->priv->bus, aligned_addr,
+   L1_CACHE_BYTES))
+   return NOTIFY_DONE;
+
+   pr_debug("Add memory range (0x%llx -- 0x%llx) as bad range\n",
+aligned_addr, aligned_addr + L1_CACHE_BYTES);
+
+
+   nvdimm_region_notify(pmem_region->region, NVDIMM_REVALIDATE_POISON);
+
+   return NOTIFY_OK;
+}
+
+static struct notifier_block mce_ue_nb = {
+   .notifier_call = handle_mce_ue
+};
+
 static int of_pmem_region_probe(struct platform_device *pdev)
 {
struct of_pmem_private *priv;
struct device_node *np;
struct nvdimm_bus *bus;
+   struct of_pmem_region *pmem_region;
+   struct nd_region_desc *ndr_desc;
bool is_volatile;
int i;
 
@@ -58,32 +127,49 @@ static int of_pmem_region_probe(struct platform_device 
*pdev)
is_volatile ? "volatile" : "non-volatile",  np);
 
for (i = 0; i < pdev->num_resources; i++) {
-   struct nd_region_desc ndr_desc;
struct nd_region *region;
 
-   /*
-* NB: libnvdimm copies the data from ndr_desc into it's own
-* structures so passing a stack pointer is fine.
-*/
-   memset(_desc, 0, sizeof(ndr_desc));
-   ndr_desc.attr_groups = region_attr_groups;
-   ndr_desc.numa_node = dev_to_node(>dev);
-   ndr_desc.target_node = ndr_desc.numa_node;
-   ndr_desc.res = >resource[i];
-   ndr_desc.of_node = np;
-   set_bit(ND_REGION_PAGEMAP, _desc.flags);
+   ndr_desc = kzalloc(sizeof(struct nd_region_desc), GFP_KERNEL);
+   if (!ndr_desc) {
+   nvdimm_bus_unregister(priv->bus);
+   kfree(priv);
+   return -ENOMEM;
+   }
+
+   ndr_desc->attr_groups = region_attr_groups;
+   ndr_desc->numa_node = dev_to_node(>dev);
+   ndr_desc->target_node = ndr_desc->numa_node;
+   ndr_desc->res = >resource[i];
+   ndr_desc->of_node = np;
+   set_bit(ND_REGION_PAGEMAP, _desc->flags);
 
if (is_volatile)
-   region = nvdimm_volatile_region_create(bus, _desc);
+   region = nvdimm_volatile_region_create(bus, ndr_desc);
else
-   region = nvdimm_pmem_region_create(bus, _desc);
+   region = nvdimm_pmem_region_create(bus, ndr_desc);
 
-   if (!region)
+   if (!region) {

[PATCH 3/3] papr/scm: Add bad memory ranges to nvdimm bad ranges

2019-08-19 Thread Santosh Sivaraj

Subscribe to the MCE notification and add the physical address which
generated a memory error to nvdimm bad range.

Signed-off-by: Santosh Sivaraj 
---
 arch/powerpc/platforms/pseries/papr_scm.c | 86 ++-
 1 file changed, 85 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index a5ac371a3f06..e38f7febc5d9 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -12,6 +12,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include 
 
@@ -39,8 +41,12 @@ struct papr_scm_priv {
struct resource res;
struct nd_region *region;
struct nd_interleave_set nd_set;
+   struct list_head region_list;
 };
 
+LIST_HEAD(papr_nd_regions);
+DEFINE_MUTEX(papr_ndr_lock);
+
 static int drc_pmem_bind(struct papr_scm_priv *p)
 {
unsigned long ret[PLPAR_HCALL_BUFSIZE];
@@ -364,6 +370,10 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
dev_info(dev, "Region registered with target node %d and online 
node %d",
 target_nid, online_nid);
 
+   mutex_lock(_ndr_lock);
+   list_add_tail(>region_list, _nd_regions);
+   mutex_unlock(_ndr_lock);
+
return 0;
 
 err:   nvdimm_bus_unregister(p->bus);
@@ -371,6 +381,57 @@ err:   nvdimm_bus_unregister(p->bus);
return -ENXIO;
 }
 
+static int handle_mce_ue(struct notifier_block *nb, unsigned long val,
+void *data)
+{
+   struct machine_check_event *evt = data;
+   struct papr_scm_priv *p;
+   u64 phys_addr, aligned_addr;
+   bool found = false;
+
+   if (evt->error_type != MCE_ERROR_TYPE_UE)
+   return NOTIFY_DONE;
+
+   if (list_empty(_nd_regions))
+   return NOTIFY_DONE;
+
+   phys_addr = evt->u.ue_error.physical_address +
+   (evt->u.ue_error.effective_address & ~PAGE_MASK);
+
+   if (!evt->u.ue_error.physical_address_provided ||
+   !is_zone_device_page(pfn_to_page(phys_addr >> PAGE_SHIFT)))
+   return NOTIFY_DONE;
+
+   mutex_lock(_ndr_lock);
+   list_for_each_entry(p, _nd_regions, region_list) {
+   struct resource res = p->res;
+
+   if (phys_addr >= res.start && phys_addr <= res.end) {
+   found = true;
+   break;
+   }
+   }
+   mutex_unlock(_ndr_lock);
+
+   if (!found)
+   return NOTIFY_DONE;
+
+   aligned_addr = ALIGN_DOWN(phys_addr, L1_CACHE_BYTES);
+   if (nvdimm_bus_add_badrange(p->bus, aligned_addr, L1_CACHE_BYTES))
+   return NOTIFY_DONE;
+
+   pr_debug("Add memory range (0x%llx -- 0x%llx) as bad range\n",
+aligned_addr, aligned_addr + L1_CACHE_BYTES);
+
+   nvdimm_region_notify(p->region, NVDIMM_REVALIDATE_POISON);
+
+   return NOTIFY_OK;
+}
+
+static struct notifier_block mce_ue_nb = {
+   .notifier_call = handle_mce_ue
+};
+
 static int papr_scm_probe(struct platform_device *pdev)
 {
struct device_node *dn = pdev->dev.of_node;
@@ -456,6 +517,7 @@ static int papr_scm_probe(struct platform_device *pdev)
goto err2;
 
platform_set_drvdata(pdev, p);
+   mce_register_notifier(_ue_nb);
 
return 0;
 
@@ -468,6 +530,10 @@ static int papr_scm_remove(struct platform_device *pdev)
 {
struct papr_scm_priv *p = platform_get_drvdata(pdev);
 
+   mutex_lock(_ndr_lock);
+   list_del(&(p->region_list));
+   mutex_unlock(_ndr_lock);
+
nvdimm_bus_unregister(p->bus);
drc_pmem_unbind(p);
kfree(p);
@@ -490,7 +556,25 @@ static struct platform_driver papr_scm_driver = {
},
 };
 
-module_platform_driver(papr_scm_driver);
+static int __init papr_scm_init(void)
+{
+   int ret;
+
+   ret = platform_driver_register(_scm_driver);
+   if (!ret)
+   mce_register_notifier(_ue_nb);
+
+   return ret;
+}
+module_init(papr_scm_init);
+
+static void __exit papr_scm_exit(void)
+{
+   mce_unregister_notifier(_ue_nb);
+   platform_driver_unregister(_scm_driver);
+}
+module_exit(papr_scm_exit);
+
 MODULE_DEVICE_TABLE(of, papr_scm_match);
 MODULE_LICENSE("GPL");
 MODULE_AUTHOR("IBM Corporation");
-- 
2.21.0

[PATCH 1/3] powerpc/mce: Add MCE notification chain

2019-08-19 Thread Santosh Sivaraj

This is needed to report bad blocks for persistent memory.

Signed-off-by: Santosh Sivaraj 
---
 arch/powerpc/include/asm/mce.h |  3 +++
 arch/powerpc/kernel/mce.c  | 15 +++
 2 files changed, 18 insertions(+)

diff --git a/arch/powerpc/include/asm/mce.h b/arch/powerpc/include/asm/mce.h
index e1931c8c2743..b1c6363f924c 100644
--- a/arch/powerpc/include/asm/mce.h
+++ b/arch/powerpc/include/asm/mce.h
@@ -212,6 +212,9 @@ extern void machine_check_queue_event(void);
 extern void machine_check_print_event_info(struct machine_check_event *evt,
   bool user_mode, bool in_guest);
 unsigned long addr_to_phys(struct pt_regs *regs, unsigned long addr);
+int mce_register_notifier(struct notifier_block *nb);
+int mce_unregister_notifier(struct notifier_block *nb);
+
 #ifdef CONFIG_PPC_BOOK3S_64
 void flush_and_reload_slb(void);
 #endif /* CONFIG_PPC_BOOK3S_64 */
diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
index ec4b3e1087be..a78210ca6cd9 100644
--- a/arch/powerpc/kernel/mce.c
+++ b/arch/powerpc/kernel/mce.c
@@ -47,6 +47,20 @@ static struct irq_work mce_ue_event_irq_work = {
 
 DECLARE_WORK(mce_ue_event_work, machine_process_ue_event);
 
+static BLOCKING_NOTIFIER_HEAD(mce_notifier_list);
+
+int mce_register_notifier(struct notifier_block *nb)
+{
+   return blocking_notifier_chain_register(_notifier_list, nb);
+}
+EXPORT_SYMBOL_GPL(mce_register_notifier);
+
+int mce_unregister_notifier(struct notifier_block *nb)
+{
+   return blocking_notifier_chain_unregister(_notifier_list, nb);
+}
+EXPORT_SYMBOL_GPL(mce_unregister_notifier);
+
 static void mce_set_error_info(struct machine_check_event *mce,
   struct mce_error_info *mce_err)
 {
@@ -263,6 +277,7 @@ static void machine_process_ue_event(struct work_struct 
*work)
while (__this_cpu_read(mce_ue_count) > 0) {
index = __this_cpu_read(mce_ue_count) - 1;
evt = this_cpu_ptr(_ue_event_queue[index]);
+   blocking_notifier_call_chain(_notifier_list, 0, evt);
 #ifdef CONFIG_MEMORY_FAILURE
/*
 * This should probably queued elsewhere, but
-- 
2.21.0

[PATCH 0/3] Add bad pmem bad blocks to bad range

2019-08-19 Thread Santosh Sivaraj

This series, which should be based on top of the still un-merged
"powerpc: implement machine check safe memcpy" series, adds support
to add the bad blocks which generated an MCE to the NVDIMM bad blocks.
The next access of the same memory will be blocked by the NVDIMM layer
itself.

---
Santosh Sivaraj (3):
  powerpc/mce: Add MCE notification chain
  of_pmem: Add memory ranges which took a mce to bad range
  papr/scm: Add bad memory ranges to nvdimm bad ranges

 arch/powerpc/include/asm/mce.h|   3 +
 arch/powerpc/kernel/mce.c |  15 +++
 arch/powerpc/platforms/pseries/papr_scm.c |  86 +++-
 drivers/nvdimm/of_pmem.c  | 151 +++---
 4 files changed, 234 insertions(+), 21 deletions(-)

-- 
2.21.0

Re: [PATCH v10 2/7] powerpc/mce: Fix MCE handling for huge pages

2019-08-19 Thread Santosh Sivaraj

Hi Nick,

Nicholas Piggin  writes:

> Santosh Sivaraj's on August 15, 2019 10:39 am:
>> From: Balbir Singh 
>> 
>> The current code would fail on huge pages addresses, since the shift would
>> be incorrect. Use the correct page shift value returned by
>> __find_linux_pte() to get the correct physical address. The code is more
>> generic and can handle both regular and compound pages.
>> 
>> Fixes: ba41e1e1ccb9 ("powerpc/mce: Hookup derror (load/store) UE errors")
>> Signed-off-by: Balbir Singh 
>> [ar...@linux.ibm.com: Fixup pseries_do_memory_failure()]
>> Signed-off-by: Reza Arbab 
>> Co-developed-by: Santosh Sivaraj 
>> Signed-off-by: Santosh Sivaraj 
>> Tested-by: Mahesh Salgaonkar 
>> Cc: sta...@vger.kernel.org # v4.15+
>> ---
>>  arch/powerpc/include/asm/mce.h   |  2 +-
>>  arch/powerpc/kernel/mce_power.c  | 55 ++--
>>  arch/powerpc/platforms/pseries/ras.c |  9 ++---
>>  3 files changed, 32 insertions(+), 34 deletions(-)
>> 
>> diff --git a/arch/powerpc/include/asm/mce.h b/arch/powerpc/include/asm/mce.h
>> index a4c6a74ad2fb..f3a6036b6bc0 100644
>> --- a/arch/powerpc/include/asm/mce.h
>> +++ b/arch/powerpc/include/asm/mce.h
>> @@ -209,7 +209,7 @@ extern void release_mce_event(void);
>>  extern void machine_check_queue_event(void);
>>  extern void machine_check_print_event_info(struct machine_check_event *evt,
>> bool user_mode, bool in_guest);
>> -unsigned long addr_to_pfn(struct pt_regs *regs, unsigned long addr);
>> +unsigned long addr_to_phys(struct pt_regs *regs, unsigned long addr);
>>  #ifdef CONFIG_PPC_BOOK3S_64
>>  void flush_and_reload_slb(void);
>>  #endif /* CONFIG_PPC_BOOK3S_64 */
>> diff --git a/arch/powerpc/kernel/mce_power.c 
>> b/arch/powerpc/kernel/mce_power.c
>> index a814d2dfb5b0..e74816f045f8 100644
>> --- a/arch/powerpc/kernel/mce_power.c
>> +++ b/arch/powerpc/kernel/mce_power.c
>> @@ -20,13 +20,14 @@
>>  #include 
>>  
>>  /*
>> - * Convert an address related to an mm to a PFN. NOTE: we are in real
>> - * mode, we could potentially race with page table updates.
>> + * Convert an address related to an mm to a physical address.
>> + * NOTE: we are in real mode, we could potentially race with page table 
>> updates.
>>   */
>> -unsigned long addr_to_pfn(struct pt_regs *regs, unsigned long addr)
>> +unsigned long addr_to_phys(struct pt_regs *regs, unsigned long addr)
>>  {
>> -pte_t *ptep;
>> -unsigned long flags;
>> +pte_t *ptep, pte;
>> +unsigned int shift;
>> +unsigned long flags, phys_addr;
>>  struct mm_struct *mm;
>>  
>>  if (user_mode(regs))
>> @@ -35,14 +36,21 @@ unsigned long addr_to_pfn(struct pt_regs *regs, unsigned 
>> long addr)
>>  mm = _mm;
>>  
>>  local_irq_save(flags);
>> -if (mm == current->mm)
>> -ptep = find_current_mm_pte(mm->pgd, addr, NULL, NULL);
>> -else
>> -ptep = find_init_mm_pte(addr, NULL);
>> +ptep = __find_linux_pte(mm->pgd, addr, NULL, );
>>  local_irq_restore(flags);
>> +
>>  if (!ptep || pte_special(*ptep))
>>  return ULONG_MAX;
>> -return pte_pfn(*ptep);
>> +
>> +pte = *ptep;
>> +if (shift > PAGE_SHIFT) {
>> +unsigned long rpnmask = (1ul << shift) - PAGE_SIZE;
>> +
>> +pte = __pte(pte_val(pte) | (addr & rpnmask));
>> +}
>> +phys_addr = pte_pfn(pte) << PAGE_SHIFT;
>> +
>> +return phys_addr;
>>  }
>
> This should remain addr_to_pfn I think. None of the callers care what
> size page the EA was mapped with. 'pfn' is referring to the Linux pfn,
> which is the small page number.
>
>   if (shift > PAGE_SHIFT)
> return (pte_pfn(*ptep) | ((addr & ((1UL << shift) - 1)) >> PAGE_SHIFT);
>   else
> return pte_pfn(*ptep);
>
> Something roughly like that, then you don't have to change any callers
> or am I missing something?

Here[1] you asked to return the real address rather than pfn, which all
callers care about. So made the changes accordingly.

[1] https://www.spinics.net/lists/kernel/msg3187658.html

Thanks,
Santosh
>
> Thanks,
> Nick

Re: [PATCH 3/3] papr/scm: Add bad memory ranges to nvdimm bad ranges

2019-08-15 Thread Santosh Sivaraj

"Oliver O'Halloran"  writes:

> On Wed, Aug 14, 2019 at 6:25 PM Santosh Sivaraj  wrote:
>>
>> Subscribe to the MCE notification and add the physical address which
>> generated a memory error to nvdimm bad range.
>>
>> Signed-off-by: Santosh Sivaraj 
>> ---
>>  arch/powerpc/platforms/pseries/papr_scm.c | 65 +++
>>  1 file changed, 65 insertions(+)
>>
>> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
>> b/arch/powerpc/platforms/pseries/papr_scm.c
>> index a5ac371a3f06..4d25c98a9835 100644
>> --- a/arch/powerpc/platforms/pseries/papr_scm.c
>> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
>> @@ -12,6 +12,8 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>> +#include 
>>
>>  #include 
>>
>> @@ -39,8 +41,12 @@ struct papr_scm_priv {
>> struct resource res;
>> struct nd_region *region;
>> struct nd_interleave_set nd_set;
>> +   struct list_head list;
>
> list is not a meaningful name. call it something more descriptive.
>
>>  };
>>
>> +LIST_HEAD(papr_nd_regions);
>> +DEFINE_MUTEX(papr_ndr_lock);
>
> Should this be a mutex or a spinlock? I don't know what context the
> mce notifier is called from, but if it's not sleepable then a mutex
> will cause problems. Did you test this with lockdep enabled?

This would be a mutex, we are called from a blocking notifier.

>
>> +
>>  static int drc_pmem_bind(struct papr_scm_priv *p)
>>  {
>> unsigned long ret[PLPAR_HCALL_BUFSIZE];
>> @@ -364,6 +370,10 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
>> dev_info(dev, "Region registered with target node %d and 
>> online node %d",
>>  target_nid, online_nid);
>>
>> +   mutex_lock(_ndr_lock);
>> +   list_add_tail(>list, _nd_regions);
>> +   mutex_unlock(_ndr_lock);
>> +
>
> Where's the matching remove when we unbind the driver?

Missed it completely. Will fix it.

>
>> return 0;
>>pp
>>  err:   nvdimm_bus_unregister(p->bus);
>> @@ -371,6 +381,60 @@ err:   nvdimm_bus_unregister(p->bus);
>> return -ENXIO;
>>  }
>>
>> +static int handle_mce_ue(struct notifier_block *nb, unsigned long val,
>> +void *data)
>> +{
>> +   struct machine_check_event *evt = data;
>> +   struct papr_scm_priv *p;
>> +   u64 phys_addr;
>> +
>> +   if (evt->error_type != MCE_ERROR_TYPE_UE)
>> +   return NOTIFY_DONE;
>> +
>> +   if (list_empty(_nd_regions))
>> +   return NOTIFY_DONE;
>> +
>> +   phys_addr = evt->u.ue_error.physical_address +
>> +   (evt->u.ue_error.effective_address & ~PAGE_MASK);
>
> Wait what? Why is physical_address page aligned, but effective_address
> not? Not a problem with this patch, but still, what the hell?

Not sure why, but its the way now. I can see if I can update it if it makes
sense in a later patch.

>
>> +   if (!evt->u.ue_error.physical_address_provided ||
>> +   !is_zone_device_page(pfn_to_page(phys_addr >> PAGE_SHIFT)))
>> +   return NOTIFY_DONE;
>> +
>> +   mutex_lock(_ndr_lock);
>> +   list_for_each_entry(p, _nd_regions, list) {
>> +   struct resource res = p->res;
>> +   u64 aligned_addr;
>> +
>
>> +   if (res.start > phys_addr)
>> +   continue;
>> +
>> +   if (res.end < phys_addr)
>> +   continue;
>
> surely there's a helper for this
>
>> +
>> +   aligned_addr = ALIGN_DOWN(phys_addr, L1_CACHE_BYTES);
>> +   pr_debug("Add memory range (0x%llx -- 0x%llx) as bad 
>> range\n",
>> +aligned_addr, aligned_addr + L1_CACHE_BYTES);
>> +
>> +   if (nvdimm_bus_add_badrange(p->bus,
>> +   aligned_addr, L1_CACHE_BYTES))
>> +   pr_warn("Failed to add bad range (0x%llx -- 
>> 0x%llx)\n",
>> +   aligned_addr, aligned_addr + L1_CACHE_BYTES);
>> +
>> +   nvdimm_region_notify(p->region,
>> +NVDIMM_REVALIDATE_POISON);
>> +
>> +   break;
>
> nit: you can avoid stacking indetation levels by breaking out of the
> loop as soon as you've found the region

[PATCH v10 7/7] powerpc: add machine check safe copy_to_user

2019-08-14 Thread Santosh Sivaraj

Use  memcpy_mcsafe() implementation to define copy_to_user_mcsafe()

Signed-off-by: Santosh Sivaraj 
---
 arch/powerpc/Kconfig   |  1 +
 arch/powerpc/include/asm/uaccess.h | 14 ++
 2 files changed, 15 insertions(+)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 77f6ebf97113..4316e36095a2 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -137,6 +137,7 @@ config PPC
select ARCH_HAS_STRICT_KERNEL_RWX   if ((PPC_BOOK3S_64 || PPC32) && 
!RELOCATABLE && !HIBERNATION)
select ARCH_HAS_TICK_BROADCAST  if GENERIC_CLOCKEVENTS_BROADCAST
select ARCH_HAS_UACCESS_FLUSHCACHE  if PPC64
+   select ARCH_HAS_UACCESS_MCSAFE  if PPC64
select ARCH_HAS_UBSAN_SANITIZE_ALL
select ARCH_HAVE_NMI_SAFE_CMPXCHG
select ARCH_KEEP_MEMBLOCK
diff --git a/arch/powerpc/include/asm/uaccess.h 
b/arch/powerpc/include/asm/uaccess.h
index 8b03eb44e876..15002b51ff18 100644
--- a/arch/powerpc/include/asm/uaccess.h
+++ b/arch/powerpc/include/asm/uaccess.h
@@ -387,6 +387,20 @@ static inline unsigned long raw_copy_to_user(void __user 
*to,
return ret;
 }
 
+static __always_inline unsigned long __must_check
+copy_to_user_mcsafe(void __user *to, const void *from, unsigned long n)
+{
+   if (likely(check_copy_size(from, n, true))) {
+   if (access_ok(to, n)) {
+   allow_write_to_user(to, n);
+   n = memcpy_mcsafe((void *)to, from, n);
+   prevent_write_to_user(to, n);
+   }
+   }
+
+   return n;
+}
+
 extern unsigned long __clear_user(void __user *addr, unsigned long size);
 
 static inline unsigned long clear_user(void __user *addr, unsigned long size)
-- 
2.21.0

[PATCH v10 6/7] powerpc/mce: Handle UE event for memcpy_mcsafe

2019-08-14 Thread Santosh Sivaraj

From: Balbir Singh 

If we take a UE on one of the instructions with a fixup entry, set nip
to continue execution at the fixup entry. Stop processing the event
further or print it.

Co-developed-by: Reza Arbab 
Signed-off-by: Reza Arbab 
Signed-off-by: Balbir Singh 
Signed-off-by: Santosh Sivaraj 
Reviewed-by: Mahesh Salgaonkar 
---
 arch/powerpc/include/asm/mce.h  |  4 +++-
 arch/powerpc/kernel/mce.c   | 16 
 arch/powerpc/kernel/mce_power.c | 15 +--
 3 files changed, 32 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/mce.h b/arch/powerpc/include/asm/mce.h
index f3a6036b6bc0..e1931c8c2743 100644
--- a/arch/powerpc/include/asm/mce.h
+++ b/arch/powerpc/include/asm/mce.h
@@ -122,7 +122,8 @@ struct machine_check_event {
enum MCE_UeErrorType ue_error_type:8;
u8  effective_address_provided;
u8  physical_address_provided;
-   u8  reserved_1[5];
+   u8  ignore_event;
+   u8  reserved_1[4];
u64 effective_address;
u64 physical_address;
u8  reserved_2[8];
@@ -193,6 +194,7 @@ struct mce_error_info {
enum MCE_Initiator  initiator:8;
enum MCE_ErrorClass error_class:8;
boolsync_error;
+   boolignore_event;
 };
 
 #define MAX_MC_EVT 100
diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
index a3b122a685a5..ec4b3e1087be 100644
--- a/arch/powerpc/kernel/mce.c
+++ b/arch/powerpc/kernel/mce.c
@@ -149,6 +149,7 @@ void save_mce_event(struct pt_regs *regs, long handled,
if (phys_addr != ULONG_MAX) {
mce->u.ue_error.physical_address_provided = true;
mce->u.ue_error.physical_address = phys_addr;
+   mce->u.ue_error.ignore_event = mce_err->ignore_event;
machine_check_ue_event(mce);
}
}
@@ -266,8 +267,17 @@ static void machine_process_ue_event(struct work_struct 
*work)
/*
 * This should probably queued elsewhere, but
 * oh! well
+*
+* Don't report this machine check because the caller has a
+* asked us to ignore the event, it has a fixup handler which
+* will do the appropriate error handling and reporting.
 */
if (evt->error_type == MCE_ERROR_TYPE_UE) {
+   if (evt->u.ue_error.ignore_event) {
+   __this_cpu_dec(mce_ue_count);
+   continue;
+   }
+
if (evt->u.ue_error.physical_address_provided) {
unsigned long pfn;
 
@@ -301,6 +311,12 @@ static void machine_check_process_queued_event(struct 
irq_work *work)
while (__this_cpu_read(mce_queue_count) > 0) {
index = __this_cpu_read(mce_queue_count) - 1;
evt = this_cpu_ptr(_event_queue[index]);
+
+   if (evt->error_type == MCE_ERROR_TYPE_UE &&
+   evt->u.ue_error.ignore_event) {
+   __this_cpu_dec(mce_queue_count);
+   continue;
+   }
machine_check_print_event_info(evt, false, false);
__this_cpu_dec(mce_queue_count);
}
diff --git a/arch/powerpc/kernel/mce_power.c b/arch/powerpc/kernel/mce_power.c
index e74816f045f8..1dd87f6f5186 100644
--- a/arch/powerpc/kernel/mce_power.c
+++ b/arch/powerpc/kernel/mce_power.c
@@ -11,6 +11,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -18,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * Convert an address related to an mm to a physical address.
@@ -559,9 +561,18 @@ static int mce_handle_derror(struct pt_regs *regs,
return 0;
 }
 
-static long mce_handle_ue_error(struct pt_regs *regs)
+static long mce_handle_ue_error(struct pt_regs *regs,
+   struct mce_error_info *mce_err)
 {
long handled = 0;
+   const struct exception_table_entry *entry;
+
+   entry = search_kernel_exception_table(regs->nip);
+   if (entry) {
+   mce_err->ignore_event = true;
+   regs->nip = extable_fixup(entry);
+   return 1;
+   }
 
/*
 * On specific SCOM read via MMIO we may get a machine check
@@ -594,7 +605,7 @@ static long mce_handle_error(struct pt_regs *regs,
_addr);
 
if (!handled && mce_err.error_type == MCE_ERROR_TYPE_UE)
-   handled = mce_handle_ue_error(regs);
+   handled = mce_han

[PATCH v10 5/7] powerpc/memcpy: Add memcpy_mcsafe for pmem

2019-08-14 Thread Santosh Sivaraj

From: Balbir Singh 

The pmem infrastructure uses memcpy_mcsafe in the pmem layer so as to
convert machine check exceptions into a return value on failure in case
a machine check exception is encountered during the memcpy. The return
value is the number of bytes remaining to be copied.

This patch largely borrows from the copyuser_power7 logic and does not add
the VMX optimizations, largely to keep the patch simple. If needed those
optimizations can be folded in.

Signed-off-by: Balbir Singh 
[ar...@linux.ibm.com: Added symbol export]
Co-developed-by: Santosh Sivaraj 
Signed-off-by: Santosh Sivaraj 
---
 arch/powerpc/include/asm/string.h   |   2 +
 arch/powerpc/lib/Makefile   |   2 +-
 arch/powerpc/lib/memcpy_mcsafe_64.S | 242 
 3 files changed, 245 insertions(+), 1 deletion(-)
 create mode 100644 arch/powerpc/lib/memcpy_mcsafe_64.S

diff --git a/arch/powerpc/include/asm/string.h 
b/arch/powerpc/include/asm/string.h
index 9bf6dffb4090..b72692702f35 100644
--- a/arch/powerpc/include/asm/string.h
+++ b/arch/powerpc/include/asm/string.h
@@ -53,7 +53,9 @@ void *__memmove(void *to, const void *from, __kernel_size_t 
n);
 #ifndef CONFIG_KASAN
 #define __HAVE_ARCH_MEMSET32
 #define __HAVE_ARCH_MEMSET64
+#define __HAVE_ARCH_MEMCPY_MCSAFE
 
+extern int memcpy_mcsafe(void *dst, const void *src, __kernel_size_t sz);
 extern void *__memset16(uint16_t *, uint16_t v, __kernel_size_t);
 extern void *__memset32(uint32_t *, uint32_t v, __kernel_size_t);
 extern void *__memset64(uint64_t *, uint64_t v, __kernel_size_t);
diff --git a/arch/powerpc/lib/Makefile b/arch/powerpc/lib/Makefile
index eebc782d89a5..fa6b1b657b43 100644
--- a/arch/powerpc/lib/Makefile
+++ b/arch/powerpc/lib/Makefile
@@ -39,7 +39,7 @@ obj-$(CONFIG_PPC_BOOK3S_64) += copyuser_power7.o 
copypage_power7.o \
   memcpy_power7.o
 
 obj64-y+= copypage_64.o copyuser_64.o mem_64.o hweight_64.o \
-  memcpy_64.o pmem.o
+  memcpy_64.o pmem.o memcpy_mcsafe_64.o
 
 obj64-$(CONFIG_SMP)+= locks.o
 obj64-$(CONFIG_ALTIVEC)+= vmx-helper.o
diff --git a/arch/powerpc/lib/memcpy_mcsafe_64.S 
b/arch/powerpc/lib/memcpy_mcsafe_64.S
new file mode 100644
index ..949976dc115d
--- /dev/null
+++ b/arch/powerpc/lib/memcpy_mcsafe_64.S
@@ -0,0 +1,242 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) IBM Corporation, 2011
+ * Derived from copyuser_power7.s by Anton Blanchard 
+ * Author - Balbir Singh 
+ */
+#include 
+#include 
+#include 
+
+   .macro err1
+100:
+   EX_TABLE(100b,.Ldo_err1)
+   .endm
+
+   .macro err2
+200:
+   EX_TABLE(200b,.Ldo_err2)
+   .endm
+
+   .macro err3
+300:   EX_TABLE(300b,.Ldone)
+   .endm
+
+.Ldo_err2:
+   ld  r22,STK_REG(R22)(r1)
+   ld  r21,STK_REG(R21)(r1)
+   ld  r20,STK_REG(R20)(r1)
+   ld  r19,STK_REG(R19)(r1)
+   ld  r18,STK_REG(R18)(r1)
+   ld  r17,STK_REG(R17)(r1)
+   ld  r16,STK_REG(R16)(r1)
+   ld  r15,STK_REG(R15)(r1)
+   ld  r14,STK_REG(R14)(r1)
+   addir1,r1,STACKFRAMESIZE
+.Ldo_err1:
+   /* Do a byte by byte copy to get the exact remaining size */
+   mtctr   r7
+46:
+err3;  lbz r0,0(r4)
+   addir4,r4,1
+err3;  stb r0,0(r3)
+   addir3,r3,1
+   bdnz46b
+   li  r3,0
+   blr
+
+.Ldone:
+   mfctr   r3
+   blr
+
+
+_GLOBAL(memcpy_mcsafe)
+   mr  r7,r5
+   cmpldi  r5,16
+   blt .Lshort_copy
+
+.Lcopy:
+   /* Get the source 8B aligned */
+   neg r6,r4
+   mtocrf  0x01,r6
+   clrldi  r6,r6,(64-3)
+
+   bf  cr7*4+3,1f
+err1;  lbz r0,0(r4)
+   addir4,r4,1
+err1;  stb r0,0(r3)
+   addir3,r3,1
+   subir7,r7,1
+
+1: bf  cr7*4+2,2f
+err1;  lhz r0,0(r4)
+   addir4,r4,2
+err1;  sth r0,0(r3)
+   addir3,r3,2
+   subir7,r7,2
+
+2: bf  cr7*4+1,3f
+err1;  lwz r0,0(r4)
+   addir4,r4,4
+err1;  stw r0,0(r3)
+   addir3,r3,4
+   subir7,r7,4
+
+3: sub r5,r5,r6
+   cmpldi  r5,128
+   blt 5f
+
+   mflrr0
+   stdur1,-STACKFRAMESIZE(r1)
+   std r14,STK_REG(R14)(r1)
+   std r15,STK_REG(R15)(r1)
+   std r16,STK_REG(R16)(r1)
+   std r17,STK_REG(R17)(r1)
+   std r18,STK_REG(R18)(r1)
+   std r19,STK_REG(R19)(r1)
+   std r20,STK_REG(R20)(r1)
+   std r21,STK_REG(R21)(r1)
+   std r22,STK_REG(R22)(r1)
+   std r0,STACKFRAMESIZE+16(r1)
+
+   srdir6,r5,7
+   mtctr   r6
+
+   /* Now do cacheline (128B) sized loads and stores. */
+   .align  5
+4:
+err2;  ld  r0,0(r4)
+err2;  ld  r6,8(r4)
+err2;  ld  r8,16(r4)
+err2;  ld  r9,24(r4)
+err2;  ld  r10,32(r4)
+err2;  ld  r11,40(r4)
+err2;  ld  r12,48(r4)
+err2;  ld  r14,56(r4)
+err2;  ld  r15,64(r4)
+err2;  ld  r16,72

[PATCH v10 4/7] extable: Add function to search only kernel exception table

2019-08-14 Thread Santosh Sivaraj

Certain architecture specific operating modes (e.g., in powerpc machine
check handler that is unable to access vmalloc memory), the
search_exception_tables cannot be called because it also searches the
module exception tables if entry is not found in the kernel exception
table.

Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Nicholas Piggin 
Signed-off-by: Santosh Sivaraj 
Reviewed-by: Nicholas Piggin 
---
 include/linux/extable.h |  2 ++
 kernel/extable.c| 11 +--
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/include/linux/extable.h b/include/linux/extable.h
index 41c5b3a25f67..81ecfaa83ad3 100644
--- a/include/linux/extable.h
+++ b/include/linux/extable.h
@@ -19,6 +19,8 @@ void trim_init_extable(struct module *m);
 
 /* Given an address, look for it in the exception tables */
 const struct exception_table_entry *search_exception_tables(unsigned long add);
+const struct exception_table_entry *
+search_kernel_exception_table(unsigned long addr);
 
 #ifdef CONFIG_MODULES
 /* For extable.c to search modules' exception tables. */
diff --git a/kernel/extable.c b/kernel/extable.c
index e23cce6e6092..f6c9406eec7d 100644
--- a/kernel/extable.c
+++ b/kernel/extable.c
@@ -40,13 +40,20 @@ void __init sort_main_extable(void)
}
 }
 
+/* Given an address, look for it in the kernel exception table */
+const
+struct exception_table_entry *search_kernel_exception_table(unsigned long addr)
+{
+   return search_extable(__start___ex_table,
+ __stop___ex_table - __start___ex_table, addr);
+}
+
 /* Given an address, look for it in the exception tables. */
 const struct exception_table_entry *search_exception_tables(unsigned long addr)
 {
const struct exception_table_entry *e;
 
-   e = search_extable(__start___ex_table,
-  __stop___ex_table - __start___ex_table, addr);
+   e = search_kernel_exception_table(addr);
if (!e)
e = search_module_extables(addr);
return e;
-- 
2.21.0

[PATCH v10 3/7] powerpc/mce: Make machine_check_ue_event() static

2019-08-14 Thread Santosh Sivaraj

From: Reza Arbab 

The function doesn't get used outside this file, so make it static.

Signed-off-by: Reza Arbab 
Signed-off-by: Santosh Sivaraj 
Reviewed-by: Nicholas Piggin 
---
 arch/powerpc/kernel/mce.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
index cff31d4a501f..a3b122a685a5 100644
--- a/arch/powerpc/kernel/mce.c
+++ b/arch/powerpc/kernel/mce.c
@@ -34,7 +34,7 @@ static DEFINE_PER_CPU(struct machine_check_event[MAX_MC_EVT],
 
 static void machine_check_process_queued_event(struct irq_work *work);
 static void machine_check_ue_irq_work(struct irq_work *work);
-void machine_check_ue_event(struct machine_check_event *evt);
+static void machine_check_ue_event(struct machine_check_event *evt);
 static void machine_process_ue_event(struct work_struct *work);
 
 static struct irq_work mce_event_process_work = {
@@ -212,7 +212,7 @@ static void machine_check_ue_irq_work(struct irq_work *work)
 /*
  * Queue up the MCE event which then can be handled later.
  */
-void machine_check_ue_event(struct machine_check_event *evt)
+static void machine_check_ue_event(struct machine_check_event *evt)
 {
int index;
 
-- 
2.21.0

[PATCH v10 2/7] powerpc/mce: Fix MCE handling for huge pages

2019-08-14 Thread Santosh Sivaraj

From: Balbir Singh 

The current code would fail on huge pages addresses, since the shift would
be incorrect. Use the correct page shift value returned by
__find_linux_pte() to get the correct physical address. The code is more
generic and can handle both regular and compound pages.

Fixes: ba41e1e1ccb9 ("powerpc/mce: Hookup derror (load/store) UE errors")
Signed-off-by: Balbir Singh 
[ar...@linux.ibm.com: Fixup pseries_do_memory_failure()]
Signed-off-by: Reza Arbab 
Co-developed-by: Santosh Sivaraj 
Signed-off-by: Santosh Sivaraj 
Tested-by: Mahesh Salgaonkar 
Cc: sta...@vger.kernel.org # v4.15+
---
 arch/powerpc/include/asm/mce.h   |  2 +-
 arch/powerpc/kernel/mce_power.c  | 55 ++--
 arch/powerpc/platforms/pseries/ras.c |  9 ++---
 3 files changed, 32 insertions(+), 34 deletions(-)

diff --git a/arch/powerpc/include/asm/mce.h b/arch/powerpc/include/asm/mce.h
index a4c6a74ad2fb..f3a6036b6bc0 100644
--- a/arch/powerpc/include/asm/mce.h
+++ b/arch/powerpc/include/asm/mce.h
@@ -209,7 +209,7 @@ extern void release_mce_event(void);
 extern void machine_check_queue_event(void);
 extern void machine_check_print_event_info(struct machine_check_event *evt,
   bool user_mode, bool in_guest);
-unsigned long addr_to_pfn(struct pt_regs *regs, unsigned long addr);
+unsigned long addr_to_phys(struct pt_regs *regs, unsigned long addr);
 #ifdef CONFIG_PPC_BOOK3S_64
 void flush_and_reload_slb(void);
 #endif /* CONFIG_PPC_BOOK3S_64 */
diff --git a/arch/powerpc/kernel/mce_power.c b/arch/powerpc/kernel/mce_power.c
index a814d2dfb5b0..e74816f045f8 100644
--- a/arch/powerpc/kernel/mce_power.c
+++ b/arch/powerpc/kernel/mce_power.c
@@ -20,13 +20,14 @@
 #include 
 
 /*
- * Convert an address related to an mm to a PFN. NOTE: we are in real
- * mode, we could potentially race with page table updates.
+ * Convert an address related to an mm to a physical address.
+ * NOTE: we are in real mode, we could potentially race with page table 
updates.
  */
-unsigned long addr_to_pfn(struct pt_regs *regs, unsigned long addr)
+unsigned long addr_to_phys(struct pt_regs *regs, unsigned long addr)
 {
-   pte_t *ptep;
-   unsigned long flags;
+   pte_t *ptep, pte;
+   unsigned int shift;
+   unsigned long flags, phys_addr;
struct mm_struct *mm;
 
if (user_mode(regs))
@@ -35,14 +36,21 @@ unsigned long addr_to_pfn(struct pt_regs *regs, unsigned 
long addr)
mm = _mm;
 
local_irq_save(flags);
-   if (mm == current->mm)
-   ptep = find_current_mm_pte(mm->pgd, addr, NULL, NULL);
-   else
-   ptep = find_init_mm_pte(addr, NULL);
+   ptep = __find_linux_pte(mm->pgd, addr, NULL, );
local_irq_restore(flags);
+
if (!ptep || pte_special(*ptep))
return ULONG_MAX;
-   return pte_pfn(*ptep);
+
+   pte = *ptep;
+   if (shift > PAGE_SHIFT) {
+   unsigned long rpnmask = (1ul << shift) - PAGE_SIZE;
+
+   pte = __pte(pte_val(pte) | (addr & rpnmask));
+   }
+   phys_addr = pte_pfn(pte) << PAGE_SHIFT;
+
+   return phys_addr;
 }
 
 /* flush SLBs and reload */
@@ -344,7 +352,7 @@ static const struct mce_derror_table mce_p9_derror_table[] 
= {
   MCE_INITIATOR_CPU,   MCE_SEV_SEVERE, true },
 { 0, false, 0, 0, 0, 0, 0 } };
 
-static int mce_find_instr_ea_and_pfn(struct pt_regs *regs, uint64_t *addr,
+static int mce_find_instr_ea_and_phys(struct pt_regs *regs, uint64_t *addr,
uint64_t *phys_addr)
 {
/*
@@ -354,18 +362,16 @@ static int mce_find_instr_ea_and_pfn(struct pt_regs 
*regs, uint64_t *addr,
 * faults
 */
int instr;
-   unsigned long pfn, instr_addr;
+   unsigned long instr_addr;
struct instruction_op op;
struct pt_regs tmp = *regs;
 
-   pfn = addr_to_pfn(regs, regs->nip);
-   if (pfn != ULONG_MAX) {
-   instr_addr = (pfn << PAGE_SHIFT) + (regs->nip & ~PAGE_MASK);
+   instr_addr = addr_to_phys(regs, regs->nip) + (regs->nip & ~PAGE_MASK);
+   if (instr_addr != ULONG_MAX) {
instr = *(unsigned int *)(instr_addr);
if (!analyse_instr(, , instr)) {
-   pfn = addr_to_pfn(regs, op.ea);
*addr = op.ea;
-   *phys_addr = (pfn << PAGE_SHIFT);
+   *phys_addr = addr_to_phys(regs, op.ea);
return 0;
}
/*
@@ -440,15 +446,9 @@ static int mce_handle_ierror(struct pt_regs *regs,
*addr = regs->nip;
if (mce_err->sync_error &&
table[i].error_type == MCE_ERROR_TYPE_UE) {
-   unsigned long pfn;
-
-   if (get_paca()->in_mce <

[PATCH v10 1/7] powerpc/mce: Schedule work from irq_work

2019-08-14 Thread Santosh Sivaraj

schedule_work() cannot be called from MCE exception context as MCE can
interrupt even in interrupt disabled context.

fixes: 733e4a4c ("powerpc/mce: hookup memory_failure for UE errors")
Suggested-by: Mahesh Salgaonkar 
Signed-off-by: Santosh Sivaraj 
Reviewed-by: Mahesh Salgaonkar 
Acked-by: Balbir Singh 
Cc: sta...@vger.kernel.org # v4.15+
---
 arch/powerpc/kernel/mce.c | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
index b18df633eae9..cff31d4a501f 100644
--- a/arch/powerpc/kernel/mce.c
+++ b/arch/powerpc/kernel/mce.c
@@ -33,6 +33,7 @@ static DEFINE_PER_CPU(struct machine_check_event[MAX_MC_EVT],
mce_ue_event_queue);
 
 static void machine_check_process_queued_event(struct irq_work *work);
+static void machine_check_ue_irq_work(struct irq_work *work);
 void machine_check_ue_event(struct machine_check_event *evt);
 static void machine_process_ue_event(struct work_struct *work);
 
@@ -40,6 +41,10 @@ static struct irq_work mce_event_process_work = {
 .func = machine_check_process_queued_event,
 };
 
+static struct irq_work mce_ue_event_irq_work = {
+   .func = machine_check_ue_irq_work,
+};
+
 DECLARE_WORK(mce_ue_event_work, machine_process_ue_event);
 
 static void mce_set_error_info(struct machine_check_event *mce,
@@ -199,6 +204,10 @@ void release_mce_event(void)
get_mce_event(NULL, true);
 }
 
+static void machine_check_ue_irq_work(struct irq_work *work)
+{
+   schedule_work(_ue_event_work);
+}
 
 /*
  * Queue up the MCE event which then can be handled later.
@@ -216,7 +225,7 @@ void machine_check_ue_event(struct machine_check_event *evt)
memcpy(this_cpu_ptr(_ue_event_queue[index]), evt, sizeof(*evt));
 
/* Queue work to process this event later. */
-   schedule_work(_ue_event_work);
+   irq_work_queue(_ue_event_irq_work);
 }
 
 /*
-- 
2.21.0

[PATCH v10 0/7] powerpc: implement machine check safe memcpy

2019-08-14 Thread Santosh Sivaraj

During a memcpy from a pmem device, if a machine check exception is
generated we end up in a panic. In case of fsdax read, this should
only result in a -EIO. Avoid MCE by implementing memcpy_mcsafe.

Before this patch series:

```
bash-4.4# mount -o dax /dev/pmem0 /mnt/pmem/
[ 7621.714094] Disabling lock debugging due to kernel taint
[ 7621.714099] MCE: CPU0: machine check (Severe) Host UE Load/Store [Not 
recovered]
[ 7621.714104] MCE: CPU0: NIP: [c0088978] memcpy_power7+0x418/0x7e0
[ 7621.714107] MCE: CPU0: Hardware error
[ 7621.714112] opal: Hardware platform error: Unrecoverable Machine Check 
exception
[ 7621.714118] CPU: 0 PID: 1368 Comm: mount Tainted: G   M  
5.2.0-rc5-00239-g241e39004581
#50
[ 7621.714123] NIP:  c0088978 LR: c08e16f8 CTR: 01de
[ 7621.714129] REGS: c000fffbfd70 TRAP: 0200   Tainted: G   M  
(5.2.0-rc5-00239-g241e39004581)
[ 7621.714131] MSR:  92209033   CR: 
24428840  XER: 0004
[ 7621.714160] CFAR: c00889a8 DAR: deadbeefdeadbeef DSISR: 8000 
IRQMASK: 0
[ 7621.714171] GPR00: 0e00 c000f0b8b1e0 c12cf100 
c000ed8e1100 
[ 7621.714186] GPR04: c2001100 0001 0200 
03fff1272000 
[ 7621.714201] GPR08: 8000 0010 0020 
0030 
[ 7621.714216] GPR12: 0040 7fffb8c6d390 0050 
0060 
[ 7621.714232] GPR16: 0070  0001 
c000f0b8b960 
[ 7621.714247] GPR20: 0001 c000f0b8b940 0001 
0001 
[ 7621.714262] GPR24: c1382560 c00c003b6380 c00c003b6380 
0001 
[ 7621.714277] GPR28:  0001 c200 
0001 
[ 7621.714294] NIP [c0088978] memcpy_power7+0x418/0x7e0
[ 7621.714298] LR [c08e16f8] pmem_do_bvec+0xf8/0x430
...  ...
```

After this patch series:

```
bash-4.4# mount -o dax /dev/pmem0 /mnt/pmem/
[25302.883978] Buffer I/O error on dev pmem0, logical block 0, async page read
[25303.020816] EXT4-fs (pmem0): DAX enabled. Warning: EXPERIMENTAL, use at your 
own risk
[25303.021236] EXT4-fs (pmem0): Can't read superblock on 2nd try
[25303.152515] EXT4-fs (pmem0): DAX enabled. Warning: EXPERIMENTAL, use at your 
own risk
[25303.284031] EXT4-fs (pmem0): DAX enabled. Warning: EXPERIMENTAL, use at your 
own risk
[25304.084100] UDF-fs: bad mount option "dax" or missing value
mount: /mnt/pmem: wrong fs type, bad option, bad superblock on /dev/pmem0, 
missing codepage or helper
program, or other error.
```

MCE is injected on a pmem address using mambo. The last patch which adds a
nop is only for testing on mambo, where r13 is not restored upon hitting
vector 200.

The memcpy code can be optimised by adding VMX optimizations and GAS macros
can be used to enable code reusablity, which I will send as another series.
--
v10: Fix authorship; add reviewed-bys and acks.

v9:
* Add a new IRQ work for UE events [mahesh]
* Reorder patches, and copy stable

v8:
* While ignoring UE events, return was used instead of continue.
* Checkpatch fixups for commit log

v7:
* Move schedule_work to be called from irq_work.

v6:
* Don't return pfn, all callees are expecting physical address anyway [nick]
* Patch re-ordering: move exception table patch before memcpy_mcsafe patch 
[nick]
* Reword commit log for search_exception_tables patch [nick]

v5:
* Don't use search_exception_tables since it searches for module exception 
tables
  also [Nicholas]
* Fix commit message for patch 2 [Nicholas]

v4:
* Squash return remaining bytes patch to memcpy_mcsafe implemtation patch 
[christophe]
* Access ok should be checked for copy_to_user_mcsafe() [christophe]

v3:
* Drop patch which enables DR/IR for external modules
* Drop notifier call chain, we don't want to do that in real mode
* Return remaining bytes from memcpy_mcsafe correctly
* We no longer restore r13 for simulator tests, rather use a nop at 
  vector 0x200 [workaround for simulator; not to be merged]

v2:
* Don't set RI bit explicitly [mahesh]
* Re-ordered series to get r13 workaround as the last patch

--
Balbir Singh (3):
  powerpc/mce: Fix MCE handling for huge pages
  powerpc/memcpy: Add memcpy_mcsafe for pmem
  powerpc/mce: Handle UE event for memcpy_mcsafe

Reza Arbab (1):
  powerpc/mce: Make machine_check_ue_event() static

Santosh Sivaraj (3):
  powerpc/mce: Schedule work from irq_work
  extable: Add function to search only kernel exception table
  powerpc: add machine check safe copy_to_user

 arch/powerpc/Kconfig |   1 +
 arch/powerpc/include/asm/mce.h   |   6 +-
 arch/powerpc/include/asm/string.h|   2 +
 arch/powerpc/include/asm/uaccess.h   |  14 ++
 arch/powerpc/kernel/mce.c|  31 +++-
 arch/powerpc/kernel/mce_power.c  |  70 
 arch/powerpc/lib/Makefile|   2 +-
 arch/powerpc/lib/memcpy_mcsafe_6

Re: [PATCH v9 6/7] powerpc/mce: Handle UE event for memcpy_mcsafe

2019-08-14 Thread Santosh Sivaraj

Hi Balbir,

Balbir Singh  writes:

> On 12/8/19 7:22 pm, Santosh Sivaraj wrote:
>> If we take a UE on one of the instructions with a fixup entry, set nip
>> to continue execution at the fixup entry. Stop processing the event
>> further or print it.
>> 
>> Co-developed-by: Reza Arbab 
>> Signed-off-by: Reza Arbab 
>> Cc: Mahesh Salgaonkar 
>> Signed-off-by: Santosh Sivaraj 
>> ---
>
> Isn't this based on https://patchwork.ozlabs.org/patch/895294/? If so it
> should still have my author tag and signed-off-by

Originally when I received the series for posting, I had Reza's authorship and
signed-off-by, since the patch changed significantly I added co-developed-by as
Reza. I will update in the next spin.

https://lore.kernel.org/linuxppc-dev/20190702051932.511-1-sant...@fossix.org/

Santosh
>
> Balbir Singh
>
>>  arch/powerpc/include/asm/mce.h  |  4 +++-
>>  arch/powerpc/kernel/mce.c   | 16 
>>  arch/powerpc/kernel/mce_power.c | 15 +--
>>  3 files changed, 32 insertions(+), 3 deletions(-)
>> 
>> diff --git a/arch/powerpc/include/asm/mce.h b/arch/powerpc/include/asm/mce.h
>> index f3a6036b6bc0..e1931c8c2743 100644
>> --- a/arch/powerpc/include/asm/mce.h
>> +++ b/arch/powerpc/include/asm/mce.h
>> @@ -122,7 +122,8 @@ struct machine_check_event {
>>  enum MCE_UeErrorType ue_error_type:8;
>>  u8  effective_address_provided;
>>  u8  physical_address_provided;
>> -u8  reserved_1[5];
>> +u8  ignore_event;
>> +u8  reserved_1[4];
>>  u64 effective_address;
>>  u64 physical_address;
>>  u8  reserved_2[8];
>> @@ -193,6 +194,7 @@ struct mce_error_info {
>>  enum MCE_Initiator  initiator:8;
>>  enum MCE_ErrorClass error_class:8;
>>  boolsync_error;
>> +boolignore_event;
>>  };
>>  
>>  #define MAX_MC_EVT  100
>> diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
>> index a3b122a685a5..ec4b3e1087be 100644
>> --- a/arch/powerpc/kernel/mce.c
>> +++ b/arch/powerpc/kernel/mce.c
>> @@ -149,6 +149,7 @@ void save_mce_event(struct pt_regs *regs, long handled,
>>  if (phys_addr != ULONG_MAX) {
>>  mce->u.ue_error.physical_address_provided = true;
>>  mce->u.ue_error.physical_address = phys_addr;
>> +mce->u.ue_error.ignore_event = mce_err->ignore_event;
>>  machine_check_ue_event(mce);
>>  }
>>  }
>> @@ -266,8 +267,17 @@ static void machine_process_ue_event(struct work_struct 
>> *work)
>>  /*
>>   * This should probably queued elsewhere, but
>>   * oh! well
>> + *
>> + * Don't report this machine check because the caller has a
>> + * asked us to ignore the event, it has a fixup handler which
>> + * will do the appropriate error handling and reporting.
>>   */
>>  if (evt->error_type == MCE_ERROR_TYPE_UE) {
>> +if (evt->u.ue_error.ignore_event) {
>> +__this_cpu_dec(mce_ue_count);
>> +continue;
>> +}
>> +
>>  if (evt->u.ue_error.physical_address_provided) {
>>  unsigned long pfn;
>>  
>> @@ -301,6 +311,12 @@ static void machine_check_process_queued_event(struct 
>> irq_work *work)
>>  while (__this_cpu_read(mce_queue_count) > 0) {
>>  index = __this_cpu_read(mce_queue_count) - 1;
>>  evt = this_cpu_ptr(_event_queue[index]);
>> +
>> +if (evt->error_type == MCE_ERROR_TYPE_UE &&
>> +evt->u.ue_error.ignore_event) {
>> +__this_cpu_dec(mce_queue_count);
>> +continue;
>> +}
>>  machine_check_print_event_info(evt, false, false);
>>  __this_cpu_dec(mce_queue_count);
>>  }
>> diff --git a/arch/powerpc/kernel/mce_power.c 
>> b/arch/powerpc/kernel/mce_power.c
>> index e74816f045f8..1dd87f6f5186 100644
>> --- a/arch/powerpc/kernel/mce_power.c
>> +++ b/arch/powerpc/kernel/mce_power.c
>> @@ -11,6 +11,7 @@

Re: [PATCH v9 7/7] powerpc: add machine check safe copy_to_user

2019-08-14 Thread Santosh Sivaraj

Hi Balbir,

Balbir Singh  writes:

> On 12/8/19 7:22 pm, Santosh Sivaraj wrote:
>> Use  memcpy_mcsafe() implementation to define copy_to_user_mcsafe()
>> 
>> Signed-off-by: Santosh Sivaraj 
>> ---
>>  arch/powerpc/Kconfig   |  1 +
>>  arch/powerpc/include/asm/uaccess.h | 14 ++
>>  2 files changed, 15 insertions(+)
>> 
>> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
>> index 77f6ebf97113..4316e36095a2 100644
>> --- a/arch/powerpc/Kconfig
>> +++ b/arch/powerpc/Kconfig
>> @@ -137,6 +137,7 @@ config PPC
>>  select ARCH_HAS_STRICT_KERNEL_RWX   if ((PPC_BOOK3S_64 || PPC32) && 
>> !RELOCATABLE && !HIBERNATION)
>>  select ARCH_HAS_TICK_BROADCAST  if GENERIC_CLOCKEVENTS_BROADCAST
>>  select ARCH_HAS_UACCESS_FLUSHCACHE  if PPC64
>> +select ARCH_HAS_UACCESS_MCSAFE  if PPC64
>>  select ARCH_HAS_UBSAN_SANITIZE_ALL
>>  select ARCH_HAVE_NMI_SAFE_CMPXCHG
>>  select ARCH_KEEP_MEMBLOCK
>> diff --git a/arch/powerpc/include/asm/uaccess.h 
>> b/arch/powerpc/include/asm/uaccess.h
>> index 8b03eb44e876..15002b51ff18 100644
>> --- a/arch/powerpc/include/asm/uaccess.h
>> +++ b/arch/powerpc/include/asm/uaccess.h
>> @@ -387,6 +387,20 @@ static inline unsigned long raw_copy_to_user(void 
>> __user *to,
>>  return ret;
>>  }
>>  
>> +static __always_inline unsigned long __must_check
>> +copy_to_user_mcsafe(void __user *to, const void *from, unsigned long n)
>> +{
>> +if (likely(check_copy_size(from, n, true))) {
>> +if (access_ok(to, n)) {
>> +allow_write_to_user(to, n);
>> +n = memcpy_mcsafe((void *)to, from, n);
>> +prevent_write_to_user(to, n);
>> +}
>> +}
>> +
>> +return n;
>
> Do we always return n independent of the check_copy_size return value and
> access_ok return values?

Yes we always return the remaining bytes not copied even if check_copy_size
or access_ok fails.

Santosh

>
> Balbir Singh.
>
>> +}
>> +
>>  extern unsigned long __clear_user(void __user *addr, unsigned long size);
>>  
>>  static inline unsigned long clear_user(void __user *addr, unsigned long 
>> size)
>>

Re: [PATCH v9 4/7] extable: Add function to search only kernel exception table

2019-08-14 Thread Santosh Sivaraj

Balbir Singh  writes:

> On 12/8/19 7:22 pm, Santosh Sivaraj wrote:
>> Certain architecture specific operating modes (e.g., in powerpc machine
>> check handler that is unable to access vmalloc memory), the
>> search_exception_tables cannot be called because it also searches the
>> module exception tables if entry is not found in the kernel exception
>> table.
>> 
>> Cc: Thomas Gleixner 
>> Cc: Ingo Molnar 
>> Cc: Nicholas Piggin 
>> Signed-off-by: Santosh Sivaraj 
>> Reviewed-by: Nicholas Piggin 
>> ---
>>  include/linux/extable.h |  2 ++
>>  kernel/extable.c| 11 +--
>>  2 files changed, 11 insertions(+), 2 deletions(-)
>> 
>> diff --git a/include/linux/extable.h b/include/linux/extable.h
>> index 41c5b3a25f67..81ecfaa83ad3 100644
>> --- a/include/linux/extable.h
>> +++ b/include/linux/extable.h
>> @@ -19,6 +19,8 @@ void trim_init_extable(struct module *m);
>>  
>>  /* Given an address, look for it in the exception tables */
>>  const struct exception_table_entry *search_exception_tables(unsigned long 
>> add);
>> +const struct exception_table_entry *
>> +search_kernel_exception_table(unsigned long addr);
>> 
>
> Can we find a better name search_kernel still sounds like all of the kernel.
> Can we rename it to search_kernel_linear_map_extable?

I thought search_kernel_exception_table and search_module_extables were
non-ambiguous enough :-) But If you think name will be confusing, I can
change that as suggested.

Thanks,
Santosh

>
>  
>>  #ifdef CONFIG_MODULES
>>  /* For extable.c to search modules' exception tables. */
>> diff --git a/kernel/extable.c b/kernel/extable.c
>> index e23cce6e6092..f6c9406eec7d 100644
>> --- a/kernel/extable.c
>> +++ b/kernel/extable.c
>> @@ -40,13 +40,20 @@ void __init sort_main_extable(void)
>>  }
>>  }
>>  
>> +/* Given an address, look for it in the kernel exception table */
>> +const
>> +struct exception_table_entry *search_kernel_exception_table(unsigned long 
>> addr)
>> +{
>> +return search_extable(__start___ex_table,
>> +  __stop___ex_table - __start___ex_table, addr);
>> +}
>> +
>>  /* Given an address, look for it in the exception tables. */
>>  const struct exception_table_entry *search_exception_tables(unsigned long 
>> addr)
>>  {
>>  const struct exception_table_entry *e;
>>  
>> -e = search_extable(__start___ex_table,
>> -   __stop___ex_table - __start___ex_table, addr);
>> +e = search_kernel_exception_table(addr);
>>  if (!e)
>>  e = search_module_extables(addr);
>>  return e;
>> 

-- 
if (( RANDOM % 2 )); then ~/bin/cookie; else fortune -s; fi
#cat ~/notes/quotes | sort -R | head -1 | cut -f2- -d " "

[PATCH 3/3] papr/scm: Add bad memory ranges to nvdimm bad ranges

2019-08-14 Thread Santosh Sivaraj

Subscribe to the MCE notification and add the physical address which
generated a memory error to nvdimm bad range.

Signed-off-by: Santosh Sivaraj 
---
 arch/powerpc/platforms/pseries/papr_scm.c | 65 +++
 1 file changed, 65 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index a5ac371a3f06..4d25c98a9835 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -12,6 +12,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include 
 
@@ -39,8 +41,12 @@ struct papr_scm_priv {
struct resource res;
struct nd_region *region;
struct nd_interleave_set nd_set;
+   struct list_head list;
 };
 
+LIST_HEAD(papr_nd_regions);
+DEFINE_MUTEX(papr_ndr_lock);
+
 static int drc_pmem_bind(struct papr_scm_priv *p)
 {
unsigned long ret[PLPAR_HCALL_BUFSIZE];
@@ -364,6 +370,10 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
dev_info(dev, "Region registered with target node %d and online 
node %d",
 target_nid, online_nid);
 
+   mutex_lock(_ndr_lock);
+   list_add_tail(>list, _nd_regions);
+   mutex_unlock(_ndr_lock);
+
return 0;
 
 err:   nvdimm_bus_unregister(p->bus);
@@ -371,6 +381,60 @@ err:   nvdimm_bus_unregister(p->bus);
return -ENXIO;
 }
 
+static int handle_mce_ue(struct notifier_block *nb, unsigned long val,
+void *data)
+{
+   struct machine_check_event *evt = data;
+   struct papr_scm_priv *p;
+   u64 phys_addr;
+
+   if (evt->error_type != MCE_ERROR_TYPE_UE)
+   return NOTIFY_DONE;
+
+   if (list_empty(_nd_regions))
+   return NOTIFY_DONE;
+
+   phys_addr = evt->u.ue_error.physical_address +
+   (evt->u.ue_error.effective_address & ~PAGE_MASK);
+
+   if (!evt->u.ue_error.physical_address_provided ||
+   !is_zone_device_page(pfn_to_page(phys_addr >> PAGE_SHIFT)))
+   return NOTIFY_DONE;
+
+   mutex_lock(_ndr_lock);
+   list_for_each_entry(p, _nd_regions, list) {
+   struct resource res = p->res;
+   u64 aligned_addr;
+
+   if (res.start > phys_addr)
+   continue;
+
+   if (res.end < phys_addr)
+   continue;
+
+   aligned_addr = ALIGN_DOWN(phys_addr, L1_CACHE_BYTES);
+   pr_debug("Add memory range (0x%llx -- 0x%llx) as bad range\n",
+aligned_addr, aligned_addr + L1_CACHE_BYTES);
+
+   if (nvdimm_bus_add_badrange(p->bus,
+   aligned_addr, L1_CACHE_BYTES))
+   pr_warn("Failed to add bad range (0x%llx -- 0x%llx)\n",
+   aligned_addr, aligned_addr + L1_CACHE_BYTES);
+
+   nvdimm_region_notify(p->region,
+NVDIMM_REVALIDATE_POISON);
+
+   break;
+   }
+   mutex_unlock(_ndr_lock);
+
+   return NOTIFY_OK;
+}
+
+static struct notifier_block mce_ue_nb = {
+   .notifier_call = handle_mce_ue
+};
+
 static int papr_scm_probe(struct platform_device *pdev)
 {
struct device_node *dn = pdev->dev.of_node;
@@ -456,6 +520,7 @@ static int papr_scm_probe(struct platform_device *pdev)
goto err2;
 
platform_set_drvdata(pdev, p);
+   mce_register_notifier(_ue_nb);
 
return 0;
 
-- 
2.21.0

[PATCH 2/3] of_pmem: Add memory ranges which took a mce to bad range

2019-08-14 Thread Santosh Sivaraj

Subscribe to the MCE notification and add the physical address which
generated a memory error to nvdimm bad range.

Signed-off-by: Santosh Sivaraj 
---
 drivers/nvdimm/of_pmem.c | 122 +--
 1 file changed, 103 insertions(+), 19 deletions(-)

diff --git a/drivers/nvdimm/of_pmem.c b/drivers/nvdimm/of_pmem.c
index a0c8dcfa0bf9..828dbfe44ca6 100644
--- a/drivers/nvdimm/of_pmem.c
+++ b/drivers/nvdimm/of_pmem.c
@@ -8,6 +8,9 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
 
 static const struct attribute_group *region_attr_groups[] = {
_region_attribute_group,
@@ -25,11 +28,77 @@ struct of_pmem_private {
struct nvdimm_bus *bus;
 };
 
+struct of_pmem_region {
+   struct of_pmem_private *priv;
+   struct nd_region_desc *region_desc;
+   struct nd_region *region;
+   struct list_head list;
+};
+
+LIST_HEAD(pmem_regions);
+DEFINE_MUTEX(pmem_region_lock);
+
+static int handle_mce_ue(struct notifier_block *nb, unsigned long val,
+void *data)
+{
+   struct machine_check_event *evt = data;
+   struct of_pmem_region *pmem_region;
+   u64 phys_addr;
+
+   if (evt->error_type != MCE_ERROR_TYPE_UE)
+   return NOTIFY_DONE;
+
+   if (list_empty(_regions))
+   return NOTIFY_DONE;
+
+   phys_addr = evt->u.ue_error.physical_address +
+   (evt->u.ue_error.effective_address & ~PAGE_MASK);
+
+   if (!evt->u.ue_error.physical_address_provided ||
+   !is_zone_device_page(pfn_to_page(phys_addr >> PAGE_SHIFT)))
+   return NOTIFY_DONE;
+
+   mutex_lock(_region_lock);
+   list_for_each_entry(pmem_region, _regions, list) {
+   struct resource *res = pmem_region->region_desc->res;
+   u64 aligned_addr;
+
+   if (res->start > phys_addr)
+   continue;
+
+   if (res->end < phys_addr)
+   continue;
+
+   aligned_addr = ALIGN_DOWN(phys_addr, L1_CACHE_BYTES);
+   pr_debug("Add memory range (0x%llx -- 0x%llx) as bad range\n",
+aligned_addr, aligned_addr + L1_CACHE_BYTES);
+
+   if (nvdimm_bus_add_badrange(pmem_region->priv->bus,
+aligned_addr, L1_CACHE_BYTES))
+   pr_warn("Failed to add bad range (0x%llx -- 0x%llx)\n",
+   aligned_addr, aligned_addr + L1_CACHE_BYTES);
+
+   nvdimm_region_notify(pmem_region->region,
+NVDIMM_REVALIDATE_POISON);
+
+   break;
+   }
+   mutex_unlock(_region_lock);
+
+   return NOTIFY_OK;
+}
+
+static struct notifier_block mce_ue_nb = {
+   .notifier_call = handle_mce_ue
+};
+
 static int of_pmem_region_probe(struct platform_device *pdev)
 {
struct of_pmem_private *priv;
struct device_node *np;
struct nvdimm_bus *bus;
+   struct of_pmem_region *pmem_region;
+   struct nd_region_desc *ndr_desc;
bool is_volatile;
int i;
 
@@ -58,34 +127,49 @@ static int of_pmem_region_probe(struct platform_device 
*pdev)
is_volatile ? "volatile" : "non-volatile",  np);
 
for (i = 0; i < pdev->num_resources; i++) {
-   struct nd_region_desc ndr_desc;
struct nd_region *region;
 
-   /*
-* NB: libnvdimm copies the data from ndr_desc into it's own
-* structures so passing a stack pointer is fine.
-*/
-   memset(_desc, 0, sizeof(ndr_desc));
-   ndr_desc.attr_groups = region_attr_groups;
-   ndr_desc.numa_node = dev_to_node(>dev);
-   ndr_desc.target_node = ndr_desc.numa_node;
-   ndr_desc.res = >resource[i];
-   ndr_desc.of_node = np;
-   set_bit(ND_REGION_PAGEMAP, _desc.flags);
+   ndr_desc = kzalloc(sizeof(struct nd_region_desc), GFP_KERNEL);
+   if (!ndr_desc) {
+   nvdimm_bus_unregister(priv->bus);
+   kfree(priv);
+   return -ENOMEM;
+   }
+
+   ndr_desc->attr_groups = region_attr_groups;
+   ndr_desc->numa_node = dev_to_node(>dev);
+   ndr_desc->target_node = ndr_desc->numa_node;
+   ndr_desc->res = >resource[i];
+   ndr_desc->of_node = np;
+   set_bit(ND_REGION_PAGEMAP, _desc->flags);
 
if (is_volatile)
-   region = nvdimm_volatile_region_create(bus, _desc);
+   region = nvdimm_volatile_region_create(bus, ndr_desc);
else
-   region = nvdimm_pmem_region_create(bus, _desc);
+

[PATCH 1/3] powerpc/mce: Add MCE notification chain

2019-08-14 Thread Santosh Sivaraj

This is needed to report bad blocks for persistent memory.

Signed-off-by: Santosh Sivaraj 
---
 arch/powerpc/include/asm/mce.h |  3 +++
 arch/powerpc/kernel/mce.c  | 15 +++
 2 files changed, 18 insertions(+)

diff --git a/arch/powerpc/include/asm/mce.h b/arch/powerpc/include/asm/mce.h
index e1931c8c2743..b1c6363f924c 100644
--- a/arch/powerpc/include/asm/mce.h
+++ b/arch/powerpc/include/asm/mce.h
@@ -212,6 +212,9 @@ extern void machine_check_queue_event(void);
 extern void machine_check_print_event_info(struct machine_check_event *evt,
   bool user_mode, bool in_guest);
 unsigned long addr_to_phys(struct pt_regs *regs, unsigned long addr);
+int mce_register_notifier(struct notifier_block *nb);
+int mce_unregister_notifier(struct notifier_block *nb);
+
 #ifdef CONFIG_PPC_BOOK3S_64
 void flush_and_reload_slb(void);
 #endif /* CONFIG_PPC_BOOK3S_64 */
diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
index ec4b3e1087be..a78210ca6cd9 100644
--- a/arch/powerpc/kernel/mce.c
+++ b/arch/powerpc/kernel/mce.c
@@ -47,6 +47,20 @@ static struct irq_work mce_ue_event_irq_work = {
 
 DECLARE_WORK(mce_ue_event_work, machine_process_ue_event);
 
+static BLOCKING_NOTIFIER_HEAD(mce_notifier_list);
+
+int mce_register_notifier(struct notifier_block *nb)
+{
+   return blocking_notifier_chain_register(_notifier_list, nb);
+}
+EXPORT_SYMBOL_GPL(mce_register_notifier);
+
+int mce_unregister_notifier(struct notifier_block *nb)
+{
+   return blocking_notifier_chain_unregister(_notifier_list, nb);
+}
+EXPORT_SYMBOL_GPL(mce_unregister_notifier);
+
 static void mce_set_error_info(struct machine_check_event *mce,
   struct mce_error_info *mce_err)
 {
@@ -263,6 +277,7 @@ static void machine_process_ue_event(struct work_struct 
*work)
while (__this_cpu_read(mce_ue_count) > 0) {
index = __this_cpu_read(mce_ue_count) - 1;
evt = this_cpu_ptr(_ue_event_queue[index]);
+   blocking_notifier_call_chain(_notifier_list, 0, evt);
 #ifdef CONFIG_MEMORY_FAILURE
/*
 * This should probably queued elsewhere, but
-- 
2.21.0

[PATCH 0/3] Add bad pmem bad blocks to bad range

2019-08-14 Thread Santosh Sivaraj

This series, which should be based on top of the still un-merged
"powerpc: implement machine check safe memcpy" series, adds support
to add the bad blocks which generated an MCE to the NVDIMM bad blocks.
The next access of the same memory will be blocked by the NVDIMM layer
itself.

Santosh Sivaraj (3):
  powerpc/mce: Add MCE notification chain
  of_pmem: Add memory ranges which took a mce to bad range
  papr/scm: Add bad memory ranges to nvdimm bad ranges

 arch/powerpc/include/asm/mce.h|   3 +
 arch/powerpc/kernel/mce.c |  15 +++
 arch/powerpc/platforms/pseries/papr_scm.c |  65 
 drivers/nvdimm/of_pmem.c  | 122 ++
 4 files changed, 186 insertions(+), 19 deletions(-)

-- 
2.21.0

Re: [PATCH v9 2/7] powerpc/mce: Fix MCE handling for huge pages

2019-08-12 Thread Santosh Sivaraj

Sasha Levin  writes:

> Hi,
>
> [This is an automated email]
>
> This commit has been processed because it contains a "Fixes:" tag,
> fixing commit: ba41e1e1ccb9 powerpc/mce: Hookup derror (load/store) UE errors.
>
> The bot has tested the following trees: v5.2.8, v4.19.66.
>
> v5.2.8: Build OK!
> v4.19.66: Failed to apply! Possible dependencies:
> 360cae313702 ("KVM: PPC: Book3S HV: Nested guest entry via hypercall")
> 41f4e631daf8 ("KVM: PPC: Book3S HV: Extract PMU save/restore operations 
> as C-callable functions")
> 884dfb722db8 ("KVM: PPC: Book3S HV: Simplify machine check handling")
> 89329c0be8bd ("KVM: PPC: Book3S HV: Clear partition table entry on vm 
> teardown")
> 8e3f5fc1045d ("KVM: PPC: Book3S HV: Framework and hcall stubs for nested 
> virtualization")
> 95a6432ce903 ("KVM: PPC: Book3S HV: Streamlined guest entry/exit path on 
> P9 for radix guests")
> a43c1590426c ("powerpc/pseries: Flush SLB contents on SLB MCE errors.")
> c05772018491 ("powerpc/64s: Better printing of machine check info for 
> guest MCEs")
> d24ea8a7336a ("KVM: PPC: Book3S: Simplify external interrupt handling")
> df709a296ef7 ("KVM: PPC: Book3S HV: Simplify real-mode interrupt 
> handling")
> f7035ce9f1df ("KVM: PPC: Book3S HV: Move interrupt delivery on guest 
> entry to C code")
>
>
> NOTE: The patch will not be queued to stable trees until it is upstream.
>
> How should we proceed with this patch?

I will send a backport once this has been merged upstream.

Thanks,
Santosh

>
> --
> Thanks,
> Sasha

1 2 3 >

1 - 100 of 214 matches

Mail list logo