date:20190322

Re: Re: [PATCH 5/5] powerpc/8xx: fix possible object reference leak

2019-03-22 Thread wen.yang99

Hi, Christophe,

>> The call to of_find_compatible_node returns a node pointer with refcount
>> incremented thus it must be explicitly decremented after the last
>> usage.
>> irq_domain_add_linear also calls of_node_get to increase refcount,
>> so irq_domain will not be affected when it is released.
>
>
> Should you have a:
>
> Fixes: a8db8cf0d894 ("irq_domain: Replace irq_alloc_host() with
> revmap-specific initializers")
>
> If not, it means your change is in contradiction with commit
> b1725c9319aa ("[POWERPC] arch/powerpc/sysdev: Add missing of_node_put")

Thank you very much.
This problem existed before this commit (a8db8cf0d894).
and the SmPL we used is:
https://lkml.org/lkml/2019/3/14/880
We are still improving this SmPL, and it is somewhat different from the script 
used by b1725c9319aa ("[POWERPC] arch/powerpc/sysdev: Add missing of_node_put").

>> Detected by coccinelle with the following warnings:
>> ./arch/powerpc/platforms/8xx/pic.c:158:1-7: ERROR: missing of_node_put; 
>> acquired a node pointer with refcount incremented on line 136, but without a 
>> corresponding object release within this function.
>>
>>   arch/powerpc/platforms/8xx/pic.c | 1 +
>>   1 file changed, 1 insertion(+)
>>
>> diff --git a/arch/powerpc/platforms/8xx/pic.c 
>> b/arch/powerpc/platforms/8xx/pic.c
>> index 8d5a25d..13d880b 100644
>> --- a/arch/powerpc/platforms/8xx/pic.c
>> +++ b/arch/powerpc/platforms/8xx/pic.c
>> @@ -155,6 +155,7 @@ int mpc8xx_pic_init(void)
>>   ret = -ENOMEM;
>>   goto out;
>>   }
>> +of_node_put(np);
>>   return 0;
>>
>>   out:
>>
>
> I guess it would be better as follows:
> 
> --- a/arch/powerpc/platforms/8xx/pic.c
> +++ b/arch/powerpc/platforms/8xx/pic.c
> @@ -153,9 +153,7 @@ int mpc8xx_pic_init(void)
> if (mpc8xx_pic_host == NULL) {
> printk(KERN_ERR "MPC8xx PIC: failed to allocate irq
> host!\n");
> ret = -ENOMEM;
> -   goto out;
> }
> -   return 0;
> 
> out:
> of_node_put(np);

OK.
Thank you.
We will fix it soon.

Thanks and regards,
Wen

[PATCH 2/2] pci: rpaphp: get/put device node reference during slot alloc/dealloc

2019-03-22 Thread Tyrel Datwyler

When allocating the slot structure we store a pointer to the associated
device_node. We really should be incrementing the reference count, so
add an of_node_get() during slot alloc and an of_node_put() during slot
dealloc.

Signed-off-by: Tyrel Datwyler 
---
 drivers/pci/hotplug/rpaphp_slot.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/hotplug/rpaphp_slot.c 
b/drivers/pci/hotplug/rpaphp_slot.c
index 5282aa3e33c5..93b4a945c55d 100644
--- a/drivers/pci/hotplug/rpaphp_slot.c
+++ b/drivers/pci/hotplug/rpaphp_slot.c
@@ -21,6 +21,7 @@
 /* free up the memory used by a slot */
 void dealloc_slot_struct(struct slot *slot)
 {
+   of_node_put(slot->dn);
kfree(slot->name);
kfree(slot);
 }
@@ -36,7 +37,7 @@ struct slot *alloc_slot_struct(struct device_node *dn,
slot->name = kstrdup(drc_name, GFP_KERNEL);
if (!slot->name)
goto error_slot;
-   slot->dn = dn;
+   slot->dn = of_node_get(dn);
slot->index = drc_index;
slot->power_domain = power_domain;
slot->hotplug_slot.ops = _hotplug_slot_ops;
-- 
2.12.3

[PATCH 1/2] pci: rpadlpar: fix leaked device_node references in add/remove paths

2019-03-22 Thread Tyrel Datwyler

The find_dlpar_node() helper returns a device node with its reference
incremented. Both the add and remove paths use this helper for find the
appropriate node, but fail to release the reference when done.

Annotate the find_dlpar_node() helper with a comment about the incremented
reference count, and call of_node_put() on the obtained device_node in the
add and remove paths. Also, fixup a reference leak in the find_vio_slot()
helper where we fail to call of_node_put() on the vdevice node after we
iterate over its children.

Signed-off-by: Tyrel Datwyler 
---
 drivers/pci/hotplug/rpadlpar_core.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/pci/hotplug/rpadlpar_core.c 
b/drivers/pci/hotplug/rpadlpar_core.c
index e2356a9c7088..182f9e3443ee 100644
--- a/drivers/pci/hotplug/rpadlpar_core.c
+++ b/drivers/pci/hotplug/rpadlpar_core.c
@@ -51,6 +51,7 @@ static struct device_node *find_vio_slot_node(char *drc_name)
if (rc == 0)
break;
}
+   of_node_put(parent);
 
return dn;
 }
@@ -71,6 +72,7 @@ static struct device_node *find_php_slot_pci_node(char 
*drc_name,
return np;
 }
 
+/* Returns a device_node with its reference count incremented */
 static struct device_node *find_dlpar_node(char *drc_name, int *node_type)
 {
struct device_node *dn;
@@ -306,6 +308,7 @@ int dlpar_add_slot(char *drc_name)
rc = dlpar_add_phb(drc_name, dn);
break;
}
+   of_node_put(dn);
 
printk(KERN_INFO "%s: slot %s added\n", DLPAR_MODULE_NAME, drc_name);
 exit:
@@ -439,6 +442,7 @@ int dlpar_remove_slot(char *drc_name)
rc = dlpar_remove_pci_slot(drc_name, dn);
break;
}
+   of_node_put(dn);
vm_unmap_aliases();
 
printk(KERN_INFO "%s: slot %s removed\n", DLPAR_MODULE_NAME, drc_name);
-- 
2.12.3

Re: [PATCH kernel RFC 2/2] vfio-pci-nvlink2: Implement interconnect isolation

2019-03-22 Thread Alex Williamson

On Fri, 22 Mar 2019 14:08:38 +1100
David Gibson  wrote:

> On Thu, Mar 21, 2019 at 12:19:34PM -0600, Alex Williamson wrote:
> > On Thu, 21 Mar 2019 10:56:00 +1100
> > David Gibson  wrote:
> >   
> > > On Wed, Mar 20, 2019 at 01:09:08PM -0600, Alex Williamson wrote:  
> > > > On Wed, 20 Mar 2019 15:38:24 +1100
> > > > David Gibson  wrote:
> > > > 
> > > > > On Tue, Mar 19, 2019 at 10:36:19AM -0600, Alex Williamson wrote:
> > > > > > On Fri, 15 Mar 2019 19:18:35 +1100
> > > > > > Alexey Kardashevskiy  wrote:
> > > > > >   
> > > > > > > The NVIDIA V100 SXM2 GPUs are connected to the CPU via PCIe links 
> > > > > > > and
> > > > > > > (on POWER9) NVLinks. In addition to that, GPUs themselves have 
> > > > > > > direct
> > > > > > > peer to peer NVLinks in groups of 2 to 4 GPUs. At the moment the 
> > > > > > > POWERNV
> > > > > > > platform puts all interconnected GPUs to the same IOMMU group.
> > > > > > > 
> > > > > > > However the user may want to pass individual GPUs to the 
> > > > > > > userspace so
> > > > > > > in order to do so we need to put them into separate IOMMU groups 
> > > > > > > and
> > > > > > > cut off the interconnects.
> > > > > > > 
> > > > > > > Thankfully V100 GPUs implement an interface to do by programming 
> > > > > > > link
> > > > > > > disabling mask to BAR0 of a GPU. Once a link is disabled in a GPU 
> > > > > > > using
> > > > > > > this interface, it cannot be re-enabled until the secondary bus 
> > > > > > > reset is
> > > > > > > issued to the GPU.
> > > > > > > 
> > > > > > > This defines a reset_done() handler for V100 NVlink2 device which
> > > > > > > determines what links need to be disabled. This relies on presence
> > > > > > > of the new "ibm,nvlink-peers" device tree property of a GPU 
> > > > > > > telling which
> > > > > > > PCI peers it is connected to (which includes NVLink bridges or 
> > > > > > > peer GPUs).
> > > > > > > 
> > > > > > > This does not change the existing behaviour and instead adds
> > > > > > > a new "isolate_nvlink" kernel parameter to allow such isolation.
> > > > > > > 
> > > > > > > The alternative approaches would be:
> > > > > > > 
> > > > > > > 1. do this in the system firmware (skiboot) but for that we would 
> > > > > > > need
> > > > > > > to tell skiboot via an additional OPAL call whether or not we 
> > > > > > > want this
> > > > > > > isolation - skiboot is unaware of IOMMU groups.
> > > > > > > 
> > > > > > > 2. do this in the secondary bus reset handler in the POWERNV 
> > > > > > > platform -
> > > > > > > the problem with that is at that point the device is not enabled, 
> > > > > > > i.e.
> > > > > > > config space is not restored so we need to enable the device 
> > > > > > > (i.e. MMIO
> > > > > > > bit in CMD register + program valid address to BAR0) in order to 
> > > > > > > disable
> > > > > > > links and then perhaps undo all this initialization to bring the 
> > > > > > > device
> > > > > > > back to the state where pci_try_reset_function() expects it to 
> > > > > > > be.  
> > > > > > 
> > > > > > The trouble seems to be that this approach only maintains the 
> > > > > > isolation
> > > > > > exposed by the IOMMU group when vfio-pci is the active driver for 
> > > > > > the
> > > > > > device.  IOMMU groups can be used by any driver and the IOMMU core 
> > > > > > is
> > > > > > incorporating groups in various ways.  
> > > > > 
> > > > > I don't think that reasoning is quite right.  An IOMMU group doesn't
> > > > > necessarily represent devices which *are* isolated, just devices which
> > > > > *can be* isolated.  There are plenty of instances when we don't need
> > > > > to isolate devices in different IOMMU groups: passing both groups to
> > > > > the same guest or userspace VFIO driver for example, or indeed when
> > > > > both groups are owned by regular host kernel drivers.
> > > > > 
> > > > > In at least some of those cases we also don't want to isolate the
> > > > > devices when we don't have to, usually for performance reasons.
> > > > 
> > > > I see IOMMU groups as representing the current isolation of the device,
> > > > not just the possible isolation.  If there are ways to break down that
> > > > isolation then ideally the group would be updated to reflect it.  The
> > > > ACS disable patches seem to support this, at boot time we can choose to
> > > > disable ACS at certain points in the topology to favor peer-to-peer
> > > > performance over isolation.  This is then reflected in the group
> > > > composition, because even though ACS *can be* enabled at the given
> > > > isolation points, it's intentionally not with this option.  Whether or
> > > > not a given user who owns multiple devices needs that isolation is
> > > > really beside the point, the user can choose to connect groups via IOMMU
> > > > mappings or reconfigure the system to disable ACS and potentially more
> > > > direct routing.  The IOMMU groups are still accurately reflecting the
> > > > topology and IOMMU

Re: [RESEND 6/7] IB/qib: Use the new FOLL_LONGTERM flag to get_user_pages_fast()

2019-03-22 Thread Dan Williams

On Sun, Mar 17, 2019 at 7:36 PM  wrote:
>
> From: Ira Weiny 
>
> Use the new FOLL_LONGTERM to get_user_pages_fast() to protect against
> FS DAX pages being mapped.
>
> Signed-off-by: Ira Weiny 

Looks good modulo potential  __get_user_pages_fast() suggestion.

Re: [RESEND 5/7] IB/hfi1: Use the new FOLL_LONGTERM flag to get_user_pages_fast()

2019-03-22 Thread Dan Williams

On Sun, Mar 17, 2019 at 7:36 PM  wrote:
>
> From: Ira Weiny 
>
> Use the new FOLL_LONGTERM to get_user_pages_fast() to protect against
> FS DAX pages being mapped.
>
> Signed-off-by: Ira Weiny 
> ---
>  drivers/infiniband/hw/hfi1/user_pages.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/infiniband/hw/hfi1/user_pages.c 
> b/drivers/infiniband/hw/hfi1/user_pages.c
> index 78ccacaf97d0..6a7f9cd5a94e 100644
> --- a/drivers/infiniband/hw/hfi1/user_pages.c
> +++ b/drivers/infiniband/hw/hfi1/user_pages.c
> @@ -104,9 +104,11 @@ int hfi1_acquire_user_pages(struct mm_struct *mm, 
> unsigned long vaddr, size_t np
> bool writable, struct page **pages)
>  {
> int ret;
> +   unsigned int gup_flags = writable ? FOLL_WRITE : 0;

Maybe:

unsigned int gup_flags = FOLL_LONGTERM | (writable ? FOLL_WRITE : 0);

?

>
> -   ret = get_user_pages_fast(vaddr, npages, writable ? FOLL_WRITE : 0,
> - pages);
> +   gup_flags |= FOLL_LONGTERM;
> +
> +   ret = get_user_pages_fast(vaddr, npages, gup_flags, pages);
> if (ret < 0)
> return ret;
>
> --
> 2.20.1
>

Re: [RESEND 4/7] mm/gup: Add FOLL_LONGTERM capability to GUP fast

2019-03-22 Thread Dan Williams

On Sun, Mar 17, 2019 at 7:36 PM  wrote:
>
> From: Ira Weiny 
>
> DAX pages were previously unprotected from longterm pins when users
> called get_user_pages_fast().
>
> Use the new FOLL_LONGTERM flag to check for DEVMAP pages and fall
> back to regular GUP processing if a DEVMAP page is encountered.
>
> Signed-off-by: Ira Weiny 
> ---
>  mm/gup.c | 29 +
>  1 file changed, 25 insertions(+), 4 deletions(-)
>
> diff --git a/mm/gup.c b/mm/gup.c
> index 0684a9536207..173db0c44678 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1600,6 +1600,9 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, 
> unsigned long end,
> goto pte_unmap;
>
> if (pte_devmap(pte)) {
> +   if (unlikely(flags & FOLL_LONGTERM))
> +   goto pte_unmap;
> +
> pgmap = get_dev_pagemap(pte_pfn(pte), pgmap);
> if (unlikely(!pgmap)) {
> undo_dev_pagemap(nr, nr_start, pages);
> @@ -1739,8 +1742,11 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, 
> unsigned long addr,
> if (!pmd_access_permitted(orig, flags & FOLL_WRITE))
> return 0;
>
> -   if (pmd_devmap(orig))
> +   if (pmd_devmap(orig)) {
> +   if (unlikely(flags & FOLL_LONGTERM))
> +   return 0;
> return __gup_device_huge_pmd(orig, pmdp, addr, end, pages, 
> nr);
> +   }
>
> refs = 0;
> page = pmd_page(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
> @@ -1777,8 +1783,11 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, 
> unsigned long addr,
> if (!pud_access_permitted(orig, flags & FOLL_WRITE))
> return 0;
>
> -   if (pud_devmap(orig))
> +   if (pud_devmap(orig)) {
> +   if (unlikely(flags & FOLL_LONGTERM))
> +   return 0;
> return __gup_device_huge_pud(orig, pudp, addr, end, pages, 
> nr);
> +   }
>
> refs = 0;
> page = pud_page(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
> @@ -2066,8 +2075,20 @@ int get_user_pages_fast(unsigned long start, int 
> nr_pages,
> start += nr << PAGE_SHIFT;
> pages += nr;
>
> -   ret = get_user_pages_unlocked(start, nr_pages - nr, pages,
> - gup_flags);
> +   if (gup_flags & FOLL_LONGTERM) {
> +   down_read(>mm->mmap_sem);
> +   ret = __gup_longterm_locked(current, current->mm,
> +   start, nr_pages - nr,
> +   pages, NULL, gup_flags);
> +   up_read(>mm->mmap_sem);
> +   } else {
> +   /*
> +* retain FAULT_FOLL_ALLOW_RETRY optimization if
> +* possible
> +*/
> +   ret = get_user_pages_unlocked(start, nr_pages - nr,
> + pages, gup_flags);

I couldn't immediately grok why this path needs to branch on
FOLL_LONGTERM? Won't get_user_pages_unlocked(..., FOLL_LONGTERM) do
the right thing?

Re: [RESEND 3/7] mm/gup: Change GUP fast to use flags rather than a write 'bool'

2019-03-22 Thread Dan Williams

On Sun, Mar 17, 2019 at 7:36 PM  wrote:
>
> From: Ira Weiny 
>
> To facilitate additional options to get_user_pages_fast() change the
> singular write parameter to be gup_flags.
>
> This patch does not change any functionality.  New functionality will
> follow in subsequent patches.
>
> Some of the get_user_pages_fast() call sites were unchanged because they
> already passed FOLL_WRITE or 0 for the write parameter.
>
> Signed-off-by: Ira Weiny 
>
> ---
> Changes from V1:
> Rebase to current merge tree
> arch/powerpc/mm/mmu_context_iommu.c no longer calls gup_fast
> The gup_longterm was converted in patch 1
>
>  arch/mips/mm/gup.c | 11 ++-
>  arch/powerpc/kvm/book3s_64_mmu_hv.c|  4 ++--
>  arch/powerpc/kvm/e500_mmu.c|  2 +-
>  arch/s390/kvm/interrupt.c  |  2 +-
>  arch/s390/mm/gup.c | 12 ++--
>  arch/sh/mm/gup.c   | 11 ++-
>  arch/sparc/mm/gup.c|  9 +
>  arch/x86/kvm/paging_tmpl.h |  2 +-
>  arch/x86/kvm/svm.c |  2 +-
>  drivers/fpga/dfl-afu-dma-region.c  |  2 +-
>  drivers/gpu/drm/via/via_dmablit.c  |  3 ++-
>  drivers/infiniband/hw/hfi1/user_pages.c|  3 ++-
>  drivers/misc/genwqe/card_utils.c   |  2 +-
>  drivers/misc/vmw_vmci/vmci_host.c  |  2 +-
>  drivers/misc/vmw_vmci/vmci_queue_pair.c|  6 --
>  drivers/platform/goldfish/goldfish_pipe.c  |  3 ++-
>  drivers/rapidio/devices/rio_mport_cdev.c   |  4 +++-
>  drivers/sbus/char/oradax.c |  2 +-
>  drivers/scsi/st.c  |  3 ++-
>  drivers/staging/gasket/gasket_page_table.c |  4 ++--
>  drivers/tee/tee_shm.c  |  2 +-
>  drivers/vfio/vfio_iommu_spapr_tce.c|  3 ++-
>  drivers/vhost/vhost.c  |  2 +-
>  drivers/video/fbdev/pvr2fb.c   |  2 +-
>  drivers/virt/fsl_hypervisor.c  |  2 +-
>  drivers/xen/gntdev.c   |  2 +-
>  fs/orangefs/orangefs-bufmap.c  |  2 +-
>  include/linux/mm.h |  4 ++--
>  kernel/futex.c |  2 +-
>  lib/iov_iter.c |  7 +--
>  mm/gup.c   | 10 +-
>  mm/util.c  |  8 
>  net/ceph/pagevec.c |  2 +-
>  net/rds/info.c |  2 +-
>  net/rds/rdma.c |  3 ++-
>  35 files changed, 79 insertions(+), 63 deletions(-)


>
> diff --git a/arch/mips/mm/gup.c b/arch/mips/mm/gup.c
> index 0d14e0d8eacf..4c2b4483683c 100644
> --- a/arch/mips/mm/gup.c
> +++ b/arch/mips/mm/gup.c
> @@ -235,7 +235,7 @@ int __get_user_pages_fast(unsigned long start, int 
> nr_pages, int write,
>   * get_user_pages_fast() - pin user pages in memory
>   * @start: starting user address
>   * @nr_pages:  number of pages from start to pin
> - * @write: whether pages will be written to
> + * @gup_flags: flags modifying pin behaviour
>   * @pages: array that receives pointers to the pages pinned.
>   * Should be at least nr_pages long.
>   *
> @@ -247,8 +247,8 @@ int __get_user_pages_fast(unsigned long start, int 
> nr_pages, int write,
>   * requested. If nr_pages is 0 or negative, returns 0. If no pages
>   * were pinned, returns -errno.
>   */
> -int get_user_pages_fast(unsigned long start, int nr_pages, int write,
> -   struct page **pages)
> +int get_user_pages_fast(unsigned long start, int nr_pages,
> +   unsigned int gup_flags, struct page **pages)

This looks a tad scary given all related thrash especially when it's
only 1 user that wants to do get_user_page_fast_longterm, right? Maybe
something like the following. Note I explicitly moved the flags to the
end so that someone half paying attention that calls
__get_user_pages_fast will get a compile error if they specify the
args in the same order.

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 76ba638ceda8..c6c743bc2c68 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1505,8 +1505,15 @@ static inline long
get_user_pages_longterm(unsigned long start,
 }
 #endif /* CONFIG_FS_DAX */

-int get_user_pages_fast(unsigned long start, int nr_pages, int write,
-   struct page **pages);
+
+int __get_user_pages_fast(unsigned long start, int nr_pages,
+   struct page **pages, unsigned int gup_flags);
+
+static inline int get_user_pages_fast(unsigned long start, int
nr_pages, int write,
+   struct page **pages)
+{
+   return __get_user_pages_fast(start, nr_pages, pages, write ?
FOLL_WRITE);
+}

 /* Container for pinned pfns / pages */
 struct frame_vector {

Re: [RESEND 2/7] mm/gup: Change write parameter to flags in fast walk

2019-03-22 Thread Dan Williams

On Sun, Mar 17, 2019 at 7:36 PM  wrote:
>
> From: Ira Weiny 
>
> In order to support more options in the GUP fast walk, change
> the write parameter to flags throughout the call stack.
>
> This patch does not change functionality and passes FOLL_WRITE
> where write was previously used.
>
> Signed-off-by: Ira Weiny 

Looks good,

Reviewed-by: Dan Williams

Re: [RESEND 1/7] mm/gup: Replace get_user_pages_longterm() with FOLL_LONGTERM

2019-03-22 Thread Dan Williams

On Sun, Mar 17, 2019 at 7:36 PM  wrote:
>
> From: Ira Weiny 
>
> Rather than have a separate get_user_pages_longterm() call,
> introduce FOLL_LONGTERM and change the longterm callers to use
> it.
>
> This patch does not change any functionality.
>
> FOLL_LONGTERM can only be supported with get_user_pages() as it
> requires vmas to determine if DAX is in use.
>
> CC: Aneesh Kumar K.V 
> CC: Andrew Morton 
> CC: Michal Hocko 
> Signed-off-by: Ira Weiny 
[..]
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 2d483dbdffc0..6831077d126c 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
[..]
> @@ -2609,6 +2596,7 @@ struct page *follow_page(struct vm_area_struct *vma, 
> unsigned long address,
>  #define FOLL_REMOTE0x2000  /* we are working on non-current tsk/mm */
>  #define FOLL_COW   0x4000  /* internal GUP flag */
>  #define FOLL_ANON  0x8000  /* don't do file mappings */
> +#define FOLL_LONGTERM  0x1 /* mapping is intended for a long term pin */

Let's change this comment to say something like /* mapping lifetime is
indefinite / at the discretion of userspace */, since "longterm is not
well defined.

I think it should also include a /* FIXME: */ to say something about
the havoc a long term pin might wreak on fs and mm code paths.

>  static inline int vm_fault_to_errno(vm_fault_t vm_fault, int foll_flags)
>  {
> diff --git a/mm/gup.c b/mm/gup.c
> index f84e22685aaa..8cb4cff067bc 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1112,26 +1112,7 @@ long get_user_pages_remote(struct task_struct *tsk, 
> struct mm_struct *mm,
>  }
>  EXPORT_SYMBOL(get_user_pages_remote);
>
> -/*
> - * This is the same as get_user_pages_remote(), just with a
> - * less-flexible calling convention where we assume that the task
> - * and mm being operated on are the current task's and don't allow
> - * passing of a locked parameter.  We also obviously don't pass
> - * FOLL_REMOTE in here.
> - */
> -long get_user_pages(unsigned long start, unsigned long nr_pages,
> -   unsigned int gup_flags, struct page **pages,
> -   struct vm_area_struct **vmas)
> -{
> -   return __get_user_pages_locked(current, current->mm, start, nr_pages,
> -  pages, vmas, NULL,
> -  gup_flags | FOLL_TOUCH);
> -}
> -EXPORT_SYMBOL(get_user_pages);
> -
>  #if defined(CONFIG_FS_DAX) || defined (CONFIG_CMA)
> -
> -#ifdef CONFIG_FS_DAX
>  static bool check_dax_vmas(struct vm_area_struct **vmas, long nr_pages)
>  {
> long i;
> @@ -1150,12 +1131,6 @@ static bool check_dax_vmas(struct vm_area_struct 
> **vmas, long nr_pages)
> }
> return false;
>  }
> -#else
> -static inline bool check_dax_vmas(struct vm_area_struct **vmas, long 
> nr_pages)
> -{
> -   return false;
> -}
> -#endif
>
>  #ifdef CONFIG_CMA
>  static struct page *new_non_cma_page(struct page *page, unsigned long 
> private)
> @@ -1209,10 +1184,13 @@ static struct page *new_non_cma_page(struct page 
> *page, unsigned long private)
> return __alloc_pages_node(nid, gfp_mask, 0);
>  }
>
> -static long check_and_migrate_cma_pages(unsigned long start, long nr_pages,
> -   unsigned int gup_flags,
> +static long check_and_migrate_cma_pages(struct task_struct *tsk,
> +   struct mm_struct *mm,
> +   unsigned long start,
> +   unsigned long nr_pages,
> struct page **pages,
> -   struct vm_area_struct **vmas)
> +   struct vm_area_struct **vmas,
> +   unsigned int gup_flags)
>  {
> long i;
> bool drain_allow = true;
> @@ -1268,10 +1246,14 @@ static long check_and_migrate_cma_pages(unsigned long 
> start, long nr_pages,
> putback_movable_pages(_page_list);
> }
> /*
> -* We did migrate all the pages, Try to get the page 
> references again
> -* migrating any new CMA pages which we failed to isolate 
> earlier.
> +* We did migrate all the pages, Try to get the page 
> references
> +* again migrating any new CMA pages which we failed to 
> isolate
> +* earlier.
>  */
> -   nr_pages = get_user_pages(start, nr_pages, gup_flags, pages, 
> vmas);
> +   nr_pages = __get_user_pages_locked(tsk, mm, start, nr_pages,
> +  pages, vmas, NULL,
> +  gup_flags);
> +

Why did this need to change to __get_user_pages_locked?

> if ((nr_pages > 0) && migrate_allow) {
> drain_allow = true;
> goto check_again;
> @@ -1281,66

Re: [GIT PULL] Please pull powerpc/linux.git powerpc-5.1-3 tag

2019-03-22 Thread pr-tracker-bot

The pull request you sent on Fri, 22 Mar 2019 23:58:08 +1100:

> https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git 
> tags/powerpc-5.1-3

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/a5ed1e96cafde5ba48638f486bfca0685dc6ddc9

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.wiki.kernel.org/userdoc/prtracker

Re: [PATCH v5 1/3] locking/rwsem: Remove arch specific rwsem files

2019-03-22 Thread Waiman Long

On 03/22/2019 03:30 PM, Davidlohr Bueso wrote:
> On Fri, 22 Mar 2019, Linus Torvalds wrote:
>> Some of them _might_ be performance-critical. There's the one on
>> mmap_sem in the fault handling path, for example. And yes, I'd expect
>> the normal case to very much be "no other readers or writers" for that
>> one.
>
> Yeah, the mmap_sem case in the fault path is really expecting an unlocked
> state. To the point that four archs have added branch predictions, ie:
>
> 92181f190b6 (x86: optimise x86's do_page_fault (C entry point for the
> page fault path))
> b15021d994f (powerpc/mm: Add a bunch of (un)likely annotations to
> do_page_fault)
>
> And using PROFILE_ANNOTATED_BRANCHES shows pretty clearly:
> (without resetting the counters)
>
> correct incorrect  %    Function  File  Line
> --- -  -        
>  4603685   34   0 do_user_addr_fault fault.c  1416
> (bootup)
> 382327745  449   0 do_user_addr_fault fault.c 
> 1416 (kernel build)
> 399446159  461   0 do_user_addr_fault fault.c 
> 1416 (redis benchmark)
>
> It would probably wouldn't harm doing the unlikely() for all archs, or
> alternatively, add likely() to the atomic_long_try_cmpxchg_acquire in
> patch 3 and do it implicitly but maybe that would be less flexible(?)
>
> Thanks,
> Davidlohr

I had used the my lock event counting code to count the number of
contended and uncontended trylocks. I tested both bootup and kernel
build. I think I saw less than 1% were contended, the rests were all
uncontended. That is similar to what you got. I thought I had sent the
data out previously, but I couldn't find the email. That was the main
reason why I took Linus' suggestion to optimize it for the uncontended case.

Thanks,
Longman

Re: [PATCH v5 1/3] locking/rwsem: Remove arch specific rwsem files

2019-03-22 Thread Davidlohr Bueso


On Fri, 22 Mar 2019, Linus Torvalds wrote:

Some of them _might_ be performance-critical. There's the one on
mmap_sem in the fault handling path, for example. And yes, I'd expect
the normal case to very much be "no other readers or writers" for that
one.


Yeah, the mmap_sem case in the fault path is really expecting an unlocked
state. To the point that four archs have added branch predictions, ie:

92181f190b6 (x86: optimise x86's do_page_fault (C entry point for the page 
fault path))
b15021d994f (powerpc/mm: Add a bunch of (un)likely annotations to do_page_fault)

And using PROFILE_ANNOTATED_BRANCHES shows pretty clearly:
(without resetting the counters)

correct incorrect  %Function  File  Line
--- -  -    
 4603685   34   0 do_user_addr_fault fault.c  1416 (bootup)
382327745  449   0 do_user_addr_fault fault.c  1416 (kernel 
build)
399446159  461   0 do_user_addr_fault fault.c  1416 (redis 
benchmark)

It would probably wouldn't harm doing the unlikely() for all archs, or
alternatively, add likely() to the atomic_long_try_cmpxchg_acquire in
patch 3 and do it implicitly but maybe that would be less flexible(?)

Thanks,
Davidlohr

Re: powerpc32 boot crash in 5.1-rc1

2019-03-22 Thread LEROY Christophe


Meelis Roos  a écrit :

While 5.0.0 worked fine on my PowerMac G4, 5.0 + git (unknown rev  
as of Mar 13), 5.0.0-11520-gf261c4e and todays git all fail to boot.


The problem seems to be in page fault handler in load_elf_binary()  
of init process.


The patch at https://patchwork.ozlabs.org/patch/1053385/ should fix it


Tested it - yes, it fixes the boot.


Thanks for testing. It will be merged in 5.1-rc2

Christophe



--
Meelis Roos

Re: [PATCH v3] powerpc/64: Fix memcmp reading past the end of src/dest

2019-03-22 Thread Segher Boessenkool

On Fri, Mar 22, 2019 at 11:37:24PM +1100, Michael Ellerman wrote:
>  .Lcmp_rest_lt8bytes:
> - /* Here we have only less than 8 bytes to compare with. at least s1
> -  * Address is aligned with 8 bytes.
> -  * The next double words are load and shift right with appropriate
> -  * bits.
> + /*
> +  * Here we have less than 8 bytes to compare. At least s1 is aligned to
> +  * 8 bytes, but s2 may not be. We must make sure s2 + 8 doesn't cross a

"s2 + 7"?  The code is fine though (bgt, not bge).

> +  * page boundary, otherwise we might read past the end of the buffer and
> +  * trigger a page fault. We use 4K as the conservative minimum page
> +  * size. If we detect that case we go to the byte-by-byte loop.
> +  *
> +  * Otherwise the next double word is loaded from s1 and s2, and shifted
> +  * right to compare the appropriate bits.
>*/
> + clrldi  r6,r4,(64-12)   // r6 = r4 & 0xfff

You can just write
  rlwinm r6,r4,0,0x0fff
if that is clearer?  Or do you still want a comment with that :-)

> + cmpdi   r6,0xff8
> + bgt .Lshort

Reviewed-by: Segher Boessenkool 


Segher

Re: [PATCH v5 3/3] locking/rwsem: Optimize down_read_trylock()

2019-03-22 Thread Waiman Long

On 03/22/2019 01:25 PM, Russell King - ARM Linux admin wrote:
> On Fri, Mar 22, 2019 at 10:30:08AM -0400, Waiman Long wrote:
>> Modify __down_read_trylock() to optimize for an unlocked rwsem and make
>> it generate slightly better code.
>>
>> Before this patch, down_read_trylock:
>>
>>0x <+0>: callq  0x5 
>>0x0005 <+5>: jmp0x18 
>>0x0007 <+7>: lea0x1(%rdx),%rcx
>>0x000b <+11>:mov%rdx,%rax
>>0x000e <+14>:lock cmpxchg %rcx,(%rdi)
>>0x0013 <+19>:cmp%rax,%rdx
>>0x0016 <+22>:je 0x23 
>>0x0018 <+24>:mov(%rdi),%rdx
>>0x001b <+27>:test   %rdx,%rdx
>>0x001e <+30>:jns0x7 
>>0x0020 <+32>:xor%eax,%eax
>>0x0022 <+34>:retq
>>0x0023 <+35>:mov%gs:0x0,%rax
>>0x002c <+44>:or $0x3,%rax
>>0x0030 <+48>:mov%rax,0x20(%rdi)
>>0x0034 <+52>:mov$0x1,%eax
>>0x0039 <+57>:retq
>>
>> After patch, down_read_trylock:
>>
>>0x <+0>:  callq  0x5 
>>0x0005 <+5>:  xor%eax,%eax
>>0x0007 <+7>:  lea0x1(%rax),%rdx
>>0x000b <+11>: lock cmpxchg %rdx,(%rdi)
>>0x0010 <+16>: jne0x29 
>>0x0012 <+18>: mov%gs:0x0,%rax
>>0x001b <+27>: or $0x3,%rax
>>0x001f <+31>: mov%rax,0x20(%rdi)
>>0x0023 <+35>: mov$0x1,%eax
>>0x0028 <+40>: retq
>>0x0029 <+41>: test   %rax,%rax
>>0x002c <+44>: jns0x7 
>>0x002e <+46>: xor%eax,%eax
>>0x0030 <+48>: retq
>>
>> By using a rwsem microbenchmark, the down_read_trylock() rate (with a
>> load of 10 to lengthen the lock critical section) on a x86-64 system
>> before and after the patch were:
>>
>>  Before PatchAfter Patch
>># of Threads rlock   rlock
>> -   -
>> 1   14,496  14,716
>> 28,644   8,453
>>  46,799   6,983
>>  85,664   7,190
>>
>> On a ARM64 system, the performance results were:
>>
>>  Before PatchAfter Patch
>># of Threads rlock   rlock
>> -   -
>> 1   23,676  24,488
>> 27,697   9,502
>> 44,945   3,440
>> 82,641   1,603
>>
>> For the uncontended case (1 thread), the new down_read_trylock() is a
>> little bit faster. For the contended cases, the new down_read_trylock()
>> perform pretty well in x86-64, but performance degrades at high
>> contention level on ARM64.
> So, 70% for 4 threads, 61% for 4 threads - does this trend
> continue tailing off as the number of threads (and cores)
> increase?
>
I didn't try higher number of contending threads. I won't worry too much
about contention as trylock is a one-off event. The chance of having
more than one trylock happening simultaneously is very small.

Cheers,
Longman

Re: [PATCH v2 3/7] ocxl: Create a clear delineation between ocxl backend & frontend

2019-03-22 Thread Frederic Barrat


Hi Alastair,

I'm still seeing problems with the handling of the info structure and 
ref counting on the AFU. To make things easier, I'm attaching a patch. 
There's also a bunch of other review comments in the patch, check for 
the comments with the "fxb" marker.


I've played a bit with driver unload and force unbinds and the free 
operations were happening as expected. We're getting there!


  Fred


Le 20/03/2019 à 06:08, Alastair D'Silva a écrit :

From: Alastair D'Silva 

The OCXL driver contains both frontend code for interacting with userspace,
as well as backend code for interacting with the hardware.

This patch separates the backend code from the frontend so that it can be
used by other device drivers that communicate via OpenCAPI.

Relocate dev, cdev & sysfs files to the frontend code to allow external
drivers to maintain their own devices.

Reference counting on the device in the backend is replaced with kref
counting.

Move file & sysfs layer initialisation from core.c (backend) to
pci.c (frontend).

Create an ocxl_function oriented interface for initing devices &
enumerating AFUs.

Signed-off-by: Alastair D'Silva 
---
  drivers/misc/ocxl/context.c   |   2 +-
  drivers/misc/ocxl/core.c  | 205 +++---
  drivers/misc/ocxl/file.c  | 125 --
  drivers/misc/ocxl/ocxl_internal.h |  39 +++---
  drivers/misc/ocxl/pci.c   |  61 -
  drivers/misc/ocxl/sysfs.c |  58 +
  include/misc/ocxl.h   | 121 --
  7 files changed, 416 insertions(+), 195 deletions(-)

diff --git a/drivers/misc/ocxl/context.c b/drivers/misc/ocxl/context.c
index 3498a0199bde..371ef17bba33 100644
--- a/drivers/misc/ocxl/context.c
+++ b/drivers/misc/ocxl/context.c
@@ -238,7 +238,7 @@ int ocxl_context_detach(struct ocxl_context *ctx)
}
rc = ocxl_link_remove_pe(ctx->afu->fn->link, ctx->pasid);
if (rc) {
-   dev_warn(>afu->dev,
+   dev_warn(>dev,
"Couldn't remove PE entry cleanly: %d\n", rc);
}
return 0;
diff --git a/drivers/misc/ocxl/core.c b/drivers/misc/ocxl/core.c
index 2fd0c700e8a0..c632ec372342 100644
--- a/drivers/misc/ocxl/core.c
+++ b/drivers/misc/ocxl/core.c
@@ -13,16 +13,6 @@ static void ocxl_fn_put(struct ocxl_fn *fn)
put_device(>dev);
  }
  
-struct ocxl_afu *ocxl_afu_get(struct ocxl_afu *afu)

-{
-   return (get_device(>dev) == NULL) ? NULL : afu;
-}
-
-void ocxl_afu_put(struct ocxl_afu *afu)
-{
-   put_device(>dev);
-}
-
  static struct ocxl_afu *alloc_afu(struct ocxl_fn *fn)
  {
struct ocxl_afu *afu;
@@ -31,6 +21,7 @@ static struct ocxl_afu *alloc_afu(struct ocxl_fn *fn)
if (!afu)
return NULL;
  
+	kref_init(>kref);

mutex_init(>contexts_lock);
mutex_init(>afu_control_lock);
idr_init(>contexts_idr);
@@ -39,32 +30,26 @@ static struct ocxl_afu *alloc_afu(struct ocxl_fn *fn)
return afu;
  }
  
-static void free_afu(struct ocxl_afu *afu)

+static void afu_release(struct kref *kref)
  {
+   struct ocxl_afu *afu = container_of(kref, struct ocxl_afu, kref);
+
idr_destroy(>contexts_idr);
ocxl_fn_put(afu->fn);
kfree(afu);
  }
  
-static void free_afu_dev(struct device *dev)

+void ocxl_afu_get(struct ocxl_afu *afu)
  {
-   struct ocxl_afu *afu = to_ocxl_afu(dev);
-
-   ocxl_unregister_afu(afu);
-   free_afu(afu);
+   kref_get(>kref);
  }
+EXPORT_SYMBOL_GPL(ocxl_afu_get);
  
-static int set_afu_device(struct ocxl_afu *afu, const char *location)

+void ocxl_afu_put(struct ocxl_afu *afu)
  {
-   struct ocxl_fn *fn = afu->fn;
-   int rc;
-
-   afu->dev.parent = >dev;
-   afu->dev.release = free_afu_dev;
-   rc = dev_set_name(>dev, "%s.%s.%hhu", afu->config.name, location,
-   afu->config.idx);
-   return rc;
+   kref_put(>kref, afu_release);
  }
+EXPORT_SYMBOL_GPL(ocxl_afu_put);
  
  static int assign_afu_actag(struct ocxl_afu *afu)

  {
@@ -233,27 +218,25 @@ static int configure_afu(struct ocxl_afu *afu, u8 
afu_idx, struct pci_dev *dev)
if (rc)
return rc;
  
-	rc = set_afu_device(afu, dev_name(>dev));

-   if (rc)
-   return rc;
-
rc = assign_afu_actag(afu);
if (rc)
return rc;
  
  	rc = assign_afu_pasid(afu);

-   if (rc) {
-   reclaim_afu_actag(afu);
-   return rc;
-   }
+   if (rc)
+   goto err_free_actag;
  
  	rc = map_mmio_areas(afu);

-   if (rc) {
-   reclaim_afu_pasid(afu);
-   reclaim_afu_actag(afu);
-   return rc;
-   }
+   if (rc)
+   goto err_free_pasid;
+
return 0;
+
+err_free_pasid:
+   reclaim_afu_pasid(afu);
+err_free_actag:
+   reclaim_afu_actag(afu);
+   return rc;
  }
  
  static void deconfigure_afu(struct ocxl_afu *afu)

@@ -265,16 +248,8 @@ static void

Re: [PATCH v5 3/3] locking/rwsem: Optimize down_read_trylock()

2019-03-22 Thread Russell King - ARM Linux admin

On Fri, Mar 22, 2019 at 10:30:08AM -0400, Waiman Long wrote:
> Modify __down_read_trylock() to optimize for an unlocked rwsem and make
> it generate slightly better code.
> 
> Before this patch, down_read_trylock:
> 
>0x <+0>: callq  0x5 
>0x0005 <+5>: jmp0x18 
>0x0007 <+7>: lea0x1(%rdx),%rcx
>0x000b <+11>:mov%rdx,%rax
>0x000e <+14>:lock cmpxchg %rcx,(%rdi)
>0x0013 <+19>:cmp%rax,%rdx
>0x0016 <+22>:je 0x23 
>0x0018 <+24>:mov(%rdi),%rdx
>0x001b <+27>:test   %rdx,%rdx
>0x001e <+30>:jns0x7 
>0x0020 <+32>:xor%eax,%eax
>0x0022 <+34>:retq
>0x0023 <+35>:mov%gs:0x0,%rax
>0x002c <+44>:or $0x3,%rax
>0x0030 <+48>:mov%rax,0x20(%rdi)
>0x0034 <+52>:mov$0x1,%eax
>0x0039 <+57>:retq
> 
> After patch, down_read_trylock:
> 
>0x <+0>:   callq  0x5 
>0x0005 <+5>:   xor%eax,%eax
>0x0007 <+7>:   lea0x1(%rax),%rdx
>0x000b <+11>:  lock cmpxchg %rdx,(%rdi)
>0x0010 <+16>:  jne0x29 
>0x0012 <+18>:  mov%gs:0x0,%rax
>0x001b <+27>:  or $0x3,%rax
>0x001f <+31>:  mov%rax,0x20(%rdi)
>0x0023 <+35>:  mov$0x1,%eax
>0x0028 <+40>:  retq
>0x0029 <+41>:  test   %rax,%rax
>0x002c <+44>:  jns0x7 
>0x002e <+46>:  xor%eax,%eax
>0x0030 <+48>:  retq
> 
> By using a rwsem microbenchmark, the down_read_trylock() rate (with a
> load of 10 to lengthen the lock critical section) on a x86-64 system
> before and after the patch were:
> 
>  Before PatchAfter Patch
># of Threads rlock   rlock
> -   -
> 1   14,496  14,716
> 28,644   8,453
>   46,799   6,983
>   85,664   7,190
> 
> On a ARM64 system, the performance results were:
> 
>  Before PatchAfter Patch
># of Threads rlock   rlock
> -   -
> 1   23,676  24,488
> 27,697   9,502
> 44,945   3,440
> 82,641   1,603
> 
> For the uncontended case (1 thread), the new down_read_trylock() is a
> little bit faster. For the contended cases, the new down_read_trylock()
> perform pretty well in x86-64, but performance degrades at high
> contention level on ARM64.

So, 70% for 4 threads, 61% for 4 threads - does this trend
continue tailing off as the number of threads (and cores)
increase?

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

Re: [PATCH v5 1/3] locking/rwsem: Remove arch specific rwsem files

2019-03-22 Thread Waiman Long

On 03/22/2019 01:01 PM, Linus Torvalds wrote:
> On Fri, Mar 22, 2019 at 7:30 AM Waiman Long  wrote:
>>  19 files changed, 133 insertions(+), 930 deletions(-)
> Lovely. And it all looks sane to me.
>
> So ack.
>
> The only comment I have is about __down_read_trylock(), which probably
> isn't critical enough to actually care about, but:
>
>> +static inline int __down_read_trylock(struct rw_semaphore *sem)
>> +{
>> +   long tmp;
>> +
>> +   while ((tmp = atomic_long_read(>count)) >= 0) {
>> +   if (tmp == atomic_long_cmpxchg_acquire(>count, tmp,
>> +  tmp + RWSEM_ACTIVE_READ_BIAS)) {
>> +   return 1;
>> +   }
>> +   }
>> +   return 0;
>> +}
> So this seems to
>
>  (a) read the line early (the whole cacheline in shared state issue)
>
>  (b) read the line again unnecessarily in the while loop
>
> Now, (a) might be explained by "well, maybe we do trylock even with
> existing readers", although I continue to think that the case we
> should optimize for is simply the uncontended one, where we don't even
> have multiple readers.
>
> But (b) just seems silly.
>
> So I wonder if it shouldn't just be
>
> long tmp = 0;
>
> do {
> long new = atomic_long_cmpxchg_acquire(>count, tmp,
> tmp + RWSEM_ACTIVE_READ_BIAS);
> if (likely(new == tmp))
> return 1;
>tmp = new;
> } while (tmp >= 0);
> return 0;
>
> which would seem simpler and solve both issues. Hmm?
>
> But honestly, I didn't check what our uses of down_read_trylock() look
> like. We have more of them than I expected, and I _think_ the normal
> case is the "nobody else holds the lock", but that's just a gut
> feeling.
>
> Some of them _might_ be performance-critical. There's the one on
> mmap_sem in the fault handling path, for example. And yes, I'd expect
> the normal case to very much be "no other readers or writers" for that
> one.
>
> NOTE! The above code snippet is absolutely untested, and might be
> completely wrong. Take it as a "something like this" rather than
> anything else.
>
>Linus

As you have noticed already, this patch is just for moving code around
without changing it. I optimize __down_read_trylock() in patch 3.

Cheers,
Longman

Re: [PATCH v5 3/3] locking/rwsem: Optimize down_read_trylock()

2019-03-22 Thread Linus Torvalds

On Fri, Mar 22, 2019 at 7:30 AM Waiman Long  wrote:
>
> Modify __down_read_trylock() to optimize for an unlocked rwsem and make
> it generate slightly better code.

Oh, that should teach me to read all patches in the series before
starting to comment on them.

So ignore my comment on #1.

Linus

Re: [PATCH v5 2/3] locking/rwsem: Remove rwsem-spinlock.c & use rwsem-xadd.c for all archs

2019-03-22 Thread Linus Torvalds

On Fri, Mar 22, 2019 at 7:30 AM Waiman Long  wrote:
>
> For simplication, we are going to remove rwsem-spinlock.c and make all
> architectures use a single implementation of rwsem - rwsem-xadd.c.

Ack.

   Linus

Re: [PATCH v5 1/3] locking/rwsem: Remove arch specific rwsem files

2019-03-22 Thread Linus Torvalds

On Fri, Mar 22, 2019 at 7:30 AM Waiman Long  wrote:
>
>  19 files changed, 133 insertions(+), 930 deletions(-)

Lovely. And it all looks sane to me.

So ack.

The only comment I have is about __down_read_trylock(), which probably
isn't critical enough to actually care about, but:

> +static inline int __down_read_trylock(struct rw_semaphore *sem)
> +{
> +   long tmp;
> +
> +   while ((tmp = atomic_long_read(>count)) >= 0) {
> +   if (tmp == atomic_long_cmpxchg_acquire(>count, tmp,
> +  tmp + RWSEM_ACTIVE_READ_BIAS)) {
> +   return 1;
> +   }
> +   }
> +   return 0;
> +}

So this seems to

 (a) read the line early (the whole cacheline in shared state issue)

 (b) read the line again unnecessarily in the while loop

Now, (a) might be explained by "well, maybe we do trylock even with
existing readers", although I continue to think that the case we
should optimize for is simply the uncontended one, where we don't even
have multiple readers.

But (b) just seems silly.

So I wonder if it shouldn't just be

long tmp = 0;

do {
long new = atomic_long_cmpxchg_acquire(>count, tmp,
tmp + RWSEM_ACTIVE_READ_BIAS);
if (likely(new == tmp))
return 1;
   tmp = new;
} while (tmp >= 0);
return 0;

which would seem simpler and solve both issues. Hmm?

But honestly, I didn't check what our uses of down_read_trylock() look
like. We have more of them than I expected, and I _think_ the normal
case is the "nobody else holds the lock", but that's just a gut
feeling.

Some of them _might_ be performance-critical. There's the one on
mmap_sem in the fault handling path, for example. And yes, I'd expect
the normal case to very much be "no other readers or writers" for that
one.

NOTE! The above code snippet is absolutely untested, and might be
completely wrong. Take it as a "something like this" rather than
anything else.

   Linus

[PATCH v5 3/3] locking/rwsem: Optimize down_read_trylock()

2019-03-22 Thread Waiman Long

Modify __down_read_trylock() to optimize for an unlocked rwsem and make
it generate slightly better code.

Before this patch, down_read_trylock:

   0x <+0>: callq  0x5 
   0x0005 <+5>: jmp0x18 
   0x0007 <+7>: lea0x1(%rdx),%rcx
   0x000b <+11>:mov%rdx,%rax
   0x000e <+14>:lock cmpxchg %rcx,(%rdi)
   0x0013 <+19>:cmp%rax,%rdx
   0x0016 <+22>:je 0x23 
   0x0018 <+24>:mov(%rdi),%rdx
   0x001b <+27>:test   %rdx,%rdx
   0x001e <+30>:jns0x7 
   0x0020 <+32>:xor%eax,%eax
   0x0022 <+34>:retq
   0x0023 <+35>:mov%gs:0x0,%rax
   0x002c <+44>:or $0x3,%rax
   0x0030 <+48>:mov%rax,0x20(%rdi)
   0x0034 <+52>:mov$0x1,%eax
   0x0039 <+57>:retq

After patch, down_read_trylock:

   0x <+0>: callq  0x5 
   0x0005 <+5>: xor%eax,%eax
   0x0007 <+7>: lea0x1(%rax),%rdx
   0x000b <+11>:lock cmpxchg %rdx,(%rdi)
   0x0010 <+16>:jne0x29 
   0x0012 <+18>:mov%gs:0x0,%rax
   0x001b <+27>:or $0x3,%rax
   0x001f <+31>:mov%rax,0x20(%rdi)
   0x0023 <+35>:mov$0x1,%eax
   0x0028 <+40>:retq
   0x0029 <+41>:test   %rax,%rax
   0x002c <+44>:jns0x7 
   0x002e <+46>:xor%eax,%eax
   0x0030 <+48>:retq

By using a rwsem microbenchmark, the down_read_trylock() rate (with a
load of 10 to lengthen the lock critical section) on a x86-64 system
before and after the patch were:

 Before PatchAfter Patch
   # of Threads rlock   rlock
    -   -
1   14,496  14,716
28,644   8,453
46,799   6,983
85,664   7,190

On a ARM64 system, the performance results were:

 Before PatchAfter Patch
   # of Threads rlock   rlock
    -   -
1   23,676  24,488
27,697   9,502
44,945   3,440
82,641   1,603

For the uncontended case (1 thread), the new down_read_trylock() is a
little bit faster. For the contended cases, the new down_read_trylock()
perform pretty well in x86-64, but performance degrades at high
contention level on ARM64.

Suggested-by: Linus Torvalds 
Signed-off-by: Waiman Long 
---
 kernel/locking/rwsem.h | 13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/kernel/locking/rwsem.h b/kernel/locking/rwsem.h
index 45ee00236e03..1f5775aa6a1d 100644
--- a/kernel/locking/rwsem.h
+++ b/kernel/locking/rwsem.h
@@ -174,14 +174,17 @@ static inline int __down_read_killable(struct 
rw_semaphore *sem)
 
 static inline int __down_read_trylock(struct rw_semaphore *sem)
 {
-   long tmp;
+   /*
+* Optimize for the case when the rwsem is not locked at all.
+*/
+   long tmp = RWSEM_UNLOCKED_VALUE;
 
-   while ((tmp = atomic_long_read(>count)) >= 0) {
-   if (tmp == atomic_long_cmpxchg_acquire(>count, tmp,
-  tmp + RWSEM_ACTIVE_READ_BIAS)) {
+   do {
+   if (atomic_long_try_cmpxchg_acquire(>count, ,
+   tmp + RWSEM_ACTIVE_READ_BIAS)) {
return 1;
}
-   }
+   } while (tmp >= 0);
return 0;
 }
 
-- 
2.18.1

[PATCH v5 2/3] locking/rwsem: Remove rwsem-spinlock.c & use rwsem-xadd.c for all archs

2019-03-22 Thread Waiman Long

Currently, we have two different implementation of rwsem:
 1) CONFIG_RWSEM_GENERIC_SPINLOCK (rwsem-spinlock.c)
 2) CONFIG_RWSEM_XCHGADD_ALGORITHM (rwsem-xadd.c)

As we are going to use a single generic implementation for rwsem-xadd.c
and no architecture-specific code will be needed, there is no point
in keeping two different implementations of rwsem. In most cases, the
performance of rwsem-spinlock.c will be worse. It also doesn't get all
the performance tuning and optimizations that had been implemented in
rwsem-xadd.c over the years.

For simplication, we are going to remove rwsem-spinlock.c and make all
architectures use a single implementation of rwsem - rwsem-xadd.c.

All references to RWSEM_GENERIC_SPINLOCK and RWSEM_XCHGADD_ALGORITHM
in the code are removed.

Suggested-by: Peter Zijlstra 
Signed-off-by: Waiman Long 
---
 arch/alpha/Kconfig  |   7 -
 arch/arc/Kconfig|   3 -
 arch/arm/Kconfig|   4 -
 arch/arm64/Kconfig  |   3 -
 arch/c6x/Kconfig|   3 -
 arch/csky/Kconfig   |   3 -
 arch/h8300/Kconfig  |   3 -
 arch/hexagon/Kconfig|   6 -
 arch/ia64/Kconfig   |   4 -
 arch/m68k/Kconfig   |   7 -
 arch/microblaze/Kconfig |   6 -
 arch/mips/Kconfig   |   7 -
 arch/nds32/Kconfig  |   3 -
 arch/nios2/Kconfig  |   3 -
 arch/openrisc/Kconfig   |   6 -
 arch/parisc/Kconfig |   6 -
 arch/powerpc/Kconfig|   7 -
 arch/riscv/Kconfig  |   3 -
 arch/s390/Kconfig   |   6 -
 arch/sh/Kconfig |   6 -
 arch/sparc/Kconfig  |   8 -
 arch/unicore32/Kconfig  |   6 -
 arch/x86/Kconfig|   3 -
 arch/x86/um/Kconfig |   6 -
 arch/xtensa/Kconfig |   3 -
 include/linux/rwsem-spinlock.h  |  47 -
 include/linux/rwsem.h   |   5 -
 kernel/Kconfig.locks|   2 +-
 kernel/locking/Makefile |   4 +-
 kernel/locking/rwsem-spinlock.c | 339 
 kernel/locking/rwsem.h  |   3 -
 31 files changed, 2 insertions(+), 520 deletions(-)
 delete mode 100644 include/linux/rwsem-spinlock.h
 delete mode 100644 kernel/locking/rwsem-spinlock.c

diff --git a/arch/alpha/Kconfig b/arch/alpha/Kconfig
index 584a6e114853..27c871227eee 100644
--- a/arch/alpha/Kconfig
+++ b/arch/alpha/Kconfig
@@ -49,13 +49,6 @@ config MMU
bool
default y
 
-config RWSEM_GENERIC_SPINLOCK
-   bool
-
-config RWSEM_XCHGADD_ALGORITHM
-   bool
-   default y
-
 config ARCH_HAS_ILOG2_U32
bool
default n
diff --git a/arch/arc/Kconfig b/arch/arc/Kconfig
index c781e45d1d99..23e063df5d2c 100644
--- a/arch/arc/Kconfig
+++ b/arch/arc/Kconfig
@@ -63,9 +63,6 @@ config SCHED_OMIT_FRAME_POINTER
 config GENERIC_CSUM
def_bool y
 
-config RWSEM_GENERIC_SPINLOCK
-   def_bool y
-
 config ARCH_DISCONTIGMEM_ENABLE
def_bool n
 
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 054ead960f98..c11c61093c6c 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -178,10 +178,6 @@ config TRACE_IRQFLAGS_SUPPORT
bool
default !CPU_V7M
 
-config RWSEM_XCHGADD_ALGORITHM
-   bool
-   default y
-
 config ARCH_HAS_ILOG2_U32
bool
 
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 7e34b9eba5de..c62b9db2b5e8 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -237,9 +237,6 @@ config LOCKDEP_SUPPORT
 config TRACE_IRQFLAGS_SUPPORT
def_bool y
 
-config RWSEM_XCHGADD_ALGORITHM
-   def_bool y
-
 config GENERIC_BUG
def_bool y
depends on BUG
diff --git a/arch/c6x/Kconfig b/arch/c6x/Kconfig
index e5cd3c5f8399..ed92b5840c0a 100644
--- a/arch/c6x/Kconfig
+++ b/arch/c6x/Kconfig
@@ -27,9 +27,6 @@ config MMU
 config FPU
def_bool n
 
-config RWSEM_GENERIC_SPINLOCK
-   def_bool y
-
 config GENERIC_CALIBRATE_DELAY
def_bool y
 
diff --git a/arch/csky/Kconfig b/arch/csky/Kconfig
index 725a115759c9..6555d1781132 100644
--- a/arch/csky/Kconfig
+++ b/arch/csky/Kconfig
@@ -92,9 +92,6 @@ config GENERIC_HWEIGHT
 config MMU
def_bool y
 
-config RWSEM_GENERIC_SPINLOCK
-   def_bool y
-
 config STACKTRACE_SUPPORT
def_bool y
 
diff --git a/arch/h8300/Kconfig b/arch/h8300/Kconfig
index c071da34e081..61c01db6c292 100644
--- a/arch/h8300/Kconfig
+++ b/arch/h8300/Kconfig
@@ -27,9 +27,6 @@ config H8300
 config CPU_BIG_ENDIAN
def_bool y
 
-config RWSEM_GENERIC_SPINLOCK
-   def_bool y
-
 config GENERIC_HWEIGHT
def_bool y
 
diff --git a/arch/hexagon/Kconfig b/arch/hexagon/Kconfig
index ac441680dcc0..3e54a53208d5 100644
--- a/arch/hexagon/Kconfig
+++ b/arch/hexagon/Kconfig
@@ -65,12 +65,6 @@ config GENERIC_CSUM
 config GENERIC_IRQ_PROBE
def_bool y
 
-config RWSEM_GENERIC_SPINLOCK
-   def_bool n
-
-config RWSEM_XCHGADD_ALGORITHM
-   def_bool y
-
 config GENERIC_HWEIGHT

[PATCH v5 1/3] locking/rwsem: Remove arch specific rwsem files

2019-03-22 Thread Waiman Long

As the generic rwsem-xadd code is using the appropriate acquire and
release versions of the atomic operations, the arch specific rwsem.h
files will not be that much faster than the generic code as long as the
atomic functions are properly implemented. So we can remove those arch
specific rwsem.h and stop building asm/rwsem.h to reduce maintenance
effort.

Currently, only x86, alpha and ia64 have implemented architecture
specific fast paths. I don't have access to alpha and ia64 systems for
testing, but they are legacy systems that are not likely to be updated
to the latest kernel anyway.

By using a rwsem microbenchmark, the total locking rates on a 4-socket
56-core 112-thread x86-64 system before and after the patch were as
follows (mixed means equal # of read and write locks):

  Before Patch  After Patch
   # of Threads  wlock   rlock   mixed wlock   rlock   mixed
     -   -   - -   -   -
129,201  30,143  29,45828,615  30,172  29,201
2 6,807  13,299   1,171 7,725  15,025   1,804
4 6,504  12,755   1,520 7,127  14,286   1,345
8 6,762  13,412 764 6,826  13,652 726
   16 6,693  15,408 662 6,599  15,938 626
   32 6,145  15,286 496 5,549  15,487 511
   64 5,812  15,495  60 5,858  15,572  60

There were some run-to-run variations for the multi-thread tests. For
x86-64, using the generic C code fast path seems to be a little bit
faster than the assembly version with low lock contention.  Looking at
the assembly version of the fast paths, there are assembly to/from C
code wrappers that save and restore all the callee-clobbered registers
(7 registers on x86-64). The assembly generated from the generic C
code doesn't need to do that. That may explain the slight performance
gain here.

The generic asm rwsem.h can also be merged into kernel/locking/rwsem.h
with no code change as no other code other than those under
kernel/locking needs to access the internal rwsem macros and functions.

Signed-off-by: Waiman Long 
---
 MAINTAINERS |   1 -
 arch/alpha/include/asm/rwsem.h  | 211 
 arch/arm/include/asm/Kbuild |   1 -
 arch/arm64/include/asm/Kbuild   |   1 -
 arch/hexagon/include/asm/Kbuild |   1 -
 arch/ia64/include/asm/rwsem.h   | 172 ---
 arch/powerpc/include/asm/Kbuild |   1 -
 arch/s390/include/asm/Kbuild|   1 -
 arch/sh/include/asm/Kbuild  |   1 -
 arch/sparc/include/asm/Kbuild   |   1 -
 arch/x86/include/asm/rwsem.h| 237 
 arch/x86/lib/Makefile   |   1 -
 arch/x86/lib/rwsem.S| 156 -
 arch/x86/um/Makefile|   1 -
 arch/xtensa/include/asm/Kbuild  |   1 -
 include/asm-generic/rwsem.h | 140 ---
 include/linux/rwsem.h   |   4 +-
 kernel/locking/percpu-rwsem.c   |   2 +
 kernel/locking/rwsem.h  | 130 ++
 19 files changed, 133 insertions(+), 930 deletions(-)
 delete mode 100644 arch/alpha/include/asm/rwsem.h
 delete mode 100644 arch/ia64/include/asm/rwsem.h
 delete mode 100644 arch/x86/include/asm/rwsem.h
 delete mode 100644 arch/x86/lib/rwsem.S
 delete mode 100644 include/asm-generic/rwsem.h

diff --git a/MAINTAINERS b/MAINTAINERS
index e17ebf70b548..6bfd5a94c08e 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9089,7 +9089,6 @@ F:arch/*/include/asm/spinlock*.h
 F: include/linux/rwlock*.h
 F: include/linux/mutex*.h
 F: include/linux/rwsem*.h
-F: arch/*/include/asm/rwsem.h
 F: include/linux/seqlock.h
 F: lib/locking*.[ch]
 F: kernel/locking/
diff --git a/arch/alpha/include/asm/rwsem.h b/arch/alpha/include/asm/rwsem.h
deleted file mode 100644
index cf8fc8f9a2ed..
--- a/arch/alpha/include/asm/rwsem.h
+++ /dev/null
@@ -1,211 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef _ALPHA_RWSEM_H
-#define _ALPHA_RWSEM_H
-
-/*
- * Written by Ivan Kokshaysky , 2001.
- * Based on asm-alpha/semaphore.h and asm-i386/rwsem.h
- */
-
-#ifndef _LINUX_RWSEM_H
-#error "please don't include asm/rwsem.h directly, use linux/rwsem.h instead"
-#endif
-
-#ifdef __KERNEL__
-
-#include 
-
-#define RWSEM_UNLOCKED_VALUE   0xL
-#define RWSEM_ACTIVE_BIAS  0x0001L
-#define RWSEM_ACTIVE_MASK  0xL
-#define RWSEM_WAITING_BIAS (-0x0001L)
-#define RWSEM_ACTIVE_READ_BIAS RWSEM_ACTIVE_BIAS
-#define RWSEM_ACTIVE_WRITE_BIAS(RWSEM_WAITING_BIAS + 
RWSEM_ACTIVE_BIAS)
-
-static inline int ___down_read(struct rw_semaphore *sem)
-{
-   long oldcount;
-#ifndefCONFIG_SMP
-   oldcount = sem->count.counter;
-   sem->count.counter += RWSEM_ACTIVE_READ_BIAS;
-#else
-   long temp;
-   __asm__ __volatile__(
-   "1: ldq_l

[PATCH v5 0/3] locking/rwsem: Rwsem rearchitecture part 0

2019-03-22 Thread Waiman Long

v5:
 - Rebase to the latest v5.1 tree and fix conflicts in 
   arch/{xtensa,s390}/include/asm/Kbuild.

v4:
 - Remove rwsem-spinlock.c and make all archs use rwsem-xadd.c.

v3:
 - Optimize __down_read_trylock() for the uncontended case as suggested
   by Linus.

v2:
 - Add patch 2 to optimize __down_read_trylock() as suggested by PeterZ.
 - Update performance test data in patch 1.

The goal of this patchset is to remove the architecture specific files
for rwsem-xadd to make it easer to add enhancements in the later rwsem
patches. It also removes the legacy rwsem-spinlock.c file and make all
the architectures use one single implementation of rwsem - rwsem-xadd.c.

Waiman Long (3):
  locking/rwsem: Remove arch specific rwsem files
  locking/rwsem: Remove rwsem-spinlock.c & use rwsem-xadd.c for all
archs
  locking/rwsem: Optimize down_read_trylock()

 MAINTAINERS |   1 -
 arch/alpha/Kconfig  |   7 -
 arch/alpha/include/asm/rwsem.h  | 211 
 arch/arc/Kconfig|   3 -
 arch/arm/Kconfig|   4 -
 arch/arm/include/asm/Kbuild |   1 -
 arch/arm64/Kconfig  |   3 -
 arch/arm64/include/asm/Kbuild   |   1 -
 arch/c6x/Kconfig|   3 -
 arch/csky/Kconfig   |   3 -
 arch/h8300/Kconfig  |   3 -
 arch/hexagon/Kconfig|   6 -
 arch/hexagon/include/asm/Kbuild |   1 -
 arch/ia64/Kconfig   |   4 -
 arch/ia64/include/asm/rwsem.h   | 172 
 arch/m68k/Kconfig   |   7 -
 arch/microblaze/Kconfig |   6 -
 arch/mips/Kconfig   |   7 -
 arch/nds32/Kconfig  |   3 -
 arch/nios2/Kconfig  |   3 -
 arch/openrisc/Kconfig   |   6 -
 arch/parisc/Kconfig |   6 -
 arch/powerpc/Kconfig|   7 -
 arch/powerpc/include/asm/Kbuild |   1 -
 arch/riscv/Kconfig  |   3 -
 arch/s390/Kconfig   |   6 -
 arch/s390/include/asm/Kbuild|   1 -
 arch/sh/Kconfig |   6 -
 arch/sh/include/asm/Kbuild  |   1 -
 arch/sparc/Kconfig  |   8 -
 arch/sparc/include/asm/Kbuild   |   1 -
 arch/unicore32/Kconfig  |   6 -
 arch/x86/Kconfig|   3 -
 arch/x86/include/asm/rwsem.h| 237 --
 arch/x86/lib/Makefile   |   1 -
 arch/x86/lib/rwsem.S| 156 ---
 arch/x86/um/Kconfig |   6 -
 arch/x86/um/Makefile|   1 -
 arch/xtensa/Kconfig |   3 -
 arch/xtensa/include/asm/Kbuild  |   1 -
 include/asm-generic/rwsem.h | 140 -
 include/linux/rwsem-spinlock.h  |  47 -
 include/linux/rwsem.h   |   9 +-
 kernel/Kconfig.locks|   2 +-
 kernel/locking/Makefile |   4 +-
 kernel/locking/percpu-rwsem.c   |   2 +
 kernel/locking/rwsem-spinlock.c | 339 
 kernel/locking/rwsem.h  | 130 
 48 files changed, 135 insertions(+), 1447 deletions(-)
 delete mode 100644 arch/alpha/include/asm/rwsem.h
 delete mode 100644 arch/ia64/include/asm/rwsem.h
 delete mode 100644 arch/x86/include/asm/rwsem.h
 delete mode 100644 arch/x86/lib/rwsem.S
 delete mode 100644 include/asm-generic/rwsem.h
 delete mode 100644 include/linux/rwsem-spinlock.h
 delete mode 100644 kernel/locking/rwsem-spinlock.c

-- 
2.18.1

Re: powerpc32 boot crash in 5.1-rc1

2019-03-22 Thread Meelis Roos


While 5.0.0 worked fine on my PowerMac G4, 5.0 + git (unknown rev as of Mar 
13), 5.0.0-11520-gf261c4e and todays git all fail to boot.

The problem seems to be in page fault handler in load_elf_binary() of init 
process.


The patch at https://patchwork.ozlabs.org/patch/1053385/ should fix it


Tested it - yes, it fixes the boot.

--
Meelis Roos

[RFC PATCH v1 3/3] kasan: add interceptors for all string functions

2019-03-22 Thread Christophe Leroy

In the same spirit as commit 393f203f5fd5 ("x86_64: kasan: add
interceptors for memset/memmove/memcpy functions"), this patch
adds interceptors for string manipulation functions so that we
can compile lib/string.o without kasan support hence allow the
string functions to also be used from places where kasan has
to be disabled.

Signed-off-by: Christophe Leroy 
---
 This is the generic part. If we agree on the principle, then I'll
 go through the arches and see if adaptations need to be done there.

 include/linux/string.h |  79 
 lib/Makefile   |   2 +
 mm/kasan/string.c  | 334 +
 3 files changed, 415 insertions(+)

diff --git a/include/linux/string.h b/include/linux/string.h
index 7927b875f80c..7e7441f4c420 100644
--- a/include/linux/string.h
+++ b/include/linux/string.h
@@ -19,54 +19,117 @@ extern void *memdup_user_nul(const void __user *, size_t);
  */
 #include 
 
+#if defined(CONFIG_KASAN) && !defined(__SANITIZE_ADDRESS__)
+/*
+ * For files that are not instrumented (e.g. mm/slub.c) we
+ * should use not instrumented version of mem* functions.
+ */
+#define memset16   __memset16
+#define memset32   __memset32
+#define memset64   __memset64
+#define memzero_explicit   __memzero_explicit
+#define strcpy __strcpy
+#define strncpy__strncpy
+#define strlcpy__strlcpy
+#define strscpy__strscpy
+#define strcat __strcat
+#define strncat__strncat
+#define strlcat__strlcat
+#define strcmp __strcmp
+#define strncmp__strncmp
+#define strcasecmp __strcasecmp
+#define strncasecmp__strncasecmp
+#define strchr __strchr
+#define strchrnul  __strchrnul
+#define strrchr__strrchr
+#define strnchr__strnchr
+#define skip_spaces__skip_spaces
+#define strim  __strim
+#define strstr __strstr
+#define strnstr__strnstr
+#define strlen __strlen
+#define strnlen__strnlen
+#define strpbrk__strpbrk
+#define strsep __strsep
+#define strspn __strspn
+#define strcspn__strcspn
+#define memscan__memscan
+#define memcmp __memcmp
+#define memchr __memchr
+#define memchr_inv __memchr_inv
+#define strreplace __strreplace
+
+#ifndef __NO_FORTIFY
+#define __NO_FORTIFY /* FORTIFY_SOURCE uses __builtin_memcpy, etc. */
+#endif
+
+#endif
+
 #ifndef __HAVE_ARCH_STRCPY
 extern char * strcpy(char *,const char *);
+char *__strcpy(char *,const char *);
 #endif
 #ifndef __HAVE_ARCH_STRNCPY
 extern char * strncpy(char *,const char *, __kernel_size_t);
+char *__strncpy(char *,const char *, __kernel_size_t);
 #endif
 #ifndef __HAVE_ARCH_STRLCPY
 size_t strlcpy(char *, const char *, size_t);
+size_t __strlcpy(char *, const char *, size_t);
 #endif
 #ifndef __HAVE_ARCH_STRSCPY
 ssize_t strscpy(char *, const char *, size_t);
+ssize_t __strscpy(char *, const char *, size_t);
 #endif
 #ifndef __HAVE_ARCH_STRCAT
 extern char * strcat(char *, const char *);
+char *__strcat(char *, const char *);
 #endif
 #ifndef __HAVE_ARCH_STRNCAT
 extern char * strncat(char *, const char *, __kernel_size_t);
+char *__strncat(char *, const char *, __kernel_size_t);
 #endif
 #ifndef __HAVE_ARCH_STRLCAT
 extern size_t strlcat(char *, const char *, __kernel_size_t);
+size_t __strlcat(char *, const char *, __kernel_size_t);
 #endif
 #ifndef __HAVE_ARCH_STRCMP
 extern int strcmp(const char *,const char *);
+int __strcmp(const char *,const char *);
 #endif
 #ifndef __HAVE_ARCH_STRNCMP
 extern int strncmp(const char *,const char *,__kernel_size_t);
+int __strncmp(const char *,const char *,__kernel_size_t);
 #endif
 #ifndef __HAVE_ARCH_STRCASECMP
 extern int strcasecmp(const char *s1, const char *s2);
+int __strcasecmp(const char *s1, const char *s2);
 #endif
 #ifndef __HAVE_ARCH_STRNCASECMP
 extern int strncasecmp(const char *s1, const char *s2, size_t n);
+int __strncasecmp(const char *s1, const char *s2, size_t n);
 #endif
 #ifndef __HAVE_ARCH_STRCHR
 extern char * strchr(const char *,int);
+char *__strchr(const char *,int);
 #endif
 #ifndef __HAVE_ARCH_STRCHRNUL
 extern char * strchrnul(const char *,int);
+char *__strchrnul(const char *,int);
 #endif
 #ifndef __HAVE_ARCH_STRNCHR
 extern char * strnchr(const char *, size_t, int);
+char *__strnchr(const char *, size_t, int);
 #endif
 #ifndef __HAVE_ARCH_STRRCHR
 extern char * strrchr(const char *,int);
+char *__strrchr(const char *,int);
 #endif
 extern char * __must_check skip_spaces(const char *);
+char * __must_check __skip_spaces(const char *);
 
 extern char *strim(char *);
+char *__strim(char *);
 
 static inline __must_check char *strstrip(char *str)
 {
@@ -75,27 +138,35 @@ static inline __must_check char *strstrip(char *str)
 
 #ifndef __HAVE_ARCH_STRSTR
 extern char * strstr(const char *, const char *);
+char *__strstr(const

[RFC PATCH v1 1/3] kasan: move memset/memmove/memcpy interceptors in a dedicated file

2019-03-22 Thread Christophe Leroy

In preparation of the addition of interceptors for other string functions,
this patch moves memset/memmove/memcpy interceptions in string.c

Signed-off-by: Christophe Leroy 
---
 mm/kasan/Makefile |  5 -
 mm/kasan/common.c | 26 --
 mm/kasan/string.c | 35 +++
 3 files changed, 39 insertions(+), 27 deletions(-)
 create mode 100644 mm/kasan/string.c

diff --git a/mm/kasan/Makefile b/mm/kasan/Makefile
index 5d1065efbd47..85e91e301404 100644
--- a/mm/kasan/Makefile
+++ b/mm/kasan/Makefile
@@ -1,11 +1,13 @@
 # SPDX-License-Identifier: GPL-2.0
 KASAN_SANITIZE := n
 UBSAN_SANITIZE_common.o := n
+UBSAN_SANITIZE_string.o := n
 UBSAN_SANITIZE_generic.o := n
 UBSAN_SANITIZE_tags.o := n
 KCOV_INSTRUMENT := n
 
 CFLAGS_REMOVE_common.o = -pg
+CFLAGS_REMOVE_string.o = -pg
 CFLAGS_REMOVE_generic.o = -pg
 CFLAGS_REMOVE_tags.o = -pg
 
@@ -13,9 +15,10 @@ CFLAGS_REMOVE_tags.o = -pg
 # see: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63533
 
 CFLAGS_common.o := $(call cc-option, -fno-conserve-stack -fno-stack-protector)
+CFLAGS_string.o := $(call cc-option, -fno-conserve-stack -fno-stack-protector)
 CFLAGS_generic.o := $(call cc-option, -fno-conserve-stack -fno-stack-protector)
 CFLAGS_tags.o := $(call cc-option, -fno-conserve-stack -fno-stack-protector)
 
-obj-$(CONFIG_KASAN) := common.o init.o report.o
+obj-$(CONFIG_KASAN) := common.o init.o report.o string.o
 obj-$(CONFIG_KASAN_GENERIC) += generic.o generic_report.o quarantine.o
 obj-$(CONFIG_KASAN_SW_TAGS) += tags.o tags_report.o
diff --git a/mm/kasan/common.c b/mm/kasan/common.c
index 80bbe62b16cd..3b94f484bf78 100644
--- a/mm/kasan/common.c
+++ b/mm/kasan/common.c
@@ -109,32 +109,6 @@ void kasan_check_write(const volatile void *p, unsigned 
int size)
 }
 EXPORT_SYMBOL(kasan_check_write);
 
-#undef memset
-void *memset(void *addr, int c, size_t len)
-{
-   check_memory_region((unsigned long)addr, len, true, _RET_IP_);
-
-   return __memset(addr, c, len);
-}
-
-#undef memmove
-void *memmove(void *dest, const void *src, size_t len)
-{
-   check_memory_region((unsigned long)src, len, false, _RET_IP_);
-   check_memory_region((unsigned long)dest, len, true, _RET_IP_);
-
-   return __memmove(dest, src, len);
-}
-
-#undef memcpy
-void *memcpy(void *dest, const void *src, size_t len)
-{
-   check_memory_region((unsigned long)src, len, false, _RET_IP_);
-   check_memory_region((unsigned long)dest, len, true, _RET_IP_);
-
-   return __memcpy(dest, src, len);
-}
-
 /*
  * Poisons the shadow memory for 'size' bytes starting from 'addr'.
  * Memory addresses should be aligned to KASAN_SHADOW_SCALE_SIZE.
diff --git a/mm/kasan/string.c b/mm/kasan/string.c
new file mode 100644
index ..f23a740ff985
--- /dev/null
+++ b/mm/kasan/string.c
@@ -0,0 +1,35 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * This file contains strings functions for KASAN
+ *
+ */
+
+#include 
+
+#include "kasan.h"
+
+#undef memset
+void *memset(void *addr, int c, size_t len)
+{
+   check_memory_region((unsigned long)addr, len, true, _RET_IP_);
+
+   return __memset(addr, c, len);
+}
+
+#undef memmove
+void *memmove(void *dest, const void *src, size_t len)
+{
+   check_memory_region((unsigned long)src, len, false, _RET_IP_);
+   check_memory_region((unsigned long)dest, len, true, _RET_IP_);
+
+   return __memmove(dest, src, len);
+}
+
+#undef memcpy
+void *memcpy(void *dest, const void *src, size_t len)
+{
+   check_memory_region((unsigned long)src, len, false, _RET_IP_);
+   check_memory_region((unsigned long)dest, len, true, _RET_IP_);
+
+   return __memcpy(dest, src, len);
+}
-- 
2.13.3

[RFC PATCH v1 2/3] lib/string: move sysfs string functions out of string.c

2019-03-22 Thread Christophe Leroy

In order to implement interceptors for string functions, move
higher level sysfs related string functions out of string.c

This patch creates a new file named string_sysfs.c

Signed-off-by: Christophe Leroy 
---
 lib/Makefile   |  3 ++-
 lib/string.c   | 79 --
 lib/string_sysfs.c | 61 +
 3 files changed, 63 insertions(+), 80 deletions(-)
 create mode 100644 lib/string_sysfs.c

diff --git a/lib/Makefile b/lib/Makefile
index 3b08673e8881..30b9b0bfbba9 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -12,12 +12,13 @@ endif
 # flaky coverage that is not a function of syscall inputs. For example,
 # rbtree can be global and individual rotations don't correlate with inputs.
 KCOV_INSTRUMENT_string.o := n
+KCOV_INSTRUMENT_string_sysfs.o := n
 KCOV_INSTRUMENT_rbtree.o := n
 KCOV_INSTRUMENT_list_debug.o := n
 KCOV_INSTRUMENT_debugobjects.o := n
 KCOV_INSTRUMENT_dynamic_debug.o := n
 
-lib-y := ctype.o string.o vsprintf.o cmdline.o \
+lib-y := ctype.o string.o string_sysfs.o vsprintf.o cmdline.o \
 rbtree.o radix-tree.o timerqueue.o xarray.o \
 idr.o int_sqrt.o extable.o \
 sha1.o chacha.o irq_regs.o argv_split.o \
diff --git a/lib/string.c b/lib/string.c
index 38e4ca08e757..f3886c5175ac 100644
--- a/lib/string.c
+++ b/lib/string.c
@@ -605,85 +605,6 @@ char *strsep(char **s, const char *ct)
 EXPORT_SYMBOL(strsep);
 #endif
 
-/**
- * sysfs_streq - return true if strings are equal, modulo trailing newline
- * @s1: one string
- * @s2: another string
- *
- * This routine returns true iff two strings are equal, treating both
- * NUL and newline-then-NUL as equivalent string terminations.  It's
- * geared for use with sysfs input strings, which generally terminate
- * with newlines but are compared against values without newlines.
- */
-bool sysfs_streq(const char *s1, const char *s2)
-{
-   while (*s1 && *s1 == *s2) {
-   s1++;
-   s2++;
-   }
-
-   if (*s1 == *s2)
-   return true;
-   if (!*s1 && *s2 == '\n' && !s2[1])
-   return true;
-   if (*s1 == '\n' && !s1[1] && !*s2)
-   return true;
-   return false;
-}
-EXPORT_SYMBOL(sysfs_streq);
-
-/**
- * match_string - matches given string in an array
- * @array: array of strings
- * @n: number of strings in the array or -1 for NULL terminated arrays
- * @string:string to match with
- *
- * Return:
- * index of a @string in the @array if matches, or %-EINVAL otherwise.
- */
-int match_string(const char * const *array, size_t n, const char *string)
-{
-   int index;
-   const char *item;
-
-   for (index = 0; index < n; index++) {
-   item = array[index];
-   if (!item)
-   break;
-   if (!strcmp(item, string))
-   return index;
-   }
-
-   return -EINVAL;
-}
-EXPORT_SYMBOL(match_string);
-
-/**
- * __sysfs_match_string - matches given string in an array
- * @array: array of strings
- * @n: number of strings in the array or -1 for NULL terminated arrays
- * @str: string to match with
- *
- * Returns index of @str in the @array or -EINVAL, just like match_string().
- * Uses sysfs_streq instead of strcmp for matching.
- */
-int __sysfs_match_string(const char * const *array, size_t n, const char *str)
-{
-   const char *item;
-   int index;
-
-   for (index = 0; index < n; index++) {
-   item = array[index];
-   if (!item)
-   break;
-   if (sysfs_streq(item, str))
-   return index;
-   }
-
-   return -EINVAL;
-}
-EXPORT_SYMBOL(__sysfs_match_string);
-
 #ifndef __HAVE_ARCH_MEMSET
 /**
  * memset - Fill a region of memory with the given value
diff --git a/lib/string_sysfs.c b/lib/string_sysfs.c
new file mode 100644
index ..f2dd384be20d
--- /dev/null
+++ b/lib/string_sysfs.c
@@ -0,0 +1,61 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * stupid library routines for sysfs
+ *
+ */
+
+#include 
+#include 
+#include 
+
+/**
+ * match_string - matches given string in an array
+ * @array: array of strings
+ * @n: number of strings in the array or -1 for NULL terminated arrays
+ * @string:string to match with
+ *
+ * Return:
+ * index of a @string in the @array if matches, or %-EINVAL otherwise.
+ */
+int match_string(const char * const *array, size_t n, const char *string)
+{
+   int index;
+   const char *item;
+
+   for (index = 0; index < n; index++) {
+   item = array[index];
+   if (!item)
+   break;
+   if (!strcmp(item, string))
+   return index;
+   }
+
+   return -EINVAL;
+}
+EXPORT_SYMBOL(match_string);
+
+/**
+ * __sysfs_match_string - matches given string in an array
+ * @array: array of strings
+ * @n: number of strings in the array or -1 for

Re: [PATCH] crypto: vmx - fix copy-paste error in CTR mode

2019-03-22 Thread Herbert Xu

On Fri, Mar 15, 2019 at 01:09:01PM +1100, Daniel Axtens wrote:
> The original assembly imported from OpenSSL has two copy-paste
> errors in handling CTR mode. When dealing with a 2 or 3 block tail,
> the code branches to the CBC decryption exit path, rather than to
> the CTR exit path.
> 
> This leads to corruption of the IV, which leads to subsequent blocks
> being corrupted.
> 
> This can be detected with libkcapi test suite, which is available at
> https://github.com/smuellerDD/libkcapi
> 
> Reported-by: Ondrej Mosnáček 
> Fixes: 5c380d623ed3 ("crypto: vmx - Add support for VMS instructions by ASM")
> Cc: sta...@vger.kernel.org
> Signed-off-by: Daniel Axtens 
> ---
>  drivers/crypto/vmx/aesp8-ppc.pl | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)

Patch applied.  Thanks.
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH 1/2] cpuidle : auto-promotion for cpuidle states

2019-03-22 Thread Daniel Lezcano

On 22/03/2019 10:45, Rafael J. Wysocki wrote:
> On Fri, Mar 22, 2019 at 8:31 AM Abhishek Goel
>  wrote:
>>
>> Currently, the cpuidle governors (menu /ladder) determine what idle state
>> an idling CPU should enter into based on heuristics that depend on the
>> idle history on that CPU. Given that no predictive heuristic is perfect,
>> there are cases where the governor predicts a shallow idle state, hoping
>> that the CPU will be busy soon. However, if no new workload is scheduled
>> on that CPU in the near future, the CPU will end up in the shallow state.
>>
>> In case of POWER, this is problematic, when the predicted state in the
>> aforementioned scenario is a lite stop state, as such lite states will
>> inhibit SMT folding, thereby depriving the other threads in the core from
>> using the core resources.
>>
>> To address this, such lite states need to be autopromoted. The cpuidle-
>> core can queue timer to correspond with the residency value of the next
>> available state. Thus leading to auto-promotion to a deeper idle state as
>> soon as possible.
> 
> Isn't the tick stopping avoidance sufficient for that?

I was about to ask the same :)




-- 
  Linaro.org │ Open source software for ARM SoCs

Follow Linaro:   Facebook |
 Twitter |
 Blog

[GIT PULL] Please pull powerpc/linux.git powerpc-5.1-3 tag

2019-03-22 Thread Michael Ellerman

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi Linus,

Please pull some powerpc fixes for 5.1:

The following changes since commit 9e98c678c2d6ae3a17cb2de55d17f69dddaa231b:

  Linux 5.1-rc1 (2019-03-17 14:22:26 -0700)

are available in the git repository at:

  https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git 
tags/powerpc-5.1-3

for you to fetch changes up to 92edf8df0ff2ae86cc632eeca0e651fd8431d40d:

  powerpc/security: Fix spectre_v2 reporting (2019-03-21 21:09:03 +1100)

- --
powerpc fixes for 5.1 #3

One fix for a boot failure on 32-bit, introduced during the merge window.

A fix for our handling of CLOCK_MONOTONIC in the 64-bit VDSO. Changing the wall
clock across the Y2038 boundary could cause CLOCK_MONOTONIC to jump forward and
backward.

Our spectre_v2 reporting was a bit confusing due to a bug I introduced. On some
systems it was reporting that the count cache was disabled and also that we were
flushing the count cache on context switch. Only the former is true, and given
that the count cache is disabled it doesn't make any sense to flush it. No one
reported it, so presumably the presence of any mitigation is all people check
for.

Finally a small build fix for zsmalloc on 32-bit.

Thanks to:
  Ben Hutchings, Christophe Leroy, Diana Craciun, Guenter Roeck, Michael 
Neuling.

- --
Ben Hutchings (1):
  powerpc/mm: Only define MAX_PHYSMEM_BITS in SPARSEMEM configurations

Christophe Leroy (1):
  powerpc/6xx: fix setup and use of SPRN_SPRG_PGDIR for hash32

Michael Ellerman (2):
  powerpc/vdso64: Fix CLOCK_MONOTONIC inconsistencies across Y2038
  powerpc/security: Fix spectre_v2 reporting


 arch/powerpc/include/asm/mmu.h|  2 +-
 arch/powerpc/include/asm/vdso_datapage.h  |  8 
 arch/powerpc/kernel/cpu_setup_6xx.S   |  3 ---
 arch/powerpc/kernel/head_32.S |  6 ++
 arch/powerpc/kernel/security.c| 23 ---
 arch/powerpc/kernel/vdso64/gettimeofday.S |  4 ++--
 arch/powerpc/mm/hash_low_32.S |  8 
 7 files changed, 25 insertions(+), 29 deletions(-)
-BEGIN PGP SIGNATURE-

iQIcBAEBAgAGBQJclNsLAAoJEFHr6jzI4aWAQlAQAKgQgvPoMIBKObYsjQ6KK6iN
lgXCrU+WuI+7f6kMMeFWP8sOeT4nA+f7ejoXJDog96Z3oBG63sZuILTlifFetFbp
0ptJA+AC0DS0k//yv2pMweZVWq2jR7Jfr213inZxwJW6NzvI9m54QK5eUXv++dkk
Q0H/PhMxNTnP0HKBYRWKSkBSvhCZd6zex5hRZFkXVfDwe6fhpYSkObInGlt2rN4s
u3NJIZLS1zqYOyx/VPwkUCsePmdqdR0/qFBYT191iFce3lmdrKociFt9/mJKkqj6
DYVbJljxJtoZ0iIztHdStvHBpbC0kaaUHTNnKEjX2Q2xL7oitiOyOa5gT98Cs8Q1
ZHfNPidZhyhdRRwIgpDKECIE7xldhG/4icTg0a7LnufjpVrbc8idUU8Hm2oaGuvu
SlytOO0AAepPFqsTy/IeKo5cT1TNhjqcPe0twxx5nOHtaY0vhtZ9azn6B/2AFfIb
y/hX6rFoIOzBDlmdRS63EdtQr9negUilUvonCWkY5luo2ypvGcNVnxdrTQww5yPA
geahb+HRm+dJb22lVp6sONHXJZRfZijBi6jJPPhSRjAVmaibCUvxYiN9MvSFQT/U
iwsMtS4dpvX8WISijEfCuiZNBEjGSoUQEIwPWqRtgaqZbHYfBFZkwrb6yGF/YjmP
wVjdhrCcut+qz2NbbDlv
=PXK6
-END PGP SIGNATURE-

[PATCH v3] powerpc/64: Fix memcmp reading past the end of src/dest

2019-03-22 Thread Michael Ellerman

Chandan reported that fstests' generic/026 test hit a crash:

  BUG: Unable to handle kernel data access at 0xc0062ac4
  Faulting instruction address: 0xc0092240
  Oops: Kernel access of bad area, sig: 11 [#1]
  LE SMP NR_CPUS=2048 DEBUG_PAGEALLOC NUMA pSeries
  CPU: 0 PID: 27828 Comm: chacl Not tainted 
5.0.0-rc2-next-20190115-1-g6de6dba64dda #1
  NIP:  c0092240 LR: c066a55c CTR: 
  REGS: c0062c0c3430 TRAP: 0300   Not tainted  
(5.0.0-rc2-next-20190115-1-g6de6dba64dda)
  MSR:  82009033   CR: 44000842  XER: 2000
  CFAR: 7fff7f3108ac DAR: c0062ac4 DSISR: 4000 IRQMASK: 0
  GPR00:  c0062c0c36c0 c17f4c00 c121a660
  GPR04: c0062ac3fff9 0004 0020 275b19c4
  GPR08: 000c 46494c45 5347495f41434c5f c26073a0
  GPR12:  c27a  
  GPR16:    
  GPR20: c0062ea70020 c0062c0c38d0 0002 0002
  GPR24: c0062ac3ffe8 275b19c4 0001 c0062ac3
  GPR28: c0062c0c38d0 c0062ac30050 c0062ac30058 
  NIP memcmp+0x120/0x690
  LR  xfs_attr3_leaf_lookup_int+0x53c/0x5b0
  Call Trace:
xfs_attr3_leaf_lookup_int+0x78/0x5b0 (unreliable)
xfs_da3_node_lookup_int+0x32c/0x5a0
xfs_attr_node_addname+0x170/0x6b0
xfs_attr_set+0x2ac/0x340
__xfs_set_acl+0xf0/0x230
xfs_set_acl+0xd0/0x160
set_posix_acl+0xc0/0x130
posix_acl_xattr_set+0x68/0x110
__vfs_setxattr+0xa4/0x110
__vfs_setxattr_noperm+0xac/0x240
vfs_setxattr+0x128/0x130
setxattr+0x248/0x600
path_setxattr+0x108/0x120
sys_setxattr+0x28/0x40
system_call+0x5c/0x70
  Instruction dump:
  7d201c28 7d402428 7c295040 38630008 38840008 408201f0 4200ffe8 2c05
  4182ff6c 20c50008 54c61838 7d201c28 <7d402428> 7d293436 7d4a3436 7c295040

The instruction dump decodes as:
  subfic  r6,r5,8
  rlwinm  r6,r6,3,0,28
  ldbrx   r9,0,r3
  ldbrx   r10,0,r4  <-

Which shows us doing an 8 byte load from c0062ac3fff9, which
crosses the page boundary at c0062ac4 and faults.

It's not OK for memcmp to read past the end of the source or
destination buffers if that would cross a page boundary, because we
don't know that the next page is mapped.

As pointed out by Segher, we can read past the end of the source or
destination as long as we don't cross a 4K boundary, because that's
our minimum page size on all platforms.

The bug is in the code at the .Lcmp_rest_lt8bytes label. When we get
there we know that s1 is 8-byte aligned and we have at least 1 byte to
read, so a single 8-byte load won't read past the end of s1 and cross
a page boundary.

But we have to be more careful with s2. So check if it's within 8
bytes of a 4K boundary and if so go to the byte-by-byte loop.

Fixes: 2d9ee327adce ("powerpc/64: Align bytes before fall back to .Lshort in 
powerpc64 memcmp()")
Cc: sta...@vger.kernel.org # v4.19+
Reported-by: Chandan Rajendra 
Signed-off-by: Michael Ellerman 
---
 arch/powerpc/lib/memcmp_64.S | 17 +
 1 file changed, 13 insertions(+), 4 deletions(-)

v3: Just check if we're crossing a 4K boundary.

Oops, I wrote this a while back but forgot to send v3.

diff --git a/arch/powerpc/lib/memcmp_64.S b/arch/powerpc/lib/memcmp_64.S
index 844d8e774492..f6554ebebeb5 100644
--- a/arch/powerpc/lib/memcmp_64.S
+++ b/arch/powerpc/lib/memcmp_64.S
@@ -215,11 +215,20 @@ _GLOBAL_TOC(memcmp)
beq .Lzero
 
 .Lcmp_rest_lt8bytes:
-   /* Here we have only less than 8 bytes to compare with. at least s1
-* Address is aligned with 8 bytes.
-* The next double words are load and shift right with appropriate
-* bits.
+   /*
+* Here we have less than 8 bytes to compare. At least s1 is aligned to
+* 8 bytes, but s2 may not be. We must make sure s2 + 8 doesn't cross a
+* page boundary, otherwise we might read past the end of the buffer and
+* trigger a page fault. We use 4K as the conservative minimum page
+* size. If we detect that case we go to the byte-by-byte loop.
+*
+* Otherwise the next double word is loaded from s1 and s2, and shifted
+* right to compare the appropriate bits.
 */
+   clrldi  r6,r4,(64-12)   // r6 = r4 & 0xfff
+   cmpdi   r6,0xff8
+   bgt .Lshort
+
subfic  r6,r5,8
slwir6,r6,3
LD  rA,0,r3
-- 
2.20.1

Re: powerpc/security: Fix spectre_v2 reporting

2019-03-22 Thread Michael Ellerman

On Thu, 2019-03-21 at 04:24:33 UTC, Michael Ellerman wrote:
> When I updated the spectre_v2 reporting to handle software count cache
> flush I got the logic wrong when there's no software count cache
> enabled at all.
> 
> The result is that on systems with the software count cache flush
> disabled we print:
> 
>   Mitigation: Indirect branch cache disabled, Software count cache flush
> 
> Which correctly indicates that the count cache is disabled, but
> incorrectly says the software count cache flush is enabled.
> 
> The root of the problem is that we are trying to handle all
> combinations of options. But we know now that we only expect to see
> the software count cache flush enabled if the other options are false.
> 
> So split the two cases, which simplifies the logic and fixes the bug.
> We were also missing a space before "(hardware accelerated)".
> 
> The result is we see one of:
> 
>   Mitigation: Indirect branch serialisation (kernel only)
>   Mitigation: Indirect branch cache disabled
>   Mitigation: Software count cache flush
>   Mitigation: Software count cache flush (hardware accelerated)
> 
> Fixes: ee13cb249fab ("powerpc/64s: Add support for software count cache 
> flush")
> Cc: sta...@vger.kernel.org # v4.19+
> Signed-off-by: Michael Ellerman 
> Reviewed-by: Michael Neuling 
> Reviewed-by: Diana Craciun 

Applied to powerpc fixes.

https://git.kernel.org/powerpc/c/92edf8df0ff2ae86cc632eeca0e651fd

cheers

Re: powerpc/mm: Only define MAX_PHYSMEM_BITS in SPARSEMEM configurations

2019-03-22 Thread Michael Ellerman

On Sun, 2019-03-17 at 01:17:56 UTC, Ben Hutchings wrote:
> MAX_PHYSMEM_BITS only needs to be defined if CONFIG_SPARSEMEM is
> enabled, and that was the case before commit 4ffe713b7587
> ("powerpc/mm: Increase the max addressable memory to 2PB").
> 
> On 32-bit systems, where CONFIG_SPARSEMEM is not enabled, we now
> define it as 46.  That is larger than the real number of physical
> address bits, and breaks calculations in zsmalloc:
> 
> mm/zsmalloc.c:130:49: warning: right shift count is negative 
> [-Wshift-count-negative]
>   MAX(32, (ZS_MAX_PAGES_PER_ZSPAGE << PAGE_SHIFT >> OBJ_INDEX_BITS))
>  ^~
> ...
> mm/zsmalloc.c:253:21: error: variably modified 'size_class' at file scope
>   struct size_class *size_class[ZS_SIZE_CLASSES];
>  ^~
> 
> Fixes: 4ffe713b7587 ("powerpc/mm: Increase the max addressable memory to 2PB")
> Cc: sta...@vger.kernel.org
> Signed-off-by: Ben Hutchings 

Applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/8bc086899816214fbc6047c9c7e15fca

cheers

Re: [PATCH 1/2] cpuidle : auto-promotion for cpuidle states

2019-03-22 Thread Rafael J. Wysocki

On Fri, Mar 22, 2019 at 8:31 AM Abhishek Goel
 wrote:
>
> Currently, the cpuidle governors (menu /ladder) determine what idle state
> an idling CPU should enter into based on heuristics that depend on the
> idle history on that CPU. Given that no predictive heuristic is perfect,
> there are cases where the governor predicts a shallow idle state, hoping
> that the CPU will be busy soon. However, if no new workload is scheduled
> on that CPU in the near future, the CPU will end up in the shallow state.
>
> In case of POWER, this is problematic, when the predicted state in the
> aforementioned scenario is a lite stop state, as such lite states will
> inhibit SMT folding, thereby depriving the other threads in the core from
> using the core resources.
>
> To address this, such lite states need to be autopromoted. The cpuidle-
> core can queue timer to correspond with the residency value of the next
> available state. Thus leading to auto-promotion to a deeper idle state as
> soon as possible.

Isn't the tick stopping avoidance sufficient for that?

[RFC PATCH 3/3] powenv/mce: print additional information about mce error.

2019-03-22 Thread Mahesh J Salgaonkar

From: Mahesh Salgaonkar 

Print more information about mce error whether it is an hardware or
software error.

Some of the mce errors can be easily categorized as hardware or software
errors e.g. UEs are due to hardware error, where as error triggered due to
invalid usage of tlbie is a pure software bug. But not all the mce errors
can be easily categorize into either software or hardware. There are errors
like multihit errors which are usually result of a software bug, but in
some rare cases a hardware failure can cause a multihit error. In past, we
have seen case where after replacing faulty chip, multihit errors stopped
occurring. Same with parity errors, which are usually due to faulty hardware
but there are chances where multihit can also cause an parity error. Such
errors are difficult to determine what really caused it. Hence this patch
classifies mce errors into following four categorize:
1. Hardware error:
UE and Link timeout failure errors.
2. Hardware error, small probability of software cause:
SLB/ERAT/TLB Parity errors.
3. Software error
Invalid tlbie form.
4. Software error, small probability of hardware failure
SLB/ERAT/TLB Multihit errors.

Sample o/p:

[ 1259.331319] MCE: CPU40: (Warning) Guest SLB Multihit at 7fff9a59dc60 
DAR: 01003d740320 [Recovered]
[ 1259.331324] MCE: CPU40: PID: 24051 Comm: qemu-system-ppc
[ 1259.331345] MCE: CPU40: Software error, small probability of hardware failure

Signed-off-by: Mahesh Salgaonkar 
---
 arch/powerpc/include/asm/mce.h  |   10 
 arch/powerpc/kernel/mce.c   |   12 
 arch/powerpc/kernel/mce_power.c |  107 +++
 3 files changed, 86 insertions(+), 43 deletions(-)

diff --git a/arch/powerpc/include/asm/mce.h b/arch/powerpc/include/asm/mce.h
index 314ed3f13d59..cef5f3c50a5c 100644
--- a/arch/powerpc/include/asm/mce.h
+++ b/arch/powerpc/include/asm/mce.h
@@ -56,6 +56,14 @@ enum MCE_ErrorType {
MCE_ERROR_TYPE_LINK = 7,
 };
 
+enum MCE_ErrorClass {
+   MCE_ECLASS_UNKNOWN = 0,
+   MCE_ECLASS_HARDWARE,
+   MCE_ECLASS_HARD_INDETERMINATE,
+   MCE_ECLASS_SOFTWARE,
+   MCE_ECLASS_SOFT_INDETERMINATE,
+};
+
 enum MCE_UeErrorType {
MCE_UE_ERROR_INDETERMINATE = 0,
MCE_UE_ERROR_IFETCH = 1,
@@ -115,6 +123,7 @@ struct machine_check_event {
enum MCE_Severity   severity:8;
enum MCE_Initiator  initiator:8;
enum MCE_ErrorType  error_type:8;
+   enum MCE_ErrorClass error_class:8;
enum MCE_Dispositiondisposition:8;
uint8_t sync_error;
uint16_tcpu;
@@ -195,6 +204,7 @@ struct mce_error_info {
} u;
enum MCE_Severity   severity:8;
enum MCE_Initiator  initiator:8;
+   enum MCE_ErrorClass error_class:8;
uint8_t sync_error;
 };
 
diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
index 588a280a8a4a..1ec7ba7c766d 100644
--- a/arch/powerpc/kernel/mce.c
+++ b/arch/powerpc/kernel/mce.c
@@ -123,6 +123,7 @@ void save_mce_event(struct pt_regs *regs, long handled,
mce->initiator = mce_err->initiator;
mce->severity = mce_err->severity;
mce->sync_error = mce_err->sync_error;
+   mce->error_class = mce_err->error_class;
 
/*
 * Populate the mce error_type and type-specific error_type.
@@ -361,6 +362,13 @@ void machine_check_print_event_info(struct 
machine_check_event *evt,
"Store (timeout)",
"Page table walk Load/Store (timeout)",
};
+   static const char *mc_error_class[] = {
+   "Unknown",
+   "Hardware error",
+   "Hardware error, small probability of software cause",
+   "Software error",
+   "Software error, small probability of hardware failure",
+   };
 
/* Print things out */
if (evt->version != MCE_V1) {
@@ -482,6 +490,10 @@ void machine_check_print_event_info(struct 
machine_check_event *evt,
printk("%sMCE: CPU%d: NIP: [%016llx] %pS\n",
level, evt->cpu, evt->srr0, (void *)evt->srr0);
}
+
+   subtype = evt->error_class < ARRAY_SIZE(mc_error_class) ?
+   mc_error_class[evt->error_class] : "Unknown";
+   printk("%sMCE: CPU%d: %s\n", level, evt->cpu, subtype);
 }
 EXPORT_SYMBOL_GPL(machine_check_print_event_info);
 
diff --git a/arch/powerpc/kernel/mce_power.c b/arch/powerpc/kernel/mce_power.c
index 06161de19060..adeed82e59c9 100644
--- a/arch/powerpc/kernel/mce_power.c
+++ b/arch/powerpc/kernel/mce_power.c
@@ -131,6 +131,7 @@ struct mce_ierror_table {
bool nip_valid; /* nip is a valid indicator of faulting address */
unsigned int error_type;
unsigned int error_subtype;
+   unsigned int error_class;
unsigned int initiator;
unsigned int

[RFC PATCH 2/3] powernv/mce: Print correct severity for mce error.

2019-03-22 Thread Mahesh J Salgaonkar

From: Mahesh Salgaonkar 

Currently all machine check errors are printed as severe errors which isn't
correct. Print soft errors as warning instead of severe errors.

Signed-off-by: Mahesh Salgaonkar 
---
 arch/powerpc/include/asm/mce.h|   26 +++---
 arch/powerpc/kernel/mce.c |5 +
 arch/powerpc/kernel/mce_power.c   |  143 +
 arch/powerpc/platforms/powernv/opal.c |2 
 4 files changed, 92 insertions(+), 84 deletions(-)

diff --git a/arch/powerpc/include/asm/mce.h b/arch/powerpc/include/asm/mce.h
index 8d0b1c24c636..314ed3f13d59 100644
--- a/arch/powerpc/include/asm/mce.h
+++ b/arch/powerpc/include/asm/mce.h
@@ -31,7 +31,7 @@ enum MCE_Version {
 enum MCE_Severity {
MCE_SEV_NO_ERROR = 0,
MCE_SEV_WARNING = 1,
-   MCE_SEV_ERROR_SYNC = 2,
+   MCE_SEV_SEVERE = 2,
MCE_SEV_FATAL = 3,
 };
 
@@ -110,17 +110,18 @@ enum MCE_LinkErrorType {
 };
 
 struct machine_check_event {
-   enum MCE_Versionversion:8;  /* 0x00 */
-   uint8_t in_use; /* 0x01 */
-   enum MCE_Severity   severity:8; /* 0x02 */
-   enum MCE_Initiator  initiator:8;/* 0x03 */
-   enum MCE_ErrorType  error_type:8;   /* 0x04 */
-   enum MCE_Dispositiondisposition:8;  /* 0x05 */
-   uint16_tcpu;/* 0x06 */
-   uint64_tgpr3;   /* 0x08 */
-   uint64_tsrr0;   /* 0x10 */
-   uint64_tsrr1;   /* 0x18 */
-   union { /* 0x20 */
+   enum MCE_Versionversion:8;
+   uint8_t in_use;
+   enum MCE_Severity   severity:8;
+   enum MCE_Initiator  initiator:8;
+   enum MCE_ErrorType  error_type:8;
+   enum MCE_Dispositiondisposition:8;
+   uint8_t sync_error;
+   uint16_tcpu;
+   uint64_tgpr3;
+   uint64_tsrr0;
+   uint64_tsrr1;
+   union {
struct {
enum MCE_UeErrorType ue_error_type:8;
uint8_t effective_address_provided;
@@ -194,6 +195,7 @@ struct mce_error_info {
} u;
enum MCE_Severity   severity:8;
enum MCE_Initiator  initiator:8;
+   uint8_t sync_error;
 };
 
 #define MAX_MC_EVT 100
diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
index 44614462cb34..588a280a8a4a 100644
--- a/arch/powerpc/kernel/mce.c
+++ b/arch/powerpc/kernel/mce.c
@@ -122,6 +122,7 @@ void save_mce_event(struct pt_regs *regs, long handled,
 
mce->initiator = mce_err->initiator;
mce->severity = mce_err->severity;
+   mce->sync_error = mce_err->sync_error;
 
/*
 * Populate the mce error_type and type-specific error_type.
@@ -374,9 +375,9 @@ void machine_check_print_event_info(struct 
machine_check_event *evt,
break;
case MCE_SEV_WARNING:
level = KERN_WARNING;
-   sevstr = "";
+   sevstr = "Warning";
break;
-   case MCE_SEV_ERROR_SYNC:
+   case MCE_SEV_SEVERE:
level = KERN_ERR;
sevstr = "Severe";
break;
diff --git a/arch/powerpc/kernel/mce_power.c b/arch/powerpc/kernel/mce_power.c
index 6b800eec31f2..06161de19060 100644
--- a/arch/powerpc/kernel/mce_power.c
+++ b/arch/powerpc/kernel/mce_power.c
@@ -133,106 +133,107 @@ struct mce_ierror_table {
unsigned int error_subtype;
unsigned int initiator;
unsigned int severity;
+   bool sync_error;
 };
 
 static const struct mce_ierror_table mce_p7_ierror_table[] = {
 { 0x001c, 0x0004, true,
   MCE_ERROR_TYPE_UE,  MCE_UE_ERROR_IFETCH,
-  MCE_INITIATOR_CPU,  MCE_SEV_ERROR_SYNC, },
+  MCE_INITIATOR_CPU,  MCE_SEV_SEVERE, true },
 { 0x001c, 0x0008, true,
   MCE_ERROR_TYPE_SLB, MCE_SLB_ERROR_PARITY,
-  MCE_INITIATOR_CPU,  MCE_SEV_ERROR_SYNC, },
+  MCE_INITIATOR_CPU,  MCE_SEV_SEVERE, true },
 { 0x001c, 0x000c, true,
   MCE_ERROR_TYPE_SLB, MCE_SLB_ERROR_MULTIHIT,
-  MCE_INITIATOR_CPU,  MCE_SEV_ERROR_SYNC, },
+  MCE_INITIATOR_CPU,  MCE_SEV_WARNING, true },
 { 0x001c, 0x0010, true,
   MCE_ERROR_TYPE_SLB, MCE_SLB_ERROR_INDETERMINATE, /* BOTH */
-  MCE_INITIATOR_CPU,  MCE_SEV_ERROR_SYNC, },
+  MCE_INITIATOR_CPU,  MCE_SEV_WARNING, true },
 { 0x001c, 0x0014, true,
   MCE_ERROR_TYPE_TLB, MCE_TLB_ERROR_MULTIHIT,
-  MCE_INITIATOR_CPU,  MCE_SEV_ERROR_SYNC, },
+  MCE_INITIATOR_CPU,  MCE_SEV_WARNING, true },
 { 0x001c, 0x0018, true,
   MCE_ERROR_TYPE_UE,  MCE_UE_ERROR_PAGE_TABLE_WALK_IFETCH,
-  MCE_INITIATOR_CPU,  MCE_SEV_ERROR_SYNC, },
+  MCE_INITIATOR_CPU,  MCE_SEV_SEVERE, true },
 { 0x001c,

[RFC PATCH 1/3] powernv/mce: reduce mce console logs to lesser lines.

2019-03-22 Thread Mahesh J Salgaonkar

From: Mahesh Salgaonkar 

Also add cpu number while displaying mce log. This will help cleaner logs
when mce hits on multiple cpus simultaneously.

Signed-off-by: Mahesh Salgaonkar 
---
 arch/powerpc/include/asm/mce.h |2 -
 arch/powerpc/kernel/mce.c  |   86 
 2 files changed, 45 insertions(+), 43 deletions(-)

diff --git a/arch/powerpc/include/asm/mce.h b/arch/powerpc/include/asm/mce.h
index 17996bc9382b..8d0b1c24c636 100644
--- a/arch/powerpc/include/asm/mce.h
+++ b/arch/powerpc/include/asm/mce.h
@@ -116,7 +116,7 @@ struct machine_check_event {
enum MCE_Initiator  initiator:8;/* 0x03 */
enum MCE_ErrorType  error_type:8;   /* 0x04 */
enum MCE_Dispositiondisposition:8;  /* 0x05 */
-   uint8_t reserved_1[2];  /* 0x06 */
+   uint16_tcpu;/* 0x06 */
uint64_tgpr3;   /* 0x08 */
uint64_tsrr0;   /* 0x10 */
uint64_tsrr1;   /* 0x18 */
diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
index b5fec1f9751a..44614462cb34 100644
--- a/arch/powerpc/kernel/mce.c
+++ b/arch/powerpc/kernel/mce.c
@@ -112,6 +112,7 @@ void save_mce_event(struct pt_regs *regs, long handled,
mce->srr1 = regs->msr;
mce->gpr3 = regs->gpr[3];
mce->in_use = 1;
+   mce->cpu = get_paca()->paca_index;
 
/* Mark it recovered if we have handled it and MSR(RI=1). */
if (handled && (regs->msr & MSR_RI))
@@ -310,7 +311,9 @@ static void machine_check_process_queued_event(struct 
irq_work *work)
 void machine_check_print_event_info(struct machine_check_event *evt,
bool user_mode, bool in_guest)
 {
-   const char *level, *sevstr, *subtype;
+   const char *level, *sevstr, *subtype, *err_type;
+   uint64_t ea = 0;
+   char dar_str[50];
static const char *mc_ue_types[] = {
"Indeterminate",
"Instruction fetch",
@@ -384,101 +387,100 @@ void machine_check_print_event_info(struct 
machine_check_event *evt,
break;
}
 
-   printk("%s%s Machine check interrupt [%s]\n", level, sevstr,
-  evt->disposition == MCE_DISPOSITION_RECOVERED ?
-  "Recovered" : "Not recovered");
-
-   if (in_guest) {
-   printk("%s  Guest NIP: %016llx\n", level, evt->srr0);
-   } else if (user_mode) {
-   printk("%s  NIP: [%016llx] PID: %d Comm: %s\n", level,
-   evt->srr0, current->pid, current->comm);
-   } else {
-   printk("%s  NIP [%016llx]: %pS\n", level, evt->srr0,
-  (void *)evt->srr0);
-   }
-
-   printk("%s  Initiator: %s\n", level,
-  evt->initiator == MCE_INITIATOR_CPU ? "CPU" : "Unknown");
switch (evt->error_type) {
case MCE_ERROR_TYPE_UE:
+   err_type = "UE";
subtype = evt->u.ue_error.ue_error_type <
ARRAY_SIZE(mc_ue_types) ?
mc_ue_types[evt->u.ue_error.ue_error_type]
: "Unknown";
-   printk("%s  Error type: UE [%s]\n", level, subtype);
if (evt->u.ue_error.effective_address_provided)
-   printk("%sEffective address: %016llx\n",
-  level, evt->u.ue_error.effective_address);
-   if (evt->u.ue_error.physical_address_provided)
-   printk("%sPhysical address:  %016llx\n",
-  level, evt->u.ue_error.physical_address);
+   ea = evt->u.ue_error.effective_address;
break;
case MCE_ERROR_TYPE_SLB:
+   err_type = "SLB";
subtype = evt->u.slb_error.slb_error_type <
ARRAY_SIZE(mc_slb_types) ?
mc_slb_types[evt->u.slb_error.slb_error_type]
: "Unknown";
-   printk("%s  Error type: SLB [%s]\n", level, subtype);
if (evt->u.slb_error.effective_address_provided)
-   printk("%sEffective address: %016llx\n",
-  level, evt->u.slb_error.effective_address);
+   ea = evt->u.slb_error.effective_address;
break;
case MCE_ERROR_TYPE_ERAT:
+   err_type = "ERAT";
subtype = evt->u.erat_error.erat_error_type <
ARRAY_SIZE(mc_erat_types) ?
mc_erat_types[evt->u.erat_error.erat_error_type]
: "Unknown";
-   printk("%s  Error type: ERAT [%s]\n", level, subtype);
if (evt->u.erat_error.effective_address_provided)
-   printk("%sEffective address: %016llx\n",
-  level,

[PATCH 7/7] powerpc/setup: replace ifdefs by IS_ENABLED() wherever possible.

2019-03-22 Thread Christophe Leroy

Compared to ifdefs, IS_ENABLED() provide a cleaner code and allows
to detect compilation failure regardless of the selected options.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/setup-common.c | 39 ++
 1 file changed, 18 insertions(+), 21 deletions(-)

diff --git a/arch/powerpc/kernel/setup-common.c 
b/arch/powerpc/kernel/setup-common.c
index b6c86287085a..6a936cb98b79 100644
--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -196,14 +196,15 @@ static void show_cpuinfo_summary(struct seq_file *m)
 {
struct device_node *root;
const char *model = NULL;
-#if defined(CONFIG_SMP) && defined(CONFIG_PPC32)
unsigned long bogosum = 0;
int i;
-   for_each_online_cpu(i)
-   bogosum += loops_per_jiffy;
-   seq_printf(m, "total bogomips\t: %lu.%02lu\n",
-  bogosum/(50/HZ), bogosum/(5000/HZ) % 100);
-#endif /* CONFIG_SMP && CONFIG_PPC32 */
+
+   if (IS_ENABLED(CONFIG_SMP) && IS_ENABLED(CONFIG_PPC32)) {
+   for_each_online_cpu(i)
+   bogosum += loops_per_jiffy;
+   seq_printf(m, "total bogomips\t: %lu.%02lu\n",
+  bogosum / (50 / HZ), bogosum / (5000 / HZ) % 
100);
+   }
seq_printf(m, "timebase\t: %lu\n", ppc_tb_freq);
if (ppc_md.name)
seq_printf(m, "platform\t: %s\n", ppc_md.name);
@@ -217,11 +218,10 @@ static void show_cpuinfo_summary(struct seq_file *m)
if (ppc_md.show_cpuinfo != NULL)
ppc_md.show_cpuinfo(m);
 
-#ifdef CONFIG_PPC32
/* Display the amount of memory */
-   seq_printf(m, "Memory\t\t: %d MB\n",
-  (unsigned int)(total_memory / (1024 * 1024)));
-#endif
+   if (IS_ENABLED(CONFIG_PPC32))
+   seq_printf(m, "Memory\t\t: %d MB\n",
+  (unsigned int)(total_memory / (1024 * 1024)));
 }
 
 static int show_cpuinfo(struct seq_file *m, void *v)
@@ -329,11 +329,10 @@ static int show_cpuinfo(struct seq_file *m, void *v)
seq_printf(m, "revision\t: %hd.%hd (pvr %04x %04x)\n",
   maj, min, PVR_VER(pvr), PVR_REV(pvr));
 
-#ifdef CONFIG_PPC32
-   seq_printf(m, "bogomips\t: %lu.%02lu\n",
-  loops_per_jiffy / (50/HZ),
-  (loops_per_jiffy / (5000/HZ)) % 100);
-#endif
+   if (IS_ENABLED(CONFIG_PPC32))
+   seq_printf(m, "bogomips\t: %lu.%02lu\n", loops_per_jiffy / 
(50 / HZ),
+  (loops_per_jiffy / (5000 / HZ)) % 100);
+
seq_printf(m, "\n");
 
/* If this is the last cpu, print the summary */
@@ -957,9 +956,9 @@ void __init setup_arch(char **cmdline_p)
 
early_memtest(min_low_pfn << PAGE_SHIFT, max_low_pfn << PAGE_SHIFT);
 
-#ifdef CONFIG_DUMMY_CONSOLE
-   conswitchp = _con;
-#endif
+   if (IS_ENABLED(CONFIG_DUMMY_CONSOLE))
+   conswitchp = _con;
+
if (ppc_md.setup_arch)
ppc_md.setup_arch();
 
@@ -971,10 +970,8 @@ void __init setup_arch(char **cmdline_p)
/* Initialize the MMU context management stuff. */
mmu_context_init();
 
-#ifdef CONFIG_PPC64
/* Interrupt code needs to be 64K-aligned. */
-   if ((unsigned long)_stext & 0x)
+   if (IS_ENABLED(CONFIG_PPC64) && (unsigned long)_stext & 0x)
panic("Kernelbase not 64K-aligned (0x%lx)!\n",
  (unsigned long)_stext);
-#endif
 }
-- 
2.13.3

[PATCH 6/7] powerpc/setup: cleanup the #ifdef CONFIG_TAU block

2019-03-22 Thread Christophe Leroy

Use cpu_has_feature() instead of opencoding

Use IS_ENABLED() instead of #ifdef for CONFIG_TAU_AVERAGE

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/setup-common.c | 24 
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/kernel/setup-common.c 
b/arch/powerpc/kernel/setup-common.c
index 95d545e94c28..b6c86287085a 100644
--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -254,18 +254,18 @@ static int show_cpuinfo(struct seq_file *m, void *v)
seq_printf(m, "\n");
 
 #ifdef CONFIG_TAU
-   if (cur_cpu_spec->cpu_features & CPU_FTR_TAU) {
-#ifdef CONFIG_TAU_AVERAGE
-   /* more straightforward, but potentially misleading */
-   seq_printf(m,  "temperature \t: %u C (uncalibrated)\n",
-  cpu_temp(cpu_id));
-#else
-   /* show the actual temp sensor range */
-   u32 temp;
-   temp = cpu_temp_both(cpu_id);
-   seq_printf(m, "temperature \t: %u-%u C (uncalibrated)\n",
-  temp & 0xff, temp >> 16);
-#endif
+   if (cpu_has_feature(CPU_FTR_TAU)) {
+   if (IS_ENABLED(CONFIG_TAU_AVERAGE)) {
+   /* more straightforward, but potentially misleading */
+   seq_printf(m,  "temperature \t: %u C (uncalibrated)\n",
+  cpu_temp(cpu_id));
+   } else {
+   /* show the actual temp sensor range */
+   u32 temp;
+   temp = cpu_temp_both(cpu_id);
+   seq_printf(m, "temperature \t: %u-%u C 
(uncalibrated)\n",
+  temp & 0xff, temp >> 16);
+   }
}
 #endif /* CONFIG_TAU */
 
-- 
2.13.3

[PATCH 5/7] powerpc/setup: cleanup ifdef mess in check_cache_coherency()

2019-03-22 Thread Christophe Leroy

Use IS_ENABLED() instead of #ifdefs

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/setup-common.c | 10 +++---
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/kernel/setup-common.c 
b/arch/powerpc/kernel/setup-common.c
index fa90585760c0..95d545e94c28 100644
--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -734,23 +734,19 @@ void __init setup_panic(void)
  * BUG() in that case.
  */
 
-#ifdef CONFIG_NOT_COHERENT_CACHE
-#define KERNEL_COHERENCY   0
-#else
-#define KERNEL_COHERENCY   1
-#endif
+#define KERNEL_COHERENCY   (!IS_ENABLED(CONFIG_NOT_COHERENT_CACHE))
 
 static int __init check_cache_coherency(void)
 {
struct device_node *np;
const void *prop;
-   int devtree_coherency;
+   bool devtree_coherency;
 
np = of_find_node_by_path("/");
prop = of_get_property(np, "coherency-off", NULL);
of_node_put(np);
 
-   devtree_coherency = prop ? 0 : 1;
+   devtree_coherency = prop ? false : true;
 
if (devtree_coherency != KERNEL_COHERENCY) {
printk(KERN_ERR
-- 
2.13.3

[PATCH 4/7] powerpc/setup: Remove unnecessary #ifdef CONFIG_ALTIVEC

2019-03-22 Thread Christophe Leroy

CPU_FTR_ALTIVEC is only set when CONFIG_ALTIVEC is selected, so
the ifdef is unnecessary.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/setup-common.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/arch/powerpc/kernel/setup-common.c 
b/arch/powerpc/kernel/setup-common.c
index a4ed9301e815..fa90585760c0 100644
--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -248,10 +248,8 @@ static int show_cpuinfo(struct seq_file *m, void *v)
else
seq_printf(m, "unknown (%08x)", pvr);
 
-#ifdef CONFIG_ALTIVEC
if (cpu_has_feature(CPU_FTR_ALTIVEC))
seq_printf(m, ", altivec supported");
-#endif /* CONFIG_ALTIVEC */
 
seq_printf(m, "\n");
 
-- 
2.13.3

[PATCH 1/7] powerpc/fadump: define an empty fadump_cleanup()

2019-03-22 Thread Christophe Leroy

To avoid #ifdefs, define an static inline fadump_cleanup() function
when CONFIG_FADUMP is not selected

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/fadump.h  | 1 +
 arch/powerpc/kernel/setup-common.c | 2 --
 2 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/fadump.h 
b/arch/powerpc/include/asm/fadump.h
index 188776befaf9..e2099c0a15c3 100644
--- a/arch/powerpc/include/asm/fadump.h
+++ b/arch/powerpc/include/asm/fadump.h
@@ -219,5 +219,6 @@ extern void fadump_cleanup(void);
 static inline int is_fadump_active(void) { return 0; }
 static inline int should_fadump_crash(void) { return 0; }
 static inline void crash_fadump(struct pt_regs *regs, const char *str) { }
+static inline void fadump_cleanup(void) { }
 #endif
 #endif
diff --git a/arch/powerpc/kernel/setup-common.c 
b/arch/powerpc/kernel/setup-common.c
index 2e5dfb6e0823..971f50d99d87 100644
--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -133,13 +133,11 @@ int crashing_cpu = -1;
 /* also used by kexec */
 void machine_shutdown(void)
 {
-#ifdef CONFIG_FA_DUMP
/*
 * if fadump is active, cleanup the fadump registration before we
 * shutdown.
 */
fadump_cleanup();
-#endif
 
if (ppc_md.machine_shutdown)
ppc_md.machine_shutdown();
-- 
2.13.3

[PATCH 3/7] powerpc/setup: define cpu_pvr at all time

2019-03-22 Thread Christophe Leroy

To avoid ifdefs, define cpu_pvr at all time.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/setup-common.c | 12 +---
 1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/kernel/setup-common.c 
b/arch/powerpc/kernel/setup-common.c
index a90e8367ccde..a4ed9301e815 100644
--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -190,9 +190,7 @@ void machine_halt(void)
machine_hang();
 }
 
-#ifdef CONFIG_SMP
 DEFINE_PER_CPU(unsigned int, cpu_pvr);
-#endif
 
 static void show_cpuinfo_summary(struct seq_file *m)
 {
@@ -234,11 +232,11 @@ static int show_cpuinfo(struct seq_file *m, void *v)
unsigned short maj;
unsigned short min;
 
-#ifdef CONFIG_SMP
-   pvr = per_cpu(cpu_pvr, cpu_id);
-#else
-   pvr = mfspr(SPRN_PVR);
-#endif
+   if (IS_ENABLED(CONFIG_SMP))
+   pvr = per_cpu(cpu_pvr, cpu_id);
+   else
+   pvr = mfspr(SPRN_PVR);
+
maj = (pvr >> 8) & 0xFF;
min = pvr & 0xFF;
 
-- 
2.13.3

[PATCH 2/7] powerpc/mm: define an empty mm_iommu_init()

2019-03-22 Thread Christophe Leroy

To avoid ifdefs, define a empty static inline mm_iommu_init() function
when CONFIG_SPAPR_TCE_IOMMU is not selected.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/mmu_context.h | 1 +
 arch/powerpc/kernel/setup-common.c | 2 --
 2 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/mmu_context.h 
b/arch/powerpc/include/asm/mmu_context.h
index 6ee8195a2ffb..95b93ce428b7 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -52,6 +52,7 @@ static inline bool mm_iommu_is_devmem(struct mm_struct *mm, 
unsigned long hpa,
 {
return false;
 }
+static inline void mm_iommu_init(struct mm_struct *mm) { }
 #endif
 extern void switch_slb(struct task_struct *tsk, struct mm_struct *mm);
 extern void set_context(unsigned long id, pgd_t *pgd);
diff --git a/arch/powerpc/kernel/setup-common.c 
b/arch/powerpc/kernel/setup-common.c
index 971f50d99d87..a90e8367ccde 100644
--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -956,9 +956,7 @@ void __init setup_arch(char **cmdline_p)
 #endif
 #endif
 
-#ifdef CONFIG_SPAPR_TCE_IOMMU
mm_iommu_init(_mm);
-#endif
irqstack_early_init();
exc_lvl_early_init();
emergency_stack_init();
-- 
2.13.3

Re: [PATCH 13/38] vfs: Convert cxl to fs_context

2019-03-22 Thread Frederic Barrat





Le 14/03/2019 à 17:10, David Howells a écrit :

Signed-off-by: David Howells 
cc: Frederic Barrat 
cc: Andrew Donnellan 
cc: linuxppc-dev@lists.ozlabs.org
---


Acked-by: Frederic Barrat 




  drivers/misc/cxl/api.c |   10 +-
  1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/misc/cxl/api.c b/drivers/misc/cxl/api.c
index 750470ef2049..395e9a88e6ba 100644
--- a/drivers/misc/cxl/api.c
+++ b/drivers/misc/cxl/api.c
@@ -13,6 +13,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  #include 
  
@@ -41,17 +42,16 @@ static const struct dentry_operations cxl_fs_dops = {

.d_dname= simple_dname,
  };
  
-static struct dentry *cxl_fs_mount(struct file_system_type *fs_type, int flags,

-   const char *dev_name, void *data)
+static int cxl_fs_init_fs_context(struct fs_context *fc)
  {
-   return mount_pseudo(fs_type, "cxl:", NULL, _fs_dops,
-   CXL_PSEUDO_FS_MAGIC);
+   return vfs_init_pseudo_fs_context(fc, "cxl:", NULL, NULL,
+ _fs_dops, CXL_PSEUDO_FS_MAGIC);
  }
  
  static struct file_system_type cxl_fs_type = {

.name   = "cxl",
.owner  = THIS_MODULE,
-   .mount  = cxl_fs_mount,
+   .init_fs_context = cxl_fs_init_fs_context,
.kill_sb= kill_anon_super,
  };

Re: [PATCH 0/2] Auto-promotion logic for cpuidle states

2019-03-22 Thread Abhishek


Please ignore this set as this is incomplete. I have resent the patches.

--Abhishek

On 03/22/2019 11:55 AM, Abhishek Goel wrote:

Currently, the cpuidle governors (menu/ladder) determine what idle state a
idling CPU should enter into based on heuristics that depend on the idle
history on that CPU. Given that no predictive heuristic is perfect, there
are cases where the governor predicts a shallow idle state, hoping that
the CPU will be busy soon. However, if no new workload is scheduled on
that CPU in the near future, the CPU will end up in the shallow state.

Motivation
--
In case of POWER, this is problematic, when the predicted state in the
aforementioned scenario is a lite stop state, as such lite states will
inhibit SMT folding, thereby depriving the other threads in the core from
using the core resources.

To address this, such lite states need to be autopromoted. The cpuidle-core
can queue timer to correspond with the residency value of the next
available state. Thus leading to auto-promotion to a deeper idle state as
soon as possible.

Experiment
--
Without this patch -
It was seen that for a idle system, a cpu may remain in stop0_lite for few
seconds and then directly goes to a deeper state such as stop2.

With this patch -
A cpu will not remain in stop0_lite for more than the residency of next
available state, and thus it will go to a deeper state in conservative
fashion. Using this, we may spent even less than 20 milliseconds if
susbsequent stop states are enabled. In the worst case, we may end up
spending more than a second, as was the case without this patch. The
worst case will occur in the scenario when no other shallow states are
enbaled, and only deep states are available for auto-promotion.

Abhishek Goel (2):
   cpuidle : auto-promotion for cpuidle states
   cpuidle : Add auto-promotion flag to cpuidle flags

  arch/powerpc/include/asm/opal-api.h |  1 +
  drivers/cpuidle/Kconfig |  4 
  drivers/cpuidle/cpuidle-powernv.c   | 13 +++--
  drivers/cpuidle/cpuidle.c   |  3 ---
  4 files changed, 16 insertions(+), 5 deletions(-)

[PATCH 2/2] cpuidle : Add auto-promotion flag to cpuidle flags

2019-03-22 Thread Abhishek Goel

This patch sets up flags for the state which needs to be auto-promoted.
For powernv systems, lite states do not even lose user context. That
information has been used to set the flag for lite states.

Signed-off-by: Abhishek Goel 
---
 arch/powerpc/include/asm/opal-api.h |  1 +
 drivers/cpuidle/Kconfig |  4 
 drivers/cpuidle/cpuidle-powernv.c   | 13 +++--
 3 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/opal-api.h 
b/arch/powerpc/include/asm/opal-api.h
index 870fb7b23..735dec731 100644
--- a/arch/powerpc/include/asm/opal-api.h
+++ b/arch/powerpc/include/asm/opal-api.h
@@ -226,6 +226,7 @@
  */
 
 #define OPAL_PM_TIMEBASE_STOP  0x0002
+#define OPAL_PM_LOSE_USER_CONTEXT  0x1000
 #define OPAL_PM_LOSE_HYP_CONTEXT   0x2000
 #define OPAL_PM_LOSE_FULL_CONTEXT  0x4000
 #define OPAL_PM_NAP_ENABLED0x0001
diff --git a/drivers/cpuidle/Kconfig b/drivers/cpuidle/Kconfig
index 7e48eb5bf..0ece62684 100644
--- a/drivers/cpuidle/Kconfig
+++ b/drivers/cpuidle/Kconfig
@@ -26,6 +26,10 @@ config CPU_IDLE_GOV_MENU
 config DT_IDLE_STATES
bool
 
+config CPU_IDLE_AUTO_PROMOTION
+   bool
+   default y if PPC_POWERNV
+
 menu "ARM CPU Idle Drivers"
 depends on ARM || ARM64
 source "drivers/cpuidle/Kconfig.arm"
diff --git a/drivers/cpuidle/cpuidle-powernv.c 
b/drivers/cpuidle/cpuidle-powernv.c
index 84b1ebe21..e351f5f9c 100644
--- a/drivers/cpuidle/cpuidle-powernv.c
+++ b/drivers/cpuidle/cpuidle-powernv.c
@@ -299,6 +299,7 @@ static int powernv_add_idle_states(void)
for (i = 0; i < dt_idle_states; i++) {
unsigned int exit_latency, target_residency;
bool stops_timebase = false;
+   bool lose_user_context = false;
struct pnv_idle_states_t *state = _idle_states[i];
 
/*
@@ -324,6 +325,9 @@ static int powernv_add_idle_states(void)
if (has_stop_states && !(state->valid))
continue;
 
+   if (state->flags & OPAL_PM_LOSE_USER_CONTEXT)
+   lose_user_context = true;
+
if (state->flags & OPAL_PM_TIMEBASE_STOP)
stops_timebase = true;
 
@@ -332,12 +336,17 @@ static int powernv_add_idle_states(void)
add_powernv_state(nr_idle_states, "Nap",
  CPUIDLE_FLAG_NONE, nap_loop,
  target_residency, exit_latency, 0, 0);
+   } else if (has_stop_states & !lose_user_context) {
+   add_powernv_state(nr_idle_states, state->name,
+ CPUIDLE_FLAG_AUTO_PROMOTION,
+ stop_loop, target_residency,
+ exit_latency, state->psscr_val,
+ state->psscr_mask);
} else if (has_stop_states && !stops_timebase) {
add_powernv_state(nr_idle_states, state->name,
  CPUIDLE_FLAG_NONE, stop_loop,
  target_residency, exit_latency,
- state->psscr_val,
- state->psscr_mask);
+ state->psscr_val, state->psscr_mask);
}
 
/*
-- 
2.17.1

[PATCH 1/2] cpuidle : auto-promotion for cpuidle states

2019-03-22 Thread Abhishek Goel

Currently, the cpuidle governors (menu /ladder) determine what idle state
an idling CPU should enter into based on heuristics that depend on the
idle history on that CPU. Given that no predictive heuristic is perfect,
there are cases where the governor predicts a shallow idle state, hoping
that the CPU will be busy soon. However, if no new workload is scheduled
on that CPU in the near future, the CPU will end up in the shallow state.

In case of POWER, this is problematic, when the predicted state in the
aforementioned scenario is a lite stop state, as such lite states will
inhibit SMT folding, thereby depriving the other threads in the core from
using the core resources.

To address this, such lite states need to be autopromoted. The cpuidle-
core can queue timer to correspond with the residency value of the next
available state. Thus leading to auto-promotion to a deeper idle state as
soon as possible.

Signed-off-by: Abhishek Goel 
---
 drivers/cpuidle/cpuidle.c  | 79 +-
 drivers/cpuidle/governors/ladder.c |  3 +-
 drivers/cpuidle/governors/menu.c   | 23 -
 include/linux/cpuidle.h| 12 +++--
 4 files changed, 111 insertions(+), 6 deletions(-)

diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index 7f108309e..c4d1c1b38 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -36,6 +36,12 @@ static int enabled_devices;
 static int off __read_mostly;
 static int initialized __read_mostly;
 
+struct auto_promotion {
+   struct hrtimer  hrtimer;
+   int timeout;
+   booltimeout_needed;
+};
+
 int cpuidle_disabled(void)
 {
return off;
@@ -188,6 +194,64 @@ int cpuidle_enter_s2idle(struct cpuidle_driver *drv, 
struct cpuidle_device *dev)
 }
 #endif /* CONFIG_SUSPEND */
 
+enum hrtimer_restart auto_promotion_hrtimer_callback(struct hrtimer *hrtimer)
+{
+   return HRTIMER_NORESTART;
+}
+
+#ifdef CONFIG_CPU_IDLE_AUTO_PROMOTION
+DEFINE_PER_CPU(struct auto_promotion, ap);
+
+static void cpuidle_auto_promotion_start(struct cpuidle_state *state, int cpu)
+{
+   struct auto_promotion *this_ap = _cpu(ap, cpu);
+
+   if (this_ap->timeout_needed && (state->flags &
+   CPUIDLE_FLAG_AUTO_PROMOTION))
+   hrtimer_start(_ap->hrtimer, ns_to_ktime(this_ap->timeout
+   * 1000), HRTIMER_MODE_REL_PINNED);
+}
+
+static void cpuidle_auto_promotion_cancel(int cpu)
+{
+   struct hrtimer *hrtimer;
+
+   hrtimer = _cpu(ap, cpu).hrtimer;
+   if (hrtimer_is_queued(hrtimer))
+   hrtimer_cancel(hrtimer);
+}
+
+static void cpuidle_auto_promotion_update(int time, int cpu)
+{
+   per_cpu(ap, cpu).timeout = time;
+}
+
+static void cpuidle_auto_promotion_init(struct cpuidle_driver *drv, int cpu)
+{
+   int i;
+   struct auto_promotion *this_ap = _cpu(ap, cpu);
+
+   this_ap->timeout_needed = 0;
+
+   for (i = 0; i < drv->state_count; i++) {
+   if (drv->states[i].flags & CPUIDLE_FLAG_AUTO_PROMOTION) {
+   this_ap->timeout_needed = 1;
+   break;
+   }
+   }
+
+   hrtimer_init(_ap->hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+   this_ap->hrtimer.function = auto_promotion_hrtimer_callback;
+}
+#else
+static inline void cpuidle_auto_promotion_start(struct cpuidle_state *state,
+   int cpu) { }
+static inline void cpuidle_auto_promotion_cancel(int cpu) { }
+static inline void cpuidle_auto_promotion_update(int timeout, int cpu) { }
+static inline void cpuidle_auto_promotion_init(struct cpuidle_driver *drv,
+   int cpu) { }
+#endif
+
 /**
  * cpuidle_enter_state - enter the state and update stats
  * @dev: cpuidle device for this cpu
@@ -225,12 +289,17 @@ int cpuidle_enter_state(struct cpuidle_device *dev, 
struct cpuidle_driver *drv,
trace_cpu_idle_rcuidle(index, dev->cpu);
time_start = ns_to_ktime(local_clock());
 
+   cpuidle_auto_promotion_start(target_state, dev->cpu);
+
stop_critical_timings();
entered_state = target_state->enter(dev, drv, index);
start_critical_timings();
 
sched_clock_idle_wakeup_event();
time_end = ns_to_ktime(local_clock());
+
+   cpuidle_auto_promotion_cancel(dev->cpu);
+
trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, dev->cpu);
 
/* The cpu is no longer idle or about to enter idle. */
@@ -312,7 +381,13 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct 
cpuidle_driver *drv,
 int cpuidle_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
   bool *stop_tick)
 {
-   return cpuidle_curr_governor->select(drv, dev, stop_tick);
+   int timeout, ret;
+
+   timeout = INT_MAX;
+   ret = cpuidle_curr_governor->select(drv, dev, stop_tick, );
+   cpuidle_auto_promotion_update(timeout,

[PATCH 0/2] Auto-promotion logic for cpuidle states

2019-03-22 Thread Abhishek Goel

Currently, the cpuidle governors (menu/ladder) determine what idle state a
idling CPU should enter into based on heuristics that depend on the idle
history on that CPU. Given that no predictive heuristic is perfect, there
are cases where the governor predicts a shallow idle state, hoping that
the CPU will be busy soon. However, if no new workload is scheduled on
that CPU in the near future, the CPU will end up in the shallow state.

Motivation
--
In case of POWER, this is problematic, when the predicted state in the
aforementioned scenario is a lite stop state, as such lite states will
inhibit SMT folding, thereby depriving the other threads in the core from
using the core resources.

To address this, such lite states need to be autopromoted. The cpuidle-core
can queue timer to correspond with the residency value of the next
available state. Thus leading to auto-promotion to a deeper idle state as
soon as possible.

Experiment
--
Without this patch -
It was seen that for a idle system, a cpu may remain in stop0_lite for few
seconds and then directly goes to a deeper state such as stop2.

With this patch -
A cpu will not remain in stop0_lite for more than the residency of next
available state, and thus it will go to a deeper state in conservative
fashion. Using this, we may spent even less than 20 milliseconds if
susbsequent stop states are enabled. In the worst case, we may end up
spending more than a second, as was the case without this patch. The
worst case will occur in the scenario when no other shallow states are
enbaled, and only deep states are available for auto-promotion.

Abhishek Goel (2):
  cpuidle : auto-promotion for cpuidle states
  cpuidle : Add auto-promotion flag to cpuidle flags

 arch/powerpc/include/asm/opal-api.h |  1 +
 drivers/cpuidle/Kconfig |  4 ++
 drivers/cpuidle/cpuidle-powernv.c   | 13 -
 drivers/cpuidle/cpuidle.c   | 79 -
 drivers/cpuidle/governors/ladder.c  |  3 +-
 drivers/cpuidle/governors/menu.c| 23 -
 include/linux/cpuidle.h | 12 +++--
 7 files changed, 127 insertions(+), 8 deletions(-)

-- 
2.17.1

Re: [PATCH 5/5] powerpc/8xx: fix possible object reference leak

2019-03-22 Thread Christophe Leroy





On 03/22/2019 03:05 AM, Wen Yang wrote:

The call to of_find_compatible_node returns a node pointer with refcount
incremented thus it must be explicitly decremented after the last
usage.
irq_domain_add_linear also calls of_node_get to increase refcount,
so irq_domain will not be affected when it is released.



Should you have a:

Fixes: a8db8cf0d894 ("irq_domain: Replace irq_alloc_host() with 
revmap-specific initializers")


If not, it means your change is in contradiction with commit 
b1725c9319aa ("[POWERPC] arch/powerpc/sysdev: Add missing of_node_put")




Detected by coccinelle with the following warnings:
./arch/powerpc/platforms/8xx/pic.c:158:1-7: ERROR: missing of_node_put; 
acquired a node pointer with refcount incremented on line 136, but without a 
corresponding object release within this function.

Signed-off-by: Wen Yang 
Cc: Vitaly Bordug 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-ker...@vger.kernel.org
---
  arch/powerpc/platforms/8xx/pic.c | 1 +
  1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/platforms/8xx/pic.c b/arch/powerpc/platforms/8xx/pic.c
index 8d5a25d..13d880b 100644
--- a/arch/powerpc/platforms/8xx/pic.c
+++ b/arch/powerpc/platforms/8xx/pic.c
@@ -155,6 +155,7 @@ int mpc8xx_pic_init(void)
ret = -ENOMEM;
goto out;
}
+   of_node_put(np);
return 0;
  
  out:




I guess it would be better as follows:

--- a/arch/powerpc/platforms/8xx/pic.c
+++ b/arch/powerpc/platforms/8xx/pic.c
@@ -153,9 +153,7 @@ int mpc8xx_pic_init(void)
if (mpc8xx_pic_host == NULL) {
printk(KERN_ERR "MPC8xx PIC: failed to allocate irq 
host!\n");

ret = -ENOMEM;
-   goto out;
}
-   return 0;

 out:
of_node_put(np);



Christophe

[PATCH 0/2] Auto-promotion logic for cpuidle states

2019-03-22 Thread Abhishek Goel

Currently, the cpuidle governors (menu/ladder) determine what idle state a
idling CPU should enter into based on heuristics that depend on the idle
history on that CPU. Given that no predictive heuristic is perfect, there
are cases where the governor predicts a shallow idle state, hoping that
the CPU will be busy soon. However, if no new workload is scheduled on
that CPU in the near future, the CPU will end up in the shallow state.

Motivation
--
In case of POWER, this is problematic, when the predicted state in the
aforementioned scenario is a lite stop state, as such lite states will
inhibit SMT folding, thereby depriving the other threads in the core from
using the core resources.

To address this, such lite states need to be autopromoted. The cpuidle-core
can queue timer to correspond with the residency value of the next
available state. Thus leading to auto-promotion to a deeper idle state as
soon as possible.

Experiment
--
Without this patch -
It was seen that for a idle system, a cpu may remain in stop0_lite for few
seconds and then directly goes to a deeper state such as stop2.

With this patch -
A cpu will not remain in stop0_lite for more than the residency of next
available state, and thus it will go to a deeper state in conservative
fashion. Using this, we may spent even less than 20 milliseconds if
susbsequent stop states are enabled. In the worst case, we may end up
spending more than a second, as was the case without this patch. The
worst case will occur in the scenario when no other shallow states are
enbaled, and only deep states are available for auto-promotion.

Abhishek Goel (2):
  cpuidle : auto-promotion for cpuidle states
  cpuidle : Add auto-promotion flag to cpuidle flags

 arch/powerpc/include/asm/opal-api.h |  1 +
 drivers/cpuidle/Kconfig |  4 
 drivers/cpuidle/cpuidle-powernv.c   | 13 +++--
 drivers/cpuidle/cpuidle.c   |  3 ---
 4 files changed, 16 insertions(+), 5 deletions(-)

-- 
2.17.1

[PATCH 2/2] cpuidle : Add auto-promotion flag to cpuidle flags

2019-03-22 Thread Abhishek Goel

This patch sets up flags for the state which needs to be auto-promoted.
For powernv systems, lite states do not even lose user context. That
information has been used to set the flag for lite states.

Signed-off-by: Abhishek Goel 
---
 arch/powerpc/include/asm/opal-api.h |  1 +
 drivers/cpuidle/Kconfig |  4 
 drivers/cpuidle/cpuidle-powernv.c   | 13 +++--
 3 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/opal-api.h 
b/arch/powerpc/include/asm/opal-api.h
index 870fb7b23..735dec731 100644
--- a/arch/powerpc/include/asm/opal-api.h
+++ b/arch/powerpc/include/asm/opal-api.h
@@ -226,6 +226,7 @@
  */
 
 #define OPAL_PM_TIMEBASE_STOP  0x0002
+#define OPAL_PM_LOSE_USER_CONTEXT  0x1000
 #define OPAL_PM_LOSE_HYP_CONTEXT   0x2000
 #define OPAL_PM_LOSE_FULL_CONTEXT  0x4000
 #define OPAL_PM_NAP_ENABLED0x0001
diff --git a/drivers/cpuidle/Kconfig b/drivers/cpuidle/Kconfig
index 7e48eb5bf..0ece62684 100644
--- a/drivers/cpuidle/Kconfig
+++ b/drivers/cpuidle/Kconfig
@@ -26,6 +26,10 @@ config CPU_IDLE_GOV_MENU
 config DT_IDLE_STATES
bool
 
+config CPU_IDLE_AUTO_PROMOTION
+   bool
+   default y if PPC_POWERNV
+
 menu "ARM CPU Idle Drivers"
 depends on ARM || ARM64
 source "drivers/cpuidle/Kconfig.arm"
diff --git a/drivers/cpuidle/cpuidle-powernv.c 
b/drivers/cpuidle/cpuidle-powernv.c
index 84b1ebe21..e351f5f9c 100644
--- a/drivers/cpuidle/cpuidle-powernv.c
+++ b/drivers/cpuidle/cpuidle-powernv.c
@@ -299,6 +299,7 @@ static int powernv_add_idle_states(void)
for (i = 0; i < dt_idle_states; i++) {
unsigned int exit_latency, target_residency;
bool stops_timebase = false;
+   bool lose_user_context = false;
struct pnv_idle_states_t *state = _idle_states[i];
 
/*
@@ -324,6 +325,9 @@ static int powernv_add_idle_states(void)
if (has_stop_states && !(state->valid))
continue;
 
+   if (state->flags & OPAL_PM_LOSE_USER_CONTEXT)
+   lose_user_context = true;
+
if (state->flags & OPAL_PM_TIMEBASE_STOP)
stops_timebase = true;
 
@@ -332,12 +336,17 @@ static int powernv_add_idle_states(void)
add_powernv_state(nr_idle_states, "Nap",
  CPUIDLE_FLAG_NONE, nap_loop,
  target_residency, exit_latency, 0, 0);
+   } else if (has_stop_states & !lose_user_context) {
+   add_powernv_state(nr_idle_states, state->name,
+ CPUIDLE_FLAG_AUTO_PROMOTION,
+ stop_loop, target_residency,
+ exit_latency, state->psscr_val,
+ state->psscr_mask);
} else if (has_stop_states && !stops_timebase) {
add_powernv_state(nr_idle_states, state->name,
  CPUIDLE_FLAG_NONE, stop_loop,
  target_residency, exit_latency,
- state->psscr_val,
- state->psscr_mask);
+ state->psscr_val, state->psscr_mask);
}
 
/*
-- 
2.17.1

[PATCH 1/2] cpuidle : auto-promotion for cpuidle states

2019-03-22 Thread Abhishek Goel

Currently, the cpuidle governors (menu /ladder) determine what idle state
an idling CPU should enter into based on heuristics that depend on the
idle history on that CPU. Given that no predictive heuristic is perfect,
there are cases where the governor predicts a shallow idle state, hoping
that the CPU will be busy soon. However, if no new workload is scheduled
on that CPU in the near future, the CPU will end up in the shallow state.

In case of POWER, this is problematic, when the predicted state in the
aforementioned scenario is a lite stop state, as such lite states will
inhibit SMT folding, thereby depriving the other threads in the core from
using the core resources.

To address this, such lite states need to be autopromoted. The cpuidle-
core can queue timer to correspond with the residency value of the next
available state. Thus leading to auto-promotion to a deeper idle state as
soon as possible.

Signed-off-by: Abhishek Goel 
---
 drivers/cpuidle/cpuidle.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index 2406e2655..c4d1c1b38 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -584,11 +584,8 @@ static void __cpuidle_unregister_device(struct 
cpuidle_device *dev)
 
 static void __cpuidle_device_init(struct cpuidle_device *dev)
 {
-   int i;
memset(dev->states_usage, 0, sizeof(dev->states_usage));
dev->last_residency = 0;
-   for (i = 0; i < CPUIDLE_STATE_MAX; i++)
-   dev->states_usage[i].disable = true;
 }
 
 /**
-- 
2.17.1

56 matches

Mail list logo