Re: [PATCH 2/2] smp: introduce kick_active_cpus_sync()

2018-04-03 Thread Yury Norov
Hi Mark,

Thank you for review.

On Tue, Apr 03, 2018 at 02:48:32PM +0100, Mark Rutland wrote:
> Hi Yury,
> 
> On Sun, Apr 01, 2018 at 02:11:08PM +0300, Yury Norov wrote:
> > +/*
> > + * Flush I-cache if CPU is in extended quiescent state
> > + */
> 
> This comment is misleading. An ISB doesn't touch the I-cache; it forces
> a context synchronization event.
> 
> > +   .macro  isb_if_eqs
> > +#ifndef CONFIG_TINY_RCU
> > +   bl  rcu_is_watching
> > +   tst w0, #0xff
> > +   b.ne1f
> 
> The TST+B.NE can be a CBNZ:
> 
>   bl  rcu_is_watching
>   cbnzx0, 1f
>   isb
> 1:
> 
> > +   /* Pairs with aarch64_insn_patch_text for EQS CPUs. */
> > +   isb
> > +1:
> > +#endif
> > +   .endm
> > +
> >  el0_sync_invalid:
> > inv_entry 0, BAD_SYNC
> >  ENDPROC(el0_sync_invalid)
> > @@ -840,8 +861,10 @@ el0_svc:
> > mov wsc_nr, #__NR_syscalls
> >  el0_svc_naked: // compat entry point
> > stp x0, xscno, [sp, #S_ORIG_X0] // save the original x0 and 
> > syscall number
> > +   isb_if_eqs
> > enable_dbg_and_irq
> > -   ct_user_exit 1
> > +   ct_user_exit
> 
> I don't think this is safe. here we issue the ISB *before* exiting a
> quiesecent state, so I think we can race with another CPU that calls
> kick_all_active_cpus_sync, e.g.
> 
>   CPU0CPU1
> 
>   ISB
>   patch_some_text()
>   kick_all_active_cpus_sync()
>   ct_user_exit
> 
>   // not synchronized!
>   use_of_patched_text()
> 
> ... and therefore the ISB has no effect, which could be disasterous.
> 
> I believe we need the ISB *after* we transition into a non-quiescent
> state, so that we can't possibly miss a context synchronization event.
 
I decided to put isb() in entry because there's a chance that there will
be patched code prior to exiting a quiescent state. But after some
headscratching, I think it's safe. I'll do like you suggested here.

Thanks,
Yury


[RFC 2/2] powerpc/swiotlb: Make swiotlb_dma_ops available on POWER

2018-04-03 Thread Anshuman Khandual
Generic swiotlb_dma_ops structure is available under CONFIG_DMA_DIRECT_OPS
configuration. Hence select it on POWER platform.

Signed-off-by: Anshuman Khandual 
---
 arch/powerpc/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 73ce5dd07642..61a55f9928d9 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -235,6 +235,7 @@ config PPC
select SPARSE_IRQ
select SYSCTL_EXCEPTION_TRACE
select VIRT_TO_BUS  if !PPC64
+   select DMA_DIRECT_OPS   if PPC_BOOK3S_64
#
# Please keep this list sorted alphabetically.
#
-- 
2.14.1



[RFC 1/2] powerpc/swiotlb: Dont free up allocated SWIOTLB slab on POWER

2018-04-03 Thread Anshuman Khandual
Even though SWIOTLB slab gets allocated and initialized on powerpc with
swiotlb_init() called during mem_init(), it gets released away again on
POWER platform because 'ppc_swiotlb_enable' never gets set. The function
swiotlb_detect_4g() checks for 4GB memory and then sets the variable
'ppc_swiotlb_enable' which prevents freeing up the SWIOTLB slab. Lets
make POWER platform call swiotlb_detect_4g() during setup_arch() which
will keep the SWIOTLB slab through out the runtime.

A previous commit cf5621032f ("powerpc/64: Limit ZONE_DMA32 to 4GiB in
swiotlb_detect_4g()") enforced 4GB limit on ZONE_DMA32 which is is not
applicable on POWER (CONFIG_PPC_BOOK3S_64) platform. Lets remove this
unnecessary restriction.

After the patch, SWIOTLB slab does not get released.

[0.410992] software IO TLB [mem 0xfbff-0x] (64MB) mapped
at [767f6cb3-4a10114f]

Signed-off-by: Anshuman Khandual 
---
 arch/powerpc/kernel/dma-swiotlb.c  | 2 +-
 arch/powerpc/kernel/setup-common.c | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/dma-swiotlb.c 
b/arch/powerpc/kernel/dma-swiotlb.c
index 88f3963ca30f..2255c6dc89db 100644
--- a/arch/powerpc/kernel/dma-swiotlb.c
+++ b/arch/powerpc/kernel/dma-swiotlb.c
@@ -110,7 +110,7 @@ void __init swiotlb_detect_4g(void)
 {
if ((memblock_end_of_DRAM() - 1) > 0x) {
ppc_swiotlb_enable = 1;
-#ifdef CONFIG_ZONE_DMA32
+#if defined(CONFIG_ZONE_DMA32) && !defined(CONFIG_PPC_BOOK3S_64)
limit_zone_pfn(ZONE_DMA32, (1ULL << 32) >> PAGE_SHIFT);
 #endif
}
diff --git a/arch/powerpc/kernel/setup-common.c 
b/arch/powerpc/kernel/setup-common.c
index d73ec518ef80..c4db844e0b0d 100644
--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -944,6 +944,7 @@ void __init setup_arch(char **cmdline_p)
/* Initialize the MMU context management stuff. */
mmu_context_init();
 
+   swiotlb_detect_4g();
 #ifdef CONFIG_PPC64
/* Interrupt code needs to be 64K-aligned. */
if ((unsigned long)_stext & 0x)
-- 
2.14.1



Re: [mm] b1f0502d04: INFO:trying_to_register_non-static_key

2018-04-03 Thread David Rientjes
On Tue, 3 Apr 2018, David Rientjes wrote:

> > >> I found the root cause of this lockdep warning.
> > >>
> > >> In mmap_region(), unmap_region() may be called while vma_link() has not 
> > >> been
> > >> called. This happens during the error path if call_mmap() failed.
> > >>
> > >> The only to fix that particular case is to call
> > >> seqcount_init(>vm_sequence) when initializing the vma in 
> > >> mmap_region().
> > >>
> > > 
> > > Ack, although that would require a fixup to dup_mmap() as well.
> > 
> > You're right, I'll fix that too.
> > 
> 
> I also think the following is needed:
> 
> diff --git a/fs/exec.c b/fs/exec.c
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -312,6 +312,10 @@ static int __bprm_mm_init(struct linux_binprm *bprm)
>   vma->vm_flags = VM_SOFTDIRTY | VM_STACK_FLAGS | 
> VM_STACK_INCOMPLETE_SETUP;
>   vma->vm_page_prot = vm_get_page_prot(vma->vm_flags);
>   INIT_LIST_HEAD(>anon_vma_chain);
> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> + seqcount_init(>vm_sequence);
> + atomic_set(>vm_ref_count, 0);
> +#endif
>  
>   err = insert_vm_struct(mm, vma);
>   if (err)
> 

Ugh, I think there are a number of other places where this is needed as 
well in mm/mmap.c.  I think it would be better to just create a new 
alloc_vma(unsigned long flags) that all vma allocators can use and for 
CONFIG_SPECULATIVE_PAGE_FAULT will initialize the seqcount_t and atomic_t.


Re: [mm] b1f0502d04: INFO:trying_to_register_non-static_key

2018-04-03 Thread David Rientjes
On Wed, 28 Mar 2018, Laurent Dufour wrote:

> On 26/03/2018 00:10, David Rientjes wrote:
> > On Wed, 21 Mar 2018, Laurent Dufour wrote:
> > 
> >> I found the root cause of this lockdep warning.
> >>
> >> In mmap_region(), unmap_region() may be called while vma_link() has not 
> >> been
> >> called. This happens during the error path if call_mmap() failed.
> >>
> >> The only to fix that particular case is to call
> >> seqcount_init(>vm_sequence) when initializing the vma in 
> >> mmap_region().
> >>
> > 
> > Ack, although that would require a fixup to dup_mmap() as well.
> 
> You're right, I'll fix that too.
> 

I also think the following is needed:

diff --git a/fs/exec.c b/fs/exec.c
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -312,6 +312,10 @@ static int __bprm_mm_init(struct linux_binprm *bprm)
vma->vm_flags = VM_SOFTDIRTY | VM_STACK_FLAGS | 
VM_STACK_INCOMPLETE_SETUP;
vma->vm_page_prot = vm_get_page_prot(vma->vm_flags);
INIT_LIST_HEAD(>anon_vma_chain);
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   seqcount_init(>vm_sequence);
+   atomic_set(>vm_ref_count, 0);
+#endif
 
err = insert_vm_struct(mm, vma);
if (err)


Re: [PATCH v9 04/24] mm: Prepare for FAULT_FLAG_SPECULATIVE

2018-04-03 Thread David Rientjes
On Wed, 28 Mar 2018, Laurent Dufour wrote:

> >> diff --git a/include/linux/mm.h b/include/linux/mm.h
> >> index 4d02524a7998..2f3e98edc94a 100644
> >> --- a/include/linux/mm.h
> >> +++ b/include/linux/mm.h
> >> @@ -300,6 +300,7 @@ extern pgprot_t protection_map[16];
> >>  #define FAULT_FLAG_USER   0x40/* The fault originated in 
> >> userspace */
> >>  #define FAULT_FLAG_REMOTE 0x80/* faulting for non current tsk/mm */
> >>  #define FAULT_FLAG_INSTRUCTION  0x100 /* The fault was during an 
> >> instruction fetch */
> >> +#define FAULT_FLAG_SPECULATIVE0x200   /* Speculative fault, not 
> >> holding mmap_sem */
> >>  
> >>  #define FAULT_FLAG_TRACE \
> >>{ FAULT_FLAG_WRITE, "WRITE" }, \
> > 
> > I think FAULT_FLAG_SPECULATIVE should be introduced in the patch that 
> > actually uses it.
> 
> I think you're right, I'll move down this define in the series.
> 
> >> diff --git a/mm/memory.c b/mm/memory.c
> >> index e0ae4999c824..8ac241b9f370 100644
> >> --- a/mm/memory.c
> >> +++ b/mm/memory.c
> >> @@ -2288,6 +2288,13 @@ int apply_to_page_range(struct mm_struct *mm, 
> >> unsigned long addr,
> >>  }
> >>  EXPORT_SYMBOL_GPL(apply_to_page_range);
> >>  
> >> +static bool pte_map_lock(struct vm_fault *vmf)
> > 
> > inline?
> 
> Agreed.
> 

Ignore this, the final form of the function after the full patchset 
shouldn't be inline.

> >> +{
> >> +  vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
> >> + vmf->address, >ptl);
> >> +  return true;
> >> +}
> >> +
> >>  /*
> >>   * handle_pte_fault chooses page fault handler according to an entry 
> >> which was
> >>   * read non-atomically.  Before making any commitment, on those 
> >> architectures
> >> @@ -2477,6 +2484,7 @@ static int wp_page_copy(struct vm_fault *vmf)
> >>const unsigned long mmun_start = vmf->address & PAGE_MASK;
> >>const unsigned long mmun_end = mmun_start + PAGE_SIZE;
> >>struct mem_cgroup *memcg;
> >> +  int ret = VM_FAULT_OOM;
> >>  
> >>if (unlikely(anon_vma_prepare(vma)))
> >>goto oom;
> >> @@ -2504,7 +2512,11 @@ static int wp_page_copy(struct vm_fault *vmf)
> >>/*
> >> * Re-check the pte - we dropped the lock
> >> */
> >> -  vmf->pte = pte_offset_map_lock(mm, vmf->pmd, vmf->address, >ptl);
> >> +  if (!pte_map_lock(vmf)) {
> >> +  mem_cgroup_cancel_charge(new_page, memcg, false);
> >> +  ret = VM_FAULT_RETRY;
> >> +  goto oom_free_new;
> >> +  }
> > 
> > Ugh, but we aren't oom here, so maybe rename oom_free_new so that it makes 
> > sense for return values other than VM_FAULT_OOM?
> 
> You're right, now this label name is not correct, I'll rename it to
> "out_free_new" and rename also the label "oom" to "out" since it is generic 
> too
> now.
> 

I think it would just be better to introduce a out_uncharge that handles 
the mem_cgroup_cancel_charge() in the exit path.

diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2645,9 +2645,8 @@ static int wp_page_copy(struct vm_fault *vmf)
 * Re-check the pte - we dropped the lock
 */
if (!pte_map_lock(vmf)) {
-   mem_cgroup_cancel_charge(new_page, memcg, false);
ret = VM_FAULT_RETRY;
-   goto oom_free_new;
+   goto out_uncharge;
}
if (likely(pte_same(*vmf->pte, vmf->orig_pte))) {
if (old_page) {
@@ -2735,6 +2734,8 @@ static int wp_page_copy(struct vm_fault *vmf)
put_page(old_page);
}
return page_copied ? VM_FAULT_WRITE : 0;
+out_uncharge:
+   mem_cgroup_cancel_charge(new_page, memcg, false);
 oom_free_new:
put_page(new_page);
 oom:


Re: [PATCH] cxl: Fix possible deadlock when processing page faults from cxllib

2018-04-03 Thread Frederic Barrat



Le 03/04/2018 à 16:40, Aneesh Kumar K.V a écrit :

On 04/03/2018 03:13 PM, Frederic Barrat wrote:

cxllib_handle_fault() is called by an external driver when it needs to
have the host process page faults for a buffer which may cover several
pages. Currently the function holds the mm->mmap_sem semaphore with
read access while iterating over the buffer, since it could spawn
several VMAs. When calling a lower-level function to handle the page
fault for a single page, the semaphore is accessed again in read
mode. That is wrong and can lead to deadlocks if a writer tries to
sneak in while a buffer of several pages is being processed.

The fix is to release the semaphore once cxllib_handle_fault() got the
information it needs from the current vma. The address space/VMAs
could evolve while we iterate over the full buffer, but in the
unlikely case where we miss a page, the driver will raise a new page
fault when retrying.

Fixes: 3ced8d730063 ("cxl: Export library to support IBM XSL")
Cc: sta...@vger.kernel.org # 4.13+
Signed-off-by: Frederic Barrat 
---
  drivers/misc/cxl/cxllib.c | 85 
++-

  1 file changed, 55 insertions(+), 30 deletions(-)

diff --git a/drivers/misc/cxl/cxllib.c b/drivers/misc/cxl/cxllib.c
index 30ccba436b3b..55cd35d1a9cc 100644
--- a/drivers/misc/cxl/cxllib.c
+++ b/drivers/misc/cxl/cxllib.c
@@ -208,49 +208,74 @@ int cxllib_get_PE_attributes(struct task_struct 
*task,

  }
  EXPORT_SYMBOL_GPL(cxllib_get_PE_attributes);
-int cxllib_handle_fault(struct mm_struct *mm, u64 addr, u64 size, u64 
flags)

+static int get_vma_info(struct mm_struct *mm, u64 addr,
+    u64 *vma_start, u64 *vma_end,
+    unsigned long *page_size)
  {
-    int rc;
-    u64 dar;
  struct vm_area_struct *vma = NULL;
-    unsigned long page_size;
-
-    if (mm == NULL)
-    return -EFAULT;
+    int rc = 0;
  down_read(>mmap_sem);
  vma = find_vma(mm, addr);
  if (!vma) {
-    pr_err("Can't find vma for addr %016llx\n", addr);
  rc = -EFAULT;
  goto out;
  }
-    /* get the size of the pages allocated */
-    page_size = vma_kernel_pagesize(vma);
-
-    for (dar = (addr & ~(page_size - 1)); dar < (addr + size); dar += 
page_size) {

-    if (dar < vma->vm_start || dar >= vma->vm_end) {
-    vma = find_vma(mm, addr);
-    if (!vma) {
-    pr_err("Can't find vma for addr %016llx\n", addr);
-    rc = -EFAULT;
-    goto out;
-    }
-    /* get the size of the pages allocated */
-    page_size = vma_kernel_pagesize(vma);
+    *page_size = vma_kernel_pagesize(vma);
+    *vma_start = vma->vm_start;
+    *vma_end = vma->vm_end;
+out:
+    up_read(>mmap_sem);
+    return rc;
+}
+
+int cxllib_handle_fault(struct mm_struct *mm, u64 addr, u64 size, u64 
flags)

+{
+    int rc;
+    u64 dar, vma_start, vma_end;
+    unsigned long page_size;
+
+    if (mm == NULL)
+    return -EFAULT;
+
+    /*
+ * The buffer we have to process can extend over several pages
+ * and may also cover several VMAs.
+ * We iterate over all the pages. The page size could vary
+ * between VMAs.
+ */
+    rc = get_vma_info(mm, addr, _start, _end, _size);
+    if (rc)
+    return rc;
+
+    for (dar = (addr & ~(page_size - 1)); dar < (addr + size);
+ dar += page_size) {
+    if (dar < vma_start || dar >= vma_end) {



IIUC, we are fetching the vma to get just the page_size with which it is 
mapped? Can't we iterate with PAGE_SIZE? Considering hugetlb page size 
will be larger than PAGE_SIZE, we might call into cxl_handle_mm_fault 
multiple times for a hugetlb page. Does that cause any issue? Also can 
cxl be used with hugetlb mappings?


I discussed it with Aneesh, but for the record:
- huge pages could be used, cxl has no control over it.
- incrementing the loop with PAGE_SIZE if it's a huge page would be a 
waste, as only the first call to cxl_handle_mm_fault() would be useful.
- having to account for several VMAs and potentially page sizes make it 
more complicated. An idea is to check with Mellanox if we can reduce the 
scope, in case the caller can rule out some cases. It's too late for 
coral, but something we can look into for the future/upstream.


  Fred




+    /*
+ * We don't hold the mm->mmap_sem semaphore
+ * while iterating, since the semaphore is
+ * required by one of the lower-level page
+ * fault processing functions and it could
+ * create a deadlock.
+ *
+ * It means the VMAs can be altered between 2
+ * loop iterations and we could theoretically
+ * miss a page (however unlikely). But that's
+ * not really a problem, as the driver will
+ * retry access, get another page fault on the
+ * missing page and call us again.
+ */
+  

Re: [PATCH v9 06/24] mm: make pte_unmap_same compatible with SPF

2018-04-03 Thread Jerome Glisse
On Tue, Apr 03, 2018 at 01:40:18PM -0700, David Rientjes wrote:
> On Tue, 3 Apr 2018, Jerome Glisse wrote:
> 
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 21b1212a0892..4bc7b0bdcb40 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -2309,21 +2309,29 @@ static bool pte_map_lock(struct vm_fault *vmf)
> > >   * parts, do_swap_page must check under lock before unmapping the pte and
> > >   * proceeding (but do_wp_page is only called after already making such a 
> > > check;
> > >   * and do_anonymous_page can safely check later on).
> > > + *
> > > + * pte_unmap_same() returns:
> > > + *   0   if the PTE are the same
> > > + *   VM_FAULT_PTNOTSAME  if the PTE are different
> > > + *   VM_FAULT_RETRY  if the VMA has changed in our back 
> > > during
> > > + *   a speculative page fault handling.
> > >   */

[...]

> > >  
> > 
> > This change what do_swap_page() returns ie before it was returning 0
> > when locked pte lookup was different from orig_pte. After this patch
> > it returns VM_FAULT_PTNOTSAME but this is a new return value for
> > handle_mm_fault() (the do_swap_page() return value is what ultimately
> > get return by handle_mm_fault())
> > 
> > Do we really want that ? This might confuse some existing user of
> > handle_mm_fault() and i am not sure of the value of that information
> > to caller.
> > 
> > Note i do understand that you want to return retry if anything did
> > change from underneath and thus need to differentiate from when the
> > pte value are not the same.
> > 
> 
> I think VM_FAULT_RETRY should be handled appropriately for any user of 
> handle_mm_fault() already, and would be surprised to learn differently.  
> Khugepaged has the appropriate handling.  I think the concern is whether a 
> user is handling anything other than VM_FAULT_RETRY and VM_FAULT_ERROR 
> (which VM_FAULT_PTNOTSAME is not set in)?  I haven't done a full audit of 
> the users.

I am not worried about VM_FAULT_RETRY and barely have any worry about
VM_FAULT_PTNOTSAME either as they are other comparable new return value
(VM_FAULT_NEEDDSYNC for instance which is quite recent).

I wonder if adding a new value is really needed here. I don't see any
value to it for caller of handle_mm_fault() except for stats.

Note that I am not oppose, but while today we have free bits, maybe
tomorrow we will run out, i am always worried about thing like that :)

Cheers,
Jérôme


Re: [PATCH v9 15/24] mm: Introduce __vm_normal_page()

2018-04-03 Thread David Rientjes
On Tue, 3 Apr 2018, Jerome Glisse wrote:

> > When dealing with the speculative fault path we should use the VMA's field
> > cached value stored in the vm_fault structure.
> > 
> > Currently vm_normal_page() is using the pointer to the VMA to fetch the
> > vm_flags value. This patch provides a new __vm_normal_page() which is
> > receiving the vm_flags flags value as parameter.
> > 
> > Note: The speculative path is turned on for architecture providing support
> > for special PTE flag. So only the first block of vm_normal_page is used
> > during the speculative path.
> 
> Might be a good idea to explicitly have SPECULATIVE Kconfig option depends
> on ARCH_PTE_SPECIAL and a comment for !HAVE_PTE_SPECIAL in the function
> explaining that speculative page fault should never reach that point.

Yeah, I think that's appropriate but in a follow-up patch since this is 
only propagating vma_flags.  It will require that __HAVE_ARCH_PTE_SPECIAL 
become an actual Kconfig entry, however.


Re: [PATCH 2/4] libnvdimm: Add device-tree based driver

2018-04-03 Thread kbuild test robot
Hi Oliver,

I love your patch! Yet something to improve:

[auto build test ERROR on linux-nvdimm/libnvdimm-for-next]
[also build test ERROR on v4.16 next-20180403]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Oliver-O-Halloran/libnvdimm-Add-of_node-to-region-and-bus-descriptors/20180404-015544
base:   https://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm.git 
libnvdimm-for-next
config: ia64-allmodconfig (attached as .config)
compiler: ia64-linux-gcc (GCC) 7.2.0
reproduce:
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=ia64 

All errors (new ones prefixed by >>):

   ERROR: "ia64_delay_loop" [drivers/spi/spi-thunderx.ko] undefined!
>> ERROR: "of_node_to_nid" [drivers/nvdimm/of_pmem.ko] undefined!
   ERROR: "ia64_delay_loop" [drivers/net/phy/mdio-cavium.ko] undefined!

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: [PATCH v9 06/24] mm: make pte_unmap_same compatible with SPF

2018-04-03 Thread David Rientjes
On Tue, 3 Apr 2018, Jerome Glisse wrote:

> > diff --git a/mm/memory.c b/mm/memory.c
> > index 21b1212a0892..4bc7b0bdcb40 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -2309,21 +2309,29 @@ static bool pte_map_lock(struct vm_fault *vmf)
> >   * parts, do_swap_page must check under lock before unmapping the pte and
> >   * proceeding (but do_wp_page is only called after already making such a 
> > check;
> >   * and do_anonymous_page can safely check later on).
> > + *
> > + * pte_unmap_same() returns:
> > + * 0   if the PTE are the same
> > + * VM_FAULT_PTNOTSAME  if the PTE are different
> > + * VM_FAULT_RETRY  if the VMA has changed in our back during
> > + * a speculative page fault handling.
> >   */
> > -static inline int pte_unmap_same(struct mm_struct *mm, pmd_t *pmd,
> > -   pte_t *page_table, pte_t orig_pte)
> > +static inline int pte_unmap_same(struct vm_fault *vmf)
> >  {
> > -   int same = 1;
> > +   int ret = 0;
> > +
> >  #if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT)
> > if (sizeof(pte_t) > sizeof(unsigned long)) {
> > -   spinlock_t *ptl = pte_lockptr(mm, pmd);
> > -   spin_lock(ptl);
> > -   same = pte_same(*page_table, orig_pte);
> > -   spin_unlock(ptl);
> > +   if (pte_spinlock(vmf)) {
> > +   if (!pte_same(*vmf->pte, vmf->orig_pte))
> > +   ret = VM_FAULT_PTNOTSAME;
> > +   spin_unlock(vmf->ptl);
> > +   } else
> > +   ret = VM_FAULT_RETRY;
> > }
> >  #endif
> > -   pte_unmap(page_table);
> > -   return same;
> > +   pte_unmap(vmf->pte);
> > +   return ret;
> >  }
> >  
> >  static inline void cow_user_page(struct page *dst, struct page *src, 
> > unsigned long va, struct vm_area_struct *vma)
> > @@ -2913,7 +2921,8 @@ int do_swap_page(struct vm_fault *vmf)
> > int exclusive = 0;
> > int ret = 0;
> >  
> > -   if (!pte_unmap_same(vma->vm_mm, vmf->pmd, vmf->pte, vmf->orig_pte))
> > +   ret = pte_unmap_same(vmf);
> > +   if (ret)
> > goto out;
> >  
> 
> This change what do_swap_page() returns ie before it was returning 0
> when locked pte lookup was different from orig_pte. After this patch
> it returns VM_FAULT_PTNOTSAME but this is a new return value for
> handle_mm_fault() (the do_swap_page() return value is what ultimately
> get return by handle_mm_fault())
> 
> Do we really want that ? This might confuse some existing user of
> handle_mm_fault() and i am not sure of the value of that information
> to caller.
> 
> Note i do understand that you want to return retry if anything did
> change from underneath and thus need to differentiate from when the
> pte value are not the same.
> 

I think VM_FAULT_RETRY should be handled appropriately for any user of 
handle_mm_fault() already, and would be surprised to learn differently.  
Khugepaged has the appropriate handling.  I think the concern is whether a 
user is handling anything other than VM_FAULT_RETRY and VM_FAULT_ERROR 
(which VM_FAULT_PTNOTSAME is not set in)?  I haven't done a full audit of 
the users.


Re: [PATCH v9 00/24] Speculative page faults

2018-04-03 Thread Jerome Glisse
On Tue, Mar 13, 2018 at 06:59:30PM +0100, Laurent Dufour wrote:
> This is a port on kernel 4.16 of the work done by Peter Zijlstra to
> handle page fault without holding the mm semaphore [1].
> 
> The idea is to try to handle user space page faults without holding the
> mmap_sem. This should allow better concurrency for massively threaded
> process since the page fault handler will not wait for other threads memory
> layout change to be done, assuming that this change is done in another part
> of the process's memory space. This type page fault is named speculative
> page fault. If the speculative page fault fails because of a concurrency is
> detected or because underlying PMD or PTE tables are not yet allocating, it
> is failing its processing and a classic page fault is then tried.
> 
> The speculative page fault (SPF) has to look for the VMA matching the fault
> address without holding the mmap_sem, this is done by introducing a rwlock
> which protects the access to the mm_rb tree. Previously this was done using
> SRCU but it was introducing a lot of scheduling to process the VMA's
> freeing
> operation which was hitting the performance by 20% as reported by Kemi Wang
> [2].Using a rwlock to protect access to the mm_rb tree is limiting the
> locking contention to these operations which are expected to be in a O(log
> n)
> order. In addition to ensure that the VMA is not freed in our back a
> reference count is added and 2 services (get_vma() and put_vma()) are
> introduced to handle the reference count. When a VMA is fetch from the RB
> tree using get_vma() is must be later freeed using put_vma(). Furthermore,
> to allow the VMA to be used again by the classic page fault handler a
> service is introduced can_reuse_spf_vma(). This service is expected to be
> called with the mmap_sem hold. It checked that the VMA is still matching
> the specified address and is releasing its reference count as the mmap_sem
> is hold it is ensure that it will not be freed in our back. In general, the
> VMA's reference count could be decremented when holding the mmap_sem but it
> should not be increased as holding the mmap_sem is ensuring that the VMA is
> stable. I can't see anymore the overhead I got while will-it-scale
> benchmark anymore.
> 
> The VMA's attributes checked during the speculative page fault processing
> have to be protected against parallel changes. This is done by using a per
> VMA sequence lock. This sequence lock allows the speculative page fault
> handler to fast check for parallel changes in progress and to abort the
> speculative page fault in that case.
> 
> Once the VMA is found, the speculative page fault handler would check for
> the VMA's attributes to verify that the page fault has to be handled
> correctly or not. Thus the VMA is protected through a sequence lock which
> allows fast detection of concurrent VMA changes. If such a change is
> detected, the speculative page fault is aborted and a *classic* page fault
> is tried.  VMA sequence lockings are added when VMA attributes which are
> checked during the page fault are modified.
> 
> When the PTE is fetched, the VMA is checked to see if it has been changed,
> so once the page table is locked, the VMA is valid, so any other changes
> leading to touching this PTE will need to lock the page table, so no
> parallel change is possible at this time.

What would have been nice is some pseudo highlevel code before all the
above detailed description. Something like:
  speculative_fault(addr) {
mm_lock_for_vma_snapshot()
vma_snapshot = snapshot_vma_infos(addr)
mm_unlock_for_vma_snapshot()
...
if (!vma_can_speculatively_fault(vma_snapshot, addr))
return;
...
/* Do fault ie alloc memory, read from file ... */
page = ...;

preempt_disable();
if (vma_snapshot_still_valid(vma_snapshot, addr) &&
vma_pte_map_lock(vma_snapshot, addr)) {
if (pte_same(ptep, orig_pte)) {
/* Setup new pte */
page = NULL;
}
}
preempt_enable();
if (page)
put(page)
  }

I just find pseudo code easier for grasping the highlevel view of the
expected code flow.


> 
> The locking of the PTE is done with interrupts disabled, this allows to
> check for the PMD to ensure that there is not an ongoing collapsing
> operation. Since khugepaged is firstly set the PMD to pmd_none and then is
> waiting for the other CPU to have catch the IPI interrupt, if the pmd is
> valid at the time the PTE is locked, we have the guarantee that the
> collapsing opertion will have to wait on the PTE lock to move foward. This
> allows the SPF handler to map the PTE safely. If the PMD value is different
> than the one recorded at the beginning of the SPF operation, the classic
> page fault handler will be called to handle the operation while holding the
> mmap_sem. As the PTE lock is done with the interrupts disabled, the lock is
> done using spin_trylock() to avoid dead lock when handling a 

[PATCH v4 10/10] powerpc64/ftrace: Implement support for ftrace_regs_caller()

2018-04-03 Thread Naveen N. Rao
With -mprofile-kernel, we always save the full register state in
ftrace_caller(). While this works, this is inefficient if we're not
interested in the register state, such as when we're using the function
tracer.

Rename the existing ftrace_caller() as ftrace_regs_caller() and provide
a simpler implementation for ftrace_caller() that is used when registers
are not required to be saved.

Signed-off-by: Naveen N. Rao 
---
 arch/powerpc/include/asm/ftrace.h  |   2 -
 arch/powerpc/include/asm/module.h  |   3 +
 arch/powerpc/kernel/module_64.c|  28 +++-
 arch/powerpc/kernel/trace/ftrace.c | 184 +++--
 arch/powerpc/kernel/trace/ftrace_64_mprofile.S |  71 +-
 5 files changed, 262 insertions(+), 26 deletions(-)

diff --git a/arch/powerpc/include/asm/ftrace.h 
b/arch/powerpc/include/asm/ftrace.h
index 3b5e85a72e10..f0806a2fd451 100644
--- a/arch/powerpc/include/asm/ftrace.h
+++ b/arch/powerpc/include/asm/ftrace.h
@@ -49,8 +49,6 @@
 extern void _mcount(void);
 
 #ifdef CONFIG_DYNAMIC_FTRACE
-# define FTRACE_ADDR ((unsigned long)ftrace_caller)
-# define FTRACE_REGS_ADDR FTRACE_ADDR
 static inline unsigned long ftrace_call_adjust(unsigned long addr)
 {
/* reloction of mcount call site is the same as the address */
diff --git a/arch/powerpc/include/asm/module.h 
b/arch/powerpc/include/asm/module.h
index 7e28442827f1..2d16b6d9147d 100644
--- a/arch/powerpc/include/asm/module.h
+++ b/arch/powerpc/include/asm/module.h
@@ -43,6 +43,9 @@ struct mod_arch_specific {
 #ifdef CONFIG_DYNAMIC_FTRACE
unsigned long toc;
unsigned long tramp;
+#ifdef CONFIG_DYNAMIC_FTRACE_WITH_REGS
+   unsigned long tramp_regs;
+#endif
 #endif
 
/* For module function descriptor dereference */
diff --git a/arch/powerpc/kernel/module_64.c b/arch/powerpc/kernel/module_64.c
index 8413be31d6a4..f7667e2ebfcb 100644
--- a/arch/powerpc/kernel/module_64.c
+++ b/arch/powerpc/kernel/module_64.c
@@ -280,6 +280,10 @@ static unsigned long get_stubs_size(const Elf64_Ehdr *hdr,
 #ifdef CONFIG_DYNAMIC_FTRACE
/* make the trampoline to the ftrace_caller */
relocs++;
+#ifdef CONFIG_DYNAMIC_FTRACE_WITH_REGS
+   /* an additional one for ftrace_regs_caller */
+   relocs++;
+#endif
 #endif
 
pr_debug("Looks like a total of %lu stubs, max\n", relocs);
@@ -765,7 +769,8 @@ int apply_relocate_add(Elf64_Shdr *sechdrs,
  * via the paca (in r13). The target (ftrace_caller()) is responsible for
  * saving and restoring the toc before returning.
  */
-static unsigned long create_ftrace_stub(const Elf64_Shdr *sechdrs, struct 
module *me)
+static unsigned long create_ftrace_stub(const Elf64_Shdr *sechdrs,
+   struct module *me, unsigned long addr)
 {
struct ppc64_stub_entry *entry;
unsigned int i, num_stubs;
@@ -792,9 +797,10 @@ static unsigned long create_ftrace_stub(const Elf64_Shdr 
*sechdrs, struct module
memcpy(entry->jump, stub_insns, sizeof(stub_insns));
 
/* Stub uses address relative to kernel toc (from the paca) */
-   reladdr = (unsigned long)ftrace_caller - kernel_toc_addr();
+   reladdr = addr - kernel_toc_addr();
if (reladdr > 0x7FFF || reladdr < -(0x8000L)) {
-   pr_err("%s: Address of ftrace_caller out of range of 
kernel_toc.\n", me->name);
+   pr_err("%s: Address of %ps out of range of kernel_toc.\n",
+   me->name, (void *)addr);
return 0;
}
 
@@ -802,22 +808,30 @@ static unsigned long create_ftrace_stub(const Elf64_Shdr 
*sechdrs, struct module
entry->jump[2] |= PPC_LO(reladdr);
 
/* Eventhough we don't use funcdata in the stub, it's needed elsewhere. 
*/
-   entry->funcdata = func_desc((unsigned long)ftrace_caller);
+   entry->funcdata = func_desc(addr);
entry->magic = STUB_MAGIC;
 
return (unsigned long)entry;
 }
 #else
-static unsigned long create_ftrace_stub(const Elf64_Shdr *sechdrs, struct 
module *me)
+static unsigned long create_ftrace_stub(const Elf64_Shdr *sechdrs,
+   struct module *me, unsigned long addr)
 {
-   return stub_for_addr(sechdrs, (unsigned long)ftrace_caller, me);
+   return stub_for_addr(sechdrs, addr, me);
 }
 #endif
 
 int module_finalize_ftrace(struct module *mod, const Elf_Shdr *sechdrs)
 {
mod->arch.toc = my_r2(sechdrs, mod);
-   mod->arch.tramp = create_ftrace_stub(sechdrs, mod);
+   mod->arch.tramp = create_ftrace_stub(sechdrs, mod,
+   (unsigned long)ftrace_caller);
+#ifdef CONFIG_DYNAMIC_FTRACE_WITH_REGS
+   mod->arch.tramp_regs = create_ftrace_stub(sechdrs, mod,
+   (unsigned long)ftrace_regs_caller);
+   if (!mod->arch.tramp_regs)
+   return -ENOENT;
+#endif
 
if (!mod->arch.tramp)
 

[PATCH v4 09/10] powerpc64/ftrace: Use the generic version of ftrace_replace_code()

2018-04-03 Thread Naveen N. Rao
Our implementation matches that of the generic version, which also
handles FTRACE_UPDATE_MODIFY_CALL. So, remove our implementation in
favor of the generic version.

Signed-off-by: Naveen N. Rao 
---
 arch/powerpc/kernel/trace/ftrace.c | 36 
 1 file changed, 36 deletions(-)

diff --git a/arch/powerpc/kernel/trace/ftrace.c 
b/arch/powerpc/kernel/trace/ftrace.c
index 4741fe112f05..80667128db3d 100644
--- a/arch/powerpc/kernel/trace/ftrace.c
+++ b/arch/powerpc/kernel/trace/ftrace.c
@@ -485,42 +485,6 @@ int ftrace_update_ftrace_func(ftrace_func_t func)
return ret;
 }
 
-static int __ftrace_replace_code(struct dyn_ftrace *rec, int enable)
-{
-   unsigned long ftrace_addr = (unsigned long)FTRACE_ADDR;
-   int ret;
-
-   ret = ftrace_update_record(rec, enable);
-
-   switch (ret) {
-   case FTRACE_UPDATE_IGNORE:
-   return 0;
-   case FTRACE_UPDATE_MAKE_CALL:
-   return ftrace_make_call(rec, ftrace_addr);
-   case FTRACE_UPDATE_MAKE_NOP:
-   return ftrace_make_nop(NULL, rec, ftrace_addr);
-   }
-
-   return 0;
-}
-
-void ftrace_replace_code(int enable)
-{
-   struct ftrace_rec_iter *iter;
-   struct dyn_ftrace *rec;
-   int ret;
-
-   for (iter = ftrace_rec_iter_start(); iter;
-iter = ftrace_rec_iter_next(iter)) {
-   rec = ftrace_rec_iter_record(iter);
-   ret = __ftrace_replace_code(rec, enable);
-   if (ret) {
-   ftrace_bug(ret, rec);
-   return;
-   }
-   }
-}
-
 /*
  * Use the default ftrace_modify_all_code, but without
  * stop_machine().
-- 
2.16.2



[PATCH v4 08/10] powerpc64/module: Tighten detection of mcount call sites with -mprofile-kernel

2018-04-03 Thread Naveen N. Rao
For R_PPC64_REL24 relocations, we suppress emitting instructions for TOC
load/restore in the relocation stub if the relocation is for _mcount()
call when using -mprofile-kernel ABI.

To detect this, we check if the preceding instructions are per the
standard set of instructions emitted by gcc: either the two instruction
sequence of 'mflr r0; std r0,16(r1)', or the more optimized variant of a
single 'mflr r0'. This is not sufficient since nothing prevents users
from hand coding sequences involving a 'mflr r0' followed by a 'bl'.

For removing the toc save instruction from the stub, we additionally
check if the symbol is "_mcount". Add the same check here as well.

Also rename is_early_mcount_callsite() to is_mprofile_mcount_callsite()
since that is what is being checked. The use of "early" is misleading
since there is nothing involving this function that qualifies as early.

Fixes: 153086644fd1f ("powerpc/ftrace: Add support for -mprofile-kernel ftrace 
ABI")
Signed-off-by: Naveen N. Rao 
---
 arch/powerpc/kernel/module_64.c | 15 +--
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kernel/module_64.c b/arch/powerpc/kernel/module_64.c
index a2636c250b7b..8413be31d6a4 100644
--- a/arch/powerpc/kernel/module_64.c
+++ b/arch/powerpc/kernel/module_64.c
@@ -463,8 +463,11 @@ static unsigned long stub_for_addr(const Elf64_Shdr 
*sechdrs,
 }
 
 #ifdef CC_USING_MPROFILE_KERNEL
-static bool is_early_mcount_callsite(u32 *instruction)
+static bool is_mprofile_mcount_callsite(const char *name, u32 *instruction)
 {
+   if (strcmp("_mcount", name))
+   return false;
+
/*
 * Check if this is one of the -mprofile-kernel sequences.
 */
@@ -496,8 +499,7 @@ static void squash_toc_save_inst(const char *name, unsigned 
long addr)
 #else
 static void squash_toc_save_inst(const char *name, unsigned long addr) { }
 
-/* without -mprofile-kernel, mcount calls are never early */
-static bool is_early_mcount_callsite(u32 *instruction)
+static bool is_mprofile_mcount_callsite(const char *name, u32 *instruction)
 {
return false;
 }
@@ -505,11 +507,11 @@ static bool is_early_mcount_callsite(u32 *instruction)
 
 /* We expect a noop next: if it is, replace it with instruction to
restore r2. */
-static int restore_r2(u32 *instruction, struct module *me)
+static int restore_r2(const char *name, u32 *instruction, struct module *me)
 {
u32 *prev_insn = instruction - 1;
 
-   if (is_early_mcount_callsite(prev_insn))
+   if (is_mprofile_mcount_callsite(name, prev_insn))
return 1;
 
/*
@@ -650,7 +652,8 @@ int apply_relocate_add(Elf64_Shdr *sechdrs,
value = stub_for_addr(sechdrs, value, me);
if (!value)
return -ENOENT;
-   if (!restore_r2((u32 *)location + 1, me))
+   if (!restore_r2(strtab + sym->st_name,
+   (u32 *)location + 1, 
me))
return -ENOEXEC;
 
squash_toc_save_inst(strtab + sym->st_name, 
value);
-- 
2.16.2



[PATCH v4 07/10] powerpc64/kexec: Hard disable ftrace before switching to the new kernel

2018-04-03 Thread Naveen N. Rao
If function_graph tracer is enabled during kexec, we see the below
exception in the simulator:
root@(none):/# kexec -e
kvm: exiting hardware virtualization
kexec_core: Starting new kernel
[   19.262020070,5] OPAL: Switch to big-endian OS
kexec: Starting switchover sequence.
Interrupt to 0xC0004380 from 0xC0004380
** Execution stopped: Continuous Interrupt, Instruction caused 
exception,  **

Now that we have a more effective way to completely disable ftrace on
ppc64, let's also use that before switching to a new kernel during
kexec.

Signed-off-by: Naveen N. Rao 
---
 arch/powerpc/kernel/machine_kexec.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/kernel/machine_kexec.c 
b/arch/powerpc/kernel/machine_kexec.c
index 2694d078741d..936c7e2d421e 100644
--- a/arch/powerpc/kernel/machine_kexec.c
+++ b/arch/powerpc/kernel/machine_kexec.c
@@ -98,12 +98,14 @@ void machine_kexec(struct kimage *image)
int save_ftrace_enabled;
 
save_ftrace_enabled = __ftrace_enabled_save();
+   this_cpu_disable_ftrace();
 
if (ppc_md.machine_kexec)
ppc_md.machine_kexec(image);
else
default_machine_kexec(image);
 
+   this_cpu_enable_ftrace();
__ftrace_enabled_restore(save_ftrace_enabled);
 
/* Fall back to normal restart if we're still alive. */
-- 
2.16.2



[PATCH v4 06/10] powerpc64/ftrace: Disable ftrace during hotplug

2018-04-03 Thread Naveen N. Rao
Disable ftrace when a cpu is about to go offline. When the cpu is woken
up, ftrace will get enabled in start_secondary().

Signed-off-by: Naveen N. Rao 
---
 arch/powerpc/kernel/smp.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index d90195dddc57..09a7ef12ff33 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1122,6 +1122,8 @@ int __cpu_disable(void)
if (!smp_ops->cpu_disable)
return -ENOSYS;
 
+   this_cpu_disable_ftrace();
+
err = smp_ops->cpu_disable();
if (err)
return err;
@@ -1140,6 +1142,12 @@ void __cpu_die(unsigned int cpu)
 
 void cpu_die(void)
 {
+   /*
+* Disable on the down path. This will be re-enabled by
+* start_secondary() via start_secondary_resume() below
+*/
+   this_cpu_disable_ftrace();
+
if (ppc_md.cpu_die)
ppc_md.cpu_die();
 
-- 
2.16.2



[PATCH v4 05/10] powerpc64/ftrace: Delay enabling ftrace on secondary cpus

2018-04-03 Thread Naveen N. Rao
On the boot cpu, though we enable paca->ftrace_enabled in early_setup()
(via cpu_ready_for_interrupts()), we don't start tracing until much
later since ftrace is not initialized yet and since we only support
DYNAMIC_FTRACE on powerpc. However, it is possible that ftrace has been
initialized by the time some of the secondary cpus start up. In this
case, we will try to trace some of the early boot code which can cause
problems.

To address this, move setting paca->ftrace_enabled from
cpu_ready_for_interrupts() to early_setup() for the boot cpu, and towards
the end of start_secondary() for secondary cpus.

Signed-off-by: Naveen N. Rao 
---
 arch/powerpc/kernel/setup_64.c | 10 +++---
 arch/powerpc/kernel/smp.c  |  4 
 2 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
index a96ea3bab8f4..625833f4099a 100644
--- a/arch/powerpc/kernel/setup_64.c
+++ b/arch/powerpc/kernel/setup_64.c
@@ -252,9 +252,6 @@ static void cpu_ready_for_interrupts(void)
 
/* Set IR and DR in PACA MSR */
get_paca()->kernel_msr = MSR_KERNEL;
-
-   /* We are now ok to enable ftrace */
-   get_paca()->ftrace_enabled = 1;
 }
 
 unsigned long spr_default_dscr = 0;
@@ -349,6 +346,13 @@ void __init early_setup(unsigned long dt_ptr)
 */
cpu_ready_for_interrupts();
 
+   /*
+* We enable ftrace here, but since we only support DYNAMIC_FTRACE, it
+* will only actually get enabled on the boot cpu much later once
+* ftrace itself has been initialized.
+*/
+   this_cpu_enable_ftrace();
+
DBG(" <- early_setup()\n");
 
 #ifdef CONFIG_PPC_EARLY_DEBUG_BOOTX
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index cfc08b099c49..d90195dddc57 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -59,6 +59,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #ifdef DEBUG
 #include 
@@ -1022,6 +1023,9 @@ void start_secondary(void *unused)
 
local_irq_enable();
 
+   /* We can enable ftrace for secondary cpus now */
+   this_cpu_enable_ftrace();
+
cpu_startup_entry(CPUHP_AP_ONLINE_IDLE);
 
BUG();
-- 
2.16.2



[PATCH v4 04/10] powerpc64/ftrace: Add helpers to hard disable ftrace

2018-04-03 Thread Naveen N. Rao
Add some helpers to enable/disable ftrace through paca->ftrace_enabled.

Signed-off-by: Naveen N. Rao 
---
 arch/powerpc/include/asm/ftrace.h | 17 +
 1 file changed, 17 insertions(+)

diff --git a/arch/powerpc/include/asm/ftrace.h 
b/arch/powerpc/include/asm/ftrace.h
index ebfc0b846bb5..3b5e85a72e10 100644
--- a/arch/powerpc/include/asm/ftrace.h
+++ b/arch/powerpc/include/asm/ftrace.h
@@ -82,6 +82,23 @@ static inline bool arch_syscall_match_sym_name(const char 
*sym, const char *name
return !strcmp(sym + 4, name + 3);
 }
 #endif /* CONFIG_FTRACE_SYSCALLS && PPC64_ELF_ABI_v1 */
+
+#ifdef CONFIG_PPC64
+#include 
+
+static inline void this_cpu_disable_ftrace(void)
+{
+   get_paca()->ftrace_enabled = 0;
+}
+
+static inline void this_cpu_enable_ftrace(void)
+{
+   get_paca()->ftrace_enabled = 1;
+}
+#else /* CONFIG_PPC64 */
+static inline void this_cpu_disable_ftrace(void) { }
+static inline void this_cpu_enable_ftrace(void) { }
+#endif /* CONFIG_PPC64 */
 #endif /* !__ASSEMBLY__ */
 
 #endif /* _ASM_POWERPC_FTRACE */
-- 
2.16.2



[PATCH v4 03/10] powerpc64/ftrace: Rearrange #ifdef sections in ftrace.h

2018-04-03 Thread Naveen N. Rao
Re-arrange the last #ifdef section in preparation for a subsequent
change.

Signed-off-by: Naveen N. Rao 
---
 arch/powerpc/include/asm/ftrace.h | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/ftrace.h 
b/arch/powerpc/include/asm/ftrace.h
index 9abddde372ab..ebfc0b846bb5 100644
--- a/arch/powerpc/include/asm/ftrace.h
+++ b/arch/powerpc/include/asm/ftrace.h
@@ -68,8 +68,8 @@ struct dyn_arch_ftrace {
 #endif
 #endif
 
-#if defined(CONFIG_FTRACE_SYSCALLS) && !defined(__ASSEMBLY__)
-#ifdef PPC64_ELF_ABI_v1
+#ifndef __ASSEMBLY__
+#if defined(CONFIG_FTRACE_SYSCALLS) && defined(PPC64_ELF_ABI_v1)
 #define ARCH_HAS_SYSCALL_MATCH_SYM_NAME
 static inline bool arch_syscall_match_sym_name(const char *sym, const char 
*name)
 {
@@ -81,7 +81,7 @@ static inline bool arch_syscall_match_sym_name(const char 
*sym, const char *name
 */
return !strcmp(sym + 4, name + 3);
 }
-#endif
-#endif /* CONFIG_FTRACE_SYSCALLS && !__ASSEMBLY__ */
+#endif /* CONFIG_FTRACE_SYSCALLS && PPC64_ELF_ABI_v1 */
+#endif /* !__ASSEMBLY__ */
 
 #endif /* _ASM_POWERPC_FTRACE */
-- 
2.16.2



[PATCH v4 02/10] powerpc64/ftrace: Disable ftrace during kvm guest entry/exit

2018-04-03 Thread Naveen N. Rao
During guest entry/exit, we switch over to/from the guest MMU context.
While doing so, we set our state to KVM_GUEST_MODE_HOST_HV to note down
the fact that we cannot take any exceptions in the hypervisor code.

Since ftrace may be enabled and since it can result in us taking a trap,
disable ftrace by setting paca->ftrace_enabled to zero. Once we exit the
guest and restore host MMU context, we re-enable ftrace.

Signed-off-by: Naveen N. Rao 
---
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 8 
 1 file changed, 8 insertions(+)

diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index bd63fa8a08b5..6f2d7206a12b 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -598,6 +598,10 @@ kvmppc_hv_entry:
/* Save R1 in the PACA */
std r1, HSTATE_HOST_R1(r13)
 
+   /* Disable ftrace since we can't take a trap any more */
+   li  r6, 0
+   stb r6, PACA_FTRACE_ENABLED(r13)
+
li  r6, KVM_GUEST_MODE_HOST_HV
stb r6, HSTATE_IN_GUEST(r13)
 
@@ -2078,6 +2082,8 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300)
/* Unset guest mode */
li  r0, KVM_GUEST_MODE_NONE
stb r0, HSTATE_IN_GUEST(r13)
+   li  r0, 1
+   stb r0, PACA_FTRACE_ENABLED(r13)
 
lwz r12, STACK_SLOT_TRAP(r1)/* return trap # in r12 */
ld  r0, SFS+PPC_LR_STKOFF(r1)
@@ -3547,6 +3553,8 @@ END_MMU_FTR_SECTION_IFSET(MMU_FTR_TYPE_RADIX)
ld  r8, KVM_HOST_LPCR(r10)
mtspr   SPRN_LPCR, r8
isync
+   li  r0, 1
+   stb r0, PACA_FTRACE_ENABLED(r13)
li  r0, KVM_GUEST_MODE_NONE
stb r0, HSTATE_IN_GUEST(r13)
 
-- 
2.16.2



[PATCH v4 01/10] powerpc64/ftrace: Add a field in paca to disable ftrace in unsafe code paths

2018-04-03 Thread Naveen N. Rao
We have some C code that we call into from real mode where we cannot
take any exceptions. Though the C functions themselves are mostly safe,
if these functions are traced, there is a possibility that we may take
an exception. For instance, in certain conditions, the ftrace code uses
WARN(), which uses a 'trap' to do its job.

For such scenarios, introduce a new field in paca 'ftrace_enabled',
which is checked on ftrace entry before continuing. This field can then
be set to zero to disable/pause ftrace, and set to a non-zero value to
resume ftrace.

Signed-off-by: Naveen N. Rao 
---
 arch/powerpc/include/asm/paca.h|  1 +
 arch/powerpc/kernel/asm-offsets.c  |  1 +
 arch/powerpc/kernel/setup_64.c |  3 +++
 arch/powerpc/kernel/trace/ftrace_64_mprofile.S | 14 ++
 arch/powerpc/kernel/trace/ftrace_64_pg.S   |  4 
 5 files changed, 23 insertions(+)

diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
index 4185f1c96125..163f13f31255 100644
--- a/arch/powerpc/include/asm/paca.h
+++ b/arch/powerpc/include/asm/paca.h
@@ -223,6 +223,7 @@ struct paca_struct {
u8 hmi_event_available; /* HMI event is available */
u8 hmi_p9_special_emu;  /* HMI P9 special emulation */
 #endif
+   u8 ftrace_enabled;  /* Hard disable ftrace */
 
/* Stuff for accurate time accounting */
struct cpu_accounting_data accounting;
diff --git a/arch/powerpc/kernel/asm-offsets.c 
b/arch/powerpc/kernel/asm-offsets.c
index 6bee65f3cfd3..262c44a90ea1 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -180,6 +180,7 @@ int main(void)
OFFSET(PACAKMSR, paca_struct, kernel_msr);
OFFSET(PACAIRQSOFTMASK, paca_struct, irq_soft_mask);
OFFSET(PACAIRQHAPPENED, paca_struct, irq_happened);
+   OFFSET(PACA_FTRACE_ENABLED, paca_struct, ftrace_enabled);
 #ifdef CONFIG_PPC_BOOK3S
OFFSET(PACACONTEXTID, paca_struct, mm_ctx_id);
 #ifdef CONFIG_PPC_MM_SLICES
diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
index 66f2b6299c40..a96ea3bab8f4 100644
--- a/arch/powerpc/kernel/setup_64.c
+++ b/arch/powerpc/kernel/setup_64.c
@@ -252,6 +252,9 @@ static void cpu_ready_for_interrupts(void)
 
/* Set IR and DR in PACA MSR */
get_paca()->kernel_msr = MSR_KERNEL;
+
+   /* We are now ok to enable ftrace */
+   get_paca()->ftrace_enabled = 1;
 }
 
 unsigned long spr_default_dscr = 0;
diff --git a/arch/powerpc/kernel/trace/ftrace_64_mprofile.S 
b/arch/powerpc/kernel/trace/ftrace_64_mprofile.S
index 3f3e81852422..ae1cbe783ab6 100644
--- a/arch/powerpc/kernel/trace/ftrace_64_mprofile.S
+++ b/arch/powerpc/kernel/trace/ftrace_64_mprofile.S
@@ -47,6 +47,12 @@ _GLOBAL(ftrace_caller)
/* Save all gprs to pt_regs */
SAVE_GPR(0, r1)
SAVE_10GPRS(2, r1)
+
+   /* Ok to continue? */
+   lbz r3, PACA_FTRACE_ENABLED(r13)
+   cmpdi   r3, 0
+   beq ftrace_no_trace
+
SAVE_10GPRS(12, r1)
SAVE_10GPRS(22, r1)
 
@@ -168,6 +174,14 @@ _GLOBAL(ftrace_graph_stub)
 _GLOBAL(ftrace_stub)
blr
 
+ftrace_no_trace:
+   mflrr3
+   mtctr   r3
+   REST_GPR(3, r1)
+   addir1, r1, SWITCH_FRAME_SIZE
+   mtlrr0
+   bctr
+
 #ifdef CONFIG_LIVEPATCH
/*
 * This function runs in the mcount context, between two functions. As
diff --git a/arch/powerpc/kernel/trace/ftrace_64_pg.S 
b/arch/powerpc/kernel/trace/ftrace_64_pg.S
index f095358da96e..b7ba51a0f3b6 100644
--- a/arch/powerpc/kernel/trace/ftrace_64_pg.S
+++ b/arch/powerpc/kernel/trace/ftrace_64_pg.S
@@ -16,6 +16,10 @@
 
 #ifdef CONFIG_DYNAMIC_FTRACE
 _GLOBAL_TOC(ftrace_caller)
+   lbz r3, PACA_FTRACE_ENABLED(r13)
+   cmpdi   r3, 0
+   beqlr
+
/* Taken from output of objdump from lib64/glibc */
mflrr3
ld  r11, 0(r1)
-- 
2.16.2



[PATCH v4 00/10] powerpc64/ftrace: Add support for ftrace_modify_call() and a few other fixes

2018-04-03 Thread Naveen N. Rao
This is v4 of the patches posted at:
https://www.mail-archive.com/linuxppc-dev@lists.ozlabs.org/msg130739.html

This series has been tested using mambo for p8 (hash) and p9 (radix).


v4 changes:
- flip the paca field from 'ftrace_disabled' to 'ftrace_enabled' to 
  signify a default disable policy to address early boot
- add and use helpers to disable/enable ftrace, which addresses 32-bit 
  build issues
- a new patch to address issues with function_graph tracing during cpu 
  hotplug.

The last three patches are unchanged from the previous series, for the 
most part (except for the ftrace_disabled -> ftrace_enabled adjustment).

The last two patches implement support for ftrace_caller() to 
conditionally save the register state. This speeds up the function 
tracer a bit. The existing ftrace_caller() is renamed to 
ftrace_regs_caller() since we save the entire pt_regs today. A new 
implementation of ftrace_caller() that saves the minimum register state 
is provided. We switch between the two variants through 
ftrace_modify_call(). The necessary support to call into the two 
different variants from modules is also added.


- Naveen


Naveen N. Rao (10):
  powerpc64/ftrace: Add a field in paca to disable ftrace in unsafe code
paths
  powerpc64/ftrace: Disable ftrace during kvm guest entry/exit
  powerpc64/ftrace: Rearrange #ifdef sections in ftrace.h
  powerpc64/ftrace: Add helpers to hard disable ftrace
  powerpc64/ftrace: Delay enabling ftrace on secondary cpus
  powerpc64/ftrace: Disable ftrace during hotplug
  powerpc64/kexec: Hard disable ftrace before switching to the new
kernel
  powerpc64/module: Tighten detection of mcount call sites with
-mprofile-kernel
  powerpc64/ftrace: Use the generic version of ftrace_replace_code()
  powerpc64/ftrace: Implement support for ftrace_regs_caller()

 arch/powerpc/include/asm/ftrace.h  |  27 +++-
 arch/powerpc/include/asm/module.h  |   3 +
 arch/powerpc/include/asm/paca.h|   1 +
 arch/powerpc/kernel/asm-offsets.c  |   1 +
 arch/powerpc/kernel/machine_kexec.c|   2 +
 arch/powerpc/kernel/module_64.c|  43 +++--
 arch/powerpc/kernel/setup_64.c |   7 +
 arch/powerpc/kernel/smp.c  |  12 ++
 arch/powerpc/kernel/trace/ftrace.c | 210 -
 arch/powerpc/kernel/trace/ftrace_64_mprofile.S |  85 +-
 arch/powerpc/kernel/trace/ftrace_64_pg.S   |   4 +
 arch/powerpc/kvm/book3s_hv_rmhandlers.S|   8 +
 12 files changed, 336 insertions(+), 67 deletions(-)

-- 
2.16.2



Re: [PATCH v9 15/24] mm: Introduce __vm_normal_page()

2018-04-03 Thread Jerome Glisse
On Tue, Mar 13, 2018 at 06:59:45PM +0100, Laurent Dufour wrote:
> When dealing with the speculative fault path we should use the VMA's field
> cached value stored in the vm_fault structure.
> 
> Currently vm_normal_page() is using the pointer to the VMA to fetch the
> vm_flags value. This patch provides a new __vm_normal_page() which is
> receiving the vm_flags flags value as parameter.
> 
> Note: The speculative path is turned on for architecture providing support
> for special PTE flag. So only the first block of vm_normal_page is used
> during the speculative path.

Might be a good idea to explicitly have SPECULATIVE Kconfig option depends
on ARCH_PTE_SPECIAL and a comment for !HAVE_PTE_SPECIAL in the function
explaining that speculative page fault should never reach that point.

Cheers,
Jérôme


Re: [PATCH v3 1/7] powerpc/fadump: Move the metadata region to start of the reserved area.

2018-04-03 Thread Hari Bathini
Mahesh, I think we should explicitly document that production and 
capture kernel
versions should be same. For changes like below, older/newer production 
kernel vs
capture kernel is bound to fail. Of course, production and capture 
kernel versions
would be the same case usually but for the uninitiated, mentioning that 
explicitly

in documentation would help..?

Thanks
Hari


On Monday 02 April 2018 11:59 AM, Mahesh J Salgaonkar wrote:

From: Mahesh Salgaonkar 

Currently the metadata region that holds crash info structure and ELF core
header is placed towards the end of reserved memory area. This patch places
it at the beginning of the reserved memory area.

Signed-off-by: Mahesh Salgaonkar 
---
  arch/powerpc/include/asm/fadump.h |4 ++
  arch/powerpc/kernel/fadump.c  |   92 -
  2 files changed, 83 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/include/asm/fadump.h 
b/arch/powerpc/include/asm/fadump.h
index 5a23010af600..44b6ebfe9be6 100644
--- a/arch/powerpc/include/asm/fadump.h
+++ b/arch/powerpc/include/asm/fadump.h
@@ -61,6 +61,9 @@
  #define FADUMP_UNREGISTER 2
  #define FADUMP_INVALIDATE 3

+/* Number of dump sections requested by kernel */
+#define FADUMP_NUM_SECTIONS4
+
  /* Dump status flag */
  #define FADUMP_ERROR_FLAG 0x2000

@@ -119,6 +122,7 @@ struct fadump_mem_struct {
struct fadump_section   cpu_state_data;
struct fadump_section   hpte_region;
struct fadump_section   rmr_region;
+   struct fadump_section   metadata_region;
  };

  /* Firmware-assisted dump configuration details. */
diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
index 3c2c2688918f..80552fd7d5f8 100644
--- a/arch/powerpc/kernel/fadump.c
+++ b/arch/powerpc/kernel/fadump.c
@@ -188,17 +188,48 @@ static void fadump_show_config(void)
pr_debug("Boot memory size  : %lx\n", fw_dump.boot_memory_size);
  }

+static unsigned long get_fadump_metadata_size(void)
+{
+   unsigned long size = 0;
+
+   /*
+* If fadump is active then look into fdm_active to get metadata
+* size set by previous kernel.
+*/
+   if (fw_dump.dump_active) {
+   size = fdm_active->cpu_state_data.destination_address -
+   fdm_active->metadata_region.source_address;
+   goto out;
+   }
+   size += sizeof(struct fadump_crash_info_header);
+   size += sizeof(struct elfhdr); /* ELF core header.*/
+   size += sizeof(struct elf_phdr); /* place holder for cpu notes */
+   /* Program headers for crash memory regions. */
+   size += sizeof(struct elf_phdr) * (memblock_num_regions(memory) + 2);
+
+   size = PAGE_ALIGN(size);
+out:
+   pr_debug("fadump Metadata size is %ld\n", size);
+   return size;
+}
+
  static unsigned long init_fadump_mem_struct(struct fadump_mem_struct *fdm,
unsigned long addr)
  {
+   uint16_t num_sections = 0;
+   unsigned long metadata_base = 0;
+
if (!fdm)
return 0;

+   /* Skip the fadump metadata area. */
+   metadata_base = addr;
+   addr += get_fadump_metadata_size();
+
memset(fdm, 0, sizeof(struct fadump_mem_struct));
addr = addr & PAGE_MASK;

fdm->header.dump_format_version = cpu_to_be32(0x0001);
-   fdm->header.dump_num_sections = cpu_to_be16(3);
fdm->header.dump_status_flag = 0;
fdm->header.offset_first_dump_section =
cpu_to_be32((u32)offsetof(struct fadump_mem_struct, 
cpu_state_data));
@@ -222,6 +253,7 @@ static unsigned long init_fadump_mem_struct(struct 
fadump_mem_struct *fdm,
fdm->cpu_state_data.source_address = 0;
fdm->cpu_state_data.source_len = 
cpu_to_be64(fw_dump.cpu_state_data_size);
fdm->cpu_state_data.destination_address = cpu_to_be64(addr);
+   num_sections++;
addr += fw_dump.cpu_state_data_size;

/* hpte region section */
@@ -230,6 +262,7 @@ static unsigned long init_fadump_mem_struct(struct 
fadump_mem_struct *fdm,
fdm->hpte_region.source_address = 0;
fdm->hpte_region.source_len = cpu_to_be64(fw_dump.hpte_region_size);
fdm->hpte_region.destination_address = cpu_to_be64(addr);
+   num_sections++;
addr += fw_dump.hpte_region_size;

/* RMA region section */
@@ -238,8 +271,25 @@ static unsigned long init_fadump_mem_struct(struct 
fadump_mem_struct *fdm,
fdm->rmr_region.source_address = cpu_to_be64(RMA_START);
fdm->rmr_region.source_len = cpu_to_be64(fw_dump.boot_memory_size);
fdm->rmr_region.destination_address = cpu_to_be64(addr);
+   num_sections++;
addr += fw_dump.boot_memory_size;

+   /*
+* fadump metadata section.
+* Add an entry with source len a zero. We are only intereseted in
+* source address 

Re: [PATCH v9 06/24] mm: make pte_unmap_same compatible with SPF

2018-04-03 Thread Jerome Glisse
On Tue, Mar 13, 2018 at 06:59:36PM +0100, Laurent Dufour wrote:
> pte_unmap_same() is making the assumption that the page table are still
> around because the mmap_sem is held.
> This is no more the case when running a speculative page fault and
> additional check must be made to ensure that the final page table are still
> there.
> 
> This is now done by calling pte_spinlock() to check for the VMA's
> consistency while locking for the page tables.
> 
> This is requiring passing a vm_fault structure to pte_unmap_same() which is
> containing all the needed parameters.
> 
> As pte_spinlock() may fail in the case of a speculative page fault, if the
> VMA has been touched in our back, pte_unmap_same() should now return 3
> cases :
>   1. pte are the same (0)
>   2. pte are different (VM_FAULT_PTNOTSAME)
>   3. a VMA's changes has been detected (VM_FAULT_RETRY)
> 
> The case 2 is handled by the introduction of a new VM_FAULT flag named
> VM_FAULT_PTNOTSAME which is then trapped in cow_user_page().
> If VM_FAULT_RETRY is returned, it is passed up to the callers to retry the
> page fault while holding the mmap_sem.
> 
> Signed-off-by: Laurent Dufour 
> ---
>  include/linux/mm.h |  1 +
>  mm/memory.c| 29 +++--
>  2 files changed, 20 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 2f3e98edc94a..b6432a261e63 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1199,6 +1199,7 @@ static inline void clear_page_pfmemalloc(struct page 
> *page)
>  #define VM_FAULT_NEEDDSYNC  0x2000   /* ->fault did not modify page tables
>* and needs fsync() to complete (for
>* synchronous page faults in DAX) */
> +#define VM_FAULT_PTNOTSAME 0x4000/* Page table entries have changed */
>  
>  #define VM_FAULT_ERROR   (VM_FAULT_OOM | VM_FAULT_SIGBUS | 
> VM_FAULT_SIGSEGV | \
>VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE | \
> diff --git a/mm/memory.c b/mm/memory.c
> index 21b1212a0892..4bc7b0bdcb40 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2309,21 +2309,29 @@ static bool pte_map_lock(struct vm_fault *vmf)
>   * parts, do_swap_page must check under lock before unmapping the pte and
>   * proceeding (but do_wp_page is only called after already making such a 
> check;
>   * and do_anonymous_page can safely check later on).
> + *
> + * pte_unmap_same() returns:
> + *   0   if the PTE are the same
> + *   VM_FAULT_PTNOTSAME  if the PTE are different
> + *   VM_FAULT_RETRY  if the VMA has changed in our back during
> + *   a speculative page fault handling.
>   */
> -static inline int pte_unmap_same(struct mm_struct *mm, pmd_t *pmd,
> - pte_t *page_table, pte_t orig_pte)
> +static inline int pte_unmap_same(struct vm_fault *vmf)
>  {
> - int same = 1;
> + int ret = 0;
> +
>  #if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT)
>   if (sizeof(pte_t) > sizeof(unsigned long)) {
> - spinlock_t *ptl = pte_lockptr(mm, pmd);
> - spin_lock(ptl);
> - same = pte_same(*page_table, orig_pte);
> - spin_unlock(ptl);
> + if (pte_spinlock(vmf)) {
> + if (!pte_same(*vmf->pte, vmf->orig_pte))
> + ret = VM_FAULT_PTNOTSAME;
> + spin_unlock(vmf->ptl);
> + } else
> + ret = VM_FAULT_RETRY;
>   }
>  #endif
> - pte_unmap(page_table);
> - return same;
> + pte_unmap(vmf->pte);
> + return ret;
>  }
>  
>  static inline void cow_user_page(struct page *dst, struct page *src, 
> unsigned long va, struct vm_area_struct *vma)
> @@ -2913,7 +2921,8 @@ int do_swap_page(struct vm_fault *vmf)
>   int exclusive = 0;
>   int ret = 0;
>  
> - if (!pte_unmap_same(vma->vm_mm, vmf->pmd, vmf->pte, vmf->orig_pte))
> + ret = pte_unmap_same(vmf);
> + if (ret)
>   goto out;
>  

This change what do_swap_page() returns ie before it was returning 0
when locked pte lookup was different from orig_pte. After this patch
it returns VM_FAULT_PTNOTSAME but this is a new return value for
handle_mm_fault() (the do_swap_page() return value is what ultimately
get return by handle_mm_fault())

Do we really want that ? This might confuse some existing user of
handle_mm_fault() and i am not sure of the value of that information
to caller.

Note i do understand that you want to return retry if anything did
change from underneath and thus need to differentiate from when the
pte value are not the same.

Cheers,
Jérôme


Re: [RESEND v2 3/4] doc/devicetree: Persistent memory region bindings

2018-04-03 Thread Dan Williams
On Tue, Apr 3, 2018 at 7:24 AM, Oliver O'Halloran  wrote:
> Add device-tree binding documentation for the nvdimm region driver.
>
> Cc: devicet...@vger.kernel.org
> Signed-off-by: Oliver O'Halloran 
> ---
> v2: Changed name from nvdimm-region to pmem-region.
> Cleaned up the example binding and fixed the overlapping regions.
> Added support for multiple regions in a single reg.
> ---
>  .../devicetree/bindings/pmem/pmem-region.txt   | 80 
> ++
>  MAINTAINERS|  1 +
>  2 files changed, 81 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/pmem/pmem-region.txt

Device-tree folks, does this look, ok?

Oliver, is there any concept of a management interface to the
device(s) backing these regions? libnvdimm calls these "nmem" devices
and support operations like health status and namespace label
management.


Re: [RESEND v2 2/4] libnvdimm: Add device-tree based driver

2018-04-03 Thread Dan Williams
On Tue, Apr 3, 2018 at 7:24 AM, Oliver O'Halloran  wrote:
> This patch adds peliminary device-tree bindings for persistent memory
> regions. The driver registers a libnvdimm bus for each pmem-region
> node and each address range under the node is converted to a region
> within that bus.
>
> Signed-off-by: Oliver O'Halloran 
> ---
> v2: Made each bus have a separate node rather having a shared bus.
> Renamed to of_pmem rather than of_nvdimm.
> Changed log level of happy-path messages to debug.
> ---
[..]
> +static struct platform_driver of_nd_region_driver = {
> +   .probe = of_nd_region_probe,
> +   .remove = of_nd_region_remove,
> +   .driver = {
> +   .name = "of_pmem",
> +   .owner = THIS_MODULE,
> +   .of_match_table = of_nd_region_match,
> +   },
> +};

This and the other patches look good to me. Just a nit on the
naming... since you name the regions pmem-regions in the device-tree
description shouldn't this be the "of_pmem_region" or "of_pmem_range"
driver? Otherwise, it is confusing to me that anything named
*nd_region would be creating an nvdimm_bus. In general An nd_region is
always a child of a bus.

That said, with an ack/reviewed-by on the device-tree bindings I can
take these through the nvdimm tree. I'll reply to patch 4 with that
request for ack.


Re: [1/3] powerpc/64s/idle: POWER9 implement a separate idle stop function for hotplug

2018-04-03 Thread Nicholas Piggin
On Wed,  4 Apr 2018 02:03:18 +1000 (AEST)
Michael Ellerman  wrote:

> On Fri, 2017-11-17 at 14:08:05 UTC, Nicholas Piggin wrote:
> > Implement a new function to invoke stop, power9_offline_stop, which is
> > like power9_idle_stop but used by the cpu hotplug code.
> > 
> > Move KVM secondary state manipulation code to the offline case.
> > 
> > Signed-off-by: Nicholas Piggin 
> > Reviewed-by: Vaidyanathan Srinivasan   
> 
> Applied to powerpc next, thanks.
> 
> https://git.kernel.org/powerpc/c/3d4fbffdd703d2b968db443911f214
> 
> cheers

I sent out a new series for this which is rebased and a bit cleaner.
I probably wasn't clear enough about the change, sorry.

This patch will need to be adjusted for the force SMT4 change. The
end result should look like this,

https://patchwork.ozlabs.org/patch/893948/

Thanks,
Nick


Re: selftests/powerpc: Fix copyloops build since Power4 assembler change

2018-04-03 Thread Michael Ellerman
On Tue, 2018-04-03 at 15:53:18 UTC, Michael Ellerman wrote:
> The recent commit 15a3204d24a3 ("powerpc/64s: Set assembler machine
> type to POWER4") set the machine type in our ASFLAGS when building the
> kernel, and removed some ".machine power4" directives from various asm
> files.
> 
> This broke the selftests build on old toolchains (that don't assume
> Power4), because we build the kernel source files into the selftests
> using different ASFLAGS.
> 
> The fix is simply to add -mpower4 to the selftest ASFLAGS as well.
> 
> Fixes: 15a3204d24a3 ("powerpc/64s: Set assembler machine type to POWER4")
> Signed-off-by: Michael Ellerman 

Applied to powerpc next.

https://git.kernel.org/powerpc/c/b6f534d1a642a9b6263fd52df30806

cheers


Re: [V2] powerpc: Fix oops due to wrong access of lppaca on non virtualized platform

2018-04-03 Thread Michael Ellerman
On Mon, 2018-04-02 at 07:33:37 UTC, "Aneesh Kumar K.V" wrote:
> Commit 8e0b634b1327 ("powerpc/64s: Do not allocate lppaca if we are not
> virtualized") removed allocation of lppaca on non virtualized platform. But 
> with
> CONFIG_PPC_SPLPAR enabled, we do access lppaca non-virtualized platform in 
> some
> code paths.
> 
> Fix this but adding runtime check for virtualized platform.
> 
> Fixes: 8e0b634b1327 ("powerpc/64s: Do not allocate lppaca if we are not 
> virtualized")
> 
> Signed-off-by: Aneesh Kumar K.V 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/a6201da34ff9366680e97392efd06a

cheers


Re: [1/2] KVM: PPC: Book3S HV: Fix ppc_breakpoint_available compile error

2018-04-03 Thread Michael Ellerman
On Sun, 2018-04-01 at 05:50:35 UTC, Nicholas Piggin wrote:
> arch/powerpc/kvm/book3s_hv.c: In function ‘kvmppc_h_set_mode’:
> arch/powerpc/kvm/book3s_hv.c:745:8: error: implicit declaration of function 
> ‘ppc_breakpoint_available’; did you mean ‘hw_breakpoint_disable’? 
> [-Werror=implicit-function-declaration]
>if (!ppc_breakpoint_available())
> ^~~~
> hw_breakpoint_disable
> 
> Cc: Michael Neuling 
> Fixes: 398e712c00 ("KVM: PPC: Book3S HV: Return error from 
> h_set_mode(SET_DAWR) on POWER9")
> Signed-off-by: Nicholas Piggin 

Series applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/e303c08787c4cbe1ca07912817dff2

cheers


Re: [1/2] powerpc: Move default security feature flags

2018-04-03 Thread Michael Ellerman
On Fri, 2018-03-30 at 17:28:24 UTC, Mauricio Faria de Oliveira wrote:
> This moves the definition of the default security feature flags
> (i.e., enabled by default) closer to the security feature flags.
> 
> This can be used to restore current flags to the default flags.
> 
> Signed-off-by: Mauricio Faria de Oliveira 

Series applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/e7347a86830f38dc3e40c8f7e28c04

cheers


Re: powerpc: Clear branch trap (MSR.BE) before delivering SIGTRAP

2018-04-03 Thread Michael Ellerman
On Mon, 2018-03-26 at 16:55:21 UTC, Matt Evans wrote:
> When using SIG_DBG_BRANCH_TRACING, MSR.BE is left enabled in the
> user context when single_step_exception() prepares the SIGTRAP
> delivery.  The resulting branch-trap-within-the-SIGTRAP-handler
> isn't healthy.
> 
> Commit 2538c2d08f46141550a1e68819efa8fe31c6e3dc broke this, by
> replacing an MSR mask operation of ~(MSR_SE | MSR_BE) with a call
> to clear_single_step() which only clears MSR_SE.
> 
> This patch adds a new helper, clear_br_trace(), which clears the
> debug trap before invoking the signal handler.  This helper is a
> NOP for BookE as SIG_DBG_BRANCH_TRACING isn't supported on BookE.
> 
> Signed-off-by: Matt Evans 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/0e524e761fc2157f1037e0f5d616cd

cheers


Re: powerpc/64s: return more carefully from sreset NMI

2018-04-03 Thread Michael Ellerman
On Mon, 2018-03-26 at 15:01:03 UTC, Nicholas Piggin wrote:
> System Reset, being an NMI, must return more carefully than other
> interrupts. It has traditionally returned via the nromal return
> from exception path, but that has a number of problems.
> 
> - r13 does not get restored if returning to kernel. This is for
>   interrupts which may cause a context switch, which sreset will
>   never do. Interrupting OPAL (which uses a different r13) is one
>   place where this causes breakage.
> 
> - It may cause several other problems returning to kernel with
>   preempt or TIF_EMULATE_STACK_STORE if it hits at the wrong time.
> 
> It's safer just to have a simple restore and return, like machine
> check which is the other NMI.
> 
> Signed-off-by: Nicholas Piggin 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/15b4dd7981496f51c5f9262a5e0761

cheers


Re: powerpc/64s: sreset panic if there is no debugger or crash dump handlers

2018-04-03 Thread Michael Ellerman
On Mon, 2018-03-26 at 15:01:16 UTC, Nicholas Piggin wrote:
> system_reset_exception does most of its own crash handling now,
> invoking the debugger or crash dumps if they are registered. If not,
> then it goes through to die() to print stack traces, and then is
> supposed to panic (according to comments).
> 
> However after die() prints oopses, it does its own handling which
> doesn't allow system_reset_exception to panic (e.g., it may just
> kill the current process). This patch causes sreset exceptions to
> return from die after it prints messages but before acting.
> 
> This also stops die from invoking the debugger on 0x100 crashes.
> system_reset_exception similarly calls the debugger. It had been
> thought this was harmless (because if the debugger was disabled,
> neither call would fire, and if it was enabled the first call
> would return). However in some cases like xmon 'X' command, the
> debugger returns 0, which currently causes it to be entered
> again (first in system_reset_exception, then in die), which is
> confusing.
> 
> Signed-off-by: Nicholas Piggin 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/d40b6768e45bd9213139b2d91d30c7

cheers


Re: powerpc/mm/radix: Fix always false comparison against MMU_NO_CONTEXT

2018-04-03 Thread Michael Ellerman
On Thu, 2018-03-22 at 21:03:18 UTC, Mathieu Malaterre wrote:
> In commit 9690c1574268 ("powerpc/mm/radix: Fix always false comparison
> against MMU_NO_CONTEXT") an issue was discovered where `mm->context.id` was
> being truncated to an `unsigned int`, while the PID is actually an
> `unsigned long`. Update the earlier patch by fixing one remaining
> occurrence. Discovered during a compilation with W=1:
> 
>   arch/powerpc/mm/tlb-radix.c:702:19: error: comparison is always false due 
> to limited range of data type [-Werror=type-limits]
> 
> Signed-off-by: Mathieu Malaterre 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/19e68b2aec3c0a2bd770d3c358a296

cheers


Re: powerpc/boot: Remove duplicate typedefs from libfdt_env.h

2018-04-03 Thread Michael Ellerman
On Fri, 2018-03-16 at 21:54:43 UTC, Mark Greer wrote:
> When building a uImage or zImage using ppc6xx_defconfig and some other
> defconfigs, the following error occurs:
> 
>   BOOTCC  arch/powerpc/boot/fdt.o
>   In file included from arch/powerpc/boot/fdt.c:51:0:
>   ../arch/powerpc/boot/libfdt_env.h:10:13: error: redefinition of typedef 
> 'uint32_t'
>   ../arch/powerpc/boot/types.h:21:13: note: previous declaration of 
> 'uint32_t' was here
>   ../arch/powerpc/boot/libfdt_env.h:11:13: error: redefinition of typedef 
> 'uint64_t'
>   ../arch/powerpc/boot/types.h:22:13: note: previous declaration of 
> 'uint64_t' was here
>   ../arch/powerpc/boot/Makefile:210: recipe for target 
> 'arch/powerpc/boot/fdt.o' failed
>   make[2]: *** [arch/powerpc/boot/fdt.o] Error 1
> 
> The problem is that commit 656ad58ef19e (powerpc/boot: Add OPAL console
> to epapr wrappers) adds typedefs for uint32_t and uint64_t to type.h but
> doesn't remove the pre-existing (and now duplicate) typedefs from
> libfdt_env.h.  Fix the error by removing the duplicat typedefs from
> libfdt_env.h
> 
> CC: David Gibson 
> CC: Oliver O'Halloran 
> Signed-off-by: Mark Greer 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/147704534e2de30dd47171d55240c3

cheers


Re: [1/9] powerpc/64s: cputable add all POWER9 features to CPU_FTRS_ALWAYS

2018-04-03 Thread Michael Ellerman
On Tue, 2018-02-20 at 19:08:24 UTC, Nicholas Piggin wrote:
> Signed-off-by: Nicholas Piggin 

Series applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/b842bd0f7a61b129a672f8b038325e

cheers


Re: [2/3] powerpc/64s/idle: avoid sync for KVM state when waking from idle

2018-04-03 Thread Michael Ellerman
On Fri, 2017-11-17 at 14:08:06 UTC, Nicholas Piggin wrote:
> When waking from a CPU idle instruction (e.g., nap or stop), the sync
> for ordering the KVM secondary thread state can be avoided if there
> wakeup is coming from a kernel context rather than KVM context.
> 
> This improves performance for ping-pong benchmark with the stop0 idle
> state by 0.46% for 2 threads in the same core, and 1.02% for different
> cores.
> 
> Signed-off-by: Nicholas Piggin 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/8c1c7fb0b5ec95c392e9b585a6cf8c

cheers


Re: [1/3] powerpc/64s/idle: POWER9 implement a separate idle stop function for hotplug

2018-04-03 Thread Michael Ellerman
On Fri, 2017-11-17 at 14:08:05 UTC, Nicholas Piggin wrote:
> Implement a new function to invoke stop, power9_offline_stop, which is
> like power9_idle_stop but used by the cpu hotplug code.
> 
> Move KVM secondary state manipulation code to the offline case.
> 
> Signed-off-by: Nicholas Piggin 
> Reviewed-by: Vaidyanathan Srinivasan 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/3d4fbffdd703d2b968db443911f214

cheers


[PATCH] selftests/powerpc: Fix copyloops build since Power4 assembler change

2018-04-03 Thread Michael Ellerman
The recent commit 15a3204d24a3 ("powerpc/64s: Set assembler machine
type to POWER4") set the machine type in our ASFLAGS when building the
kernel, and removed some ".machine power4" directives from various asm
files.

This broke the selftests build on old toolchains (that don't assume
Power4), because we build the kernel source files into the selftests
using different ASFLAGS.

The fix is simply to add -mpower4 to the selftest ASFLAGS as well.

Fixes: 15a3204d24a3 ("powerpc/64s: Set assembler machine type to POWER4")
Signed-off-by: Michael Ellerman 
---
 tools/testing/selftests/powerpc/copyloops/Makefile | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/powerpc/copyloops/Makefile 
b/tools/testing/selftests/powerpc/copyloops/Makefile
index ac4a52e19e59..eedce3366f64 100644
--- a/tools/testing/selftests/powerpc/copyloops/Makefile
+++ b/tools/testing/selftests/powerpc/copyloops/Makefile
@@ -5,8 +5,8 @@ CFLAGS += -I$(CURDIR)
 CFLAGS += -D SELFTEST
 CFLAGS += -maltivec
 
-# Use our CFLAGS for the implicit .S rule
-ASFLAGS = $(CFLAGS)
+# Use our CFLAGS for the implicit .S rule & set the asm machine type
+ASFLAGS = $(CFLAGS) -Wa,-mpower4
 
 TEST_GEN_PROGS := copyuser_64 copyuser_power7 memcpy_64 memcpy_power7
 EXTRA_SOURCES := validate.c ../harness.c
-- 
2.14.1



Re: [PATCH] cxl: Fix possible deadlock when processing page faults from cxllib

2018-04-03 Thread Aneesh Kumar K.V

On 04/03/2018 08:10 PM, Aneesh Kumar K.V wrote:

On 04/03/2018 03:13 PM, Frederic Barrat wrote:

cxllib_handle_fault() is called by an external driver when it needs to
have the host process page faults for a buffer which may cover several
pages. Currently the function holds the mm->mmap_sem semaphore with
read access while iterating over the buffer, since it could spawn
several VMAs. When calling a lower-level function to handle the page
fault for a single page, the semaphore is accessed again in read
mode. That is wrong and can lead to deadlocks if a writer tries to
sneak in while a buffer of several pages is being processed.

The fix is to release the semaphore once cxllib_handle_fault() got the
information it needs from the current vma. The address space/VMAs
could evolve while we iterate over the full buffer, but in the
unlikely case where we miss a page, the driver will raise a new page
fault when retrying.

Fixes: 3ced8d730063 ("cxl: Export library to support IBM XSL")
Cc: sta...@vger.kernel.org # 4.13+
Signed-off-by: Frederic Barrat 
---
  drivers/misc/cxl/cxllib.c | 85 
++-

  1 file changed, 55 insertions(+), 30 deletions(-)

diff --git a/drivers/misc/cxl/cxllib.c b/drivers/misc/cxl/cxllib.c
index 30ccba436b3b..55cd35d1a9cc 100644
--- a/drivers/misc/cxl/cxllib.c
+++ b/drivers/misc/cxl/cxllib.c
@@ -208,49 +208,74 @@ int cxllib_get_PE_attributes(struct task_struct 
*task,

  }
  EXPORT_SYMBOL_GPL(cxllib_get_PE_attributes);
-int cxllib_handle_fault(struct mm_struct *mm, u64 addr, u64 size, u64 
flags)

+static int get_vma_info(struct mm_struct *mm, u64 addr,
+    u64 *vma_start, u64 *vma_end,
+    unsigned long *page_size)
  {
-    int rc;
-    u64 dar;
  struct vm_area_struct *vma = NULL;
-    unsigned long page_size;
-
-    if (mm == NULL)
-    return -EFAULT;
+    int rc = 0;
  down_read(>mmap_sem);
  vma = find_vma(mm, addr);
  if (!vma) {
-    pr_err("Can't find vma for addr %016llx\n", addr);
  rc = -EFAULT;
  goto out;
  }
-    /* get the size of the pages allocated */
-    page_size = vma_kernel_pagesize(vma);
-
-    for (dar = (addr & ~(page_size - 1)); dar < (addr + size); dar += 
page_size) {

-    if (dar < vma->vm_start || dar >= vma->vm_end) {
-    vma = find_vma(mm, addr);
-    if (!vma) {
-    pr_err("Can't find vma for addr %016llx\n", addr);
-    rc = -EFAULT;
-    goto out;
-    }
-    /* get the size of the pages allocated */
-    page_size = vma_kernel_pagesize(vma);
+    *page_size = vma_kernel_pagesize(vma);
+    *vma_start = vma->vm_start;
+    *vma_end = vma->vm_end;
+out:
+    up_read(>mmap_sem);
+    return rc;
+}
+
+int cxllib_handle_fault(struct mm_struct *mm, u64 addr, u64 size, u64 
flags)

+{
+    int rc;
+    u64 dar, vma_start, vma_end;
+    unsigned long page_size;
+
+    if (mm == NULL)
+    return -EFAULT;
+
+    /*
+ * The buffer we have to process can extend over several pages
+ * and may also cover several VMAs.
+ * We iterate over all the pages. The page size could vary
+ * between VMAs.
+ */
+    rc = get_vma_info(mm, addr, _start, _end, _size);
+    if (rc)
+    return rc;
+
+    for (dar = (addr & ~(page_size - 1)); dar < (addr + size);
+ dar += page_size) {
+    if (dar < vma_start || dar >= vma_end) {



IIUC, we are fetching the vma to get just the page_size with which it is 
mapped? Can't we iterate with PAGE_SIZE? Considering hugetlb page size 
will be larger than PAGE_SIZE, we might call into cxl_handle_mm_fault 
multiple times for a hugetlb page. Does that cause any issue? Also can 
cxl be used with hugetlb mappings?


Can you also try to use a helper like below. That will clarify the need 
of find_vma there.


static int __cxllib_handle_fault(struct mm_struct *mm, unsigned long 
start, unsigned long end,

 unsigned long mapping_psize, u64 flags)
{
int rc;
unsigned long dar;

for (dar = start; dar < end; dar += mapping_psize) {
rc = cxl_handle_mm_fault(mm, flags, dar);
if (rc) {
rc = -EFAULT;
goto out;
}
}
rc = 0;
out:
return rc;
}

-aneesh



Re: [PATCH] cxl: Fix possible deadlock when processing page faults from cxllib

2018-04-03 Thread Aneesh Kumar K.V

On 04/03/2018 03:13 PM, Frederic Barrat wrote:

cxllib_handle_fault() is called by an external driver when it needs to
have the host process page faults for a buffer which may cover several
pages. Currently the function holds the mm->mmap_sem semaphore with
read access while iterating over the buffer, since it could spawn
several VMAs. When calling a lower-level function to handle the page
fault for a single page, the semaphore is accessed again in read
mode. That is wrong and can lead to deadlocks if a writer tries to
sneak in while a buffer of several pages is being processed.

The fix is to release the semaphore once cxllib_handle_fault() got the
information it needs from the current vma. The address space/VMAs
could evolve while we iterate over the full buffer, but in the
unlikely case where we miss a page, the driver will raise a new page
fault when retrying.

Fixes: 3ced8d730063 ("cxl: Export library to support IBM XSL")
Cc: sta...@vger.kernel.org # 4.13+
Signed-off-by: Frederic Barrat 
---
  drivers/misc/cxl/cxllib.c | 85 ++-
  1 file changed, 55 insertions(+), 30 deletions(-)

diff --git a/drivers/misc/cxl/cxllib.c b/drivers/misc/cxl/cxllib.c
index 30ccba436b3b..55cd35d1a9cc 100644
--- a/drivers/misc/cxl/cxllib.c
+++ b/drivers/misc/cxl/cxllib.c
@@ -208,49 +208,74 @@ int cxllib_get_PE_attributes(struct task_struct *task,
  }
  EXPORT_SYMBOL_GPL(cxllib_get_PE_attributes);
  
-int cxllib_handle_fault(struct mm_struct *mm, u64 addr, u64 size, u64 flags)

+static int get_vma_info(struct mm_struct *mm, u64 addr,
+   u64 *vma_start, u64 *vma_end,
+   unsigned long *page_size)
  {
-   int rc;
-   u64 dar;
struct vm_area_struct *vma = NULL;
-   unsigned long page_size;
-
-   if (mm == NULL)
-   return -EFAULT;
+   int rc = 0;
  
  	down_read(>mmap_sem);
  
  	vma = find_vma(mm, addr);

if (!vma) {
-   pr_err("Can't find vma for addr %016llx\n", addr);
rc = -EFAULT;
goto out;
}
-   /* get the size of the pages allocated */
-   page_size = vma_kernel_pagesize(vma);
-
-   for (dar = (addr & ~(page_size - 1)); dar < (addr + size); dar += 
page_size) {
-   if (dar < vma->vm_start || dar >= vma->vm_end) {
-   vma = find_vma(mm, addr);
-   if (!vma) {
-   pr_err("Can't find vma for addr %016llx\n", 
addr);
-   rc = -EFAULT;
-   goto out;
-   }
-   /* get the size of the pages allocated */
-   page_size = vma_kernel_pagesize(vma);
+   *page_size = vma_kernel_pagesize(vma);
+   *vma_start = vma->vm_start;
+   *vma_end = vma->vm_end;
+out:
+   up_read(>mmap_sem);
+   return rc;
+}
+
+int cxllib_handle_fault(struct mm_struct *mm, u64 addr, u64 size, u64 flags)
+{
+   int rc;
+   u64 dar, vma_start, vma_end;
+   unsigned long page_size;
+
+   if (mm == NULL)
+   return -EFAULT;
+
+   /*
+* The buffer we have to process can extend over several pages
+* and may also cover several VMAs.
+* We iterate over all the pages. The page size could vary
+* between VMAs.
+*/
+   rc = get_vma_info(mm, addr, _start, _end, _size);
+   if (rc)
+   return rc;
+
+   for (dar = (addr & ~(page_size - 1)); dar < (addr + size);
+dar += page_size) {
+   if (dar < vma_start || dar >= vma_end) {



IIUC, we are fetching the vma to get just the page_size with which it is 
mapped? Can't we iterate with PAGE_SIZE? Considering hugetlb page size 
will be larger than PAGE_SIZE, we might call into cxl_handle_mm_fault 
multiple times for a hugetlb page. Does that cause any issue? Also can 
cxl be used with hugetlb mappings?



+   /*
+* We don't hold the mm->mmap_sem semaphore
+* while iterating, since the semaphore is
+* required by one of the lower-level page
+* fault processing functions and it could
+* create a deadlock.
+*
+* It means the VMAs can be altered between 2
+* loop iterations and we could theoretically
+* miss a page (however unlikely). But that's
+* not really a problem, as the driver will
+* retry access, get another page fault on the
+* missing page and call us again.
+*/
+   rc = get_vma_info(mm, dar, _start, _end,
+   _size);
+   if (rc)
+   return rc;
 

Re: [PATCH v2] cxl: Fix possible deadlock when processing page faults from cxllib

2018-04-03 Thread Frederic Barrat



Le 03/04/2018 à 16:06, Laurent Dufour a écrit :

On 03/04/2018 15:54, Frederic Barrat wrote:

cxllib_handle_fault() is called by an external driver when it needs to
have the host resolve page faults for a buffer. The buffer can cover
several pages and VMAs. The function iterates over all the pages used
by the buffer, based on the page size of the VMA.

To ensure some stability while processing the faults, the thread T1
grabs the mm->mmap_sem semaphore with read access (R1). However, when
processing a page fault for a single page, one of the underlying
functions, copro_handle_mm_fault(), also grabs the same semaphore with
read access (R2). So the thread T1 takes the semaphore twice.

If another thread T2 tries to access the semaphore in write mode W1
(say, because it wants to allocate memory and calls 'brk'), then that
thread T2 will have to wait because there's a reader (R1). If the
thread T1 is processing a new page at that time, it won't get an
automatic grant at R2, because there's now a writer thread
waiting (T2). And we have a deadlock.

The timeline is:
1. thread T1 owns the semaphore with read access R1
2. thread T2 requests write access W1 and waits
3. thread T1 requests read access R2 and waits

The fix is for the thread T1 to release the semaphore R1 once it got
the information it needs from the current VMA. The address space/VMAs
could evolve while T1 iterates over the full buffer, but in the
unlikely case where T1 misses a page, the external driver will raise a
new page fault when retrying the memory access.

Fixes: 3ced8d730063 ("cxl: Export library to support IBM XSL")
Cc: sta...@vger.kernel.org # 4.13+
Signed-off-by: Frederic Barrat 


What are the changes introduced in this v2 ?


How did that happen! I would have sworn to have added the changelog!

Code is the same. I rewrote the commit message to hopefully make it more 
readable.


  Fred




---
  drivers/misc/cxl/cxllib.c | 85 ++-
  1 file changed, 55 insertions(+), 30 deletions(-)

diff --git a/drivers/misc/cxl/cxllib.c b/drivers/misc/cxl/cxllib.c
index 30ccba436b3b..55cd35d1a9cc 100644
--- a/drivers/misc/cxl/cxllib.c
+++ b/drivers/misc/cxl/cxllib.c
@@ -208,49 +208,74 @@ int cxllib_get_PE_attributes(struct task_struct *task,
  }
  EXPORT_SYMBOL_GPL(cxllib_get_PE_attributes);

-int cxllib_handle_fault(struct mm_struct *mm, u64 addr, u64 size, u64 flags)
+static int get_vma_info(struct mm_struct *mm, u64 addr,
+   u64 *vma_start, u64 *vma_end,
+   unsigned long *page_size)
  {
-   int rc;
-   u64 dar;
struct vm_area_struct *vma = NULL;
-   unsigned long page_size;
-
-   if (mm == NULL)
-   return -EFAULT;
+   int rc = 0;

down_read(>mmap_sem);

vma = find_vma(mm, addr);
if (!vma) {
-   pr_err("Can't find vma for addr %016llx\n", addr);
rc = -EFAULT;
goto out;
}
-   /* get the size of the pages allocated */
-   page_size = vma_kernel_pagesize(vma);
-
-   for (dar = (addr & ~(page_size - 1)); dar < (addr + size); dar += 
page_size) {
-   if (dar < vma->vm_start || dar >= vma->vm_end) {
-   vma = find_vma(mm, addr);
-   if (!vma) {
-   pr_err("Can't find vma for addr %016llx\n", 
addr);
-   rc = -EFAULT;
-   goto out;
-   }
-   /* get the size of the pages allocated */
-   page_size = vma_kernel_pagesize(vma);
+   *page_size = vma_kernel_pagesize(vma);
+   *vma_start = vma->vm_start;
+   *vma_end = vma->vm_end;
+out:
+   up_read(>mmap_sem);
+   return rc;
+}
+
+int cxllib_handle_fault(struct mm_struct *mm, u64 addr, u64 size, u64 flags)
+{
+   int rc;
+   u64 dar, vma_start, vma_end;
+   unsigned long page_size;
+
+   if (mm == NULL)
+   return -EFAULT;
+
+   /*
+* The buffer we have to process can extend over several pages
+* and may also cover several VMAs.
+* We iterate over all the pages. The page size could vary
+* between VMAs.
+*/
+   rc = get_vma_info(mm, addr, _start, _end, _size);
+   if (rc)
+   return rc;
+
+   for (dar = (addr & ~(page_size - 1)); dar < (addr + size);
+dar += page_size) {
+   if (dar < vma_start || dar >= vma_end) {
+   /*
+* We don't hold the mm->mmap_sem semaphore
+* while iterating, since the semaphore is
+* required by one of the lower-level page
+* fault processing functions and it could
+* create a deadlock.
+*
+* It means the VMAs can be altered 

[RESEND v2 4/4] powerpc/powernv: Create platform devs for nvdimm buses

2018-04-03 Thread Oliver O'Halloran
Scan the devicetree for an nvdimm-bus compatible and create
a platform device for them.

Signed-off-by: Oliver O'Halloran 
---
 arch/powerpc/platforms/powernv/opal.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/opal.c 
b/arch/powerpc/platforms/powernv/opal.c
index c15182765ff5..c37485a3c5c9 100644
--- a/arch/powerpc/platforms/powernv/opal.c
+++ b/arch/powerpc/platforms/powernv/opal.c
@@ -821,6 +821,9 @@ static int __init opal_init(void)
/* Create i2c platform devices */
opal_pdev_init("ibm,opal-i2c");
 
+   /* Handle non-volatile memory devices */
+   opal_pdev_init("pmem-region");
+
/* Setup a heatbeat thread if requested by OPAL */
opal_init_heartbeat();
 
-- 
2.9.5



[RESEND v2 3/4] doc/devicetree: Persistent memory region bindings

2018-04-03 Thread Oliver O'Halloran
Add device-tree binding documentation for the nvdimm region driver.

Cc: devicet...@vger.kernel.org
Signed-off-by: Oliver O'Halloran 
---
v2: Changed name from nvdimm-region to pmem-region.
Cleaned up the example binding and fixed the overlapping regions.
Added support for multiple regions in a single reg.
---
 .../devicetree/bindings/pmem/pmem-region.txt   | 80 ++
 MAINTAINERS|  1 +
 2 files changed, 81 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/pmem/pmem-region.txt

diff --git a/Documentation/devicetree/bindings/pmem/pmem-region.txt 
b/Documentation/devicetree/bindings/pmem/pmem-region.txt
new file mode 100644
index ..de48dc8cd562
--- /dev/null
+++ b/Documentation/devicetree/bindings/pmem/pmem-region.txt
@@ -0,0 +1,80 @@
+Device-tree bindings for persistent memory regions
+-
+
+Persistent memory refers to a class of memory devices that are:
+
+   a) Usable as main system memory (i.e. cacheable), and
+   b) Retain their contents across power failure.
+
+Given b) it is best to think of persistent memory as a kind of memory mapped
+storage device. To ensure data integrity the operating system needs to manage
+persistent regions separately to the normal memory pool. To aid with that this
+binding provides a standardised interface for discovering where persistent
+memory regions exist inside the physical address space.
+
+Bindings for the region nodes:
+-
+
+Required properties:
+   - compatible = "pmem-region"
+
+   - reg = ;
+   The system physical address range of this nvdimm region.
+
+   If the reg property contains multiple address ranges
+   each address range will be treated as though it was specified
+   in a separate device node. Having multiple address ranges in a
+   node implies no special relationship between the two ranges.
+
+Optional properties:
+   - Any relevant NUMA assocativity properties for the target platform.
+
+   - A "volatile" property indicating that this region is actually in
+ normal DRAM and does not require cache flushes after each write.
+
+ If this property is absent then the OS must assume that the region
+ is backed by non-volatile memory.
+
+A complete example:
+
+
+Here we define three 4KB regions:
+
+   a) A volatile region at 0x5000 on numa node 0,
+   b) A non-volatile region at 0x6000, and
+   c) A non-volatile region at 0x8000.
+
+/dts-v1/;
+
+/ {
+   #address-cells = <1>;
+   #size-cells = <1>;
+
+   platform {
+   compatible = "simple-bus";
+   #address-cells = <1>;
+   #size-cells = <1>;
+   ranges;
+
+   /*
+* This node specifies one non-volatile region spanning from
+* 0x5000 to 0x5fff.
+*/
+   pmem@5000 {
+   compatible = "pmem-region";
+   reg = <0x5000 0x1000>;
+   numa-node-id = <0>;
+   volatile;
+   };
+
+   /*
+* This node specifies two 4KB regions that are backed by
+* volatile (normal) memory.
+*/
+   pmem@6000 {
+   compatible = "pmem-region";
+   reg = <0x6000 0x1000 0x8000 0x1000>;
+   };
+   };
+};
+
diff --git a/MAINTAINERS b/MAINTAINERS
index 6ef38be700e8..fa3c9211d6ff 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8041,6 +8041,7 @@ L:linux-nvd...@lists.01.org
 Q: https://patchwork.kernel.org/project/linux-nvdimm/list/
 S: Supported
 F: drivers/nvdimm/of_pmem.c
+F: Documentation/devicetree/bindings/pmem/pmem-region.txt
 
 LIBNVDIMM: NON-VOLATILE MEMORY DEVICE SUBSYSTEM
 M: Dan Williams 
-- 
2.9.5



[RESEND v2 2/4] libnvdimm: Add device-tree based driver

2018-04-03 Thread Oliver O'Halloran
This patch adds peliminary device-tree bindings for persistent memory
regions. The driver registers a libnvdimm bus for each pmem-region
node and each address range under the node is converted to a region
within that bus.

Signed-off-by: Oliver O'Halloran 
---
v2: Made each bus have a separate node rather having a shared bus.
Renamed to of_pmem rather than of_nvdimm.
Changed log level of happy-path messages to debug.
---
 MAINTAINERS  |   7 +++
 drivers/nvdimm/Kconfig   |  10 
 drivers/nvdimm/Makefile  |   1 +
 drivers/nvdimm/of_pmem.c | 119 +++
 4 files changed, 137 insertions(+)
 create mode 100644 drivers/nvdimm/of_pmem.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 4e62756936fa..6ef38be700e8 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8035,6 +8035,13 @@ Q:   
https://patchwork.kernel.org/project/linux-nvdimm/list/
 S: Supported
 F: drivers/nvdimm/pmem*
 
+LIBNVDIMM: DEVICETREE BINDINGS
+M: Oliver O'Halloran 
+L: linux-nvd...@lists.01.org
+Q: https://patchwork.kernel.org/project/linux-nvdimm/list/
+S: Supported
+F: drivers/nvdimm/of_pmem.c
+
 LIBNVDIMM: NON-VOLATILE MEMORY DEVICE SUBSYSTEM
 M: Dan Williams 
 L: linux-nvd...@lists.01.org
diff --git a/drivers/nvdimm/Kconfig b/drivers/nvdimm/Kconfig
index a65f2e1d9f53..2d6862bf7436 100644
--- a/drivers/nvdimm/Kconfig
+++ b/drivers/nvdimm/Kconfig
@@ -102,4 +102,14 @@ config NVDIMM_DAX
 
  Select Y if unsure
 
+config OF_PMEM
+   tristate "Device-tree support for persistent memory regions"
+   depends on OF
+   default LIBNVDIMM
+   help
+ Allows regions of persistent memory to be described in the
+ device-tree.
+
+ Select Y if unsure.
+
 endif
diff --git a/drivers/nvdimm/Makefile b/drivers/nvdimm/Makefile
index 70d5f3ad9909..e8847045dac0 100644
--- a/drivers/nvdimm/Makefile
+++ b/drivers/nvdimm/Makefile
@@ -4,6 +4,7 @@ obj-$(CONFIG_BLK_DEV_PMEM) += nd_pmem.o
 obj-$(CONFIG_ND_BTT) += nd_btt.o
 obj-$(CONFIG_ND_BLK) += nd_blk.o
 obj-$(CONFIG_X86_PMEM_LEGACY) += nd_e820.o
+obj-$(CONFIG_OF_PMEM) += of_pmem.o
 
 nd_pmem-y := pmem.o
 
diff --git a/drivers/nvdimm/of_pmem.c b/drivers/nvdimm/of_pmem.c
new file mode 100644
index ..374c796ea1de
--- /dev/null
+++ b/drivers/nvdimm/of_pmem.c
@@ -0,0 +1,119 @@
+// SPDX-License-Identifier: GPL-2.0+
+
+#define pr_fmt(fmt) "of_pmem: " fmt
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+static const struct attribute_group *region_attr_groups[] = {
+   _region_attribute_group,
+   _device_attribute_group,
+   NULL,
+};
+
+static const struct attribute_group *bus_attr_groups[] = {
+   _bus_attribute_group,
+   NULL,
+};
+
+struct of_nd_private {
+   struct nvdimm_bus_descriptor bus_desc;
+   struct nvdimm_bus *bus;
+};
+
+static int of_nd_region_probe(struct platform_device *pdev)
+{
+   struct of_nd_private *priv;
+   struct device_node *np;
+   struct nvdimm_bus *bus;
+   bool is_volatile;
+   int i;
+
+   np = dev_of_node(>dev);
+   if (!np)
+   return -ENXIO;
+
+   priv = kzalloc(sizeof(*priv), GFP_KERNEL);
+   if (!priv)
+   return -ENOMEM;
+
+   priv->bus_desc.attr_groups = bus_attr_groups;
+   priv->bus_desc.provider_name = "of_pmem";
+   priv->bus_desc.module = THIS_MODULE;
+   priv->bus_desc.of_node = np;
+
+   priv->bus = bus = nvdimm_bus_register(>dev, >bus_desc);
+   if (!bus) {
+   kfree(priv);
+   return -ENODEV;
+   }
+   platform_set_drvdata(pdev, priv);
+
+   is_volatile = !!of_find_property(np, "volatile", NULL);
+   dev_dbg(>dev, "Registering %s regions from %pOF\n",
+   is_volatile ? "volatile" : "non-volatile",  np);
+
+   for (i = 0; i < pdev->num_resources; i++) {
+   struct nd_region_desc ndr_desc;
+   struct nd_region *region;
+
+   /*
+* NB: libnvdimm copies the data from ndr_desc into it's own
+* structures so passing a stack pointer is fine.
+*/
+   memset(_desc, 0, sizeof(ndr_desc));
+   ndr_desc.attr_groups = region_attr_groups;
+   ndr_desc.numa_node = of_node_to_nid(np);
+   ndr_desc.res = >resource[i];
+   ndr_desc.of_node = np;
+   set_bit(ND_REGION_PAGEMAP, _desc.flags);
+
+   if (is_volatile)
+   region = nvdimm_volatile_region_create(bus, _desc);
+   else
+   region = nvdimm_pmem_region_create(bus, _desc);
+
+   if (!region)
+   dev_warn(>dev, "Unable to register region %pR 
from %pOF\n",
+   ndr_desc.res, np);
+   else
+   dev_dbg(>dev, "Registered 

[RESEND v2 1/4] libnvdimm: Add of_node to region and bus descriptors

2018-04-03 Thread Oliver O'Halloran
We want to be able to cross reference the region and bus devices
with the device tree node that they were spawned from. libNVDIMM
handles creating the actual devices for these internally, so we
need to pass in a pointer to the relevant node in the descriptor.

Signed-off-by: Oliver O'Halloran 
Acked-by: Dan Williams 
---
 drivers/nvdimm/bus.c | 1 +
 drivers/nvdimm/region_devs.c | 1 +
 include/linux/libnvdimm.h| 3 +++
 3 files changed, 5 insertions(+)

diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c
index 78eabc3a1ab1..c6106914f396 100644
--- a/drivers/nvdimm/bus.c
+++ b/drivers/nvdimm/bus.c
@@ -358,6 +358,7 @@ struct nvdimm_bus *nvdimm_bus_register(struct device 
*parent,
nvdimm_bus->dev.release = nvdimm_bus_release;
nvdimm_bus->dev.groups = nd_desc->attr_groups;
nvdimm_bus->dev.bus = _bus_type;
+   nvdimm_bus->dev.of_node = nd_desc->of_node;
dev_set_name(_bus->dev, "ndbus%d", nvdimm_bus->id);
rc = device_register(_bus->dev);
if (rc) {
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index e6d01911e092..2f1d5771100e 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -1005,6 +1005,7 @@ static struct nd_region *nd_region_create(struct 
nvdimm_bus *nvdimm_bus,
dev->parent = _bus->dev;
dev->type = dev_type;
dev->groups = ndr_desc->attr_groups;
+   dev->of_node = ndr_desc->of_node;
nd_region->ndr_size = resource_size(ndr_desc->res);
nd_region->ndr_start = ndr_desc->res->start;
nd_device_register(dev);
diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h
index ff855ed965fb..f61cb5050297 100644
--- a/include/linux/libnvdimm.h
+++ b/include/linux/libnvdimm.h
@@ -76,12 +76,14 @@ typedef int (*ndctl_fn)(struct nvdimm_bus_descriptor 
*nd_desc,
struct nvdimm *nvdimm, unsigned int cmd, void *buf,
unsigned int buf_len, int *cmd_rc);
 
+struct device_node;
 struct nvdimm_bus_descriptor {
const struct attribute_group **attr_groups;
unsigned long bus_dsm_mask;
unsigned long cmd_mask;
struct module *module;
char *provider_name;
+   struct device_node *of_node;
ndctl_fn ndctl;
int (*flush_probe)(struct nvdimm_bus_descriptor *nd_desc);
int (*clear_to_send)(struct nvdimm_bus_descriptor *nd_desc,
@@ -123,6 +125,7 @@ struct nd_region_desc {
int num_lanes;
int numa_node;
unsigned long flags;
+   struct device_node *of_node;
 };
 
 struct device;
-- 
2.9.5



[PATCH 4/4] powerpc/powernv: Create platform devs for nvdimm buses

2018-04-03 Thread Oliver O'Halloran
Scan the devicetree for an nvdimm-bus compatible and create
a platform device for them.

Signed-off-by: Oliver O'Halloran 
---
 arch/powerpc/platforms/powernv/opal.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/opal.c 
b/arch/powerpc/platforms/powernv/opal.c
index c15182765ff5..c37485a3c5c9 100644
--- a/arch/powerpc/platforms/powernv/opal.c
+++ b/arch/powerpc/platforms/powernv/opal.c
@@ -821,6 +821,9 @@ static int __init opal_init(void)
/* Create i2c platform devices */
opal_pdev_init("ibm,opal-i2c");
 
+   /* Handle non-volatile memory devices */
+   opal_pdev_init("pmem-region");
+
/* Setup a heatbeat thread if requested by OPAL */
opal_init_heartbeat();
 
-- 
2.9.5



[PATCH 2/4] libnvdimm: Add device-tree based driver

2018-04-03 Thread Oliver O'Halloran
This patch adds peliminary device-tree bindings for persistent memory
regions. The driver registers a libnvdimm bus for each pmem-region
node and each address range under the node is converted to a region
within that bus.

Signed-off-by: Oliver O'Halloran 
---
v2: Made each bus have a separate node rather having a shared bus.
Renamed to of_pmem rather than of_nvdimm.
Changed log level of happy-path messages to debug.
---
 MAINTAINERS  |   7 +++
 drivers/nvdimm/Kconfig   |  10 
 drivers/nvdimm/Makefile  |   1 +
 drivers/nvdimm/of_pmem.c | 119 +++
 4 files changed, 137 insertions(+)
 create mode 100644 drivers/nvdimm/of_pmem.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 4e62756936fa..6ef38be700e8 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8035,6 +8035,13 @@ Q:   
https://patchwork.kernel.org/project/linux-nvdimm/list/
 S: Supported
 F: drivers/nvdimm/pmem*
 
+LIBNVDIMM: DEVICETREE BINDINGS
+M: Oliver O'Halloran 
+L: linux-nvd...@lists.01.org
+Q: https://patchwork.kernel.org/project/linux-nvdimm/list/
+S: Supported
+F: drivers/nvdimm/of_pmem.c
+
 LIBNVDIMM: NON-VOLATILE MEMORY DEVICE SUBSYSTEM
 M: Dan Williams 
 L: linux-nvd...@lists.01.org
diff --git a/drivers/nvdimm/Kconfig b/drivers/nvdimm/Kconfig
index a65f2e1d9f53..2d6862bf7436 100644
--- a/drivers/nvdimm/Kconfig
+++ b/drivers/nvdimm/Kconfig
@@ -102,4 +102,14 @@ config NVDIMM_DAX
 
  Select Y if unsure
 
+config OF_PMEM
+   tristate "Device-tree support for persistent memory regions"
+   depends on OF
+   default LIBNVDIMM
+   help
+ Allows regions of persistent memory to be described in the
+ device-tree.
+
+ Select Y if unsure.
+
 endif
diff --git a/drivers/nvdimm/Makefile b/drivers/nvdimm/Makefile
index 70d5f3ad9909..e8847045dac0 100644
--- a/drivers/nvdimm/Makefile
+++ b/drivers/nvdimm/Makefile
@@ -4,6 +4,7 @@ obj-$(CONFIG_BLK_DEV_PMEM) += nd_pmem.o
 obj-$(CONFIG_ND_BTT) += nd_btt.o
 obj-$(CONFIG_ND_BLK) += nd_blk.o
 obj-$(CONFIG_X86_PMEM_LEGACY) += nd_e820.o
+obj-$(CONFIG_OF_PMEM) += of_pmem.o
 
 nd_pmem-y := pmem.o
 
diff --git a/drivers/nvdimm/of_pmem.c b/drivers/nvdimm/of_pmem.c
new file mode 100644
index ..374c796ea1de
--- /dev/null
+++ b/drivers/nvdimm/of_pmem.c
@@ -0,0 +1,119 @@
+// SPDX-License-Identifier: GPL-2.0+
+
+#define pr_fmt(fmt) "of_pmem: " fmt
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+static const struct attribute_group *region_attr_groups[] = {
+   _region_attribute_group,
+   _device_attribute_group,
+   NULL,
+};
+
+static const struct attribute_group *bus_attr_groups[] = {
+   _bus_attribute_group,
+   NULL,
+};
+
+struct of_nd_private {
+   struct nvdimm_bus_descriptor bus_desc;
+   struct nvdimm_bus *bus;
+};
+
+static int of_nd_region_probe(struct platform_device *pdev)
+{
+   struct of_nd_private *priv;
+   struct device_node *np;
+   struct nvdimm_bus *bus;
+   bool is_volatile;
+   int i;
+
+   np = dev_of_node(>dev);
+   if (!np)
+   return -ENXIO;
+
+   priv = kzalloc(sizeof(*priv), GFP_KERNEL);
+   if (!priv)
+   return -ENOMEM;
+
+   priv->bus_desc.attr_groups = bus_attr_groups;
+   priv->bus_desc.provider_name = "of_pmem";
+   priv->bus_desc.module = THIS_MODULE;
+   priv->bus_desc.of_node = np;
+
+   priv->bus = bus = nvdimm_bus_register(>dev, >bus_desc);
+   if (!bus) {
+   kfree(priv);
+   return -ENODEV;
+   }
+   platform_set_drvdata(pdev, priv);
+
+   is_volatile = !!of_find_property(np, "volatile", NULL);
+   dev_dbg(>dev, "Registering %s regions from %pOF\n",
+   is_volatile ? "volatile" : "non-volatile",  np);
+
+   for (i = 0; i < pdev->num_resources; i++) {
+   struct nd_region_desc ndr_desc;
+   struct nd_region *region;
+
+   /*
+* NB: libnvdimm copies the data from ndr_desc into it's own
+* structures so passing a stack pointer is fine.
+*/
+   memset(_desc, 0, sizeof(ndr_desc));
+   ndr_desc.attr_groups = region_attr_groups;
+   ndr_desc.numa_node = of_node_to_nid(np);
+   ndr_desc.res = >resource[i];
+   ndr_desc.of_node = np;
+   set_bit(ND_REGION_PAGEMAP, _desc.flags);
+
+   if (is_volatile)
+   region = nvdimm_volatile_region_create(bus, _desc);
+   else
+   region = nvdimm_pmem_region_create(bus, _desc);
+
+   if (!region)
+   dev_warn(>dev, "Unable to register region %pR 
from %pOF\n",
+   ndr_desc.res, np);
+   else
+   dev_dbg(>dev, "Registered 

[PATCH 3/4] doc/devicetree: Persistent memory region bindings

2018-04-03 Thread Oliver O'Halloran
Add device-tree binding documentation for the nvdimm region driver.

Cc: devicet...@vger.kernel.org
Signed-off-by: Oliver O'Halloran 
---
v2: Changed name from nvdimm-region to pmem-region.
Cleaned up the example binding and fixed the overlapping regions.
Added support for multiple regions in a single reg.
---
 .../devicetree/bindings/pmem/pmem-region.txt   | 80 ++
 MAINTAINERS|  1 +
 2 files changed, 81 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/pmem/pmem-region.txt

diff --git a/Documentation/devicetree/bindings/pmem/pmem-region.txt 
b/Documentation/devicetree/bindings/pmem/pmem-region.txt
new file mode 100644
index ..de48dc8cd562
--- /dev/null
+++ b/Documentation/devicetree/bindings/pmem/pmem-region.txt
@@ -0,0 +1,80 @@
+Device-tree bindings for persistent memory regions
+-
+
+Persistent memory refers to a class of memory devices that are:
+
+   a) Usable as main system memory (i.e. cacheable), and
+   b) Retain their contents across power failure.
+
+Given b) it is best to think of persistent memory as a kind of memory mapped
+storage device. To ensure data integrity the operating system needs to manage
+persistent regions separately to the normal memory pool. To aid with that this
+binding provides a standardised interface for discovering where persistent
+memory regions exist inside the physical address space.
+
+Bindings for the region nodes:
+-
+
+Required properties:
+   - compatible = "pmem-region"
+
+   - reg = ;
+   The system physical address range of this nvdimm region.
+
+   If the reg property contains multiple address ranges
+   each address range will be treated as though it was specified
+   in a separate device node. Having multiple address ranges in a
+   node implies no special relationship between the two ranges.
+
+Optional properties:
+   - Any relevant NUMA assocativity properties for the target platform.
+
+   - A "volatile" property indicating that this region is actually in
+ normal DRAM and does not require cache flushes after each write.
+
+ If this property is absent then the OS must assume that the region
+ is backed by non-volatile memory.
+
+A complete example:
+
+
+Here we define three 4KB regions:
+
+   a) A volatile region at 0x5000 on numa node 0,
+   b) A non-volatile region at 0x6000, and
+   c) A non-volatile region at 0x8000.
+
+/dts-v1/;
+
+/ {
+   #address-cells = <1>;
+   #size-cells = <1>;
+
+   platform {
+   compatible = "simple-bus";
+   #address-cells = <1>;
+   #size-cells = <1>;
+   ranges;
+
+   /*
+* This node specifies one non-volatile region spanning from
+* 0x5000 to 0x5fff.
+*/
+   pmem@5000 {
+   compatible = "pmem-region";
+   reg = <0x5000 0x1000>;
+   numa-node-id = <0>;
+   volatile;
+   };
+
+   /*
+* This node specifies two 4KB regions that are backed by
+* volatile (normal) memory.
+*/
+   pmem@6000 {
+   compatible = "pmem-region";
+   reg = <0x6000 0x1000 0x8000 0x1000>;
+   };
+   };
+};
+
diff --git a/MAINTAINERS b/MAINTAINERS
index 6ef38be700e8..fa3c9211d6ff 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8041,6 +8041,7 @@ L:linux-nvd...@lists.01.org
 Q: https://patchwork.kernel.org/project/linux-nvdimm/list/
 S: Supported
 F: drivers/nvdimm/of_pmem.c
+F: Documentation/devicetree/bindings/pmem/pmem-region.txt
 
 LIBNVDIMM: NON-VOLATILE MEMORY DEVICE SUBSYSTEM
 M: Dan Williams 
-- 
2.9.5



[PATCH 1/4] libnvdimm: Add of_node to region and bus descriptors

2018-04-03 Thread Oliver O'Halloran
We want to be able to cross reference the region and bus devices
with the device tree node that they were spawned from. libNVDIMM
handles creating the actual devices for these internally, so we
need to pass in a pointer to the relevant node in the descriptor.

Signed-off-by: Oliver O'Halloran 
Acked-by: Dan Williams 
---
 drivers/nvdimm/bus.c | 1 +
 drivers/nvdimm/region_devs.c | 1 +
 include/linux/libnvdimm.h| 3 +++
 3 files changed, 5 insertions(+)

diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c
index 78eabc3a1ab1..c6106914f396 100644
--- a/drivers/nvdimm/bus.c
+++ b/drivers/nvdimm/bus.c
@@ -358,6 +358,7 @@ struct nvdimm_bus *nvdimm_bus_register(struct device 
*parent,
nvdimm_bus->dev.release = nvdimm_bus_release;
nvdimm_bus->dev.groups = nd_desc->attr_groups;
nvdimm_bus->dev.bus = _bus_type;
+   nvdimm_bus->dev.of_node = nd_desc->of_node;
dev_set_name(_bus->dev, "ndbus%d", nvdimm_bus->id);
rc = device_register(_bus->dev);
if (rc) {
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index e6d01911e092..2f1d5771100e 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -1005,6 +1005,7 @@ static struct nd_region *nd_region_create(struct 
nvdimm_bus *nvdimm_bus,
dev->parent = _bus->dev;
dev->type = dev_type;
dev->groups = ndr_desc->attr_groups;
+   dev->of_node = ndr_desc->of_node;
nd_region->ndr_size = resource_size(ndr_desc->res);
nd_region->ndr_start = ndr_desc->res->start;
nd_device_register(dev);
diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h
index ff855ed965fb..f61cb5050297 100644
--- a/include/linux/libnvdimm.h
+++ b/include/linux/libnvdimm.h
@@ -76,12 +76,14 @@ typedef int (*ndctl_fn)(struct nvdimm_bus_descriptor 
*nd_desc,
struct nvdimm *nvdimm, unsigned int cmd, void *buf,
unsigned int buf_len, int *cmd_rc);
 
+struct device_node;
 struct nvdimm_bus_descriptor {
const struct attribute_group **attr_groups;
unsigned long bus_dsm_mask;
unsigned long cmd_mask;
struct module *module;
char *provider_name;
+   struct device_node *of_node;
ndctl_fn ndctl;
int (*flush_probe)(struct nvdimm_bus_descriptor *nd_desc);
int (*clear_to_send)(struct nvdimm_bus_descriptor *nd_desc,
@@ -123,6 +125,7 @@ struct nd_region_desc {
int num_lanes;
int numa_node;
unsigned long flags;
+   struct device_node *of_node;
 };
 
 struct device;
-- 
2.9.5



Re: [PATCH v2] cxl: Fix possible deadlock when processing page faults from cxllib

2018-04-03 Thread Laurent Dufour
On 03/04/2018 15:54, Frederic Barrat wrote:
> cxllib_handle_fault() is called by an external driver when it needs to
> have the host resolve page faults for a buffer. The buffer can cover
> several pages and VMAs. The function iterates over all the pages used
> by the buffer, based on the page size of the VMA.
> 
> To ensure some stability while processing the faults, the thread T1
> grabs the mm->mmap_sem semaphore with read access (R1). However, when
> processing a page fault for a single page, one of the underlying
> functions, copro_handle_mm_fault(), also grabs the same semaphore with
> read access (R2). So the thread T1 takes the semaphore twice.
> 
> If another thread T2 tries to access the semaphore in write mode W1
> (say, because it wants to allocate memory and calls 'brk'), then that
> thread T2 will have to wait because there's a reader (R1). If the
> thread T1 is processing a new page at that time, it won't get an
> automatic grant at R2, because there's now a writer thread
> waiting (T2). And we have a deadlock.
> 
> The timeline is:
> 1. thread T1 owns the semaphore with read access R1
> 2. thread T2 requests write access W1 and waits
> 3. thread T1 requests read access R2 and waits
> 
> The fix is for the thread T1 to release the semaphore R1 once it got
> the information it needs from the current VMA. The address space/VMAs
> could evolve while T1 iterates over the full buffer, but in the
> unlikely case where T1 misses a page, the external driver will raise a
> new page fault when retrying the memory access.
> 
> Fixes: 3ced8d730063 ("cxl: Export library to support IBM XSL")
> Cc: sta...@vger.kernel.org # 4.13+
> Signed-off-by: Frederic Barrat 

What are the changes introduced in this v2 ?

> ---
>  drivers/misc/cxl/cxllib.c | 85 
> ++-
>  1 file changed, 55 insertions(+), 30 deletions(-)
> 
> diff --git a/drivers/misc/cxl/cxllib.c b/drivers/misc/cxl/cxllib.c
> index 30ccba436b3b..55cd35d1a9cc 100644
> --- a/drivers/misc/cxl/cxllib.c
> +++ b/drivers/misc/cxl/cxllib.c
> @@ -208,49 +208,74 @@ int cxllib_get_PE_attributes(struct task_struct *task,
>  }
>  EXPORT_SYMBOL_GPL(cxllib_get_PE_attributes);
> 
> -int cxllib_handle_fault(struct mm_struct *mm, u64 addr, u64 size, u64 flags)
> +static int get_vma_info(struct mm_struct *mm, u64 addr,
> + u64 *vma_start, u64 *vma_end,
> + unsigned long *page_size)
>  {
> - int rc;
> - u64 dar;
>   struct vm_area_struct *vma = NULL;
> - unsigned long page_size;
> -
> - if (mm == NULL)
> - return -EFAULT;
> + int rc = 0;
> 
>   down_read(>mmap_sem);
> 
>   vma = find_vma(mm, addr);
>   if (!vma) {
> - pr_err("Can't find vma for addr %016llx\n", addr);
>   rc = -EFAULT;
>   goto out;
>   }
> - /* get the size of the pages allocated */
> - page_size = vma_kernel_pagesize(vma);
> -
> - for (dar = (addr & ~(page_size - 1)); dar < (addr + size); dar += 
> page_size) {
> - if (dar < vma->vm_start || dar >= vma->vm_end) {
> - vma = find_vma(mm, addr);
> - if (!vma) {
> - pr_err("Can't find vma for addr %016llx\n", 
> addr);
> - rc = -EFAULT;
> - goto out;
> - }
> - /* get the size of the pages allocated */
> - page_size = vma_kernel_pagesize(vma);
> + *page_size = vma_kernel_pagesize(vma);
> + *vma_start = vma->vm_start;
> + *vma_end = vma->vm_end;
> +out:
> + up_read(>mmap_sem);
> + return rc;
> +}
> +
> +int cxllib_handle_fault(struct mm_struct *mm, u64 addr, u64 size, u64 flags)
> +{
> + int rc;
> + u64 dar, vma_start, vma_end;
> + unsigned long page_size;
> +
> + if (mm == NULL)
> + return -EFAULT;
> +
> + /*
> +  * The buffer we have to process can extend over several pages
> +  * and may also cover several VMAs.
> +  * We iterate over all the pages. The page size could vary
> +  * between VMAs.
> +  */
> + rc = get_vma_info(mm, addr, _start, _end, _size);
> + if (rc)
> + return rc;
> +
> + for (dar = (addr & ~(page_size - 1)); dar < (addr + size);
> +  dar += page_size) {
> + if (dar < vma_start || dar >= vma_end) {
> + /*
> +  * We don't hold the mm->mmap_sem semaphore
> +  * while iterating, since the semaphore is
> +  * required by one of the lower-level page
> +  * fault processing functions and it could
> +  * create a deadlock.
> +  *
> +  * It means the VMAs can be altered between 2
> +  * loop iterations and we could theoretically
> +  * 

[PATCH v2] cxl: Fix possible deadlock when processing page faults from cxllib

2018-04-03 Thread Frederic Barrat
cxllib_handle_fault() is called by an external driver when it needs to
have the host resolve page faults for a buffer. The buffer can cover
several pages and VMAs. The function iterates over all the pages used
by the buffer, based on the page size of the VMA.

To ensure some stability while processing the faults, the thread T1
grabs the mm->mmap_sem semaphore with read access (R1). However, when
processing a page fault for a single page, one of the underlying
functions, copro_handle_mm_fault(), also grabs the same semaphore with
read access (R2). So the thread T1 takes the semaphore twice.

If another thread T2 tries to access the semaphore in write mode W1
(say, because it wants to allocate memory and calls 'brk'), then that
thread T2 will have to wait because there's a reader (R1). If the
thread T1 is processing a new page at that time, it won't get an
automatic grant at R2, because there's now a writer thread
waiting (T2). And we have a deadlock.

The timeline is:
1. thread T1 owns the semaphore with read access R1
2. thread T2 requests write access W1 and waits
3. thread T1 requests read access R2 and waits

The fix is for the thread T1 to release the semaphore R1 once it got
the information it needs from the current VMA. The address space/VMAs
could evolve while T1 iterates over the full buffer, but in the
unlikely case where T1 misses a page, the external driver will raise a
new page fault when retrying the memory access.

Fixes: 3ced8d730063 ("cxl: Export library to support IBM XSL")
Cc: sta...@vger.kernel.org # 4.13+
Signed-off-by: Frederic Barrat 
---
 drivers/misc/cxl/cxllib.c | 85 ++-
 1 file changed, 55 insertions(+), 30 deletions(-)

diff --git a/drivers/misc/cxl/cxllib.c b/drivers/misc/cxl/cxllib.c
index 30ccba436b3b..55cd35d1a9cc 100644
--- a/drivers/misc/cxl/cxllib.c
+++ b/drivers/misc/cxl/cxllib.c
@@ -208,49 +208,74 @@ int cxllib_get_PE_attributes(struct task_struct *task,
 }
 EXPORT_SYMBOL_GPL(cxllib_get_PE_attributes);
 
-int cxllib_handle_fault(struct mm_struct *mm, u64 addr, u64 size, u64 flags)
+static int get_vma_info(struct mm_struct *mm, u64 addr,
+   u64 *vma_start, u64 *vma_end,
+   unsigned long *page_size)
 {
-   int rc;
-   u64 dar;
struct vm_area_struct *vma = NULL;
-   unsigned long page_size;
-
-   if (mm == NULL)
-   return -EFAULT;
+   int rc = 0;
 
down_read(>mmap_sem);
 
vma = find_vma(mm, addr);
if (!vma) {
-   pr_err("Can't find vma for addr %016llx\n", addr);
rc = -EFAULT;
goto out;
}
-   /* get the size of the pages allocated */
-   page_size = vma_kernel_pagesize(vma);
-
-   for (dar = (addr & ~(page_size - 1)); dar < (addr + size); dar += 
page_size) {
-   if (dar < vma->vm_start || dar >= vma->vm_end) {
-   vma = find_vma(mm, addr);
-   if (!vma) {
-   pr_err("Can't find vma for addr %016llx\n", 
addr);
-   rc = -EFAULT;
-   goto out;
-   }
-   /* get the size of the pages allocated */
-   page_size = vma_kernel_pagesize(vma);
+   *page_size = vma_kernel_pagesize(vma);
+   *vma_start = vma->vm_start;
+   *vma_end = vma->vm_end;
+out:
+   up_read(>mmap_sem);
+   return rc;
+}
+
+int cxllib_handle_fault(struct mm_struct *mm, u64 addr, u64 size, u64 flags)
+{
+   int rc;
+   u64 dar, vma_start, vma_end;
+   unsigned long page_size;
+
+   if (mm == NULL)
+   return -EFAULT;
+
+   /*
+* The buffer we have to process can extend over several pages
+* and may also cover several VMAs.
+* We iterate over all the pages. The page size could vary
+* between VMAs.
+*/
+   rc = get_vma_info(mm, addr, _start, _end, _size);
+   if (rc)
+   return rc;
+
+   for (dar = (addr & ~(page_size - 1)); dar < (addr + size);
+dar += page_size) {
+   if (dar < vma_start || dar >= vma_end) {
+   /*
+* We don't hold the mm->mmap_sem semaphore
+* while iterating, since the semaphore is
+* required by one of the lower-level page
+* fault processing functions and it could
+* create a deadlock.
+*
+* It means the VMAs can be altered between 2
+* loop iterations and we could theoretically
+* miss a page (however unlikely). But that's
+* not really a problem, as the driver will
+* retry access, get another page fault on the
+* 

Re: [PATCH 2/2] smp: introduce kick_active_cpus_sync()

2018-04-03 Thread Mark Rutland
Hi Yury,

On Sun, Apr 01, 2018 at 02:11:08PM +0300, Yury Norov wrote:
> +/*
> + * Flush I-cache if CPU is in extended quiescent state
> + */

This comment is misleading. An ISB doesn't touch the I-cache; it forces
a context synchronization event.

> + .macro  isb_if_eqs
> +#ifndef CONFIG_TINY_RCU
> + bl  rcu_is_watching
> + tst w0, #0xff
> + b.ne1f

The TST+B.NE can be a CBNZ:

bl  rcu_is_watching
cbnzx0, 1f
isb
1:

> + /* Pairs with aarch64_insn_patch_text for EQS CPUs. */
> + isb
> +1:
> +#endif
> + .endm
> +
>  el0_sync_invalid:
>   inv_entry 0, BAD_SYNC
>  ENDPROC(el0_sync_invalid)
> @@ -840,8 +861,10 @@ el0_svc:
>   mov wsc_nr, #__NR_syscalls
>  el0_svc_naked:   // compat entry point
>   stp x0, xscno, [sp, #S_ORIG_X0] // save the original x0 and 
> syscall number
> + isb_if_eqs
>   enable_dbg_and_irq
> - ct_user_exit 1
> + ct_user_exit

I don't think this is safe. here we issue the ISB *before* exiting a
quiesecent state, so I think we can race with another CPU that calls
kick_all_active_cpus_sync, e.g.

CPU0CPU1

ISB
patch_some_text()
kick_all_active_cpus_sync()
ct_user_exit

// not synchronized!
use_of_patched_text()

... and therefore the ISB has no effect, which could be disasterous.

I believe we need the ISB *after* we transition into a non-quiescent
state, so that we can't possibly miss a context synchronization event.

Thanks,
Mark.


Re: [PATCH] cxl: Fix possible deadlock when processing page faults from cxllib

2018-04-03 Thread Vaibhav Jain
Frederic Barrat  writes:

> cxllib_handle_fault() is called by an external driver when it needs to
> have the host process page faults for a buffer which may cover several
> pages. Currently the function holds the mm->mmap_sem semaphore with
> read access while iterating over the buffer, since it could spawn
> several VMAs. When calling a lower-level function to handle the page
> fault for a single page, the semaphore is accessed again in read
> mode. That is wrong and can lead to deadlocks if a writer tries to
> sneak in while a buffer of several pages is being processed.
>
> The fix is to release the semaphore once cxllib_handle_fault() got the
> information it needs from the current vma. The address space/VMAs
> could evolve while we iterate over the full buffer, but in the
> unlikely case where we miss a page, the driver will raise a new page
> fault when retrying.
>
> Fixes: 3ced8d730063 ("cxl: Export library to support IBM XSL")
> Cc: sta...@vger.kernel.org # 4.13+
> Signed-off-by: Frederic Barrat 

Reviewed-by: Vaibhav Jain 



Re: [PATCH v3 4/7] powerpc/fadump: exclude memory holes while reserving memory in second kernel.

2018-04-03 Thread Mahesh Jagannath Salgaonkar
On 04/03/2018 03:21 PM, Hari Bathini wrote:
> 
> 
> On Monday 02 April 2018 12:00 PM, Mahesh J Salgaonkar wrote:
>> From: Mahesh Salgaonkar 
>>
>> The second kernel, during early boot after the crash, reserves rest of
>> the memory above boot memory size to make sure it does not touch any
>> of the
>> dump memory area. It uses memblock_reserve() that reserves the specified
>> memory region irrespective of memory holes present within that region.
>> There are chances where previous kernel would have hot removed some of
>> its memory leaving memory holes behind. In such cases fadump kernel
>> reports
>> incorrect number of reserved pages through arch_reserved_kernel_pages()
>> hook causing kernel to hang or panic.
>>
>> Fix this by excluding memory holes while reserving rest of the memory
>> above boot memory size during second kernel boot after crash.
>>
>> Signed-off-by: Mahesh Salgaonkar 
>> ---
>>   arch/powerpc/kernel/fadump.c |   17 -
>>   1 file changed, 16 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
>> index 011f1aa7abab..a497e9fb93fe 100644
>> --- a/arch/powerpc/kernel/fadump.c
>> +++ b/arch/powerpc/kernel/fadump.c
>> @@ -433,6 +433,21 @@ static inline unsigned long
>> get_fadump_metadata_base(
>>   return be64_to_cpu(fdm->metadata_region.source_address);
>>   }
>>
>> +static void fadump_memblock_reserve(unsigned long base, unsigned long
>> size)
>> +{
>> +    struct memblock_region *reg;
>> +    unsigned long start, end;
>> +
>> +    for_each_memblock(memory, reg) {
>> +    start = max(base, (unsigned long)reg->base);
>> +    end = reg->base + reg->size;
>> +    end = min(base + size, end);
>> +
>> +    if (start < end)
>> +    memblock_reserve(start, end - start);
>> +    }
>> +}
>> +
>>   int __init fadump_reserve_mem(void)
>>   {
>>   unsigned long base, size, memory_boundary;
>> @@ -487,7 +502,7 @@ int __init fadump_reserve_mem(void)
>>    */
>>   base = fw_dump.boot_memory_size;
>>   size = memory_boundary - base;
>> -    memblock_reserve(base, size);
>> +    fadump_memblock_reserve(base, size);
>>   printk(KERN_INFO "Reserved %ldMB of memory at %ldMB "
> 
> Mahesh, you may want to change this print as well as it would be
> misleading in case of
> holes in the memory.

Yeah, may be we can just move that printk also inside
fadump_memblock_reserve().

Thanks,
-Mahesh.

> 
> Thanks
> Hari
> 
>>   "for saving crash dump\n",
>>   (unsigned long)(size >> 20),
>>
> 



Re: [PATCH v3 6/7] powerpc/fadump: Do not allow hot-remove memory from fadump reserved area.

2018-04-03 Thread Mahesh Jagannath Salgaonkar
On 04/03/2018 08:48 AM, Pingfan Liu wrote:
> I think CMA has protected us from hot-remove, so this patch is not necessary.

Yes, but only if the memory from declared CMA region is allocated using
cma_alloc(). The rest of the memory inside CMA region which hasn't been
cma_allocat-ed can still be hot-removed. Hence this patch is necessary.

fadump allocates only 1 page from fadump CMA region for metadata region.
Rest all memory is free to use by applications and vulnerable to hot-remove.

Thanks,
-Mahesh.

> 
> Regards,
> Pingfan
> 
> On Mon, Apr 2, 2018 at 2:30 PM, Mahesh J Salgaonkar
>  wrote:
>> From: Mahesh Salgaonkar 
>>
>> For fadump to work successfully there should not be any holes in reserved
>> memory ranges where kernel has asked firmware to move the content of old
>> kernel memory in event of crash. But this memory area is currently not
>> protected from hot-remove operations. Hence, fadump service can fail to
>> re-register after the hot-remove operation, if hot-removed memory belongs
>> to fadump reserved region. To avoid this make sure that memory from fadump
>> reserved area is not hot-removable if fadump is registered.
>>
>> However, if user still wants to remove that memory, he can do so by
>> manually stopping fadump service before hot-remove operation.
>>
>> Signed-off-by: Mahesh Salgaonkar 
>> ---
>>  arch/powerpc/include/asm/fadump.h   |2 +-
>>  arch/powerpc/kernel/fadump.c|   10 --
>>  arch/powerpc/platforms/pseries/hotplug-memory.c |7 +--
>>  3 files changed, 14 insertions(+), 5 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/fadump.h 
>> b/arch/powerpc/include/asm/fadump.h
>> index 44b6ebfe9be6..d16dc77107a8 100644
>> --- a/arch/powerpc/include/asm/fadump.h
>> +++ b/arch/powerpc/include/asm/fadump.h
>> @@ -207,7 +207,7 @@ struct fad_crash_memory_ranges {
>> unsigned long long  size;
>>  };
>>
>> -extern int is_fadump_boot_memory_area(u64 addr, ulong size);
>> +extern int is_fadump_memory_area(u64 addr, ulong size);
>>  extern int early_init_dt_scan_fw_dump(unsigned long node,
>> const char *uname, int depth, void *data);
>>  extern int fadump_reserve_mem(void);
>> diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
>> index 59aaf0357a52..2c3c7e655eec 100644
>> --- a/arch/powerpc/kernel/fadump.c
>> +++ b/arch/powerpc/kernel/fadump.c
>> @@ -162,13 +162,19 @@ int __init early_init_dt_scan_fw_dump(unsigned long 
>> node,
>>
>>  /*
>>   * If fadump is registered, check if the memory provided
>> - * falls within boot memory area.
>> + * falls within boot memory area and reserved memory area.
>>   */
>> -int is_fadump_boot_memory_area(u64 addr, ulong size)
>> +int is_fadump_memory_area(u64 addr, ulong size)
>>  {
>> +   u64 d_start = fw_dump.reserve_dump_area_start;
>> +   u64 d_end = d_start + fw_dump.reserve_dump_area_size;
>> +
>> if (!fw_dump.dump_registered)
>> return 0;
>>
>> +   if (((addr + size) > d_start) && (addr <= d_end))
>> +   return 1;
>> +
>> return (addr + size) > RMA_START && addr <= fw_dump.boot_memory_size;
>>  }
>>
>> diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
>> b/arch/powerpc/platforms/pseries/hotplug-memory.c
>> index c1578f54c626..e4c658cda3a7 100644
>> --- a/arch/powerpc/platforms/pseries/hotplug-memory.c
>> +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
>> @@ -389,8 +389,11 @@ static bool lmb_is_removable(struct drmem_lmb *lmb)
>> phys_addr = lmb->base_addr;
>>
>>  #ifdef CONFIG_FA_DUMP
>> -   /* Don't hot-remove memory that falls in fadump boot memory area */
>> -   if (is_fadump_boot_memory_area(phys_addr, block_sz))
>> +   /*
>> +* Don't hot-remove memory that falls in fadump boot memory area
>> +* and memory that is reserved for capturing old kernel memory.
>> +*/
>> +   if (is_fadump_memory_area(phys_addr, block_sz))
>> return false;
>>  #endif
>>
>>
> 



Re: [PATCH] cxl: Fix possible deadlock when processing page faults from cxllib

2018-04-03 Thread Laurent Dufour


On 03/04/2018 11:43, Frederic Barrat wrote:
> cxllib_handle_fault() is called by an external driver when it needs to
> have the host process page faults for a buffer which may cover several
> pages. Currently the function holds the mm->mmap_sem semaphore with
> read access while iterating over the buffer, since it could spawn
> several VMAs. When calling a lower-level function to handle the page
> fault for a single page, the semaphore is accessed again in read
> mode. That is wrong and can lead to deadlocks if a writer tries to
> sneak in while a buffer of several pages is being processed.
> 
> The fix is to release the semaphore once cxllib_handle_fault() got the
> information it needs from the current vma. The address space/VMAs
> could evolve while we iterate over the full buffer, but in the
> unlikely case where we miss a page, the driver will raise a new page
> fault when retrying.
> 
> Fixes: 3ced8d730063 ("cxl: Export library to support IBM XSL")
> Cc: sta...@vger.kernel.org # 4.13+
> Signed-off-by: Frederic Barrat 

FWIW,
Reviewed-by : Laurent Dufour 

> ---
>  drivers/misc/cxl/cxllib.c | 85 
> ++-
>  1 file changed, 55 insertions(+), 30 deletions(-)
> 
> diff --git a/drivers/misc/cxl/cxllib.c b/drivers/misc/cxl/cxllib.c
> index 30ccba436b3b..55cd35d1a9cc 100644
> --- a/drivers/misc/cxl/cxllib.c
> +++ b/drivers/misc/cxl/cxllib.c
> @@ -208,49 +208,74 @@ int cxllib_get_PE_attributes(struct task_struct *task,
>  }
>  EXPORT_SYMBOL_GPL(cxllib_get_PE_attributes);
> 
> -int cxllib_handle_fault(struct mm_struct *mm, u64 addr, u64 size, u64 flags)
> +static int get_vma_info(struct mm_struct *mm, u64 addr,
> + u64 *vma_start, u64 *vma_end,
> + unsigned long *page_size)
>  {
> - int rc;
> - u64 dar;
>   struct vm_area_struct *vma = NULL;
> - unsigned long page_size;
> -
> - if (mm == NULL)
> - return -EFAULT;
> + int rc = 0;
> 
>   down_read(>mmap_sem);
> 
>   vma = find_vma(mm, addr);
>   if (!vma) {
> - pr_err("Can't find vma for addr %016llx\n", addr);
>   rc = -EFAULT;
>   goto out;
>   }
> - /* get the size of the pages allocated */
> - page_size = vma_kernel_pagesize(vma);
> -
> - for (dar = (addr & ~(page_size - 1)); dar < (addr + size); dar += 
> page_size) {
> - if (dar < vma->vm_start || dar >= vma->vm_end) {
> - vma = find_vma(mm, addr);
> - if (!vma) {
> - pr_err("Can't find vma for addr %016llx\n", 
> addr);
> - rc = -EFAULT;
> - goto out;
> - }
> - /* get the size of the pages allocated */
> - page_size = vma_kernel_pagesize(vma);
> + *page_size = vma_kernel_pagesize(vma);
> + *vma_start = vma->vm_start;
> + *vma_end = vma->vm_end;
> +out:
> + up_read(>mmap_sem);
> + return rc;
> +}
> +
> +int cxllib_handle_fault(struct mm_struct *mm, u64 addr, u64 size, u64 flags)
> +{
> + int rc;
> + u64 dar, vma_start, vma_end;
> + unsigned long page_size;
> +
> + if (mm == NULL)
> + return -EFAULT;
> +
> + /*
> +  * The buffer we have to process can extend over several pages
> +  * and may also cover several VMAs.
> +  * We iterate over all the pages. The page size could vary
> +  * between VMAs.
> +  */
> + rc = get_vma_info(mm, addr, _start, _end, _size);
> + if (rc)
> + return rc;
> +
> + for (dar = (addr & ~(page_size - 1)); dar < (addr + size);
> +  dar += page_size) {
> + if (dar < vma_start || dar >= vma_end) {
> + /*
> +  * We don't hold the mm->mmap_sem semaphore
> +  * while iterating, since the semaphore is
> +  * required by one of the lower-level page
> +  * fault processing functions and it could
> +  * create a deadlock.
> +  *
> +  * It means the VMAs can be altered between 2
> +  * loop iterations and we could theoretically
> +  * miss a page (however unlikely). But that's
> +  * not really a problem, as the driver will
> +  * retry access, get another page fault on the
> +  * missing page and call us again.
> +  */
> + rc = get_vma_info(mm, dar, _start, _end,
> + _size);
> + if (rc)
> + return rc;
>   }
> 
>   rc = cxl_handle_mm_fault(mm, flags, dar);
> - if (rc) {
> - pr_err("cxl_handle_mm_fault failed %d", rc);

Re: [PATCH v3 4/7] powerpc/fadump: exclude memory holes while reserving memory in second kernel.

2018-04-03 Thread Hari Bathini



On Monday 02 April 2018 12:00 PM, Mahesh J Salgaonkar wrote:

From: Mahesh Salgaonkar 

The second kernel, during early boot after the crash, reserves rest of
the memory above boot memory size to make sure it does not touch any of the
dump memory area. It uses memblock_reserve() that reserves the specified
memory region irrespective of memory holes present within that region.
There are chances where previous kernel would have hot removed some of
its memory leaving memory holes behind. In such cases fadump kernel reports
incorrect number of reserved pages through arch_reserved_kernel_pages()
hook causing kernel to hang or panic.

Fix this by excluding memory holes while reserving rest of the memory
above boot memory size during second kernel boot after crash.

Signed-off-by: Mahesh Salgaonkar 
---
  arch/powerpc/kernel/fadump.c |   17 -
  1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
index 011f1aa7abab..a497e9fb93fe 100644
--- a/arch/powerpc/kernel/fadump.c
+++ b/arch/powerpc/kernel/fadump.c
@@ -433,6 +433,21 @@ static inline unsigned long get_fadump_metadata_base(
return be64_to_cpu(fdm->metadata_region.source_address);
  }

+static void fadump_memblock_reserve(unsigned long base, unsigned long size)
+{
+   struct memblock_region *reg;
+   unsigned long start, end;
+
+   for_each_memblock(memory, reg) {
+   start = max(base, (unsigned long)reg->base);
+   end = reg->base + reg->size;
+   end = min(base + size, end);
+
+   if (start < end)
+   memblock_reserve(start, end - start);
+   }
+}
+
  int __init fadump_reserve_mem(void)
  {
unsigned long base, size, memory_boundary;
@@ -487,7 +502,7 @@ int __init fadump_reserve_mem(void)
 */
base = fw_dump.boot_memory_size;
size = memory_boundary - base;
-   memblock_reserve(base, size);
+   fadump_memblock_reserve(base, size);
printk(KERN_INFO "Reserved %ldMB of memory at %ldMB "


Mahesh, you may want to change this print as well as it would be 
misleading in case of

holes in the memory.

Thanks
Hari


"for saving crash dump\n",
(unsigned long)(size >> 20),





[PATCH] cxl: Fix possible deadlock when processing page faults from cxllib

2018-04-03 Thread Frederic Barrat
cxllib_handle_fault() is called by an external driver when it needs to
have the host process page faults for a buffer which may cover several
pages. Currently the function holds the mm->mmap_sem semaphore with
read access while iterating over the buffer, since it could spawn
several VMAs. When calling a lower-level function to handle the page
fault for a single page, the semaphore is accessed again in read
mode. That is wrong and can lead to deadlocks if a writer tries to
sneak in while a buffer of several pages is being processed.

The fix is to release the semaphore once cxllib_handle_fault() got the
information it needs from the current vma. The address space/VMAs
could evolve while we iterate over the full buffer, but in the
unlikely case where we miss a page, the driver will raise a new page
fault when retrying.

Fixes: 3ced8d730063 ("cxl: Export library to support IBM XSL")
Cc: sta...@vger.kernel.org # 4.13+
Signed-off-by: Frederic Barrat 
---
 drivers/misc/cxl/cxllib.c | 85 ++-
 1 file changed, 55 insertions(+), 30 deletions(-)

diff --git a/drivers/misc/cxl/cxllib.c b/drivers/misc/cxl/cxllib.c
index 30ccba436b3b..55cd35d1a9cc 100644
--- a/drivers/misc/cxl/cxllib.c
+++ b/drivers/misc/cxl/cxllib.c
@@ -208,49 +208,74 @@ int cxllib_get_PE_attributes(struct task_struct *task,
 }
 EXPORT_SYMBOL_GPL(cxllib_get_PE_attributes);
 
-int cxllib_handle_fault(struct mm_struct *mm, u64 addr, u64 size, u64 flags)
+static int get_vma_info(struct mm_struct *mm, u64 addr,
+   u64 *vma_start, u64 *vma_end,
+   unsigned long *page_size)
 {
-   int rc;
-   u64 dar;
struct vm_area_struct *vma = NULL;
-   unsigned long page_size;
-
-   if (mm == NULL)
-   return -EFAULT;
+   int rc = 0;
 
down_read(>mmap_sem);
 
vma = find_vma(mm, addr);
if (!vma) {
-   pr_err("Can't find vma for addr %016llx\n", addr);
rc = -EFAULT;
goto out;
}
-   /* get the size of the pages allocated */
-   page_size = vma_kernel_pagesize(vma);
-
-   for (dar = (addr & ~(page_size - 1)); dar < (addr + size); dar += 
page_size) {
-   if (dar < vma->vm_start || dar >= vma->vm_end) {
-   vma = find_vma(mm, addr);
-   if (!vma) {
-   pr_err("Can't find vma for addr %016llx\n", 
addr);
-   rc = -EFAULT;
-   goto out;
-   }
-   /* get the size of the pages allocated */
-   page_size = vma_kernel_pagesize(vma);
+   *page_size = vma_kernel_pagesize(vma);
+   *vma_start = vma->vm_start;
+   *vma_end = vma->vm_end;
+out:
+   up_read(>mmap_sem);
+   return rc;
+}
+
+int cxllib_handle_fault(struct mm_struct *mm, u64 addr, u64 size, u64 flags)
+{
+   int rc;
+   u64 dar, vma_start, vma_end;
+   unsigned long page_size;
+
+   if (mm == NULL)
+   return -EFAULT;
+
+   /*
+* The buffer we have to process can extend over several pages
+* and may also cover several VMAs.
+* We iterate over all the pages. The page size could vary
+* between VMAs.
+*/
+   rc = get_vma_info(mm, addr, _start, _end, _size);
+   if (rc)
+   return rc;
+
+   for (dar = (addr & ~(page_size - 1)); dar < (addr + size);
+dar += page_size) {
+   if (dar < vma_start || dar >= vma_end) {
+   /*
+* We don't hold the mm->mmap_sem semaphore
+* while iterating, since the semaphore is
+* required by one of the lower-level page
+* fault processing functions and it could
+* create a deadlock.
+*
+* It means the VMAs can be altered between 2
+* loop iterations and we could theoretically
+* miss a page (however unlikely). But that's
+* not really a problem, as the driver will
+* retry access, get another page fault on the
+* missing page and call us again.
+*/
+   rc = get_vma_info(mm, dar, _start, _end,
+   _size);
+   if (rc)
+   return rc;
}
 
rc = cxl_handle_mm_fault(mm, flags, dar);
-   if (rc) {
-   pr_err("cxl_handle_mm_fault failed %d", rc);
-   rc = -EFAULT;
-   goto out;
-   }
+   if (rc)
+   return -EFAULT;
}
-   rc = 0;
-out:
-   

[PATCH 1/4] powerpc/64/kexec: fix race in kexec when XIVE is shutdowned

2018-04-03 Thread Cédric Le Goater
The kexec_state KEXEC_STATE_IRQS_OFF barrier is reached by all
secondary CPUs before the kexec_cpu_down() operation is called on
secondaries. This can raise conflicts and provoque errors in the XIVE
hcalls when XIVE is shutdowned with H_INT_RESET on the primary CPU.

To synchronize the kexec_cpu_down() operations and make sure the
secondaries have completed their task before the primary starts doing
the same, let's move the primary kexec_cpu_down() after the
KEXEC_STATE_REAL_MODE barrier.

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/kernel/machine_kexec_64.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/machine_kexec_64.c 
b/arch/powerpc/kernel/machine_kexec_64.c
index 49d34d7271e7..212ecb8e829c 100644
--- a/arch/powerpc/kernel/machine_kexec_64.c
+++ b/arch/powerpc/kernel/machine_kexec_64.c
@@ -230,16 +230,16 @@ static void kexec_prepare_cpus(void)
/* we are sure every CPU has IRQs off at this point */
kexec_all_irq_disabled = 1;
 
-   /* after we tell the others to go down */
-   if (ppc_md.kexec_cpu_down)
-   ppc_md.kexec_cpu_down(0, 0);
-
/*
 * Before removing MMU mappings make sure all CPUs have entered real
 * mode:
 */
kexec_prepare_cpus_wait(KEXEC_STATE_REAL_MODE);
 
+   /* after we tell the others to go down */
+   if (ppc_md.kexec_cpu_down)
+   ppc_md.kexec_cpu_down(0, 0);
+
put_cpu();
 }
 
-- 
2.13.6



[PATCH 2/4] powerpc/xive: fix hcall H_INT_RESET to support long busy delays

2018-04-03 Thread Cédric Le Goater
The hcall H_INT_RESET can take some time to complete and in such cases
it returns H_LONG_BUSY_* codes requiring the machine to sleep for a
while before retrying.

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/sysdev/xive/spapr.c | 53 
 1 file changed, 48 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/sysdev/xive/spapr.c b/arch/powerpc/sysdev/xive/spapr.c
index 091f1d0d0af1..7113f5d87952 100644
--- a/arch/powerpc/sysdev/xive/spapr.c
+++ b/arch/powerpc/sysdev/xive/spapr.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -108,6 +109,52 @@ static void xive_irq_bitmap_free(int irq)
}
 }
 
+
+/* Based on the similar routines in RTAS */
+static unsigned int plpar_busy_delay_time(long rc)
+{
+   unsigned int ms = 0;
+
+   if (H_IS_LONG_BUSY(rc)) {
+   ms = get_longbusy_msecs(rc);
+   } else if (rc == H_BUSY) {
+   ms = 10; /* seems appropriate for XIVE hcalls */
+   }
+
+   return ms;
+}
+
+static unsigned int plpar_busy_delay(int rc)
+{
+   unsigned int ms;
+
+   might_sleep();
+   ms = plpar_busy_delay_time(rc);
+   if (ms && need_resched())
+   msleep(ms);
+
+   return ms;
+}
+
+/*
+ * Note: this call has a partition wide scope and can take a while to
+ * complete. If it returns H_LONG_BUSY_* it should be retried
+ * periodically.
+ */
+static long plpar_int_reset(unsigned long flags)
+{
+   long rc;
+
+   do {
+   rc = plpar_hcall_norets(H_INT_RESET, flags);
+   } while (plpar_busy_delay(rc));
+
+   if (rc)
+   pr_err("H_INT_RESET failed %ld\n", rc);
+
+   return rc;
+}
+
 static long plpar_int_get_source_info(unsigned long flags,
  unsigned long lisn,
  unsigned long *src_flags,
@@ -445,11 +492,7 @@ static void xive_spapr_put_ipi(unsigned int cpu, struct 
xive_cpu *xc)
 
 static void xive_spapr_shutdown(void)
 {
-   long rc;
-
-   rc = plpar_hcall_norets(H_INT_RESET, 0);
-   if (rc)
-   pr_err("H_INT_RESET failed %ld\n", rc);
+   plpar_int_reset(0);
 }
 
 /*
-- 
2.13.6



[PATCH 4/4] powerpc/xive: prepare all hcalls to support long busy delays

2018-04-03 Thread Cédric Le Goater
This is not the case for the moment, but future releases of pHyp might
need to introduce some synchronisation routines under the hood which
would make the XIVE hcalls longer to complete.

As this was done for H_INT_RESET, let's wrap the other hcalls in a
loop catching the H_LONG_BUSY_* codes.

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/sysdev/xive/spapr.c | 36 
 1 file changed, 28 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/sysdev/xive/spapr.c b/arch/powerpc/sysdev/xive/spapr.c
index 7113f5d87952..97ea0a67a173 100644
--- a/arch/powerpc/sysdev/xive/spapr.c
+++ b/arch/powerpc/sysdev/xive/spapr.c
@@ -165,7 +165,10 @@ static long plpar_int_get_source_info(unsigned long flags,
unsigned long retbuf[PLPAR_HCALL_BUFSIZE];
long rc;
 
-   rc = plpar_hcall(H_INT_GET_SOURCE_INFO, retbuf, flags, lisn);
+   do {
+   rc = plpar_hcall(H_INT_GET_SOURCE_INFO, retbuf, flags, lisn);
+   } while (plpar_busy_delay(rc));
+
if (rc) {
pr_err("H_INT_GET_SOURCE_INFO lisn=%ld failed %ld\n", lisn, rc);
return rc;
@@ -198,8 +201,11 @@ static long plpar_int_set_source_config(unsigned long 
flags,
flags, lisn, target, prio, sw_irq);
 
 
-   rc = plpar_hcall_norets(H_INT_SET_SOURCE_CONFIG, flags, lisn,
-   target, prio, sw_irq);
+   do {
+   rc = plpar_hcall_norets(H_INT_SET_SOURCE_CONFIG, flags, lisn,
+   target, prio, sw_irq);
+   } while (plpar_busy_delay(rc));
+
if (rc) {
pr_err("H_INT_SET_SOURCE_CONFIG lisn=%ld target=%lx prio=%lx 
failed %ld\n",
   lisn, target, prio, rc);
@@ -218,7 +224,11 @@ static long plpar_int_get_queue_info(unsigned long flags,
unsigned long retbuf[PLPAR_HCALL_BUFSIZE];
long rc;
 
-   rc = plpar_hcall(H_INT_GET_QUEUE_INFO, retbuf, flags, target, priority);
+   do {
+   rc = plpar_hcall(H_INT_GET_QUEUE_INFO, retbuf, flags, target,
+priority);
+   } while (plpar_busy_delay(rc));
+
if (rc) {
pr_err("H_INT_GET_QUEUE_INFO cpu=%ld prio=%ld failed %ld\n",
   target, priority, rc);
@@ -247,8 +257,11 @@ static long plpar_int_set_queue_config(unsigned long flags,
pr_devel("H_INT_SET_QUEUE_CONFIG flags=%lx target=%lx priority=%lx 
qpage=%lx qsize=%lx\n",
flags,  target, priority, qpage, qsize);
 
-   rc = plpar_hcall_norets(H_INT_SET_QUEUE_CONFIG, flags, target,
-   priority, qpage, qsize);
+   do {
+   rc = plpar_hcall_norets(H_INT_SET_QUEUE_CONFIG, flags, target,
+   priority, qpage, qsize);
+   } while (plpar_busy_delay(rc));
+
if (rc) {
pr_err("H_INT_SET_QUEUE_CONFIG cpu=%ld prio=%ld qpage=%lx 
returned %ld\n",
   target, priority, qpage, rc);
@@ -262,7 +275,10 @@ static long plpar_int_sync(unsigned long flags, unsigned 
long lisn)
 {
long rc;
 
-   rc = plpar_hcall_norets(H_INT_SYNC, flags, lisn);
+   do {
+   rc = plpar_hcall_norets(H_INT_SYNC, flags, lisn);
+   } while (plpar_busy_delay(rc));
+
if (rc) {
pr_err("H_INT_SYNC lisn=%ld returned %ld\n", lisn, rc);
return  rc;
@@ -285,7 +301,11 @@ static long plpar_int_esb(unsigned long flags,
pr_devel("H_INT_ESB flags=%lx lisn=%lx offset=%lx in=%lx\n",
flags,  lisn, offset, in_data);
 
-   rc = plpar_hcall(H_INT_ESB, retbuf, flags, lisn, offset, in_data);
+   do {
+   rc = plpar_hcall(H_INT_ESB, retbuf, flags, lisn, offset,
+in_data);
+   } while (plpar_busy_delay(rc));
+
if (rc) {
pr_err("H_INT_ESB lisn=%ld offset=%ld returned %ld\n",
   lisn, offset, rc);
-- 
2.13.6



[PATCH 3/4] powerpc/xive: shutdown XIVE when kexec or kdump is performed

2018-04-03 Thread Cédric Le Goater
The hcall H_INT_RESET should be called to make sure XIVE is fully
reseted.

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/platforms/pseries/kexec.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/kexec.c 
b/arch/powerpc/platforms/pseries/kexec.c
index eeb13429d685..1d9bbf8e7357 100644
--- a/arch/powerpc/platforms/pseries/kexec.c
+++ b/arch/powerpc/platforms/pseries/kexec.c
@@ -52,8 +52,11 @@ void pseries_kexec_cpu_down(int crash_shutdown, int 
secondary)
}
}
 
-   if (xive_enabled())
+   if (xive_enabled()) {
xive_kexec_teardown_cpu(secondary);
-   else
+
+   if (!secondary)
+   xive_shutdown();
+   } else
xics_kexec_teardown_cpu(secondary);
 }
-- 
2.13.6



[PATCH 0/4] powerpc/xive: add support for H_INT_RESET

2018-04-03 Thread Cédric Le Goater
H_INT_RESET performs a reset of the Hypervisor internal interrupt
structures, removing all settings done with H_INT_SET_SOURCE_CONFIG
and H_INT_SET_QUEUE_CONFIG. This is most important for kdump and kexec
to be able to restart in a clean state.

First patch closes a window in the kexec sequence in which the
H_INT_RESET hcall can be run concurrently with other hcalls when CPUs
are brought down. The other patches wire up the reset hcall making
sure it supports H_LONG_BUSY_* returned by the Hypervisor.

Cédric Le Goater (4):
  powerpc/64/kexec: fix race in kexec when XIVE is shutdowned
  powerpc/xive: fix hcall H_INT_RESET to support long busy delays
  powerpc/xive: shutdown XIVE when kexec or kdump is performed
  powerpc/xive: prepare all hcalls to support long busy delays

 arch/powerpc/kernel/machine_kexec_64.c |  8 +--
 arch/powerpc/platforms/pseries/kexec.c |  7 ++-
 arch/powerpc/sysdev/xive/spapr.c   | 89 +-
 3 files changed, 85 insertions(+), 19 deletions(-)

-- 
2.13.6