date:20170316

Re: [PATCH 8/8] powerpc/64s: idle POWER8 avoid full state loss recovery when possible

2017-03-16 Thread Nicholas Piggin

On Thu, 16 Mar 2017 21:42:01 +0530
Gautham R Shenoy  wrote:

Hey, thanks for the review.

> Hi Nick,
> 
> On Tue, Mar 14, 2017 at 07:23:49PM +1000, Nicholas Piggin wrote:
> > If not all threads were in winkle, full state loss recovery is not
> > necessary and can be avoided. A previous patch removed this optimisation
> > due to some complexity with the implementation. Re-implement it by
> > counting the number of threads in winkle with the per-core idle state.
> > Only restore full state loss if all threads were in winkle.
> > 
> > This has a small window of false positives right before threads execute
> > winkle and just after they wake up, when the winkle count does not
> > reflect the true number of threads in winkle. This is not a significant
> > problem in comparison with even the minimum winkle duration. For
> > correctness, a false positive is not a problem (only false negatives
> > would be).  


> > @@ -517,10 +526,34 @@ END_FTR_SECTION_IFSET(CPU_FTR_HVMODE)
> >  * At this stage
> >  * cr2 - eq if first thread to wakeup in core
> >  * cr3-  gt if waking up with partial/complete hypervisor state loss
> > +* ISA300:
> >  * cr4 - gt or eq if waking up from complete hypervisor state loss.  
> 
> For ISA 300, we need to restore hypervisor thread resources if we are
> waking up a state that is as deep or deeper than
> pnv_first_deep_stop_state. In that case, we expect either the "gt" bit
> or the "eq" bit of cr4 to be set.
> 
> Before this patch, on ISA 207, cr4 would be "eq" if we are waking up
> from winkle.

Yes, that was based on testing the thread idle state (nap/sleep/winkle).
The condition has become more complex now, so it moved below.


> >  */
> > 
> >  BEGIN_FTR_SECTION
> > +   /*
> > +* Were we in winkle?
> > +* If yes, check if all threads were in winkle, decrement our
> > +* winkle count, set all thread winkle bits if all were in winkle.
> > +* Check if our thread has a winkle bit set, and set cr4 accordingly
> > +* (to match ISA300, above).
> > +*/
> > +   cmpwi   r18,PNV_THREAD_WINKLE
> > +   bne 2f
> > +   andis.  r9,r15,PNV_CORE_IDLE_WINKLE_COUNT_ALL_BIT@h
> > +   subis   r15,r15,PNV_CORE_IDLE_WINKLE_COUNT@h
> > +   beq 2f
> > +   ori r15,r15,PNV_CORE_IDLE_THREAD_WINKLE_BITS /* all were winkle */  
> 
> So PNV_CORE_IDLE_THREAD_WINKLE_BITS will be set by the first waking
> thread in a winkle'd core. Subsequent waking thread(s) can only clear
> their respective bits from the winkle bits.

Yes. It was easier to do it on the wakeup side because the sleep side
does not take the lock, only uses atomic load/store.


> > +2:
> > +   /* Shift thread bit to winkle mask, then test if this thread is set,
> > +* and remove it from the winkle bits */
> > +   slwir8,r7,8
> > +   and r8,r8,r15
> > +   andcr15,r15,r8
> > +   cmpwi   cr4,r8,1 /* cr4 will be gt if our bit is set, lt if not */  
> 
> Very clever indeed! So we seem to be doing the following:
> 
> winkle_entry(thread_bit)
> {
>   atomic_inc(core_winkle_count);
> }
> 
> winkle_exit(thread_bit)
> {
>   atomic {
>  if (core_winkle_count == 8)
> set all thread bits in core_winkle_mask;
> 
>  if (thread_bit set in core_winkle_mask) {
> cr4 will be "gt".
>  }
> 
>  clear thread_bit from core_winkle_mask;
>   }
> }
> 
> I would suggest documenting this a bit more documentation given the
> subtlety.

Yes, commenting the algorithm in pseudo C is probably a good idea. It is
a bit tricky to follow.

> 
> > +
> > +   cmpwi   cr4,r18,PNV_THREAD_WINKLE  
> 
> This second comparision seems to be undoing the benefit of the
> optimization by setting "eq" in cr4 for every thread that wakes up
> from winkle irrespective of whether the core has winkled or not.

No, good catch. That was some leftover debugging code that got in there
I think. Hmm, I'm not sure how much of my testing that's invalidated, so
I'll have to fix it up and re-run some more tests.

> 
> However, this is quite subtle, so good chances that my addled brain
> has gotten it wrong.
> 
> Case 1: If a thread were waking up from winkle. We have 2
> possibilities
> 
>   a) The entire core winkled at least once ever since this thread
>   went to winkle. In this case we want cr4 to be "eq" or "gt" so
>   that we restore the hypervisor resources.
>   Now, for this case the first thread in the core that wakes up would find
>   count == PNV_CORE_IDLE_WINKLE_COUNT_ALL_BIT and hence would have
>   set PNV_CORE_IDLE_THREAD_WINKLE_BITS which includes this
>   thread's bit as well. Hence, this thread
>   would find its bit set in r8, thereby cr4 would be "gt". Which
>   is good for us!
> 
>   b) The entire core has not entered winkle. In which case we
>   haven't lost any hypervisor resources, so no need to restore
>   these. We would like cr4 to have neither

Re: [PATCH 4/8] powerpc/64s: fix POWER9 machine check handler from stop state

2017-03-16 Thread Nicholas Piggin

On Fri, 17 Mar 2017 12:49:27 +1000
Nicholas Piggin  wrote:

> On Thu, 16 Mar 2017 18:10:48 +0530
> Mahesh Jagannath Salgaonkar  wrote:
> 
> > On 03/14/2017 02:53 PM, Nicholas Piggin wrote:  

> > Looks like we are not winding up.. Shouldn't we ? What if we may end up
> > in pnv_wakeup_noloss() which assumes that no GPRs are lost. Am I missing
> > anything ?  
> 
> Hmm, on second look, I don't think any non-volatile GPRs are overwritten
> in this path. But this MCE is a slow path, and it is a much longer path
> than the system reset idle wakeup... So I'll add the napstatelost with
> a comment.

On third look, I'll just add the comment. The windup does not restore
non-volatile GPRs either, and in general we're careful not to use them
in exception handlers. So I think it's okay.

Thanks,
Nick

[PATCH] powerpc/64s: fix idle wakeup potential to clobber registers

2017-03-16 Thread Nicholas Piggin

We concluded there may be a window where the idle wakeup code could
get to pnv_wakeup_tb_loss (which clobbers non-volatile GPRs), but the
hardware may set SRR1[46:47] to 01b (no state loss) which would
result in the wakeup code failing to restore non-volatile GPRs.

I was not able to trigger this condition with trivial tests on
real hardware or simulator, but the ISA (at least 2.07) seems to
allow for it, and Gautham says that it can happen if there is an
exception pending when the sleep/winkle instruction is executed.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/idle_book3s.S | 20 +---
 1 file changed, 17 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/idle_book3s.S 
b/arch/powerpc/kernel/idle_book3s.S
index 995728736677..6fd08219248d 100644
--- a/arch/powerpc/kernel/idle_book3s.S
+++ b/arch/powerpc/kernel/idle_book3s.S
@@ -449,9 +449,23 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300)
 _GLOBAL(pnv_wakeup_tb_loss)
ld  r1,PACAR1(r13)
/*
-* Before entering any idle state, the NVGPRs are saved in the stack
-* and they are restored before switching to the process context. Hence
-* until they are restored, they are free to be used.
+* Before entering any idle state, the NVGPRs are saved in the stack.
+* If there was a state loss, or PACA_NAPSTATELOST was set, then the
+* NVGPRs are restored. If we are here, it is likely that state is lost,
+* but not guaranteed -- neither ISA207 nor ISA300 tests to reach
+* here are the same as the test to restore NVGPRS:
+* PACA_THREAD_IDLE_STATE test for ISA207, PSSCR test for ISA300,
+* and SRR1 test for restoring NVGPRs.
+*
+* We are about to clobber NVGPRs now, so set NAPSTATELOST to
+* guarantee they will always be restored. This might be tightened
+* with careful reading of specs (particularly for ISA300) but this
+* is already a slow wakeup path and it's simpler to be safe.
+*/
+   li  r0,1
+   stb r0,PACA_NAPSTATELOST(r13)
+
+   /*
 *
 * Save SRR1 and LR in NVGPRs as they might be clobbered in
 * opal_call() (called in CHECK_HMI_INTERRUPT). SRR1 is required
-- 
2.11.0

[PATCH kernel v10 00/10] powerpc/kvm/vfio: Enable in-kernel acceleration

2017-03-16 Thread Alexey Kardashevskiy

This is my current queue of patches to add acceleration of TCE
updates in KVM.

This is based on Linus'es tree sha1 d528ae0d3dfe.

Please comment. Thanks.

Changes:
v10:
* fixed bugs in 10/10
* fixed 04/10 to avoid iommu_table get/put race in 10/10

v9:
* renamed few exported symbols in 04/10
* reforked various objects reference counting in 10/10

v8:
* kept fixing oddities with error handling in 10/10

v7:
* added realmode's WARN_ON_ONCE_RM in arch/powerpc/kvm/book3s_64_vio_hv.c

v6:
* reworked the last patch in terms of error handling and parameters checking

v5:
* replaced "KVM: PPC: Separate TCE validation from update" with
"KVM: PPC: iommu: Unify TCE checking"
* changed already reviewed "powerpc/iommu/vfio_spapr_tce: Cleanup iommu_table 
disposal"
* reworked "KVM: PPC: VFIO: Add in-kernel acceleration for VFIO"
* more details in individual commit logs

v4:
* addressed comments from v3
* updated subject lines with correct component names
* regrouped the patchset in order:
- powerpc fixes;
- vfio_spapr_tce driver fixes;
- KVM/PPC fixes;
- KVM+PPC+VFIO;
* everything except last 2 patches have "Reviewed-By: David"

v3:
* there was no full repost, only last patch was posted

v2:
* 11/11 reworked to use new notifiers, it is rather RFC as it still has
a issue;
* got 09/11, 10/11 to use notifiers in 11/11;
* added rb: David to most of patches and added a comment in 05/11.


Alexey Kardashevskiy (10):
  powerpc/mmu: Add real mode support for IOMMU preregistered memory
  powerpc/powernv/iommu: Add real mode version of
iommu_table_ops::exchange()
  powerpc/iommu/vfio_spapr_tce: Cleanup iommu_table disposal
  powerpc/vfio_spapr_tce: Add reference counting to iommu_table
  KVM: PPC: Reserve KVM_CAP_SPAPR_TCE_VFIO capability number
  KVM: PPC: Enable IOMMU_API for KVM_BOOK3S_64 permanently
  KVM: PPC: Pass kvm* to kvmppc_find_table()
  KVM: PPC: Use preregistered memory API to access TCE list
  KVM: PPC: iommu: Unify TCE checking
  KVM: PPC: VFIO: Add in-kernel acceleration for VFIO

 Documentation/virtual/kvm/devices/vfio.txt |  18 +-
 arch/powerpc/include/asm/iommu.h   |  32 ++-
 arch/powerpc/include/asm/kvm_host.h|   8 +
 arch/powerpc/include/asm/kvm_ppc.h |  12 +-
 arch/powerpc/include/asm/mmu_context.h |   4 +
 include/uapi/linux/kvm.h   |   7 +
 arch/powerpc/kernel/iommu.c|  89 ++---
 arch/powerpc/kvm/book3s_64_vio.c   | 308 -
 arch/powerpc/kvm/book3s_64_vio_hv.c| 303 +++-
 arch/powerpc/kvm/powerpc.c |   2 +
 arch/powerpc/mm/mmu_context_iommu.c|  39 
 arch/powerpc/platforms/powernv/pci-ioda.c  |  46 +++--
 arch/powerpc/platforms/powernv/pci.c   |   1 +
 arch/powerpc/platforms/pseries/iommu.c |   3 +-
 arch/powerpc/platforms/pseries/vio.c   |   2 +-
 drivers/vfio/vfio_iommu_spapr_tce.c|   2 +-
 virt/kvm/vfio.c| 104 ++
 arch/powerpc/kvm/Kconfig   |   1 +
 18 files changed, 874 insertions(+), 107 deletions(-)

-- 
2.11.0

[PATCH kernel v10 10/10] KVM: PPC: VFIO: Add in-kernel acceleration for VFIO

2017-03-16 Thread Alexey Kardashevskiy

This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
without passing them to user space which saves time on switching
to user space and back.

This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
KVM tries to handle a TCE request in the real mode, if failed
it passes the request to the virtual mode to complete the operation.
If it a virtual mode handler fails, the request is passed to
the user space; this is not expected to happen though.

To avoid dealing with page use counters (which is tricky in real mode),
this only accelerates SPAPR TCE IOMMU v2 clients which are required
to pre-register the userspace memory. The very first TCE request will
be handled in the VFIO SPAPR TCE driver anyway as the userspace view
of the TCE table (iommu_table::it_userspace) is not allocated till
the very first mapping happens and we cannot call vmalloc in real mode.

If we fail to update a hardware IOMMU table unexpected reason, we just
clear it and move on as there is nothing really we can do about it -
for example, if we hot plug a VFIO device to a guest, existing TCE tables
will be mirrored automatically to the hardware and there is no interface
to report to the guest about possible failures.

This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
and associates a physical IOMMU table with the SPAPR TCE table (which
is a guest view of the hardware IOMMU table). The iommu_table object
is cached and referenced so we do not have to look up for it in real mode.

This does not implement the UNSET counterpart as there is no use for it -
once the acceleration is enabled, the existing userspace won't
disable it unless a VFIO container is destroyed; this adds necessary
cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.

This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
space.

This adds real mode version of WARN_ON_ONCE() as the generic version
causes problems with rcu_sched. Since we testing what vmalloc_to_phys()
returns in the code, this also adds a check for already existing
vmalloc_to_phys() call in kvmppc_rm_h_put_tce_indirect().

This finally makes use of vfio_external_user_iommu_id() which was
introduced quite some time ago and was considered for removal.

Tests show that this patch increases transmission speed from 220MB/s
to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).

Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v10:
* fixed leaking references in virt/kvm/vfio.c
* moved code to helpers - kvm_vfio_group_get_iommu_group, 
kvm_spapr_tce_release_vfio_group
* fixed possible race between referencing table and destroying it via
VFIO add/remove window ioctls()

v9:
* removed referencing a group in KVM, only referencing iommu_table's now
* fixed a reference leak in KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE handler
* fixed typo in vfio.txt
* removed @argsz and @flags from struct kvm_vfio_spapr_tce

v8:
* changed all (!pua) checks to return H_TOO_HARD as ioctl() is supposed
to handle them
* changed vmalloc_to_phys() callers to return H_HARDWARE
* changed real mode iommu_tce_xchg_rm() callers to return H_TOO_HARD
and added a comment about this in the code
* changed virtual mode iommu_tce_xchg() callers to return H_HARDWARE
and do WARN_ON
* added WARN_ON_ONCE_RM(!rmap) in kvmppc_rm_h_put_tce_indirect() to
have all vmalloc_to_phys() callsites covered

v7:
* added realmode-friendly WARN_ON_ONCE_RM

v6:
* changed handling of errors returned by kvmppc_(rm_)tce_iommu_(un)map()
* moved kvmppc_gpa_to_ua() to TCE validation

v5:
* changed error codes in multiple places
* added bunch of WARN_ON() in places which should not really happen
* adde a check that an iommu table is not attached already to LIOBN
* dropped explicit calls to iommu_tce_clear_param_check/
iommu_tce_put_param_check as kvmppc_tce_validate/kvmppc_ioba_validate
call them anyway (since the previous patch)
* if we fail to update a hardware IOMMU table for unexpected reason,
this just clears the entry

v4:
* added note to the commit log about allowing multiple updates of
the same IOMMU table;
* instead of checking for if any memory was preregistered, this
returns H_TOO_HARD if a specific page was not;
* fixed comments from v3 about error handling in many places;
* simplified TCE handlers and merged IOMMU parts inline - for example,
there used to be kvmppc_h_put_tce_iommu(), now it is merged into
kvmppc_h_put_tce(); this allows to check IOBA boundaries against
the first attached table only (makes the code simpler);

v3:
* simplified not to use VFIO group notifiers
* reworked cleanup, should be cleaner/simpler now

v2:
* reworked to use new VFIO notifiers
* now same iommu_table may appear in the list several times, to be fixed later
---
 Documentation/virtual/kvm/devices/vfio.txt |  18 +-
 arch/powerpc/include/asm/kvm_host.h|   8 +

[PATCH kernel v10 09/10] KVM: PPC: iommu: Unify TCE checking

2017-03-16 Thread Alexey Kardashevskiy

This reworks helpers for checking TCE update parameters in way they
can be used in KVM.

This should cause no behavioral change.

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: David Gibson 
---
Changes:
v6:
* s/tce/gpa/ as TCE without permission bits is a GPA and this is what is
passed everywhere
---
 arch/powerpc/include/asm/iommu.h| 20 +++-
 arch/powerpc/include/asm/kvm_ppc.h  |  6 --
 arch/powerpc/kernel/iommu.c | 37 +
 arch/powerpc/kvm/book3s_64_vio_hv.c | 31 +++
 4 files changed, 39 insertions(+), 55 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index d96142572e6d..8a8ce220d7d0 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -296,11 +296,21 @@ static inline void iommu_restore(void)
 #endif
 
 /* The API to support IOMMU operations for VFIO */
-extern int iommu_tce_clear_param_check(struct iommu_table *tbl,
-   unsigned long ioba, unsigned long tce_value,
-   unsigned long npages);
-extern int iommu_tce_put_param_check(struct iommu_table *tbl,
-   unsigned long ioba, unsigned long tce);
+extern int iommu_tce_check_ioba(unsigned long page_shift,
+   unsigned long offset, unsigned long size,
+   unsigned long ioba, unsigned long npages);
+extern int iommu_tce_check_gpa(unsigned long page_shift,
+   unsigned long gpa);
+
+#define iommu_tce_clear_param_check(tbl, ioba, tce_value, npages) \
+   (iommu_tce_check_ioba((tbl)->it_page_shift,   \
+   (tbl)->it_offset, (tbl)->it_size, \
+   (ioba), (npages)) || (tce_value))
+#define iommu_tce_put_param_check(tbl, ioba, gpa) \
+   (iommu_tce_check_ioba((tbl)->it_page_shift,   \
+   (tbl)->it_offset, (tbl)->it_size, \
+   (ioba), 1) || \
+   iommu_tce_check_gpa((tbl)->it_page_shift, (gpa)))
 
 extern void iommu_flush_tce(struct iommu_table *tbl);
 extern int iommu_take_ownership(struct iommu_table *tbl);
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index eba8988d8443..72c2a155641f 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -169,8 +169,10 @@ extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
struct kvm_create_spapr_tce_64 *args);
 extern struct kvmppc_spapr_tce_table *kvmppc_find_table(
struct kvm *kvm, unsigned long liobn);
-extern long kvmppc_ioba_validate(struct kvmppc_spapr_tce_table *stt,
-   unsigned long ioba, unsigned long npages);
+#define kvmppc_ioba_validate(stt, ioba, npages) \
+   (iommu_tce_check_ioba((stt)->page_shift, (stt)->offset, \
+   (stt)->size, (ioba), (npages)) ?\
+   H_PARAMETER : H_SUCCESS)
 extern long kvmppc_tce_validate(struct kvmppc_spapr_tce_table *tt,
unsigned long tce);
 extern long kvmppc_gpa_to_ua(struct kvm *kvm, unsigned long gpa,
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index af915da5e03a..e73927352672 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -963,47 +963,36 @@ void iommu_flush_tce(struct iommu_table *tbl)
 }
 EXPORT_SYMBOL_GPL(iommu_flush_tce);
 
-int iommu_tce_clear_param_check(struct iommu_table *tbl,
-   unsigned long ioba, unsigned long tce_value,
-   unsigned long npages)
+int iommu_tce_check_ioba(unsigned long page_shift,
+   unsigned long offset, unsigned long size,
+   unsigned long ioba, unsigned long npages)
 {
-   /* tbl->it_ops->clear() does not support any value but 0 */
-   if (tce_value)
-   return -EINVAL;
+   unsigned long mask = (1UL << page_shift) - 1;
 
-   if (ioba & ~IOMMU_PAGE_MASK(tbl))
+   if (ioba & mask)
return -EINVAL;
 
-   ioba >>= tbl->it_page_shift;
-   if (ioba < tbl->it_offset)
+   ioba >>= page_shift;
+   if (ioba < offset)
return -EINVAL;
 
-   if ((ioba + npages) > (tbl->it_offset + tbl->it_size))
+   if ((ioba + 1) > (offset + size))
return -EINVAL;
 
return 0;
 }
-EXPORT_SYMBOL_GPL(iommu_tce_clear_param_check);
+EXPORT_SYMBOL_GPL(iommu_tce_check_ioba);
 
-int iommu_tce_put_param_check(struct iommu_table *tbl,
-   unsigned long ioba, unsigned long tce)
+int iommu_tce_check_gpa(unsigned long page_shift, unsigned long gpa)
 {
-   if (tce & ~IOMMU_PAGE_MASK(tbl))
-   return -EINVAL;
-
-   if (ioba & ~IOMMU_PAGE_MASK(tbl))
-   return -EINVAL;
-
-   ioba >>=

[PATCH kernel v10 08/10] KVM: PPC: Use preregistered memory API to access TCE list

2017-03-16 Thread Alexey Kardashevskiy

VFIO on sPAPR already implements guest memory pre-registration
when the entire guest RAM gets pinned. This can be used to translate
the physical address of a guest page containing the TCE list
from H_PUT_TCE_INDIRECT.

This makes use of the pre-registrered memory API to access TCE list
pages in order to avoid unnecessary locking on the KVM memory
reverse map as we know that all of guest memory is pinned and
we have a flat array mapping GPA to HPA which makes it simpler and
quicker to index into that array (even with looking up the
kernel page tables in vmalloc_to_phys) than it is to find the memslot,
lock the rmap entry, look up the user page tables, and unlock the rmap
entry. Note that the rmap pointer is initialized to NULL
where declared (not in this patch).

If a requested chunk of memory has not been preregistered, this will
fall back to non-preregistered case and lock rmap.

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: David Gibson 
---
Changes:
v4:
* removed oneline inlines
* now falls back to locking rmap if TCE list is not in preregistered memory

v2:
* updated the commit log with David's comment
---
 arch/powerpc/kvm/book3s_64_vio_hv.c | 58 +++--
 1 file changed, 42 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c 
b/arch/powerpc/kvm/book3s_64_vio_hv.c
index 918af76ab2b6..0f145fc7a3a5 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -239,6 +239,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
long i, ret = H_SUCCESS;
unsigned long tces, entry, ua = 0;
unsigned long *rmap = NULL;
+   bool prereg = false;
 
stt = kvmppc_find_table(vcpu->kvm, liobn);
if (!stt)
@@ -259,23 +260,47 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
if (ret != H_SUCCESS)
return ret;
 
-   if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, , ))
-   return H_TOO_HARD;
+   if (mm_iommu_preregistered(vcpu->kvm->mm)) {
+   /*
+* We get here if guest memory was pre-registered which
+* is normally VFIO case and gpa->hpa translation does not
+* depend on hpt.
+*/
+   struct mm_iommu_table_group_mem_t *mem;
 
-   rmap = (void *) vmalloc_to_phys(rmap);
+   if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, , NULL))
+   return H_TOO_HARD;
 
-   /*
-* Synchronize with the MMU notifier callbacks in
-* book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
-* While we have the rmap lock, code running on other CPUs
-* cannot finish unmapping the host real page that backs
-* this guest real page, so we are OK to access the host
-* real page.
-*/
-   lock_rmap(rmap);
-   if (kvmppc_rm_ua_to_hpa(vcpu, ua, )) {
-   ret = H_TOO_HARD;
-   goto unlock_exit;
+   mem = mm_iommu_lookup_rm(vcpu->kvm->mm, ua, IOMMU_PAGE_SIZE_4K);
+   if (mem)
+   prereg = mm_iommu_ua_to_hpa_rm(mem, ua, ) == 0;
+   }
+
+   if (!prereg) {
+   /*
+* This is usually a case of a guest with emulated devices only
+* when TCE list is not in preregistered memory.
+* We do not require memory to be preregistered in this case
+* so lock rmap and do __find_linux_pte_or_hugepte().
+*/
+   if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, , ))
+   return H_TOO_HARD;
+
+   rmap = (void *) vmalloc_to_phys(rmap);
+
+   /*
+* Synchronize with the MMU notifier callbacks in
+* book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
+* While we have the rmap lock, code running on other CPUs
+* cannot finish unmapping the host real page that backs
+* this guest real page, so we are OK to access the host
+* real page.
+*/
+   lock_rmap(rmap);
+   if (kvmppc_rm_ua_to_hpa(vcpu, ua, )) {
+   ret = H_TOO_HARD;
+   goto unlock_exit;
+   }
}
 
for (i = 0; i < npages; ++i) {
@@ -289,7 +314,8 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
}
 
 unlock_exit:
-   unlock_rmap(rmap);
+   if (rmap)
+   unlock_rmap(rmap);
 
return ret;
 }
-- 
2.11.0

[PATCH kernel v10 07/10] KVM: PPC: Pass kvm* to kvmppc_find_table()

2017-03-16 Thread Alexey Kardashevskiy

The guest view TCE tables are per KVM anyway (not per VCPU) so pass kvm*
there. This will be used in the following patches where we will be
attaching VFIO containers to LIOBNs via ioctl() to KVM (rather than
to VCPU).

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: David Gibson 
---
 arch/powerpc/include/asm/kvm_ppc.h  |  2 +-
 arch/powerpc/kvm/book3s_64_vio.c|  7 ---
 arch/powerpc/kvm/book3s_64_vio_hv.c | 13 +++--
 3 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index dd11c4c8c56a..eba8988d8443 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -168,7 +168,7 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
 extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
struct kvm_create_spapr_tce_64 *args);
 extern struct kvmppc_spapr_tce_table *kvmppc_find_table(
-   struct kvm_vcpu *vcpu, unsigned long liobn);
+   struct kvm *kvm, unsigned long liobn);
 extern long kvmppc_ioba_validate(struct kvmppc_spapr_tce_table *stt,
unsigned long ioba, unsigned long npages);
 extern long kvmppc_tce_validate(struct kvmppc_spapr_tce_table *tt,
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 3e26cd4979f9..e96a4590464c 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -214,12 +214,13 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
  unsigned long ioba, unsigned long tce)
 {
-   struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn);
+   struct kvmppc_spapr_tce_table *stt;
long ret;
 
/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
/*  liobn, ioba, tce); */
 
+   stt = kvmppc_find_table(vcpu->kvm, liobn);
if (!stt)
return H_TOO_HARD;
 
@@ -247,7 +248,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
u64 __user *tces;
u64 tce;
 
-   stt = kvmppc_find_table(vcpu, liobn);
+   stt = kvmppc_find_table(vcpu->kvm, liobn);
if (!stt)
return H_TOO_HARD;
 
@@ -301,7 +302,7 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
struct kvmppc_spapr_tce_table *stt;
long i, ret;
 
-   stt = kvmppc_find_table(vcpu, liobn);
+   stt = kvmppc_find_table(vcpu->kvm, liobn);
if (!stt)
return H_TOO_HARD;
 
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c 
b/arch/powerpc/kvm/book3s_64_vio_hv.c
index e4c4ea973e57..918af76ab2b6 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -48,10 +48,9 @@
  * WARNING: This will be called in real or virtual mode on HV KVM and virtual
  *  mode on PR KVM
  */
-struct kvmppc_spapr_tce_table *kvmppc_find_table(struct kvm_vcpu *vcpu,
+struct kvmppc_spapr_tce_table *kvmppc_find_table(struct kvm *kvm,
unsigned long liobn)
 {
-   struct kvm *kvm = vcpu->kvm;
struct kvmppc_spapr_tce_table *stt;
 
list_for_each_entry_lockless(stt, >arch.spapr_tce_tables, list)
@@ -182,12 +181,13 @@ EXPORT_SYMBOL_GPL(kvmppc_gpa_to_ua);
 long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
unsigned long ioba, unsigned long tce)
 {
-   struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn);
+   struct kvmppc_spapr_tce_table *stt;
long ret;
 
/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
/*  liobn, ioba, tce); */
 
+   stt = kvmppc_find_table(vcpu->kvm, liobn);
if (!stt)
return H_TOO_HARD;
 
@@ -240,7 +240,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
unsigned long tces, entry, ua = 0;
unsigned long *rmap = NULL;
 
-   stt = kvmppc_find_table(vcpu, liobn);
+   stt = kvmppc_find_table(vcpu->kvm, liobn);
if (!stt)
return H_TOO_HARD;
 
@@ -301,7 +301,7 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
struct kvmppc_spapr_tce_table *stt;
long i, ret;
 
-   stt = kvmppc_find_table(vcpu, liobn);
+   stt = kvmppc_find_table(vcpu->kvm, liobn);
if (!stt)
return H_TOO_HARD;
 
@@ -322,12 +322,13 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
 long kvmppc_h_get_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
  unsigned long ioba)
 {
-   struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn);
+   struct kvmppc_spapr_tce_table *stt;
long ret;
unsigned long idx;
struct page *page;
u64 *tbl;
 
+   stt = kvmppc_find_table(vcpu->kvm, liobn);
if (!stt)
return H_TOO_HARD;
 
--

[PATCH kernel v10 06/10] KVM: PPC: Enable IOMMU_API for KVM_BOOK3S_64 permanently

2017-03-16 Thread Alexey Kardashevskiy

It does not make much sense to have KVM in book3s-64 and
not to have IOMMU bits for PCI pass through support as it costs little
and allows VFIO to function on book3s KVM.

Having IOMMU_API always enabled makes it unnecessary to have a lot of
"#ifdef IOMMU_API" in arch/powerpc/kvm/book3s_64_vio*. With those
ifdef's we could have only user space emulated devices accelerated
(but not VFIO) which do not seem to be very useful.

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: David Gibson 
---
 arch/powerpc/kvm/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index 029be26b5a17..65a471de96de 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -67,6 +67,7 @@ config KVM_BOOK3S_64
select KVM_BOOK3S_64_HANDLER
select KVM
select KVM_BOOK3S_PR_POSSIBLE if !KVM_BOOK3S_HV_POSSIBLE
+   select SPAPR_TCE_IOMMU if IOMMU_SUPPORT
---help---
  Support running unmodified book3s_64 and book3s_32 guest kernels
  in virtual machines on book3s_64 host processors.
-- 
2.11.0

[PATCH kernel v10 05/10] KVM: PPC: Reserve KVM_CAP_SPAPR_TCE_VFIO capability number

2017-03-16 Thread Alexey Kardashevskiy

This adds a capability number for in-kernel support for VFIO on
SPAPR platform.

The capability will tell the user space whether in-kernel handlers of
H_PUT_TCE can handle VFIO-targeted requests or not. If not, the user space
must not attempt allocating a TCE table in the host kernel via
the KVM_CREATE_SPAPR_TCE KVM ioctl because in that case TCE requests
will not be passed to the user space which is desired action in
the situation like that.

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: David Gibson 
---
 include/uapi/linux/kvm.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index f51d5082a377..f5a52ffb6b58 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -883,6 +883,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_PPC_MMU_RADIX 134
 #define KVM_CAP_PPC_MMU_HASH_V3 135
 #define KVM_CAP_IMMEDIATE_EXIT 136
+#define KVM_CAP_SPAPR_TCE_VFIO 137
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
-- 
2.11.0

[PATCH kernel v10 04/10] powerpc/vfio_spapr_tce: Add reference counting to iommu_table

2017-03-16 Thread Alexey Kardashevskiy

So far iommu_table obejcts were only used in virtual mode and had
a single owner. We are going to change this by implementing in-kernel
acceleration of DMA mapping requests. The proposed acceleration
will handle requests in real mode and KVM will keep references to tables.

This adds a kref to iommu_table and defines new helpers to update it.
This replaces iommu_free_table() with iommu_tce_table_put() and makes
iommu_free_table() static. iommu_tce_table_get() is not used in this patch
but it will be in the following patch.

Since this touches prototypes, this also removes @node_name parameter as
it has never been really useful on powernv and carrying it for
the pseries platform code to iommu_free_table() seems to be quite
useless as well.

This should cause no behavioral change.

Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v10:
* iommu_tce_table_get() can fail now if a table is being destroyed, will be
used in 10/10
* iommu_tce_table_put() returns what kref_put() returned
* iommu_tce_table_put() got WARN_ON(!tbl) as the callers already check
for it and do not call _put() when tbl==NULL

v9:
* s/iommu_table_get/iommu_tce_table_get/ and
s/iommu_table_put/iommu_tce_table_put/ -- so I removed r-b/a-b
---
 arch/powerpc/include/asm/iommu.h  |  5 +++--
 arch/powerpc/kernel/iommu.c   | 27 ++-
 arch/powerpc/platforms/powernv/pci-ioda.c | 14 +++---
 arch/powerpc/platforms/powernv/pci.c  |  1 +
 arch/powerpc/platforms/pseries/iommu.c|  3 ++-
 arch/powerpc/platforms/pseries/vio.c  |  2 +-
 drivers/vfio/vfio_iommu_spapr_tce.c   |  2 +-
 7 files changed, 37 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 4554699aec02..d96142572e6d 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -119,6 +119,7 @@ struct iommu_table {
struct list_head it_group_list;/* List of iommu_table_group_link */
unsigned long *it_userspace; /* userspace view of the table */
struct iommu_table_ops *it_ops;
+   struct krefit_kref;
 };
 
 #define IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry) \
@@ -151,8 +152,8 @@ static inline void *get_iommu_table_base(struct device *dev)
 
 extern int dma_iommu_dma_supported(struct device *dev, u64 mask);
 
-/* Frees table for an individual device node */
-extern void iommu_free_table(struct iommu_table *tbl, const char *node_name);
+extern struct iommu_table *iommu_tce_table_get(struct iommu_table *tbl);
+extern int iommu_tce_table_put(struct iommu_table *tbl);
 
 /* Initializes an iommu_table based in values set in the passed-in
  * structure
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index bc142d87130f..af915da5e03a 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -711,13 +711,13 @@ struct iommu_table *iommu_init_table(struct iommu_table 
*tbl, int nid)
return tbl;
 }
 
-void iommu_free_table(struct iommu_table *tbl, const char *node_name)
+static void iommu_table_free(struct kref *kref)
 {
unsigned long bitmap_sz;
unsigned int order;
+   struct iommu_table *tbl;
 
-   if (!tbl)
-   return;
+   tbl = container_of(kref, struct iommu_table, it_kref);
 
if (tbl->it_ops->free)
tbl->it_ops->free(tbl);
@@ -736,7 +736,7 @@ void iommu_free_table(struct iommu_table *tbl, const char 
*node_name)
 
/* verify that table contains no entries */
if (!bitmap_empty(tbl->it_map, tbl->it_size))
-   pr_warn("%s: Unexpected TCEs for %s\n", __func__, node_name);
+   pr_warn("%s: Unexpected TCEs\n", __func__);
 
/* calculate bitmap size in bytes */
bitmap_sz = BITS_TO_LONGS(tbl->it_size) * sizeof(unsigned long);
@@ -748,7 +748,24 @@ void iommu_free_table(struct iommu_table *tbl, const char 
*node_name)
/* free table */
kfree(tbl);
 }
-EXPORT_SYMBOL_GPL(iommu_free_table);
+
+struct iommu_table *iommu_tce_table_get(struct iommu_table *tbl)
+{
+   if (kref_get_unless_zero(>it_kref))
+   return tbl;
+
+   return NULL;
+}
+EXPORT_SYMBOL_GPL(iommu_tce_table_get);
+
+int iommu_tce_table_put(struct iommu_table *tbl)
+{
+   if (WARN_ON(!tbl))
+   return 0;
+
+   return kref_put(>it_kref, iommu_table_free);
+}
+EXPORT_SYMBOL_GPL(iommu_tce_table_put);
 
 /* Creates TCEs for a user provided buffer.  The user buffer must be
  * contiguous real kernel storage (not vmalloc).  The address passed here
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 5dae54cb11e3..ee4cdb5b893f 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1424,7 +1424,7 @@ static void pnv_pci_ioda2_release_dma_pe(struct pci_dev 
*dev, struct pnv_ioda_pe
iommu_group_put(pe->table_group.group);

[PATCH kernel v10 03/10] powerpc/iommu/vfio_spapr_tce: Cleanup iommu_table disposal

2017-03-16 Thread Alexey Kardashevskiy

At the moment iommu_table can be disposed by either calling
iommu_table_free() directly or it_ops::free(); the only implementation
of free() is in IODA2 - pnv_ioda2_table_free() - and it calls
iommu_table_free() anyway.

As we are going to have reference counting on tables, we need an unified
way of disposing tables.

This moves it_ops::free() call into iommu_free_table() and makes use
of the latter. The free() callback now handles only platform-specific
data.

As from now on the iommu_free_table() calls it_ops->free(), we need
to have it_ops initialized before calling iommu_free_table() so this
moves this initialization in pnv_pci_ioda2_create_table().

This should cause no behavioral change.

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: David Gibson 
Acked-by: Alex Williamson 
---
Changes:
v5:
* moved "tbl->it_ops = _ioda2_iommu_ops" earlier and updated
the commit log
---
 arch/powerpc/kernel/iommu.c   |  4 
 arch/powerpc/platforms/powernv/pci-ioda.c | 10 --
 drivers/vfio/vfio_iommu_spapr_tce.c   |  2 +-
 3 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 9bace5df05d5..bc142d87130f 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -719,6 +719,9 @@ void iommu_free_table(struct iommu_table *tbl, const char 
*node_name)
if (!tbl)
return;
 
+   if (tbl->it_ops->free)
+   tbl->it_ops->free(tbl);
+
if (!tbl->it_map) {
kfree(tbl);
return;
@@ -745,6 +748,7 @@ void iommu_free_table(struct iommu_table *tbl, const char 
*node_name)
/* free table */
kfree(tbl);
 }
+EXPORT_SYMBOL_GPL(iommu_free_table);
 
 /* Creates TCEs for a user provided buffer.  The user buffer must be
  * contiguous real kernel storage (not vmalloc).  The address passed here
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 572e9c9f1ea0..5dae54cb11e3 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1424,7 +1424,6 @@ static void pnv_pci_ioda2_release_dma_pe(struct pci_dev 
*dev, struct pnv_ioda_pe
iommu_group_put(pe->table_group.group);
BUG_ON(pe->table_group.group);
}
-   pnv_pci_ioda2_table_free_pages(tbl);
iommu_free_table(tbl, of_node_full_name(dev->dev.of_node));
 }
 
@@ -2040,7 +2039,6 @@ static void pnv_ioda2_tce_free(struct iommu_table *tbl, 
long index,
 static void pnv_ioda2_table_free(struct iommu_table *tbl)
 {
pnv_pci_ioda2_table_free_pages(tbl);
-   iommu_free_table(tbl, "pnv");
 }
 
 static struct iommu_table_ops pnv_ioda2_iommu_ops = {
@@ -2317,6 +2315,8 @@ static long pnv_pci_ioda2_create_table(struct 
iommu_table_group *table_group,
if (!tbl)
return -ENOMEM;
 
+   tbl->it_ops = _ioda2_iommu_ops;
+
ret = pnv_pci_ioda2_table_alloc_pages(nid,
bus_offset, page_shift, window_size,
levels, tbl);
@@ -2325,8 +2325,6 @@ static long pnv_pci_ioda2_create_table(struct 
iommu_table_group *table_group,
return ret;
}
 
-   tbl->it_ops = _ioda2_iommu_ops;
-
*ptbl = tbl;
 
return 0;
@@ -2367,7 +2365,7 @@ static long pnv_pci_ioda2_setup_default_config(struct 
pnv_ioda_pe *pe)
if (rc) {
pe_err(pe, "Failed to configure 32-bit TCE table, err %ld\n",
rc);
-   pnv_ioda2_table_free(tbl);
+   iommu_free_table(tbl, "");
return rc;
}
 
@@ -2455,7 +2453,7 @@ static void pnv_ioda2_take_ownership(struct 
iommu_table_group *table_group)
pnv_pci_ioda2_unset_window(>table_group, 0);
if (pe->pbus)
pnv_ioda_setup_bus_dma(pe, pe->pbus, false);
-   pnv_ioda2_table_free(tbl);
+   iommu_free_table(tbl, "pnv");
 }
 
 static void pnv_ioda2_release_ownership(struct iommu_table_group *table_group)
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
b/drivers/vfio/vfio_iommu_spapr_tce.c
index cf3de91fbfe7..fbec7348a7e5 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -680,7 +680,7 @@ static void tce_iommu_free_table(struct tce_container 
*container,
unsigned long pages = tbl->it_allocated_size >> PAGE_SHIFT;
 
tce_iommu_userspace_view_free(tbl, container->mm);
-   tbl->it_ops->free(tbl);
+   iommu_free_table(tbl, "");
decrement_locked_vm(container->mm, pages);
 }
 
-- 
2.11.0

[PATCH kernel v10 02/10] powerpc/powernv/iommu: Add real mode version of iommu_table_ops::exchange()

2017-03-16 Thread Alexey Kardashevskiy

In real mode, TCE tables are invalidated using special
cache-inhibited store instructions which are not available in
virtual mode

This defines and implements exchange_rm() callback. This does not
define set_rm/clear_rm/flush_rm callbacks as there is no user for those -
exchange/exchange_rm are only to be used by KVM for VFIO.

The exchange_rm callback is defined for IODA1/IODA2 powernv platforms.

This replaces list_for_each_entry_rcu with its lockless version as
from now on pnv_pci_ioda2_tce_invalidate() can be called in
the real mode too.

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: David Gibson 
---
 arch/powerpc/include/asm/iommu.h  |  7 +++
 arch/powerpc/kernel/iommu.c   | 23 +++
 arch/powerpc/platforms/powernv/pci-ioda.c | 26 +-
 3 files changed, 55 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 2c1d50792944..4554699aec02 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -64,6 +64,11 @@ struct iommu_table_ops {
long index,
unsigned long *hpa,
enum dma_data_direction *direction);
+   /* Real mode */
+   int (*exchange_rm)(struct iommu_table *tbl,
+   long index,
+   unsigned long *hpa,
+   enum dma_data_direction *direction);
 #endif
void (*clear)(struct iommu_table *tbl,
long index, long npages);
@@ -208,6 +213,8 @@ extern void iommu_del_device(struct device *dev);
 extern int __init tce_iommu_bus_notifier_init(void);
 extern long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry,
unsigned long *hpa, enum dma_data_direction *direction);
+extern long iommu_tce_xchg_rm(struct iommu_table *tbl, unsigned long entry,
+   unsigned long *hpa, enum dma_data_direction *direction);
 #else
 static inline void iommu_register_group(struct iommu_table_group *table_group,
int pci_domain_number,
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 5f202a566ec5..9bace5df05d5 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1004,6 +1004,29 @@ long iommu_tce_xchg(struct iommu_table *tbl, unsigned 
long entry,
 }
 EXPORT_SYMBOL_GPL(iommu_tce_xchg);
 
+long iommu_tce_xchg_rm(struct iommu_table *tbl, unsigned long entry,
+   unsigned long *hpa, enum dma_data_direction *direction)
+{
+   long ret;
+
+   ret = tbl->it_ops->exchange_rm(tbl, entry, hpa, direction);
+
+   if (!ret && ((*direction == DMA_FROM_DEVICE) ||
+   (*direction == DMA_BIDIRECTIONAL))) {
+   struct page *pg = realmode_pfn_to_page(*hpa >> PAGE_SHIFT);
+
+   if (likely(pg)) {
+   SetPageDirty(pg);
+   } else {
+   tbl->it_ops->exchange_rm(tbl, entry, hpa, direction);
+   ret = -EFAULT;
+   }
+   }
+
+   return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_tce_xchg_rm);
+
 int iommu_take_ownership(struct iommu_table *tbl)
 {
unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index e36738291c32..572e9c9f1ea0 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1860,6 +1860,17 @@ static int pnv_ioda1_tce_xchg(struct iommu_table *tbl, 
long index,
 
return ret;
 }
+
+static int pnv_ioda1_tce_xchg_rm(struct iommu_table *tbl, long index,
+   unsigned long *hpa, enum dma_data_direction *direction)
+{
+   long ret = pnv_tce_xchg(tbl, index, hpa, direction);
+
+   if (!ret)
+   pnv_pci_p7ioc_tce_invalidate(tbl, index, 1, true);
+
+   return ret;
+}
 #endif
 
 static void pnv_ioda1_tce_free(struct iommu_table *tbl, long index,
@@ -1874,6 +1885,7 @@ static struct iommu_table_ops pnv_ioda1_iommu_ops = {
.set = pnv_ioda1_tce_build,
 #ifdef CONFIG_IOMMU_API
.exchange = pnv_ioda1_tce_xchg,
+   .exchange_rm = pnv_ioda1_tce_xchg_rm,
 #endif
.clear = pnv_ioda1_tce_free,
.get = pnv_tce_get,
@@ -1948,7 +1960,7 @@ static void pnv_pci_ioda2_tce_invalidate(struct 
iommu_table *tbl,
 {
struct iommu_table_group_link *tgl;
 
-   list_for_each_entry_rcu(tgl, >it_group_list, next) {
+   list_for_each_entry_lockless(tgl, >it_group_list, next) {
struct pnv_ioda_pe *pe = container_of(tgl->table_group,
struct pnv_ioda_pe, table_group);
struct pnv_phb *phb = pe->phb;
@@ -2004,6 +2016,17 @@ static int pnv_ioda2_tce_xchg(struct iommu_table *tbl, 
long index,
 
return ret;
 }
+
+static int

[PATCH kernel v10 01/10] powerpc/mmu: Add real mode support for IOMMU preregistered memory

2017-03-16 Thread Alexey Kardashevskiy

This makes mm_iommu_lookup() able to work in realmode by replacing
list_for_each_entry_rcu() (which can do debug stuff which can fail in
real mode) with list_for_each_entry_lockless().

This adds realmode version of mm_iommu_ua_to_hpa() which adds
explicit vmalloc'd-to-linear address conversion.
Unlike mm_iommu_ua_to_hpa(), mm_iommu_ua_to_hpa_rm() can fail.

This changes mm_iommu_preregistered() to receive @mm as in real mode
@current does not always have a correct pointer.

This adds realmode version of mm_iommu_lookup() which receives @mm
(for the same reason as for mm_iommu_preregistered()) and uses
lockless version of list_for_each_entry_rcu().

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: David Gibson 
---
 arch/powerpc/include/asm/mmu_context.h |  4 
 arch/powerpc/mm/mmu_context_iommu.c| 39 ++
 2 files changed, 43 insertions(+)

diff --git a/arch/powerpc/include/asm/mmu_context.h 
b/arch/powerpc/include/asm/mmu_context.h
index b9e3f0aca261..c70c8272523d 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -29,10 +29,14 @@ extern void mm_iommu_init(struct mm_struct *mm);
 extern void mm_iommu_cleanup(struct mm_struct *mm);
 extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(struct mm_struct *mm,
unsigned long ua, unsigned long size);
+extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup_rm(
+   struct mm_struct *mm, unsigned long ua, unsigned long size);
 extern struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
unsigned long ua, unsigned long entries);
 extern long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
unsigned long ua, unsigned long *hpa);
+extern long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem,
+   unsigned long ua, unsigned long *hpa);
 extern long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem);
 extern void mm_iommu_mapped_dec(struct mm_iommu_table_group_mem_t *mem);
 #endif
diff --git a/arch/powerpc/mm/mmu_context_iommu.c 
b/arch/powerpc/mm/mmu_context_iommu.c
index 497130c5c742..fc67bd766eaf 100644
--- a/arch/powerpc/mm/mmu_context_iommu.c
+++ b/arch/powerpc/mm/mmu_context_iommu.c
@@ -314,6 +314,25 @@ struct mm_iommu_table_group_mem_t *mm_iommu_lookup(struct 
mm_struct *mm,
 }
 EXPORT_SYMBOL_GPL(mm_iommu_lookup);
 
+struct mm_iommu_table_group_mem_t *mm_iommu_lookup_rm(struct mm_struct *mm,
+   unsigned long ua, unsigned long size)
+{
+   struct mm_iommu_table_group_mem_t *mem, *ret = NULL;
+
+   list_for_each_entry_lockless(mem, >context.iommu_group_mem_list,
+   next) {
+   if ((mem->ua <= ua) &&
+   (ua + size <= mem->ua +
+(mem->entries << PAGE_SHIFT))) {
+   ret = mem;
+   break;
+   }
+   }
+
+   return ret;
+}
+EXPORT_SYMBOL_GPL(mm_iommu_lookup_rm);
+
 struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
unsigned long ua, unsigned long entries)
 {
@@ -345,6 +364,26 @@ long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t 
*mem,
 }
 EXPORT_SYMBOL_GPL(mm_iommu_ua_to_hpa);
 
+long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem,
+   unsigned long ua, unsigned long *hpa)
+{
+   const long entry = (ua - mem->ua) >> PAGE_SHIFT;
+   void *va = >hpas[entry];
+   unsigned long *pa;
+
+   if (entry >= mem->entries)
+   return -EFAULT;
+
+   pa = (void *) vmalloc_to_phys(va);
+   if (!pa)
+   return -EFAULT;
+
+   *hpa = *pa | (ua & ~PAGE_MASK);
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(mm_iommu_ua_to_hpa_rm);
+
 long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem)
 {
if (atomic64_inc_not_zero(>mapped))
-- 
2.11.0

Re: [PATCH] tty: hvc: don't allocate a buffer for console print on stack

2017-03-16 Thread Greg Kroah-Hartman

On Fri, Feb 17, 2017 at 11:42:45PM +0300, Jan Dakinevich wrote:
> The buffer is used by virtio console driver as DMA buffer. Since v4.9
> (if VMAP_STACK is enabled) we shouldn't use the stack for DMA.

You shouldn't use 'static' data either, that's not always guaranteed to
be DMA-able, right?

> 
> Signed-off-by: Jan Dakinevich 
> ---
>  drivers/tty/hvc/hvc_console.c | 7 ++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/tty/hvc/hvc_console.c b/drivers/tty/hvc/hvc_console.c
> index 9b5c0fb..1ce6aaf 100644
> --- a/drivers/tty/hvc/hvc_console.c
> +++ b/drivers/tty/hvc/hvc_console.c
> @@ -143,10 +143,15 @@ static struct hvc_struct *hvc_get_by_index(int index)
>  static void hvc_console_print(struct console *co, const char *b,
> unsigned count)
>  {
> - char c[N_OUTBUF] __ALIGNED__;
>   unsigned i = 0, n = 0;
>   int r, donecr = 0, index = co->index;
>  
> + /*
> +  * Access to the buffer is serialized by console_sem in caller code from
> +  * kernel/printk/printk.c
> +  */
> + static char c[N_OUTBUF] __ALIGNED__;

What about allocating it dynamically?  That's the correct thing to do.

thanks,

greg k-h

Re: [PATCH] powerpc/pseries: Don't give a warning when HPT resizing isn't available

2017-03-16 Thread Michael Ellerman

David Gibson  writes:

> As of 438cc81a41 "powerpc/pseries: Automatically resize HPT for memory hot
> add/remove" when running on the pseries platform, we always attempt to
> use the PAPR extension to resize the hashed page table (HPT) when we add
> or remove memory.
>
> This is fine, but when the extension is available we'll give a harmless,
> but scary warning.  This patch suppresses the warning in this case.  It
> will still warn if the feature is supposed to be available, but didn't
> work.
>
> Signed-off-by: David Gibson 
> ---
>  arch/powerpc/mm/hash_utils_64.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> Kind of cosmetic, but getting an error message on every memory hot
> plug/unplug attempt if your host doesn't support HPT resizing is
> pretty ugly.  So I think this is a candidate for quick inclusion.

Yeah thanks, I forgot I was going to send a patch for it.

I was thinking of doing the following instead, or maybe we can do both?

diff --git a/arch/powerpc/platforms/pseries/lpar.c 
b/arch/powerpc/platforms/pseries/lpar.c
index 251060cf1713..8b1fe895daa3 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -751,7 +751,9 @@ void __init hpte_init_pseries(void)
mmu_hash_ops.flush_hash_range= pSeries_lpar_flush_hash_range;
mmu_hash_ops.hpte_clear_all  = pseries_hpte_clear_all;
mmu_hash_ops.hugepage_invalidate = pSeries_lpar_hugepage_invalidate;
-   mmu_hash_ops.resize_hpt  = pseries_lpar_resize_hpt;
+
+   if (firmware_has_feature(FW_FEATURE_HPT_RESIZE))
+   mmu_hash_ops.resize_hpt = pseries_lpar_resize_hpt;
 }
 
 void radix_init_pseries(void)


cheers

Re: [PATCH 1/3] cxl: Re-factor cxl_pci_afu_read_err_buffer()

2017-03-16 Thread Andrew Donnellan


On 14/03/17 15:06, Vaibhav Jain wrote:

This patch moves,renames and re-factors the function
afu_pci_afu_err_buffer(). The function is now moved to native.c from
pci.c and renamed as native_afu_read_err_buffer().

Also the ability of copying data from h/w enforcing 4/8 byte aligned
access is useful and better shared across other functions. So this
patch moves the core logic of existing cxl_pci_afu_read_err_buffer()
to a new function named __aligned_memcpy().The new implementation of
native_afu_read_err_buffer() is simply a call to __aligned_memcpy()
with appropriate actual parameters.

Signed-off-by: Vaibhav Jain 


Comments below.

Reviewed-by: Andrew Donnellan 


---
 drivers/misc/cxl/cxl.h|  3 ---
 drivers/misc/cxl/native.c | 56 ++-
 drivers/misc/cxl/pci.c| 44 -
 3 files changed, 55 insertions(+), 48 deletions(-)

diff --git a/drivers/misc/cxl/cxl.h b/drivers/misc/cxl/cxl.h
index 79e60ec..ef683b7 100644
--- a/drivers/misc/cxl/cxl.h
+++ b/drivers/misc/cxl/cxl.h
@@ -739,9 +739,6 @@ static inline u64 cxl_p2n_read(struct cxl_afu *afu, 
cxl_p2n_reg_t reg)
return ~0ULL;
 }

-ssize_t cxl_pci_afu_read_err_buffer(struct cxl_afu *afu, char *buf,
-   loff_t off, size_t count);
-
 /* Internal functions wrapped in cxl_base to allow PHB to call them */
 bool _cxl_pci_associate_default_context(struct pci_dev *dev, struct cxl_afu 
*afu);
 void _cxl_pci_disable_device(struct pci_dev *dev);
diff --git a/drivers/misc/cxl/native.c b/drivers/misc/cxl/native.c
index 7ae7105..20d3df6 100644
--- a/drivers/misc/cxl/native.c
+++ b/drivers/misc/cxl/native.c
@@ -1276,6 +1276,60 @@ static int native_afu_cr_write8(struct cxl_afu *afu, int 
cr, u64 off, u8 in)
return rc;
 }

+#define ERR_BUFF_MAX_COPY_SIZE PAGE_SIZE
+
+/*
+ * __aligned_memcpy:
+ * Copies count or max_read bytes (whichever is smaller) from src to dst buffer
+ * starting at offset off in src buffer. This specialized implementation of
+ * memcpy_fromio is needed as capi h/w only supports 4/8 bytes aligned access.
+ * So in case the requested offset/count arent 8 byte aligned the function uses


aren't


+ * a bounce buffer which can be max ERR_BUFF_MAX_COPY_SIZE == PAGE_SIZE
+ */
+static ssize_t __aligned_memcpy(void *dst, void __iomem *src, loff_t off,
+  size_t count, size_t max_read)
+{
+   loff_t aligned_start, aligned_end;
+   size_t aligned_length;
+   void *tbuf;
+
+   if (count == 0 || off < 0 || (size_t)off >= max_read)
+   return 0;
+
+   /* calculate aligned read window */
+   count = min((size_t)(max_read - off), count);
+   aligned_start = round_down(off, 8);
+   aligned_end = round_up(off + count, 8);
+   aligned_length = aligned_end - aligned_start;
+
+   /* max we can copy in one read is PAGE_SIZE */
+   if (aligned_length > ERR_BUFF_MAX_COPY_SIZE) {


I'm not sure if the name ERR_BUFF_MAX_COPY_SIZE makes sense here any more.


+   aligned_length = ERR_BUFF_MAX_COPY_SIZE;
+   count = ERR_BUFF_MAX_COPY_SIZE - (off & 0x7);
+   }
+
+   /* use bounce buffer for copy */
+   tbuf = (void *)__get_free_page(GFP_TEMPORARY);
+   if (!tbuf)
+   return -ENOMEM;
+
+   /* perform aligned read from the mmio region */
+   memcpy_fromio(tbuf, src + aligned_start, aligned_length);
+   memcpy(dst, tbuf + (off & 0x7), count);
+
+   free_page((unsigned long)tbuf);
+
+   return count;
+}
+
+static ssize_t native_afu_read_err_buffer(struct cxl_afu *afu, char *buf,
+   loff_t off, size_t count)
+{
+   void __iomem *ebuf = afu->native->afu_desc_mmio + afu->eb_offset;
+
+   return __aligned_memcpy(buf, ebuf, off, count, afu->eb_len);
+}
+
 const struct cxl_backend_ops cxl_native_ops = {
.module = THIS_MODULE,
.adapter_reset = cxl_pci_reset,
@@ -1294,7 +1348,7 @@ const struct cxl_backend_ops cxl_native_ops = {
.support_attributes = native_support_attributes,
.link_ok = cxl_adapter_link_ok,
.release_afu = cxl_pci_release_afu,
-   .afu_read_err_buffer = cxl_pci_afu_read_err_buffer,
+   .afu_read_err_buffer = native_afu_read_err_buffer,
.afu_check_and_enable = native_afu_check_and_enable,
.afu_activate_mode = native_afu_activate_mode,
.afu_deactivate_mode = native_afu_deactivate_mode,
diff --git a/drivers/misc/cxl/pci.c b/drivers/misc/cxl/pci.c
index 91f6459..541dc9a 100644
--- a/drivers/misc/cxl/pci.c
+++ b/drivers/misc/cxl/pci.c
@@ -1051,50 +1051,6 @@ static int sanitise_afu_regs(struct cxl_afu *afu)
return 0;
 }

-#define ERR_BUFF_MAX_COPY_SIZE PAGE_SIZE
-/*
- * afu_eb_read:
- * Called from sysfs and reads the afu error info buffer. The h/w only supports
- * 4/8 bytes aligned access. So in case the requested

Re: [PATCH V2 09/11] powerpc/mm: Lower the max real address to 51 bits

2017-03-16 Thread Aneesh Kumar K.V




On Friday 17 March 2017 02:56 AM, Benjamin Herrenschmidt wrote:

On Thu, 2017-03-16 at 16:02 +0530, Aneesh Kumar K.V wrote:

Max value supported by hardware is 51 bits address. Radix page table define
a slot of 57 bits for future expansion. We restrict the value supported in
linux kernel 51 bits, so that we can use the bits between 57-51 for storing
hash linux page table bits. This is done in the next patch.


All of them ? I would keep some for future backward compatibility. It's likely
that a successor to P9 will have more physical address bits. I feel nervous
limiting to precisely what P9 supports.



What do you want to keep as MAX PFN bits here ?. Any new expansion will 
eat into the software bits defined by Radix and hence can't be used

for generic features.



This will free up the software page table bits to be used for features
that are needed for both hash and radix. The current hash linux page table
format doesn't have any free software bits. Moving hash linux page table
specific bits to top of RPN field free up the software bits for other purpose.



-aneesh

Re: [PATCH V2 11/11] powerpc/mm: Move hash specific pte bits to be top bits of RPN

2017-03-16 Thread Aneesh Kumar K.V




On Friday 17 March 2017 04:04 AM, Paul Mackerras wrote:

On Thu, Mar 16, 2017 at 04:02:09PM +0530, Aneesh Kumar K.V wrote:


.


/* pte contains a translation */

+
+/*
+ * Top and bottom bits of RPN which can be used by hash
+ * translation mode, because we expect them to be zero
+ * otherwise.
+ */
 #define _RPAGE_RPN00x01000
 #define _RPAGE_RPN10x02000
+#define _RPAGE_RPN45   0x0100UL
+#define _RPAGE_RPN44   0x0080UL
+#define _RPAGE_RPN43   0x0040UL
+#define _RPAGE_RPN42   0x0020UL
+#define _RPAGE_RPN41   0x0010UL
+#define _RPAGE_RPN40   0x0008UL


If RPN0 is 0x1000, then this is actually RPN39 as far as I can see,
and the other RPN4* bits are likewise off by one.


0x0100 >> 12 = 0x1000

I guess I got that naming wrong. it is 45 bit count hence the numbering 
should be RPN44. I will fixup in next update.


-aneesh

[PATCH v3 07/10] VAS: Define vas_rx_win_open() interface

2017-03-16 Thread Sukadev Bhattiprolu

Define the vas_rx_win_open() interface. This interface is intended to be
used by the Nest Accelerator (NX) driver(s) to setup receive windows for
one or more NX engines (which implement compression/encryption algorithms
in the hardware).

Follow-on patches will provide an interface to close the window and to open
a send window that kenrel subsystems can use to access the NX engines.

The interface to open a receive window is expected to be invoked for each
instance of VAS in the system.

Signed-off-by: Sukadev Bhattiprolu 
---

Changelog[v3]:
- Fault receive windows must enable interrupts and disable
  notifications. NX Windows are opposite.
- Use macros rather than enum for threshold-control mode
- Ignore irq_ports for in-kernel windows. They are needed for
  user space windows and will be added later
---
 arch/powerpc/include/asm/vas.h  |  46 +
 drivers/misc/vas/vas-internal.h |  11 +++
 drivers/misc/vas/vas-window.c   | 200 
 3 files changed, 257 insertions(+)

diff --git a/arch/powerpc/include/asm/vas.h b/arch/powerpc/include/asm/vas.h
index 184eeb2..d49b05b 100644
--- a/arch/powerpc/include/asm/vas.h
+++ b/arch/powerpc/include/asm/vas.h
@@ -38,6 +38,51 @@ enum vas_cop_type {
 };
 
 /*
+ * Receive window attributes specified by the (in-kernel) owner of window.
+ */
+struct vas_rx_win_attr {
+   void *rx_fifo;
+   int rx_fifo_size;
+   int wcreds_max;
+
+   bool pin_win;
+   bool rej_no_credit;
+   bool tx_wcred_mode;
+   bool rx_wcred_mode;
+   bool tx_win_ord_mode;
+   bool rx_win_ord_mode;
+   bool data_stamp;
+   bool nx_win;
+   bool fault_win;
+   bool notify_disable;
+   bool intr_disable;
+   bool notify_early;
+
+   int lnotify_lpid;
+   int lnotify_pid;
+   int lnotify_tid;
+   int pswid;
+
+   int tc_mode;
+};
+
+/*
+ * Helper to initialize receive window attributes to defaults for an
+ * NX window.
+ */
+extern void vas_init_rx_win_attr(struct vas_rx_win_attr *rxattr,
+   enum vas_cop_type cop);
+
+/*
+ * Open a VAS receive window for the instance of VAS identified by @vasid
+ * Use @attr to initialize the attributes of the window.
+ *
+ * Return a handle to the window or ERR_PTR() on error.
+ */
+extern struct vas_window *vas_rx_win_open(int vasid, enum vas_cop_type cop,
+   struct vas_rx_win_attr *attr);
+
+/*
  * Get/Set bit fields
  */
 #define GET_FIELD(m, v)(((v) & (m)) >> MASK_LSH(m))
@@ -46,4 +91,5 @@ enum vas_cop_type {
(((v) & ~(m)) | typeof(v))(val)) << MASK_LSH(m)) & (m)))
 
 #endif /* __KERNEL__ */
+
 #endif
diff --git a/drivers/misc/vas/vas-internal.h b/drivers/misc/vas/vas-internal.h
index 8e721df..1e5c94b 100644
--- a/drivers/misc/vas/vas-internal.h
+++ b/drivers/misc/vas/vas-internal.h
@@ -402,6 +402,16 @@ extern struct vas_instance *find_vas_instance(int vasid);
 #define VREG(r)VREG_SFX(r, _OFFSET)
 
 #ifndef vas_debug
+static inline void dump_rx_win_attr(struct vas_rx_win_attr *attr)
+{
+   pr_err("VAS: fault %d, notify %d, intr %d early %d\n",
+   attr->fault_win, attr->notify_disable,
+   attr->intr_disable, attr->notify_early);
+
+   pr_err("VAS: rx_fifo_size %d, max value %d\n",
+   attr->rx_fifo_size, VAS_RX_FIFO_SIZE_MAX);
+}
+
 static inline void vas_log_write(struct vas_window *win, char *name,
void *regptr, uint64_t val)
 {
@@ -414,6 +424,7 @@ static inline void vas_log_write(struct vas_window *win, 
char *name,
 #else  /* vas_debug */
 
 #define vas_log_write(win, name, reg, val)
+#define dump_rx_win_attr(attr)
 
 #endif /* vas_debug */
 
diff --git a/drivers/misc/vas/vas-window.c b/drivers/misc/vas/vas-window.c
index 9233bf5..605a2fa 100644
--- a/drivers/misc/vas/vas-window.c
+++ b/drivers/misc/vas/vas-window.c
@@ -547,3 +547,203 @@ int vas_window_reset(struct vas_instance *vinst, int 
winid)
 
return 0;
 }
+
+struct vas_window *get_vinstance_rxwin(struct vas_instance *vinst,
+   enum vas_cop_type cop)
+{
+   struct vas_window *rxwin;
+
+   mutex_lock(>mutex);
+
+   rxwin = vinst->rxwin[cop];
+   if (rxwin)
+   atomic_inc(>num_txwins);
+
+   mutex_unlock(>mutex);
+
+   return rxwin;
+}
+
+static void set_vinstance_rxwin(struct vas_instance *vinst,
+   enum vas_cop_type cop, struct vas_window *window)
+{
+   mutex_lock(>mutex);
+
+   /*
+* There should only be one receive window for a coprocessor type.
+*/
+   WARN_ON_ONCE(vinst->rxwin[cop]);
+   vinst->rxwin[cop] = window;
+
+   mutex_unlock(>mutex);
+}
+
+static void init_winctx_for_rxwin(struct vas_window *rxwin,
+   struct vas_rx_win_attr *rxattr,
+   struct

[PATCH v3 09/10] VAS: Define vas_tx_win_open()

2017-03-16 Thread Sukadev Bhattiprolu

Define an interface to open a VAS send window. This interface is
intended to be used the Nest Accelerator (NX) driver(s) to open
a send window and use it to submit compression/encryption requests
to a VAS receive window.

The receive window, identified by the [node, chip, cop] parameters,
must already be open in VAS (i.e connected to an NX engine).

Signed-off-by: Sukadev Bhattiprolu 

---
Changelog [v3]:
- Distinguish between hardware PID (SPRN_PID) and Linux pid.
- Use macros rather than enum for threshold-control mode
- Set the pid of send window from attr (needed for user space
  send windows).
- Ignore irq port setting for now. They are needed for user space
  windows and will be added later
---
 arch/powerpc/include/asm/vas.h |  42 +++
 drivers/misc/vas/vas-window.c  | 157 +
 2 files changed, 199 insertions(+)

diff --git a/arch/powerpc/include/asm/vas.h b/arch/powerpc/include/asm/vas.h
index fc00249..ff8da98 100644
--- a/arch/powerpc/include/asm/vas.h
+++ b/arch/powerpc/include/asm/vas.h
@@ -67,6 +67,29 @@ struct vas_rx_win_attr {
 };
 
 /*
+ * Window attributes specified by the in-kernel owner of a send window.
+ */
+struct vas_tx_win_attr {
+   enum vas_cop_type cop;
+   int wcreds_max;
+   int lpid;
+   int pidr;   /* hardware PID (from SPRN_PID) */
+   int pid;/* linux process id */
+   int pswid;
+   int rsvd_txbuf_count;
+   int tc_mode;
+
+   bool user_win;
+   bool pin_win;
+   bool rej_no_credit;
+   bool rsvd_txbuf_enable;
+   bool tx_wcred_mode;
+   bool rx_wcred_mode;
+   bool tx_win_ord_mode;
+   bool rx_win_ord_mode;
+};
+
+/*
  * Helper to initialize receive window attributes to defaults for an
  * NX window.
  */
@@ -83,6 +106,25 @@ extern struct vas_window *vas_rx_win_open(int vasid, enum 
vas_cop_type cop,
struct vas_rx_win_attr *attr);
 
 /*
+ * Helper to initialize send window attributes to defaults for an NX window.
+ */
+extern void vas_init_tx_win_attr(struct vas_tx_win_attr *txattr,
+   enum vas_cop_type cop);
+
+/*
+ * Open a VAS send window for the instance of VAS identified by @vasid
+ * and the co-processor type @cop. Use @attr to initialize attributes
+ * of the window.
+ *
+ * Note: The instance of VAS must already have an open receive window for
+ * the coprocessor type @cop.
+ *
+ * Return a handle to the send window or ERR_PTR() on error.
+ */
+struct vas_window *vas_tx_win_open(int vasid, enum vas_cop_type cop,
+   struct vas_tx_win_attr *attr);
+
+/*
  * Close the send or receive window identified by @win. For receive windows
  * return -EAGAIN if there are active send windows attached to this receive
  * window.
diff --git a/drivers/misc/vas/vas-window.c b/drivers/misc/vas/vas-window.c
index 40e4f7d..9caf10b 100644
--- a/drivers/misc/vas/vas-window.c
+++ b/drivers/misc/vas/vas-window.c
@@ -756,6 +756,163 @@ struct vas_window *vas_rx_win_open(int vasid, enum 
vas_cop_type cop,
return ERR_PTR(rc);
 }
 
+void vas_init_tx_win_attr(struct vas_tx_win_attr *txattr, enum vas_cop_type 
cop)
+{
+   memset(txattr, 0, sizeof(*txattr));
+
+   if (cop == VAS_COP_TYPE_842 || cop == VAS_COP_TYPE_842_HIPRI) {
+   txattr->rej_no_credit = false;
+   txattr->rx_wcred_mode = true;
+   txattr->tx_wcred_mode = true;
+   txattr->rx_win_ord_mode = true;
+   txattr->tx_win_ord_mode = true;
+   }
+}
+
+static void init_winctx_for_txwin(struct vas_window *txwin,
+   struct vas_tx_win_attr *txattr,
+   struct vas_winctx *winctx)
+{
+   /*
+* We first zero all fields and only set non-zero ones. Following
+* are some fields set to 0/false for the stated reason:
+*
+*  ->notify_os_intr_regIn powerNV, send intrs to HV
+*  ->rsvd_txbuf_count  Not supported yet.
+*  ->notify_disableFalse for NX windows
+*  ->xtra_writeFalse for NX windows
+*  ->notify_early  NA for NX windows
+*  ->lnotify_lpid  NA for Tx windows
+*  ->lnotify_pid   NA for Tx windows
+*  ->lnotify_tid   NA for Tx windows
+*  ->tx_win_cred_mode  Ignore for now for NX windows
+*  ->rx_win_cred_mode  Ignore for now for NX windows
+*/
+   memset(winctx, 0, sizeof(struct vas_winctx));
+
+   winctx->wcreds_max = txattr->wcreds_max ?: VAS_WCREDS_DEFAULT;
+
+   winctx->user_win = txattr->user_win;
+   winctx->nx_win = txwin->rxwin->nx_win;
+   winctx->pin_win = txattr->pin_win;
+
+   winctx->rx_wcred_mode = txattr->rx_wcred_mode;
+   winctx->tx_wcred_mode = txattr->tx_wcred_mode;
+   winctx->rx_word_mode = txattr->rx_win_ord_mode;
+

[PATCH v3 08/10] VAS: Define vas_win_close() interface

2017-03-16 Thread Sukadev Bhattiprolu

Define the vas_win_close() interface which should be used to close a
send or receive windows.

While the hardware configurations required to open send and receive windows
differ, the configuration to close a window is the same for both. So we use
a single interface to close the window.

Signed-off-by: Sukadev Bhattiprolu 
---
Changelog[v3]:
- Fix order of parameters in GET_FIELD().
- Update references and sequence for closing/quiescing a window.
---
 arch/powerpc/include/asm/vas.h |   7 +++
 drivers/misc/vas/vas-window.c  | 120 +
 2 files changed, 127 insertions(+)

diff --git a/arch/powerpc/include/asm/vas.h b/arch/powerpc/include/asm/vas.h
index d49b05b..fc00249 100644
--- a/arch/powerpc/include/asm/vas.h
+++ b/arch/powerpc/include/asm/vas.h
@@ -83,6 +83,13 @@ extern struct vas_window *vas_rx_win_open(int vasid, enum 
vas_cop_type cop,
struct vas_rx_win_attr *attr);
 
 /*
+ * Close the send or receive window identified by @win. For receive windows
+ * return -EAGAIN if there are active send windows attached to this receive
+ * window.
+ */
+int vas_win_close(struct vas_window *win);
+
+/*
  * Get/Set bit fields
  */
 #define GET_FIELD(m, v)(((v) & (m)) >> MASK_LSH(m))
diff --git a/drivers/misc/vas/vas-window.c b/drivers/misc/vas/vas-window.c
index 605a2fa..40e4f7d 100644
--- a/drivers/misc/vas/vas-window.c
+++ b/drivers/misc/vas/vas-window.c
@@ -548,6 +548,14 @@ int vas_window_reset(struct vas_instance *vinst, int winid)
return 0;
 }
 
+static void put_rx_win(struct vas_window *rxwin)
+{
+   /* Better not be a send window! */
+   WARN_ON_ONCE(rxwin->tx_win);
+
+   atomic_dec(>num_txwins);
+}
+
 struct vas_window *get_vinstance_rxwin(struct vas_instance *vinst,
enum vas_cop_type cop)
 {
@@ -747,3 +755,115 @@ struct vas_window *vas_rx_win_open(int vasid, enum 
vas_cop_type cop,
vas_release_window_id(>ida, rxwin->winid);
return ERR_PTR(rc);
 }
+
+static void poll_window_busy_state(struct vas_window *window)
+{
+   int busy;
+   uint64_t val;
+
+retry:
+   /*
+* Poll Window Busy flag
+*/
+   val = read_hvwc_reg(window, VREG(WIN_STATUS));
+   busy = GET_FIELD(VAS_WIN_BUSY, val);
+   if (busy) {
+   val = 0;
+   schedule_timeout(2000);
+   goto retry;
+   }
+}
+
+static void poll_window_credits_return(struct vas_window *window)
+{
+   int credits;
+   uint64_t val;
+
+retry:
+   /*
+* Poll Window Credits
+*/
+   if (window->tx_win) {
+   val = read_hvwc_reg(window, VREG(TX_WCRED));
+   credits = GET_FIELD(VAS_TX_WCRED, val);
+   } else {
+   val = read_hvwc_reg(window, VREG(LRX_WCRED));
+   credits = GET_FIELD(VAS_LRX_WCRED, val);
+   }
+
+   if (credits != VAS_WCREDS_DEFAULT) {
+   val = 0;
+   schedule_timeout(2000);
+   goto retry;
+   }
+}
+
+static void poll_window_castout(struct vas_window *window)
+{
+   int cached;
+   uint64_t val;
+
+   /* Cast window context out of the cache */
+retry:
+   val = read_hvwc_reg(window, VREG(WIN_CTX_CACHING_CTL));
+   cached = GET_FIELD(VAS_WIN_CACHE_STATUS, val);
+   if (cached) {
+   val = 0ULL;
+   val = SET_FIELD(VAS_CASTOUT_REQ, val, 1);
+   val = SET_FIELD(VAS_PUSH_TO_MEM, val, 0);
+   write_hvwc_reg(window, VREG(WIN_CTX_CACHING_CTL), val);
+
+   schedule_timeout(2000);
+   goto retry;
+   }
+}
+
+/*
+ * Close a window.
+ *
+ * See Section 1.12.1 of VAS workbook v1.05 for details on closing window:
+ * - disable new paste operations (unmap paste address)
+ * - Poll for the "Window Busy" bit to be cleared
+ * - Clear the Open/Enable bit for the Window.
+ * - Poll for return of window Credits (implies FIFO empty for Rx win?)
+ * - Unpin and cast window context out of cache
+ *
+ * Besides the hardware, kernel has some bookkeeping of course.
+ */
+int vas_win_close(struct vas_window *window)
+{
+   uint64_t val;
+
+   if (!window)
+   return 0;
+
+   if (!window->tx_win && atomic_read(>num_txwins) != 0) {
+   pr_devel("VAS: Attempting to close an active Rx window!\n");
+   WARN_ON_ONCE(1);
+   return -EAGAIN;
+   }
+
+   unmap_wc_paste_kaddr(window);
+
+   poll_window_busy_state(window);
+
+   /* Unpin window from cache and close it */
+   val = read_hvwc_reg(window, VREG(WINCTL));
+   val = SET_FIELD(VAS_WINCTL_PIN, val, 0);
+   val = SET_FIELD(VAS_WINCTL_OPEN, val, 0);
+   write_hvwc_reg(window, VREG(WINCTL), val);
+
+   poll_window_credits_return(window);
+
+   poll_window_castout(window);
+
+   /* if send window, drop reference to matching receive

[PATCH v3 10/10] VAS: Define copy/paste interfaces

2017-03-16 Thread Sukadev Bhattiprolu

Define interfaces (wrappers) to the 'copy' and 'paste' instructions
(which are new in PowerISA 3.0). These are intended to be used to
by NX driver(s) to submit Coprocessor Request Blocks (CRBs) to the
NX hardware engines.

Signed-off-by: Sukadev Bhattiprolu 

---
Changelog[v3]
- Map raw CR value from paste instruction into an error code.
---
 arch/powerpc/include/asm/vas.h  | 13 
 drivers/misc/vas/copy-paste.h   | 74 +
 drivers/misc/vas/vas-internal.h | 14 
 drivers/misc/vas/vas-window.c   | 50 
 4 files changed, 151 insertions(+)
 create mode 100644 drivers/misc/vas/copy-paste.h

diff --git a/arch/powerpc/include/asm/vas.h b/arch/powerpc/include/asm/vas.h
index ff8da98..1ef81ed 100644
--- a/arch/powerpc/include/asm/vas.h
+++ b/arch/powerpc/include/asm/vas.h
@@ -132,6 +132,19 @@ struct vas_window *vas_tx_win_open(int vasid, enum 
vas_cop_type cop,
 int vas_win_close(struct vas_window *win);
 
 /*
+ * Copy the co-processor request block (CRB) @crb into the local L2 cache.
+ * For now, @offset must be 0 and @first must be true.
+ */
+extern int vas_copy_crb(void *crb, int offset, bool first);
+
+/*
+ * Paste a previously copied CRB (see vas_copy_crb()) from the L2 cache to
+ * the hardware address associated with the window @win. For now, @off must
+ * 0 and @last must be true. @re is expected/assumed to be true for NX windows.
+ */
+extern int vas_paste_crb(struct vas_window *win, int off, bool last, bool re);
+
+/*
  * Get/Set bit fields
  */
 #define GET_FIELD(m, v)(((v) & (m)) >> MASK_LSH(m))
diff --git a/drivers/misc/vas/copy-paste.h b/drivers/misc/vas/copy-paste.h
new file mode 100644
index 000..7783bb8
--- /dev/null
+++ b/drivers/misc/vas/copy-paste.h
@@ -0,0 +1,74 @@
+/*
+ * Copyright 2016 IBM Corp.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+/*
+ * Macros taken from tools/testing/selftests/powerpc/context_switch/cp_abort.c
+ */
+#define PASTE(RA, RB, L, RC) \
+   .long (0x7c00070c | (RA) << (31-15) | (RB) << (31-20) \
+ | (L) << (31-10) | (RC) << (31-31))
+
+#define COPY(RA, RB, L) \
+   .long (0x7c00060c | (RA) << (31-15) | (RB) << (31-20) \
+ | (L) << (31-10))
+
+#define CR0_FXM"0x80"
+#define CR0_SHIFT  28
+#define CR0_MASK   0xF
+/*
+ * Copy/paste instructions:
+ *
+ * copy RA,RB,L
+ * Copy contents of address (RA) + effective_address(RB)
+ * to internal copy-buffer.
+ *
+ * L == 1 indicates this is the first copy.
+ *
+ * L == 0 indicates its a continuation of a prior first copy.
+ *
+ * paste RA,RB,L
+ * Paste contents of internal copy-buffer to the address
+ * (RA) + effective_address(RB)
+ *
+ * L == 0 indicates its a continuation of a prior paste. i.e.
+ * don't wait for the completion or update status.
+ *
+ * L == 1 indicates this is the last paste in the group (i.e.
+ * wait for the group to complete and update status in CR0).
+ *
+ * For Power9, the L bit must be 'true' in both copy and paste.
+ */
+
+static inline int vas_copy(void *crb, int offset, int first)
+{
+   WARN_ON_ONCE(!first);
+
+   __asm__ __volatile(stringify_in_c(COPY(%0, %1, %2))";"
+   :
+   : "b" (offset), "b" (crb), "i" (1)
+   : "memory");
+
+   return 0;
+}
+
+static inline int vas_paste(void *paste_address, int offset, int last)
+{
+   unsigned long long cr;
+
+   WARN_ON_ONCE(!last);
+
+   cr = 0;
+   __asm__ __volatile(stringify_in_c(PASTE(%1, %2, 1, 1))";"
+   "mfocrf %0," CR0_FXM ";"
+   : "=r" (cr)
+   : "b" (paste_address), "b" (offset)
+   : "memory");
+
+   return cr;
+}
diff --git a/drivers/misc/vas/vas-internal.h b/drivers/misc/vas/vas-internal.h
index 1e5c94b..54e2a31 100644
--- a/drivers/misc/vas/vas-internal.h
+++ b/drivers/misc/vas/vas-internal.h
@@ -456,4 +456,18 @@ static inline uint64_t read_hvwc_reg(struct vas_window 
*win,
return in_be64(win->hvwc_map+reg);
 }
 
+#ifdef vas_debug
+
+static void print_fifo_msg_count(struct vas_window *txwin)
+{
+   uint64_t read_hvwc_reg(struct vas_window *w, char *n, uint64_t o);
+   pr_devel("Winid %d, Msg count %llu\n", txwin->winid,
+   (uint64_t)read_hvwc_reg(txwin, VREG(LRFIFO_PUSH)));
+}
+#else  /* vas_debug */
+
+#define print_fifo_msg_count(window)
+
+#endif /* vas_debug */
+
 #endif
diff --git a/drivers/misc/vas/vas-window.c b/drivers/misc/vas/vas-window.c
index 9caf10b..fa2dd72 100644
--- a/drivers/misc/vas/vas-window.c
+++

[PATCH v3 05/10] VAS: Define helpers to init window context

2017-03-16 Thread Sukadev Bhattiprolu

Define helpers to initialize window context registers of the VAS
hardware. These will be used in follow-on patches when opening/closing
VAS windows.

Signed-off-by: Sukadev Bhattiprolu 
---
Changelog[v3]
- Have caller, rather than init_xlate_regs() reset window regs
  so we don't reset any settings caller may already have set.
- Translation mode should be 0x3 (0b11) not 0x11.
- Skip initilaizing read-only registers NX_UTIL and NX_UTIL_SE
- Skip initializing adder registers from UWC - they are already
  initialized from the HVWC.
- Check winctx->user_win when setting translation registers
---
 drivers/misc/vas/vas-internal.h |  59 ++-
 drivers/misc/vas/vas-window.c   | 334 
 2 files changed, 390 insertions(+), 3 deletions(-)

diff --git a/drivers/misc/vas/vas-internal.h b/drivers/misc/vas/vas-internal.h
index 15b62e0..8e721df 100644
--- a/drivers/misc/vas/vas-internal.h
+++ b/drivers/misc/vas/vas-internal.h
@@ -11,6 +11,7 @@
 #define VAS_INTERNAL_H
 #include 
 #include 
+#include 
 #include 
 
 #ifdef CONFIG_PPC_4K_PAGES
@@ -336,9 +337,6 @@ struct vas_window {
/* Feilds applicable only to receive windows */
enum vas_cop_type cop;
atomic_t num_txwins;
-
-   int32_t hwirq;
-   uint64_t irq_port;
 };
 
 /*
@@ -392,4 +390,59 @@ struct vas_winctx {
 extern int vas_initialized;
 extern int vas_window_reset(struct vas_instance *vinst, int winid);
 extern struct vas_instance *find_vas_instance(int vasid);
+
+/*
+ * VREG(x):
+ * Expand a register's short name (eg: LPID) into two parameters:
+ * - the register's short name in string form ("LPID"), and
+ * - the name of the macro (eg: VAS_LPID_OFFSET), defining the
+ *   register's offset in the window context
+ */
+#define VREG_SFX(n, s) __stringify(n), VAS_##n##s
+#define VREG(r)VREG_SFX(r, _OFFSET)
+
+#ifndef vas_debug
+static inline void vas_log_write(struct vas_window *win, char *name,
+   void *regptr, uint64_t val)
+{
+   if (val)
+   pr_err("%swin #%d: %s reg %p, val 0x%llx\n",
+   win->tx_win ? "Tx" : "Rx", win->winid, name,
+   regptr, val);
+}
+
+#else  /* vas_debug */
+
+#define vas_log_write(win, name, reg, val)
+
+#endif /* vas_debug */
+
+static inline void write_uwc_reg(struct vas_window *win, char *name,
+   int32_t reg, uint64_t val)
+{
+   void *regptr;
+
+   regptr = win->uwc_map + reg;
+   vas_log_write(win, name, regptr, val);
+
+   out_be64(regptr, val);
+}
+
+static inline void write_hvwc_reg(struct vas_window *win, char *name,
+   int32_t reg, uint64_t val)
+{
+   void *regptr;
+
+   regptr = win->hvwc_map + reg;
+   vas_log_write(win, name, regptr, val);
+
+   out_be64(regptr, val);
+}
+
+static inline uint64_t read_hvwc_reg(struct vas_window *win,
+   char *name __maybe_unused, int32_t reg)
+{
+   return in_be64(win->hvwc_map+reg);
+}
+
 #endif
diff --git a/drivers/misc/vas/vas-window.c b/drivers/misc/vas/vas-window.c
index 32dd1d0..edf5c9f 100644
--- a/drivers/misc/vas/vas-window.c
+++ b/drivers/misc/vas/vas-window.c
@@ -14,6 +14,8 @@
 #include 
 #include "vas-internal.h"
 
+static int fault_winid;
+
 /*
  * Compute the paste address region for the window @window using the
  * ->win_base_addr and ->win_id_shift we got from device tree.
@@ -138,6 +140,338 @@ int map_wc_mmio_bars(struct vas_window *window)
return 0;
 }
 
+/*
+ * Reset all valid registers in the HV and OS/User Window Contexts for
+ * the window identified by @window.
+ *
+ * NOTE: We cannot really use a for loop to reset window context. Not all
+ *  offsets in a window context are valid registers and the valid
+ *  registers are not sequential. And, we can only write to offsets
+ *  with valid registers (or is that only in Simics?).
+ */
+void reset_window_regs(struct vas_window *window)
+{
+   write_hvwc_reg(window, VREG(LPID), 0ULL);
+   write_hvwc_reg(window, VREG(PID), 0ULL);
+   write_hvwc_reg(window, VREG(XLATE_MSR), 0ULL);
+   write_hvwc_reg(window, VREG(XLATE_LPCR), 0ULL);
+   write_hvwc_reg(window, VREG(XLATE_CTL), 0ULL);
+   write_hvwc_reg(window, VREG(AMR), 0ULL);
+   write_hvwc_reg(window, VREG(SEIDR), 0ULL);
+   write_hvwc_reg(window, VREG(FAULT_TX_WIN), 0ULL);
+   write_hvwc_reg(window, VREG(OSU_INTR_SRC_RA), 0ULL);
+   write_hvwc_reg(window, VREG(HV_INTR_SRC_RA), 0ULL);
+   write_hvwc_reg(window, VREG(PSWID), 0ULL);
+   write_hvwc_reg(window, VREG(SPARE1), 0ULL);
+   write_hvwc_reg(window, VREG(SPARE2), 0ULL);
+   write_hvwc_reg(window, VREG(SPARE3), 0ULL);
+   write_hvwc_reg(window, VREG(SPARE4), 0ULL);
+   write_hvwc_reg(window, VREG(SPARE5), 0ULL);
+   write_hvwc_reg(window, VREG(SPARE6), 0ULL);
+

[PATCH v3 06/10] VAS: Define helpers to alloc/free windows

2017-03-16 Thread Sukadev Bhattiprolu

Define helpers to allocate/free VAS window objects. These will
be used in follow-on patches when opening/closing windows.

Signed-off-by: Sukadev Bhattiprolu 
---
 drivers/misc/vas/vas-window.c | 74 +--
 1 file changed, 72 insertions(+), 2 deletions(-)

diff --git a/drivers/misc/vas/vas-window.c b/drivers/misc/vas/vas-window.c
index edf5c9f..9233bf5 100644
--- a/drivers/misc/vas/vas-window.c
+++ b/drivers/misc/vas/vas-window.c
@@ -119,7 +119,7 @@ static void unmap_wc_mmio_bars(struct vas_window *window)
  * OS/User Window Context (UWC) MMIO Base Address Region for the given window.
  * Map these bus addresses and save the mapped kernel addresses in @window.
  */
-int map_wc_mmio_bars(struct vas_window *window)
+static int map_wc_mmio_bars(struct vas_window *window)
 {
int len;
uint64_t start;
@@ -472,8 +472,78 @@ int init_winctx_regs(struct vas_window *window, struct 
vas_winctx *winctx)
return 0;
 }
 
-/* stub for now */
+DEFINE_SPINLOCK(vas_ida_lock);
+
+void vas_release_window_id(struct ida *ida, int winid)
+{
+   spin_lock(_ida_lock);
+   ida_remove(ida, winid);
+   spin_unlock(_ida_lock);
+}
+
+int vas_assign_window_id(struct ida *ida)
+{
+   int rc, winid;
+
+   rc = ida_pre_get(ida, GFP_KERNEL);
+   if (!rc)
+   return -EAGAIN;
+
+   spin_lock(_ida_lock);
+   rc = ida_get_new_above(ida, 0, );
+   spin_unlock(_ida_lock);
+
+   if (rc)
+   return rc;
+
+   if (winid > VAS_MAX_WINDOWS_PER_CHIP) {
+   pr_err("VAS: Too many (%d) open windows\n", winid);
+   vas_release_window_id(ida, winid);
+   return -EAGAIN;
+   }
+
+   return winid;
+}
+
+static void vas_window_free(struct vas_window *window)
+{
+   unmap_wc_mmio_bars(window);
+   kfree(window->paste_addr_name);
+   kfree(window);
+}
+
+static struct vas_window *vas_window_alloc(struct vas_instance *vinst, int id)
+{
+   struct vas_window *window;
+
+   window = kzalloc(sizeof(*window), GFP_KERNEL);
+   if (!window)
+   return NULL;
+
+   window->vinst = vinst;
+   window->winid = id;
+
+   if (map_wc_mmio_bars(window))
+   goto out_free;
+
+   return window;
+
+out_free:
+   kfree(window);
+   return NULL;
+}
+
 int vas_window_reset(struct vas_instance *vinst, int winid)
 {
+   struct vas_window *window;
+
+   window = vas_window_alloc(vinst, winid);
+   if (!window)
+   return -ENOMEM;
+
+   reset_window_regs(window);
+
+   vas_window_free(window);
+
return 0;
 }
-- 
2.7.4

[PATCH v3 01/10] VAS: Define macros, register fields and structures

2017-03-16 Thread Sukadev Bhattiprolu

Define macros for the VAS hardware registers and bit-fields as well
as couple of data structures needed by the VAS driver.

Signed-off-by: Sukadev Bhattiprolu 
---
Changelog[v3]
- Rename winctx->pid to winctx->pidr to reflect that its a value
  from the PID register (SPRN_PID), not the linux process id.
- Make it easier to split header into kernel/user parts
- To keep user interface simple, use macros rather than enum for
  the threshold-control modes.
- Add a pid field to struct vas_window - needed for user space
  send windows.

Changelog[v2]
- Add an overview of VAS in vas-internal.h
- Get window context parameters from device tree and drop
  unnecessary macros.
---
 MAINTAINERS |   6 +
 arch/powerpc/include/asm/vas.h  |  43 +
 drivers/misc/vas/vas-internal.h | 392 
 3 files changed, 441 insertions(+)
 create mode 100644 arch/powerpc/include/asm/vas.h
 create mode 100644 drivers/misc/vas/vas-internal.h

diff --git a/MAINTAINERS b/MAINTAINERS
index c265a5f..2a910c9 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -13213,6 +13213,12 @@ S: Maintained
 F: Documentation/fb/uvesafb.txt
 F: drivers/video/fbdev/uvesafb.*
 
+VAS (IBM Virtual Accelerator Switchboard) DRIVER
+M: Sukadev Bhattiprolu 
+L: linuxppc-dev@lists.ozlabs.org
+S: Supported
+F: drivers/misc/vas/*
+
 VF610 NAND DRIVER
 M: Stefan Agner 
 L: linux-...@lists.infradead.org
diff --git a/arch/powerpc/include/asm/vas.h b/arch/powerpc/include/asm/vas.h
new file mode 100644
index 000..6d35ce6
--- /dev/null
+++ b/arch/powerpc/include/asm/vas.h
@@ -0,0 +1,43 @@
+/*
+ * Copyright 2016 IBM Corp.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#ifndef VAS_H
+#define VAS_H
+
+/*
+ * Threshold Control Mode: Have paste operation fail if the number of
+ * requests in receive FIFO exceeds a threshold.
+ *
+ * NOTE: No special error code yet if paste is rejected because of these
+ *  limits. So users can't distinguish between this and other errors.
+ */
+#define VAS_THRESH_DISABLED0
+#define VAS_THRESH_FIFO_GT_HALF_FULL   1
+#define VAS_THRESH_FIFO_GT_QTR_FULL2
+#define VAS_THRESH_FIFO_GT_EIGHTH_FULL 3
+
+#ifdef __KERNEL__
+
+#define VAS_RX_FIFO_SIZE_MAX   (8 << 20)   /* 8MB */
+/*
+ * Co-processor Engine type.
+ */
+enum vas_cop_type {
+   VAS_COP_TYPE_FAULT,
+   VAS_COP_TYPE_842,
+   VAS_COP_TYPE_842_HIPRI,
+   VAS_COP_TYPE_GZIP,
+   VAS_COP_TYPE_GZIP_HIPRI,
+   VAS_COP_TYPE_MAX,
+};
+
+
+#endif /* __KERNEL__ */
+
+#endif
diff --git a/drivers/misc/vas/vas-internal.h b/drivers/misc/vas/vas-internal.h
new file mode 100644
index 000..ce48f14
--- /dev/null
+++ b/drivers/misc/vas/vas-internal.h
@@ -0,0 +1,392 @@
+/*
+ * Copyright 2016 IBM Corp.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#ifndef VAS_INTERNAL_H
+#define VAS_INTERNAL_H
+#include 
+#include 
+#include 
+
+#ifdef CONFIG_PPC_4K_PAGES
+#  error "TODO: Compute RMA/Paste-address for 4K pages."
+#else
+#ifndef CONFIG_PPC_64K_PAGES
+#  error "Unexpected Page size."
+#endif
+#endif
+
+/*
+ * Overview of Virtual Accelerator Switchboard (VAS).
+ *
+ * VAS is a hardware "switchboard" that allows senders and receivers to
+ * exchange messages with _minimal_ kernel involvment. The receivers are
+ * typically NX coprocessor engines that perform compression or encryption
+ * in hardware, but receivers can also be other software threads.
+ *
+ * Senders are user/kernel threads that submit compression/encryption or
+ * other requests to the receivers. Senders must format their messages as
+ * Coprocessor Request Blocks (CRB)s and submit them using the instructions
+ * "copy" and "paste" which were introduced in Power9.
+ *
+ * A Power node can have (upto?) 8 Power chips. There is one instance of
+ * VAS in each Power9 chip. Each instance of VAS has 64K windows or ports,
+ * Senders and receivers must each connect to a separate window before they
+ * can exchange messages through the switchboard.
+ *
+ * Each window is described by two types of window contexts:
+ *
+ * Hypervisor Window Context (HVWC) of size VAS_HVWC_SIZE bytes
+ *
+ * OS/User Window Context (UWC) of size VAS_UWC_SIZE bytes.
+ *
+ * A window context can be viewed as a set of 64-bit registers. The settings
+ * in these registers configure/control/determine the behavior of the VAS
+ * hardware when

[PATCH v3 03/10] VAS: Define vas_init() and vas_exit()

2017-03-16 Thread Sukadev Bhattiprolu

Implement vas_init() and vas_exit() functions for a new VAS module.
This VAS module is essentially a library for other device drivers
and kernel users of the NX coprocessors like NX-842 and NX-GZIP.

Signed-off-by: Sukadev Bhattiprolu 
---
Changelog[v3]:
- Zero vas_instances memory on allocation
- [Haren Myneni] Fix description in Kconfig
Changelog[v2]:
- Get HVWC, UWC and window address parameters from device tree.
---
 MAINTAINERS |   8 ++-
 arch/powerpc/include/asm/reg.h  |   1 +
 drivers/misc/Kconfig|   1 +
 drivers/misc/Makefile   |   1 +
 drivers/misc/vas/Kconfig|  21 ++
 drivers/misc/vas/Makefile   |   3 +
 drivers/misc/vas/vas-internal.h |   3 +
 drivers/misc/vas/vas-window.c   |  19 +
 drivers/misc/vas/vas.c  | 155 
 9 files changed, 210 insertions(+), 2 deletions(-)
 create mode 100644 drivers/misc/vas/Kconfig
 create mode 100644 drivers/misc/vas/Makefile
 create mode 100644 drivers/misc/vas/vas-window.c
 create mode 100644 drivers/misc/vas/vas.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 2a910c9..4037252 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3673,8 +3673,6 @@ F:arch/powerpc/platforms/powernv/pci-cxl.c
 F: drivers/misc/cxl/
 F: include/misc/cxl*
 F: include/uapi/misc/cxl.h
-F: Documentation/powerpc/cxl.txt
-F: Documentation/ABI/testing/sysfs-class-cxl
 
 CXLFLASH (IBM Coherent Accelerator Processor Interface CAPI Flash) SCSI DRIVER
 M: Manoj N. Kumar 
@@ -3686,6 +3684,12 @@ F:   drivers/scsi/cxlflash/
 F: include/uapi/scsi/cxlflash_ioctls.h
 F: Documentation/powerpc/cxlflash.txt
 
+VAS (IBM Virtual Accelerator Switch) DRIVER
+M: Sukadev Bhattiprolu 
+L: linuxppc-dev@lists.ozlabs.org
+S: Supported
+F: drivers/misc/vas/
+
 STMMAC ETHERNET DRIVER
 M: Giuseppe Cavallaro 
 M: Alexandre Torgue 
diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index fc879fd..7a45ff7 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -1225,6 +1225,7 @@
 #define PVR_POWER8E0x004B
 #define PVR_POWER8NVL  0x004C
 #define PVR_POWER8 0x004D
+#define PVR_POWER9 0x004E
 #define PVR_BE 0x0070
 #define PVR_PA6T   0x0090
 
diff --git a/drivers/misc/Kconfig b/drivers/misc/Kconfig
index c290990..97d652e 100644
--- a/drivers/misc/Kconfig
+++ b/drivers/misc/Kconfig
@@ -783,4 +783,5 @@ source "drivers/misc/mic/Kconfig"
 source "drivers/misc/genwqe/Kconfig"
 source "drivers/misc/echo/Kconfig"
 source "drivers/misc/cxl/Kconfig"
+source "drivers/misc/vas/Kconfig"
 endmenu
diff --git a/drivers/misc/Makefile b/drivers/misc/Makefile
index 7a3ea89..5201ffd 100644
--- a/drivers/misc/Makefile
+++ b/drivers/misc/Makefile
@@ -53,6 +53,7 @@ obj-$(CONFIG_GENWQE)  += genwqe/
 obj-$(CONFIG_ECHO) += echo/
 obj-$(CONFIG_VEXPRESS_SYSCFG)  += vexpress-syscfg.o
 obj-$(CONFIG_CXL_BASE) += cxl/
+obj-$(CONFIG_VAS)  += vas/
 obj-$(CONFIG_PANEL) += panel.o
 
 lkdtm-$(CONFIG_LKDTM)  += lkdtm_core.o
diff --git a/drivers/misc/vas/Kconfig b/drivers/misc/vas/Kconfig
new file mode 100644
index 000..43cedda
--- /dev/null
+++ b/drivers/misc/vas/Kconfig
@@ -0,0 +1,21 @@
+#
+# IBM Virtual Accelarator Switchboard (VAS) compatible devices
+#depends on PPC_POWERNV && PCI_MSI && EEH
+#
+
+config VAS
+   tristate "Support for IBM Virtual Accelerator Switchboard (VAS)"
+   depends on PPC_POWERNV
+   default n
+   help
+ Select this option to enable driver support for IBM Virtual
+ Accelerator Switchboard (VAS).
+
+ VAS allows accelerators in co processors like NX-842 to be
+ directly available to a user process. This driver enables
+ userspace programs to access these accelerators via device
+ nodes like /dev/crypto/nx-gzip.
+
+ VAS adapters are found in POWER9 based systems.
+
+ If unsure, say N.
diff --git a/drivers/misc/vas/Makefile b/drivers/misc/vas/Makefile
new file mode 100644
index 000..7dd7139
--- /dev/null
+++ b/drivers/misc/vas/Makefile
@@ -0,0 +1,3 @@
+ccflags-y  := $(call cc-disable-warning, 
unused-const-variable)
+ccflags-$(CONFIG_PPC_WERROR)   += -Werror
+obj-$(CONFIG_VAS)  += vas.o vas-window.o
diff --git a/drivers/misc/vas/vas-internal.h b/drivers/misc/vas/vas-internal.h
index ce48f14..15b62e0 100644
--- a/drivers/misc/vas/vas-internal.h
+++ b/drivers/misc/vas/vas-internal.h
@@ -389,4 +389,7 @@ struct vas_winctx {
enum vas_notify_after_count notify_after_count;
 };
 
+extern int vas_initialized;
+extern int vas_window_reset(struct vas_instance *vinst, int winid);
+extern struct vas_instance *find_vas_instance(int vasid);
 #endif
diff --git

[PATCH v3 04/10] VAS: Define helpers for access MMIO regions

2017-03-16 Thread Sukadev Bhattiprolu

Define some helper functions to access the MMIO regions. We use these
in a follow-on patches to read/write VAS hardware registers. These
helpers are also used to later issue 'paste' instructions to submit
requests to the NX hardware engines.

Signed-off-by: Sukadev Bhattiprolu 
---
Changelog [v3]:
- Minor reorg/cleanup of map/unmap functions

Changelog [v2]:
- Get HVWC, UWC and paste addresses from window->vinst (i.e DT)
  rather than kernel macros.
---
 drivers/misc/vas/vas-window.c | 126 ++
 1 file changed, 126 insertions(+)

diff --git a/drivers/misc/vas/vas-window.c b/drivers/misc/vas/vas-window.c
index 468f3bf..32dd1d0 100644
--- a/drivers/misc/vas/vas-window.c
+++ b/drivers/misc/vas/vas-window.c
@@ -9,9 +9,135 @@
 
 #include 
 #include 
+#include 
+#include 
 #include 
 #include "vas-internal.h"
 
+/*
+ * Compute the paste address region for the window @window using the
+ * ->win_base_addr and ->win_id_shift we got from device tree.
+ */
+void compute_paste_address(struct vas_window *window, uint64_t *addr, int *len)
+{
+   uint64_t base, shift;
+   int winid;
+
+   base = window->vinst->win_base_addr;
+   shift = window->vinst->win_id_shift;
+   winid = window->winid;
+
+   *addr  = base + (winid << shift);
+   *len = PAGE_SIZE;
+
+   pr_debug("Txwin #%d: Paste addr 0x%llx\n", winid, *addr);
+}
+
+static inline void get_hvwc_mmio_bar(struct vas_window *window,
+   uint64_t *start, int *len)
+{
+   uint64_t pbaddr;
+
+   pbaddr = window->vinst->hvwc_bar_start;
+   *start = pbaddr + window->winid * VAS_HVWC_SIZE;
+   *len = VAS_HVWC_SIZE;
+}
+
+static inline void get_uwc_mmio_bar(struct vas_window *window,
+   uint64_t *start, int *len)
+{
+   uint64_t pbaddr;
+
+   pbaddr = window->vinst->uwc_bar_start;
+   *start = pbaddr + window->winid * VAS_UWC_SIZE;
+   *len = VAS_UWC_SIZE;
+}
+
+static void *map_mmio_region(char *name, uint64_t start, int len)
+{
+   void *map;
+
+   if (!request_mem_region(start, len, name)) {
+   pr_devel("%s(): request_mem_region(0x%llx, %d) failed\n",
+   __func__, start, len);
+   return NULL;
+   }
+
+   map = __ioremap(start, len, pgprot_val(pgprot_cached(__pgprot(0;
+   if (!map) {
+   pr_devel("%s(): ioremap(0x%llx, %d) failed\n", __func__, start,
+   len);
+   return NULL;
+   }
+
+   return map;
+}
+
+/*
+ * Unmap the MMIO regions for a window.
+ */
+static void unmap_wc_paste_kaddr(struct vas_window *window)
+{
+   int len;
+   uint64_t busaddr_start;
+
+   if (window->paste_kaddr) {
+   iounmap(window->paste_kaddr);
+   compute_paste_address(window, _start, );
+   release_mem_region((phys_addr_t)busaddr_start, len);
+   window->paste_kaddr = NULL;
+   }
+
+}
+
+static void unmap_wc_mmio_bars(struct vas_window *window)
+{
+   int len;
+   uint64_t busaddr_start;
+
+   unmap_wc_paste_kaddr(window);
+
+   if (window->hvwc_map) {
+   iounmap(window->hvwc_map);
+   get_hvwc_mmio_bar(window, _start, );
+   release_mem_region((phys_addr_t)busaddr_start, len);
+   window->hvwc_map = NULL;
+   }
+
+   if (window->uwc_map) {
+   iounmap(window->uwc_map);
+   get_uwc_mmio_bar(window, _start, );
+   release_mem_region((phys_addr_t)busaddr_start, len);
+   window->uwc_map = NULL;
+   }
+}
+
+/*
+ * Find the Hypervisor Window Context (HVWC) MMIO Base Address Region and the
+ * OS/User Window Context (UWC) MMIO Base Address Region for the given window.
+ * Map these bus addresses and save the mapped kernel addresses in @window.
+ */
+int map_wc_mmio_bars(struct vas_window *window)
+{
+   int len;
+   uint64_t start;
+
+   window->paste_kaddr = window->hvwc_map = window->uwc_map = NULL;
+
+   get_hvwc_mmio_bar(window, , );
+   window->hvwc_map = map_mmio_region("HVWCM_Window", start, len);
+
+   get_uwc_mmio_bar(window, , );
+   window->uwc_map = map_mmio_region("UWCM_Window", start, len);
+
+   if (!window->hvwc_map || !window->uwc_map) {
+   unmap_wc_mmio_bars(window);
+   return -1;
+   }
+
+   return 0;
+}
+
 /* stub for now */
 int vas_window_reset(struct vas_instance *vinst, int winid)
 {
-- 
2.7.4

[PATCH v3 02/10] Move GET_FIELD/SET_FIELD to vas.h

2017-03-16 Thread Sukadev Bhattiprolu

Move the GET_FIELD and SET_FIELD macros to vas.h as VAS and other
users of VAS, including NX-842 can use those macros.

There is a lot of related code between the VAS/NX kernel drivers
and skiboot. For consistency switch the order of parameters in
SET_FIELD to match the order in skiboot.

Signed-off-by: Sukadev Bhattiprolu 
---

Changelog[v3]
- Fix order of parameters in nx-842 driver.
---
 arch/powerpc/include/asm/vas.h | 8 +++-
 drivers/crypto/nx/nx-842-powernv.c | 7 ---
 drivers/crypto/nx/nx-842.h | 5 -
 3 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/include/asm/vas.h b/arch/powerpc/include/asm/vas.h
index 6d35ce6..184eeb2 100644
--- a/arch/powerpc/include/asm/vas.h
+++ b/arch/powerpc/include/asm/vas.h
@@ -37,7 +37,13 @@ enum vas_cop_type {
VAS_COP_TYPE_MAX,
 };
 
+/*
+ * Get/Set bit fields
+ */
+#define GET_FIELD(m, v)(((v) & (m)) >> MASK_LSH(m))
+#define MASK_LSH(m)(__builtin_ffsl(m) - 1)
+#define SET_FIELD(m, v, val)   \
+   (((v) & ~(m)) | typeof(v))(val)) << MASK_LSH(m)) & (m)))
 
 #endif /* __KERNEL__ */
-
 #endif
diff --git a/drivers/crypto/nx/nx-842-powernv.c 
b/drivers/crypto/nx/nx-842-powernv.c
index 1710f80..3abb045 100644
--- a/drivers/crypto/nx/nx-842-powernv.c
+++ b/drivers/crypto/nx/nx-842-powernv.c
@@ -22,6 +22,7 @@
 
 #include 
 #include 
+#include 
 
 MODULE_LICENSE("GPL");
 MODULE_AUTHOR("Dan Streetman ");
@@ -424,9 +425,9 @@ static int nx842_powernv_function(const unsigned char *in, 
unsigned int inlen,
 
/* set up CCW */
ccw = 0;
-   ccw = SET_FIELD(ccw, CCW_CT, nx842_ct);
-   ccw = SET_FIELD(ccw, CCW_CI_842, 0); /* use 0 for hw auto-selection */
-   ccw = SET_FIELD(ccw, CCW_FC_842, fc);
+   ccw = SET_FIELD(CCW_CT, ccw, nx842_ct);
+   ccw = SET_FIELD(CCW_CI_842, ccw, 0); /* use 0 for hw auto-selection */
+   ccw = SET_FIELD(CCW_FC_842, ccw, fc);
 
/* set up CRB's CSB addr */
csb_addr = nx842_get_pa(csb) & CRB_CSB_ADDRESS;
diff --git a/drivers/crypto/nx/nx-842.h b/drivers/crypto/nx/nx-842.h
index a4eee3b..30929bd 100644
--- a/drivers/crypto/nx/nx-842.h
+++ b/drivers/crypto/nx/nx-842.h
@@ -100,11 +100,6 @@ static inline unsigned long nx842_get_pa(void *addr)
return page_to_phys(vmalloc_to_page(addr)) + offset_in_page(addr);
 }
 
-/* Get/Set bit fields */
-#define MASK_LSH(m)(__builtin_ffsl(m) - 1)
-#define GET_FIELD(v, m)(((v) & (m)) >> MASK_LSH(m))
-#define SET_FIELD(v, m, val)   (((v) & ~(m)) | (((val) << MASK_LSH(m)) & (m)))
-
 /**
  * This provides the driver's constraints.  Different nx842 implementations
  * may have varying requirements.  The constraints are:
-- 
2.7.4

[PATCH v3 00/10] Enable VAS

2017-03-16 Thread Sukadev Bhattiprolu

Power9 introduces a hardware subsystem referred to as the Virtual
Accelerator Switchboard (VAS). VAS allows kernel subsystems and user
space processes to directly access the Nest Accelerator (NX) engines
which implement compression and encryption algorithms in the hardware.

NX has been in Power processors since Power7+, but access to the NX
engines was through the 'icswx' instruction which is only available
to the kernel/hypervisor. Starting with Power9, access to the NX
engines is provided to both kernel and user space processes through
VAS.

The switchboard (i.e VAS) multiplexes accesses between "receivers" and
"senders", where the "receivers" are typically the NX engines and
"senders" are the kernel subsystems and user processors that wish to
access the receivers (NX engines).  Once a sender is "connected" to
a receiver through the switchboard, the senders can submit compression/
encryption requests to the hardware using the new (PowerISA 3.0)
"copy" and "paste" instructions.

In the initial OPAL and PowerNV kernel patchsets, the "senders" can
only be kernel subsystems (eg NX-842 driver). A follow-on patch set 
will allow senders to be user-space processes.

This kernel patch set configures the VAS subsystems and provides
kernel interfaces to drivers like NX-842 to open receive and send
windows in VAS and to submit requests to the NX engine.

This patch set that has been tested in a Simics Power9 environment using
a modified NX-842 kernel driver and a compression self-test module from
Power8. The corresponding OPAL patchset for VAS support was posted to
skiboot mailing list:

https://lists.ozlabs.org/pipermail/skiboot/2017-January/006193.html

OPAL and kernel patchsets for NX-842 driver will be posted separately.
All four patchsets are needed to effectively use VAS/NX in Power9.

Thanks to input from Ben Herrenschmidt, Michael Neuling, Michael Ellerman
and Haren Myneni.

Changelog[v3]
- Rebase to v4.11-rc1
- Add interfaces to initialize send/receive window attributes to
  defaults that drivers can use (see arch/powerpc/include/asm/vas.h)
- Modify interface vas_paste() to return 0 or error code
- Fix a bug in setting Translation Control Mode (0b11 not 0x11)
- Enable send-window-credit checking 
- Reorg code  in vas_win_close()
- Minor reorgs and tweaks to register field settings to make it
  easier to add support for user space windows.
- Skip writing to read-only registers
- Start window indexing from 0 rather than 1

Changelog[v2]
- Use vas-id, HVWC, UWC and paste address, entries from device tree
  rather than defining/computing them in kernel and reorg code.

Sukadev Bhattiprolu (10):
  VAS: Define macros, register fields and structures
  Move GET_FIELD/SET_FIELD to vas.h
  VAS: Define vas_init() and vas_exit()
  VAS: Define helpers for access MMIO regions
  VAS: Define helpers to init window context
  VAS: Define helpers to alloc/free windows
  VAS: Define vas_rx_win_open() interface
  VAS: Define vas_win_close() interface
  VAS: Define vas_tx_win_open()
  VAS: Define copy/paste interfaces

 MAINTAINERS|   14 +-
 arch/powerpc/include/asm/reg.h |1 +
 arch/powerpc/include/asm/vas.h |  157 ++
 drivers/crypto/nx/nx-842-powernv.c |7 +-
 drivers/crypto/nx/nx-842.h |5 -
 drivers/misc/Kconfig   |1 +
 drivers/misc/Makefile  |1 +
 drivers/misc/vas/Kconfig   |   21 +
 drivers/misc/vas/Makefile  |3 +
 drivers/misc/vas/copy-paste.h  |   74 +++
 drivers/misc/vas/vas-internal.h|  473 
 drivers/misc/vas/vas-window.c  | 1076 
 drivers/misc/vas/vas.c |  155 ++
 13 files changed, 1978 insertions(+), 10 deletions(-)
 create mode 100644 arch/powerpc/include/asm/vas.h
 create mode 100644 drivers/misc/vas/Kconfig
 create mode 100644 drivers/misc/vas/Makefile
 create mode 100644 drivers/misc/vas/copy-paste.h
 create mode 100644 drivers/misc/vas/vas-internal.h
 create mode 100644 drivers/misc/vas/vas-window.c
 create mode 100644 drivers/misc/vas/vas.c

-- 
2.7.4

Re: [PATCH 4/8] powerpc/64s: fix POWER9 machine check handler from stop state

2017-03-16 Thread Nicholas Piggin

On Thu, 16 Mar 2017 18:10:48 +0530
Mahesh Jagannath Salgaonkar  wrote:

> On 03/14/2017 02:53 PM, Nicholas Piggin wrote:
> > The ISA specifies power save wakeup can cause a machine check interrupt.
> > The machine check handler currently has code to handle that for POWER8,
> > but POWER9 crashes when trying to execute the P8 style sleep
> > instructions.
> > 
> > So queue up the machine check, then call into the idle code to wake up
> > as the system reset interrupt does, rather than attempting to sleep
> > again without going through the main idle path.
> > 
> > Reviewed-by: Gautham R. Shenoy 
> > Signed-off-by: Nicholas Piggin 
> > ---
> >  arch/powerpc/include/asm/reg.h   |  1 +
> >  arch/powerpc/kernel/exceptions-64s.S | 69 
> > ++--
> >  2 files changed, 35 insertions(+), 35 deletions(-)
> > 
> > diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
> > index fc879fd6bdae..8bbdfacce970 100644
> > --- a/arch/powerpc/include/asm/reg.h
> > +++ b/arch/powerpc/include/asm/reg.h
> > @@ -656,6 +656,7 @@
> >  #define   SRR1_ISI_PROT0x0800 /* ISI: Other protection 
> > fault */
> >  #define   SRR1_WAKEMASK0x0038 /* reason for wakeup */
> >  #define   SRR1_WAKEMASK_P8 0x003c /* reason for wakeup on POWER8 and 9 
> > */
> > +#define   SRR1_WAKEMCE_RESVD   0x003c /* Unused/reserved value 
> > used by MCE wakeup to indicate cause to idle wakeup handler */
> >  #define   SRR1_WAKESYSERR  0x0030 /* System error */
> >  #define   SRR1_WAKEEE  0x0020 /* External interrupt */
> >  #define   SRR1_WAKEHVI 0x0024 /* Hypervisor Virtualization 
> > Interrupt (P9) */
> > diff --git a/arch/powerpc/kernel/exceptions-64s.S 
> > b/arch/powerpc/kernel/exceptions-64s.S
> > index e390fcd04bcb..5779d2d6a192 100644
> > --- a/arch/powerpc/kernel/exceptions-64s.S
> > +++ b/arch/powerpc/kernel/exceptions-64s.S
> > @@ -306,6 +306,33 @@ EXC_COMMON_BEGIN(machine_check_common)
> > /* restore original r1. */  \
> > ld  r1,GPR1(r1)
> > 
> > +#ifdef CONFIG_PPC_P7_NAP
> > +EXC_COMMON_BEGIN(machine_check_idle_common)
> > +   bl  machine_check_queue_event
> > +   /*
> > +* Queue the machine check, then reload SRR1 and use it to set
> > +* CR3 according to pnv_powersave_wakeup convention.
> > +*/
> > +   ld  r12,_MSR(r1)
> > +   rlwinm  r11,r12,47-31,30,31
> > +   cmpwi   cr3,r11,2
> > +
> > +   /*
> > +* Now put SRR1_WAKEMCE_RESVD into SRR1, allows it to follow the
> > +* system reset wakeup code.
> > +*/
> > +   orisr12,r12,SRR1_WAKEMCE_RESVD@h
> > +   mtspr   SPRN_SRR1,r12
> > +   std r12,_MSR(r1)
> > +
> > +   /*
> > +* Decrement MCE nesting after finishing with the stack.
> > +*/
> > +   lhz r11,PACA_IN_MCE(r13)
> > +   subir11,r11,1
> > +   sth r11,PACA_IN_MCE(r13)  
> 
> Looks like we are not winding up.. Shouldn't we ? What if we may end up
> in pnv_wakeup_noloss() which assumes that no GPRs are lost. Am I missing
> anything ?

Hmm, on second look, I don't think any non-volatile GPRs are overwritten
in this path. But this MCE is a slow path, and it is a much longer path
than the system reset idle wakeup... So I'll add the napstatelost with
a comment.

Thanks,
Nick

Re: [PATCH kernel v9 10/10] KVM: PPC: VFIO: Add in-kernel acceleration for VFIO

2017-03-16 Thread Alexey Kardashevskiy

Thanks for the quick review, there is one comment below.

On 17/03/17 03:51, Alex Williamson wrote:
> On Thu, 16 Mar 2017 18:09:32 +1100
> Alexey Kardashevskiy  wrote:
> 
>> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
>> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
>> without passing them to user space which saves time on switching
>> to user space and back.
>>
>> This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
>> KVM tries to handle a TCE request in the real mode, if failed
>> it passes the request to the virtual mode to complete the operation.
>> If it a virtual mode handler fails, the request is passed to
>> the user space; this is not expected to happen though.
>>
>> To avoid dealing with page use counters (which is tricky in real mode),
>> this only accelerates SPAPR TCE IOMMU v2 clients which are required
>> to pre-register the userspace memory. The very first TCE request will
>> be handled in the VFIO SPAPR TCE driver anyway as the userspace view
>> of the TCE table (iommu_table::it_userspace) is not allocated till
>> the very first mapping happens and we cannot call vmalloc in real mode.
>>
>> If we fail to update a hardware IOMMU table unexpected reason, we just
>> clear it and move on as there is nothing really we can do about it -
>> for example, if we hot plug a VFIO device to a guest, existing TCE tables
>> will be mirrored automatically to the hardware and there is no interface
>> to report to the guest about possible failures.
>>
>> This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
>> the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
>> and associates a physical IOMMU table with the SPAPR TCE table (which
>> is a guest view of the hardware IOMMU table). The iommu_table object
>> is cached and referenced so we do not have to look up for it in real mode.
>>
>> This does not implement the UNSET counterpart as there is no use for it -
>> once the acceleration is enabled, the existing userspace won't
>> disable it unless a VFIO container is destroyed; this adds necessary
>> cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
>>
>> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
>> space.
>>
>> This adds real mode version of WARN_ON_ONCE() as the generic version
>> causes problems with rcu_sched. Since we testing what vmalloc_to_phys()
>> returns in the code, this also adds a check for already existing
>> vmalloc_to_phys() call in kvmppc_rm_h_put_tce_indirect().
>>
>> This finally makes use of vfio_external_user_iommu_id() which was
>> introduced quite some time ago and was considered for removal.
>>
>> Tests show that this patch increases transmission speed from 220MB/s
>> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
>>
>> Signed-off-by: Alexey Kardashevskiy 
>> ---
>> Changes:
>> v9:
>> * removed referencing a group in KVM, only referencing iommu_table's now
>> * fixed a reference leak in KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE handler
>> * fixed typo in vfio.txt
>> * removed @argsz and @flags from struct kvm_vfio_spapr_tce
>>
>> v8:
>> * changed all (!pua) checks to return H_TOO_HARD as ioctl() is supposed
>> to handle them
>> * changed vmalloc_to_phys() callers to return H_HARDWARE
>> * changed real mode iommu_tce_xchg_rm() callers to return H_TOO_HARD
>> and added a comment about this in the code
>> * changed virtual mode iommu_tce_xchg() callers to return H_HARDWARE
>> and do WARN_ON
>> * added WARN_ON_ONCE_RM(!rmap) in kvmppc_rm_h_put_tce_indirect() to
>> have all vmalloc_to_phys() callsites covered
>>
>> v7:
>> * added realmode-friendly WARN_ON_ONCE_RM
>>
>> v6:
>> * changed handling of errors returned by kvmppc_(rm_)tce_iommu_(un)map()
>> * moved kvmppc_gpa_to_ua() to TCE validation
>>
>> v5:
>> * changed error codes in multiple places
>> * added bunch of WARN_ON() in places which should not really happen
>> * adde a check that an iommu table is not attached already to LIOBN
>> * dropped explicit calls to iommu_tce_clear_param_check/
>> iommu_tce_put_param_check as kvmppc_tce_validate/kvmppc_ioba_validate
>> call them anyway (since the previous patch)
>> * if we fail to update a hardware IOMMU table for unexpected reason,
>> this just clears the entry
>>
>> v4:
>> * added note to the commit log about allowing multiple updates of
>> the same IOMMU table;
>> * instead of checking for if any memory was preregistered, this
>> returns H_TOO_HARD if a specific page was not;
>> * fixed comments from v3 about error handling in many places;
>> * simplified TCE handlers and merged IOMMU parts inline - for example,
>> there used to be kvmppc_h_put_tce_iommu(), now it is merged into
>> kvmppc_h_put_tce(); this allows to check IOBA boundaries against
>> the first attached table only (makes the code simpler);
>>
>> v3:
>> * simplified not to use VFIO group notifiers
>> * reworked cleanup, should be cleaner/simpler now
>>
>> v2:
>> *

Re: [PATCH] kernfs: Check KERNFS_HAS_RELEASE before calling kernfs_release_file()

2017-03-16 Thread Greg Kroah-Hartman

On Thu, Mar 16, 2017 at 05:14:30PM -0400, Tejun Heo wrote:
> Hello, Greg.
> 
> On Tue, Mar 14, 2017 at 11:08:29AM +0800, Greg Kroah-Hartman wrote:
> > Tejun, want to take this through your tree, or at the least, give me an
> > ack for this?
> 
> Just acked.  I think going through your tree is better for this one.

Ok, will take it in my tree, thanks.

greg k-h

Re: [Patch v5] powerpc/powernv: add hdat attribute to sysfs

2017-03-16 Thread Oliver O'Halloran

On Thu, Mar 2, 2017 at 4:44 PM, Matt Brown  wrote:
> The HDAT data area is consumed by skiboot and turned into a device-tree.
> In some cases we would like to look directly at the HDAT, so this patch
> adds a sysfs node to allow it to be viewed.  This is not possible through
> /dev/mem as it is reserved memory which is stopped by the /dev/mem filter.
> This patch also adds sysfs nodes for all properties in the device-tree
> under /ibm,opal/firmware/exports.
>
> Signed-off-by: Matt Brown 
> ---
> Changes between v4 and v5:
> - all properties under /ibm,opal/firmware/exports in the device-tree
>   are now added as new sysfs nodes
> - the new sysfs nodes are now placed under /opal/exports
> - added a generic read function for all exported attributes
> ---
>  arch/powerpc/platforms/powernv/opal.c | 84 
> +++
>  1 file changed, 84 insertions(+)
>
> diff --git a/arch/powerpc/platforms/powernv/opal.c 
> b/arch/powerpc/platforms/powernv/opal.c
> index 2822935..fbb8264 100644
> --- a/arch/powerpc/platforms/powernv/opal.c
> +++ b/arch/powerpc/platforms/powernv/opal.c
> @@ -36,6 +36,9 @@
>  /* /sys/firmware/opal */
>  struct kobject *opal_kobj;
>
> +/* /sys/firmware/opal/exports */
> +struct kobject *opal_export_kobj;
> +
>  struct opal {
> u64 base;
> u64 entry;
> @@ -604,6 +607,82 @@ static void opal_export_symmap(void)
> pr_warn("Error %d creating OPAL symbols file\n", rc);
>  }
>
> +

> +static int opal_exports_sysfs_init(void)
> +{
> +   opal_export_kobj = kobject_create_and_add("exports", opal_kobj);
> +   if (!opal_export_kobj) {
> +   pr_warn("kobject_create_and_add opal_exports failed\n");
> +   return -ENOMEM;
> +   }
> +
> +   return 0;
> +}

This can be folded into opal_export_attrs().

> +
> +static ssize_t export_attr_read(struct file *fp, struct kobject *kobj,
> +struct bin_attribute *bin_attr, char *buf,
> +loff_t off, size_t count)
> +{
> +   return memory_read_from_buffer(buf, count, , bin_attr->private,
> +  bin_attr->size);
> +}
> +
> +static struct bin_attribute *exported_attrs;
> +/*
> + * opal_export_attrs: creates a sysfs node for each property listed in
> + * the device-tree under /ibm,opal/firmware/exports/
> + * All new sysfs nodes are created under /opal/exports/.
> + * This allows for reserved memory regions (e.g. HDAT) to be read.
> + * The new sysfs nodes are only readable by root.
> + */
> +static void opal_export_attrs(void)
> +{
> +   const __be64 *syms;
> +   unsigned int size;
> +   struct device_node *fw;
> +   struct property *prop;
> +   int rc;
> +   int attr_count = 0;
> +   int n = 0;
> +

> +   fw = of_find_node_by_path("/ibm,opal/firmware/exports");
> +   if (!fw)
> +   return;

devicetree nodes are reference counted so when you take a reference to
one using of_find_node_* you should use of_put_node() to drop the reference
when you're finished with it. Of course, there's plenty of existing code that
doesn't do this, but that's no reason to make a bad problem worse ;)

> +
> +   for (prop = fw->properties; prop != NULL; prop = prop->next)
> +   attr_count++;
> +
> +   if (attr_count > 2)
> +   exported_attrs = 
> kmalloc(sizeof(exported_attrs)*(attr_count-2),
> +   __GFP_IO | __GFP_FS);

Why are you using __GFP_IO | __GFP_FS instead of GFP_KERNEL? Also,
using kzalloc(), which zeros memory, over kmalloc() is a good idea in
general since structures can contain fields that change the behaviour
of the function that you pass them to.

> +
> +
> +   for_each_property_of_node(fw, prop) {
> +
> +   syms = of_get_property(fw, prop->name, );
> +
> +   if (!strcmp(prop->name, "name") ||
> +   !strcmp(prop->name, "phandle"))
> +   continue;
> +
> +   if (!syms || size != 2 * sizeof(__be64))
> +   continue;
> +

> +   (exported_attrs+n)->attr.name = prop->name;

References to DT properties are only valid if you have a reference to
the DT node that contains them. DT nodes and properties can (in
theory) be changed at runtime, but in practice this only really
happens for nodes that refer to hotpluggable devices (memory, PCI,
etc), but its still poor form to rely on things not happening. You can
make a copy of the name with kstrdup() and store that pointer for as
long as you like, since you can guarantee the copy will exist until
you explicitly free() it.

> +   (exported_attrs+n)->attr.mode = 0400;
> +   (exported_attrs+n)->read = export_attr_read;
> +   (exported_attrs+n)->private = __va(be64_to_cpu(syms[0]));
> +

[PATCH] powerpc/pseries: Don't give a warning when HPT resizing isn't available

2017-03-16 Thread David Gibson

As of 438cc81a41 "powerpc/pseries: Automatically resize HPT for memory hot
add/remove" when running on the pseries platform, we always attempt to
use the PAPR extension to resize the hashed page table (HPT) when we add
or remove memory.

This is fine, but when the extension is available we'll give a harmless,
but scary warning.  This patch suppresses the warning in this case.  It
will still warn if the feature is supposed to be available, but didn't
work.

Signed-off-by: David Gibson 
---
 arch/powerpc/mm/hash_utils_64.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Kind of cosmetic, but getting an error message on every memory hot
plug/unplug attempt if your host doesn't support HPT resizing is
pretty ugly.  So I think this is a candidate for quick inclusion.

diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index c554768..cc16e4f 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -771,7 +771,7 @@ void resize_hpt_for_hotplug(unsigned long new_mem_size)
int rc;
 
rc = mmu_hash_ops.resize_hpt(target_hpt_shift);
-   if (rc)
+   if (rc && (rc != -ENODEV))
printk(KERN_WARNING
   "Unable to resize hash page table to target 
order %d: %d\n",
   target_hpt_shift, rc);
-- 
2.9.3

Re: [RFC PATCH 00/13] Introduce first class virtual address spaces

2017-03-16 Thread Till Smejkal

On Thu, 16 Mar 2017, Thomas Gleixner wrote:
> Why do we need yet another mechanism to represent something which looks
> like a file instead of simply using existing mechanisms and extend them?

You are right. I also recognized during the discussion with Andy, Chris, 
Matthew,
Luck, Rich and the others that there are already other techniques in the Linux 
kernel
that can achieve the same functionality when combined. As I said also to the 
others,
I will drop the VAS segments for future versions. The first class virtual 
address
space feature was the more interesting part of the patchset anyways.

Thanks
Till

Re: [RFC PATCH 00/13] Introduce first class virtual address spaces

2017-03-16 Thread Till Smejkal

On Thu, 16 Mar 2017, Thomas Gleixner wrote:
> On Thu, 16 Mar 2017, Till Smejkal wrote:
> > On Thu, 16 Mar 2017, Thomas Gleixner wrote:
> > > Why do we need yet another mechanism to represent something which looks
> > > like a file instead of simply using existing mechanisms and extend them?
> > 
> > You are right. I also recognized during the discussion with Andy, Chris,
> > Matthew, Luck, Rich and the others that there are already other
> > techniques in the Linux kernel that can achieve the same functionality
> > when combined. As I said also to the others, I will drop the VAS segments
> > for future versions. The first class virtual address space feature was
> > the more interesting part of the patchset anyways.
> 
> While you are at it, could you please drop this 'first class' marketing as
> well? It has zero technical value, really.

Yes of course. I am sorry for the trouble that I caused already.

Thanks
Till

Re: [RFC PATCH 00/13] Introduce first class virtual address spaces

2017-03-16 Thread Thomas Gleixner

On Thu, 16 Mar 2017, Till Smejkal wrote:
> On Thu, 16 Mar 2017, Thomas Gleixner wrote:
> > Why do we need yet another mechanism to represent something which looks
> > like a file instead of simply using existing mechanisms and extend them?
> 
> You are right. I also recognized during the discussion with Andy, Chris,
> Matthew, Luck, Rich and the others that there are already other
> techniques in the Linux kernel that can achieve the same functionality
> when combined. As I said also to the others, I will drop the VAS segments
> for future versions. The first class virtual address space feature was
> the more interesting part of the patchset anyways.

While you are at it, could you please drop this 'first class' marketing as
well? It has zero technical value, really.

Thanks,

tglx

Re: [RFC PATCH 00/13] Introduce first class virtual address spaces

2017-03-16 Thread Thomas Gleixner

On Wed, 15 Mar 2017, Till Smejkal wrote:
> On Wed, 15 Mar 2017, Andy Lutomirski wrote:

> > > VAS segments on the other side would provide a functionality to
> > > achieve the same without the need of any mounted filesystem. However,
> > > I agree, that this is just a small advantage compared to what can
> > > already be achieved with the existing functionality provided by the
> > > Linux kernel.
> > 
> > I see this "small advantage" as "resource leak and security problem".
> 
> I don't agree here. VAS segments are basically in-memory files that are
> handled by the kernel directly without using a file system. Hence, if an

Why do we need yet another mechanism to represent something which looks
like a file instead of simply using existing mechanisms and extend them?

Thanks,

tglx

Re: [RFC PATCH 00/13] Introduce first class virtual address spaces

2017-03-16 Thread Till Smejkal

On Wed, 15 Mar 2017, Luck, Tony wrote:
> On Wed, Mar 15, 2017 at 03:02:34PM -0700, Till Smejkal wrote:
> > I don't agree here. VAS segments are basically in-memory files that are 
> > handled by
> > the kernel directly without using a file system. Hence, if an application 
> > uses a VAS
> > segment to store data the same rules apply as if it uses a file. Everything 
> > that it
> > saves in the VAS segment might be accessible by other applications. An 
> > application
> > using VAS segments should be aware of this fact. In addition, the resources 
> > that are
> > represented by a VAS segment are not leaked. As I said, VAS segments are 
> > much like
> > files. Hence, if you don't want to use them any more, delete them. But as 
> > with files,
> > the kernel will not delete them for you (although something like this can 
> > be added).
> 
> So how do they differ from shmget(2), shmat(2), shmdt(2), shmctl(2)?
> 
> Apart from VAS having better names, instead of silly "key_t key" ones.

Unfortunately, I have to admit that the VAS segments don't differ from shm* a 
lot.
The implementation is differently, but the functionality that you can achieve 
with it
is very similar. I am sorry. We should have looked more closely at the whole
functionality that is provided by the shmem subsystem before working on VAS 
segments.

However, VAS segments are not the key part of this patch set. The more 
interesting
functionality in our opinion is the introduction of first class virtual address
spaces and what they can be used for. VAS segments were just another logical 
step for
us (from first class virtual address spaces to first class virtual address space
segments) but since their functionality can be achieved with various other 
already
existing features of the Linux kernel, I will probably drop them in future 
versions
of the patchset.

Thanks
Till

Re: [RFC PATCH 00/13] Introduce first class virtual address spaces

2017-03-16 Thread Luck, Tony

On Wed, Mar 15, 2017 at 03:02:34PM -0700, Till Smejkal wrote:
> I don't agree here. VAS segments are basically in-memory files that are 
> handled by
> the kernel directly without using a file system. Hence, if an application 
> uses a VAS
> segment to store data the same rules apply as if it uses a file. Everything 
> that it
> saves in the VAS segment might be accessible by other applications. An 
> application
> using VAS segments should be aware of this fact. In addition, the resources 
> that are
> represented by a VAS segment are not leaked. As I said, VAS segments are much 
> like
> files. Hence, if you don't want to use them any more, delete them. But as 
> with files,
> the kernel will not delete them for you (although something like this can be 
> added).

So how do they differ from shmget(2), shmat(2), shmdt(2), shmctl(2)?

Apart from VAS having better names, instead of silly "key_t key" ones.

-Tony

Re: [RFC PATCH 00/13] Introduce first class virtual address spaces

2017-03-16 Thread Till Smejkal

On Wed, 15 Mar 2017, Andy Lutomirski wrote:
> On Wed, Mar 15, 2017 at 12:44 PM, Till Smejkal
>  wrote:
> > On Wed, 15 Mar 2017, Andy Lutomirski wrote:
> >> > One advantage of VAS segments is that they can be globally queried by 
> >> > user programs
> >> > which means that VAS segments can be shared by applications that not 
> >> > necessarily have
> >> > to be related. If I am not mistaken, MAP_SHARED of pure in memory data 
> >> > will only work
> >> > if the tasks that share the memory region are related (aka. have a 
> >> > common parent that
> >> > initialized the shared mapping). Otherwise, the shared mapping have to 
> >> > be backed by a
> >> > file.
> >>
> >> What's wrong with memfd_create()?
> >>
> >> > VAS segments on the other side allow sharing of pure in memory data by
> >> > arbitrary related tasks without the need of a file. This becomes 
> >> > especially
> >> > interesting if one combines VAS segments with non-volatile memory since 
> >> > one can keep
> >> > data structures in the NVM and still be able to share them between 
> >> > multiple tasks.
> >>
> >> What's wrong with regular mmap?
> >
> > I never wanted to say that there is something wrong with regular mmap. We 
> > just
> > figured that with VAS segments you could remove the need to mmap your 
> > shared data but
> > instead can keep everything purely in memory.
> 
> memfd does that.

Yes, that's right. Thanks for giving me the pointer to this. I should have 
researched
more carefully before starting to work at VAS segments.

> > VAS segments on the other side would provide a functionality to
> > achieve the same without the need of any mounted filesystem. However, I 
> > agree, that
> > this is just a small advantage compared to what can already be achieved 
> > with the
> > existing functionality provided by the Linux kernel.
> 
> I see this "small advantage" as "resource leak and security problem".

I don't agree here. VAS segments are basically in-memory files that are handled 
by
the kernel directly without using a file system. Hence, if an application uses 
a VAS
segment to store data the same rules apply as if it uses a file. Everything 
that it
saves in the VAS segment might be accessible by other applications. An 
application
using VAS segments should be aware of this fact. In addition, the resources 
that are
represented by a VAS segment are not leaked. As I said, VAS segments are much 
like
files. Hence, if you don't want to use them any more, delete them. But as with 
files,
the kernel will not delete them for you (although something like this can be 
added).

> >> This sounds complicated and fragile.  What happens if a heuristically
> >> shared region coincides with a region in the "first class address
> >> space" being selected?
> >
> > If such a conflict happens, the task cannot use the first class address 
> > space and the
> > corresponding system call will return an error. However, with the current 
> > available
> > virtual address space size that programs can use, such conflicts are 
> > probably rare.
> 
> A bug that hits 1% of the time is often worse than one that hits 100%
> of the time because debugging it is miserable.

I don't agree that this is a bug at all. If there is a conflict in the memory 
layout
of the ASes the application simply cannot use this first class virtual address 
space.
Every application that wants to use first class virtual address spaces should 
check
for error return values and handle them.

This situation is similar to mapping a file at some special address in memory 
because
the file contains pointer based data structures and the application wants to use
them, but the kernel cannot map the file at this particular position in the
application's AS because there is already a different conflicting mapping. If an
application wants to do such things, it should also handle all the errors that 
can
occur.

Till

Re: [PATCH V2 07/11] powerpc/mm: Conditional defines of pte bits are messy

2017-03-16 Thread Paul Mackerras

On Thu, Mar 16, 2017 at 04:02:05PM +0530, Aneesh Kumar K.V wrote:
> Signed-off-by: Aneesh Kumar K.V 

I think it would be better if the subject was something like "Define
_PAGE_SOFT_DIRTY unconditionally" and the comment about conditional
defines was the patch description.

For the code change:

Reviewed-by: Paul Mackerras

Re: [PATCH V2 05/11] powerpc/mm: Add translation mode information in /proc/cpuinfo

2017-03-16 Thread Paul Mackerras

On Thu, Mar 16, 2017 at 04:02:03PM +0530, Aneesh Kumar K.V wrote:
> With this we have on powernv and pseries /proc/cpuinfo reporting
> 
> timebase: 51200
> platform: PowerNV
> model   : 8247-22L
> machine : PowerNV 8247-22L
> firmware: OPAL
> MMU   : Hash
> 
> Signed-off-by: Aneesh Kumar K.V 

Reviewed-by: Paul Mackerras

Re: [PATCH V2 09/11] powerpc/mm: Lower the max real address to 51 bits

2017-03-16 Thread Paul Mackerras

On Thu, Mar 16, 2017 at 04:02:07PM +0530, Aneesh Kumar K.V wrote:
> Max value supported by hardware is 51 bits address. Radix page table define
> a slot of 57 bits for future expansion. We restrict the value supported in
> linux kernel 51 bits, so that we can use the bits between 57-51 for storing
> hash linux page table bits. This is done in the next patch.
> 
> This will free up the software page table bits to be used for features
> that are needed for both hash and radix. The current hash linux page table
> format doesn't have any free software bits. Moving hash linux page table
> specific bits to top of RPN field free up the software bits for other purpose.
> 
> Signed-off-by: Aneesh Kumar K.V 
> ---

There are a couple of comment typos below, but for the actual code change:

Reviewed-by: Paul Mackerras 

>  arch/powerpc/include/asm/book3s/64/pgtable.h | 24 ++--
>  1 file changed, 22 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
> b/arch/powerpc/include/asm/book3s/64/pgtable.h
> index 96566df547a8..c470dcc815d5 100644
> --- a/arch/powerpc/include/asm/book3s/64/pgtable.h
> +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
> @@ -38,6 +38,25 @@
>  #define _RPAGE_RSV4  0x0200UL
>  #define _RPAGE_RPN0  0x01000
>  #define _RPAGE_RPN1  0x02000
> +/* Max physicall address bit as per radix table */

physical not physicall

> +#define _RPAGE_PA_MAX57
> +/*
> + * Max physical address bit we will use for now.
> + *
> + * This is mostly a hardware limitation and for now Power9 has
> + * a 51 bit limit.
> + *
> + * This is different from the number of physical bit required to address
> + * the last byte of memory. That is defined by MAX_PHYSMEM_BITS.
> + * MAX_PHYSMEM_BITS is a linux limitation imposed by the maximum
> + * number of sections we can support (SECTIONS_SHIFT).
> + *
> + * This is different from Radix page table limitation above and
> + * should always be less than that. The limit is done such that
> + * we can overload the bits between _RPAGE_PA_MAX and _PAGE_PA_MAX
> + * for hash linux page table specific bits.
> + */
> +#define _PAGE_PA_MAX 51
>  
>  #define _PAGE_SOFT_DIRTY _RPAGE_SW3 /* software: software dirty tracking 
> */
>  #define _PAGE_SPECIAL_RPAGE_SW2 /* software: special page */
> @@ -51,10 +70,11 @@
>   */
>  #define _PAGE_NO_CACHE   _PAGE_TOLERANT
>  /*
> - * We support 57 bit real address in pte. Clear everything above 57, and
> + * We support _RPAGE_PA_MAX bit real address in pte. On the linux side
> + * we are limited by _PAGE_PA_MAX. Clear everything above _PAGE_PA_MAX
>   * every thing below PAGE_SHIFT;

You lost an "and" in that last sentence.

>   */
> -#define PTE_RPN_MASK (((1UL << 57) - 1) & (PAGE_MASK))
> +#define PTE_RPN_MASK (((1UL << _PAGE_PA_MAX) - 1) & (PAGE_MASK))
>  /*
>   * set of bits not changed in pmd_modify. Even though we have hash specific 
> bits
>   * in here, on radix we expect them to be zero.
> -- 
> 2.7.4

Paul.

Re: [PATCH V2 03/11] powerpc/mm: Cleanup bits definition between hash and radix.

2017-03-16 Thread Paul Mackerras

On Thu, Mar 16, 2017 at 04:02:01PM +0530, Aneesh Kumar K.V wrote:
> Define everything based on bits present in pgtable.h. This will help in easily
> identifying overlapping bits between hash/radix.
> 
> No functional change with this patch.
> 
> Signed-off-by: Aneesh Kumar K.V 

Reviewed-by: Paul Mackerras

Re: [PATCH V2 10/11] powerpc/mm/radix: Make max pfn bits a variable

2017-03-16 Thread Paul Mackerras

On Thu, Mar 16, 2017 at 04:02:08PM +0530, Aneesh Kumar K.V wrote:
> This makes max pysical address bits a variable so that hash and radix
> translation mode can choose what value to use. In this patch we also switch 
> the
> radix translation mode to use 57 bits. This make it resilient to future 
> changes
> to max pfn supported by platforms.
> 
> This patch is split from the previous one to make the review easier.
> 
> Signed-off-by: Aneesh Kumar K.V 

Why do we need to do this now?  It seems like this will add overhead
every time we set a PTE for no current benefit.

Paul.

Re: [PATCH V2 02/11] powerpc/mm/slice: when computing slice mask limit lowe slice max addr correctly

2017-03-16 Thread Paul Mackerras

On Thu, Mar 16, 2017 at 04:02:00PM +0530, Aneesh Kumar K.V wrote:
> For low slice max addr should be less that 4G
 than

A more verbose explanation of the off-by-1 error that you are fixing
is needed here.  Tell us what goes wrong with the current code and why
your fix is the correct one.

> 
> Signed-off-by: Aneesh Kumar K.V 

For the code change:

Reviewed-by: Paul Mackerras

Re: [PATCH V2 06/11] powerpc/mm/hugetlb: Filter out hugepage size not supported by page table layout

2017-03-16 Thread Paul Mackerras

On Thu, Mar 16, 2017 at 04:02:04PM +0530, Aneesh Kumar K.V wrote:
> Without this if firmware reports 1MB page size support we will crash
> trying to use 1MB as hugetlb page size.
> 
> echo 300 > /sys/kernel/mm/hugepages/hugepages-1024kB/nr_hugepages
> 
> kernel BUG at ./arch/powerpc/include/asm/hugetlb.h:19!
> .
> 
> [c000e2c27b30] c029dae8 .hugetlb_fault+0x638/0xda0
> [c000e2c27c30] c026fb64 .handle_mm_fault+0x844/0x1d70
> [c000e2c27d70] c004805c .do_page_fault+0x3dc/0x7c0
> [c000e2c27e30] c000ac98 handle_page_fault+0x10/0x30
> 
> With fix, we don't enable 1MB as hugepage size.
> 
> bash-4.2# cd /sys/kernel/mm/hugepages/
> bash-4.2# ls
> hugepages-16384kB  hugepages-16777216kB
> 
> Signed-off-by: Aneesh Kumar K.V 
> ---
>  arch/powerpc/mm/hugetlbpage.c | 20 
>  1 file changed, 20 insertions(+)
> 
> diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
> index 8c3389cbcd12..eb8d42bac00b 100644
> --- a/arch/powerpc/mm/hugetlbpage.c
> +++ b/arch/powerpc/mm/hugetlbpage.c
> @@ -738,6 +738,7 @@ static int __init add_huge_page_size(unsigned long long 
> size)
>   int shift = __ffs(size);
>   int mmu_psize;
>  
> +#ifndef CONFIG_PPC_BOOK3S_64

This #ifndef doesn't really seem necessary.  All it is removing is a
check for size <= PAGE_SIZE.  Yes that check is subsumed by the checks
you are adding below, but on the other hand, #if[n]defs inside
functions are ugly and make the code harder to read.  Since this is
not a hot path, let's not have the ifndef.

>   /* Check that it is a page size supported by the hardware and
>* that it fits within pagetable and slice limits. */
>   if (size <= PAGE_SIZE)
> @@ -749,10 +750,29 @@ static int __init add_huge_page_size(unsigned long long 
> size)
>   if (!is_power_of_2(size) || (shift > SLICE_HIGH_SHIFT))
>   return -EINVAL;
>  #endif
> +#endif /* CONFIG_PPC_BOOK3S_64 */
>  
>   if ((mmu_psize = shift_to_mmu_psize(shift)) < 0)
>   return -EINVAL;
>  
> +#ifdef CONFIG_PPC_BOOK3S_64
> + /*
> +  * We need to make sure that for different page sizes reported by
> +  * firmware we only add hugetlb support for page sizes that can be
> +  * supported by linux page table layout.
> +  * For now we have
> +  * Radix: 2M
> +  * Hash: 16M and 16G
> +  */
> + if (radix_enabled()) {
> + if (mmu_psize != MMU_PAGE_2M)
> + return -EINVAL;
> + } else {
> + if (mmu_psize != MMU_PAGE_16M && mmu_psize != MMU_PAGE_16G)
> + return -EINVAL;
> + }
> +#endif
> +
>   BUG_ON(mmu_psize_defs[mmu_psize].shift != shift);
>  
>   /* Return if huge page size has already been setup */
> -- 
> 2.7.4

Paul.

Re: [PATCH V2 04/11] powerpc/mm/radix: rename _PAGE_LARGE to R_PAGE_LARGE

2017-03-16 Thread Paul Mackerras

On Thu, Mar 16, 2017 at 04:02:02PM +0530, Aneesh Kumar K.V wrote:
> This bit is only used by radix and it is nice to follow the naming style of 
> having
> bit name start with H_/R_ depending on which translation mode they are used.
> 
> No functional change in this patch.
> 
> Signed-off-by: Aneesh Kumar K.V 

Reviewed-by: Paul Mackerras

Re: [PATCH V2 08/11] powerpc/mm: Express everything based on Radix page table defines

2017-03-16 Thread Paul Mackerras

On Thu, Mar 16, 2017 at 04:02:06PM +0530, Aneesh Kumar K.V wrote:
> Signed-off-by: Aneesh Kumar K.V 

This change seems correct, but of minimal benefit.

The subject could be better expressed.  How about "Define all PTE bits
based on radix definitions" or something like that?  "Everything" is a
bit too broad.

For the code change:

Reviewed-by: Paul Mackerras

Re: [PATCH V2 11/11] powerpc/mm: Move hash specific pte bits to be top bits of RPN

2017-03-16 Thread Paul Mackerras

On Thu, Mar 16, 2017 at 04:02:09PM +0530, Aneesh Kumar K.V wrote:
> We don't support the full 57 bits of physical address and hence can overload
> the top bits of RPN as hash specific pte bits.
> 
> Signed-off-by: Aneesh Kumar K.V 
> ---
>  arch/powerpc/include/asm/book3s/64/hash.h| 18 ++
>  arch/powerpc/include/asm/book3s/64/pgtable.h | 19 ---
>  arch/powerpc/mm/hash_native_64.c |  1 +
>  3 files changed, 23 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/book3s/64/hash.h 
> b/arch/powerpc/include/asm/book3s/64/hash.h
> index af3c88624d3a..33eb1a650317 100644
> --- a/arch/powerpc/include/asm/book3s/64/hash.h
> +++ b/arch/powerpc/include/asm/book3s/64/hash.h
> @@ -6,20 +6,14 @@
>   * Common bits between 4K and 64K pages in a linux-style PTE.
>   * Additional bits may be defined in pgtable-hash64-*.h
>   *
> - * Note: We only support user read/write permissions. Supervisor always
> - * have full read/write to pages above PAGE_OFFSET (pages below that
> - * always use the user access permissions).
> - *
> - * We could create separate kernel read-only if we used the 3 PP bits
> - * combinations that newer processors provide but we currently don't.
>   */
> -#define H_PAGE_BUSY  _RPAGE_SW1 /* software: PTE & hash are busy */
> +#define H_PAGE_BUSY  _RPAGE_RPN45 /* software: PTE & hash are busy */
>  #define H_PTE_NONE_MASK  _PAGE_HPTEFLAGS
> -#define H_PAGE_F_GIX_SHIFT   57
> -/* (7ul << 57) HPTE index within HPTEG */
> -#define H_PAGE_F_GIX (_RPAGE_RSV2 | _RPAGE_RSV3 | _RPAGE_RSV4)
> -#define H_PAGE_F_SECOND  _RPAGE_RSV1 /* HPTE is in 2ndary 
> HPTEG */
> -#define H_PAGE_HASHPTE   _RPAGE_SW0  /* PTE has associated 
> HPTE */
> +#define H_PAGE_F_GIX_SHIFT   52
> +/* (7ul << 53) HPTE index within HPTEG */
> +#define H_PAGE_F_SECOND  _RPAGE_RPN44/* HPTE is in 2ndary 
> HPTEG */
> +#define H_PAGE_F_GIX (_RPAGE_RPN43 | _RPAGE_RPN42 | _RPAGE_RPN41)
> +#define H_PAGE_HASHPTE   _RPAGE_RPN40/* PTE has associated 
> HPTE */
>  /*
>   * Max physical address bit we will use for now.
>   *
> diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
> b/arch/powerpc/include/asm/book3s/64/pgtable.h
> index eb82b60b5c89..3d104f8ad891 100644
> --- a/arch/powerpc/include/asm/book3s/64/pgtable.h
> +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
> @@ -36,16 +36,29 @@
>  #define _RPAGE_RSV2  0x0800UL
>  #define _RPAGE_RSV3  0x0400UL
>  #define _RPAGE_RSV4  0x0200UL
> +
> +#define _PAGE_PTE0x4000UL/* distinguishes PTEs 
> from pointers */
> +#define _PAGE_PRESENT0x8000UL/* pte contains 
> a translation */
> +
> +/*
> + * Top and bottom bits of RPN which can be used by hash
> + * translation mode, because we expect them to be zero
> + * otherwise.
> + */
>  #define _RPAGE_RPN0  0x01000
>  #define _RPAGE_RPN1  0x02000
> +#define _RPAGE_RPN45 0x0100UL
> +#define _RPAGE_RPN44 0x0080UL
> +#define _RPAGE_RPN43 0x0040UL
> +#define _RPAGE_RPN42 0x0020UL
> +#define _RPAGE_RPN41 0x0010UL
> +#define _RPAGE_RPN40 0x0008UL

If RPN0 is 0x1000, then this is actually RPN39 as far as I can see,
and the other RPN4* bits are likewise off by one.

Paul.

Re: [PATCH V2 01/11] powerpc/mm/nohash: MM_SLICE is only used by book3s 64

2017-03-16 Thread Paul Mackerras

On Thu, Mar 16, 2017 at 04:01:59PM +0530, Aneesh Kumar K.V wrote:
> BOOKE code is dead code as per the Kconfig details. So make it simpler
> by enabling MM_SLICE only for book3s_64. The changes w.r.t nohash is just
> removing deadcode. W.r.t ppc64, 4k without hugetlb will now enable MM_SLICE.
> But that is good, because we reduce one extra variant which probably is not
> getting tested much.
> 
> Signed-off-by: Aneesh Kumar K.V 

Reviewed-by: Paul Mackerras

Re: [PATCH V2 09/11] powerpc/mm: Lower the max real address to 51 bits

2017-03-16 Thread Benjamin Herrenschmidt

On Thu, 2017-03-16 at 16:02 +0530, Aneesh Kumar K.V wrote:
> Max value supported by hardware is 51 bits address. Radix page table define
> a slot of 57 bits for future expansion. We restrict the value supported in
> linux kernel 51 bits, so that we can use the bits between 57-51 for storing
> hash linux page table bits. This is done in the next patch.

All of them ? I would keep some for future backward compatibility. It's likely
that a successor to P9 will have more physical address bits. I feel nervous
limiting to precisely what P9 supports.

> This will free up the software page table bits to be used for features
> that are needed for both hash and radix. The current hash linux page table
> format doesn't have any free software bits. Moving hash linux page table
> specific bits to top of RPN field free up the software bits for other purpose.
> 
> > Signed-off-by: Aneesh Kumar K.V 
> ---
>  arch/powerpc/include/asm/book3s/64/pgtable.h | 24 ++--
>  1 file changed, 22 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
> b/arch/powerpc/include/asm/book3s/64/pgtable.h
> index 96566df547a8..c470dcc815d5 100644
> --- a/arch/powerpc/include/asm/book3s/64/pgtable.h
> +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
> @@ -38,6 +38,25 @@
> >  #define _RPAGE_RSV40x0200UL
> >  #define _RPAGE_RPN00x01000
> >  #define _RPAGE_RPN10x02000
> +/* Max physicall address bit as per radix table */
> > +#define _RPAGE_PA_MAX  57
> +/*
> + * Max physical address bit we will use for now.
> + *
> + * This is mostly a hardware limitation and for now Power9 has
> + * a 51 bit limit.
> + *
> + * This is different from the number of physical bit required to address
> + * the last byte of memory. That is defined by MAX_PHYSMEM_BITS.
> + * MAX_PHYSMEM_BITS is a linux limitation imposed by the maximum
> + * number of sections we can support (SECTIONS_SHIFT).
> + *
> + * This is different from Radix page table limitation above and
> + * should always be less than that. The limit is done such that
> + * we can overload the bits between _RPAGE_PA_MAX and _PAGE_PA_MAX
> + * for hash linux page table specific bits.
> + */
> > +#define _PAGE_PA_MAX   51
>  
> >  #define _PAGE_SOFT_DIRTY   _RPAGE_SW3 /* software: software dirty tracking 
> > */
> >  #define _PAGE_SPECIAL  _RPAGE_SW2 /* software: special page */
> @@ -51,10 +70,11 @@
>   */
> >  #define _PAGE_NO_CACHE _PAGE_TOLERANT
>  /*
> - * We support 57 bit real address in pte. Clear everything above 57, and
> + * We support _RPAGE_PA_MAX bit real address in pte. On the linux side
> + * we are limited by _PAGE_PA_MAX. Clear everything above _PAGE_PA_MAX
>   * every thing below PAGE_SHIFT;
>   */
> > -#define PTE_RPN_MASK   (((1UL << 57) - 1) & (PAGE_MASK))
> > +#define PTE_RPN_MASK   (((1UL << _PAGE_PA_MAX) - 1) & (PAGE_MASK))
>  /*
>   * set of bits not changed in pmd_modify. Even though we have hash specific 
> bits
>   * in here, on radix we expect them to be zero.

Re: [PATCH] kernfs: Check KERNFS_HAS_RELEASE before calling kernfs_release_file()

2017-03-16 Thread Tejun Heo

Hello, Greg.

On Tue, Mar 14, 2017 at 11:08:29AM +0800, Greg Kroah-Hartman wrote:
> Tejun, want to take this through your tree, or at the least, give me an
> ack for this?

Just acked.  I think going through your tree is better for this one.

Thanks!

-- 
tejun

Re: [PATCH] kernfs: Check KERNFS_HAS_RELEASE before calling kernfs_release_file()

2017-03-16 Thread Tejun Heo

On Tue, Mar 14, 2017 at 08:17:00AM +0530, Vaibhav Jain wrote:
> Recently started seeing a kernel oops when a module tries removing a
> memory mapped sysfs bin_attribute. On closer investigation the root
> cause seems to be kernfs_release_file() trying to call
> kernfs_op.release() callback that's NULL for such sysfs
> bin_attributes. The oops occurs when kernfs_release_file() is called from
> kernfs_drain_open_files() to cleanup any open handles with active
> memory mappings.
> 
> The patch fixes this by checking for flag KERNFS_HAS_RELEASE before
> calling kernfs_release_file() in function kernfs_drain_open_files().
> 
> On ppc64-le arch with cxl module the oops back-trace is of the
> form below:
> [  861.381126] Unable to handle kernel paging request for instruction fetch
> [  861.381360] Faulting instruction address: 0x
> [  861.381428] Oops: Kernel access of bad area, sig: 11 [#1]
> 
> [  861.382481] NIP:  LR: c0362c60 CTR:
> 
> 
> Call Trace:
> [c00f1680b750] [c0362c34] kernfs_drain_open_files+0x104/0x1d0 
> (unreliable)
> [c00f1680b790] [c035fa00] __kernfs_remove+0x260/0x2c0
> [c00f1680b820] [c0360da0] kernfs_remove_by_name_ns+0x60/0xe0
> [c00f1680b8b0] [c03638f4] sysfs_remove_bin_file+0x24/0x40
> [c00f1680b8d0] [c062a164] device_remove_bin_file+0x24/0x40
> [c00f1680b8f0] [d9b7b22c] cxl_sysfs_afu_remove+0x144/0x170 [cxl]
> [c00f1680b940] [d9b7c7e4] cxl_remove+0x6c/0x1a0 [cxl]
> [c00f1680b990] [c052f694] pci_device_remove+0x64/0x110
> [c00f1680b9d0] [c06321d4] 
> device_release_driver_internal+0x1f4/0x2b0
> [c00f1680ba20] [c0525cb0] pci_stop_bus_device+0xa0/0xd0
> [c00f1680ba60] [c0525e80] pci_stop_and_remove_bus_device+0x20/0x40
> [c00f1680ba90] [c004a6c4] pci_hp_remove_devices+0x84/0xc0
> [c00f1680bad0] [c004a688] pci_hp_remove_devices+0x48/0xc0
> [c00f1680bb10] [c09dfda4] eeh_reset_device+0xb0/0x290
> [c00f1680bbb0] [c0032b4c] eeh_handle_normal_event+0x47c/0x530
> [c00f1680bc60] [c0032e64] eeh_handle_event+0x174/0x350
> [c00f1680bd10] [c0033228] eeh_event_handler+0x1e8/0x1f0
> [c00f1680bdc0] [c00d384c] kthread+0x14c/0x190
> [c00f1680be30] [c000b5a0] ret_from_kernel_thread+0x5c/0xbc
> 
> Fixes: f83f3c515654("kernfs: fix locking around kernfs_ops->release()
> callback")
> Signed-off-by: Vaibhav Jain 

Acked-by: Tejun Heo 

Thanks.

-- 
tejun

[PATCH v2 2/2] powerpc/fadump: update fadump documentation

2017-03-16 Thread Hari Bathini

With the unnecessary restriction to reserve memory for fadump at the
top of RAM forgone, update the documentation accordingly.

Signed-off-by: Hari Bathini 
---
 Documentation/powerpc/firmware-assisted-dump.txt |   34 +++---
 1 file changed, 17 insertions(+), 17 deletions(-)

diff --git a/Documentation/powerpc/firmware-assisted-dump.txt 
b/Documentation/powerpc/firmware-assisted-dump.txt
index 3007bc9..19b1e3d 100644
--- a/Documentation/powerpc/firmware-assisted-dump.txt
+++ b/Documentation/powerpc/firmware-assisted-dump.txt
@@ -105,21 +105,21 @@ memory is held.
 
 If there is no waiting dump data, then only the memory required
 to hold CPU state, HPTE region, boot memory dump and elfcore
-header, is reserved at the top of memory (see Fig. 1). This area
-is *not* released: this region will be kept permanently reserved,
-so that it can act as a receptacle for a copy of the boot memory
-content in addition to CPU state and HPTE region, in the case a
-crash does occur.
+header, is usually reserved at an offset greater than boot memory
+size (see Fig. 1). This area is *not* released: this region will
+be kept permanently reserved, so that it can act as a receptacle
+for a copy of the boot memory content in addition to CPU state
+and HPTE region, in the case a crash does occur.
 
   o Memory Reservation during first kernel
 
-  Low memoryTop of memory
+  Low memory Top of memory
   0  boot memory size   |
-  |   |   |<--Reserved dump area -->|
-  V   V   |   Permanent Reservation V
-  +---+--/ /--+---++---++
-  |   |   |CPU|HPTE|  DUMP |ELF |
-  +---+--/ /--+---++---++
+  |   ||<--Reserved dump area -->|  |
+  V   V|   Permanent Reservation |  V
+  +---+--/ /---+---++---++--+
+  |   ||CPU|HPTE|  DUMP |ELF |  |
+  +---+--/ /---+---++---++--+
 |   ^
 |   |
 \   /
@@ -135,12 +135,12 @@ crash does occur.
   0  boot memory size   |
   |   |<- Reserved dump area --- -->|
   V   V V
-  +---+--/ /--+---++---++
-  |   |   |CPU|HPTE|  DUMP |ELF |
-  +---+--/ /--+---++---++
-||
-VV
-   Used by second/proc/vmcore
+  +---+--/ /---+---++---++--+
+  |   ||CPU|HPTE|  DUMP |ELF |  |
+  +---+--/ /---+---++---++--+
+|  |
+V  V
+   Used by second/proc/vmcore
kernel to boot
Fig. 2

[PATCH v2 1/2] powerpc/fadump: reserve memory at an offset closer to bottom of RAM

2017-03-16 Thread Hari Bathini

Currently, the area to preserve boot memory is reserved at the top of
RAM. This leaves fadump vulnerable to memory hot-remove operations. As
memory for fadump has to be reserved early in the boot process, fadump
can't be registered after a memory hot-remove operation. Though this
problem can't be eleminated completely, the impact can be minimized by
reserving memory at an offset closer to bottom of the RAM. The offset
for fadump memory reservation can be any value greater than fadump boot
memory size.

Signed-off-by: Hari Bathini 
---

Changes from v1:
* Finding the offset based on memory holes/availability
* Improved error checking


 arch/powerpc/kernel/fadump.c |   33 ++---
 1 file changed, 26 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
index 8ff0dd4..33b2da3 100644
--- a/arch/powerpc/kernel/fadump.c
+++ b/arch/powerpc/kernel/fadump.c
@@ -319,15 +319,34 @@ int __init fadump_reserve_mem(void)
pr_debug("fadumphdr_addr = %p\n",
(void *) fw_dump.fadumphdr_addr);
} else {
-   /* Reserve the memory at the top of memory. */
size = get_fadump_area_size();
-   base = memory_boundary - size;
-   memblock_reserve(base, size);
-   printk(KERN_INFO "Reserved %ldMB of memory at %ldMB "
-   "for firmware-assisted dump\n",
-   (unsigned long)(size >> 20),
-   (unsigned long)(base >> 20));
+
+   /*
+* Reserve memory at an offset closer to bottom of the RAM to
+* minimize the impact of memory hot-remove operation. We can't
+* use memblock_find_in_range() here since it doesn't allocate
+* from bottom to top.
+*/
+   for (base = fw_dump.boot_memory_size;
+base <= (memory_boundary - size);
+base += size) {
+   if (memblock_is_region_memory(base, size) &&
+   !memblock_is_region_reserved(base, size))
+   break;
+   }
+   if ((base > (memory_boundary - size)) ||
+   memblock_reserve(base, size)) {
+   pr_err("Failed to reserve memory\n");
+   return 0;
+   }
+
+   pr_info("Reserved %ldMB of memory at %ldMB for firmware-"
+   "assisted dump (System RAM: %ldMB)\n",
+   (unsigned long)(size >> 20),
+   (unsigned long)(base >> 20),
+   (unsigned long)(memblock_phys_mem_size() >> 20));
}
+
fw_dump.reserve_dump_area_start = base;
fw_dump.reserve_dump_area_size = size;
return 1;

Re: [PATCH] net: ethernet: fs_enet: Remove useless includes

2017-03-16 Thread David Miller

From: Christophe Leroy 
Date: Thu, 16 Mar 2017 10:18:04 +0100 (CET)

> CONFIG_8xx is being deprecated. Since the includes dependent on
> CONFIG_8xx are useless, just drop them.
> 
> Signed-off-by: Christophe Leroy 

Applied.

Re: [PATCH] isdn: hardware: mISDN: Remove reference to CONFIG_8xx

2017-03-16 Thread David Miller

From: Christophe Leroy 
Date: Thu, 16 Mar 2017 10:18:02 +0100 (CET)

> CONFIG_8xx is deprecated and should soon be removed in favor
> of CONFIG_PPC_8xx.
> Anyway, hfc_multi_8xx.h only uses 8xx I/O ports which are
> linked to the CPM1 communication processor included in the 8xx
> rather than the 8xx itself.
> 
> This patch therefore makes it dependent on CONFIG_CPM1 instead,
> like several other drivers.
> 
> Signed-off-by: Christophe Leroy 

Applied.

RE: [PATCH 08/29] drivers, md: convert mddev.active from atomic_t to refcount_t

2017-03-16 Thread Reshetova, Elena

> On Tue, 2017-03-14 at 12:29 +, Reshetova, Elena wrote:
> > > Elena Reshetova  writes:
> > >
> > > > refcount_t type and corresponding API should be
> > > > used instead of atomic_t when the variable is used as
> > > > a reference counter. This allows to avoid accidental
> > > > refcounter overflows that might lead to use-after-free
> > > > situations.
> > > >
> > > > Signed-off-by: Elena Reshetova 
> > > > Signed-off-by: Hans Liljestrand 
> > > > Signed-off-by: Kees Cook 
> > > > Signed-off-by: David Windsor 
> > > > ---
> > > >  drivers/md/md.c | 6 +++---
> > > >  drivers/md/md.h | 3 ++-
> > > >  2 files changed, 5 insertions(+), 4 deletions(-)
> > >
> > > When booting linux-next (specifically 5be4921c9958ec) I'm seeing
> > > the
> > > backtrace below. I suspect this patch is just exposing an existing
> > > issue?
> >
> > Yes, we have actually been following this issue in the another
> > thread.
> > It looks like the object is re-used somehow, but I can't quite
> > understand how just by reading the code.
> > This was what I put into the previous thread:
> >
> > "The log below indicates that you are using your refcounter in a bit
> > weird way in mddev_find().
> > However, I can't find the place (just by reading the code) where you
> > would increment refcounter from zero (vs. setting it to one).
> > It looks like you either iterate over existing nodes (and increment
> > their counters, which should be >= 1 at the time of increment) or
> > create a new node, but then mddev_init() sets the counter to 1. "
> >
> > If you can help to understand what is going on with the object
> > creation/destruction, would be appreciated!
> >
> > Also Shaohua Li stopped this patch coming from his tree since the
> > issue was caught at that time, so we are not going to merge this
> > until we figure it out.
> 
> Asking on the correct list (dm-devel) would have got you the easy
> answer:  The refcount behind mddev->active is a genuine atomic.  It has
> refcount properties but only if the array fails to initialise (in that
> case, final put kills it).  Once it's added to the system as a gendisk,
> it cannot be freed until md_free().  Thus its ->active count can go to
> zero (when it becomes inactive; usually because of an unmount). On a
> simple allocation regardless of outcome, the last executed statement in
> md_alloc is mddev_put(): that destroys the device if we didn't manage
> to create it or returns 0 and adds an inactive device to the system
> which the user can get with mddev_find().

Thank you James for explaining this! I guess in this case, the conversion 
doesn't make sense. 
And sorry about not asking in a correct place: we are handling many similar 
patches now and while I try to reach the right audience using get_maintainer 
script, it doesn't always succeeds. 

Best Regards,
Elena.

> 
> James
>

Re: [PATCH kernel v9 10/10] KVM: PPC: VFIO: Add in-kernel acceleration for VFIO

2017-03-16 Thread Alex Williamson

On Thu, 16 Mar 2017 18:09:32 +1100
Alexey Kardashevskiy  wrote:

> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
> without passing them to user space which saves time on switching
> to user space and back.
> 
> This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
> KVM tries to handle a TCE request in the real mode, if failed
> it passes the request to the virtual mode to complete the operation.
> If it a virtual mode handler fails, the request is passed to
> the user space; this is not expected to happen though.
> 
> To avoid dealing with page use counters (which is tricky in real mode),
> this only accelerates SPAPR TCE IOMMU v2 clients which are required
> to pre-register the userspace memory. The very first TCE request will
> be handled in the VFIO SPAPR TCE driver anyway as the userspace view
> of the TCE table (iommu_table::it_userspace) is not allocated till
> the very first mapping happens and we cannot call vmalloc in real mode.
> 
> If we fail to update a hardware IOMMU table unexpected reason, we just
> clear it and move on as there is nothing really we can do about it -
> for example, if we hot plug a VFIO device to a guest, existing TCE tables
> will be mirrored automatically to the hardware and there is no interface
> to report to the guest about possible failures.
> 
> This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
> the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
> and associates a physical IOMMU table with the SPAPR TCE table (which
> is a guest view of the hardware IOMMU table). The iommu_table object
> is cached and referenced so we do not have to look up for it in real mode.
> 
> This does not implement the UNSET counterpart as there is no use for it -
> once the acceleration is enabled, the existing userspace won't
> disable it unless a VFIO container is destroyed; this adds necessary
> cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
> 
> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
> space.
> 
> This adds real mode version of WARN_ON_ONCE() as the generic version
> causes problems with rcu_sched. Since we testing what vmalloc_to_phys()
> returns in the code, this also adds a check for already existing
> vmalloc_to_phys() call in kvmppc_rm_h_put_tce_indirect().
> 
> This finally makes use of vfio_external_user_iommu_id() which was
> introduced quite some time ago and was considered for removal.
> 
> Tests show that this patch increases transmission speed from 220MB/s
> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
> 
> Signed-off-by: Alexey Kardashevskiy 
> ---
> Changes:
> v9:
> * removed referencing a group in KVM, only referencing iommu_table's now
> * fixed a reference leak in KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE handler
> * fixed typo in vfio.txt
> * removed @argsz and @flags from struct kvm_vfio_spapr_tce
> 
> v8:
> * changed all (!pua) checks to return H_TOO_HARD as ioctl() is supposed
> to handle them
> * changed vmalloc_to_phys() callers to return H_HARDWARE
> * changed real mode iommu_tce_xchg_rm() callers to return H_TOO_HARD
> and added a comment about this in the code
> * changed virtual mode iommu_tce_xchg() callers to return H_HARDWARE
> and do WARN_ON
> * added WARN_ON_ONCE_RM(!rmap) in kvmppc_rm_h_put_tce_indirect() to
> have all vmalloc_to_phys() callsites covered
> 
> v7:
> * added realmode-friendly WARN_ON_ONCE_RM
> 
> v6:
> * changed handling of errors returned by kvmppc_(rm_)tce_iommu_(un)map()
> * moved kvmppc_gpa_to_ua() to TCE validation
> 
> v5:
> * changed error codes in multiple places
> * added bunch of WARN_ON() in places which should not really happen
> * adde a check that an iommu table is not attached already to LIOBN
> * dropped explicit calls to iommu_tce_clear_param_check/
> iommu_tce_put_param_check as kvmppc_tce_validate/kvmppc_ioba_validate
> call them anyway (since the previous patch)
> * if we fail to update a hardware IOMMU table for unexpected reason,
> this just clears the entry
> 
> v4:
> * added note to the commit log about allowing multiple updates of
> the same IOMMU table;
> * instead of checking for if any memory was preregistered, this
> returns H_TOO_HARD if a specific page was not;
> * fixed comments from v3 about error handling in many places;
> * simplified TCE handlers and merged IOMMU parts inline - for example,
> there used to be kvmppc_h_put_tce_iommu(), now it is merged into
> kvmppc_h_put_tce(); this allows to check IOBA boundaries against
> the first attached table only (makes the code simpler);
> 
> v3:
> * simplified not to use VFIO group notifiers
> * reworked cleanup, should be cleaner/simpler now
> 
> v2:
> * reworked to use new VFIO notifiers
> * now same iommu_table may appear in the list several times, to be fixed later
> ---
>  Documentation/virtual/kvm/devices/vfio.txt |  18 +-
>

Re: [GIT PULL 0/6] perf/core improvements and fixes

2017-03-16 Thread Ingo Molnar


* Arnaldo Carvalho de Melo <a...@kernel.org> wrote:

> Hi Ingo,
> 
>   Please consider pulling,
> 
> - Arnaldo
> 
> Test results at the end of this message, as usual.
> 
> The following changes since commit ffa86c2f1a8862cf58c873f6f14d4b2c3250fb48:
> 
>   Merge tag 'perf-core-for-mingo-4.12-20170314' of 
> git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core 
> (2017-03-15 19:27:27 +0100)
> 
> are available in the git repository at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git 
> tags/perf-core-for-mingo-4.12-20170316
> 
> for you to fetch changes up to 61f35d750683b21e9e3836e309195c79c1daed74:
> 
>   uprobes: Default UPROBES_EVENTS to Y (2017-03-16 12:42:02 -0300)
> 
> 
> perf/core improvements and fixes:
> 
> New features:
> 
> - Add 'brstackinsn' field in 'perf script' to reuse the x86 instruction
>   decoder used in the Intel PT code to study hot paths to samples (Andi Kleen)
> 
> Kernel:
> 
> - Default UPROBES_EVENTS to Y (Alexei Starovoitov)
> 
> - Fix check for kretprobe offset within function entry (Naveen N. Rao)
> 
> Infrastructure:
> 
> - Introduce util func is_sdt_event() (Ravi Bangoria)
> 
> - Make perf_event__synthesize_mmap_events() scale on older kernels where
>   reading /proc/pid/maps is way slower than reading /proc/pid/task/pid/maps 
> (Stephane Eranian)
> 
> Signed-off-by: Arnaldo Carvalho de Melo <a...@redhat.com>
> 
> 
> Andi Kleen (1):
>   perf script: Add 'brstackinsn' for branch stacks
> 
> Arnaldo Carvalho de Melo (2):
>   tools headers: Sync {tools/,}arch/x86/include/asm/cpufeatures.h
>   uprobes: Default UPROBES_EVENTS to Y
> 
> Naveen N. Rao (1):
>   trace/kprobes: Fix check for kretprobe offset within function entry
> 
> Ravi Bangoria (1):
>   perf probe: Introduce util func is_sdt_event()
> 
> Stephane Eranian (1):
>   perf tools: Make perf_event__synthesize_mmap_events() scale
> 
>  include/linux/kprobes.h|   1 +
>  kernel/kprobes.c   |  40 ++--
>  kernel/trace/Kconfig   |   2 +-
>  kernel/trace/trace_kprobe.c|   2 +-
>  tools/arch/x86/include/asm/cpufeatures.h   |   5 +-
>  tools/perf/Documentation/perf-script.txt   |  13 +-
>  tools/perf/builtin-script.c| 264 
> -
>  tools/perf/util/Build  |   1 +
>  tools/perf/util/dump-insn.c|  14 ++
>  tools/perf/util/dump-insn.h|  22 ++
>  tools/perf/util/event.c|   4 +-
>  .../util/intel-pt-decoder/intel-pt-insn-decoder.c  |  24 ++
>  tools/perf/util/parse-events.h |  20 ++
>  tools/perf/util/probe-event.c  |   9 +-
>  14 files changed, 381 insertions(+), 40 deletions(-)
>  create mode 100644 tools/perf/util/dump-insn.c
>  create mode 100644 tools/perf/util/dump-insn.h

Pulled, thanks a lot Arnaldo!

Ingo

Re: [PATCH 8/8] powerpc/64s: idle POWER8 avoid full state loss recovery when possible

2017-03-16 Thread Gautham R Shenoy

Hi Nick,

On Tue, Mar 14, 2017 at 07:23:49PM +1000, Nicholas Piggin wrote:
> If not all threads were in winkle, full state loss recovery is not
> necessary and can be avoided. A previous patch removed this optimisation
> due to some complexity with the implementation. Re-implement it by
> counting the number of threads in winkle with the per-core idle state.
> Only restore full state loss if all threads were in winkle.
> 
> This has a small window of false positives right before threads execute
> winkle and just after they wake up, when the winkle count does not
> reflect the true number of threads in winkle. This is not a significant
> problem in comparison with even the minimum winkle duration. For
> correctness, a false positive is not a problem (only false negatives
> would be).

I like the technique.. Some comments below.

> 
> Signed-off-by: Nicholas Piggin 
> ---
>  arch/powerpc/include/asm/cpuidle.h | 32 ---
>  arch/powerpc/kernel/idle_book3s.S  | 45 
> +-
>  2 files changed, 68 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/cpuidle.h 
> b/arch/powerpc/include/asm/cpuidle.h
> index b9d9f960dffd..b68a5cd75ae8 100644
> --- a/arch/powerpc/include/asm/cpuidle.h
> +++ b/arch/powerpc/include/asm/cpuidle.h
> @@ -2,13 +2,39 @@
>  #define _ASM_POWERPC_CPUIDLE_H
> 
>  #ifdef CONFIG_PPC_POWERNV
> -/* Used in powernv idle state management */
> +/* Thread state used in powernv idle state management */
>  #define PNV_THREAD_RUNNING  0
>  #define PNV_THREAD_NAP  1
>  #define PNV_THREAD_SLEEP2
>  #define PNV_THREAD_WINKLE   3
> -#define PNV_CORE_IDLE_LOCK_BIT  0x1000
> -#define PNV_CORE_IDLE_THREAD_BITS   0x00FF
> +
> +/*
> + * Core state used in powernv idle for POWER8.
> + *
> + * The lock bit synchronizes updates to the state, as well as parts of the
> + * sleep/wake code (see kernel/idle_book3s.S).
> + *
> + * Bottom 8 bits track the idle state of each thread. Bit is cleared before
> + * the thread executes an idle instruction (nap/sleep/winkle).
> + *
> + * Then there is winkle tracking. A core does not lose complete state
> + * until every thread is in winkle. So the winkle count field counts the
> + * number of threads in winkle (small window of false positives is okay
> + * around the sleep/wake, so long as there are no false negatives).
> + *
> + * When the winkle count reaches 8 (the COUNT_ALL_BIT becomes set), then
> + * the THREAD_WINKLE_BITS are set, which indicate which threads have not
> + * yet woken from the winkle state.
> + */
> +#define PNV_CORE_IDLE_LOCK_BIT   0x1000
> +
> +#define PNV_CORE_IDLE_WINKLE_COUNT   0x0001
> +#define PNV_CORE_IDLE_WINKLE_COUNT_ALL_BIT   0x0008
> +#define PNV_CORE_IDLE_WINKLE_COUNT_BITS  0x000F
> +#define PNV_CORE_IDLE_THREAD_WINKLE_BITS_SHIFT   8
> +#define PNV_CORE_IDLE_THREAD_WINKLE_BITS 0xFF00
> +
> +#define PNV_CORE_IDLE_THREAD_BITS0x00FF
> 
>  /*
>   *  NOTE =
> diff --git a/arch/powerpc/kernel/idle_book3s.S 
> b/arch/powerpc/kernel/idle_book3s.S
> index 3cb75907c5c5..87518a1dca50 100644
> --- a/arch/powerpc/kernel/idle_book3s.S
> +++ b/arch/powerpc/kernel/idle_book3s.S
> @@ -210,15 +210,20 @@ pnv_enter_arch207_idle_mode:
>   /* Sleep or winkle */
>   lbz r7,PACA_THREAD_MASK(r13)
>   ld  r14,PACA_CORE_IDLE_STATE_PTR(r13)
> + li  r5,0
> + beq cr3,3f
> + lis r5,PNV_CORE_IDLE_WINKLE_COUNT@h
> +3:
>  lwarx_loop1:
>   lwarx   r15,0,r14
> 
>   andis.  r9,r15,PNV_CORE_IDLE_LOCK_BIT@h
>   bnel-   core_idle_lock_held
> 
> + add r15,r15,r5  /* Add if winkle */
>   andcr15,r15,r7  /* Clear thread bit */
> 
> - andi.   r15,r15,PNV_CORE_IDLE_THREAD_BITS
> + andi.   r9,r15,PNV_CORE_IDLE_THREAD_BITS
> 
>  /*
>   * If cr0 = 0, then current thread is the last thread of the core entering
> @@ -437,16 +442,14 @@ pnv_restore_hyp_resource_arch300:
>  pnv_restore_hyp_resource_arch207:
>   /*
>* POWER ISA 2.07 or less.
> -  * Check if we slept with winkle.
> +  * Check if we slept with sleep or winkle.
>*/
>   ld  r2,PACATOC(r13);
> 
> - lbz r0,PACA_THREAD_IDLE_STATE(r13)
> - cmpwi   cr2,r0,PNV_THREAD_NAP
> - cmpwi   cr4,r0,PNV_THREAD_WINKLE
> + lbz r4,PACA_THREAD_IDLE_STATE(r13)
>   li  r0,PNV_THREAD_RUNNING
>   stb r0,PACA_THREAD_IDLE_STATE(r13)  /* Clear thread state */
> -
> + cmpwi   cr2,r4,PNV_THREAD_NAP
>   bgt cr2,pnv_wakeup_tb_loss  /* Either sleep or Winkle */
> 
> 
> @@ -467,7 +470,12 @@ pnv_restore_hyp_resource_arch207:
>   *
>   * r13 - PACA
>   * cr3 - gt if waking up with partial/complete hypervisor state loss
> + *
> + * If ISA300:
>   * cr4 -

[GIT PULL 0/6] perf/core improvements and fixes

2017-03-16 Thread Arnaldo Carvalho de Melo

Hi Ingo,

Please consider pulling,

- Arnaldo

Test results at the end of this message, as usual.

The following changes since commit ffa86c2f1a8862cf58c873f6f14d4b2c3250fb48:

  Merge tag 'perf-core-for-mingo-4.12-20170314' of 
git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core 
(2017-03-15 19:27:27 +0100)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git 
tags/perf-core-for-mingo-4.12-20170316

for you to fetch changes up to 61f35d750683b21e9e3836e309195c79c1daed74:

  uprobes: Default UPROBES_EVENTS to Y (2017-03-16 12:42:02 -0300)


perf/core improvements and fixes:

New features:

- Add 'brstackinsn' field in 'perf script' to reuse the x86 instruction
  decoder used in the Intel PT code to study hot paths to samples (Andi Kleen)

Kernel:

- Default UPROBES_EVENTS to Y (Alexei Starovoitov)

- Fix check for kretprobe offset within function entry (Naveen N. Rao)

Infrastructure:

- Introduce util func is_sdt_event() (Ravi Bangoria)

- Make perf_event__synthesize_mmap_events() scale on older kernels where
  reading /proc/pid/maps is way slower than reading /proc/pid/task/pid/maps 
(Stephane Eranian)

Signed-off-by: Arnaldo Carvalho de Melo <a...@redhat.com>


Andi Kleen (1):
  perf script: Add 'brstackinsn' for branch stacks

Arnaldo Carvalho de Melo (2):
  tools headers: Sync {tools/,}arch/x86/include/asm/cpufeatures.h
  uprobes: Default UPROBES_EVENTS to Y

Naveen N. Rao (1):
  trace/kprobes: Fix check for kretprobe offset within function entry

Ravi Bangoria (1):
  perf probe: Introduce util func is_sdt_event()

Stephane Eranian (1):
  perf tools: Make perf_event__synthesize_mmap_events() scale

 include/linux/kprobes.h|   1 +
 kernel/kprobes.c   |  40 ++--
 kernel/trace/Kconfig   |   2 +-
 kernel/trace/trace_kprobe.c|   2 +-
 tools/arch/x86/include/asm/cpufeatures.h   |   5 +-
 tools/perf/Documentation/perf-script.txt   |  13 +-
 tools/perf/builtin-script.c| 264 -
 tools/perf/util/Build  |   1 +
 tools/perf/util/dump-insn.c|  14 ++
 tools/perf/util/dump-insn.h|  22 ++
 tools/perf/util/event.c|   4 +-
 .../util/intel-pt-decoder/intel-pt-insn-decoder.c  |  24 ++
 tools/perf/util/parse-events.h |  20 ++
 tools/perf/util/probe-event.c  |   9 +-
 14 files changed, 381 insertions(+), 40 deletions(-)
 create mode 100644 tools/perf/util/dump-insn.c
 create mode 100644 tools/perf/util/dump-insn.h

Test results:

The first ones are container (docker) based builds of tools/perf with and
without libelf support, objtool where it is supported and samples/bpf/, ditto.
Where clang is available, it is also used to build perf with/without libelf.

Several are cross builds, the ones with -x-ARCH, and the android one, and those
may not have all the features built, due to lack of multi-arch devel packages,
available and being used so far on just a few, like
debian:experimental-x-{arm64,mipsel}.

The 'perf test' one will perform a variety of tests exercising
tools/perf/util/, tools/lib/{bpf,traceevent,etc}, as well as run perf commands
with a variety of command line event specifications to then intercept the
sys_perf_event syscall to check that the perf_event_attr fields are set up as
expected, among a variety of other unit tests.

Then there is the 'make -C tools/perf build-test' ones, that build tools/perf/
with a variety of feature sets, exercising the build with an incomplete set of
features as well as with a complete one. It is planned to have it run on each
of the containers mentioned above, using some container orchestration
infrastructure. Get in contact if interested in helping having this in place.

  # dm
   1 alpine:3.4: Ok
   2 alpine:3.5: Ok
   3 alpine:edge: Ok
   4 android-ndk:r12b-arm: Ok
   5 archlinux:latest: Ok
   6 centos:5: Ok
   7 centos:6: Ok
   8 centos:7: Ok
   9 debian:7: Ok
  10 debian:8: Ok
  11 debian:experimental: Ok
  12 debian:experimental-x-arm64: Ok
  13 debian:experimental-x-mips: Ok
  14 debian:experimental-x-mips64: Ok
  15 debian:experimental-x-mipsel: Ok
  16 fedora:20: Ok
  17 fedora:21: Ok
  18 fedora:22: Ok
  19 fedora:23: Ok
  20 fedora:24: Ok
  21 fedora:24-x-ARC-uClibc: Ok
  22 fedora:25: Ok
  23 fedora:rawhide: Ok
  24 mageia:5: Ok
  25 opensuse:13.2: Ok
  26 opensuse:42.1: Ok
  27 opensuse:tumbleweed: Ok
  28 ubuntu:12.04.5: Ok
  29 ubuntu:14.04.4: Ok
  30 ubuntu:14.04.4-x-linaro-arm64: Ok
  31 ubuntu:15.10: Ok
  32 ubuntu:16.04: Ok
  33 ubuntu:16.04-x-arm: Ok
  34 ubuntu:16.04-x-arm64: Ok
  35 ubuntu:16.04-x-powerpc: Ok
  36

[PATCH 2/6] trace/kprobes: Fix check for kretprobe offset within function entry

2017-03-16 Thread Arnaldo Carvalho de Melo

From: "Naveen N. Rao" 

perf specifies an offset from _text and since this offset is fed
directly into the arch-specific helper, kprobes tracer rejects
installation of kretprobes through perf. Fix this by looking up the
actual offset from a function for the specified sym+offset.

Refactor and reuse existing routines to limit code duplication -- we
repurpose kprobe_addr() for determining final kprobe address and we
split out the function entry offset determination into a separate
generic helper.

Before patch:

  naveen@ubuntu:~/linux/tools/perf$ sudo ./perf probe -v do_open%return
  probe-definition(0): do_open%return
  symbol:do_open file:(null) line:0 offset:0 return:1 lazy:(null)
  0 arguments
  Looking at the vmlinux_path (8 entries long)
  Using /boot/vmlinux for symbols
  Open Debuginfo file: /boot/vmlinux
  Try to find probe point from debuginfo.
  Matched function: do_open [2d0c7ff]
  Probe point found: do_open+0
  Matched function: do_open [35d76dc]
  found inline addr: 0xc04ba9c4
  Failed to find "do_open%return",
   because do_open is an inlined function and has no return point.
  An error occurred in debuginfo analysis (-22).
  Trying to use symbols.
  Opening /sys/kernel/debug/tracing//README write=0
  Opening /sys/kernel/debug/tracing//kprobe_events write=1
  Writing event: r:probe/do_open _text+4469776
  Failed to write event: Invalid argument
Error: Failed to add events. Reason: Invalid argument (Code: -22)
  naveen@ubuntu:~/linux/tools/perf$ dmesg | tail
  
  [   33.568656] Given offset is not valid for return probe.

After patch:

  naveen@ubuntu:~/linux/tools/perf$ sudo ./perf probe -v do_open%return
  probe-definition(0): do_open%return
  symbol:do_open file:(null) line:0 offset:0 return:1 lazy:(null)
  0 arguments
  Looking at the vmlinux_path (8 entries long)
  Using /boot/vmlinux for symbols
  Open Debuginfo file: /boot/vmlinux
  Try to find probe point from debuginfo.
  Matched function: do_open [2d0c7d6]
  Probe point found: do_open+0
  Matched function: do_open [35d76b3]
  found inline addr: 0xc04ba9e4
  Failed to find "do_open%return",
   because do_open is an inlined function and has no return point.
  An error occurred in debuginfo analysis (-22).
  Trying to use symbols.
  Opening /sys/kernel/debug/tracing//README write=0
  Opening /sys/kernel/debug/tracing//kprobe_events write=1
  Writing event: r:probe/do_open _text+4469808
  Writing event: r:probe/do_open_1 _text+4956344
  Added new events:
probe:do_open(on do_open%return)
probe:do_open_1  (on do_open%return)

  You can now use it in all perf tools, such as:

  perf record -e probe:do_open_1 -aR sleep 1

  naveen@ubuntu:~/linux/tools/perf$ sudo cat /sys/kernel/debug/kprobes/list
  c0041370  k  kretprobe_trampoline+0x0[OPTIMIZED]
  c04ba0b8  r  do_open+0x8[DISABLED]
  c0443430  r  do_open+0x0[DISABLED]

Signed-off-by: Naveen N. Rao 
Acked-by: Masami Hiramatsu 
Cc: Ananth N Mavinakayanahalli 
Cc: Michael Ellerman 
Cc: Steven Rostedt 
Cc: linuxppc-dev@lists.ozlabs.org
Link: 
http://lkml.kernel.org/r/d8cd1ef420ec22e3643ac332fdabcffc77319a42.1488961018.git.naveen.n@linux.vnet.ibm.com
Signed-off-by: Arnaldo Carvalho de Melo 
---
 include/linux/kprobes.h |  1 +
 kernel/kprobes.c| 40 ++--
 kernel/trace/trace_kprobe.c |  2 +-
 3 files changed, 28 insertions(+), 15 deletions(-)

diff --git a/include/linux/kprobes.h b/include/linux/kprobes.h
index 177bdf6c6aeb..47e4da5b4fa2 100644
--- a/include/linux/kprobes.h
+++ b/include/linux/kprobes.h
@@ -268,6 +268,7 @@ extern void show_registers(struct pt_regs *regs);
 extern void kprobes_inc_nmissed_count(struct kprobe *p);
 extern bool arch_within_kprobe_blacklist(unsigned long addr);
 extern bool arch_function_offset_within_entry(unsigned long offset);
+extern bool function_offset_within_entry(kprobe_opcode_t *addr, const char 
*sym, unsigned long offset);
 
 extern bool within_kprobe_blacklist(unsigned long addr);
 
diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index 4780ec236035..d733479a10ee 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -1391,21 +1391,19 @@ bool within_kprobe_blacklist(unsigned long addr)
  * This returns encoded errors if it fails to look up symbol or invalid
  * combination of parameters.
  */
-static kprobe_opcode_t *kprobe_addr(struct kprobe *p)
+static kprobe_opcode_t *_kprobe_addr(kprobe_opcode_t *addr,
+   const char *symbol_name, unsigned int offset)
 {
-   kprobe_opcode_t *addr = p->addr;
-
-   if ((p->symbol_name && p->addr) ||
-   (!p->symbol_name && !p->addr))
+   if ((symbol_name && addr) || (!symbol_name && !addr))
goto invalid;
 
-   if (p->symbol_name) {
-

Re: [PATCH 4/8] powerpc/64s: fix POWER9 machine check handler from stop state

2017-03-16 Thread Gautham R Shenoy

Hi,

On Thu, Mar 16, 2017 at 11:05:20PM +1000, Nicholas Piggin wrote:
> On Thu, 16 Mar 2017 18:10:48 +0530
> Mahesh Jagannath Salgaonkar  wrote:
> 
> > On 03/14/2017 02:53 PM, Nicholas Piggin wrote:
> > > The ISA specifies power save wakeup can cause a machine check interrupt.
> > > The machine check handler currently has code to handle that for POWER8,
> > > but POWER9 crashes when trying to execute the P8 style sleep
> > > instructions.
> > > 
> > > So queue up the machine check, then call into the idle code to wake up
> > > as the system reset interrupt does, rather than attempting to sleep
> > > again without going through the main idle path.
> > > 
> > > Reviewed-by: Gautham R. Shenoy 
> > > Signed-off-by: Nicholas Piggin 
> > > ---
> > >  arch/powerpc/include/asm/reg.h   |  1 +
> > >  arch/powerpc/kernel/exceptions-64s.S | 69 
> > > ++--
> > >  2 files changed, 35 insertions(+), 35 deletions(-)
> > > 
> > > diff --git a/arch/powerpc/include/asm/reg.h 
> > > b/arch/powerpc/include/asm/reg.h
> > > index fc879fd6bdae..8bbdfacce970 100644
> > > --- a/arch/powerpc/include/asm/reg.h
> > > +++ b/arch/powerpc/include/asm/reg.h
> > > @@ -656,6 +656,7 @@
> > >  #define   SRR1_ISI_PROT  0x0800 /* ISI: Other protection 
> > > fault */
> > >  #define   SRR1_WAKEMASK  0x0038 /* reason for wakeup */
> > >  #define   SRR1_WAKEMASK_P8   0x003c /* reason for wakeup on 
> > > POWER8 and 9 */
> > > +#define   SRR1_WAKEMCE_RESVD 0x003c /* Unused/reserved value 
> > > used by MCE wakeup to indicate cause to idle wakeup handler */
> > >  #define   SRR1_WAKESYSERR0x0030 /* System error */
> > >  #define   SRR1_WAKEEE0x0020 /* External interrupt */
> > >  #define   SRR1_WAKEHVI   0x0024 /* Hypervisor Virtualization 
> > > Interrupt (P9) */
> > > diff --git a/arch/powerpc/kernel/exceptions-64s.S 
> > > b/arch/powerpc/kernel/exceptions-64s.S
> > > index e390fcd04bcb..5779d2d6a192 100644
> > > --- a/arch/powerpc/kernel/exceptions-64s.S
> > > +++ b/arch/powerpc/kernel/exceptions-64s.S
> > > @@ -306,6 +306,33 @@ EXC_COMMON_BEGIN(machine_check_common)
> > >   /* restore original r1. */  \
> > >   ld  r1,GPR1(r1)
> > > 
> > > +#ifdef CONFIG_PPC_P7_NAP
> > > +EXC_COMMON_BEGIN(machine_check_idle_common)
> > > + bl  machine_check_queue_event
> > > + /*
> > > +  * Queue the machine check, then reload SRR1 and use it to set
> > > +  * CR3 according to pnv_powersave_wakeup convention.
> > > +  */
> > > + ld  r12,_MSR(r1)
> > > + rlwinm  r11,r12,47-31,30,31
> > > + cmpwi   cr3,r11,2
> > > +
> > > + /*
> > > +  * Now put SRR1_WAKEMCE_RESVD into SRR1, allows it to follow the
> > > +  * system reset wakeup code.
> > > +  */
> > > + orisr12,r12,SRR1_WAKEMCE_RESVD@h
> > > + mtspr   SPRN_SRR1,r12
> > > + std r12,_MSR(r1)
> > > +
> > > + /*
> > > +  * Decrement MCE nesting after finishing with the stack.
> > > +  */
> > > + lhz r11,PACA_IN_MCE(r13)
> > > + subir11,r11,1
> > > + sth r11,PACA_IN_MCE(r13)  
> > 
> > Looks like we are not winding up.. Shouldn't we ? What if we may end up
> > in pnv_wakeup_noloss() which assumes that no GPRs are lost. Am I missing
> > anything ?

Nice catch! This can occur if SRR1[46:47] == 0b01.

>
> Hmm, no I think you're right. Thanks, good catch. But can we do it with
> just setting PACA_NAPSTATELOST?

Unconditionally setting PACA_NAPSTATELOST should be sufficient.

> 
> > 
> > > + b   pnv_powersave_wakeup
> > > +#endif
> > >   /*  
> > 
> > [...]
> > 
> > Rest looks good to me.
> > 
> > Reviewed-by: Mahesh J Salgaonkar 
> 
> Thanks,
> Nick
>

Re: [PATCH 4/8] powerpc/64s: fix POWER9 machine check handler from stop state

2017-03-16 Thread Nicholas Piggin

On Thu, 16 Mar 2017 18:10:48 +0530
Mahesh Jagannath Salgaonkar  wrote:

> On 03/14/2017 02:53 PM, Nicholas Piggin wrote:
> > The ISA specifies power save wakeup can cause a machine check interrupt.
> > The machine check handler currently has code to handle that for POWER8,
> > but POWER9 crashes when trying to execute the P8 style sleep
> > instructions.
> > 
> > So queue up the machine check, then call into the idle code to wake up
> > as the system reset interrupt does, rather than attempting to sleep
> > again without going through the main idle path.
> > 
> > Reviewed-by: Gautham R. Shenoy 
> > Signed-off-by: Nicholas Piggin 
> > ---
> >  arch/powerpc/include/asm/reg.h   |  1 +
> >  arch/powerpc/kernel/exceptions-64s.S | 69 
> > ++--
> >  2 files changed, 35 insertions(+), 35 deletions(-)
> > 
> > diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
> > index fc879fd6bdae..8bbdfacce970 100644
> > --- a/arch/powerpc/include/asm/reg.h
> > +++ b/arch/powerpc/include/asm/reg.h
> > @@ -656,6 +656,7 @@
> >  #define   SRR1_ISI_PROT0x0800 /* ISI: Other protection 
> > fault */
> >  #define   SRR1_WAKEMASK0x0038 /* reason for wakeup */
> >  #define   SRR1_WAKEMASK_P8 0x003c /* reason for wakeup on POWER8 and 9 
> > */
> > +#define   SRR1_WAKEMCE_RESVD   0x003c /* Unused/reserved value 
> > used by MCE wakeup to indicate cause to idle wakeup handler */
> >  #define   SRR1_WAKESYSERR  0x0030 /* System error */
> >  #define   SRR1_WAKEEE  0x0020 /* External interrupt */
> >  #define   SRR1_WAKEHVI 0x0024 /* Hypervisor Virtualization 
> > Interrupt (P9) */
> > diff --git a/arch/powerpc/kernel/exceptions-64s.S 
> > b/arch/powerpc/kernel/exceptions-64s.S
> > index e390fcd04bcb..5779d2d6a192 100644
> > --- a/arch/powerpc/kernel/exceptions-64s.S
> > +++ b/arch/powerpc/kernel/exceptions-64s.S
> > @@ -306,6 +306,33 @@ EXC_COMMON_BEGIN(machine_check_common)
> > /* restore original r1. */  \
> > ld  r1,GPR1(r1)
> > 
> > +#ifdef CONFIG_PPC_P7_NAP
> > +EXC_COMMON_BEGIN(machine_check_idle_common)
> > +   bl  machine_check_queue_event
> > +   /*
> > +* Queue the machine check, then reload SRR1 and use it to set
> > +* CR3 according to pnv_powersave_wakeup convention.
> > +*/
> > +   ld  r12,_MSR(r1)
> > +   rlwinm  r11,r12,47-31,30,31
> > +   cmpwi   cr3,r11,2
> > +
> > +   /*
> > +* Now put SRR1_WAKEMCE_RESVD into SRR1, allows it to follow the
> > +* system reset wakeup code.
> > +*/
> > +   orisr12,r12,SRR1_WAKEMCE_RESVD@h
> > +   mtspr   SPRN_SRR1,r12
> > +   std r12,_MSR(r1)
> > +
> > +   /*
> > +* Decrement MCE nesting after finishing with the stack.
> > +*/
> > +   lhz r11,PACA_IN_MCE(r13)
> > +   subir11,r11,1
> > +   sth r11,PACA_IN_MCE(r13)  
> 
> Looks like we are not winding up.. Shouldn't we ? What if we may end up
> in pnv_wakeup_noloss() which assumes that no GPRs are lost. Am I missing
> anything ?

Hmm, no I think you're right. Thanks, good catch. But can we do it with
just setting PACA_NAPSTATELOST?

> 
> > +   b   pnv_powersave_wakeup
> > +#endif
> > /*  
> 
> [...]
> 
> Rest looks good to me.
> 
> Reviewed-by: Mahesh J Salgaonkar 

Thanks,
Nick

Re: [PATCH 7/8] powerpc/64s: idle do not hold reservation longer than required

2017-03-16 Thread Nicholas Piggin

On Thu, 16 Mar 2017 18:13:28 +0530
Gautham R Shenoy  wrote:

> Hi Nick,
> 
> On Tue, Mar 14, 2017 at 07:23:48PM +1000, Nicholas Piggin wrote:
> > When taking the core idle state lock, grab it immediately like a
> > regular lock, rather than adding more tests in there. Holding the lock
> > keeps it stable, so there is no need to do it whole holding the
> > reservation.  
> 
> I agree with this patch. Just a minor query
> 
> > 
> > Signed-off-by: Nicholas Piggin 
> > ---
> >  arch/powerpc/kernel/idle_book3s.S | 20 +++-
> >  1 file changed, 11 insertions(+), 9 deletions(-)
> > 
> > diff --git a/arch/powerpc/kernel/idle_book3s.S 
> > b/arch/powerpc/kernel/idle_book3s.S
> > index 1c91dc35c559..3cb75907c5c5 100644
> > --- a/arch/powerpc/kernel/idle_book3s.S
> > +++ b/arch/powerpc/kernel/idle_book3s.S
> > @@ -488,12 +488,12 @@ BEGIN_FTR_SECTION
> > CHECK_HMI_INTERRUPT
> >  END_FTR_SECTION_IFSET(CPU_FTR_HVMODE)
> > 
> > -   lbz r7,PACA_THREAD_MASK(r13)
> > ld  r14,PACA_CORE_IDLE_STATE_PTR(r13)
> > -lwarx_loop2:
> > -   lwarx   r15,0,r14
> > -   andis.  r9,r15,PNV_CORE_IDLE_LOCK_BIT@h
> > +   lbz r7,PACA_THREAD_MASK(r13)  
> 
> Is reversing the order of loads into r7 and r14 intentional?

Oh, yes I guess it is because we use r14 result first. I should have
mentioned it but I forgot about it. Probably they decode together,
but you might get them in different cycles.

Thanks for the review!

Thanks,
Nick

Re: [PATCH 1/4] crypto: powerpc - Factor out the core CRC vpmsum algorithm

2017-03-16 Thread Daniel Axtens

> So although this sits in arch/powerpc, it's heavy on the crypto which is
> not my area of expertise (to say the least!), so I think it should
> probably go via Herbert and the crypto tree?

That was my thought as well. Sorry - probably should have put that in
the comments somewhere.

Regards,
Daniel

Re: [PATCH 7/8] powerpc/64s: idle do not hold reservation longer than required

2017-03-16 Thread Gautham R Shenoy

Hi Nick,

On Tue, Mar 14, 2017 at 07:23:48PM +1000, Nicholas Piggin wrote:
> When taking the core idle state lock, grab it immediately like a
> regular lock, rather than adding more tests in there. Holding the lock
> keeps it stable, so there is no need to do it whole holding the
> reservation.

I agree with this patch. Just a minor query

> 
> Signed-off-by: Nicholas Piggin 
> ---
>  arch/powerpc/kernel/idle_book3s.S | 20 +++-
>  1 file changed, 11 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/idle_book3s.S 
> b/arch/powerpc/kernel/idle_book3s.S
> index 1c91dc35c559..3cb75907c5c5 100644
> --- a/arch/powerpc/kernel/idle_book3s.S
> +++ b/arch/powerpc/kernel/idle_book3s.S
> @@ -488,12 +488,12 @@ BEGIN_FTR_SECTION
>   CHECK_HMI_INTERRUPT
>  END_FTR_SECTION_IFSET(CPU_FTR_HVMODE)
> 
> - lbz r7,PACA_THREAD_MASK(r13)
>   ld  r14,PACA_CORE_IDLE_STATE_PTR(r13)
> -lwarx_loop2:
> - lwarx   r15,0,r14
> - andis.  r9,r15,PNV_CORE_IDLE_LOCK_BIT@h
> + lbz r7,PACA_THREAD_MASK(r13)

Is reversing the order of loads into r7 and r14 intentional?

Other than that,
Reviewed-by: Gautham R. Shenoy 

> +
>   /*
> +  * Take the core lock to synchronize against other threads.
> +  *
>* Lock bit is set in one of the 2 cases-
>* a. In the sleep/winkle enter path, the last thread is executing
>* fastsleep workaround code.
> @@ -501,7 +501,14 @@ lwarx_loop2:
>* workaround undo code or resyncing timebase or restoring context
>* In either case loop until the lock bit is cleared.
>*/
> +1:
> + lwarx   r15,0,r14
> + andis.  r9,r15,PNV_CORE_IDLE_LOCK_BIT@h
>   bnel-   core_idle_lock_held
> + orisr15,r15,PNV_CORE_IDLE_LOCK_BIT@h
> + stwcx.  r15,0,r14
> + bne-1b
> + isync
> 
>   andi.   r9,r15,PNV_CORE_IDLE_THREAD_BITS
>   cmpwi   cr2,r9,0
> @@ -513,11 +520,6 @@ lwarx_loop2:
>* cr4 - gt or eq if waking up from complete hypervisor state loss.
>*/
> 
> - orisr15,r15,PNV_CORE_IDLE_LOCK_BIT@h
> - stwcx.  r15,0,r14
> - bne-lwarx_loop2
> - isync
> -
>  BEGIN_FTR_SECTION
>   lbz r4,PACA_SUBCORE_SIBLING_MASK(r13)
>   and r4,r4,r15
> -- 
> 2.11.0
>

Re: [PATCH 4/8] powerpc/64s: fix POWER9 machine check handler from stop state

2017-03-16 Thread Mahesh Jagannath Salgaonkar

On 03/14/2017 02:53 PM, Nicholas Piggin wrote:
> The ISA specifies power save wakeup can cause a machine check interrupt.
> The machine check handler currently has code to handle that for POWER8,
> but POWER9 crashes when trying to execute the P8 style sleep
> instructions.
> 
> So queue up the machine check, then call into the idle code to wake up
> as the system reset interrupt does, rather than attempting to sleep
> again without going through the main idle path.
> 
> Reviewed-by: Gautham R. Shenoy 
> Signed-off-by: Nicholas Piggin 
> ---
>  arch/powerpc/include/asm/reg.h   |  1 +
>  arch/powerpc/kernel/exceptions-64s.S | 69 
> ++--
>  2 files changed, 35 insertions(+), 35 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
> index fc879fd6bdae..8bbdfacce970 100644
> --- a/arch/powerpc/include/asm/reg.h
> +++ b/arch/powerpc/include/asm/reg.h
> @@ -656,6 +656,7 @@
>  #define   SRR1_ISI_PROT  0x0800 /* ISI: Other protection 
> fault */
>  #define   SRR1_WAKEMASK  0x0038 /* reason for wakeup */
>  #define   SRR1_WAKEMASK_P8   0x003c /* reason for wakeup on POWER8 and 9 
> */
> +#define   SRR1_WAKEMCE_RESVD 0x003c /* Unused/reserved value used by MCE 
> wakeup to indicate cause to idle wakeup handler */
>  #define   SRR1_WAKESYSERR0x0030 /* System error */
>  #define   SRR1_WAKEEE0x0020 /* External interrupt */
>  #define   SRR1_WAKEHVI   0x0024 /* Hypervisor Virtualization 
> Interrupt (P9) */
> diff --git a/arch/powerpc/kernel/exceptions-64s.S 
> b/arch/powerpc/kernel/exceptions-64s.S
> index e390fcd04bcb..5779d2d6a192 100644
> --- a/arch/powerpc/kernel/exceptions-64s.S
> +++ b/arch/powerpc/kernel/exceptions-64s.S
> @@ -306,6 +306,33 @@ EXC_COMMON_BEGIN(machine_check_common)
>   /* restore original r1. */  \
>   ld  r1,GPR1(r1)
> 
> +#ifdef CONFIG_PPC_P7_NAP
> +EXC_COMMON_BEGIN(machine_check_idle_common)
> + bl  machine_check_queue_event
> + /*
> +  * Queue the machine check, then reload SRR1 and use it to set
> +  * CR3 according to pnv_powersave_wakeup convention.
> +  */
> + ld  r12,_MSR(r1)
> + rlwinm  r11,r12,47-31,30,31
> + cmpwi   cr3,r11,2
> +
> + /*
> +  * Now put SRR1_WAKEMCE_RESVD into SRR1, allows it to follow the
> +  * system reset wakeup code.
> +  */
> + orisr12,r12,SRR1_WAKEMCE_RESVD@h
> + mtspr   SPRN_SRR1,r12
> + std r12,_MSR(r1)
> +
> + /*
> +  * Decrement MCE nesting after finishing with the stack.
> +  */
> + lhz r11,PACA_IN_MCE(r13)
> + subir11,r11,1
> + sth r11,PACA_IN_MCE(r13)

Looks like we are not winding up.. Shouldn't we ? What if we may end up
in pnv_wakeup_noloss() which assumes that no GPRs are lost. Am I missing
anything ?

> + b   pnv_powersave_wakeup
> +#endif
>   /*

[...]

Rest looks good to me.

Reviewed-by: Mahesh J Salgaonkar 

Thanks,
-Mahesh.

Re: [PATCH 5/8] powerpc/64s: use PACA_THREAD_IDLE_STATE only in POWER8

2017-03-16 Thread Nicholas Piggin

On Thu, 16 Mar 2017 17:24:03 +0530
Gautham R Shenoy  wrote:

> Hi Nick,
> 
> On Tue, Mar 14, 2017 at 07:23:46PM +1000, Nicholas Piggin wrote:
> > POWER9 does not use this field, so it should be moved into the POWER8
> > code. Update the documentation in the paca struct too.
> > 
> > Signed-off-by: Nicholas Piggin 
> > ---
> >  arch/powerpc/include/asm/paca.h   | 12 ++--
> >  arch/powerpc/kernel/idle_book3s.S | 13 +++--
> >  2 files changed, 17 insertions(+), 8 deletions(-)
> > 
> > diff --git a/arch/powerpc/include/asm/paca.h 
> > b/arch/powerpc/include/asm/paca.h
> > index 708c3e592eeb..bbb59e226a9f 100644
> > --- a/arch/powerpc/include/asm/paca.h
> > +++ b/arch/powerpc/include/asm/paca.h
> > @@ -165,11 +165,19 @@ struct paca_struct {
> >  #endif
> > 
> >  #ifdef CONFIG_PPC_POWERNV
> > -   /* Per-core mask tracking idle threads and a lock bit-[L][] */
> > +   /* CPU idle fields */
> > +
> > +   /*
> > +* Per-core word used to synchronize between threads. See
> > +* asm/cpuidle.h, PNV_CORE_IDLE_*
> > +*/
> > u32 *core_idle_state_ptr;
> > -   u8 thread_idle_state;   /* PNV_THREAD_RUNNING/NAP/SLEEP */
> > /* Mask to indicate thread id in core */
> > u8 thread_mask;
> > +
> > +   /* POWER8 specific fields */
> > +   /* PNV_THREAD_RUNNING/NAP/SLEEP */
> > +   u8 thread_idle_state;  
> 
> I am planning to use this in POWER9 DD1 to distinguish between a
> SRESET received when the thread was running vs when it was in stop.
> Unfortunately the SRR1[46:47] are not cleared in the former case. So
> we need a way in software to distinguish between the two.

Okay, we can skip this for now. It was not a critical part of my
patches, just a tidy up.

Thanks,
Nick

Re: [PATCH 6/8] powerpc/64s: idle expand usable core idle state bits

2017-03-16 Thread Gautham R Shenoy

Hi Nick,

On Tue, Mar 14, 2017 at 07:23:47PM +1000, Nicholas Piggin wrote:
> In preparation for adding more bits to the core idle state word,
> move the lock bit up, and unlock by flipping the lock bit rather
> than masking off all but the thread bits.
> 
> Add branch hints for atomic operations while we're here.

Looks good.

Reviewed-by: Gautham R. Shenoy 

> 
> Signed-off-by: Nicholas Piggin 
> ---
>  arch/powerpc/include/asm/cpuidle.h |  4 ++--
>  arch/powerpc/kernel/idle_book3s.S  | 33 +
>  2 files changed, 19 insertions(+), 18 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/cpuidle.h 
> b/arch/powerpc/include/asm/cpuidle.h
> index 155731557c9b..b9d9f960dffd 100644
> --- a/arch/powerpc/include/asm/cpuidle.h
> +++ b/arch/powerpc/include/asm/cpuidle.h
> @@ -7,8 +7,8 @@
>  #define PNV_THREAD_NAP  1
>  #define PNV_THREAD_SLEEP2
>  #define PNV_THREAD_WINKLE   3
> -#define PNV_CORE_IDLE_LOCK_BIT  0x100
> -#define PNV_CORE_IDLE_THREAD_BITS   0x0FF
> +#define PNV_CORE_IDLE_LOCK_BIT  0x1000
> +#define PNV_CORE_IDLE_THREAD_BITS   0x00FF
> 
>  /*
>   *  NOTE =
> diff --git a/arch/powerpc/kernel/idle_book3s.S 
> b/arch/powerpc/kernel/idle_book3s.S
> index 9bdfba75a5e7..1c91dc35c559 100644
> --- a/arch/powerpc/kernel/idle_book3s.S
> +++ b/arch/powerpc/kernel/idle_book3s.S
> @@ -95,12 +95,12 @@ ALT_FTR_SECTION_END_IFSET(CPU_FTR_ARCH_300)
>  core_idle_lock_held:
>   HMT_LOW
>  3:   lwz r15,0(r14)
> - andi.   r15,r15,PNV_CORE_IDLE_LOCK_BIT
> + andis.  r15,r15,PNV_CORE_IDLE_LOCK_BIT@h
>   bne 3b
>   HMT_MEDIUM
>   lwarx   r15,0,r14
> - andi.   r9,r15,PNV_CORE_IDLE_LOCK_BIT
> - bne core_idle_lock_held
> + andis.  r9,r15,PNV_CORE_IDLE_LOCK_BIT@h
> + bne-core_idle_lock_held
>   blr
> 
>  /*
> @@ -213,8 +213,8 @@ pnv_enter_arch207_idle_mode:
>  lwarx_loop1:
>   lwarx   r15,0,r14
> 
> - andi.   r9,r15,PNV_CORE_IDLE_LOCK_BIT
> - bnelcore_idle_lock_held
> + andis.  r9,r15,PNV_CORE_IDLE_LOCK_BIT@h
> + bnel-   core_idle_lock_held
> 
>   andcr15,r15,r7  /* Clear thread bit */
> 
> @@ -241,7 +241,7 @@ common_enter: /* common code for all the threads entering 
> sleep or winkle */
>   IDLE_STATE_ENTER_SEQ_NORET(PPC_SLEEP)
> 
>  fastsleep_workaround_at_entry:
> - ori r15,r15,PNV_CORE_IDLE_LOCK_BIT
> + orisr15,r15,PNV_CORE_IDLE_LOCK_BIT@h
>   stwcx.  r15,0,r14
>   bne-lwarx_loop1
>   isync
> @@ -251,10 +251,10 @@ fastsleep_workaround_at_entry:
>   li  r4,1
>   bl  opal_config_cpu_idle_state
> 
> - /* Clear Lock bit */
> - li  r0,0
> + /* Unlock */
> + xoris   r15,r15,PNV_CORE_IDLE_LOCK_BIT@h
>   lwsync
> - stw r0,0(r14)
> + stw r15,0(r14)
>   b   common_enter
> 
>  enter_winkle:
> @@ -302,8 +302,8 @@ power_enter_stop:
> 
>  lwarx_loop_stop:
>   lwarx   r15,0,r14
> - andi.   r9,r15,PNV_CORE_IDLE_LOCK_BIT
> - bnelcore_idle_lock_held
> + andis.  r9,r15,PNV_CORE_IDLE_LOCK_BIT@h
> + bnel-   core_idle_lock_held
>   andcr15,r15,r7  /* Clear thread bit */
> 
>   stwcx.  r15,0,r14
> @@ -492,7 +492,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_HVMODE)
>   ld  r14,PACA_CORE_IDLE_STATE_PTR(r13)
>  lwarx_loop2:
>   lwarx   r15,0,r14
> - andi.   r9,r15,PNV_CORE_IDLE_LOCK_BIT
> + andis.  r9,r15,PNV_CORE_IDLE_LOCK_BIT@h
>   /*
>* Lock bit is set in one of the 2 cases-
>* a. In the sleep/winkle enter path, the last thread is executing
> @@ -501,9 +501,10 @@ lwarx_loop2:
>* workaround undo code or resyncing timebase or restoring context
>* In either case loop until the lock bit is cleared.
>*/
> - bnelcore_idle_lock_held
> + bnel-   core_idle_lock_held
> 
> - cmpwi   cr2,r15,0
> + andi.   r9,r15,PNV_CORE_IDLE_THREAD_BITS
> + cmpwi   cr2,r9,0
> 
>   /*
>* At this stage
> @@ -512,7 +513,7 @@ lwarx_loop2:
>* cr4 - gt or eq if waking up from complete hypervisor state loss.
>*/
> 
> - ori r15,r15,PNV_CORE_IDLE_LOCK_BIT
> + orisr15,r15,PNV_CORE_IDLE_LOCK_BIT@h
>   stwcx.  r15,0,r14
>   bne-lwarx_loop2
>   isync
> @@ -602,7 +603,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300)
>   mtspr   SPRN_WORC,r4
> 
>  clear_lock:
> - andi.   r15,r15,PNV_CORE_IDLE_THREAD_BITS
> + xoris   r15,r15,PNV_CORE_IDLE_LOCK_BIT@h
>   lwsync
>   stw r15,0(r14)
> 
> -- 
> 2.11.0
>

Re: [PATCH 5/8] powerpc/64s: use PACA_THREAD_IDLE_STATE only in POWER8

2017-03-16 Thread Gautham R Shenoy

Hi Nick,

On Tue, Mar 14, 2017 at 07:23:46PM +1000, Nicholas Piggin wrote:
> POWER9 does not use this field, so it should be moved into the POWER8
> code. Update the documentation in the paca struct too.
> 
> Signed-off-by: Nicholas Piggin 
> ---
>  arch/powerpc/include/asm/paca.h   | 12 ++--
>  arch/powerpc/kernel/idle_book3s.S | 13 +++--
>  2 files changed, 17 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
> index 708c3e592eeb..bbb59e226a9f 100644
> --- a/arch/powerpc/include/asm/paca.h
> +++ b/arch/powerpc/include/asm/paca.h
> @@ -165,11 +165,19 @@ struct paca_struct {
>  #endif
> 
>  #ifdef CONFIG_PPC_POWERNV
> - /* Per-core mask tracking idle threads and a lock bit-[L][] */
> + /* CPU idle fields */
> +
> + /*
> +  * Per-core word used to synchronize between threads. See
> +  * asm/cpuidle.h, PNV_CORE_IDLE_*
> +  */
>   u32 *core_idle_state_ptr;
> - u8 thread_idle_state;   /* PNV_THREAD_RUNNING/NAP/SLEEP */
>   /* Mask to indicate thread id in core */
>   u8 thread_mask;
> +
> + /* POWER8 specific fields */
> + /* PNV_THREAD_RUNNING/NAP/SLEEP */
> + u8 thread_idle_state;

I am planning to use this in POWER9 DD1 to distinguish between a
SRESET received when the thread was running vs when it was in stop.
Unfortunately the SRR1[46:47] are not cleared in the former case. So
we need a way in software to distinguish between the two.

>   /* Mask to denote subcore sibling threads */
>   u8 subcore_sibling_mask;
>  #endif
> diff --git a/arch/powerpc/kernel/idle_book3s.S 
> b/arch/powerpc/kernel/idle_book3s.S
> index 9284ea0762b1..9bdfba75a5e7 100644
> --- a/arch/powerpc/kernel/idle_book3s.S
> +++ b/arch/powerpc/kernel/idle_book3s.S
> @@ -389,9 +389,6 @@ FTR_SECTION_ELSE
>   bl  pnv_restore_hyp_resource_arch207
>  ALT_FTR_SECTION_END_IFSET(CPU_FTR_ARCH_300)
> 
> - li  r0,PNV_THREAD_RUNNING
> - stb r0,PACA_THREAD_IDLE_STATE(r13)  /* Clear thread state */
> -
>  #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
>   li  r0,KVM_HWTHREAD_IN_KERNEL
>   stb r0,HSTATE_HWTHREAD_STATE(r13)
> @@ -445,9 +442,13 @@ pnv_restore_hyp_resource_arch207:
>   ld  r2,PACATOC(r13);
> 
>   lbz r0,PACA_THREAD_IDLE_STATE(r13)
> - cmpwi   cr2,r0,PNV_THREAD_NAP
> - cmpwi   cr4,r0,PNV_THREAD_WINKLE
> - bgt cr2,pnv_wakeup_tb_loss  /* Either sleep or Winkle */
> + cmpwi   cr2,r0,PNV_THREAD_NAP
> + cmpwi   cr4,r0,PNV_THREAD_WINKLE
> + li  r0,PNV_THREAD_RUNNING
> + stb r0,PACA_THREAD_IDLE_STATE(r13)  /* Clear thread state */
> +
> + bgt cr2,pnv_wakeup_tb_loss  /* Either sleep or Winkle */
> +
> 
>   /*
>* We fall through here if PACA_THREAD_IDLE_STATE shows we are waking
> -- 
> 2.11.0
>

Re: [PATCH 3/8] powerpc/64s: use alternative feature patching

2017-03-16 Thread Gautham R Shenoy

On Tue, Mar 14, 2017 at 07:23:44PM +1000, Nicholas Piggin wrote:
> This reduces the number of nops for POWER8.

Nice!
> 
> Signed-off-by: Nicholas Piggin 

Reviewed-by: Gautham R. Shenoy 

> ---
>  arch/powerpc/kernel/idle_book3s.S | 19 ---
>  1 file changed, 12 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/idle_book3s.S 
> b/arch/powerpc/kernel/idle_book3s.S
> index 405631b2c229..9284ea0762b1 100644
> --- a/arch/powerpc/kernel/idle_book3s.S
> +++ b/arch/powerpc/kernel/idle_book3s.S
> @@ -383,7 +383,11 @@ _GLOBAL(power9_idle_stop)
>   */
>  .global pnv_powersave_wakeup
>  pnv_powersave_wakeup:
> - bl  pnv_restore_hyp_resource
> +BEGIN_FTR_SECTION
> + bl  pnv_restore_hyp_resource_arch300
> +FTR_SECTION_ELSE
> + bl  pnv_restore_hyp_resource_arch207
> +ALT_FTR_SECTION_END_IFSET(CPU_FTR_ARCH_300)
> 
>   li  r0,PNV_THREAD_RUNNING
>   stb r0,PACA_THREAD_IDLE_STATE(r13)  /* Clear thread state */
> @@ -411,14 +415,13 @@ pnv_powersave_wakeup:
>   *
>   * cr3 - set to gt if waking up with partial/complete hypervisor state loss
>   */
> -pnv_restore_hyp_resource:
> - ld  r2,PACATOC(r13);
> -
> -BEGIN_FTR_SECTION
> +pnv_restore_hyp_resource_arch300:
>   /*
>* POWER ISA 3. Use PSSCR to determine if we
>* are waking up from deep idle state
>*/
> + ld  r2,PACATOC(r13);
> +
>   LOAD_REG_ADDRBASE(r5,pnv_first_deep_stop_state)
>   ld  r4,ADDROFF(pnv_first_deep_stop_state)(r5)
> 
> @@ -433,12 +436,14 @@ BEGIN_FTR_SECTION
> 
>   blr /* Waking up without hypervisor state loss. */
> 
> -END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300)
> -
> +/* Same calling convention as arch300 */
> +pnv_restore_hyp_resource_arch207:
>   /*
>* POWER ISA 2.07 or less.
>* Check if we slept with winkle.
>*/
> + ld  r2,PACATOC(r13);
> +
>   lbz r0,PACA_THREAD_IDLE_STATE(r13)
>   cmpwi   cr2,r0,PNV_THREAD_NAP
>   cmpwi   cr4,r0,PNV_THREAD_WINKLE
> -- 
> 2.11.0
>

Re: [PATCH 2/8] powerpc/64s: stop using bit in HSPRG0 to test winkle

2017-03-16 Thread Gautham R Shenoy

Hi Nick,

On Tue, Mar 14, 2017 at 07:23:43PM +1000, Nicholas Piggin wrote:
> The POWER8 idle code has a neat trick of programming the power on engine
> to restore a low bit into HSPRG0, so idle wakeup code can test and see
> if it has been programmed this way and therefore lost all state, and
> avoiding the expensive full restore if not.
> 
> However this messes with our r13 PACA pointer, and requires HSPRG0 to
> be written to throughout the exception handlers and idle wakeup, rather
> than just once on kernel entry.
> 
> Remove this complexity and assume winkle sleeps always require a state
> restore. This speedup is later re-introduced by counting per-core winkles
> and setting a bitmap of threads with state loss when all are in winkle.
>

Looks good to me.

> Signed-off-by: Nicholas Piggin 

Reviewed-by: Gautham R. Shenoy 


--
Thanks and Regards
gautham.

Re: [PATCH 1/4] crypto: powerpc - Factor out the core CRC vpmsum algorithm

2017-03-16 Thread Anton Blanchard

Hi David,

> While not part of this change, the unrolled loops look as though
> they just destroy the cpu cache.
> I'd like be convinced that anything does CRC over long enough buffers
> to make it a gain at all.

btrfs data checksumming is one area.

> With modern (not that modern now) superscalar cpus you can often
> get the loop instructions 'for free'.

A branch on POWER8 is a three cycle redirect. The vpmsum instructions
are 6 cycles.

> Sometimes pipelining the loop is needed to get full throughput.
> Unlike the IP checksum, you don't even have to 'loop carry' the
> cpu carry flag.

It went through quite a lot of simulation to reach peak performance.
The loop is quite delicate, we have to pace it just right to avoid
some pipeline reject conditions.

Note also that we already modulo schedule the loop across three
iterations, required to hide the latency of the vpmsum instructions.

Anton

[PATCH updated] powerpc/mm/hash: Skip using reserved virtual address range

2017-03-16 Thread Aneesh Kumar K.V

Now that we use all the available virtual address range, we need to make sure
we don't generate VSID such that it overlaps with the reserved vsid range.
Reserved vsid range include the virtual address range used by the adjunct
partition and also the VRMA virtual segment. We find the context value that
can result in generating such a VSID and reserve it early in boot.

We don't look at the adjunct range, because for now we disable the adjunct usage
in a Linux LPAR via CAS interface.

Signed-off-by: Aneesh Kumar K.V 
---
Changes:
* handle context 0 correctly when reserving. (p4 and p5 will
hit that case) 

 arch/powerpc/include/asm/book3s/64/mmu-hash.h |  7 
 arch/powerpc/include/asm/kvm_book3s_64.h  |  2 -
 arch/powerpc/include/asm/mmu_context.h|  1 +
 arch/powerpc/mm/hash_utils_64.c   | 58 +++
 arch/powerpc/mm/mmu_context_book3s64.c| 28 +
 5 files changed, 94 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h 
b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
index c99ea6bbd82c..ac987e08ce63 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
@@ -578,11 +578,18 @@ extern void slb_set_size(u16 size);
 #define VSID_MULTIPLIER_256M   ASM_CONST(12538073) /* 24-bit prime */
 #define VSID_BITS_256M (VA_BITS - SID_SHIFT)
 #define VSID_BITS_65_256M  (65 - SID_SHIFT)
+/*
+ * Modular multiplicative inverse of VSID_MULTIPLIER under modulo VSID_MODULUS
+ */
+#define VSID_MULINV_256M   ASM_CONST(665548017062)
 
 #define VSID_MULTIPLIER_1T ASM_CONST(12538073) /* 24-bit prime */
 #define VSID_BITS_1T   (VA_BITS - SID_SHIFT_1T)
 #define VSID_BITS_65_1T(65 - SID_SHIFT_1T)
+#define VSID_MULINV_1T ASM_CONST(209034062)
 
+/* 1TB VSID reserved for VRMA */
+#define VRMA_VSID  0x1ffUL
 #define USER_VSID_RANGE(1UL << (ESID_BITS + SID_SHIFT))
 
 /* 4 bits per slice and we have one slice per 1TB */
diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h 
b/arch/powerpc/include/asm/kvm_book3s_64.h
index d9b48f5bb606..d55c7f881ce7 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -49,8 +49,6 @@ static inline bool kvm_is_radix(struct kvm *kvm)
 #define KVM_DEFAULT_HPT_ORDER  24  /* 16MB HPT by default */
 #endif
 
-#define VRMA_VSID  0x1ffUL /* 1TB VSID reserved for VRMA */
-
 /*
  * We use a lock bit in HPTE dword 0 to synchronize updates and
  * accesses to each HPTE, and another bit to indicate non-present
diff --git a/arch/powerpc/include/asm/mmu_context.h 
b/arch/powerpc/include/asm/mmu_context.h
index 8fe1ba1808d3..757d4a9e1a1c 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -51,6 +51,7 @@ static inline void switch_mmu_context(struct mm_struct *prev,
return switch_slb(tsk, next);
 }
 
+extern void hash__resv_context(int context_id);
 extern int hash__get_new_context(void);
 extern void __destroy_context(int context_id);
 static inline void mmu_context_init(void) { }
diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index d96ba04d8844..80ae6f42854a 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -1847,4 +1847,62 @@ static int __init hash64_debugfs(void)
 }
 machine_device_initcall(pseries, hash64_debugfs);
 
+/*
+ * if modinv is the modular multiplicate inverse of (x % vsid_modulus) and
+ * vsid = (protovsid * x) % vsid_modulus, then we say
+ *
+ * provosid = (vsid * modinv) % vsid_modulus
+ */
+static unsigned long vsid_unscramble(unsigned long vsid, int ssize)
+{
+   unsigned long protovsid;
+   unsigned long va_bits = VA_BITS;
+   unsigned long modinv, vsid_modulus;
+   unsigned long max_mod_inv, tmp_modinv;
+
+
+   if (!mmu_has_feature(MMU_FTR_68_BIT_VA))
+   va_bits = 65;
+
+   if (ssize == MMU_SEGSIZE_256M) {
+   modinv = VSID_MULINV_256M;
+   vsid_modulus = ((1UL << (va_bits - SID_SHIFT)) - 1);
+   } else {
+   modinv = VSID_MULINV_1T;
+   vsid_modulus = ((1UL << (va_bits - SID_SHIFT_1T)) - 1);
+   }
+   /*
+* vsid outside our range.
+*/
+   if (vsid >= vsid_modulus)
+   return 0;
+
+   /* Check if (vsid * modinv) overflow (63 bits) */
+   max_mod_inv = 0x7fffull / vsid;
+   if (modinv < max_mod_inv)
+   return (vsid * modinv) % vsid_modulus;
+
+   tmp_modinv = modinv/max_mod_inv;
+   modinv %= max_mod_inv;
+
+   protovsid = (((vsid * max_mod_inv) % vsid_modulus) * tmp_modinv) % 
vsid_modulus;
+   protovsid = (protovsid + vsid * modinv) % vsid_modulus;
+   return protovsid;
+}
+
+static int __init hash_init_reserved_context(void)
+{
+   unsigned long protovsid;
+
+   /*

Re: ioctl structs differ from x86_64?

2017-03-16 Thread Michael Ellerman

Harshal Patil  writes:

> Hello,
> I am looking into a bug, https://bugzilla.linux.ibm.com/show_bug.cgi?id=152493
> ( external mirror is at, https://github.com/opencontainers/runc/issues/1364) 
> Recently in runc code, they added this code
> https://github.com/opencontainers/runc/commit/eea28f480db435dbef4a275de9776b9934818b8c#diff-5f5c07d0cab3ce2086437d3d43c0d25fR155.
> As you can see they set -onlcr to get rid of \r (line no. 164). Golang, in 
> which
> runc is written, doesn't have any bindings for ioctls. This means you have to
> invoke C code directly (that's what they are doing there).
> Our guess is the ioctls in ppc64le differ than x86_64, and thats why the code
> which is clearing onclr bit
> (https://github.com/opencontainers/runc/commit/eea28f480db435dbef4a275de9776b9934818b8c#diff-5f5c07d0cab3ce2086437d3d43c0d25fR164)
> is failing on ppc64le but works fine on x86_64. 

I think you've probably got enough replies, but the short answer is
"yes", IOCTL numbers do differ across architectures - including
potentially between 32-bit and 64-bit.

cheers

Re: [PATCH 1/4] crypto: powerpc - Factor out the core CRC vpmsum algorithm

2017-03-16 Thread Michael Ellerman

Daniel Axtens  writes:

> The core nuts and bolts of the crc32c vpmsum algorithm will
> also work for a number of other CRC algorithms with different
> polynomials. Factor out the function into a new asm file.
>
> To handle multiple users of the function, a user simply
> provides constants, defines the name of their CRC function,
> and then #includes the core algorithm file.
>
> Cc: Anton Blanchard 
> Signed-off-by: Daniel Axtens 
>
> --
>
> It's possible at this point to argue that the address
> of the constant tables should be passed in to the function,
> rather than doing this somewhat unconventional #include.
>
> However, we're about to add further #ifdef's back into the core
> that will be provided by the encapsulaing code, and which couldn't
> be done as a variable without performance loss.
> ---
>  arch/powerpc/crypto/crc32-vpmsum_core.S | 726 
> 
>  arch/powerpc/crypto/crc32c-vpmsum_asm.S | 714 +--
>  2 files changed, 729 insertions(+), 711 deletions(-)
>  create mode 100644 arch/powerpc/crypto/crc32-vpmsum_core.S

So although this sits in arch/powerpc, it's heavy on the crypto which is
not my area of expertise (to say the least!), so I think it should
probably go via Herbert and the crypto tree?

cheers

Re: [PATCH 2/4] crypto: powerpc - Re-enable non-REFLECTed CRCs

2017-03-16 Thread Michael Ellerman

Daniel Axtens  writes:

> When CRC32c was included in the kernel, Anton ripped out
> the #ifdefs around reflected polynomials, because CRC32c
> is always reflected. However, not all CRCs use reflection
> so we'd like to make it optional.
>
> Restore the REFLECT parts from Anton's original CRC32
> implementation (https://github.com/antonblanchard/crc32-vpmsum)
>
> That implementation is available under GPLv2+, so we're OK
> from a licensing point of view:
> https://github.com/antonblanchard/crc32-vpmsum/blob/master/LICENSE.TXT

It's also written by Anton and copyright IBM, so you (we (IBM)) could
always just relicense it anyway.

So doubly OK IMO.

cheers

[PATCH V4 10/14] powerpc/mm/hash: Convert mask to unsigned long

2017-03-16 Thread Aneesh Kumar K.V

This doesn't have any functional change. But helps in avoiding mistakes
in case the shift bit changes

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/64/mmu-hash.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h 
b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
index 078d7bf93a69..73f34a98ce99 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
@@ -409,7 +409,7 @@ static inline unsigned long hpt_vpn(unsigned long ea,
 static inline unsigned long hpt_hash(unsigned long vpn,
 unsigned int shift, int ssize)
 {
-   int mask;
+   unsigned long mask;
unsigned long hash, vsid;
 
/* VPN_SHIFT can be atmost 12 */
-- 
2.7.4

[PATCH V4 11/14] powerpc/mm/hash: Increase VA range to 128TB

2017-03-16 Thread Aneesh Kumar K.V

We update the hash linux page table layout such that we can support 512TB. But
we limit the TASK_SIZE to 128TB. We can switch to 128TB by default without
conditional because that is the max virtual address supported by other
architectures. We will later add a mechanism to on-demand increase the
application's effective address range to 512TB.

Having the page table layout changed to accommodate 512TB  makes testing large
memory configuration easier with less code changes to kernel

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/64/hash-4k.h  |  2 +-
 arch/powerpc/include/asm/book3s/64/hash-64k.h |  2 +-
 arch/powerpc/include/asm/processor.h  | 22 ++
 3 files changed, 20 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h 
b/arch/powerpc/include/asm/book3s/64/hash-4k.h
index 0c4e470571ca..b4b5e6b671ca 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-4k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h
@@ -8,7 +8,7 @@
 #define H_PTE_INDEX_SIZE  9
 #define H_PMD_INDEX_SIZE  7
 #define H_PUD_INDEX_SIZE  9
-#define H_PGD_INDEX_SIZE  9
+#define H_PGD_INDEX_SIZE  12
 
 #ifndef __ASSEMBLY__
 #define H_PTE_TABLE_SIZE   (sizeof(pte_t) << H_PTE_INDEX_SIZE)
diff --git a/arch/powerpc/include/asm/book3s/64/hash-64k.h 
b/arch/powerpc/include/asm/book3s/64/hash-64k.h
index 7be54f9590a3..214219dff87c 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-64k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-64k.h
@@ -4,7 +4,7 @@
 #define H_PTE_INDEX_SIZE  8
 #define H_PMD_INDEX_SIZE  5
 #define H_PUD_INDEX_SIZE  5
-#define H_PGD_INDEX_SIZE  12
+#define H_PGD_INDEX_SIZE  15
 
 /*
  * 64k aligned address free up few of the lower bits of RPN for us
diff --git a/arch/powerpc/include/asm/processor.h 
b/arch/powerpc/include/asm/processor.h
index e0fecbcea2a2..146c3a91d89f 100644
--- a/arch/powerpc/include/asm/processor.h
+++ b/arch/powerpc/include/asm/processor.h
@@ -102,11 +102,25 @@ void release_thread(struct task_struct *);
 #endif
 
 #ifdef CONFIG_PPC64
-/* 64-bit user address space is 46-bits (64TB user VM) */
-#define TASK_SIZE_USER64 (0x4000UL)
+/*
+ * 64-bit user address space can have multiple limits
+ * For now supported values are:
+ */
+#define TASK_SIZE_64TB  (0x4000UL)
+#define TASK_SIZE_128TB (0x8000UL)
+#define TASK_SIZE_512TB (0x0002UL)
 
-/* 
- * 32-bit user address space is 4GB - 1 page 
+#ifdef CONFIG_PPC_BOOK3S_64
+/*
+ * MAx value currently used:
+ */
+#define TASK_SIZE_USER64 TASK_SIZE_128TB
+#else
+#define TASK_SIZE_USER64 TASK_SIZE_64TB
+#endif
+
+/*
+ * 32-bit user address space is 4GB - 1 page
  * (this 1 page is needed so referencing of 0x generates EFAULT
  */
 #define TASK_SIZE_USER32 (0x0001UL - (1*PAGE_SIZE))
-- 
2.7.4

[PATCH V4 14/14] powerpc/mm/hash: Skip using reserved virtual address range

2017-03-16 Thread Aneesh Kumar K.V

Now that we use all the available virtual address range, we need to make sure
we don't generate VSID such that it overlaps with the reserved vsid range.
Reserved vsid range include the virtual address range used by the adjunct
partition and also the VRMA virtual segment. We find the context value that
can result in generating such a VSID and reserve it early in boot.

We don't look at the adjunct range, because for now we disable the adjunct usage
in a Linux LPAR via CAS interface.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/64/mmu-hash.h |  7 
 arch/powerpc/include/asm/kvm_book3s_64.h  |  2 -
 arch/powerpc/include/asm/mmu_context.h|  1 +
 arch/powerpc/mm/hash_utils_64.c   | 58 +++
 arch/powerpc/mm/mmu_context_book3s64.c| 23 +++
 5 files changed, 89 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h 
b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
index c99ea6bbd82c..ac987e08ce63 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
@@ -578,11 +578,18 @@ extern void slb_set_size(u16 size);
 #define VSID_MULTIPLIER_256M   ASM_CONST(12538073) /* 24-bit prime */
 #define VSID_BITS_256M (VA_BITS - SID_SHIFT)
 #define VSID_BITS_65_256M  (65 - SID_SHIFT)
+/*
+ * Modular multiplicative inverse of VSID_MULTIPLIER under modulo VSID_MODULUS
+ */
+#define VSID_MULINV_256M   ASM_CONST(665548017062)
 
 #define VSID_MULTIPLIER_1T ASM_CONST(12538073) /* 24-bit prime */
 #define VSID_BITS_1T   (VA_BITS - SID_SHIFT_1T)
 #define VSID_BITS_65_1T(65 - SID_SHIFT_1T)
+#define VSID_MULINV_1T ASM_CONST(209034062)
 
+/* 1TB VSID reserved for VRMA */
+#define VRMA_VSID  0x1ffUL
 #define USER_VSID_RANGE(1UL << (ESID_BITS + SID_SHIFT))
 
 /* 4 bits per slice and we have one slice per 1TB */
diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h 
b/arch/powerpc/include/asm/kvm_book3s_64.h
index d9b48f5bb606..d55c7f881ce7 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -49,8 +49,6 @@ static inline bool kvm_is_radix(struct kvm *kvm)
 #define KVM_DEFAULT_HPT_ORDER  24  /* 16MB HPT by default */
 #endif
 
-#define VRMA_VSID  0x1ffUL /* 1TB VSID reserved for VRMA */
-
 /*
  * We use a lock bit in HPTE dword 0 to synchronize updates and
  * accesses to each HPTE, and another bit to indicate non-present
diff --git a/arch/powerpc/include/asm/mmu_context.h 
b/arch/powerpc/include/asm/mmu_context.h
index 8fe1ba1808d3..757d4a9e1a1c 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -51,6 +51,7 @@ static inline void switch_mmu_context(struct mm_struct *prev,
return switch_slb(tsk, next);
 }
 
+extern void hash__resv_context(int context_id);
 extern int hash__get_new_context(void);
 extern void __destroy_context(int context_id);
 static inline void mmu_context_init(void) { }
diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index d96ba04d8844..80ae6f42854a 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -1847,4 +1847,62 @@ static int __init hash64_debugfs(void)
 }
 machine_device_initcall(pseries, hash64_debugfs);
 
+/*
+ * if modinv is the modular multiplicate inverse of (x % vsid_modulus) and
+ * vsid = (protovsid * x) % vsid_modulus, then we say
+ *
+ * provosid = (vsid * modinv) % vsid_modulus
+ */
+static unsigned long vsid_unscramble(unsigned long vsid, int ssize)
+{
+   unsigned long protovsid;
+   unsigned long va_bits = VA_BITS;
+   unsigned long modinv, vsid_modulus;
+   unsigned long max_mod_inv, tmp_modinv;
+
+
+   if (!mmu_has_feature(MMU_FTR_68_BIT_VA))
+   va_bits = 65;
+
+   if (ssize == MMU_SEGSIZE_256M) {
+   modinv = VSID_MULINV_256M;
+   vsid_modulus = ((1UL << (va_bits - SID_SHIFT)) - 1);
+   } else {
+   modinv = VSID_MULINV_1T;
+   vsid_modulus = ((1UL << (va_bits - SID_SHIFT_1T)) - 1);
+   }
+   /*
+* vsid outside our range.
+*/
+   if (vsid >= vsid_modulus)
+   return 0;
+
+   /* Check if (vsid * modinv) overflow (63 bits) */
+   max_mod_inv = 0x7fffull / vsid;
+   if (modinv < max_mod_inv)
+   return (vsid * modinv) % vsid_modulus;
+
+   tmp_modinv = modinv/max_mod_inv;
+   modinv %= max_mod_inv;
+
+   protovsid = (((vsid * max_mod_inv) % vsid_modulus) * tmp_modinv) % 
vsid_modulus;
+   protovsid = (protovsid + vsid * modinv) % vsid_modulus;
+   return protovsid;
+}
+
+static int __init hash_init_reserved_context(void)
+{
+   unsigned long protovsid;
+
+   /*
+* VRMA_VSID to skip list. We don't bother about
+*

[PATCH V4 13/14] powerpc/mm/hash64: Store task size in PACA

2017-03-16 Thread Aneesh Kumar K.V

We can optmize the slice page size array copy to paca by copying only the
range based on task size. This will require us to not look at page size
array beyond task size in PACA on slb fault. To enable that copy task size
to paca which will be used during slb fault.

We can take slb fault on an mm even before we set the task_size in
setup_new_exec. To make sure our paca have the details of default page
size, init the mm->task_size with max value early. Later we will adjust this
based on task personality.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/paca.h   | 4 +++-
 arch/powerpc/kernel/asm-offsets.c | 4 
 arch/powerpc/kernel/paca.c| 1 +
 arch/powerpc/mm/slb_low.S | 8 +++-
 4 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
index f48c250339fd..25f4a1c14759 100644
--- a/arch/powerpc/include/asm/paca.h
+++ b/arch/powerpc/include/asm/paca.h
@@ -144,7 +144,9 @@ struct paca_struct {
u16 mm_ctx_sllp;
 #endif
 #endif
-
+#ifdef CONFIG_PPC_STD_MMU_64
+   u64 task_size;
+#endif
/*
 * then miscellaneous read-write fields
 */
diff --git a/arch/powerpc/kernel/asm-offsets.c 
b/arch/powerpc/kernel/asm-offsets.c
index 4367e7df51a1..a60ef1d976ab 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -189,6 +189,10 @@ int main(void)
 #endif /* CONFIG_PPC_MM_SLICES */
 #endif
 
+#ifdef CONFIG_PPC_STD_MMU_64
+   DEFINE(PACATASKSIZE, offsetof(struct paca_struct, task_size));
+#endif
+
 #ifdef CONFIG_PPC_BOOK3E
OFFSET(PACAPGD, paca_struct, pgd);
OFFSET(PACA_KERNELPGD, paca_struct, kernel_pgd);
diff --git a/arch/powerpc/kernel/paca.c b/arch/powerpc/kernel/paca.c
index bffdbd6d6774..50b60e23d07f 100644
--- a/arch/powerpc/kernel/paca.c
+++ b/arch/powerpc/kernel/paca.c
@@ -254,6 +254,7 @@ void copy_mm_to_paca(struct mm_struct *mm)
get_paca()->mm_ctx_id = context->id;
 #ifdef CONFIG_PPC_MM_SLICES
VM_BUG_ON(!mm->task_size);
+   get_paca()->task_size = mm->task_size;
get_paca()->mm_ctx_low_slices_psize = context->low_slices_psize;
memcpy(_paca()->mm_ctx_high_slices_psize,
   >high_slices_psize, TASK_SLICE_ARRAY_SZ(mm));
diff --git a/arch/powerpc/mm/slb_low.S b/arch/powerpc/mm/slb_low.S
index 35e91e89640f..b09e7748856f 100644
--- a/arch/powerpc/mm/slb_low.S
+++ b/arch/powerpc/mm/slb_low.S
@@ -149,7 +149,13 @@ END_MMU_FTR_SECTION_IFCLR(MMU_FTR_1T_SEGMENT)
 * For userspace addresses, make sure this is region 0.
 */
cmpdi   r9, 0
-   bne 8f
+   bne-8f
+/*
+ * user space make sure we are within the allowed limit
+*/
+   ld  r11,PACATASKSIZE(r13)
+   cmpld   r3,r11
+   bge-8f
 
/* when using slices, we extract the psize off the slice bitmaps
 * and then we need to get the sllp encoding off the mmu_psize_defs
-- 
2.7.4

[PATCH V4 12/14] powerpc/mm/slice: Use mm task_size as max value of slice index

2017-03-16 Thread Aneesh Kumar K.V

In the followup patch, we will increase the slice array sice to handle 512TB
range, but will limit the task size to 128TB. Avoid doing uncessary computation
and avoid doing slice mask related operation above task_size.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/64/mmu-hash.h |  3 ++-
 arch/powerpc/kernel/paca.c|  3 ++-
 arch/powerpc/kernel/setup-common.c|  9 +
 arch/powerpc/mm/mmu_context_book3s64.c|  8 
 arch/powerpc/mm/slice.c   | 22 --
 5 files changed, 33 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h 
b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
index 73f34a98ce99..c99ea6bbd82c 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
@@ -586,7 +586,8 @@ extern void slb_set_size(u16 size);
 #define USER_VSID_RANGE(1UL << (ESID_BITS + SID_SHIFT))
 
 /* 4 bits per slice and we have one slice per 1TB */
-#define SLICE_ARRAY_SIZE  (H_PGTABLE_RANGE >> 41)
+#define SLICE_ARRAY_SIZE   (H_PGTABLE_RANGE >> 41)
+#define TASK_SLICE_ARRAY_SZ(x) ((x)->task_size >> 41)
 
 #ifndef __ASSEMBLY__
 
diff --git a/arch/powerpc/kernel/paca.c b/arch/powerpc/kernel/paca.c
index e2cf745a4b94..bffdbd6d6774 100644
--- a/arch/powerpc/kernel/paca.c
+++ b/arch/powerpc/kernel/paca.c
@@ -253,9 +253,10 @@ void copy_mm_to_paca(struct mm_struct *mm)
 
get_paca()->mm_ctx_id = context->id;
 #ifdef CONFIG_PPC_MM_SLICES
+   VM_BUG_ON(!mm->task_size);
get_paca()->mm_ctx_low_slices_psize = context->low_slices_psize;
memcpy(_paca()->mm_ctx_high_slices_psize,
-  >high_slices_psize, SLICE_ARRAY_SIZE);
+  >high_slices_psize, TASK_SLICE_ARRAY_SZ(mm));
 #else /* CONFIG_PPC_MM_SLICES */
get_paca()->mm_ctx_user_psize = context->user_psize;
get_paca()->mm_ctx_sllp = context->sllp;
diff --git a/arch/powerpc/kernel/setup-common.c 
b/arch/powerpc/kernel/setup-common.c
index 4697da895133..aaf1e2befbcb 100644
--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -920,6 +920,15 @@ void __init setup_arch(char **cmdline_p)
init_mm.end_code = (unsigned long) _etext;
init_mm.end_data = (unsigned long) _edata;
init_mm.brk = klimit;
+
+#ifdef CONFIG_PPC_MM_SLICES
+#ifdef CONFIG_PPC64
+   init_mm.task_size = TASK_SIZE_USER64;
+#else
+#error "task_size not initialized."
+#endif
+#endif
+
 #ifdef CONFIG_PPC_64K_PAGES
init_mm.context.pte_frag = NULL;
 #endif
diff --git a/arch/powerpc/mm/mmu_context_book3s64.c 
b/arch/powerpc/mm/mmu_context_book3s64.c
index e6a5bcbf8abe..9ab6cd2923be 100644
--- a/arch/powerpc/mm/mmu_context_book3s64.c
+++ b/arch/powerpc/mm/mmu_context_book3s64.c
@@ -86,6 +86,14 @@ static int hash__init_new_context(struct mm_struct *mm)
 * We should not be calling init_new_context() on init_mm. Hence a
 * check against 0 is ok.
 */
+#ifdef CONFIG_PPC_MM_SLICES
+   /*
+* We do switch_slb early in fork, even before we setup the 
mm->task_size.
+* Default to max task size so that we copy the default values to paca
+* which will help us to handle slb miss early.
+*/
+   mm->task_size = TASK_SIZE_USER64;
+#endif
if (mm->context.id == 0)
slice_set_user_psize(mm, mmu_virtual_psize);
subpage_prot_init_new_context(mm);
diff --git a/arch/powerpc/mm/slice.c b/arch/powerpc/mm/slice.c
index 29154cc6a90a..5347867225ba 100644
--- a/arch/powerpc/mm/slice.c
+++ b/arch/powerpc/mm/slice.c
@@ -136,7 +136,7 @@ static void slice_mask_for_free(struct mm_struct *mm, 
struct slice_mask *ret)
if (mm->task_size <= SLICE_LOW_TOP)
return;
 
-   for (i = 0; i < SLICE_NUM_HIGH; i++)
+   for (i = 0; i < GET_HIGH_SLICE_INDEX(mm->task_size); i++)
if (!slice_high_has_vma(mm, i))
__set_bit(i, ret->high_slices);
 }
@@ -157,7 +157,7 @@ static void slice_mask_for_size(struct mm_struct *mm, int 
psize, struct slice_ma
ret->low_slices |= 1u << i;
 
hpsizes = mm->context.high_slices_psize;
-   for (i = 0; i < SLICE_NUM_HIGH; i++) {
+   for (i = 0; i < GET_HIGH_SLICE_INDEX(mm->task_size); i++) {
mask_index = i & 0x1;
index = i >> 1;
if (((hpsizes[index] >> (mask_index * 4)) & 0xf) == psize)
@@ -165,15 +165,17 @@ static void slice_mask_for_size(struct mm_struct *mm, int 
psize, struct slice_ma
}
 }
 
-static int slice_check_fit(struct slice_mask mask, struct slice_mask available)
+static int slice_check_fit(struct mm_struct *mm,
+  struct slice_mask mask, struct slice_mask available)
 {
DECLARE_BITMAP(result, SLICE_NUM_HIGH);
+   unsigned long slice_count = GET_HIGH_SLICE_INDEX(mm->task_size);

[PATCH V4 05/14] powerpc/mm/slice: Move slice_mask struct definition to slice.c

2017-03-16 Thread Aneesh Kumar K.V

This structure definition need not be in a header since this is used only by
slice.c file. So move it to slice.c. This also allow us to use SLICE_NUM_HIGH
instead of 64.

I also switch the low_slices type to u64 from u16. This doesn't have an impact
on size of struct due to padding added with u16 type. This helps in using
bitmap printing function for printing slice mask.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/page_64.h | 11 ---
 arch/powerpc/mm/slice.c| 10 +-
 2 files changed, 9 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/include/asm/page_64.h 
b/arch/powerpc/include/asm/page_64.h
index bd55ff751938..c4d9654bd637 100644
--- a/arch/powerpc/include/asm/page_64.h
+++ b/arch/powerpc/include/asm/page_64.h
@@ -99,17 +99,6 @@ extern u64 ppc64_pft_size;
 #define GET_HIGH_SLICE_INDEX(addr) ((addr) >> SLICE_HIGH_SHIFT)
 
 #ifndef __ASSEMBLY__
-/*
- * One bit per slice. We have lower slices which cover 256MB segments
- * upto 4G range. That gets us 16 low slices. For the rest we track slices
- * in 1TB size.
- * 64 below is actually SLICE_NUM_HIGH to fixup complie errros
- */
-struct slice_mask {
-   u16 low_slices;
-   DECLARE_BITMAP(high_slices, 64);
-};
-
 struct mm_struct;
 
 extern unsigned long slice_get_unmapped_area(unsigned long addr,
diff --git a/arch/powerpc/mm/slice.c b/arch/powerpc/mm/slice.c
index 2546204856f5..14e9fb525172 100644
--- a/arch/powerpc/mm/slice.c
+++ b/arch/powerpc/mm/slice.c
@@ -37,7 +37,15 @@
 #include 
 
 static DEFINE_SPINLOCK(slice_convert_lock);
-
+/*
+ * One bit per slice. We have lower slices which cover 256MB segments
+ * upto 4G range. That gets us 16 low slices. For the rest we track slices
+ * in 1TB size.
+ */
+struct slice_mask {
+   u64 low_slices;
+   DECLARE_BITMAP(high_slices, SLICE_NUM_HIGH);
+};
 
 #ifdef DEBUG
 int _slice_debug = 1;
-- 
2.7.4

[PATCH V4 09/14] powerpc/mm/hash: VSID 0 is no more an invalid VSID

2017-03-16 Thread Aneesh Kumar K.V

This is now used by linear mapped region of the kernel. User space still
should not see a VSID 0. But having that VSID check confuse the reader.
Remove the same and convert the error checking to be based on addr value

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/64/mmu-hash.h |  6 --
 arch/powerpc/mm/hash_utils_64.c   | 19 +++
 arch/powerpc/mm/pgtable-hash64.c  |  1 -
 arch/powerpc/mm/tlb_hash64.c  |  1 -
 4 files changed, 7 insertions(+), 20 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h 
b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
index 3897d30820b0..078d7bf93a69 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
@@ -673,12 +673,6 @@ static inline unsigned long get_vsid(unsigned long 
context, unsigned long ea,
unsigned long vsid_bits;
unsigned long protovsid;
 
-   /*
-* Bad address. We return VSID 0 for that
-*/
-   if ((ea & ~REGION_MASK) >= H_PGTABLE_RANGE)
-   return 0;
-
if (!mmu_has_feature(MMU_FTR_68_BIT_VA))
va_bits = 65;
 
diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index 0e84200a88f2..d96ba04d8844 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -1223,6 +1223,13 @@ int hash_page_mm(struct mm_struct *mm, unsigned long ea,
ea, access, trap);
trace_hash_fault(ea, access, trap);
 
+   /* Bad address. */
+   if ((ea & ~REGION_MASK) >= H_PGTABLE_RANGE) {
+   DBG_LOW("Bad address!\n");
+   rc = 1;
+   goto bail;
+   }
+
/* Get region & vsid */
switch (REGION_ID(ea)) {
case USER_REGION_ID:
@@ -1253,12 +1260,6 @@ int hash_page_mm(struct mm_struct *mm, unsigned long ea,
}
DBG_LOW(" mm=%p, mm->pgdir=%p, vsid=%016lx\n", mm, mm->pgd, vsid);
 
-   /* Bad address. */
-   if (!vsid) {
-   DBG_LOW("Bad address!\n");
-   rc = 1;
-   goto bail;
-   }
/* Get pgdir */
pgdir = mm->pgd;
if (pgdir == NULL) {
@@ -1501,8 +1502,6 @@ void hash_preload(struct mm_struct *mm, unsigned long ea,
/* Get VSID */
ssize = user_segment_size(ea);
vsid = get_vsid(mm->context.id, ea, ssize);
-   if (!vsid)
-   return;
/*
 * Hash doesn't like irqs. Walking linux page table with irq disabled
 * saves us from holding multiple locks.
@@ -1747,10 +1746,6 @@ static void kernel_map_linear_page(unsigned long vaddr, 
unsigned long lmi)
 
hash = hpt_hash(vpn, PAGE_SHIFT, mmu_kernel_ssize);
 
-   /* Don't create HPTE entries for bad address */
-   if (!vsid)
-   return;
-
ret = hpte_insert_repeating(hash, vpn, __pa(vaddr), mode,
HPTE_V_BOLTED,
mmu_linear_psize, mmu_kernel_ssize);
diff --git a/arch/powerpc/mm/pgtable-hash64.c b/arch/powerpc/mm/pgtable-hash64.c
index 8b85a14b08ea..ddfeb141af29 100644
--- a/arch/powerpc/mm/pgtable-hash64.c
+++ b/arch/powerpc/mm/pgtable-hash64.c
@@ -263,7 +263,6 @@ void hpte_do_hugepage_flush(struct mm_struct *mm, unsigned 
long addr,
if (!is_kernel_addr(addr)) {
ssize = user_segment_size(addr);
vsid = get_vsid(mm->context.id, addr, ssize);
-   WARN_ON(vsid == 0);
} else {
vsid = get_kernel_vsid(addr, mmu_kernel_ssize);
ssize = mmu_kernel_ssize;
diff --git a/arch/powerpc/mm/tlb_hash64.c b/arch/powerpc/mm/tlb_hash64.c
index 4517aa43a8b1..d8fa336bf05d 100644
--- a/arch/powerpc/mm/tlb_hash64.c
+++ b/arch/powerpc/mm/tlb_hash64.c
@@ -87,7 +87,6 @@ void hpte_need_flush(struct mm_struct *mm, unsigned long addr,
vsid = get_kernel_vsid(addr, mmu_kernel_ssize);
ssize = mmu_kernel_ssize;
}
-   WARN_ON(vsid == 0);
vpn = hpt_vpn(addr, vsid, ssize);
rpte = __real_pte(__pte(pte), ptep);
 
-- 
2.7.4

[PATCH V4 04/14] powerpc/mm: Remove redundant TASK_SIZE_USER64 checks

2017-03-16 Thread Aneesh Kumar K.V

The check against VSID range is implied when we check task size against
hash and radix pgtable range[1], because we make sure page table range cannot
exceed vsid range.

[1] BUILD_BUG_ON(TASK_SIZE_USER64 > H_PGTABLE_RANGE);
BUILD_BUG_ON(TASK_SIZE_USER64 > RADIX_PGTABLE_RANGE);

The check for smaller task size is also removed here, because the follow up
patch will support a tasksize smaller than pgtable range.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/init_64.c| 4 
 arch/powerpc/mm/pgtable_64.c | 5 -
 2 files changed, 9 deletions(-)

diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index 9be992083d2a..8f6f2a173e47 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -71,10 +71,6 @@
 #if H_PGTABLE_RANGE > USER_VSID_RANGE
 #warning Limited user VSID range means pagetable space is wasted
 #endif
-
-#if (TASK_SIZE_USER64 < H_PGTABLE_RANGE) && (TASK_SIZE_USER64 < 
USER_VSID_RANGE)
-#warning TASK_SIZE is smaller than it needs to be.
-#endif
 #endif /* CONFIG_PPC_STD_MMU_64 */
 
 phys_addr_t memstart_addr = ~0;
diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
index ac0c7ee60de0..af660f2b2ae3 100644
--- a/arch/powerpc/mm/pgtable_64.c
+++ b/arch/powerpc/mm/pgtable_64.c
@@ -56,11 +56,6 @@
 
 #include "mmu_decl.h"
 
-#ifdef CONFIG_PPC_STD_MMU_64
-#if TASK_SIZE_USER64 > (1UL << (ESID_BITS + SID_SHIFT))
-#error TASK_SIZE_USER64 exceeds user VSID range
-#endif
-#endif
 
 #ifdef CONFIG_PPC_BOOK3S_64
 /*
-- 
2.7.4

[PATCH V4 07/14] powerpc/mm/hash: Move kernel context to the starting of context range

2017-03-16 Thread Aneesh Kumar K.V

With current kernel, we use the top 4 context for the kernel. Kernel VSIDs are 
built
using these top context values and effective segemnt ID. In the following 
patches,
we want to increase the max effective address to 512TB. We achieve that by
increasing the effective segments IDs there by increasing virtual address range.

We will be switching to a 68bit virtual address in the following patch. But for
platforms like  p4 and p5, which only support a 65 bit va, we want to limit the
virtual addrress to a 65 bit value. We do that by limiting the context bits to 
16
instead of 19. That means we will have different max context values on different
platforms.

To make this simpler. we move the kernel context to the starting of the range.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/64/mmu-hash.h | 39 ++--
 arch/powerpc/include/asm/mmu_context.h|  2 +-
 arch/powerpc/kvm/book3s_64_mmu_host.c |  2 +-
 arch/powerpc/mm/hash_utils_64.c   |  5 --
 arch/powerpc/mm/mmu_context_book3s64.c| 88 ++-
 arch/powerpc/mm/slb_low.S | 20 ++
 6 files changed, 84 insertions(+), 72 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h 
b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
index 52d8d1e4b772..37dbc9becaba 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
@@ -494,10 +494,10 @@ extern void slb_set_size(u16 size);
  * For user processes max context id is limited to ((1ul << 19) - 5)
  * for kernel space, we use the top 4 context ids to map address as below
  * NOTE: each context only support 64TB now.
- * 0x7fffc -  [ 0xc000 - 0xc0003fff ]
- * 0x7fffd -  [ 0xd000 - 0xd0003fff ]
- * 0x7fffe -  [ 0xe000 - 0xe0003fff ]
- * 0x7 -  [ 0xf000 - 0xf0003fff ]
+ * 0x0 -  [ 0xc000 - 0xc0003fff ]
+ * 0x1 -  [ 0xd000 - 0xd0003fff ]
+ * 0x2 -  [ 0xe000 - 0xe0003fff ]
+ * 0x3 -  [ 0xf000 - 0xf0003fff ]
  *
  * The proto-VSIDs are then scrambled into real VSIDs with the
  * multiplicative hash:
@@ -511,15 +511,9 @@ extern void slb_set_size(u16 size);
  * robust scattering in the hash table (at least based on some initial
  * results).
  *
- * We also consider VSID 0 special. We use VSID 0 for slb entries mapping
- * bad address. This enables us to consolidate bad address handling in
- * hash_page.
- *
  * We also need to avoid the last segment of the last context, because that
  * would give a protovsid of 0x1f. That will result in a VSID 0
- * because of the modulo operation in vsid scramble. But the vmemmap
- * (which is what uses region 0xf) will never be close to 64TB in size
- * (it's 56 bytes per page of system memory).
+ * because of the modulo operation in vsid scramble.
  */
 
 #define CONTEXT_BITS   19
@@ -532,12 +526,15 @@ extern void slb_set_size(u16 size);
 /*
  * 256MB segment
  * The proto-VSID space has 2^(CONTEX_BITS + ESID_BITS) - 1 segments
- * available for user + kernel mapping. The top 4 contexts are used for
+ * available for user + kernel mapping. The bottom 4 contexts are used for
  * kernel mapping. Each segment contains 2^28 bytes. Each
- * context maps 2^46 bytes (64TB) so we can support 2^19-1 contexts
- * (19 == 37 + 28 - 46).
+ * context maps 2^46 bytes (64TB).
+ *
+ * We also need to avoid the last segment of the last context, because that
+ * would give a protovsid of 0x1f. That will result in a VSID 0
+ * because of the modulo operation in vsid scramble.
  */
-#define MAX_USER_CONTEXT   ((ASM_CONST(1) << CONTEXT_BITS) - 5)
+#define MAX_USER_CONTEXT   ((ASM_CONST(1) << CONTEXT_BITS) - 2)
 
 /*
  * This should be computed such that protovosid * vsid_mulitplier
@@ -673,19 +670,19 @@ static inline unsigned long get_vsid(unsigned long 
context, unsigned long ea,
  * This is only valid for addresses >= PAGE_OFFSET
  *
  * For kernel space, we use the top 4 context ids to map address as below
- * 0x7fffc -  [ 0xc000 - 0xc0003fff ]
- * 0x7fffd -  [ 0xd000 - 0xd0003fff ]
- * 0x7fffe -  [ 0xe000 - 0xe0003fff ]
- * 0x7 -  [ 0xf000 - 0xf0003fff ]
+ * 0x0 -  [ 0xc000 - 0xc0003fff ]
+ * 0x1 -  [ 0xd000 - 0xd0003fff ]
+ * 0x2 -  [ 0xe000 - 0xe0003fff ]
+ * 0x3 -  [ 0xf000 - 0xf0003fff ]
  */
 static inline unsigned long get_kernel_vsid(unsigned long ea, int ssize)
 {
unsigned long context;
 
/*
-* kernel take the top 4 context from the available range
+* kernel take the first 4 context from the available range
 */
-   context = (MAX_USER_CONTEXT) + ((ea >> 60) - 0xc) +

[PATCH V4 08/14] powerpc/mm/hash: Support 68 bit VA

2017-03-16 Thread Aneesh Kumar K.V

Inorder to support large effective address range (512TB), we want to increase
the virtual address bits to 68. But we do have platforms like p4 and p5 that can
only do 65 bit VA. We support those platforms by limiting context bits on them
to 16.

The protovsid -> vsid conversion is verified to work with both 65 and 68 bit
va values. I also documented the restrictions in a table format as part of code
comments.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/64/mmu-hash.h | 125 --
 arch/powerpc/include/asm/mmu.h|  19 ++--
 arch/powerpc/kvm/book3s_64_mmu_host.c |   8 +-
 arch/powerpc/mm/mmu_context_book3s64.c|   8 +-
 arch/powerpc/mm/slb_low.S |  54 +--
 5 files changed, 150 insertions(+), 64 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h 
b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
index 37dbc9becaba..3897d30820b0 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
@@ -39,6 +39,7 @@
 
 /* Bits in the SLB VSID word */
 #define SLB_VSID_SHIFT 12
+#define SLB_VSID_SHIFT_256MSLB_VSID_SHIFT
 #define SLB_VSID_SHIFT_1T  24
 #define SLB_VSID_SSIZE_SHIFT   62
 #define SLB_VSID_B ASM_CONST(0xc000)
@@ -516,9 +517,19 @@ extern void slb_set_size(u16 size);
  * because of the modulo operation in vsid scramble.
  */
 
+/*
+ * Max Va bits we support as of now is 68 bits. We want 19 bit
+ * context ID.
+ * Restrictions:
+ * GPU has restrictions of not able to access beyond 128TB
+ * (47 bit effective address). We also cannot do more than 20bit PID.
+ * For p4 and p5 which can only do 65 bit VA, we restrict our CONTEXT_BITS
+ * to 16 bits (ie, we can only have 2^16 pids at the same time).
+ */
+#define VA_BITS68
 #define CONTEXT_BITS   19
-#define ESID_BITS  18
-#define ESID_BITS_1T   6
+#define ESID_BITS  (VA_BITS - (SID_SHIFT + CONTEXT_BITS))
+#define ESID_BITS_1T   (VA_BITS - (SID_SHIFT_1T + CONTEXT_BITS))
 
 #define ESID_BITS_MASK ((1 << ESID_BITS) - 1)
 #define ESID_BITS_1T_MASK  ((1 << ESID_BITS_1T) - 1)
@@ -528,62 +539,52 @@ extern void slb_set_size(u16 size);
  * The proto-VSID space has 2^(CONTEX_BITS + ESID_BITS) - 1 segments
  * available for user + kernel mapping. The bottom 4 contexts are used for
  * kernel mapping. Each segment contains 2^28 bytes. Each
- * context maps 2^46 bytes (64TB).
+ * context maps 2^49 bytes (512TB).
  *
  * We also need to avoid the last segment of the last context, because that
  * would give a protovsid of 0x1f. That will result in a VSID 0
  * because of the modulo operation in vsid scramble.
  */
 #define MAX_USER_CONTEXT   ((ASM_CONST(1) << CONTEXT_BITS) - 2)
+/*
+ * For platforms that support on 65bit VA we limit the context bits
+ */
+#define MAX_USER_CONTEXT_65BIT_VA ((ASM_CONST(1) << (65 - (SID_SHIFT + 
ESID_BITS))) - 2)
 
 /*
  * This should be computed such that protovosid * vsid_mulitplier
  * doesn't overflow 64 bits. It should also be co-prime to vsid_modulus
+ * We also need to make sure that number of bits in divisor is less
+ * than twice the number of protovsid bits for our modulus optmization to work.
+ * The below table shows the current values used.
+ *
+ * |---++++--|
+ * |   | Prime Bits | VSID_BITS_65VA | Total Bits | 2* VSID_BITS |
+ * |---++++--|
+ * | 1T| 24 | 25 | 49 |   50 |
+ * |---++++--|
+ * | 256MB | 24 | 37 | 61 |   74 |
+ * |---++++--|
+ *
+ * |---++++--|
+ * |   | Prime Bits | VSID_BITS_68VA | Total Bits | 2* VSID_BITS |
+ * |---++++--|
+ * | 1T| 24 | 28 | 52 |   56 |
+ * |---++++--|
+ * | 256MB | 24 | 40 | 64 |   80 |
+ * |---++++--|
+ *
  */
 #define VSID_MULTIPLIER_256M   ASM_CONST(12538073) /* 24-bit prime */
-#define VSID_BITS_256M (CONTEXT_BITS + ESID_BITS)
-#define VSID_MODULUS_256M  ((1UL<

[PATCH V4 06/14] powerpc/mm/slice: Update slice mask printing to use bitmap printing.

2017-03-16 Thread Aneesh Kumar K.V

We now get output like below which is much better.

[0.935306]  good_mask low_slice: 0-15
[0.935360]  good_mask high_slice: 0-511

Compared to

[0.953414]  good_mask: - 1.

I also fixed an error with slice_dbg printing.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/slice.c | 30 +++---
 1 file changed, 7 insertions(+), 23 deletions(-)

diff --git a/arch/powerpc/mm/slice.c b/arch/powerpc/mm/slice.c
index 14e9fb525172..29154cc6a90a 100644
--- a/arch/powerpc/mm/slice.c
+++ b/arch/powerpc/mm/slice.c
@@ -52,29 +52,13 @@ int _slice_debug = 1;
 
 static void slice_print_mask(const char *label, struct slice_mask mask)
 {
-   char*p, buf[SLICE_NUM_LOW + 3 + SLICE_NUM_HIGH + 1];
-   int i;
-
if (!_slice_debug)
return;
-   p = buf;
-   for (i = 0; i < SLICE_NUM_LOW; i++)
-   *(p++) = (mask.low_slices & (1 << i)) ? '1' : '0';
-   *(p++) = ' ';
-   *(p++) = '-';
-   *(p++) = ' ';
-   for (i = 0; i < SLICE_NUM_HIGH; i++) {
-   if (test_bit(i, mask.high_slices))
-   *(p++) = '1';
-   else
-   *(p++) = '0';
-   }
-   *(p++) = 0;
-
-   printk(KERN_DEBUG "%s:%s\n", label, buf);
+   pr_devel("%s low_slice: %*pbl\n", label, (int)SLICE_NUM_LOW, 
_slices);
+   pr_devel("%s high_slice: %*pbl\n", label, (int)SLICE_NUM_HIGH, 
mask.high_slices);
 }
 
-#define slice_dbg(fmt...) do { if (_slice_debug) pr_debug(fmt); } while(0)
+#define slice_dbg(fmt...) do { if (_slice_debug) pr_devel(fmt); } while (0)
 
 #else
 
@@ -243,8 +227,8 @@ static void slice_convert(struct mm_struct *mm, struct 
slice_mask mask, int psiz
}
 
slice_dbg(" lsps=%lx, hsps=%lx\n",
- mm->context.low_slices_psize,
- mm->context.high_slices_psize);
+ (unsigned long)mm->context.low_slices_psize,
+ (unsigned long)mm->context.high_slices_psize);
 
spin_unlock_irqrestore(_convert_lock, flags);
 
@@ -686,8 +670,8 @@ void slice_set_user_psize(struct mm_struct *mm, unsigned 
int psize)
 
 
slice_dbg(" lsps=%lx, hsps=%lx\n",
- mm->context.low_slices_psize,
- mm->context.high_slices_psize);
+ (unsigned long)mm->context.low_slices_psize,
+ (unsigned long)mm->context.high_slices_psize);
 
  bail:
spin_unlock_irqrestore(_convert_lock, flags);
-- 
2.7.4

[PATCH V4 03/14] powerpc/mm: Move copy_mm_to_paca to paca.c

2017-03-16 Thread Aneesh Kumar K.V

We also update the function arg to struct mm_struct. Move this so that function
finds the definition of struct mm_struct. No functional change in this patch.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/paca.h | 18 +-
 arch/powerpc/kernel/paca.c  | 19 +++
 arch/powerpc/mm/hash_utils_64.c |  4 ++--
 arch/powerpc/mm/slb.c   |  2 +-
 arch/powerpc/mm/slice.c |  2 +-
 5 files changed, 24 insertions(+), 21 deletions(-)

diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
index 708c3e592eeb..f48c250339fd 100644
--- a/arch/powerpc/include/asm/paca.h
+++ b/arch/powerpc/include/asm/paca.h
@@ -206,23 +206,7 @@ struct paca_struct {
 #endif
 };
 
-#ifdef CONFIG_PPC_BOOK3S
-static inline void copy_mm_to_paca(mm_context_t *context)
-{
-   get_paca()->mm_ctx_id = context->id;
-#ifdef CONFIG_PPC_MM_SLICES
-   get_paca()->mm_ctx_low_slices_psize = context->low_slices_psize;
-   memcpy(_paca()->mm_ctx_high_slices_psize,
-  >high_slices_psize, SLICE_ARRAY_SIZE);
-#else
-   get_paca()->mm_ctx_user_psize = context->user_psize;
-   get_paca()->mm_ctx_sllp = context->sllp;
-#endif
-}
-#else
-static inline void copy_mm_to_paca(mm_context_t *context){}
-#endif
-
+extern void copy_mm_to_paca(struct mm_struct *mm);
 extern struct paca_struct *paca;
 extern void initialise_paca(struct paca_struct *new_paca, int cpu);
 extern void setup_paca(struct paca_struct *new_paca);
diff --git a/arch/powerpc/kernel/paca.c b/arch/powerpc/kernel/paca.c
index dfc479df9634..e2cf745a4b94 100644
--- a/arch/powerpc/kernel/paca.c
+++ b/arch/powerpc/kernel/paca.c
@@ -245,3 +245,22 @@ void __init free_unused_pacas(void)
 
free_lppacas();
 }
+
+void copy_mm_to_paca(struct mm_struct *mm)
+{
+#ifdef CONFIG_PPC_BOOK3S
+   mm_context_t *context = >context;
+
+   get_paca()->mm_ctx_id = context->id;
+#ifdef CONFIG_PPC_MM_SLICES
+   get_paca()->mm_ctx_low_slices_psize = context->low_slices_psize;
+   memcpy(_paca()->mm_ctx_high_slices_psize,
+  >high_slices_psize, SLICE_ARRAY_SIZE);
+#else /* CONFIG_PPC_MM_SLICES */
+   get_paca()->mm_ctx_user_psize = context->user_psize;
+   get_paca()->mm_ctx_sllp = context->sllp;
+#endif
+#else /* CONFIG_PPC_BOOK3S */
+   return;
+#endif
+}
diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index d990c3332057..3bccc9d6e5d3 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -1121,7 +1121,7 @@ void demote_segment_4k(struct mm_struct *mm, unsigned 
long addr)
copro_flush_all_slbs(mm);
if ((get_paca_psize(addr) != MMU_PAGE_4K) && (current->mm == mm)) {
 
-   copy_mm_to_paca(>context);
+   copy_mm_to_paca(mm);
slb_flush_and_rebolt();
}
 }
@@ -1193,7 +1193,7 @@ static void check_paca_psize(unsigned long ea, struct 
mm_struct *mm,
 {
if (user_region) {
if (psize != get_paca_psize(ea)) {
-   copy_mm_to_paca(>context);
+   copy_mm_to_paca(mm);
slb_flush_and_rebolt();
}
} else if (get_paca()->vmalloc_sllp !=
diff --git a/arch/powerpc/mm/slb.c b/arch/powerpc/mm/slb.c
index 5e01b2ece1d0..98ae810b8c21 100644
--- a/arch/powerpc/mm/slb.c
+++ b/arch/powerpc/mm/slb.c
@@ -229,7 +229,7 @@ void switch_slb(struct task_struct *tsk, struct mm_struct 
*mm)
asm volatile("slbie %0" : : "r" (slbie_data));
 
get_paca()->slb_cache_ptr = 0;
-   copy_mm_to_paca(>context);
+   copy_mm_to_paca(mm);
 
/*
 * preload some userspace segments into the SLB.
diff --git a/arch/powerpc/mm/slice.c b/arch/powerpc/mm/slice.c
index 88d8a36e1c97..2546204856f5 100644
--- a/arch/powerpc/mm/slice.c
+++ b/arch/powerpc/mm/slice.c
@@ -192,7 +192,7 @@ static void slice_flush_segments(void *parm)
if (mm != current->active_mm)
return;
 
-   copy_mm_to_paca(>active_mm->context);
+   copy_mm_to_paca(current->active_mm);
 
local_irq_save(flags);
slb_flush_and_rebolt();
-- 
2.7.4

[PATCH V4 02/14] powerpc/mm/slice: Update the function prototype

2017-03-16 Thread Aneesh Kumar K.V

This avoid copying the slice_mask struct as function return value

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/slice.c | 62 ++---
 1 file changed, 28 insertions(+), 34 deletions(-)

diff --git a/arch/powerpc/mm/slice.c b/arch/powerpc/mm/slice.c
index 2b7a3f8295c0..88d8a36e1c97 100644
--- a/arch/powerpc/mm/slice.c
+++ b/arch/powerpc/mm/slice.c
@@ -75,19 +75,18 @@ static void slice_print_mask(const char *label, struct 
slice_mask mask) {}
 
 #endif
 
-static struct slice_mask slice_range_to_mask(unsigned long start,
-unsigned long len)
+static void slice_range_to_mask(unsigned long start, unsigned long len,
+   struct slice_mask *ret)
 {
unsigned long end = start + len - 1;
-   struct slice_mask ret;
 
-   ret.low_slices = 0;
-   bitmap_zero(ret.high_slices, SLICE_NUM_HIGH);
+   ret->low_slices = 0;
+   bitmap_zero(ret->high_slices, SLICE_NUM_HIGH);
 
if (start < SLICE_LOW_TOP) {
unsigned long mend = min(end, (SLICE_LOW_TOP - 1));
 
-   ret.low_slices = (1u << (GET_LOW_SLICE_INDEX(mend) + 1))
+   ret->low_slices = (1u << (GET_LOW_SLICE_INDEX(mend) + 1))
- (1u << GET_LOW_SLICE_INDEX(start));
}
 
@@ -96,9 +95,8 @@ static struct slice_mask slice_range_to_mask(unsigned long 
start,
unsigned long align_end = ALIGN(end, (1UL <high_slices, start_index, count);
}
-   return ret;
 }
 
 static int slice_area_is_free(struct mm_struct *mm, unsigned long addr,
@@ -132,53 +130,47 @@ static int slice_high_has_vma(struct mm_struct *mm, 
unsigned long slice)
return !slice_area_is_free(mm, start, end - start);
 }
 
-static struct slice_mask slice_mask_for_free(struct mm_struct *mm)
+static void slice_mask_for_free(struct mm_struct *mm, struct slice_mask *ret)
 {
-   struct slice_mask ret;
unsigned long i;
 
-   ret.low_slices = 0;
-   bitmap_zero(ret.high_slices, SLICE_NUM_HIGH);
+   ret->low_slices = 0;
+   bitmap_zero(ret->high_slices, SLICE_NUM_HIGH);
 
for (i = 0; i < SLICE_NUM_LOW; i++)
if (!slice_low_has_vma(mm, i))
-   ret.low_slices |= 1u << i;
+   ret->low_slices |= 1u << i;
 
if (mm->task_size <= SLICE_LOW_TOP)
-   return ret;
+   return;
 
for (i = 0; i < SLICE_NUM_HIGH; i++)
if (!slice_high_has_vma(mm, i))
-   __set_bit(i, ret.high_slices);
-
-   return ret;
+   __set_bit(i, ret->high_slices);
 }
 
-static struct slice_mask slice_mask_for_size(struct mm_struct *mm, int psize)
+static void slice_mask_for_size(struct mm_struct *mm, int psize, struct 
slice_mask *ret)
 {
unsigned char *hpsizes;
int index, mask_index;
-   struct slice_mask ret;
unsigned long i;
u64 lpsizes;
 
-   ret.low_slices = 0;
-   bitmap_zero(ret.high_slices, SLICE_NUM_HIGH);
+   ret->low_slices = 0;
+   bitmap_zero(ret->high_slices, SLICE_NUM_HIGH);
 
lpsizes = mm->context.low_slices_psize;
for (i = 0; i < SLICE_NUM_LOW; i++)
if (((lpsizes >> (i * 4)) & 0xf) == psize)
-   ret.low_slices |= 1u << i;
+   ret->low_slices |= 1u << i;
 
hpsizes = mm->context.high_slices_psize;
for (i = 0; i < SLICE_NUM_HIGH; i++) {
mask_index = i & 0x1;
index = i >> 1;
if (((hpsizes[index] >> (mask_index * 4)) & 0xf) == psize)
-   __set_bit(i, ret.high_slices);
+   __set_bit(i, ret->high_slices);
}
-
-   return ret;
 }
 
 static int slice_check_fit(struct slice_mask mask, struct slice_mask available)
@@ -460,7 +452,7 @@ unsigned long slice_get_unmapped_area(unsigned long addr, 
unsigned long len,
/* First make up a "good" mask of slices that have the right size
 * already
 */
-   good_mask = slice_mask_for_size(mm, psize);
+   slice_mask_for_size(mm, psize, _mask);
slice_print_mask(" good_mask", good_mask);
 
/*
@@ -485,7 +477,7 @@ unsigned long slice_get_unmapped_area(unsigned long addr, 
unsigned long len,
 #ifdef CONFIG_PPC_64K_PAGES
/* If we support combo pages, we can allow 64k pages in 4k slices */
if (psize == MMU_PAGE_64K) {
-   compat_mask = slice_mask_for_size(mm, MMU_PAGE_4K);
+   slice_mask_for_size(mm, MMU_PAGE_4K, _mask);
if (fixed)
slice_or_mask(_mask, _mask);
}
@@ -494,7 +486,7 @@ unsigned

[PATCH V4 01/14] powerpc/mm/slice: Convert slice_mask high slice to a bitmap

2017-03-16 Thread Aneesh Kumar K.V

In followup patch we want to increase the va range which will result
in us requiring high_slices to have more than 64 bits. To enable this
convert high_slices to bitmap. We keep the number bits same in this patch
and later change that to higher value

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/page_64.h |  15 ++---
 arch/powerpc/mm/slice.c| 110 +
 2 files changed, 80 insertions(+), 45 deletions(-)

diff --git a/arch/powerpc/include/asm/page_64.h 
b/arch/powerpc/include/asm/page_64.h
index 3e83d2a20b6f..bd55ff751938 100644
--- a/arch/powerpc/include/asm/page_64.h
+++ b/arch/powerpc/include/asm/page_64.h
@@ -98,19 +98,16 @@ extern u64 ppc64_pft_size;
 #define GET_LOW_SLICE_INDEX(addr)  ((addr) >> SLICE_LOW_SHIFT)
 #define GET_HIGH_SLICE_INDEX(addr) ((addr) >> SLICE_HIGH_SHIFT)
 
+#ifndef __ASSEMBLY__
 /*
- * 1 bit per slice and we have one slice per 1TB
- * Right now we support only 64TB.
- * IF we change this we will have to change the type
- * of high_slices
+ * One bit per slice. We have lower slices which cover 256MB segments
+ * upto 4G range. That gets us 16 low slices. For the rest we track slices
+ * in 1TB size.
+ * 64 below is actually SLICE_NUM_HIGH to fixup complie errros
  */
-#define SLICE_MASK_SIZE 8
-
-#ifndef __ASSEMBLY__
-
 struct slice_mask {
u16 low_slices;
-   u64 high_slices;
+   DECLARE_BITMAP(high_slices, 64);
 };
 
 struct mm_struct;
diff --git a/arch/powerpc/mm/slice.c b/arch/powerpc/mm/slice.c
index bf150557dba8..2b7a3f8295c0 100644
--- a/arch/powerpc/mm/slice.c
+++ b/arch/powerpc/mm/slice.c
@@ -36,11 +36,6 @@
 #include 
 #include 
 
-/* some sanity checks */
-#if (H_PGTABLE_RANGE >> 43) > SLICE_MASK_SIZE
-#error H_PGTABLE_RANGE exceeds slice_mask high_slices size
-#endif
-
 static DEFINE_SPINLOCK(slice_convert_lock);
 
 
@@ -49,7 +44,7 @@ int _slice_debug = 1;
 
 static void slice_print_mask(const char *label, struct slice_mask mask)
 {
-   char*p, buf[16 + 3 + 64 + 1];
+   char*p, buf[SLICE_NUM_LOW + 3 + SLICE_NUM_HIGH + 1];
int i;
 
if (!_slice_debug)
@@ -60,8 +55,12 @@ static void slice_print_mask(const char *label, struct 
slice_mask mask)
*(p++) = ' ';
*(p++) = '-';
*(p++) = ' ';
-   for (i = 0; i < SLICE_NUM_HIGH; i++)
-   *(p++) = (mask.high_slices & (1ul << i)) ? '1' : '0';
+   for (i = 0; i < SLICE_NUM_HIGH; i++) {
+   if (test_bit(i, mask.high_slices))
+   *(p++) = '1';
+   else
+   *(p++) = '0';
+   }
*(p++) = 0;
 
printk(KERN_DEBUG "%s:%s\n", label, buf);
@@ -80,7 +79,10 @@ static struct slice_mask slice_range_to_mask(unsigned long 
start,
 unsigned long len)
 {
unsigned long end = start + len - 1;
-   struct slice_mask ret = { 0, 0 };
+   struct slice_mask ret;
+
+   ret.low_slices = 0;
+   bitmap_zero(ret.high_slices, SLICE_NUM_HIGH);
 
if (start < SLICE_LOW_TOP) {
unsigned long mend = min(end, (SLICE_LOW_TOP - 1));
@@ -89,10 +91,13 @@ static struct slice_mask slice_range_to_mask(unsigned long 
start,
- (1u << GET_LOW_SLICE_INDEX(start));
}
 
-   if ((start + len) > SLICE_LOW_TOP)
-   ret.high_slices = (1ul << (GET_HIGH_SLICE_INDEX(end) + 1))
-   - (1ul << GET_HIGH_SLICE_INDEX(start));
+   if ((start + len) > SLICE_LOW_TOP) {
+   unsigned long start_index = GET_HIGH_SLICE_INDEX(start);
+   unsigned long align_end = ALIGN(end, (1UL <

[PATCH V4 00/14] powerpc/mm/ppc64: Add 128TB support

2017-03-16 Thread Aneesh Kumar K.V

This patch series increase the effective virtual address range of
applications from 64TB to 128TB. We do that by supporting a 
68 bit virtual address. On platforms that can only do 65 bit virtual
address we limit the max contexts to a 16bit value instead of 19.

The patch series also switch the page table layout such that we can
do 512TB effective address. But we still limit the TASK_SIZE to
128TB. This was done to make sure we don't break applications
that make assumption regarding the max address returned by the
OS. We can switch to 128TB without a linux personality value because
other architectures do 128TB as max address.

Changes from V3:
* Rebase to latest upstrea
* Fixes based on testing

Changes from V2:
* Handle hugepage size correctly.


Aneesh Kumar K.V (14):
  powerpc/mm/slice: Convert slice_mask high slice to a bitmap
  powerpc/mm/slice: Update the function prototype
  powerpc/mm: Move copy_mm_to_paca to paca.c
  powerpc/mm: Remove redundant TASK_SIZE_USER64 checks
  powerpc/mm/slice: Move slice_mask struct definition to slice.c
  powerpc/mm/slice: Update slice mask printing to use bitmap printing.
  powerpc/mm/hash: Move kernel context to the starting of context range
  powerpc/mm/hash: Support 68 bit VA
  powerpc/mm/hash: VSID 0 is no more an invalid VSID
  powerpc/mm/hash: Convert mask to unsigned long
  powerpc/mm/hash: Increase VA range to 128TB
  powerpc/mm/slice: Use mm task_size as max value of slice index
  powerpc/mm/hash64: Store task size in PACA
  powerpc/mm/hash: Skip using reserved virtual address range

 arch/powerpc/include/asm/book3s/64/hash-4k.h  |   2 +-
 arch/powerpc/include/asm/book3s/64/hash-64k.h |   2 +-
 arch/powerpc/include/asm/book3s/64/mmu-hash.h | 178 +--
 arch/powerpc/include/asm/kvm_book3s_64.h  |   2 -
 arch/powerpc/include/asm/mmu.h|  19 ++-
 arch/powerpc/include/asm/mmu_context.h|   3 +-
 arch/powerpc/include/asm/paca.h   |  22 +--
 arch/powerpc/include/asm/page_64.h|  14 --
 arch/powerpc/include/asm/processor.h  |  22 ++-
 arch/powerpc/kernel/asm-offsets.c |   4 +
 arch/powerpc/kernel/paca.c|  21 +++
 arch/powerpc/kernel/setup-common.c|   9 ++
 arch/powerpc/kvm/book3s_64_mmu_host.c |  10 +-
 arch/powerpc/mm/hash_utils_64.c   |  86 +---
 arch/powerpc/mm/init_64.c |   4 -
 arch/powerpc/mm/mmu_context_book3s64.c| 127 +
 arch/powerpc/mm/pgtable-hash64.c  |   1 -
 arch/powerpc/mm/pgtable_64.c  |   5 -
 arch/powerpc/mm/slb.c |   2 +-
 arch/powerpc/mm/slb_low.S |  82 +++
 arch/powerpc/mm/slice.c   | 194 +++---
 arch/powerpc/mm/tlb_hash64.c  |   1 -
 22 files changed, 517 insertions(+), 293 deletions(-)

-- 
2.7.4

[PATCH V2 11/11] powerpc/mm: Move hash specific pte bits to be top bits of RPN

2017-03-16 Thread Aneesh Kumar K.V

We don't support the full 57 bits of physical address and hence can overload
the top bits of RPN as hash specific pte bits.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/64/hash.h| 18 ++
 arch/powerpc/include/asm/book3s/64/pgtable.h | 19 ---
 arch/powerpc/mm/hash_native_64.c |  1 +
 3 files changed, 23 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/hash.h 
b/arch/powerpc/include/asm/book3s/64/hash.h
index af3c88624d3a..33eb1a650317 100644
--- a/arch/powerpc/include/asm/book3s/64/hash.h
+++ b/arch/powerpc/include/asm/book3s/64/hash.h
@@ -6,20 +6,14 @@
  * Common bits between 4K and 64K pages in a linux-style PTE.
  * Additional bits may be defined in pgtable-hash64-*.h
  *
- * Note: We only support user read/write permissions. Supervisor always
- * have full read/write to pages above PAGE_OFFSET (pages below that
- * always use the user access permissions).
- *
- * We could create separate kernel read-only if we used the 3 PP bits
- * combinations that newer processors provide but we currently don't.
  */
-#define H_PAGE_BUSY_RPAGE_SW1 /* software: PTE & hash are busy */
+#define H_PAGE_BUSY_RPAGE_RPN45 /* software: PTE & hash are busy */
 #define H_PTE_NONE_MASK_PAGE_HPTEFLAGS
-#define H_PAGE_F_GIX_SHIFT 57
-/* (7ul << 57) HPTE index within HPTEG */
-#define H_PAGE_F_GIX   (_RPAGE_RSV2 | _RPAGE_RSV3 | _RPAGE_RSV4)
-#define H_PAGE_F_SECOND_RPAGE_RSV1 /* HPTE is in 2ndary 
HPTEG */
-#define H_PAGE_HASHPTE _RPAGE_SW0  /* PTE has associated HPTE */
+#define H_PAGE_F_GIX_SHIFT 52
+/* (7ul << 53) HPTE index within HPTEG */
+#define H_PAGE_F_SECOND_RPAGE_RPN44/* HPTE is in 2ndary 
HPTEG */
+#define H_PAGE_F_GIX   (_RPAGE_RPN43 | _RPAGE_RPN42 | _RPAGE_RPN41)
+#define H_PAGE_HASHPTE _RPAGE_RPN40/* PTE has associated HPTE */
 /*
  * Max physical address bit we will use for now.
  *
diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index eb82b60b5c89..3d104f8ad891 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -36,16 +36,29 @@
 #define _RPAGE_RSV20x0800UL
 #define _RPAGE_RSV30x0400UL
 #define _RPAGE_RSV40x0200UL
+
+#define _PAGE_PTE  0x4000UL/* distinguishes PTEs 
from pointers */
+#define _PAGE_PRESENT  0x8000UL/* pte contains a 
translation */
+
+/*
+ * Top and bottom bits of RPN which can be used by hash
+ * translation mode, because we expect them to be zero
+ * otherwise.
+ */
 #define _RPAGE_RPN00x01000
 #define _RPAGE_RPN10x02000
+#define _RPAGE_RPN45   0x0100UL
+#define _RPAGE_RPN44   0x0080UL
+#define _RPAGE_RPN43   0x0040UL
+#define _RPAGE_RPN42   0x0020UL
+#define _RPAGE_RPN41   0x0010UL
+#define _RPAGE_RPN40   0x0008UL
+
 /* Max physicall address bit as per radix table */
 #define _RPAGE_PA_MAX  57
 
 #define _PAGE_SOFT_DIRTY   _RPAGE_SW3 /* software: software dirty tracking 
*/
 #define _PAGE_SPECIAL  _RPAGE_SW2 /* software: special page */
-
-#define _PAGE_PTE  0x4000UL/* distinguishes PTEs 
from pointers */
-#define _PAGE_PRESENT  0x8000UL/* pte contains a 
translation */
 /*
  * Drivers request for cache inhibited pte mapping using _PAGE_NO_CACHE
  * Instead of fixing all of them, add an alternate define which
diff --git a/arch/powerpc/mm/hash_native_64.c b/arch/powerpc/mm/hash_native_64.c
index cc332608e656..917a5a336441 100644
--- a/arch/powerpc/mm/hash_native_64.c
+++ b/arch/powerpc/mm/hash_native_64.c
@@ -246,6 +246,7 @@ static long native_hpte_insert(unsigned long hpte_group, 
unsigned long vpn,
 
__asm__ __volatile__ ("ptesync" : : : "memory");
 
+   BUILD_BUG_ON(H_PAGE_F_SECOND != (1ul  << (H_PAGE_F_GIX_SHIFT + 3)));
return i | (!!(vflags & HPTE_V_SECONDARY) << 3);
 }
 
-- 
2.7.4

[PATCH V2 10/11] powerpc/mm/radix: Make max pfn bits a variable

2017-03-16 Thread Aneesh Kumar K.V

This makes max pysical address bits a variable so that hash and radix
translation mode can choose what value to use. In this patch we also switch the
radix translation mode to use 57 bits. This make it resilient to future changes
to max pfn supported by platforms.

This patch is split from the previous one to make the review easier.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/64/hash.h| 18 ++
 arch/powerpc/include/asm/book3s/64/pgtable.h | 28 +---
 arch/powerpc/include/asm/book3s/64/radix.h   |  4 
 arch/powerpc/mm/hash_utils_64.c  |  1 +
 arch/powerpc/mm/pgtable-radix.c  |  1 +
 arch/powerpc/mm/pgtable_64.c |  3 +++
 6 files changed, 32 insertions(+), 23 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/hash.h 
b/arch/powerpc/include/asm/book3s/64/hash.h
index ec2828b1db07..af3c88624d3a 100644
--- a/arch/powerpc/include/asm/book3s/64/hash.h
+++ b/arch/powerpc/include/asm/book3s/64/hash.h
@@ -20,6 +20,24 @@
 #define H_PAGE_F_GIX   (_RPAGE_RSV2 | _RPAGE_RSV3 | _RPAGE_RSV4)
 #define H_PAGE_F_SECOND_RPAGE_RSV1 /* HPTE is in 2ndary 
HPTEG */
 #define H_PAGE_HASHPTE _RPAGE_SW0  /* PTE has associated HPTE */
+/*
+ * Max physical address bit we will use for now.
+ *
+ * This is mostly a hardware limitation and for now Power9 has
+ * a 51 bit limit.
+ *
+ * This is different from the number of physical bit required to address
+ * the last byte of memory. That is defined by MAX_PHYSMEM_BITS.
+ * MAX_PHYSMEM_BITS is a linux limitation imposed by the maximum
+ * number of sections we can support (SECTIONS_SHIFT).
+ *
+ * This is different from Radix page table limitation and
+ * should always be less than that. The limit is done such that
+ * we can overload the bits between _RPAGE_PA_MAX and H_PAGE_PA_MAX
+ * for hash linux page table specific bits.
+ */
+#define H_PAGE_PA_MAX  51
+#define H_PTE_RPN_MASK (((1UL << H_PAGE_PA_MAX) - 1) & (PAGE_MASK))
 
 #ifdef CONFIG_PPC_64K_PAGES
 #include 
diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index c470dcc815d5..eb82b60b5c89 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -40,23 +40,6 @@
 #define _RPAGE_RPN10x02000
 /* Max physicall address bit as per radix table */
 #define _RPAGE_PA_MAX  57
-/*
- * Max physical address bit we will use for now.
- *
- * This is mostly a hardware limitation and for now Power9 has
- * a 51 bit limit.
- *
- * This is different from the number of physical bit required to address
- * the last byte of memory. That is defined by MAX_PHYSMEM_BITS.
- * MAX_PHYSMEM_BITS is a linux limitation imposed by the maximum
- * number of sections we can support (SECTIONS_SHIFT).
- *
- * This is different from Radix page table limitation above and
- * should always be less than that. The limit is done such that
- * we can overload the bits between _RPAGE_PA_MAX and _PAGE_PA_MAX
- * for hash linux page table specific bits.
- */
-#define _PAGE_PA_MAX   51
 
 #define _PAGE_SOFT_DIRTY   _RPAGE_SW3 /* software: software dirty tracking 
*/
 #define _PAGE_SPECIAL  _RPAGE_SW2 /* software: special page */
@@ -70,12 +53,6 @@
  */
 #define _PAGE_NO_CACHE _PAGE_TOLERANT
 /*
- * We support _RPAGE_PA_MAX bit real address in pte. On the linux side
- * we are limited by _PAGE_PA_MAX. Clear everything above _PAGE_PA_MAX
- * every thing below PAGE_SHIFT;
- */
-#define PTE_RPN_MASK   (((1UL << _PAGE_PA_MAX) - 1) & (PAGE_MASK))
-/*
  * set of bits not changed in pmd_modify. Even though we have hash specific 
bits
  * in here, on radix we expect them to be zero.
  */
@@ -180,6 +157,11 @@
 
 #ifndef __ASSEMBLY__
 /*
+ * based on max physical address bit that we want to encode in page table
+ */
+extern unsigned long __pte_rpn_mask;
+#define PTE_RPN_MASK __pte_rpn_mask
+/*
  * page table defines
  */
 extern unsigned long __pte_index_size;
diff --git a/arch/powerpc/include/asm/book3s/64/radix.h 
b/arch/powerpc/include/asm/book3s/64/radix.h
index ac16d1943022..142739b31174 100644
--- a/arch/powerpc/include/asm/book3s/64/radix.h
+++ b/arch/powerpc/include/asm/book3s/64/radix.h
@@ -24,6 +24,10 @@
 
 /* An empty PTE can still have a R or C writeback */
 #define RADIX_PTE_NONE_MASK(_PAGE_DIRTY | _PAGE_ACCESSED)
+/*
+ * Clear everything above _RPAGE_PA_MAX every thing below PAGE_SHIFT
+ */
+#define RADIX_PTE_RPN_MASK (((1UL << _RPAGE_PA_MAX) - 1) & 
(PAGE_MASK))
 
 /* Bits to set in a RPMD/RPUD/RPGD */
 #define RADIX_PMD_VAL_BITS (0x8000UL | 
RADIX_PTE_INDEX_SIZE)
diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index c554768b1fa2..d990c3332057 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -986,6

[PATCH V2 08/11] powerpc/mm: Express everything based on Radix page table defines

2017-03-16 Thread Aneesh Kumar K.V

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/64/hash-64k.h | 4 ++--
 arch/powerpc/include/asm/book3s/64/pgtable.h  | 2 ++
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/hash-64k.h 
b/arch/powerpc/include/asm/book3s/64/hash-64k.h
index b39f0b86405e..7be54f9590a3 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-64k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-64k.h
@@ -10,8 +10,8 @@
  * 64k aligned address free up few of the lower bits of RPN for us
  * We steal that here. For more deatils look at pte_pfn/pfn_pte()
  */
-#define H_PAGE_COMBO   0x1000 /* this is a combo 4k page */
-#define H_PAGE_4K_PFN  0x2000 /* PFN is for a single 4k page */
+#define H_PAGE_COMBO   _RPAGE_RPN0 /* this is a combo 4k page */
+#define H_PAGE_4K_PFN  _RPAGE_RPN1 /* PFN is for a single 4k page */
 /*
  * We need to differentiate between explicit huge page and THP huge
  * page, since THP huge page also need to track real subpage details
diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 4d4ff9a324f0..96566df547a8 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -36,6 +36,8 @@
 #define _RPAGE_RSV20x0800UL
 #define _RPAGE_RSV30x0400UL
 #define _RPAGE_RSV40x0200UL
+#define _RPAGE_RPN00x01000
+#define _RPAGE_RPN10x02000
 
 #define _PAGE_SOFT_DIRTY   _RPAGE_SW3 /* software: software dirty tracking 
*/
 #define _PAGE_SPECIAL  _RPAGE_SW2 /* software: special page */
-- 
2.7.4

1 2 >

1 - 100 of 146 matches

Mail list logo