date:20200703

Re: [PATCH v2] ASoC: fsl_asrc: Add an option to select internal ratio mode

2020-07-03 Thread Nicolin Chen

On Fri, Jul 03, 2020 at 11:50:20PM +0100, Mark Brown wrote:
> On Fri, Jul 03, 2020 at 03:46:58PM -0700, Nicolin Chen wrote:
> 
> > > [1/1] ASoC: fsl_asrc: Add an option to select internal ratio mode
> > >   commit: d0250cf4f2abfbea64ed247230f08f5ae23979f0
> 
> > You already applied v3 of this change:
> > https://mailman.alsa-project.org/pipermail/alsa-devel/2020-July/169976.html
> 
> > And it's already in linux-next also. Not sure what's happening...
> 
> The script can't always tell the difference between versions - it looks
> like it's notified for v2 based on seeing v3 in git.

OK..as long as no revert nor re-applying happens, we can ignore :)

Thanks

Re: [PATCH v2] ASoC: fsl_asrc: Add an option to select internal ratio mode

2020-07-03 Thread Mark Brown

On Fri, Jul 03, 2020 at 03:46:58PM -0700, Nicolin Chen wrote:

> > [1/1] ASoC: fsl_asrc: Add an option to select internal ratio mode
> >   commit: d0250cf4f2abfbea64ed247230f08f5ae23979f0

> You already applied v3 of this change:
> https://mailman.alsa-project.org/pipermail/alsa-devel/2020-July/169976.html

> And it's already in linux-next also. Not sure what's happening...

The script can't always tell the difference between versions - it looks
like it's notified for v2 based on seeing v3 in git.


signature.asc
Description: PGP signature

Re: [PATCH v2] ASoC: fsl_asrc: Add an option to select internal ratio mode

2020-07-03 Thread Nicolin Chen

Hi Mark,

On Fri, Jul 03, 2020 at 06:03:43PM +0100, Mark Brown wrote:
> On Tue, 30 Jun 2020 16:47:56 +0800, Shengjiu Wang wrote:
> > The ASRC not only supports ideal ratio mode, but also supports
> > internal ratio mode.
> > 
> > For internal rato mode, the rate of clock source should be divided
> > with no remainder by sample rate, otherwise there is sound
> > distortion.
> > 
> > [...]
> 
> Applied to
> 
>https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound.git for-next
> 
> Thanks!
> 
> [1/1] ASoC: fsl_asrc: Add an option to select internal ratio mode
>   commit: d0250cf4f2abfbea64ed247230f08f5ae23979f0

You already applied v3 of this change:
https://mailman.alsa-project.org/pipermail/alsa-devel/2020-July/169976.html

And it's already in linux-next also. Not sure what's happening...

Re: [PATCH v2] ASoC: fsl_asrc: Add an option to select internal ratio mode

2020-07-03 Thread Mark Brown

On Tue, 30 Jun 2020 16:47:56 +0800, Shengjiu Wang wrote:
> The ASRC not only supports ideal ratio mode, but also supports
> internal ratio mode.
> 
> For internal rato mode, the rate of clock source should be divided
> with no remainder by sample rate, otherwise there is sound
> distortion.
> 
> [...]

Applied to

   https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound.git for-next

Thanks!

[1/1] ASoC: fsl_asrc: Add an option to select internal ratio mode
  commit: d0250cf4f2abfbea64ed247230f08f5ae23979f0

All being well this means that it will be integrated into the linux-next
tree (usually sometime in the next 24 hours) and sent to Linus during
the next merge window (or sooner if it is a bug fix), however if
problems are discovered then the patch may be dropped or reverted.

You may get further e-mails resulting from automated or manual testing
and review of the tree, please engage with people reporting problems and
send followup patches addressing any issues that are reported if needed.

If any updates are required or you are submitting further changes they
should be sent as incremental updates against current git, existing
patches will not be replaced.

Please add any relevant lists and maintainers to the CCs when replying
to this mail.

Thanks,
Mark

[PATCH 0/2] Rework secure memslot dropping

2020-07-03 Thread Laurent Dufour

When doing memory hotplug on a secure VM, the secure pages are not well
cleaned from the secure device when dropping the memslot.  This silent
error, is then preventing the SVM to reboot properly after the following
sequence of commands are run in the Qemu monitor:

device_add pc-dimm,id=dimm1,memdev=mem1
device_del dimm1
device_add pc-dimm,id=dimm1,memdev=mem1

At reboot time, when the kernel is booting again and switching to the
secure mode, the page_in is failing for the pages in the memslot because
the cleanup was not done properly, because the memslot is flagged as
invalid during the hot unplug and thus the page fault mechanism is not
triggered.

To prevent that during the memslot dropping, instead of belonging on the
page fault mechanism to trigger the page out of the secured pages, it seems
simpler to directly call the function doing the page out. This way the
state of the memslot is not interfering on the page out process.

This series applies on top of the Ram's one titled:
"PATCH v3 0/4] Migrate non-migrated pages of a SVM."
https://lore.kernel.org/linuxppc-dev/1592606622-29884-1-git-send-email-linux...@us.ibm.com/#r

Laurent Dufour (2):
  KVM: PPC: Book3S HV: move kvmppc_svm_page_out up
  KVM: PPC: Book3S HV: rework secure mem slot dropping

 arch/powerpc/kvm/book3s_hv_uvmem.c | 220 +
 1 file changed, 127 insertions(+), 93 deletions(-)

-- 
2.27.0

[PATCH 1/2] KVM: PPC: Book3S HV: move kvmppc_svm_page_out up

2020-07-03 Thread Laurent Dufour

kvmppc_svm_page_out() will need to be called by kvmppc_uvmem_drop_pages()
so move it upper in this file.

Furthermore it will be interesting to call this function when already
holding the kvm->arch.uvmem_lock, so prefix the original function with __
and remove the locking in it, and introduce a wrapper which call that
function with the lock held.

There is no functional change.

Cc: Ram Pai 
Cc: Bharata B Rao 
Cc: Paul Mackerras 
Signed-off-by: Laurent Dufour 
---
 arch/powerpc/kvm/book3s_hv_uvmem.c | 166 -
 1 file changed, 90 insertions(+), 76 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
b/arch/powerpc/kvm/book3s_hv_uvmem.c
index 778a6ea86991..852cc9ae6a0b 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -435,6 +435,96 @@ unsigned long kvmppc_h_svm_init_done(struct kvm *kvm)
return ret;
 }
 
+/*
+ * Provision a new page on HV side and copy over the contents
+ * from secure memory using UV_PAGE_OUT uvcall.
+ * Caller must held kvm->arch.uvmem_lock.
+ */
+static int __kvmppc_svm_page_out(struct vm_area_struct *vma,
+   unsigned long start,
+   unsigned long end, unsigned long page_shift,
+   struct kvm *kvm, unsigned long gpa)
+{
+   unsigned long src_pfn, dst_pfn = 0;
+   struct migrate_vma mig;
+   struct page *dpage, *spage;
+   struct kvmppc_uvmem_page_pvt *pvt;
+   unsigned long pfn;
+   int ret = U_SUCCESS;
+
+   memset(, 0, sizeof(mig));
+   mig.vma = vma;
+   mig.start = start;
+   mig.end = end;
+   mig.src = _pfn;
+   mig.dst = _pfn;
+   mig.src_owner = _uvmem_pgmap;
+
+   /* The requested page is already paged-out, nothing to do */
+   if (!kvmppc_gfn_is_uvmem_pfn(gpa >> page_shift, kvm, NULL))
+   return ret;
+
+   ret = migrate_vma_setup();
+   if (ret)
+   return -1;
+
+   spage = migrate_pfn_to_page(*mig.src);
+   if (!spage || !(*mig.src & MIGRATE_PFN_MIGRATE))
+   goto out_finalize;
+
+   if (!is_zone_device_page(spage))
+   goto out_finalize;
+
+   dpage = alloc_page_vma(GFP_HIGHUSER, vma, start);
+   if (!dpage) {
+   ret = -1;
+   goto out_finalize;
+   }
+
+   lock_page(dpage);
+   pvt = spage->zone_device_data;
+   pfn = page_to_pfn(dpage);
+
+   /*
+* This function is used in two cases:
+* - When HV touches a secure page, for which we do UV_PAGE_OUT
+* - When a secure page is converted to shared page, we *get*
+*   the page to essentially unmap the device page. In this
+*   case we skip page-out.
+*/
+   if (!pvt->skip_page_out)
+   ret = uv_page_out(kvm->arch.lpid, pfn << page_shift,
+ gpa, 0, page_shift);
+
+   if (ret == U_SUCCESS)
+   *mig.dst = migrate_pfn(pfn) | MIGRATE_PFN_LOCKED;
+   else {
+   unlock_page(dpage);
+   __free_page(dpage);
+   goto out_finalize;
+   }
+
+   migrate_vma_pages();
+
+out_finalize:
+   migrate_vma_finalize();
+   return ret;
+}
+
+static inline int kvmppc_svm_page_out(struct vm_area_struct *vma,
+ unsigned long start, unsigned long end,
+ unsigned long page_shift,
+ struct kvm *kvm, unsigned long gpa)
+{
+   int ret;
+
+   mutex_lock(>arch.uvmem_lock);
+   ret = __kvmppc_svm_page_out(vma, start, end, page_shift, kvm, gpa);
+   mutex_unlock(>arch.uvmem_lock);
+
+   return ret;
+}
+
 /*
  * Drop device pages that we maintain for the secure guest
  *
@@ -801,82 +891,6 @@ unsigned long kvmppc_h_svm_page_in(struct kvm *kvm, 
unsigned long gpa,
return ret;
 }
 
-/*
- * Provision a new page on HV side and copy over the contents
- * from secure memory using UV_PAGE_OUT uvcall.
- */
-static int kvmppc_svm_page_out(struct vm_area_struct *vma,
-   unsigned long start,
-   unsigned long end, unsigned long page_shift,
-   struct kvm *kvm, unsigned long gpa)
-{
-   unsigned long src_pfn, dst_pfn = 0;
-   struct migrate_vma mig;
-   struct page *dpage, *spage;
-   struct kvmppc_uvmem_page_pvt *pvt;
-   unsigned long pfn;
-   int ret = U_SUCCESS;
-
-   memset(, 0, sizeof(mig));
-   mig.vma = vma;
-   mig.start = start;
-   mig.end = end;
-   mig.src = _pfn;
-   mig.dst = _pfn;
-   mig.src_owner = _uvmem_pgmap;
-
-   mutex_lock(>arch.uvmem_lock);
-   /* The requested page is already paged-out, nothing to do */
-   if (!kvmppc_gfn_is_uvmem_pfn(gpa >> page_shift, kvm, NULL))
-   goto out;
-
-   ret = migrate_vma_setup();
-   if (ret)
-   goto out;
-
-   spage = migrate_pfn_to_page(*mig.src);
-   if (!spage || !(*mig.src &

[PATCH 2/2] KVM: PPC: Book3S HV: rework secure mem slot dropping

2020-07-03 Thread Laurent Dufour

When a secure memslot is dropped, all the pages backed in the secure device
(aka really backed by secure memory by the Ultravisor) should be paged out
to a normal page. Previously, this was achieved by triggering the page
fault mechanism which is calling kvmppc_svm_page_out() on each pages.

This can't work when hot unplugging a memory slot because the memory slot
is flagged as invalid and gfn_to_pfn() is then not trying to access the
page, so the page fault mechanism is not triggered.

Since the final goal is to make a call to kvmppc_svm_page_out() it seems
simpler to directly calling it instead of triggering such a mechanism. This
way kvmppc_uvmem_drop_pages() can be called even when hot unplugging a
memslot.

Since kvmppc_uvmem_drop_pages() is already holding kvm->arch.uvmem_lock,
the call to __kvmppc_svm_page_out() is made.
As __kvmppc_svm_page_out needs the vma pointer to migrate the pages, the
VMA is fetched in a lazy way, to not trigger find_vma() all the time. In
addition, the mmap_sem is help in read mode during that time, not in write
mode since the virual memory layout is not impacted, and
kvm->arch.uvmem_lock prevents concurrent operation on the secure device.

Cc: Ram Pai 
Cc: Bharata B Rao 
Cc: Paul Mackerras 
Signed-off-by: Laurent Dufour 
---
 arch/powerpc/kvm/book3s_hv_uvmem.c | 54 --
 1 file changed, 37 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
b/arch/powerpc/kvm/book3s_hv_uvmem.c
index 852cc9ae6a0b..479ddf16d18c 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -533,35 +533,55 @@ static inline int kvmppc_svm_page_out(struct 
vm_area_struct *vma,
  * fault on them, do fault time migration to replace the device PTEs in
  * QEMU page table with normal PTEs from newly allocated pages.
  */
-void kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *free,
+void kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *slot,
 struct kvm *kvm, bool skip_page_out)
 {
int i;
struct kvmppc_uvmem_page_pvt *pvt;
-   unsigned long pfn, uvmem_pfn;
-   unsigned long gfn = free->base_gfn;
+   struct page *uvmem_page;
+   struct vm_area_struct *vma = NULL;
+   unsigned long uvmem_pfn, gfn;
+   unsigned long addr, end;
+
+   down_read(>mm->mmap_sem);
+
+   addr = slot->userspace_addr;
+   end = addr + (slot->npages * PAGE_SIZE);
 
-   for (i = free->npages; i; --i, ++gfn) {
-   struct page *uvmem_page;
+   gfn = slot->base_gfn;
+   for (i = slot->npages; i; --i, ++gfn, addr += PAGE_SIZE) {
+
+   /* Fetch the VMA if addr is not in the latest fetched one */
+   if (!vma || (addr < vma->vm_start || addr >= vma->vm_end)) {
+   vma = find_vma_intersection(kvm->mm, addr, end);
+   if (!vma ||
+   vma->vm_start > addr || vma->vm_end < end) {
+   pr_err("Can't find VMA for gfn:0x%lx\n", gfn);
+   break;
+   }
+   }
 
mutex_lock(>arch.uvmem_lock);
-   if (!kvmppc_gfn_is_uvmem_pfn(gfn, kvm, _pfn)) {
+
+   if (kvmppc_gfn_is_uvmem_pfn(gfn, kvm, _pfn)) {
+   uvmem_page = pfn_to_page(uvmem_pfn);
+   pvt = uvmem_page->zone_device_data;
+   pvt->skip_page_out = skip_page_out;
+   pvt->remove_gfn = true;
+
+   if (__kvmppc_svm_page_out(vma, addr, addr + PAGE_SIZE,
+ PAGE_SHIFT, kvm, pvt->gpa))
+   pr_err("Can't page out gpa:0x%lx addr:0x%lx\n",
+  pvt->gpa, addr);
+   } else {
+   /* Remove the shared flag if any */
kvmppc_gfn_remove(gfn, kvm);
-   mutex_unlock(>arch.uvmem_lock);
-   continue;
}
 
-   uvmem_page = pfn_to_page(uvmem_pfn);
-   pvt = uvmem_page->zone_device_data;
-   pvt->skip_page_out = skip_page_out;
-   pvt->remove_gfn = true;
mutex_unlock(>arch.uvmem_lock);
-
-   pfn = gfn_to_pfn(kvm, gfn);
-   if (is_error_noslot_pfn(pfn))
-   continue;
-   kvm_release_pfn_clean(pfn);
}
+
+   up_read(>mm->mmap_sem);
 }
 
 unsigned long kvmppc_h_svm_init_abort(struct kvm *kvm)
-- 
2.27.0

Re: [PATCH v2] ASoC: fsl_asrc: Add an option to select internal ratio mode

2020-07-03 Thread Mark Brown

On Tue, Jun 30, 2020 at 04:47:56PM +0800, Shengjiu Wang wrote:
> The ASRC not only supports ideal ratio mode, but also supports
> internal ratio mode.

This doesn't apply against current code, please check and resend.


signature.asc
Description: PGP signature

[RFC PATCH 5/5] selftests/powerpc: Remove powerpc special cases from stack expansion test

2020-07-03 Thread Michael Ellerman

Now that the powerpc code behaves the same as other architectures we
can drop the special cases we had.

Signed-off-by: Michael Ellerman 
---
 .../powerpc/mm/stack_expansion_ldst.c | 41 +++
 1 file changed, 5 insertions(+), 36 deletions(-)

diff --git a/tools/testing/selftests/powerpc/mm/stack_expansion_ldst.c 
b/tools/testing/selftests/powerpc/mm/stack_expansion_ldst.c
index 95c3f3de16a1..ed9143990888 100644
--- a/tools/testing/selftests/powerpc/mm/stack_expansion_ldst.c
+++ b/tools/testing/selftests/powerpc/mm/stack_expansion_ldst.c
@@ -56,13 +56,7 @@ int consume_stack(unsigned long target_sp, unsigned long 
stack_high, int delta,
 #else
asm volatile ("mov %%rsp, %[sp]" : [sp] "=r" (stack_top_sp));
 #endif
-
-   // Kludge, delta < 0 indicates relative to SP
-   if (delta < 0)
-   target = stack_top_sp + delta;
-   else
-   target = stack_high - delta + 1;
-
+   target = stack_high - delta + 1;
volatile char *p = (char *)target;
 
if (type == STORE)
@@ -162,41 +156,16 @@ static int test_one(unsigned int stack_used, int delta, 
enum access_type type)
 
 static void test_one_type(enum access_type type, unsigned long page_size, 
unsigned long rlim_cur)
 {
-   assert(test_one(DEFAULT_SIZE, 512 * _KB, type) == 0);
+   unsigned long delta;
 
-   // powerpc has a special case to allow up to 1MB
-   assert(test_one(DEFAULT_SIZE, 1 * _MB, type) == 0);
-
-#ifdef __powerpc__
-   // This fails on powerpc because it's > 1MB and is not a stdu &
-   // not close to r1
-   assert(test_one(DEFAULT_SIZE, 1 * _MB + 8, type) != 0);
-#else
-   assert(test_one(DEFAULT_SIZE, 1 * _MB + 8, type) == 0);
-#endif
-
-#ifdef __powerpc__
-   // Accessing way past the stack pointer is not allowed on powerpc
-   assert(test_one(DEFAULT_SIZE, rlim_cur, type) != 0);
-#else
// We should be able to access anywhere within the rlimit
+   for (delta = page_size; delta <= rlim_cur; delta += page_size)
+   assert(test_one(DEFAULT_SIZE, delta, type) == 0);
+
assert(test_one(DEFAULT_SIZE, rlim_cur, type) == 0);
-#endif
 
// But if we go past the rlimit it should fail
assert(test_one(DEFAULT_SIZE, rlim_cur + 1, type) != 0);
-
-   // Above 1MB powerpc only allows accesses within 4096 bytes of
-   // r1 for accesses that aren't stdu
-   assert(test_one(1 * _MB + page_size - 128, -4096, type) == 0);
-#ifdef __powerpc__
-   assert(test_one(1 * _MB + page_size - 128, -4097, type) != 0);
-#else
-   assert(test_one(1 * _MB + page_size - 128, -4097, type) == 0);
-#endif
-
-   // By consuming 2MB of stack we test the stdu case
-   assert(test_one(2 * _MB + page_size - 128, -4096, type) == 0);
 }
 
 static int test(void)
-- 
2.25.1

[RFC PATCH 4/5] powerpc/mm: Remove custom stack expansion checking

2020-07-03 Thread Michael Ellerman

We have powerpc specific logic in our page fault handling to decide if
an access to an unmapped address below the stack pointer should expand
the stack VMA.

The logic aims to prevent userspace from doing bad accesses below the
stack pointer. However as long as the stack is < 1MB in size, we allow
all accesses without further checks. Adding some debug I see that I
can do a full kernel build and LTP run, and not a single process has
used more than 1MB of stack. So for the majority of processes the
logic never even fires.

We also recently found a nasty bug in this code which could cause
userspace programs to be killed during signal delivery. It went
unnoticed presumably because most processes use < 1MB of stack.

The generic mm code has also grown support for stack guard pages since
this code was originally written, so the most heinous case of the
stack expanding into other mappings is now handled for us.

Finally although some other arches have special logic in this path,
from what I can tell none of x86, arm64, arm and s390 impose any extra
checks other than those in expand_stack().

So drop our complicated logic and like other architectures just let
the stack expand as long as its within the rlimit.

Signed-off-by: Michael Ellerman 
---
 arch/powerpc/mm/fault.c | 106 ++--
 1 file changed, 5 insertions(+), 101 deletions(-)

diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index ed01329dd12b..925a7231abb3 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -42,39 +42,7 @@
 #include 
 #include 
 
-/*
- * Check whether the instruction inst is a store using
- * an update addressing form which will update r1.
- */
-static bool store_updates_sp(struct ppc_inst inst)
-{
-   /* check for 1 in the rA field */
-   if (((ppc_inst_val(inst) >> 16) & 0x1f) != 1)
-   return false;
-   /* check major opcode */
-   switch (ppc_inst_primary_opcode(inst)) {
-   case OP_STWU:
-   case OP_STBU:
-   case OP_STHU:
-   case OP_STFSU:
-   case OP_STFDU:
-   return true;
-   case OP_STD:/* std or stdu */
-   return (ppc_inst_val(inst) & 3) == 1;
-   case OP_31:
-   /* check minor opcode */
-   switch ((ppc_inst_val(inst) >> 1) & 0x3ff) {
-   case OP_31_XOP_STDUX:
-   case OP_31_XOP_STWUX:
-   case OP_31_XOP_STBUX:
-   case OP_31_XOP_STHUX:
-   case OP_31_XOP_STFSUX:
-   case OP_31_XOP_STFDUX:
-   return true;
-   }
-   }
-   return false;
-}
+
 /*
  * do_page_fault error handling helpers
  */
@@ -267,54 +235,6 @@ static bool bad_kernel_fault(struct pt_regs *regs, 
unsigned long error_code,
return false;
 }
 
-static bool bad_stack_expansion(struct pt_regs *regs, unsigned long address,
-   struct vm_area_struct *vma, unsigned int flags,
-   bool *must_retry)
-{
-   /*
-* N.B. The POWER/Open ABI allows programs to access up to
-* 288 bytes below the stack pointer.
-* The kernel signal delivery code writes up to 4KB
-* below the stack pointer (r1) before decrementing it.
-* The exec code can write slightly over 640kB to the stack
-* before setting the user r1.  Thus we allow the stack to
-* expand to 1MB without further checks.
-*/
-   if (address + 0x10 < vma->vm_end) {
-   struct ppc_inst __user *nip = (struct ppc_inst __user 
*)regs->nip;
-   /* get user regs even if this fault is in kernel mode */
-   struct pt_regs *uregs = current->thread.regs;
-   if (uregs == NULL)
-   return true;
-
-   /*
-* A user-mode access to an address a long way below
-* the stack pointer is only valid if the instruction
-* is one which would update the stack pointer to the
-* address accessed if the instruction completed,
-* i.e. either stwu rs,n(r1) or stwux rs,r1,rb
-* (or the byte, halfword, float or double forms).
-*
-* If we don't check this then any write to the area
-* between the last mapped region and the stack will
-* expand the stack rather than segfaulting.
-*/
-   if (address + 4096 >= uregs->gpr[1])
-   return false;
-
-   if ((flags & FAULT_FLAG_WRITE) && (flags & FAULT_FLAG_USER) &&
-   access_ok(nip, sizeof(*nip))) {
-   struct ppc_inst inst;
-
-   if (!probe_user_read_inst(, nip))
-   return !store_updates_sp(inst);
-   *must_retry = true;
-   }
-   return true;
-   }
-   return

[PATCH 2/5] powerpc: Allow 4096 bytes of stack expansion for the signal frame

2020-07-03 Thread Michael Ellerman

We have powerpc specific logic in our page fault handling to decide if
an access to an unmapped address below the stack pointer should expand
the stack VMA.

The code was originally added in 2004 "ported from 2.4". The rough
logic is that the stack is allowed to grow to 1MB with no extra
checking. Over 1MB the access must be within 2048 bytes of the stack
pointer, or be from a user instruction that updates the stack pointer.

The 2048 byte allowance below the stack pointer is there to cover the
288 "red zone" as well as the "about 1.5kB" needed by the signal
delivery code.

Unfortunately since then the signal frame has expanded, and is now
4096 bytes on 64-bit kernels with transactional memory enabled. This
means if a process has consumed more than 1MB of stack, and its stack
pointer lies less than 4096 bytes from the next page boundary, signal
delivery will fault when trying to expand the stack and the process
will see a SEGV.

The 2048 allowance was sufficient until 2008 as the signal frame was:

struct rt_sigframe {
struct ucontextuc;   /* 0  1440 */
/* --- cacheline 11 boundary (1408 bytes) was 32 bytes ago --- */
long unsigned int  _unused[2];   /*  144016 */
unsigned int   tramp[6]; /*  145624 */
struct siginfo *   pinfo;/*  1480 8 */
void * puc;  /*  1488 8 */
struct siginfo info; /*  1496   128 */
/* --- cacheline 12 boundary (1536 bytes) was 88 bytes ago --- */
char   abigap[288];  /*  1624   288 */

/* size: 1920, cachelines: 15, members: 7 */
/* padding: 8 */
};

Then in commit ce48b2100785 ("powerpc: Add VSX context save/restore,
ptrace and signal support") (Jul 2008) the signal frame expanded to
2176 bytes:

struct rt_sigframe {
struct ucontextuc;   /* 0  1696 */  
<--
/* --- cacheline 13 boundary (1664 bytes) was 32 bytes ago --- */
long unsigned int  _unused[2];   /*  169616 */
unsigned int   tramp[6]; /*  171224 */
struct siginfo *   pinfo;/*  1736 8 */
void * puc;  /*  1744 8 */
struct siginfo info; /*  1752   128 */
/* --- cacheline 14 boundary (1792 bytes) was 88 bytes ago --- */
char   abigap[288];  /*  1880   288 */

/* size: 2176, cachelines: 17, members: 7 */
/* padding: 8 */
};

At this point we should have been exposed to the bug, though as far as
I know it was never reported. I no longer have a system old enough to
easily test on.

Then in 2010 commit 320b2b8de126 ("mm: keep a guard page below a
grow-down stack segment") caused our stack expansion code to never
trigger, as there was always a VMA found for a write up to PAGE_SIZE
below r1.

That meant the bug was hidden as we continued to expand the signal
frame in commit 2b0a576d15e0 ("powerpc: Add new transactional memory
state to the signal context") (Feb 2013):

struct rt_sigframe {
struct ucontextuc;   /* 0  1696 */
/* --- cacheline 13 boundary (1664 bytes) was 32 bytes ago --- */
struct ucontextuc_transact;  /*  1696  1696 */  
<--
/* --- cacheline 26 boundary (3328 bytes) was 64 bytes ago --- */
long unsigned int  _unused[2];   /*  339216 */
unsigned int   tramp[6]; /*  340824 */
struct siginfo *   pinfo;/*  3432 8 */
void * puc;  /*  3440 8 */
struct siginfo info; /*  3448   128 */
/* --- cacheline 27 boundary (3456 bytes) was 120 bytes ago --- */
char   abigap[288];  /*  3576   288 */

/* size: 3872, cachelines: 31, members: 8 */
/* padding: 8 */
/* last cacheline: 32 bytes */
};

And commit 573ebfa6601f ("powerpc: Increase stack redzone for 64-bit
userspace to 512 bytes") (Feb 2014):

struct rt_sigframe {
struct ucontextuc;   /* 0  1696 */
/* --- cacheline 13 boundary (1664 bytes) was 32 bytes ago --- */
struct ucontextuc_transact;  /*  1696  1696 */
/* --- cacheline 26 boundary (3328 bytes) was 64 bytes ago --- */
long unsigned int  _unused[2];   /*  339216 */
unsigned int   tramp[6]; /*  340824 */
struct siginfo *   pinfo;/*  3432 8 */
void * puc;  /*  3440 8 */
struct

[PATCH 3/5] selftests/powerpc: Update the stack expansion test

2020-07-03 Thread Michael Ellerman

Update the stack expansion load/store test to take into account the
new allowance of 4096 bytes below the stack pointer.

Signed-off-by: Michael Ellerman 
---
 .../selftests/powerpc/mm/stack_expansion_ldst.c| 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/tools/testing/selftests/powerpc/mm/stack_expansion_ldst.c 
b/tools/testing/selftests/powerpc/mm/stack_expansion_ldst.c
index 0587e11437f5..95c3f3de16a1 100644
--- a/tools/testing/selftests/powerpc/mm/stack_expansion_ldst.c
+++ b/tools/testing/selftests/powerpc/mm/stack_expansion_ldst.c
@@ -186,17 +186,17 @@ static void test_one_type(enum access_type type, unsigned 
long page_size, unsign
// But if we go past the rlimit it should fail
assert(test_one(DEFAULT_SIZE, rlim_cur + 1, type) != 0);
 
-   // Above 1MB powerpc only allows accesses within 2048 bytes of
+   // Above 1MB powerpc only allows accesses within 4096 bytes of
// r1 for accesses that aren't stdu
-   assert(test_one(1 * _MB + page_size - 128, -2048, type) == 0);
+   assert(test_one(1 * _MB + page_size - 128, -4096, type) == 0);
 #ifdef __powerpc__
-   assert(test_one(1 * _MB + page_size - 128, -2049, type) != 0);
+   assert(test_one(1 * _MB + page_size - 128, -4097, type) != 0);
 #else
-   assert(test_one(1 * _MB + page_size - 128, -2049, type) == 0);
+   assert(test_one(1 * _MB + page_size - 128, -4097, type) == 0);
 #endif
 
// By consuming 2MB of stack we test the stdu case
-   assert(test_one(2 * _MB + page_size - 128, -2048, type) == 0);
+   assert(test_one(2 * _MB + page_size - 128, -4096, type) == 0);
 }
 
 static int test(void)
-- 
2.25.1

[PATCH 1/5] selftests/powerpc: Add test of stack expansion logic

2020-07-03 Thread Michael Ellerman

We have custom stack expansion checks that it turns out are extremely
badly tested and contain bugs, surprise. So add some tests that
exercise the code and capture the current boundary conditions.

The signal test currently fails on 64-bit kernels because the 2048
byte allowance for the signal frame is too small, we will fix that in
a subsequent patch.

Signed-off-by: Michael Ellerman 
---
 tools/testing/selftests/powerpc/mm/.gitignore |   2 +
 tools/testing/selftests/powerpc/mm/Makefile   |   8 +-
 .../powerpc/mm/stack_expansion_ldst.c | 233 ++
 .../powerpc/mm/stack_expansion_signal.c   | 114 +
 tools/testing/selftests/powerpc/pmu/lib.h |   1 +
 5 files changed, 357 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/powerpc/mm/stack_expansion_ldst.c
 create mode 100644 tools/testing/selftests/powerpc/mm/stack_expansion_signal.c

diff --git a/tools/testing/selftests/powerpc/mm/.gitignore 
b/tools/testing/selftests/powerpc/mm/.gitignore
index 2ca523255b1b..8bfa3b39f628 100644
--- a/tools/testing/selftests/powerpc/mm/.gitignore
+++ b/tools/testing/selftests/powerpc/mm/.gitignore
@@ -8,3 +8,5 @@ wild_bctr
 large_vm_fork_separation
 bad_accesses
 tlbie_test
+stack_expansion_ldst
+stack_expansion_signal
diff --git a/tools/testing/selftests/powerpc/mm/Makefile 
b/tools/testing/selftests/powerpc/mm/Makefile
index b9103c4bb414..3937b277c288 100644
--- a/tools/testing/selftests/powerpc/mm/Makefile
+++ b/tools/testing/selftests/powerpc/mm/Makefile
@@ -3,7 +3,8 @@
$(MAKE) -C ../
 
 TEST_GEN_PROGS := hugetlb_vs_thp_test subpage_prot prot_sao segv_errors 
wild_bctr \
- large_vm_fork_separation bad_accesses
+ large_vm_fork_separation bad_accesses stack_expansion_signal \
+ stack_expansion_ldst
 TEST_GEN_PROGS_EXTENDED := tlbie_test
 TEST_GEN_FILES := tempfile
 
@@ -18,6 +19,11 @@ $(OUTPUT)/wild_bctr: CFLAGS += -m64
 $(OUTPUT)/large_vm_fork_separation: CFLAGS += -m64
 $(OUTPUT)/bad_accesses: CFLAGS += -m64
 
+$(OUTPUT)/stack_expansion_signal: ../utils.c ../pmu/lib.c
+
+$(OUTPUT)/stack_expansion_ldst: CFLAGS += -fno-stack-protector
+$(OUTPUT)/stack_expansion_ldst: ../utils.c
+
 $(OUTPUT)/tempfile:
dd if=/dev/zero of=$@ bs=64k count=1
 
diff --git a/tools/testing/selftests/powerpc/mm/stack_expansion_ldst.c 
b/tools/testing/selftests/powerpc/mm/stack_expansion_ldst.c
new file mode 100644
index ..0587e11437f5
--- /dev/null
+++ b/tools/testing/selftests/powerpc/mm/stack_expansion_ldst.c
@@ -0,0 +1,233 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Test that loads/stores expand the stack segment, or trigger a SEGV, in
+ * various conditions.
+ *
+ * Based on test code by Tom Lane.
+ */
+
+#undef NDEBUG
+#include 
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define _KB (1024)
+#define _MB (1024 * 1024)
+
+volatile char *stack_top_ptr;
+volatile unsigned long stack_top_sp;
+volatile char c;
+
+enum access_type {
+   LOAD,
+   STORE,
+};
+
+/*
+ * Consume stack until the stack pointer is below @target_sp, then do an access
+ * (load or store) at offset @delta from either the base of the stack or the
+ * current stack pointer.
+ */
+__attribute__ ((noinline))
+int consume_stack(unsigned long target_sp, unsigned long stack_high, int 
delta, enum access_type type)
+{
+   unsigned long target;
+   char stack_cur;
+
+   if ((unsigned long)_cur > target_sp)
+   return consume_stack(target_sp, stack_high, delta, type);
+   else {
+   // We don't really need this, but without it GCC might not
+   // generate a recursive call above.
+   stack_top_ptr = _cur;
+
+#ifdef __powerpc__
+   asm volatile ("mr %[sp], %%r1" : [sp] "=r" (stack_top_sp));
+#else
+   asm volatile ("mov %%rsp, %[sp]" : [sp] "=r" (stack_top_sp));
+#endif
+
+   // Kludge, delta < 0 indicates relative to SP
+   if (delta < 0)
+   target = stack_top_sp + delta;
+   else
+   target = stack_high - delta + 1;
+
+   volatile char *p = (char *)target;
+
+   if (type == STORE)
+   *p = c;
+   else
+   c = *p;
+
+   // Do something to prevent the stack frame being popped prior to
+   // our access above.
+   getpid();
+   }
+
+   return 0;
+}
+
+static int search_proc_maps(char *needle, unsigned long *low, unsigned long 
*high)
+{
+   unsigned long start, end;
+   static char buf[4096];
+   char name[128];
+   FILE *f;
+   int rc;
+
+   f = fopen("/proc/self/maps", "r");
+   if (!f) {
+   perror("fopen");
+   return -1;
+   }
+
+   while (fgets(buf, sizeof(buf), f)) {
+   rc = sscanf(buf, "%lx-%lx

Re: [v2 PATCH] crypto: af_alg - Fix regression on empty requests

2020-07-03 Thread Luis Chamberlain

On Thu, Jul 02, 2020 at 01:32:21PM +1000, Herbert Xu wrote:
> On Tue, Jun 30, 2020 at 02:18:11PM +0530, Naresh Kamboju wrote:
> > 
> > Since we are on this subject,
> > LTP af_alg02  test case fails on stable 4.9 and stable 4.4
> > This is not a regression because the test case has been failing from
> > the beginning.
> > 
> > Is this test case expected to fail on stable 4.9 and 4.4 ?
> > or any chance to fix this on these older branches ?
> > 
> > Test output:
> > af_alg02.c:52: BROK: Timed out while reading from request socket.
> > 
> > ref:
> > https://qa-reports.linaro.org/lkft/linux-stable-rc-4.9-oe/build/v4.9.228-191-g082e807235d7/testrun/2884917/suite/ltp-crypto-tests/test/af_alg02/history/
> > https://qa-reports.linaro.org/lkft/linux-stable-rc-4.9-oe/build/v4.9.228-191-g082e807235d7/testrun/2884606/suite/ltp-crypto-tests/test/af_alg02/log
> 
> Actually this test really is broken.

FWIW the patch "umh: fix processed error when UMH_WAIT_PROC is used" was
dropped from linux-next for now as it was missing checking for signals.
I'll be open coding iall checks for each UMH_WAIT_PROC callers next. Its
not clear if this was the issue with this test case, but figured I'd let
you know.

  Luis

Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

2020-07-03 Thread Srikar Dronamraju

* Michal Hocko  [2020-07-03 12:59:44]:

> > Honestly, I do not have any idea. I've traced it down to
> > Author: Andi Kleen 
> > Date:   Tue Jan 11 15:35:48 2005 -0800
> > 
> > [PATCH] x86_64: Fix ACPI SRAT NUMA parsing
> > 
> > Fix fallout from the recent nodemask_t changes. The node ids assigned
> > in the SRAT parser were off by one.
> > 
> > I added a new first_unset_node() function to nodemask.h to allocate
> > IDs sanely.
> > 
> > Signed-off-by: Andi Kleen 
> > Signed-off-by: Linus Torvalds 
> > 
> > which doesn't really tell all that much. The historical baggage and a
> > long term behavior which is not really trivial to fix I suspect.
> 
> Thinking about this some more, this logic makes some sense afterall.
> Especially in the world without memory hotplug which was very likely the
> case back then. It is much better to have compact node mask rather than
> sparse one. After all node numbers shouldn't really matter as long as
> you have a clear mapping to the HW. I am not sure we export that
> information (except for the kernel ring buffer) though.
> 
> The memory hotplug changes that somehow because you can hotremove numa
> nodes and therefore make the nodemask sparse but that is not a common
> case. I am not sure what would happen if a completely new node was added
> and its corresponding node was already used by the renumbered one
> though. It would likely conflate the two I am afraid. But I am not sure
> this is really possible with x86 and a lack of a bug report would
> suggest that nobody is doing that at least.
> 

JFYI,
Satheesh copied in this mailchain had opened a bug a year on crash with vcpu
hotplug on memoryless node. 

https://bugzilla.kernel.org/show_bug.cgi?id=202187

-- 
Thanks and Regards
Srikar Dronamraju

[PATCH 2/2] powerpc/powernv/idle: save-restore DAWR0,DAWRX0 for P10

2020-07-03 Thread Pratik Rajesh Sampat

Additional registers DAWR0, DAWRX0 may be lost on Power 10 for
stop levels < 4.
Therefore save the values of these SPRs before entering a  "stop"
state and restore their values on wakeup.

Signed-off-by: Pratik Rajesh Sampat 
---
 arch/powerpc/platforms/powernv/idle.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/idle.c 
b/arch/powerpc/platforms/powernv/idle.c
index 19d94d021357..471d4a65b1fa 100644
--- a/arch/powerpc/platforms/powernv/idle.c
+++ b/arch/powerpc/platforms/powernv/idle.c
@@ -600,6 +600,8 @@ struct p9_sprs {
u64 iamr;
u64 amor;
u64 uamor;
+   u64 dawr0;
+   u64 dawrx0;
 };
 
 static unsigned long power9_idle_stop(unsigned long psscr, bool mmu_on)
@@ -677,6 +679,10 @@ static unsigned long power9_idle_stop(unsigned long psscr, 
bool mmu_on)
sprs.tscr   = mfspr(SPRN_TSCR);
if (!firmware_has_feature(FW_FEATURE_ULTRAVISOR))
sprs.ldbar = mfspr(SPRN_LDBAR);
+   if (cpu_has_feature(CPU_FTR_ARCH_31)) {
+   sprs.dawr0 = mfspr(SPRN_DAWR0);
+   sprs.dawrx0 = mfspr(SPRN_DAWRX0);
+   }
 
sprs_saved = true;
 
@@ -792,6 +798,10 @@ static unsigned long power9_idle_stop(unsigned long psscr, 
bool mmu_on)
mtspr(SPRN_MMCR2,   sprs.mmcr2);
if (!firmware_has_feature(FW_FEATURE_ULTRAVISOR))
mtspr(SPRN_LDBAR, sprs.ldbar);
+   if (cpu_has_feature(CPU_FTR_ARCH_31)) {
+   mtspr(SPRN_DAWR0, sprs.dawr0);
+   mtspr(SPRN_DAWRX0, sprs.dawrx0);
+   }
 
mtspr(SPRN_SPRG3,   local_paca->sprg_vdso);
 
-- 
2.25.4

[PATCH 1/2] powerpc/powernv/idle: Exclude mfspr on HID1, 4, 5 on P9 and above

2020-07-03 Thread Pratik Rajesh Sampat

POWER9 onwards the support for the registers HID1, HID4, HID5 has been
receded.
Although mfspr on the above registers worked in Power9, In Power10
simulator is unrecognized. Moving their assignment under the
check for machines lower than Power9

Signed-off-by: Pratik Rajesh Sampat 
---
 arch/powerpc/platforms/powernv/idle.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/idle.c 
b/arch/powerpc/platforms/powernv/idle.c
index 2dd467383a88..19d94d021357 100644
--- a/arch/powerpc/platforms/powernv/idle.c
+++ b/arch/powerpc/platforms/powernv/idle.c
@@ -73,9 +73,6 @@ static int pnv_save_sprs_for_deep_states(void)
 */
uint64_t lpcr_val   = mfspr(SPRN_LPCR);
uint64_t hid0_val   = mfspr(SPRN_HID0);
-   uint64_t hid1_val   = mfspr(SPRN_HID1);
-   uint64_t hid4_val   = mfspr(SPRN_HID4);
-   uint64_t hid5_val   = mfspr(SPRN_HID5);
uint64_t hmeer_val  = mfspr(SPRN_HMEER);
uint64_t msr_val = MSR_IDLE;
uint64_t psscr_val = pnv_deepest_stop_psscr_val;
@@ -117,6 +114,9 @@ static int pnv_save_sprs_for_deep_states(void)
 
/* Only p8 needs to set extra HID regiters */
if (!cpu_has_feature(CPU_FTR_ARCH_300)) {
+   uint64_t hid1_val = mfspr(SPRN_HID1);
+   uint64_t hid4_val = mfspr(SPRN_HID4);
+   uint64_t hid5_val = mfspr(SPRN_HID5);
 
rc = opal_slw_set_reg(pir, SPRN_HID1, hid1_val);
if (rc != 0)
-- 
2.25.4

Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

2020-07-03 Thread Michal Hocko

On Fri 03-07-20 13:32:21, David Hildenbrand wrote:
> On 03.07.20 12:59, Michal Hocko wrote:
> > On Fri 03-07-20 11:24:17, Michal Hocko wrote:
> >> [Cc Andi]
> >>
> >> On Fri 03-07-20 11:10:01, Michal Suchanek wrote:
> >>> On Wed, Jul 01, 2020 at 02:21:10PM +0200, Michal Hocko wrote:
>  On Wed 01-07-20 13:30:57, David Hildenbrand wrote:
> >> [...]
> > Yep, looks like it.
> >
> > [0.009726] SRAT: PXM 1 -> APIC 0x00 -> Node 0
> > [0.009727] SRAT: PXM 1 -> APIC 0x01 -> Node 0
> > [0.009727] SRAT: PXM 1 -> APIC 0x02 -> Node 0
> > [0.009728] SRAT: PXM 1 -> APIC 0x03 -> Node 0
> > [0.009731] ACPI: SRAT: Node 0 PXM 1 [mem 0x-0x0009]
> > [0.009732] ACPI: SRAT: Node 0 PXM 1 [mem 0x0010-0xbfff]
> > [0.009733] ACPI: SRAT: Node 0 PXM 1 [mem 0x1-0x13fff]
> 
>  This begs a question whether ppc can do the same thing?
> >>> Or x86 stop doing it so that you can see on what node you are running?
> >>>
> >>> What's the point of this indirection other than another way of avoiding
> >>> empty node 0?
> >>
> >> Honestly, I do not have any idea. I've traced it down to
> >> Author: Andi Kleen 
> >> Date:   Tue Jan 11 15:35:48 2005 -0800
> >>
> >> [PATCH] x86_64: Fix ACPI SRAT NUMA parsing
> >>
> >> Fix fallout from the recent nodemask_t changes. The node ids assigned
> >> in the SRAT parser were off by one.
> >>
> >> I added a new first_unset_node() function to nodemask.h to allocate
> >> IDs sanely.
> >>
> >> Signed-off-by: Andi Kleen 
> >> Signed-off-by: Linus Torvalds 
> >>
> >> which doesn't really tell all that much. The historical baggage and a
> >> long term behavior which is not really trivial to fix I suspect.
> > 
> > Thinking about this some more, this logic makes some sense afterall.
> > Especially in the world without memory hotplug which was very likely the
> > case back then. It is much better to have compact node mask rather than
> > sparse one. After all node numbers shouldn't really matter as long as
> > you have a clear mapping to the HW. I am not sure we export that
> > information (except for the kernel ring buffer) though.
> > 
> > The memory hotplug changes that somehow because you can hotremove numa
> > nodes and therefore make the nodemask sparse but that is not a common
> > case. I am not sure what would happen if a completely new node was added
> > and its corresponding node was already used by the renumbered one
> > though. It would likely conflate the two I am afraid. But I am not sure
> > this is really possible with x86 and a lack of a bug report would
> > suggest that nobody is doing that at least.
> > 
> 
> I think the ACPI code takes care of properly mapping PXM to nodes.
> 
> So if I start with PXM 0 empty and PXM 1 populated, I will get
> PXM 1 == node 0 as described. Once I hotplug something to PXM 0 in QEMU
> 
> $ echo "object_add memory-backend-ram,id=mem0,size=1G" | sudo nc -U 
> /var/tmp/monitor
> $ echo "device_add pc-dimm,id=dimm0,memdev=mem0,node=0" | sudo nc -U 
> /var/tmp/monitor
> 
> $ echo "info numa" | sudo nc -U /var/tmp/monitor
> QEMU 5.0.50 monitor - type 'help' for more information
> (qemu) info numa
> 2 nodes
> node 0 cpus:
> node 0 size: 1024 MB
> node 0 plugged: 1024 MB
> node 1 cpus: 0 1 2 3
> node 1 size: 4096 MB
> node 1 plugged: 0 MB

Thanks for double checking.

> I get in the guest:
> 
> [   50.174435] [ cut here ]
> [   50.175436] node 1 was absent from the node_possible_map
> [   50.176844] WARNING: CPU: 0 PID: 7 at mm/memory_hotplug.c:1021 
> add_memory_resource+0x8c/0x290

This would mean that the ACPI code or whoever does the remaping is not
adding the new node into possible nodes.

[...]
> I remember that we added that check just recently (due to powerpc if I am not 
> wrong).
> Not sure why that triggers here.

This was a misbehaving Qemu IIRC providing a garbage map.

-- 
Michal Hocko
SUSE Labs

Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

2020-07-03 Thread David Hildenbrand

On 03.07.20 12:59, Michal Hocko wrote:
> On Fri 03-07-20 11:24:17, Michal Hocko wrote:
>> [Cc Andi]
>>
>> On Fri 03-07-20 11:10:01, Michal Suchanek wrote:
>>> On Wed, Jul 01, 2020 at 02:21:10PM +0200, Michal Hocko wrote:
 On Wed 01-07-20 13:30:57, David Hildenbrand wrote:
>> [...]
> Yep, looks like it.
>
> [0.009726] SRAT: PXM 1 -> APIC 0x00 -> Node 0
> [0.009727] SRAT: PXM 1 -> APIC 0x01 -> Node 0
> [0.009727] SRAT: PXM 1 -> APIC 0x02 -> Node 0
> [0.009728] SRAT: PXM 1 -> APIC 0x03 -> Node 0
> [0.009731] ACPI: SRAT: Node 0 PXM 1 [mem 0x-0x0009]
> [0.009732] ACPI: SRAT: Node 0 PXM 1 [mem 0x0010-0xbfff]
> [0.009733] ACPI: SRAT: Node 0 PXM 1 [mem 0x1-0x13fff]

 This begs a question whether ppc can do the same thing?
>>> Or x86 stop doing it so that you can see on what node you are running?
>>>
>>> What's the point of this indirection other than another way of avoiding
>>> empty node 0?
>>
>> Honestly, I do not have any idea. I've traced it down to
>> Author: Andi Kleen 
>> Date:   Tue Jan 11 15:35:48 2005 -0800
>>
>> [PATCH] x86_64: Fix ACPI SRAT NUMA parsing
>>
>> Fix fallout from the recent nodemask_t changes. The node ids assigned
>> in the SRAT parser were off by one.
>>
>> I added a new first_unset_node() function to nodemask.h to allocate
>> IDs sanely.
>>
>> Signed-off-by: Andi Kleen 
>> Signed-off-by: Linus Torvalds 
>>
>> which doesn't really tell all that much. The historical baggage and a
>> long term behavior which is not really trivial to fix I suspect.
> 
> Thinking about this some more, this logic makes some sense afterall.
> Especially in the world without memory hotplug which was very likely the
> case back then. It is much better to have compact node mask rather than
> sparse one. After all node numbers shouldn't really matter as long as
> you have a clear mapping to the HW. I am not sure we export that
> information (except for the kernel ring buffer) though.
> 
> The memory hotplug changes that somehow because you can hotremove numa
> nodes and therefore make the nodemask sparse but that is not a common
> case. I am not sure what would happen if a completely new node was added
> and its corresponding node was already used by the renumbered one
> though. It would likely conflate the two I am afraid. But I am not sure
> this is really possible with x86 and a lack of a bug report would
> suggest that nobody is doing that at least.
> 

I think the ACPI code takes care of properly mapping PXM to nodes.

So if I start with PXM 0 empty and PXM 1 populated, I will get
PXM 1 == node 0 as described. Once I hotplug something to PXM 0 in QEMU

$ echo "object_add memory-backend-ram,id=mem0,size=1G" | sudo nc -U 
/var/tmp/monitor
$ echo "device_add pc-dimm,id=dimm0,memdev=mem0,node=0" | sudo nc -U 
/var/tmp/monitor

$ echo "info numa" | sudo nc -U /var/tmp/monitor
QEMU 5.0.50 monitor - type 'help' for more information
(qemu) info numa
2 nodes
node 0 cpus:
node 0 size: 1024 MB
node 0 plugged: 1024 MB
node 1 cpus: 0 1 2 3
node 1 size: 4096 MB
node 1 plugged: 0 MB

I get in the guest:

[   50.174435] [ cut here ]
[   50.175436] node 1 was absent from the node_possible_map
[   50.176844] WARNING: CPU: 0 PID: 7 at mm/memory_hotplug.c:1021 
add_memory_resource+0x8c/0x290
[   50.176844] Modules linked in:
[   50.176845] CPU: 0 PID: 7 Comm: kworker/u8:0 Not tainted 5.8.0-rc2+ #4
[   50.176846] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.4
[   50.176846] Workqueue: kacpi_hotplug acpi_hotplug_work_fn
[   50.176847] RIP: 0010:add_memory_resource+0x8c/0x290
[   50.176849] Code: 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 63 c5 48 89 04 24 48 
0f a3 05 94 6c 1c 01 72 17 89 ee 48 c78
[   50.176849] RSP: 0018:a7a1c0043d48 EFLAGS: 00010296
[   50.176850] RAX: 002c RBX: 8bc633e63b80 RCX: 
[   50.176851] RDX: 8bc63bc27060 RSI: 8bc63bc18d00 RDI: 8bc63bc18d00
[   50.176851] RBP: 0001 R08: 01e1 R09: a7a1c0043bd8
[   50.176852] R10: 0005 R11:  R12: 00014000
[   50.176852] R13: 00017fff R14: 4000 R15: 00018000
[   50.176853] FS:  () GS:8bc63bc0() 
knlGS:
[   50.176853] CS:  0010 DS:  ES:  CR0: 80050033
[   50.176855] CR2: 55dfcbfc5ee8 CR3: aca0a000 CR4: 06f0
[   50.176855] DR0:  DR1:  DR2: 
[   50.176856] DR3:  DR6: fffe0ff0 DR7: 0400
[   50.176856] Call Trace:
[   50.176856]  __add_memory+0x33/0x70
[   50.176857]  acpi_memory_device_add+0x132/0x2f2
[   50.176857]  acpi_bus_attach+0xd2/0x200
[   50.176858]  acpi_bus_scan+0x33/0x70
[   50.176858]  acpi_device_hotplug+0x298/0x390
[   50.176858]

Re: [PATCH 16/26] mm/powerpc: Use general page fault accounting

2020-07-03 Thread Michael Ellerman

Peter Xu  writes:
> Use the general page fault accounting by passing regs into handle_mm_fault().
>
> CC: Michael Ellerman 
> CC: Benjamin Herrenschmidt 
> CC: Paul Mackerras 
> CC: linuxppc-dev@lists.ozlabs.org
> Signed-off-by: Peter Xu 
> ---
>  arch/powerpc/mm/fault.c | 11 +++
>  1 file changed, 3 insertions(+), 8 deletions(-)
>
> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
> index 992b10c3761c..e325d13efaf5 100644
> --- a/arch/powerpc/mm/fault.c
> +++ b/arch/powerpc/mm/fault.c
> @@ -563,7 +563,7 @@ static int __do_page_fault(struct pt_regs *regs, unsigned 
> long address,
>* make sure we exit gracefully rather than endlessly redo
>* the fault.
>*/
> - fault = handle_mm_fault(vma, address, flags, NULL);
> + fault = handle_mm_fault(vma, address, flags, regs);
>  
>  #ifdef CONFIG_PPC_MEM_KEYS
>   /*
> @@ -604,14 +604,9 @@ static int __do_page_fault(struct pt_regs *regs, 
> unsigned long address,
>   /*
>* Major/minor page fault accounting.
>*/
> - if (major) {
> - current->maj_flt++;
> - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, regs, address);
> + if (major)
>   cmo_account_page_fault();
> - } else {
> - current->min_flt++;
> - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address);
> - }
> +
>   return 0;
>  }
>  NOKPROBE_SYMBOL(__do_page_fault);


You do change the logic a bit if regs is NULL (in mm_account_fault()),
but regs can never be NULL in this path, so it looks OK to me.

Acked-by: Michael Ellerman 

cheers

Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

2020-07-03 Thread Michal Hocko

On Fri 03-07-20 11:24:17, Michal Hocko wrote:
> [Cc Andi]
> 
> On Fri 03-07-20 11:10:01, Michal Suchanek wrote:
> > On Wed, Jul 01, 2020 at 02:21:10PM +0200, Michal Hocko wrote:
> > > On Wed 01-07-20 13:30:57, David Hildenbrand wrote:
> [...]
> > > > Yep, looks like it.
> > > > 
> > > > [0.009726] SRAT: PXM 1 -> APIC 0x00 -> Node 0
> > > > [0.009727] SRAT: PXM 1 -> APIC 0x01 -> Node 0
> > > > [0.009727] SRAT: PXM 1 -> APIC 0x02 -> Node 0
> > > > [0.009728] SRAT: PXM 1 -> APIC 0x03 -> Node 0
> > > > [0.009731] ACPI: SRAT: Node 0 PXM 1 [mem 0x-0x0009]
> > > > [0.009732] ACPI: SRAT: Node 0 PXM 1 [mem 0x0010-0xbfff]
> > > > [0.009733] ACPI: SRAT: Node 0 PXM 1 [mem 0x1-0x13fff]
> > > 
> > > This begs a question whether ppc can do the same thing?
> > Or x86 stop doing it so that you can see on what node you are running?
> > 
> > What's the point of this indirection other than another way of avoiding
> > empty node 0?
> 
> Honestly, I do not have any idea. I've traced it down to
> Author: Andi Kleen 
> Date:   Tue Jan 11 15:35:48 2005 -0800
> 
> [PATCH] x86_64: Fix ACPI SRAT NUMA parsing
> 
> Fix fallout from the recent nodemask_t changes. The node ids assigned
> in the SRAT parser were off by one.
> 
> I added a new first_unset_node() function to nodemask.h to allocate
> IDs sanely.
> 
> Signed-off-by: Andi Kleen 
> Signed-off-by: Linus Torvalds 
> 
> which doesn't really tell all that much. The historical baggage and a
> long term behavior which is not really trivial to fix I suspect.

Thinking about this some more, this logic makes some sense afterall.
Especially in the world without memory hotplug which was very likely the
case back then. It is much better to have compact node mask rather than
sparse one. After all node numbers shouldn't really matter as long as
you have a clear mapping to the HW. I am not sure we export that
information (except for the kernel ring buffer) though.

The memory hotplug changes that somehow because you can hotremove numa
nodes and therefore make the nodemask sparse but that is not a common
case. I am not sure what would happen if a completely new node was added
and its corresponding node was already used by the renumbered one
though. It would likely conflate the two I am afraid. But I am not sure
this is really possible with x86 and a lack of a bug report would
suggest that nobody is doing that at least.

-- 
Michal Hocko
SUSE Labs

Re: [PATCH 5/8] powerpc/64s: implement queued spinlocks and rwlocks

2020-07-03 Thread Michael Ellerman

Nicholas Piggin  writes:
> Excerpts from Will Deacon's message of July 2, 2020 8:35 pm:
>> On Thu, Jul 02, 2020 at 08:25:43PM +1000, Nicholas Piggin wrote:
>>> Excerpts from Will Deacon's message of July 2, 2020 6:02 pm:
>>> > On Thu, Jul 02, 2020 at 05:48:36PM +1000, Nicholas Piggin wrote:
>>> >> diff --git a/arch/powerpc/include/asm/qspinlock.h 
>>> >> b/arch/powerpc/include/asm/qspinlock.h
>>> >> new file mode 100644
>>> >> index ..f84da77b6bb7
>>> >> --- /dev/null
>>> >> +++ b/arch/powerpc/include/asm/qspinlock.h
>>> >> @@ -0,0 +1,20 @@
>>> >> +/* SPDX-License-Identifier: GPL-2.0 */
>>> >> +#ifndef _ASM_POWERPC_QSPINLOCK_H
>>> >> +#define _ASM_POWERPC_QSPINLOCK_H
>>> >> +
>>> >> +#include 
>>> >> +
>>> >> +#define _Q_PENDING_LOOPS(1 << 9) /* not tuned */
>>> >> +
>>> >> +#define smp_mb__after_spinlock()   smp_mb()
>>> >> +
>>> >> +static __always_inline int queued_spin_is_locked(struct qspinlock *lock)
>>> >> +{
>>> >> +smp_mb();
>>> >> +return atomic_read(>val);
>>> >> +}
>>> > 
>>> > Why do you need the smp_mb() here?
>>> 
>>> A long and sad tale that ends here 51d7d5205d338
>>> 
>>> Should probably at least refer to that commit from here, since this one 
>>> is not going to git blame back there. I'll add something.
>> 
>> Is this still an issue, though?
>> 
>> See 38b850a73034 (where we added a similar barrier on arm64) and then
>> c6f5d02b6a0f (where we removed it).
>> 
>
> Oh nice, I didn't know that went away. Thanks for the heads up.

Argh! I spent so much time chasing that damn bug in the ipc code.

> I'm going to say I'm too scared to remove it while changing the
> spinlock algorithm, but I'll open an issue and we should look at 
> removing it.

Sounds good.

cheers

[RFC PATCH v0 2/2] KVM: PPC: Book3S HV: Use H_RPT_INVALIDATE in nested KVM

2020-07-03 Thread Bharata B Rao

In the nested KVM case, replace H_TLB_INVALIDATE by the new hcall
H_RPT_INVALIDATE if available. The availability of this hcall
is determined from "hcall-rpt-invalidate" string in ibm,hypertas-functions
DT property.

Signed-off-by: Bharata B Rao 
---
 arch/powerpc/include/asm/firmware.h   |  4 +++-
 arch/powerpc/kvm/book3s_64_mmu_radix.c| 26 ++-
 arch/powerpc/kvm/book3s_hv_nested.c   | 13 ++--
 arch/powerpc/platforms/pseries/firmware.c |  1 +
 4 files changed, 36 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/include/asm/firmware.h 
b/arch/powerpc/include/asm/firmware.h
index 6003c2e533a0..aa6a5ef5d483 100644
--- a/arch/powerpc/include/asm/firmware.h
+++ b/arch/powerpc/include/asm/firmware.h
@@ -52,6 +52,7 @@
 #define FW_FEATURE_PAPR_SCMASM_CONST(0x0020)
 #define FW_FEATURE_ULTRAVISOR  ASM_CONST(0x0040)
 #define FW_FEATURE_STUFF_TCE   ASM_CONST(0x0080)
+#define FW_FEATURE_RPT_INVALIDATE ASM_CONST(0x0100)
 
 #ifndef __ASSEMBLY__
 
@@ -71,7 +72,8 @@ enum {
FW_FEATURE_TYPE1_AFFINITY | FW_FEATURE_PRRN |
FW_FEATURE_HPT_RESIZE | FW_FEATURE_DRMEM_V2 |
FW_FEATURE_DRC_INFO | FW_FEATURE_BLOCK_REMOVE |
-   FW_FEATURE_PAPR_SCM | FW_FEATURE_ULTRAVISOR,
+   FW_FEATURE_PAPR_SCM | FW_FEATURE_ULTRAVISOR |
+   FW_FEATURE_RPT_INVALIDATE,
FW_FEATURE_PSERIES_ALWAYS = 0,
FW_FEATURE_POWERNV_POSSIBLE = FW_FEATURE_OPAL | FW_FEATURE_ULTRAVISOR,
FW_FEATURE_POWERNV_ALWAYS = 0,
diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index e738ea652192..8411e42eedbd 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * Supported radix tree geometry.
@@ -313,9 +314,17 @@ void kvmppc_radix_tlbie_page(struct kvm *kvm, unsigned 
long addr,
}
 
psi = shift_to_mmu_psize(pshift);
-   rb = addr | (mmu_get_ap(psi) << PPC_BITLSHIFT(58));
-   rc = plpar_hcall_norets(H_TLB_INVALIDATE, H_TLBIE_P1_ENC(0, 0, 1),
-   lpid, rb);
+   if (!firmware_has_feature(FW_FEATURE_RPT_INVALIDATE)) {
+   rb = addr | (mmu_get_ap(psi) << PPC_BITLSHIFT(58));
+   rc = plpar_hcall_norets(H_TLB_INVALIDATE,
+   H_TLBIE_P1_ENC(0, 0, 1), lpid, rb);
+   } else {
+   rc = pseries_rpt_invalidate(lpid, H_RPTI_TARGET_CMMU,
+   H_RPTI_TYPE_NESTED |
+   H_RPTI_TYPE_TLB,
+   psize_to_rpti_pgsize(psi),
+   addr, addr + psize);
+   }
if (rc)
pr_err("KVM: TLB page invalidation hcall failed, rc=%ld\n", rc);
 }
@@ -329,8 +338,15 @@ static void kvmppc_radix_flush_pwc(struct kvm *kvm, 
unsigned int lpid)
return;
}
 
-   rc = plpar_hcall_norets(H_TLB_INVALIDATE, H_TLBIE_P1_ENC(1, 0, 1),
-   lpid, TLBIEL_INVAL_SET_LPID);
+   if (!firmware_has_feature(FW_FEATURE_RPT_INVALIDATE))
+   rc = plpar_hcall_norets(H_TLB_INVALIDATE,
+   H_TLBIE_P1_ENC(1, 0, 1),
+   lpid, TLBIEL_INVAL_SET_LPID);
+   else
+   rc = pseries_rpt_invalidate(lpid, H_RPTI_TARGET_CMMU,
+   H_RPTI_TYPE_NESTED |
+   H_RPTI_TYPE_PWC, H_RPTI_PAGE_ALL,
+   0, -1UL);
if (rc)
pr_err("KVM: TLB PWC invalidation hcall failed, rc=%ld\n", rc);
 }
diff --git a/arch/powerpc/kvm/book3s_hv_nested.c 
b/arch/powerpc/kvm/book3s_hv_nested.c
index efb78d37f29a..4d023c451be4 100644
--- a/arch/powerpc/kvm/book3s_hv_nested.c
+++ b/arch/powerpc/kvm/book3s_hv_nested.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static struct patb_entry *pseries_partition_tb;
 
@@ -401,8 +402,16 @@ static void kvmhv_flush_lpid(unsigned int lpid)
return;
}
 
-   rc = plpar_hcall_norets(H_TLB_INVALIDATE, H_TLBIE_P1_ENC(2, 0, 1),
-   lpid, TLBIEL_INVAL_SET_LPID);
+   if (!firmware_has_feature(FW_FEATURE_RPT_INVALIDATE))
+   rc = plpar_hcall_norets(H_TLB_INVALIDATE,
+   H_TLBIE_P1_ENC(2, 0, 1),
+   lpid, TLBIEL_INVAL_SET_LPID);
+   else
+   rc = pseries_rpt_invalidate(lpid, H_RPTI_TARGET_CMMU,
+   H_RPTI_TYPE_NESTED |
+   H_RPTI_TYPE_TLB | H_RPTI_TYPE_PWC |
+   H_RPTI_TYPE_PAT,
+

[RFC PATCH v0 1/2] KVM: PPC: Book3S HV: Add support for H_RPT_INVALIDATE (nested case only)

2020-07-03 Thread Bharata B Rao

Implements H_RPT_INVALIDATE hcall and supports only nested case
currently.

A KVM capability KVM_CAP_RPT_INVALIDATE is added to indicate the
support for this hcall.

Signed-off-by: Bharata B Rao 
---
 Documentation/virt/kvm/api.rst| 17 
 .../include/asm/book3s/64/tlbflush-radix.h| 18 
 arch/powerpc/include/asm/kvm_book3s.h |  3 +
 arch/powerpc/kvm/book3s_hv.c  | 32 +++
 arch/powerpc/kvm/book3s_hv_nested.c   | 94 +++
 arch/powerpc/kvm/powerpc.c|  3 +
 arch/powerpc/mm/book3s64/radix_tlb.c  |  4 -
 include/uapi/linux/kvm.h  |  1 +
 8 files changed, 168 insertions(+), 4 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 426f94582b7a..d235d16a4bf0 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -5843,6 +5843,23 @@ controlled by the kvm module parameter halt_poll_ns. 
This capability allows
 the maximum halt time to specified on a per-VM basis, effectively overriding
 the module parameter for the target VM.
 
+7.21 KVM_CAP_RPT_INVALIDATE
+--
+
+:Capability: KVM_CAP_RPT_INVALIDATE
+:Architectures: ppc
+:Type: vm
+
+This capability indicates that the kernel is capable of handling
+H_RPT_INVALIDATE hcall.
+
+In order to enable the use of H_RPT_INVALIDATE in the guest,
+user space might have to advertise it for the guest. For example,
+IBM pSeries (sPAPR) guest starts using it if "hcall-rpt-invalidate" is
+present in the "ibm,hypertas-functions" device-tree property.
+
+This capability is always enabled.
+
 8. Other capabilities.
 ==
 
diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h 
b/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h
index 94439e0cefc9..aace7e9b2397 100644
--- a/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h
+++ b/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h
@@ -4,6 +4,10 @@
 
 #include 
 
+#define RIC_FLUSH_TLB 0
+#define RIC_FLUSH_PWC 1
+#define RIC_FLUSH_ALL 2
+
 struct vm_area_struct;
 struct mm_struct;
 struct mmu_gather;
@@ -21,6 +25,20 @@ static inline u64 psize_to_rpti_pgsize(unsigned long psize)
return H_RPTI_PAGE_ALL;
 }
 
+static inline int rpti_pgsize_to_psize(unsigned long page_size)
+{
+   if (page_size == H_RPTI_PAGE_4K)
+   return MMU_PAGE_4K;
+   if (page_size == H_RPTI_PAGE_64K)
+   return MMU_PAGE_64K;
+   if (page_size == H_RPTI_PAGE_2M)
+   return MMU_PAGE_2M;
+   if (page_size == H_RPTI_PAGE_1G)
+   return MMU_PAGE_1G;
+   else
+   return MMU_PAGE_64K; /* Default */
+}
+
 static inline int mmu_get_ap(int psize)
 {
return mmu_psize_defs[psize].ap;
diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index d32ec9ae73bd..0f1c5fa6e8ce 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -298,6 +298,9 @@ void kvmhv_set_ptbl_entry(unsigned int lpid, u64 dw0, u64 
dw1);
 void kvmhv_release_all_nested(struct kvm *kvm);
 long kvmhv_enter_nested_guest(struct kvm_vcpu *vcpu);
 long kvmhv_do_nested_tlbie(struct kvm_vcpu *vcpu);
+long kvmhv_h_rpti_nested(struct kvm_vcpu *vcpu, unsigned long lpid,
+unsigned long type, unsigned long pg_sizes,
+unsigned long start, unsigned long end);
 int kvmhv_run_single_vcpu(struct kvm_vcpu *vcpu,
  u64 time_limit, unsigned long lpcr);
 void kvmhv_save_hv_regs(struct kvm_vcpu *vcpu, struct hv_guest_state *hr);
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 6bf66649ab92..2f772183f249 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -895,6 +895,28 @@ static int kvmppc_get_yield_count(struct kvm_vcpu *vcpu)
return yield_count;
 }
 
+static long kvmppc_h_rpt_invalidate(struct kvm_vcpu *vcpu,
+   unsigned long pid, unsigned long target,
+   unsigned long type, unsigned long pg_sizes,
+   unsigned long start, unsigned long end)
+{
+   if (end < start)
+   return H_P5;
+
+   if ((!type & H_RPTI_TYPE_NESTED))
+   return H_P3;
+
+   if (!nesting_enabled(vcpu->kvm))
+   return H_FUNCTION;
+
+   /* Support only cores as target */
+   if (target != H_RPTI_TARGET_CMMU)
+   return H_P2;
+
+   return kvmhv_h_rpti_nested(vcpu, pid, (type & ~H_RPTI_TYPE_NESTED),
+  pg_sizes, start, end);
+}
+
 int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
 {
unsigned long req = kvmppc_get_gpr(vcpu, 3);
@@ -1103,6 +1125,14 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
 */
ret = kvmppc_h_svm_init_abort(vcpu->kvm);
break;
+

[RFC PATCH v0 0/2] Use H_RPT_INVALIDATE for nested guest

2020-07-03 Thread Bharata B Rao

This patchset adds support for the new hcall H_RPT_INVALIDATE
(currently handles nested case only) and replaces the nested tlb flush
calls with this new hcall if the support for the same exists.

This applies on top of "[PATCH v3 0/3] Off-load TLB invalidations to host
for !GTSE" patchset that was posted at:

https://lore.kernel.org/linuxppc-dev/20200703053608.12884-1-bhar...@linux.ibm.com/T/#t

H_RPT_INVALIDATE

Syntax:
int64   /* H_Success: Return code on successful completion */
    /* H_Busy - repeat the call with the same */
    /* H_Parameter, H_P2, H_P3, H_P4, H_P5 : Invalid parameters */
    hcall(const uint64 H_RPT_INVALIDATE, /* Invalidate RPT translation 
lookaside information */
  uint64 pid,   /* PID/LPID to invalidate */
  uint64 target,    /* Invalidation target */
  uint64 type,  /* Type of lookaside information */
  uint64 pageSizes, /* Page sizes */
  uint64 start, /* Start of Effective Address (EA) range 
(inclusive) */
  uint64 end)   /* End of EA range (exclusive) */

Invalidation targets (target)
-
Core MMU    0x01 /* All virtual processors in the partition */
Core local MMU  0x02 /* Current virtual processor */
Nest MMU    0x04 /* All nest/accelerator agents in use by the partition */

A combination of the above can be specified, except core and core local.

Type of translation to invalidate (type)
---
NESTED   0x0001  /* Invalidate nested guest partition-scope */
TLB  0x0002  /* Invalidate TLB */
PWC  0x0004  /* Invalidate Page Walk Cache */
PRT  0x0008  /* Invalidate Process Table Entries if NESTED is clear */
PAT  0x0008  /* Invalidate Partition Table Entries if NESTED is set */

A combination of the above can be specified.

Page size mask (pageSizes)
--
4K  0x01
64K 0x02
2M  0x04
1G  0x08
All sizes   (-1UL)

A combination of the above can be specified.
All page sizes can be selected with -1.

Semantics: Invalidate radix tree lookaside information
   matching the parameters given.
* Return H_P2, H_P3 or H_P4 if target, type, or pageSizes parameters are
  different from the defined values.
* Return H_PARAMETER if NESTED is set and pid is not a valid nested
  LPID allocated to this partition
* Return H_P5 if (start, end) doesn't form a valid range. Start and end
  should be a valid Quadrant address and  end > start.
* Return H_NotSupported if the partition is not in running in radix
  translation mode.
* May invalidate more translation information than requested.
* If start = 0 and end = -1, set the range to cover all valid addresses.
  Else start and end should be aligned to 4kB (lower 11 bits clear).
* If NESTED is clear, then invalidate process scoped lookaside information.
  Else pid specifies a nested LPID, and the invalidation is performed
  on nested guest partition table and nested guest partition scope real
  addresses.
* If pid = 0 and NESTED is clear, then valid addresses are quadrant 3 and
  quadrant 0 spaces, Else valid addresses are quadrant 0.
* Pages which are fully covered by the range are to be invalidated.
  Those which are partially covered are considered outside invalidation
  range, which allows a caller to optimally invalidate ranges that may
  contain mixed page sizes.
* Return H_SUCCESS on success.

Bharata B Rao (2):
  KVM: PPC: Book3S HV: Add support for H_RPT_INVALIDATE (nested case
only)
  KVM: PPC: Book3S HV: Use H_RPT_INVALIDATE in nested KVM

 Documentation/virt/kvm/api.rst|  17 +++
 .../include/asm/book3s/64/tlbflush-radix.h|  18 +++
 arch/powerpc/include/asm/firmware.h   |   4 +-
 arch/powerpc/include/asm/kvm_book3s.h |   3 +
 arch/powerpc/kvm/book3s_64_mmu_radix.c|  26 -
 arch/powerpc/kvm/book3s_hv.c  |  32 ++
 arch/powerpc/kvm/book3s_hv_nested.c   | 107 +-
 arch/powerpc/kvm/powerpc.c|   3 +
 arch/powerpc/mm/book3s64/radix_tlb.c  |   4 -
 arch/powerpc/platforms/pseries/firmware.c |   1 +
 include/uapi/linux/kvm.h  |   1 +
 11 files changed, 204 insertions(+), 12 deletions(-)

-- 
2.21.3

Re: [PATCH V3 (RESEND) 2/3] mm/sparsemem: Enable vmem_altmap support in vmemmap_alloc_block_buf()

2020-07-03 Thread Michael Ellerman

Catalin Marinas  writes:
> On Thu, Jun 18, 2020 at 06:45:29AM +0530, Anshuman Khandual wrote:
>> There are many instances where vmemap allocation is often switched between
>> regular memory and device memory just based on whether altmap is available
>> or not. vmemmap_alloc_block_buf() is used in various platforms to allocate
>> vmemmap mappings. Lets also enable it to handle altmap based device memory
>> allocation along with existing regular memory allocations. This will help
>> in avoiding the altmap based allocation switch in many places.
>> 
>> While here also implement a regular memory allocation fallback mechanism
>> when the first preferred device memory allocation fails. This will ensure
>> preserving the existing semantics on powerpc platform. To summarize there
>> are three different methods to call vmemmap_alloc_block_buf().
>> 
>> (., NULL,   false) /* Allocate from system RAM */
>> (., altmap, false) /* Allocate from altmap without any fallback */
>> (., altmap, true)  /* Allocate from altmap with fallback (system RAM) */
> [...]
>> diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
>> index bc73abf0bc25..01e25b56eccb 100644
>> --- a/arch/powerpc/mm/init_64.c
>> +++ b/arch/powerpc/mm/init_64.c
>> @@ -225,12 +225,12 @@ int __meminit vmemmap_populate(unsigned long start, 
>> unsigned long end, int node,
>>   * fall back to system memory if the altmap allocation fail.
>>   */
>>  if (altmap && !altmap_cross_boundary(altmap, start, page_size)) 
>> {
>> -p = altmap_alloc_block_buf(page_size, altmap);
>> -if (!p)
>> -pr_debug("altmap block allocation failed, 
>> falling back to system memory");
>> +p = vmemmap_alloc_block_buf(page_size, node,
>> +altmap, true);
>> +} else {
>> +p = vmemmap_alloc_block_buf(page_size, node,
>> +NULL, false);
>>  }
>> -if (!p)
>> -p = vmemmap_alloc_block_buf(page_size, node);
>>  if (!p)
>>  return -ENOMEM;
>
> Is the fallback argument actually necessary. It may be cleaner to just
> leave the code as is with the choice between altmap and NULL. If an arch
> needs a fallback (only powerpc), they have the fallback in place
> already. I don't see the powerpc code any better after this change.

Yeah I agree.

cheers

Re: [PATCH 0/4] ASoC: fsl_asrc: allow selecting arbitrary clocks

2020-07-03 Thread Arnaud Ferraris

Hi Nic,

Le 02/07/2020 à 20:42, Nicolin Chen a écrit :
> Hi Arnaud,
> 
> On Thu, Jul 02, 2020 at 04:22:31PM +0200, Arnaud Ferraris wrote:
>> The current ASRC driver hardcodes the input and output clocks used for
>> sample rate conversions. In order to allow greater flexibility and to
>> cover more use cases, it would be preferable to select the clocks using
>> device-tree properties.
> 
> We recent just merged a new change that auto-selecting internal
> clocks based on sample rates as the first option -- ideal ratio
> mode is the fallback mode now. Please refer to:
> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?h=next-20200702=d0250cf4f2abfbea64ed247230f08f5ae23979f0

That looks interesting, thanks for pointing this out!
I'll rebase and see how it works for my use-case, will keep you informed.

Regards,
Arnaud

Re: [PATCH v4 24/41] powerpc/book3s64/pkeys: Store/restore userspace AMR correctly on entry and exit from kernel

2020-07-03 Thread Aneesh Kumar K.V


On 7/3/20 2:48 PM, Nicholas Piggin wrote:

Excerpts from Aneesh Kumar K.V's message of June 15, 2020 4:14 pm:

This prepare kernel to operate with a different value than userspace AMR.
For this, AMR needs to be saved and restored on entry and return from the
kernel.

With KUAP we modify kernel AMR when accessing user address from the kernel
via copy_to/from_user interfaces.

If MMU_FTR_KEY is enabled we always use the key mechanism to implement KUAP
feature. If MMU_FTR_KEY is not supported and if we support MMU_FTR_KUAP
(radix translation on POWER9), we can skip restoring AMR on return
to userspace. Userspace won't be using AMR in that specific config.

Signed-off-by: Aneesh Kumar K.V 
---
  arch/powerpc/include/asm/book3s/64/kup.h | 141 ++-
  arch/powerpc/kernel/entry_64.S   |   6 +-
  arch/powerpc/kernel/exceptions-64s.S |   4 +-
  arch/powerpc/kernel/syscall_64.c |  26 -
  4 files changed, 144 insertions(+), 33 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/kup.h 
b/arch/powerpc/include/asm/book3s/64/kup.h
index e6ee1c34842f..6979cd1a0003 100644
--- a/arch/powerpc/include/asm/book3s/64/kup.h
+++ b/arch/powerpc/include/asm/book3s/64/kup.h
@@ -13,18 +13,47 @@
  
  #ifdef __ASSEMBLY__
  
-.macro kuap_restore_amr	gpr1, gpr2

-#ifdef CONFIG_PPC_KUAP
+.macro kuap_restore_user_amr gpr1
+#if defined(CONFIG_PPC_MEM_KEYS)
BEGIN_MMU_FTR_SECTION_NESTED(67)
-   mfspr   \gpr1, SPRN_AMR
+   /*
+* AMR is going to be different when
+* returning to userspace.
+*/
+   ld  \gpr1, STACK_REGS_KUAP(r1)
+   isync
+   mtspr   SPRN_AMR, \gpr1
+
+   /* No isync required, see kuap_restore_user_amr() */
+   END_MMU_FTR_SECTION_NESTED_IFSET(MMU_FTR_PKEY , 67)
+#endif
+.endm
+
+.macro kuap_restore_kernel_amr gpr1, gpr2
+#if defined(CONFIG_PPC_MEM_KEYS)
+   BEGIN_MMU_FTR_SECTION_NESTED(67)
+   b   99f  // handle_pkey_restore_amr
+   END_MMU_FTR_SECTION_NESTED_IFSET(MMU_FTR_PKEY , 67)
+
+   BEGIN_MMU_FTR_SECTION_NESTED(68)
+   b   99f  // handle_kuap_restore_amr
+   MMU_FTR_SECTION_ELSE_NESTED(68)
+   b   100f  // skip_restore_amr
+   ALT_MMU_FTR_SECTION_END_NESTED_IFSET(MMU_FTR_KUAP, 68)
+
+99:
+   /*
+* AMR is going to be mostly the same since we are
+* returning to the kernel. Compare and do a mtspr.
+*/
ld  \gpr2, STACK_REGS_KUAP(r1)
+   mfspr   \gpr1, SPRN_AMR
cmpd\gpr1, \gpr2
-   beq 998f
+   beq 100f
isync
mtspr   SPRN_AMR, \gpr2
/* No isync required, see kuap_restore_amr() */
-998:
-   END_MMU_FTR_SECTION_NESTED_IFSET(MMU_FTR_KUAP, 67)
+100:  // skip_restore_amr


Can't you code it like this? (_IFCLR requires none of the bits to be
set)

BEGIN_MMU_FTR_SECTION_NESTED(67)
b   99f // nothing using AMR, no need to restore
END_MMU_FTR_SECTION_NESTED_IFCLR(MMU_FTR_PKEY | MMU_FTR_KUAP, 67)

That saves you a branch in the common case of using AMR. Similar
for others.




Yes i could switch to that. The code is taking extra 200 cycles even 
with KUAP/KUEP disabled and no keys being used on hash. I am yet to 
analyze this closely. So will rework things based on that analysis.





@@ -69,22 +133,40 @@
  
  extern u64 default_uamor;
  
-static inline void kuap_restore_amr(struct pt_regs *regs, unsigned long amr)

+static inline void kuap_restore_user_amr(struct pt_regs *regs)
  {
-   if (mmu_has_feature(MMU_FTR_KUAP) && unlikely(regs->kuap != amr)) {
-   isync();
-   mtspr(SPRN_AMR, regs->kuap);
-   /*
-* No isync required here because we are about to RFI back to
-* previous context before any user accesses would be made,
-* which is a CSI.
-*/
+   if (!mmu_has_feature(MMU_FTR_PKEY))
+   return;


If you have PKEY but not KUAP, do you still have to restore?




Yes, because user space pkey is now set on the exit path. This is needed 
to handle things like exec/fork().




+
+   isync();
+   mtspr(SPRN_AMR, regs->kuap);
+   /*
+* No isync required here because we are about to rfi
+* back to previous context before any user accesses
+* would be made, which is a CSI.
+*/
+}
+
+static inline void kuap_restore_kernel_amr(struct pt_regs *regs,
+  unsigned long amr)
+{
+   if (mmu_has_feature(MMU_FTR_KUAP) || mmu_has_feature(MMU_FTR_PKEY)) {
+
+   if (unlikely(regs->kuap != amr)) {
+   isync();
+   mtspr(SPRN_AMR, regs->kuap);
+   /*
+* No isync required here because we are about to rfi
+* back to previous context before any user accesses
+* would be made, which is a CSI.
+*/
+   }

Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

2020-07-03 Thread Michal Hocko

[Cc Andi]

On Fri 03-07-20 11:10:01, Michal Suchanek wrote:
> On Wed, Jul 01, 2020 at 02:21:10PM +0200, Michal Hocko wrote:
> > On Wed 01-07-20 13:30:57, David Hildenbrand wrote:
[...]
> > > Yep, looks like it.
> > > 
> > > [0.009726] SRAT: PXM 1 -> APIC 0x00 -> Node 0
> > > [0.009727] SRAT: PXM 1 -> APIC 0x01 -> Node 0
> > > [0.009727] SRAT: PXM 1 -> APIC 0x02 -> Node 0
> > > [0.009728] SRAT: PXM 1 -> APIC 0x03 -> Node 0
> > > [0.009731] ACPI: SRAT: Node 0 PXM 1 [mem 0x-0x0009]
> > > [0.009732] ACPI: SRAT: Node 0 PXM 1 [mem 0x0010-0xbfff]
> > > [0.009733] ACPI: SRAT: Node 0 PXM 1 [mem 0x1-0x13fff]
> > 
> > This begs a question whether ppc can do the same thing?
> Or x86 stop doing it so that you can see on what node you are running?
> 
> What's the point of this indirection other than another way of avoiding
> empty node 0?

Honestly, I do not have any idea. I've traced it down to
Author: Andi Kleen 
Date:   Tue Jan 11 15:35:48 2005 -0800

[PATCH] x86_64: Fix ACPI SRAT NUMA parsing

Fix fallout from the recent nodemask_t changes. The node ids assigned
in the SRAT parser were off by one.

I added a new first_unset_node() function to nodemask.h to allocate
IDs sanely.

Signed-off-by: Andi Kleen 
Signed-off-by: Linus Torvalds 

which doesn't really tell all that much. The historical baggage and a
long term behavior which is not really trivial to fix I suspect.
-- 
Michal Hocko
SUSE Labs

Re: [PATCH 1/2] dt-bindings: sound: fsl-asoc-card: add new compatible for I2S slave

2020-07-03 Thread Arnaud Ferraris




Le 02/07/2020 à 17:42, Mark Brown a écrit :
> On Thu, Jul 02, 2020 at 05:28:03PM +0200, Arnaud Ferraris wrote:
>> Le 02/07/2020 à 16:31, Mark Brown a écrit :
> 
>>> Why require that the CODEC be clock master here - why not make this
>>> configurable, reusing the properties from the generic and audio graph
>>> cards?
> 
>> This is partly because I'm not sure how to do it (yet), but mostly
>> because I don't have the hardware to test this (the 2 CODECs present on
>> my only i.MX6 board are both clock master)
> 
> Take a look at what the generic cards are doing, it's a library function 
> asoc_simple_parse_daifmt().  It's not the end of the world if you can't
> test it properly - if it turns out it's buggy somehow someone can always
> fix the code later but an ABI is an ABI so we can't change it.
> 

Thanks for the hints, I'll look into it.

Regards,
Arnaud

Re: [PATCH v4 24/41] powerpc/book3s64/pkeys: Store/restore userspace AMR correctly on entry and exit from kernel

2020-07-03 Thread Nicholas Piggin

Excerpts from Aneesh Kumar K.V's message of June 15, 2020 4:14 pm:
> This prepare kernel to operate with a different value than userspace AMR.
> For this, AMR needs to be saved and restored on entry and return from the
> kernel.
> 
> With KUAP we modify kernel AMR when accessing user address from the kernel
> via copy_to/from_user interfaces.
> 
> If MMU_FTR_KEY is enabled we always use the key mechanism to implement KUAP
> feature. If MMU_FTR_KEY is not supported and if we support MMU_FTR_KUAP
> (radix translation on POWER9), we can skip restoring AMR on return
> to userspace. Userspace won't be using AMR in that specific config.
> 
> Signed-off-by: Aneesh Kumar K.V 
> ---
>  arch/powerpc/include/asm/book3s/64/kup.h | 141 ++-
>  arch/powerpc/kernel/entry_64.S   |   6 +-
>  arch/powerpc/kernel/exceptions-64s.S |   4 +-
>  arch/powerpc/kernel/syscall_64.c |  26 -
>  4 files changed, 144 insertions(+), 33 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/book3s/64/kup.h 
> b/arch/powerpc/include/asm/book3s/64/kup.h
> index e6ee1c34842f..6979cd1a0003 100644
> --- a/arch/powerpc/include/asm/book3s/64/kup.h
> +++ b/arch/powerpc/include/asm/book3s/64/kup.h
> @@ -13,18 +13,47 @@
>  
>  #ifdef __ASSEMBLY__
>  
> -.macro kuap_restore_amr  gpr1, gpr2
> -#ifdef CONFIG_PPC_KUAP
> +.macro kuap_restore_user_amr gpr1
> +#if defined(CONFIG_PPC_MEM_KEYS)
>   BEGIN_MMU_FTR_SECTION_NESTED(67)
> - mfspr   \gpr1, SPRN_AMR
> + /*
> +  * AMR is going to be different when
> +  * returning to userspace.
> +  */
> + ld  \gpr1, STACK_REGS_KUAP(r1)
> + isync
> + mtspr   SPRN_AMR, \gpr1
> +
> + /* No isync required, see kuap_restore_user_amr() */
> + END_MMU_FTR_SECTION_NESTED_IFSET(MMU_FTR_PKEY , 67)
> +#endif
> +.endm
> +
> +.macro kuap_restore_kernel_amr   gpr1, gpr2
> +#if defined(CONFIG_PPC_MEM_KEYS)
> + BEGIN_MMU_FTR_SECTION_NESTED(67)
> + b   99f  // handle_pkey_restore_amr
> + END_MMU_FTR_SECTION_NESTED_IFSET(MMU_FTR_PKEY , 67)
> +
> + BEGIN_MMU_FTR_SECTION_NESTED(68)
> + b   99f  // handle_kuap_restore_amr
> + MMU_FTR_SECTION_ELSE_NESTED(68)
> + b   100f  // skip_restore_amr
> + ALT_MMU_FTR_SECTION_END_NESTED_IFSET(MMU_FTR_KUAP, 68)
> +
> +99:
> + /*
> +  * AMR is going to be mostly the same since we are
> +  * returning to the kernel. Compare and do a mtspr.
> +  */
>   ld  \gpr2, STACK_REGS_KUAP(r1)
> + mfspr   \gpr1, SPRN_AMR
>   cmpd\gpr1, \gpr2
> - beq 998f
> + beq 100f
>   isync
>   mtspr   SPRN_AMR, \gpr2
>   /* No isync required, see kuap_restore_amr() */
> -998:
> - END_MMU_FTR_SECTION_NESTED_IFSET(MMU_FTR_KUAP, 67)
> +100:  // skip_restore_amr

Can't you code it like this? (_IFCLR requires none of the bits to be 
set)

BEGIN_MMU_FTR_SECTION_NESTED(67)
b   99f // nothing using AMR, no need to restore
END_MMU_FTR_SECTION_NESTED_IFCLR(MMU_FTR_PKEY | MMU_FTR_KUAP, 67)

That saves you a branch in the common case of using AMR. Similar
for others.

> @@ -69,22 +133,40 @@
>  
>  extern u64 default_uamor;
>  
> -static inline void kuap_restore_amr(struct pt_regs *regs, unsigned long amr)
> +static inline void kuap_restore_user_amr(struct pt_regs *regs)
>  {
> - if (mmu_has_feature(MMU_FTR_KUAP) && unlikely(regs->kuap != amr)) {
> - isync();
> - mtspr(SPRN_AMR, regs->kuap);
> - /*
> -  * No isync required here because we are about to RFI back to
> -  * previous context before any user accesses would be made,
> -  * which is a CSI.
> -  */
> + if (!mmu_has_feature(MMU_FTR_PKEY))
> + return;

If you have PKEY but not KUAP, do you still have to restore?

> +
> + isync();
> + mtspr(SPRN_AMR, regs->kuap);
> + /*
> +  * No isync required here because we are about to rfi
> +  * back to previous context before any user accesses
> +  * would be made, which is a CSI.
> +  */
> +}
> +
> +static inline void kuap_restore_kernel_amr(struct pt_regs *regs,
> +unsigned long amr)
> +{
> + if (mmu_has_feature(MMU_FTR_KUAP) || mmu_has_feature(MMU_FTR_PKEY)) {
> +
> + if (unlikely(regs->kuap != amr)) {
> + isync();
> + mtspr(SPRN_AMR, regs->kuap);
> + /*
> +  * No isync required here because we are about to rfi
> +  * back to previous context before any user accesses
> +  * would be made, which is a CSI.
> +  */
> + }
>   }
>  }
>  
>  static inline unsigned long kuap_get_and_check_amr(void)
>  {
> - if (mmu_has_feature(MMU_FTR_KUAP)) {
> + if (mmu_has_feature(MMU_FTR_KUAP) || mmu_has_feature(MMU_FTR_PKEY)) {
>   unsigned long amr = mfspr(SPRN_AMR);
>

Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

2020-07-03 Thread Michal Suchánek

On Wed, Jul 01, 2020 at 02:21:10PM +0200, Michal Hocko wrote:
> On Wed 01-07-20 13:30:57, David Hildenbrand wrote:
> > On 01.07.20 13:06, David Hildenbrand wrote:
> > > On 01.07.20 13:01, Srikar Dronamraju wrote:
> > >> * David Hildenbrand  [2020-07-01 12:15:54]:
> > >>
> > >>> On 01.07.20 12:04, Srikar Dronamraju wrote:
> >  * Michal Hocko  [2020-07-01 10:42:00]:
> > 
> > >
> > >>
> > >> 2. Also existence of dummy node also leads to inconsistent 
> > >> information. The
> > >> number of online nodes is inconsistent with the information in the
> > >> device-tree and resource-dump
> > >>
> > >> 3. When the dummy node is present, single node non-Numa systems end 
> > >> up showing
> > >> up as NUMA systems and numa_balancing gets enabled. This will mean 
> > >> we take
> > >> the hit from the unnecessary numa hinting faults.
> > >
> > > I have to say that I dislike the node online/offline state and 
> > > directly
> > > exporting that to the userspace. Users should only care whether the 
> > > node
> > > has memory/cpus. Numa nodes can be online without any memory. Just
> > > offline all the present memory blocks but do not physically hot remove
> > > them and you are in the same situation. If users are confused by an
> > > output of tools like numactl -H then those could be updated and hide
> > > nodes without any memory
> > >
> > > The autonuma problem sounds interesting but again this patch doesn't
> > > really solve the underlying problem because I strongly suspect that 
> > > the
> > > problem is still there when a numa node gets all its memory offline as
> > > mentioned above.
> 
> I would really appreciate a feedback to these two as well.
> 
> > > While I completely agree that making node 0 special is wrong, I have
> > > still hard time to review this very simply looking patch because all 
> > > the
> > > numa initialization is so spread around that this might just blow up
> > > at unexpected places. IIRC we have discussed testing in the previous
> > > version and David has provided a way to emulate these configurations
> > > on x86. Did you manage to use those instruction for additional testing
> > > on other than ppc architectures?
> > >
> > 
> >  I have tried all the steps that David mentioned and reported back at
> >  https://lore.kernel.org/lkml/20200511174731.gd1...@linux.vnet.ibm.com/t/#u
> > 
> >  As a summary, David's steps are still not creating a 
> >  memoryless/cpuless on
> >  x86 VM.
> > >>>
> > >>> Now, that is wrong. You get a memoryless/cpuless node, which is *not
> > >>> online*. Once you hotplug some memory, it will switch online. Once you
> > >>> remove memory, it will switch back offline.
> > >>>
> > >>
> > >> Let me clarify, we are looking for a node 0 which is cpuless/memoryless 
> > >> at
> > >> boot.  The code in question tries to handle a cpuless/memoryless node 0 
> > >> at
> > >> boot.
> > > 
> > > I was just correcting your statement, because it was wrong.
> > > 
> > > Could be that x86 code maps PXM 1 to node 0 because PXM 1 does neither
> > > have CPUs nor memory. That would imply that we can, in fact, never have
> > > node 0 offline during boot.
> > > 
> > 
> > Yep, looks like it.
> > 
> > [0.009726] SRAT: PXM 1 -> APIC 0x00 -> Node 0
> > [0.009727] SRAT: PXM 1 -> APIC 0x01 -> Node 0
> > [0.009727] SRAT: PXM 1 -> APIC 0x02 -> Node 0
> > [0.009728] SRAT: PXM 1 -> APIC 0x03 -> Node 0
> > [0.009731] ACPI: SRAT: Node 0 PXM 1 [mem 0x-0x0009]
> > [0.009732] ACPI: SRAT: Node 0 PXM 1 [mem 0x0010-0xbfff]
> > [0.009733] ACPI: SRAT: Node 0 PXM 1 [mem 0x1-0x13fff]
> 
> This begs a question whether ppc can do the same thing?
Or x86 stop doing it so that you can see on what node you are running?

What's the point of this indirection other than another way of avoiding
empty node 0?

Thanks

Michal

[PATCH v2 6/6] powerpc/qspinlock: optimised atomic_try_cmpxchg_lock that adds the lock hint

2020-07-03 Thread Nicholas Piggin

This brings the behaviour of the uncontended fast path back to
roughly equivalent to simple spinlocks -- a single atomic op with
lock hint.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/atomic.h| 28 
 arch/powerpc/include/asm/qspinlock.h |  2 +-
 2 files changed, 29 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/atomic.h 
b/arch/powerpc/include/asm/atomic.h
index 498785ffc25f..f6a3d145ffb7 100644
--- a/arch/powerpc/include/asm/atomic.h
+++ b/arch/powerpc/include/asm/atomic.h
@@ -193,6 +193,34 @@ static __inline__ int atomic_dec_return_relaxed(atomic_t 
*v)
 #define atomic_xchg(v, new) (xchg(&((v)->counter), new))
 #define atomic_xchg_relaxed(v, new) xchg_relaxed(&((v)->counter), (new))
 
+/*
+ * Don't want to override the generic atomic_try_cmpxchg_acquire, because
+ * we add a lock hint to the lwarx, which may not be wanted for the
+ * _acquire case (and is not used by the other _acquire variants so it
+ * would be a surprise).
+ */
+static __always_inline bool
+atomic_try_cmpxchg_lock(atomic_t *v, int *old, int new)
+{
+   int r, o = *old;
+
+   __asm__ __volatile__ (
+"1:\t" PPC_LWARX(%0,0,%2,1) "  # atomic_try_cmpxchg_acquire\n"
+"  cmpw0,%0,%3 \n"
+"  bne-2f  \n"
+"  stwcx.  %4,0,%2 \n"
+"  bne-1b  \n"
+"\t"   PPC_ACQUIRE_BARRIER "   \n"
+"2:\n"
+   : "=" (r), "+m" (v->counter)
+   : "r" (>counter), "r" (o), "r" (new)
+   : "cr0", "memory");
+
+   if (unlikely(r != o))
+   *old = r;
+   return likely(r == o);
+}
+
 /**
  * atomic_fetch_add_unless - add unless the number is a given value
  * @v: pointer of type atomic_t
diff --git a/arch/powerpc/include/asm/qspinlock.h 
b/arch/powerpc/include/asm/qspinlock.h
index 0960a0de2467..beb6aa4628e7 100644
--- a/arch/powerpc/include/asm/qspinlock.h
+++ b/arch/powerpc/include/asm/qspinlock.h
@@ -26,7 +26,7 @@ static __always_inline void queued_spin_lock(struct qspinlock 
*lock)
 {
u32 val = 0;
 
-   if (likely(atomic_try_cmpxchg_acquire(>val, , _Q_LOCKED_VAL)))
+   if (likely(atomic_try_cmpxchg_lock(>val, , _Q_LOCKED_VAL)))
return;
 
queued_spin_lock_slowpath(lock, val);
-- 
2.23.0

[PATCH v2 5/6] powerpc/pseries: implement paravirt qspinlocks for SPLPAR

2020-07-03 Thread Nicholas Piggin

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/paravirt.h   | 28 ++
 arch/powerpc/include/asm/qspinlock.h  | 55 +++
 arch/powerpc/include/asm/qspinlock_paravirt.h |  5 ++
 arch/powerpc/platforms/pseries/Kconfig|  5 ++
 arch/powerpc/platforms/pseries/setup.c|  6 +-
 include/asm-generic/qspinlock.h   |  2 +
 6 files changed, 100 insertions(+), 1 deletion(-)
 create mode 100644 arch/powerpc/include/asm/qspinlock_paravirt.h

diff --git a/arch/powerpc/include/asm/paravirt.h 
b/arch/powerpc/include/asm/paravirt.h
index 7a8546660a63..f2d51f929cf5 100644
--- a/arch/powerpc/include/asm/paravirt.h
+++ b/arch/powerpc/include/asm/paravirt.h
@@ -29,6 +29,16 @@ static inline void yield_to_preempted(int cpu, u32 
yield_count)
 {
plpar_hcall_norets(H_CONFER, get_hard_smp_processor_id(cpu), 
yield_count);
 }
+
+static inline void prod_cpu(int cpu)
+{
+   plpar_hcall_norets(H_PROD, get_hard_smp_processor_id(cpu));
+}
+
+static inline void yield_to_any(void)
+{
+   plpar_hcall_norets(H_CONFER, -1, 0);
+}
 #else
 static inline bool is_shared_processor(void)
 {
@@ -45,6 +55,19 @@ static inline void yield_to_preempted(int cpu, u32 
yield_count)
 {
___bad_yield_to_preempted(); /* This would be a bug */
 }
+
+extern void ___bad_yield_to_any(void);
+static inline void yield_to_any(void)
+{
+   ___bad_yield_to_any(); /* This would be a bug */
+}
+
+extern void ___bad_prod_cpu(void);
+static inline void prod_cpu(int cpu)
+{
+   ___bad_prod_cpu(); /* This would be a bug */
+}
+
 #endif
 
 #define vcpu_is_preempted vcpu_is_preempted
@@ -57,5 +80,10 @@ static inline bool vcpu_is_preempted(int cpu)
return false;
 }
 
+static inline bool pv_is_native_spin_unlock(void)
+{
+ return !is_shared_processor();
+}
+
 #endif /* __KERNEL__ */
 #endif /* __ASM_PARAVIRT_H */
diff --git a/arch/powerpc/include/asm/qspinlock.h 
b/arch/powerpc/include/asm/qspinlock.h
index c49e33e24edd..0960a0de2467 100644
--- a/arch/powerpc/include/asm/qspinlock.h
+++ b/arch/powerpc/include/asm/qspinlock.h
@@ -3,9 +3,36 @@
 #define _ASM_POWERPC_QSPINLOCK_H
 
 #include 
+#include 
 
 #define _Q_PENDING_LOOPS   (1 << 9) /* not tuned */
 
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+extern void native_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
+extern void __pv_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
+
+static __always_inline void queued_spin_lock_slowpath(struct qspinlock *lock, 
u32 val)
+{
+   if (!is_shared_processor())
+   native_queued_spin_lock_slowpath(lock, val);
+   else
+   __pv_queued_spin_lock_slowpath(lock, val);
+}
+#else
+extern void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
+#endif
+
+static __always_inline void queued_spin_lock(struct qspinlock *lock)
+{
+   u32 val = 0;
+
+   if (likely(atomic_try_cmpxchg_acquire(>val, , _Q_LOCKED_VAL)))
+   return;
+
+   queued_spin_lock_slowpath(lock, val);
+}
+#define queued_spin_lock queued_spin_lock
+
 #define smp_mb__after_spinlock()   smp_mb()
 
 static __always_inline int queued_spin_is_locked(struct qspinlock *lock)
@@ -20,6 +47,34 @@ static __always_inline int queued_spin_is_locked(struct 
qspinlock *lock)
 }
 #define queued_spin_is_locked queued_spin_is_locked
 
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+#define SPIN_THRESHOLD (1<<15) /* not tuned */
+
+static __always_inline void pv_wait(u8 *ptr, u8 val)
+{
+   if (*ptr != val)
+   return;
+   yield_to_any();
+   /*
+* We could pass in a CPU here if waiting in the queue and yield to
+* the previous CPU in the queue.
+*/
+}
+
+static __always_inline void pv_kick(int cpu)
+{
+   prod_cpu(cpu);
+}
+
+extern void __pv_init_lock_hash(void);
+
+static inline void pv_spinlocks_init(void)
+{
+   __pv_init_lock_hash();
+}
+
+#endif
+
 #include 
 
 #endif /* _ASM_POWERPC_QSPINLOCK_H */
diff --git a/arch/powerpc/include/asm/qspinlock_paravirt.h 
b/arch/powerpc/include/asm/qspinlock_paravirt.h
new file mode 100644
index ..6dbdb8a4f84f
--- /dev/null
+++ b/arch/powerpc/include/asm/qspinlock_paravirt.h
@@ -0,0 +1,5 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+#ifndef __ASM_QSPINLOCK_PARAVIRT_H
+#define __ASM_QSPINLOCK_PARAVIRT_H
+
+#endif /* __ASM_QSPINLOCK_PARAVIRT_H */
diff --git a/arch/powerpc/platforms/pseries/Kconfig 
b/arch/powerpc/platforms/pseries/Kconfig
index 24c18362e5ea..756e727b383f 100644
--- a/arch/powerpc/platforms/pseries/Kconfig
+++ b/arch/powerpc/platforms/pseries/Kconfig
@@ -25,9 +25,14 @@ config PPC_PSERIES
select SWIOTLB
default y
 
+config PARAVIRT_SPINLOCKS
+   bool
+   default n
+
 config PPC_SPLPAR
depends on PPC_PSERIES
bool "Support for shared-processor logical partitions"
+   select PARAVIRT_SPINLOCKS if PPC_QUEUED_SPINLOCKS
help
  Enabling this option will make the kernel run more

[PATCH v2 4/6] powerpc/64s: implement queued spinlocks and rwlocks

2020-07-03 Thread Nicholas Piggin

These have shown significantly improved performance and fairness when
spinlock contention is moderate to high on very large systems.

 [ Numbers hopefully forthcoming after more testing, but initial
   results look good ]

Thanks to the fast path, single threaded performance is not noticably
hurt.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/Kconfig  | 13 
 arch/powerpc/include/asm/Kbuild   |  2 ++
 arch/powerpc/include/asm/qspinlock.h  | 25 +++
 arch/powerpc/include/asm/spinlock.h   |  5 +
 arch/powerpc/include/asm/spinlock_types.h |  5 +
 arch/powerpc/lib/Makefile |  3 +++
 include/asm-generic/qspinlock.h   |  2 ++
 7 files changed, 55 insertions(+)
 create mode 100644 arch/powerpc/include/asm/qspinlock.h

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 9fa23eb320ff..b17575109876 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -145,6 +145,8 @@ config PPC
select ARCH_SUPPORTS_ATOMIC_RMW
select ARCH_USE_BUILTIN_BSWAP
select ARCH_USE_CMPXCHG_LOCKREF if PPC64
+   select ARCH_USE_QUEUED_RWLOCKS  if PPC_QUEUED_SPINLOCKS
+   select ARCH_USE_QUEUED_SPINLOCKSif PPC_QUEUED_SPINLOCKS
select ARCH_WANT_IPC_PARSE_VERSION
select ARCH_WEAK_RELEASE_ACQUIRE
select BINFMT_ELF
@@ -490,6 +492,17 @@ config HOTPLUG_CPU
 
  Say N if you are unsure.
 
+config PPC_QUEUED_SPINLOCKS
+   bool "Queued spinlocks"
+   depends on SMP
+   default "y" if PPC_BOOK3S_64
+   help
+ Say Y here to use to use queued spinlocks which are more complex
+ but give better salability and fairness on large SMP and NUMA
+ systems.
+
+ If unsure, say "Y" if you have lots of cores, otherwise "N".
+
 config ARCH_CPU_PROBE_RELEASE
def_bool y
depends on HOTPLUG_CPU
diff --git a/arch/powerpc/include/asm/Kbuild b/arch/powerpc/include/asm/Kbuild
index dadbcf3a0b1e..1dd8b6adff5e 100644
--- a/arch/powerpc/include/asm/Kbuild
+++ b/arch/powerpc/include/asm/Kbuild
@@ -6,5 +6,7 @@ generated-y += syscall_table_spu.h
 generic-y += export.h
 generic-y += local64.h
 generic-y += mcs_spinlock.h
+generic-y += qrwlock.h
+generic-y += qspinlock.h
 generic-y += vtime.h
 generic-y += early_ioremap.h
diff --git a/arch/powerpc/include/asm/qspinlock.h 
b/arch/powerpc/include/asm/qspinlock.h
new file mode 100644
index ..c49e33e24edd
--- /dev/null
+++ b/arch/powerpc/include/asm/qspinlock.h
@@ -0,0 +1,25 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_POWERPC_QSPINLOCK_H
+#define _ASM_POWERPC_QSPINLOCK_H
+
+#include 
+
+#define _Q_PENDING_LOOPS   (1 << 9) /* not tuned */
+
+#define smp_mb__after_spinlock()   smp_mb()
+
+static __always_inline int queued_spin_is_locked(struct qspinlock *lock)
+{
+   /*
+* This barrier was added to simple spinlocks by commit 51d7d5205d338,
+* but it should now be possible to remove it, asm arm64 has done with
+* commit c6f5d02b6a0f.
+*/
+   smp_mb();
+   return atomic_read(>val);
+}
+#define queued_spin_is_locked queued_spin_is_locked
+
+#include 
+
+#endif /* _ASM_POWERPC_QSPINLOCK_H */
diff --git a/arch/powerpc/include/asm/spinlock.h 
b/arch/powerpc/include/asm/spinlock.h
index 21357fe05fe0..434615f1d761 100644
--- a/arch/powerpc/include/asm/spinlock.h
+++ b/arch/powerpc/include/asm/spinlock.h
@@ -3,7 +3,12 @@
 #define __ASM_SPINLOCK_H
 #ifdef __KERNEL__
 
+#ifdef CONFIG_PPC_QUEUED_SPINLOCKS
+#include 
+#include 
+#else
 #include 
+#endif
 
 #endif /* __KERNEL__ */
 #endif /* __ASM_SPINLOCK_H */
diff --git a/arch/powerpc/include/asm/spinlock_types.h 
b/arch/powerpc/include/asm/spinlock_types.h
index 3906f52dae65..c5d742f18021 100644
--- a/arch/powerpc/include/asm/spinlock_types.h
+++ b/arch/powerpc/include/asm/spinlock_types.h
@@ -6,6 +6,11 @@
 # error "please don't include this file directly"
 #endif
 
+#ifdef CONFIG_PPC_QUEUED_SPINLOCKS
+#include 
+#include 
+#else
 #include 
+#endif
 
 #endif
diff --git a/arch/powerpc/lib/Makefile b/arch/powerpc/lib/Makefile
index 5e994cda8e40..d66a645503eb 100644
--- a/arch/powerpc/lib/Makefile
+++ b/arch/powerpc/lib/Makefile
@@ -41,7 +41,10 @@ obj-$(CONFIG_PPC_BOOK3S_64) += copyuser_power7.o 
copypage_power7.o \
 obj64-y+= copypage_64.o copyuser_64.o mem_64.o hweight_64.o \
   memcpy_64.o memcpy_mcsafe_64.o
 
+ifndef CONFIG_PPC_QUEUED_SPINLOCKS
 obj64-$(CONFIG_SMP)+= locks.o
+endif
+
 obj64-$(CONFIG_ALTIVEC)+= vmx-helper.o
 obj64-$(CONFIG_KPROBES_SANITY_TEST)+= test_emulate_step.o \
   test_emulate_step_exec_instr.o
diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h
index fde943d180e0..fb0a814d4395 100644
--- a/include/asm-generic/qspinlock.h
+++ b/include/asm-generic/qspinlock.h
@@ -12,6 +12,7 @@
 
 #include 
 
+#ifndef queued_spin_is_locked
 /**
  *

[PATCH v2 3/6] powerpc: move spinlock implementation to simple_spinlock

2020-07-03 Thread Nicholas Piggin

To prepare for queued spinlocks. This is a simple rename except to update
preprocessor guard name and a file reference.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/simple_spinlock.h| 292 ++
 .../include/asm/simple_spinlock_types.h   |  21 ++
 arch/powerpc/include/asm/spinlock.h   | 285 +
 arch/powerpc/include/asm/spinlock_types.h |  12 +-
 4 files changed, 315 insertions(+), 295 deletions(-)
 create mode 100644 arch/powerpc/include/asm/simple_spinlock.h
 create mode 100644 arch/powerpc/include/asm/simple_spinlock_types.h

diff --git a/arch/powerpc/include/asm/simple_spinlock.h 
b/arch/powerpc/include/asm/simple_spinlock.h
new file mode 100644
index ..e048c041c4a9
--- /dev/null
+++ b/arch/powerpc/include/asm/simple_spinlock.h
@@ -0,0 +1,292 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+#ifndef __ASM_SIMPLE_SPINLOCK_H
+#define __ASM_SIMPLE_SPINLOCK_H
+#ifdef __KERNEL__
+
+/*
+ * Simple spin lock operations.  
+ *
+ * Copyright (C) 2001-2004 Paul Mackerras , IBM
+ * Copyright (C) 2001 Anton Blanchard , IBM
+ * Copyright (C) 2002 Dave Engebretsen , IBM
+ * Rework to support virtual processors
+ *
+ * Type of int is used as a full 64b word is not necessary.
+ *
+ * (the type definitions are in asm/simple_spinlock_types.h)
+ */
+#include 
+#include 
+#ifdef CONFIG_PPC64
+#include 
+#endif
+#include 
+#include 
+
+#ifdef CONFIG_PPC64
+/* use 0x80yy when locked, where yy == CPU number */
+#ifdef __BIG_ENDIAN__
+#define LOCK_TOKEN (*(u32 *)(_paca()->lock_token))
+#else
+#define LOCK_TOKEN (*(u32 *)(_paca()->paca_index))
+#endif
+#else
+#define LOCK_TOKEN 1
+#endif
+
+static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock)
+{
+   return lock.slock == 0;
+}
+
+static inline int arch_spin_is_locked(arch_spinlock_t *lock)
+{
+   smp_mb();
+   return !arch_spin_value_unlocked(*lock);
+}
+
+/*
+ * This returns the old value in the lock, so we succeeded
+ * in getting the lock if the return value is 0.
+ */
+static inline unsigned long __arch_spin_trylock(arch_spinlock_t *lock)
+{
+   unsigned long tmp, token;
+
+   token = LOCK_TOKEN;
+   __asm__ __volatile__(
+"1:" PPC_LWARX(%0,0,%2,1) "\n\
+   cmpwi   0,%0,0\n\
+   bne-2f\n\
+   stwcx.  %1,0,%2\n\
+   bne-1b\n"
+   PPC_ACQUIRE_BARRIER
+"2:"
+   : "=" (tmp)
+   : "r" (token), "r" (>slock)
+   : "cr0", "memory");
+
+   return tmp;
+}
+
+static inline int arch_spin_trylock(arch_spinlock_t *lock)
+{
+   return __arch_spin_trylock(lock) == 0;
+}
+
+/*
+ * On a system with shared processors (that is, where a physical
+ * processor is multiplexed between several virtual processors),
+ * there is no point spinning on a lock if the holder of the lock
+ * isn't currently scheduled on a physical processor.  Instead
+ * we detect this situation and ask the hypervisor to give the
+ * rest of our timeslice to the lock holder.
+ *
+ * So that we can tell which virtual processor is holding a lock,
+ * we put 0x8000 | smp_processor_id() in the lock when it is
+ * held.  Conveniently, we have a word in the paca that holds this
+ * value.
+ */
+
+#if defined(CONFIG_PPC_SPLPAR)
+/* We only yield to the hypervisor if we are in shared processor mode */
+void splpar_spin_yield(arch_spinlock_t *lock);
+void splpar_rw_yield(arch_rwlock_t *lock);
+#else /* SPLPAR */
+static inline void splpar_spin_yield(arch_spinlock_t *lock) {};
+static inline void splpar_rw_yield(arch_rwlock_t *lock) {};
+#endif
+
+static inline void spin_yield(arch_spinlock_t *lock)
+{
+   if (is_shared_processor())
+   splpar_spin_yield(lock);
+   else
+   barrier();
+}
+
+static inline void rw_yield(arch_rwlock_t *lock)
+{
+   if (is_shared_processor())
+   splpar_rw_yield(lock);
+   else
+   barrier();
+}
+
+static inline void arch_spin_lock(arch_spinlock_t *lock)
+{
+   while (1) {
+   if (likely(__arch_spin_trylock(lock) == 0))
+   break;
+   do {
+   HMT_low();
+   if (is_shared_processor())
+   splpar_spin_yield(lock);
+   } while (unlikely(lock->slock != 0));
+   HMT_medium();
+   }
+}
+
+static inline
+void arch_spin_lock_flags(arch_spinlock_t *lock, unsigned long flags)
+{
+   unsigned long flags_dis;
+
+   while (1) {
+   if (likely(__arch_spin_trylock(lock) == 0))
+   break;
+   local_save_flags(flags_dis);
+   local_irq_restore(flags);
+   do {
+   HMT_low();
+   if (is_shared_processor())
+   splpar_spin_yield(lock);
+   } while (unlikely(lock->slock != 0));
+   HMT_medium();
+

[PATCH v2 2/6] powerpc/pseries: move some PAPR paravirt functions to their own file

2020-07-03 Thread Nicholas Piggin

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/paravirt.h | 61 +
 arch/powerpc/include/asm/spinlock.h | 24 +---
 arch/powerpc/lib/locks.c| 12 +++---
 3 files changed, 68 insertions(+), 29 deletions(-)
 create mode 100644 arch/powerpc/include/asm/paravirt.h

diff --git a/arch/powerpc/include/asm/paravirt.h 
b/arch/powerpc/include/asm/paravirt.h
new file mode 100644
index ..7a8546660a63
--- /dev/null
+++ b/arch/powerpc/include/asm/paravirt.h
@@ -0,0 +1,61 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+#ifndef __ASM_PARAVIRT_H
+#define __ASM_PARAVIRT_H
+#ifdef __KERNEL__
+
+#include 
+#include 
+#ifdef CONFIG_PPC64
+#include 
+#include 
+#endif
+
+#ifdef CONFIG_PPC_SPLPAR
+DECLARE_STATIC_KEY_FALSE(shared_processor);
+
+static inline bool is_shared_processor(void)
+{
+   return static_branch_unlikely(_processor);
+}
+
+/* If bit 0 is set, the cpu has been preempted */
+static inline u32 yield_count_of(int cpu)
+{
+   __be32 yield_count = READ_ONCE(lppaca_of(cpu).yield_count);
+   return be32_to_cpu(yield_count);
+}
+
+static inline void yield_to_preempted(int cpu, u32 yield_count)
+{
+   plpar_hcall_norets(H_CONFER, get_hard_smp_processor_id(cpu), 
yield_count);
+}
+#else
+static inline bool is_shared_processor(void)
+{
+   return false;
+}
+
+static inline u32 yield_count_of(int cpu)
+{
+   return 0;
+}
+
+extern void ___bad_yield_to_preempted(void);
+static inline void yield_to_preempted(int cpu, u32 yield_count)
+{
+   ___bad_yield_to_preempted(); /* This would be a bug */
+}
+#endif
+
+#define vcpu_is_preempted vcpu_is_preempted
+static inline bool vcpu_is_preempted(int cpu)
+{
+   if (!is_shared_processor())
+   return false;
+   if (yield_count_of(cpu) & 1)
+   return true;
+   return false;
+}
+
+#endif /* __KERNEL__ */
+#endif /* __ASM_PARAVIRT_H */
diff --git a/arch/powerpc/include/asm/spinlock.h 
b/arch/powerpc/include/asm/spinlock.h
index 2d620896cdae..79be9bb10bbb 100644
--- a/arch/powerpc/include/asm/spinlock.h
+++ b/arch/powerpc/include/asm/spinlock.h
@@ -15,11 +15,10 @@
  *
  * (the type definitions are in asm/spinlock_types.h)
  */
-#include 
 #include 
+#include 
 #ifdef CONFIG_PPC64
 #include 
-#include 
 #endif
 #include 
 #include 
@@ -35,18 +34,6 @@
 #define LOCK_TOKEN 1
 #endif
 
-#ifdef CONFIG_PPC_PSERIES
-DECLARE_STATIC_KEY_FALSE(shared_processor);
-
-#define vcpu_is_preempted vcpu_is_preempted
-static inline bool vcpu_is_preempted(int cpu)
-{
-   if (!static_branch_unlikely(_processor))
-   return false;
-   return !!(be32_to_cpu(lppaca_of(cpu).yield_count) & 1);
-}
-#endif
-
 static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock)
 {
return lock.slock == 0;
@@ -110,15 +97,6 @@ static inline void splpar_spin_yield(arch_spinlock_t *lock) 
{};
 static inline void splpar_rw_yield(arch_rwlock_t *lock) {};
 #endif
 
-static inline bool is_shared_processor(void)
-{
-#ifdef CONFIG_PPC_SPLPAR
-   return static_branch_unlikely(_processor);
-#else
-   return false;
-#endif
-}
-
 static inline void spin_yield(arch_spinlock_t *lock)
 {
if (is_shared_processor())
diff --git a/arch/powerpc/lib/locks.c b/arch/powerpc/lib/locks.c
index 6440d5943c00..04165b7a163f 100644
--- a/arch/powerpc/lib/locks.c
+++ b/arch/powerpc/lib/locks.c
@@ -27,14 +27,14 @@ void splpar_spin_yield(arch_spinlock_t *lock)
return;
holder_cpu = lock_value & 0x;
BUG_ON(holder_cpu >= NR_CPUS);
-   yield_count = be32_to_cpu(lppaca_of(holder_cpu).yield_count);
+
+   yield_count = yield_count_of(holder_cpu);
if ((yield_count & 1) == 0)
return; /* virtual cpu is currently running */
rmb();
if (lock->slock != lock_value)
return; /* something has changed */
-   plpar_hcall_norets(H_CONFER,
-   get_hard_smp_processor_id(holder_cpu), yield_count);
+   yield_to_preempted(holder_cpu, yield_count);
 }
 EXPORT_SYMBOL_GPL(splpar_spin_yield);
 
@@ -53,13 +53,13 @@ void splpar_rw_yield(arch_rwlock_t *rw)
return; /* no write lock at present */
holder_cpu = lock_value & 0x;
BUG_ON(holder_cpu >= NR_CPUS);
-   yield_count = be32_to_cpu(lppaca_of(holder_cpu).yield_count);
+
+   yield_count = yield_count_of(holder_cpu);
if ((yield_count & 1) == 0)
return; /* virtual cpu is currently running */
rmb();
if (rw->lock != lock_value)
return; /* something has changed */
-   plpar_hcall_norets(H_CONFER,
-   get_hard_smp_processor_id(holder_cpu), yield_count);
+   yield_to_preempted(holder_cpu, yield_count);
 }
 #endif
-- 
2.23.0

[PATCH v2 1/6] powerpc/powernv: must include hvcall.h to get PAPR defines

2020-07-03 Thread Nicholas Piggin

An include goes away in future patches which breaks compilation
without this.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/platforms/powernv/pci-ioda-tce.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda-tce.c 
b/arch/powerpc/platforms/powernv/pci-ioda-tce.c
index f923359d8afc..8eba6ece7808 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda-tce.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda-tce.c
@@ -15,6 +15,7 @@
 
 #include 
 #include 
+#include  /* share error returns with PAPR */
 #include "pci.h"
 
 unsigned long pnv_ioda_parse_tce_sizes(struct pnv_phb *phb)
-- 
2.23.0

[PATCH v2 0/6] powerpc: queued spinlocks and rwlocks

2020-07-03 Thread Nicholas Piggin

v2 is updated to account for feedback from Will, Peter, and
Waiman (thank you), and trims off a couple of RFC and unrelated
patches.

Thanks,
Nick

Nicholas Piggin (6):
  powerpc/powernv: must include hvcall.h to get PAPR defines
  powerpc/pseries: move some PAPR paravirt functions to their own file
  powerpc: move spinlock implementation to simple_spinlock
  powerpc/64s: implement queued spinlocks and rwlocks
  powerpc/pseries: implement paravirt qspinlocks for SPLPAR
  powerpc/qspinlock: optimised atomic_try_cmpxchg_lock that adds the
lock hint

 arch/powerpc/Kconfig  |  13 +
 arch/powerpc/include/asm/Kbuild   |   2 +
 arch/powerpc/include/asm/atomic.h |  28 ++
 arch/powerpc/include/asm/paravirt.h   |  89 +
 arch/powerpc/include/asm/qspinlock.h  |  80 +
 arch/powerpc/include/asm/qspinlock_paravirt.h |   5 +
 arch/powerpc/include/asm/simple_spinlock.h| 292 +
 .../include/asm/simple_spinlock_types.h   |  21 ++
 arch/powerpc/include/asm/spinlock.h   | 308 +-
 arch/powerpc/include/asm/spinlock_types.h |  17 +-
 arch/powerpc/lib/Makefile |   3 +
 arch/powerpc/lib/locks.c  |  12 +-
 arch/powerpc/platforms/powernv/pci-ioda-tce.c |   1 +
 arch/powerpc/platforms/pseries/Kconfig|   5 +
 arch/powerpc/platforms/pseries/setup.c|   6 +-
 include/asm-generic/qspinlock.h   |   4 +
 16 files changed, 564 insertions(+), 322 deletions(-)
 create mode 100644 arch/powerpc/include/asm/paravirt.h
 create mode 100644 arch/powerpc/include/asm/qspinlock.h
 create mode 100644 arch/powerpc/include/asm/qspinlock_paravirt.h
 create mode 100644 arch/powerpc/include/asm/simple_spinlock.h
 create mode 100644 arch/powerpc/include/asm/simple_spinlock_types.h

-- 
2.23.0

Re: [PATCH] powerpc/powernv: machine check handler for POWER10

2020-07-03 Thread kernel test robot

Hi Nicholas,

I love your patch! Perhaps something to improve:

[auto build test WARNING on powerpc/next]
[also build test WARNING on v5.8-rc3 next-20200702]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use  as documented in
https://git-scm.com/docs/git-format-patch]

url:
https://github.com/0day-ci/linux/commits/Nicholas-Piggin/powerpc-powernv-machine-check-handler-for-POWER10/20200703-073739
base:   https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next
config: powerpc-allyesconfig (attached as .config)
compiler: powerpc64-linux-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross 
ARCH=powerpc 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot 

All warnings (new ones prefixed by >>):

   arch/powerpc/kernel/mce_power.c:709:6: warning: no previous prototype for 
'__machine_check_early_realmode_p7' [-Wmissing-prototypes]
 709 | long __machine_check_early_realmode_p7(struct pt_regs *regs)
 |  ^
   arch/powerpc/kernel/mce_power.c:717:6: warning: no previous prototype for 
'__machine_check_early_realmode_p8' [-Wmissing-prototypes]
 717 | long __machine_check_early_realmode_p8(struct pt_regs *regs)
 |  ^
   arch/powerpc/kernel/mce_power.c:722:6: warning: no previous prototype for 
'__machine_check_early_realmode_p9' [-Wmissing-prototypes]
 722 | long __machine_check_early_realmode_p9(struct pt_regs *regs)
 |  ^
>> arch/powerpc/kernel/mce_power.c:740:6: warning: no previous prototype for 
>> '__machine_check_early_realmode_p10' [-Wmissing-prototypes]
 740 | long __machine_check_early_realmode_p10(struct pt_regs *regs)
 |  ^~

vim +/__machine_check_early_realmode_p10 +740 arch/powerpc/kernel/mce_power.c

   739  
 > 740  long __machine_check_early_realmode_p10(struct pt_regs *regs)

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org


.config.gz
Description: application/gzip

[PATCH] powerpc/perf: Add kernel support for new MSR[HV PR] bits in trace-imc.

2020-07-03 Thread Anju T Sudhakar

IMC trace-mode record has MSR[HV PR] bits added in the third DW. 
These bits can be used to set the cpumode for the instruction pointer
captured in each sample.
  
   
Add support in kernel to use these bits to set the cpumode for 
each sample.   
   
Signed-off-by: Anju T Sudhakar 
---
 arch/powerpc/include/asm/imc-pmu.h |  5 +
 arch/powerpc/perf/imc-pmu.c| 29 -
 2 files changed, 29 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/imc-pmu.h 
b/arch/powerpc/include/asm/imc-pmu.h
index 4da4fcba0684..4f897993b710 100644
--- a/arch/powerpc/include/asm/imc-pmu.h
+++ b/arch/powerpc/include/asm/imc-pmu.h
@@ -99,6 +99,11 @@ struct trace_imc_data {
  */
 #define IMC_TRACE_RECORD_TB1_MASK  0x3ffULL
 
+/*
+ * Bit 0:1 in third DW of IMC trace record
+ * specifies the MSR[HV PR] values.
+ */
+#define IMC_TRACE_RECORD_VAL_HVPR(x)   ((x) >> 62)
 
 /*
  * Device tree parser code detects IMC pmu support and
diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c
index cb50a9e1fd2d..310922fed9eb 100644
--- a/arch/powerpc/perf/imc-pmu.c
+++ b/arch/powerpc/perf/imc-pmu.c
@@ -1178,11 +1178,30 @@ static int trace_imc_prepare_sample(struct 
trace_imc_data *mem,
header->size = sizeof(*header) + event->header_size;
header->misc = 0;
 
-   if (is_kernel_addr(data->ip))
-   header->misc |= PERF_RECORD_MISC_KERNEL;
-   else
-   header->misc |= PERF_RECORD_MISC_USER;
-
+   if (cpu_has_feature(CPU_FTRS_POWER9)) {
+   if (is_kernel_addr(data->ip))
+   header->misc |= PERF_RECORD_MISC_KERNEL;
+   else
+   header->misc |= PERF_RECORD_MISC_USER;
+   } else {
+   switch (IMC_TRACE_RECORD_VAL_HVPR(mem->val)) {
+   case 0:/* when MSR HV and PR not set in the trace-record */
+   header->misc |= PERF_RECORD_MISC_GUEST_KERNEL;
+   break;
+   case 1: /* MSR HV is 0 and PR is 1 */
+   header->misc |= PERF_RECORD_MISC_GUEST_USER;
+   break;
+   case 2: /* MSR Hv is 1 and PR is 0 */
+   header->misc |= PERF_RECORD_MISC_HYPERVISOR;
+   break;
+   case 3: /* MSR HV is 1 and PR is 1 */
+   header->misc |= PERF_RECORD_MISC_USER;
+   break;
+   default:
+   pr_info("IMC: Unable to set the flag based on MSR 
bits\n");
+   break;
+   }
+   }
perf_event_header__init_id(header, data, event);
 
return 0;
-- 
2.25.4

Re: [PATCH V3 (RESEND) 2/3] mm/sparsemem: Enable vmem_altmap support in vmemmap_alloc_block_buf()

2020-07-03 Thread Anshuman Khandual




On 07/02/2020 07:37 PM, Catalin Marinas wrote:
> On Thu, Jun 18, 2020 at 06:45:29AM +0530, Anshuman Khandual wrote:
>> There are many instances where vmemap allocation is often switched between
>> regular memory and device memory just based on whether altmap is available
>> or not. vmemmap_alloc_block_buf() is used in various platforms to allocate
>> vmemmap mappings. Lets also enable it to handle altmap based device memory
>> allocation along with existing regular memory allocations. This will help
>> in avoiding the altmap based allocation switch in many places.
>>
>> While here also implement a regular memory allocation fallback mechanism
>> when the first preferred device memory allocation fails. This will ensure
>> preserving the existing semantics on powerpc platform. To summarize there
>> are three different methods to call vmemmap_alloc_block_buf().
>>
>> (., NULL,   false) /* Allocate from system RAM */
>> (., altmap, false) /* Allocate from altmap without any fallback */
>> (., altmap, true)  /* Allocate from altmap with fallback (system RAM) */
> [...]
>> diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
>> index bc73abf0bc25..01e25b56eccb 100644
>> --- a/arch/powerpc/mm/init_64.c
>> +++ b/arch/powerpc/mm/init_64.c
>> @@ -225,12 +225,12 @@ int __meminit vmemmap_populate(unsigned long start, 
>> unsigned long end, int node,
>>   * fall back to system memory if the altmap allocation fail.
>>   */
>>  if (altmap && !altmap_cross_boundary(altmap, start, page_size)) 
>> {
>> -p = altmap_alloc_block_buf(page_size, altmap);
>> -if (!p)
>> -pr_debug("altmap block allocation failed, 
>> falling back to system memory");
>> +p = vmemmap_alloc_block_buf(page_size, node,
>> +altmap, true);
>> +} else {
>> +p = vmemmap_alloc_block_buf(page_size, node,
>> +NULL, false);
>>  }
>> -if (!p)
>> -p = vmemmap_alloc_block_buf(page_size, node);
>>  if (!p)
>>  return -ENOMEM;
> 
> Is the fallback argument actually necessary. It may be cleaner to just
> leave the code as is with the choice between altmap and NULL. If an arch
> needs a fallback (only powerpc), they have the fallback in place
> already. I don't see the powerpc code any better after this change.
> 
> I'm fine with the altmap argument though.

Okay. Will drop 'fallback' from vmemmap_alloc_block_buf() and update the
callers. There will also be a single change in the subsequent patch i.e
vmemmap_alloc_block_buf(PMD_SIZE, node, altmap).

Re: [PATCH v2 5/6] powerpc/pseries/iommu: Make use of DDW even if it does not map the partition

2020-07-03 Thread Leonardo Bras

On Thu, 2020-07-02 at 10:31 +1000, Alexey Kardashevskiy wrote:
> 
> On 02/07/2020 09:48, Leonardo Bras wrote:
> > On Wed, 2020-07-01 at 16:57 -0300, Leonardo Bras wrote:
> > > > It is not necessarily "direct" anymore as the name suggests, you may
> > > > want to change that. DMA64_PROPNAME, may be. Thanks,
> > > > 
> > > 
> > > Yeah, you are right.
> > > I will change this for next version, also changing the string name to
> > > reflect this.
> > > 
> > > -#define DIRECT64_PROPNAME "linux,direct64-ddr-window-info"
> > > +#define DMA64_PROPNAME "linux,dma64-ddr-window-info"
> > > 
> > > Is that ok?
> > > 
> > > Thank you for helping!
> > 
> > In fact, there is a lot of places in this file where it's called direct
> > window. Should I replace everything?
> > Should it be in a separated patch?
> 
> If it looks simple and you write a nice commit log explaining all that
> and why you are not reusing the existing ibm,dma-window property (to
> provide a clue what "reset" will reset to? is there any other reason?)
> for that - sure, do it :)
> 

v3 available here:
http://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=187348=%2A=both

Best regards,
Leonardo

[PATCH v3 6/6] powerpc/pseries/iommu: Rename "direct window" to "dma window"

2020-07-03 Thread Leonardo Bras

A previous change introduced the usage of DDW as a bigger indirect DMA
mapping when the DDW available size does not map the whole partition.

As most of the code that manipulates direct mappings was reused for
indirect mappings, it's necessary to rename all names and debug/info
messages to reflect that it can be used for both kinds of mapping.

Also, defines DEFAULT_DMA_WIN as "ibm,dma-window" to document that
it's the name of the default DMA window.

Those changes are not supposed to change how the code works in any
way, just adjust naming.

Signed-off-by: Leonardo Bras 
---
 arch/powerpc/platforms/pseries/iommu.c | 101 +
 1 file changed, 53 insertions(+), 48 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index c652177de09c..070b80efc43a 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -339,7 +339,7 @@ struct dynamic_dma_window_prop {
__be32  window_shift;   /* ilog2(tce_window_size) */
 };
 
-struct direct_window {
+struct dma_win {
struct device_node *device;
const struct dynamic_dma_window_prop *prop;
struct list_head list;
@@ -359,12 +359,13 @@ struct ddw_create_response {
u32 addr_lo;
 };
 
-static LIST_HEAD(direct_window_list);
+static LIST_HEAD(dma_win_list);
 /* prevents races between memory on/offline and window creation */
-static DEFINE_SPINLOCK(direct_window_list_lock);
+static DEFINE_SPINLOCK(dma_win_list_lock);
 /* protects initializing window twice for same device */
-static DEFINE_MUTEX(direct_window_init_mutex);
+static DEFINE_MUTEX(dma_win_init_mutex);
 #define DMA64_PROPNAME "linux,dma64-ddr-window-info"
+#define DEFAULT_DMA_WIN "ibm,dma-window"
 
 static int tce_clearrange_multi_pSeriesLP(unsigned long start_pfn,
unsigned long num_pfn, const void *arg)
@@ -697,9 +698,12 @@ static void pci_dma_bus_setup_pSeriesLP(struct pci_bus 
*bus)
pr_debug("pci_dma_bus_setup_pSeriesLP: setting up bus %pOF\n",
 dn);
 
-   /* Find nearest ibm,dma-window, walking up the device tree */
+   /*
+* Find nearest ibm,dma-window (default DMA window), walking up the
+* device tree
+*/
for (pdn = dn; pdn != NULL; pdn = pdn->parent) {
-   dma_window = of_get_property(pdn, "ibm,dma-window", NULL);
+   dma_window = of_get_property(pdn, DEFAULT_DMA_WIN, NULL);
if (dma_window != NULL)
break;
}
@@ -710,7 +714,8 @@ static void pci_dma_bus_setup_pSeriesLP(struct pci_bus *bus)
dma_window = alt_dma_window;
 
if (dma_window == NULL) {
-   pr_debug("  no ibm,dma-window nor linux,dma64-ddr-window-info 
property !\n");
+   pr_debug("  no %s nor %s property !\n",
+DEFAULT_DMA_WIN, DMA64_PROPNAME);
return;
}
 
@@ -808,11 +813,11 @@ static void remove_dma_window(struct device_node *np, u32 
*ddw_avail,
 
ret = rtas_call(ddw_avail[DDW_REMOVE_PE_DMA_WIN], 1, 1, NULL, liobn);
if (ret)
-   pr_warn("%pOF: failed to remove direct window: rtas returned "
+   pr_warn("%pOF: failed to remove dma window: rtas returned "
"%d to ibm,remove-pe-dma-window(%x) %llx\n",
np, ret, ddw_avail[DDW_REMOVE_PE_DMA_WIN], liobn);
else
-   pr_debug("%pOF: successfully removed direct window: rtas 
returned "
+   pr_debug("%pOF: successfully removed dma window: rtas returned "
"%d to ibm,remove-pe-dma-window(%x) %llx\n",
np, ret, ddw_avail[DDW_REMOVE_PE_DMA_WIN], liobn);
 }
@@ -840,26 +845,26 @@ static void remove_ddw(struct device_node *np, bool 
remove_prop)
 
ret = of_remove_property(np, win);
if (ret)
-   pr_warn("%pOF: failed to remove direct window property: %d\n",
+   pr_warn("%pOF: failed to remove dma window property: %d\n",
np, ret);
 }
 
 static u64 find_existing_ddw(struct device_node *pdn)
 {
-   struct direct_window *window;
-   const struct dynamic_dma_window_prop *direct64;
+   struct dma_win *window;
+   const struct dynamic_dma_window_prop *dma64;
u64 dma_addr = 0;
 
-   spin_lock(_window_list_lock);
+   spin_lock(_win_list_lock);
/* check if we already created a window and dupe that config if so */
-   list_for_each_entry(window, _window_list, list) {
+   list_for_each_entry(window, _win_list, list) {
if (window->device == pdn) {
-   direct64 = window->prop;
-   dma_addr = be64_to_cpu(direct64->dma_base);
+   dma64 = window->prop;
+   dma_addr = be64_to_cpu(dma64->dma_base);
break;
}

[PATCH v3 5/6] powerpc/pseries/iommu: Make use of DDW even if it does not map the partition

2020-07-03 Thread Leonardo Bras

As of today, if the biggest DDW that can be created can't map the whole
partition, it's creation is skipped and the default DMA window
"ibm,dma-window" is used instead.

Usually this DDW is bigger than the default DMA window, and it performs
better, so it would be nice to use it instead.

The ddw created will be used for direct mapping by default.
If it's not available, indirect mapping sill be used instead.

As there will never have both mappings at the same time, the same property
name can be used for the created DDW.

So renaming
define DIRECT64_PROPNAME "linux,direct64-ddr-window-info"
to
define DMA64_PROPNAME "linux,dma64-ddr-window-info"
looks the right thing to do.

Signed-off-by: Leonardo Bras 
---
 arch/powerpc/platforms/pseries/iommu.c | 38 --
 1 file changed, 24 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 5b520ac354c6..c652177de09c 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -364,7 +364,7 @@ static LIST_HEAD(direct_window_list);
 static DEFINE_SPINLOCK(direct_window_list_lock);
 /* protects initializing window twice for same device */
 static DEFINE_MUTEX(direct_window_init_mutex);
-#define DIRECT64_PROPNAME "linux,direct64-ddr-window-info"
+#define DMA64_PROPNAME "linux,dma64-ddr-window-info"
 
 static int tce_clearrange_multi_pSeriesLP(unsigned long start_pfn,
unsigned long num_pfn, const void *arg)
@@ -690,7 +690,7 @@ static void pci_dma_bus_setup_pSeriesLP(struct pci_bus *bus)
struct iommu_table *tbl;
struct device_node *dn, *pdn;
struct pci_dn *ppci;
-   const __be32 *dma_window = NULL;
+   const __be32 *dma_window = NULL, *alt_dma_window = NULL;
 
dn = pci_bus_to_OF_node(bus);
 
@@ -704,8 +704,13 @@ static void pci_dma_bus_setup_pSeriesLP(struct pci_bus 
*bus)
break;
}
 
+   /* If there is a DDW available, use it instead */
+   alt_dma_window = of_get_property(pdn, DMA64_PROPNAME, NULL);
+   if (alt_dma_window)
+   dma_window = alt_dma_window;
+
if (dma_window == NULL) {
-   pr_debug("  no ibm,dma-window property !\n");
+   pr_debug("  no ibm,dma-window nor linux,dma64-ddr-window-info 
property !\n");
return;
}
 
@@ -823,7 +828,7 @@ static void remove_ddw(struct device_node *np, bool 
remove_prop)
if (ret)
return;
 
-   win = of_find_property(np, DIRECT64_PROPNAME, NULL);
+   win = of_find_property(np, DMA64_PROPNAME, NULL);
if (!win)
return;
 
@@ -869,8 +874,8 @@ static int find_existing_ddw_windows(void)
if (!firmware_has_feature(FW_FEATURE_LPAR))
return 0;
 
-   for_each_node_with_property(pdn, DIRECT64_PROPNAME) {
-   direct64 = of_get_property(pdn, DIRECT64_PROPNAME, );
+   for_each_node_with_property(pdn, DMA64_PROPNAME) {
+   direct64 = of_get_property(pdn, DMA64_PROPNAME, );
if (!direct64)
continue;
 
@@ -1205,23 +1210,26 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
  query.page_size);
goto out_restore_defwin;
}
+
/* verify the window * number of ptes will map the partition */
-   /* check largest block * page size > max memory hotplug addr */
max_addr = ddw_memory_hotplug_max();
if (query.largest_available_block < (max_addr >> page_shift)) {
-   dev_dbg(>dev, "can't map partition max 0x%llx with %llu "
- "%llu-sized pages\n", max_addr,  
query.largest_available_block,
- 1ULL << page_shift);
-   goto out_restore_defwin;
+   dev_dbg(>dev, "can't map partition max 0x%llx with %llu 
%llu-sized pages\n",
+   max_addr, query.largest_available_block,
+   1ULL << page_shift);
+
+   len = order_base_2(query.largest_available_block << page_shift);
+   } else {
+   len = order_base_2(max_addr);
}
-   len = order_base_2(max_addr);
+
win64 = kzalloc(sizeof(struct property), GFP_KERNEL);
if (!win64) {
dev_info(>dev,
"couldn't allocate property for 64bit dma window\n");
goto out_restore_defwin;
}
-   win64->name = kstrdup(DIRECT64_PROPNAME, GFP_KERNEL);
+   win64->name = kstrdup(DMA64_PROPNAME, GFP_KERNEL);
win64->value = ddwprop = kmalloc(sizeof(*ddwprop), GFP_KERNEL);
win64->length = sizeof(*ddwprop);
if (!win64->name || !win64->value) {
@@ -1268,7 +1276,9 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
list_add(>list, _window_list);
spin_unlock(_window_list_lock);
 
-

[PATCH v3 4/6] powerpc/pseries/iommu: Remove default DMA window before creating DDW

2020-07-03 Thread Leonardo Bras

On LoPAR "DMA Window Manipulation Calls", it's recommended to remove the
default DMA window for the device, before attempting to configure a DDW,
in order to make the maximum resources available for the next DDW to be
created.

This is a requirement for using DDW on devices in which hypervisor
allows only one DMA window.

If setting up a new DDW fails anywhere after the removal of this
default DMA window, it's needed to restore the default DMA window.
For this, an implementation of ibm,reset-pe-dma-windows rtas call is
needed:

Platforms supporting the DDW option starting with LoPAR level 2.7 implement
ibm,ddw-extensions. The first extension available (index 2) carries the
token for ibm,reset-pe-dma-windows rtas call, which is used to restore
the default DMA window for a device, if it has been deleted.

It does so by resetting the TCE table allocation for the PE to it's
boot time value, available in "ibm,dma-window" device tree node.

Signed-off-by: Leonardo Bras 
---
 arch/powerpc/platforms/pseries/iommu.c | 83 +-
 1 file changed, 69 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 4e33147825cc..5b520ac354c6 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -1066,6 +1066,38 @@ static phys_addr_t ddw_memory_hotplug_max(void)
return max_addr;
 }
 
+/*
+ * Platforms supporting the DDW option starting with LoPAR level 2.7 implement
+ * ibm,ddw-extensions, which carries the rtas token for
+ * ibm,reset-pe-dma-windows.
+ * That rtas-call can be used to restore the default DMA window for the device.
+ */
+static void reset_dma_window(struct pci_dev *dev, struct device_node *par_dn)
+{
+   int ret;
+   u32 cfg_addr, reset_dma_win;
+   u64 buid;
+   struct device_node *dn;
+   struct pci_dn *pdn;
+
+   ret = ddw_read_ext(par_dn, DDW_EXT_RESET_DMA_WIN, _dma_win);
+   if (ret)
+   return;
+
+   dn = pci_device_to_OF_node(dev);
+   pdn = PCI_DN(dn);
+   buid = pdn->phb->buid;
+   cfg_addr = ((pdn->busno << 16) | (pdn->devfn << 8));
+
+   ret = rtas_call(reset_dma_win, 3, 1, NULL, cfg_addr, BUID_HI(buid),
+   BUID_LO(buid));
+   if (ret)
+   dev_info(>dev,
+"ibm,reset-pe-dma-windows(%x) %x %x %x returned %d ",
+reset_dma_win, cfg_addr, BUID_HI(buid), BUID_LO(buid),
+ret);
+}
+
 /*
  * If the PE supports dynamic dma windows, and there is space for a table
  * that can map all pages in a linear offset, then setup such a table,
@@ -1079,7 +,7 @@ static phys_addr_t ddw_memory_hotplug_max(void)
  */
 static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
 {
-   int len, ret;
+   int len, ret, reset_win_ext;
struct ddw_query_response query;
struct ddw_create_response create;
int page_shift;
@@ -1087,7 +1119,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
struct device_node *dn;
u32 ddw_avail[DDW_APPLICABLE_SIZE];
struct direct_window *window;
-   struct property *win64;
+   struct property *win64, *default_win = NULL;
struct dynamic_dma_window_prop *ddwprop;
struct failed_ddw_pdn *fpdn;
 
@@ -1122,7 +1154,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
if (ret)
goto out_failed;
 
-   /*
+   /*
 * Query if there is a second window of size to map the
 * whole partition.  Query returns number of windows, largest
 * block assigned to PE (partition endpoint), and two bitmasks
@@ -1133,14 +1165,34 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
if (ret != 0)
goto out_failed;
 
+   /*
+* If there is no window available, remove the default DMA window,
+* if it's present. This will make all the resources available to the
+* new DDW window.
+* If anything fails after this, we need to restore it, so also check
+* for extensions presence.
+*/
if (query.windows_available == 0) {
-   /*
-* no additional windows are available for this device.
-* We might be able to reallocate the existing window,
-* trading in for a larger page size.
-*/
-   dev_dbg(>dev, "no free dynamic windows");
-   goto out_failed;
+   default_win = of_find_property(pdn, "ibm,dma-window", NULL);
+   if (!default_win)
+   goto out_failed;
+
+   reset_win_ext = ddw_read_ext(pdn, DDW_EXT_RESET_DMA_WIN, NULL);
+   if (reset_win_ext)
+   goto out_failed;
+
+   remove_dma_window(pdn, ddw_avail, default_win);
+
+   /* Query again,

[PATCH v3 3/6] powerpc/pseries/iommu: Move window-removing part of remove_ddw into remove_dma_window

2020-07-03 Thread Leonardo Bras

Move the window-removing part of remove_ddw into a new function
(remove_dma_window), so it can be used to remove other DMA windows.

It's useful for removing DMA windows that don't create DIRECT64_PROPNAME
property, like the default DMA window from the device, which uses
"ibm,dma-window".

Signed-off-by: Leonardo Bras 
---
 arch/powerpc/platforms/pseries/iommu.c | 45 +++---
 1 file changed, 27 insertions(+), 18 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 1a933c4e8bba..4e33147825cc 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -781,25 +781,14 @@ static int __init disable_ddw_setup(char *str)
 
 early_param("disable_ddw", disable_ddw_setup);
 
-static void remove_ddw(struct device_node *np, bool remove_prop)
+static void remove_dma_window(struct device_node *np, u32 *ddw_avail,
+ struct property *win)
 {
struct dynamic_dma_window_prop *dwp;
-   struct property *win64;
-   u32 ddw_avail[DDW_APPLICABLE_SIZE];
u64 liobn;
-   int ret = 0;
-
-   ret = of_property_read_u32_array(np, "ibm,ddw-applicable",
-_avail[0], DDW_APPLICABLE_SIZE);
-
-   win64 = of_find_property(np, DIRECT64_PROPNAME, NULL);
-   if (!win64)
-   return;
-
-   if (ret || win64->length < sizeof(*dwp))
-   goto delprop;
+   int ret;
 
-   dwp = win64->value;
+   dwp = win->value;
liobn = (u64)be32_to_cpu(dwp->liobn);
 
/* clear the whole window, note the arg is in kernel pages */
@@ -821,10 +810,30 @@ static void remove_ddw(struct device_node *np, bool 
remove_prop)
pr_debug("%pOF: successfully removed direct window: rtas 
returned "
"%d to ibm,remove-pe-dma-window(%x) %llx\n",
np, ret, ddw_avail[DDW_REMOVE_PE_DMA_WIN], liobn);
+}
+
+static void remove_ddw(struct device_node *np, bool remove_prop)
+{
+   struct property *win;
+   u32 ddw_avail[DDW_APPLICABLE_SIZE];
+   int ret = 0;
+
+   ret = of_property_read_u32_array(np, "ibm,ddw-applicable",
+_avail[0], DDW_APPLICABLE_SIZE);
+   if (ret)
+   return;
+
+   win = of_find_property(np, DIRECT64_PROPNAME, NULL);
+   if (!win)
+   return;
+
+   if (win->length >= sizeof(struct dynamic_dma_window_prop))
+   remove_dma_window(np, ddw_avail, win);
+
+   if (!remove_prop)
+   return;
 
-delprop:
-   if (remove_prop)
-   ret = of_remove_property(np, win64);
+   ret = of_remove_property(np, win);
if (ret)
pr_warn("%pOF: failed to remove direct window property: %d\n",
np, ret);
-- 
2.25.4

[PATCH v3 2/6] powerpc/pseries/iommu: Update call to ibm, query-pe-dma-windows

2020-07-03 Thread Leonardo Bras

>From LoPAR level 2.8, "ibm,ddw-extensions" index 3 can make the number of
outputs from "ibm,query-pe-dma-windows" go from 5 to 6.

This change of output size is meant to expand the address size of
largest_available_block PE TCE from 32-bit to 64-bit, which ends up
shifting page_size and migration_capable.

This ends up requiring the update of
ddw_query_response->largest_available_block from u32 to u64, and manually
assigning the values from the buffer into this struct, according to
output size.

Also, a routine was created for helping reading the ddw extensions as
suggested by LoPAR: First reading the size of the extension array from
index 0, checking if the property exists, and then returning it's value.

Signed-off-by: Leonardo Bras 
---
 arch/powerpc/platforms/pseries/iommu.c | 91 +++---
 1 file changed, 81 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index ac0d6376bdad..1a933c4e8bba 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -47,6 +47,12 @@ enum {
DDW_APPLICABLE_SIZE
 };
 
+enum {
+   DDW_EXT_SIZE = 0,
+   DDW_EXT_RESET_DMA_WIN = 1,
+   DDW_EXT_QUERY_OUT_SIZE = 2
+};
+
 static struct iommu_table_group *iommu_pseries_alloc_group(int node)
 {
struct iommu_table_group *table_group;
@@ -342,7 +348,7 @@ struct direct_window {
 /* Dynamic DMA Window support */
 struct ddw_query_response {
u32 windows_available;
-   u32 largest_available_block;
+   u64 largest_available_block;
u32 page_size;
u32 migration_capable;
 };
@@ -877,14 +883,62 @@ static int find_existing_ddw_windows(void)
 }
 machine_arch_initcall(pseries, find_existing_ddw_windows);
 
+/**
+ * ddw_read_ext - Get the value of an DDW extension
+ * @np:device node from which the extension value is to be 
read.
+ * @extnum:index number of the extension.
+ * @value: pointer to return value, modified when extension is available.
+ *
+ * Checks if "ibm,ddw-extensions" exists for this node, and get the value
+ * on index 'extnum'.
+ * It can be used only to check if a property exists, passing value == NULL.
+ *
+ * Returns:
+ * 0 if extension successfully read
+ * -EINVAL if the "ibm,ddw-extensions" does not exist,
+ * -ENODATA if "ibm,ddw-extensions" does not have a value, and
+ * -EOVERFLOW if "ibm,ddw-extensions" does not contain this extension.
+ */
+static inline int ddw_read_ext(const struct device_node *np, int extnum,
+  u32 *value)
+{
+   static const char propname[] = "ibm,ddw-extensions";
+   u32 count;
+   int ret;
+
+   ret = of_property_read_u32_index(np, propname, DDW_EXT_SIZE, );
+   if (ret)
+   return ret;
+
+   if (count < extnum)
+   return -EOVERFLOW;
+
+   if (!value)
+   value = 
+
+   return of_property_read_u32_index(np, propname, extnum, value);
+}
+
 static int query_ddw(struct pci_dev *dev, const u32 *ddw_avail,
-   struct ddw_query_response *query)
+struct ddw_query_response *query,
+struct device_node *parent)
 {
struct device_node *dn;
struct pci_dn *pdn;
-   u32 cfg_addr;
+   u32 cfg_addr, ext_query, query_out[5];
u64 buid;
-   int ret;
+   int ret, out_sz;
+
+   /*
+* From LoPAR level 2.8, "ibm,ddw-extensions" index 3 can rule how many
+* output parameters ibm,query-pe-dma-windows will have, ranging from
+* 5 to 6.
+*/
+   ret = ddw_read_ext(parent, DDW_EXT_QUERY_OUT_SIZE, _query);
+   if (!ret && ext_query == 1)
+   out_sz = 6;
+   else
+   out_sz = 5;
 
/*
 * Get the config address and phb buid of the PE window.
@@ -897,11 +951,28 @@ static int query_ddw(struct pci_dev *dev, const u32 
*ddw_avail,
buid = pdn->phb->buid;
cfg_addr = ((pdn->busno << 16) | (pdn->devfn << 8));
 
-   ret = rtas_call(ddw_avail[DDW_QUERY_PE_DMA_WIN], 3, 5, (u32 *)query,
+   ret = rtas_call(ddw_avail[DDW_QUERY_PE_DMA_WIN], 3, out_sz, query_out,
cfg_addr, BUID_HI(buid), BUID_LO(buid));
-   dev_info(>dev, "ibm,query-pe-dma-windows(%x) %x %x %x"
-   " returned %d\n", ddw_avail[DDW_QUERY_PE_DMA_WIN], cfg_addr,
-BUID_HI(buid), BUID_LO(buid), ret);
+   dev_info(>dev, "ibm,query-pe-dma-windows(%x) %x %x %x returned 
%d\n",
+ddw_avail[DDW_QUERY_PE_DMA_WIN], cfg_addr, BUID_HI(buid),
+BUID_LO(buid), ret);
+
+   switch (out_sz) {
+   case 5:
+   query->windows_available = query_out[0];
+   query->largest_available_block = query_out[1];
+   query->page_size = query_out[2];
+   query->migration_capable = query_out[3];
+   break;
+   case 6:
+

[PATCH v3 1/6] powerpc/pseries/iommu: Create defines for operations in ibm, ddw-applicable

2020-07-03 Thread Leonardo Bras

Create defines to help handling ibm,ddw-applicable values, avoiding
confusion about the index of given operations.

Signed-off-by: Leonardo Bras 
---
 arch/powerpc/platforms/pseries/iommu.c | 43 --
 1 file changed, 26 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 6d47b4a3ce39..ac0d6376bdad 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -39,6 +39,14 @@
 
 #include "pseries.h"
 
+enum {
+   DDW_QUERY_PE_DMA_WIN  = 0,
+   DDW_CREATE_PE_DMA_WIN = 1,
+   DDW_REMOVE_PE_DMA_WIN = 2,
+
+   DDW_APPLICABLE_SIZE
+};
+
 static struct iommu_table_group *iommu_pseries_alloc_group(int node)
 {
struct iommu_table_group *table_group;
@@ -771,12 +779,12 @@ static void remove_ddw(struct device_node *np, bool 
remove_prop)
 {
struct dynamic_dma_window_prop *dwp;
struct property *win64;
-   u32 ddw_avail[3];
+   u32 ddw_avail[DDW_APPLICABLE_SIZE];
u64 liobn;
int ret = 0;
 
ret = of_property_read_u32_array(np, "ibm,ddw-applicable",
-_avail[0], 3);
+_avail[0], DDW_APPLICABLE_SIZE);
 
win64 = of_find_property(np, DIRECT64_PROPNAME, NULL);
if (!win64)
@@ -798,15 +806,15 @@ static void remove_ddw(struct device_node *np, bool 
remove_prop)
pr_debug("%pOF successfully cleared tces in window.\n",
 np);
 
-   ret = rtas_call(ddw_avail[2], 1, 1, NULL, liobn);
+   ret = rtas_call(ddw_avail[DDW_REMOVE_PE_DMA_WIN], 1, 1, NULL, liobn);
if (ret)
pr_warn("%pOF: failed to remove direct window: rtas returned "
"%d to ibm,remove-pe-dma-window(%x) %llx\n",
-   np, ret, ddw_avail[2], liobn);
+   np, ret, ddw_avail[DDW_REMOVE_PE_DMA_WIN], liobn);
else
pr_debug("%pOF: successfully removed direct window: rtas 
returned "
"%d to ibm,remove-pe-dma-window(%x) %llx\n",
-   np, ret, ddw_avail[2], liobn);
+   np, ret, ddw_avail[DDW_REMOVE_PE_DMA_WIN], liobn);
 
 delprop:
if (remove_prop)
@@ -889,11 +897,11 @@ static int query_ddw(struct pci_dev *dev, const u32 
*ddw_avail,
buid = pdn->phb->buid;
cfg_addr = ((pdn->busno << 16) | (pdn->devfn << 8));
 
-   ret = rtas_call(ddw_avail[0], 3, 5, (u32 *)query,
- cfg_addr, BUID_HI(buid), BUID_LO(buid));
+   ret = rtas_call(ddw_avail[DDW_QUERY_PE_DMA_WIN], 3, 5, (u32 *)query,
+   cfg_addr, BUID_HI(buid), BUID_LO(buid));
dev_info(>dev, "ibm,query-pe-dma-windows(%x) %x %x %x"
-   " returned %d\n", ddw_avail[0], cfg_addr, BUID_HI(buid),
-   BUID_LO(buid), ret);
+   " returned %d\n", ddw_avail[DDW_QUERY_PE_DMA_WIN], cfg_addr,
+BUID_HI(buid), BUID_LO(buid), ret);
return ret;
 }
 
@@ -920,15 +928,16 @@ static int create_ddw(struct pci_dev *dev, const u32 
*ddw_avail,
 
do {
/* extra outputs are LIOBN and dma-addr (hi, lo) */
-   ret = rtas_call(ddw_avail[1], 5, 4, (u32 *)create,
-   cfg_addr, BUID_HI(buid), BUID_LO(buid),
-   page_shift, window_shift);
+   ret = rtas_call(ddw_avail[DDW_CREATE_PE_DMA_WIN], 5, 4,
+   (u32 *)create, cfg_addr, BUID_HI(buid),
+   BUID_LO(buid), page_shift, window_shift);
} while (rtas_busy_delay(ret));
dev_info(>dev,
"ibm,create-pe-dma-window(%x) %x %x %x %x %x returned %d "
-   "(liobn = 0x%x starting addr = %x %x)\n", ddw_avail[1],
-cfg_addr, BUID_HI(buid), BUID_LO(buid), page_shift,
-window_shift, ret, create->liobn, create->addr_hi, 
create->addr_lo);
+   "(liobn = 0x%x starting addr = %x %x)\n",
+ddw_avail[DDW_CREATE_PE_DMA_WIN], cfg_addr, BUID_HI(buid),
+BUID_LO(buid), page_shift, window_shift, ret, create->liobn,
+create->addr_hi, create->addr_lo);
 
return ret;
 }
@@ -996,7 +1005,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
int page_shift;
u64 dma_addr, max_addr;
struct device_node *dn;
-   u32 ddw_avail[3];
+   u32 ddw_avail[DDW_APPLICABLE_SIZE];
struct direct_window *window;
struct property *win64;
struct dynamic_dma_window_prop *ddwprop;
@@ -1029,7 +1038,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
 * the property is actually in the parent, not the PE
 */
ret = of_property_read_u32_array(pdn, "ibm,ddw-applicable",
-_avail[0], 3);
+

[PATCH v3 0/6] Remove default DMA window before creating DDW

2020-07-03 Thread Leonardo Bras

There are some devices in which a hypervisor may only allow 1 DMA window
to exist at a time, and in those cases, a DDW is never created to them,
since the default DMA window keeps using this resource.

LoPAR recommends this procedure:
1. Remove the default DMA window,
2. Query for which configs the DDW can be created,
3. Create a DDW.

Patch #1:
Create defines for outputs of ibm,ddw-applicable, so it's easier to
identify them.

Patch #2:
- After LoPAR level 2.8, there is an extension that can make
  ibm,query-pe-dma-windows to have 6 outputs instead of 5. This changes the
  order of the outputs, and that can cause some trouble. 
- query_ddw() was updated to check how many outputs the 
  ibm,query-pe-dma-windows is supposed to have, update the rtas_call() and
  deal correctly with the outputs in both cases.
- This patch looks somehow unrelated to the series, but it can avoid future
  problems on DDW creation.

Patch #3 moves the window-removing code from remove_ddw() to
remove_dma_window(), creating a way to delete any DMA window, so it can be
used to delete the default DMA window.

Patch #4 makes use of the remove_dma_window() from patch #3 to remove the
default DMA window before query_ddw(). It also implements a new rtas call
to recover the default DMA window, in case anything fails after it was
removed, and a DDW couldn't be created.

Patch #5:
Instead of destroying the created DDW if it doesn't map the whole
partition, make use of it instead of the default DMA window as it improves
performance.

Patch #6:
Does some renaming of 'direct window' to 'dma window', given the DDW
created can now be also used in indirect mapping if direct mapping is not
available.

All patches were tested into an LPAR with an Ethernet VF:
4005:01:00.0 Ethernet controller: Mellanox Technologies MT27700 Family
[ConnectX-4 Virtual Function]

Patch #5 It was tested with a 64GB DDW which did not map the whole
partition (128G). Performance improvement noticed by using the DDW instead
of the default DMA window:

64 thread write throughput: +203.0%
64 thread read throughput: +17.5%
1 thread write throughput: +20.5%
1 thread read throughput: +3.43%
Average write latency: -23.0%
Average read latency:  -2.26%

---
Changes since v2:
- Change the way ibm,ddw-extensions is accessed, using a proper function
  instead of doing this inline everytime it's used.
- Remove previous patch #6, as it doesn't look like it would be useful.
- Add new patch, for changing names from direct* to dma*, as indirect 
  mapping can be used from now on.
- Fix some typos, corrects some define usage.
- v2 link: 
http://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=185433=%2A=both

Changes since v1:
- Add defines for ibm,ddw-applicable and ibm,ddw-extensions outputs
- Merge aux function query_ddw_out_sz() into query_ddw()
- Merge reset_dma_window() patch (prev. #2) into remove default DMA
  window patch (#4).
- Keep device_node *np name instead of using pdn in remove_*()
- Rename 'device_node *pdn' into 'parent' in new functions
- Rename dfl_win to default_win
- Only remove the default DMA window if there is no window available
  in first query.
- Check if default DMA window can be restored before removing it.
- Fix 'unitialized use' (found by travis mpe:ci-test)
- New patches #5 and #6
- v1 link: 
http://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=184420=%2A=both

Special thanks for Alexey Kardashevskiy and Oliver O'Halloran for
the feedback provided!

Leonardo Bras (6):
  powerpc/pseries/iommu: Create defines for operations in
ibm,ddw-applicable
  powerpc/pseries/iommu: Update call to ibm,query-pe-dma-windows
  powerpc/pseries/iommu: Move window-removing part of remove_ddw into
remove_dma_window
  powerpc/pseries/iommu: Remove default DMA window before creating DDW
  powerpc/pseries/iommu: Make use of DDW even if it does not map the
partition
  powerpc/pseries/iommu: Rename "direct window" to "dma window"

 arch/powerpc/platforms/pseries/iommu.c | 379 ++---
 1 file changed, 269 insertions(+), 110 deletions(-)

-- 
2.25.4

50 matches

Mail list logo