Re: [PATCH v2] ASoC: fsl_asrc: Add an option to select internal ratio mode
On Fri, Jul 03, 2020 at 11:50:20PM +0100, Mark Brown wrote: > On Fri, Jul 03, 2020 at 03:46:58PM -0700, Nicolin Chen wrote: > > > > [1/1] ASoC: fsl_asrc: Add an option to select internal ratio mode > > > commit: d0250cf4f2abfbea64ed247230f08f5ae23979f0 > > > You already applied v3 of this change: > > https://mailman.alsa-project.org/pipermail/alsa-devel/2020-July/169976.html > > > And it's already in linux-next also. Not sure what's happening... > > The script can't always tell the difference between versions - it looks > like it's notified for v2 based on seeing v3 in git. OK..as long as no revert nor re-applying happens, we can ignore :) Thanks
Re: [PATCH v2] ASoC: fsl_asrc: Add an option to select internal ratio mode
On Fri, Jul 03, 2020 at 03:46:58PM -0700, Nicolin Chen wrote: > > [1/1] ASoC: fsl_asrc: Add an option to select internal ratio mode > > commit: d0250cf4f2abfbea64ed247230f08f5ae23979f0 > You already applied v3 of this change: > https://mailman.alsa-project.org/pipermail/alsa-devel/2020-July/169976.html > And it's already in linux-next also. Not sure what's happening... The script can't always tell the difference between versions - it looks like it's notified for v2 based on seeing v3 in git. signature.asc Description: PGP signature
Re: [PATCH v2] ASoC: fsl_asrc: Add an option to select internal ratio mode
Hi Mark, On Fri, Jul 03, 2020 at 06:03:43PM +0100, Mark Brown wrote: > On Tue, 30 Jun 2020 16:47:56 +0800, Shengjiu Wang wrote: > > The ASRC not only supports ideal ratio mode, but also supports > > internal ratio mode. > > > > For internal rato mode, the rate of clock source should be divided > > with no remainder by sample rate, otherwise there is sound > > distortion. > > > > [...] > > Applied to > >https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound.git for-next > > Thanks! > > [1/1] ASoC: fsl_asrc: Add an option to select internal ratio mode > commit: d0250cf4f2abfbea64ed247230f08f5ae23979f0 You already applied v3 of this change: https://mailman.alsa-project.org/pipermail/alsa-devel/2020-July/169976.html And it's already in linux-next also. Not sure what's happening...
Re: [PATCH v2] ASoC: fsl_asrc: Add an option to select internal ratio mode
On Tue, 30 Jun 2020 16:47:56 +0800, Shengjiu Wang wrote: > The ASRC not only supports ideal ratio mode, but also supports > internal ratio mode. > > For internal rato mode, the rate of clock source should be divided > with no remainder by sample rate, otherwise there is sound > distortion. > > [...] Applied to https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound.git for-next Thanks! [1/1] ASoC: fsl_asrc: Add an option to select internal ratio mode commit: d0250cf4f2abfbea64ed247230f08f5ae23979f0 All being well this means that it will be integrated into the linux-next tree (usually sometime in the next 24 hours) and sent to Linus during the next merge window (or sooner if it is a bug fix), however if problems are discovered then the patch may be dropped or reverted. You may get further e-mails resulting from automated or manual testing and review of the tree, please engage with people reporting problems and send followup patches addressing any issues that are reported if needed. If any updates are required or you are submitting further changes they should be sent as incremental updates against current git, existing patches will not be replaced. Please add any relevant lists and maintainers to the CCs when replying to this mail. Thanks, Mark
[PATCH 0/2] Rework secure memslot dropping
When doing memory hotplug on a secure VM, the secure pages are not well cleaned from the secure device when dropping the memslot. This silent error, is then preventing the SVM to reboot properly after the following sequence of commands are run in the Qemu monitor: device_add pc-dimm,id=dimm1,memdev=mem1 device_del dimm1 device_add pc-dimm,id=dimm1,memdev=mem1 At reboot time, when the kernel is booting again and switching to the secure mode, the page_in is failing for the pages in the memslot because the cleanup was not done properly, because the memslot is flagged as invalid during the hot unplug and thus the page fault mechanism is not triggered. To prevent that during the memslot dropping, instead of belonging on the page fault mechanism to trigger the page out of the secured pages, it seems simpler to directly call the function doing the page out. This way the state of the memslot is not interfering on the page out process. This series applies on top of the Ram's one titled: "PATCH v3 0/4] Migrate non-migrated pages of a SVM." https://lore.kernel.org/linuxppc-dev/1592606622-29884-1-git-send-email-linux...@us.ibm.com/#r Laurent Dufour (2): KVM: PPC: Book3S HV: move kvmppc_svm_page_out up KVM: PPC: Book3S HV: rework secure mem slot dropping arch/powerpc/kvm/book3s_hv_uvmem.c | 220 + 1 file changed, 127 insertions(+), 93 deletions(-) -- 2.27.0
[PATCH 1/2] KVM: PPC: Book3S HV: move kvmppc_svm_page_out up
kvmppc_svm_page_out() will need to be called by kvmppc_uvmem_drop_pages() so move it upper in this file. Furthermore it will be interesting to call this function when already holding the kvm->arch.uvmem_lock, so prefix the original function with __ and remove the locking in it, and introduce a wrapper which call that function with the lock held. There is no functional change. Cc: Ram Pai Cc: Bharata B Rao Cc: Paul Mackerras Signed-off-by: Laurent Dufour --- arch/powerpc/kvm/book3s_hv_uvmem.c | 166 - 1 file changed, 90 insertions(+), 76 deletions(-) diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c index 778a6ea86991..852cc9ae6a0b 100644 --- a/arch/powerpc/kvm/book3s_hv_uvmem.c +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c @@ -435,6 +435,96 @@ unsigned long kvmppc_h_svm_init_done(struct kvm *kvm) return ret; } +/* + * Provision a new page on HV side and copy over the contents + * from secure memory using UV_PAGE_OUT uvcall. + * Caller must held kvm->arch.uvmem_lock. + */ +static int __kvmppc_svm_page_out(struct vm_area_struct *vma, + unsigned long start, + unsigned long end, unsigned long page_shift, + struct kvm *kvm, unsigned long gpa) +{ + unsigned long src_pfn, dst_pfn = 0; + struct migrate_vma mig; + struct page *dpage, *spage; + struct kvmppc_uvmem_page_pvt *pvt; + unsigned long pfn; + int ret = U_SUCCESS; + + memset(, 0, sizeof(mig)); + mig.vma = vma; + mig.start = start; + mig.end = end; + mig.src = _pfn; + mig.dst = _pfn; + mig.src_owner = _uvmem_pgmap; + + /* The requested page is already paged-out, nothing to do */ + if (!kvmppc_gfn_is_uvmem_pfn(gpa >> page_shift, kvm, NULL)) + return ret; + + ret = migrate_vma_setup(); + if (ret) + return -1; + + spage = migrate_pfn_to_page(*mig.src); + if (!spage || !(*mig.src & MIGRATE_PFN_MIGRATE)) + goto out_finalize; + + if (!is_zone_device_page(spage)) + goto out_finalize; + + dpage = alloc_page_vma(GFP_HIGHUSER, vma, start); + if (!dpage) { + ret = -1; + goto out_finalize; + } + + lock_page(dpage); + pvt = spage->zone_device_data; + pfn = page_to_pfn(dpage); + + /* +* This function is used in two cases: +* - When HV touches a secure page, for which we do UV_PAGE_OUT +* - When a secure page is converted to shared page, we *get* +* the page to essentially unmap the device page. In this +* case we skip page-out. +*/ + if (!pvt->skip_page_out) + ret = uv_page_out(kvm->arch.lpid, pfn << page_shift, + gpa, 0, page_shift); + + if (ret == U_SUCCESS) + *mig.dst = migrate_pfn(pfn) | MIGRATE_PFN_LOCKED; + else { + unlock_page(dpage); + __free_page(dpage); + goto out_finalize; + } + + migrate_vma_pages(); + +out_finalize: + migrate_vma_finalize(); + return ret; +} + +static inline int kvmppc_svm_page_out(struct vm_area_struct *vma, + unsigned long start, unsigned long end, + unsigned long page_shift, + struct kvm *kvm, unsigned long gpa) +{ + int ret; + + mutex_lock(>arch.uvmem_lock); + ret = __kvmppc_svm_page_out(vma, start, end, page_shift, kvm, gpa); + mutex_unlock(>arch.uvmem_lock); + + return ret; +} + /* * Drop device pages that we maintain for the secure guest * @@ -801,82 +891,6 @@ unsigned long kvmppc_h_svm_page_in(struct kvm *kvm, unsigned long gpa, return ret; } -/* - * Provision a new page on HV side and copy over the contents - * from secure memory using UV_PAGE_OUT uvcall. - */ -static int kvmppc_svm_page_out(struct vm_area_struct *vma, - unsigned long start, - unsigned long end, unsigned long page_shift, - struct kvm *kvm, unsigned long gpa) -{ - unsigned long src_pfn, dst_pfn = 0; - struct migrate_vma mig; - struct page *dpage, *spage; - struct kvmppc_uvmem_page_pvt *pvt; - unsigned long pfn; - int ret = U_SUCCESS; - - memset(, 0, sizeof(mig)); - mig.vma = vma; - mig.start = start; - mig.end = end; - mig.src = _pfn; - mig.dst = _pfn; - mig.src_owner = _uvmem_pgmap; - - mutex_lock(>arch.uvmem_lock); - /* The requested page is already paged-out, nothing to do */ - if (!kvmppc_gfn_is_uvmem_pfn(gpa >> page_shift, kvm, NULL)) - goto out; - - ret = migrate_vma_setup(); - if (ret) - goto out; - - spage = migrate_pfn_to_page(*mig.src); - if (!spage || !(*mig.src &
[PATCH 2/2] KVM: PPC: Book3S HV: rework secure mem slot dropping
When a secure memslot is dropped, all the pages backed in the secure device (aka really backed by secure memory by the Ultravisor) should be paged out to a normal page. Previously, this was achieved by triggering the page fault mechanism which is calling kvmppc_svm_page_out() on each pages. This can't work when hot unplugging a memory slot because the memory slot is flagged as invalid and gfn_to_pfn() is then not trying to access the page, so the page fault mechanism is not triggered. Since the final goal is to make a call to kvmppc_svm_page_out() it seems simpler to directly calling it instead of triggering such a mechanism. This way kvmppc_uvmem_drop_pages() can be called even when hot unplugging a memslot. Since kvmppc_uvmem_drop_pages() is already holding kvm->arch.uvmem_lock, the call to __kvmppc_svm_page_out() is made. As __kvmppc_svm_page_out needs the vma pointer to migrate the pages, the VMA is fetched in a lazy way, to not trigger find_vma() all the time. In addition, the mmap_sem is help in read mode during that time, not in write mode since the virual memory layout is not impacted, and kvm->arch.uvmem_lock prevents concurrent operation on the secure device. Cc: Ram Pai Cc: Bharata B Rao Cc: Paul Mackerras Signed-off-by: Laurent Dufour --- arch/powerpc/kvm/book3s_hv_uvmem.c | 54 -- 1 file changed, 37 insertions(+), 17 deletions(-) diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c index 852cc9ae6a0b..479ddf16d18c 100644 --- a/arch/powerpc/kvm/book3s_hv_uvmem.c +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c @@ -533,35 +533,55 @@ static inline int kvmppc_svm_page_out(struct vm_area_struct *vma, * fault on them, do fault time migration to replace the device PTEs in * QEMU page table with normal PTEs from newly allocated pages. */ -void kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *free, +void kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *slot, struct kvm *kvm, bool skip_page_out) { int i; struct kvmppc_uvmem_page_pvt *pvt; - unsigned long pfn, uvmem_pfn; - unsigned long gfn = free->base_gfn; + struct page *uvmem_page; + struct vm_area_struct *vma = NULL; + unsigned long uvmem_pfn, gfn; + unsigned long addr, end; + + down_read(>mm->mmap_sem); + + addr = slot->userspace_addr; + end = addr + (slot->npages * PAGE_SIZE); - for (i = free->npages; i; --i, ++gfn) { - struct page *uvmem_page; + gfn = slot->base_gfn; + for (i = slot->npages; i; --i, ++gfn, addr += PAGE_SIZE) { + + /* Fetch the VMA if addr is not in the latest fetched one */ + if (!vma || (addr < vma->vm_start || addr >= vma->vm_end)) { + vma = find_vma_intersection(kvm->mm, addr, end); + if (!vma || + vma->vm_start > addr || vma->vm_end < end) { + pr_err("Can't find VMA for gfn:0x%lx\n", gfn); + break; + } + } mutex_lock(>arch.uvmem_lock); - if (!kvmppc_gfn_is_uvmem_pfn(gfn, kvm, _pfn)) { + + if (kvmppc_gfn_is_uvmem_pfn(gfn, kvm, _pfn)) { + uvmem_page = pfn_to_page(uvmem_pfn); + pvt = uvmem_page->zone_device_data; + pvt->skip_page_out = skip_page_out; + pvt->remove_gfn = true; + + if (__kvmppc_svm_page_out(vma, addr, addr + PAGE_SIZE, + PAGE_SHIFT, kvm, pvt->gpa)) + pr_err("Can't page out gpa:0x%lx addr:0x%lx\n", + pvt->gpa, addr); + } else { + /* Remove the shared flag if any */ kvmppc_gfn_remove(gfn, kvm); - mutex_unlock(>arch.uvmem_lock); - continue; } - uvmem_page = pfn_to_page(uvmem_pfn); - pvt = uvmem_page->zone_device_data; - pvt->skip_page_out = skip_page_out; - pvt->remove_gfn = true; mutex_unlock(>arch.uvmem_lock); - - pfn = gfn_to_pfn(kvm, gfn); - if (is_error_noslot_pfn(pfn)) - continue; - kvm_release_pfn_clean(pfn); } + + up_read(>mm->mmap_sem); } unsigned long kvmppc_h_svm_init_abort(struct kvm *kvm) -- 2.27.0
Re: [PATCH v2] ASoC: fsl_asrc: Add an option to select internal ratio mode
On Tue, Jun 30, 2020 at 04:47:56PM +0800, Shengjiu Wang wrote: > The ASRC not only supports ideal ratio mode, but also supports > internal ratio mode. This doesn't apply against current code, please check and resend. signature.asc Description: PGP signature
[RFC PATCH 5/5] selftests/powerpc: Remove powerpc special cases from stack expansion test
Now that the powerpc code behaves the same as other architectures we can drop the special cases we had. Signed-off-by: Michael Ellerman --- .../powerpc/mm/stack_expansion_ldst.c | 41 +++ 1 file changed, 5 insertions(+), 36 deletions(-) diff --git a/tools/testing/selftests/powerpc/mm/stack_expansion_ldst.c b/tools/testing/selftests/powerpc/mm/stack_expansion_ldst.c index 95c3f3de16a1..ed9143990888 100644 --- a/tools/testing/selftests/powerpc/mm/stack_expansion_ldst.c +++ b/tools/testing/selftests/powerpc/mm/stack_expansion_ldst.c @@ -56,13 +56,7 @@ int consume_stack(unsigned long target_sp, unsigned long stack_high, int delta, #else asm volatile ("mov %%rsp, %[sp]" : [sp] "=r" (stack_top_sp)); #endif - - // Kludge, delta < 0 indicates relative to SP - if (delta < 0) - target = stack_top_sp + delta; - else - target = stack_high - delta + 1; - + target = stack_high - delta + 1; volatile char *p = (char *)target; if (type == STORE) @@ -162,41 +156,16 @@ static int test_one(unsigned int stack_used, int delta, enum access_type type) static void test_one_type(enum access_type type, unsigned long page_size, unsigned long rlim_cur) { - assert(test_one(DEFAULT_SIZE, 512 * _KB, type) == 0); + unsigned long delta; - // powerpc has a special case to allow up to 1MB - assert(test_one(DEFAULT_SIZE, 1 * _MB, type) == 0); - -#ifdef __powerpc__ - // This fails on powerpc because it's > 1MB and is not a stdu & - // not close to r1 - assert(test_one(DEFAULT_SIZE, 1 * _MB + 8, type) != 0); -#else - assert(test_one(DEFAULT_SIZE, 1 * _MB + 8, type) == 0); -#endif - -#ifdef __powerpc__ - // Accessing way past the stack pointer is not allowed on powerpc - assert(test_one(DEFAULT_SIZE, rlim_cur, type) != 0); -#else // We should be able to access anywhere within the rlimit + for (delta = page_size; delta <= rlim_cur; delta += page_size) + assert(test_one(DEFAULT_SIZE, delta, type) == 0); + assert(test_one(DEFAULT_SIZE, rlim_cur, type) == 0); -#endif // But if we go past the rlimit it should fail assert(test_one(DEFAULT_SIZE, rlim_cur + 1, type) != 0); - - // Above 1MB powerpc only allows accesses within 4096 bytes of - // r1 for accesses that aren't stdu - assert(test_one(1 * _MB + page_size - 128, -4096, type) == 0); -#ifdef __powerpc__ - assert(test_one(1 * _MB + page_size - 128, -4097, type) != 0); -#else - assert(test_one(1 * _MB + page_size - 128, -4097, type) == 0); -#endif - - // By consuming 2MB of stack we test the stdu case - assert(test_one(2 * _MB + page_size - 128, -4096, type) == 0); } static int test(void) -- 2.25.1
[RFC PATCH 4/5] powerpc/mm: Remove custom stack expansion checking
We have powerpc specific logic in our page fault handling to decide if an access to an unmapped address below the stack pointer should expand the stack VMA. The logic aims to prevent userspace from doing bad accesses below the stack pointer. However as long as the stack is < 1MB in size, we allow all accesses without further checks. Adding some debug I see that I can do a full kernel build and LTP run, and not a single process has used more than 1MB of stack. So for the majority of processes the logic never even fires. We also recently found a nasty bug in this code which could cause userspace programs to be killed during signal delivery. It went unnoticed presumably because most processes use < 1MB of stack. The generic mm code has also grown support for stack guard pages since this code was originally written, so the most heinous case of the stack expanding into other mappings is now handled for us. Finally although some other arches have special logic in this path, from what I can tell none of x86, arm64, arm and s390 impose any extra checks other than those in expand_stack(). So drop our complicated logic and like other architectures just let the stack expand as long as its within the rlimit. Signed-off-by: Michael Ellerman --- arch/powerpc/mm/fault.c | 106 ++-- 1 file changed, 5 insertions(+), 101 deletions(-) diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c index ed01329dd12b..925a7231abb3 100644 --- a/arch/powerpc/mm/fault.c +++ b/arch/powerpc/mm/fault.c @@ -42,39 +42,7 @@ #include #include -/* - * Check whether the instruction inst is a store using - * an update addressing form which will update r1. - */ -static bool store_updates_sp(struct ppc_inst inst) -{ - /* check for 1 in the rA field */ - if (((ppc_inst_val(inst) >> 16) & 0x1f) != 1) - return false; - /* check major opcode */ - switch (ppc_inst_primary_opcode(inst)) { - case OP_STWU: - case OP_STBU: - case OP_STHU: - case OP_STFSU: - case OP_STFDU: - return true; - case OP_STD:/* std or stdu */ - return (ppc_inst_val(inst) & 3) == 1; - case OP_31: - /* check minor opcode */ - switch ((ppc_inst_val(inst) >> 1) & 0x3ff) { - case OP_31_XOP_STDUX: - case OP_31_XOP_STWUX: - case OP_31_XOP_STBUX: - case OP_31_XOP_STHUX: - case OP_31_XOP_STFSUX: - case OP_31_XOP_STFDUX: - return true; - } - } - return false; -} + /* * do_page_fault error handling helpers */ @@ -267,54 +235,6 @@ static bool bad_kernel_fault(struct pt_regs *regs, unsigned long error_code, return false; } -static bool bad_stack_expansion(struct pt_regs *regs, unsigned long address, - struct vm_area_struct *vma, unsigned int flags, - bool *must_retry) -{ - /* -* N.B. The POWER/Open ABI allows programs to access up to -* 288 bytes below the stack pointer. -* The kernel signal delivery code writes up to 4KB -* below the stack pointer (r1) before decrementing it. -* The exec code can write slightly over 640kB to the stack -* before setting the user r1. Thus we allow the stack to -* expand to 1MB without further checks. -*/ - if (address + 0x10 < vma->vm_end) { - struct ppc_inst __user *nip = (struct ppc_inst __user *)regs->nip; - /* get user regs even if this fault is in kernel mode */ - struct pt_regs *uregs = current->thread.regs; - if (uregs == NULL) - return true; - - /* -* A user-mode access to an address a long way below -* the stack pointer is only valid if the instruction -* is one which would update the stack pointer to the -* address accessed if the instruction completed, -* i.e. either stwu rs,n(r1) or stwux rs,r1,rb -* (or the byte, halfword, float or double forms). -* -* If we don't check this then any write to the area -* between the last mapped region and the stack will -* expand the stack rather than segfaulting. -*/ - if (address + 4096 >= uregs->gpr[1]) - return false; - - if ((flags & FAULT_FLAG_WRITE) && (flags & FAULT_FLAG_USER) && - access_ok(nip, sizeof(*nip))) { - struct ppc_inst inst; - - if (!probe_user_read_inst(, nip)) - return !store_updates_sp(inst); - *must_retry = true; - } - return true; - } - return
[PATCH 2/5] powerpc: Allow 4096 bytes of stack expansion for the signal frame
We have powerpc specific logic in our page fault handling to decide if an access to an unmapped address below the stack pointer should expand the stack VMA. The code was originally added in 2004 "ported from 2.4". The rough logic is that the stack is allowed to grow to 1MB with no extra checking. Over 1MB the access must be within 2048 bytes of the stack pointer, or be from a user instruction that updates the stack pointer. The 2048 byte allowance below the stack pointer is there to cover the 288 "red zone" as well as the "about 1.5kB" needed by the signal delivery code. Unfortunately since then the signal frame has expanded, and is now 4096 bytes on 64-bit kernels with transactional memory enabled. This means if a process has consumed more than 1MB of stack, and its stack pointer lies less than 4096 bytes from the next page boundary, signal delivery will fault when trying to expand the stack and the process will see a SEGV. The 2048 allowance was sufficient until 2008 as the signal frame was: struct rt_sigframe { struct ucontextuc; /* 0 1440 */ /* --- cacheline 11 boundary (1408 bytes) was 32 bytes ago --- */ long unsigned int _unused[2]; /* 144016 */ unsigned int tramp[6]; /* 145624 */ struct siginfo * pinfo;/* 1480 8 */ void * puc; /* 1488 8 */ struct siginfo info; /* 1496 128 */ /* --- cacheline 12 boundary (1536 bytes) was 88 bytes ago --- */ char abigap[288]; /* 1624 288 */ /* size: 1920, cachelines: 15, members: 7 */ /* padding: 8 */ }; Then in commit ce48b2100785 ("powerpc: Add VSX context save/restore, ptrace and signal support") (Jul 2008) the signal frame expanded to 2176 bytes: struct rt_sigframe { struct ucontextuc; /* 0 1696 */ <-- /* --- cacheline 13 boundary (1664 bytes) was 32 bytes ago --- */ long unsigned int _unused[2]; /* 169616 */ unsigned int tramp[6]; /* 171224 */ struct siginfo * pinfo;/* 1736 8 */ void * puc; /* 1744 8 */ struct siginfo info; /* 1752 128 */ /* --- cacheline 14 boundary (1792 bytes) was 88 bytes ago --- */ char abigap[288]; /* 1880 288 */ /* size: 2176, cachelines: 17, members: 7 */ /* padding: 8 */ }; At this point we should have been exposed to the bug, though as far as I know it was never reported. I no longer have a system old enough to easily test on. Then in 2010 commit 320b2b8de126 ("mm: keep a guard page below a grow-down stack segment") caused our stack expansion code to never trigger, as there was always a VMA found for a write up to PAGE_SIZE below r1. That meant the bug was hidden as we continued to expand the signal frame in commit 2b0a576d15e0 ("powerpc: Add new transactional memory state to the signal context") (Feb 2013): struct rt_sigframe { struct ucontextuc; /* 0 1696 */ /* --- cacheline 13 boundary (1664 bytes) was 32 bytes ago --- */ struct ucontextuc_transact; /* 1696 1696 */ <-- /* --- cacheline 26 boundary (3328 bytes) was 64 bytes ago --- */ long unsigned int _unused[2]; /* 339216 */ unsigned int tramp[6]; /* 340824 */ struct siginfo * pinfo;/* 3432 8 */ void * puc; /* 3440 8 */ struct siginfo info; /* 3448 128 */ /* --- cacheline 27 boundary (3456 bytes) was 120 bytes ago --- */ char abigap[288]; /* 3576 288 */ /* size: 3872, cachelines: 31, members: 8 */ /* padding: 8 */ /* last cacheline: 32 bytes */ }; And commit 573ebfa6601f ("powerpc: Increase stack redzone for 64-bit userspace to 512 bytes") (Feb 2014): struct rt_sigframe { struct ucontextuc; /* 0 1696 */ /* --- cacheline 13 boundary (1664 bytes) was 32 bytes ago --- */ struct ucontextuc_transact; /* 1696 1696 */ /* --- cacheline 26 boundary (3328 bytes) was 64 bytes ago --- */ long unsigned int _unused[2]; /* 339216 */ unsigned int tramp[6]; /* 340824 */ struct siginfo * pinfo;/* 3432 8 */ void * puc; /* 3440 8 */ struct
[PATCH 3/5] selftests/powerpc: Update the stack expansion test
Update the stack expansion load/store test to take into account the new allowance of 4096 bytes below the stack pointer. Signed-off-by: Michael Ellerman --- .../selftests/powerpc/mm/stack_expansion_ldst.c| 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/tools/testing/selftests/powerpc/mm/stack_expansion_ldst.c b/tools/testing/selftests/powerpc/mm/stack_expansion_ldst.c index 0587e11437f5..95c3f3de16a1 100644 --- a/tools/testing/selftests/powerpc/mm/stack_expansion_ldst.c +++ b/tools/testing/selftests/powerpc/mm/stack_expansion_ldst.c @@ -186,17 +186,17 @@ static void test_one_type(enum access_type type, unsigned long page_size, unsign // But if we go past the rlimit it should fail assert(test_one(DEFAULT_SIZE, rlim_cur + 1, type) != 0); - // Above 1MB powerpc only allows accesses within 2048 bytes of + // Above 1MB powerpc only allows accesses within 4096 bytes of // r1 for accesses that aren't stdu - assert(test_one(1 * _MB + page_size - 128, -2048, type) == 0); + assert(test_one(1 * _MB + page_size - 128, -4096, type) == 0); #ifdef __powerpc__ - assert(test_one(1 * _MB + page_size - 128, -2049, type) != 0); + assert(test_one(1 * _MB + page_size - 128, -4097, type) != 0); #else - assert(test_one(1 * _MB + page_size - 128, -2049, type) == 0); + assert(test_one(1 * _MB + page_size - 128, -4097, type) == 0); #endif // By consuming 2MB of stack we test the stdu case - assert(test_one(2 * _MB + page_size - 128, -2048, type) == 0); + assert(test_one(2 * _MB + page_size - 128, -4096, type) == 0); } static int test(void) -- 2.25.1
[PATCH 1/5] selftests/powerpc: Add test of stack expansion logic
We have custom stack expansion checks that it turns out are extremely badly tested and contain bugs, surprise. So add some tests that exercise the code and capture the current boundary conditions. The signal test currently fails on 64-bit kernels because the 2048 byte allowance for the signal frame is too small, we will fix that in a subsequent patch. Signed-off-by: Michael Ellerman --- tools/testing/selftests/powerpc/mm/.gitignore | 2 + tools/testing/selftests/powerpc/mm/Makefile | 8 +- .../powerpc/mm/stack_expansion_ldst.c | 233 ++ .../powerpc/mm/stack_expansion_signal.c | 114 + tools/testing/selftests/powerpc/pmu/lib.h | 1 + 5 files changed, 357 insertions(+), 1 deletion(-) create mode 100644 tools/testing/selftests/powerpc/mm/stack_expansion_ldst.c create mode 100644 tools/testing/selftests/powerpc/mm/stack_expansion_signal.c diff --git a/tools/testing/selftests/powerpc/mm/.gitignore b/tools/testing/selftests/powerpc/mm/.gitignore index 2ca523255b1b..8bfa3b39f628 100644 --- a/tools/testing/selftests/powerpc/mm/.gitignore +++ b/tools/testing/selftests/powerpc/mm/.gitignore @@ -8,3 +8,5 @@ wild_bctr large_vm_fork_separation bad_accesses tlbie_test +stack_expansion_ldst +stack_expansion_signal diff --git a/tools/testing/selftests/powerpc/mm/Makefile b/tools/testing/selftests/powerpc/mm/Makefile index b9103c4bb414..3937b277c288 100644 --- a/tools/testing/selftests/powerpc/mm/Makefile +++ b/tools/testing/selftests/powerpc/mm/Makefile @@ -3,7 +3,8 @@ $(MAKE) -C ../ TEST_GEN_PROGS := hugetlb_vs_thp_test subpage_prot prot_sao segv_errors wild_bctr \ - large_vm_fork_separation bad_accesses + large_vm_fork_separation bad_accesses stack_expansion_signal \ + stack_expansion_ldst TEST_GEN_PROGS_EXTENDED := tlbie_test TEST_GEN_FILES := tempfile @@ -18,6 +19,11 @@ $(OUTPUT)/wild_bctr: CFLAGS += -m64 $(OUTPUT)/large_vm_fork_separation: CFLAGS += -m64 $(OUTPUT)/bad_accesses: CFLAGS += -m64 +$(OUTPUT)/stack_expansion_signal: ../utils.c ../pmu/lib.c + +$(OUTPUT)/stack_expansion_ldst: CFLAGS += -fno-stack-protector +$(OUTPUT)/stack_expansion_ldst: ../utils.c + $(OUTPUT)/tempfile: dd if=/dev/zero of=$@ bs=64k count=1 diff --git a/tools/testing/selftests/powerpc/mm/stack_expansion_ldst.c b/tools/testing/selftests/powerpc/mm/stack_expansion_ldst.c new file mode 100644 index ..0587e11437f5 --- /dev/null +++ b/tools/testing/selftests/powerpc/mm/stack_expansion_ldst.c @@ -0,0 +1,233 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Test that loads/stores expand the stack segment, or trigger a SEGV, in + * various conditions. + * + * Based on test code by Tom Lane. + */ + +#undef NDEBUG +#include + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#define _KB (1024) +#define _MB (1024 * 1024) + +volatile char *stack_top_ptr; +volatile unsigned long stack_top_sp; +volatile char c; + +enum access_type { + LOAD, + STORE, +}; + +/* + * Consume stack until the stack pointer is below @target_sp, then do an access + * (load or store) at offset @delta from either the base of the stack or the + * current stack pointer. + */ +__attribute__ ((noinline)) +int consume_stack(unsigned long target_sp, unsigned long stack_high, int delta, enum access_type type) +{ + unsigned long target; + char stack_cur; + + if ((unsigned long)_cur > target_sp) + return consume_stack(target_sp, stack_high, delta, type); + else { + // We don't really need this, but without it GCC might not + // generate a recursive call above. + stack_top_ptr = _cur; + +#ifdef __powerpc__ + asm volatile ("mr %[sp], %%r1" : [sp] "=r" (stack_top_sp)); +#else + asm volatile ("mov %%rsp, %[sp]" : [sp] "=r" (stack_top_sp)); +#endif + + // Kludge, delta < 0 indicates relative to SP + if (delta < 0) + target = stack_top_sp + delta; + else + target = stack_high - delta + 1; + + volatile char *p = (char *)target; + + if (type == STORE) + *p = c; + else + c = *p; + + // Do something to prevent the stack frame being popped prior to + // our access above. + getpid(); + } + + return 0; +} + +static int search_proc_maps(char *needle, unsigned long *low, unsigned long *high) +{ + unsigned long start, end; + static char buf[4096]; + char name[128]; + FILE *f; + int rc; + + f = fopen("/proc/self/maps", "r"); + if (!f) { + perror("fopen"); + return -1; + } + + while (fgets(buf, sizeof(buf), f)) { + rc = sscanf(buf, "%lx-%lx
Re: [v2 PATCH] crypto: af_alg - Fix regression on empty requests
On Thu, Jul 02, 2020 at 01:32:21PM +1000, Herbert Xu wrote: > On Tue, Jun 30, 2020 at 02:18:11PM +0530, Naresh Kamboju wrote: > > > > Since we are on this subject, > > LTP af_alg02 test case fails on stable 4.9 and stable 4.4 > > This is not a regression because the test case has been failing from > > the beginning. > > > > Is this test case expected to fail on stable 4.9 and 4.4 ? > > or any chance to fix this on these older branches ? > > > > Test output: > > af_alg02.c:52: BROK: Timed out while reading from request socket. > > > > ref: > > https://qa-reports.linaro.org/lkft/linux-stable-rc-4.9-oe/build/v4.9.228-191-g082e807235d7/testrun/2884917/suite/ltp-crypto-tests/test/af_alg02/history/ > > https://qa-reports.linaro.org/lkft/linux-stable-rc-4.9-oe/build/v4.9.228-191-g082e807235d7/testrun/2884606/suite/ltp-crypto-tests/test/af_alg02/log > > Actually this test really is broken. FWIW the patch "umh: fix processed error when UMH_WAIT_PROC is used" was dropped from linux-next for now as it was missing checking for signals. I'll be open coding iall checks for each UMH_WAIT_PROC callers next. Its not clear if this was the issue with this test case, but figured I'd let you know. Luis
Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline
* Michal Hocko [2020-07-03 12:59:44]: > > Honestly, I do not have any idea. I've traced it down to > > Author: Andi Kleen > > Date: Tue Jan 11 15:35:48 2005 -0800 > > > > [PATCH] x86_64: Fix ACPI SRAT NUMA parsing > > > > Fix fallout from the recent nodemask_t changes. The node ids assigned > > in the SRAT parser were off by one. > > > > I added a new first_unset_node() function to nodemask.h to allocate > > IDs sanely. > > > > Signed-off-by: Andi Kleen > > Signed-off-by: Linus Torvalds > > > > which doesn't really tell all that much. The historical baggage and a > > long term behavior which is not really trivial to fix I suspect. > > Thinking about this some more, this logic makes some sense afterall. > Especially in the world without memory hotplug which was very likely the > case back then. It is much better to have compact node mask rather than > sparse one. After all node numbers shouldn't really matter as long as > you have a clear mapping to the HW. I am not sure we export that > information (except for the kernel ring buffer) though. > > The memory hotplug changes that somehow because you can hotremove numa > nodes and therefore make the nodemask sparse but that is not a common > case. I am not sure what would happen if a completely new node was added > and its corresponding node was already used by the renumbered one > though. It would likely conflate the two I am afraid. But I am not sure > this is really possible with x86 and a lack of a bug report would > suggest that nobody is doing that at least. > JFYI, Satheesh copied in this mailchain had opened a bug a year on crash with vcpu hotplug on memoryless node. https://bugzilla.kernel.org/show_bug.cgi?id=202187 -- Thanks and Regards Srikar Dronamraju
[PATCH 2/2] powerpc/powernv/idle: save-restore DAWR0,DAWRX0 for P10
Additional registers DAWR0, DAWRX0 may be lost on Power 10 for stop levels < 4. Therefore save the values of these SPRs before entering a "stop" state and restore their values on wakeup. Signed-off-by: Pratik Rajesh Sampat --- arch/powerpc/platforms/powernv/idle.c | 10 ++ 1 file changed, 10 insertions(+) diff --git a/arch/powerpc/platforms/powernv/idle.c b/arch/powerpc/platforms/powernv/idle.c index 19d94d021357..471d4a65b1fa 100644 --- a/arch/powerpc/platforms/powernv/idle.c +++ b/arch/powerpc/platforms/powernv/idle.c @@ -600,6 +600,8 @@ struct p9_sprs { u64 iamr; u64 amor; u64 uamor; + u64 dawr0; + u64 dawrx0; }; static unsigned long power9_idle_stop(unsigned long psscr, bool mmu_on) @@ -677,6 +679,10 @@ static unsigned long power9_idle_stop(unsigned long psscr, bool mmu_on) sprs.tscr = mfspr(SPRN_TSCR); if (!firmware_has_feature(FW_FEATURE_ULTRAVISOR)) sprs.ldbar = mfspr(SPRN_LDBAR); + if (cpu_has_feature(CPU_FTR_ARCH_31)) { + sprs.dawr0 = mfspr(SPRN_DAWR0); + sprs.dawrx0 = mfspr(SPRN_DAWRX0); + } sprs_saved = true; @@ -792,6 +798,10 @@ static unsigned long power9_idle_stop(unsigned long psscr, bool mmu_on) mtspr(SPRN_MMCR2, sprs.mmcr2); if (!firmware_has_feature(FW_FEATURE_ULTRAVISOR)) mtspr(SPRN_LDBAR, sprs.ldbar); + if (cpu_has_feature(CPU_FTR_ARCH_31)) { + mtspr(SPRN_DAWR0, sprs.dawr0); + mtspr(SPRN_DAWRX0, sprs.dawrx0); + } mtspr(SPRN_SPRG3, local_paca->sprg_vdso); -- 2.25.4
[PATCH 1/2] powerpc/powernv/idle: Exclude mfspr on HID1, 4, 5 on P9 and above
POWER9 onwards the support for the registers HID1, HID4, HID5 has been receded. Although mfspr on the above registers worked in Power9, In Power10 simulator is unrecognized. Moving their assignment under the check for machines lower than Power9 Signed-off-by: Pratik Rajesh Sampat --- arch/powerpc/platforms/powernv/idle.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/platforms/powernv/idle.c b/arch/powerpc/platforms/powernv/idle.c index 2dd467383a88..19d94d021357 100644 --- a/arch/powerpc/platforms/powernv/idle.c +++ b/arch/powerpc/platforms/powernv/idle.c @@ -73,9 +73,6 @@ static int pnv_save_sprs_for_deep_states(void) */ uint64_t lpcr_val = mfspr(SPRN_LPCR); uint64_t hid0_val = mfspr(SPRN_HID0); - uint64_t hid1_val = mfspr(SPRN_HID1); - uint64_t hid4_val = mfspr(SPRN_HID4); - uint64_t hid5_val = mfspr(SPRN_HID5); uint64_t hmeer_val = mfspr(SPRN_HMEER); uint64_t msr_val = MSR_IDLE; uint64_t psscr_val = pnv_deepest_stop_psscr_val; @@ -117,6 +114,9 @@ static int pnv_save_sprs_for_deep_states(void) /* Only p8 needs to set extra HID regiters */ if (!cpu_has_feature(CPU_FTR_ARCH_300)) { + uint64_t hid1_val = mfspr(SPRN_HID1); + uint64_t hid4_val = mfspr(SPRN_HID4); + uint64_t hid5_val = mfspr(SPRN_HID5); rc = opal_slw_set_reg(pir, SPRN_HID1, hid1_val); if (rc != 0) -- 2.25.4
Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline
On Fri 03-07-20 13:32:21, David Hildenbrand wrote: > On 03.07.20 12:59, Michal Hocko wrote: > > On Fri 03-07-20 11:24:17, Michal Hocko wrote: > >> [Cc Andi] > >> > >> On Fri 03-07-20 11:10:01, Michal Suchanek wrote: > >>> On Wed, Jul 01, 2020 at 02:21:10PM +0200, Michal Hocko wrote: > On Wed 01-07-20 13:30:57, David Hildenbrand wrote: > >> [...] > > Yep, looks like it. > > > > [0.009726] SRAT: PXM 1 -> APIC 0x00 -> Node 0 > > [0.009727] SRAT: PXM 1 -> APIC 0x01 -> Node 0 > > [0.009727] SRAT: PXM 1 -> APIC 0x02 -> Node 0 > > [0.009728] SRAT: PXM 1 -> APIC 0x03 -> Node 0 > > [0.009731] ACPI: SRAT: Node 0 PXM 1 [mem 0x-0x0009] > > [0.009732] ACPI: SRAT: Node 0 PXM 1 [mem 0x0010-0xbfff] > > [0.009733] ACPI: SRAT: Node 0 PXM 1 [mem 0x1-0x13fff] > > This begs a question whether ppc can do the same thing? > >>> Or x86 stop doing it so that you can see on what node you are running? > >>> > >>> What's the point of this indirection other than another way of avoiding > >>> empty node 0? > >> > >> Honestly, I do not have any idea. I've traced it down to > >> Author: Andi Kleen > >> Date: Tue Jan 11 15:35:48 2005 -0800 > >> > >> [PATCH] x86_64: Fix ACPI SRAT NUMA parsing > >> > >> Fix fallout from the recent nodemask_t changes. The node ids assigned > >> in the SRAT parser were off by one. > >> > >> I added a new first_unset_node() function to nodemask.h to allocate > >> IDs sanely. > >> > >> Signed-off-by: Andi Kleen > >> Signed-off-by: Linus Torvalds > >> > >> which doesn't really tell all that much. The historical baggage and a > >> long term behavior which is not really trivial to fix I suspect. > > > > Thinking about this some more, this logic makes some sense afterall. > > Especially in the world without memory hotplug which was very likely the > > case back then. It is much better to have compact node mask rather than > > sparse one. After all node numbers shouldn't really matter as long as > > you have a clear mapping to the HW. I am not sure we export that > > information (except for the kernel ring buffer) though. > > > > The memory hotplug changes that somehow because you can hotremove numa > > nodes and therefore make the nodemask sparse but that is not a common > > case. I am not sure what would happen if a completely new node was added > > and its corresponding node was already used by the renumbered one > > though. It would likely conflate the two I am afraid. But I am not sure > > this is really possible with x86 and a lack of a bug report would > > suggest that nobody is doing that at least. > > > > I think the ACPI code takes care of properly mapping PXM to nodes. > > So if I start with PXM 0 empty and PXM 1 populated, I will get > PXM 1 == node 0 as described. Once I hotplug something to PXM 0 in QEMU > > $ echo "object_add memory-backend-ram,id=mem0,size=1G" | sudo nc -U > /var/tmp/monitor > $ echo "device_add pc-dimm,id=dimm0,memdev=mem0,node=0" | sudo nc -U > /var/tmp/monitor > > $ echo "info numa" | sudo nc -U /var/tmp/monitor > QEMU 5.0.50 monitor - type 'help' for more information > (qemu) info numa > 2 nodes > node 0 cpus: > node 0 size: 1024 MB > node 0 plugged: 1024 MB > node 1 cpus: 0 1 2 3 > node 1 size: 4096 MB > node 1 plugged: 0 MB Thanks for double checking. > I get in the guest: > > [ 50.174435] [ cut here ] > [ 50.175436] node 1 was absent from the node_possible_map > [ 50.176844] WARNING: CPU: 0 PID: 7 at mm/memory_hotplug.c:1021 > add_memory_resource+0x8c/0x290 This would mean that the ACPI code or whoever does the remaping is not adding the new node into possible nodes. [...] > I remember that we added that check just recently (due to powerpc if I am not > wrong). > Not sure why that triggers here. This was a misbehaving Qemu IIRC providing a garbage map. -- Michal Hocko SUSE Labs
Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline
On 03.07.20 12:59, Michal Hocko wrote: > On Fri 03-07-20 11:24:17, Michal Hocko wrote: >> [Cc Andi] >> >> On Fri 03-07-20 11:10:01, Michal Suchanek wrote: >>> On Wed, Jul 01, 2020 at 02:21:10PM +0200, Michal Hocko wrote: On Wed 01-07-20 13:30:57, David Hildenbrand wrote: >> [...] > Yep, looks like it. > > [0.009726] SRAT: PXM 1 -> APIC 0x00 -> Node 0 > [0.009727] SRAT: PXM 1 -> APIC 0x01 -> Node 0 > [0.009727] SRAT: PXM 1 -> APIC 0x02 -> Node 0 > [0.009728] SRAT: PXM 1 -> APIC 0x03 -> Node 0 > [0.009731] ACPI: SRAT: Node 0 PXM 1 [mem 0x-0x0009] > [0.009732] ACPI: SRAT: Node 0 PXM 1 [mem 0x0010-0xbfff] > [0.009733] ACPI: SRAT: Node 0 PXM 1 [mem 0x1-0x13fff] This begs a question whether ppc can do the same thing? >>> Or x86 stop doing it so that you can see on what node you are running? >>> >>> What's the point of this indirection other than another way of avoiding >>> empty node 0? >> >> Honestly, I do not have any idea. I've traced it down to >> Author: Andi Kleen >> Date: Tue Jan 11 15:35:48 2005 -0800 >> >> [PATCH] x86_64: Fix ACPI SRAT NUMA parsing >> >> Fix fallout from the recent nodemask_t changes. The node ids assigned >> in the SRAT parser were off by one. >> >> I added a new first_unset_node() function to nodemask.h to allocate >> IDs sanely. >> >> Signed-off-by: Andi Kleen >> Signed-off-by: Linus Torvalds >> >> which doesn't really tell all that much. The historical baggage and a >> long term behavior which is not really trivial to fix I suspect. > > Thinking about this some more, this logic makes some sense afterall. > Especially in the world without memory hotplug which was very likely the > case back then. It is much better to have compact node mask rather than > sparse one. After all node numbers shouldn't really matter as long as > you have a clear mapping to the HW. I am not sure we export that > information (except for the kernel ring buffer) though. > > The memory hotplug changes that somehow because you can hotremove numa > nodes and therefore make the nodemask sparse but that is not a common > case. I am not sure what would happen if a completely new node was added > and its corresponding node was already used by the renumbered one > though. It would likely conflate the two I am afraid. But I am not sure > this is really possible with x86 and a lack of a bug report would > suggest that nobody is doing that at least. > I think the ACPI code takes care of properly mapping PXM to nodes. So if I start with PXM 0 empty and PXM 1 populated, I will get PXM 1 == node 0 as described. Once I hotplug something to PXM 0 in QEMU $ echo "object_add memory-backend-ram,id=mem0,size=1G" | sudo nc -U /var/tmp/monitor $ echo "device_add pc-dimm,id=dimm0,memdev=mem0,node=0" | sudo nc -U /var/tmp/monitor $ echo "info numa" | sudo nc -U /var/tmp/monitor QEMU 5.0.50 monitor - type 'help' for more information (qemu) info numa 2 nodes node 0 cpus: node 0 size: 1024 MB node 0 plugged: 1024 MB node 1 cpus: 0 1 2 3 node 1 size: 4096 MB node 1 plugged: 0 MB I get in the guest: [ 50.174435] [ cut here ] [ 50.175436] node 1 was absent from the node_possible_map [ 50.176844] WARNING: CPU: 0 PID: 7 at mm/memory_hotplug.c:1021 add_memory_resource+0x8c/0x290 [ 50.176844] Modules linked in: [ 50.176845] CPU: 0 PID: 7 Comm: kworker/u8:0 Not tainted 5.8.0-rc2+ #4 [ 50.176846] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.4 [ 50.176846] Workqueue: kacpi_hotplug acpi_hotplug_work_fn [ 50.176847] RIP: 0010:add_memory_resource+0x8c/0x290 [ 50.176849] Code: 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 63 c5 48 89 04 24 48 0f a3 05 94 6c 1c 01 72 17 89 ee 48 c78 [ 50.176849] RSP: 0018:a7a1c0043d48 EFLAGS: 00010296 [ 50.176850] RAX: 002c RBX: 8bc633e63b80 RCX: [ 50.176851] RDX: 8bc63bc27060 RSI: 8bc63bc18d00 RDI: 8bc63bc18d00 [ 50.176851] RBP: 0001 R08: 01e1 R09: a7a1c0043bd8 [ 50.176852] R10: 0005 R11: R12: 00014000 [ 50.176852] R13: 00017fff R14: 4000 R15: 00018000 [ 50.176853] FS: () GS:8bc63bc0() knlGS: [ 50.176853] CS: 0010 DS: ES: CR0: 80050033 [ 50.176855] CR2: 55dfcbfc5ee8 CR3: aca0a000 CR4: 06f0 [ 50.176855] DR0: DR1: DR2: [ 50.176856] DR3: DR6: fffe0ff0 DR7: 0400 [ 50.176856] Call Trace: [ 50.176856] __add_memory+0x33/0x70 [ 50.176857] acpi_memory_device_add+0x132/0x2f2 [ 50.176857] acpi_bus_attach+0xd2/0x200 [ 50.176858] acpi_bus_scan+0x33/0x70 [ 50.176858] acpi_device_hotplug+0x298/0x390 [ 50.176858]
Re: [PATCH 16/26] mm/powerpc: Use general page fault accounting
Peter Xu writes: > Use the general page fault accounting by passing regs into handle_mm_fault(). > > CC: Michael Ellerman > CC: Benjamin Herrenschmidt > CC: Paul Mackerras > CC: linuxppc-dev@lists.ozlabs.org > Signed-off-by: Peter Xu > --- > arch/powerpc/mm/fault.c | 11 +++ > 1 file changed, 3 insertions(+), 8 deletions(-) > > diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c > index 992b10c3761c..e325d13efaf5 100644 > --- a/arch/powerpc/mm/fault.c > +++ b/arch/powerpc/mm/fault.c > @@ -563,7 +563,7 @@ static int __do_page_fault(struct pt_regs *regs, unsigned > long address, >* make sure we exit gracefully rather than endlessly redo >* the fault. >*/ > - fault = handle_mm_fault(vma, address, flags, NULL); > + fault = handle_mm_fault(vma, address, flags, regs); > > #ifdef CONFIG_PPC_MEM_KEYS > /* > @@ -604,14 +604,9 @@ static int __do_page_fault(struct pt_regs *regs, > unsigned long address, > /* >* Major/minor page fault accounting. >*/ > - if (major) { > - current->maj_flt++; > - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, regs, address); > + if (major) > cmo_account_page_fault(); > - } else { > - current->min_flt++; > - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address); > - } > + > return 0; > } > NOKPROBE_SYMBOL(__do_page_fault); You do change the logic a bit if regs is NULL (in mm_account_fault()), but regs can never be NULL in this path, so it looks OK to me. Acked-by: Michael Ellerman cheers
Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline
On Fri 03-07-20 11:24:17, Michal Hocko wrote: > [Cc Andi] > > On Fri 03-07-20 11:10:01, Michal Suchanek wrote: > > On Wed, Jul 01, 2020 at 02:21:10PM +0200, Michal Hocko wrote: > > > On Wed 01-07-20 13:30:57, David Hildenbrand wrote: > [...] > > > > Yep, looks like it. > > > > > > > > [0.009726] SRAT: PXM 1 -> APIC 0x00 -> Node 0 > > > > [0.009727] SRAT: PXM 1 -> APIC 0x01 -> Node 0 > > > > [0.009727] SRAT: PXM 1 -> APIC 0x02 -> Node 0 > > > > [0.009728] SRAT: PXM 1 -> APIC 0x03 -> Node 0 > > > > [0.009731] ACPI: SRAT: Node 0 PXM 1 [mem 0x-0x0009] > > > > [0.009732] ACPI: SRAT: Node 0 PXM 1 [mem 0x0010-0xbfff] > > > > [0.009733] ACPI: SRAT: Node 0 PXM 1 [mem 0x1-0x13fff] > > > > > > This begs a question whether ppc can do the same thing? > > Or x86 stop doing it so that you can see on what node you are running? > > > > What's the point of this indirection other than another way of avoiding > > empty node 0? > > Honestly, I do not have any idea. I've traced it down to > Author: Andi Kleen > Date: Tue Jan 11 15:35:48 2005 -0800 > > [PATCH] x86_64: Fix ACPI SRAT NUMA parsing > > Fix fallout from the recent nodemask_t changes. The node ids assigned > in the SRAT parser were off by one. > > I added a new first_unset_node() function to nodemask.h to allocate > IDs sanely. > > Signed-off-by: Andi Kleen > Signed-off-by: Linus Torvalds > > which doesn't really tell all that much. The historical baggage and a > long term behavior which is not really trivial to fix I suspect. Thinking about this some more, this logic makes some sense afterall. Especially in the world without memory hotplug which was very likely the case back then. It is much better to have compact node mask rather than sparse one. After all node numbers shouldn't really matter as long as you have a clear mapping to the HW. I am not sure we export that information (except for the kernel ring buffer) though. The memory hotplug changes that somehow because you can hotremove numa nodes and therefore make the nodemask sparse but that is not a common case. I am not sure what would happen if a completely new node was added and its corresponding node was already used by the renumbered one though. It would likely conflate the two I am afraid. But I am not sure this is really possible with x86 and a lack of a bug report would suggest that nobody is doing that at least. -- Michal Hocko SUSE Labs
Re: [PATCH 5/8] powerpc/64s: implement queued spinlocks and rwlocks
Nicholas Piggin writes: > Excerpts from Will Deacon's message of July 2, 2020 8:35 pm: >> On Thu, Jul 02, 2020 at 08:25:43PM +1000, Nicholas Piggin wrote: >>> Excerpts from Will Deacon's message of July 2, 2020 6:02 pm: >>> > On Thu, Jul 02, 2020 at 05:48:36PM +1000, Nicholas Piggin wrote: >>> >> diff --git a/arch/powerpc/include/asm/qspinlock.h >>> >> b/arch/powerpc/include/asm/qspinlock.h >>> >> new file mode 100644 >>> >> index ..f84da77b6bb7 >>> >> --- /dev/null >>> >> +++ b/arch/powerpc/include/asm/qspinlock.h >>> >> @@ -0,0 +1,20 @@ >>> >> +/* SPDX-License-Identifier: GPL-2.0 */ >>> >> +#ifndef _ASM_POWERPC_QSPINLOCK_H >>> >> +#define _ASM_POWERPC_QSPINLOCK_H >>> >> + >>> >> +#include >>> >> + >>> >> +#define _Q_PENDING_LOOPS(1 << 9) /* not tuned */ >>> >> + >>> >> +#define smp_mb__after_spinlock() smp_mb() >>> >> + >>> >> +static __always_inline int queued_spin_is_locked(struct qspinlock *lock) >>> >> +{ >>> >> +smp_mb(); >>> >> +return atomic_read(>val); >>> >> +} >>> > >>> > Why do you need the smp_mb() here? >>> >>> A long and sad tale that ends here 51d7d5205d338 >>> >>> Should probably at least refer to that commit from here, since this one >>> is not going to git blame back there. I'll add something. >> >> Is this still an issue, though? >> >> See 38b850a73034 (where we added a similar barrier on arm64) and then >> c6f5d02b6a0f (where we removed it). >> > > Oh nice, I didn't know that went away. Thanks for the heads up. Argh! I spent so much time chasing that damn bug in the ipc code. > I'm going to say I'm too scared to remove it while changing the > spinlock algorithm, but I'll open an issue and we should look at > removing it. Sounds good. cheers
[RFC PATCH v0 2/2] KVM: PPC: Book3S HV: Use H_RPT_INVALIDATE in nested KVM
In the nested KVM case, replace H_TLB_INVALIDATE by the new hcall H_RPT_INVALIDATE if available. The availability of this hcall is determined from "hcall-rpt-invalidate" string in ibm,hypertas-functions DT property. Signed-off-by: Bharata B Rao --- arch/powerpc/include/asm/firmware.h | 4 +++- arch/powerpc/kvm/book3s_64_mmu_radix.c| 26 ++- arch/powerpc/kvm/book3s_hv_nested.c | 13 ++-- arch/powerpc/platforms/pseries/firmware.c | 1 + 4 files changed, 36 insertions(+), 8 deletions(-) diff --git a/arch/powerpc/include/asm/firmware.h b/arch/powerpc/include/asm/firmware.h index 6003c2e533a0..aa6a5ef5d483 100644 --- a/arch/powerpc/include/asm/firmware.h +++ b/arch/powerpc/include/asm/firmware.h @@ -52,6 +52,7 @@ #define FW_FEATURE_PAPR_SCMASM_CONST(0x0020) #define FW_FEATURE_ULTRAVISOR ASM_CONST(0x0040) #define FW_FEATURE_STUFF_TCE ASM_CONST(0x0080) +#define FW_FEATURE_RPT_INVALIDATE ASM_CONST(0x0100) #ifndef __ASSEMBLY__ @@ -71,7 +72,8 @@ enum { FW_FEATURE_TYPE1_AFFINITY | FW_FEATURE_PRRN | FW_FEATURE_HPT_RESIZE | FW_FEATURE_DRMEM_V2 | FW_FEATURE_DRC_INFO | FW_FEATURE_BLOCK_REMOVE | - FW_FEATURE_PAPR_SCM | FW_FEATURE_ULTRAVISOR, + FW_FEATURE_PAPR_SCM | FW_FEATURE_ULTRAVISOR | + FW_FEATURE_RPT_INVALIDATE, FW_FEATURE_PSERIES_ALWAYS = 0, FW_FEATURE_POWERNV_POSSIBLE = FW_FEATURE_OPAL | FW_FEATURE_ULTRAVISOR, FW_FEATURE_POWERNV_ALWAYS = 0, diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c index e738ea652192..8411e42eedbd 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_radix.c +++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c @@ -21,6 +21,7 @@ #include #include #include +#include /* * Supported radix tree geometry. @@ -313,9 +314,17 @@ void kvmppc_radix_tlbie_page(struct kvm *kvm, unsigned long addr, } psi = shift_to_mmu_psize(pshift); - rb = addr | (mmu_get_ap(psi) << PPC_BITLSHIFT(58)); - rc = plpar_hcall_norets(H_TLB_INVALIDATE, H_TLBIE_P1_ENC(0, 0, 1), - lpid, rb); + if (!firmware_has_feature(FW_FEATURE_RPT_INVALIDATE)) { + rb = addr | (mmu_get_ap(psi) << PPC_BITLSHIFT(58)); + rc = plpar_hcall_norets(H_TLB_INVALIDATE, + H_TLBIE_P1_ENC(0, 0, 1), lpid, rb); + } else { + rc = pseries_rpt_invalidate(lpid, H_RPTI_TARGET_CMMU, + H_RPTI_TYPE_NESTED | + H_RPTI_TYPE_TLB, + psize_to_rpti_pgsize(psi), + addr, addr + psize); + } if (rc) pr_err("KVM: TLB page invalidation hcall failed, rc=%ld\n", rc); } @@ -329,8 +338,15 @@ static void kvmppc_radix_flush_pwc(struct kvm *kvm, unsigned int lpid) return; } - rc = plpar_hcall_norets(H_TLB_INVALIDATE, H_TLBIE_P1_ENC(1, 0, 1), - lpid, TLBIEL_INVAL_SET_LPID); + if (!firmware_has_feature(FW_FEATURE_RPT_INVALIDATE)) + rc = plpar_hcall_norets(H_TLB_INVALIDATE, + H_TLBIE_P1_ENC(1, 0, 1), + lpid, TLBIEL_INVAL_SET_LPID); + else + rc = pseries_rpt_invalidate(lpid, H_RPTI_TARGET_CMMU, + H_RPTI_TYPE_NESTED | + H_RPTI_TYPE_PWC, H_RPTI_PAGE_ALL, + 0, -1UL); if (rc) pr_err("KVM: TLB PWC invalidation hcall failed, rc=%ld\n", rc); } diff --git a/arch/powerpc/kvm/book3s_hv_nested.c b/arch/powerpc/kvm/book3s_hv_nested.c index efb78d37f29a..4d023c451be4 100644 --- a/arch/powerpc/kvm/book3s_hv_nested.c +++ b/arch/powerpc/kvm/book3s_hv_nested.c @@ -19,6 +19,7 @@ #include #include #include +#include static struct patb_entry *pseries_partition_tb; @@ -401,8 +402,16 @@ static void kvmhv_flush_lpid(unsigned int lpid) return; } - rc = plpar_hcall_norets(H_TLB_INVALIDATE, H_TLBIE_P1_ENC(2, 0, 1), - lpid, TLBIEL_INVAL_SET_LPID); + if (!firmware_has_feature(FW_FEATURE_RPT_INVALIDATE)) + rc = plpar_hcall_norets(H_TLB_INVALIDATE, + H_TLBIE_P1_ENC(2, 0, 1), + lpid, TLBIEL_INVAL_SET_LPID); + else + rc = pseries_rpt_invalidate(lpid, H_RPTI_TARGET_CMMU, + H_RPTI_TYPE_NESTED | + H_RPTI_TYPE_TLB | H_RPTI_TYPE_PWC | + H_RPTI_TYPE_PAT, +
[RFC PATCH v0 1/2] KVM: PPC: Book3S HV: Add support for H_RPT_INVALIDATE (nested case only)
Implements H_RPT_INVALIDATE hcall and supports only nested case currently. A KVM capability KVM_CAP_RPT_INVALIDATE is added to indicate the support for this hcall. Signed-off-by: Bharata B Rao --- Documentation/virt/kvm/api.rst| 17 .../include/asm/book3s/64/tlbflush-radix.h| 18 arch/powerpc/include/asm/kvm_book3s.h | 3 + arch/powerpc/kvm/book3s_hv.c | 32 +++ arch/powerpc/kvm/book3s_hv_nested.c | 94 +++ arch/powerpc/kvm/powerpc.c| 3 + arch/powerpc/mm/book3s64/radix_tlb.c | 4 - include/uapi/linux/kvm.h | 1 + 8 files changed, 168 insertions(+), 4 deletions(-) diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index 426f94582b7a..d235d16a4bf0 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -5843,6 +5843,23 @@ controlled by the kvm module parameter halt_poll_ns. This capability allows the maximum halt time to specified on a per-VM basis, effectively overriding the module parameter for the target VM. +7.21 KVM_CAP_RPT_INVALIDATE +-- + +:Capability: KVM_CAP_RPT_INVALIDATE +:Architectures: ppc +:Type: vm + +This capability indicates that the kernel is capable of handling +H_RPT_INVALIDATE hcall. + +In order to enable the use of H_RPT_INVALIDATE in the guest, +user space might have to advertise it for the guest. For example, +IBM pSeries (sPAPR) guest starts using it if "hcall-rpt-invalidate" is +present in the "ibm,hypertas-functions" device-tree property. + +This capability is always enabled. + 8. Other capabilities. == diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h b/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h index 94439e0cefc9..aace7e9b2397 100644 --- a/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h +++ b/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h @@ -4,6 +4,10 @@ #include +#define RIC_FLUSH_TLB 0 +#define RIC_FLUSH_PWC 1 +#define RIC_FLUSH_ALL 2 + struct vm_area_struct; struct mm_struct; struct mmu_gather; @@ -21,6 +25,20 @@ static inline u64 psize_to_rpti_pgsize(unsigned long psize) return H_RPTI_PAGE_ALL; } +static inline int rpti_pgsize_to_psize(unsigned long page_size) +{ + if (page_size == H_RPTI_PAGE_4K) + return MMU_PAGE_4K; + if (page_size == H_RPTI_PAGE_64K) + return MMU_PAGE_64K; + if (page_size == H_RPTI_PAGE_2M) + return MMU_PAGE_2M; + if (page_size == H_RPTI_PAGE_1G) + return MMU_PAGE_1G; + else + return MMU_PAGE_64K; /* Default */ +} + static inline int mmu_get_ap(int psize) { return mmu_psize_defs[psize].ap; diff --git a/arch/powerpc/include/asm/kvm_book3s.h b/arch/powerpc/include/asm/kvm_book3s.h index d32ec9ae73bd..0f1c5fa6e8ce 100644 --- a/arch/powerpc/include/asm/kvm_book3s.h +++ b/arch/powerpc/include/asm/kvm_book3s.h @@ -298,6 +298,9 @@ void kvmhv_set_ptbl_entry(unsigned int lpid, u64 dw0, u64 dw1); void kvmhv_release_all_nested(struct kvm *kvm); long kvmhv_enter_nested_guest(struct kvm_vcpu *vcpu); long kvmhv_do_nested_tlbie(struct kvm_vcpu *vcpu); +long kvmhv_h_rpti_nested(struct kvm_vcpu *vcpu, unsigned long lpid, +unsigned long type, unsigned long pg_sizes, +unsigned long start, unsigned long end); int kvmhv_run_single_vcpu(struct kvm_vcpu *vcpu, u64 time_limit, unsigned long lpcr); void kvmhv_save_hv_regs(struct kvm_vcpu *vcpu, struct hv_guest_state *hr); diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c index 6bf66649ab92..2f772183f249 100644 --- a/arch/powerpc/kvm/book3s_hv.c +++ b/arch/powerpc/kvm/book3s_hv.c @@ -895,6 +895,28 @@ static int kvmppc_get_yield_count(struct kvm_vcpu *vcpu) return yield_count; } +static long kvmppc_h_rpt_invalidate(struct kvm_vcpu *vcpu, + unsigned long pid, unsigned long target, + unsigned long type, unsigned long pg_sizes, + unsigned long start, unsigned long end) +{ + if (end < start) + return H_P5; + + if ((!type & H_RPTI_TYPE_NESTED)) + return H_P3; + + if (!nesting_enabled(vcpu->kvm)) + return H_FUNCTION; + + /* Support only cores as target */ + if (target != H_RPTI_TARGET_CMMU) + return H_P2; + + return kvmhv_h_rpti_nested(vcpu, pid, (type & ~H_RPTI_TYPE_NESTED), + pg_sizes, start, end); +} + int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu) { unsigned long req = kvmppc_get_gpr(vcpu, 3); @@ -1103,6 +1125,14 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu) */ ret = kvmppc_h_svm_init_abort(vcpu->kvm); break; +
[RFC PATCH v0 0/2] Use H_RPT_INVALIDATE for nested guest
This patchset adds support for the new hcall H_RPT_INVALIDATE (currently handles nested case only) and replaces the nested tlb flush calls with this new hcall if the support for the same exists. This applies on top of "[PATCH v3 0/3] Off-load TLB invalidations to host for !GTSE" patchset that was posted at: https://lore.kernel.org/linuxppc-dev/20200703053608.12884-1-bhar...@linux.ibm.com/T/#t H_RPT_INVALIDATE Syntax: int64  /* H_Success: Return code on successful completion */    /* H_Busy - repeat the call with the same */    /* H_Parameter, H_P2, H_P3, H_P4, H_P5 : Invalid parameters */    hcall(const uint64 H_RPT_INVALIDATE, /* Invalidate RPT translation lookaside information */  uint64 pid,  /* PID/LPID to invalidate */  uint64 target,   /* Invalidation target */  uint64 type, /* Type of lookaside information */  uint64 pageSizes, /* Page sizes */  uint64 start, /* Start of Effective Address (EA) range (inclusive) */  uint64 end)  /* End of EA range (exclusive) */ Invalidation targets (target) - Core MMU   0x01 /* All virtual processors in the partition */ Core local MMU 0x02 /* Current virtual processor */ Nest MMU   0x04 /* All nest/accelerator agents in use by the partition */ A combination of the above can be specified, except core and core local. Type of translation to invalidate (type) --- NESTED  0x0001 /* Invalidate nested guest partition-scope */ TLB  0x0002 /* Invalidate TLB */ PWC  0x0004 /* Invalidate Page Walk Cache */ PRT  0x0008 /* Invalidate Process Table Entries if NESTED is clear */ PAT  0x0008 /* Invalidate Partition Table Entries if NESTED is set */ A combination of the above can be specified. Page size mask (pageSizes) -- 4K 0x01 64K 0x02 2M 0x04 1G 0x08 All sizes  (-1UL) A combination of the above can be specified. All page sizes can be selected with -1. Semantics: Invalidate radix tree lookaside information   matching the parameters given. * Return H_P2, H_P3 or H_P4 if target, type, or pageSizes parameters are  different from the defined values. * Return H_PARAMETER if NESTED is set and pid is not a valid nested LPID allocated to this partition * Return H_P5 if (start, end) doesn't form a valid range. Start and end should be a valid Quadrant address and end > start. * Return H_NotSupported if the partition is not in running in radix translation mode. * May invalidate more translation information than requested. * If start = 0 and end = -1, set the range to cover all valid addresses.  Else start and end should be aligned to 4kB (lower 11 bits clear). * If NESTED is clear, then invalidate process scoped lookaside information.  Else pid specifies a nested LPID, and the invalidation is performed  on nested guest partition table and nested guest partition scope real addresses. * If pid = 0 and NESTED is clear, then valid addresses are quadrant 3 and  quadrant 0 spaces, Else valid addresses are quadrant 0. * Pages which are fully covered by the range are to be invalidated.  Those which are partially covered are considered outside invalidation  range, which allows a caller to optimally invalidate ranges that may  contain mixed page sizes. * Return H_SUCCESS on success. Bharata B Rao (2): KVM: PPC: Book3S HV: Add support for H_RPT_INVALIDATE (nested case only) KVM: PPC: Book3S HV: Use H_RPT_INVALIDATE in nested KVM Documentation/virt/kvm/api.rst| 17 +++ .../include/asm/book3s/64/tlbflush-radix.h| 18 +++ arch/powerpc/include/asm/firmware.h | 4 +- arch/powerpc/include/asm/kvm_book3s.h | 3 + arch/powerpc/kvm/book3s_64_mmu_radix.c| 26 - arch/powerpc/kvm/book3s_hv.c | 32 ++ arch/powerpc/kvm/book3s_hv_nested.c | 107 +- arch/powerpc/kvm/powerpc.c| 3 + arch/powerpc/mm/book3s64/radix_tlb.c | 4 - arch/powerpc/platforms/pseries/firmware.c | 1 + include/uapi/linux/kvm.h | 1 + 11 files changed, 204 insertions(+), 12 deletions(-) -- 2.21.3
Re: [PATCH V3 (RESEND) 2/3] mm/sparsemem: Enable vmem_altmap support in vmemmap_alloc_block_buf()
Catalin Marinas writes: > On Thu, Jun 18, 2020 at 06:45:29AM +0530, Anshuman Khandual wrote: >> There are many instances where vmemap allocation is often switched between >> regular memory and device memory just based on whether altmap is available >> or not. vmemmap_alloc_block_buf() is used in various platforms to allocate >> vmemmap mappings. Lets also enable it to handle altmap based device memory >> allocation along with existing regular memory allocations. This will help >> in avoiding the altmap based allocation switch in many places. >> >> While here also implement a regular memory allocation fallback mechanism >> when the first preferred device memory allocation fails. This will ensure >> preserving the existing semantics on powerpc platform. To summarize there >> are three different methods to call vmemmap_alloc_block_buf(). >> >> (., NULL, false) /* Allocate from system RAM */ >> (., altmap, false) /* Allocate from altmap without any fallback */ >> (., altmap, true) /* Allocate from altmap with fallback (system RAM) */ > [...] >> diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c >> index bc73abf0bc25..01e25b56eccb 100644 >> --- a/arch/powerpc/mm/init_64.c >> +++ b/arch/powerpc/mm/init_64.c >> @@ -225,12 +225,12 @@ int __meminit vmemmap_populate(unsigned long start, >> unsigned long end, int node, >> * fall back to system memory if the altmap allocation fail. >> */ >> if (altmap && !altmap_cross_boundary(altmap, start, page_size)) >> { >> -p = altmap_alloc_block_buf(page_size, altmap); >> -if (!p) >> -pr_debug("altmap block allocation failed, >> falling back to system memory"); >> +p = vmemmap_alloc_block_buf(page_size, node, >> +altmap, true); >> +} else { >> +p = vmemmap_alloc_block_buf(page_size, node, >> +NULL, false); >> } >> -if (!p) >> -p = vmemmap_alloc_block_buf(page_size, node); >> if (!p) >> return -ENOMEM; > > Is the fallback argument actually necessary. It may be cleaner to just > leave the code as is with the choice between altmap and NULL. If an arch > needs a fallback (only powerpc), they have the fallback in place > already. I don't see the powerpc code any better after this change. Yeah I agree. cheers
Re: [PATCH 0/4] ASoC: fsl_asrc: allow selecting arbitrary clocks
Hi Nic, Le 02/07/2020 à 20:42, Nicolin Chen a écrit : > Hi Arnaud, > > On Thu, Jul 02, 2020 at 04:22:31PM +0200, Arnaud Ferraris wrote: >> The current ASRC driver hardcodes the input and output clocks used for >> sample rate conversions. In order to allow greater flexibility and to >> cover more use cases, it would be preferable to select the clocks using >> device-tree properties. > > We recent just merged a new change that auto-selecting internal > clocks based on sample rates as the first option -- ideal ratio > mode is the fallback mode now. Please refer to: > https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?h=next-20200702=d0250cf4f2abfbea64ed247230f08f5ae23979f0 That looks interesting, thanks for pointing this out! I'll rebase and see how it works for my use-case, will keep you informed. Regards, Arnaud
Re: [PATCH v4 24/41] powerpc/book3s64/pkeys: Store/restore userspace AMR correctly on entry and exit from kernel
On 7/3/20 2:48 PM, Nicholas Piggin wrote: Excerpts from Aneesh Kumar K.V's message of June 15, 2020 4:14 pm: This prepare kernel to operate with a different value than userspace AMR. For this, AMR needs to be saved and restored on entry and return from the kernel. With KUAP we modify kernel AMR when accessing user address from the kernel via copy_to/from_user interfaces. If MMU_FTR_KEY is enabled we always use the key mechanism to implement KUAP feature. If MMU_FTR_KEY is not supported and if we support MMU_FTR_KUAP (radix translation on POWER9), we can skip restoring AMR on return to userspace. Userspace won't be using AMR in that specific config. Signed-off-by: Aneesh Kumar K.V --- arch/powerpc/include/asm/book3s/64/kup.h | 141 ++- arch/powerpc/kernel/entry_64.S | 6 +- arch/powerpc/kernel/exceptions-64s.S | 4 +- arch/powerpc/kernel/syscall_64.c | 26 - 4 files changed, 144 insertions(+), 33 deletions(-) diff --git a/arch/powerpc/include/asm/book3s/64/kup.h b/arch/powerpc/include/asm/book3s/64/kup.h index e6ee1c34842f..6979cd1a0003 100644 --- a/arch/powerpc/include/asm/book3s/64/kup.h +++ b/arch/powerpc/include/asm/book3s/64/kup.h @@ -13,18 +13,47 @@ #ifdef __ASSEMBLY__ -.macro kuap_restore_amr gpr1, gpr2 -#ifdef CONFIG_PPC_KUAP +.macro kuap_restore_user_amr gpr1 +#if defined(CONFIG_PPC_MEM_KEYS) BEGIN_MMU_FTR_SECTION_NESTED(67) - mfspr \gpr1, SPRN_AMR + /* +* AMR is going to be different when +* returning to userspace. +*/ + ld \gpr1, STACK_REGS_KUAP(r1) + isync + mtspr SPRN_AMR, \gpr1 + + /* No isync required, see kuap_restore_user_amr() */ + END_MMU_FTR_SECTION_NESTED_IFSET(MMU_FTR_PKEY , 67) +#endif +.endm + +.macro kuap_restore_kernel_amr gpr1, gpr2 +#if defined(CONFIG_PPC_MEM_KEYS) + BEGIN_MMU_FTR_SECTION_NESTED(67) + b 99f // handle_pkey_restore_amr + END_MMU_FTR_SECTION_NESTED_IFSET(MMU_FTR_PKEY , 67) + + BEGIN_MMU_FTR_SECTION_NESTED(68) + b 99f // handle_kuap_restore_amr + MMU_FTR_SECTION_ELSE_NESTED(68) + b 100f // skip_restore_amr + ALT_MMU_FTR_SECTION_END_NESTED_IFSET(MMU_FTR_KUAP, 68) + +99: + /* +* AMR is going to be mostly the same since we are +* returning to the kernel. Compare and do a mtspr. +*/ ld \gpr2, STACK_REGS_KUAP(r1) + mfspr \gpr1, SPRN_AMR cmpd\gpr1, \gpr2 - beq 998f + beq 100f isync mtspr SPRN_AMR, \gpr2 /* No isync required, see kuap_restore_amr() */ -998: - END_MMU_FTR_SECTION_NESTED_IFSET(MMU_FTR_KUAP, 67) +100: // skip_restore_amr Can't you code it like this? (_IFCLR requires none of the bits to be set) BEGIN_MMU_FTR_SECTION_NESTED(67) b 99f // nothing using AMR, no need to restore END_MMU_FTR_SECTION_NESTED_IFCLR(MMU_FTR_PKEY | MMU_FTR_KUAP, 67) That saves you a branch in the common case of using AMR. Similar for others. Yes i could switch to that. The code is taking extra 200 cycles even with KUAP/KUEP disabled and no keys being used on hash. I am yet to analyze this closely. So will rework things based on that analysis. @@ -69,22 +133,40 @@ extern u64 default_uamor; -static inline void kuap_restore_amr(struct pt_regs *regs, unsigned long amr) +static inline void kuap_restore_user_amr(struct pt_regs *regs) { - if (mmu_has_feature(MMU_FTR_KUAP) && unlikely(regs->kuap != amr)) { - isync(); - mtspr(SPRN_AMR, regs->kuap); - /* -* No isync required here because we are about to RFI back to -* previous context before any user accesses would be made, -* which is a CSI. -*/ + if (!mmu_has_feature(MMU_FTR_PKEY)) + return; If you have PKEY but not KUAP, do you still have to restore? Yes, because user space pkey is now set on the exit path. This is needed to handle things like exec/fork(). + + isync(); + mtspr(SPRN_AMR, regs->kuap); + /* +* No isync required here because we are about to rfi +* back to previous context before any user accesses +* would be made, which is a CSI. +*/ +} + +static inline void kuap_restore_kernel_amr(struct pt_regs *regs, + unsigned long amr) +{ + if (mmu_has_feature(MMU_FTR_KUAP) || mmu_has_feature(MMU_FTR_PKEY)) { + + if (unlikely(regs->kuap != amr)) { + isync(); + mtspr(SPRN_AMR, regs->kuap); + /* +* No isync required here because we are about to rfi +* back to previous context before any user accesses +* would be made, which is a CSI. +*/ + }
Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline
[Cc Andi] On Fri 03-07-20 11:10:01, Michal Suchanek wrote: > On Wed, Jul 01, 2020 at 02:21:10PM +0200, Michal Hocko wrote: > > On Wed 01-07-20 13:30:57, David Hildenbrand wrote: [...] > > > Yep, looks like it. > > > > > > [0.009726] SRAT: PXM 1 -> APIC 0x00 -> Node 0 > > > [0.009727] SRAT: PXM 1 -> APIC 0x01 -> Node 0 > > > [0.009727] SRAT: PXM 1 -> APIC 0x02 -> Node 0 > > > [0.009728] SRAT: PXM 1 -> APIC 0x03 -> Node 0 > > > [0.009731] ACPI: SRAT: Node 0 PXM 1 [mem 0x-0x0009] > > > [0.009732] ACPI: SRAT: Node 0 PXM 1 [mem 0x0010-0xbfff] > > > [0.009733] ACPI: SRAT: Node 0 PXM 1 [mem 0x1-0x13fff] > > > > This begs a question whether ppc can do the same thing? > Or x86 stop doing it so that you can see on what node you are running? > > What's the point of this indirection other than another way of avoiding > empty node 0? Honestly, I do not have any idea. I've traced it down to Author: Andi Kleen Date: Tue Jan 11 15:35:48 2005 -0800 [PATCH] x86_64: Fix ACPI SRAT NUMA parsing Fix fallout from the recent nodemask_t changes. The node ids assigned in the SRAT parser were off by one. I added a new first_unset_node() function to nodemask.h to allocate IDs sanely. Signed-off-by: Andi Kleen Signed-off-by: Linus Torvalds which doesn't really tell all that much. The historical baggage and a long term behavior which is not really trivial to fix I suspect. -- Michal Hocko SUSE Labs
Re: [PATCH 1/2] dt-bindings: sound: fsl-asoc-card: add new compatible for I2S slave
Le 02/07/2020 à 17:42, Mark Brown a écrit : > On Thu, Jul 02, 2020 at 05:28:03PM +0200, Arnaud Ferraris wrote: >> Le 02/07/2020 à 16:31, Mark Brown a écrit : > >>> Why require that the CODEC be clock master here - why not make this >>> configurable, reusing the properties from the generic and audio graph >>> cards? > >> This is partly because I'm not sure how to do it (yet), but mostly >> because I don't have the hardware to test this (the 2 CODECs present on >> my only i.MX6 board are both clock master) > > Take a look at what the generic cards are doing, it's a library function > asoc_simple_parse_daifmt(). It's not the end of the world if you can't > test it properly - if it turns out it's buggy somehow someone can always > fix the code later but an ABI is an ABI so we can't change it. > Thanks for the hints, I'll look into it. Regards, Arnaud
Re: [PATCH v4 24/41] powerpc/book3s64/pkeys: Store/restore userspace AMR correctly on entry and exit from kernel
Excerpts from Aneesh Kumar K.V's message of June 15, 2020 4:14 pm: > This prepare kernel to operate with a different value than userspace AMR. > For this, AMR needs to be saved and restored on entry and return from the > kernel. > > With KUAP we modify kernel AMR when accessing user address from the kernel > via copy_to/from_user interfaces. > > If MMU_FTR_KEY is enabled we always use the key mechanism to implement KUAP > feature. If MMU_FTR_KEY is not supported and if we support MMU_FTR_KUAP > (radix translation on POWER9), we can skip restoring AMR on return > to userspace. Userspace won't be using AMR in that specific config. > > Signed-off-by: Aneesh Kumar K.V > --- > arch/powerpc/include/asm/book3s/64/kup.h | 141 ++- > arch/powerpc/kernel/entry_64.S | 6 +- > arch/powerpc/kernel/exceptions-64s.S | 4 +- > arch/powerpc/kernel/syscall_64.c | 26 - > 4 files changed, 144 insertions(+), 33 deletions(-) > > diff --git a/arch/powerpc/include/asm/book3s/64/kup.h > b/arch/powerpc/include/asm/book3s/64/kup.h > index e6ee1c34842f..6979cd1a0003 100644 > --- a/arch/powerpc/include/asm/book3s/64/kup.h > +++ b/arch/powerpc/include/asm/book3s/64/kup.h > @@ -13,18 +13,47 @@ > > #ifdef __ASSEMBLY__ > > -.macro kuap_restore_amr gpr1, gpr2 > -#ifdef CONFIG_PPC_KUAP > +.macro kuap_restore_user_amr gpr1 > +#if defined(CONFIG_PPC_MEM_KEYS) > BEGIN_MMU_FTR_SECTION_NESTED(67) > - mfspr \gpr1, SPRN_AMR > + /* > + * AMR is going to be different when > + * returning to userspace. > + */ > + ld \gpr1, STACK_REGS_KUAP(r1) > + isync > + mtspr SPRN_AMR, \gpr1 > + > + /* No isync required, see kuap_restore_user_amr() */ > + END_MMU_FTR_SECTION_NESTED_IFSET(MMU_FTR_PKEY , 67) > +#endif > +.endm > + > +.macro kuap_restore_kernel_amr gpr1, gpr2 > +#if defined(CONFIG_PPC_MEM_KEYS) > + BEGIN_MMU_FTR_SECTION_NESTED(67) > + b 99f // handle_pkey_restore_amr > + END_MMU_FTR_SECTION_NESTED_IFSET(MMU_FTR_PKEY , 67) > + > + BEGIN_MMU_FTR_SECTION_NESTED(68) > + b 99f // handle_kuap_restore_amr > + MMU_FTR_SECTION_ELSE_NESTED(68) > + b 100f // skip_restore_amr > + ALT_MMU_FTR_SECTION_END_NESTED_IFSET(MMU_FTR_KUAP, 68) > + > +99: > + /* > + * AMR is going to be mostly the same since we are > + * returning to the kernel. Compare and do a mtspr. > + */ > ld \gpr2, STACK_REGS_KUAP(r1) > + mfspr \gpr1, SPRN_AMR > cmpd\gpr1, \gpr2 > - beq 998f > + beq 100f > isync > mtspr SPRN_AMR, \gpr2 > /* No isync required, see kuap_restore_amr() */ > -998: > - END_MMU_FTR_SECTION_NESTED_IFSET(MMU_FTR_KUAP, 67) > +100: // skip_restore_amr Can't you code it like this? (_IFCLR requires none of the bits to be set) BEGIN_MMU_FTR_SECTION_NESTED(67) b 99f // nothing using AMR, no need to restore END_MMU_FTR_SECTION_NESTED_IFCLR(MMU_FTR_PKEY | MMU_FTR_KUAP, 67) That saves you a branch in the common case of using AMR. Similar for others. > @@ -69,22 +133,40 @@ > > extern u64 default_uamor; > > -static inline void kuap_restore_amr(struct pt_regs *regs, unsigned long amr) > +static inline void kuap_restore_user_amr(struct pt_regs *regs) > { > - if (mmu_has_feature(MMU_FTR_KUAP) && unlikely(regs->kuap != amr)) { > - isync(); > - mtspr(SPRN_AMR, regs->kuap); > - /* > - * No isync required here because we are about to RFI back to > - * previous context before any user accesses would be made, > - * which is a CSI. > - */ > + if (!mmu_has_feature(MMU_FTR_PKEY)) > + return; If you have PKEY but not KUAP, do you still have to restore? > + > + isync(); > + mtspr(SPRN_AMR, regs->kuap); > + /* > + * No isync required here because we are about to rfi > + * back to previous context before any user accesses > + * would be made, which is a CSI. > + */ > +} > + > +static inline void kuap_restore_kernel_amr(struct pt_regs *regs, > +unsigned long amr) > +{ > + if (mmu_has_feature(MMU_FTR_KUAP) || mmu_has_feature(MMU_FTR_PKEY)) { > + > + if (unlikely(regs->kuap != amr)) { > + isync(); > + mtspr(SPRN_AMR, regs->kuap); > + /* > + * No isync required here because we are about to rfi > + * back to previous context before any user accesses > + * would be made, which is a CSI. > + */ > + } > } > } > > static inline unsigned long kuap_get_and_check_amr(void) > { > - if (mmu_has_feature(MMU_FTR_KUAP)) { > + if (mmu_has_feature(MMU_FTR_KUAP) || mmu_has_feature(MMU_FTR_PKEY)) { > unsigned long amr = mfspr(SPRN_AMR); >
Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline
On Wed, Jul 01, 2020 at 02:21:10PM +0200, Michal Hocko wrote: > On Wed 01-07-20 13:30:57, David Hildenbrand wrote: > > On 01.07.20 13:06, David Hildenbrand wrote: > > > On 01.07.20 13:01, Srikar Dronamraju wrote: > > >> * David Hildenbrand [2020-07-01 12:15:54]: > > >> > > >>> On 01.07.20 12:04, Srikar Dronamraju wrote: > > * Michal Hocko [2020-07-01 10:42:00]: > > > > > > > >> > > >> 2. Also existence of dummy node also leads to inconsistent > > >> information. The > > >> number of online nodes is inconsistent with the information in the > > >> device-tree and resource-dump > > >> > > >> 3. When the dummy node is present, single node non-Numa systems end > > >> up showing > > >> up as NUMA systems and numa_balancing gets enabled. This will mean > > >> we take > > >> the hit from the unnecessary numa hinting faults. > > > > > > I have to say that I dislike the node online/offline state and > > > directly > > > exporting that to the userspace. Users should only care whether the > > > node > > > has memory/cpus. Numa nodes can be online without any memory. Just > > > offline all the present memory blocks but do not physically hot remove > > > them and you are in the same situation. If users are confused by an > > > output of tools like numactl -H then those could be updated and hide > > > nodes without any memory > > > > > > The autonuma problem sounds interesting but again this patch doesn't > > > really solve the underlying problem because I strongly suspect that > > > the > > > problem is still there when a numa node gets all its memory offline as > > > mentioned above. > > I would really appreciate a feedback to these two as well. > > > > While I completely agree that making node 0 special is wrong, I have > > > still hard time to review this very simply looking patch because all > > > the > > > numa initialization is so spread around that this might just blow up > > > at unexpected places. IIRC we have discussed testing in the previous > > > version and David has provided a way to emulate these configurations > > > on x86. Did you manage to use those instruction for additional testing > > > on other than ppc architectures? > > > > > > > I have tried all the steps that David mentioned and reported back at > > https://lore.kernel.org/lkml/20200511174731.gd1...@linux.vnet.ibm.com/t/#u > > > > As a summary, David's steps are still not creating a > > memoryless/cpuless on > > x86 VM. > > >>> > > >>> Now, that is wrong. You get a memoryless/cpuless node, which is *not > > >>> online*. Once you hotplug some memory, it will switch online. Once you > > >>> remove memory, it will switch back offline. > > >>> > > >> > > >> Let me clarify, we are looking for a node 0 which is cpuless/memoryless > > >> at > > >> boot. The code in question tries to handle a cpuless/memoryless node 0 > > >> at > > >> boot. > > > > > > I was just correcting your statement, because it was wrong. > > > > > > Could be that x86 code maps PXM 1 to node 0 because PXM 1 does neither > > > have CPUs nor memory. That would imply that we can, in fact, never have > > > node 0 offline during boot. > > > > > > > Yep, looks like it. > > > > [0.009726] SRAT: PXM 1 -> APIC 0x00 -> Node 0 > > [0.009727] SRAT: PXM 1 -> APIC 0x01 -> Node 0 > > [0.009727] SRAT: PXM 1 -> APIC 0x02 -> Node 0 > > [0.009728] SRAT: PXM 1 -> APIC 0x03 -> Node 0 > > [0.009731] ACPI: SRAT: Node 0 PXM 1 [mem 0x-0x0009] > > [0.009732] ACPI: SRAT: Node 0 PXM 1 [mem 0x0010-0xbfff] > > [0.009733] ACPI: SRAT: Node 0 PXM 1 [mem 0x1-0x13fff] > > This begs a question whether ppc can do the same thing? Or x86 stop doing it so that you can see on what node you are running? What's the point of this indirection other than another way of avoiding empty node 0? Thanks Michal
[PATCH v2 6/6] powerpc/qspinlock: optimised atomic_try_cmpxchg_lock that adds the lock hint
This brings the behaviour of the uncontended fast path back to roughly equivalent to simple spinlocks -- a single atomic op with lock hint. Signed-off-by: Nicholas Piggin --- arch/powerpc/include/asm/atomic.h| 28 arch/powerpc/include/asm/qspinlock.h | 2 +- 2 files changed, 29 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/atomic.h b/arch/powerpc/include/asm/atomic.h index 498785ffc25f..f6a3d145ffb7 100644 --- a/arch/powerpc/include/asm/atomic.h +++ b/arch/powerpc/include/asm/atomic.h @@ -193,6 +193,34 @@ static __inline__ int atomic_dec_return_relaxed(atomic_t *v) #define atomic_xchg(v, new) (xchg(&((v)->counter), new)) #define atomic_xchg_relaxed(v, new) xchg_relaxed(&((v)->counter), (new)) +/* + * Don't want to override the generic atomic_try_cmpxchg_acquire, because + * we add a lock hint to the lwarx, which may not be wanted for the + * _acquire case (and is not used by the other _acquire variants so it + * would be a surprise). + */ +static __always_inline bool +atomic_try_cmpxchg_lock(atomic_t *v, int *old, int new) +{ + int r, o = *old; + + __asm__ __volatile__ ( +"1:\t" PPC_LWARX(%0,0,%2,1) " # atomic_try_cmpxchg_acquire\n" +" cmpw0,%0,%3 \n" +" bne-2f \n" +" stwcx. %4,0,%2 \n" +" bne-1b \n" +"\t" PPC_ACQUIRE_BARRIER " \n" +"2:\n" + : "=" (r), "+m" (v->counter) + : "r" (>counter), "r" (o), "r" (new) + : "cr0", "memory"); + + if (unlikely(r != o)) + *old = r; + return likely(r == o); +} + /** * atomic_fetch_add_unless - add unless the number is a given value * @v: pointer of type atomic_t diff --git a/arch/powerpc/include/asm/qspinlock.h b/arch/powerpc/include/asm/qspinlock.h index 0960a0de2467..beb6aa4628e7 100644 --- a/arch/powerpc/include/asm/qspinlock.h +++ b/arch/powerpc/include/asm/qspinlock.h @@ -26,7 +26,7 @@ static __always_inline void queued_spin_lock(struct qspinlock *lock) { u32 val = 0; - if (likely(atomic_try_cmpxchg_acquire(>val, , _Q_LOCKED_VAL))) + if (likely(atomic_try_cmpxchg_lock(>val, , _Q_LOCKED_VAL))) return; queued_spin_lock_slowpath(lock, val); -- 2.23.0
[PATCH v2 5/6] powerpc/pseries: implement paravirt qspinlocks for SPLPAR
Signed-off-by: Nicholas Piggin --- arch/powerpc/include/asm/paravirt.h | 28 ++ arch/powerpc/include/asm/qspinlock.h | 55 +++ arch/powerpc/include/asm/qspinlock_paravirt.h | 5 ++ arch/powerpc/platforms/pseries/Kconfig| 5 ++ arch/powerpc/platforms/pseries/setup.c| 6 +- include/asm-generic/qspinlock.h | 2 + 6 files changed, 100 insertions(+), 1 deletion(-) create mode 100644 arch/powerpc/include/asm/qspinlock_paravirt.h diff --git a/arch/powerpc/include/asm/paravirt.h b/arch/powerpc/include/asm/paravirt.h index 7a8546660a63..f2d51f929cf5 100644 --- a/arch/powerpc/include/asm/paravirt.h +++ b/arch/powerpc/include/asm/paravirt.h @@ -29,6 +29,16 @@ static inline void yield_to_preempted(int cpu, u32 yield_count) { plpar_hcall_norets(H_CONFER, get_hard_smp_processor_id(cpu), yield_count); } + +static inline void prod_cpu(int cpu) +{ + plpar_hcall_norets(H_PROD, get_hard_smp_processor_id(cpu)); +} + +static inline void yield_to_any(void) +{ + plpar_hcall_norets(H_CONFER, -1, 0); +} #else static inline bool is_shared_processor(void) { @@ -45,6 +55,19 @@ static inline void yield_to_preempted(int cpu, u32 yield_count) { ___bad_yield_to_preempted(); /* This would be a bug */ } + +extern void ___bad_yield_to_any(void); +static inline void yield_to_any(void) +{ + ___bad_yield_to_any(); /* This would be a bug */ +} + +extern void ___bad_prod_cpu(void); +static inline void prod_cpu(int cpu) +{ + ___bad_prod_cpu(); /* This would be a bug */ +} + #endif #define vcpu_is_preempted vcpu_is_preempted @@ -57,5 +80,10 @@ static inline bool vcpu_is_preempted(int cpu) return false; } +static inline bool pv_is_native_spin_unlock(void) +{ + return !is_shared_processor(); +} + #endif /* __KERNEL__ */ #endif /* __ASM_PARAVIRT_H */ diff --git a/arch/powerpc/include/asm/qspinlock.h b/arch/powerpc/include/asm/qspinlock.h index c49e33e24edd..0960a0de2467 100644 --- a/arch/powerpc/include/asm/qspinlock.h +++ b/arch/powerpc/include/asm/qspinlock.h @@ -3,9 +3,36 @@ #define _ASM_POWERPC_QSPINLOCK_H #include +#include #define _Q_PENDING_LOOPS (1 << 9) /* not tuned */ +#ifdef CONFIG_PARAVIRT_SPINLOCKS +extern void native_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val); +extern void __pv_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val); + +static __always_inline void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val) +{ + if (!is_shared_processor()) + native_queued_spin_lock_slowpath(lock, val); + else + __pv_queued_spin_lock_slowpath(lock, val); +} +#else +extern void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val); +#endif + +static __always_inline void queued_spin_lock(struct qspinlock *lock) +{ + u32 val = 0; + + if (likely(atomic_try_cmpxchg_acquire(>val, , _Q_LOCKED_VAL))) + return; + + queued_spin_lock_slowpath(lock, val); +} +#define queued_spin_lock queued_spin_lock + #define smp_mb__after_spinlock() smp_mb() static __always_inline int queued_spin_is_locked(struct qspinlock *lock) @@ -20,6 +47,34 @@ static __always_inline int queued_spin_is_locked(struct qspinlock *lock) } #define queued_spin_is_locked queued_spin_is_locked +#ifdef CONFIG_PARAVIRT_SPINLOCKS +#define SPIN_THRESHOLD (1<<15) /* not tuned */ + +static __always_inline void pv_wait(u8 *ptr, u8 val) +{ + if (*ptr != val) + return; + yield_to_any(); + /* +* We could pass in a CPU here if waiting in the queue and yield to +* the previous CPU in the queue. +*/ +} + +static __always_inline void pv_kick(int cpu) +{ + prod_cpu(cpu); +} + +extern void __pv_init_lock_hash(void); + +static inline void pv_spinlocks_init(void) +{ + __pv_init_lock_hash(); +} + +#endif + #include #endif /* _ASM_POWERPC_QSPINLOCK_H */ diff --git a/arch/powerpc/include/asm/qspinlock_paravirt.h b/arch/powerpc/include/asm/qspinlock_paravirt.h new file mode 100644 index ..6dbdb8a4f84f --- /dev/null +++ b/arch/powerpc/include/asm/qspinlock_paravirt.h @@ -0,0 +1,5 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later */ +#ifndef __ASM_QSPINLOCK_PARAVIRT_H +#define __ASM_QSPINLOCK_PARAVIRT_H + +#endif /* __ASM_QSPINLOCK_PARAVIRT_H */ diff --git a/arch/powerpc/platforms/pseries/Kconfig b/arch/powerpc/platforms/pseries/Kconfig index 24c18362e5ea..756e727b383f 100644 --- a/arch/powerpc/platforms/pseries/Kconfig +++ b/arch/powerpc/platforms/pseries/Kconfig @@ -25,9 +25,14 @@ config PPC_PSERIES select SWIOTLB default y +config PARAVIRT_SPINLOCKS + bool + default n + config PPC_SPLPAR depends on PPC_PSERIES bool "Support for shared-processor logical partitions" + select PARAVIRT_SPINLOCKS if PPC_QUEUED_SPINLOCKS help Enabling this option will make the kernel run more
[PATCH v2 4/6] powerpc/64s: implement queued spinlocks and rwlocks
These have shown significantly improved performance and fairness when spinlock contention is moderate to high on very large systems. [ Numbers hopefully forthcoming after more testing, but initial results look good ] Thanks to the fast path, single threaded performance is not noticably hurt. Signed-off-by: Nicholas Piggin --- arch/powerpc/Kconfig | 13 arch/powerpc/include/asm/Kbuild | 2 ++ arch/powerpc/include/asm/qspinlock.h | 25 +++ arch/powerpc/include/asm/spinlock.h | 5 + arch/powerpc/include/asm/spinlock_types.h | 5 + arch/powerpc/lib/Makefile | 3 +++ include/asm-generic/qspinlock.h | 2 ++ 7 files changed, 55 insertions(+) create mode 100644 arch/powerpc/include/asm/qspinlock.h diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 9fa23eb320ff..b17575109876 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -145,6 +145,8 @@ config PPC select ARCH_SUPPORTS_ATOMIC_RMW select ARCH_USE_BUILTIN_BSWAP select ARCH_USE_CMPXCHG_LOCKREF if PPC64 + select ARCH_USE_QUEUED_RWLOCKS if PPC_QUEUED_SPINLOCKS + select ARCH_USE_QUEUED_SPINLOCKSif PPC_QUEUED_SPINLOCKS select ARCH_WANT_IPC_PARSE_VERSION select ARCH_WEAK_RELEASE_ACQUIRE select BINFMT_ELF @@ -490,6 +492,17 @@ config HOTPLUG_CPU Say N if you are unsure. +config PPC_QUEUED_SPINLOCKS + bool "Queued spinlocks" + depends on SMP + default "y" if PPC_BOOK3S_64 + help + Say Y here to use to use queued spinlocks which are more complex + but give better salability and fairness on large SMP and NUMA + systems. + + If unsure, say "Y" if you have lots of cores, otherwise "N". + config ARCH_CPU_PROBE_RELEASE def_bool y depends on HOTPLUG_CPU diff --git a/arch/powerpc/include/asm/Kbuild b/arch/powerpc/include/asm/Kbuild index dadbcf3a0b1e..1dd8b6adff5e 100644 --- a/arch/powerpc/include/asm/Kbuild +++ b/arch/powerpc/include/asm/Kbuild @@ -6,5 +6,7 @@ generated-y += syscall_table_spu.h generic-y += export.h generic-y += local64.h generic-y += mcs_spinlock.h +generic-y += qrwlock.h +generic-y += qspinlock.h generic-y += vtime.h generic-y += early_ioremap.h diff --git a/arch/powerpc/include/asm/qspinlock.h b/arch/powerpc/include/asm/qspinlock.h new file mode 100644 index ..c49e33e24edd --- /dev/null +++ b/arch/powerpc/include/asm/qspinlock.h @@ -0,0 +1,25 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _ASM_POWERPC_QSPINLOCK_H +#define _ASM_POWERPC_QSPINLOCK_H + +#include + +#define _Q_PENDING_LOOPS (1 << 9) /* not tuned */ + +#define smp_mb__after_spinlock() smp_mb() + +static __always_inline int queued_spin_is_locked(struct qspinlock *lock) +{ + /* +* This barrier was added to simple spinlocks by commit 51d7d5205d338, +* but it should now be possible to remove it, asm arm64 has done with +* commit c6f5d02b6a0f. +*/ + smp_mb(); + return atomic_read(>val); +} +#define queued_spin_is_locked queued_spin_is_locked + +#include + +#endif /* _ASM_POWERPC_QSPINLOCK_H */ diff --git a/arch/powerpc/include/asm/spinlock.h b/arch/powerpc/include/asm/spinlock.h index 21357fe05fe0..434615f1d761 100644 --- a/arch/powerpc/include/asm/spinlock.h +++ b/arch/powerpc/include/asm/spinlock.h @@ -3,7 +3,12 @@ #define __ASM_SPINLOCK_H #ifdef __KERNEL__ +#ifdef CONFIG_PPC_QUEUED_SPINLOCKS +#include +#include +#else #include +#endif #endif /* __KERNEL__ */ #endif /* __ASM_SPINLOCK_H */ diff --git a/arch/powerpc/include/asm/spinlock_types.h b/arch/powerpc/include/asm/spinlock_types.h index 3906f52dae65..c5d742f18021 100644 --- a/arch/powerpc/include/asm/spinlock_types.h +++ b/arch/powerpc/include/asm/spinlock_types.h @@ -6,6 +6,11 @@ # error "please don't include this file directly" #endif +#ifdef CONFIG_PPC_QUEUED_SPINLOCKS +#include +#include +#else #include +#endif #endif diff --git a/arch/powerpc/lib/Makefile b/arch/powerpc/lib/Makefile index 5e994cda8e40..d66a645503eb 100644 --- a/arch/powerpc/lib/Makefile +++ b/arch/powerpc/lib/Makefile @@ -41,7 +41,10 @@ obj-$(CONFIG_PPC_BOOK3S_64) += copyuser_power7.o copypage_power7.o \ obj64-y+= copypage_64.o copyuser_64.o mem_64.o hweight_64.o \ memcpy_64.o memcpy_mcsafe_64.o +ifndef CONFIG_PPC_QUEUED_SPINLOCKS obj64-$(CONFIG_SMP)+= locks.o +endif + obj64-$(CONFIG_ALTIVEC)+= vmx-helper.o obj64-$(CONFIG_KPROBES_SANITY_TEST)+= test_emulate_step.o \ test_emulate_step_exec_instr.o diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h index fde943d180e0..fb0a814d4395 100644 --- a/include/asm-generic/qspinlock.h +++ b/include/asm-generic/qspinlock.h @@ -12,6 +12,7 @@ #include +#ifndef queued_spin_is_locked /** *
[PATCH v2 3/6] powerpc: move spinlock implementation to simple_spinlock
To prepare for queued spinlocks. This is a simple rename except to update preprocessor guard name and a file reference. Signed-off-by: Nicholas Piggin --- arch/powerpc/include/asm/simple_spinlock.h| 292 ++ .../include/asm/simple_spinlock_types.h | 21 ++ arch/powerpc/include/asm/spinlock.h | 285 + arch/powerpc/include/asm/spinlock_types.h | 12 +- 4 files changed, 315 insertions(+), 295 deletions(-) create mode 100644 arch/powerpc/include/asm/simple_spinlock.h create mode 100644 arch/powerpc/include/asm/simple_spinlock_types.h diff --git a/arch/powerpc/include/asm/simple_spinlock.h b/arch/powerpc/include/asm/simple_spinlock.h new file mode 100644 index ..e048c041c4a9 --- /dev/null +++ b/arch/powerpc/include/asm/simple_spinlock.h @@ -0,0 +1,292 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later */ +#ifndef __ASM_SIMPLE_SPINLOCK_H +#define __ASM_SIMPLE_SPINLOCK_H +#ifdef __KERNEL__ + +/* + * Simple spin lock operations. + * + * Copyright (C) 2001-2004 Paul Mackerras , IBM + * Copyright (C) 2001 Anton Blanchard , IBM + * Copyright (C) 2002 Dave Engebretsen , IBM + * Rework to support virtual processors + * + * Type of int is used as a full 64b word is not necessary. + * + * (the type definitions are in asm/simple_spinlock_types.h) + */ +#include +#include +#ifdef CONFIG_PPC64 +#include +#endif +#include +#include + +#ifdef CONFIG_PPC64 +/* use 0x80yy when locked, where yy == CPU number */ +#ifdef __BIG_ENDIAN__ +#define LOCK_TOKEN (*(u32 *)(_paca()->lock_token)) +#else +#define LOCK_TOKEN (*(u32 *)(_paca()->paca_index)) +#endif +#else +#define LOCK_TOKEN 1 +#endif + +static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock) +{ + return lock.slock == 0; +} + +static inline int arch_spin_is_locked(arch_spinlock_t *lock) +{ + smp_mb(); + return !arch_spin_value_unlocked(*lock); +} + +/* + * This returns the old value in the lock, so we succeeded + * in getting the lock if the return value is 0. + */ +static inline unsigned long __arch_spin_trylock(arch_spinlock_t *lock) +{ + unsigned long tmp, token; + + token = LOCK_TOKEN; + __asm__ __volatile__( +"1:" PPC_LWARX(%0,0,%2,1) "\n\ + cmpwi 0,%0,0\n\ + bne-2f\n\ + stwcx. %1,0,%2\n\ + bne-1b\n" + PPC_ACQUIRE_BARRIER +"2:" + : "=" (tmp) + : "r" (token), "r" (>slock) + : "cr0", "memory"); + + return tmp; +} + +static inline int arch_spin_trylock(arch_spinlock_t *lock) +{ + return __arch_spin_trylock(lock) == 0; +} + +/* + * On a system with shared processors (that is, where a physical + * processor is multiplexed between several virtual processors), + * there is no point spinning on a lock if the holder of the lock + * isn't currently scheduled on a physical processor. Instead + * we detect this situation and ask the hypervisor to give the + * rest of our timeslice to the lock holder. + * + * So that we can tell which virtual processor is holding a lock, + * we put 0x8000 | smp_processor_id() in the lock when it is + * held. Conveniently, we have a word in the paca that holds this + * value. + */ + +#if defined(CONFIG_PPC_SPLPAR) +/* We only yield to the hypervisor if we are in shared processor mode */ +void splpar_spin_yield(arch_spinlock_t *lock); +void splpar_rw_yield(arch_rwlock_t *lock); +#else /* SPLPAR */ +static inline void splpar_spin_yield(arch_spinlock_t *lock) {}; +static inline void splpar_rw_yield(arch_rwlock_t *lock) {}; +#endif + +static inline void spin_yield(arch_spinlock_t *lock) +{ + if (is_shared_processor()) + splpar_spin_yield(lock); + else + barrier(); +} + +static inline void rw_yield(arch_rwlock_t *lock) +{ + if (is_shared_processor()) + splpar_rw_yield(lock); + else + barrier(); +} + +static inline void arch_spin_lock(arch_spinlock_t *lock) +{ + while (1) { + if (likely(__arch_spin_trylock(lock) == 0)) + break; + do { + HMT_low(); + if (is_shared_processor()) + splpar_spin_yield(lock); + } while (unlikely(lock->slock != 0)); + HMT_medium(); + } +} + +static inline +void arch_spin_lock_flags(arch_spinlock_t *lock, unsigned long flags) +{ + unsigned long flags_dis; + + while (1) { + if (likely(__arch_spin_trylock(lock) == 0)) + break; + local_save_flags(flags_dis); + local_irq_restore(flags); + do { + HMT_low(); + if (is_shared_processor()) + splpar_spin_yield(lock); + } while (unlikely(lock->slock != 0)); + HMT_medium(); +
[PATCH v2 2/6] powerpc/pseries: move some PAPR paravirt functions to their own file
Signed-off-by: Nicholas Piggin --- arch/powerpc/include/asm/paravirt.h | 61 + arch/powerpc/include/asm/spinlock.h | 24 +--- arch/powerpc/lib/locks.c| 12 +++--- 3 files changed, 68 insertions(+), 29 deletions(-) create mode 100644 arch/powerpc/include/asm/paravirt.h diff --git a/arch/powerpc/include/asm/paravirt.h b/arch/powerpc/include/asm/paravirt.h new file mode 100644 index ..7a8546660a63 --- /dev/null +++ b/arch/powerpc/include/asm/paravirt.h @@ -0,0 +1,61 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later */ +#ifndef __ASM_PARAVIRT_H +#define __ASM_PARAVIRT_H +#ifdef __KERNEL__ + +#include +#include +#ifdef CONFIG_PPC64 +#include +#include +#endif + +#ifdef CONFIG_PPC_SPLPAR +DECLARE_STATIC_KEY_FALSE(shared_processor); + +static inline bool is_shared_processor(void) +{ + return static_branch_unlikely(_processor); +} + +/* If bit 0 is set, the cpu has been preempted */ +static inline u32 yield_count_of(int cpu) +{ + __be32 yield_count = READ_ONCE(lppaca_of(cpu).yield_count); + return be32_to_cpu(yield_count); +} + +static inline void yield_to_preempted(int cpu, u32 yield_count) +{ + plpar_hcall_norets(H_CONFER, get_hard_smp_processor_id(cpu), yield_count); +} +#else +static inline bool is_shared_processor(void) +{ + return false; +} + +static inline u32 yield_count_of(int cpu) +{ + return 0; +} + +extern void ___bad_yield_to_preempted(void); +static inline void yield_to_preempted(int cpu, u32 yield_count) +{ + ___bad_yield_to_preempted(); /* This would be a bug */ +} +#endif + +#define vcpu_is_preempted vcpu_is_preempted +static inline bool vcpu_is_preempted(int cpu) +{ + if (!is_shared_processor()) + return false; + if (yield_count_of(cpu) & 1) + return true; + return false; +} + +#endif /* __KERNEL__ */ +#endif /* __ASM_PARAVIRT_H */ diff --git a/arch/powerpc/include/asm/spinlock.h b/arch/powerpc/include/asm/spinlock.h index 2d620896cdae..79be9bb10bbb 100644 --- a/arch/powerpc/include/asm/spinlock.h +++ b/arch/powerpc/include/asm/spinlock.h @@ -15,11 +15,10 @@ * * (the type definitions are in asm/spinlock_types.h) */ -#include #include +#include #ifdef CONFIG_PPC64 #include -#include #endif #include #include @@ -35,18 +34,6 @@ #define LOCK_TOKEN 1 #endif -#ifdef CONFIG_PPC_PSERIES -DECLARE_STATIC_KEY_FALSE(shared_processor); - -#define vcpu_is_preempted vcpu_is_preempted -static inline bool vcpu_is_preempted(int cpu) -{ - if (!static_branch_unlikely(_processor)) - return false; - return !!(be32_to_cpu(lppaca_of(cpu).yield_count) & 1); -} -#endif - static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock) { return lock.slock == 0; @@ -110,15 +97,6 @@ static inline void splpar_spin_yield(arch_spinlock_t *lock) {}; static inline void splpar_rw_yield(arch_rwlock_t *lock) {}; #endif -static inline bool is_shared_processor(void) -{ -#ifdef CONFIG_PPC_SPLPAR - return static_branch_unlikely(_processor); -#else - return false; -#endif -} - static inline void spin_yield(arch_spinlock_t *lock) { if (is_shared_processor()) diff --git a/arch/powerpc/lib/locks.c b/arch/powerpc/lib/locks.c index 6440d5943c00..04165b7a163f 100644 --- a/arch/powerpc/lib/locks.c +++ b/arch/powerpc/lib/locks.c @@ -27,14 +27,14 @@ void splpar_spin_yield(arch_spinlock_t *lock) return; holder_cpu = lock_value & 0x; BUG_ON(holder_cpu >= NR_CPUS); - yield_count = be32_to_cpu(lppaca_of(holder_cpu).yield_count); + + yield_count = yield_count_of(holder_cpu); if ((yield_count & 1) == 0) return; /* virtual cpu is currently running */ rmb(); if (lock->slock != lock_value) return; /* something has changed */ - plpar_hcall_norets(H_CONFER, - get_hard_smp_processor_id(holder_cpu), yield_count); + yield_to_preempted(holder_cpu, yield_count); } EXPORT_SYMBOL_GPL(splpar_spin_yield); @@ -53,13 +53,13 @@ void splpar_rw_yield(arch_rwlock_t *rw) return; /* no write lock at present */ holder_cpu = lock_value & 0x; BUG_ON(holder_cpu >= NR_CPUS); - yield_count = be32_to_cpu(lppaca_of(holder_cpu).yield_count); + + yield_count = yield_count_of(holder_cpu); if ((yield_count & 1) == 0) return; /* virtual cpu is currently running */ rmb(); if (rw->lock != lock_value) return; /* something has changed */ - plpar_hcall_norets(H_CONFER, - get_hard_smp_processor_id(holder_cpu), yield_count); + yield_to_preempted(holder_cpu, yield_count); } #endif -- 2.23.0
[PATCH v2 1/6] powerpc/powernv: must include hvcall.h to get PAPR defines
An include goes away in future patches which breaks compilation without this. Signed-off-by: Nicholas Piggin --- arch/powerpc/platforms/powernv/pci-ioda-tce.c | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/powerpc/platforms/powernv/pci-ioda-tce.c b/arch/powerpc/platforms/powernv/pci-ioda-tce.c index f923359d8afc..8eba6ece7808 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda-tce.c +++ b/arch/powerpc/platforms/powernv/pci-ioda-tce.c @@ -15,6 +15,7 @@ #include #include +#include /* share error returns with PAPR */ #include "pci.h" unsigned long pnv_ioda_parse_tce_sizes(struct pnv_phb *phb) -- 2.23.0
[PATCH v2 0/6] powerpc: queued spinlocks and rwlocks
v2 is updated to account for feedback from Will, Peter, and Waiman (thank you), and trims off a couple of RFC and unrelated patches. Thanks, Nick Nicholas Piggin (6): powerpc/powernv: must include hvcall.h to get PAPR defines powerpc/pseries: move some PAPR paravirt functions to their own file powerpc: move spinlock implementation to simple_spinlock powerpc/64s: implement queued spinlocks and rwlocks powerpc/pseries: implement paravirt qspinlocks for SPLPAR powerpc/qspinlock: optimised atomic_try_cmpxchg_lock that adds the lock hint arch/powerpc/Kconfig | 13 + arch/powerpc/include/asm/Kbuild | 2 + arch/powerpc/include/asm/atomic.h | 28 ++ arch/powerpc/include/asm/paravirt.h | 89 + arch/powerpc/include/asm/qspinlock.h | 80 + arch/powerpc/include/asm/qspinlock_paravirt.h | 5 + arch/powerpc/include/asm/simple_spinlock.h| 292 + .../include/asm/simple_spinlock_types.h | 21 ++ arch/powerpc/include/asm/spinlock.h | 308 +- arch/powerpc/include/asm/spinlock_types.h | 17 +- arch/powerpc/lib/Makefile | 3 + arch/powerpc/lib/locks.c | 12 +- arch/powerpc/platforms/powernv/pci-ioda-tce.c | 1 + arch/powerpc/platforms/pseries/Kconfig| 5 + arch/powerpc/platforms/pseries/setup.c| 6 +- include/asm-generic/qspinlock.h | 4 + 16 files changed, 564 insertions(+), 322 deletions(-) create mode 100644 arch/powerpc/include/asm/paravirt.h create mode 100644 arch/powerpc/include/asm/qspinlock.h create mode 100644 arch/powerpc/include/asm/qspinlock_paravirt.h create mode 100644 arch/powerpc/include/asm/simple_spinlock.h create mode 100644 arch/powerpc/include/asm/simple_spinlock_types.h -- 2.23.0
Re: [PATCH] powerpc/powernv: machine check handler for POWER10
Hi Nicholas, I love your patch! Perhaps something to improve: [auto build test WARNING on powerpc/next] [also build test WARNING on v5.8-rc3 next-20200702] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use as documented in https://git-scm.com/docs/git-format-patch] url: https://github.com/0day-ci/linux/commits/Nicholas-Piggin/powerpc-powernv-machine-check-handler-for-POWER10/20200703-073739 base: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next config: powerpc-allyesconfig (attached as .config) compiler: powerpc64-linux-gcc (GCC) 9.3.0 reproduce (this is a W=1 build): wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=powerpc If you fix the issue, kindly add following tag as appropriate Reported-by: kernel test robot All warnings (new ones prefixed by >>): arch/powerpc/kernel/mce_power.c:709:6: warning: no previous prototype for '__machine_check_early_realmode_p7' [-Wmissing-prototypes] 709 | long __machine_check_early_realmode_p7(struct pt_regs *regs) | ^ arch/powerpc/kernel/mce_power.c:717:6: warning: no previous prototype for '__machine_check_early_realmode_p8' [-Wmissing-prototypes] 717 | long __machine_check_early_realmode_p8(struct pt_regs *regs) | ^ arch/powerpc/kernel/mce_power.c:722:6: warning: no previous prototype for '__machine_check_early_realmode_p9' [-Wmissing-prototypes] 722 | long __machine_check_early_realmode_p9(struct pt_regs *regs) | ^ >> arch/powerpc/kernel/mce_power.c:740:6: warning: no previous prototype for >> '__machine_check_early_realmode_p10' [-Wmissing-prototypes] 740 | long __machine_check_early_realmode_p10(struct pt_regs *regs) | ^~ vim +/__machine_check_early_realmode_p10 +740 arch/powerpc/kernel/mce_power.c 739 > 740 long __machine_check_early_realmode_p10(struct pt_regs *regs) --- 0-DAY CI Kernel Test Service, Intel Corporation https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org .config.gz Description: application/gzip
[PATCH] powerpc/perf: Add kernel support for new MSR[HV PR] bits in trace-imc.
IMC trace-mode record has MSR[HV PR] bits added in the third DW. These bits can be used to set the cpumode for the instruction pointer captured in each sample. Add support in kernel to use these bits to set the cpumode for each sample. Signed-off-by: Anju T Sudhakar --- arch/powerpc/include/asm/imc-pmu.h | 5 + arch/powerpc/perf/imc-pmu.c| 29 - 2 files changed, 29 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/include/asm/imc-pmu.h b/arch/powerpc/include/asm/imc-pmu.h index 4da4fcba0684..4f897993b710 100644 --- a/arch/powerpc/include/asm/imc-pmu.h +++ b/arch/powerpc/include/asm/imc-pmu.h @@ -99,6 +99,11 @@ struct trace_imc_data { */ #define IMC_TRACE_RECORD_TB1_MASK 0x3ffULL +/* + * Bit 0:1 in third DW of IMC trace record + * specifies the MSR[HV PR] values. + */ +#define IMC_TRACE_RECORD_VAL_HVPR(x) ((x) >> 62) /* * Device tree parser code detects IMC pmu support and diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c index cb50a9e1fd2d..310922fed9eb 100644 --- a/arch/powerpc/perf/imc-pmu.c +++ b/arch/powerpc/perf/imc-pmu.c @@ -1178,11 +1178,30 @@ static int trace_imc_prepare_sample(struct trace_imc_data *mem, header->size = sizeof(*header) + event->header_size; header->misc = 0; - if (is_kernel_addr(data->ip)) - header->misc |= PERF_RECORD_MISC_KERNEL; - else - header->misc |= PERF_RECORD_MISC_USER; - + if (cpu_has_feature(CPU_FTRS_POWER9)) { + if (is_kernel_addr(data->ip)) + header->misc |= PERF_RECORD_MISC_KERNEL; + else + header->misc |= PERF_RECORD_MISC_USER; + } else { + switch (IMC_TRACE_RECORD_VAL_HVPR(mem->val)) { + case 0:/* when MSR HV and PR not set in the trace-record */ + header->misc |= PERF_RECORD_MISC_GUEST_KERNEL; + break; + case 1: /* MSR HV is 0 and PR is 1 */ + header->misc |= PERF_RECORD_MISC_GUEST_USER; + break; + case 2: /* MSR Hv is 1 and PR is 0 */ + header->misc |= PERF_RECORD_MISC_HYPERVISOR; + break; + case 3: /* MSR HV is 1 and PR is 1 */ + header->misc |= PERF_RECORD_MISC_USER; + break; + default: + pr_info("IMC: Unable to set the flag based on MSR bits\n"); + break; + } + } perf_event_header__init_id(header, data, event); return 0; -- 2.25.4
Re: [PATCH V3 (RESEND) 2/3] mm/sparsemem: Enable vmem_altmap support in vmemmap_alloc_block_buf()
On 07/02/2020 07:37 PM, Catalin Marinas wrote: > On Thu, Jun 18, 2020 at 06:45:29AM +0530, Anshuman Khandual wrote: >> There are many instances where vmemap allocation is often switched between >> regular memory and device memory just based on whether altmap is available >> or not. vmemmap_alloc_block_buf() is used in various platforms to allocate >> vmemmap mappings. Lets also enable it to handle altmap based device memory >> allocation along with existing regular memory allocations. This will help >> in avoiding the altmap based allocation switch in many places. >> >> While here also implement a regular memory allocation fallback mechanism >> when the first preferred device memory allocation fails. This will ensure >> preserving the existing semantics on powerpc platform. To summarize there >> are three different methods to call vmemmap_alloc_block_buf(). >> >> (., NULL, false) /* Allocate from system RAM */ >> (., altmap, false) /* Allocate from altmap without any fallback */ >> (., altmap, true) /* Allocate from altmap with fallback (system RAM) */ > [...] >> diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c >> index bc73abf0bc25..01e25b56eccb 100644 >> --- a/arch/powerpc/mm/init_64.c >> +++ b/arch/powerpc/mm/init_64.c >> @@ -225,12 +225,12 @@ int __meminit vmemmap_populate(unsigned long start, >> unsigned long end, int node, >> * fall back to system memory if the altmap allocation fail. >> */ >> if (altmap && !altmap_cross_boundary(altmap, start, page_size)) >> { >> -p = altmap_alloc_block_buf(page_size, altmap); >> -if (!p) >> -pr_debug("altmap block allocation failed, >> falling back to system memory"); >> +p = vmemmap_alloc_block_buf(page_size, node, >> +altmap, true); >> +} else { >> +p = vmemmap_alloc_block_buf(page_size, node, >> +NULL, false); >> } >> -if (!p) >> -p = vmemmap_alloc_block_buf(page_size, node); >> if (!p) >> return -ENOMEM; > > Is the fallback argument actually necessary. It may be cleaner to just > leave the code as is with the choice between altmap and NULL. If an arch > needs a fallback (only powerpc), they have the fallback in place > already. I don't see the powerpc code any better after this change. > > I'm fine with the altmap argument though. Okay. Will drop 'fallback' from vmemmap_alloc_block_buf() and update the callers. There will also be a single change in the subsequent patch i.e vmemmap_alloc_block_buf(PMD_SIZE, node, altmap).
Re: [PATCH v2 5/6] powerpc/pseries/iommu: Make use of DDW even if it does not map the partition
On Thu, 2020-07-02 at 10:31 +1000, Alexey Kardashevskiy wrote: > > On 02/07/2020 09:48, Leonardo Bras wrote: > > On Wed, 2020-07-01 at 16:57 -0300, Leonardo Bras wrote: > > > > It is not necessarily "direct" anymore as the name suggests, you may > > > > want to change that. DMA64_PROPNAME, may be. Thanks, > > > > > > > > > > Yeah, you are right. > > > I will change this for next version, also changing the string name to > > > reflect this. > > > > > > -#define DIRECT64_PROPNAME "linux,direct64-ddr-window-info" > > > +#define DMA64_PROPNAME "linux,dma64-ddr-window-info" > > > > > > Is that ok? > > > > > > Thank you for helping! > > > > In fact, there is a lot of places in this file where it's called direct > > window. Should I replace everything? > > Should it be in a separated patch? > > If it looks simple and you write a nice commit log explaining all that > and why you are not reusing the existing ibm,dma-window property (to > provide a clue what "reset" will reset to? is there any other reason?) > for that - sure, do it :) > v3 available here: http://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=187348=%2A=both Best regards, Leonardo
[PATCH v3 6/6] powerpc/pseries/iommu: Rename "direct window" to "dma window"
A previous change introduced the usage of DDW as a bigger indirect DMA mapping when the DDW available size does not map the whole partition. As most of the code that manipulates direct mappings was reused for indirect mappings, it's necessary to rename all names and debug/info messages to reflect that it can be used for both kinds of mapping. Also, defines DEFAULT_DMA_WIN as "ibm,dma-window" to document that it's the name of the default DMA window. Those changes are not supposed to change how the code works in any way, just adjust naming. Signed-off-by: Leonardo Bras --- arch/powerpc/platforms/pseries/iommu.c | 101 + 1 file changed, 53 insertions(+), 48 deletions(-) diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c index c652177de09c..070b80efc43a 100644 --- a/arch/powerpc/platforms/pseries/iommu.c +++ b/arch/powerpc/platforms/pseries/iommu.c @@ -339,7 +339,7 @@ struct dynamic_dma_window_prop { __be32 window_shift; /* ilog2(tce_window_size) */ }; -struct direct_window { +struct dma_win { struct device_node *device; const struct dynamic_dma_window_prop *prop; struct list_head list; @@ -359,12 +359,13 @@ struct ddw_create_response { u32 addr_lo; }; -static LIST_HEAD(direct_window_list); +static LIST_HEAD(dma_win_list); /* prevents races between memory on/offline and window creation */ -static DEFINE_SPINLOCK(direct_window_list_lock); +static DEFINE_SPINLOCK(dma_win_list_lock); /* protects initializing window twice for same device */ -static DEFINE_MUTEX(direct_window_init_mutex); +static DEFINE_MUTEX(dma_win_init_mutex); #define DMA64_PROPNAME "linux,dma64-ddr-window-info" +#define DEFAULT_DMA_WIN "ibm,dma-window" static int tce_clearrange_multi_pSeriesLP(unsigned long start_pfn, unsigned long num_pfn, const void *arg) @@ -697,9 +698,12 @@ static void pci_dma_bus_setup_pSeriesLP(struct pci_bus *bus) pr_debug("pci_dma_bus_setup_pSeriesLP: setting up bus %pOF\n", dn); - /* Find nearest ibm,dma-window, walking up the device tree */ + /* +* Find nearest ibm,dma-window (default DMA window), walking up the +* device tree +*/ for (pdn = dn; pdn != NULL; pdn = pdn->parent) { - dma_window = of_get_property(pdn, "ibm,dma-window", NULL); + dma_window = of_get_property(pdn, DEFAULT_DMA_WIN, NULL); if (dma_window != NULL) break; } @@ -710,7 +714,8 @@ static void pci_dma_bus_setup_pSeriesLP(struct pci_bus *bus) dma_window = alt_dma_window; if (dma_window == NULL) { - pr_debug(" no ibm,dma-window nor linux,dma64-ddr-window-info property !\n"); + pr_debug(" no %s nor %s property !\n", +DEFAULT_DMA_WIN, DMA64_PROPNAME); return; } @@ -808,11 +813,11 @@ static void remove_dma_window(struct device_node *np, u32 *ddw_avail, ret = rtas_call(ddw_avail[DDW_REMOVE_PE_DMA_WIN], 1, 1, NULL, liobn); if (ret) - pr_warn("%pOF: failed to remove direct window: rtas returned " + pr_warn("%pOF: failed to remove dma window: rtas returned " "%d to ibm,remove-pe-dma-window(%x) %llx\n", np, ret, ddw_avail[DDW_REMOVE_PE_DMA_WIN], liobn); else - pr_debug("%pOF: successfully removed direct window: rtas returned " + pr_debug("%pOF: successfully removed dma window: rtas returned " "%d to ibm,remove-pe-dma-window(%x) %llx\n", np, ret, ddw_avail[DDW_REMOVE_PE_DMA_WIN], liobn); } @@ -840,26 +845,26 @@ static void remove_ddw(struct device_node *np, bool remove_prop) ret = of_remove_property(np, win); if (ret) - pr_warn("%pOF: failed to remove direct window property: %d\n", + pr_warn("%pOF: failed to remove dma window property: %d\n", np, ret); } static u64 find_existing_ddw(struct device_node *pdn) { - struct direct_window *window; - const struct dynamic_dma_window_prop *direct64; + struct dma_win *window; + const struct dynamic_dma_window_prop *dma64; u64 dma_addr = 0; - spin_lock(_window_list_lock); + spin_lock(_win_list_lock); /* check if we already created a window and dupe that config if so */ - list_for_each_entry(window, _window_list, list) { + list_for_each_entry(window, _win_list, list) { if (window->device == pdn) { - direct64 = window->prop; - dma_addr = be64_to_cpu(direct64->dma_base); + dma64 = window->prop; + dma_addr = be64_to_cpu(dma64->dma_base); break; }
[PATCH v3 5/6] powerpc/pseries/iommu: Make use of DDW even if it does not map the partition
As of today, if the biggest DDW that can be created can't map the whole partition, it's creation is skipped and the default DMA window "ibm,dma-window" is used instead. Usually this DDW is bigger than the default DMA window, and it performs better, so it would be nice to use it instead. The ddw created will be used for direct mapping by default. If it's not available, indirect mapping sill be used instead. As there will never have both mappings at the same time, the same property name can be used for the created DDW. So renaming define DIRECT64_PROPNAME "linux,direct64-ddr-window-info" to define DMA64_PROPNAME "linux,dma64-ddr-window-info" looks the right thing to do. Signed-off-by: Leonardo Bras --- arch/powerpc/platforms/pseries/iommu.c | 38 -- 1 file changed, 24 insertions(+), 14 deletions(-) diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c index 5b520ac354c6..c652177de09c 100644 --- a/arch/powerpc/platforms/pseries/iommu.c +++ b/arch/powerpc/platforms/pseries/iommu.c @@ -364,7 +364,7 @@ static LIST_HEAD(direct_window_list); static DEFINE_SPINLOCK(direct_window_list_lock); /* protects initializing window twice for same device */ static DEFINE_MUTEX(direct_window_init_mutex); -#define DIRECT64_PROPNAME "linux,direct64-ddr-window-info" +#define DMA64_PROPNAME "linux,dma64-ddr-window-info" static int tce_clearrange_multi_pSeriesLP(unsigned long start_pfn, unsigned long num_pfn, const void *arg) @@ -690,7 +690,7 @@ static void pci_dma_bus_setup_pSeriesLP(struct pci_bus *bus) struct iommu_table *tbl; struct device_node *dn, *pdn; struct pci_dn *ppci; - const __be32 *dma_window = NULL; + const __be32 *dma_window = NULL, *alt_dma_window = NULL; dn = pci_bus_to_OF_node(bus); @@ -704,8 +704,13 @@ static void pci_dma_bus_setup_pSeriesLP(struct pci_bus *bus) break; } + /* If there is a DDW available, use it instead */ + alt_dma_window = of_get_property(pdn, DMA64_PROPNAME, NULL); + if (alt_dma_window) + dma_window = alt_dma_window; + if (dma_window == NULL) { - pr_debug(" no ibm,dma-window property !\n"); + pr_debug(" no ibm,dma-window nor linux,dma64-ddr-window-info property !\n"); return; } @@ -823,7 +828,7 @@ static void remove_ddw(struct device_node *np, bool remove_prop) if (ret) return; - win = of_find_property(np, DIRECT64_PROPNAME, NULL); + win = of_find_property(np, DMA64_PROPNAME, NULL); if (!win) return; @@ -869,8 +874,8 @@ static int find_existing_ddw_windows(void) if (!firmware_has_feature(FW_FEATURE_LPAR)) return 0; - for_each_node_with_property(pdn, DIRECT64_PROPNAME) { - direct64 = of_get_property(pdn, DIRECT64_PROPNAME, ); + for_each_node_with_property(pdn, DMA64_PROPNAME) { + direct64 = of_get_property(pdn, DMA64_PROPNAME, ); if (!direct64) continue; @@ -1205,23 +1210,26 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn) query.page_size); goto out_restore_defwin; } + /* verify the window * number of ptes will map the partition */ - /* check largest block * page size > max memory hotplug addr */ max_addr = ddw_memory_hotplug_max(); if (query.largest_available_block < (max_addr >> page_shift)) { - dev_dbg(>dev, "can't map partition max 0x%llx with %llu " - "%llu-sized pages\n", max_addr, query.largest_available_block, - 1ULL << page_shift); - goto out_restore_defwin; + dev_dbg(>dev, "can't map partition max 0x%llx with %llu %llu-sized pages\n", + max_addr, query.largest_available_block, + 1ULL << page_shift); + + len = order_base_2(query.largest_available_block << page_shift); + } else { + len = order_base_2(max_addr); } - len = order_base_2(max_addr); + win64 = kzalloc(sizeof(struct property), GFP_KERNEL); if (!win64) { dev_info(>dev, "couldn't allocate property for 64bit dma window\n"); goto out_restore_defwin; } - win64->name = kstrdup(DIRECT64_PROPNAME, GFP_KERNEL); + win64->name = kstrdup(DMA64_PROPNAME, GFP_KERNEL); win64->value = ddwprop = kmalloc(sizeof(*ddwprop), GFP_KERNEL); win64->length = sizeof(*ddwprop); if (!win64->name || !win64->value) { @@ -1268,7 +1276,9 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn) list_add(>list, _window_list); spin_unlock(_window_list_lock); -
[PATCH v3 4/6] powerpc/pseries/iommu: Remove default DMA window before creating DDW
On LoPAR "DMA Window Manipulation Calls", it's recommended to remove the default DMA window for the device, before attempting to configure a DDW, in order to make the maximum resources available for the next DDW to be created. This is a requirement for using DDW on devices in which hypervisor allows only one DMA window. If setting up a new DDW fails anywhere after the removal of this default DMA window, it's needed to restore the default DMA window. For this, an implementation of ibm,reset-pe-dma-windows rtas call is needed: Platforms supporting the DDW option starting with LoPAR level 2.7 implement ibm,ddw-extensions. The first extension available (index 2) carries the token for ibm,reset-pe-dma-windows rtas call, which is used to restore the default DMA window for a device, if it has been deleted. It does so by resetting the TCE table allocation for the PE to it's boot time value, available in "ibm,dma-window" device tree node. Signed-off-by: Leonardo Bras --- arch/powerpc/platforms/pseries/iommu.c | 83 +- 1 file changed, 69 insertions(+), 14 deletions(-) diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c index 4e33147825cc..5b520ac354c6 100644 --- a/arch/powerpc/platforms/pseries/iommu.c +++ b/arch/powerpc/platforms/pseries/iommu.c @@ -1066,6 +1066,38 @@ static phys_addr_t ddw_memory_hotplug_max(void) return max_addr; } +/* + * Platforms supporting the DDW option starting with LoPAR level 2.7 implement + * ibm,ddw-extensions, which carries the rtas token for + * ibm,reset-pe-dma-windows. + * That rtas-call can be used to restore the default DMA window for the device. + */ +static void reset_dma_window(struct pci_dev *dev, struct device_node *par_dn) +{ + int ret; + u32 cfg_addr, reset_dma_win; + u64 buid; + struct device_node *dn; + struct pci_dn *pdn; + + ret = ddw_read_ext(par_dn, DDW_EXT_RESET_DMA_WIN, _dma_win); + if (ret) + return; + + dn = pci_device_to_OF_node(dev); + pdn = PCI_DN(dn); + buid = pdn->phb->buid; + cfg_addr = ((pdn->busno << 16) | (pdn->devfn << 8)); + + ret = rtas_call(reset_dma_win, 3, 1, NULL, cfg_addr, BUID_HI(buid), + BUID_LO(buid)); + if (ret) + dev_info(>dev, +"ibm,reset-pe-dma-windows(%x) %x %x %x returned %d ", +reset_dma_win, cfg_addr, BUID_HI(buid), BUID_LO(buid), +ret); +} + /* * If the PE supports dynamic dma windows, and there is space for a table * that can map all pages in a linear offset, then setup such a table, @@ -1079,7 +,7 @@ static phys_addr_t ddw_memory_hotplug_max(void) */ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn) { - int len, ret; + int len, ret, reset_win_ext; struct ddw_query_response query; struct ddw_create_response create; int page_shift; @@ -1087,7 +1119,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn) struct device_node *dn; u32 ddw_avail[DDW_APPLICABLE_SIZE]; struct direct_window *window; - struct property *win64; + struct property *win64, *default_win = NULL; struct dynamic_dma_window_prop *ddwprop; struct failed_ddw_pdn *fpdn; @@ -1122,7 +1154,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn) if (ret) goto out_failed; - /* + /* * Query if there is a second window of size to map the * whole partition. Query returns number of windows, largest * block assigned to PE (partition endpoint), and two bitmasks @@ -1133,14 +1165,34 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn) if (ret != 0) goto out_failed; + /* +* If there is no window available, remove the default DMA window, +* if it's present. This will make all the resources available to the +* new DDW window. +* If anything fails after this, we need to restore it, so also check +* for extensions presence. +*/ if (query.windows_available == 0) { - /* -* no additional windows are available for this device. -* We might be able to reallocate the existing window, -* trading in for a larger page size. -*/ - dev_dbg(>dev, "no free dynamic windows"); - goto out_failed; + default_win = of_find_property(pdn, "ibm,dma-window", NULL); + if (!default_win) + goto out_failed; + + reset_win_ext = ddw_read_ext(pdn, DDW_EXT_RESET_DMA_WIN, NULL); + if (reset_win_ext) + goto out_failed; + + remove_dma_window(pdn, ddw_avail, default_win); + + /* Query again,
[PATCH v3 3/6] powerpc/pseries/iommu: Move window-removing part of remove_ddw into remove_dma_window
Move the window-removing part of remove_ddw into a new function (remove_dma_window), so it can be used to remove other DMA windows. It's useful for removing DMA windows that don't create DIRECT64_PROPNAME property, like the default DMA window from the device, which uses "ibm,dma-window". Signed-off-by: Leonardo Bras --- arch/powerpc/platforms/pseries/iommu.c | 45 +++--- 1 file changed, 27 insertions(+), 18 deletions(-) diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c index 1a933c4e8bba..4e33147825cc 100644 --- a/arch/powerpc/platforms/pseries/iommu.c +++ b/arch/powerpc/platforms/pseries/iommu.c @@ -781,25 +781,14 @@ static int __init disable_ddw_setup(char *str) early_param("disable_ddw", disable_ddw_setup); -static void remove_ddw(struct device_node *np, bool remove_prop) +static void remove_dma_window(struct device_node *np, u32 *ddw_avail, + struct property *win) { struct dynamic_dma_window_prop *dwp; - struct property *win64; - u32 ddw_avail[DDW_APPLICABLE_SIZE]; u64 liobn; - int ret = 0; - - ret = of_property_read_u32_array(np, "ibm,ddw-applicable", -_avail[0], DDW_APPLICABLE_SIZE); - - win64 = of_find_property(np, DIRECT64_PROPNAME, NULL); - if (!win64) - return; - - if (ret || win64->length < sizeof(*dwp)) - goto delprop; + int ret; - dwp = win64->value; + dwp = win->value; liobn = (u64)be32_to_cpu(dwp->liobn); /* clear the whole window, note the arg is in kernel pages */ @@ -821,10 +810,30 @@ static void remove_ddw(struct device_node *np, bool remove_prop) pr_debug("%pOF: successfully removed direct window: rtas returned " "%d to ibm,remove-pe-dma-window(%x) %llx\n", np, ret, ddw_avail[DDW_REMOVE_PE_DMA_WIN], liobn); +} + +static void remove_ddw(struct device_node *np, bool remove_prop) +{ + struct property *win; + u32 ddw_avail[DDW_APPLICABLE_SIZE]; + int ret = 0; + + ret = of_property_read_u32_array(np, "ibm,ddw-applicable", +_avail[0], DDW_APPLICABLE_SIZE); + if (ret) + return; + + win = of_find_property(np, DIRECT64_PROPNAME, NULL); + if (!win) + return; + + if (win->length >= sizeof(struct dynamic_dma_window_prop)) + remove_dma_window(np, ddw_avail, win); + + if (!remove_prop) + return; -delprop: - if (remove_prop) - ret = of_remove_property(np, win64); + ret = of_remove_property(np, win); if (ret) pr_warn("%pOF: failed to remove direct window property: %d\n", np, ret); -- 2.25.4
[PATCH v3 2/6] powerpc/pseries/iommu: Update call to ibm, query-pe-dma-windows
>From LoPAR level 2.8, "ibm,ddw-extensions" index 3 can make the number of outputs from "ibm,query-pe-dma-windows" go from 5 to 6. This change of output size is meant to expand the address size of largest_available_block PE TCE from 32-bit to 64-bit, which ends up shifting page_size and migration_capable. This ends up requiring the update of ddw_query_response->largest_available_block from u32 to u64, and manually assigning the values from the buffer into this struct, according to output size. Also, a routine was created for helping reading the ddw extensions as suggested by LoPAR: First reading the size of the extension array from index 0, checking if the property exists, and then returning it's value. Signed-off-by: Leonardo Bras --- arch/powerpc/platforms/pseries/iommu.c | 91 +++--- 1 file changed, 81 insertions(+), 10 deletions(-) diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c index ac0d6376bdad..1a933c4e8bba 100644 --- a/arch/powerpc/platforms/pseries/iommu.c +++ b/arch/powerpc/platforms/pseries/iommu.c @@ -47,6 +47,12 @@ enum { DDW_APPLICABLE_SIZE }; +enum { + DDW_EXT_SIZE = 0, + DDW_EXT_RESET_DMA_WIN = 1, + DDW_EXT_QUERY_OUT_SIZE = 2 +}; + static struct iommu_table_group *iommu_pseries_alloc_group(int node) { struct iommu_table_group *table_group; @@ -342,7 +348,7 @@ struct direct_window { /* Dynamic DMA Window support */ struct ddw_query_response { u32 windows_available; - u32 largest_available_block; + u64 largest_available_block; u32 page_size; u32 migration_capable; }; @@ -877,14 +883,62 @@ static int find_existing_ddw_windows(void) } machine_arch_initcall(pseries, find_existing_ddw_windows); +/** + * ddw_read_ext - Get the value of an DDW extension + * @np:device node from which the extension value is to be read. + * @extnum:index number of the extension. + * @value: pointer to return value, modified when extension is available. + * + * Checks if "ibm,ddw-extensions" exists for this node, and get the value + * on index 'extnum'. + * It can be used only to check if a property exists, passing value == NULL. + * + * Returns: + * 0 if extension successfully read + * -EINVAL if the "ibm,ddw-extensions" does not exist, + * -ENODATA if "ibm,ddw-extensions" does not have a value, and + * -EOVERFLOW if "ibm,ddw-extensions" does not contain this extension. + */ +static inline int ddw_read_ext(const struct device_node *np, int extnum, + u32 *value) +{ + static const char propname[] = "ibm,ddw-extensions"; + u32 count; + int ret; + + ret = of_property_read_u32_index(np, propname, DDW_EXT_SIZE, ); + if (ret) + return ret; + + if (count < extnum) + return -EOVERFLOW; + + if (!value) + value = + + return of_property_read_u32_index(np, propname, extnum, value); +} + static int query_ddw(struct pci_dev *dev, const u32 *ddw_avail, - struct ddw_query_response *query) +struct ddw_query_response *query, +struct device_node *parent) { struct device_node *dn; struct pci_dn *pdn; - u32 cfg_addr; + u32 cfg_addr, ext_query, query_out[5]; u64 buid; - int ret; + int ret, out_sz; + + /* +* From LoPAR level 2.8, "ibm,ddw-extensions" index 3 can rule how many +* output parameters ibm,query-pe-dma-windows will have, ranging from +* 5 to 6. +*/ + ret = ddw_read_ext(parent, DDW_EXT_QUERY_OUT_SIZE, _query); + if (!ret && ext_query == 1) + out_sz = 6; + else + out_sz = 5; /* * Get the config address and phb buid of the PE window. @@ -897,11 +951,28 @@ static int query_ddw(struct pci_dev *dev, const u32 *ddw_avail, buid = pdn->phb->buid; cfg_addr = ((pdn->busno << 16) | (pdn->devfn << 8)); - ret = rtas_call(ddw_avail[DDW_QUERY_PE_DMA_WIN], 3, 5, (u32 *)query, + ret = rtas_call(ddw_avail[DDW_QUERY_PE_DMA_WIN], 3, out_sz, query_out, cfg_addr, BUID_HI(buid), BUID_LO(buid)); - dev_info(>dev, "ibm,query-pe-dma-windows(%x) %x %x %x" - " returned %d\n", ddw_avail[DDW_QUERY_PE_DMA_WIN], cfg_addr, -BUID_HI(buid), BUID_LO(buid), ret); + dev_info(>dev, "ibm,query-pe-dma-windows(%x) %x %x %x returned %d\n", +ddw_avail[DDW_QUERY_PE_DMA_WIN], cfg_addr, BUID_HI(buid), +BUID_LO(buid), ret); + + switch (out_sz) { + case 5: + query->windows_available = query_out[0]; + query->largest_available_block = query_out[1]; + query->page_size = query_out[2]; + query->migration_capable = query_out[3]; + break; + case 6: +
[PATCH v3 1/6] powerpc/pseries/iommu: Create defines for operations in ibm, ddw-applicable
Create defines to help handling ibm,ddw-applicable values, avoiding confusion about the index of given operations. Signed-off-by: Leonardo Bras --- arch/powerpc/platforms/pseries/iommu.c | 43 -- 1 file changed, 26 insertions(+), 17 deletions(-) diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c index 6d47b4a3ce39..ac0d6376bdad 100644 --- a/arch/powerpc/platforms/pseries/iommu.c +++ b/arch/powerpc/platforms/pseries/iommu.c @@ -39,6 +39,14 @@ #include "pseries.h" +enum { + DDW_QUERY_PE_DMA_WIN = 0, + DDW_CREATE_PE_DMA_WIN = 1, + DDW_REMOVE_PE_DMA_WIN = 2, + + DDW_APPLICABLE_SIZE +}; + static struct iommu_table_group *iommu_pseries_alloc_group(int node) { struct iommu_table_group *table_group; @@ -771,12 +779,12 @@ static void remove_ddw(struct device_node *np, bool remove_prop) { struct dynamic_dma_window_prop *dwp; struct property *win64; - u32 ddw_avail[3]; + u32 ddw_avail[DDW_APPLICABLE_SIZE]; u64 liobn; int ret = 0; ret = of_property_read_u32_array(np, "ibm,ddw-applicable", -_avail[0], 3); +_avail[0], DDW_APPLICABLE_SIZE); win64 = of_find_property(np, DIRECT64_PROPNAME, NULL); if (!win64) @@ -798,15 +806,15 @@ static void remove_ddw(struct device_node *np, bool remove_prop) pr_debug("%pOF successfully cleared tces in window.\n", np); - ret = rtas_call(ddw_avail[2], 1, 1, NULL, liobn); + ret = rtas_call(ddw_avail[DDW_REMOVE_PE_DMA_WIN], 1, 1, NULL, liobn); if (ret) pr_warn("%pOF: failed to remove direct window: rtas returned " "%d to ibm,remove-pe-dma-window(%x) %llx\n", - np, ret, ddw_avail[2], liobn); + np, ret, ddw_avail[DDW_REMOVE_PE_DMA_WIN], liobn); else pr_debug("%pOF: successfully removed direct window: rtas returned " "%d to ibm,remove-pe-dma-window(%x) %llx\n", - np, ret, ddw_avail[2], liobn); + np, ret, ddw_avail[DDW_REMOVE_PE_DMA_WIN], liobn); delprop: if (remove_prop) @@ -889,11 +897,11 @@ static int query_ddw(struct pci_dev *dev, const u32 *ddw_avail, buid = pdn->phb->buid; cfg_addr = ((pdn->busno << 16) | (pdn->devfn << 8)); - ret = rtas_call(ddw_avail[0], 3, 5, (u32 *)query, - cfg_addr, BUID_HI(buid), BUID_LO(buid)); + ret = rtas_call(ddw_avail[DDW_QUERY_PE_DMA_WIN], 3, 5, (u32 *)query, + cfg_addr, BUID_HI(buid), BUID_LO(buid)); dev_info(>dev, "ibm,query-pe-dma-windows(%x) %x %x %x" - " returned %d\n", ddw_avail[0], cfg_addr, BUID_HI(buid), - BUID_LO(buid), ret); + " returned %d\n", ddw_avail[DDW_QUERY_PE_DMA_WIN], cfg_addr, +BUID_HI(buid), BUID_LO(buid), ret); return ret; } @@ -920,15 +928,16 @@ static int create_ddw(struct pci_dev *dev, const u32 *ddw_avail, do { /* extra outputs are LIOBN and dma-addr (hi, lo) */ - ret = rtas_call(ddw_avail[1], 5, 4, (u32 *)create, - cfg_addr, BUID_HI(buid), BUID_LO(buid), - page_shift, window_shift); + ret = rtas_call(ddw_avail[DDW_CREATE_PE_DMA_WIN], 5, 4, + (u32 *)create, cfg_addr, BUID_HI(buid), + BUID_LO(buid), page_shift, window_shift); } while (rtas_busy_delay(ret)); dev_info(>dev, "ibm,create-pe-dma-window(%x) %x %x %x %x %x returned %d " - "(liobn = 0x%x starting addr = %x %x)\n", ddw_avail[1], -cfg_addr, BUID_HI(buid), BUID_LO(buid), page_shift, -window_shift, ret, create->liobn, create->addr_hi, create->addr_lo); + "(liobn = 0x%x starting addr = %x %x)\n", +ddw_avail[DDW_CREATE_PE_DMA_WIN], cfg_addr, BUID_HI(buid), +BUID_LO(buid), page_shift, window_shift, ret, create->liobn, +create->addr_hi, create->addr_lo); return ret; } @@ -996,7 +1005,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn) int page_shift; u64 dma_addr, max_addr; struct device_node *dn; - u32 ddw_avail[3]; + u32 ddw_avail[DDW_APPLICABLE_SIZE]; struct direct_window *window; struct property *win64; struct dynamic_dma_window_prop *ddwprop; @@ -1029,7 +1038,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn) * the property is actually in the parent, not the PE */ ret = of_property_read_u32_array(pdn, "ibm,ddw-applicable", -_avail[0], 3); +
[PATCH v3 0/6] Remove default DMA window before creating DDW
There are some devices in which a hypervisor may only allow 1 DMA window to exist at a time, and in those cases, a DDW is never created to them, since the default DMA window keeps using this resource. LoPAR recommends this procedure: 1. Remove the default DMA window, 2. Query for which configs the DDW can be created, 3. Create a DDW. Patch #1: Create defines for outputs of ibm,ddw-applicable, so it's easier to identify them. Patch #2: - After LoPAR level 2.8, there is an extension that can make ibm,query-pe-dma-windows to have 6 outputs instead of 5. This changes the order of the outputs, and that can cause some trouble. - query_ddw() was updated to check how many outputs the ibm,query-pe-dma-windows is supposed to have, update the rtas_call() and deal correctly with the outputs in both cases. - This patch looks somehow unrelated to the series, but it can avoid future problems on DDW creation. Patch #3 moves the window-removing code from remove_ddw() to remove_dma_window(), creating a way to delete any DMA window, so it can be used to delete the default DMA window. Patch #4 makes use of the remove_dma_window() from patch #3 to remove the default DMA window before query_ddw(). It also implements a new rtas call to recover the default DMA window, in case anything fails after it was removed, and a DDW couldn't be created. Patch #5: Instead of destroying the created DDW if it doesn't map the whole partition, make use of it instead of the default DMA window as it improves performance. Patch #6: Does some renaming of 'direct window' to 'dma window', given the DDW created can now be also used in indirect mapping if direct mapping is not available. All patches were tested into an LPAR with an Ethernet VF: 4005:01:00.0 Ethernet controller: Mellanox Technologies MT27700 Family [ConnectX-4 Virtual Function] Patch #5 It was tested with a 64GB DDW which did not map the whole partition (128G). Performance improvement noticed by using the DDW instead of the default DMA window: 64 thread write throughput: +203.0% 64 thread read throughput: +17.5% 1 thread write throughput: +20.5% 1 thread read throughput: +3.43% Average write latency: -23.0% Average read latency: -2.26% --- Changes since v2: - Change the way ibm,ddw-extensions is accessed, using a proper function instead of doing this inline everytime it's used. - Remove previous patch #6, as it doesn't look like it would be useful. - Add new patch, for changing names from direct* to dma*, as indirect mapping can be used from now on. - Fix some typos, corrects some define usage. - v2 link: http://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=185433=%2A=both Changes since v1: - Add defines for ibm,ddw-applicable and ibm,ddw-extensions outputs - Merge aux function query_ddw_out_sz() into query_ddw() - Merge reset_dma_window() patch (prev. #2) into remove default DMA window patch (#4). - Keep device_node *np name instead of using pdn in remove_*() - Rename 'device_node *pdn' into 'parent' in new functions - Rename dfl_win to default_win - Only remove the default DMA window if there is no window available in first query. - Check if default DMA window can be restored before removing it. - Fix 'unitialized use' (found by travis mpe:ci-test) - New patches #5 and #6 - v1 link: http://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=184420=%2A=both Special thanks for Alexey Kardashevskiy and Oliver O'Halloran for the feedback provided! Leonardo Bras (6): powerpc/pseries/iommu: Create defines for operations in ibm,ddw-applicable powerpc/pseries/iommu: Update call to ibm,query-pe-dma-windows powerpc/pseries/iommu: Move window-removing part of remove_ddw into remove_dma_window powerpc/pseries/iommu: Remove default DMA window before creating DDW powerpc/pseries/iommu: Make use of DDW even if it does not map the partition powerpc/pseries/iommu: Rename "direct window" to "dma window" arch/powerpc/platforms/pseries/iommu.c | 379 ++--- 1 file changed, 269 insertions(+), 110 deletions(-) -- 2.25.4