Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline
* Michal Hocko [2020-07-01 14:21:10]: > > >> > > >> 2. Also existence of dummy node also leads to inconsistent > > >> information. The > > >> number of online nodes is inconsistent with the information in the > > >> device-tree and resource-dump > > >> > > >> 3. When the dummy node is present, single node non-Numa systems end > > >> up showing > > >> up as NUMA systems and numa_balancing gets enabled. This will mean > > >> we take > > >> the hit from the unnecessary numa hinting faults. > > > > > > I have to say that I dislike the node online/offline state and > > > directly > > > exporting that to the userspace. Users should only care whether the > > > node > > > has memory/cpus. Numa nodes can be online without any memory. Just > > > offline all the present memory blocks but do not physically hot remove > > > them and you are in the same situation. If users are confused by an > > > output of tools like numactl -H then those could be updated and hide > > > nodes without any memory&cpus. > > > > > > The autonuma problem sounds interesting but again this patch doesn't > > > really solve the underlying problem because I strongly suspect that > > > the > > > problem is still there when a numa node gets all its memory offline as > > > mentioned above. > > I would really appreciate a feedback to these two as well. 1. Its not just numactl that's to be fixed but all tools/utilities that depend on /sys/devices/system/node/online. Are we saying to not rely/believe in the output given by the kernel but do further verification? Also how would the user space differentiate between the case where the Kernel missed marking a node as offline to the case where the memory was offlined on a cpuless node but node wasn't offline?. 2. Regarding the autonuma, the case of offline memory is user/admin driven, so if there is a performance hit, its something that's driven by his user/admin actions. Also how often do we see users offline complete memory of cpuless node on a 2 node system? > > > [0.009726] SRAT: PXM 1 -> APIC 0x00 -> Node 0 > > [0.009727] SRAT: PXM 1 -> APIC 0x01 -> Node 0 > > [0.009727] SRAT: PXM 1 -> APIC 0x02 -> Node 0 > > [0.009728] SRAT: PXM 1 -> APIC 0x03 -> Node 0 > > [0.009731] ACPI: SRAT: Node 0 PXM 1 [mem 0x-0x0009] > > [0.009732] ACPI: SRAT: Node 0 PXM 1 [mem 0x0010-0xbfff] > > [0.009733] ACPI: SRAT: Node 0 PXM 1 [mem 0x1-0x13fff] > > This begs a question whether ppc can do the same thing? Certainly ppc can be made to adapt to this situation but that would be a workaround. Do we have a reason why we think node 0 is unique and special? If yes can we document it so that in future also people know why we consider node 0 to be special. I do understand the *fear of the unknown* but when we are unable to theoretically or practically come up a case, then it may probably be better we hit the situation to understand what that unknown is? > I would swear that we've had x86 system with node 0 but I cannot really > find it and it is possible that it was not x86 after all... -- Thanks and Regards Srikar Dronamraju
Re: [PATCH v2 02/10] KVM: PPC: Book3S HV: Save/restore new PMU registers
> On 01-Jul-2020, at 4:41 PM, Paul Mackerras wrote: > > On Wed, Jul 01, 2020 at 05:20:54AM -0400, Athira Rajeev wrote: >> PowerISA v3.1 has added new performance monitoring unit (PMU) >> special purpose registers (SPRs). They are >> >> Monitor Mode Control Register 3 (MMCR3) >> Sampled Instruction Event Register A (SIER2) >> Sampled Instruction Event Register B (SIER3) >> >> Patch addes support to save/restore these new >> SPRs while entering/exiting guest. > > This mostly looks reasonable, at a quick glance at least, but I am > puzzled by two of the changes you are making. See below. > >> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c >> index 6bf66649..c265800 100644 >> --- a/arch/powerpc/kvm/book3s_hv.c >> +++ b/arch/powerpc/kvm/book3s_hv.c >> @@ -1698,7 +1698,8 @@ static int kvmppc_get_one_reg_hv(struct kvm_vcpu >> *vcpu, u64 id, >> *val = get_reg_val(id, vcpu->arch.sdar); >> break; >> case KVM_REG_PPC_SIER: >> -*val = get_reg_val(id, vcpu->arch.sier); >> +i = id - KVM_REG_PPC_SIER; >> +*val = get_reg_val(id, vcpu->arch.sier[i]); > > This is inside a switch (id) statement, so here we know that id is > KVM_REG_PPC_SIER. In other words i will always be zero, so what is > the point of doing the subtraction? > >> break; >> case KVM_REG_PPC_IAMR: >> *val = get_reg_val(id, vcpu->arch.iamr); >> @@ -1919,7 +1920,8 @@ static int kvmppc_set_one_reg_hv(struct kvm_vcpu >> *vcpu, u64 id, >> vcpu->arch.sdar = set_reg_val(id, *val); >> break; >> case KVM_REG_PPC_SIER: >> -vcpu->arch.sier = set_reg_val(id, *val); >> +i = id - KVM_REG_PPC_SIER; >> +vcpu->arch.sier[i] = set_reg_val(id, *val); > > Same comment here. Hi Paul, Thanks for reviewing the patch. Yes, true that currently `id` will be zero since it is only KVM_REG_PPC_SIER. I have kept the subtraction here considering that there will be addition of new registers to switch case. ex: case KVM_REG_PPC_SIER..KVM_REG_PPC_SIER3 > > I think that new defines for the new registers will need to be added > to arch/powerpc/include/uapi/asm/kvm.h and > Documentation/virt/kvm/api.rst, and then new cases will need to be > added to these switch statements. Yes, New registers are not yet added to kvm.h I will address these comments and include changes for arch/powerpc/include/uapi/asm/kvm.h and Documentation/virt/kvm/api.rst in the next version. > > By the way, please cc kvm-...@vger.kernel.org and k...@vger.kernel.org > on KVM patches. Sure, will include KVM mailing list in the next version Thanks Athira > > Paul.
Re: [PATCH 12/20] block: remove the request_queue argument from blk_queue_split
On Wed, Jul 1, 2020 at 2:02 AM Christoph Hellwig wrote: > > The queue can be trivially derived from the bio, so pass one less > argument. > > Signed-off-by: Christoph Hellwig > --- [...] > drivers/md/md.c | 2 +- For md.c: Acked-by: Song Liu
[v2 PATCH] crypto: af_alg - Fix regression on empty requests
On Tue, Jun 30, 2020 at 02:18:11PM +0530, Naresh Kamboju wrote: > > Since we are on this subject, > LTP af_alg02 test case fails on stable 4.9 and stable 4.4 > This is not a regression because the test case has been failing from > the beginning. > > Is this test case expected to fail on stable 4.9 and 4.4 ? > or any chance to fix this on these older branches ? > > Test output: > af_alg02.c:52: BROK: Timed out while reading from request socket. > > ref: > https://qa-reports.linaro.org/lkft/linux-stable-rc-4.9-oe/build/v4.9.228-191-g082e807235d7/testrun/2884917/suite/ltp-crypto-tests/test/af_alg02/history/ > https://qa-reports.linaro.org/lkft/linux-stable-rc-4.9-oe/build/v4.9.228-191-g082e807235d7/testrun/2884606/suite/ltp-crypto-tests/test/af_alg02/log Actually this test really is broken. Even though empty requests are legal, they should never be done with no write(2) at all. Because this fundamentally breaks the use of a blocking read(2) to wait for more data. Granted this has been broken since 2017 but I'm not going to reintroduce this just because of a broken test case. So please either remove af_alg02 or fix it by adding a control message through sendmsg(2). Thanks, ---8<--- Some user-space programs rely on crypto requests that have no control metadata. This broke when a check was added to require the presence of control metadata with the ctx->init flag. This patch fixes the regression by setting ctx->init as long as one sendmsg(2) has been made, with or without a control message. Reported-by: Sachin Sant Reported-by: Naresh Kamboju Fixes: f3c802a1f300 ("crypto: algif_aead - Only wake up when...") Signed-off-by: Herbert Xu diff --git a/crypto/af_alg.c b/crypto/af_alg.c index 9fcb91ea10c41..5882ed46f1adb 100644 --- a/crypto/af_alg.c +++ b/crypto/af_alg.c @@ -851,6 +851,7 @@ int af_alg_sendmsg(struct socket *sock, struct msghdr *msg, size_t size, err = -EINVAL; goto unlock; } + ctx->init = true; if (init) { ctx->enc = enc; @@ -858,7 +859,6 @@ int af_alg_sendmsg(struct socket *sock, struct msghdr *msg, size_t size, memcpy(ctx->iv, con.iv->iv, ivsize); ctx->aead_assoclen = con.aead_assoclen; - ctx->init = true; } while (size) { -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
[PATCH v8 4/4] powerpc: Book3S 64-bit "heavyweight" KASAN support
Implement a limited form of KASAN for Book3S 64-bit machines running under the Radix MMU: - Set aside the last 1/8th of the first contiguous block of physical memory to provide writable shadow for the linear map. For annoying reasons documented below, the memory size must be specified at compile time. - Enable the compiler instrumentation to check addresses and maintain the shadow region. (This is the guts of KASAN which we can easily reuse.) - Require kasan-vmalloc support to handle modules and anything else in vmalloc space. - KASAN needs to be able to validate all pointer accesses, but we can't back all kernel addresses with real memory - only linear map and vmalloc. On boot, set up a single page of read-only shadow that marks all these accesses as valid. - Make our stack-walking code KASAN-safe by using READ_ONCE_NOCHECK - generic code, arm64, s390 and x86 all do this for similar sorts of reasons: when unwinding a stack, we might touch memory that KASAN has marked as being out-of-bounds. In our case we often get this when checking for an exception frame because we're checking an arbitrary offset into the stack frame. See commit 20955746320e ("s390/kasan: avoid false positives during stack unwind"), commit bcaf669b4bdb ("arm64: disable kasan when accessing frame->fp in unwind_frame"), commit 91e08ab0c851 ("x86/dumpstack: Prevent KASAN false positive warnings") and commit 6e22c8366416 ("tracing, kasan: Silence Kasan warning in check_stack of stack_tracer") - Document KASAN in both generic and powerpc docs. Background -- KASAN support on Book3S is a bit tricky to get right: - It would be good to support inline instrumentation so as to be able to catch stack issues that cannot be caught with outline mode. - Inline instrumentation requires a fixed offset. - Book3S runs code in real mode after booting. Most notably a lot of KVM runs in real mode, and it would be good to be able to instrument it. - Because code runs in real mode after boot, the offset has to point to valid memory both in and out of real mode. [ppc64 mm note: The kernel installs a linear mapping at effective address c000... onward. This is a one-to-one mapping with physical memory from ... onward. Because of how memory accesses work on powerpc 64-bit Book3S, a kernel pointer in the linear map accesses the same memory both with translations on (accessing as an 'effective address'), and with translations off (accessing as a 'real address'). This works in both guests and the hypervisor. For more details, see s5.7 of Book III of version 3 of the ISA, in particular the Storage Control Overview, s5.7.3, and s5.7.5 - noting that this KASAN implementation currently only supports Radix.] One approach is just to give up on inline instrumentation. This way all checks can be delayed until after everything set is up correctly, and the address-to-shadow calculations can be overridden. However, the features and speed boost provided by inline instrumentation are worth trying to do better. If _at compile time_ it is known how much contiguous physical memory a system has, the top 1/8th of the first block of physical memory can be set aside for the shadow. This is a big hammer and comes with 3 big consequences: - there's no nice way to handle physically discontiguous memory, so only the first physical memory block can be used. - kernels will simply fail to boot on machines with less memory than specified when compiling. - kernels running on machines with more memory than specified when compiling will simply ignore the extra memory. Despite the limitations, it can still find bugs, e.g. http://patchwork.ozlabs.org/patch/1103775/ At the moment, this physical memory limit must be set _even for outline mode_. This may be changed in a later series - a different implementation could be added for outline mode that dynamically allocates shadow at a fixed offset. For example, see https://patchwork.ozlabs.org/patch/795211/ Suggested-by: Michael Ellerman Cc: Balbir Singh # ppc64 out-of-line radix version Cc: Christophe Leroy # ppc32 version Reviewed-by: # focussed mainly on Documentation and things impacting PPC32 Signed-off-by: Daniel Axtens --- Documentation/dev-tools/kasan.rst| 9 +- Documentation/powerpc/kasan.txt | 112 ++- arch/powerpc/Kconfig | 3 +- arch/powerpc/Kconfig.debug | 23 +++- arch/powerpc/Makefile| 11 ++ arch/powerpc/include/asm/book3s/64/hash.h| 4 + arch/powerpc/include/asm/book3s/64/pgtable.h | 7 ++ arch/powerpc/include/asm/book3s/64/radix.h | 5 + arch/powerpc/include/asm/kasan.h | 11 +- arch/powerpc/kernel/Makefile | 2 + arch/powerpc/kernel/process.c| 16 +-- arch/powerpc/ke
[PATCH v8 3/4] powerpc/mm/kasan: rename kasan_init_32.c to init_32.c
kasan is already implied by the directory name, we don't need to repeat it. Suggested-by: Christophe Leroy Signed-off-by: Daniel Axtens --- arch/powerpc/mm/kasan/Makefile | 2 +- arch/powerpc/mm/kasan/{kasan_init_32.c => init_32.c} | 0 2 files changed, 1 insertion(+), 1 deletion(-) rename arch/powerpc/mm/kasan/{kasan_init_32.c => init_32.c} (100%) diff --git a/arch/powerpc/mm/kasan/Makefile b/arch/powerpc/mm/kasan/Makefile index bb1a5408b86b..42fb628a44fd 100644 --- a/arch/powerpc/mm/kasan/Makefile +++ b/arch/powerpc/mm/kasan/Makefile @@ -2,6 +2,6 @@ KASAN_SANITIZE := n -obj-$(CONFIG_PPC32) += kasan_init_32.o +obj-$(CONFIG_PPC32) += init_32.o obj-$(CONFIG_PPC_8xx) += 8xx.o obj-$(CONFIG_PPC_BOOK3S_32)+= book3s_32.o diff --git a/arch/powerpc/mm/kasan/kasan_init_32.c b/arch/powerpc/mm/kasan/init_32.c similarity index 100% rename from arch/powerpc/mm/kasan/kasan_init_32.c rename to arch/powerpc/mm/kasan/init_32.c -- 2.25.1
[PATCH v8 2/4] kasan: Document support on 32-bit powerpc
KASAN is supported on 32-bit powerpc and the docs should reflect this. Document s390 support while we're at it. Suggested-by: Christophe Leroy Reviewed-by: Christophe Leroy Signed-off-by: Daniel Axtens --- Documentation/dev-tools/kasan.rst | 7 +-- Documentation/powerpc/kasan.txt | 12 2 files changed, 17 insertions(+), 2 deletions(-) create mode 100644 Documentation/powerpc/kasan.txt diff --git a/Documentation/dev-tools/kasan.rst b/Documentation/dev-tools/kasan.rst index c652d740735d..554cbee1d240 100644 --- a/Documentation/dev-tools/kasan.rst +++ b/Documentation/dev-tools/kasan.rst @@ -22,7 +22,8 @@ global variables yet. Tag-based KASAN is only supported in Clang and requires version 7.0.0 or later. Currently generic KASAN is supported for the x86_64, arm64, xtensa, s390 and -riscv architectures, and tag-based KASAN is supported only for arm64. +riscv architectures. It is also supported on 32-bit powerpc kernels. Tag-based +KASAN is supported only on arm64. Usage - @@ -255,7 +256,9 @@ CONFIG_KASAN_VMALLOC With ``CONFIG_KASAN_VMALLOC``, KASAN can cover vmalloc space at the -cost of greater memory usage. Currently this is only supported on x86. +cost of greater memory usage. Currently this supported on x86, s390 +and 32-bit powerpc. It is optional, except on 32-bit powerpc kernels +with module support, where it is required. This works by hooking into vmalloc and vmap, and dynamically allocating real shadow memory to back the mappings. diff --git a/Documentation/powerpc/kasan.txt b/Documentation/powerpc/kasan.txt new file mode 100644 index ..26bb0e8bb18c --- /dev/null +++ b/Documentation/powerpc/kasan.txt @@ -0,0 +1,12 @@ +KASAN is supported on powerpc on 32-bit only. + +32 bit support +== + +KASAN is supported on both hash and nohash MMUs on 32-bit. + +The shadow area sits at the top of the kernel virtual memory space above the +fixmap area and occupies one eighth of the total kernel virtual memory space. + +Instrumentation of the vmalloc area is optional, unless built with modules, +in which case it is required. -- 2.25.1
[PATCH v8 0/4] KASAN for powerpc64 radix
Building on the work of Christophe, Aneesh and Balbir, I've ported KASAN to 64-bit Book3S kernels running on the Radix MMU. This provides full inline instrumentation on radix, but does require that you be able to specify the amount of physically contiguous memory on the system at compile time. More details in patch 4. v8 is just a rebase of v7 on a more recent powerpc/merge and a fixup of a whitespace error. Module globals still don't work, but that's due to some 'clever' renaming of a section that the powerpc module loading code does to avoid more complicated relocations/tramplines rather than anything to do with KASAN. Daniel Axtens (4): kasan: define and use MAX_PTRS_PER_* for early shadow tables kasan: Document support on 32-bit powerpc powerpc/mm/kasan: rename kasan_init_32.c to init_32.c powerpc: Book3S 64-bit "heavyweight" KASAN support Documentation/dev-tools/kasan.rst | 8 +- Documentation/powerpc/kasan.txt | 122 ++ arch/powerpc/Kconfig | 3 +- arch/powerpc/Kconfig.debug| 23 +++- arch/powerpc/Makefile | 11 ++ arch/powerpc/include/asm/book3s/64/hash.h | 4 + arch/powerpc/include/asm/book3s/64/pgtable.h | 7 + arch/powerpc/include/asm/book3s/64/radix.h| 5 + arch/powerpc/include/asm/kasan.h | 11 +- arch/powerpc/kernel/Makefile | 2 + arch/powerpc/kernel/process.c | 16 ++- arch/powerpc/kernel/prom.c| 76 ++- arch/powerpc/mm/kasan/Makefile| 3 +- .../mm/kasan/{kasan_init_32.c => init_32.c} | 0 arch/powerpc/mm/kasan/init_book3s_64.c| 73 +++ arch/powerpc/mm/ptdump/ptdump.c | 10 +- arch/powerpc/platforms/Kconfig.cputype| 1 + include/linux/kasan.h | 18 ++- mm/kasan/init.c | 6 +- 19 files changed, 377 insertions(+), 22 deletions(-) create mode 100644 Documentation/powerpc/kasan.txt rename arch/powerpc/mm/kasan/{kasan_init_32.c => init_32.c} (100%) create mode 100644 arch/powerpc/mm/kasan/init_book3s_64.c -- 2.25.1
[PATCH v8 1/4] kasan: define and use MAX_PTRS_PER_* for early shadow tables
powerpc has a variable number of PTRS_PER_*, set at runtime based on the MMU that the kernel is booted under. This means the PTRS_PER_* are no longer constants, and therefore breaks the build. Define default MAX_PTRS_PER_*s in the same style as MAX_PTRS_PER_P4D. As KASAN is the only user at the moment, just define them in the kasan header, and have them default to PTRS_PER_* unless overridden in arch code. Suggested-by: Christophe Leroy Suggested-by: Balbir Singh Reviewed-by: Christophe Leroy Reviewed-by: Balbir Singh Signed-off-by: Daniel Axtens --- include/linux/kasan.h | 18 +++--- mm/kasan/init.c | 6 +++--- 2 files changed, 18 insertions(+), 6 deletions(-) diff --git a/include/linux/kasan.h b/include/linux/kasan.h index 82522e996c76..b6f94952333b 100644 --- a/include/linux/kasan.h +++ b/include/linux/kasan.h @@ -14,10 +14,22 @@ struct task_struct; #include #include +#ifndef MAX_PTRS_PER_PTE +#define MAX_PTRS_PER_PTE PTRS_PER_PTE +#endif + +#ifndef MAX_PTRS_PER_PMD +#define MAX_PTRS_PER_PMD PTRS_PER_PMD +#endif + +#ifndef MAX_PTRS_PER_PUD +#define MAX_PTRS_PER_PUD PTRS_PER_PUD +#endif + extern unsigned char kasan_early_shadow_page[PAGE_SIZE]; -extern pte_t kasan_early_shadow_pte[PTRS_PER_PTE]; -extern pmd_t kasan_early_shadow_pmd[PTRS_PER_PMD]; -extern pud_t kasan_early_shadow_pud[PTRS_PER_PUD]; +extern pte_t kasan_early_shadow_pte[MAX_PTRS_PER_PTE]; +extern pmd_t kasan_early_shadow_pmd[MAX_PTRS_PER_PMD]; +extern pud_t kasan_early_shadow_pud[MAX_PTRS_PER_PUD]; extern p4d_t kasan_early_shadow_p4d[MAX_PTRS_PER_P4D]; int kasan_populate_early_shadow(const void *shadow_start, diff --git a/mm/kasan/init.c b/mm/kasan/init.c index fe6be0be1f76..42bca3d27db8 100644 --- a/mm/kasan/init.c +++ b/mm/kasan/init.c @@ -46,7 +46,7 @@ static inline bool kasan_p4d_table(pgd_t pgd) } #endif #if CONFIG_PGTABLE_LEVELS > 3 -pud_t kasan_early_shadow_pud[PTRS_PER_PUD] __page_aligned_bss; +pud_t kasan_early_shadow_pud[MAX_PTRS_PER_PUD] __page_aligned_bss; static inline bool kasan_pud_table(p4d_t p4d) { return p4d_page(p4d) == virt_to_page(lm_alias(kasan_early_shadow_pud)); @@ -58,7 +58,7 @@ static inline bool kasan_pud_table(p4d_t p4d) } #endif #if CONFIG_PGTABLE_LEVELS > 2 -pmd_t kasan_early_shadow_pmd[PTRS_PER_PMD] __page_aligned_bss; +pmd_t kasan_early_shadow_pmd[MAX_PTRS_PER_PMD] __page_aligned_bss; static inline bool kasan_pmd_table(pud_t pud) { return pud_page(pud) == virt_to_page(lm_alias(kasan_early_shadow_pmd)); @@ -69,7 +69,7 @@ static inline bool kasan_pmd_table(pud_t pud) return false; } #endif -pte_t kasan_early_shadow_pte[PTRS_PER_PTE] __page_aligned_bss; +pte_t kasan_early_shadow_pte[MAX_PTRS_PER_PTE] __page_aligned_bss; static inline bool kasan_pte_table(pmd_t pmd) { -- 2.25.1
[PATCH AUTOSEL 5.4 23/40] powerpc/kvm/book3s64: Fix kernel crash with nested kvm & DEBUG_VIRTUAL
From: "Aneesh Kumar K.V" [ Upstream commit c1ed1754f271f6b7acb1bfdc8cfb62220fbed423 ] With CONFIG_DEBUG_VIRTUAL=y, __pa() checks for addr value and if it's less than PAGE_OFFSET it leads to a BUG(). #define __pa(x) ({ VIRTUAL_BUG_ON((unsigned long)(x) < PAGE_OFFSET); (unsigned long)(x) & 0x0fffUL; }) kernel BUG at arch/powerpc/kvm/book3s_64_mmu_radix.c:43! cpu 0x70: Vector: 700 (Program Check) at [c018a2187360] pc: c0161b30: __kvmhv_copy_tofrom_guest_radix+0x130/0x1f0 lr: c0161d5c: kvmhv_copy_from_guest_radix+0x3c/0x80 ... kvmhv_copy_from_guest_radix+0x3c/0x80 kvmhv_load_from_eaddr+0x48/0xc0 kvmppc_ld+0x98/0x1e0 kvmppc_load_last_inst+0x50/0x90 kvmppc_hv_emulate_mmio+0x288/0x2b0 kvmppc_book3s_radix_page_fault+0xd8/0x2b0 kvmppc_book3s_hv_page_fault+0x37c/0x1050 kvmppc_vcpu_run_hv+0xbb8/0x1080 kvmppc_vcpu_run+0x34/0x50 kvm_arch_vcpu_ioctl_run+0x2fc/0x410 kvm_vcpu_ioctl+0x2b4/0x8f0 ksys_ioctl+0xf4/0x150 sys_ioctl+0x28/0x80 system_call_exception+0x104/0x1d0 system_call_common+0xe8/0x214 kvmhv_copy_tofrom_guest_radix() uses a NULL value for to/from to indicate direction of copy. Avoid calling __pa() if the value is NULL to avoid the BUG(). Signed-off-by: Aneesh Kumar K.V [mpe: Massage change log a bit to mention CONFIG_DEBUG_VIRTUAL] Signed-off-by: Michael Ellerman Link: https://lore.kernel.org/r/20200611120159.680284-1-aneesh.ku...@linux.ibm.com Signed-off-by: Sasha Levin --- arch/powerpc/kvm/book3s_64_mmu_radix.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c index 43b56f8f6bebd..da8375437d161 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_radix.c +++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c @@ -38,7 +38,8 @@ unsigned long __kvmhv_copy_tofrom_guest_radix(int lpid, int pid, /* Can't access quadrants 1 or 2 in non-HV mode, call the HV to do it */ if (kvmhv_on_pseries()) return plpar_hcall_norets(H_COPY_TOFROM_GUEST, lpid, pid, eaddr, - __pa(to), __pa(from), n); + (to != NULL) ? __pa(to): 0, + (from != NULL) ? __pa(from): 0, n); quadrant = 1; if (!pid) -- 2.25.1
[PATCH AUTOSEL 5.4 22/40] ibmvnic: continue to init in CRQ reset returns H_CLOSED
From: Dany Madden [ Upstream commit 8b40eb73509f5704a0e8cd25de0163876299f1a7 ] Continue the reset path when partner adapter is not ready or H_CLOSED is returned from reset crq. This patch allows the CRQ init to proceed to establish a valid CRQ for traffic to flow after reset. Signed-off-by: Dany Madden Signed-off-by: David S. Miller Signed-off-by: Sasha Levin --- drivers/net/ethernet/ibm/ibmvnic.c | 9 +++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c index 5a42ddeecfe50..143a9722ad11a 100644 --- a/drivers/net/ethernet/ibm/ibmvnic.c +++ b/drivers/net/ethernet/ibm/ibmvnic.c @@ -1865,13 +1865,18 @@ static int do_reset(struct ibmvnic_adapter *adapter, release_sub_crqs(adapter, 1); } else { rc = ibmvnic_reset_crq(adapter); - if (!rc) + if (rc == H_CLOSED || rc == H_SUCCESS) { rc = vio_enable_interrupts(adapter->vdev); + if (rc) + netdev_err(adapter->netdev, + "Reset failed to enable interrupts. rc=%d\n", + rc); + } } if (rc) { netdev_err(adapter->netdev, - "Couldn't initialize crq. rc=%d\n", rc); + "Reset couldn't initialize crq. rc=%d\n", rc); goto out; } -- 2.25.1
[PATCH AUTOSEL 5.7 29/53] powerpc/kvm/book3s64: Fix kernel crash with nested kvm & DEBUG_VIRTUAL
From: "Aneesh Kumar K.V" [ Upstream commit c1ed1754f271f6b7acb1bfdc8cfb62220fbed423 ] With CONFIG_DEBUG_VIRTUAL=y, __pa() checks for addr value and if it's less than PAGE_OFFSET it leads to a BUG(). #define __pa(x) ({ VIRTUAL_BUG_ON((unsigned long)(x) < PAGE_OFFSET); (unsigned long)(x) & 0x0fffUL; }) kernel BUG at arch/powerpc/kvm/book3s_64_mmu_radix.c:43! cpu 0x70: Vector: 700 (Program Check) at [c018a2187360] pc: c0161b30: __kvmhv_copy_tofrom_guest_radix+0x130/0x1f0 lr: c0161d5c: kvmhv_copy_from_guest_radix+0x3c/0x80 ... kvmhv_copy_from_guest_radix+0x3c/0x80 kvmhv_load_from_eaddr+0x48/0xc0 kvmppc_ld+0x98/0x1e0 kvmppc_load_last_inst+0x50/0x90 kvmppc_hv_emulate_mmio+0x288/0x2b0 kvmppc_book3s_radix_page_fault+0xd8/0x2b0 kvmppc_book3s_hv_page_fault+0x37c/0x1050 kvmppc_vcpu_run_hv+0xbb8/0x1080 kvmppc_vcpu_run+0x34/0x50 kvm_arch_vcpu_ioctl_run+0x2fc/0x410 kvm_vcpu_ioctl+0x2b4/0x8f0 ksys_ioctl+0xf4/0x150 sys_ioctl+0x28/0x80 system_call_exception+0x104/0x1d0 system_call_common+0xe8/0x214 kvmhv_copy_tofrom_guest_radix() uses a NULL value for to/from to indicate direction of copy. Avoid calling __pa() if the value is NULL to avoid the BUG(). Signed-off-by: Aneesh Kumar K.V [mpe: Massage change log a bit to mention CONFIG_DEBUG_VIRTUAL] Signed-off-by: Michael Ellerman Link: https://lore.kernel.org/r/20200611120159.680284-1-aneesh.ku...@linux.ibm.com Signed-off-by: Sasha Levin --- arch/powerpc/kvm/book3s_64_mmu_radix.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c index bc6c1aa3d0e92..ef40addd52c65 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_radix.c +++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c @@ -40,7 +40,8 @@ unsigned long __kvmhv_copy_tofrom_guest_radix(int lpid, int pid, /* Can't access quadrants 1 or 2 in non-HV mode, call the HV to do it */ if (kvmhv_on_pseries()) return plpar_hcall_norets(H_COPY_TOFROM_GUEST, lpid, pid, eaddr, - __pa(to), __pa(from), n); + (to != NULL) ? __pa(to): 0, + (from != NULL) ? __pa(from): 0, n); quadrant = 1; if (!pid) -- 2.25.1
[PATCH AUTOSEL 5.7 28/53] ibmvnic: continue to init in CRQ reset returns H_CLOSED
From: Dany Madden [ Upstream commit 8b40eb73509f5704a0e8cd25de0163876299f1a7 ] Continue the reset path when partner adapter is not ready or H_CLOSED is returned from reset crq. This patch allows the CRQ init to proceed to establish a valid CRQ for traffic to flow after reset. Signed-off-by: Dany Madden Signed-off-by: David S. Miller Signed-off-by: Sasha Levin --- drivers/net/ethernet/ibm/ibmvnic.c | 9 +++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c index 1b4d04e4474bb..2dbcbdbb0e4ad 100644 --- a/drivers/net/ethernet/ibm/ibmvnic.c +++ b/drivers/net/ethernet/ibm/ibmvnic.c @@ -1958,13 +1958,18 @@ static int do_reset(struct ibmvnic_adapter *adapter, release_sub_crqs(adapter, 1); } else { rc = ibmvnic_reset_crq(adapter); - if (!rc) + if (rc == H_CLOSED || rc == H_SUCCESS) { rc = vio_enable_interrupts(adapter->vdev); + if (rc) + netdev_err(adapter->netdev, + "Reset failed to enable interrupts. rc=%d\n", + rc); + } } if (rc) { netdev_err(adapter->netdev, - "Couldn't initialize crq. rc=%d\n", rc); + "Reset couldn't initialize crq. rc=%d\n", rc); goto out; } -- 2.25.1
Re: [PATCH v2 5/6] powerpc/pseries/iommu: Make use of DDW even if it does not map the partition
On Thu, 2020-07-02 at 10:31 +1000, Alexey Kardashevskiy wrote: > > In fact, there is a lot of places in this file where it's called direct > > window. Should I replace everything? > > Should it be in a separated patch? > > If it looks simple and you write a nice commit log explaining all that > and why you are not reusing the existing ibm,dma-window property > for that - sure, do it :) Nice, I will do that :) > (to provide a clue what "reset" will reset to? is there any other > reason?) That's the main reason here. The way I perceive this, ibm,dma-window should only point to the default DMA window, which is guaranteed to always be the same, even if it's destroyed and re-created. So there I see no point destroying / overwriting it. On the other hand, I also thought about using a new node name for this window, but it would be very troublesome and I could see no real gain. Thanks !
Re: Memory: 880608K/983040K .... 36896K reserved ?
Joakim Tjernlund writes: > I cannot figure out how the xxxK reserved item works in: > Memory: 880608K/983040K available (9532K kernel code, 1104K rwdata, 3348K > rodata, 1088K init, 1201K bss, 36896K reserved ... It's calculated as: (physpages - totalram_pages() - totalcma_pages) The naming is a bit historical I guess. But roughly physpages is the total number of pages of RAM we think exist in the system. totalram_pages() is the total number of pages that have been freed to the buddy allocator. totalcma_pages is pages used by CMA which is probably 0 for you. So the amount "reserved" is the memory that hasn't been freed to the buddy allocator by memblock. You should be able to see it in debugfs: # cat /sys/kernel/debug/memblock/reserved 0: 0x..0x02b40e57 1: 0x02b41000..0x02b413ff 2: 0x02b5..0x02ba 3: 0x0a91..0x0e93 4: 0x0fe8..0x0fe9 5: 0x0feac000..0x0ffebfff 6: 0x0ffed400..0x0ffed7ff 7: 0x0ffeda80..0x0ffeebff 8: 0x0ffeee80..0x0ffe 9: 0x0fff0280..0x0fff13ff ... > Is there a way to tune(lower it) this memory? Some or most of those reserved regions will be things your firmware told you to reserve, so you need to work out what each region is. They might be firmware things, or even holes in RAM, you need to dig deeper to find out what is what. cheers
Re: [PATCH v2 1/6] powerpc/pseries/iommu: Create defines for operations in ibm,ddw-applicable
On Thu, 2020-07-02 at 10:43 +1000, Alexey Kardashevskiy wrote: > > Or should one stick to #define in this case? > > imho a matter of taste but after some grepping it feels like #define is > mostly used which does not mean it is a good idea. Keep it enum and see > if it passed mpe's filter :) Good idea :) Thanks !
Re: [PATCH v2 1/6] powerpc/pseries/iommu: Create defines for operations in ibm,ddw-applicable
On 02/07/2020 10:36, Leonardo Bras wrote: > On Thu, 2020-07-02 at 10:21 +1000, Alexey Kardashevskiy wrote: >>> enum { >>>DDW_QUERY_PE_DMA_WIN, >>>DDW_CREATE_PE_DMA_WIN, >>>DDW_REMOVE_PE_DMA_WIN, >>> >>>DDW_APPLICABLE_SIZE >>> } >>> IMO, it looks better than all the defines before. >>> >>> What do you think? >> >> No, not really, these come from a binary interface so the reader of this >> cares about absolute numbers and rather wants to see them explicitly. > > Makes sense to me. > I am still getting experience on where to use enum vs define. Thanks > for the tip! > > Using something like > enum { > DDW_QUERY_PE_DMA_WIN = 0, > DDW_CREATE_PE_DMA_WIN = 1, > DDW_REMOVE_PE_DMA_WIN = 2, > > DDW_APPLICABLE_SIZE > }; > > would be fine too? This is fine too. > Or should one stick to #define in this case? imho a matter of taste but after some grepping it feels like #define is mostly used which does not mean it is a good idea. Keep it enum and see if it passed mpe's filter :) -- Alexey
Re: [PATCH v2 1/6] powerpc/pseries/iommu: Create defines for operations in ibm,ddw-applicable
On Thu, 2020-07-02 at 10:21 +1000, Alexey Kardashevskiy wrote: > > enum { > >DDW_QUERY_PE_DMA_WIN, > >DDW_CREATE_PE_DMA_WIN, > >DDW_REMOVE_PE_DMA_WIN, > > > >DDW_APPLICABLE_SIZE > > } > > IMO, it looks better than all the defines before. > > > > What do you think? > > No, not really, these come from a binary interface so the reader of this > cares about absolute numbers and rather wants to see them explicitly. Makes sense to me. I am still getting experience on where to use enum vs define. Thanks for the tip! Using something like enum { DDW_QUERY_PE_DMA_WIN = 0, DDW_CREATE_PE_DMA_WIN = 1, DDW_REMOVE_PE_DMA_WIN = 2, DDW_APPLICABLE_SIZE }; would be fine too? Or should one stick to #define in this case? Thank you,
Re: [PATCH v2 5/6] powerpc/pseries/iommu: Make use of DDW even if it does not map the partition
On 02/07/2020 09:48, Leonardo Bras wrote: > On Wed, 2020-07-01 at 16:57 -0300, Leonardo Bras wrote: >>> It is not necessarily "direct" anymore as the name suggests, you may >>> want to change that. DMA64_PROPNAME, may be. Thanks, >>> >> >> Yeah, you are right. >> I will change this for next version, also changing the string name to >> reflect this. >> >> -#define DIRECT64_PROPNAME "linux,direct64-ddr-window-info" >> +#define DMA64_PROPNAME "linux,dma64-ddr-window-info" >> >> Is that ok? >> >> Thank you for helping! > > In fact, there is a lot of places in this file where it's called direct > window. Should I replace everything? > Should it be in a separated patch? If it looks simple and you write a nice commit log explaining all that and why you are not reusing the existing ibm,dma-window property (to provide a clue what "reset" will reset to? is there any other reason?) for that - sure, do it :) -- Alexey
Re: [PATCH v2 2/6] powerpc/pseries/iommu: Update call to ibm,query-pe-dma-windows
On Thu, 2020-07-02 at 10:18 +1000, Alexey Kardashevskiy wrote: > > On 02/07/2020 00:04, Leonardo Bras wrote: > > On Wed, 2020-07-01 at 18:17 +1000, Alexey Kardashevskiy wrote: > > > > +#define DDW_EXT_SIZE 0 > > > > +#define DDW_EXT_RESET_DMA_WIN 1 > > > > +#define DDW_EXT_QUERY_OUT_SIZE 2 > > > > > > #define DDW_EXT_LAST (DDW_EXT_QUERY_OUT_SIZE + 1) > > > ... > > > > > > > > > > + > > > > static struct iommu_table_group *iommu_pseries_alloc_group(int node) > > > > { > > > > struct iommu_table_group *table_group; > > > > @@ -339,7 +343,7 @@ struct direct_window { > > > > /* Dynamic DMA Window support */ > > > > struct ddw_query_response { > > > > u32 windows_available; > > > > - u32 largest_available_block; > > > > + u64 largest_available_block; > > > > u32 page_size; > > > > u32 migration_capable; > > > > }; > > > > @@ -875,13 +879,29 @@ static int find_existing_ddw_windows(void) > > > > machine_arch_initcall(pseries, find_existing_ddw_windows); > > > > > > > > static int query_ddw(struct pci_dev *dev, const u32 *ddw_avail, > > > > - struct ddw_query_response *query) > > > > +struct ddw_query_response *query, > > > > +struct device_node *parent) > > > > { > > > > struct device_node *dn; > > > > struct pci_dn *pdn; > > > > - u32 cfg_addr; > > > > + u32 cfg_addr, query_out[5], ddw_ext[DDW_EXT_QUERY_OUT_SIZE + 1]; > > > > > > ... and use DDW_EXT_LAST here. > > > > Because of the growing nature of ddw-extensions, I intentionally let > > this be (DDW_EXT_QUERY_OUT_SIZE + 1). If we create a DDW_EXT_LAST, it > > will be incremented in the future if more extensions come to exist. > > > > I mean, I previously saw no reason for allocating space for extensions > > after the desired one, as they won't be used here. > > Ah, my bad, you're right. > > > > > > > > > u64 buid; > > > > - int ret; > > > > + int ret, out_sz; > > > > + > > > > + /* > > > > +* From LoPAR level 2.8, "ibm,ddw-extensions" index 3 can rule > > > > how many > > > > +* output parameters ibm,query-pe-dma-windows will have, > > > > ranging from > > > > +* 5 to 6. > > > > +*/ > > > > + > > > > + ret = of_property_read_u32_array(parent, "ibm,ddw-extensions", > > > > +&ddw_ext[0], > > > > +DDW_EXT_QUERY_OUT_SIZE + 1); > > > > In this case, I made sure not to cross (DDW_EXT_QUERY_OUT_SIZE + 1) > > while reading the extensions from the property. > > > > What do you think about it? > > I think you want something like: > > static inline int ddw_read_ext(const struct device_node *np, int extnum, > u32 *ret) > { > retun of_property_read_u32_index(np, "ibm,ddw-extensions", extnum + 1, ret); > } > > These "+1"'s all over the place are confusing. That's a great idea! I was not aware it was possible to read a single value[index] directly from the property, but it makes total sense to use it. Thank you!
Re: [PATCH v2 4/6] powerpc/pseries/iommu: Remove default DMA window before creating DDW
On 02/07/2020 05:48, Leonardo Bras wrote: > On Wed, 2020-07-01 at 18:17 +1000, Alexey Kardashevskiy wrote: >> >> On 24/06/2020 16:24, Leonardo Bras wrote: >>> On LoPAR "DMA Window Manipulation Calls", it's recommended to remove the >>> default DMA window for the device, before attempting to configure a DDW, >>> in order to make the maximum resources available for the next DDW to be >>> created. >>> >>> This is a requirement for some devices to use DDW, given they only >>> allow one DMA window. >> >> Devices never know about these windows, it is purely PHB's side of >> things. A device can access any address on the bus, the bus can generate >> an exception if there is no window behind the address OR some other >> device's MMIO. We could actually create a second window in addition to >> the first one and allocate bus addresses from both, we just simplifying >> this by merging two separate non-adjacent windows into one. > > That's interesting, I was not aware of this. > I will try to improve this commit message with this info. > Thanks for sharing! > > If setting up a new DDW fails anywhere after the removal of this >>> default DMA window, it's needed to restore the default DMA window. >>> For this, an implementation of ibm,reset-pe-dma-windows rtas call is >>> needed: >>> >>> Platforms supporting the DDW option starting with LoPAR level 2.7 implement >>> ibm,ddw-extensions. The first extension available (index 2) carries the >>> token for ibm,reset-pe-dma-windows rtas call, which is used to restore >>> the default DMA window for a device, if it has been deleted. >>> >>> It does so by resetting the TCE table allocation for the PE to it's >>> boot time value, available in "ibm,dma-window" device tree node. >>> >>> Signed-off-by: Leonardo Bras >>> --- >>> arch/powerpc/platforms/pseries/iommu.c | 70 ++ >>> 1 file changed, 61 insertions(+), 9 deletions(-) >>> >>> diff --git a/arch/powerpc/platforms/pseries/iommu.c >>> b/arch/powerpc/platforms/pseries/iommu.c >>> index a8840d9e1c35..4fcf00016fb1 100644 >>> --- a/arch/powerpc/platforms/pseries/iommu.c >>> +++ b/arch/powerpc/platforms/pseries/iommu.c >>> @@ -1029,6 +1029,39 @@ static phys_addr_t ddw_memory_hotplug_max(void) >>> return max_addr; >>> } >>> >>> +/* >>> + * Platforms supporting the DDW option starting with LoPAR level 2.7 >>> implement >>> + * ibm,ddw-extensions, which carries the rtas token for >>> + * ibm,reset-pe-dma-windows. >>> + * That rtas-call can be used to restore the default DMA window for the >>> device. >>> + */ >>> +static void reset_dma_window(struct pci_dev *dev, struct device_node >>> *par_dn) >>> +{ >>> + int ret; >>> + u32 cfg_addr, ddw_ext[DDW_EXT_RESET_DMA_WIN + 1]; >>> + u64 buid; >>> + struct device_node *dn; >>> + struct pci_dn *pdn; >>> + >>> + ret = of_property_read_u32_array(par_dn, "ibm,ddw-extensions", >>> +&ddw_ext[0], DDW_EXT_RESET_DMA_WIN + >>> 1); >>> + if (ret) >>> + return; >>> + >>> + dn = pci_device_to_OF_node(dev); >>> + pdn = PCI_DN(dn); >>> + buid = pdn->phb->buid; >>> + cfg_addr = ((pdn->busno << 16) | (pdn->devfn << 8)); >>> + >>> + ret = rtas_call(ddw_ext[DDW_EXT_RESET_DMA_WIN], 3, 1, NULL, cfg_addr, >>> + BUID_HI(buid), BUID_LO(buid)); >>> + if (ret) >>> + dev_info(&dev->dev, >>> +"ibm,reset-pe-dma-windows(%x) %x %x %x returned %d ", >>> +ddw_ext[1], cfg_addr, BUID_HI(buid), BUID_LO(buid), >> >> s/ddw_ext[1]/ddw_ext[DDW_EXT_RESET_DMA_WIN]/ > > Good catch! I missed this one. > >> >> >>> +ret); >>> +} >>> + >>> /* >>> * If the PE supports dynamic dma windows, and there is space for a table >>> * that can map all pages in a linear offset, then setup such a table, >>> @@ -1049,8 +1082,9 @@ static u64 enable_ddw(struct pci_dev *dev, struct >>> device_node *pdn) >>> u64 dma_addr, max_addr; >>> struct device_node *dn; >>> u32 ddw_avail[DDW_APPLICABLE_SIZE]; >>> + >> >> Unrelated new empty line. > > Fixed! > >> >> >>> struct direct_window *window; >>> - struct property *win64; >>> + struct property *win64, *default_win = NULL, *ddw_ext = NULL; >>> struct dynamic_dma_window_prop *ddwprop; >>> struct failed_ddw_pdn *fpdn; >>> >>> @@ -1085,7 +1119,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct >>> device_node *pdn) >>> if (ret) >>> goto out_failed; >>> >>> - /* >>> + /* >>> * Query if there is a second window of size to map the >>> * whole partition. Query returns number of windows, largest >>> * block assigned to PE (partition endpoint), and two bitmasks >>> @@ -1096,15 +1130,31 @@ static u64 enable_ddw(struct pci_dev *dev, struct >>> device_node *pdn) >>> if (ret != 0) >>> goto out_failed; >>> >>> + /* >>> +* If there is no window available, remove the default DMA window, >>> +* if it's present.
Re: [PATCH v2 1/6] powerpc/pseries/iommu: Create defines for operations in ibm,ddw-applicable
On 01/07/2020 23:28, Leonardo Bras wrote: > On Wed, 2020-07-01 at 18:16 +1000, Alexey Kardashevskiy wrote: >> >> On 24/06/2020 16:24, Leonardo Bras wrote: >>> Create defines to help handling ibm,ddw-applicable values, avoiding >>> confusion about the index of given operations. >>> >>> Signed-off-by: Leonardo Bras >>> --- >>> arch/powerpc/platforms/pseries/iommu.c | 40 +++--- >>> 1 file changed, 23 insertions(+), 17 deletions(-) >>> >>> diff --git a/arch/powerpc/platforms/pseries/iommu.c >>> b/arch/powerpc/platforms/pseries/iommu.c >>> index 6d47b4a3ce39..68d2aa9c71a8 100644 >>> --- a/arch/powerpc/platforms/pseries/iommu.c >>> +++ b/arch/powerpc/platforms/pseries/iommu.c >>> @@ -39,6 +39,11 @@ >>> >>> #include "pseries.h" >>> >>> +#define DDW_QUERY_PE_DMA_WIN 0 >>> +#define DDW_CREATE_PE_DMA_WIN 1 >>> +#define DDW_REMOVE_PE_DMA_WIN 2 >>> +#define DDW_APPLICABLE_SIZE3 >> >> #define DDW_APPLICABLE_SIZE (DDW_REMOVE_PE_DMA_WIN + 1) >> >> thanks, > > Thanks for the feedback! > About this (and patch #2), would it be better to use enum ? > enum { > DDW_QUERY_PE_DMA_WIN, > DDW_CREATE_PE_DMA_WIN, > DDW_REMOVE_PE_DMA_WIN, > > DDW_APPLICABLE_SIZE > } > IMO, it looks better than all the defines before. > > What do you think? No, not really, these come from a binary interface so the reader of this cares about absolute numbers and rather wants to see them explicitly. -- Alexey
Re: [PATCH v2 2/6] powerpc/pseries/iommu: Update call to ibm,query-pe-dma-windows
On 02/07/2020 00:04, Leonardo Bras wrote: > On Wed, 2020-07-01 at 18:17 +1000, Alexey Kardashevskiy wrote: >> >>> +#define DDW_EXT_SIZE 0 >>> +#define DDW_EXT_RESET_DMA_WIN 1 >>> +#define DDW_EXT_QUERY_OUT_SIZE 2 >> >> #define DDW_EXT_LAST (DDW_EXT_QUERY_OUT_SIZE + 1) >> ... >> >> >>> + >>> static struct iommu_table_group *iommu_pseries_alloc_group(int node) >>> { >>> struct iommu_table_group *table_group; >>> @@ -339,7 +343,7 @@ struct direct_window { >>> /* Dynamic DMA Window support */ >>> struct ddw_query_response { >>> u32 windows_available; >>> - u32 largest_available_block; >>> + u64 largest_available_block; >>> u32 page_size; >>> u32 migration_capable; >>> }; >>> @@ -875,13 +879,29 @@ static int find_existing_ddw_windows(void) >>> machine_arch_initcall(pseries, find_existing_ddw_windows); >>> >>> static int query_ddw(struct pci_dev *dev, const u32 *ddw_avail, >>> - struct ddw_query_response *query) >>> +struct ddw_query_response *query, >>> +struct device_node *parent) >>> { >>> struct device_node *dn; >>> struct pci_dn *pdn; >>> - u32 cfg_addr; >>> + u32 cfg_addr, query_out[5], ddw_ext[DDW_EXT_QUERY_OUT_SIZE + 1]; >> >> ... and use DDW_EXT_LAST here. > > Because of the growing nature of ddw-extensions, I intentionally let > this be (DDW_EXT_QUERY_OUT_SIZE + 1). If we create a DDW_EXT_LAST, it > will be incremented in the future if more extensions come to exist. > > I mean, I previously saw no reason for allocating space for extensions > after the desired one, as they won't be used here. Ah, my bad, you're right. > >> >> >>> u64 buid; >>> - int ret; >>> + int ret, out_sz; >>> + >>> + /* >>> +* From LoPAR level 2.8, "ibm,ddw-extensions" index 3 can rule how many >>> +* output parameters ibm,query-pe-dma-windows will have, ranging from >>> +* 5 to 6. >>> +*/ >>> + >>> + ret = of_property_read_u32_array(parent, "ibm,ddw-extensions", >>> +&ddw_ext[0], >>> +DDW_EXT_QUERY_OUT_SIZE + 1); > > In this case, I made sure not to cross (DDW_EXT_QUERY_OUT_SIZE + 1) > while reading the extensions from the property. > > What do you think about it? I think you want something like: static inline int ddw_read_ext(const struct device_node *np, int extnum, u32 *ret) { retun of_property_read_u32_index(np, "ibm,ddw-extensions", extnum + 1, ret); } These "+1"'s all over the place are confusing. -- Alexey
Re: [PATCH v2 5/6] powerpc/pseries/iommu: Make use of DDW even if it does not map the partition
On Wed, 2020-07-01 at 16:57 -0300, Leonardo Bras wrote: > > It is not necessarily "direct" anymore as the name suggests, you may > > want to change that. DMA64_PROPNAME, may be. Thanks, > > > > Yeah, you are right. > I will change this for next version, also changing the string name to > reflect this. > > -#define DIRECT64_PROPNAME "linux,direct64-ddr-window-info" > +#define DMA64_PROPNAME "linux,dma64-ddr-window-info" > > Is that ok? > > Thank you for helping! In fact, there is a lot of places in this file where it's called direct window. Should I replace everything? Should it be in a separated patch? Best regards, Leonardo
Re: [PATCH] ASoC: fsl_sai: Refine regcache usage with pm runtime
On Mon, 29 Jun 2020 14:42:33 +0800, Shengjiu Wang wrote: > When there is dedicated power domain bound with device, after probing > the power will be disabled, then registers are not accessible in > fsl_sai_dai_probe(), so regcache only need to be enabled in end of > probe() and regcache_mark_dirty should be moved to pm runtime resume > callback function. Applied to https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound.git for-next Thanks! [1/1] ASoC: fsl_sai: Refine regcache usage with pm runtime commit: d8d702e19e997cf3f172487e0659d0e68aa5ede5 All being well this means that it will be integrated into the linux-next tree (usually sometime in the next 24 hours) and sent to Linus during the next merge window (or sooner if it is a bug fix), however if problems are discovered then the patch may be dropped or reverted. You may get further e-mails resulting from automated or manual testing and review of the tree, please engage with people reporting problems and send followup patches addressing any issues that are reported if needed. If any updates are required or you are submitting further changes they should be sent as incremental updates against current git, existing patches will not be replaced. Please add any relevant lists and maintainers to the CCs when replying to this mail. Thanks, Mark
Re: [PATCH v3] ASoC: fsl_asrc: Add an option to select internal ratio mode
On Tue, 30 Jun 2020 21:56:07 +0800, Shengjiu Wang wrote: > The ASRC not only supports ideal ratio mode, but also supports > internal ratio mode. > > For internal rato mode, the rate of clock source should be divided > with no remainder by sample rate, otherwise there is sound > distortion. > > [...] Applied to https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound.git for-next Thanks! [1/1] ASoC: fsl_asrc: Add an option to select internal ratio mode commit: d0250cf4f2abfbea64ed247230f08f5ae23979f0 All being well this means that it will be integrated into the linux-next tree (usually sometime in the next 24 hours) and sent to Linus during the next merge window (or sooner if it is a bug fix), however if problems are discovered then the patch may be dropped or reverted. You may get further e-mails resulting from automated or manual testing and review of the tree, please engage with people reporting problems and send followup patches addressing any issues that are reported if needed. If any updates are required or you are submitting further changes they should be sent as incremental updates against current git, existing patches will not be replaced. Please add any relevant lists and maintainers to the CCs when replying to this mail. Thanks, Mark
Re: [PATCH v2 6/6] powerpc/pseries/iommu: Avoid errors when DDW starts at 0x00
On Wed, 2020-07-01 at 18:04 +1000, Alexey Kardashevskiy wrote: > > On 27/06/2020 03:46, Leonardo Bras wrote: > > On Wed, 2020-06-24 at 03:24 -0300, Leonardo Bras wrote: > > > As of today, enable_ddw() will return a non-null DMA address if the > > > created DDW maps the whole partition. If the address is valid, > > > iommu_bypass_supported_pSeriesLP() will consider iommu bypass enabled. > > > > > > This can cause some trouble if the DDW happens to start at 0x00. > > > > > > Instead if checking if the address is non-null, check directly if > > > the DDW maps the whole partition, so it can bypass iommu. > > > > > > Signed-off-by: Leonardo Bras > > > > This patch has a bug in it. I will rework it soon. > > I'd rather suggest this: > > https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20180725095032.2196-2-...@ozlabs.ru/ > > Although it does not look like you are actually going to have windows > starting at 0. Thanks, Yeah, agree. I am thinking of dropping this one, as I don't see much good to be done here. Thank you!
Re: [PATCH v2 5/6] powerpc/pseries/iommu: Make use of DDW even if it does not map the partition
On Wed, 2020-07-01 at 18:16 +1000, Alexey Kardashevskiy wrote: > > On 24/06/2020 16:24, Leonardo Bras wrote: > > As of today, if a DDW is created and can't map the whole partition, it's > > removed and the default DMA window "ibm,dma-window" is used instead. > > > > Usually this DDW is bigger than the default DMA window, so it would be > > better to make use of it instead. > > > > Signed-off-by: Leonardo Bras > > --- > > arch/powerpc/platforms/pseries/iommu.c | 28 +- > > 1 file changed, 19 insertions(+), 9 deletions(-) > > > > diff --git a/arch/powerpc/platforms/pseries/iommu.c > > b/arch/powerpc/platforms/pseries/iommu.c > > index 4fcf00016fb1..2d217cda4075 100644 > > --- a/arch/powerpc/platforms/pseries/iommu.c > > +++ b/arch/powerpc/platforms/pseries/iommu.c > > @@ -685,7 +685,7 @@ static void pci_dma_bus_setup_pSeriesLP(struct pci_bus > > *bus) > > struct iommu_table *tbl; > > struct device_node *dn, *pdn; > > struct pci_dn *ppci; > > - const __be32 *dma_window = NULL; > > + const __be32 *dma_window = NULL, *alt_dma_window = NULL; > > > > dn = pci_bus_to_OF_node(bus); > > > > @@ -699,8 +699,13 @@ static void pci_dma_bus_setup_pSeriesLP(struct pci_bus > > *bus) > > break; > > } > > > > + /* If there is a DDW available, use it instead */ > > + alt_dma_window = of_get_property(pdn, DIRECT64_PROPNAME, NULL); > > It is not necessarily "direct" anymore as the name suggests, you may > want to change that. DMA64_PROPNAME, may be. Thanks, > Yeah, you are right. I will change this for next version, also changing the string name to reflect this. -#define DIRECT64_PROPNAME "linux,direct64-ddr-window-info" +#define DMA64_PROPNAME "linux,dma64-ddr-window-info" Is that ok? Thank you for helping!
Re: [PATCH v2 4/6] powerpc/pseries/iommu: Remove default DMA window before creating DDW
On Wed, 2020-07-01 at 18:17 +1000, Alexey Kardashevskiy wrote: > > On 24/06/2020 16:24, Leonardo Bras wrote: > > On LoPAR "DMA Window Manipulation Calls", it's recommended to remove the > > default DMA window for the device, before attempting to configure a DDW, > > in order to make the maximum resources available for the next DDW to be > > created. > > > > This is a requirement for some devices to use DDW, given they only > > allow one DMA window. > > Devices never know about these windows, it is purely PHB's side of > things. A device can access any address on the bus, the bus can generate > an exception if there is no window behind the address OR some other > device's MMIO. We could actually create a second window in addition to > the first one and allocate bus addresses from both, we just simplifying > this by merging two separate non-adjacent windows into one. That's interesting, I was not aware of this. I will try to improve this commit message with this info. Thanks for sharing! > > > > If setting up a new DDW fails anywhere after the removal of this > > default DMA window, it's needed to restore the default DMA window. > > For this, an implementation of ibm,reset-pe-dma-windows rtas call is > > needed: > > > > Platforms supporting the DDW option starting with LoPAR level 2.7 implement > > ibm,ddw-extensions. The first extension available (index 2) carries the > > token for ibm,reset-pe-dma-windows rtas call, which is used to restore > > the default DMA window for a device, if it has been deleted. > > > > It does so by resetting the TCE table allocation for the PE to it's > > boot time value, available in "ibm,dma-window" device tree node. > > > > Signed-off-by: Leonardo Bras > > --- > > arch/powerpc/platforms/pseries/iommu.c | 70 ++ > > 1 file changed, 61 insertions(+), 9 deletions(-) > > > > diff --git a/arch/powerpc/platforms/pseries/iommu.c > > b/arch/powerpc/platforms/pseries/iommu.c > > index a8840d9e1c35..4fcf00016fb1 100644 > > --- a/arch/powerpc/platforms/pseries/iommu.c > > +++ b/arch/powerpc/platforms/pseries/iommu.c > > @@ -1029,6 +1029,39 @@ static phys_addr_t ddw_memory_hotplug_max(void) > > return max_addr; > > } > > > > +/* > > + * Platforms supporting the DDW option starting with LoPAR level 2.7 > > implement > > + * ibm,ddw-extensions, which carries the rtas token for > > + * ibm,reset-pe-dma-windows. > > + * That rtas-call can be used to restore the default DMA window for the > > device. > > + */ > > +static void reset_dma_window(struct pci_dev *dev, struct device_node > > *par_dn) > > +{ > > + int ret; > > + u32 cfg_addr, ddw_ext[DDW_EXT_RESET_DMA_WIN + 1]; > > + u64 buid; > > + struct device_node *dn; > > + struct pci_dn *pdn; > > + > > + ret = of_property_read_u32_array(par_dn, "ibm,ddw-extensions", > > +&ddw_ext[0], DDW_EXT_RESET_DMA_WIN + > > 1); > > + if (ret) > > + return; > > + > > + dn = pci_device_to_OF_node(dev); > > + pdn = PCI_DN(dn); > > + buid = pdn->phb->buid; > > + cfg_addr = ((pdn->busno << 16) | (pdn->devfn << 8)); > > + > > + ret = rtas_call(ddw_ext[DDW_EXT_RESET_DMA_WIN], 3, 1, NULL, cfg_addr, > > + BUID_HI(buid), BUID_LO(buid)); > > + if (ret) > > + dev_info(&dev->dev, > > +"ibm,reset-pe-dma-windows(%x) %x %x %x returned %d ", > > +ddw_ext[1], cfg_addr, BUID_HI(buid), BUID_LO(buid), > > s/ddw_ext[1]/ddw_ext[DDW_EXT_RESET_DMA_WIN]/ Good catch! I missed this one. > > > > +ret); > > +} > > + > > /* > > * If the PE supports dynamic dma windows, and there is space for a table > > * that can map all pages in a linear offset, then setup such a table, > > @@ -1049,8 +1082,9 @@ static u64 enable_ddw(struct pci_dev *dev, struct > > device_node *pdn) > > u64 dma_addr, max_addr; > > struct device_node *dn; > > u32 ddw_avail[DDW_APPLICABLE_SIZE]; > > + > > Unrelated new empty line. Fixed! > > > > struct direct_window *window; > > - struct property *win64; > > + struct property *win64, *default_win = NULL, *ddw_ext = NULL; > > struct dynamic_dma_window_prop *ddwprop; > > struct failed_ddw_pdn *fpdn; > > > > @@ -1085,7 +1119,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct > > device_node *pdn) > > if (ret) > > goto out_failed; > > > > - /* > > + /* > > * Query if there is a second window of size to map the > > * whole partition. Query returns number of windows, largest > > * block assigned to PE (partition endpoint), and two bitmasks > > @@ -1096,15 +1130,31 @@ static u64 enable_ddw(struct pci_dev *dev, struct > > device_node *pdn) > > if (ret != 0) > > goto out_failed; > > > > + /* > > +* If there is no window available, remove the default DMA window, > > +* if it's present. This will make all the resources available to the > > +* new DDW wi
Memory: 880608K/983040K .... 36896K reserved ?
I cannot figure out how the xxxK reserved item works in: Memory: 880608K/983040K available (9532K kernel code, 1104K rwdata, 3348K rodata, 1088K init, 1201K bss, 36896K reserved ... Is there a way to tune(lower it) this memory? Jocke
Re: [PATCH 01/11] kexec_file: allow archs to handle special regions while locating memory hole
On 01/07/20 1:16 pm, Dave Young wrote: > On 06/29/20 at 05:26pm, Hari Bathini wrote: >> Hi Petr, >> >> On 29/06/20 5:09 pm, Petr Tesarik wrote: >>> Hi Hari, >>> >>> is there any good reason to add two more functions with a very similar >>> name to an existing function? AFAICS all you need is a way to call a >>> PPC64-specific function from within kexec_add_buffer (PATCH 4/11), so >>> you could add something like this: >>> >>> int __weak arch_kexec_locate_mem_hole(struct kexec_buf *kbuf) >>> { >>> return 0; >>> } >>> >>> Call this function from kexec_add_buffer where appropriate and then >>> override it for PPC64 (it roughly corresponds to your >>> kexec_locate_mem_hole_ppc64() from PATCH 4/11). >>> >>> FWIW it would make it easier for me to follow the resulting code. >> >> Right, Petr. >> >> I was trying out a few things before I ended up with what I sent here. >> Bu yeah.. I did realize arch_kexec_locate_mem_hole() would have been better >> after sending out v1. Will take care of that in v2. > > Another way is use arch private function to locate mem hole, then set > kbuf->mem, and then call kexec_add_buf, it will skip the common locate > hole function. Dave, I did think about it. But there are a couple of places this can get tricky. One is ima_add_kexec_buffer() and the other is kexec_elf_load(). These call sites could be updated to set kbuf->mem before kexec_add_buffer(). But the current approach seemed like the better option for it creates a single point of control in setting up segment buffers and also, makes adding any new segments simpler, arch-specific segments or otherwise. Thanks Hari
Re: [PATCH 10/20] dm: stop using ->queuedata
On Wed, Jul 01 2020 at 4:59am -0400, Christoph Hellwig wrote: > Instead of setting up the queuedata as well just use one private data > field. > > Signed-off-by: Christoph Hellwig Acked-by: Mike Snitzer
Re: [PATCH 04/11] ppc64/kexec_file: avoid stomping memory used by special regions
On 01/07/20 1:10 pm, Dave Young wrote: > Hi Hari, > On 06/27/20 at 12:35am, Hari Bathini wrote: >> crashkernel region could have an overlap with special memory regions >> like opal, rtas, tce-table & such. These regions are referred to as >> exclude memory ranges. Setup this ranges during image probe in order >> to avoid them while finding the buffer for different kdump segments. >> Implement kexec_locate_mem_hole_ppc64() that locates a memory hole >> accounting for these ranges. Also, override arch_kexec_add_buffer() >> to locate a memory hole & later call __kexec_add_buffer() function >> with kbuf->mem set to skip the generic locate memory hole lookup. >> >> Signed-off-by: Hari Bathini >> --- >> arch/powerpc/include/asm/crashdump-ppc64.h | 10 + >> arch/powerpc/include/asm/kexec.h |7 - >> arch/powerpc/kexec/elf_64.c|7 + >> arch/powerpc/kexec/file_load_64.c | 292 >> >> 4 files changed, 312 insertions(+), 4 deletions(-) >> create mode 100644 arch/powerpc/include/asm/crashdump-ppc64.h >> > [snip] >> /** >> + * get_exclude_memory_ranges - Get exclude memory ranges. This list includes >> + * regions like opal/rtas, tce-table, initrd, >> + * kernel, htab which should be avoided while >> + * setting up kexec load segments. >> + * @mem_ranges:Range list to add the memory ranges to. >> + * >> + * Returns 0 on success, negative errno on error. >> + */ >> +static int get_exclude_memory_ranges(struct crash_mem **mem_ranges) >> +{ >> +int ret; >> + >> +ret = add_tce_mem_ranges(mem_ranges); >> +if (ret) >> +goto out; >> + >> +ret = add_initrd_mem_range(mem_ranges); >> +if (ret) >> +goto out; >> + >> +ret = add_htab_mem_range(mem_ranges); >> +if (ret) >> +goto out; >> + >> +ret = add_kernel_mem_range(mem_ranges); >> +if (ret) >> +goto out; >> + >> +ret = add_rtas_mem_range(mem_ranges, false); >> +if (ret) >> +goto out; >> + >> +ret = add_opal_mem_range(mem_ranges, false); >> +if (ret) >> +goto out; >> + >> +ret = add_reserved_ranges(mem_ranges); >> +if (ret) >> +goto out; >> + >> +/* exclude memory ranges should be sorted for easy lookup */ >> +sort_memory_ranges(*mem_ranges); >> +out: >> +if (ret) >> +pr_err("Failed to setup exclude memory ranges\n"); >> +return ret; >> +} > > I'm confused about the "overlap with crashkernel memory", does that mean > those normal kernel used memory could be put in crashkernel reserved There are regions that could overlap with crashkernel region but they are not normal kernel used memory though. These are regions that kernel and/or f/w chose to place at a particular address for real mode accessibility and/or memory layout between kernel & f/w kind of thing. > memory range? If so why can't just skip those areas while crashkernel > doing the reservation? crashkernel region has a dependency to be in the first memory block for it to be accessible in real mode. Accommodating this requirement while addressing other requirements would mean something like what we have now. A list of possible special memory regions in crashkernel region to take care of. I have plans to split crashkernel region into low & high to have exclusive regions for crashkernel, even if that means to have two of them. But that is for another day with its own set of complexities to deal with... Thanks Hari
Re: [PATCH 16/20] block: move ->make_request_fn to struct block_device_operations
On Wed, Jul 1, 2020 at 2:01 AM Christoph Hellwig wrote: > > The make_request_fn is a little weird in that it sits directly in > struct request_queue instead of an operation vector. Replace it with > a block_device_operations method called submit_bio (which describes much > better what it does). Also remove the request_queue argument to it, as > the queue can be derived pretty trivially from the bio. > > Signed-off-by: Christoph Hellwig > --- [..] > drivers/nvdimm/blk.c | 5 +- > drivers/nvdimm/btt.c | 5 +- > drivers/nvdimm/pmem.c | 5 +- For drivers/nvdimm Acked-by: Dan Williams
Re: [PATCH v2 2/6] powerpc/pseries/iommu: Update call to ibm,query-pe-dma-windows
On Wed, 2020-07-01 at 18:17 +1000, Alexey Kardashevskiy wrote: > > > +#define DDW_EXT_SIZE 0 > > +#define DDW_EXT_RESET_DMA_WIN 1 > > +#define DDW_EXT_QUERY_OUT_SIZE 2 > > #define DDW_EXT_LAST (DDW_EXT_QUERY_OUT_SIZE + 1) > ... > > > > + > > static struct iommu_table_group *iommu_pseries_alloc_group(int node) > > { > > struct iommu_table_group *table_group; > > @@ -339,7 +343,7 @@ struct direct_window { > > /* Dynamic DMA Window support */ > > struct ddw_query_response { > > u32 windows_available; > > - u32 largest_available_block; > > + u64 largest_available_block; > > u32 page_size; > > u32 migration_capable; > > }; > > @@ -875,13 +879,29 @@ static int find_existing_ddw_windows(void) > > machine_arch_initcall(pseries, find_existing_ddw_windows); > > > > static int query_ddw(struct pci_dev *dev, const u32 *ddw_avail, > > - struct ddw_query_response *query) > > +struct ddw_query_response *query, > > +struct device_node *parent) > > { > > struct device_node *dn; > > struct pci_dn *pdn; > > - u32 cfg_addr; > > + u32 cfg_addr, query_out[5], ddw_ext[DDW_EXT_QUERY_OUT_SIZE + 1]; > > ... and use DDW_EXT_LAST here. Because of the growing nature of ddw-extensions, I intentionally let this be (DDW_EXT_QUERY_OUT_SIZE + 1). If we create a DDW_EXT_LAST, it will be incremented in the future if more extensions come to exist. I mean, I previously saw no reason for allocating space for extensions after the desired one, as they won't be used here. > > > > u64 buid; > > - int ret; > > + int ret, out_sz; > > + > > + /* > > +* From LoPAR level 2.8, "ibm,ddw-extensions" index 3 can rule how many > > +* output parameters ibm,query-pe-dma-windows will have, ranging from > > +* 5 to 6. > > +*/ > > + > > + ret = of_property_read_u32_array(parent, "ibm,ddw-extensions", > > +&ddw_ext[0], > > +DDW_EXT_QUERY_OUT_SIZE + 1); In this case, I made sure not to cross (DDW_EXT_QUERY_OUT_SIZE + 1) while reading the extensions from the property. What do you think about it? Best regards, Leonardo
Re: rename ->make_request_fn and move it to the block_device_operations v2
On 7/1/20 2:59 AM, Christoph Hellwig wrote: > Hi Jens, > > this series moves the make_request_fn method into block_device_operations > with the much more descriptive ->submit_bio name. It then also gives > generic_make_request a more descriptive name, and further optimize the > path to issue to blk-mq, removing the need for the direct_make_request > bypass. Thanks, I'll try this again. -- Jens Axboe
[PATCH v2 2/2] powerpc/papr_scm: Add support for fetching nvdimm 'fuel-gauge' metric
We add support for reporting 'fuel-gauge' NVDIMM metric via PAPR_PDSM_HEALTH pdsm payload. 'fuel-gauge' metric indicates the usage life remaining of a papr-scm compatible NVDIMM. PHYP exposes this metric via the H_SCM_PERFORMANCE_STATS. The metric value is returned from the pdsm by extending the return payload 'struct nd_papr_pdsm_health' without breaking the ABI. A new field 'dimm_fuel_gauge' to hold the metric value is introduced at the end of the payload struct and its presence is indicated by by extension flag PDSM_DIMM_HEALTH_RUN_GAUGE_VALID. The patch introduces a new function papr_pdsm_fuel_gauge() that is called from papr_pdsm_health(). If fetching NVDIMM performance stats is supported then 'papr_pdsm_fuel_gauge()' allocated an output buffer large enough to hold the performance stat and passes it to drc_pmem_query_stats() that issues the HCALL to PHYP. The return value of the stat is then populated in the 'struct nd_papr_pdsm_health.dimm_fuel_gauge' field with extension flag 'PDSM_DIMM_HEALTH_RUN_GAUGE_VALID' set in 'struct nd_papr_pdsm_health.extension_flags' Signed-off-by: Vaibhav Jain --- Changelog: v2: * Restructure code in papr_pdsm_fuel_gauge() to handle error case first [ Ira ] * Ignore the return value of papr_pdsm_fuel_gauge() in papr_psdm_health() [ Ira ] --- arch/powerpc/include/uapi/asm/papr_pdsm.h | 9 + arch/powerpc/platforms/pseries/papr_scm.c | 49 +++ 2 files changed, 58 insertions(+) diff --git a/arch/powerpc/include/uapi/asm/papr_pdsm.h b/arch/powerpc/include/uapi/asm/papr_pdsm.h index 9ccecc1d6840..50ef95e2f5b1 100644 --- a/arch/powerpc/include/uapi/asm/papr_pdsm.h +++ b/arch/powerpc/include/uapi/asm/papr_pdsm.h @@ -72,6 +72,11 @@ #define PAPR_PDSM_DIMM_CRITICAL 2 #define PAPR_PDSM_DIMM_FATAL 3 +/* struct nd_papr_pdsm_health.extension_flags field flags */ + +/* Indicate that the 'dimm_fuel_gauge' field is valid */ +#define PDSM_DIMM_HEALTH_RUN_GAUGE_VALID 1 + /* * Struct exchanged between kernel & ndctl in for PAPR_PDSM_HEALTH * Various flags indicate the health status of the dimm. @@ -84,6 +89,7 @@ * dimm_locked : Contents of the dimm cant be modified until CEC reboot * dimm_encrypted : Contents of dimm are encrypted. * dimm_health : Dimm health indicator. One of PAPR_PDSM_DIMM_ + * dimm_fuel_gauge : Life remaining of DIMM as a percentage from 0-100 */ struct nd_papr_pdsm_health { union { @@ -96,6 +102,9 @@ struct nd_papr_pdsm_health { __u8 dimm_locked; __u8 dimm_encrypted; __u16 dimm_health; + + /* Extension flag PDSM_DIMM_HEALTH_RUN_GAUGE_VALID */ + __u16 dimm_fuel_gauge; }; __u8 buf[ND_PDSM_PAYLOAD_MAX_SIZE]; }; diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c index bde2433822b2..b02bed8475f9 100644 --- a/arch/powerpc/platforms/pseries/papr_scm.c +++ b/arch/powerpc/platforms/pseries/papr_scm.c @@ -498,6 +498,51 @@ static int is_cmd_valid(struct nvdimm *nvdimm, unsigned int cmd, void *buf, return 0; } +static int papr_pdsm_fuel_gauge(struct papr_scm_priv *p, + union nd_pdsm_payload *payload) +{ + int rc, size; + u64 statval; + struct papr_scm_perf_stat *stat; + struct papr_scm_perf_stats *stats; + + /* Silently fail if fetching performance metrics isn't supported */ + if (!p->stat_buffer_len) + return 0; + + /* Allocate request buffer enough to hold single performance stat */ + size = sizeof(struct papr_scm_perf_stats) + + sizeof(struct papr_scm_perf_stat); + + stats = kzalloc(size, GFP_KERNEL); + if (!stats) + return -ENOMEM; + + stat = &stats->scm_statistic[0]; + memcpy(&stat->stat_id, "MemLife ", sizeof(stat->stat_id)); + stat->stat_val = 0; + + /* Fetch the fuel gauge and populate it in payload */ + rc = drc_pmem_query_stats(p, stats, size, 1, NULL); + if (rc) { + dev_dbg(&p->pdev->dev, "Err(%d) fetching fuel gauge\n", rc); + goto free_stats; + } + + statval = be64_to_cpu(stat->stat_val); + dev_dbg(&p->pdev->dev, + "Fetched fuel-gauge %llu", statval); + payload->health.extension_flags |= + PDSM_DIMM_HEALTH_RUN_GAUGE_VALID; + payload->health.dimm_fuel_gauge = statval; + + rc = sizeof(struct nd_papr_pdsm_health); + +free_stats: + kfree(stats); + return rc; +} + /* Fetch the DIMM health info and populate it in provided package. */ static int papr_pdsm_health(struct papr_scm_priv *p, union nd_pdsm_payload *payload) @@ -538,6 +583,10 @@ static int papr_pdsm_health(struct papr_scm_priv *p, /* struct populated hence can release the mutex now */ mutex_unloc
[PATCH v2 1/2] powerpc/papr_scm: Fetch nvdimm performance stats from PHYP
Update papr_scm.c to query dimm performance statistics from PHYP via H_SCM_PERFORMANCE_STATS hcall and export them to user-space as PAPR specific NVDIMM attribute 'perf_stats' in sysfs. The patch also provide a sysfs ABI documentation for the stats being reported and their meanings. During NVDIMM probe time in papr_scm_nvdimm_init() a special variant of H_SCM_PERFORMANCE_STATS hcall is issued to check if collection of performance statistics is supported or not. If successful then a PHYP returns a maximum possible buffer length needed to read all performance stats. This returned value is stored in a per-nvdimm attribute 'stat_buffer_len'. The layout of request buffer for reading NVDIMM performance stats from PHYP is defined in 'struct papr_scm_perf_stats' and 'struct papr_scm_perf_stat'. These structs are used in newly introduced drc_pmem_query_stats() that issues the H_SCM_PERFORMANCE_STATS hcall. The sysfs access function perf_stats_show() uses value 'stat_buffer_len' to allocate a buffer large enough to hold all possible NVDIMM performance stats and passes it to drc_pmem_query_stats() to populate. Finally statistics reported in the buffer are formatted into the sysfs access function output buffer. Signed-off-by: Vaibhav Jain --- Changelog: v2: * Fixed a bug in drc_pmem_query_stats() that was passing a NULL pointer to virt_to_pfn() [ Ira ] * Updated 'struct papr_scm_perf_stats' and 'struct papr_scm_perf_stat' to use big-endian types. [ Aneesh ] * s/len_stat_buffer/stat_buffer_len/ [ Aneesh ] * s/statistics_id/stat_id/ , s/statistics_val/stat_val/ [ Aneesh ] * Conversion from Big endian to cpu endian happens later rather than just after its fetched from PHYP. * Changed a log statement to unambiguously report dimm performance stats are not available for the given nvdimm [ Ira ] * Restructed some code to handle error case first [ Ira ] --- Documentation/ABI/testing/sysfs-bus-papr-pmem | 27 arch/powerpc/platforms/pseries/papr_scm.c | 134 ++ 2 files changed, 161 insertions(+) diff --git a/Documentation/ABI/testing/sysfs-bus-papr-pmem b/Documentation/ABI/testing/sysfs-bus-papr-pmem index 5b10d036a8d4..c1a67275c43f 100644 --- a/Documentation/ABI/testing/sysfs-bus-papr-pmem +++ b/Documentation/ABI/testing/sysfs-bus-papr-pmem @@ -25,3 +25,30 @@ Description: NVDIMM have been scrubbed. * "locked" : Indicating that NVDIMM contents cant be modified until next power cycle. + +What: /sys/bus/nd/devices/nmemX/papr/perf_stats +Date: May, 2020 +KernelVersion: v5.9 +Contact: linuxppc-dev , linux-nvd...@lists.01.org, +Description: + (RO) Report various performance stats related to papr-scm NVDIMM + device. Each stat is reported on a new line with each line + composed of a stat-identifier followed by it value. Below are + currently known dimm performance stats which are reported: + + * "CtlResCt" : Controller Reset Count + * "CtlResTm" : Controller Reset Elapsed Time + * "PonSecs " : Power-on Seconds + * "MemLife " : Life Remaining + * "CritRscU" : Critical Resource Utilization + * "HostLCnt" : Host Load Count + * "HostSCnt" : Host Store Count + * "HostSDur" : Host Store Duration + * "HostLDur" : Host Load Duration + * "MedRCnt " : Media Read Count + * "MedWCnt " : Media Write Count + * "MedRDur " : Media Read Duration + * "MedWDur " : Media Write Duration + * "CchRHCnt" : Cache Read Hit Count + * "CchWHCnt" : Cache Write Hit Count + * "FastWCnt" : Fast Write Count \ No newline at end of file diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c index 9c569078a09f..bde2433822b2 100644 --- a/arch/powerpc/platforms/pseries/papr_scm.c +++ b/arch/powerpc/platforms/pseries/papr_scm.c @@ -62,6 +62,26 @@ PAPR_PMEM_HEALTH_FATAL |\ PAPR_PMEM_HEALTH_UNHEALTHY) +#define PAPR_SCM_PERF_STATS_EYECATCHER __stringify(SCMSTATS) +#define PAPR_SCM_PERF_STATS_VERSION 0x1 + +/* Struct holding a single performance metric */ +struct papr_scm_perf_stat { + u8 stat_id[8]; + __be64 stat_val; +} __packed; + +/* Struct exchanged between kernel and PHYP for fetching drc perf stats */ +struct papr_scm_perf_stats { + u8 eye_catcher[8]; + /* Should be PAPR_SCM_PERF_STATS_VERSION */ + __be32 stats_version; + /* Number of stats following */ + __be32 num_statistics; + /* zero or more performance matrics */ + struct papr_scm_perf_stat scm_statistic[]; +} __packed; + /* private struct associated with each region */ struct papr_scm_priv { struct platform_
[PATCH v2 0/2] powerpc/papr_scm: add support for reporting NVDIMM 'life_used_percentage' metric
Changes since v1 [1]: * Minor restructuring of code as suggested by Ira * Renaming of few members of 'struct par_scm_perf_[stat|stats]' * Fixed a bug where a NULL pointer was potentially passed to virt_to_phys(). * Using Big endian type rather than cpu native type so receive data from PHYP in 'struct par_scm_perf_[stat|stats]' * Some minor log message improvements. [1] https://lore.kernel.org/linux-nvdimm/20200622042451.22448-1-vaib...@linux.ibm.com --- This small patchset implements kernel side support for reporting 'life_used_percentage' metric in NDCTL with dimm health output for papr-scm NVDIMMs. With corresponding NDCTL side changes [2] output for should be like: $ sudo ndctl list -DH [ { "dev":"nmem0", "health":{ "health_state":"ok", "life_used_percentage":0, "shutdown_state":"clean" } } ] PHYP supports H_SCM_PERFORMANCE_STATS hcall through which an LPAR can fetch various performance stats including 'fuel_gauge' percentage for an NVDIMM. 'fuel_gauge' metric indicates the usable life remaining of an NVDIMM expressed as percentage and 'life_used_percentage' can be calculated as 'life_used_percentage = 100 - fuel_gauge'. Structure of the patchset = First patch implements necessary scaffolding needed to issue the H_SCM_PERFORMANCE_STATS hcall and fetch performance stats catalogue. The patch also implements support for 'perf_stats' sysfs attribute to report the full catalogue of supported performance stats by PHYP. Second and final patch implements support for sending this value to libndctl by extending the PAPR_PDSM_HEALTH pdsm payload to add a new field named 'dimm_fuel_gauge' to it. References == [2] https://github.com/vaibhav92/ndctl/tree/papr_scm_health_v13_run_guage Vaibhav Jain (2): powerpc/papr_scm: Fetch nvdimm performance stats from PHYP powerpc/papr_scm: Add support for fetching nvdimm 'fuel-gauge' metric Documentation/ABI/testing/sysfs-bus-papr-pmem | 27 +++ arch/powerpc/include/uapi/asm/papr_pdsm.h | 9 + arch/powerpc/platforms/pseries/papr_scm.c | 183 ++ 3 files changed, 219 insertions(+) -- 2.26.2
Re: [PATCH v2 30/30] misc: cxl: flash: Remove unused pointer
On Wed, 01 Jul 2020, Greg KH wrote: > On Wed, Jul 01, 2020 at 09:31:18AM +0100, Lee Jones wrote: > > The DRC index pointer us updated on an OPCODE_ADD, but never > > actually read. Remove the used pointer and shift up OPCODE_ADD > > to group with OPCODE_DELETE which also provides a noop. > > > > Fixes the following W=1 kernel build warning: > > > > drivers/misc/cxl/flash.c: In function ‘update_devicetree’: > > drivers/misc/cxl/flash.c:178:16: warning: variable ‘drc_index’ set but not > > used [-Wunused-but-set-variable] > > 178 | __be32 *data, drc_index, phandle; > > | ^ > > > > Cc: Frederic Barrat > > Cc: Andrew Donnellan > > Cc: linuxppc-dev@lists.ozlabs.org > > Signed-off-by: Lee Jones > > --- > > drivers/misc/cxl/flash.c | 7 ++- > > 1 file changed, 2 insertions(+), 5 deletions(-) > > > > diff --git a/drivers/misc/cxl/flash.c b/drivers/misc/cxl/flash.c > > index cb9cca35a2263..24e3dfcc91a74 100644 > > --- a/drivers/misc/cxl/flash.c > > +++ b/drivers/misc/cxl/flash.c > > @@ -175,7 +175,7 @@ static int update_devicetree(struct cxl *adapter, s32 > > scope) > > struct update_nodes_workarea *unwa; > > u32 action, node_count; > > int token, rc, i; > > - __be32 *data, drc_index, phandle; > > + __be32 *data, phandle; > > char *buf; > > > > token = rtas_token("ibm,update-nodes"); > > @@ -206,15 +206,12 @@ static int update_devicetree(struct cxl *adapter, s32 > > scope) > > > > switch (action) { > > case OPCODE_DELETE: > > + case OPCODE_ADD: > > /* nothing to do */ > > break; > > case OPCODE_UPDATE: > > update_node(phandle, scope); > > break; > > - case OPCODE_ADD: > > - /* nothing to do, just move pointer */ > > - drc_index = *data++; > > - break; > > I think this is not correct, as *data++ changes the value there, and so > this changes the logic of the code. Great spot. Looks like I overlooked that the pointer itself is being incremented. > Dropping this one from my queue. Sounds good. I'll fix-up and send this separately at a later date. -- Lee Jones [李琼斯] Senior Technical Lead - Developer Services Linaro.org │ Open source software for Arm SoCs Follow Linaro: Facebook | Twitter | Blog
Re: [PATCH v2 1/6] powerpc/pseries/iommu: Create defines for operations in ibm,ddw-applicable
On Wed, 2020-07-01 at 18:16 +1000, Alexey Kardashevskiy wrote: > > On 24/06/2020 16:24, Leonardo Bras wrote: > > Create defines to help handling ibm,ddw-applicable values, avoiding > > confusion about the index of given operations. > > > > Signed-off-by: Leonardo Bras > > --- > > arch/powerpc/platforms/pseries/iommu.c | 40 +++--- > > 1 file changed, 23 insertions(+), 17 deletions(-) > > > > diff --git a/arch/powerpc/platforms/pseries/iommu.c > > b/arch/powerpc/platforms/pseries/iommu.c > > index 6d47b4a3ce39..68d2aa9c71a8 100644 > > --- a/arch/powerpc/platforms/pseries/iommu.c > > +++ b/arch/powerpc/platforms/pseries/iommu.c > > @@ -39,6 +39,11 @@ > > > > #include "pseries.h" > > > > +#define DDW_QUERY_PE_DMA_WIN 0 > > +#define DDW_CREATE_PE_DMA_WIN 1 > > +#define DDW_REMOVE_PE_DMA_WIN 2 > > +#define DDW_APPLICABLE_SIZE3 > > #define DDW_APPLICABLE_SIZE (DDW_REMOVE_PE_DMA_WIN + 1) > > thanks, Thanks for the feedback! About this (and patch #2), would it be better to use enum ? enum { DDW_QUERY_PE_DMA_WIN, DDW_CREATE_PE_DMA_WIN, DDW_REMOVE_PE_DMA_WIN, DDW_APPLICABLE_SIZE } IMO, it looks better than all the defines before. What do you think? Best regards,
Re: [PATCH v2 30/30] misc: cxl: flash: Remove unused pointer
On Wed, Jul 01, 2020 at 09:31:18AM +0100, Lee Jones wrote: > The DRC index pointer us updated on an OPCODE_ADD, but never > actually read. Remove the used pointer and shift up OPCODE_ADD > to group with OPCODE_DELETE which also provides a noop. > > Fixes the following W=1 kernel build warning: > > drivers/misc/cxl/flash.c: In function ‘update_devicetree’: > drivers/misc/cxl/flash.c:178:16: warning: variable ‘drc_index’ set but not > used [-Wunused-but-set-variable] > 178 | __be32 *data, drc_index, phandle; > | ^ > > Cc: Frederic Barrat > Cc: Andrew Donnellan > Cc: linuxppc-dev@lists.ozlabs.org > Signed-off-by: Lee Jones > --- > drivers/misc/cxl/flash.c | 7 ++- > 1 file changed, 2 insertions(+), 5 deletions(-) > > diff --git a/drivers/misc/cxl/flash.c b/drivers/misc/cxl/flash.c > index cb9cca35a2263..24e3dfcc91a74 100644 > --- a/drivers/misc/cxl/flash.c > +++ b/drivers/misc/cxl/flash.c > @@ -175,7 +175,7 @@ static int update_devicetree(struct cxl *adapter, s32 > scope) > struct update_nodes_workarea *unwa; > u32 action, node_count; > int token, rc, i; > - __be32 *data, drc_index, phandle; > + __be32 *data, phandle; > char *buf; > > token = rtas_token("ibm,update-nodes"); > @@ -206,15 +206,12 @@ static int update_devicetree(struct cxl *adapter, s32 > scope) > > switch (action) { > case OPCODE_DELETE: > + case OPCODE_ADD: > /* nothing to do */ > break; > case OPCODE_UPDATE: > update_node(phandle, scope); > break; > - case OPCODE_ADD: > - /* nothing to do, just move pointer */ > - drc_index = *data++; > - break; I think this is not correct, as *data++ changes the value there, and so this changes the logic of the code. Dropping this one from my queue. thanks, greg k-h
Re: [PATCH 04/11] ppc64/kexec_file: avoid stomping memory used by special regions
On 07/01/2020 03:40 PM, Dave Young wrote: > Hi Hari, > On 06/27/20 at 12:35am, Hari Bathini wrote: >> crashkernel region could have an overlap with special memory regions >> like opal, rtas, tce-table & such. These regions are referred to as >> exclude memory ranges. Setup this ranges during image probe in order >> to avoid them while finding the buffer for different kdump segments. >> Implement kexec_locate_mem_hole_ppc64() that locates a memory hole >> accounting for these ranges. Also, override arch_kexec_add_buffer() >> to locate a memory hole & later call __kexec_add_buffer() function >> with kbuf->mem set to skip the generic locate memory hole lookup. >> >> Signed-off-by: Hari Bathini >> --- >> arch/powerpc/include/asm/crashdump-ppc64.h | 10 + >> arch/powerpc/include/asm/kexec.h |7 - >> arch/powerpc/kexec/elf_64.c|7 + >> arch/powerpc/kexec/file_load_64.c | 292 >> >> 4 files changed, 312 insertions(+), 4 deletions(-) >> create mode 100644 arch/powerpc/include/asm/crashdump-ppc64.h >> > [snip] >> /** >> + * get_exclude_memory_ranges - Get exclude memory ranges. This list includes >> + * regions like opal/rtas, tce-table, initrd, >> + * kernel, htab which should be avoided while >> + * setting up kexec load segments. >> + * @mem_ranges:Range list to add the memory ranges to. >> + * >> + * Returns 0 on success, negative errno on error. >> + */ >> +static int get_exclude_memory_ranges(struct crash_mem **mem_ranges) >> +{ >> +int ret; >> + >> +ret = add_tce_mem_ranges(mem_ranges); >> +if (ret) >> +goto out; >> + >> +ret = add_initrd_mem_range(mem_ranges); >> +if (ret) >> +goto out; >> + >> +ret = add_htab_mem_range(mem_ranges); >> +if (ret) >> +goto out; >> + >> +ret = add_kernel_mem_range(mem_ranges); >> +if (ret) >> +goto out; >> + >> +ret = add_rtas_mem_range(mem_ranges, false); >> +if (ret) >> +goto out; >> + >> +ret = add_opal_mem_range(mem_ranges, false); >> +if (ret) >> +goto out; >> + >> +ret = add_reserved_ranges(mem_ranges); >> +if (ret) >> +goto out; >> + >> +/* exclude memory ranges should be sorted for easy lookup */ >> +sort_memory_ranges(*mem_ranges); >> +out: >> +if (ret) >> +pr_err("Failed to setup exclude memory ranges\n"); >> +return ret; >> +} > > I'm confused about the "overlap with crashkernel memory", does that mean > those normal kernel used memory could be put in crashkernel reserved > memory range? If so why can't just skip those areas while crashkernel > doing the reservation? I raised the same question in another mail. As Hari's answer, "kexec -p" skips these ranges in user space. And the same logic should be done in "kexec -s -p" Regards, Pingfan
Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline
On Tue 30-06-20 09:31:25, Srikar Dronamraju wrote: > * Christopher Lameter [2020-06-29 14:58:40]: > > > On Wed, 24 Jun 2020, Srikar Dronamraju wrote: > > > > > Currently Linux kernel with CONFIG_NUMA on a system with multiple > > > possible nodes, marks node 0 as online at boot. However in practice, > > > there are systems which have node 0 as memoryless and cpuless. > > > > Maybe add something to explain why you are not simply mapping the > > existing memory to NUMA node 0 which is after all just a numbering scheme > > used by the kernel and can be used arbitrarily? > > > > I thought Michal Hocko already gave a clear picture on why mapping is a bad > idea. https://lore.kernel.org/lkml/20200316085425.gb11...@dhcp22.suse.cz/t/#u > Are you suggesting that we add that as part of the changelog? Well, I was not aware x86 already does renumber. So there is a certain precendence. As I've said I do not really like that but this is what already is happening. If renumbering is not an option then just handle that in the ppc code explicitly. Generic solution would be preferable of course but as I've said it is really hard to check for correctness and potential subtle issues. -- Michal Hocko SUSE Labs
Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline
On Wed 01-07-20 13:30:57, David Hildenbrand wrote: > On 01.07.20 13:06, David Hildenbrand wrote: > > On 01.07.20 13:01, Srikar Dronamraju wrote: > >> * David Hildenbrand [2020-07-01 12:15:54]: > >> > >>> On 01.07.20 12:04, Srikar Dronamraju wrote: > * Michal Hocko [2020-07-01 10:42:00]: > > > > >> > >> 2. Also existence of dummy node also leads to inconsistent > >> information. The > >> number of online nodes is inconsistent with the information in the > >> device-tree and resource-dump > >> > >> 3. When the dummy node is present, single node non-Numa systems end up > >> showing > >> up as NUMA systems and numa_balancing gets enabled. This will mean we > >> take > >> the hit from the unnecessary numa hinting faults. > > > > I have to say that I dislike the node online/offline state and directly > > exporting that to the userspace. Users should only care whether the node > > has memory/cpus. Numa nodes can be online without any memory. Just > > offline all the present memory blocks but do not physically hot remove > > them and you are in the same situation. If users are confused by an > > output of tools like numactl -H then those could be updated and hide > > nodes without any memory&cpus. > > > > The autonuma problem sounds interesting but again this patch doesn't > > really solve the underlying problem because I strongly suspect that the > > problem is still there when a numa node gets all its memory offline as > > mentioned above. I would really appreciate a feedback to these two as well. > > While I completely agree that making node 0 special is wrong, I have > > still hard time to review this very simply looking patch because all the > > numa initialization is so spread around that this might just blow up > > at unexpected places. IIRC we have discussed testing in the previous > > version and David has provided a way to emulate these configurations > > on x86. Did you manage to use those instruction for additional testing > > on other than ppc architectures? > > > > I have tried all the steps that David mentioned and reported back at > https://lore.kernel.org/lkml/20200511174731.gd1...@linux.vnet.ibm.com/t/#u > > As a summary, David's steps are still not creating a memoryless/cpuless > on > x86 VM. > >>> > >>> Now, that is wrong. You get a memoryless/cpuless node, which is *not > >>> online*. Once you hotplug some memory, it will switch online. Once you > >>> remove memory, it will switch back offline. > >>> > >> > >> Let me clarify, we are looking for a node 0 which is cpuless/memoryless at > >> boot. The code in question tries to handle a cpuless/memoryless node 0 at > >> boot. > > > > I was just correcting your statement, because it was wrong. > > > > Could be that x86 code maps PXM 1 to node 0 because PXM 1 does neither > > have CPUs nor memory. That would imply that we can, in fact, never have > > node 0 offline during boot. > > > > Yep, looks like it. > > [0.009726] SRAT: PXM 1 -> APIC 0x00 -> Node 0 > [0.009727] SRAT: PXM 1 -> APIC 0x01 -> Node 0 > [0.009727] SRAT: PXM 1 -> APIC 0x02 -> Node 0 > [0.009728] SRAT: PXM 1 -> APIC 0x03 -> Node 0 > [0.009731] ACPI: SRAT: Node 0 PXM 1 [mem 0x-0x0009] > [0.009732] ACPI: SRAT: Node 0 PXM 1 [mem 0x0010-0xbfff] > [0.009733] ACPI: SRAT: Node 0 PXM 1 [mem 0x1-0x13fff] This begs a question whether ppc can do the same thing? I would swear that we've had x86 system with node 0 but I cannot really find it and it is possible that it was not x86 after all... -- Michal Hocko SUSE Labs
Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline
On 01.07.20 13:06, David Hildenbrand wrote: > On 01.07.20 13:01, Srikar Dronamraju wrote: >> * David Hildenbrand [2020-07-01 12:15:54]: >> >>> On 01.07.20 12:04, Srikar Dronamraju wrote: * Michal Hocko [2020-07-01 10:42:00]: > >> >> 2. Also existence of dummy node also leads to inconsistent information. >> The >> number of online nodes is inconsistent with the information in the >> device-tree and resource-dump >> >> 3. When the dummy node is present, single node non-Numa systems end up >> showing >> up as NUMA systems and numa_balancing gets enabled. This will mean we >> take >> the hit from the unnecessary numa hinting faults. > > I have to say that I dislike the node online/offline state and directly > exporting that to the userspace. Users should only care whether the node > has memory/cpus. Numa nodes can be online without any memory. Just > offline all the present memory blocks but do not physically hot remove > them and you are in the same situation. If users are confused by an > output of tools like numactl -H then those could be updated and hide > nodes without any memory&cpus. > > The autonuma problem sounds interesting but again this patch doesn't > really solve the underlying problem because I strongly suspect that the > problem is still there when a numa node gets all its memory offline as > mentioned above. > > While I completely agree that making node 0 special is wrong, I have > still hard time to review this very simply looking patch because all the > numa initialization is so spread around that this might just blow up > at unexpected places. IIRC we have discussed testing in the previous > version and David has provided a way to emulate these configurations > on x86. Did you manage to use those instruction for additional testing > on other than ppc architectures? > I have tried all the steps that David mentioned and reported back at https://lore.kernel.org/lkml/20200511174731.gd1...@linux.vnet.ibm.com/t/#u As a summary, David's steps are still not creating a memoryless/cpuless on x86 VM. >>> >>> Now, that is wrong. You get a memoryless/cpuless node, which is *not >>> online*. Once you hotplug some memory, it will switch online. Once you >>> remove memory, it will switch back offline. >>> >> >> Let me clarify, we are looking for a node 0 which is cpuless/memoryless at >> boot. The code in question tries to handle a cpuless/memoryless node 0 at >> boot. > > I was just correcting your statement, because it was wrong. > > Could be that x86 code maps PXM 1 to node 0 because PXM 1 does neither > have CPUs nor memory. That would imply that we can, in fact, never have > node 0 offline during boot. > Yep, looks like it. [0.009726] SRAT: PXM 1 -> APIC 0x00 -> Node 0 [0.009727] SRAT: PXM 1 -> APIC 0x01 -> Node 0 [0.009727] SRAT: PXM 1 -> APIC 0x02 -> Node 0 [0.009728] SRAT: PXM 1 -> APIC 0x03 -> Node 0 [0.009731] ACPI: SRAT: Node 0 PXM 1 [mem 0x-0x0009] [0.009732] ACPI: SRAT: Node 0 PXM 1 [mem 0x0010-0xbfff] [0.009733] ACPI: SRAT: Node 0 PXM 1 [mem 0x1-0x13fff] -- Thanks, David / dhildenb
Re: [PATCH v2 02/10] KVM: PPC: Book3S HV: Save/restore new PMU registers
On Wed, Jul 01, 2020 at 05:20:54AM -0400, Athira Rajeev wrote: > PowerISA v3.1 has added new performance monitoring unit (PMU) > special purpose registers (SPRs). They are > > Monitor Mode Control Register 3 (MMCR3) > Sampled Instruction Event Register A (SIER2) > Sampled Instruction Event Register B (SIER3) > > Patch addes support to save/restore these new > SPRs while entering/exiting guest. This mostly looks reasonable, at a quick glance at least, but I am puzzled by two of the changes you are making. See below. > diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c > index 6bf66649..c265800 100644 > --- a/arch/powerpc/kvm/book3s_hv.c > +++ b/arch/powerpc/kvm/book3s_hv.c > @@ -1698,7 +1698,8 @@ static int kvmppc_get_one_reg_hv(struct kvm_vcpu *vcpu, > u64 id, > *val = get_reg_val(id, vcpu->arch.sdar); > break; > case KVM_REG_PPC_SIER: > - *val = get_reg_val(id, vcpu->arch.sier); > + i = id - KVM_REG_PPC_SIER; > + *val = get_reg_val(id, vcpu->arch.sier[i]); This is inside a switch (id) statement, so here we know that id is KVM_REG_PPC_SIER. In other words i will always be zero, so what is the point of doing the subtraction? > break; > case KVM_REG_PPC_IAMR: > *val = get_reg_val(id, vcpu->arch.iamr); > @@ -1919,7 +1920,8 @@ static int kvmppc_set_one_reg_hv(struct kvm_vcpu *vcpu, > u64 id, > vcpu->arch.sdar = set_reg_val(id, *val); > break; > case KVM_REG_PPC_SIER: > - vcpu->arch.sier = set_reg_val(id, *val); > + i = id - KVM_REG_PPC_SIER; > + vcpu->arch.sier[i] = set_reg_val(id, *val); Same comment here. I think that new defines for the new registers will need to be added to arch/powerpc/include/uapi/asm/kvm.h and Documentation/virt/kvm/api.rst, and then new cases will need to be added to these switch statements. By the way, please cc kvm-...@vger.kernel.org and k...@vger.kernel.org on KVM patches. Paul.
Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline
On 01.07.20 13:01, Srikar Dronamraju wrote: > * David Hildenbrand [2020-07-01 12:15:54]: > >> On 01.07.20 12:04, Srikar Dronamraju wrote: >>> * Michal Hocko [2020-07-01 10:42:00]: >>> > > 2. Also existence of dummy node also leads to inconsistent information. > The > number of online nodes is inconsistent with the information in the > device-tree and resource-dump > > 3. When the dummy node is present, single node non-Numa systems end up > showing > up as NUMA systems and numa_balancing gets enabled. This will mean we take > the hit from the unnecessary numa hinting faults. I have to say that I dislike the node online/offline state and directly exporting that to the userspace. Users should only care whether the node has memory/cpus. Numa nodes can be online without any memory. Just offline all the present memory blocks but do not physically hot remove them and you are in the same situation. If users are confused by an output of tools like numactl -H then those could be updated and hide nodes without any memory&cpus. The autonuma problem sounds interesting but again this patch doesn't really solve the underlying problem because I strongly suspect that the problem is still there when a numa node gets all its memory offline as mentioned above. While I completely agree that making node 0 special is wrong, I have still hard time to review this very simply looking patch because all the numa initialization is so spread around that this might just blow up at unexpected places. IIRC we have discussed testing in the previous version and David has provided a way to emulate these configurations on x86. Did you manage to use those instruction for additional testing on other than ppc architectures? >>> >>> I have tried all the steps that David mentioned and reported back at >>> https://lore.kernel.org/lkml/20200511174731.gd1...@linux.vnet.ibm.com/t/#u >>> >>> As a summary, David's steps are still not creating a memoryless/cpuless on >>> x86 VM. >> >> Now, that is wrong. You get a memoryless/cpuless node, which is *not >> online*. Once you hotplug some memory, it will switch online. Once you >> remove memory, it will switch back offline. >> > > Let me clarify, we are looking for a node 0 which is cpuless/memoryless at > boot. The code in question tries to handle a cpuless/memoryless node 0 at > boot. I was just correcting your statement, because it was wrong. Could be that x86 code maps PXM 1 to node 0 because PXM 1 does neither have CPUs nor memory. That would imply that we can, in fact, never have node 0 offline during boot. -- Thanks, David / dhildenb
Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline
* David Hildenbrand [2020-07-01 12:15:54]: > On 01.07.20 12:04, Srikar Dronamraju wrote: > > * Michal Hocko [2020-07-01 10:42:00]: > > > >> > >>> > >>> 2. Also existence of dummy node also leads to inconsistent information. > >>> The > >>> number of online nodes is inconsistent with the information in the > >>> device-tree and resource-dump > >>> > >>> 3. When the dummy node is present, single node non-Numa systems end up > >>> showing > >>> up as NUMA systems and numa_balancing gets enabled. This will mean we take > >>> the hit from the unnecessary numa hinting faults. > >> > >> I have to say that I dislike the node online/offline state and directly > >> exporting that to the userspace. Users should only care whether the node > >> has memory/cpus. Numa nodes can be online without any memory. Just > >> offline all the present memory blocks but do not physically hot remove > >> them and you are in the same situation. If users are confused by an > >> output of tools like numactl -H then those could be updated and hide > >> nodes without any memory&cpus. > >> > >> The autonuma problem sounds interesting but again this patch doesn't > >> really solve the underlying problem because I strongly suspect that the > >> problem is still there when a numa node gets all its memory offline as > >> mentioned above. > >> > >> While I completely agree that making node 0 special is wrong, I have > >> still hard time to review this very simply looking patch because all the > >> numa initialization is so spread around that this might just blow up > >> at unexpected places. IIRC we have discussed testing in the previous > >> version and David has provided a way to emulate these configurations > >> on x86. Did you manage to use those instruction for additional testing > >> on other than ppc architectures? > >> > > > > I have tried all the steps that David mentioned and reported back at > > https://lore.kernel.org/lkml/20200511174731.gd1...@linux.vnet.ibm.com/t/#u > > > > As a summary, David's steps are still not creating a memoryless/cpuless on > > x86 VM. > > Now, that is wrong. You get a memoryless/cpuless node, which is *not > online*. Once you hotplug some memory, it will switch online. Once you > remove memory, it will switch back offline. > Let me clarify, we are looking for a node 0 which is cpuless/memoryless at boot. The code in question tries to handle a cpuless/memoryless node 0 at boot. With the steps that you gave the node 0 was always populated, node 1 or some other node would be memoryless/cpuless and offline. But that should have no impact by patch. I don't see how adding/hotplugging/removing memory to a node after boot is going to affect the changes that I have made. Please do correct me if I have misunderstood. -- Thanks and Regards Srikar Dronamraju
Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline
On 01.07.20 12:04, Srikar Dronamraju wrote: > * Michal Hocko [2020-07-01 10:42:00]: > >> >>> >>> 2. Also existence of dummy node also leads to inconsistent information. The >>> number of online nodes is inconsistent with the information in the >>> device-tree and resource-dump >>> >>> 3. When the dummy node is present, single node non-Numa systems end up >>> showing >>> up as NUMA systems and numa_balancing gets enabled. This will mean we take >>> the hit from the unnecessary numa hinting faults. >> >> I have to say that I dislike the node online/offline state and directly >> exporting that to the userspace. Users should only care whether the node >> has memory/cpus. Numa nodes can be online without any memory. Just >> offline all the present memory blocks but do not physically hot remove >> them and you are in the same situation. If users are confused by an >> output of tools like numactl -H then those could be updated and hide >> nodes without any memory&cpus. >> >> The autonuma problem sounds interesting but again this patch doesn't >> really solve the underlying problem because I strongly suspect that the >> problem is still there when a numa node gets all its memory offline as >> mentioned above. >> >> While I completely agree that making node 0 special is wrong, I have >> still hard time to review this very simply looking patch because all the >> numa initialization is so spread around that this might just blow up >> at unexpected places. IIRC we have discussed testing in the previous >> version and David has provided a way to emulate these configurations >> on x86. Did you manage to use those instruction for additional testing >> on other than ppc architectures? >> > > I have tried all the steps that David mentioned and reported back at > https://lore.kernel.org/lkml/20200511174731.gd1...@linux.vnet.ibm.com/t/#u > > As a summary, David's steps are still not creating a memoryless/cpuless on > x86 VM. Now, that is wrong. You get a memoryless/cpuless node, which is *not online*. Once you hotplug some memory, it will switch online. Once you remove memory, it will switch back offline. -- Thanks, David / dhildenb
Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline
* Michal Hocko [2020-07-01 10:42:00]: > > > > > 2. Also existence of dummy node also leads to inconsistent information. The > > number of online nodes is inconsistent with the information in the > > device-tree and resource-dump > > > > 3. When the dummy node is present, single node non-Numa systems end up > > showing > > up as NUMA systems and numa_balancing gets enabled. This will mean we take > > the hit from the unnecessary numa hinting faults. > > I have to say that I dislike the node online/offline state and directly > exporting that to the userspace. Users should only care whether the node > has memory/cpus. Numa nodes can be online without any memory. Just > offline all the present memory blocks but do not physically hot remove > them and you are in the same situation. If users are confused by an > output of tools like numactl -H then those could be updated and hide > nodes without any memory&cpus. > > The autonuma problem sounds interesting but again this patch doesn't > really solve the underlying problem because I strongly suspect that the > problem is still there when a numa node gets all its memory offline as > mentioned above. > > While I completely agree that making node 0 special is wrong, I have > still hard time to review this very simply looking patch because all the > numa initialization is so spread around that this might just blow up > at unexpected places. IIRC we have discussed testing in the previous > version and David has provided a way to emulate these configurations > on x86. Did you manage to use those instruction for additional testing > on other than ppc architectures? > I have tried all the steps that David mentioned and reported back at https://lore.kernel.org/lkml/20200511174731.gd1...@linux.vnet.ibm.com/t/#u As a summary, David's steps are still not creating a memoryless/cpuless on x86 VM. I have tried booting with Numa/non-numa on all the x86 machines that I could get to. -- Thanks and Regards Srikar Dronamraju
[powerpc:merge] BUILD SUCCESS 3c0356e8994ed88e5234897c0ffee4188f8b9287
tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git merge branch HEAD: 3c0356e8994ed88e5234897c0ffee4188f8b9287 Automatic merge of 'master', 'next' and 'fixes' (2020-06-30 22:07) elapsed time: 1282m configs tested: 98 configs skipped: 1 The following configs have been built successfully. More configs may be tested in the coming days. arm defconfig arm allyesconfig arm allmodconfig arm allnoconfig arm64allyesconfig arm64 defconfig arm64allmodconfig arm64 allnoconfig i386 allyesconfig i386defconfig i386 debian-10.3 i386 allnoconfig ia64 allmodconfig ia64defconfig ia64 allnoconfig ia64 allyesconfig m68k allmodconfig m68k allnoconfig m68k sun3_defconfig m68kdefconfig m68k allyesconfig nios2 defconfig nios2allyesconfig openriscdefconfig c6x allyesconfig c6x allnoconfig openrisc allyesconfig nds32 defconfig nds32 allnoconfig csky allyesconfig cskydefconfig alpha defconfig alphaallyesconfig xtensa allyesconfig h8300allyesconfig h8300allmodconfig xtensa defconfig arc defconfig arc allyesconfig sh allmodconfig shallnoconfig microblazeallnoconfig mips allyesconfig mips allnoconfig mips allmodconfig pariscallnoconfig parisc defconfig parisc allyesconfig parisc allmodconfig powerpc defconfig powerpc allyesconfig powerpc rhel-kconfig powerpc allmodconfig powerpc allnoconfig i386 randconfig-a001-20200630 i386 randconfig-a003-20200630 i386 randconfig-a002-20200630 i386 randconfig-a004-20200630 i386 randconfig-a005-20200630 i386 randconfig-a006-20200630 x86_64 randconfig-a011-20200630 x86_64 randconfig-a014-20200630 x86_64 randconfig-a013-20200630 x86_64 randconfig-a015-20200630 x86_64 randconfig-a016-20200630 x86_64 randconfig-a012-20200630 i386 randconfig-a011-20200630 i386 randconfig-a016-20200630 i386 randconfig-a015-20200630 i386 randconfig-a012-20200630 i386 randconfig-a014-20200630 i386 randconfig-a013-20200630 riscvallyesconfig riscv allnoconfig riscv defconfig riscvallmodconfig s390 allyesconfig s390 allnoconfig s390 allmodconfig s390defconfig sparcallyesconfig sparc defconfig sparc64 defconfig sparc64 allnoconfig sparc64 allyesconfig sparc64 allmodconfig um allmodconfig umallnoconfig um allyesconfig um defconfig x86_64 rhel x86_64 rhel-7.2-clear x86_64lkp x86_64 fedora-25 x86_64 rhel-7.6 x86_64rhel-7.6-kselftests x86_64 rhel-8.3 x86_64 kexec --- 0-DAY CI Kernel Test Service, Intel Corporation https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org
[PATCH v2 10/10] powerpc/perf: Add extended regs support for power10 platform
Include capability flag `PERF_PMU_CAP_EXTENDED_REGS` for power10 and expose MMCR3, SIER2, SIER3 registers as part of extended regs. Also introduce `PERF_REG_PMU_MASK_31` to define extended mask value at runtime for power10 Signed-off-by: Athira Rajeev --- arch/powerpc/include/uapi/asm/perf_regs.h | 6 ++ arch/powerpc/perf/perf_regs.c | 10 +- arch/powerpc/perf/power10-pmu.c | 6 ++ tools/arch/powerpc/include/uapi/asm/perf_regs.h | 6 ++ tools/perf/arch/powerpc/include/perf_regs.h | 3 +++ tools/perf/arch/powerpc/util/perf_regs.c| 6 ++ 6 files changed, 36 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/uapi/asm/perf_regs.h b/arch/powerpc/include/uapi/asm/perf_regs.h index 485b1d5..020b51c 100644 --- a/arch/powerpc/include/uapi/asm/perf_regs.h +++ b/arch/powerpc/include/uapi/asm/perf_regs.h @@ -52,6 +52,9 @@ enum perf_event_powerpc_regs { PERF_REG_POWERPC_MMCR0, PERF_REG_POWERPC_MMCR1, PERF_REG_POWERPC_MMCR2, + PERF_REG_POWERPC_MMCR3, + PERF_REG_POWERPC_SIER2, + PERF_REG_POWERPC_SIER3, /* Max regs without the extended regs */ PERF_REG_POWERPC_MAX = PERF_REG_POWERPC_MMCRA + 1, }; @@ -62,4 +65,7 @@ enum perf_event_powerpc_regs { #define PERF_REG_PMU_MASK_300 (((1ULL << (PERF_REG_POWERPC_MMCR2 + 1)) - 1) \ - PERF_REG_PMU_MASK) +/* PERF_REG_EXTENDED_MASK value for CPU_FTR_ARCH_31 */ +#define PERF_REG_PMU_MASK_31 (((1ULL << (PERF_REG_POWERPC_SIER3 + 1)) - 1) \ + - PERF_REG_PMU_MASK) #endif /* _UAPI_ASM_POWERPC_PERF_REGS_H */ diff --git a/arch/powerpc/perf/perf_regs.c b/arch/powerpc/perf/perf_regs.c index c8a7e8c..c969935 100644 --- a/arch/powerpc/perf/perf_regs.c +++ b/arch/powerpc/perf/perf_regs.c @@ -81,6 +81,12 @@ static u64 get_ext_regs_value(int idx) return mfspr(SPRN_MMCR1); case PERF_REG_POWERPC_MMCR2: return mfspr(SPRN_MMCR2); + case PERF_REG_POWERPC_MMCR3: + return mfspr(SPRN_MMCR3); + case PERF_REG_POWERPC_SIER2: + return mfspr(SPRN_SIER2); + case PERF_REG_POWERPC_SIER3: + return mfspr(SPRN_SIER3); default: return 0; } } @@ -89,7 +95,9 @@ u64 perf_reg_value(struct pt_regs *regs, int idx) { u64 PERF_REG_EXTENDED_MAX; - if (cpu_has_feature(CPU_FTR_ARCH_300)) + if (cpu_has_feature(CPU_FTR_ARCH_31)) + PERF_REG_EXTENDED_MAX = PERF_REG_POWERPC_SIER3 + 1; + else if (cpu_has_feature(CPU_FTR_ARCH_300)) PERF_REG_EXTENDED_MAX = PERF_REG_POWERPC_MMCR2 + 1; if (idx == PERF_REG_POWERPC_SIER && diff --git a/arch/powerpc/perf/power10-pmu.c b/arch/powerpc/perf/power10-pmu.c index 07fb919..51082d6 100644 --- a/arch/powerpc/perf/power10-pmu.c +++ b/arch/powerpc/perf/power10-pmu.c @@ -86,6 +86,8 @@ #define POWER10_MMCRA_IFM3 0xC000UL #define POWER10_MMCRA_BHRB_MASK0xC000UL +extern u64 mask_var; + /* Table of alternatives, sorted by column 0 */ static const unsigned int power10_event_alternatives[][MAX_ALT] = { { PM_RUN_CYC_ALT, PM_RUN_CYC }, @@ -397,6 +399,7 @@ static void power10_config_bhrb(u64 pmu_bhrb_filter) .cache_events = &power10_cache_events, .attr_groups= power10_pmu_attr_groups, .bhrb_nr= 32, + .capabilities = PERF_PMU_CAP_EXTENDED_REGS, }; int init_power10_pmu(void) @@ -408,6 +411,9 @@ int init_power10_pmu(void) strcmp(cur_cpu_spec->oprofile_cpu_type, "ppc64/power10")) return -ENODEV; + /* Set the PERF_REG_EXTENDED_MASK here */ + mask_var = PERF_REG_PMU_MASK_31; + rc = register_power_pmu(&power10_pmu); if (rc) return rc; diff --git a/tools/arch/powerpc/include/uapi/asm/perf_regs.h b/tools/arch/powerpc/include/uapi/asm/perf_regs.h index 485b1d5..020b51c 100644 --- a/tools/arch/powerpc/include/uapi/asm/perf_regs.h +++ b/tools/arch/powerpc/include/uapi/asm/perf_regs.h @@ -52,6 +52,9 @@ enum perf_event_powerpc_regs { PERF_REG_POWERPC_MMCR0, PERF_REG_POWERPC_MMCR1, PERF_REG_POWERPC_MMCR2, + PERF_REG_POWERPC_MMCR3, + PERF_REG_POWERPC_SIER2, + PERF_REG_POWERPC_SIER3, /* Max regs without the extended regs */ PERF_REG_POWERPC_MAX = PERF_REG_POWERPC_MMCRA + 1, }; @@ -62,4 +65,7 @@ enum perf_event_powerpc_regs { #define PERF_REG_PMU_MASK_300 (((1ULL << (PERF_REG_POWERPC_MMCR2 + 1)) - 1) \ - PERF_REG_PMU_MASK) +/* PERF_REG_EXTENDED_MASK value for CPU_FTR_ARCH_31 */ +#define PERF_REG_PMU_MASK_31 (((1ULL << (PERF_REG_POWERPC_SIER3 + 1)) - 1) \ + - PERF_REG_PMU_MASK) #endif /* _UAPI_ASM_POWERPC_PERF_REGS_H */ diff --git a/tools/perf/a
[PATCH v2 09/10] tools/perf: Add perf tools support for extended register capability in powerpc
From: Anju T Sudhakar Add extended regs to sample_reg_mask in the tool side to use with `-I?` option. Perf tools side uses extended mask to display the platform supported register names (with -I? option) to the user and also send this mask to the kernel to capture the extended registers in each sample. Hence decide the mask value based on the processor version. Signed-off-by: Anju T Sudhakar [Decide extended mask at run time based on platform] Signed-off-by: Athira Rajeev Reviewed-by: Madhavan Srinivasan --- tools/arch/powerpc/include/uapi/asm/perf_regs.h | 14 ++- tools/perf/arch/powerpc/include/perf_regs.h | 5 ++- tools/perf/arch/powerpc/util/perf_regs.c| 55 + 3 files changed, 72 insertions(+), 2 deletions(-) diff --git a/tools/arch/powerpc/include/uapi/asm/perf_regs.h b/tools/arch/powerpc/include/uapi/asm/perf_regs.h index f599064..485b1d5 100644 --- a/tools/arch/powerpc/include/uapi/asm/perf_regs.h +++ b/tools/arch/powerpc/include/uapi/asm/perf_regs.h @@ -48,6 +48,18 @@ enum perf_event_powerpc_regs { PERF_REG_POWERPC_DSISR, PERF_REG_POWERPC_SIER, PERF_REG_POWERPC_MMCRA, - PERF_REG_POWERPC_MAX, + /* Extended registers */ + PERF_REG_POWERPC_MMCR0, + PERF_REG_POWERPC_MMCR1, + PERF_REG_POWERPC_MMCR2, + /* Max regs without the extended regs */ + PERF_REG_POWERPC_MAX = PERF_REG_POWERPC_MMCRA + 1, }; + +#define PERF_REG_PMU_MASK ((1ULL << PERF_REG_POWERPC_MAX) - 1) + +/* PERF_REG_EXTENDED_MASK value for CPU_FTR_ARCH_300 */ +#define PERF_REG_PMU_MASK_300 (((1ULL << (PERF_REG_POWERPC_MMCR2 + 1)) - 1) \ + - PERF_REG_PMU_MASK) + #endif /* _UAPI_ASM_POWERPC_PERF_REGS_H */ diff --git a/tools/perf/arch/powerpc/include/perf_regs.h b/tools/perf/arch/powerpc/include/perf_regs.h index e18a355..46ed00d 100644 --- a/tools/perf/arch/powerpc/include/perf_regs.h +++ b/tools/perf/arch/powerpc/include/perf_regs.h @@ -64,7 +64,10 @@ [PERF_REG_POWERPC_DAR] = "dar", [PERF_REG_POWERPC_DSISR] = "dsisr", [PERF_REG_POWERPC_SIER] = "sier", - [PERF_REG_POWERPC_MMCRA] = "mmcra" + [PERF_REG_POWERPC_MMCRA] = "mmcra", + [PERF_REG_POWERPC_MMCR0] = "mmcr0", + [PERF_REG_POWERPC_MMCR1] = "mmcr1", + [PERF_REG_POWERPC_MMCR2] = "mmcr2", }; static inline const char *perf_reg_name(int id) diff --git a/tools/perf/arch/powerpc/util/perf_regs.c b/tools/perf/arch/powerpc/util/perf_regs.c index 0a52429..9179230 100644 --- a/tools/perf/arch/powerpc/util/perf_regs.c +++ b/tools/perf/arch/powerpc/util/perf_regs.c @@ -6,9 +6,14 @@ #include "../../../util/perf_regs.h" #include "../../../util/debug.h" +#include "../../../util/event.h" +#include "../../../util/header.h" +#include "../../../perf-sys.h" #include +#define PVR_POWER9 0x004E + const struct sample_reg sample_reg_masks[] = { SMPL_REG(r0, PERF_REG_POWERPC_R0), SMPL_REG(r1, PERF_REG_POWERPC_R1), @@ -55,6 +60,9 @@ SMPL_REG(dsisr, PERF_REG_POWERPC_DSISR), SMPL_REG(sier, PERF_REG_POWERPC_SIER), SMPL_REG(mmcra, PERF_REG_POWERPC_MMCRA), + SMPL_REG(mmcr0, PERF_REG_POWERPC_MMCR0), + SMPL_REG(mmcr1, PERF_REG_POWERPC_MMCR1), + SMPL_REG(mmcr2, PERF_REG_POWERPC_MMCR2), SMPL_REG_END }; @@ -163,3 +171,50 @@ int arch_sdt_arg_parse_op(char *old_op, char **new_op) return SDT_ARG_VALID; } + +uint64_t arch__intr_reg_mask(void) +{ + struct perf_event_attr attr = { + .type = PERF_TYPE_HARDWARE, + .config = PERF_COUNT_HW_CPU_CYCLES, + .sample_type= PERF_SAMPLE_REGS_INTR, + .precise_ip = 1, + .disabled = 1, + .exclude_kernel = 1, + }; + int fd, ret; + char buffer[64]; + u32 version; + u64 extended_mask = 0; + + /* Get the PVR value to set the extended +* mask specific to platform +*/ + get_cpuid(buffer, sizeof(buffer)); + ret = sscanf(buffer, "%u,", &version); + + if (ret != 1) { + pr_debug("Failed to get the processor version, unable to output extended registers\n"); + return PERF_REGS_MASK; + } + + if (version == PVR_POWER9) + extended_mask = PERF_REG_PMU_MASK_300; + else + return PERF_REGS_MASK; + + attr.sample_regs_intr = extended_mask; + attr.sample_period = 1; + event_attr_init(&attr); + + /* +* check if the pmu supports perf extended regs, before +* returning the register mask to sample. +*/ + fd = sys_perf_event_open(&attr, 0, -1, -1, 0); + if (fd != -1) { + close(fd); + return (extended_mask | PERF_REGS_MASK); + } + return PERF_REGS_MASK; +} -- 1.8.3.1
[PATCH v2 08/10] powerpc/perf: Add support for outputting extended regs in perf intr_regs
From: Anju T Sudhakar Add support for perf extended register capability in powerpc. The capability flag PERF_PMU_CAP_EXTENDED_REGS, is used to indicate the PMU which support extended registers. The generic code define the mask of extended registers as 0 for non supported architectures. Patch adds extended regs support for power9 platform by exposing MMCR0, MMCR1 and MMCR2 registers. REG_RESERVED mask needs update to include extended regs. `PERF_REG_EXTENDED_MASK`, contains mask value of the supported registers, is defined at runtime in the kernel based on platform since the supported registers may differ from one processor version to another and hence the MASK value. with patch -- available registers: r0 r1 r2 r3 r4 r5 r6 r7 r8 r9 r10 r11 r12 r13 r14 r15 r16 r17 r18 r19 r20 r21 r22 r23 r24 r25 r26 r27 r28 r29 r30 r31 nip msr orig_r3 ctr link xer ccr softe trap dar dsisr sier mmcra mmcr0 mmcr1 mmcr2 PERF_RECORD_SAMPLE(IP, 0x1): 4784/4784: 0 period: 1 addr: 0 ... intr regs: mask 0x ABI 64-bit r00xc012b77c r10xc03fe5e03930 r20xc1b0e000 r30xc03fdcddf800 r40xc03fc788 r50x9c422724be r60xc03fe5e03908 r70xff63bddc8706 r80x9e4 r90x0 r10 0x1 r11 0x0 r12 0xc01299c0 r13 0xc03c4800 r14 0x0 r15 0x7fffdd8b8b00 r16 0x0 r17 0x7fffdd8be6b8 r18 0x7e7076607730 r19 0x2f r20 0xc0001fc26c68 r21 0xc0002041e4227e00 r22 0xc0002018fb60 r23 0x1 r24 0xc03ffec4d900 r25 0x8000 r26 0x0 r27 0x1 r28 0x1 r29 0xc1be1260 r30 0x6008010 r31 0xc03ffebb7218 nip 0xc012b910 msr 0x90009033 orig_r3 0xc012b86c ctr 0xc01299c0 link 0xc012b77c xer 0x0 ccr 0x2800 softe 0x1 trap 0xf00 dar 0x0 dsisr 0x800 sier 0x0 mmcra 0x800 mmcr0 0x82008090 mmcr1 0x1e00 mmcr2 0x0 ... thread: perf:4784 Signed-off-by: Anju T Sudhakar [Defined PERF_REG_EXTENDED_MASK at run time to add support for different platforms ] Signed-off-by: Athira Rajeev Reviewed-by: Madhavan Srinivasan --- arch/powerpc/include/asm/perf_event_server.h | 8 +++ arch/powerpc/include/uapi/asm/perf_regs.h| 14 +++- arch/powerpc/perf/core-book3s.c | 1 + arch/powerpc/perf/perf_regs.c| 34 +--- arch/powerpc/perf/power9-pmu.c | 6 + 5 files changed, 59 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/include/asm/perf_event_server.h b/arch/powerpc/include/asm/perf_event_server.h index cb207f8..e8d35b6 100644 --- a/arch/powerpc/include/asm/perf_event_server.h +++ b/arch/powerpc/include/asm/perf_event_server.h @@ -15,6 +15,9 @@ #define MAX_EVENT_ALTERNATIVES 8 #define MAX_LIMITED_HWCOUNTERS 2 +extern u64 mask_var; +#define PERF_REG_EXTENDED_MASK mask_var + struct perf_event; /* @@ -55,6 +58,11 @@ struct power_pmu { int *blacklist_ev; /* BHRB entries in the PMU */ int bhrb_nr; + /* +* set this flag with `PERF_PMU_CAP_EXTENDED_REGS` if +* the pmu supports extended perf regs capability +*/ + int capabilities; }; /* diff --git a/arch/powerpc/include/uapi/asm/perf_regs.h b/arch/powerpc/include/uapi/asm/perf_regs.h index f599064..485b1d5 100644 --- a/arch/powerpc/include/uapi/asm/perf_regs.h +++ b/arch/powerpc/include/uapi/asm/perf_regs.h @@ -48,6 +48,18 @@ enum perf_event_powerpc_regs { PERF_REG_POWERPC_DSISR, PERF_REG_POWERPC_SIER, PERF_REG_POWERPC_MMCRA, - PERF_REG_POWERPC_MAX, + /* Extended registers */ + PERF_REG_POWERPC_MMCR0, + PERF_REG_POWERPC_MMCR1, + PERF_REG_POWERPC_MMCR2, + /* Max regs without the extended regs */ + PERF_REG_POWERPC_MAX = PERF_REG_POWERPC_MMCRA + 1, }; + +#define PERF_REG_PMU_MASK ((1ULL << PERF_REG_POWERPC_MAX) - 1) + +/* PERF_REG_EXTENDED_MASK value for CPU_FTR_ARCH_300 */ +#define PERF_REG_PMU_MASK_300 (((1ULL << (PERF_REG_POWERPC_MMCR2 + 1)) - 1) \ + - PERF_REG_PMU_MASK) + #endif /* _UAPI_ASM_POWERPC_PERF_REGS_H */ diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c index 9709606..382d770 100644 --- a/arch/powerpc/perf/core-book3s.c +++ b/arch/powerpc/perf/core-book3s.c @@ -2317,6 +2317,7 @@ int register_power_pmu(struct power_pmu *pmu) pmu->name); power_pmu.attr_groups = ppmu->attr_groups; + power_pmu.capabilities |= (ppmu->capabilities & PERF_PMU_CAP_EXTENDED_REGS); #ifdef MSR_HV /* diff --git a/arch/powerpc/perf/perf_regs.c b/arch/powerpc/perf/perf_regs.c index a213a0a..c8a7e8c 10064
[PATCH v2 07/10] powerpc/perf: support BHRB disable bit and new filtering modes
PowerISA v3.1 has few updates for the Branch History Rolling Buffer(BHRB). First is the addition of BHRB disable bit and second new filtering modes for BHRB. BHRB disable is controlled via Monitor Mode Control Register A (MMCRA) bit 26, namely "BHRB Recording Disable (BHRBRD)". This field controls whether BHRB entries are written when BHRB recording is enabled by other bits. Patch implements support for this BHRB disable bit. Secondly PowerISA v3.1 introduce filtering support for PERF_SAMPLE_BRANCH_IND_CALL/COND. The patch adds BHRB filter support for "ind_call" and "cond" in power10_bhrb_filter_map(). 'commit bb19af816025 ("powerpc/perf: Prevent kernel address leak to userspace via BHRB buffer")' added a check in bhrb_read() to filter the kernel address from BHRB buffer. Patch here modified it to avoid that check for PowerISA v3.1 based processors, since PowerISA v3.1 allows only MSR[PR]=1 address to be written to BHRB buffer. Signed-off-by: Athira Rajeev --- arch/powerpc/perf/core-book3s.c | 27 +-- arch/powerpc/perf/isa207-common.c | 13 + arch/powerpc/perf/power10-pmu.c | 13 +++-- arch/powerpc/platforms/powernv/idle.c | 14 ++ 4 files changed, 59 insertions(+), 8 deletions(-) diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c index fad5159..9709606 100644 --- a/arch/powerpc/perf/core-book3s.c +++ b/arch/powerpc/perf/core-book3s.c @@ -466,9 +466,13 @@ static void power_pmu_bhrb_read(struct perf_event *event, struct cpu_hw_events * * addresses at this point. Check the privileges before * exporting it to userspace (avoid exposure of regions * where we could have speculative execution) +* Incase of ISA 310, BHRB will capture only user-space +* address,hence include a check before filtering code */ - if (is_kernel_addr(addr) && perf_allow_kernel(&event->attr) != 0) - continue; + if (!(ppmu->flags & PPMU_ARCH_310S)) + if (is_kernel_addr(addr) && + perf_allow_kernel(&event->attr) != 0) + continue; /* Branches are read most recent first (ie. mfbhrb 0 is * the most recent branch). @@ -1212,7 +1216,7 @@ static void write_mmcr0(struct cpu_hw_events *cpuhw, unsigned long mmcr0) static void power_pmu_disable(struct pmu *pmu) { struct cpu_hw_events *cpuhw; - unsigned long flags, mmcr0, val; + unsigned long flags, mmcr0, val, mmcra = 0; if (!ppmu) return; @@ -1245,12 +1249,23 @@ static void power_pmu_disable(struct pmu *pmu) mb(); isync(); + val = mmcra = cpuhw->mmcr[2]; + /* * Disable instruction sampling if it was enabled */ - if (cpuhw->mmcr[2] & MMCRA_SAMPLE_ENABLE) { - mtspr(SPRN_MMCRA, - cpuhw->mmcr[2] & ~MMCRA_SAMPLE_ENABLE); + if (cpuhw->mmcr[2] & MMCRA_SAMPLE_ENABLE) + mmcra = cpuhw->mmcr[2] & ~MMCRA_SAMPLE_ENABLE; + + /* Disable BHRB via mmcra [:26] for p10 if needed */ + if (!(cpuhw->mmcr[2] & MMCRA_BHRB_DISABLE)) + mmcra |= MMCRA_BHRB_DISABLE; + + /* Write SPRN_MMCRA if mmcra has either disabled +* instruction sampling or BHRB +*/ + if (val != mmcra) { + mtspr(SPRN_MMCRA, mmcra); mb(); isync(); } diff --git a/arch/powerpc/perf/isa207-common.c b/arch/powerpc/perf/isa207-common.c index 7d4839e..463d925 100644 --- a/arch/powerpc/perf/isa207-common.c +++ b/arch/powerpc/perf/isa207-common.c @@ -404,6 +404,12 @@ int isa207_compute_mmcr(u64 event[], int n_ev, mmcra = mmcr1 = mmcr2 = mmcr3 = 0; + /* Disable bhrb unless explicitly requested +* by setting MMCRA [:26] bit. +*/ + if (cpu_has_feature(CPU_FTR_ARCH_31)) + mmcra |= MMCRA_BHRB_DISABLE; + /* Second pass: assign PMCs, set all MMCR1 fields */ for (i = 0; i < n_ev; ++i) { pmc = (event[i] >> EVENT_PMC_SHIFT) & EVENT_PMC_MASK; @@ -475,10 +481,17 @@ int isa207_compute_mmcr(u64 event[], int n_ev, } if (event[i] & EVENT_WANTS_BHRB) { + /* set MMCRA[:26] to 0 for Power10 to enable BHRB */ + if (cpu_has_feature(CPU_FTR_ARCH_31)) + mmcra &= ~MMCRA_BHRB_DISABLE; val = (event[i] >> EVENT_IFM_SHIFT) & EVENT_IFM_MASK; m
[PATCH v2 05/10] powerpc/perf: Update Power PMU cache_events to u64 type
Events of type PERF_TYPE_HW_CACHE was described for Power PMU as: int (*cache_events)[type][op][result]; where type, op, result values unpacked from the event attribute config value is used to generate the raw event code at runtime. So far the event code values which used to create these cache-related events were within 32 bit and `int` type worked. In power10, some of the event codes are of 64-bit value and hence update the Power PMU cache_events to `u64` type in `power_pmu` struct. Also propagate this change to existing all PMU driver code paths which are using ppmu->cache_events. Signed-off-by: Athira Rajeev --- arch/powerpc/include/asm/perf_event_server.h | 2 +- arch/powerpc/perf/core-book3s.c | 2 +- arch/powerpc/perf/generic-compat-pmu.c | 2 +- arch/powerpc/perf/mpc7450-pmu.c | 2 +- arch/powerpc/perf/power5+-pmu.c | 2 +- arch/powerpc/perf/power5-pmu.c | 2 +- arch/powerpc/perf/power6-pmu.c | 2 +- arch/powerpc/perf/power7-pmu.c | 2 +- arch/powerpc/perf/power8-pmu.c | 2 +- arch/powerpc/perf/power9-pmu.c | 2 +- arch/powerpc/perf/ppc970-pmu.c | 2 +- 11 files changed, 11 insertions(+), 11 deletions(-) diff --git a/arch/powerpc/include/asm/perf_event_server.h b/arch/powerpc/include/asm/perf_event_server.h index 895aeaa..cb207f8 100644 --- a/arch/powerpc/include/asm/perf_event_server.h +++ b/arch/powerpc/include/asm/perf_event_server.h @@ -47,7 +47,7 @@ struct power_pmu { const struct attribute_group**attr_groups; int n_generic; int *generic_events; - int (*cache_events)[PERF_COUNT_HW_CACHE_MAX] + u64 (*cache_events)[PERF_COUNT_HW_CACHE_MAX] [PERF_COUNT_HW_CACHE_OP_MAX] [PERF_COUNT_HW_CACHE_RESULT_MAX]; diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c index 5c64bd3..58bfb9a 100644 --- a/arch/powerpc/perf/core-book3s.c +++ b/arch/powerpc/perf/core-book3s.c @@ -1820,7 +1820,7 @@ static void hw_perf_event_destroy(struct perf_event *event) static int hw_perf_cache_event(u64 config, u64 *eventp) { unsigned long type, op, result; - int ev; + u64 ev; if (!ppmu->cache_events) return -EINVAL; diff --git a/arch/powerpc/perf/generic-compat-pmu.c b/arch/powerpc/perf/generic-compat-pmu.c index 5e5a54d..eb8a6aaf 100644 --- a/arch/powerpc/perf/generic-compat-pmu.c +++ b/arch/powerpc/perf/generic-compat-pmu.c @@ -101,7 +101,7 @@ enum { * 0 means not supported, -1 means nonsensical, other values * are event codes. */ -static int generic_compat_cache_events[C(MAX)][C(OP_MAX)][C(RESULT_MAX)] = { +static u64 generic_compat_cache_events[C(MAX)][C(OP_MAX)][C(RESULT_MAX)] = { [ C(L1D) ] = { [ C(OP_READ) ] = { [ C(RESULT_ACCESS) ] = 0, diff --git a/arch/powerpc/perf/mpc7450-pmu.c b/arch/powerpc/perf/mpc7450-pmu.c index 4d5ef92..cf1eb89 100644 --- a/arch/powerpc/perf/mpc7450-pmu.c +++ b/arch/powerpc/perf/mpc7450-pmu.c @@ -354,7 +354,7 @@ static void mpc7450_disable_pmc(unsigned int pmc, unsigned long mmcr[]) * 0 means not supported, -1 means nonsensical, other values * are event codes. */ -static int mpc7450_cache_events[C(MAX)][C(OP_MAX)][C(RESULT_MAX)] = { +static u64 mpc7450_cache_events[C(MAX)][C(OP_MAX)][C(RESULT_MAX)] = { [C(L1D)] = {/* RESULT_ACCESS RESULT_MISS */ [C(OP_READ)] = {0, 0x225 }, [C(OP_WRITE)] = { 0, 0x227 }, diff --git a/arch/powerpc/perf/power5+-pmu.c b/arch/powerpc/perf/power5+-pmu.c index f857454..9252281 100644 --- a/arch/powerpc/perf/power5+-pmu.c +++ b/arch/powerpc/perf/power5+-pmu.c @@ -618,7 +618,7 @@ static void power5p_disable_pmc(unsigned int pmc, unsigned long mmcr[]) * 0 means not supported, -1 means nonsensical, other values * are event codes. */ -static int power5p_cache_events[C(MAX)][C(OP_MAX)][C(RESULT_MAX)] = { +static u64 power5p_cache_events[C(MAX)][C(OP_MAX)][C(RESULT_MAX)] = { [C(L1D)] = {/* RESULT_ACCESS RESULT_MISS */ [C(OP_READ)] = {0x1c10a8, 0x3c1088}, [C(OP_WRITE)] = { 0x2c10a8, 0xc10c3 }, diff --git a/arch/powerpc/perf/power5-pmu.c b/arch/powerpc/perf/power5-pmu.c index da52eca..3b36630 100644 --- a/arch/powerpc/perf/power5-pmu.c +++ b/arch/powerpc/perf/power5-pmu.c @@ -560,7 +560,7 @@ static void power5_disable_pmc(unsigned int pmc, unsigned long mmcr[]) * 0 means not supported, -1 means nonsensical, other values * are event codes. */ -static int power5_cache_events[C(MAX)][C(OP_MAX)][C(RESULT_MAX)] = { +static u64 power5_cache_events[C(MAX)][C(OP_MAX)][C(RESULT_MAX)] = { [C(L1D)] = {/* RESULT_ACCESS RESULT_MISS *
[PATCH v2 06/10] powerpc/perf: power10 Performance Monitoring support
Base enablement patch to register performance monitoring hardware support for power10. Patch introduce the raw event encoding format, defines the supported list of events, config fields for the event attributes and their corresponding bit values which are exported via sysfs. Patch also enhances the support function in isa207_common.c to include power10 pmu hardware. [Enablement of base PMU driver code] Signed-off-by: Madhavan Srinivasan [Addition of ISA macros for counter support functions] Signed-off-by: Athira Rajeev --- arch/powerpc/perf/Makefile | 2 +- arch/powerpc/perf/core-book3s.c | 2 + arch/powerpc/perf/internal.h| 1 + arch/powerpc/perf/isa207-common.c | 59 - arch/powerpc/perf/isa207-common.h | 33 ++- arch/powerpc/perf/power10-events-list.h | 70 ++ arch/powerpc/perf/power10-pmu.c | 410 7 files changed, 566 insertions(+), 11 deletions(-) create mode 100644 arch/powerpc/perf/power10-events-list.h create mode 100644 arch/powerpc/perf/power10-pmu.c diff --git a/arch/powerpc/perf/Makefile b/arch/powerpc/perf/Makefile index 53d614e..c02854d 100644 --- a/arch/powerpc/perf/Makefile +++ b/arch/powerpc/perf/Makefile @@ -9,7 +9,7 @@ obj-$(CONFIG_PPC_PERF_CTRS) += core-book3s.o bhrb.o obj64-$(CONFIG_PPC_PERF_CTRS) += ppc970-pmu.o power5-pmu.o \ power5+-pmu.o power6-pmu.o power7-pmu.o \ isa207-common.o power8-pmu.o power9-pmu.o \ - generic-compat-pmu.o + generic-compat-pmu.o power10-pmu.o obj32-$(CONFIG_PPC_PERF_CTRS) += mpc7450-pmu.o obj-$(CONFIG_PPC_POWERNV) += imc-pmu.o diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c index 58bfb9a..fad5159 100644 --- a/arch/powerpc/perf/core-book3s.c +++ b/arch/powerpc/perf/core-book3s.c @@ -2333,6 +2333,8 @@ static int __init init_ppc64_pmu(void) return 0; else if (!init_power9_pmu()) return 0; + else if (!init_power10_pmu()) + return 0; else if (!init_ppc970_pmu()) return 0; else diff --git a/arch/powerpc/perf/internal.h b/arch/powerpc/perf/internal.h index f755c64..80bbf72 100644 --- a/arch/powerpc/perf/internal.h +++ b/arch/powerpc/perf/internal.h @@ -9,4 +9,5 @@ extern int init_power7_pmu(void); extern int init_power8_pmu(void); extern int init_power9_pmu(void); +extern int init_power10_pmu(void); extern int init_generic_compat_pmu(void); diff --git a/arch/powerpc/perf/isa207-common.c b/arch/powerpc/perf/isa207-common.c index 4c86da5..7d4839e 100644 --- a/arch/powerpc/perf/isa207-common.c +++ b/arch/powerpc/perf/isa207-common.c @@ -55,7 +55,9 @@ static bool is_event_valid(u64 event) { u64 valid_mask = EVENT_VALID_MASK; - if (cpu_has_feature(CPU_FTR_ARCH_300)) + if (cpu_has_feature(CPU_FTR_ARCH_31)) + valid_mask = p10_EVENT_VALID_MASK; + else if (cpu_has_feature(CPU_FTR_ARCH_300)) valid_mask = p9_EVENT_VALID_MASK; return !(event & ~valid_mask); @@ -69,6 +71,14 @@ static inline bool is_event_marked(u64 event) return false; } +static unsigned long sdar_mod_val(u64 event) +{ + if (cpu_has_feature(CPU_FTR_ARCH_31)) + return p10_SDAR_MODE(event); + + return p9_SDAR_MODE(event); +} + static void mmcra_sdar_mode(u64 event, unsigned long *mmcra) { /* @@ -79,7 +89,7 @@ static void mmcra_sdar_mode(u64 event, unsigned long *mmcra) * MMCRA[SDAR_MODE] will be programmed as "0b01" for continous sampling * mode and will be un-changed when setting MMCRA[63] (Marked events). * -* Incase of Power9: +* Incase of Power9/power10: * Marked event: MMCRA[SDAR_MODE] will be set to 0b00 ('No Updates'), * or if group already have any marked events. * For rest @@ -90,8 +100,8 @@ static void mmcra_sdar_mode(u64 event, unsigned long *mmcra) if (cpu_has_feature(CPU_FTR_ARCH_300)) { if (is_event_marked(event) || (*mmcra & MMCRA_SAMPLE_ENABLE)) *mmcra &= MMCRA_SDAR_MODE_NO_UPDATES; - else if (p9_SDAR_MODE(event)) - *mmcra |= p9_SDAR_MODE(event) << MMCRA_SDAR_MODE_SHIFT; + else if (sdar_mod_val(event)) + *mmcra |= sdar_mod_val(event) << MMCRA_SDAR_MODE_SHIFT; else *mmcra |= MMCRA_SDAR_MODE_DCACHE; } else @@ -134,7 +144,11 @@ static bool is_thresh_cmp_valid(u64 event) /* * Check the mantissa upper two bits are not zero, unless the * exponent is also zero. See the THRESH_CMP_MANTISSA doc. +* Power10: thresh_cmp is replaced by l2_l3 event select. */ + if (cpu_has_feature(CPU_FTR_ARCH_31)) + re
[PATCH v2 04/10] powerpc/perf: Add power10_feat to dt_cpu_ftrs
From: Madhavan Srinivasan Add power10 feature function to dt_cpu_ftrs.c along with a power10 specific init() to initialize pmu sprs. Signed-off-by: Madhavan Srinivasan --- arch/powerpc/include/asm/reg.h| 3 +++ arch/powerpc/kernel/cpu_setup_power.S | 7 +++ arch/powerpc/kernel/dt_cpu_ftrs.c | 26 ++ 3 files changed, 36 insertions(+) diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h index 21a1b2d..900ada1 100644 --- a/arch/powerpc/include/asm/reg.h +++ b/arch/powerpc/include/asm/reg.h @@ -1068,6 +1068,9 @@ #define MMCR0_PMC2_LOADMISSTIME0x5 #endif +/* BHRB disable bit for PowerISA v3.10 */ +#define MMCRA_BHRB_DISABLE 0x0020 + /* * SPRG usage: * diff --git a/arch/powerpc/kernel/cpu_setup_power.S b/arch/powerpc/kernel/cpu_setup_power.S index efdcfa7..e8b3370c 100644 --- a/arch/powerpc/kernel/cpu_setup_power.S +++ b/arch/powerpc/kernel/cpu_setup_power.S @@ -233,3 +233,10 @@ __init_PMU_ISA207: li r5,0 mtspr SPRN_MMCRS,r5 blr + +__init_PMU_ISA31: + li r5,0 + mtspr SPRN_MMCR3,r5 + LOAD_REG_IMMEDIATE(r5, MMCRA_BHRB_DISABLE) + mtspr SPRN_MMCRA,r5 + blr diff --git a/arch/powerpc/kernel/dt_cpu_ftrs.c b/arch/powerpc/kernel/dt_cpu_ftrs.c index a0edeb3..14a513f 100644 --- a/arch/powerpc/kernel/dt_cpu_ftrs.c +++ b/arch/powerpc/kernel/dt_cpu_ftrs.c @@ -449,6 +449,31 @@ static int __init feat_enable_pmu_power9(struct dt_cpu_feature *f) return 1; } +static void init_pmu_power10(void) +{ + init_pmu_power9(); + + mtspr(SPRN_MMCR3, 0); + mtspr(SPRN_MMCRA, MMCRA_BHRB_DISABLE); +} + +static int __init feat_enable_pmu_power10(struct dt_cpu_feature *f) +{ + hfscr_pmu_enable(); + + init_pmu_power10(); + init_pmu_registers = init_pmu_power10; + + cur_cpu_spec->cpu_features |= CPU_FTR_MMCRA; + cur_cpu_spec->cpu_user_features |= PPC_FEATURE_PSERIES_PERFMON_COMPAT; + + cur_cpu_spec->num_pmcs = 6; + cur_cpu_spec->pmc_type = PPC_PMC_IBM; + cur_cpu_spec->oprofile_cpu_type = "ppc64/power10"; + + return 1; +} + static int __init feat_enable_tm(struct dt_cpu_feature *f) { #ifdef CONFIG_PPC_TRANSACTIONAL_MEM @@ -638,6 +663,7 @@ struct dt_cpu_feature_match { {"pc-relative-addressing", feat_enable, 0}, {"machine-check-power9", feat_enable_mce_power9, 0}, {"performance-monitor-power9", feat_enable_pmu_power9, 0}, + {"performance-monitor-power10", feat_enable_pmu_power10, 0}, {"event-based-branch-v3", feat_enable, 0}, {"random-number-generator", feat_enable, 0}, {"system-call-vectored", feat_disable, 0}, -- 1.8.3.1
[PATCH v2 03/10] powerpc/xmon: Add PowerISA v3.1 PMU SPRs
From: Madhavan Srinivasan PowerISA v3.1 added three new perfromance monitoring unit (PMU) speical purpose register (SPR). They are Monitor Mode Control Register 3 (MMCR3), Sampled Instruction Event Register 2 (SIER2), Sampled Instruction Event Register 3 (SIER3). Patch here adds a new dump function dump_310_sprs to print these SPR values. Signed-off-by: Madhavan Srinivasan --- arch/powerpc/xmon/xmon.c | 15 +++ 1 file changed, 15 insertions(+) diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c index 7efe4bc..8917fe8 100644 --- a/arch/powerpc/xmon/xmon.c +++ b/arch/powerpc/xmon/xmon.c @@ -2022,6 +2022,20 @@ static void dump_300_sprs(void) #endif } +static void dump_310_sprs(void) +{ +#ifdef CONFIG_PPC64 + if (!cpu_has_feature(CPU_FTR_ARCH_31)) + return; + + printf("mmcr3 = %.16lx\n", + mfspr(SPRN_MMCR3)); + + printf("sier2 = %.16lx sier3 = %.16lx\n", + mfspr(SPRN_SIER2), mfspr(SPRN_SIER3)); +#endif +} + static void dump_one_spr(int spr, bool show_unimplemented) { unsigned long val; @@ -2076,6 +2090,7 @@ static void super_regs(void) dump_206_sprs(); dump_207_sprs(); dump_300_sprs(); + dump_310_sprs(); return; } -- 1.8.3.1
[PATCH v2 02/10] KVM: PPC: Book3S HV: Save/restore new PMU registers
PowerISA v3.1 has added new performance monitoring unit (PMU) special purpose registers (SPRs). They are Monitor Mode Control Register 3 (MMCR3) Sampled Instruction Event Register A (SIER2) Sampled Instruction Event Register B (SIER3) Patch addes support to save/restore these new SPRs while entering/exiting guest. Signed-off-by: Athira Rajeev --- arch/powerpc/include/asm/kvm_book3s_asm.h | 2 +- arch/powerpc/include/asm/kvm_host.h | 4 ++-- arch/powerpc/kernel/asm-offsets.c | 3 +++ arch/powerpc/kvm/book3s_hv.c | 6 -- arch/powerpc/kvm/book3s_hv_interrupts.S | 8 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 24 6 files changed, 42 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_book3s_asm.h b/arch/powerpc/include/asm/kvm_book3s_asm.h index 45704f2..078f464 100644 --- a/arch/powerpc/include/asm/kvm_book3s_asm.h +++ b/arch/powerpc/include/asm/kvm_book3s_asm.h @@ -119,7 +119,7 @@ struct kvmppc_host_state { void __iomem *xive_tima_virt; u32 saved_xirr; u64 dabr; - u64 host_mmcr[7]; /* MMCR 0,1,A, SIAR, SDAR, MMCR2, SIER */ + u64 host_mmcr[10]; /* MMCR 0,1,A, SIAR, SDAR, MMCR2, SIER, MMCR3, SIER2/3 */ u32 host_pmc[8]; u64 host_purr; u64 host_spurr; diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index 7e2d061..d718061 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -637,12 +637,12 @@ struct kvm_vcpu_arch { u32 ccr1; u32 dbsr; - u64 mmcr[5]; + u64 mmcr[6]; u32 pmc[8]; u32 spmc[2]; u64 siar; u64 sdar; - u64 sier; + u64 sier[3]; #ifdef CONFIG_PPC_TRANSACTIONAL_MEM u64 tfhar; u64 texasr; diff --git a/arch/powerpc/kernel/asm-offsets.c b/arch/powerpc/kernel/asm-offsets.c index 6657dc6..20a8b1e 100644 --- a/arch/powerpc/kernel/asm-offsets.c +++ b/arch/powerpc/kernel/asm-offsets.c @@ -696,6 +696,9 @@ int main(void) HSTATE_FIELD(HSTATE_SDAR, host_mmcr[4]); HSTATE_FIELD(HSTATE_MMCR2, host_mmcr[5]); HSTATE_FIELD(HSTATE_SIER, host_mmcr[6]); + HSTATE_FIELD(HSTATE_MMCR3, host_mmcr[7]); + HSTATE_FIELD(HSTATE_SIER2, host_mmcr[8]); + HSTATE_FIELD(HSTATE_SIER3, host_mmcr[9]); HSTATE_FIELD(HSTATE_PMC1, host_pmc[0]); HSTATE_FIELD(HSTATE_PMC2, host_pmc[1]); HSTATE_FIELD(HSTATE_PMC3, host_pmc[2]); diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c index 6bf66649..c265800 100644 --- a/arch/powerpc/kvm/book3s_hv.c +++ b/arch/powerpc/kvm/book3s_hv.c @@ -1698,7 +1698,8 @@ static int kvmppc_get_one_reg_hv(struct kvm_vcpu *vcpu, u64 id, *val = get_reg_val(id, vcpu->arch.sdar); break; case KVM_REG_PPC_SIER: - *val = get_reg_val(id, vcpu->arch.sier); + i = id - KVM_REG_PPC_SIER; + *val = get_reg_val(id, vcpu->arch.sier[i]); break; case KVM_REG_PPC_IAMR: *val = get_reg_val(id, vcpu->arch.iamr); @@ -1919,7 +1920,8 @@ static int kvmppc_set_one_reg_hv(struct kvm_vcpu *vcpu, u64 id, vcpu->arch.sdar = set_reg_val(id, *val); break; case KVM_REG_PPC_SIER: - vcpu->arch.sier = set_reg_val(id, *val); + i = id - KVM_REG_PPC_SIER; + vcpu->arch.sier[i] = set_reg_val(id, *val); break; case KVM_REG_PPC_IAMR: vcpu->arch.iamr = set_reg_val(id, *val); diff --git a/arch/powerpc/kvm/book3s_hv_interrupts.S b/arch/powerpc/kvm/book3s_hv_interrupts.S index 63fd81f..59822cb 100644 --- a/arch/powerpc/kvm/book3s_hv_interrupts.S +++ b/arch/powerpc/kvm/book3s_hv_interrupts.S @@ -140,6 +140,14 @@ BEGIN_FTR_SECTION std r8, HSTATE_MMCR2(r13) std r9, HSTATE_SIER(r13) END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S) +BEGIN_FTR_SECTION + mfspr r5, SPRN_MMCR3 + mfspr r6, SPRN_SIER2 + mfspr r7, SPRN_SIER3 + std r5, HSTATE_MMCR3(r13) + std r6, HSTATE_SIER2(r13) + std r7, HSTATE_SIER3(r13) +END_FTR_SECTION_IFSET(CPU_FTR_ARCH_31) mfspr r3, SPRN_PMC1 mfspr r5, SPRN_PMC2 mfspr r6, SPRN_PMC3 diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S index 7194389..57b6c14 100644 --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S @@ -3436,6 +3436,14 @@ END_FTR_SECTION_IFSET(CPU_FTR_PMAO_BUG) mtspr SPRN_SIAR, r7 mtspr SPRN_SDAR, r8 BEGIN_FTR_SECTION + ld r5, VCPU_MMCR + 40(r4) + ld r6, VCPU_SIER + 8(r4) + ld r7, VCPU_SIER + 16(r4) + mtspr SPRN_MMCR3, r5 + mtspr SPRN_SIER2, r6 + mtspr SPRN_SIER3, r7 +END_FTR_SECTION_IFSET(CPU_FTR_ARCH_31) +BEGIN_FTR_SECTION
[PATCH v2 01/10] powerpc/perf: Add support for ISA3.1 PMU SPRs
From: Madhavan Srinivasan PowerISA v3.1 includes new performance monitoring unit(PMU) special purpose registers (SPRs). They are Monitor Mode Control Register 3 (MMCR3) Sampled Instruction Event Register 2 (SIER2) Sampled Instruction Event Register 3 (SIER3) MMCR3 is added for further sampling related configuration control. SIER2/SIER3 are added to provide additional information about the sampled instruction. Patch adds new PPMU flag called "PPMU_ARCH_310S" to support handling of these new SPRs, updates the struct thread_struct to include these new SPRs, increase the size of mmcr[] array by one to include MMCR3 in struct cpu_hw_event. This is needed to support programming of MMCR3 SPR during event_[enable/disable]. Patch also adds the sysfs support for the MMCR3 SPR along with SPRN_ macros for these new pmu sprs. Signed-off-by: Madhavan Srinivasan --- arch/powerpc/include/asm/perf_event_server.h | 1 + arch/powerpc/include/asm/processor.h | 4 arch/powerpc/include/asm/reg.h | 6 ++ arch/powerpc/kernel/sysfs.c | 8 arch/powerpc/perf/core-book3s.c | 29 ++-- 5 files changed, 46 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/perf_event_server.h b/arch/powerpc/include/asm/perf_event_server.h index 3e9703f..895aeaa 100644 --- a/arch/powerpc/include/asm/perf_event_server.h +++ b/arch/powerpc/include/asm/perf_event_server.h @@ -69,6 +69,7 @@ struct power_pmu { #define PPMU_HAS_SIER 0x0040 /* Has SIER */ #define PPMU_ARCH_207S 0x0080 /* PMC is architecture v2.07S */ #define PPMU_NO_SIAR 0x0100 /* Do not use SIAR */ +#define PPMU_ARCH_310S 0x0200 /* Has MMCR3, SIER2 and SIER3 */ /* * Values for flags to get_alternatives() diff --git a/arch/powerpc/include/asm/processor.h b/arch/powerpc/include/asm/processor.h index 52a6783..a466e94 100644 --- a/arch/powerpc/include/asm/processor.h +++ b/arch/powerpc/include/asm/processor.h @@ -272,6 +272,10 @@ struct thread_struct { unsignedmmcr0; unsignedused_ebb; + unsigned long mmcr3; + unsigned long sier2; + unsigned long sier3; + #endif }; diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h index 88e6c78..21a1b2d 100644 --- a/arch/powerpc/include/asm/reg.h +++ b/arch/powerpc/include/asm/reg.h @@ -876,7 +876,9 @@ #define MMCR0_FCHV 0x0001UL /* freeze conditions in hypervisor mode */ #define SPRN_MMCR1 798 #define SPRN_MMCR2 785 +#define SPRN_MMCR3 754 #define SPRN_UMMCR2769 +#define SPRN_UMMCR3738 #define SPRN_MMCRA 0x312 #define MMCRA_SDSYNC 0x8000UL /* SDAR synced with SIAR */ #define MMCRA_SDAR_DCACHE_MISS 0x4000UL @@ -918,6 +920,10 @@ #define SIER_SIHV0x100 /* Sampled MSR_HV */ #define SIER_SIAR_VALID 0x040 /* SIAR contents valid */ #define SIER_SDAR_VALID 0x020 /* SDAR contents valid */ +#define SPRN_SIER2 752 +#define SPRN_SIER3 753 +#define SPRN_USIER2736 +#define SPRN_USIER3737 #define SPRN_SIAR 796 #define SPRN_SDAR 797 #define SPRN_TACR 888 diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c index 571b325..46b4ebc 100644 --- a/arch/powerpc/kernel/sysfs.c +++ b/arch/powerpc/kernel/sysfs.c @@ -622,8 +622,10 @@ void ppc_enable_pmcs(void) SYSFS_PMCSETUP(pmc8, SPRN_PMC8); SYSFS_PMCSETUP(mmcra, SPRN_MMCRA); +SYSFS_PMCSETUP(mmcr3, SPRN_MMCR3); static DEVICE_ATTR(mmcra, 0600, show_mmcra, store_mmcra); +static DEVICE_ATTR(mmcr3, 0600, show_mmcr3, store_mmcr3); #endif /* HAS_PPC_PMC56 */ @@ -886,6 +888,9 @@ static int register_cpu_online(unsigned int cpu) #ifdef CONFIG_PMU_SYSFS if (cpu_has_feature(CPU_FTR_MMCRA)) device_create_file(s, &dev_attr_mmcra); + + if (cpu_has_feature(CPU_FTR_ARCH_31)) + device_create_file(s, &dev_attr_mmcr3); #endif /* CONFIG_PMU_SYSFS */ if (cpu_has_feature(CPU_FTR_PURR)) { @@ -980,6 +985,9 @@ static int unregister_cpu_online(unsigned int cpu) #ifdef CONFIG_PMU_SYSFS if (cpu_has_feature(CPU_FTR_MMCRA)) device_remove_file(s, &dev_attr_mmcra); + + if (cpu_has_feature(CPU_FTR_ARCH_31)) + device_remove_file(s, &dev_attr_mmcr3); #endif /* CONFIG_PMU_SYSFS */ if (cpu_has_feature(CPU_FTR_PURR)) { diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c index cd6a742..5c64bd3 100644 --- a/arch/powerpc/perf/core-book3s.c +++ b/arch/powerpc/perf/core-book3s.c @@ -39,10 +39,10 @@ struct cpu_hw_events { unsigned int flags[MAX_HWEVENTS]; /* * The order of the MMCR array is: -* - 64-bit, MMCR0, MMCR1, MMCRA, MMCR2 +* - 64-bit, MMCR0, MMCR1, MMCRA, MMCR2, MMCR3 * - 32-bit, MMCR0, MMCR1, MMCR2 */ - unsigned long mmcr[4]; +
[PATCH v2 00/10] powerpc/perf: Add support for power10 PMU Hardware
The patch series adds support for power10 PMU hardware. Anju T Sudhakar (2): powerpc/perf: Add support for outputting extended regs in perf intr_regs tools/perf: Add perf tools support for extended register capability in powerpc Athira Rajeev (5): KVM: PPC: Book3S HV: Save/restore new PMU registers powerpc/perf: Update Power PMU cache_events to u64 type powerpc/perf: power10 Performance Monitoring support powerpc/perf: support BHRB disable bit and new filtering modes powerpc/perf: Add extended regs support for power10 platform Madhavan Srinivasan (3): powerpc/perf: Add support for ISA3.1 PMU SPRs powerpc/xmon: Add PowerISA v3.1 PMU SPRs powerpc/perf: Add power10_feat to dt_cpu_ftrs --- Changes from v1 -> v2 - Added support for extended regs in powerpc for power9/power10 platform ( patches 8, 9, 10) - Addressed change/removal of some event codes in the PMU driver --- arch/powerpc/include/asm/kvm_book3s_asm.h | 2 +- arch/powerpc/include/asm/kvm_host.h | 4 +- arch/powerpc/include/asm/perf_event_server.h| 11 +- arch/powerpc/include/asm/processor.h| 4 + arch/powerpc/include/asm/reg.h | 9 + arch/powerpc/include/uapi/asm/perf_regs.h | 20 +- arch/powerpc/kernel/asm-offsets.c | 3 + arch/powerpc/kernel/cpu_setup_power.S | 7 + arch/powerpc/kernel/dt_cpu_ftrs.c | 26 ++ arch/powerpc/kernel/sysfs.c | 8 + arch/powerpc/kvm/book3s_hv.c| 6 +- arch/powerpc/kvm/book3s_hv_interrupts.S | 8 + arch/powerpc/kvm/book3s_hv_rmhandlers.S | 24 ++ arch/powerpc/perf/Makefile | 2 +- arch/powerpc/perf/core-book3s.c | 61 +++- arch/powerpc/perf/generic-compat-pmu.c | 2 +- arch/powerpc/perf/internal.h| 1 + arch/powerpc/perf/isa207-common.c | 72 +++- arch/powerpc/perf/isa207-common.h | 33 +- arch/powerpc/perf/mpc7450-pmu.c | 2 +- arch/powerpc/perf/perf_regs.c | 42 ++- arch/powerpc/perf/power10-events-list.h | 70 arch/powerpc/perf/power10-pmu.c | 425 arch/powerpc/perf/power5+-pmu.c | 2 +- arch/powerpc/perf/power5-pmu.c | 2 +- arch/powerpc/perf/power6-pmu.c | 2 +- arch/powerpc/perf/power7-pmu.c | 2 +- arch/powerpc/perf/power8-pmu.c | 2 +- arch/powerpc/perf/power9-pmu.c | 8 +- arch/powerpc/perf/ppc970-pmu.c | 2 +- arch/powerpc/platforms/powernv/idle.c | 14 + arch/powerpc/xmon/xmon.c| 15 + tools/arch/powerpc/include/uapi/asm/perf_regs.h | 20 +- tools/perf/arch/powerpc/include/perf_regs.h | 8 +- tools/perf/arch/powerpc/util/perf_regs.c| 61 35 files changed, 939 insertions(+), 41 deletions(-) create mode 100644 arch/powerpc/perf/power10-events-list.h create mode 100644 arch/powerpc/perf/power10-pmu.c -- 1.8.3.1
[PATCH 20/20] block: remove direct_make_request
Now that submit_bio_noacct has a decent blk-mq fast path there is no more need for this bypass. Signed-off-by: Christoph Hellwig --- block/blk-core.c | 28 drivers/md/dm.c | 5 + drivers/nvme/host/multipath.c | 2 +- include/linux/blkdev.h| 1 - 4 files changed, 2 insertions(+), 34 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 2ff166f0d24ee3..bf882b8d84450c 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -1211,34 +1211,6 @@ blk_qc_t submit_bio_noacct(struct bio *bio) } EXPORT_SYMBOL(submit_bio_noacct); -/** - * direct_make_request - hand a buffer directly to its device driver for I/O - * @bio: The bio describing the location in memory and on the device. - * - * This function behaves like submit_bio_noacct(), but does not protect - * against recursion. Must only be used if the called driver is known - * to be blk-mq based. - */ -blk_qc_t direct_make_request(struct bio *bio) -{ - struct gendisk *disk = bio->bi_disk; - - if (WARN_ON_ONCE(!disk->queue->mq_ops)) { - bio_io_error(bio); - return BLK_QC_T_NONE; - } - if (!submit_bio_checks(bio)) - return BLK_QC_T_NONE; - if (unlikely(bio_queue_enter(bio))) - return BLK_QC_T_NONE; - if (!blk_crypto_bio_prep(&bio)) { - blk_queue_exit(disk->queue); - return BLK_QC_T_NONE; - } - return blk_mq_submit_bio(bio); -} -EXPORT_SYMBOL_GPL(direct_make_request); - /** * submit_bio - submit a bio to the block device layer for I/O * @bio: The &struct bio which describes the I/O diff --git a/drivers/md/dm.c b/drivers/md/dm.c index b32b539dbace56..2cb33896198c4c 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -1302,10 +1302,7 @@ static blk_qc_t __map_bio(struct dm_target_io *tio) /* the bio has been remapped so dispatch it */ trace_block_bio_remap(clone->bi_disk->queue, clone, bio_dev(io->orig_bio), sector); - if (md->type == DM_TYPE_NVME_BIO_BASED) - ret = direct_make_request(clone); - else - ret = submit_bio_noacct(clone); + ret = submit_bio_noacct(clone); break; case DM_MAPIO_KILL: free_tio(tio); diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c index f07fa47c251d9d..a986ac52c4cc7f 100644 --- a/drivers/nvme/host/multipath.c +++ b/drivers/nvme/host/multipath.c @@ -314,7 +314,7 @@ blk_qc_t nvme_ns_head_submit_bio(struct bio *bio) trace_block_bio_remap(bio->bi_disk->queue, bio, disk_devt(ns->head->disk), bio->bi_iter.bi_sector); - ret = direct_make_request(bio); + ret = submit_bio_noacct(bio); } else if (nvme_available_path(head)) { dev_warn_ratelimited(dev, "no usable path - requeuing I/O\n"); diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index b73cfa6a5141df..1cc913ffdbe21e 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -853,7 +853,6 @@ static inline void rq_flush_dcache_pages(struct request *rq) extern int blk_register_queue(struct gendisk *disk); extern void blk_unregister_queue(struct gendisk *disk); blk_qc_t submit_bio_noacct(struct bio *bio); -extern blk_qc_t direct_make_request(struct bio *bio); extern void blk_rq_init(struct request_queue *q, struct request *rq); extern void blk_put_request(struct request *); extern struct request *blk_get_request(struct request_queue *, unsigned int op, -- 2.26.2
[PATCH 19/20] block: shortcut __submit_bio_noacct for blk-mq drivers
For blk-mq drivers bios can only be inserted for the same queue. So bypass the complicated sorting logic in __submit_bio_noacct with a blk-mq simpler submission helper. Signed-off-by: Christoph Hellwig --- block/blk-core.c | 30 ++ 1 file changed, 30 insertions(+) diff --git a/block/blk-core.c b/block/blk-core.c index 57b5dc00d44cb1..2ff166f0d24ee3 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -1152,6 +1152,34 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio) return ret; } +static blk_qc_t __submit_bio_noacct_mq(struct bio *bio) +{ + struct gendisk *disk = bio->bi_disk; + struct bio_list bio_list; + blk_qc_t ret = BLK_QC_T_NONE; + + bio_list_init(&bio_list); + current->bio_list = &bio_list; + + do { + WARN_ON_ONCE(bio->bi_disk != disk); + + if (unlikely(bio_queue_enter(bio) != 0)) + continue; + + if (!blk_crypto_bio_prep(&bio)) { + blk_queue_exit(disk->queue); + ret = BLK_QC_T_NONE; + continue; + } + + ret = blk_mq_submit_bio(bio); + } while ((bio = bio_list_pop(&bio_list))); + + current->bio_list = NULL; + return ret; +} + /** * submit_bio_noacct - re-submit a bio to the block device layer for I/O * @bio: The bio describing the location in memory and on the device. @@ -1177,6 +1205,8 @@ blk_qc_t submit_bio_noacct(struct bio *bio) return BLK_QC_T_NONE; } + if (!bio->bi_disk->fops->submit_bio) + return __submit_bio_noacct_mq(bio); return __submit_bio_noacct(bio); } EXPORT_SYMBOL(submit_bio_noacct); -- 2.26.2
[PATCH 17/20] block: rename generic_make_request to submit_bio_noacct
generic_make_request has always been very confusingly misnamed, so rename it to submit_bio_noacct to make it clear that it is submit_bio minus accounting and a few checks. Signed-off-by: Christoph Hellwig --- Documentation/block/biodoc.rst| 2 +- .../fault-injection/fault-injection.rst | 2 +- Documentation/trace/ftrace.rst| 4 +-- block/bio.c | 14 block/blk-core.c | 32 +-- block/blk-crypto-fallback.c | 2 +- block/blk-crypto.c| 2 +- block/blk-merge.c | 2 +- block/blk-throttle.c | 4 +-- block/bounce.c| 2 +- drivers/block/drbd/drbd_int.h | 6 ++-- drivers/block/drbd/drbd_main.c| 2 +- drivers/block/drbd/drbd_receiver.c| 2 +- drivers/block/drbd/drbd_req.c | 2 +- drivers/block/drbd/drbd_worker.c | 2 +- drivers/block/pktcdvd.c | 2 +- drivers/lightnvm/pblk-read.c | 2 +- drivers/md/bcache/bcache.h| 2 +- drivers/md/bcache/btree.c | 2 +- drivers/md/bcache/request.c | 7 ++-- drivers/md/dm-cache-target.c | 6 ++-- drivers/md/dm-clone-target.c | 10 +++--- drivers/md/dm-crypt.c | 6 ++-- drivers/md/dm-delay.c | 2 +- drivers/md/dm-era-target.c| 2 +- drivers/md/dm-integrity.c | 4 +-- drivers/md/dm-mpath.c | 2 +- drivers/md/dm-raid1.c | 2 +- drivers/md/dm-snap-persistent.c | 2 +- drivers/md/dm-snap.c | 6 ++-- drivers/md/dm-thin.c | 4 +-- drivers/md/dm-verity-target.c | 2 +- drivers/md/dm-writecache.c| 2 +- drivers/md/dm-zoned-target.c | 2 +- drivers/md/dm.c | 10 +++--- drivers/md/md-faulty.c| 4 +-- drivers/md/md-linear.c| 4 +-- drivers/md/md-multipath.c | 4 +-- drivers/md/raid0.c| 8 ++--- drivers/md/raid1.c| 14 drivers/md/raid10.c | 28 drivers/md/raid5.c| 10 +++--- drivers/nvme/host/multipath.c | 2 +- include/linux/blkdev.h| 2 +- 44 files changed, 115 insertions(+), 118 deletions(-) diff --git a/Documentation/block/biodoc.rst b/Documentation/block/biodoc.rst index 267384159bf793..afda5e30a82e5a 100644 --- a/Documentation/block/biodoc.rst +++ b/Documentation/block/biodoc.rst @@ -1036,7 +1036,7 @@ Now the generic block layer performs partition-remapping early and thus provides drivers with a sector number relative to whole device, rather than having to take partition number into account in order to arrive at the true sector number. The routine blk_partition_remap() is invoked by -generic_make_request even before invoking the queue specific ->submit_bio, +submit_bio_noacct even before invoking the queue specific ->submit_bio, so the i/o scheduler also gets to operate on whole disk sector numbers. This should typically not require changes to block drivers, it just never gets to invoke its own partition sector offset calculations since all bios diff --git a/Documentation/fault-injection/fault-injection.rst b/Documentation/fault-injection/fault-injection.rst index f51bb21d20e44b..f850ad018b70a8 100644 --- a/Documentation/fault-injection/fault-injection.rst +++ b/Documentation/fault-injection/fault-injection.rst @@ -24,7 +24,7 @@ Available fault injection capabilities injects disk IO errors on devices permitted by setting /sys/block//make-it-fail or - /sys/block///make-it-fail. (generic_make_request()) + /sys/block///make-it-fail. (submit_bio_noacct()) - fail_mmc_request diff --git a/Documentation/trace/ftrace.rst b/Documentation/trace/ftrace.rst index 430a16283103d4..80ba765a82379e 100644 --- a/Documentation/trace/ftrace.rst +++ b/Documentation/trace/ftrace.rst @@ -1453,7 +1453,7 @@ function-trace, we get a much larger output:: => __blk_run_queue_uncond => __blk_run_queue => blk_queue_bio - => generic_make_request + => submit_bio_noacct => submit_bio => submit_bh => __ext3_get_inode_loc @@ -1738,7 +1738,7 @@ tracers. => __blk_run_queue_uncond => __blk_run_queue => blk_queue_bio - => generic_make_request + => submit_bio_noacct => submit_bio => submit_bh => ext3_bread diff --git a/block/bio.c b/block/bio.c index fc1299f9d86a24..ef91782fd668ce 100644 --- a/block/bio.c +++ b/block/bio.c @@ -358,7 +358,7 @@ st
[PATCH 18/20] block: refator submit_bio_noacct
Split out a __submit_bio_noacct helper for the actual de-recursion algorithm, and simplify the loop by using a continue when we can't enter the queue for a bio. Signed-off-by: Christoph Hellwig --- block/blk-core.c | 143 +-- 1 file changed, 75 insertions(+), 68 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index ff9a88d2d244cb..57b5dc00d44cb1 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -1084,6 +1084,74 @@ static blk_qc_t __submit_bio(struct bio *bio) return ret; } +/* + * The loop in this function may be a bit non-obvious, and so deserves some + * explanation: + * + * - Before entering the loop, bio->bi_next is NULL (as all callers ensure + *that), so we have a list with a single bio. + * - We pretend that we have just taken it off a longer list, so we assign + *bio_list to a pointer to the bio_list_on_stack, thus initialising the + *bio_list of new bios to be added. ->submit_bio() may indeed add some more + *bios through a recursive call to submit_bio_noacct. If it did, we find a + *non-NULL value in bio_list and re-enter the loop from the top. + * - In this case we really did just take the bio of the top of the list (no + *pretending) and so remove it from bio_list, and call into ->submit_bio() + *again. + * + * bio_list_on_stack[0] contains bios submitted by the current ->submit_bio. + * bio_list_on_stack[1] contains bios that were submitted before the current + * ->submit_bio_bio, but that haven't been processed yet. + */ +static blk_qc_t __submit_bio_noacct(struct bio *bio) +{ + struct bio_list bio_list_on_stack[2]; + blk_qc_t ret = BLK_QC_T_NONE; + + BUG_ON(bio->bi_next); + + bio_list_init(&bio_list_on_stack[0]); + current->bio_list = bio_list_on_stack; + + do { + struct request_queue *q = bio->bi_disk->queue; + struct bio_list lower, same; + + if (unlikely(bio_queue_enter(bio) != 0)) + continue; + + /* +* Create a fresh bio_list for all subordinate requests. +*/ + bio_list_on_stack[1] = bio_list_on_stack[0]; + bio_list_init(&bio_list_on_stack[0]); + + ret = __submit_bio(bio); + + /* +* Sort new bios into those for a lower level and those for the +* same level. +*/ + bio_list_init(&lower); + bio_list_init(&same); + while ((bio = bio_list_pop(&bio_list_on_stack[0])) != NULL) + if (q == bio->bi_disk->queue) + bio_list_add(&same, bio); + else + bio_list_add(&lower, bio); + + /* +* Now assemble so we handle the lowest level first. +*/ + bio_list_merge(&bio_list_on_stack[0], &lower); + bio_list_merge(&bio_list_on_stack[0], &same); + bio_list_merge(&bio_list_on_stack[0], &bio_list_on_stack[1]); + } while ((bio = bio_list_pop(&bio_list_on_stack[0]))); + + current->bio_list = NULL; + return ret; +} + /** * submit_bio_noacct - re-submit a bio to the block device layer for I/O * @bio: The bio describing the location in memory and on the device. @@ -1095,82 +1163,21 @@ static blk_qc_t __submit_bio(struct bio *bio) */ blk_qc_t submit_bio_noacct(struct bio *bio) { - /* -* bio_list_on_stack[0] contains bios submitted by the current -* ->submit_bio. -* bio_list_on_stack[1] contains bios that were submitted before the -* current ->submit_bio_bio, but that haven't been processed yet. -*/ - struct bio_list bio_list_on_stack[2]; - blk_qc_t ret = BLK_QC_T_NONE; - if (!submit_bio_checks(bio)) - goto out; + return BLK_QC_T_NONE; /* -* We only want one ->submit_bio to be active at a time, else -* stack usage with stacked devices could be a problem. So use -* current->bio_list to keep a list of requests submited by a -* ->submit_bio method. current->bio_list is also used as a -* flag to say if submit_bio_noacct is currently active in this -* task or not. If it is NULL, then no make_request is active. If -* it is non-NULL, then a make_request is active, and new requests -* should be added at the tail +* We only want one ->submit_bio to be active at a time, else stack +* usage with stacked devices could be a problem. Use current->bio_list +* to collect a list of requests submited by a ->submit_bio method while +* it is active, and then process them after it returned. */ if (current->bio_list) { bio_list_add(¤t->bio_list[0], bio); - goto out; +
[PATCH 16/20] block: move ->make_request_fn to struct block_device_operations
The make_request_fn is a little weird in that it sits directly in struct request_queue instead of an operation vector. Replace it with a block_device_operations method called submit_bio (which describes much better what it does). Also remove the request_queue argument to it, as the queue can be derived pretty trivially from the bio. Signed-off-by: Christoph Hellwig --- Documentation/block/biodoc.rst| 2 +- .../block/writeback_cache_control.rst | 2 +- arch/m68k/emu/nfblock.c | 5 +- arch/xtensa/platforms/iss/simdisk.c | 5 +- block/blk-cgroup.c| 2 +- block/blk-core.c | 53 +++ block/blk-mq.c| 10 ++-- block/blk.h | 2 - drivers/block/brd.c | 5 +- drivers/block/drbd/drbd_int.h | 2 +- drivers/block/drbd/drbd_main.c| 9 ++-- drivers/block/drbd/drbd_req.c | 2 +- drivers/block/null_blk_main.c | 17 -- drivers/block/pktcdvd.c | 11 ++-- drivers/block/ps3vram.c | 15 +++--- drivers/block/rsxx/dev.c | 7 ++- drivers/block/umem.c | 5 +- drivers/block/zram/zram_drv.c | 11 ++-- drivers/lightnvm/core.c | 8 +-- drivers/lightnvm/pblk-init.c | 12 +++-- drivers/md/bcache/request.c | 4 +- drivers/md/bcache/request.h | 4 +- drivers/md/bcache/super.c | 23 +--- drivers/md/dm.c | 23 drivers/md/md.c | 5 +- drivers/nvdimm/blk.c | 5 +- drivers/nvdimm/btt.c | 5 +- drivers/nvdimm/pmem.c | 5 +- drivers/nvme/host/core.c | 1 + drivers/nvme/host/multipath.c | 5 +- drivers/nvme/host/nvme.h | 1 + drivers/s390/block/dcssblk.c | 9 ++-- drivers/s390/block/xpram.c| 6 +-- include/linux/blk-mq.h| 2 +- include/linux/blkdev.h| 7 +-- include/linux/lightnvm.h | 3 +- 36 files changed, 153 insertions(+), 140 deletions(-) diff --git a/Documentation/block/biodoc.rst b/Documentation/block/biodoc.rst index b964796ec9c780..267384159bf793 100644 --- a/Documentation/block/biodoc.rst +++ b/Documentation/block/biodoc.rst @@ -1036,7 +1036,7 @@ Now the generic block layer performs partition-remapping early and thus provides drivers with a sector number relative to whole device, rather than having to take partition number into account in order to arrive at the true sector number. The routine blk_partition_remap() is invoked by -generic_make_request even before invoking the queue specific make_request_fn, +generic_make_request even before invoking the queue specific ->submit_bio, so the i/o scheduler also gets to operate on whole disk sector numbers. This should typically not require changes to block drivers, it just never gets to invoke its own partition sector offset calculations since all bios diff --git a/Documentation/block/writeback_cache_control.rst b/Documentation/block/writeback_cache_control.rst index 2c752c57c14c62..b208488d0aae85 100644 --- a/Documentation/block/writeback_cache_control.rst +++ b/Documentation/block/writeback_cache_control.rst @@ -47,7 +47,7 @@ the Forced Unit Access is implemented. The REQ_PREFLUSH and REQ_FUA flags may both be set on a single bio. -Implementation details for make_request_fn based block drivers +Implementation details for bio based block drivers -- These drivers will always see the REQ_PREFLUSH and REQ_FUA bits as they sit diff --git a/arch/m68k/emu/nfblock.c b/arch/m68k/emu/nfblock.c index 87e8b1700acd28..92d26c81244134 100644 --- a/arch/m68k/emu/nfblock.c +++ b/arch/m68k/emu/nfblock.c @@ -59,7 +59,7 @@ struct nfhd_device { struct gendisk *disk; }; -static blk_qc_t nfhd_make_request(struct request_queue *queue, struct bio *bio) +static blk_qc_t nfhd_submit_bio(struct bio *bio) { struct nfhd_device *dev = bio->bi_disk->private_data; struct bio_vec bvec; @@ -93,6 +93,7 @@ static int nfhd_getgeo(struct block_device *bdev, struct hd_geometry *geo) static const struct block_device_operations nfhd_ops = { .owner = THIS_MODULE, + .submit_bio = nfhd_submit_bio, .getgeo = nfhd_getgeo, }; @@ -118,7 +119,7 @@ static int __init nfhd_init_one(int id, u32 blocks, u32 bsize) dev->bsize = bsize; dev->bshift = ffs(bsize) - 10; - dev->queue = blk_alloc_queue(nfhd_make_request, NUMA_NO_NODE); + dev->queue = blk_alloc_queue(NUMA_
[PATCH 15/20] block: remove the nr_sectors variable in generic_make_request_checks
The variable is only used once, so just open code the bio_sector() there. Signed-off-by: Christoph Hellwig --- block/blk-core.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 37435d0d433564..28f60985dc75cc 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -974,7 +974,6 @@ static noinline_for_stack bool generic_make_request_checks(struct bio *bio) { struct request_queue *q = bio->bi_disk->queue; - int nr_sectors = bio_sectors(bio); blk_status_t status = BLK_STS_IOERR; might_sleep(); @@ -1007,7 +1006,7 @@ generic_make_request_checks(struct bio *bio) if (op_is_flush(bio->bi_opf) && !test_bit(QUEUE_FLAG_WC, &q->queue_flags)) { bio->bi_opf &= ~(REQ_PREFLUSH | REQ_FUA); - if (!nr_sectors) { + if (!bio_sectors(bio)) { status = BLK_STS_OK; goto end_io; } -- 2.26.2
[PATCH 12/20] block: remove the request_queue argument from blk_queue_split
The queue can be trivially derived from the bio, so pass one less argument. Signed-off-by: Christoph Hellwig --- block/blk-merge.c | 21 ++--- block/blk-mq.c| 2 +- block/blk.h | 3 +-- drivers/block/drbd/drbd_req.c | 2 +- drivers/block/pktcdvd.c | 2 +- drivers/block/ps3vram.c | 2 +- drivers/block/rsxx/dev.c | 2 +- drivers/block/umem.c | 2 +- drivers/lightnvm/pblk-init.c | 4 ++-- drivers/md/dm.c | 2 +- drivers/md/md.c | 2 +- drivers/nvme/host/multipath.c | 9 - drivers/s390/block/dcssblk.c | 2 +- drivers/s390/block/xpram.c| 2 +- include/linux/blkdev.h| 2 +- 15 files changed, 28 insertions(+), 31 deletions(-) diff --git a/block/blk-merge.c b/block/blk-merge.c index 9c9fb21584b64e..20fa2290604105 100644 --- a/block/blk-merge.c +++ b/block/blk-merge.c @@ -283,20 +283,20 @@ static struct bio *blk_bio_segment_split(struct request_queue *q, /** * __blk_queue_split - split a bio and submit the second half - * @q: [in] request queue pointer * @bio: [in, out] bio to be split * @nr_segs: [out] number of segments in the first bio * * Split a bio into two bios, chain the two bios, submit the second half and * store a pointer to the first half in *@bio. If the second bio is still too * big it will be split by a recursive call to this function. Since this - * function may allocate a new bio from @q->bio_split, it is the responsibility - * of the caller to ensure that @q is only released after processing of the + * function may allocate a new bio from @bio->bi_disk->queue->bio_split, it is + * the responsibility of the caller to ensure that + * @bio->bi_disk->queue->bio_split is only released after processing of the * split bio has finished. */ -void __blk_queue_split(struct request_queue *q, struct bio **bio, - unsigned int *nr_segs) +void __blk_queue_split(struct bio **bio, unsigned int *nr_segs) { + struct request_queue *q = (*bio)->bi_disk->queue; struct bio *split = NULL; switch (bio_op(*bio)) { @@ -345,20 +345,19 @@ void __blk_queue_split(struct request_queue *q, struct bio **bio, /** * blk_queue_split - split a bio and submit the second half - * @q: [in] request queue pointer * @bio: [in, out] bio to be split * * Split a bio into two bios, chains the two bios, submit the second half and * store a pointer to the first half in *@bio. Since this function may allocate - * a new bio from @q->bio_split, it is the responsibility of the caller to - * ensure that @q is only released after processing of the split bio has - * finished. + * a new bio from @bio->bi_disk->queue->bio_split, it is the responsibility of + * the caller to ensure that @bio->bi_disk->queue->bio_split is only released + * after processing of the split bio has finished. */ -void blk_queue_split(struct request_queue *q, struct bio **bio) +void blk_queue_split(struct bio **bio) { unsigned int nr_segs; - __blk_queue_split(q, bio, &nr_segs); + __blk_queue_split(bio, &nr_segs); } EXPORT_SYMBOL(blk_queue_split); diff --git a/block/blk-mq.c b/block/blk-mq.c index 65e0846fd06519..dbadb7defd618a 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -2166,7 +2166,7 @@ blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio) blk_status_t ret; blk_queue_bounce(q, &bio); - __blk_queue_split(q, &bio, &nr_segs); + __blk_queue_split(&bio, &nr_segs); if (!bio_integrity_prep(bio)) goto queue_exit; diff --git a/block/blk.h b/block/blk.h index 0184a31fe4dfaf..0114fd92c8a05b 100644 --- a/block/blk.h +++ b/block/blk.h @@ -220,8 +220,7 @@ ssize_t part_timeout_show(struct device *, struct device_attribute *, char *); ssize_t part_timeout_store(struct device *, struct device_attribute *, const char *, size_t); -void __blk_queue_split(struct request_queue *q, struct bio **bio, - unsigned int *nr_segs); +void __blk_queue_split(struct bio **bio, unsigned int *nr_segs); int ll_back_merge_fn(struct request *req, struct bio *bio, unsigned int nr_segs); int ll_front_merge_fn(struct request *req, struct bio *bio, diff --git a/drivers/block/drbd/drbd_req.c b/drivers/block/drbd/drbd_req.c index 3f09b2ab977822..9368680474223a 100644 --- a/drivers/block/drbd/drbd_req.c +++ b/drivers/block/drbd/drbd_req.c @@ -1598,7 +1598,7 @@ blk_qc_t drbd_make_request(struct request_queue *q, struct bio *bio) struct drbd_device *device = bio->bi_disk->private_data; unsigned long start_jif; - blk_queue_split(q, &bio); + blk_queue_split(&bio); start_jif = jiffies; diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c index 27a33adc41e487..29b0c62dc86c1f 100644 --- a/drivers/block/pktcdvd.c +++ b/drivers/block/pktcdvd.c @@ -2434,7 +2434,7 @@ s
[PATCH 14/20] block: remove the NULL queue check in generic_make_request_checks
All registers disks must have a valid queue pointer, so don't bother to log a warning for that case. Signed-off-by: Christoph Hellwig --- block/blk-core.c | 12 +--- 1 file changed, 1 insertion(+), 11 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 95dca74534ff73..37435d0d433564 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -973,22 +973,12 @@ static inline blk_status_t blk_check_zone_append(struct request_queue *q, static noinline_for_stack bool generic_make_request_checks(struct bio *bio) { - struct request_queue *q; + struct request_queue *q = bio->bi_disk->queue; int nr_sectors = bio_sectors(bio); blk_status_t status = BLK_STS_IOERR; - char b[BDEVNAME_SIZE]; might_sleep(); - q = bio->bi_disk->queue; - if (unlikely(!q)) { - printk(KERN_ERR - "generic_make_request: Trying to access " - "nonexistent block-device %s (%Lu)\n", - bio_devname(bio, b), (long long)bio->bi_iter.bi_sector); - goto end_io; - } - /* * For a REQ_NOWAIT based request, return -EOPNOTSUPP * if queue is not a request based queue. -- 2.26.2
[PATCH 13/20] block: tidy up a warning in bio_check_ro
The "generic_make_request: " prefix has no value, and will soon become stale. Signed-off-by: Christoph Hellwig --- block/blk-core.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 76cfd5709f66cd..95dca74534ff73 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -869,8 +869,7 @@ static inline bool bio_check_ro(struct bio *bio, struct hd_struct *part) return false; WARN_ONCE(1, - "generic_make_request: Trying to write " - "to read-only block-device %s (partno %d)\n", + "Trying to write to read-only block-device %s (partno %d)\n", bio_devname(bio, b), part->partno); /* Older lvm-tools actually trigger this */ return false; -- 2.26.2
[PATCH 11/20] fs: remove a weird comment in submit_bh_wbc
All bios can get remapped if submitted to partitions. No need to comment on that. Signed-off-by: Christoph Hellwig --- fs/buffer.c | 5 - 1 file changed, 5 deletions(-) diff --git a/fs/buffer.c b/fs/buffer.c index 64fe82ec65ff1f..2725ebbcfdc246 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -3040,12 +3040,7 @@ static int submit_bh_wbc(int op, int op_flags, struct buffer_head *bh, if (test_set_buffer_req(bh) && (op == REQ_OP_WRITE)) clear_buffer_write_io_error(bh); - /* -* from here on down, it's all bio -- do the initial mapping, -* submit_bio -> generic_make_request may further map this bio around -*/ bio = bio_alloc(GFP_NOIO, 1); - bio->bi_iter.bi_sector = bh->b_blocknr * (bh->b_size >> 9); bio_set_dev(bio, bh->b_bdev); bio->bi_write_hint = write_hint; -- 2.26.2
[PATCH 10/20] dm: stop using ->queuedata
Instead of setting up the queuedata as well just use one private data field. Signed-off-by: Christoph Hellwig --- drivers/md/dm.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/md/dm.c b/drivers/md/dm.c index e44473fe0f4873..c8d91f271c272e 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -1789,7 +1789,7 @@ static blk_qc_t dm_process_bio(struct mapped_device *md, static blk_qc_t dm_make_request(struct request_queue *q, struct bio *bio) { - struct mapped_device *md = q->queuedata; + struct mapped_device *md = bio->bi_disk->private_data; blk_qc_t ret = BLK_QC_T_NONE; int srcu_idx; struct dm_table *map; @@ -1995,7 +1995,6 @@ static struct mapped_device *alloc_dev(int minor) md->queue = blk_alloc_queue(dm_make_request, numa_node_id); if (!md->queue) goto bad; - md->queue->queuedata = md; md->disk = alloc_disk_node(1, md->numa_node_id); if (!md->disk) -- 2.26.2
[PATCH 02/20] simdisk: stop using ->queuedata
Instead of setting up the queuedata as well just use one private data field. Signed-off-by: Christoph Hellwig --- arch/xtensa/platforms/iss/simdisk.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/arch/xtensa/platforms/iss/simdisk.c b/arch/xtensa/platforms/iss/simdisk.c index 49322b66cda931..31b5020077a059 100644 --- a/arch/xtensa/platforms/iss/simdisk.c +++ b/arch/xtensa/platforms/iss/simdisk.c @@ -103,7 +103,7 @@ static void simdisk_transfer(struct simdisk *dev, unsigned long sector, static blk_qc_t simdisk_make_request(struct request_queue *q, struct bio *bio) { - struct simdisk *dev = q->queuedata; + struct simdisk *dev = bio->bi_disk->private_data; struct bio_vec bvec; struct bvec_iter iter; sector_t sector = bio->bi_iter.bi_sector; @@ -273,8 +273,6 @@ static int __init simdisk_setup(struct simdisk *dev, int which, goto out_alloc_queue; } - dev->queue->queuedata = dev; - dev->gd = alloc_disk(SIMDISK_MINORS); if (dev->gd == NULL) { pr_err("alloc_disk failed\n"); -- 2.26.2
[PATCH 09/20] bcache: stop setting ->queuedata
Nothing in bcache actually uses the ->queuedata field. Signed-off-by: Christoph Hellwig --- drivers/md/bcache/super.c | 1 - 1 file changed, 1 deletion(-) diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index 2014016f9a60d3..21aa168113d30b 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -876,7 +876,6 @@ static int bcache_device_init(struct bcache_device *d, unsigned int block_size, return -ENOMEM; d->disk->queue = q; - q->queuedata= d; q->backing_dev_info->congested_data = d; q->limits.max_hw_sectors= UINT_MAX; q->limits.max_sectors = UINT_MAX; -- 2.26.2
rename ->make_request_fn and move it to the block_device_operations v2
Hi Jens, this series moves the make_request_fn method into block_device_operations with the much more descriptive ->submit_bio name. It then also gives generic_make_request a more descriptive name, and further optimize the path to issue to blk-mq, removing the need for the direct_make_request bypass. Changes since v1: - fix a null pointer dereference when dispatching from bio to request based drivers - clean up a few more comments Diffstat: Documentation/block/biodoc.rst|2 Documentation/block/writeback_cache_control.rst |2 Documentation/fault-injection/fault-injection.rst |2 Documentation/trace/ftrace.rst|4 arch/m68k/emu/nfblock.c |8 arch/xtensa/platforms/iss/simdisk.c |9 block/bio.c | 14 - block/blk-cgroup.c|2 block/blk-core.c | 255 +- block/blk-crypto-fallback.c |2 block/blk-crypto.c|2 block/blk-merge.c | 23 - block/blk-mq.c| 12 - block/blk-throttle.c |4 block/blk.h |5 block/bounce.c|2 drivers/block/brd.c |5 drivers/block/drbd/drbd_int.h |8 drivers/block/drbd/drbd_main.c| 12 - drivers/block/drbd/drbd_receiver.c|2 drivers/block/drbd/drbd_req.c |8 drivers/block/drbd/drbd_worker.c |2 drivers/block/null_blk_main.c | 19 + drivers/block/pktcdvd.c | 15 - drivers/block/ps3vram.c | 20 - drivers/block/rsxx/dev.c | 14 - drivers/block/umem.c | 11 drivers/block/zram/zram_drv.c | 14 - drivers/lightnvm/core.c |8 drivers/lightnvm/pblk-init.c | 16 - drivers/lightnvm/pblk-read.c |2 drivers/md/bcache/bcache.h|2 drivers/md/bcache/btree.c |2 drivers/md/bcache/request.c | 11 drivers/md/bcache/request.h |4 drivers/md/bcache/super.c | 24 +- drivers/md/dm-cache-target.c |6 drivers/md/dm-clone-target.c | 10 drivers/md/dm-crypt.c |6 drivers/md/dm-delay.c |2 drivers/md/dm-era-target.c|2 drivers/md/dm-integrity.c |4 drivers/md/dm-mpath.c |2 drivers/md/dm-raid1.c |2 drivers/md/dm-snap-persistent.c |2 drivers/md/dm-snap.c |6 drivers/md/dm-thin.c |4 drivers/md/dm-verity-target.c |2 drivers/md/dm-writecache.c|2 drivers/md/dm-zoned-target.c |2 drivers/md/dm.c | 41 +-- drivers/md/md-faulty.c|4 drivers/md/md-linear.c|4 drivers/md/md-multipath.c |4 drivers/md/md.c |7 drivers/md/raid0.c|8 drivers/md/raid1.c| 14 - drivers/md/raid10.c | 28 +- drivers/md/raid5.c| 10 drivers/nvdimm/blk.c |5 drivers/nvdimm/btt.c |5 drivers/nvdimm/pmem.c |5 drivers/nvme/host/core.c |1 drivers/nvme/host/multipath.c | 18 - drivers/nvme/host/nvme.h |1 drivers/s390/block/dcssblk.c | 11 drivers/s390/block/xpram.c|8 fs/buffer.c |5 include/linux/blk-mq.h|2 include/linux/blkdev.h| 12 - include/linux/lightnvm.h |3 71 files changed, 387 insertions(+), 408 deletions(-)
[PATCH 03/20] drbd: stop using ->queuedata
Instead of setting up the queuedata as well just use one private data field. Signed-off-by: Christoph Hellwig --- drivers/block/drbd/drbd_main.c | 1 - drivers/block/drbd/drbd_req.c | 2 +- 2 files changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c index 45fbd526c453bc..26f4e0aa7393b4 100644 --- a/drivers/block/drbd/drbd_main.c +++ b/drivers/block/drbd/drbd_main.c @@ -2805,7 +2805,6 @@ enum drbd_ret_code drbd_create_device(struct drbd_config_context *adm_ctx, unsig if (!q) goto out_no_q; device->rq_queue = q; - q->queuedata = device; disk = alloc_disk(1); if (!disk) diff --git a/drivers/block/drbd/drbd_req.c b/drivers/block/drbd/drbd_req.c index c80a2f1c3c2a73..3f09b2ab977822 100644 --- a/drivers/block/drbd/drbd_req.c +++ b/drivers/block/drbd/drbd_req.c @@ -1595,7 +1595,7 @@ void do_submit(struct work_struct *ws) blk_qc_t drbd_make_request(struct request_queue *q, struct bio *bio) { - struct drbd_device *device = (struct drbd_device *) q->queuedata; + struct drbd_device *device = bio->bi_disk->private_data; unsigned long start_jif; blk_queue_split(q, &bio); -- 2.26.2
[PATCH 06/20] rsxx: stop using ->queuedata
Instead of setting up the queuedata as well just use one private data field. Signed-off-by: Christoph Hellwig --- drivers/block/rsxx/dev.c | 5 + 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/drivers/block/rsxx/dev.c b/drivers/block/rsxx/dev.c index 3ba07ab30c84f5..6a4d8d26e32cbd 100644 --- a/drivers/block/rsxx/dev.c +++ b/drivers/block/rsxx/dev.c @@ -119,7 +119,7 @@ static void bio_dma_done_cb(struct rsxx_cardinfo *card, static blk_qc_t rsxx_make_request(struct request_queue *q, struct bio *bio) { - struct rsxx_cardinfo *card = q->queuedata; + struct rsxx_cardinfo *card = bio->bi_disk->private_data; struct rsxx_bio_meta *bio_meta; blk_status_t st = BLK_STS_IOERR; @@ -267,8 +267,6 @@ int rsxx_setup_dev(struct rsxx_cardinfo *card) card->queue->limits.discard_alignment = RSXX_HW_BLK_SIZE; } - card->queue->queuedata = card; - snprintf(card->gendisk->disk_name, sizeof(card->gendisk->disk_name), "rsxx%d", card->disk_id); card->gendisk->major = card->major; @@ -289,7 +287,6 @@ void rsxx_destroy_dev(struct rsxx_cardinfo *card) card->gendisk = NULL; blk_cleanup_queue(card->queue); - card->queue->queuedata = NULL; unregister_blkdev(card->major, DRIVER_NAME); } -- 2.26.2
[PATCH 07/20] umem: stop using ->queuedata
Instead of setting up the queuedata as well just use one private data field. Signed-off-by: Christoph Hellwig --- drivers/block/umem.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/block/umem.c b/drivers/block/umem.c index 1e2aa5ae27963c..5498f1cf36b3fe 100644 --- a/drivers/block/umem.c +++ b/drivers/block/umem.c @@ -521,7 +521,8 @@ static int mm_check_plugged(struct cardinfo *card) static blk_qc_t mm_make_request(struct request_queue *q, struct bio *bio) { - struct cardinfo *card = q->queuedata; + struct cardinfo *card = bio->bi_disk->private_data; + pr_debug("mm_make_request %llu %u\n", (unsigned long long)bio->bi_iter.bi_sector, bio->bi_iter.bi_size); @@ -888,7 +889,6 @@ static int mm_pci_probe(struct pci_dev *dev, const struct pci_device_id *id) card->queue = blk_alloc_queue(mm_make_request, NUMA_NO_NODE); if (!card->queue) goto failed_alloc; - card->queue->queuedata = card; tasklet_init(&card->tasklet, process_page, (unsigned long)card); -- 2.26.2
[PATCH 04/20] null_blk: stop using ->queuedata for bio mode
Instead of setting up the queuedata as well just use one private data field. Signed-off-by: Christoph Hellwig --- drivers/block/null_blk_main.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/block/null_blk_main.c b/drivers/block/null_blk_main.c index 82259242b9b5c9..93ce0a00b2ae01 100644 --- a/drivers/block/null_blk_main.c +++ b/drivers/block/null_blk_main.c @@ -1392,7 +1392,7 @@ static blk_qc_t null_queue_bio(struct request_queue *q, struct bio *bio) { sector_t sector = bio->bi_iter.bi_sector; sector_t nr_sectors = bio_sectors(bio); - struct nullb *nullb = q->queuedata; + struct nullb *nullb = bio->bi_disk->private_data; struct nullb_queue *nq = nullb_to_queue(nullb); struct nullb_cmd *cmd; -- 2.26.2
[PATCH 01/20] nfblock: stop using ->queuedata
Instead of setting up the queuedata as well just use one private data field. Signed-off-by: Christoph Hellwig Reviewed-by: Geert Uytterhoeven Acked-by: Geert Uytterhoeven --- arch/m68k/emu/nfblock.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/arch/m68k/emu/nfblock.c b/arch/m68k/emu/nfblock.c index c3a630440512e9..87e8b1700acd28 100644 --- a/arch/m68k/emu/nfblock.c +++ b/arch/m68k/emu/nfblock.c @@ -61,7 +61,7 @@ struct nfhd_device { static blk_qc_t nfhd_make_request(struct request_queue *queue, struct bio *bio) { - struct nfhd_device *dev = queue->queuedata; + struct nfhd_device *dev = bio->bi_disk->private_data; struct bio_vec bvec; struct bvec_iter iter; int dir, len, shift; @@ -122,7 +122,6 @@ static int __init nfhd_init_one(int id, u32 blocks, u32 bsize) if (dev->queue == NULL) goto free_dev; - dev->queue->queuedata = dev; blk_queue_logical_block_size(dev->queue, bsize); dev->disk = alloc_disk(16); -- 2.26.2
[PATCH 05/20] ps3vram: stop using ->queuedata
Instead of setting up the queuedata as well just use one private data field. Signed-off-by: Christoph Hellwig --- drivers/block/ps3vram.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/block/ps3vram.c b/drivers/block/ps3vram.c index 821d4d8b1d763e..5a1d1d137c7248 100644 --- a/drivers/block/ps3vram.c +++ b/drivers/block/ps3vram.c @@ -587,7 +587,7 @@ static struct bio *ps3vram_do_bio(struct ps3_system_bus_device *dev, static blk_qc_t ps3vram_make_request(struct request_queue *q, struct bio *bio) { - struct ps3_system_bus_device *dev = q->queuedata; + struct ps3_system_bus_device *dev = bio->bi_disk->private_data; struct ps3vram_priv *priv = ps3_system_bus_get_drvdata(dev); int busy; @@ -745,7 +745,6 @@ static int ps3vram_probe(struct ps3_system_bus_device *dev) } priv->queue = queue; - queue->queuedata = dev; blk_queue_max_segments(queue, BLK_MAX_SEGMENTS); blk_queue_max_segment_size(queue, BLK_MAX_SEGMENT_SIZE); blk_queue_max_hw_sectors(queue, BLK_SAFE_MAX_SECTORS); -- 2.26.2
[PATCH 08/20] zram: stop using ->queuedata
Instead of setting up the queuedata as well just use one private data field. Signed-off-by: Christoph Hellwig --- drivers/block/zram/zram_drv.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index 6e2ad90b17a376..0564e3f384089e 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -1586,7 +1586,7 @@ static void __zram_make_request(struct zram *zram, struct bio *bio) */ static blk_qc_t zram_make_request(struct request_queue *queue, struct bio *bio) { - struct zram *zram = queue->queuedata; + struct zram *zram = bio->bi_disk->private_data; if (!valid_io_request(zram, bio->bi_iter.bi_sector, bio->bi_iter.bi_size)) { @@ -1912,7 +1912,6 @@ static int zram_add(void) zram->disk->first_minor = device_id; zram->disk->fops = &zram_devops; zram->disk->queue = queue; - zram->disk->queue->queuedata = zram; zram->disk->private_data = zram; snprintf(zram->disk->disk_name, 16, "zram%d", device_id); -- 2.26.2
Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline
On Wed 24-06-20 14:58:46, Srikar Dronamraju wrote: > Currently Linux kernel with CONFIG_NUMA on a system with multiple > possible nodes, marks node 0 as online at boot. However in practice, > there are systems which have node 0 as memoryless and cpuless. > > This can cause numa_balancing to be enabled on systems with only one node > with memory and CPUs. The existence of this dummy node which is cpuless and > memoryless node can confuse users/scripts looking at output of lscpu / > numactl. > > By marking, N_ONLINE as NODE_MASK_NONE, lets stop assuming that Node 0 is > always online. > > v5.8-rc2 > available: 2 nodes (0,2) > node 0 cpus: > node 0 size: 0 MB > node 0 free: 0 MB > node 2 cpus: 0 1 2 3 4 5 6 7 > node 2 size: 32625 MB > node 2 free: 31490 MB > node distances: > node 0 2 >0: 10 20 >2: 20 10 > > proc and sys files > -- > /sys/devices/system/node/online:0,2 > /proc/sys/kernel/numa_balancing:1 > /sys/devices/system/node/has_cpu: 2 > /sys/devices/system/node/has_memory:2 > /sys/devices/system/node/has_normal_memory: 2 > /sys/devices/system/node/possible: 0-31 > > v5.8-rc2 + patch > -- > available: 1 nodes (2) > node 2 cpus: 0 1 2 3 4 5 6 7 > node 2 size: 32625 MB > node 2 free: 31487 MB > node distances: > node 2 >2: 10 > > proc and sys files > -- > /sys/devices/system/node/online:2 > /proc/sys/kernel/numa_balancing:0 > /sys/devices/system/node/has_cpu: 2 > /sys/devices/system/node/has_memory:2 > /sys/devices/system/node/has_normal_memory: 2 > /sys/devices/system/node/possible: 0-31 > > Note: On Powerpc, cpu_to_node of possible but not present cpus would > previously return 0. Hence this commit depends on commit ("powerpc/numa: Set > numa_node for all possible cpus") and commit ("powerpc/numa: Prefer node id > queried from vphn"). Without the 2 commits, Powerpc system might crash. > > 1. User space applications like Numactl, lscpu, that parse the sysfs tend to > believe there is an extra online node. This tends to confuse users and > applications. Other user space applications start believing that system was > not able to use all the resources (i.e missing resources) or the system was > not setup correctly. > > 2. Also existence of dummy node also leads to inconsistent information. The > number of online nodes is inconsistent with the information in the > device-tree and resource-dump > > 3. When the dummy node is present, single node non-Numa systems end up showing > up as NUMA systems and numa_balancing gets enabled. This will mean we take > the hit from the unnecessary numa hinting faults. I have to say that I dislike the node online/offline state and directly exporting that to the userspace. Users should only care whether the node has memory/cpus. Numa nodes can be online without any memory. Just offline all the present memory blocks but do not physically hot remove them and you are in the same situation. If users are confused by an output of tools like numactl -H then those could be updated and hide nodes without any memory&cpus. The autonuma problem sounds interesting but again this patch doesn't really solve the underlying problem because I strongly suspect that the problem is still there when a numa node gets all its memory offline as mentioned above. While I completely agree that making node 0 special is wrong, I have still hard time to review this very simply looking patch because all the numa initialization is so spread around that this might just blow up at unexpected places. IIRC we have discussed testing in the previous version and David has provided a way to emulate these configurations on x86. Did you manage to use those instruction for additional testing on other than ppc architectures? > Cc: linuxppc-dev@lists.ozlabs.org > Cc: linux...@kvack.org > Cc: linux-ker...@vger.kernel.org > Cc: Michal Hocko > Cc: Mel Gorman > Cc: Vlastimil Babka > Cc: "Kirill A. Shutemov" > Cc: Christopher Lameter > Cc: Michael Ellerman > Cc: Andrew Morton > Cc: Linus Torvalds > Cc: Gautham R Shenoy > Cc: Satheesh Rajendran > Cc: David Hildenbrand > Signed-off-by: Srikar Dronamraju > --- > Changelog v4:->v5: > - rebased to v5.8-rc2 > link v4: > http://lore.kernel.org/lkml/20200512132937.19295-1-sri...@linux.vnet.ibm.com/t/#u > > Changelog v1:->v2: > - Rebased to v5.7-rc3 > Link v2: > https://lore.kernel.org/linuxppc-dev/20200428093836.27190-1-sri...@linux.vnet.ibm.com/t/#u > > mm/page_alloc.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 48eb0f1410d4..5187664558e1 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -117,8 +117,10 @@ EXPORT_SYMBOL(latent_entropy); > */ > nodemask_t node_states[NR_NODE_STATES] __read_mostly = { > [N_POSSIBLE] = NODE_MASK_ALL, > +#ifdef CONFIG_NUMA > + [N_ONLIN
[PATCH v2 30/30] misc: cxl: flash: Remove unused pointer
The DRC index pointer us updated on an OPCODE_ADD, but never actually read. Remove the used pointer and shift up OPCODE_ADD to group with OPCODE_DELETE which also provides a noop. Fixes the following W=1 kernel build warning: drivers/misc/cxl/flash.c: In function ‘update_devicetree’: drivers/misc/cxl/flash.c:178:16: warning: variable ‘drc_index’ set but not used [-Wunused-but-set-variable] 178 | __be32 *data, drc_index, phandle; | ^ Cc: Frederic Barrat Cc: Andrew Donnellan Cc: linuxppc-dev@lists.ozlabs.org Signed-off-by: Lee Jones --- drivers/misc/cxl/flash.c | 7 ++- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/drivers/misc/cxl/flash.c b/drivers/misc/cxl/flash.c index cb9cca35a2263..24e3dfcc91a74 100644 --- a/drivers/misc/cxl/flash.c +++ b/drivers/misc/cxl/flash.c @@ -175,7 +175,7 @@ static int update_devicetree(struct cxl *adapter, s32 scope) struct update_nodes_workarea *unwa; u32 action, node_count; int token, rc, i; - __be32 *data, drc_index, phandle; + __be32 *data, phandle; char *buf; token = rtas_token("ibm,update-nodes"); @@ -206,15 +206,12 @@ static int update_devicetree(struct cxl *adapter, s32 scope) switch (action) { case OPCODE_DELETE: + case OPCODE_ADD: /* nothing to do */ break; case OPCODE_UPDATE: update_node(phandle, scope); break; - case OPCODE_ADD: - /* nothing to do, just move pointer */ - drc_index = *data++; - break; } } } -- 2.25.1
[PATCH v2 28/30] misc: ocxl: config: Provide correct formatting to function headers
A nice attempt was made to provide kerneldoc headers for read_template_version() and read_afu_lpc_memory_info() however, the provided formatting does not match what is expected by kerneldoc. Fixes the following W=1 warnings: drivers/misc/ocxl/config.c:286: warning: Function parameter or member 'dev' not described in 'read_template_version' drivers/misc/ocxl/config.c:286: warning: Function parameter or member 'fn' not described in 'read_template_version' drivers/misc/ocxl/config.c:286: warning: Function parameter or member 'len' not described in 'read_template_version' drivers/misc/ocxl/config.c:286: warning: Function parameter or member 'version' not described in 'read_template_version' drivers/misc/ocxl/config.c:489: warning: Function parameter or member 'dev' not described in 'read_afu_lpc_memory_info' drivers/misc/ocxl/config.c:489: warning: Function parameter or member 'fn' not described in 'read_afu_lpc_memory_info' drivers/misc/ocxl/config.c:489: warning: Function parameter or member 'afu' not described in 'read_afu_lpc_memory_info' Cc: Frederic Barrat Cc: Andrew Donnellan Cc: linuxppc-dev@lists.ozlabs.org Signed-off-by: Lee Jones --- drivers/misc/ocxl/config.c | 18 +- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/drivers/misc/ocxl/config.c b/drivers/misc/ocxl/config.c index c8e19bfb5ef90..e3b99a39d207e 100644 --- a/drivers/misc/ocxl/config.c +++ b/drivers/misc/ocxl/config.c @@ -273,11 +273,11 @@ static int read_afu_info(struct pci_dev *dev, struct ocxl_fn_config *fn, } /** - * Read the template version from the AFU - * dev: the device for the AFU - * fn: the AFU offsets - * len: outputs the template length - * version: outputs the major<<8,minor version + * read_template_version - Read the template version from the AFU + * @dev: the device for the AFU + * @fn: the AFU offsets + * @len: outputs the template length + * @version: outputs the major<<8,minor version * * Returns 0 on success, negative on failure */ @@ -476,10 +476,10 @@ static int validate_afu(struct pci_dev *dev, struct ocxl_afu_config *afu) } /** - * Populate AFU metadata regarding LPC memory - * dev: the device for the AFU - * fn: the AFU offsets - * afu: the AFU struct to populate the LPC metadata into + * read_afu_lpc_memory_info - Populate AFU metadata regarding LPC memory + * @dev: the device for the AFU + * @fn: the AFU offsets + * @afu: the AFU struct to populate the LPC metadata into * * Returns 0 on success, negative on failure */ -- 2.25.1
[PATCH v2 27/30] misc: cxl: hcalls: Demote half-assed kerneldoc attempt
Function headers will need a lot of work before they reach the standards expected of kerneldoc. Demote them down to basic comments/headers, for now at least. Fixes the following W=1 kernel build warnings: drivers/misc/cxl/hcalls.c:175: warning: Function parameter or member 'unit_address' not described in 'cxl_h_detach_process' drivers/misc/cxl/hcalls.c:175: warning: Function parameter or member 'process_token' not described in 'cxl_h_detach_process' drivers/misc/cxl/hcalls.c:207: warning: Function parameter or member 'unit_address' not described in 'cxl_h_control_function' drivers/misc/cxl/hcalls.c:207: warning: Function parameter or member 'op' not described in 'cxl_h_control_function' drivers/misc/cxl/hcalls.c:207: warning: Function parameter or member 'p1' not described in 'cxl_h_control_function' drivers/misc/cxl/hcalls.c:207: warning: Function parameter or member 'p2' not described in 'cxl_h_control_function' drivers/misc/cxl/hcalls.c:207: warning: Function parameter or member 'p3' not described in 'cxl_h_control_function' drivers/misc/cxl/hcalls.c:207: warning: Function parameter or member 'p4' not described in 'cxl_h_control_function' drivers/misc/cxl/hcalls.c:207: warning: Function parameter or member 'out' not described in 'cxl_h_control_function' drivers/misc/cxl/hcalls.c:245: warning: Function parameter or member 'unit_address' not described in 'cxl_h_reset_afu' drivers/misc/cxl/hcalls.c:258: warning: Function parameter or member 'unit_address' not described in 'cxl_h_suspend_process' drivers/misc/cxl/hcalls.c:258: warning: Function parameter or member 'process_token' not described in 'cxl_h_suspend_process' drivers/misc/cxl/hcalls.c:271: warning: Function parameter or member 'unit_address' not described in 'cxl_h_resume_process' drivers/misc/cxl/hcalls.c:271: warning: Function parameter or member 'process_token' not described in 'cxl_h_resume_process' drivers/misc/cxl/hcalls.c:284: warning: Function parameter or member 'unit_address' not described in 'cxl_h_read_error_state' drivers/misc/cxl/hcalls.c:284: warning: Function parameter or member 'state' not described in 'cxl_h_read_error_state' drivers/misc/cxl/hcalls.c:300: warning: Function parameter or member 'unit_address' not described in 'cxl_h_get_afu_err' drivers/misc/cxl/hcalls.c:300: warning: Function parameter or member 'offset' not described in 'cxl_h_get_afu_err' drivers/misc/cxl/hcalls.c:300: warning: Function parameter or member 'buf_address' not described in 'cxl_h_get_afu_err' drivers/misc/cxl/hcalls.c:300: warning: Function parameter or member 'len' not described in 'cxl_h_get_afu_err' drivers/misc/cxl/hcalls.c:320: warning: Function parameter or member 'unit_address' not described in 'cxl_h_get_config' drivers/misc/cxl/hcalls.c:320: warning: Function parameter or member 'cr_num' not described in 'cxl_h_get_config' drivers/misc/cxl/hcalls.c:320: warning: Function parameter or member 'offset' not described in 'cxl_h_get_config' drivers/misc/cxl/hcalls.c:320: warning: Function parameter or member 'buf_address' not described in 'cxl_h_get_config' drivers/misc/cxl/hcalls.c:320: warning: Function parameter or member 'len' not described in 'cxl_h_get_config' drivers/misc/cxl/hcalls.c:333: warning: Function parameter or member 'unit_address' not described in 'cxl_h_terminate_process' drivers/misc/cxl/hcalls.c:333: warning: Function parameter or member 'process_token' not described in 'cxl_h_terminate_process' drivers/misc/cxl/hcalls.c:351: warning: Function parameter or member 'unit_address' not described in 'cxl_h_collect_vpd' drivers/misc/cxl/hcalls.c:351: warning: Function parameter or member 'record' not described in 'cxl_h_collect_vpd' drivers/misc/cxl/hcalls.c:351: warning: Function parameter or member 'list_address' not described in 'cxl_h_collect_vpd' drivers/misc/cxl/hcalls.c:351: warning: Function parameter or member 'num' not described in 'cxl_h_collect_vpd' drivers/misc/cxl/hcalls.c:351: warning: Function parameter or member 'out' not described in 'cxl_h_collect_vpd' drivers/misc/cxl/hcalls.c:362: warning: Function parameter or member 'unit_address' not described in 'cxl_h_get_fn_error_interrupt' drivers/misc/cxl/hcalls.c:362: warning: Function parameter or member 'reg' not described in 'cxl_h_get_fn_error_interrupt' drivers/misc/cxl/hcalls.c:374: warning: Function parameter or member 'unit_address' not described in 'cxl_h_ack_fn_error_interrupt' drivers/misc/cxl/hcalls.c:374: warning: Function parameter or member 'value' not described in 'cxl_h_ack_fn_error_interrupt' drivers/misc/cxl/hcalls.c:386: warning: Function parameter or member 'unit_address' not described in 'cxl_h_get_error_log' drivers/misc/cxl/hcalls.c:386: warning: Function parameter or member 'value' not described in 'cxl_h_get_error_log' drivers/misc/cxl/hcalls.c:399: warning: Function parameter or member 'unit_address' not described in 'cxl_h_collect_int_info' drivers/misc/cxl/
Re: [PATCH v2 4/6] powerpc/pseries/iommu: Remove default DMA window before creating DDW
On 24/06/2020 16:24, Leonardo Bras wrote: > On LoPAR "DMA Window Manipulation Calls", it's recommended to remove the > default DMA window for the device, before attempting to configure a DDW, > in order to make the maximum resources available for the next DDW to be > created. > > This is a requirement for some devices to use DDW, given they only > allow one DMA window. Devices never know about these windows, it is purely PHB's side of things. A device can access any address on the bus, the bus can generate an exception if there is no window behind the address OR some other device's MMIO. We could actually create a second window in addition to the first one and allocate bus addresses from both, we just simplifying this by merging two separate non-adjacent windows into one. > > If setting up a new DDW fails anywhere after the removal of this > default DMA window, it's needed to restore the default DMA window. > For this, an implementation of ibm,reset-pe-dma-windows rtas call is > needed: > > Platforms supporting the DDW option starting with LoPAR level 2.7 implement > ibm,ddw-extensions. The first extension available (index 2) carries the > token for ibm,reset-pe-dma-windows rtas call, which is used to restore > the default DMA window for a device, if it has been deleted. > > It does so by resetting the TCE table allocation for the PE to it's > boot time value, available in "ibm,dma-window" device tree node. > > Signed-off-by: Leonardo Bras > --- > arch/powerpc/platforms/pseries/iommu.c | 70 ++ > 1 file changed, 61 insertions(+), 9 deletions(-) > > diff --git a/arch/powerpc/platforms/pseries/iommu.c > b/arch/powerpc/platforms/pseries/iommu.c > index a8840d9e1c35..4fcf00016fb1 100644 > --- a/arch/powerpc/platforms/pseries/iommu.c > +++ b/arch/powerpc/platforms/pseries/iommu.c > @@ -1029,6 +1029,39 @@ static phys_addr_t ddw_memory_hotplug_max(void) > return max_addr; > } > > +/* > + * Platforms supporting the DDW option starting with LoPAR level 2.7 > implement > + * ibm,ddw-extensions, which carries the rtas token for > + * ibm,reset-pe-dma-windows. > + * That rtas-call can be used to restore the default DMA window for the > device. > + */ > +static void reset_dma_window(struct pci_dev *dev, struct device_node *par_dn) > +{ > + int ret; > + u32 cfg_addr, ddw_ext[DDW_EXT_RESET_DMA_WIN + 1]; > + u64 buid; > + struct device_node *dn; > + struct pci_dn *pdn; > + > + ret = of_property_read_u32_array(par_dn, "ibm,ddw-extensions", > + &ddw_ext[0], DDW_EXT_RESET_DMA_WIN + > 1); > + if (ret) > + return; > + > + dn = pci_device_to_OF_node(dev); > + pdn = PCI_DN(dn); > + buid = pdn->phb->buid; > + cfg_addr = ((pdn->busno << 16) | (pdn->devfn << 8)); > + > + ret = rtas_call(ddw_ext[DDW_EXT_RESET_DMA_WIN], 3, 1, NULL, cfg_addr, > + BUID_HI(buid), BUID_LO(buid)); > + if (ret) > + dev_info(&dev->dev, > + "ibm,reset-pe-dma-windows(%x) %x %x %x returned %d ", > + ddw_ext[1], cfg_addr, BUID_HI(buid), BUID_LO(buid), s/ddw_ext[1]/ddw_ext[DDW_EXT_RESET_DMA_WIN]/ > + ret); > +} > + > /* > * If the PE supports dynamic dma windows, and there is space for a table > * that can map all pages in a linear offset, then setup such a table, > @@ -1049,8 +1082,9 @@ static u64 enable_ddw(struct pci_dev *dev, struct > device_node *pdn) > u64 dma_addr, max_addr; > struct device_node *dn; > u32 ddw_avail[DDW_APPLICABLE_SIZE]; > + Unrelated new empty line. > struct direct_window *window; > - struct property *win64; > + struct property *win64, *default_win = NULL, *ddw_ext = NULL; > struct dynamic_dma_window_prop *ddwprop; > struct failed_ddw_pdn *fpdn; > > @@ -1085,7 +1119,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct > device_node *pdn) > if (ret) > goto out_failed; > > - /* > + /* >* Query if there is a second window of size to map the >* whole partition. Query returns number of windows, largest >* block assigned to PE (partition endpoint), and two bitmasks > @@ -1096,15 +1130,31 @@ static u64 enable_ddw(struct pci_dev *dev, struct > device_node *pdn) > if (ret != 0) > goto out_failed; > > + /* > + * If there is no window available, remove the default DMA window, > + * if it's present. This will make all the resources available to the > + * new DDW window. > + * If anything fails after this, we need to restore it, so also check > + * for extensions presence. > + */ > if (query.windows_available == 0) { Does phyp really always advertise 0 windows for these VFs? What is in the largest_available_block when windows_available==0? > - /* > - * no additional windows are available for this dev
Re: [PATCH v2 2/6] powerpc/pseries/iommu: Update call to ibm,query-pe-dma-windows
On 24/06/2020 16:24, Leonardo Bras wrote: > From LoPAR level 2.8, "ibm,ddw-extensions" index 3 can make the number of > outputs from "ibm,query-pe-dma-windows" go from 5 to 6. > > This change of output size is meant to expand the address size of > largest_available_block PE TCE from 32-bit to 64-bit, which ends up > shifting page_size and migration_capable. > > This ends up requiring the update of > ddw_query_response->largest_available_block from u32 to u64, and manually > assigning the values from the buffer into this struct, according to > output size. > > Signed-off-by: Leonardo Bras > --- > arch/powerpc/platforms/pseries/iommu.c | 57 +- > 1 file changed, 47 insertions(+), 10 deletions(-) > > diff --git a/arch/powerpc/platforms/pseries/iommu.c > b/arch/powerpc/platforms/pseries/iommu.c > index 68d2aa9c71a8..558e5441c355 100644 > --- a/arch/powerpc/platforms/pseries/iommu.c > +++ b/arch/powerpc/platforms/pseries/iommu.c > @@ -44,6 +44,10 @@ > #define DDW_REMOVE_PE_DMA_WIN2 > #define DDW_APPLICABLE_SIZE 3 > > +#define DDW_EXT_SIZE 0 > +#define DDW_EXT_RESET_DMA_WIN1 > +#define DDW_EXT_QUERY_OUT_SIZE 2 #define DDW_EXT_LAST (DDW_EXT_QUERY_OUT_SIZE + 1) ... > + > static struct iommu_table_group *iommu_pseries_alloc_group(int node) > { > struct iommu_table_group *table_group; > @@ -339,7 +343,7 @@ struct direct_window { > /* Dynamic DMA Window support */ > struct ddw_query_response { > u32 windows_available; > - u32 largest_available_block; > + u64 largest_available_block; > u32 page_size; > u32 migration_capable; > }; > @@ -875,13 +879,29 @@ static int find_existing_ddw_windows(void) > machine_arch_initcall(pseries, find_existing_ddw_windows); > > static int query_ddw(struct pci_dev *dev, const u32 *ddw_avail, > - struct ddw_query_response *query) > + struct ddw_query_response *query, > + struct device_node *parent) > { > struct device_node *dn; > struct pci_dn *pdn; > - u32 cfg_addr; > + u32 cfg_addr, query_out[5], ddw_ext[DDW_EXT_QUERY_OUT_SIZE + 1]; ... and use DDW_EXT_LAST here. > u64 buid; > - int ret; > + int ret, out_sz; > + > + /* > + * From LoPAR level 2.8, "ibm,ddw-extensions" index 3 can rule how many > + * output parameters ibm,query-pe-dma-windows will have, ranging from > + * 5 to 6. > + */ > + > + ret = of_property_read_u32_array(parent, "ibm,ddw-extensions", > + &ddw_ext[0], > + DDW_EXT_QUERY_OUT_SIZE + 1); > + if (ret && ddw_ext[DDW_EXT_SIZE] > 1 && >= DDW_EXT_QUERY_OUT_SIZE ? Thanks, > + ddw_ext[DDW_EXT_QUERY_OUT_SIZE] == 1) > + out_sz = 6; > + else > + out_sz = 5; > > /* >* Get the config address and phb buid of the PE window. > @@ -894,11 +914,28 @@ static int query_ddw(struct pci_dev *dev, const u32 > *ddw_avail, > buid = pdn->phb->buid; > cfg_addr = ((pdn->busno << 16) | (pdn->devfn << 8)); > > - ret = rtas_call(ddw_avail[DDW_QUERY_PE_DMA_WIN], 3, 5, (u32 *)query, > + ret = rtas_call(ddw_avail[DDW_QUERY_PE_DMA_WIN], 3, out_sz, query_out, > cfg_addr, BUID_HI(buid), BUID_LO(buid)); > - dev_info(&dev->dev, "ibm,query-pe-dma-windows(%x) %x %x %x" > - " returned %d\n", ddw_avail[DDW_QUERY_PE_DMA_WIN], cfg_addr, > - BUID_HI(buid), BUID_LO(buid), ret); > + dev_info(&dev->dev, "ibm,query-pe-dma-windows(%x) %x %x %x returned > %d\n", > + ddw_avail[DDW_QUERY_PE_DMA_WIN], cfg_addr, BUID_HI(buid), > + BUID_LO(buid), ret); > + > + switch (out_sz) { > + case 5: > + query->windows_available = query_out[0]; > + query->largest_available_block = query_out[1]; > + query->page_size = query_out[2]; > + query->migration_capable = query_out[3]; > + break; > + case 6: > + query->windows_available = query_out[0]; > + query->largest_available_block = ((u64)query_out[1] << 32) | > + query_out[2]; > + query->page_size = query_out[3]; > + query->migration_capable = query_out[4]; > + break; > + } > + > return ret; > } > > @@ -1046,7 +1083,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct > device_node *pdn) >* of page sizes: supported and supported for migrate-dma. >*/ > dn = pci_device_to_OF_node(dev); > - ret = query_ddw(dev, ddw_avail, &query); > + ret = query_ddw(dev, ddw_avail, &query, pdn); > if (ret != 0) > goto out_failed; > > @@ -1074,7 +,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct > device_node *pdn) > /* check largest block * page size > max memory hotplug addr */ > m
Re: [PATCH v2 1/6] powerpc/pseries/iommu: Create defines for operations in ibm,ddw-applicable
On 24/06/2020 16:24, Leonardo Bras wrote: > Create defines to help handling ibm,ddw-applicable values, avoiding > confusion about the index of given operations. > > Signed-off-by: Leonardo Bras > --- > arch/powerpc/platforms/pseries/iommu.c | 40 +++--- > 1 file changed, 23 insertions(+), 17 deletions(-) > > diff --git a/arch/powerpc/platforms/pseries/iommu.c > b/arch/powerpc/platforms/pseries/iommu.c > index 6d47b4a3ce39..68d2aa9c71a8 100644 > --- a/arch/powerpc/platforms/pseries/iommu.c > +++ b/arch/powerpc/platforms/pseries/iommu.c > @@ -39,6 +39,11 @@ > > #include "pseries.h" > > +#define DDW_QUERY_PE_DMA_WIN 0 > +#define DDW_CREATE_PE_DMA_WIN1 > +#define DDW_REMOVE_PE_DMA_WIN2 > +#define DDW_APPLICABLE_SIZE 3 #define DDW_APPLICABLE_SIZE (DDW_REMOVE_PE_DMA_WIN + 1) thanks, > + > static struct iommu_table_group *iommu_pseries_alloc_group(int node) > { > struct iommu_table_group *table_group; > @@ -771,12 +776,12 @@ static void remove_ddw(struct device_node *np, bool > remove_prop) > { > struct dynamic_dma_window_prop *dwp; > struct property *win64; > - u32 ddw_avail[3]; > + u32 ddw_avail[DDW_APPLICABLE_SIZE]; > u64 liobn; > int ret = 0; > > ret = of_property_read_u32_array(np, "ibm,ddw-applicable", > - &ddw_avail[0], 3); > + &ddw_avail[0], DDW_APPLICABLE_SIZE); > > win64 = of_find_property(np, DIRECT64_PROPNAME, NULL); > if (!win64) > @@ -798,15 +803,15 @@ static void remove_ddw(struct device_node *np, bool > remove_prop) > pr_debug("%pOF successfully cleared tces in window.\n", >np); > > - ret = rtas_call(ddw_avail[2], 1, 1, NULL, liobn); > + ret = rtas_call(ddw_avail[DDW_REMOVE_PE_DMA_WIN], 1, 1, NULL, liobn); > if (ret) > pr_warn("%pOF: failed to remove direct window: rtas returned " > "%d to ibm,remove-pe-dma-window(%x) %llx\n", > - np, ret, ddw_avail[2], liobn); > + np, ret, ddw_avail[DDW_REMOVE_PE_DMA_WIN], liobn); > else > pr_debug("%pOF: successfully removed direct window: rtas > returned " > "%d to ibm,remove-pe-dma-window(%x) %llx\n", > - np, ret, ddw_avail[2], liobn); > + np, ret, ddw_avail[DDW_REMOVE_PE_DMA_WIN], liobn); > > delprop: > if (remove_prop) > @@ -889,11 +894,11 @@ static int query_ddw(struct pci_dev *dev, const u32 > *ddw_avail, > buid = pdn->phb->buid; > cfg_addr = ((pdn->busno << 16) | (pdn->devfn << 8)); > > - ret = rtas_call(ddw_avail[0], 3, 5, (u32 *)query, > - cfg_addr, BUID_HI(buid), BUID_LO(buid)); > + ret = rtas_call(ddw_avail[DDW_QUERY_PE_DMA_WIN], 3, 5, (u32 *)query, > + cfg_addr, BUID_HI(buid), BUID_LO(buid)); > dev_info(&dev->dev, "ibm,query-pe-dma-windows(%x) %x %x %x" > - " returned %d\n", ddw_avail[0], cfg_addr, BUID_HI(buid), > - BUID_LO(buid), ret); > + " returned %d\n", ddw_avail[DDW_QUERY_PE_DMA_WIN], cfg_addr, > + BUID_HI(buid), BUID_LO(buid), ret); > return ret; > } > > @@ -920,15 +925,16 @@ static int create_ddw(struct pci_dev *dev, const u32 > *ddw_avail, > > do { > /* extra outputs are LIOBN and dma-addr (hi, lo) */ > - ret = rtas_call(ddw_avail[1], 5, 4, (u32 *)create, > - cfg_addr, BUID_HI(buid), BUID_LO(buid), > - page_shift, window_shift); > + ret = rtas_call(ddw_avail[DDW_CREATE_PE_DMA_WIN], 5, 4, > + (u32 *)create, cfg_addr, BUID_HI(buid), > + BUID_LO(buid), page_shift, window_shift); > } while (rtas_busy_delay(ret)); > dev_info(&dev->dev, > "ibm,create-pe-dma-window(%x) %x %x %x %x %x returned %d " > - "(liobn = 0x%x starting addr = %x %x)\n", ddw_avail[1], > - cfg_addr, BUID_HI(buid), BUID_LO(buid), page_shift, > - window_shift, ret, create->liobn, create->addr_hi, > create->addr_lo); > + "(liobn = 0x%x starting addr = %x %x)\n", > + ddw_avail[DDW_CREATE_PE_DMA_WIN], cfg_addr, BUID_HI(buid), > + BUID_LO(buid), page_shift, window_shift, ret, create->liobn, > + create->addr_hi, create->addr_lo); > > return ret; > } > @@ -996,7 +1002,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct > device_node *pdn) > int page_shift; > u64 dma_addr, max_addr; > struct device_node *dn; > - u32 ddw_avail[3]; > + u32 ddw_avail[DDW_APPLICABLE_SIZE]; > struct direct_window *window; > struct property *win64; > struct dynamic_dma_window_prop *ddwprop; > @@ -1029,7 +1035,7 @@ static u64 enable_ddw(struct pci_dev *dev
Re: [PATCH v2 5/6] powerpc/pseries/iommu: Make use of DDW even if it does not map the partition
On 24/06/2020 16:24, Leonardo Bras wrote: > As of today, if a DDW is created and can't map the whole partition, it's > removed and the default DMA window "ibm,dma-window" is used instead. > > Usually this DDW is bigger than the default DMA window, so it would be > better to make use of it instead. > > Signed-off-by: Leonardo Bras > --- > arch/powerpc/platforms/pseries/iommu.c | 28 +- > 1 file changed, 19 insertions(+), 9 deletions(-) > > diff --git a/arch/powerpc/platforms/pseries/iommu.c > b/arch/powerpc/platforms/pseries/iommu.c > index 4fcf00016fb1..2d217cda4075 100644 > --- a/arch/powerpc/platforms/pseries/iommu.c > +++ b/arch/powerpc/platforms/pseries/iommu.c > @@ -685,7 +685,7 @@ static void pci_dma_bus_setup_pSeriesLP(struct pci_bus > *bus) > struct iommu_table *tbl; > struct device_node *dn, *pdn; > struct pci_dn *ppci; > - const __be32 *dma_window = NULL; > + const __be32 *dma_window = NULL, *alt_dma_window = NULL; > > dn = pci_bus_to_OF_node(bus); > > @@ -699,8 +699,13 @@ static void pci_dma_bus_setup_pSeriesLP(struct pci_bus > *bus) > break; > } > > + /* If there is a DDW available, use it instead */ > + alt_dma_window = of_get_property(pdn, DIRECT64_PROPNAME, NULL); It is not necessarily "direct" anymore as the name suggests, you may want to change that. DMA64_PROPNAME, may be. Thanks, > + if (alt_dma_window) > + dma_window = alt_dma_window; > + > if (dma_window == NULL) { > - pr_debug(" no ibm,dma-window property !\n"); > + pr_debug(" no ibm,dma-window nor > linux,direct64-ddr-window-info property !\n"); > return; > } > > @@ -1166,16 +1171,19 @@ static u64 enable_ddw(struct pci_dev *dev, struct > device_node *pdn) > query.page_size); > goto out_failed; > } > + > /* verify the window * number of ptes will map the partition */ > - /* check largest block * page size > max memory hotplug addr */ > max_addr = ddw_memory_hotplug_max(); > if (query.largest_available_block < (max_addr >> page_shift)) { > - dev_dbg(&dev->dev, "can't map partition max 0x%llx with %llu " > - "%llu-sized pages\n", max_addr, > query.largest_available_block, > - 1ULL << page_shift); > - goto out_failed; > + dev_dbg(&dev->dev, "can't map partition max 0x%llx with %llu > %llu-sized pages\n", > + max_addr, query.largest_available_block, > + 1ULL << page_shift); > + > + len = order_base_2(query.largest_available_block << page_shift); > + } else { > + len = order_base_2(max_addr); > } > - len = order_base_2(max_addr); > + > win64 = kzalloc(sizeof(struct property), GFP_KERNEL); > if (!win64) { > dev_info(&dev->dev, > @@ -1229,7 +1237,9 @@ static u64 enable_ddw(struct pci_dev *dev, struct > device_node *pdn) > list_add(&window->list, &direct_window_list); > spin_unlock(&direct_window_list_lock); > > - dma_addr = be64_to_cpu(ddwprop->dma_base); > + /* Only returns the dma_addr if DDW maps the whole partition */ > + if (len == order_base_2(max_addr)) > + dma_addr = be64_to_cpu(ddwprop->dma_base); > goto out_unlock; > > out_free_window: > -- Alexey
[powerpc:fixes-test] BUILD SUCCESS 19ab500edb5d6020010caba48ce3b4ce4182ab63
tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git fixes-test branch HEAD: 19ab500edb5d6020010caba48ce3b4ce4182ab63 powerpc/mm/pkeys: Make pkey access check work on execute_only_key elapsed time: 2625m configs tested: 163 configs skipped: 11 The following configs have been built successfully. More configs may be tested in the coming days. arm defconfig arm allyesconfig arm allmodconfig arm allnoconfig arm64allyesconfig arm64 defconfig arm64allmodconfig arm64 allnoconfig mips tb0287_defconfig powerpc mpc866_ads_defconfig mips pistachio_defconfig sparc64 allyesconfig arm s3c6400_defconfig mips malta_kvm_defconfig powerpcmvme5100_defconfig arm s3c2410_defconfig microblaze defconfig mips mtx1_defconfig arm orion5x_defconfig mipsbcm63xx_defconfig riscvnommu_k210_defconfig i386 alldefconfig mipsvocore2_defconfig um i386_defconfig mipsmaltaup_defconfig m68k multi_defconfig m68kdefconfig sh rts7751r2d1_defconfig arm viper_defconfig powerpc ppc44x_defconfig sh se7724_defconfig mips pic32mzda_defconfig armpleb_defconfig arm ezx_defconfig xtensa common_defconfig mips capcella_defconfig arm lpc32xx_defconfig arm versatile_defconfig xtensa iss_defconfig shsh7785lcr_defconfig mips sb1250_swarm_defconfig mips gcw0_defconfig mips decstation_defconfig sparcallyesconfig arcnsim_700_defconfig mips rbtx49xx_defconfig i386 allnoconfig i386 allyesconfig i386defconfig i386 debian-10.3 ia64 allmodconfig ia64defconfig ia64 allnoconfig ia64 allyesconfig m68k allmodconfig m68k allnoconfig m68k sun3_defconfig m68k allyesconfig nios2 defconfig nios2allyesconfig openriscdefconfig c6x allyesconfig c6x allnoconfig openrisc allyesconfig nds32 defconfig nds32 allnoconfig csky allyesconfig cskydefconfig alpha defconfig alphaallyesconfig xtensa allyesconfig h8300allyesconfig h8300allmodconfig xtensa defconfig arc defconfig sh allmodconfig shallnoconfig microblazeallnoconfig arc allyesconfig mips allyesconfig mips allnoconfig mips allmodconfig pariscallnoconfig parisc defconfig parisc allyesconfig parisc allmodconfig powerpc defconfig powerpc allyesconfig powerpc rhel-kconfig powerpc allmodconfig powerpc allnoconfig i386 randconfig-a001-20200630 i386 randconfig-a003-20200630 i386 randconfig-a002-20200630 i386 randconfig-a004-20200630 i386 randconfig-a005-20200630 i386 randconfig-a006-20200630 i386 randconfig-a002-20200701 i386 randconfig-a001-20200701 i386 randconfig-a006-20200701 i386
Re: [PATCH v2 6/6] powerpc/pseries/iommu: Avoid errors when DDW starts at 0x00
On 27/06/2020 03:46, Leonardo Bras wrote: > On Wed, 2020-06-24 at 03:24 -0300, Leonardo Bras wrote: >> As of today, enable_ddw() will return a non-null DMA address if the >> created DDW maps the whole partition. If the address is valid, >> iommu_bypass_supported_pSeriesLP() will consider iommu bypass enabled. >> >> This can cause some trouble if the DDW happens to start at 0x00. >> >> Instead if checking if the address is non-null, check directly if >> the DDW maps the whole partition, so it can bypass iommu. >> >> Signed-off-by: Leonardo Bras > > This patch has a bug in it. I will rework it soon. I'd rather suggest this: https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20180725095032.2196-2-...@ozlabs.ru/ Although it does not look like you are actually going to have windows starting at 0. Thanks, > Please keep reviewing patches 1-5. > > Best regards, > Leonardo > -- Alexey
Re: [PATCH 01/11] kexec_file: allow archs to handle special regions while locating memory hole
On 06/29/20 at 05:26pm, Hari Bathini wrote: > Hi Petr, > > On 29/06/20 5:09 pm, Petr Tesarik wrote: > > Hi Hari, > > > > is there any good reason to add two more functions with a very similar > > name to an existing function? AFAICS all you need is a way to call a > > PPC64-specific function from within kexec_add_buffer (PATCH 4/11), so > > you could add something like this: > > > > int __weak arch_kexec_locate_mem_hole(struct kexec_buf *kbuf) > > { > > return 0; > > } > > > > Call this function from kexec_add_buffer where appropriate and then > > override it for PPC64 (it roughly corresponds to your > > kexec_locate_mem_hole_ppc64() from PATCH 4/11). > > > > FWIW it would make it easier for me to follow the resulting code. > > Right, Petr. > > I was trying out a few things before I ended up with what I sent here. > Bu yeah.. I did realize arch_kexec_locate_mem_hole() would have been better > after sending out v1. Will take care of that in v2. Another way is use arch private function to locate mem hole, then set kbuf->mem, and then call kexec_add_buf, it will skip the common locate hole function. But other than that I have some confusion about those excluded ranges. Replied a question to patch 4. Thanks Dave
Re: [PATCH 04/11] ppc64/kexec_file: avoid stomping memory used by special regions
Hi Hari, On 06/27/20 at 12:35am, Hari Bathini wrote: > crashkernel region could have an overlap with special memory regions > like opal, rtas, tce-table & such. These regions are referred to as > exclude memory ranges. Setup this ranges during image probe in order > to avoid them while finding the buffer for different kdump segments. > Implement kexec_locate_mem_hole_ppc64() that locates a memory hole > accounting for these ranges. Also, override arch_kexec_add_buffer() > to locate a memory hole & later call __kexec_add_buffer() function > with kbuf->mem set to skip the generic locate memory hole lookup. > > Signed-off-by: Hari Bathini > --- > arch/powerpc/include/asm/crashdump-ppc64.h | 10 + > arch/powerpc/include/asm/kexec.h |7 - > arch/powerpc/kexec/elf_64.c|7 + > arch/powerpc/kexec/file_load_64.c | 292 > > 4 files changed, 312 insertions(+), 4 deletions(-) > create mode 100644 arch/powerpc/include/asm/crashdump-ppc64.h > [snip] > /** > + * get_exclude_memory_ranges - Get exclude memory ranges. This list includes > + * regions like opal/rtas, tce-table, initrd, > + * kernel, htab which should be avoided while > + * setting up kexec load segments. > + * @mem_ranges:Range list to add the memory ranges to. > + * > + * Returns 0 on success, negative errno on error. > + */ > +static int get_exclude_memory_ranges(struct crash_mem **mem_ranges) > +{ > + int ret; > + > + ret = add_tce_mem_ranges(mem_ranges); > + if (ret) > + goto out; > + > + ret = add_initrd_mem_range(mem_ranges); > + if (ret) > + goto out; > + > + ret = add_htab_mem_range(mem_ranges); > + if (ret) > + goto out; > + > + ret = add_kernel_mem_range(mem_ranges); > + if (ret) > + goto out; > + > + ret = add_rtas_mem_range(mem_ranges, false); > + if (ret) > + goto out; > + > + ret = add_opal_mem_range(mem_ranges, false); > + if (ret) > + goto out; > + > + ret = add_reserved_ranges(mem_ranges); > + if (ret) > + goto out; > + > + /* exclude memory ranges should be sorted for easy lookup */ > + sort_memory_ranges(*mem_ranges); > +out: > + if (ret) > + pr_err("Failed to setup exclude memory ranges\n"); > + return ret; > +} I'm confused about the "overlap with crashkernel memory", does that mean those normal kernel used memory could be put in crashkernel reserved memory range? If so why can't just skip those areas while crashkernel doing the reservation? Thanks Dave
[PATCH v7 6/7] powerpc/pmem: Avoid the barrier in flush routines
nvdimm expect the flush routines to just mark the cache clean. The barrier that mark the store globally visible is done in nvdimm_flush(). Signed-off-by: Aneesh Kumar K.V --- arch/powerpc/lib/pmem.c | 6 -- 1 file changed, 6 deletions(-) diff --git a/arch/powerpc/lib/pmem.c b/arch/powerpc/lib/pmem.c index 5a61aaeb6930..21210fa676e5 100644 --- a/arch/powerpc/lib/pmem.c +++ b/arch/powerpc/lib/pmem.c @@ -19,9 +19,6 @@ static inline void __clean_pmem_range(unsigned long start, unsigned long stop) for (i = 0; i < size >> shift; i++, addr += bytes) asm volatile(PPC_DCBSTPS(%0, %1): :"i"(0), "r"(addr): "memory"); - - - asm volatile(PPC_PHWSYNC ::: "memory"); } static inline void __flush_pmem_range(unsigned long start, unsigned long stop) @@ -34,9 +31,6 @@ static inline void __flush_pmem_range(unsigned long start, unsigned long stop) for (i = 0; i < size >> shift; i++, addr += bytes) asm volatile(PPC_DCBFPS(%0, %1): :"i"(0), "r"(addr): "memory"); - - - asm volatile(PPC_PHWSYNC ::: "memory"); } static inline void clean_pmem_range(unsigned long start, unsigned long stop) -- 2.26.2
Re: [PATCH v3] ASoC: fsl_asrc: Add an option to select internal ratio mode
On Tue, Jun 30, 2020 at 09:56:07PM +0800, Shengjiu Wang wrote: > The ASRC not only supports ideal ratio mode, but also supports > internal ratio mode. > > For internal rato mode, the rate of clock source should be divided > with no remainder by sample rate, otherwise there is sound > distortion. > > Add function fsl_asrc_select_clk() to find proper clock source for > internal ratio mode, if the clock source is available then internal > ratio mode will be selected. > > With change, the ideal ratio mode is not the only option for user. > > Signed-off-by: Shengjiu Wang Reviewed-by: Nicolin Chen
Re: [PATCH v2 4/4] mm/vmalloc: Hugepage vmalloc mappings
> static void *__vmalloc_node(unsigned long size, unsigned long align, > - gfp_t gfp_mask, pgprot_t prot, > - int node, const void *caller); > + gfp_t gfp_mask, pgprot_t prot, unsigned long vm_flags, > + int node, const void *caller); > static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask, > - pgprot_t prot, int node) > + pgprot_t prot, unsigned int page_shift, > + int node) > { > struct page **pages; > + unsigned long addr = (unsigned long)area->addr; > + unsigned long size = get_vm_area_size(area); > + unsigned int page_order = page_shift - PAGE_SHIFT; > unsigned int nr_pages, array_size, i; > const gfp_t nested_gfp = (gfp_mask & GFP_RECLAIM_MASK) | __GFP_ZERO; > const gfp_t alloc_mask = gfp_mask | __GFP_NOWARN; > const gfp_t highmem_mask = (gfp_mask & (GFP_DMA | GFP_DMA32)) ? > - 0 : > - __GFP_HIGHMEM; > + 0 : __GFP_HIGHMEM; > > - nr_pages = get_vm_area_size(area) >> PAGE_SHIFT; > + nr_pages = size >> page_shift; while try out this patchset, we encountered a BUG_ON in account_kernel_stack() in kernel/fork.c. BUG_ON(vm->nr_pages != THREAD_SIZE / PAGE_SIZE); which obviously should be updated accordingly. > array_size = (nr_pages * sizeof(struct page *)); > > /* Please note that the recursion is strictly bounded. */ > if (array_size > PAGE_SIZE) { > pages = __vmalloc_node(array_size, 1, nested_gfp|highmem_mask, > - PAGE_KERNEL, node, area->caller); > + PAGE_KERNEL, 0, node, area->caller); > } else { > pages = kmalloc_node(array_size, nested_gfp, node); > }
[PATCH v7 5/7] powerpc/pmem: Update ppc64 to use the new barrier instruction.
pmem on POWER10 can now use phwsync instead of hwsync to ensure all previous writes are architecturally visible for the platform buffer flush. Signed-off-by: Aneesh Kumar K.V --- arch/powerpc/include/asm/barrier.h | 13 + 1 file changed, 13 insertions(+) diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h index 123adcefd40f..35c1b8f3aa68 100644 --- a/arch/powerpc/include/asm/barrier.h +++ b/arch/powerpc/include/asm/barrier.h @@ -7,6 +7,10 @@ #include +#ifndef __ASSEMBLY__ +#include +#endif + /* * Memory barrier. * The sync instruction guarantees that all memory accesses initiated @@ -97,6 +101,15 @@ do { \ #define barrier_nospec() #endif /* CONFIG_PPC_BARRIER_NOSPEC */ +/* + * pmem_wmb() ensures that all stores for which the modification + * are written to persistent storage by preceding dcbfps/dcbstps + * instructions have updated persistent storage before any data + * access or data transfer caused by subsequent instructions is + * initiated. + */ +#define pmem_wmb() __asm__ __volatile__(PPC_PHWSYNC ::: "memory") + #include #endif /* _ASM_POWERPC_BARRIER_H */ -- 2.26.2