[PATCH v2] KVM: PPC: Book3S HV nestedv2: Cancel pending DEC exception
This reverts commit 180c6b072bf3 ("KVM: PPC: Book3S HV nestedv2: Do not cancel pending decrementer exception") [1] which prevented canceling a pending HDEC exception for nestedv2 KVM guests. It was done to avoid overhead of a H_GUEST_GET_STATE hcall to read the 'DEC expiry TB' register which was higher compared to handling extra decrementer exceptions. However recent benchmarks indicate that overhead of not handling 'DECR' expiry for Nested KVM Guest(L2) is higher and results in much larger exits to Pseries Host(L1) as indicated by the Unixbench-arithoh bench[2] Metric| Current upstream| Revert [1] | Difference % arithoh-count (10)| 3244831634 | 3403089673 | +04.88% kvm_hv:kvm_guest_exit | 513558 | 152441 | -70.32% probe:kvmppc_gsb_recv | 28060 | 28110 | +00.18% N=1 As indicated by the data above that reverting [1] results in substantial reduction in number of L2->L1 exits with only slight increase in number of H_GUEST_GET_STATE hcalls to read the value of 'DEC expiry TB'. This results in an overall ~4% improvement of arithoh[2] throughput. [1] commit 180c6b072bf3 ("KVM: PPC: Book3S HV nestedv2: Do not cancel pending decrementer exception") [2] https://github.com/kdlucas/byte-unixbench/ Fixes: 180c6b072bf3 ("KVM: PPC: Book3S HV nestedv2: Do not cancel pending decrementer exception") Signed-off-by: Vaibhav Jain --- Changelog: Since v1: https://lore.kernel.org/all/20240313072625.76804-1-vaib...@linux.ibm.com * Updated/Corrected patch title and description * Included data on test benchmark results for Unixbench-arithoh bench. --- arch/powerpc/kvm/book3s_hv.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c index 8e86eb577eb8..692a7c6f5fd9 100644 --- a/arch/powerpc/kvm/book3s_hv.c +++ b/arch/powerpc/kvm/book3s_hv.c @@ -4857,7 +4857,7 @@ int kvmhv_run_single_vcpu(struct kvm_vcpu *vcpu, u64 time_limit, * entering a nested guest in which case the decrementer is now owned * by L2 and the L1 decrementer is provided in hdec_expires */ - if (!kvmhv_is_nestedv2() && kvmppc_core_pending_dec(vcpu) && + if (kvmppc_core_pending_dec(vcpu) && ((tb < kvmppc_dec_expires_host_tb(vcpu)) || (trap == BOOK3S_INTERRUPT_SYSCALL && kvmppc_get_gpr(vcpu, 3) == H_ENTER_NESTED))) -- 2.44.0
[PATCH] KVM: PPC: Book3S HV nestedv2: Cancel pending HDEC exception
This reverts commit 180c6b072bf360b686e53d893d8dcf7dbbaec6bb ("KVM: PPC: Book3S HV nestedv2: Do not cancel pending decrementer exception") which prevented cancelling a pending HDEC exception for nestedv2 KVM guests. It was done to avoid overhead of a H_GUEST_GET_STATE hcall to read the 'HDEC expiry TB' register which was higher compared to handling extra decrementer exceptions. This overhead of reading 'HDEC expiry TB' register has been mitigated recently by the L0 hypervisor(PowerVM) by putting the value of this register in L2 guest-state output buffer on trap to L1. From there the value of this register is cached, made available in kvmhv_run_single_vcpu() to compare it against host(L1) timebase and cancel the pending hypervisor decrementer exception if needed. Fixes: 180c6b072bf3 ("KVM: PPC: Book3S HV nestedv2: Do not cancel pending decrementer exception") Signed-off-by: Vaibhav Jain --- arch/powerpc/kvm/book3s_hv.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c index 0b921704da45..e47b954ce266 100644 --- a/arch/powerpc/kvm/book3s_hv.c +++ b/arch/powerpc/kvm/book3s_hv.c @@ -4856,7 +4856,7 @@ int kvmhv_run_single_vcpu(struct kvm_vcpu *vcpu, u64 time_limit, * entering a nested guest in which case the decrementer is now owned * by L2 and the L1 decrementer is provided in hdec_expires */ - if (!kvmhv_is_nestedv2() && kvmppc_core_pending_dec(vcpu) && + if (kvmppc_core_pending_dec(vcpu) && ((tb < kvmppc_dec_expires_host_tb(vcpu)) || (trap == BOOK3S_INTERRUPT_SYSCALL && kvmppc_get_gpr(vcpu, 3) == H_ENTER_NESTED))) -- 2.44.0
Re: [PATCH 2/2] powerpc/pseries: Set CPU_FTR_DBELL according to ibm,pi-features
Nicholas Piggin writes: > PAPR will define a new ibm,pi-features bit which says that doorbells > should not be used even on architectures where they exist. This could be > because they are emulated and slower than using the interrupt controller > directly for IPIs. > > Wire this bit into the pi-features parser to clear CPU_FTR_DBELL, and > ensure CPU_FTR_DBELL is not in CPU_FTRS_ALWAYS. > > Signed-off-by: Nicholas Piggin > --- Tested this patch on a PP64-LE lpar with the patch[1] and seeing the relevant pi-feature bit CPU_FTR_DBELL in `cur_cpu_spec->cpu_features` getting cleared. [1] https://lore.kernel.org/all/20240207035220.339726-1-npig...@gmail.com Hence, Tested-by: Vaibhav Jain -- Cheers ~ Vaibhav
Re: [PATCH 1/2] powerpc/pseries: Add a clear modifier to ibm,pa/pi-features parser
Nicholas Piggin writes: > When a new ibm,pa/pi-features bit is introduced that is intended to > apply to existing systems and features, it may have an "inverted" > meaning (i.e., bit clear => feature available; bit set => unavailable). > Depending on the nature of the feature, this may give the best > backward compatibility result where old firmware will continue to > have that bit clear and therefore the feature available. > > The 'invert' modifier presumably was introduced for this type of > feature bit. However it invert will set the feature if the bit is > clear, which prevents it being used in the situation where an old > CPU lacks a feature that a new CPU has, then a new firmware comes > out to disable that feature on the new CPU if the bit is set. > Adding an 'invert' entry for that feature would incorrectly enable > it for the old CPU. > > So add a 'clear' modifier that clears the feature if the bit is set, > but it does not set the feature if the bit is clear. The feature > is expected to be set in the cpu table. > > This replaces the 'invert' modifier, which is unused since commit > 7d4703455168 ("powerpc/feature: Remove CPU_FTR_NODSISRALIGN"). > > Signed-off-by: Nicholas Piggin > --- Tested this patch on a PP64-LE lpar with the patch[1] and seeing the relevant pi-feature bit CPU_FTR_DBELL getting cleared. [1] https://lore.kernel.org/all/20240207035220.339726-2-npig...@gmail.com Hence, Tested-by: Vaibhav Jain -- Cheers ~ Vaibhav
Re: [PATCH v2] KVM: PPC: Book3S HV: Fix L2 guest reboot failure due to empty 'arch_compat'
Hi Amit, Thanks for the patch. Minor comment on the patch below: Amit Machhiwal writes: > > +static inline unsigned long map_pcr_to_cap(unsigned long pcr) > +{ > + unsigned long cap = 0; > + > + switch (pcr) { > + case PCR_ARCH_300: > + cap = H_GUEST_CAP_POWER9; > + break; > + case PCR_ARCH_31: > + cap = H_GUEST_CAP_POWER10; Though CONFIG_CC_IMPLICIT_FALLTHROUGH and '-Wimplicit-fallthrough' doesnt explicitly flag this usage, please consider using the 'fallthrough;' keyword here. However you probably dont want this switch-case to fallthrough so please use a 'break' instead. > + default: > + break; > + } > + > + return cap; > +} > + > With the suggested change above Reviewed-by: Vaibhav Jain -- Cheers ~ Vaibhav
Re: [PATCH] KVM: PPC: Book3S HV nestedv2: Fix an error handling path in gs_msg_ops_kvmhv_nestedv2_config_fill_info()
Christophe JAILLET writes: > if (kvmppc_gsm_includes(gsm, KVMPPC_GSID_RUN_OUTPUT)) { > - kvmppc_gse_put_buff_info(gsb, KVMPPC_GSID_RUN_OUTPUT, > - cfg->vcpu_run_output_cfg); > + rc = kvmppc_gse_put_buff_info(gsb, KVMPPC_GSID_RUN_OUTPUT, > + cfg->vcpu_run_output_cfg); > if (rc < 0) > return rc; > } > -- Thanks for catching this and the change looks good to me. Reviewed-by: Vaibhav Jain -- Cheers ~ Vaibhav
[PATCH] powerpc/hvcall: Reorder Nestedv2 hcall opcodes
This trivial patch reorders the newly introduced hcall opcodes for Nestedv2 to follow the increasing-opcode-number convention followed in 'hvcall.h'. The patch also updates the value for MAX_HCALL_OPCODE which is at various places in arch code for range checking. Fixes: 19d31c5f1157 ("KVM: PPC: Add support for nestedv2 guests") Suggested-by: Michael Ellerman Signed-off-by: Vaibhav Jain --- arch/powerpc/include/asm/hvcall.h | 21 +++-- 1 file changed, 11 insertions(+), 10 deletions(-) diff --git a/arch/powerpc/include/asm/hvcall.h b/arch/powerpc/include/asm/hvcall.h index ddb99e982917..605ed2b58aff 100644 --- a/arch/powerpc/include/asm/hvcall.h +++ b/arch/powerpc/include/asm/hvcall.h @@ -349,7 +349,17 @@ #define H_GET_ENERGY_SCALE_INFO0x450 #define H_PKS_SIGNED_UPDATE0x454 #define H_WATCHDOG 0x45C -#define MAX_HCALL_OPCODE H_WATCHDOG +#define H_WATCHDOG 0x45C +#define H_GUEST_GET_CAPABILITIES 0x460 +#define H_GUEST_SET_CAPABILITIES 0x464 +#define H_GUEST_CREATE 0x470 +#define H_GUEST_CREATE_VCPU0x474 +#define H_GUEST_GET_STATE 0x478 +#define H_GUEST_SET_STATE 0x47C +#define H_GUEST_RUN_VCPU 0x480 +#define H_GUEST_COPY_MEMORY0x484 +#define H_GUEST_DELETE 0x488 +#define MAX_HCALL_OPCODE H_GUEST_DELETE /* Scope args for H_SCM_UNBIND_ALL */ #define H_UNBIND_SCOPE_ALL (0x1) @@ -393,15 +403,6 @@ #define H_ENTER_NESTED 0xF804 #define H_TLB_INVALIDATE 0xF808 #define H_COPY_TOFROM_GUEST0xF80C -#define H_GUEST_GET_CAPABILITIES 0x460 -#define H_GUEST_SET_CAPABILITIES 0x464 -#define H_GUEST_CREATE 0x470 -#define H_GUEST_CREATE_VCPU0x474 -#define H_GUEST_GET_STATE 0x478 -#define H_GUEST_SET_STATE 0x47C -#define H_GUEST_RUN_VCPU 0x480 -#define H_GUEST_COPY_MEMORY0x484 -#define H_GUEST_DELETE 0x488 /* Flags for H_SVM_PAGE_IN */ #define H_PAGE_IN_SHARED0x1 -- 2.43.0
Re: [PATCH 09/12] KVM: PPC: Book3S HV nestedv2: Do not call H_COPY_TOFROM_GUEST
Hi Aneesh, "Aneesh Kumar K.V" writes: >> Yes, Agreed and thats a nice suggestion. However ATM the hypervisor >> supporting Nestedv2 doesnt have support for this hcall. In future >> once we have support for this hcall for nestedv2 from the hypervisor >> we can replace this branch with a firmware_has_feature() test. >> > > What I am suggesting is we convert that conditional to firmware_has_feature > so that > later when hypervisor supports this hcall all older kernel can make > use of the copy_tofrom_guest without any code change. AFAIK for firmware_has_feature to work we either need: - A way to call this hcall with some invalid args. However lpid/pid for guest arent allocated during boot. - A way for hypervisor to advertise support for this hcall before the L1 kernel boots. ATM L0 dosent support for any of these two ways. I can do a follow up patch later when we have a clarity on how we want to advertise support for this hcall. For now current kernel supporting nestedv2 wont be using this hcall assuming its not supported. Future kernels can use one of the two ways above to set the firmware_has_feature flag to take advantage of this hcall. -- Cheers ~ Vaibhav
Re: [PATCH 01/12] KVM: PPC: Book3S HV nestedv2: Invalidate RPT before deleting a guest
"Aneesh Kumar K.V" writes: > Vaibhav Jain writes: > >> Hi Aneesh, >> >> Thanks for looking into this patch. My responses inline below: >> >> "Aneesh Kumar K.V (IBM)" writes: >> >>> Vaibhav Jain writes: >>> >>>> From: Jordan Niethe >>>> >>>> An L0 must invalidate the L2's RPT during H_GUEST_DELETE if this has not >>>> already been done. This is a slow operation that means H_GUEST_DELETE >>>> must return H_BUSY multiple times before completing. Invalidating the >>>> tables before deleting the guest so there is less work for the L0 to do. >>>> >>>> Signed-off-by: Jordan Niethe >>>> --- >>>> arch/powerpc/include/asm/kvm_book3s.h | 1 + >>>> arch/powerpc/kvm/book3s_hv.c | 6 -- >>>> arch/powerpc/kvm/book3s_hv_nested.c | 2 +- >>>> 3 files changed, 6 insertions(+), 3 deletions(-) >>>> >>>> diff --git a/arch/powerpc/include/asm/kvm_book3s.h >>>> b/arch/powerpc/include/asm/kvm_book3s.h >>>> index 4f527d09c92b..a37736ed3728 100644 >>>> --- a/arch/powerpc/include/asm/kvm_book3s.h >>>> +++ b/arch/powerpc/include/asm/kvm_book3s.h >>>> @@ -302,6 +302,7 @@ void kvmhv_nested_exit(void); >>>> void kvmhv_vm_nested_init(struct kvm *kvm); >>>> long kvmhv_set_partition_table(struct kvm_vcpu *vcpu); >>>> long kvmhv_copy_tofrom_guest_nested(struct kvm_vcpu *vcpu); >>>> +void kvmhv_flush_lpid(u64 lpid); >>>> void kvmhv_set_ptbl_entry(u64 lpid, u64 dw0, u64 dw1); >>>> void kvmhv_release_all_nested(struct kvm *kvm); >>>> long kvmhv_enter_nested_guest(struct kvm_vcpu *vcpu); >>>> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c >>>> index 1ed6ec140701..5543e8490cd9 100644 >>>> --- a/arch/powerpc/kvm/book3s_hv.c >>>> +++ b/arch/powerpc/kvm/book3s_hv.c >>>> @@ -5691,10 +5691,12 @@ static void kvmppc_core_destroy_vm_hv(struct kvm >>>> *kvm) >>>>kvmhv_set_ptbl_entry(kvm->arch.lpid, 0, 0); >>>>} >>>> >>>> - if (kvmhv_is_nestedv2()) >>>> + if (kvmhv_is_nestedv2()) { >>>> + kvmhv_flush_lpid(kvm->arch.lpid); >>>>plpar_guest_delete(0, kvm->arch.lpid); >>>> >>> >>> I am not sure I follow the optimization here. I would expect the >>> hypervisor to kill all the translation caches as part of guest_delete. >>> What is the benefit of doing a lpid flush outside the guest delete? >>> >> Thats right. However without this optimization the H_GUEST_DELETE hcall >> in plpar_guest_delete() returns H_BUSY multiple times resulting in >> multiple hcalls to the hypervisor until it finishes. Flushing the guest >> translation cache upfront reduces the number of HCALLs L1 guests has to >> make to delete a L2 guest via H_GUEST_DELETE. >> > > can we add that as a comment above that kvmhv_flush_lpid()? Sure, I will put up a comment with that detail in v2 of the patch series. > > -aneesh -- Cheers ~ Vaibhav
Re: [PATCH 09/12] KVM: PPC: Book3S HV nestedv2: Do not call H_COPY_TOFROM_GUEST
Hi Aneesh, Thanks for looking into this patch. My responses inline: "Aneesh Kumar K.V (IBM)" writes: > May be we should use > firmware_has_feature(FW_FEATURE_H_COPY_TOFROM_GUEST))? > > the nestedv2 can end up using the above hcall if it is supported by the > hypervisor right? In its absence we will have to translate the guest ea > using xlate and then use kvm_guest_read to read location using the guest > real address right? That xlate will also involves multiple kvm_guest_read. > > Yes, Agreed and thats a nice suggestion. However ATM the hypervisor supporting Nestedv2 doesnt have support for this hcall. In future once we have support for this hcall for nestedv2 from the hypervisor we can replace this branch with a firmware_has_feature() test. >> Signed-off-by: Jordan Niethe >> --- >> arch/powerpc/kvm/book3s_64_mmu_radix.c | 3 +++ >> 1 file changed, 3 insertions(+) >> >> diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c >> b/arch/powerpc/kvm/book3s_64_mmu_radix.c >> index 916af6c153a5..4a1abb9f7c05 100644 >> --- a/arch/powerpc/kvm/book3s_64_mmu_radix.c >> +++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c >> @@ -40,6 +40,9 @@ unsigned long __kvmhv_copy_tofrom_guest_radix(int lpid, >> int pid, >> unsigned long quadrant, ret = n; >> bool is_load = !!to; >> >> +if (kvmhv_is_nestedv2()) >> +return H_UNSUPPORTED; >> + >> /* Can't access quadrants 1 or 2 in non-HV mode, call the HV to do it */ >> if (kvmhv_on_pseries()) >> return plpar_hcall_norets(H_COPY_TOFROM_GUEST, lpid, pid, eaddr, >> -- >> 2.42.0 -- Cheers ~ Vaibhav
Re: [PATCH 01/12] KVM: PPC: Book3S HV nestedv2: Invalidate RPT before deleting a guest
Hi Aneesh, Thanks for looking into this patch. My responses inline below: "Aneesh Kumar K.V (IBM)" writes: > Vaibhav Jain writes: > >> From: Jordan Niethe >> >> An L0 must invalidate the L2's RPT during H_GUEST_DELETE if this has not >> already been done. This is a slow operation that means H_GUEST_DELETE >> must return H_BUSY multiple times before completing. Invalidating the >> tables before deleting the guest so there is less work for the L0 to do. >> >> Signed-off-by: Jordan Niethe >> --- >> arch/powerpc/include/asm/kvm_book3s.h | 1 + >> arch/powerpc/kvm/book3s_hv.c | 6 -- >> arch/powerpc/kvm/book3s_hv_nested.c | 2 +- >> 3 files changed, 6 insertions(+), 3 deletions(-) >> >> diff --git a/arch/powerpc/include/asm/kvm_book3s.h >> b/arch/powerpc/include/asm/kvm_book3s.h >> index 4f527d09c92b..a37736ed3728 100644 >> --- a/arch/powerpc/include/asm/kvm_book3s.h >> +++ b/arch/powerpc/include/asm/kvm_book3s.h >> @@ -302,6 +302,7 @@ void kvmhv_nested_exit(void); >> void kvmhv_vm_nested_init(struct kvm *kvm); >> long kvmhv_set_partition_table(struct kvm_vcpu *vcpu); >> long kvmhv_copy_tofrom_guest_nested(struct kvm_vcpu *vcpu); >> +void kvmhv_flush_lpid(u64 lpid); >> void kvmhv_set_ptbl_entry(u64 lpid, u64 dw0, u64 dw1); >> void kvmhv_release_all_nested(struct kvm *kvm); >> long kvmhv_enter_nested_guest(struct kvm_vcpu *vcpu); >> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c >> index 1ed6ec140701..5543e8490cd9 100644 >> --- a/arch/powerpc/kvm/book3s_hv.c >> +++ b/arch/powerpc/kvm/book3s_hv.c >> @@ -5691,10 +5691,12 @@ static void kvmppc_core_destroy_vm_hv(struct kvm >> *kvm) >> kvmhv_set_ptbl_entry(kvm->arch.lpid, 0, 0); >> } >> >> -if (kvmhv_is_nestedv2()) >> +if (kvmhv_is_nestedv2()) { >> +kvmhv_flush_lpid(kvm->arch.lpid); >> plpar_guest_delete(0, kvm->arch.lpid); >> > > I am not sure I follow the optimization here. I would expect the > hypervisor to kill all the translation caches as part of guest_delete. > What is the benefit of doing a lpid flush outside the guest delete? > Thats right. However without this optimization the H_GUEST_DELETE hcall in plpar_guest_delete() returns H_BUSY multiple times resulting in multiple hcalls to the hypervisor until it finishes. Flushing the guest translation cache upfront reduces the number of HCALLs L1 guests has to make to delete a L2 guest via H_GUEST_DELETE. >> -else >> +} else { >> kvmppc_free_lpid(kvm->arch.lpid); >> +} >> >> kvmppc_free_pimap(kvm); >> } >> diff --git a/arch/powerpc/kvm/book3s_hv_nested.c >> b/arch/powerpc/kvm/book3s_hv_nested.c >> index 3b658b8696bc..5c375ec1a3c6 100644 >> --- a/arch/powerpc/kvm/book3s_hv_nested.c >> +++ b/arch/powerpc/kvm/book3s_hv_nested.c >> @@ -503,7 +503,7 @@ void kvmhv_nested_exit(void) >> } >> } >> >> -static void kvmhv_flush_lpid(u64 lpid) >> +void kvmhv_flush_lpid(u64 lpid) >> { >> long rc; >> >> -- >> 2.42.0 -- Cheers ~ Vaibhav
[PATCH 11/12] KVM: PPC: Reduce reliance on analyse_instr() in mmio emulation
From: Jordan Niethe Commit 709236039964 ("KVM: PPC: Reimplement non-SIMD LOAD/STORE instruction mmio emulation with analyse_instr() input") and commit 2b33cb585f94 ("KVM: PPC: Reimplement LOAD_FP/STORE_FP instruction mmio emulation with analyse_instr() input") made kvmppc_emulate_loadstore() use the results from analyse_instr() for instruction emulation. In particular the effective address from analyse_instr() is used for UPDATE type instructions and fact that op.val is all ready endian corrected is used in the STORE case. However, these changes now have some negative implications for the nestedv2 case. For analyse_instr() to determine the correct effective address, the GPRs must be loaded from the L0. This is not needed as vcpu->arch.vaddr_accessed is already set. Change back to using vcpu->arch.vaddr_accessed. In the STORE case, use kvmppc_get_gpr() value instead of the op.val. kvmppc_get_gpr() will reload from the L0 if needed in the nestedv2 case. This means if a byte reversal is needed must now be passed to kvmppc_handle_store() like in the kvmppc_handle_load() case. This means the call to kvmhv_nestedv2_reload_ptregs() can be avoided as there is no concern about op.val being stale. Drop the call to kvmhv_nestedv2_mark_dirty_ptregs() as without the call to kvmhv_nestedv2_reload_ptregs(), stale state could be marked as valid. This is fine as the required marking things dirty is already handled for the UPDATE case by the call to kvmppc_set_gpr(). For LOADs, it is handled in kvmppc_complete_mmio_load(). This is called either directly in __kvmppc_handle_load() if the load can be handled in KVM, or on the next kvm_arch_vcpu_ioctl_run() if an exit was required. Signed-off-by: Jordan Niethe --- arch/powerpc/kvm/emulate_loadstore.c | 21 ++--- 1 file changed, 10 insertions(+), 11 deletions(-) diff --git a/arch/powerpc/kvm/emulate_loadstore.c b/arch/powerpc/kvm/emulate_loadstore.c index 077fd88a0b68..ec60c7979718 100644 --- a/arch/powerpc/kvm/emulate_loadstore.c +++ b/arch/powerpc/kvm/emulate_loadstore.c @@ -93,7 +93,6 @@ int kvmppc_emulate_loadstore(struct kvm_vcpu *vcpu) emulated = EMULATE_FAIL; vcpu->arch.regs.msr = kvmppc_get_msr(vcpu); - kvmhv_nestedv2_reload_ptregs(vcpu, >arch.regs); if (analyse_instr(, >arch.regs, inst) == 0) { int type = op.type & INSTR_TYPE_MASK; int size = GETSIZE(op.type); @@ -112,7 +111,7 @@ int kvmppc_emulate_loadstore(struct kvm_vcpu *vcpu) op.reg, size, !instr_byte_swap); if ((op.type & UPDATE) && (emulated != EMULATE_FAIL)) - kvmppc_set_gpr(vcpu, op.update_reg, op.ea); + kvmppc_set_gpr(vcpu, op.update_reg, vcpu->arch.vaddr_accessed); break; } @@ -132,7 +131,7 @@ int kvmppc_emulate_loadstore(struct kvm_vcpu *vcpu) KVM_MMIO_REG_FPR|op.reg, size, 1); if ((op.type & UPDATE) && (emulated != EMULATE_FAIL)) - kvmppc_set_gpr(vcpu, op.update_reg, op.ea); + kvmppc_set_gpr(vcpu, op.update_reg, vcpu->arch.vaddr_accessed); break; #endif @@ -224,16 +223,17 @@ int kvmppc_emulate_loadstore(struct kvm_vcpu *vcpu) break; } #endif - case STORE: - /* if need byte reverse, op.val has been reversed by -* analyse_instr(). -*/ - emulated = kvmppc_handle_store(vcpu, op.val, size, 1); + case STORE: { + int instr_byte_swap = op.type & BYTEREV; + + emulated = kvmppc_handle_store(vcpu, kvmppc_get_gpr(vcpu, op.reg), + size, !instr_byte_swap); if ((op.type & UPDATE) && (emulated != EMULATE_FAIL)) - kvmppc_set_gpr(vcpu, op.update_reg, op.ea); + kvmppc_set_gpr(vcpu, op.update_reg, vcpu->arch.vaddr_accessed); break; + } #ifdef CONFIG_PPC_FPU case STORE_FP: if (kvmppc_check_fp_disabled(vcpu)) @@ -254,7 +254,7 @@ int kvmppc_emulate_loadstore(struct kvm_vcpu *vcpu) kvmppc_get_fpr(vcpu, op.reg), size, 1); if ((op.type & UPDATE) && (emulated != EMULATE_FAIL)) - kvmppc_set_gpr(vcpu, op.update_reg, op.ea); + kvmppc_set_gpr(vcpu, op.update_reg, vcpu->arch.vaddr_accessed); break; #endif @@ -358,7 +358,6 @@ int kvmppc_emulate_loadstore(struct kvm_vcpu *vcpu) } trace_kvm_ppc_instr(ppc_inst_val(inst),
[PATCH 09/12] KVM: PPC: Book3S HV nestedv2: Do not call H_COPY_TOFROM_GUEST
From: Jordan Niethe H_COPY_TOFROM_GUEST is part of the nestedv1 API and so should not be called by a nestedv2 host. Do not attempt to call it. Signed-off-by: Jordan Niethe --- arch/powerpc/kvm/book3s_64_mmu_radix.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c index 916af6c153a5..4a1abb9f7c05 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_radix.c +++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c @@ -40,6 +40,9 @@ unsigned long __kvmhv_copy_tofrom_guest_radix(int lpid, int pid, unsigned long quadrant, ret = n; bool is_load = !!to; + if (kvmhv_is_nestedv2()) + return H_UNSUPPORTED; + /* Can't access quadrants 1 or 2 in non-HV mode, call the HV to do it */ if (kvmhv_on_pseries()) return plpar_hcall_norets(H_COPY_TOFROM_GUEST, lpid, pid, eaddr, -- 2.42.0
[PATCH 12/12] KVM: PPC: Book3S HV nestedv2: Do not cancel pending decrementer exception
From: Jordan Niethe In the nestedv2 case, if there is a pending decrementer exception, the L1 must get the L2's timebase from the L0 to see if the exception should be cancelled. This adds the overhead of a H_GUEST_GET_STATE call to the likely case in which the decrementer should not be cancelled. Avoid this logic for the nestedv2 case. Signed-off-by: Jordan Niethe --- arch/powerpc/kvm/book3s_hv.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c index 2ee3f2478570..e48126a59ba7 100644 --- a/arch/powerpc/kvm/book3s_hv.c +++ b/arch/powerpc/kvm/book3s_hv.c @@ -4834,7 +4834,7 @@ int kvmhv_run_single_vcpu(struct kvm_vcpu *vcpu, u64 time_limit, * entering a nested guest in which case the decrementer is now owned * by L2 and the L1 decrementer is provided in hdec_expires */ - if (kvmppc_core_pending_dec(vcpu) && + if (!kvmhv_is_nestedv2() && kvmppc_core_pending_dec(vcpu) && ((tb < kvmppc_dec_expires_host_tb(vcpu)) || (trap == BOOK3S_INTERRUPT_SYSCALL && kvmppc_get_gpr(vcpu, 3) == H_ENTER_NESTED))) -- 2.42.0
[PATCH 03/12] KVM: PPC: Book3S HV nestedv2: Do not check msr on hcalls
From: Jordan Niethe The check for a hcall coming from userspace is done for KVM-PR. This is not supported for nestedv2 and the L0 will directly inject the necessary exception to the L2 if userspace performs a hcall. Avoid checking the MSR and thus avoid a H_GUEST_GET_STATE hcall in the L1. Signed-off-by: Jordan Niethe --- arch/powerpc/kvm/book3s_hv.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c index 5543e8490cd9..069c336b6f3c 100644 --- a/arch/powerpc/kvm/book3s_hv.c +++ b/arch/powerpc/kvm/book3s_hv.c @@ -1688,7 +1688,7 @@ static int kvmppc_handle_exit_hv(struct kvm_vcpu *vcpu, { int i; - if (unlikely(__kvmppc_get_msr_hv(vcpu) & MSR_PR)) { + if (!kvmhv_is_nestedv2() && unlikely(__kvmppc_get_msr_hv(vcpu) & MSR_PR)) { /* * Guest userspace executed sc 1. This can only be * reached by the P9 path because the old path @@ -4949,7 +4949,7 @@ static int kvmppc_vcpu_run_hv(struct kvm_vcpu *vcpu) if (run->exit_reason == KVM_EXIT_PAPR_HCALL) { accumulate_time(vcpu, >arch.hcall); - if (WARN_ON_ONCE(__kvmppc_get_msr_hv(vcpu) & MSR_PR)) { + if (!kvmhv_is_nestedv2() && WARN_ON_ONCE(__kvmppc_get_msr_hv(vcpu) & MSR_PR)) { /* * These should have been caught reflected * into the guest by now. Final sanity check: -- 2.42.0
[PATCH 10/12] KVM: PPC: Book3S HV nestedv2: Register the VPA with the L0
From: Jordan Niethe In the nestedv2 case, the L1 may register the L2's VPA with the L0. This allows the L0 to manage the L2's dispatch count, as well as enable possible performance optimisations by seeing if certain resources are not being used by the L2 (such as the PMCs). Use the H_GUEST_SET_STATE call to inform the L0 of the L2's VPA address. This can not be done in the H_GUEST_VCPU_RUN input buffer. Signed-off-by: Jordan Niethe --- arch/powerpc/include/asm/kvm_book3s_64.h | 1 + arch/powerpc/kvm/book3s_hv.c | 38 ++-- arch/powerpc/kvm/book3s_hv_nestedv2.c| 29 ++ 3 files changed, 59 insertions(+), 9 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h b/arch/powerpc/include/asm/kvm_book3s_64.h index 2477021bff54..d8729ec81ca0 100644 --- a/arch/powerpc/include/asm/kvm_book3s_64.h +++ b/arch/powerpc/include/asm/kvm_book3s_64.h @@ -682,6 +682,7 @@ void kvmhv_nestedv2_vcpu_free(struct kvm_vcpu *vcpu, struct kvmhv_nestedv2_io *i int kvmhv_nestedv2_flush_vcpu(struct kvm_vcpu *vcpu, u64 time_limit); int kvmhv_nestedv2_set_ptbl_entry(unsigned long lpid, u64 dw0, u64 dw1); int kvmhv_nestedv2_parse_output(struct kvm_vcpu *vcpu); +int kvmhv_nestedv2_set_vpa(struct kvm_vcpu *vcpu, unsigned long vpa); #endif /* CONFIG_KVM_BOOK3S_HV_POSSIBLE */ diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c index 47fe470375df..2ee3f2478570 100644 --- a/arch/powerpc/kvm/book3s_hv.c +++ b/arch/powerpc/kvm/book3s_hv.c @@ -650,7 +650,8 @@ static unsigned long do_h_register_vpa(struct kvm_vcpu *vcpu, return err; } -static void kvmppc_update_vpa(struct kvm_vcpu *vcpu, struct kvmppc_vpa *vpap) +static void kvmppc_update_vpa(struct kvm_vcpu *vcpu, struct kvmppc_vpa *vpap, + struct kvmppc_vpa *old_vpap) { struct kvm *kvm = vcpu->kvm; void *va; @@ -690,9 +691,8 @@ static void kvmppc_update_vpa(struct kvm_vcpu *vcpu, struct kvmppc_vpa *vpap) kvmppc_unpin_guest_page(kvm, va, gpa, false); va = NULL; } - if (vpap->pinned_addr) - kvmppc_unpin_guest_page(kvm, vpap->pinned_addr, vpap->gpa, - vpap->dirty); + *old_vpap = *vpap; + vpap->gpa = gpa; vpap->pinned_addr = va; vpap->dirty = false; @@ -702,6 +702,9 @@ static void kvmppc_update_vpa(struct kvm_vcpu *vcpu, struct kvmppc_vpa *vpap) static void kvmppc_update_vpas(struct kvm_vcpu *vcpu) { + struct kvm *kvm = vcpu->kvm; + struct kvmppc_vpa old_vpa = { 0 }; + if (!(vcpu->arch.vpa.update_pending || vcpu->arch.slb_shadow.update_pending || vcpu->arch.dtl.update_pending)) @@ -709,17 +712,34 @@ static void kvmppc_update_vpas(struct kvm_vcpu *vcpu) spin_lock(>arch.vpa_update_lock); if (vcpu->arch.vpa.update_pending) { - kvmppc_update_vpa(vcpu, >arch.vpa); - if (vcpu->arch.vpa.pinned_addr) + kvmppc_update_vpa(vcpu, >arch.vpa, _vpa); + if (old_vpa.pinned_addr) { + if (kvmhv_is_nestedv2()) + kvmhv_nestedv2_set_vpa(vcpu, ~0ull); + kvmppc_unpin_guest_page(kvm, old_vpa.pinned_addr, old_vpa.gpa, + old_vpa.dirty); + } + if (vcpu->arch.vpa.pinned_addr) { init_vpa(vcpu, vcpu->arch.vpa.pinned_addr); + if (kvmhv_is_nestedv2()) + kvmhv_nestedv2_set_vpa(vcpu, __pa(vcpu->arch.vpa.pinned_addr)); + } } if (vcpu->arch.dtl.update_pending) { - kvmppc_update_vpa(vcpu, >arch.dtl); + kvmppc_update_vpa(vcpu, >arch.dtl, _vpa); + if (old_vpa.pinned_addr) + kvmppc_unpin_guest_page(kvm, old_vpa.pinned_addr, old_vpa.gpa, + old_vpa.dirty); vcpu->arch.dtl_ptr = vcpu->arch.dtl.pinned_addr; vcpu->arch.dtl_index = 0; } - if (vcpu->arch.slb_shadow.update_pending) - kvmppc_update_vpa(vcpu, >arch.slb_shadow); + if (vcpu->arch.slb_shadow.update_pending) { + kvmppc_update_vpa(vcpu, >arch.slb_shadow, _vpa); + if (old_vpa.pinned_addr) + kvmppc_unpin_guest_page(kvm, old_vpa.pinned_addr, old_vpa.gpa, + old_vpa.dirty); + } + spin_unlock(>arch.vpa_update_lock); } diff --git a/arch/powerpc/kvm/book3s_hv_nestedv2.c b/arch/powerpc/kvm/book3s_hv_nestedv2.c index fd3c4f2d9480..5378eb40b162 100644 --- a/arch/powerpc/kvm/book3s_hv_nestedv2.c +++ b/arch/powerpc/kvm/book3s_hv_nestedv2.c @@ -855,6 +855,35 @@ int kvmhv_nestedv2_set_ptbl_entry(unsigned long lpid, u64 dw0, u64 dw1) }
[PATCH 08/12] KVM: PPC: Book3S HV nestedv2: Avoid msr check in kvmppc_handle_exit_hv()
From: Jordan Niethe The msr check in kvmppc_handle_exit_hv() is not needed for nestedv2 hosts, skip the check to avoid a H_GUEST_GET_STATE hcall. Signed-off-by: Jordan Niethe --- arch/powerpc/kvm/book3s_hv.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c index 4dc6a928073f..47fe470375df 100644 --- a/arch/powerpc/kvm/book3s_hv.c +++ b/arch/powerpc/kvm/book3s_hv.c @@ -1597,7 +1597,7 @@ static int kvmppc_handle_exit_hv(struct kvm_vcpu *vcpu, * That can happen due to a bug, or due to a machine check * occurring at just the wrong time. */ - if (__kvmppc_get_msr_hv(vcpu) & MSR_HV) { + if (!kvmhv_is_nestedv2() && (__kvmppc_get_msr_hv(vcpu) & MSR_HV)) { printk(KERN_EMERG "KVM trap in HV mode!\n"); printk(KERN_EMERG "trap=0x%x | pc=0x%lx | msr=0x%llx\n", vcpu->arch.trap, kvmppc_get_pc(vcpu), -- 2.42.0
[PATCH 04/12] KVM: PPC: Book3S HV nestedv2: Get the PID only if needed to copy tofrom a guest
From: Jordan Niethe kvmhv_copy_tofrom_guest_radix() gets the PID at the start of the function. If pid is not used, then this is a wasteful H_GUEST_GET_STATE hcall for nestedv2 hosts. Move the assignment to where pid will be used. Suggested-by: Nicholas Piggin Signed-off-by: Jordan Niethe --- arch/powerpc/kvm/book3s_64_mmu_radix.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c index 175a8eb2681f..916af6c153a5 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_radix.c +++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c @@ -97,7 +97,7 @@ static long kvmhv_copy_tofrom_guest_radix(struct kvm_vcpu *vcpu, gva_t eaddr, void *to, void *from, unsigned long n) { int lpid = vcpu->kvm->arch.lpid; - int pid = kvmppc_get_pid(vcpu); + int pid; /* This would cause a data segment intr so don't allow the access */ if (eaddr & (0x3FFUL << 52)) @@ -110,6 +110,8 @@ static long kvmhv_copy_tofrom_guest_radix(struct kvm_vcpu *vcpu, gva_t eaddr, /* If accessing quadrant 3 then pid is expected to be 0 */ if (((eaddr >> 62) & 0x3) == 0x3) pid = 0; + else + pid = kvmppc_get_pid(vcpu); eaddr &= ~(0xFFFUL << 52); -- 2.42.0
[PATCH 07/12] KVM: PPC: Book3S HV nestedv2: Do not inject certain interrupts
From: Jordan Niethe There is no need to inject an external interrupt in kvmppc_book3s_irqprio_deliver() as the test for BOOK3S_IRQPRIO_EXTERNAL in kvmhv_run_single_vcpu() before guest entry will raise LPCR_MER if needed. There is also no need to inject the decrementer interrupt as this will be raised within the L2 if needed. Avoiding these injections reduces H_GUEST_GET_STATE hcalls by the L1. Suggested-by: Nicholas Piggin Signed-off-by: Jordan Niethe --- arch/powerpc/kvm/book3s.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c index 6cd20ab9e94e..8acec144120e 100644 --- a/arch/powerpc/kvm/book3s.c +++ b/arch/powerpc/kvm/book3s.c @@ -302,11 +302,11 @@ static int kvmppc_book3s_irqprio_deliver(struct kvm_vcpu *vcpu, switch (priority) { case BOOK3S_IRQPRIO_DECREMENTER: - deliver = (kvmppc_get_msr(vcpu) & MSR_EE) && !crit; + deliver = !kvmhv_is_nestedv2() && (kvmppc_get_msr(vcpu) & MSR_EE) && !crit; vec = BOOK3S_INTERRUPT_DECREMENTER; break; case BOOK3S_IRQPRIO_EXTERNAL: - deliver = (kvmppc_get_msr(vcpu) & MSR_EE) && !crit; + deliver = !kvmhv_is_nestedv2() && (kvmppc_get_msr(vcpu) & MSR_EE) && !crit; vec = BOOK3S_INTERRUPT_EXTERNAL; break; case BOOK3S_IRQPRIO_SYSTEM_RESET: -- 2.42.0
[PATCH 00/12] KVM: PPC: Nested APIv2 : Performance improvements
From: vaibhav This patch series introduces series of performance improvements to recently added support for Nested APIv2 PPC64 Guests via [1]. Details for Nested APIv2 for PPC64 Guests is available in Documentation/powerpc/kvm-nested.rst. This patch series introduces various optimizations for a Nested APIv2 guests namely: * Reduce the number times L1 hypervisor requests for L2 state from L0. * Register the L2 VPA with L1 * Optimizing interrupt delivery of some interrupt types. * Optimize emulation of mmio loads/stores for L2 in L1. The hcalls needed for testing these patches have been implemented in the spapr qemu model and is available at [2]. There are scripts available to assist in setting up an environment for testing nested guests at [3]. These patches are consequence of insights from on going performance engineering effort for improving performance of Nested APIv2 Guests. Special thanks goes to: * Gautam Menghani * Jordan Niethe * Nicholas Piggin * Vaidyanathan Srinivasan Refs: [1] https://lore.kernel.org/all/20230905034658.82835-1-jniet...@gmail.com [2] https://github.com/planetharsh/qemu/tree/upstream-0714-kop [3] https://github.com/iamjpn/kvm-powervm-test Jordan Niethe (11): KVM: PPC: Book3S HV nestedv2: Invalidate RPT before deleting a guest KVM: PPC: Book3S HV nestedv2: Avoid reloading the tb offset KVM: PPC: Book3S HV nestedv2: Do not check msr on hcalls KVM: PPC: Book3S HV nestedv2: Get the PID only if needed to copy tofrom a guest KVM: PPC: Book3S HV nestedv2: Ensure LPCR_MER bit is passed to the L0 KVM: PPC: Book3S HV nestedv2: Do not inject certain interrupts KVM: PPC: Book3S HV nestedv2: Avoid msr check in kvmppc_handle_exit_hv() KVM: PPC: Book3S HV nestedv2: Do not call H_COPY_TOFROM_GUEST KVM: PPC: Book3S HV nestedv2: Register the VPA with the L0 KVM: PPC: Reduce reliance on analyse_instr() in mmio emulation KVM: PPC: Book3S HV nestedv2: Do not cancel pending decrementer exception Nicholas Piggin (1): KVM: PPC: Book3S HV: Handle pending exceptions on guest entry with MSR_EE arch/powerpc/include/asm/kvm_book3s.h| 10 +++- arch/powerpc/include/asm/kvm_book3s_64.h | 1 + arch/powerpc/kvm/book3s.c| 4 +- arch/powerpc/kvm/book3s_64_mmu_radix.c | 7 ++- arch/powerpc/kvm/book3s_hv.c | 72 +--- arch/powerpc/kvm/book3s_hv_nested.c | 2 +- arch/powerpc/kvm/book3s_hv_nestedv2.c| 29 ++ arch/powerpc/kvm/emulate_loadstore.c | 21 --- 8 files changed, 107 insertions(+), 39 deletions(-) -- 2.42.0
[PATCH 06/12] KVM: PPC: Book3S HV: Handle pending exceptions on guest entry with MSR_EE
From: Nicholas Piggin Commit 026728dc5d41 ("KVM: PPC: Book3S HV P9: Inject pending xive interrupts at guest entry") changed guest entry so that if external interrupts are enabled, BOOK3S_IRQPRIO_EXTERNAL is not tested for. Test for this regardless of MSR_EE. For an L1 host, do not inject an interrupt, but always use LPCR_MER. If the L0 desires it can inject an interrupt. Fixes: 026728dc5d41 ("KVM: PPC: Book3S HV P9: Inject pending xive interrupts at guest entry") Signed-off-by: Nicholas Piggin [jpn: use kvmpcc_get_msr(), write commit message] Signed-off-by: Jordan Niethe --- arch/powerpc/kvm/book3s_hv.c | 18 -- 1 file changed, 12 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c index 6d1f0bca27aa..4dc6a928073f 100644 --- a/arch/powerpc/kvm/book3s_hv.c +++ b/arch/powerpc/kvm/book3s_hv.c @@ -4738,13 +4738,19 @@ int kvmhv_run_single_vcpu(struct kvm_vcpu *vcpu, u64 time_limit, if (!nested) { kvmppc_core_prepare_to_enter(vcpu); - if (__kvmppc_get_msr_hv(vcpu) & MSR_EE) { - if (xive_interrupt_pending(vcpu)) + if (test_bit(BOOK3S_IRQPRIO_EXTERNAL, +>arch.pending_exceptions) || + xive_interrupt_pending(vcpu)) { + /* +* For nested HV, don't synthesize but always pass MER, +* the L0 will be able to optimise that more +* effectively than manipulating registers directly. +*/ + if (!kvmhv_on_pseries() && (__kvmppc_get_msr_hv(vcpu) & MSR_EE)) kvmppc_inject_interrupt_hv(vcpu, - BOOK3S_INTERRUPT_EXTERNAL, 0); - } else if (test_bit(BOOK3S_IRQPRIO_EXTERNAL, ->arch.pending_exceptions)) { - lpcr |= LPCR_MER; + BOOK3S_INTERRUPT_EXTERNAL, 0); + else + lpcr |= LPCR_MER; } } else if (vcpu->arch.pending_exceptions || vcpu->arch.doorbell_request || -- 2.42.0
[PATCH 05/12] KVM: PPC: Book3S HV nestedv2: Ensure LPCR_MER bit is passed to the L0
From: Jordan Niethe LPCR_MER is conditionally set during entry to a guest if there is a pending external interrupt. In the nestedv2 case, this change is not being communicated to the L0, which means it is not being set in the L2. Ensure the updated LPCR value is passed to the L0. Signed-off-by: Jordan Niethe --- arch/powerpc/kvm/book3s_hv.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c index 069c336b6f3c..6d1f0bca27aa 100644 --- a/arch/powerpc/kvm/book3s_hv.c +++ b/arch/powerpc/kvm/book3s_hv.c @@ -4084,6 +4084,8 @@ static int kvmhv_vcpu_entry_nestedv2(struct kvm_vcpu *vcpu, u64 time_limit, if (rc < 0) return -EINVAL; + kvmppc_gse_put_u64(io->vcpu_run_input, KVMPPC_GSID_LPCR, lpcr); + accumulate_time(vcpu, >arch.in_guest); rc = plpar_guest_run_vcpu(0, vcpu->kvm->arch.lpid, vcpu->vcpu_id, , ); -- 2.42.0
[PATCH 01/12] KVM: PPC: Book3S HV nestedv2: Invalidate RPT before deleting a guest
From: Jordan Niethe An L0 must invalidate the L2's RPT during H_GUEST_DELETE if this has not already been done. This is a slow operation that means H_GUEST_DELETE must return H_BUSY multiple times before completing. Invalidating the tables before deleting the guest so there is less work for the L0 to do. Signed-off-by: Jordan Niethe --- arch/powerpc/include/asm/kvm_book3s.h | 1 + arch/powerpc/kvm/book3s_hv.c | 6 -- arch/powerpc/kvm/book3s_hv_nested.c | 2 +- 3 files changed, 6 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_book3s.h b/arch/powerpc/include/asm/kvm_book3s.h index 4f527d09c92b..a37736ed3728 100644 --- a/arch/powerpc/include/asm/kvm_book3s.h +++ b/arch/powerpc/include/asm/kvm_book3s.h @@ -302,6 +302,7 @@ void kvmhv_nested_exit(void); void kvmhv_vm_nested_init(struct kvm *kvm); long kvmhv_set_partition_table(struct kvm_vcpu *vcpu); long kvmhv_copy_tofrom_guest_nested(struct kvm_vcpu *vcpu); +void kvmhv_flush_lpid(u64 lpid); void kvmhv_set_ptbl_entry(u64 lpid, u64 dw0, u64 dw1); void kvmhv_release_all_nested(struct kvm *kvm); long kvmhv_enter_nested_guest(struct kvm_vcpu *vcpu); diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c index 1ed6ec140701..5543e8490cd9 100644 --- a/arch/powerpc/kvm/book3s_hv.c +++ b/arch/powerpc/kvm/book3s_hv.c @@ -5691,10 +5691,12 @@ static void kvmppc_core_destroy_vm_hv(struct kvm *kvm) kvmhv_set_ptbl_entry(kvm->arch.lpid, 0, 0); } - if (kvmhv_is_nestedv2()) + if (kvmhv_is_nestedv2()) { + kvmhv_flush_lpid(kvm->arch.lpid); plpar_guest_delete(0, kvm->arch.lpid); - else + } else { kvmppc_free_lpid(kvm->arch.lpid); + } kvmppc_free_pimap(kvm); } diff --git a/arch/powerpc/kvm/book3s_hv_nested.c b/arch/powerpc/kvm/book3s_hv_nested.c index 3b658b8696bc..5c375ec1a3c6 100644 --- a/arch/powerpc/kvm/book3s_hv_nested.c +++ b/arch/powerpc/kvm/book3s_hv_nested.c @@ -503,7 +503,7 @@ void kvmhv_nested_exit(void) } } -static void kvmhv_flush_lpid(u64 lpid) +void kvmhv_flush_lpid(u64 lpid) { long rc; -- 2.42.0
[PATCH 02/12] KVM: PPC: Book3S HV nestedv2: Avoid reloading the tb offset
From: Jordan Niethe The kvmppc_get_tb_offset() getter reloads KVMPPC_GSID_TB_OFFSET from the L0 for nestedv2 host. This is unnecessary as the value does not change. KVMPPC_GSID_TB_OFFSET also need not be reloaded in kvmppc_{s,g}et_dec_expires(). Signed-off-by: Jordan Niethe --- arch/powerpc/include/asm/kvm_book3s.h | 9 ++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_book3s.h b/arch/powerpc/include/asm/kvm_book3s.h index a37736ed3728..3e1e2a698c9e 100644 --- a/arch/powerpc/include/asm/kvm_book3s.h +++ b/arch/powerpc/include/asm/kvm_book3s.h @@ -594,13 +594,17 @@ static inline u##size kvmppc_get_##reg(struct kvm_vcpu *vcpu) \ KVMPPC_BOOK3S_VCORE_ACCESSOR(vtb, 64, KVMPPC_GSID_VTB) -KVMPPC_BOOK3S_VCORE_ACCESSOR(tb_offset, 64, KVMPPC_GSID_TB_OFFSET) KVMPPC_BOOK3S_VCORE_ACCESSOR_GET(arch_compat, 32, KVMPPC_GSID_LOGICAL_PVR) KVMPPC_BOOK3S_VCORE_ACCESSOR_GET(lpcr, 64, KVMPPC_GSID_LPCR) +KVMPPC_BOOK3S_VCORE_ACCESSOR_SET(tb_offset, 64, KVMPPC_GSID_TB_OFFSET) + +static inline u64 kvmppc_get_tb_offset(struct kvm_vcpu *vcpu) +{ + return vcpu->arch.vcore->tb_offset; +} static inline u64 kvmppc_get_dec_expires(struct kvm_vcpu *vcpu) { - WARN_ON(kvmhv_nestedv2_cached_reload(vcpu, KVMPPC_GSID_TB_OFFSET) < 0); WARN_ON(kvmhv_nestedv2_cached_reload(vcpu, KVMPPC_GSID_DEC_EXPIRY_TB) < 0); return vcpu->arch.dec_expires; } @@ -608,7 +612,6 @@ static inline u64 kvmppc_get_dec_expires(struct kvm_vcpu *vcpu) static inline void kvmppc_set_dec_expires(struct kvm_vcpu *vcpu, u64 val) { vcpu->arch.dec_expires = val; - WARN_ON(kvmhv_nestedv2_cached_reload(vcpu, KVMPPC_GSID_TB_OFFSET) < 0); kvmhv_nestedv2_mark_dirty(vcpu, KVMPPC_GSID_DEC_EXPIRY_TB); } -- 2.42.0
Re: [PATCH] powerpc: Enable generic cpu idle-loop
Thanks for looking at this patch Mpe. Michael Ellerman writes: > Vaibhav Jain writes: >> This minor patch enables config option GENERIC_IDLE_POLL_SETUP for arch >> powerpc. This should add support for kernel param 'nohlt'. > > Which is named after an 8086 instruction :), but oh well. > Thanks. This is an interesting trivia :) >> Powerpc kernel also supports another kernel boot-time param called >> 'powersave' which can also be used to disable all cpu idle-states and >> forces CPU to an idle-loop similar to what cpu_idle_poll() does. This >> patch however makes powerpc kernel-parameters better aligned to the >> generic boot-time parameters. > > It would be nice if we could make our powersave=off parameter just > enable this generic logic. > > Have you looked at if that's possible? At a glance it looks like it > should be, when cpu_idle_force_poll is true do_idle() never calls > cpuidle_idle_call(), so the cpuidle drivers are never invoked. Thanks for suggesting this and it looks like a simple fix to arch/idle.c code. I have posted a v2 with the needed change on the mailing list at https://lore.kernel.org/all/20230821045928.1350893-1-vaib...@linux.ibm.com -- Cheers ~ Vaibhav
[PATCH v2] powerpc: Enable generic cpu idle-loop
This minor patch enables config option GENERIC_IDLE_POLL_SETUP for arch powerpc. This should add support for kernel param 'nohlt'. Powerpc kernel also supports another kernel boot-time param called 'powersave' which can also be used to disable all cpu idle-states and forces CPU to an idle-loop similar to what cpu_idle_poll() does. Hence this patch also updates the handling of 'powersave=off' kernel param to enable generic cpu idle-loop if its enabled. Signed-off-by: Vaibhav Jain --- Changelog: Since v1: https://lore.kernel.org/all/20230818050739.827851-1-vaib...@linux.ibm.com * Updated powersave_off() to enable generic cpu idle-loop if 'powersave=off' kernel arg is given. [Mpe] * Update patch description --- Documentation/admin-guide/kernel-parameters.txt | 2 +- arch/powerpc/Kconfig| 1 + arch/powerpc/kernel/idle.c | 3 +++ 3 files changed, 5 insertions(+), 1 deletion(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 722b6eca2e93..6b657ebafcfb 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -3738,7 +3738,7 @@ nohibernate [HIBERNATION] Disable hibernation and resume. - nohlt [ARM,ARM64,MICROBLAZE,MIPS,SH] Forces the kernel to + nohlt [ARM,ARM64,MICROBLAZE,MIPS,PPC,SH] Forces the kernel to busy wait in do_idle() and not use the arch_cpu_idle() implementation; requires CONFIG_GENERIC_IDLE_POLL_SETUP to be effective. This is useful on platforms where the diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 0b1172cbeccb..574661403800 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -193,6 +193,7 @@ config PPC select GENERIC_CPU_VULNERABILITIES if PPC_BARRIER_NOSPEC select GENERIC_EARLY_IOREMAP select GENERIC_GETTIMEOFDAY + select GENERIC_IDLE_POLL_SETUP select GENERIC_IRQ_SHOW select GENERIC_IRQ_SHOW_LEVEL select GENERIC_PCI_IOMAPif PCI diff --git a/arch/powerpc/kernel/idle.c b/arch/powerpc/kernel/idle.c index b1c0418b25c8..7367a0698459 100644 --- a/arch/powerpc/kernel/idle.c +++ b/arch/powerpc/kernel/idle.c @@ -21,6 +21,7 @@ #include #include #include +#include #include #include @@ -35,6 +36,8 @@ EXPORT_SYMBOL(cpuidle_disable); static int __init powersave_off(char *arg) { + /* Use generic idle loop if thats available */ + cpu_idle_poll_ctrl(true); ppc_md.power_save = NULL; cpuidle_disable = IDLE_POWERSAVE_OFF; return 1; -- 2.41.0
[PATCH] powerpc: Enable generic cpu idle-loop
This minor patch enables config option GENERIC_IDLE_POLL_SETUP for arch powerpc. This should add support for kernel param 'nohlt'. Powerpc kernel also supports another kernel boot-time param called 'powersave' which can also be used to disable all cpu idle-states and forces CPU to an idle-loop similar to what cpu_idle_poll() does. This patch however makes powerpc kernel-parameters better aligned to the generic boot-time parameters. Signed-off-by: Vaibhav Jain --- Documentation/admin-guide/kernel-parameters.txt | 2 +- arch/powerpc/Kconfig| 1 + 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 722b6eca2e93..6b657ebafcfb 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -3738,7 +3738,7 @@ nohibernate [HIBERNATION] Disable hibernation and resume. - nohlt [ARM,ARM64,MICROBLAZE,MIPS,SH] Forces the kernel to + nohlt [ARM,ARM64,MICROBLAZE,MIPS,PPC,SH] Forces the kernel to busy wait in do_idle() and not use the arch_cpu_idle() implementation; requires CONFIG_GENERIC_IDLE_POLL_SETUP to be effective. This is useful on platforms where the diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 0b1172cbeccb..574661403800 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -193,6 +193,7 @@ config PPC select GENERIC_CPU_VULNERABILITIES if PPC_BARRIER_NOSPEC select GENERIC_EARLY_IOREMAP select GENERIC_GETTIMEOFDAY + select GENERIC_IDLE_POLL_SETUP select GENERIC_IRQ_SHOW select GENERIC_IRQ_SHOW_LEVEL select GENERIC_PCI_IOMAPif PCI -- 2.41.0
Re: [PATCH v3] powerpc/papr_scm: Fix nvdimm event mappings
Kajol Jain writes: > Commit 4c08d4bbc089 ("powerpc/papr_scm: Add perf interface support") > added performance monitoring support for papr-scm nvdimm devices via > perf interface. Commit also added an array in papr_scm_priv > structure called "nvdimm_events_map", which got filled based on the > result of H_SCM_PERFORMANCE_STATS hcall. > > Currently there is an assumption that the order of events in the > stats buffer, returned by the hypervisor is same. And order also > happens to matches with the events specified in nvdimm driver code. > But this assumption is not documented in Power Architecture > Platform Requirements (PAPR) document. Although the order > of events happens to be same on current generation od system, but > it might not be true in future generation systems. Fix the issue, by > adding a static mapping for nvdimm events to corresponding stat-id, > and removing the dynamic map from papr_scm_priv structure. Also > remove the function papr_scm_pmu_check_events from papr_scm.c file, > as we no longer need to copy stat-ids dynamically. > > Fixes: 4c08d4bbc089 ("powerpc/papr_scm: Add perf interface support") > Reported-by: Aneesh Kumar K.V > Signed-off-by: Kajol Jain > --- > arch/powerpc/platforms/pseries/papr_scm.c | 88 +++ > 1 file changed, 27 insertions(+), 61 deletions(-) > > --- > Changelog: > v2 -> v3 > - Remove function papr_scm_pmu_check_events() and replace the > event checks in papr_scm_pmu_register() function with p->stat_buffer_len > as suggested by Vaibhav Jain > Link to the patch v2: > https://lore.kernel.org/all/20220711034605.212683-1-kj...@linux.ibm.com/ V3 patch looks good to me. Reviewed-by: Vaibhav Jain -- Cheers ~ Vaibhav
Re: [PATCH v2] powerpc/papr_scm: Fix nvdimm event mappings
Hi Kajol, Thanks for the patch. Minor review comment below: Kajol Jain writes: > Commit 4c08d4bbc089 ("powerpc/papr_scm: Add perf interface support") > added performance monitoring support for papr-scm nvdimm devices via > perf interface. Commit also added an array in papr_scm_priv > structure called "nvdimm_events_map", which got filled based on the > result of H_SCM_PERFORMANCE_STATS hcall. > > Currently there is an assumption that the order of events in the > stats buffer, returned by the hypervisor is same. And that order also > matches with the events specified in nvdimm driver code. > But this assumption is not documented anywhere in Power Architecture > Platform Requirements (PAPR) document. Although the order > of events happens to be same on current systems, but it might > not be true in future generation systems. Fix the issue, by > adding a static mapping for nvdimm events to corresponding stat-id, > and removing the dynamic map from papr_scm_priv structure. > > Fixes: 4c08d4bbc089 ("powerpc/papr_scm: Add perf interface support") > Reported-by: Aneesh Kumar K.V > Signed-off-by: Kajol Jain > @@ -460,10 +480,9 @@ static void papr_scm_pmu_del(struct perf_event *event, > int flags) > > static int papr_scm_pmu_check_events(struct papr_scm_priv *p, struct > nvdimm_pmu *nd_pmu) > { > - struct papr_scm_perf_stat *stat; > struct papr_scm_perf_stats *stats; > u32 available_events; > - int index, rc = 0; > + int rc; > > if (!p->stat_buffer_len) > return -ENOENT; > @@ -476,34 +495,12 @@ static int papr_scm_pmu_check_events(struct > papr_scm_priv *p, struct nvdimm_pmu > /* Allocate the buffer for phyp where stats are written */ > stats = kzalloc(p->stat_buffer_len, GFP_KERNEL); > if (!stats) { > - rc = -ENOMEM; > - return rc; > + return -ENOMEM; > } > > /* Called to get list of events supported */ > rc = drc_pmem_query_stats(p, stats, 0); > - if (rc) > - goto out; > - > - /* > - * Allocate memory and populate nvdimm_event_map. > - * Allocate an extra element for NULL entry > - */ > - p->nvdimm_events_map = kcalloc(available_events + 1, > -sizeof(stat->stat_id), > -GFP_KERNEL); > - if (!p->nvdimm_events_map) { > - rc = -ENOMEM; > - goto out; > - } > > - /* Copy all stat_ids to event map */ > - for (index = 0, stat = stats->scm_statistic; > - index < available_events; index++, ++stat) { > - memcpy(>nvdimm_events_map[index * sizeof(stat->stat_id)], > ->stat_id, sizeof(stat->stat_id)); > - } > -out: > kfree(stats); > return rc; > } Earlier implementation of papr_scm_pmu_check_events() would copy the contents of returned stat-ids to struct papr_scm_priv->nvdimm_events_map, hence it was needed. With static events map you dont really need to call drc_pmem_query_stats() as that would have been already being done once in papr_scm_probe() before papr_scm_pmu_register() is called: papr_scm_probe() { ... /* Try retrieving the stat buffer and see if its supported */ stat_size = drc_pmem_query_stats(p, NULL, 0); ... papr_scm_pmu_register(p); ... } I would suggest replacing single callsite of papr_scm_pmu_check_events() with the check if (!p->stat_buffer_len) goto pmu_check_events_err; -- Cheers ~ Vaibhav
Re: [PATCH v2] of: check previous kernel's ima-kexec-buffer against memory bounds
Thanks for looking into this patch Rob, I have addressed your review comment in v3 of the patch posted at https://lore.kernel.org/all/20220531041446.3334259-1-vaib...@linux.ibm.com/ -- Cheers ~ Vaibhav
[PATCH v3] of: check previous kernel's ima-kexec-buffer against memory bounds
Presently ima_get_kexec_buffer() doesn't check if the previous kernel's ima-kexec-buffer lies outside the addressable memory range. This can result in a kernel panic if the new kernel is booted with 'mem=X' arg and the ima-kexec-buffer was allocated beyond that range by the previous kernel. The panic is usually of the form below: $ sudo kexec --initrd initrd vmlinux --append='mem=16G' BUG: Unable to handle kernel data access on read at 0xc000c01fff7f Faulting instruction address: 0xc0837974 Oops: Kernel access of bad area, sig: 11 [#1] NIP [c0837974] ima_restore_measurement_list+0x94/0x6c0 LR [c083b55c] ima_load_kexec_buffer+0xac/0x160 Call Trace: [c371fa80] [c083b55c] ima_load_kexec_buffer+0xac/0x160 [c371fb00] [c20512c4] ima_init+0x80/0x108 [c371fb70] [c20514dc] init_ima+0x4c/0x120 [c371fbf0] [c0012240] do_one_initcall+0x60/0x2c0 [c371fcc0] [c2004ad0] kernel_init_freeable+0x344/0x3ec [c371fda0] [c00128a4] kernel_init+0x34/0x1b0 [c371fe10] [c000ce64] ret_from_kernel_thread+0x5c/0x64 Instruction dump: f92100b8 f92100c0 90e10090 910100a0 4182050c 282a0017 3bc0 40810330 7c0802a6 fb610198 7c9b2378 f80101d0 2c090001 40820614 e9240010 ---[ end trace ]--- Fix this issue by checking returned PFN range of previous kernel's ima-kexec-buffer with page_is_ram() to ensure correct memory bounds. Fixes: 467d27824920 ("powerpc: ima: get the kexec buffer passed by the previous kernel") Cc: Frank Rowand Cc: Prakhar Srivastava Cc: Lakshmi Ramasubramanian Cc: Thiago Jung Bauermann Cc: Rob Herring Cc: Ritesh Harjani Cc: Robin Murphy Signed-off-by: Vaibhav Jain --- Changelog == v3: * change the type for {start,end}_pfn to unsigned long [ Ritesh ] * Switched to page_is_ram() from pfn_vaild() [ Rob ] v2: * Instead of using memblock to determine the valid bounds use pfn_valid() to do so since memblock may not be available late after the kernel init. [ Mpe ] * Changed the patch prefix from 'powerpc' to 'of' [ Mpe ] * Updated the 'Fixes' tag to point to correct commit that introduced this function. [ Rob ] * Fixed some whitespace/tab issues in the patch description [ Rob ] * Added another check for checking ig 'tmp_size' for ima-kexec-buffer is > 0 --- drivers/of/kexec.c | 17 + 1 file changed, 17 insertions(+) diff --git a/drivers/of/kexec.c b/drivers/of/kexec.c index 8d374cc552be..91b04b04eec4 100644 --- a/drivers/of/kexec.c +++ b/drivers/of/kexec.c @@ -126,6 +126,7 @@ int ima_get_kexec_buffer(void **addr, size_t *size) { int ret, len; unsigned long tmp_addr; + unsigned long start_pfn, end_pfn; size_t tmp_size; const void *prop; @@ -140,6 +141,22 @@ int ima_get_kexec_buffer(void **addr, size_t *size) if (ret) return ret; + /* Do some sanity on the returned size for the ima-kexec buffer */ + if (!tmp_size) + return -ENOENT; + + /* +* Calculate the PFNs for the buffer and ensure +* they are with in addressable memory. +*/ + start_pfn = PHYS_PFN(tmp_addr); + end_pfn = PHYS_PFN(tmp_addr + tmp_size - 1); + if (!page_is_ram(start_pfn) || !page_is_ram(end_pfn)) { + pr_warn("IMA buffer at 0x%lx, size = 0x%zx beyond memory\n", + tmp_addr, tmp_size); + return -EINVAL; + } + *addr = __va(tmp_addr); *size = tmp_size; -- 2.35.1
Re: [PATCH v2] of: check previous kernel's ima-kexec-buffer against memory bounds
Hi Ritesh, thanks for looking into this patch, Ritesh Harjani writes: > Just a minor nit which I noticed. > > On 22/05/24 11:20AM, Vaibhav Jain wrote: >> Presently ima_get_kexec_buffer() doesn't check if the previous kernel's >> ima-kexec-buffer lies outside the addressable memory range. This can result >> in a kernel panic if the new kernel is booted with 'mem=X' arg and the >> ima-kexec-buffer was allocated beyond that range by the previous kernel. >> The panic is usually of the form below: >> >> $ sudo kexec --initrd initrd vmlinux --append='mem=16G' >> >> >> BUG: Unable to handle kernel data access on read at 0xc000c01fff7f >> Faulting instruction address: 0xc0837974 >> Oops: Kernel access of bad area, sig: 11 [#1] >> >> NIP [c0837974] ima_restore_measurement_list+0x94/0x6c0 >> LR [c083b55c] ima_load_kexec_buffer+0xac/0x160 >> Call Trace: >> [c371fa80] [c083b55c] ima_load_kexec_buffer+0xac/0x160 >> [c371fb00] [c20512c4] ima_init+0x80/0x108 >> [c371fb70] [c20514dc] init_ima+0x4c/0x120 >> [c371fbf0] [c0012240] do_one_initcall+0x60/0x2c0 >> [c371fcc0] [c2004ad0] kernel_init_freeable+0x344/0x3ec >> [c371fda0] [c00128a4] kernel_init+0x34/0x1b0 >> [c371fe10] [c000ce64] ret_from_kernel_thread+0x5c/0x64 >> Instruction dump: >> f92100b8 f92100c0 90e10090 910100a0 4182050c 282a0017 3bc0 40810330 >> 7c0802a6 fb610198 7c9b2378 f80101d0 2c090001 40820614 e9240010 >> ---[ end trace ]--- >> >> Fix this issue by checking returned PFN range of previous kernel's >> ima-kexec-buffer with pfn_valid to ensure correct memory bounds. >> >> Fixes: 467d27824920 ("powerpc: ima: get the kexec buffer passed by the >> previous kernel") >> Cc: Frank Rowand >> Cc: Prakhar Srivastava >> Cc: Lakshmi Ramasubramanian >> Cc: Thiago Jung Bauermann >> Cc: Rob Herring >> Signed-off-by: Vaibhav Jain >> >> --- >> Changelog >> == >> >> v2: >> * Instead of using memblock to determine the valid bounds use pfn_valid() to >> do >> so since memblock may not be available late after the kernel init. [ Mpe ] >> * Changed the patch prefix from 'powerpc' to 'of' [ Mpe ] >> * Updated the 'Fixes' tag to point to correct commit that introduced this >> function. [ Rob ] >> * Fixed some whitespace/tab issues in the patch description [ Rob ] >> * Added another check for checking ig 'tmp_size' for ima-kexec-buffer is > 0 >> --- >> drivers/of/kexec.c | 17 + >> 1 file changed, 17 insertions(+) >> >> diff --git a/drivers/of/kexec.c b/drivers/of/kexec.c >> index 8d374cc552be..879e984fe901 100644 >> --- a/drivers/of/kexec.c >> +++ b/drivers/of/kexec.c >> @@ -126,6 +126,7 @@ int ima_get_kexec_buffer(void **addr, size_t *size) >> { >> int ret, len; >> unsigned long tmp_addr; >> +unsigned int start_pfn, end_pfn; > > ^^^ Shouldn't this be unsigned long? Thanks for catching this. Yes that should be 'unsigned long'. Will resend the patch with this fixed. > > -ritesh > >> size_t tmp_size; >> const void *prop; >> >> @@ -140,6 +141,22 @@ int ima_get_kexec_buffer(void **addr, size_t *size) >> if (ret) >> return ret; >> >> +/* Do some sanity on the returned size for the ima-kexec buffer */ >> +if (!tmp_size) >> +return -ENOENT; >> + >> +/* >> + * Calculate the PFNs for the buffer and ensure >> + * they are with in addressable memory. >> + */ >> +start_pfn = PHYS_PFN(tmp_addr); >> +end_pfn = PHYS_PFN(tmp_addr + tmp_size - 1); >> +if (!pfn_valid(start_pfn) || !pfn_valid(end_pfn)) { >> +pr_warn("IMA buffer at 0x%lx, size = 0x%zx beyond memory\n", >> +tmp_addr, tmp_size); >> +return -EINVAL; >> +} >> + >> *addr = __va(tmp_addr); >> *size = tmp_size; >> >> -- >> 2.35.1 >> -- Cheers ~ Vaibhav
Re: [powerpc] linux-next 20220520 boot failure (drc_pmem_query_stats)
Thanks for reporting this Sachin, I have posted a fix for this at https://lore.kernel.org/nvdimm/20220524112353.1718454-1-vaib...@linux.ibm.com Sachin Sant writes: > While booting linux-next (5.18.0-rc7-next-20220520) on a Power10 LPAR > configure with pmem following oops is seen. The LPAR fails to boot to > login prompt. > > [ 10.948211] papr_scm ibm,persistent-memory:ibm,pmemory@44104001: > Permission denied while accessing performance stats > [ 10.948536] Kernel attempted to write user page (1c) - exploit attempt? > (uid: 0) > [ 10.948539] BUG: Kernel NULL pointer dereference on write at 0x001c > [ 10.948540] Faulting instruction address: 0xc00801b90844 > [ 10.948542] Oops: Kernel access of bad area, sig: 11 [#1] > [ 10.948563] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries > [ 10.948568] Modules linked in: papr_scm(E+) libnvdimm(E) vmx_crypto(E) > ext4(E) mbcache(E) jbd2(E) sd_mod(E) t10_pi(E) crc64_rocksoft(E) crc64(E) > sg(E) ibmvscsi(E) scsi_transport_srp(E) ibmveth(E) fuse(E) > [ 10.948587] CPU: 25 PID: 796 Comm: systemd-udevd Tainted: GE > 5.18.0-rc7-next-20220520 #2 > [ 10.948592] NIP: c00801b90844 LR: c00801b92794 CTR: > c00801b907f8 > [ 10.948595] REGS: c0003082b110 TRAP: 0300 Tainted: GE > (5.18.0-rc7-next-20220520) > [ 10.948600] MSR: 8280b033 CR: > 44222822 XER: 0001 > [ 10.948613] CFAR: c007c744 DAR: 001c DSISR: 4200 > IRQMASK: 0 > [ 10.948613] GPR00: c00801b92794 c0003082b3b0 c00801bc8000 > c941bc00 > [ 10.948613] GPR04: 0010 c00016001800 > c0003082b420 > [ 10.948613] GPR08: 001c 0100 53544154 > c00801b92c98 > [ 10.948613] GPR12: c00801b907f8 c00abfd02b00 c0003082bd00 > 0001372bd8b0 > [ 10.948613] GPR16: ff20 c008008911b8 c0080089 > 11d0 > [ 10.948613] GPR20: 0001 c0003082bbc0 c00801bc0a88 > > [ 10.948613] GPR24: c2950e30 > 0010 > [ 10.948613] GPR28: c941bc00 0010 0020 > c941bc00 > [ 10.948660] NIP [c00801b90844] drc_pmem_query_stats+0x5c/0x270 > [papr_scm] > [ 10.948667] LR [c00801b92794] papr_scm_probe+0x2ac/0x6ec [papr_scm] > [ 10.948673] Call Trace: > [ 10.948675] [c0003082b3b0] [c941bca0] 0xc941bca0 > (unreliable) > [ 10.948680] [c0003082b460] [c00801b92794] > papr_scm_probe+0x2ac/0x6ec [papr_scm] > [ 10.948687] [c0003082b550] [c09809b8] platform_probe+0x98/0x150 > [ 10.948694] [c0003082b5d0] [c097bf2c] really_probe+0xfc/0x510 > [ 10.948699] [c0003082b650] [c097c4bc] > __driver_probe_device+0x17c/0x230 > [ 10.948704] [c0003082b6d0] [c097c5c8] > driver_probe_device+0x58/0x120 > [ 10.948709] [c0003082b710] [c097ce0c] > __driver_attach+0xfc/0x230 > [ 10.948714] [c0003082b790] [c0978458] > bus_for_each_dev+0xa8/0x130 > [ 10.948718] [c0003082b7f0] [c097b2c4] driver_attach+0x34/0x50 > [ 10.948722] [c0003082b810] [c097a508] > bus_add_driver+0x1e8/0x350 > [ 10.948729] [c0003082b8a0] [c097def8] > driver_register+0x98/0x1a0 > [ 10.948736] [c0003082b910] [c09804a8] > __platform_driver_register+0x38/0x50 > [ 10.948741] [c0003082b930] [c00801b92c10] papr_scm_init+0x3c/0x78 > [papr_scm] > [ 10.948747] [c0003082b960] [c0011ff0] > do_one_initcall+0x60/0x2d0 > [ 10.948753] [c0003082ba30] [c023627c] do_init_module+0x6c/0x2d0 > [ 10.948760] [c0003082bab0] [c0239650] load_module+0x1e90/0x2290 > [ 10.948765] [c0003082bc90] [c0239d9c] > __do_sys_finit_module+0xdc/0x180 > [ 10.948771] [c0003082bdb0] [c00335fc] > system_call_exception+0x17c/0x350 > [ 10.948777] [c0003082be10] [c000c53c] > system_call_common+0xec/0x270 > [ 10.948782] --- interrupt: c00 at 0x7fffa3f2f1d4 > [ 10.948785] NIP: 7fffa3f2f1d4 LR: 7fffa456ea9c CTR: > > [ 10.948789] REGS: c0003082be80 TRAP: 0c00 Tainted: GE > (5.18.0-rc7-next-20220520) > [ 10.948793] MSR: 8280f033 > CR: 2804 XER: > [ 10.948805] IRQMASK: 0 > [ 10.948805] GPR00: 0161 7fffd70550b0 7fffa4007300 > 0011 > [ 10.948805] GPR04: 7fffa457ad30 0011 > > [ 10.948805] GPR08: > > [ 10.948805] GPR12: 7fffa4656580 0002 > 0001372bd8b0 > [ 10.948805] GPR16: 000137300108 0001372c5c68 > > [ 10.948805] GPR20:
[PATCH] powerpc/papr_scm: don't requests stats with '0' sized stats buffer
Sachin reported [1] that on a POWER-10 lpar he is seeing a kernel panic being reported with vPMEM when papr_scm probe is being called. The panic is of the form below and is observed only with following option disabled(profile) for the said LPAR 'Enable Performance Information Collection' in the HMC: Kernel attempted to write user page (1c) - exploit attempt? (uid: 0) BUG: Kernel NULL pointer dereference on write at 0x001c Faulting instruction address: 0xc00801b90844 Oops: Kernel access of bad area, sig: 11 [#1] NIP [c00801b90844] drc_pmem_query_stats+0x5c/0x270 [papr_scm] LR [c00801b92794] papr_scm_probe+0x2ac/0x6ec [papr_scm] Call Trace: 0xc941bca0 (unreliable) papr_scm_probe+0x2ac/0x6ec [papr_scm] platform_probe+0x98/0x150 really_probe+0xfc/0x510 __driver_probe_device+0x17c/0x230 ---[ end trace ]--- Kernel panic - not syncing: Fatal exception On investigation looks like this panic was caused due to a 'stat_buffer' of size==0 being provided to drc_pmem_query_stats() to fetch all performance stats-ids of an NVDIMM. However drc_pmem_query_stats() shouldn't have been called since the vPMEM NVDIMM doesn't support and performance stat-id's. This was caused due to missing check for 'p->stat_buffer_len' at the beginning of papr_scm_pmu_check_events() which indicates that the NVDIMM doesn't support performance-stats. Fix this by introducing the check for 'p->stat_buffer_len' at the beginning of papr_scm_pmu_check_events(). [1] https://lore.kernel.org/all/6b3a522a-6a5f-4cc9-b268-0c63aa6e0...@linux.ibm.com Fixes: 0e0946e22f3665d2732 ("powerpc/papr_scm: Fix leaking nvdimm_events_map elements") Reported-by: Sachin Sant Signed-off-by: Vaibhav Jain --- arch/powerpc/platforms/pseries/papr_scm.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c index 181b855b3050..82cae08976bc 100644 --- a/arch/powerpc/platforms/pseries/papr_scm.c +++ b/arch/powerpc/platforms/pseries/papr_scm.c @@ -465,6 +465,9 @@ static int papr_scm_pmu_check_events(struct papr_scm_priv *p, struct nvdimm_pmu u32 available_events; int index, rc = 0; + if (!p->stat_buffer_len) + return -ENOENT; + available_events = (p->stat_buffer_len - sizeof(struct papr_scm_perf_stats)) / sizeof(struct papr_scm_perf_stat); if (available_events == 0) -- 2.35.1
Re: [PATCH] powerpc: check previous kernel's ima-kexec-buffer against memory bounds
Michael Ellerman writes: > > Rob's point is that commit fee3ff99bc67 only moved existing code, the > bug already existed. > > The function was introduced in: > > 467d27824920 ("powerpc: ima: get the kexec buffer passed by the previous > kernel") > > So that's where the Fixes tag should point. Thanks Mpe for explaining and apologies Rob as I initially didnt get the point you made. I wrongly assumed that since the patch can only be applied on top of commit fee3ff99bc67 hence fixes should point to fee3ff99bc67. I have sent out a v2 with the updated Fixes tag and also changes as suggested by Mpe and Rob at https://lore.kernel.org/all/20220524055042.1527968-1-vaib...@linux.ibm.com/ -- Cheers ~ Vaibhav
[PATCH v2] of: check previous kernel's ima-kexec-buffer against memory bounds
Presently ima_get_kexec_buffer() doesn't check if the previous kernel's ima-kexec-buffer lies outside the addressable memory range. This can result in a kernel panic if the new kernel is booted with 'mem=X' arg and the ima-kexec-buffer was allocated beyond that range by the previous kernel. The panic is usually of the form below: $ sudo kexec --initrd initrd vmlinux --append='mem=16G' BUG: Unable to handle kernel data access on read at 0xc000c01fff7f Faulting instruction address: 0xc0837974 Oops: Kernel access of bad area, sig: 11 [#1] NIP [c0837974] ima_restore_measurement_list+0x94/0x6c0 LR [c083b55c] ima_load_kexec_buffer+0xac/0x160 Call Trace: [c371fa80] [c083b55c] ima_load_kexec_buffer+0xac/0x160 [c371fb00] [c20512c4] ima_init+0x80/0x108 [c371fb70] [c20514dc] init_ima+0x4c/0x120 [c371fbf0] [c0012240] do_one_initcall+0x60/0x2c0 [c371fcc0] [c2004ad0] kernel_init_freeable+0x344/0x3ec [c371fda0] [c00128a4] kernel_init+0x34/0x1b0 [c371fe10] [c000ce64] ret_from_kernel_thread+0x5c/0x64 Instruction dump: f92100b8 f92100c0 90e10090 910100a0 4182050c 282a0017 3bc0 40810330 7c0802a6 fb610198 7c9b2378 f80101d0 2c090001 40820614 e9240010 ---[ end trace ]--- Fix this issue by checking returned PFN range of previous kernel's ima-kexec-buffer with pfn_valid to ensure correct memory bounds. Fixes: 467d27824920 ("powerpc: ima: get the kexec buffer passed by the previous kernel") Cc: Frank Rowand Cc: Prakhar Srivastava Cc: Lakshmi Ramasubramanian Cc: Thiago Jung Bauermann Cc: Rob Herring Signed-off-by: Vaibhav Jain --- Changelog == v2: * Instead of using memblock to determine the valid bounds use pfn_valid() to do so since memblock may not be available late after the kernel init. [ Mpe ] * Changed the patch prefix from 'powerpc' to 'of' [ Mpe ] * Updated the 'Fixes' tag to point to correct commit that introduced this function. [ Rob ] * Fixed some whitespace/tab issues in the patch description [ Rob ] * Added another check for checking ig 'tmp_size' for ima-kexec-buffer is > 0 --- drivers/of/kexec.c | 17 + 1 file changed, 17 insertions(+) diff --git a/drivers/of/kexec.c b/drivers/of/kexec.c index 8d374cc552be..879e984fe901 100644 --- a/drivers/of/kexec.c +++ b/drivers/of/kexec.c @@ -126,6 +126,7 @@ int ima_get_kexec_buffer(void **addr, size_t *size) { int ret, len; unsigned long tmp_addr; + unsigned int start_pfn, end_pfn; size_t tmp_size; const void *prop; @@ -140,6 +141,22 @@ int ima_get_kexec_buffer(void **addr, size_t *size) if (ret) return ret; + /* Do some sanity on the returned size for the ima-kexec buffer */ + if (!tmp_size) + return -ENOENT; + + /* +* Calculate the PFNs for the buffer and ensure +* they are with in addressable memory. +*/ + start_pfn = PHYS_PFN(tmp_addr); + end_pfn = PHYS_PFN(tmp_addr + tmp_size - 1); + if (!pfn_valid(start_pfn) || !pfn_valid(end_pfn)) { + pr_warn("IMA buffer at 0x%lx, size = 0x%zx beyond memory\n", + tmp_addr, tmp_size); + return -EINVAL; + } + *addr = __va(tmp_addr); *size = tmp_size; -- 2.35.1
Re: [PATCH] powerpc: check previous kernel's ima-kexec-buffer against memory bounds
Rob Herring writes: > On Thu, May 19, 2022 at 01:35:47AM +0530, Vaibhav Jain wrote: >> Presently ima_get_kexec_buffer() doesn't check if the previous kernel's >> ima-kexec-buffer lies outside the addressable memory range. This can result >> in a kernel panic if the new kernel is booted with 'mem=X' arg and the >> ima-kexec-buffer was allocated beyond that range by the previous kernel. >> The panic is usually of the form below: >> >> $ sudo kexec --initrd initrd vmlinux --append='mem=16G' >> >> >> BUG: Unable to handle kernel data access on read at 0xc000c01fff7f >> Faulting instruction address: 0xc0837974 >> Oops: Kernel access of bad area, sig: 11 [#1] >> >> NIP [c0837974] ima_restore_measurement_list+0x94/0x6c0 >> LR [c083b55c] ima_load_kexec_buffer+0xac/0x160 >> Call Trace: >> [c371fa80] [c083b55c] ima_load_kexec_buffer+0xac/0x160 >> [c371fb00] [c20512c4] ima_init+0x80/0x108 >> [c371fb70] [c20514dc] init_ima+0x4c/0x120 >> [c371fbf0] [c0012240] do_one_initcall+0x60/0x2c0 >> [c371fcc0] [c2004ad0] kernel_init_freeable+0x344/0x3ec >> [c371fda0] [c00128a4] kernel_init+0x34/0x1b0 >> [c371fe10] [c000ce64] ret_from_kernel_thread+0x5c/0x64 >> Instruction dump: >> f92100b8 f92100c0 90e10090 910100a0 4182050c 282a0017 3bc0 40810330 >> 7c0802a6 fb610198 7c9b2378 f80101d0 2c090001 40820614 e9240010 >> ---[ end trace ]--- >> >> Fix this issue by checking returned address/size of previous kernel's >> ima-kexec-buffer against memblock's memory bounds. >> >> Fixes: fee3ff99bc67("powerpc: Move arch independent ima kexec functions to >> drivers/of/kexec.c") > > Your mailer (and/or you) corrupted this. There should be a space between > the commit hash and subject, and it should not be wrapped. I probably messed it up. Will resend a fixed patch. > > It should also not have a blank line here. Sure. Will get this fixed > > But more importantly, how did this commit introduce the problem? It just > moved the code and didn't have any such check. Yes, the code didnt have the necessary check to see if the address for previous kernel IMA buffer is beyond the currently addressable memory. I have described the problem in patch description. -- Cheers ~ Vaibhav
[PATCH] powerpc: check previous kernel's ima-kexec-buffer against memory bounds
Presently ima_get_kexec_buffer() doesn't check if the previous kernel's ima-kexec-buffer lies outside the addressable memory range. This can result in a kernel panic if the new kernel is booted with 'mem=X' arg and the ima-kexec-buffer was allocated beyond that range by the previous kernel. The panic is usually of the form below: $ sudo kexec --initrd initrd vmlinux --append='mem=16G' BUG: Unable to handle kernel data access on read at 0xc000c01fff7f Faulting instruction address: 0xc0837974 Oops: Kernel access of bad area, sig: 11 [#1] NIP [c0837974] ima_restore_measurement_list+0x94/0x6c0 LR [c083b55c] ima_load_kexec_buffer+0xac/0x160 Call Trace: [c371fa80] [c083b55c] ima_load_kexec_buffer+0xac/0x160 [c371fb00] [c20512c4] ima_init+0x80/0x108 [c371fb70] [c20514dc] init_ima+0x4c/0x120 [c371fbf0] [c0012240] do_one_initcall+0x60/0x2c0 [c371fcc0] [c2004ad0] kernel_init_freeable+0x344/0x3ec [c371fda0] [c00128a4] kernel_init+0x34/0x1b0 [c371fe10] [c000ce64] ret_from_kernel_thread+0x5c/0x64 Instruction dump: f92100b8 f92100c0 90e10090 910100a0 4182050c 282a0017 3bc0 40810330 7c0802a6 fb610198 7c9b2378 f80101d0 2c090001 40820614 e9240010 ---[ end trace ]--- Fix this issue by checking returned address/size of previous kernel's ima-kexec-buffer against memblock's memory bounds. Fixes: fee3ff99bc67("powerpc: Move arch independent ima kexec functions to drivers/of/kexec.c") Cc: Frank Rowand Cc: Prakhar Srivastava Cc: Lakshmi Ramasubramanian Cc: Thiago Jung Bauermann Cc: Rob Herring Signed-off-by: Vaibhav Jain --- drivers/of/kexec.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/drivers/of/kexec.c b/drivers/of/kexec.c index b9bd1cff1793..c73007eda52d 100644 --- a/drivers/of/kexec.c +++ b/drivers/of/kexec.c @@ -140,6 +140,13 @@ int ima_get_kexec_buffer(void **addr, size_t *size) if (ret) return ret; + /* if the ima-kexec-buffer goes beyond the addressable memory */ + if (!memblock_is_region_memory(tmp_addr, tmp_size)) { + pr_warn("IMA buffer at 0x%lx, size = 0x%zx beyond memory\n", + tmp_addr, tmp_size); + return -EINVAL; + } + *addr = __va(tmp_addr); *size = tmp_size; -- 2.35.1
[PATCH v2] powerpc/papr_scm: Fix leaking nvdimm_events_map elements
Right now 'char *' elements allocated for individual 'stat_id' in 'papr_scm_priv.nvdimm_events_map[]' during papr_scm_pmu_check_events(), get leaked in papr_scm_remove() and papr_scm_pmu_register(), papr_scm_pmu_check_events() error paths. Also individual 'stat_id' arent NULL terminated 'char *' instead they are fixed 8-byte sized identifiers. However papr_scm_pmu_register() assumes it to be a NULL terminated 'char *' and at other places it assumes it to be a 'papr_scm_perf_stat.stat_id' sized string which is 8-byes in size. Fix this by allocating the memory for papr_scm_priv.nvdimm_events_map to also include space for 'stat_id' entries. This is possible since number of available events/stat_ids are known upfront. This saves some memory and one extra level of indirection from 'nvdimm_events_map' to 'stat_id'. Also rest of the code can continue to call 'kfree(papr_scm_priv.nvdimm_events_map)' without needing to iterate over the array and free up individual elements. Fixes: 4c08d4bbc089 ("powerpc/papr_scm: Add perf interface support") Signed-off-by: Vaibhav Jain --- Change-log: v2: * Removed the typedef 'stat_id_t' which will be introduced in a later patch [Aneesh] * Updated the patch description a bit --- arch/powerpc/platforms/pseries/papr_scm.c | 54 ++- 1 file changed, 24 insertions(+), 30 deletions(-) diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c index 39962c905542..181b855b3050 100644 --- a/arch/powerpc/platforms/pseries/papr_scm.c +++ b/arch/powerpc/platforms/pseries/papr_scm.c @@ -125,8 +125,8 @@ struct papr_scm_priv { /* The bits which needs to be overridden */ u64 health_bitmap_inject_mask; -/* array to have event_code and stat_id mappings */ - char **nvdimm_events_map; + /* array to have event_code and stat_id mappings */ + u8 *nvdimm_events_map; }; static int papr_scm_pmem_flush(struct nd_region *nd_region, @@ -370,7 +370,7 @@ static int papr_scm_pmu_get_value(struct perf_event *event, struct device *dev, stat = >scm_statistic[0]; memcpy(>stat_id, - p->nvdimm_events_map[event->attr.config], + >nvdimm_events_map[event->attr.config * sizeof(stat->stat_id)], sizeof(stat->stat_id)); stat->stat_val = 0; @@ -462,14 +462,13 @@ static int papr_scm_pmu_check_events(struct papr_scm_priv *p, struct nvdimm_pmu { struct papr_scm_perf_stat *stat; struct papr_scm_perf_stats *stats; - int index, rc, count; u32 available_events; - - if (!p->stat_buffer_len) - return -ENOENT; + int index, rc = 0; available_events = (p->stat_buffer_len - sizeof(struct papr_scm_perf_stats)) / sizeof(struct papr_scm_perf_stat); + if (available_events == 0) + return -EOPNOTSUPP; /* Allocate the buffer for phyp where stats are written */ stats = kzalloc(p->stat_buffer_len, GFP_KERNEL); @@ -478,35 +477,30 @@ static int papr_scm_pmu_check_events(struct papr_scm_priv *p, struct nvdimm_pmu return rc; } - /* Allocate memory to nvdimm_event_map */ - p->nvdimm_events_map = kcalloc(available_events, sizeof(char *), GFP_KERNEL); - if (!p->nvdimm_events_map) { - rc = -ENOMEM; - goto out_stats; - } - /* Called to get list of events supported */ rc = drc_pmem_query_stats(p, stats, 0); if (rc) - goto out_nvdimm_events_map; - - for (index = 0, stat = stats->scm_statistic, count = 0; -index < available_events; index++, ++stat) { - p->nvdimm_events_map[count] = kmemdup_nul(stat->stat_id, 8, GFP_KERNEL); - if (!p->nvdimm_events_map[count]) { - rc = -ENOMEM; - goto out_nvdimm_events_map; - } + goto out; - count++; + /* +* Allocate memory and populate nvdimm_event_map. +* Allocate an extra element for NULL entry +*/ + p->nvdimm_events_map = kcalloc(available_events + 1, + sizeof(stat->stat_id), + GFP_KERNEL); + if (!p->nvdimm_events_map) { + rc = -ENOMEM; + goto out; } - p->nvdimm_events_map[count] = NULL; - kfree(stats); - return 0; -out_nvdimm_events_map: - kfree(p->nvdimm_events_map); -out_stats: + /* Copy all stat_ids to event map */ + for (index = 0, stat = stats->scm_statistic; +index < available_events; index++, ++stat) { + memcpy(>nvdimm_events_map[index * sizeof(stat->stat_id)], + >stat_id, sizeof(stat->stat_id)); + } +out
Re: [PATCH] powerpc/papr_scm: Fix leaking nvdimm_events_map elements
Thanks for reviewing this patch Aneesh, "Aneesh Kumar K.V" writes: > Vaibhav Jain writes: > >> Right now 'char *' elements allocated individual 'stat_id' in >> 'papr_scm_priv.nvdimm_events_map' during papr_scm_pmu_check_events() leak in >> papr_scm_remove() and papr_scm_pmu_register(), papr_scm_pmu_check_events() >> error >> paths. >> >> Also individual 'stat_id' arent NULL terminated 'char *' instead they are >> fixed >> 8-byte sized identifiers. However papr_scm_pmu_register() assumes it to be a >> NULL terminated 'char *' and at other places it assumes it to be a >> 'papr_scm_perf_stat.stat_id' sized string which is 8-byes in size. >> >> Fix this by allocating the memory for papr_scm_priv.nvdimm_events_map to also >> include space for 'stat_id' entries. This is possible since number of >> available >> events/stat_ids are known upfront. This saves some memory and one extra >> level of >> indirection from 'nvdimm_events_map' to 'stat_id'. Also rest of the code >> can continue to call 'kfree(papr_scm_priv.nvdimm_events_map)' without >> needing to >> iterate over the array and free up individual elements. >> >> Also introduce a new typedef called 'state_id_t' thats a 'u8[8]' and can be >> used >> across papr_scm to deal with stat_ids. >> >> Fixes: 4c08d4bbc089 ("powerpc/papr_scm: Add perf interface support") >> Signed-off-by: Vaibhav Jain >> --- >> arch/powerpc/platforms/pseries/papr_scm.c | 48 +++ >> 1 file changed, 22 insertions(+), 26 deletions(-) >> >> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c >> b/arch/powerpc/platforms/pseries/papr_scm.c >> index 39962c905542..f33a865ad5fb 100644 >> --- a/arch/powerpc/platforms/pseries/papr_scm.c >> +++ b/arch/powerpc/platforms/pseries/papr_scm.c >> @@ -70,8 +70,10 @@ >> #define PAPR_SCM_PERF_STATS_VERSION 0x1 >> >> /* Struct holding a single performance metric */ >> +typedef u8 stat_id_t[8]; >> + >> struct papr_scm_perf_stat { >> -u8 stat_id[8]; >> +stat_id_t stat_id; >> __be64 stat_val; >> } __packed; > > Can we do this as two patch? One that fix the leak and other that adds > new type? > Yeah, sure that would be a simple change. I will send a v2 across that only removes the leak and then a subsequent patch that changes the usage of stat_id within papr_scm to the new typedef >> >> @@ -126,7 +128,7 @@ struct papr_scm_priv { >> u64 health_bitmap_inject_mask; >> >> /* array to have event_code and stat_id mappings */ >> -char **nvdimm_events_map; >> +stat_id_t *nvdimm_events_map; >> }; >> >> static int papr_scm_pmem_flush(struct nd_region *nd_region, >> @@ -462,7 +464,7 @@ static int papr_scm_pmu_check_events(struct >> papr_scm_priv *p, struct nvdimm_pmu >> { >> struct papr_scm_perf_stat *stat; >> struct papr_scm_perf_stats *stats; >> -int index, rc, count; >> +int index, rc = 0; >> u32 available_events; >> >> if (!p->stat_buffer_len) >> @@ -478,35 +480,29 @@ static int papr_scm_pmu_check_events(struct >> papr_scm_priv *p, struct nvdimm_pmu >> return rc; >> } >> >> -/* Allocate memory to nvdimm_event_map */ >> -p->nvdimm_events_map = kcalloc(available_events, sizeof(char *), >> GFP_KERNEL); >> -if (!p->nvdimm_events_map) { >> -rc = -ENOMEM; >> -goto out_stats; >> -} >> - >> /* Called to get list of events supported */ >> rc = drc_pmem_query_stats(p, stats, 0); >> if (rc) >> -goto out_nvdimm_events_map; >> - >> -for (index = 0, stat = stats->scm_statistic, count = 0; >> - index < available_events; index++, ++stat) { >> -p->nvdimm_events_map[count] = kmemdup_nul(stat->stat_id, 8, >> GFP_KERNEL); >> -if (!p->nvdimm_events_map[count]) { >> -rc = -ENOMEM; >> -goto out_nvdimm_events_map; >> -} >> +goto out; >> >> -count++; >> +/* >> + * Allocate memory and populate nvdimm_event_map. >> + * Allocate an extra element for NULL entry >> + */ >> +p->nvdimm_events_map = kcalloc(available_events + 1, >> + sizeof(stat_id_t), GFP_KERNEL); >> +if (!p->nvdimm_events_map) { >> +rc = -ENOMEM; >> +goto out; >> } >> -p->nvdimm_events_map[count] = NULL; >> -kfree(stats); >> -return 0; >> >> -out_nvdimm_events_map: >> -kfree(p->nvdimm_events_map); >> -out_stats: >> +/* Copy all stat_ids to event map */ >> +for (index = 0, stat = stats->scm_statistic; >> + index < available_events; index++, ++stat) { >> +memcpy(>nvdimm_events_map[index], >stat_id, >> + sizeof(stat_id_t)); >> +} >> +out: >> kfree(stats); >> return rc; >> } >> >> base-commit: 348c71344111d7a48892e3e52264ff11956fc196 >> -- >> 2.35.1 > -- Cheers ~ Vaibhav
[PATCH] powerpc/papr_scm: Fix leaking nvdimm_events_map elements
Right now 'char *' elements allocated individual 'stat_id' in 'papr_scm_priv.nvdimm_events_map' during papr_scm_pmu_check_events() leak in papr_scm_remove() and papr_scm_pmu_register(), papr_scm_pmu_check_events() error paths. Also individual 'stat_id' arent NULL terminated 'char *' instead they are fixed 8-byte sized identifiers. However papr_scm_pmu_register() assumes it to be a NULL terminated 'char *' and at other places it assumes it to be a 'papr_scm_perf_stat.stat_id' sized string which is 8-byes in size. Fix this by allocating the memory for papr_scm_priv.nvdimm_events_map to also include space for 'stat_id' entries. This is possible since number of available events/stat_ids are known upfront. This saves some memory and one extra level of indirection from 'nvdimm_events_map' to 'stat_id'. Also rest of the code can continue to call 'kfree(papr_scm_priv.nvdimm_events_map)' without needing to iterate over the array and free up individual elements. Also introduce a new typedef called 'state_id_t' thats a 'u8[8]' and can be used across papr_scm to deal with stat_ids. Fixes: 4c08d4bbc089 ("powerpc/papr_scm: Add perf interface support") Signed-off-by: Vaibhav Jain --- arch/powerpc/platforms/pseries/papr_scm.c | 48 +++ 1 file changed, 22 insertions(+), 26 deletions(-) diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c index 39962c905542..f33a865ad5fb 100644 --- a/arch/powerpc/platforms/pseries/papr_scm.c +++ b/arch/powerpc/platforms/pseries/papr_scm.c @@ -70,8 +70,10 @@ #define PAPR_SCM_PERF_STATS_VERSION 0x1 /* Struct holding a single performance metric */ +typedef u8 stat_id_t[8]; + struct papr_scm_perf_stat { - u8 stat_id[8]; + stat_id_t stat_id; __be64 stat_val; } __packed; @@ -126,7 +128,7 @@ struct papr_scm_priv { u64 health_bitmap_inject_mask; /* array to have event_code and stat_id mappings */ - char **nvdimm_events_map; + stat_id_t *nvdimm_events_map; }; static int papr_scm_pmem_flush(struct nd_region *nd_region, @@ -462,7 +464,7 @@ static int papr_scm_pmu_check_events(struct papr_scm_priv *p, struct nvdimm_pmu { struct papr_scm_perf_stat *stat; struct papr_scm_perf_stats *stats; - int index, rc, count; + int index, rc = 0; u32 available_events; if (!p->stat_buffer_len) @@ -478,35 +480,29 @@ static int papr_scm_pmu_check_events(struct papr_scm_priv *p, struct nvdimm_pmu return rc; } - /* Allocate memory to nvdimm_event_map */ - p->nvdimm_events_map = kcalloc(available_events, sizeof(char *), GFP_KERNEL); - if (!p->nvdimm_events_map) { - rc = -ENOMEM; - goto out_stats; - } - /* Called to get list of events supported */ rc = drc_pmem_query_stats(p, stats, 0); if (rc) - goto out_nvdimm_events_map; - - for (index = 0, stat = stats->scm_statistic, count = 0; -index < available_events; index++, ++stat) { - p->nvdimm_events_map[count] = kmemdup_nul(stat->stat_id, 8, GFP_KERNEL); - if (!p->nvdimm_events_map[count]) { - rc = -ENOMEM; - goto out_nvdimm_events_map; - } + goto out; - count++; + /* +* Allocate memory and populate nvdimm_event_map. +* Allocate an extra element for NULL entry +*/ + p->nvdimm_events_map = kcalloc(available_events + 1, + sizeof(stat_id_t), GFP_KERNEL); + if (!p->nvdimm_events_map) { + rc = -ENOMEM; + goto out; } - p->nvdimm_events_map[count] = NULL; - kfree(stats); - return 0; -out_nvdimm_events_map: - kfree(p->nvdimm_events_map); -out_stats: + /* Copy all stat_ids to event map */ + for (index = 0, stat = stats->scm_statistic; +index < available_events; index++, ++stat) { + memcpy(>nvdimm_events_map[index], >stat_id, + sizeof(stat_id_t)); + } +out: kfree(stats); return rc; } base-commit: 348c71344111d7a48892e3e52264ff11956fc196 -- 2.35.1
Re: [PATCH] powerpc/papr_scm: Fix buffer overflow issue with CONFIG_FORTIFY_SOURCE
Hi Kajol, Thanks for the patch. Minor review comment below: Kajol Jain writes: > With CONFIG_FORTIFY_SOURCE enabled, string functions will also perform > dynamic checks for string size which can panic the kernel, > like incase of overflow detection. > > In papr_scm, papr_scm_pmu_check_events function uses stat->stat_id > with string operations, to populate the nvdimm_events_map array. > Since stat_id variable is not NULL terminated, the kernel panics > with CONFIG_FORTIFY_SOURCE enabled at boot time. > > Below are the logs of kernel panic: > > [0.090221][T1] detected buffer overflow in __fortify_strlen > [0.090241][T1] [ cut here ] > [0.090246][T1] kernel BUG at lib/string_helpers.c:980! > [0.090253][T1] Oops: Exception in kernel mode, sig: 5 [#1] > > [0.090375][T1] NIP [c077dad0] fortify_panic+0x28/0x38 > [0.090382][T1] LR [c077dacc] fortify_panic+0x24/0x38 > [0.090387][T1] Call Trace: > [0.090390][T1] [c022d77836e0] [c077dacc] > fortify_panic+0x24/0x38 (unreliable) > [9.297707] [T1] [c0080deb2660] > papr_scm_pmu_check_events.constprop.0+0x118/0x220 [papr_scm] > [9.297721] [T1] [c0080deb2cb0] papr_scm_probe+0x288/0x62c > [papr_scm] > [9.297732] [T1] [c09b46a8] platform_probe+0x98/0x150 > > Fix this issue by using kmemdup_nul function to copy the content of > stat->stat_id directly to the nvdimm_events_map array. > > Fixes: 4c08d4bbc089 ("powerpc/papr_scm: Add perf interface support") > Signed-off-by: Kajol Jain > --- > arch/powerpc/platforms/pseries/papr_scm.c | 7 ++- > 1 file changed, 2 insertions(+), 5 deletions(-) > > diff --git a/arch/powerpc/platforms/pseries/papr_scm.c > b/arch/powerpc/platforms/pseries/papr_scm.c > index f58728d5f10d..39962c905542 100644 > --- a/arch/powerpc/platforms/pseries/papr_scm.c > +++ b/arch/powerpc/platforms/pseries/papr_scm.c > @@ -462,7 +462,6 @@ static int papr_scm_pmu_check_events(struct papr_scm_priv > *p, struct nvdimm_pmu > { > struct papr_scm_perf_stat *stat; > struct papr_scm_perf_stats *stats; > - char *statid; > int index, rc, count; > u32 available_events; > > @@ -493,14 +492,12 @@ static int papr_scm_pmu_check_events(struct > papr_scm_priv *p, struct nvdimm_pmu > > for (index = 0, stat = stats->scm_statistic, count = 0; >index < available_events; index++, ++stat) { > - statid = kzalloc(strlen(stat->stat_id) + 1, GFP_KERNEL); > - if (!statid) { > + p->nvdimm_events_map[count] = kmemdup_nul(stat->stat_id, > 8, GFP_KERNEL); s/8/sizeof(stat->stat_id)/ > + if (!p->nvdimm_events_map[count]) { > rc = -ENOMEM; > goto out_nvdimm_events_map; > } > > - strcpy(statid, stat->stat_id); > - p->nvdimm_events_map[count] = statid; > count++; > } > p->nvdimm_events_map[count] = NULL; > -- > 2.31.1 > -- Cheers ~ Vaibhav
Re: [PATCH 2/2] powerpc/papr_scm: Fix build failure when CONFIG_PERF_EVENTS is not set
Dan Williams writes: > On Tue, Mar 22, 2022 at 7:30 AM kajoljain wrote: >> >> >> >> On 3/22/22 03:09, Dan Williams wrote: >> > On Fri, Mar 18, 2022 at 4:42 AM Kajol Jain wrote: >> >> >> >> The following build failure occures when CONFIG_PERF_EVENTS is not set >> >> as generic pmu functions are not visible in that scenario. >> >> >> >> arch/powerpc/platforms/pseries/papr_scm.c:372:35: error: ‘struct >> >> perf_event’ has no member named ‘attr’ >> >> p->nvdimm_events_map[event->attr.config], >> >>^~ >> >> In file included from ./include/linux/list.h:5, >> >> from ./include/linux/kobject.h:19, >> >> from ./include/linux/of.h:17, >> >> from arch/powerpc/platforms/pseries/papr_scm.c:5: >> >> arch/powerpc/platforms/pseries/papr_scm.c: In function >> >> ‘papr_scm_pmu_event_init’: >> >> arch/powerpc/platforms/pseries/papr_scm.c:389:49: error: ‘struct >> >> perf_event’ has no member named ‘pmu’ >> >> struct nvdimm_pmu *nd_pmu = to_nvdimm_pmu(event->pmu); >> >> ^~ >> >> ./include/linux/container_of.h:18:26: note: in definition of macro >> >> ‘container_of’ >> >> void *__mptr = (void *)(ptr); \ >> >> ^~~ >> >> arch/powerpc/platforms/pseries/papr_scm.c:389:30: note: in expansion of >> >> macro ‘to_nvdimm_pmu’ >> >> struct nvdimm_pmu *nd_pmu = to_nvdimm_pmu(event->pmu); >> >> ^ >> >> In file included from ./include/linux/bits.h:22, >> >> from ./include/linux/bitops.h:6, >> >> from ./include/linux/of.h:15, >> >> from arch/powerpc/platforms/pseries/papr_scm.c:5: >> >> >> >> Fix the build issue by adding check for CONFIG_PERF_EVENTS config option >> >> and disabling the papr_scm perf interface support incase this config >> >> is not set >> >> >> >> Fixes: 4c08d4bbc089 ("powerpc/papr_scm: Add perf interface support") >> >> (Commit id >> >> based on linux-next tree) >> >> Signed-off-by: Kajol Jain >> >> --- >> >> arch/powerpc/platforms/pseries/papr_scm.c | 15 +++ >> > >> > This is a bit messier than I would have liked mainly because it dumps >> > a bunch of ifdefery into a C file contrary to coding style, "Wherever >> > possible, don't use preprocessor conditionals (#if, #ifdef) in .c >> > files". I would expect this all to move to an organization like: >> >> Hi Dan, >> Thanks for reviewing the patches. Inorder to avoid the multiple >> ifdefs checks, we can also add stub function for papr_scm_pmu_register. >> With that change we will just have one ifdef check for >> CONFIG_PERF_EVENTS config in both papr_scm.c and nd.h file. Hence we can >> avoid adding new files specific for papr_scm perf interface. >> >> Below is the code snippet for that change, let me know if looks fine to >> you. I tested it >> with set/unset PAPR_SCM config value and set/unset PERF_EVENTS config >> value combinations. >> >> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c >> b/arch/powerpc/platforms/pseries/papr_scm.c >> index 4dd513d7c029..38fabb44d3c3 100644 >> --- a/arch/powerpc/platforms/pseries/papr_scm.c >> +++ b/arch/powerpc/platforms/pseries/papr_scm.c >> @@ -69,8 +69,6 @@ >> #define PAPR_SCM_PERF_STATS_EYECATCHER __stringify(SCMSTATS) >> #define PAPR_SCM_PERF_STATS_VERSION 0x1 >> >> -#define to_nvdimm_pmu(_pmu)container_of(_pmu, struct nvdimm_pmu, pmu) >> - >> /* Struct holding a single performance metric */ >> struct papr_scm_perf_stat { >> u8 stat_id[8]; >> @@ -346,6 +344,9 @@ static ssize_t drc_pmem_query_stats(struct >> papr_scm_priv *p, >> return 0; >> } >> >> +#ifdef CONFIG_PERF_EVENTS >> +#define to_nvdimm_pmu(_pmu)container_of(_pmu, struct nvdimm_pmu, pmu) >> + >> static int papr_scm_pmu_get_value(struct perf_event *event, struct >> device *dev, u64 *count) >> { >> struct papr_scm_perf_stat *stat; >> @@ -558,6 +559,10 @@ static void papr_scm_pmu_register(struct >> papr_scm_priv *p) >> dev_info(>pdev->dev, "nvdimm pmu didn't register rc=%d\n", rc); >> } >> >> +#else >> +static inline void papr_scm_pmu_register(struct papr_scm_priv *p) { } > > Since this isn't in a header file, it does not need to be marked > "inline" the compiler will figure it out. > >> +#endif > > It might be time to create: > > arch/powerpc/platforms/pseries/papr_scm.h > > ...there is quite a bit of header material accrued in papr_scm.c and > once the ifdefs start landing in it then it becomes a nominal coding > style issue. That said, this is certainly more palatable than the > previous version. So if you don't want to create papr_scm.h yet for > this, at least make a note in the changelog that the first portion of > arch/powerpc/platforms/pseries/papr_scm.c is effectively papr_scm.h > content and may move there in the future, or something like that. Great suggestion Dan and incidently we are already working on
Re: linux-next: manual merge of the nvdimm tree with the powerpc tree
Stephen Rothwell writes: > Hi all, > > Today's linux-next merge of the nvdimm tree got a conflict in: > > arch/powerpc/platforms/pseries/papr_scm.c > > between commit: > > bbbca72352bb ("powerpc/papr_scm: Implement initial support for injecting > smart errors") > > from the powerpc tree and commit: > > 4c08d4bbc089 ("powerpc/papr_scm: Add perf interface support") > > from the nvdimm tree. > > I fixed it up (see below) and can carry the fix as necessary. This > is now fixed as far as linux-next is concerned, but any non trivial > conflicts should be mentioned to your upstream maintainer when your tree > is submitted for merging. You may also want to consider cooperating > with the maintainer of the conflicting tree to minimise any particularly > complex conflicts. Thanks for this correction Stephen and the change looks ok to me. I verified the functionality introduced by kernel commit bbbca72352bb ("powerpc/papr_scm: Implement initial support for injecting smart errors") on the 'next-20220315' and found it to be working fine. -- Cheers ~ Vaibhav
[PATCH v4] powerpc/papr_scm: Implement initial support for injecting smart errors
Presently PAPR doesn't support injecting smart errors on an NVDIMM. This makes testing the NVDIMM health reporting functionality difficult as simulating NVDIMM health related events need a hacked up qemu version. To solve this problem this patch proposes simulating certain set of NVDIMM health related events in papr_scm. Specifically 'fatal' health state and 'dirty' shutdown state. These error can be injected via the user-space 'ndctl-inject-smart(1)' command. With the proposed patch and corresponding ndctl patches following command flow is expected: $ sudo ndctl list -DH -d nmem0 ... "health_state":"ok", "shutdown_state":"clean", ... # inject unsafe shutdown and fatal health error $ sudo ndctl inject-smart nmem0 -Uf ... "health_state":"fatal", "shutdown_state":"dirty", ... # uninject all errors $ sudo ndctl inject-smart nmem0 -N ... "health_state":"ok", "shutdown_state":"clean", ... The patch adds a new member 'health_bitmap_inject_mask' inside struct papr_scm_priv which is then bitwise ANDed to the health bitmap fetched from the hypervisor. The value for 'health_bitmap_inject_mask' is accessible from sysfs at nmemX/papr/health_bitmap_inject. A new PDSM named 'SMART_INJECT' is proposed that accepts newly introduced 'struct nd_papr_pdsm_smart_inject' as payload thats exchanged between libndctl and papr_scm to indicate the requested smart-error states. When the processing the PDSM 'SMART_INJECT', papr_pdsm_smart_inject() constructs a pair or 'inject_mask' and 'clear_mask' bitmaps from the payload and bit-blt it to the 'health_bitmap_inject_mask'. This ensures the after being fetched from the hypervisor, the health_bitmap reflects requested smart-error states. Signed-off-by: Vaibhav Jain Signed-off-by: Shivaprasad G Bhat --- Changelog: Since v3: * Renamed the sysfs entry from 'health_bitmap_override' to 'health_bitmap_inject'. * Simplified the variable names and removed the 'health_bitmap_{mask,override}' members. Instead replaced them with a single 'health_bitmap_inject_mask' member. [Aneesh] * Updated the sysfs documentations and commit description. * Used READ/WRITE_ONCE macros at places where 'health_bitmap_inject_mask' may be accessed concurrently. Since v2: * Rebased the patch to ppc-next * Added documentation for newly introduced sysfs attribute 'health_bitmap_override' Since v1: * Updated the patch description. * Removed dependency of a header movement patch. * Removed '__packed' attribute for 'struct nd_papr_pdsm_smart_inject' [Aneesh] --- Documentation/ABI/testing/sysfs-bus-papr-pmem | 12 +++ arch/powerpc/include/uapi/asm/papr_pdsm.h | 18 arch/powerpc/platforms/pseries/papr_scm.c | 90 ++- 3 files changed, 117 insertions(+), 3 deletions(-) diff --git a/Documentation/ABI/testing/sysfs-bus-papr-pmem b/Documentation/ABI/testing/sysfs-bus-papr-pmem index 95254cec92bf..4ac0673901e7 100644 --- a/Documentation/ABI/testing/sysfs-bus-papr-pmem +++ b/Documentation/ABI/testing/sysfs-bus-papr-pmem @@ -61,3 +61,15 @@ Description: * "CchRHCnt" : Cache Read Hit Count * "CchWHCnt" : Cache Write Hit Count * "FastWCnt" : Fast Write Count + +What: /sys/bus/nd/devices/nmemX/papr/health_bitmap_inject +Date: Jan, 2022 +KernelVersion: v5.17 +Contact: linuxppc-dev , nvd...@lists.linux.dev, +Description: + (RO) Reports the health bitmap inject bitmap that is applied to + bitmap received from PowerVM via the H_SCM_HEALTH. This is used + to forcibly set specific bits returned from Hcall. These is then + used to simulate various health or shutdown states for an nvdimm + and are set by user-space tools like ndctl by issuing a PAPR DSM. + diff --git a/arch/powerpc/include/uapi/asm/papr_pdsm.h b/arch/powerpc/include/uapi/asm/papr_pdsm.h index 82488b1e7276..17439925045c 100644 --- a/arch/powerpc/include/uapi/asm/papr_pdsm.h +++ b/arch/powerpc/include/uapi/asm/papr_pdsm.h @@ -116,6 +116,22 @@ struct nd_papr_pdsm_health { }; }; +/* Flags for injecting specific smart errors */ +#define PDSM_SMART_INJECT_HEALTH_FATAL (1 << 0) +#define PDSM_SMART_INJECT_BAD_SHUTDOWN (1 << 1) + +struct nd_papr_pdsm_smart_inject { + union { + struct { + /* One or more of PDSM_SMART_INJECT_ */ + __u32 flags; + __u8 fatal_enable; + __u8 unsafe_shutdown_enable; + }; + __u8 buf[ND_PDSM_PAYLOAD_MAX_SIZE]; + }; +}; + /* * Methods to be embedded in ND_CMD_CALL request. These are sent to the kernel * via 'nd_cmd_pkg.nd_command' member of the ioctl struct @@ -123,12 +139,14 @@ struct nd_papr_pdsm_health
Re: [PATCH v3] powerpc/papr_scm: Implement initial support for injecting smart errors
Hi Ira, Thanks for reviewing this patch. Ira Weiny writes: > On Thu, Jan 13, 2022 at 05:32:52PM +0530, Vaibhav Jain wrote: > [snip] > >> >> +/* Inject a smart error Add the dirty-shutdown-counter value to the pdsm */ >> +static int papr_pdsm_smart_inject(struct papr_scm_priv *p, >> + union nd_pdsm_payload *payload) >> +{ >> +int rc; >> +u32 supported_flags = 0; >> +u64 mask = 0, override = 0; >> + >> +/* Check for individual smart error flags and update mask and override >> */ >> +if (payload->smart_inject.flags & PDSM_SMART_INJECT_HEALTH_FATAL) { >> +supported_flags |= PDSM_SMART_INJECT_HEALTH_FATAL; >> +mask |= PAPR_PMEM_HEALTH_FATAL; >> +override |= payload->smart_inject.fatal_enable ? >> +PAPR_PMEM_HEALTH_FATAL : 0; >> +} >> + >> +if (payload->smart_inject.flags & PDSM_SMART_INJECT_BAD_SHUTDOWN) { >> +supported_flags |= PDSM_SMART_INJECT_BAD_SHUTDOWN; >> +mask |= PAPR_PMEM_SHUTDOWN_DIRTY; >> +override |= payload->smart_inject.unsafe_shutdown_enable ? >> +PAPR_PMEM_SHUTDOWN_DIRTY : 0; >> +} >> + > > I'm struggling to see why there is a need for both a flag and an 8 bit > 'enable' > value? > This is to enable the inject/uninject error usecase with ndctl which lets user select individual error conditions like bad_shutdown or fatal-health state. The nd_papr_pdsm_smart_inject.flag field indicates which error conditions needs to be tweaked and individual __u8 fields like 'fatal_enable' are boolean values to indicate the inject/uninject state of that error condition. For e.g to uninject fatal-health and inject unsafe-shutdown following nd_papr_pdsm_smart_inject payload can be sent: { .flags = PDSM_SMART_INJECT_HEALTH_FATAL | PDSM_SMART_INJECT_BAD_SHUTDOWN, .fatal_enable = 0, .unsafe_shutdown_enable = 1, } To just to inject fatal-health following nd_papr_pdsm_smart_inject payload can be sent: { .flags = PDSM_SMART_INJECT_HEALTH_FATAL, .fatal_enable = 1, .unsafe_shutdown_enable = , } > Ira > -- Cheers ~ Vaibhav
Re: [PATCH v3] powerpc/papr_scm: Implement initial support for injecting smart errors
Thanks for reviewing this patch Aneesh. My responses to your comments below: "Aneesh Kumar K.V" writes: > Vaibhav Jain writes: > >> Presently PAPR doesn't support injecting smart errors on an >> NVDIMM. This makes testing the NVDIMM health reporting functionality >> difficult as simulating NVDIMM health related events need a hacked up >> qemu version. >> >> To solve this problem this patch proposes simulating certain set of >> NVDIMM health related events in papr_scm. Specifically 'fatal' health >> state and 'dirty' shutdown state. These error can be injected via the >> user-space 'ndctl-inject-smart(1)' command. With the proposed patch and >> corresponding ndctl patches following command flow is expected: >> >> $ sudo ndctl list -DH -d nmem0 >> ... >> "health_state":"ok", >> "shutdown_state":"clean", >> ... >> # inject unsafe shutdown and fatal health error >> $ sudo ndctl inject-smart nmem0 -Uf >> ... >> "health_state":"fatal", >> "shutdown_state":"dirty", >> ... >> # uninject all errors >> $ sudo ndctl inject-smart nmem0 -N >> ... >> "health_state":"ok", >> "shutdown_state":"clean", >> ... >> >> The patch adds two members 'health_bitmap_mask' and >> 'health_bitmap_override' inside struct papr_scm_priv which are then >> bit blt'ed to the health bitmaps fetched from the hypervisor. In case >> we are not able to fetch health information from the hypervisor we >> service the health bitmap from these two members. These members are >> accessible from sysfs at nmemX/papr/health_bitmap_override >> >> A new PDSM named 'SMART_INJECT' is proposed that accepts newly >> introduced 'struct nd_papr_pdsm_smart_inject' as payload thats >> exchanged between libndctl and papr_scm to indicate the requested >> smart-error states. >> >> When the processing the PDSM 'SMART_INJECT', papr_pdsm_smart_inject() >> constructs a pair or 'mask' and 'override' bitmaps from the payload >> and bit-blt it to the 'health_bitmap_{mask, override}' members. This >> ensures the after being fetched from the hypervisor, the health_bitmap >> reflects requested smart-error states. >> >> Signed-off-by: Vaibhav Jain >> Signed-off-by: Shivaprasad G Bhat >> --- >> Changelog: >> >> Since v2: >> * Rebased the patch to ppc-next >> * Added documentation for newly introduced sysfs attribute >> 'health_bitmap_override' >> >> Since v1: >> * Updated the patch description. >> * Removed dependency of a header movement patch. >> * Removed '__packed' attribute for 'struct nd_papr_pdsm_smart_inject' >> [Aneesh] >> --- >> Documentation/ABI/testing/sysfs-bus-papr-pmem | 13 +++ >> arch/powerpc/include/uapi/asm/papr_pdsm.h | 18 >> arch/powerpc/platforms/pseries/papr_scm.c | 94 ++- >> 3 files changed, 122 insertions(+), 3 deletions(-) >> >> diff --git a/Documentation/ABI/testing/sysfs-bus-papr-pmem >> b/Documentation/ABI/testing/sysfs-bus-papr-pmem >> index 95254cec92bf..8a0b2a7f7671 100644 >> --- a/Documentation/ABI/testing/sysfs-bus-papr-pmem >> +++ b/Documentation/ABI/testing/sysfs-bus-papr-pmem >> @@ -61,3 +61,16 @@ Description: >> * "CchRHCnt" : Cache Read Hit Count >> * "CchWHCnt" : Cache Write Hit Count >> * "FastWCnt" : Fast Write Count >> + >> +What: /sys/bus/nd/devices/nmemX/papr/health_bitmap_override >> +Date: Jan, 2022 >> +KernelVersion: v5.17 >> +Contact:linuxppc-dev , >> nvd...@lists.linux.dev, >> +Description: >> +(RO) Reports the health bitmap override/mask bitmaps that are >> +applied to top bitmap received from PowerVM via the H_SCM_HEALTH >> +Hcall. Together these can be used to forcibly set/reset specific >> +bits returned from Hcall. These bitmaps are presently used to >> +simulate various health or shutdown states for an nvdimm and are >> +set by user-space tools like ndctl by issuing a PAPR DSM. >> + >> diff --git a/arch/powerpc/include/uapi/asm/papr_pdsm.h >> b/arch/powerpc/include/uapi/asm/papr_pdsm.h >> index 82488b1e7276..17439925045c 100644 >> --- a/arch/powerpc/include/uapi/asm/papr_pdsm.h >> +++ b/arch/powerpc/include/uapi/asm/papr_pdsm.h >> @@ -116,6 +116,22 @@ struct nd_
[PATCH v3] powerpc/papr_scm: Implement initial support for injecting smart errors
Presently PAPR doesn't support injecting smart errors on an NVDIMM. This makes testing the NVDIMM health reporting functionality difficult as simulating NVDIMM health related events need a hacked up qemu version. To solve this problem this patch proposes simulating certain set of NVDIMM health related events in papr_scm. Specifically 'fatal' health state and 'dirty' shutdown state. These error can be injected via the user-space 'ndctl-inject-smart(1)' command. With the proposed patch and corresponding ndctl patches following command flow is expected: $ sudo ndctl list -DH -d nmem0 ... "health_state":"ok", "shutdown_state":"clean", ... # inject unsafe shutdown and fatal health error $ sudo ndctl inject-smart nmem0 -Uf ... "health_state":"fatal", "shutdown_state":"dirty", ... # uninject all errors $ sudo ndctl inject-smart nmem0 -N ... "health_state":"ok", "shutdown_state":"clean", ... The patch adds two members 'health_bitmap_mask' and 'health_bitmap_override' inside struct papr_scm_priv which are then bit blt'ed to the health bitmaps fetched from the hypervisor. In case we are not able to fetch health information from the hypervisor we service the health bitmap from these two members. These members are accessible from sysfs at nmemX/papr/health_bitmap_override A new PDSM named 'SMART_INJECT' is proposed that accepts newly introduced 'struct nd_papr_pdsm_smart_inject' as payload thats exchanged between libndctl and papr_scm to indicate the requested smart-error states. When the processing the PDSM 'SMART_INJECT', papr_pdsm_smart_inject() constructs a pair or 'mask' and 'override' bitmaps from the payload and bit-blt it to the 'health_bitmap_{mask, override}' members. This ensures the after being fetched from the hypervisor, the health_bitmap reflects requested smart-error states. Signed-off-by: Vaibhav Jain Signed-off-by: Shivaprasad G Bhat --- Changelog: Since v2: * Rebased the patch to ppc-next * Added documentation for newly introduced sysfs attribute 'health_bitmap_override' Since v1: * Updated the patch description. * Removed dependency of a header movement patch. * Removed '__packed' attribute for 'struct nd_papr_pdsm_smart_inject' [Aneesh] --- Documentation/ABI/testing/sysfs-bus-papr-pmem | 13 +++ arch/powerpc/include/uapi/asm/papr_pdsm.h | 18 arch/powerpc/platforms/pseries/papr_scm.c | 94 ++- 3 files changed, 122 insertions(+), 3 deletions(-) diff --git a/Documentation/ABI/testing/sysfs-bus-papr-pmem b/Documentation/ABI/testing/sysfs-bus-papr-pmem index 95254cec92bf..8a0b2a7f7671 100644 --- a/Documentation/ABI/testing/sysfs-bus-papr-pmem +++ b/Documentation/ABI/testing/sysfs-bus-papr-pmem @@ -61,3 +61,16 @@ Description: * "CchRHCnt" : Cache Read Hit Count * "CchWHCnt" : Cache Write Hit Count * "FastWCnt" : Fast Write Count + +What: /sys/bus/nd/devices/nmemX/papr/health_bitmap_override +Date: Jan, 2022 +KernelVersion: v5.17 +Contact: linuxppc-dev , nvd...@lists.linux.dev, +Description: + (RO) Reports the health bitmap override/mask bitmaps that are + applied to top bitmap received from PowerVM via the H_SCM_HEALTH + Hcall. Together these can be used to forcibly set/reset specific + bits returned from Hcall. These bitmaps are presently used to + simulate various health or shutdown states for an nvdimm and are + set by user-space tools like ndctl by issuing a PAPR DSM. + diff --git a/arch/powerpc/include/uapi/asm/papr_pdsm.h b/arch/powerpc/include/uapi/asm/papr_pdsm.h index 82488b1e7276..17439925045c 100644 --- a/arch/powerpc/include/uapi/asm/papr_pdsm.h +++ b/arch/powerpc/include/uapi/asm/papr_pdsm.h @@ -116,6 +116,22 @@ struct nd_papr_pdsm_health { }; }; +/* Flags for injecting specific smart errors */ +#define PDSM_SMART_INJECT_HEALTH_FATAL (1 << 0) +#define PDSM_SMART_INJECT_BAD_SHUTDOWN (1 << 1) + +struct nd_papr_pdsm_smart_inject { + union { + struct { + /* One or more of PDSM_SMART_INJECT_ */ + __u32 flags; + __u8 fatal_enable; + __u8 unsafe_shutdown_enable; + }; + __u8 buf[ND_PDSM_PAYLOAD_MAX_SIZE]; + }; +}; + /* * Methods to be embedded in ND_CMD_CALL request. These are sent to the kernel * via 'nd_cmd_pkg.nd_command' member of the ioctl struct @@ -123,12 +139,14 @@ struct nd_papr_pdsm_health { enum papr_pdsm { PAPR_PDSM_MIN = 0x0, PAPR_PDSM_HEALTH, + PAPR_PDSM_SMART_INJECT, PAPR_PDSM_MAX, }; /* Maximal union that can hold all possible payload types */ union nd_pdsm_payload { stru
Re: [RFC PATCH v2] powerpc/papr_scm: Move duplicate definitions to common header files
Hi Mpe, Thanks for looking into this patch. Michael Ellerman writes: > Shivaprasad G Bhat writes: >> papr_scm and ndtest share common PDSM payload structs like >> nd_papr_pdsm_health. Presently these structs are duplicated across >> papr_pdsm.h >> and ndtest.h header files. Since 'ndtest' is essentially arch independent >> and can >> run on platforms other than PPC64, a way needs to be deviced to avoid >> redundancy >> and duplication of PDSM structs in future. >> >> So the patch proposes moving the PDSM header from arch/powerpc/include/uapi/ >> to >> the generic include/uapi/linux directory. Also, there are some #defines >> common >> between papr_scm and ndtest which are not exported to the user space. So, >> move >> them to a header file which can be shared across ndtest and papr_scm via >> newly >> introduced include/linux/papr_scm.h. >> >> Signed-off-by: Shivaprasad G Bhat >> Signed-off-by: Vaibhav Jain >> Suggested-by: "Aneesh Kumar K.V" >> --- >> Changelog: >> >> Since v1: >> Link: >> https://patchwork.kernel.org/project/linux-nvdimm/patch/162505488483.72147.12741153746322191381.stgit@56e104a48989/ >> * Removed dependency on this patch for the other patches >> >> MAINTAINERS |2 >> arch/powerpc/include/uapi/asm/papr_pdsm.h | 165 >> - >> arch/powerpc/platforms/pseries/papr_scm.c | 43 >> include/linux/papr_scm.h | 48 >> include/uapi/linux/papr_pdsm.h| 165 >> + > > This doesn't make sense to me. > > Anything with papr (or PAPR) in the name is fundamentally powerpc > specific, it doesn't belong in a generic header, or in a generic > location. > > What's the actual problem you're trying to solve? > The ndtest module (tools/testing/nvdimm/test/ndtest.c) is implemented in an arch independed way to enable testing of PAPR PDSMs on non ppc64 platforms like x86_64. It uses the same PDSM structs as used by papr_scm to communicate with libndctl userspace. Since papr_scm is ppc64 arch specific we were so far duplicating the PDSM structures between ndtest and papr_scm. The patch tries to solve this duplication by moving the shared structs to arch independent common include dirs. Secondly, PDSMs describes how userspace can use NVDIMM_FAMILY_PAPR to interact with NVDIMMs. So potentially a new NVDIMM beyond powerpc arch can add its support and would need access to the same structs used by papr_scm and ndtest. In that context it would make sense to move PDSM headers to generic include dirs. > If it's just including papr_scm bits into ndtest.c then that should be > as simple as: > > #ifdef __powerpc__ > #include > #endif > > Shouldn't it? > No, as ndtest implements support for NVDIMM_FAMILY_PAPR and would need access to PDSM related structs which presently are only available for powerpc. > cheers > -- Cheers ~ Vaibhav
[PATCH] powerpc/papr_scm: Implement initial support for injecting smart errors
Presently PAPR doesn't support injecting smart errors on an NVDIMM. This makes testing the NVDIMM health reporting functionality difficult as simulating NVDIMM health related events need a hacked up qemu version. To solve this problem this patch proposes simulating certain set of NVDIMM health related events in papr_scm. Specifically 'fatal' health state and 'dirty' shutdown state. These error can be injected via the user-space 'ndctl-inject-smart(1)' command. With the proposed patch and corresponding ndctl patches following command flow is expected: $ sudo ndctl list -DH -d nmem0 ... "health_state":"ok", "shutdown_state":"clean", ... # inject unsafe shutdown and fatal health error $ sudo ndctl inject-smart nmem0 -Uf ... "health_state":"fatal", "shutdown_state":"dirty", ... # uninject all errors $ sudo ndctl inject-smart nmem0 -N ... "health_state":"ok", "shutdown_state":"clean", ... The patch adds two members 'health_bitmap_mask' and 'health_bitmap_override' inside struct papr_scm_priv which are then bit blt'ed[1] to the health bitmaps fetched from the hypervisor. In case we are not able to fetch health information from the hypervisor we service the health bitmap from these two members. These members are accessible from sysfs at nmemX/papr/health_bitmap_override A new PDSM named 'SMART_INJECT' is proposed that accepts newly introduced 'struct nd_papr_pdsm_smart_inject' as payload thats exchanged between libndctl and papr_scm to indicate the requested smart-error states. When the processing the PDSM 'SMART_INJECT', papr_pdsm_smart_inject() constructs a pair or 'mask' and 'override' bitmaps from the payload and bit-blt it to the 'health_bitmap_{mask, override}' members. This ensures the after being fetched from the hypervisor, the health_bitmap reflects requested smart-error states. The patch is based on [2] "powerpc/papr_scm: Move duplicate definitions to common header files". [1] : https://en.wikipedia.org/wiki/Bit_blit [2] : https://lore.kernel.org/nvdimm/162505488483.72147.12741153746322191381.stgit@56e104a48989 Signed-off-by: Vaibhav Jain Signed-off-by: Shivaprasad G Bhat --- arch/powerpc/platforms/pseries/papr_scm.c | 94 ++- include/uapi/linux/papr_pdsm.h| 18 + 2 files changed, 109 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c index 0c56db5a1427..b7437c61a270 100644 --- a/arch/powerpc/platforms/pseries/papr_scm.c +++ b/arch/powerpc/platforms/pseries/papr_scm.c @@ -29,6 +29,10 @@ (1ul << ND_CMD_SET_CONFIG_DATA) | \ (1ul << ND_CMD_CALL)) +/* Use bitblt method to override specific bits in the '_bitmap_' */ +#define BITBLT_BITMAP(_bitmap_, _mask_, _override_)\ + (((_bitmap_) & ~(_mask_)) | ((_mask_) & (_override_))) + /* Struct holding a single performance metric */ struct papr_scm_perf_stat { u8 stat_id[8]; @@ -81,6 +85,12 @@ struct papr_scm_priv { /* length of the stat buffer as expected by phyp */ size_t stat_buffer_len; + + /* The bits which needs to be overridden */ + u64 health_bitmap_mask; + + /* The overridden values for the bits having the masks set */ + u64 health_bitmap_override; }; static int papr_scm_pmem_flush(struct nd_region *nd_region, @@ -308,19 +318,28 @@ static ssize_t drc_pmem_query_stats(struct papr_scm_priv *p, static int __drc_pmem_query_health(struct papr_scm_priv *p) { unsigned long ret[PLPAR_HCALL_BUFSIZE]; + u64 bitmap = 0; long rc; /* issue the hcall */ rc = plpar_hcall(H_SCM_HEALTH, ret, p->drc_index); - if (rc != H_SUCCESS) { + if (rc == H_SUCCESS) + bitmap = ret[0] & ret[1]; + else if (rc == H_FUNCTION) + dev_info_once(>pdev->dev, + "Hcall H_SCM_HEALTH not implemented, assuming empty health bitmap"); + else { + dev_err(>pdev->dev, "Failed to query health information, Err:%ld\n", rc); return -ENXIO; } p->lasthealth_jiffies = jiffies; - p->health_bitmap = ret[0] & ret[1]; - + /* Allow overriding specific health bits via bit blt. */ + bitmap = BITBLT_BITMAP(bitmap, p->health_bitmap_mask, + p->health_bitmap_override); + WRITE_ONCE(p->health_bitmap, bitmap); dev_dbg(>pdev->dev, "Queried dimm health info. Bitmap:0x%016lx Mask:0x%016lx\n", ret[0], ret[1]); @@ -630,6 +649,54 @@ static int papr_pdsm_health(struct papr_scm_priv *p, return rc; } +/* Inject a smart error Add the dirty-shutdown-counter value
[RESEND-PATCH v2] powerpc/papr_scm: Add support for reporting dirty-shutdown-count
Persistent memory devices like NVDIMMs can loose cached writes in case something prevents flush on power-fail. Such situations are termed as dirty shutdown and are exposed to applications as last-shutdown-state (LSS) flag and a dirty-shutdown-counter(DSC) as described at [1]. The latter being useful in conditions where multiple applications want to detect a dirty shutdown event without racing with one another. PAPR-NVDIMMs have so far only exposed LSS style flags to indicate a dirty-shutdown-state. This patch further adds support for DSC via the "ibm,persistence-failed-count" device tree property of an NVDIMM. This property is a monotonic increasing 64-bit counter thats an indication of number of times an NVDIMM has encountered a dirty-shutdown event causing persistence loss. Since this value is not expected to change after system-boot hence papr_scm reads & caches its value during NVDIMM probe and exposes it as a PAPR sysfs attributed named 'dirty_shutdown' to match the name of similarly named NFIT sysfs attribute. Also this value is available to libnvdimm via PAPR_PDSM_HEALTH payload. 'struct nd_papr_pdsm_health' has been extended to add a new member called 'dimm_dsc' presence of which is indicated by the newly introduced PDSM_DIMM_DSC_VALID flag. References: [1] https://pmem.io/documents/Dirty_Shutdown_Handling-V1.0.pdf Signed-off-by: Vaibhav Jain Reviewed-by: Aneesh Kumar K.V --- Changelog: Resend: * Added ack by Aneesh. * Minor fix to changelog of v2 patch v2: * Rebased the patch on latest ppc-next tree * s/psdm/pdsm/g [ Santosh ] --- arch/powerpc/include/uapi/asm/papr_pdsm.h | 6 + arch/powerpc/platforms/pseries/papr_scm.c | 30 +++ 2 files changed, 36 insertions(+) diff --git a/arch/powerpc/include/uapi/asm/papr_pdsm.h b/arch/powerpc/include/uapi/asm/papr_pdsm.h index 50ef95e2f5b1..82488b1e7276 100644 --- a/arch/powerpc/include/uapi/asm/papr_pdsm.h +++ b/arch/powerpc/include/uapi/asm/papr_pdsm.h @@ -77,6 +77,9 @@ /* Indicate that the 'dimm_fuel_gauge' field is valid */ #define PDSM_DIMM_HEALTH_RUN_GAUGE_VALID 1 +/* Indicate that the 'dimm_dsc' field is valid */ +#define PDSM_DIMM_DSC_VALID 2 + /* * Struct exchanged between kernel & ndctl in for PAPR_PDSM_HEALTH * Various flags indicate the health status of the dimm. @@ -105,6 +108,9 @@ struct nd_papr_pdsm_health { /* Extension flag PDSM_DIMM_HEALTH_RUN_GAUGE_VALID */ __u16 dimm_fuel_gauge; + + /* Extension flag PDSM_DIMM_DSC_VALID */ + __u64 dimm_dsc; }; __u8 buf[ND_PDSM_PAYLOAD_MAX_SIZE]; }; diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c index 11e7b90a3360..eb8be0eb6119 100644 --- a/arch/powerpc/platforms/pseries/papr_scm.c +++ b/arch/powerpc/platforms/pseries/papr_scm.c @@ -114,6 +114,9 @@ struct papr_scm_priv { /* Health information for the dimm */ u64 health_bitmap; + /* Holds the last known dirty shutdown counter value */ + u64 dirty_shutdown_counter; + /* length of the stat buffer as expected by phyp */ size_t stat_buffer_len; }; @@ -603,6 +606,16 @@ static int papr_pdsm_fuel_gauge(struct papr_scm_priv *p, return rc; } +/* Add the dirty-shutdown-counter value to the pdsm */ +static int papr_pdsm_dsc(struct papr_scm_priv *p, +union nd_pdsm_payload *payload) +{ + payload->health.extension_flags |= PDSM_DIMM_DSC_VALID; + payload->health.dimm_dsc = p->dirty_shutdown_counter; + + return sizeof(struct nd_papr_pdsm_health); +} + /* Fetch the DIMM health info and populate it in provided package. */ static int papr_pdsm_health(struct papr_scm_priv *p, union nd_pdsm_payload *payload) @@ -646,6 +659,8 @@ static int papr_pdsm_health(struct papr_scm_priv *p, /* Populate the fuel gauge meter in the payload */ papr_pdsm_fuel_gauge(p, payload); + /* Populate the dirty-shutdown-counter field */ + papr_pdsm_dsc(p, payload); rc = sizeof(struct nd_papr_pdsm_health); @@ -907,6 +922,16 @@ static ssize_t flags_show(struct device *dev, } DEVICE_ATTR_RO(flags); +static ssize_t dirty_shutdown_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct nvdimm *dimm = to_nvdimm(dev); + struct papr_scm_priv *p = nvdimm_provider_data(dimm); + + return sysfs_emit(buf, "%llu\n", p->dirty_shutdown_counter); +} +DEVICE_ATTR_RO(dirty_shutdown); + static umode_t papr_nd_attribute_visible(struct kobject *kobj, struct attribute *attr, int n) { @@ -925,6 +950,7 @@ static umode_t papr_nd_attribute_visible(struct kobject *kobj, static struct attribute *papr_nd_attributes[] = { _attr_flags.attr, _attr_perf_stats
Re: [PATCH] powerpc/papr_scm: Add support for reporting dirty-shutdown-count
Thanks for catching this Santosh. Have fixed this in v2 version of this patch -- Cheers ~ Vaibhav Santosh Sivaraj writes: > Hi Vaibhav, > > Vaibhav Jain writes: > >> Persistent memory devices like NVDIMMs can loose cached writes in case >> something prevents flush on power-fail. Such situations are termed as >> dirty shutdown and are exposed to applications as >> last-shutdown-state (LSS) flag and a dirty-shutdown-counter(DSC) as >> described at [1]. The latter being useful in conditions where multiple >> applications want to detect a dirty shutdown event without racing with >> one another. >> >> PAPR-NVDIMMs have so far only exposed LSS style flags to indicate a >> dirty-shutdown-state. This patch further adds support for DSC via the >> "ibm,persistence-failed-count" device tree property of an NVDIMM. This >> property is a monotonic increasing 64-bit counter thats an indication >> of number of times an NVDIMM has encountered a dirty-shutdown event >> causing persistence loss. >> >> Since this value is not expected to change after system-boot hence >> papr_scm reads & caches its value during NVDIMM probe and exposes it >> as a PAPR sysfs attributed named 'dirty_shutdown' to match the name of >> similarly named NFIT sysfs attribute. Also this value is available to >> libnvdimm via PAPR_PDSM_HEALTH payload. 'struct nd_papr_pdsm_health' >> has been extended to add a new member called 'dimm_dsc' presence of >> which is indicated by the newly introduced PDSM_DIMM_DSC_VALID flag. >> >> References: >> [1] https://pmem.io/documents/Dirty_Shutdown_Handling-V1.0.pdf >> >> Signed-off-by: Vaibhav Jain >> --- >> arch/powerpc/include/uapi/asm/papr_pdsm.h | 6 + >> arch/powerpc/platforms/pseries/papr_scm.c | 30 +++ >> 2 files changed, 36 insertions(+) >> >> diff --git a/arch/powerpc/include/uapi/asm/papr_pdsm.h >> b/arch/powerpc/include/uapi/asm/papr_pdsm.h >> index 50ef95e2f5b1..82488b1e7276 100644 >> --- a/arch/powerpc/include/uapi/asm/papr_pdsm.h >> +++ b/arch/powerpc/include/uapi/asm/papr_pdsm.h >> @@ -77,6 +77,9 @@ >> /* Indicate that the 'dimm_fuel_gauge' field is valid */ >> #define PDSM_DIMM_HEALTH_RUN_GAUGE_VALID 1 >> >> +/* Indicate that the 'dimm_dsc' field is valid */ >> +#define PDSM_DIMM_DSC_VALID 2 >> + >> /* >> * Struct exchanged between kernel & ndctl in for PAPR_PDSM_HEALTH >> * Various flags indicate the health status of the dimm. >> @@ -105,6 +108,9 @@ struct nd_papr_pdsm_health { >> >> /* Extension flag PDSM_DIMM_HEALTH_RUN_GAUGE_VALID */ >> __u16 dimm_fuel_gauge; >> + >> +/* Extension flag PDSM_DIMM_DSC_VALID */ >> +__u64 dimm_dsc; >> }; >> __u8 buf[ND_PDSM_PAYLOAD_MAX_SIZE]; >> }; >> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c >> b/arch/powerpc/platforms/pseries/papr_scm.c >> index 11e7b90a3360..68f0d3d5e899 100644 >> --- a/arch/powerpc/platforms/pseries/papr_scm.c >> +++ b/arch/powerpc/platforms/pseries/papr_scm.c >> @@ -114,6 +114,9 @@ struct papr_scm_priv { >> /* Health information for the dimm */ >> u64 health_bitmap; >> >> +/* Holds the last known dirty shutdown counter value */ >> +u64 dirty_shutdown_counter; >> + >> /* length of the stat buffer as expected by phyp */ >> size_t stat_buffer_len; >> }; >> @@ -603,6 +606,16 @@ static int papr_pdsm_fuel_gauge(struct papr_scm_priv *p, >> return rc; >> } >> >> +/* Add the dirty-shutdown-counter value to the pdsm */ >> +static int papr_psdm_dsc(struct papr_scm_priv *p, > should be pdsm >> + union nd_pdsm_payload *payload) >> +{ >> +payload->health.extension_flags |= PDSM_DIMM_DSC_VALID; >> +payload->health.dimm_dsc = p->dirty_shutdown_counter; >> + >> +return sizeof(struct nd_papr_pdsm_health); >> +} >> + >> /* Fetch the DIMM health info and populate it in provided package. */ >> static int papr_pdsm_health(struct papr_scm_priv *p, >> union nd_pdsm_payload *payload) >> @@ -646,6 +659,8 @@ static int papr_pdsm_health(struct papr_scm_priv *p, >> >> /* Populate the fuel gauge meter in the payload */ >> papr_pdsm_fuel_gauge(p, payload); >> +/* Populate the dirty-shutdown-counter field */ >> +papr_psdm_dsc(p, payload); >
[PATCH v2] powerpc/papr_scm: Add support for reporting dirty-shutdown-count
Persistent memory devices like NVDIMMs can loose cached writes in case something prevents flush on power-fail. Such situations are termed as dirty shutdown and are exposed to applications as last-shutdown-state (LSS) flag and a dirty-shutdown-counter(DSC) as described at [1]. The latter being useful in conditions where multiple applications want to detect a dirty shutdown event without racing with one another. PAPR-NVDIMMs have so far only exposed LSS style flags to indicate a dirty-shutdown-state. This patch further adds support for DSC via the "ibm,persistence-failed-count" device tree property of an NVDIMM. This property is a monotonic increasing 64-bit counter thats an indication of number of times an NVDIMM has encountered a dirty-shutdown event causing persistence loss. Since this value is not expected to change after system-boot hence papr_scm reads & caches its value during NVDIMM probe and exposes it as a PAPR sysfs attributed named 'dirty_shutdown' to match the name of similarly named NFIT sysfs attribute. Also this value is available to libnvdimm via PAPR_PDSM_HEALTH payload. 'struct nd_papr_pdsm_health' has been extended to add a new member called 'dimm_dsc' presence of which is indicated by the newly introduced PDSM_DIMM_DSC_VALID flag. References: [1] https://pmem.io/documents/Dirty_Shutdown_Handling-V1.0.pdf Signed-off-by: Vaibhav Jain --- Changelog: v2: * Rebased the patch on latest ppc-next tree * s/psdm/psdm/g [ Santosh ] --- arch/powerpc/include/uapi/asm/papr_pdsm.h | 6 + arch/powerpc/platforms/pseries/papr_scm.c | 30 +++ 2 files changed, 36 insertions(+) diff --git a/arch/powerpc/include/uapi/asm/papr_pdsm.h b/arch/powerpc/include/uapi/asm/papr_pdsm.h index 50ef95e2f5b1..82488b1e7276 100644 --- a/arch/powerpc/include/uapi/asm/papr_pdsm.h +++ b/arch/powerpc/include/uapi/asm/papr_pdsm.h @@ -77,6 +77,9 @@ /* Indicate that the 'dimm_fuel_gauge' field is valid */ #define PDSM_DIMM_HEALTH_RUN_GAUGE_VALID 1 +/* Indicate that the 'dimm_dsc' field is valid */ +#define PDSM_DIMM_DSC_VALID 2 + /* * Struct exchanged between kernel & ndctl in for PAPR_PDSM_HEALTH * Various flags indicate the health status of the dimm. @@ -105,6 +108,9 @@ struct nd_papr_pdsm_health { /* Extension flag PDSM_DIMM_HEALTH_RUN_GAUGE_VALID */ __u16 dimm_fuel_gauge; + + /* Extension flag PDSM_DIMM_DSC_VALID */ + __u64 dimm_dsc; }; __u8 buf[ND_PDSM_PAYLOAD_MAX_SIZE]; }; diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c index 11e7b90a3360..eb8be0eb6119 100644 --- a/arch/powerpc/platforms/pseries/papr_scm.c +++ b/arch/powerpc/platforms/pseries/papr_scm.c @@ -114,6 +114,9 @@ struct papr_scm_priv { /* Health information for the dimm */ u64 health_bitmap; + /* Holds the last known dirty shutdown counter value */ + u64 dirty_shutdown_counter; + /* length of the stat buffer as expected by phyp */ size_t stat_buffer_len; }; @@ -603,6 +606,16 @@ static int papr_pdsm_fuel_gauge(struct papr_scm_priv *p, return rc; } +/* Add the dirty-shutdown-counter value to the pdsm */ +static int papr_pdsm_dsc(struct papr_scm_priv *p, +union nd_pdsm_payload *payload) +{ + payload->health.extension_flags |= PDSM_DIMM_DSC_VALID; + payload->health.dimm_dsc = p->dirty_shutdown_counter; + + return sizeof(struct nd_papr_pdsm_health); +} + /* Fetch the DIMM health info and populate it in provided package. */ static int papr_pdsm_health(struct papr_scm_priv *p, union nd_pdsm_payload *payload) @@ -646,6 +659,8 @@ static int papr_pdsm_health(struct papr_scm_priv *p, /* Populate the fuel gauge meter in the payload */ papr_pdsm_fuel_gauge(p, payload); + /* Populate the dirty-shutdown-counter field */ + papr_pdsm_dsc(p, payload); rc = sizeof(struct nd_papr_pdsm_health); @@ -907,6 +922,16 @@ static ssize_t flags_show(struct device *dev, } DEVICE_ATTR_RO(flags); +static ssize_t dirty_shutdown_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct nvdimm *dimm = to_nvdimm(dev); + struct papr_scm_priv *p = nvdimm_provider_data(dimm); + + return sysfs_emit(buf, "%llu\n", p->dirty_shutdown_counter); +} +DEVICE_ATTR_RO(dirty_shutdown); + static umode_t papr_nd_attribute_visible(struct kobject *kobj, struct attribute *attr, int n) { @@ -925,6 +950,7 @@ static umode_t papr_nd_attribute_visible(struct kobject *kobj, static struct attribute *papr_nd_attributes[] = { _attr_flags.attr, _attr_perf_stats.attr, + _attr_dirty_shutdown.attr, NULL, }; @@ -1149,6 +1175,10 @@ static int pap
[PATCH] powerpc/papr_scm: Add support for reporting dirty-shutdown-count
Persistent memory devices like NVDIMMs can loose cached writes in case something prevents flush on power-fail. Such situations are termed as dirty shutdown and are exposed to applications as last-shutdown-state (LSS) flag and a dirty-shutdown-counter(DSC) as described at [1]. The latter being useful in conditions where multiple applications want to detect a dirty shutdown event without racing with one another. PAPR-NVDIMMs have so far only exposed LSS style flags to indicate a dirty-shutdown-state. This patch further adds support for DSC via the "ibm,persistence-failed-count" device tree property of an NVDIMM. This property is a monotonic increasing 64-bit counter thats an indication of number of times an NVDIMM has encountered a dirty-shutdown event causing persistence loss. Since this value is not expected to change after system-boot hence papr_scm reads & caches its value during NVDIMM probe and exposes it as a PAPR sysfs attributed named 'dirty_shutdown' to match the name of similarly named NFIT sysfs attribute. Also this value is available to libnvdimm via PAPR_PDSM_HEALTH payload. 'struct nd_papr_pdsm_health' has been extended to add a new member called 'dimm_dsc' presence of which is indicated by the newly introduced PDSM_DIMM_DSC_VALID flag. References: [1] https://pmem.io/documents/Dirty_Shutdown_Handling-V1.0.pdf Signed-off-by: Vaibhav Jain --- arch/powerpc/include/uapi/asm/papr_pdsm.h | 6 + arch/powerpc/platforms/pseries/papr_scm.c | 30 +++ 2 files changed, 36 insertions(+) diff --git a/arch/powerpc/include/uapi/asm/papr_pdsm.h b/arch/powerpc/include/uapi/asm/papr_pdsm.h index 50ef95e2f5b1..82488b1e7276 100644 --- a/arch/powerpc/include/uapi/asm/papr_pdsm.h +++ b/arch/powerpc/include/uapi/asm/papr_pdsm.h @@ -77,6 +77,9 @@ /* Indicate that the 'dimm_fuel_gauge' field is valid */ #define PDSM_DIMM_HEALTH_RUN_GAUGE_VALID 1 +/* Indicate that the 'dimm_dsc' field is valid */ +#define PDSM_DIMM_DSC_VALID 2 + /* * Struct exchanged between kernel & ndctl in for PAPR_PDSM_HEALTH * Various flags indicate the health status of the dimm. @@ -105,6 +108,9 @@ struct nd_papr_pdsm_health { /* Extension flag PDSM_DIMM_HEALTH_RUN_GAUGE_VALID */ __u16 dimm_fuel_gauge; + + /* Extension flag PDSM_DIMM_DSC_VALID */ + __u64 dimm_dsc; }; __u8 buf[ND_PDSM_PAYLOAD_MAX_SIZE]; }; diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c index 11e7b90a3360..68f0d3d5e899 100644 --- a/arch/powerpc/platforms/pseries/papr_scm.c +++ b/arch/powerpc/platforms/pseries/papr_scm.c @@ -114,6 +114,9 @@ struct papr_scm_priv { /* Health information for the dimm */ u64 health_bitmap; + /* Holds the last known dirty shutdown counter value */ + u64 dirty_shutdown_counter; + /* length of the stat buffer as expected by phyp */ size_t stat_buffer_len; }; @@ -603,6 +606,16 @@ static int papr_pdsm_fuel_gauge(struct papr_scm_priv *p, return rc; } +/* Add the dirty-shutdown-counter value to the pdsm */ +static int papr_psdm_dsc(struct papr_scm_priv *p, +union nd_pdsm_payload *payload) +{ + payload->health.extension_flags |= PDSM_DIMM_DSC_VALID; + payload->health.dimm_dsc = p->dirty_shutdown_counter; + + return sizeof(struct nd_papr_pdsm_health); +} + /* Fetch the DIMM health info and populate it in provided package. */ static int papr_pdsm_health(struct papr_scm_priv *p, union nd_pdsm_payload *payload) @@ -646,6 +659,8 @@ static int papr_pdsm_health(struct papr_scm_priv *p, /* Populate the fuel gauge meter in the payload */ papr_pdsm_fuel_gauge(p, payload); + /* Populate the dirty-shutdown-counter field */ + papr_psdm_dsc(p, payload); rc = sizeof(struct nd_papr_pdsm_health); @@ -907,6 +922,16 @@ static ssize_t flags_show(struct device *dev, } DEVICE_ATTR_RO(flags); +static ssize_t dirty_shutdown_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct nvdimm *dimm = to_nvdimm(dev); + struct papr_scm_priv *p = nvdimm_provider_data(dimm); + + return sysfs_emit(buf, "%llu\n", p->dirty_shutdown_counter); +} +DEVICE_ATTR_RO(dirty_shutdown); + static umode_t papr_nd_attribute_visible(struct kobject *kobj, struct attribute *attr, int n) { @@ -925,6 +950,7 @@ static umode_t papr_nd_attribute_visible(struct kobject *kobj, static struct attribute *papr_nd_attributes[] = { _attr_flags.attr, _attr_perf_stats.attr, + _attr_dirty_shutdown.attr, NULL, }; @@ -1149,6 +1175,10 @@ static int papr_scm_probe(struct platform_device *pdev) p->is_volatile = !of_property_read_bool
Re: [PATCH v2] powerpc/papr_scm: Make 'perf_stats' invisible if perf-stats unavailable
Thanks for looking into this patch Dan, Dan Williams writes: > On Fri, May 7, 2021 at 4:40 AM Vaibhav Jain wrote: >> >> In case performance stats for an nvdimm are not available, reading the >> 'perf_stats' sysfs file returns an -ENOENT error. A better approach is >> to make the 'perf_stats' file entirely invisible to indicate that >> performance stats for an nvdimm are unavailable. >> >> So this patch updates 'papr_nd_attribute_group' to add a 'is_visible' >> callback implemented as newly introduced 'papr_nd_attribute_visible()' >> that returns an appropriate mode in case performance stats aren't >> supported in a given nvdimm. >> >> Also the initialization of 'papr_scm_priv.stat_buffer_len' is moved >> from papr_scm_nvdimm_init() to papr_scm_probe() so that it value is >> available when 'papr_nd_attribute_visible()' is called during nvdimm >> initialization. >> >> Fixes: 2d02bf835e57('powerpc/papr_scm: Fetch nvdimm performance stats from >> PHYP') > > Since this has been the exposed ABI since v5.9 perhaps a note / > analysis is needed here that the disappearance of this file will not > break any existing scripts/tools. > Yes, I have added a note in patch description and also updated Documentation/ABI/testing/sysfs-bus-papr-pmem with a note on when this sysfs attr would be available. Thanks for this suggestion. >> Signed-off-by: Vaibhav Jain >> --- >> Changelog: >> >> v2: >> * Removed a redundant dev_info() from pap_scm_nvdimm_init() [ Aneesh ] >> * Marked papr_nd_attribute_visible() as static which also fixed the >> build warning reported by kernel build robot >> --- >> arch/powerpc/platforms/pseries/papr_scm.c | 35 --- >> 1 file changed, 24 insertions(+), 11 deletions(-) >> >> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c >> b/arch/powerpc/platforms/pseries/papr_scm.c >> index e2b69cc3beaf..11e7b90a3360 100644 >> --- a/arch/powerpc/platforms/pseries/papr_scm.c >> +++ b/arch/powerpc/platforms/pseries/papr_scm.c >> @@ -907,6 +907,20 @@ static ssize_t flags_show(struct device *dev, >> } >> DEVICE_ATTR_RO(flags); >> >> +static umode_t papr_nd_attribute_visible(struct kobject *kobj, >> +struct attribute *attr, int n) >> +{ >> + struct device *dev = container_of(kobj, typeof(*dev), kobj); > > This can use the kobj_to_dev() helper. > Fixed in v3 >> + struct nvdimm *nvdimm = to_nvdimm(dev); >> + struct papr_scm_priv *p = nvdimm_provider_data(nvdimm); >> + >> + /* For if perf-stats not available remove perf_stats sysfs */ >> + if (attr == _attr_perf_stats.attr && p->stat_buffer_len == 0) >> + return 0; >> + >> + return attr->mode; >> +} >> + >> /* papr_scm specific dimm attributes */ >> static struct attribute *papr_nd_attributes[] = { >> _attr_flags.attr, >> @@ -916,6 +930,7 @@ static struct attribute *papr_nd_attributes[] = { >> >> static struct attribute_group papr_nd_attribute_group = { >> .name = "papr", >> + .is_visible = papr_nd_attribute_visible, >> .attrs = papr_nd_attributes, >> }; >> >> @@ -931,7 +946,6 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p) >> struct nd_region_desc ndr_desc; >> unsigned long dimm_flags; >> int target_nid, online_nid; >> - ssize_t stat_size; >> >> p->bus_desc.ndctl = papr_scm_ndctl; >> p->bus_desc.module = THIS_MODULE; >> @@ -1016,16 +1030,6 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv >> *p) >> list_add_tail(>region_list, _nd_regions); >> mutex_unlock(_ndr_lock); >> >> - /* Try retriving the stat buffer and see if its supported */ >> - stat_size = drc_pmem_query_stats(p, NULL, 0); >> - if (stat_size > 0) { >> - p->stat_buffer_len = stat_size; >> - dev_dbg(>pdev->dev, "Max perf-stat size %lu-bytes\n", >> - p->stat_buffer_len); >> - } else { >> - dev_info(>pdev->dev, "Dimm performance stats >> unavailable\n"); >> - } >> - >> return 0; >> >> err: nvdimm_bus_unregister(p->bus); >> @@ -1102,6 +1106,7 @@ static int papr_scm_probe(struct platform_device *pdev) >> u64 blocks, block_size; >> struct papr_scm_priv *p; >> const char *uuid_str; >
[PATCH v3] powerpc/papr_scm: Make 'perf_stats' invisible if perf-stats unavailable
In case performance stats for an nvdimm are not available, reading the 'perf_stats' sysfs file returns an -ENOENT error. A better approach is to make the 'perf_stats' file entirely invisible to indicate that performance stats for an nvdimm are unavailable. So this patch updates 'papr_nd_attribute_group' to add a 'is_visible' callback implemented as newly introduced 'papr_nd_attribute_visible()' that returns an appropriate mode in case performance stats aren't supported in a given nvdimm. Also the initialization of 'papr_scm_priv.stat_buffer_len' is moved from papr_scm_nvdimm_init() to papr_scm_probe() so that it value is available when 'papr_nd_attribute_visible()' is called during nvdimm initialization. Even though 'perf_stats' attribute is available since v5.9, there are no known user-space tools/scripts that are dependent on presence of its sysfs file. Hence I dont expect any user-space breakage with this patch. Fixes: 2d02bf835e57('powerpc/papr_scm: Fetch nvdimm performance stats from PHYP') Reviewed-by: Dan Williams Signed-off-by: Vaibhav Jain --- Changelog: v3: * Minor spell corrections [ Dan Williams ] * Switched to kobj_to_dev() helper in papr_nd_attribute_visible() [ Dan Williams ] * Updated ABI/../sysfs-bus-papr-pmem to indicate that the attribute is only available for devices that support performance stats. * Updated patch description. v2: * Removed a redundant dev_info() from pap_scm_nvdimm_init() [ Aneesh ] * Marked papr_nd_attribute_visible() as static which also fixed the build warning reported by kernel build robot --- Documentation/ABI/testing/sysfs-bus-papr-pmem | 8 +++-- arch/powerpc/platforms/pseries/papr_scm.c | 35 +-- 2 files changed, 29 insertions(+), 14 deletions(-) diff --git a/Documentation/ABI/testing/sysfs-bus-papr-pmem b/Documentation/ABI/testing/sysfs-bus-papr-pmem index 8316c33862a0..0aa02bf2bde5 100644 --- a/Documentation/ABI/testing/sysfs-bus-papr-pmem +++ b/Documentation/ABI/testing/sysfs-bus-papr-pmem @@ -39,9 +39,11 @@ KernelVersion: v5.9 Contact: linuxppc-dev , linux-nvd...@lists.01.org, Description: (RO) Report various performance stats related to papr-scm NVDIMM - device. Each stat is reported on a new line with each line - composed of a stat-identifier followed by it value. Below are - currently known dimm performance stats which are reported: + device. This attribute is only available for NVDIMM devices + that support reporting NVDIMM performance stats. Each stat is + reported on a new line with each line composed of a + stat-identifier followed by it value. Below are currently known + dimm performance stats which are reported: * "CtlResCt" : Controller Reset Count * "CtlResTm" : Controller Reset Elapsed Time diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c index e2b69cc3beaf..e8577e7e49ab 100644 --- a/arch/powerpc/platforms/pseries/papr_scm.c +++ b/arch/powerpc/platforms/pseries/papr_scm.c @@ -907,6 +907,20 @@ static ssize_t flags_show(struct device *dev, } DEVICE_ATTR_RO(flags); +static umode_t papr_nd_attribute_visible(struct kobject *kobj, +struct attribute *attr, int n) +{ + struct device *dev = kobj_to_dev(kobj); + struct nvdimm *nvdimm = to_nvdimm(dev); + struct papr_scm_priv *p = nvdimm_provider_data(nvdimm); + + /* For if perf-stats not available remove perf_stats sysfs */ + if (attr == _attr_perf_stats.attr && p->stat_buffer_len == 0) + return 0; + + return attr->mode; +} + /* papr_scm specific dimm attributes */ static struct attribute *papr_nd_attributes[] = { _attr_flags.attr, @@ -916,6 +930,7 @@ static struct attribute *papr_nd_attributes[] = { static struct attribute_group papr_nd_attribute_group = { .name = "papr", + .is_visible = papr_nd_attribute_visible, .attrs = papr_nd_attributes, }; @@ -931,7 +946,6 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p) struct nd_region_desc ndr_desc; unsigned long dimm_flags; int target_nid, online_nid; - ssize_t stat_size; p->bus_desc.ndctl = papr_scm_ndctl; p->bus_desc.module = THIS_MODULE; @@ -1016,16 +1030,6 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p) list_add_tail(>region_list, _nd_regions); mutex_unlock(_ndr_lock); - /* Try retriving the stat buffer and see if its supported */ - stat_size = drc_pmem_query_stats(p, NULL, 0); - if (stat_size > 0) { - p->stat_buffer_len = stat_size; - dev_dbg(>pdev->dev, "Max perf-stat size %lu-bytes\n", - p->stat_buffer_len); - } els
[PATCH v3] powerpc/papr_scm: Reduce error severity if nvdimm stats inaccessible
Currently drc_pmem_qeury_stats() generates a dev_err in case "Enable Performance Information Collection" feature is disabled from HMC or performance stats are not available for an nvdimm. The error is of the form below: papr_scm ibm,persistent-memory:ibm,pmemory@44104001: Failed to query performance stats, Err:-10 This error message confuses users as it implies a possible problem with the nvdimm even though its due to a disabled/unavailable feature. We fix this by explicitly handling the H_AUTHORITY and H_UNSUPPORTED errors from the H_SCM_PERFORMANCE_STATS hcall. In case of H_AUTHORITY error an info message is logged instead of an error, saying that "Permission denied while accessing performance stats" and an EPERM error is returned back. In case of H_UNSUPPORTED error we return a EOPNOTSUPP error back from drc_pmem_query_stats() indicating that performance stats-query operation is not supported on this nvdimm. Fixes: 2d02bf835e57('powerpc/papr_scm: Fetch nvdimm performance stats from PHYP') Signed-off-by: Vaibhav Jain --- Changelog v3: * Return EOPNOTSUPP error in case of H_UNSUPPORTED [ Ira ] * Return EPERM in case of H_AUTHORITY [ Ira ] * Updated patch description v2: * Updated the message logged in case of H_AUTHORITY error [ Ira ] * Switched from dev_warn to dev_info in case of H_AUTHORITY error. * Instead of -EPERM return -EACCESS for H_AUTHORITY error. * Added explicit handling of H_UNSUPPORTED error. --- arch/powerpc/platforms/pseries/papr_scm.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c index ef26fe40efb0..e2b69cc3beaf 100644 --- a/arch/powerpc/platforms/pseries/papr_scm.c +++ b/arch/powerpc/platforms/pseries/papr_scm.c @@ -310,6 +310,13 @@ static ssize_t drc_pmem_query_stats(struct papr_scm_priv *p, dev_err(>pdev->dev, "Unknown performance stats, Err:0x%016lX\n", ret[0]); return -ENOENT; + } else if (rc == H_AUTHORITY) { + dev_info(>pdev->dev, +"Permission denied while accessing performance stats"); + return -EPERM; + } else if (rc == H_UNSUPPORTED) { + dev_dbg(>pdev->dev, "Performance stats unsupported\n"); + return -EOPNOTSUPP; } else if (rc != H_SUCCESS) { dev_err(>pdev->dev, "Failed to query performance stats, Err:%lld\n", rc); -- 2.31.1
Re: [PATCH v2] powerpc/papr_scm: Reduce error severity if nvdimm stats inaccessible
Hi Ira, Thanks for looking into this patch Ira Weiny writes: > On Thu, May 06, 2021 at 12:46:06AM +0530, Vaibhav Jain wrote: >> Currently drc_pmem_qeury_stats() generates a dev_err in case >> "Enable Performance Information Collection" feature is disabled from >> HMC or performance stats are not available for an nvdimm. The error is >> of the form below: >> >> papr_scm ibm,persistent-memory:ibm,pmemory@44104001: Failed to query >> performance stats, Err:-10 >> >> This error message confuses users as it implies a possible problem >> with the nvdimm even though its due to a disabled/unavailable >> feature. We fix this by explicitly handling the H_AUTHORITY and >> H_UNSUPPORTED errors from the H_SCM_PERFORMANCE_STATS hcall. >> >> In case of H_AUTHORITY error an info message is logged instead of an >> error, saying that "Permission denied while accessing performance >> stats". Also '-EACCES' error is return instead of -EPERM. > > I thought you clarified before that this was a permission issue. So why > change > the error to EACCES? > EACCESS("Permission Denied") felt like a more accurate error code for this case than EPERM("Operation not permitted"). So switched the usage of EPERM error code to handle the case if this hcall is not supported for an nvdimm. >> >> In case of H_UNSUPPORTED error we return a -EPERM error back from >> drc_pmem_query_stats() indicating that performance stats-query >> operation is not supported on this nvdimm. > > EPERM seems wrong here too... ENOTSUP? Yes, will change it to EOPNOTSUPP in v3. > > Ira > ___ > Linux-nvdimm mailing list -- linux-nvd...@lists.01.org > To unsubscribe send an email to linux-nvdimm-le...@lists.01.org -- Cheers ~ Vaibhav
Re: [PATCH] powerpc/papr_scm: Make 'perf_stats' invisible if perf-stats unavailable
"Aneesh Kumar K.V" writes: > Vaibhav Jain writes: > >> In case performance stats for an nvdimm are not available, reading the >> 'perf_stats' sysfs file returns an -ENOENT error. A better approach is >> to make the 'perf_stats' file entirely invisible to indicate that >> performance stats for an nvdimm are unavailable. >> >> So this patch updates 'papr_nd_attribute_group' to add a 'is_visible' >> callback implemented as newly introduced 'papr_nd_attribute_visible()' >> that returns an appropriate mode in case performance stats aren't >> supported in a given nvdimm. >> >> Also the initialization of 'papr_scm_priv.stat_buffer_len' is moved >> from papr_scm_nvdimm_init() to papr_scm_probe() so that it value is >> available when 'papr_nd_attribute_visible()' is called during nvdimm >> initialization. >> >> Fixes: 2d02bf835e57('powerpc/papr_scm: Fetch nvdimm performance stats from >> PHYP') >> Signed-off-by: Vaibhav Jain >> --- >> arch/powerpc/platforms/pseries/papr_scm.c | 37 --- >> 1 file changed, 26 insertions(+), 11 deletions(-) >> >> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c >> b/arch/powerpc/platforms/pseries/papr_scm.c >> index 12f1513f0fca..90f0af8fefe8 100644 >> --- a/arch/powerpc/platforms/pseries/papr_scm.c >> +++ b/arch/powerpc/platforms/pseries/papr_scm.c >> @@ -907,6 +907,20 @@ static ssize_t flags_show(struct device *dev, >> } >> DEVICE_ATTR_RO(flags); >> >> +umode_t papr_nd_attribute_visible(struct kobject *kobj, struct attribute >> *attr, >> + int n) >> +{ >> +struct device *dev = container_of(kobj, typeof(*dev), kobj); >> +struct nvdimm *nvdimm = to_nvdimm(dev); >> +struct papr_scm_priv *p = nvdimm_provider_data(nvdimm); >> + >> +/* For if perf-stats not available remove perf_stats sysfs */ >> +if (attr == _attr_perf_stats.attr && p->stat_buffer_len == 0) >> +return 0; >> + >> +return attr->mode; >> +} >> + >> /* papr_scm specific dimm attributes */ >> static struct attribute *papr_nd_attributes[] = { >> _attr_flags.attr, >> @@ -916,6 +930,7 @@ static struct attribute *papr_nd_attributes[] = { >> >> static struct attribute_group papr_nd_attribute_group = { >> .name = "papr", >> +.is_visible = papr_nd_attribute_visible, >> .attrs = papr_nd_attributes, >> }; >> >> @@ -931,7 +946,6 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p) >> struct nd_region_desc ndr_desc; >> unsigned long dimm_flags; >> int target_nid, online_nid; >> -ssize_t stat_size; >> >> p->bus_desc.ndctl = papr_scm_ndctl; >> p->bus_desc.module = THIS_MODULE; >> @@ -1016,16 +1030,6 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv >> *p) >> list_add_tail(>region_list, _nd_regions); >> mutex_unlock(_ndr_lock); >> >> -/* Try retriving the stat buffer and see if its supported */ >> -stat_size = drc_pmem_query_stats(p, NULL, 0); >> -if (stat_size > 0) { >> -p->stat_buffer_len = stat_size; >> -dev_dbg(>pdev->dev, "Max perf-stat size %lu-bytes\n", >> -p->stat_buffer_len); >> -} else { >> -dev_info(>pdev->dev, "Dimm performance stats unavailable\n"); >> -} >> - >> return 0; >> >> err:nvdimm_bus_unregister(p->bus); >> @@ -1102,6 +1106,7 @@ static int papr_scm_probe(struct platform_device *pdev) >> u64 blocks, block_size; >> struct papr_scm_priv *p; >> const char *uuid_str; >> +ssize_t stat_size; >> u64 uuid[2]; >> int rc; >> >> @@ -1179,6 +1184,16 @@ static int papr_scm_probe(struct platform_device >> *pdev) >> p->res.name = pdev->name; >> p->res.flags = IORESOURCE_MEM; >> >> +/* Try retriving the stat buffer and see if its supported */ >> +stat_size = drc_pmem_query_stats(p, NULL, 0); >> +if (stat_size > 0) { >> +p->stat_buffer_len = stat_size; >> +dev_dbg(>pdev->dev, "Max perf-stat size %lu-bytes\n", >> +p->stat_buffer_len); >> +} else { >> +dev_info(>pdev->dev, "Dimm performance stats unavailable\n"); >> +} > > With this patch > https://lore.kernel.org/linuxppc-dev/20210505191606.51666-1-vaib...@linux.ibm.com > We are adding details of whyy performance stat query hcall failed. Do we > need to print again here? Are we being more verbose here? > Yes agree this looks more verbose with the other patch you mentioned. I have sent out a v2 of this patch with this dev_info removed. > -aneesh > ___ > Linux-nvdimm mailing list -- linux-nvd...@lists.01.org > To unsubscribe send an email to linux-nvdimm-le...@lists.01.org -- Cheers ~ Vaibhav
[PATCH v2] powerpc/papr_scm: Make 'perf_stats' invisible if perf-stats unavailable
In case performance stats for an nvdimm are not available, reading the 'perf_stats' sysfs file returns an -ENOENT error. A better approach is to make the 'perf_stats' file entirely invisible to indicate that performance stats for an nvdimm are unavailable. So this patch updates 'papr_nd_attribute_group' to add a 'is_visible' callback implemented as newly introduced 'papr_nd_attribute_visible()' that returns an appropriate mode in case performance stats aren't supported in a given nvdimm. Also the initialization of 'papr_scm_priv.stat_buffer_len' is moved from papr_scm_nvdimm_init() to papr_scm_probe() so that it value is available when 'papr_nd_attribute_visible()' is called during nvdimm initialization. Fixes: 2d02bf835e57('powerpc/papr_scm: Fetch nvdimm performance stats from PHYP') Signed-off-by: Vaibhav Jain --- Changelog: v2: * Removed a redundant dev_info() from pap_scm_nvdimm_init() [ Aneesh ] * Marked papr_nd_attribute_visible() as static which also fixed the build warning reported by kernel build robot --- arch/powerpc/platforms/pseries/papr_scm.c | 35 --- 1 file changed, 24 insertions(+), 11 deletions(-) diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c index e2b69cc3beaf..11e7b90a3360 100644 --- a/arch/powerpc/platforms/pseries/papr_scm.c +++ b/arch/powerpc/platforms/pseries/papr_scm.c @@ -907,6 +907,20 @@ static ssize_t flags_show(struct device *dev, } DEVICE_ATTR_RO(flags); +static umode_t papr_nd_attribute_visible(struct kobject *kobj, +struct attribute *attr, int n) +{ + struct device *dev = container_of(kobj, typeof(*dev), kobj); + struct nvdimm *nvdimm = to_nvdimm(dev); + struct papr_scm_priv *p = nvdimm_provider_data(nvdimm); + + /* For if perf-stats not available remove perf_stats sysfs */ + if (attr == _attr_perf_stats.attr && p->stat_buffer_len == 0) + return 0; + + return attr->mode; +} + /* papr_scm specific dimm attributes */ static struct attribute *papr_nd_attributes[] = { _attr_flags.attr, @@ -916,6 +930,7 @@ static struct attribute *papr_nd_attributes[] = { static struct attribute_group papr_nd_attribute_group = { .name = "papr", + .is_visible = papr_nd_attribute_visible, .attrs = papr_nd_attributes, }; @@ -931,7 +946,6 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p) struct nd_region_desc ndr_desc; unsigned long dimm_flags; int target_nid, online_nid; - ssize_t stat_size; p->bus_desc.ndctl = papr_scm_ndctl; p->bus_desc.module = THIS_MODULE; @@ -1016,16 +1030,6 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p) list_add_tail(>region_list, _nd_regions); mutex_unlock(_ndr_lock); - /* Try retriving the stat buffer and see if its supported */ - stat_size = drc_pmem_query_stats(p, NULL, 0); - if (stat_size > 0) { - p->stat_buffer_len = stat_size; - dev_dbg(>pdev->dev, "Max perf-stat size %lu-bytes\n", - p->stat_buffer_len); - } else { - dev_info(>pdev->dev, "Dimm performance stats unavailable\n"); - } - return 0; err: nvdimm_bus_unregister(p->bus); @@ -1102,6 +1106,7 @@ static int papr_scm_probe(struct platform_device *pdev) u64 blocks, block_size; struct papr_scm_priv *p; const char *uuid_str; + ssize_t stat_size; u64 uuid[2]; int rc; @@ -1179,6 +1184,14 @@ static int papr_scm_probe(struct platform_device *pdev) p->res.name = pdev->name; p->res.flags = IORESOURCE_MEM; + /* Try retriving the stat buffer and see if its supported */ + stat_size = drc_pmem_query_stats(p, NULL, 0); + if (stat_size > 0) { + p->stat_buffer_len = stat_size; + dev_dbg(>pdev->dev, "Max perf-stat size %lu-bytes\n", + p->stat_buffer_len); + } + rc = papr_scm_nvdimm_init(p); if (rc) goto err2; -- 2.31.1
[PATCH] powerpc/papr_scm: Make 'perf_stats' invisible if perf-stats unavailable
In case performance stats for an nvdimm are not available, reading the 'perf_stats' sysfs file returns an -ENOENT error. A better approach is to make the 'perf_stats' file entirely invisible to indicate that performance stats for an nvdimm are unavailable. So this patch updates 'papr_nd_attribute_group' to add a 'is_visible' callback implemented as newly introduced 'papr_nd_attribute_visible()' that returns an appropriate mode in case performance stats aren't supported in a given nvdimm. Also the initialization of 'papr_scm_priv.stat_buffer_len' is moved from papr_scm_nvdimm_init() to papr_scm_probe() so that it value is available when 'papr_nd_attribute_visible()' is called during nvdimm initialization. Fixes: 2d02bf835e57('powerpc/papr_scm: Fetch nvdimm performance stats from PHYP') Signed-off-by: Vaibhav Jain --- arch/powerpc/platforms/pseries/papr_scm.c | 37 --- 1 file changed, 26 insertions(+), 11 deletions(-) diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c index 12f1513f0fca..90f0af8fefe8 100644 --- a/arch/powerpc/platforms/pseries/papr_scm.c +++ b/arch/powerpc/platforms/pseries/papr_scm.c @@ -907,6 +907,20 @@ static ssize_t flags_show(struct device *dev, } DEVICE_ATTR_RO(flags); +umode_t papr_nd_attribute_visible(struct kobject *kobj, struct attribute *attr, + int n) +{ + struct device *dev = container_of(kobj, typeof(*dev), kobj); + struct nvdimm *nvdimm = to_nvdimm(dev); + struct papr_scm_priv *p = nvdimm_provider_data(nvdimm); + + /* For if perf-stats not available remove perf_stats sysfs */ + if (attr == _attr_perf_stats.attr && p->stat_buffer_len == 0) + return 0; + + return attr->mode; +} + /* papr_scm specific dimm attributes */ static struct attribute *papr_nd_attributes[] = { _attr_flags.attr, @@ -916,6 +930,7 @@ static struct attribute *papr_nd_attributes[] = { static struct attribute_group papr_nd_attribute_group = { .name = "papr", + .is_visible = papr_nd_attribute_visible, .attrs = papr_nd_attributes, }; @@ -931,7 +946,6 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p) struct nd_region_desc ndr_desc; unsigned long dimm_flags; int target_nid, online_nid; - ssize_t stat_size; p->bus_desc.ndctl = papr_scm_ndctl; p->bus_desc.module = THIS_MODULE; @@ -1016,16 +1030,6 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p) list_add_tail(>region_list, _nd_regions); mutex_unlock(_ndr_lock); - /* Try retriving the stat buffer and see if its supported */ - stat_size = drc_pmem_query_stats(p, NULL, 0); - if (stat_size > 0) { - p->stat_buffer_len = stat_size; - dev_dbg(>pdev->dev, "Max perf-stat size %lu-bytes\n", - p->stat_buffer_len); - } else { - dev_info(>pdev->dev, "Dimm performance stats unavailable\n"); - } - return 0; err: nvdimm_bus_unregister(p->bus); @@ -1102,6 +1106,7 @@ static int papr_scm_probe(struct platform_device *pdev) u64 blocks, block_size; struct papr_scm_priv *p; const char *uuid_str; + ssize_t stat_size; u64 uuid[2]; int rc; @@ -1179,6 +1184,16 @@ static int papr_scm_probe(struct platform_device *pdev) p->res.name = pdev->name; p->res.flags = IORESOURCE_MEM; + /* Try retriving the stat buffer and see if its supported */ + stat_size = drc_pmem_query_stats(p, NULL, 0); + if (stat_size > 0) { + p->stat_buffer_len = stat_size; + dev_dbg(>pdev->dev, "Max perf-stat size %lu-bytes\n", + p->stat_buffer_len); + } else { + dev_info(>pdev->dev, "Dimm performance stats unavailable\n"); + } + rc = papr_scm_nvdimm_init(p); if (rc) goto err2; -- 2.31.1
[PATCH v2] powerpc/papr_scm: Reduce error severity if nvdimm stats inaccessible
Currently drc_pmem_qeury_stats() generates a dev_err in case "Enable Performance Information Collection" feature is disabled from HMC or performance stats are not available for an nvdimm. The error is of the form below: papr_scm ibm,persistent-memory:ibm,pmemory@44104001: Failed to query performance stats, Err:-10 This error message confuses users as it implies a possible problem with the nvdimm even though its due to a disabled/unavailable feature. We fix this by explicitly handling the H_AUTHORITY and H_UNSUPPORTED errors from the H_SCM_PERFORMANCE_STATS hcall. In case of H_AUTHORITY error an info message is logged instead of an error, saying that "Permission denied while accessing performance stats". Also '-EACCES' error is return instead of -EPERM. In case of H_UNSUPPORTED error we return a -EPERM error back from drc_pmem_query_stats() indicating that performance stats-query operation is not supported on this nvdimm. Fixes: 2d02bf835e57('powerpc/papr_scm: Fetch nvdimm performance stats from PHYP') Signed-off-by: Vaibhav Jain --- Changelog v2: * Updated the message logged in case of H_AUTHORITY error [ Ira ] * Switched from dev_warn to dev_info in case of H_AUTHORITY error. * Instead of -EPERM return -EACCESS for H_AUTHORITY error. * Added explicit handling of H_UNSUPPORTED error. --- arch/powerpc/platforms/pseries/papr_scm.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c index ef26fe40efb0..12f1513f0fca 100644 --- a/arch/powerpc/platforms/pseries/papr_scm.c +++ b/arch/powerpc/platforms/pseries/papr_scm.c @@ -310,6 +310,13 @@ static ssize_t drc_pmem_query_stats(struct papr_scm_priv *p, dev_err(>pdev->dev, "Unknown performance stats, Err:0x%016lX\n", ret[0]); return -ENOENT; + } else if (rc == H_AUTHORITY) { + dev_info(>pdev->dev, +"Permission denied while accessing performance stats"); + return -EACCES; + } else if (rc == H_UNSUPPORTED) { + dev_dbg(>pdev->dev, "Performance stats unsupported\n"); + return -EPERM; } else if (rc != H_SUCCESS) { dev_err(>pdev->dev, "Failed to query performance stats, Err:%lld\n", rc); -- 2.31.1
Re: [PATCH v1 1/1] powerpc/papr_scm: Properly handle UUID types and API
Andy Shevchenko writes: > On Thu, Apr 15, 2021 at 8:10 PM Vaibhav Jain wrote: >> >> >> Thanks for the patch Andy, >> >> Unfortunately ran into a compilation issue due to missing "#include >> " that provides definition for >> get_unaligned_le64(). Gcc reported following error: >> >> error: implicit declaration of function ‘get_unaligned_le64’ > > Right, I have not tested it (as mentioned in the comments to the patch) > >> After including the necessary header file, kernel compiled fine and I >> was able to test & verify the patch. > > Thank you very much for the testing. > > I'm not sure what the coverage of your test is. Your patch updates the way the interleaved set-cookies are populated in papr_scm which are then used to populate label entry for a namespace. I verified that the reported region setcookie hasnt changed for an nvdimm region before and after applying your patch for both BE and LE variants: # 64-bit Little endian kernel before applying the patch $ sudo cat /sys/devices/ndbus0/region0/set_cookie 0x8b6b26cbc930e2b5 # 64-bit Little endian kernel after applying your patch $ sudo cat /sys/devices/ndbus0/region0/set_cookie 0x8b6b26cbc930e2b5 # 64-bit Big endian kernel before applying your patch $ sudo cat /sys/devices/ndbus0/region0/set_cookie 0x8b6b26cbc930e2b5 # 64-bit Big endian kernel after applying your patch $ sudo cat /sys/devices/ndbus0/region0/set_cookie 0x8b6b26cbc930e2b5 > That's why I have an > additional question below. Is the byte ordering kept the same in BE > (32- and 64-bit) cases? Because I'm worrying that I might have missed > something. Libnvdimm store these cookies in label area as little endian values and based on the results above I think we are good. > > > -- > With Best Regards, > Andy Shevchenko -- Cheers ~ Vaibhav
Re: [PATCH v1 1/1] powerpc/papr_scm: Properly handle UUID types and API
Thanks for the patch Andy, Unfortunately ran into a compilation issue due to missing "#include " that provides definition for get_unaligned_le64(). Gcc reported following error: error: implicit declaration of function ‘get_unaligned_le64’ After including the necessary header file, kernel compiled fine and I was able to test & verify the patch. -- Cheers ~ Vaibhav Andy Shevchenko writes: > Parse to and export from UUID own type, before dereferencing. > This also fixes wrong comment (Little Endian UUID is something else) > and should fix Sparse warnings about assigning strict types to POD. > > Fixes: 43001c52b603 ("powerpc/papr_scm: Use ibm,unit-guid as the iset cookie") > Fixes: 259a948c4ba1 ("powerpc/pseries/scm: Use a specific endian format for > storing uuid from the device tree") > Cc: Oliver O'Halloran > Cc: Aneesh Kumar K.V > Signed-off-by: Andy Shevchenko > --- > Not tested > arch/powerpc/platforms/pseries/papr_scm.c | 13 - > 1 file changed, 8 insertions(+), 5 deletions(-) > > diff --git a/arch/powerpc/platforms/pseries/papr_scm.c > b/arch/powerpc/platforms/pseries/papr_scm.c > index ae6f5d80d5ce..4366e1902890 100644 > --- a/arch/powerpc/platforms/pseries/papr_scm.c > +++ b/arch/powerpc/platforms/pseries/papr_scm.c > @@ -1085,8 +1085,9 @@ static int papr_scm_probe(struct platform_device *pdev) > u32 drc_index, metadata_size; > u64 blocks, block_size; > struct papr_scm_priv *p; > + u8 uuid_raw[UUID_SIZE]; > const char *uuid_str; > - u64 uuid[2]; > + uuid_t uuid; > int rc; > > /* check we have all the required DT properties */ > @@ -1129,16 +1130,18 @@ static int papr_scm_probe(struct platform_device > *pdev) > p->hcall_flush_required = of_property_read_bool(dn, > "ibm,hcall-flush-required"); > > /* We just need to ensure that set cookies are unique across */ > - uuid_parse(uuid_str, (uuid_t *) uuid); > + uuid_parse(uuid_str, ); > + > /* >* cookie1 and cookie2 are not really little endian > - * we store a little endian representation of the > + * we store a raw buffer representation of the >* uuid str so that we can compare this with the label >* area cookie irrespective of the endian config with which >* the kernel is built. >*/ > - p->nd_set.cookie1 = cpu_to_le64(uuid[0]); > - p->nd_set.cookie2 = cpu_to_le64(uuid[1]); > + export_uuid(uuid_raw, ); > + p->nd_set.cookie1 = get_unaligned_le64(_raw[0]); > + p->nd_set.cookie2 = get_unaligned_le64(_raw[8]); > > /* might be zero */ > p->metadata_size = metadata_size; > -- > 2.30.2 >
Re: [PATCH] powerpc/papr_scm: Reduce error severity if nvdimm stats inaccessible
Ira Weiny writes: > On Wed, Apr 14, 2021 at 09:51:40PM +0530, Vaibhav Jain wrote: >> Thanks for looking into this patch Ira, >> >> Ira Weiny writes: >> >> > On Wed, Apr 14, 2021 at 06:10:26PM +0530, Vaibhav Jain wrote: >> >> Currently drc_pmem_qeury_stats() generates a dev_err in case >> >> "Enable Performance Information Collection" feature is disabled from >> >> HMC. The error is of the form below: >> >> >> >> papr_scm ibm,persistent-memory:ibm,pmemory@44104001: Failed to query >> >>performance stats, Err:-10 >> >> >> >> This error message confuses users as it implies a possible problem >> >> with the nvdimm even though its due to a disabled feature. >> >> >> >> So we fix this by explicitly handling the H_AUTHORITY error from the >> >> H_SCM_PERFORMANCE_STATS hcall and generating a warning instead of an >> >> error, saying that "Performance stats in-accessible". >> >> >> >> Fixes: 2d02bf835e57('powerpc/papr_scm: Fetch nvdimm performance stats >> >> from PHYP') >> >> Signed-off-by: Vaibhav Jain >> >> --- >> >> arch/powerpc/platforms/pseries/papr_scm.c | 3 +++ >> >> 1 file changed, 3 insertions(+) >> >> >> >> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c >> >> b/arch/powerpc/platforms/pseries/papr_scm.c >> >> index 835163f54244..9216424f8be3 100644 >> >> --- a/arch/powerpc/platforms/pseries/papr_scm.c >> >> +++ b/arch/powerpc/platforms/pseries/papr_scm.c >> >> @@ -277,6 +277,9 @@ static ssize_t drc_pmem_query_stats(struct >> >> papr_scm_priv *p, >> >> dev_err(>pdev->dev, >> >> "Unknown performance stats, Err:0x%016lX\n", ret[0]); >> >> return -ENOENT; >> >> + } else if (rc == H_AUTHORITY) { >> >> + dev_warn(>pdev->dev, "Performance stats in-accessible"); >> >> + return -EPERM; >> > >> > Is this because of a disabled feature or because of permissions? >> >> Its because of a disabled feature that revokes permission for a guest to >> retrieve performance statistics. >> >> The feature is called "Enable Performance Information Collection" and >> once disabled the hcall H_SCM_PERFORMANCE_STATS returns an error >> H_AUTHORITY indicating that the guest doesn't have permission to retrieve >> performance statistics. > > In that case would it be appropriate to have the error message indicate a > permission issue? > > Something like 'permission denied'? Yes, Something like "Permission denied while accessing performance stats" might be more clear and intuitive. Will update the warn message in v2. > > Ira > -- Cheers ~ Vaibhav
Re: [PATCH] powerpc/papr_scm: Reduce error severity if nvdimm stats inaccessible
Thanks for looking into this Dan, Dan Williams writes: > On Wed, Apr 14, 2021 at 5:40 AM Vaibhav Jain wrote: >> >> Currently drc_pmem_qeury_stats() generates a dev_err in case >> "Enable Performance Information Collection" feature is disabled from >> HMC. The error is of the form below: >> >> papr_scm ibm,persistent-memory:ibm,pmemory@44104001: Failed to query >> performance stats, Err:-10 >> >> This error message confuses users as it implies a possible problem >> with the nvdimm even though its due to a disabled feature. >> >> So we fix this by explicitly handling the H_AUTHORITY error from the >> H_SCM_PERFORMANCE_STATS hcall and generating a warning instead of an >> error, saying that "Performance stats in-accessible". >> >> Fixes: 2d02bf835e57('powerpc/papr_scm: Fetch nvdimm performance stats from >> PHYP') >> Signed-off-by: Vaibhav Jain >> --- >> arch/powerpc/platforms/pseries/papr_scm.c | 3 +++ >> 1 file changed, 3 insertions(+) >> >> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c >> b/arch/powerpc/platforms/pseries/papr_scm.c >> index 835163f54244..9216424f8be3 100644 >> --- a/arch/powerpc/platforms/pseries/papr_scm.c >> +++ b/arch/powerpc/platforms/pseries/papr_scm.c >> @@ -277,6 +277,9 @@ static ssize_t drc_pmem_query_stats(struct papr_scm_priv >> *p, >> dev_err(>pdev->dev, >> "Unknown performance stats, Err:0x%016lX\n", ret[0]); >> return -ENOENT; >> + } else if (rc == H_AUTHORITY) { >> + dev_warn(>pdev->dev, "Performance stats in-accessible"); >> + return -EPERM; > > So userspace can spam the kernel log? Why is kernel log message needed > at all? EPERM told the caller what happened. Currently this error message is only reported during probe of the nvdimm. So userspace cannot directly spam kernel log. The callsite for this function in papr_scm_nvdimm_init() doesnt handle specific error codes. Instead in case of an error it only reports that "Dimm performance stats are unavailable". The log message just preceeding that mentions the real cause of failure. Thats why just returning -EPERM wont be usefui. Alternatively I can update papr_scm_nvdimm_init() to report the error code returned from drc_pmem_query_stats(). -- Cheers ~ Vaibhav
Re: [PATCH] powerpc/papr_scm: Reduce error severity if nvdimm stats inaccessible
Thanks for looking into this patch Ira, Ira Weiny writes: > On Wed, Apr 14, 2021 at 06:10:26PM +0530, Vaibhav Jain wrote: >> Currently drc_pmem_qeury_stats() generates a dev_err in case >> "Enable Performance Information Collection" feature is disabled from >> HMC. The error is of the form below: >> >> papr_scm ibm,persistent-memory:ibm,pmemory@44104001: Failed to query >> performance stats, Err:-10 >> >> This error message confuses users as it implies a possible problem >> with the nvdimm even though its due to a disabled feature. >> >> So we fix this by explicitly handling the H_AUTHORITY error from the >> H_SCM_PERFORMANCE_STATS hcall and generating a warning instead of an >> error, saying that "Performance stats in-accessible". >> >> Fixes: 2d02bf835e57('powerpc/papr_scm: Fetch nvdimm performance stats from >> PHYP') >> Signed-off-by: Vaibhav Jain >> --- >> arch/powerpc/platforms/pseries/papr_scm.c | 3 +++ >> 1 file changed, 3 insertions(+) >> >> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c >> b/arch/powerpc/platforms/pseries/papr_scm.c >> index 835163f54244..9216424f8be3 100644 >> --- a/arch/powerpc/platforms/pseries/papr_scm.c >> +++ b/arch/powerpc/platforms/pseries/papr_scm.c >> @@ -277,6 +277,9 @@ static ssize_t drc_pmem_query_stats(struct papr_scm_priv >> *p, >> dev_err(>pdev->dev, >> "Unknown performance stats, Err:0x%016lX\n", ret[0]); >> return -ENOENT; >> +} else if (rc == H_AUTHORITY) { >> +dev_warn(>pdev->dev, "Performance stats in-accessible"); >> +return -EPERM; > > Is this because of a disabled feature or because of permissions? Its because of a disabled feature that revokes permission for a guest to retrieve performance statistics. The feature is called "Enable Performance Information Collection" and once disabled the hcall H_SCM_PERFORMANCE_STATS returns an error H_AUTHORITY indicating that the guest doesn't have permission to retrieve performance statistics. > > Ira > >> } else if (rc != H_SUCCESS) { >> dev_err(>pdev->dev, >> "Failed to query performance stats, Err:%lld\n", rc); >> -- >> 2.30.2 >> -- Cheers ~ Vaibhav
[PATCH] powerpc/papr_scm: Reduce error severity if nvdimm stats inaccessible
Currently drc_pmem_qeury_stats() generates a dev_err in case "Enable Performance Information Collection" feature is disabled from HMC. The error is of the form below: papr_scm ibm,persistent-memory:ibm,pmemory@44104001: Failed to query performance stats, Err:-10 This error message confuses users as it implies a possible problem with the nvdimm even though its due to a disabled feature. So we fix this by explicitly handling the H_AUTHORITY error from the H_SCM_PERFORMANCE_STATS hcall and generating a warning instead of an error, saying that "Performance stats in-accessible". Fixes: 2d02bf835e57('powerpc/papr_scm: Fetch nvdimm performance stats from PHYP') Signed-off-by: Vaibhav Jain --- arch/powerpc/platforms/pseries/papr_scm.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c index 835163f54244..9216424f8be3 100644 --- a/arch/powerpc/platforms/pseries/papr_scm.c +++ b/arch/powerpc/platforms/pseries/papr_scm.c @@ -277,6 +277,9 @@ static ssize_t drc_pmem_query_stats(struct papr_scm_priv *p, dev_err(>pdev->dev, "Unknown performance stats, Err:0x%016lX\n", ret[0]); return -ENOENT; + } else if (rc == H_AUTHORITY) { + dev_warn(>pdev->dev, "Performance stats in-accessible"); + return -EPERM; } else if (rc != H_SUCCESS) { dev_err(>pdev->dev, "Failed to query performance stats, Err:%lld\n", rc); -- 2.30.2
Re: [PATCH v3] powerpc/papr_scm: Implement support for H_SCM_FLUSH hcall
"Aneesh Kumar K.V" writes: > Vaibhav Jain writes: > >> Hi Shiva, >> >> Apologies for a late review but something just caught my eye while >> working on a different patch. >> >> Shivaprasad G Bhat writes: >> >>> Add support for ND_REGION_ASYNC capability if the device tree >>> indicates 'ibm,hcall-flush-required' property in the NVDIMM node. >>> Flush is done by issuing H_SCM_FLUSH hcall to the hypervisor. >>> >>> If the flush request failed, the hypervisor is expected to >>> to reflect the problem in the subsequent nvdimm H_SCM_HEALTH call. >>> >>> This patch prevents mmap of namespaces with MAP_SYNC flag if the >>> nvdimm requires an explicit flush[1]. >>> >>> References: >>> [1] >>> https://github.com/avocado-framework-tests/avocado-misc-tests/blob/master/memory/ndctl.py.data/map_sync.c >>> >>> Signed-off-by: Shivaprasad G Bhat >>> --- >>> v2 - https://www.spinics.net/lists/kvm-ppc/msg18799.html >>> Changes from v2: >>>- Fixed the commit message. >>>- Add dev_dbg before the H_SCM_FLUSH hcall >>> >>> v1 - https://www.spinics.net/lists/kvm-ppc/msg18272.html >>> Changes from v1: >>>- Hcall semantics finalized, all changes are to accomodate them. >>> >>> Documentation/powerpc/papr_hcalls.rst | 14 ++ >>> arch/powerpc/include/asm/hvcall.h |3 +- >>> arch/powerpc/platforms/pseries/papr_scm.c | 40 >>> + >>> 3 files changed, 56 insertions(+), 1 deletion(-) >>> >>> diff --git a/Documentation/powerpc/papr_hcalls.rst >>> b/Documentation/powerpc/papr_hcalls.rst >>> index 48fcf1255a33..648f278eea8f 100644 >>> --- a/Documentation/powerpc/papr_hcalls.rst >>> +++ b/Documentation/powerpc/papr_hcalls.rst >>> @@ -275,6 +275,20 @@ Health Bitmap Flags: >>> Given a DRC Index collect the performance statistics for NVDIMM and copy >>> them >>> to the resultBuffer. >>> >>> +**H_SCM_FLUSH** >>> + >>> +| Input: *drcIndex, continue-token* >>> +| Out: *continue-token* >>> +| Return Value: *H_SUCCESS, H_Parameter, H_P2, H_BUSY* >>> + >>> +Given a DRC Index Flush the data to backend NVDIMM device. >>> + >>> +The hcall returns H_BUSY when the flush takes longer time and the hcall >>> needs >>> +to be issued multiple times in order to be completely serviced. The >>> +*continue-token* from the output to be passed in the argument list of >>> +subsequent hcalls to the hypervisor until the hcall is completely serviced >>> +at which point H_SUCCESS or other error is returned by the hypervisor. >>> + >> Does the hcall semantic also include measures to trigger a barrier/fence >> (like pm_wmb()) so that all the stores before the hcall are gauranteed >> to have hit the pmem controller ? > > It is upto the hypervisor to implement the right sequence to ensure the > guarantee. The hcall doesn't expect the store to reach the platform > buffers. Since the asyc_flush function is also called for performing deep_flush from libnvdimm and as the hcall doesnt gaurantee stores to reach the platform buffers, hence papr_scm_pmem_flush() should issue pm_wmb() before the hcall. This would ensure papr_scm_pmem_flush() semantics are consistent that to generic_nvdimm_flush(). Also, adding the statement "The hcall doesn't expect the store to reach the platform buffers" to the hcall documentation would be good to have. > > >> >> If not then the papr_scm_pmem_flush() introduced below should issue a >> fence like pm_wmb() before issuing the flush hcall. >> >> If yes then this behaviour should also be documented with the hcall >> semantics above. >> > > IIUC fdatasync on the backend file is enough to ensure the > guaraantee. Such a request will have REQ_PREFLUSH and REQ_FUA which will > do the necessary barrier on the backing device holding the backend file. > Right, but thats assuming nvdimm is backed by a regular file in hypervisor which may not always be the case. > -aneesh -- Cheers ~ Vaibhav
Re: [PATCH v3] powerpc/papr_scm: Implement support for H_SCM_FLUSH hcall
Hi Shiva, Apologies for a late review but something just caught my eye while working on a different patch. Shivaprasad G Bhat writes: > Add support for ND_REGION_ASYNC capability if the device tree > indicates 'ibm,hcall-flush-required' property in the NVDIMM node. > Flush is done by issuing H_SCM_FLUSH hcall to the hypervisor. > > If the flush request failed, the hypervisor is expected to > to reflect the problem in the subsequent nvdimm H_SCM_HEALTH call. > > This patch prevents mmap of namespaces with MAP_SYNC flag if the > nvdimm requires an explicit flush[1]. > > References: > [1] > https://github.com/avocado-framework-tests/avocado-misc-tests/blob/master/memory/ndctl.py.data/map_sync.c > > Signed-off-by: Shivaprasad G Bhat > --- > v2 - https://www.spinics.net/lists/kvm-ppc/msg18799.html > Changes from v2: >- Fixed the commit message. >- Add dev_dbg before the H_SCM_FLUSH hcall > > v1 - https://www.spinics.net/lists/kvm-ppc/msg18272.html > Changes from v1: >- Hcall semantics finalized, all changes are to accomodate them. > > Documentation/powerpc/papr_hcalls.rst | 14 ++ > arch/powerpc/include/asm/hvcall.h |3 +- > arch/powerpc/platforms/pseries/papr_scm.c | 40 > + > 3 files changed, 56 insertions(+), 1 deletion(-) > > diff --git a/Documentation/powerpc/papr_hcalls.rst > b/Documentation/powerpc/papr_hcalls.rst > index 48fcf1255a33..648f278eea8f 100644 > --- a/Documentation/powerpc/papr_hcalls.rst > +++ b/Documentation/powerpc/papr_hcalls.rst > @@ -275,6 +275,20 @@ Health Bitmap Flags: > Given a DRC Index collect the performance statistics for NVDIMM and copy them > to the resultBuffer. > > +**H_SCM_FLUSH** > + > +| Input: *drcIndex, continue-token* > +| Out: *continue-token* > +| Return Value: *H_SUCCESS, H_Parameter, H_P2, H_BUSY* > + > +Given a DRC Index Flush the data to backend NVDIMM device. > + > +The hcall returns H_BUSY when the flush takes longer time and the hcall needs > +to be issued multiple times in order to be completely serviced. The > +*continue-token* from the output to be passed in the argument list of > +subsequent hcalls to the hypervisor until the hcall is completely serviced > +at which point H_SUCCESS or other error is returned by the hypervisor. > + Does the hcall semantic also include measures to trigger a barrier/fence (like pm_wmb()) so that all the stores before the hcall are gauranteed to have hit the pmem controller ? If not then the papr_scm_pmem_flush() introduced below should issue a fence like pm_wmb() before issuing the flush hcall. If yes then this behaviour should also be documented with the hcall semantics above. > References > == > .. [1] "Power Architecture Platform Reference" > diff --git a/arch/powerpc/include/asm/hvcall.h > b/arch/powerpc/include/asm/hvcall.h > index ed6086d57b22..9f7729a97ebd 100644 > --- a/arch/powerpc/include/asm/hvcall.h > +++ b/arch/powerpc/include/asm/hvcall.h > @@ -315,7 +315,8 @@ > #define H_SCM_HEALTH0x400 > #define H_SCM_PERFORMANCE_STATS 0x418 > #define H_RPT_INVALIDATE 0x448 > -#define MAX_HCALL_OPCODE H_RPT_INVALIDATE > +#define H_SCM_FLUSH 0x44C > +#define MAX_HCALL_OPCODE H_SCM_FLUSH > > /* Scope args for H_SCM_UNBIND_ALL */ > #define H_UNBIND_SCOPE_ALL (0x1) > diff --git a/arch/powerpc/platforms/pseries/papr_scm.c > b/arch/powerpc/platforms/pseries/papr_scm.c > index 835163f54244..b7a47fcc5aa5 100644 > --- a/arch/powerpc/platforms/pseries/papr_scm.c > +++ b/arch/powerpc/platforms/pseries/papr_scm.c > @@ -93,6 +93,7 @@ struct papr_scm_priv { > uint64_t block_size; > int metadata_size; > bool is_volatile; > + bool hcall_flush_required; > > uint64_t bound_addr; > > @@ -117,6 +118,39 @@ struct papr_scm_priv { > size_t stat_buffer_len; > }; > > +static int papr_scm_pmem_flush(struct nd_region *nd_region, > +struct bio *bio __maybe_unused) > +{ > + struct papr_scm_priv *p = nd_region_provider_data(nd_region); > + unsigned long ret_buf[PLPAR_HCALL_BUFSIZE]; > + uint64_t token = 0; > + int64_t rc; > + > + dev_dbg(>pdev->dev, "flush drc 0x%x", p->drc_index); > + > + do { > + rc = plpar_hcall(H_SCM_FLUSH, ret_buf, p->drc_index, token); > + token = ret_buf[0]; > + > + /* Check if we are stalled for some time */ > + if (H_IS_LONG_BUSY(rc)) { > + msleep(get_longbusy_msecs(rc)); > + rc = H_BUSY; > + } else if (rc == H_BUSY) { > + cond_resched(); > + } > + } while (rc == H_BUSY); > + > + if (rc) { > + dev_err(>pdev->dev, "flush error: %lld", rc); > + rc = -EIO; > + } else { > + dev_dbg(>pdev->dev, "flush drc 0x%x complete", p->drc_index); > + } > + > + return rc; > +} > + > static LIST_HEAD(papr_nd_regions); >
Re: [PATCH] powerpc/mm: Add cond_resched() while removing hpte mappings
Hi Mpe, Thanks for looking into this patch. Michael Ellerman writes: > Vaibhav Jain writes: >> While removing large number of mappings from hash page tables for >> large memory systems as soft-lockup is reported because of the time >> spent inside htap_remove_mapping() like one below: >> >> watchdog: BUG: soft lockup - CPU#8 stuck for 23s! >> >> NIP plpar_hcall+0x38/0x58 >> LR pSeries_lpar_hpte_invalidate+0x68/0xb0 >> Call Trace: >> 0x1fff000 (unreliable) >> pSeries_lpar_hpte_removebolted+0x9c/0x230 >> hash__remove_section_mapping+0xec/0x1c0 >> remove_section_mapping+0x28/0x3c >> arch_remove_memory+0xfc/0x150 >> devm_memremap_pages_release+0x180/0x2f0 >> devm_action_release+0x30/0x50 >> release_nodes+0x28c/0x300 >> device_release_driver_internal+0x16c/0x280 >> unbind_store+0x124/0x170 >> drv_attr_store+0x44/0x60 >> sysfs_kf_write+0x64/0x90 >> kernfs_fop_write+0x1b0/0x290 >> __vfs_write+0x3c/0x70 >> vfs_write+0xd4/0x270 >> ksys_write+0xdc/0x130 >> system_call+0x5c/0x70 >> >> Fix this by adding a cond_resched() to the loop in >> htap_remove_mapping() that issues hcall to remove hpte mapping. This >> should prevent the soft-lockup from being reported. > > Can/should we also/instead be using H_BLOCK_REMOVE? > > cheers Current mmp_ops implementation seems to use H_BULK_REMOVE for hugepages so for removing mappings for regular pages I am looking into adding a new mmu_op that can take a range to be unmapped and I did try implmenting a new mmu_op for this which can reduce the number of hash_pte lookups needed to invalidate this range. But that would need some more work so as a stop gap I have sent out a v2 with Christophe's suggestion to add a cond_resched() every HZ interval. -- Cheers ~ Vaibhav
[PATCH v2] powerpc/mm: Add cond_resched() while removing hpte mappings
While removing large number of mappings from hash page tables for large memory systems as soft-lockup is reported because of the time spent inside htap_remove_mapping() like one below: watchdog: BUG: soft lockup - CPU#8 stuck for 23s! NIP plpar_hcall+0x38/0x58 LR pSeries_lpar_hpte_invalidate+0x68/0xb0 Call Trace: 0x1fff000 (unreliable) pSeries_lpar_hpte_removebolted+0x9c/0x230 hash__remove_section_mapping+0xec/0x1c0 remove_section_mapping+0x28/0x3c arch_remove_memory+0xfc/0x150 devm_memremap_pages_release+0x180/0x2f0 devm_action_release+0x30/0x50 release_nodes+0x28c/0x300 device_release_driver_internal+0x16c/0x280 unbind_store+0x124/0x170 drv_attr_store+0x44/0x60 sysfs_kf_write+0x64/0x90 kernfs_fop_write+0x1b0/0x290 __vfs_write+0x3c/0x70 vfs_write+0xd4/0x270 ksys_write+0xdc/0x130 system_call+0x5c/0x70 Fix this by adding a cond_resched() to the loop in htap_remove_mapping() that issues hcall to remove hpte mapping. The call to cond_resched() is issued every HZ jiffies which should prevent the soft-lockup from being reported. Suggested-by: Aneesh Kumar K.V Signed-off-by: Vaibhav Jain --- Changelog: v2: Issue cond_resched() every HZ jiffies instead of each iteration of the loop. [ Christophe Leroy ] --- arch/powerpc/mm/book3s64/hash_utils.c | 13 - 1 file changed, 12 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/mm/book3s64/hash_utils.c b/arch/powerpc/mm/book3s64/hash_utils.c index 581b20a2feaf..286e7e8cb919 100644 --- a/arch/powerpc/mm/book3s64/hash_utils.c +++ b/arch/powerpc/mm/book3s64/hash_utils.c @@ -338,7 +338,7 @@ int htab_bolt_mapping(unsigned long vstart, unsigned long vend, int htab_remove_mapping(unsigned long vstart, unsigned long vend, int psize, int ssize) { - unsigned long vaddr; + unsigned long vaddr, time_limit; unsigned int step, shift; int rc; int ret = 0; @@ -351,8 +351,19 @@ int htab_remove_mapping(unsigned long vstart, unsigned long vend, /* Unmap the full range specificied */ vaddr = ALIGN_DOWN(vstart, step); + time_limit = jiffies + HZ; + for (;vaddr < vend; vaddr += step) { rc = mmu_hash_ops.hpte_removebolted(vaddr, psize, ssize); + + /* +* For large number of mappings introduce a cond_resched() +* to prevent softlockup warnings. +*/ + if (time_after(jiffies, time_limit)) { + cond_resched(); + time_limit = jiffies + HZ; + } if (rc == -ENOENT) { ret = -ENOENT; continue; -- 2.30.2
[PATCH] powerpc/papr_scm: Mark nvdimm as unarmed if needed during probe
In case an nvdimm is found to be unarmed during probe then set its NDD_UNARMED flag before nvdimm_create(). This would enforce a read-only access to the ndimm region. Presently even if an nvdimm is unarmed its not marked as read-only on ppc64 guests. The patch updates papr_scm_nvdimm_init() to force query of nvdimm health via __drc_pmem_query_health() and if nvdimm is found to be unarmed then set the nvdimm flag ND_UNARMED for nvdimm_create(). Signed-off-by: Vaibhav Jain --- arch/powerpc/platforms/pseries/papr_scm.c | 9 + 1 file changed, 9 insertions(+) diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c index 835163f54244..7e8168e19427 100644 --- a/arch/powerpc/platforms/pseries/papr_scm.c +++ b/arch/powerpc/platforms/pseries/papr_scm.c @@ -914,6 +914,15 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p) dimm_flags = 0; set_bit(NDD_LABELING, _flags); + /* +* Check if the nvdimm is unarmed. No locking needed as we are still +* initializing. Ignore error encountered if any. +*/ + __drc_pmem_query_health(p); + + if (p->health_bitmap & PAPR_PMEM_UNARMED_MASK) + set_bit(NDD_UNARMED, _flags); + p->nvdimm = nvdimm_create(p->bus, p, papr_nd_attr_groups, dimm_flags, PAPR_SCM_DIMM_CMD_MASK, 0, NULL); if (!p->nvdimm) { -- 2.30.2
Re: [PATCH v2] powerpc/papr_scm: Implement support for H_SCM_FLUSH hcall
Hi Shiva, Thanks for the patch. Few minor review comments: Shivaprasad G Bhat writes: > Add support for ND_REGION_ASYNC capability if the device tree > indicates 'ibm,hcall-flush-required' property in the NVDIMM node. > Flush is done by issuing H_SCM_FLUSH hcall to the hypervisor. > > If the flush request failed, the hypervisor is expected to > to reflect the problem in the subsequent dimm health request call. s/dimm/nvdimm s/health request call/H_SCM_HEALTH hcall/ > > This patch prevents mmap of namespaces with MAP_SYNC flag if the > nvdimm requires explicit flush[1]. s/explicit/an explicit/ > > References: > [1] > https://github.com/avocado-framework-tests/avocado-misc-tests/blob/master/memory/ndctl.py.data/map_sync.c > > Signed-off-by: Shivaprasad G Bhat > --- > v1 - https://www.spinics.net/lists/kvm-ppc/msg18272.html > Changes from v1: >- Hcall semantics finalized, all changes are to accomodate them. > > Documentation/powerpc/papr_hcalls.rst | 14 ++ > arch/powerpc/include/asm/hvcall.h |3 +- > arch/powerpc/platforms/pseries/papr_scm.c | 39 > + > 3 files changed, 55 insertions(+), 1 deletion(-) > > diff --git a/Documentation/powerpc/papr_hcalls.rst > b/Documentation/powerpc/papr_hcalls.rst > index 48fcf1255a33..648f278eea8f 100644 > --- a/Documentation/powerpc/papr_hcalls.rst > +++ b/Documentation/powerpc/papr_hcalls.rst > @@ -275,6 +275,20 @@ Health Bitmap Flags: > Given a DRC Index collect the performance statistics for NVDIMM and copy them > to the resultBuffer. > > +**H_SCM_FLUSH** > + > +| Input: *drcIndex, continue-token* > +| Out: *continue-token* > +| Return Value: *H_SUCCESS, H_Parameter, H_P2, H_BUSY* > + > +Given a DRC Index Flush the data to backend NVDIMM device. > + > +The hcall returns H_BUSY when the flush takes longer time and the hcall needs > +to be issued multiple times in order to be completely serviced. The > +*continue-token* from the output to be passed in the argument list of > +subsequent hcalls to the hypervisor until the hcall is completely serviced > +at which point H_SUCCESS or other error is returned by the hypervisor. > + > References > == > .. [1] "Power Architecture Platform Reference" > diff --git a/arch/powerpc/include/asm/hvcall.h > b/arch/powerpc/include/asm/hvcall.h > index ed6086d57b22..9f7729a97ebd 100644 > --- a/arch/powerpc/include/asm/hvcall.h > +++ b/arch/powerpc/include/asm/hvcall.h > @@ -315,7 +315,8 @@ > #define H_SCM_HEALTH0x400 > #define H_SCM_PERFORMANCE_STATS 0x418 > #define H_RPT_INVALIDATE 0x448 > -#define MAX_HCALL_OPCODE H_RPT_INVALIDATE > +#define H_SCM_FLUSH 0x44C > +#define MAX_HCALL_OPCODE H_SCM_FLUSH > > /* Scope args for H_SCM_UNBIND_ALL */ > #define H_UNBIND_SCOPE_ALL (0x1) > diff --git a/arch/powerpc/platforms/pseries/papr_scm.c > b/arch/powerpc/platforms/pseries/papr_scm.c > index 835163f54244..f0407e135410 100644 > --- a/arch/powerpc/platforms/pseries/papr_scm.c > +++ b/arch/powerpc/platforms/pseries/papr_scm.c > @@ -93,6 +93,7 @@ struct papr_scm_priv { > uint64_t block_size; > int metadata_size; > bool is_volatile; > + bool hcall_flush_required; > > uint64_t bound_addr; > > @@ -117,6 +118,38 @@ struct papr_scm_priv { > size_t stat_buffer_len; > }; > > +static int papr_scm_pmem_flush(struct nd_region *nd_region, > +struct bio *bio __maybe_unused) > +{ > + struct papr_scm_priv *p = nd_region_provider_data(nd_region); > + unsigned long ret_buf[PLPAR_HCALL_BUFSIZE]; > + uint64_t token = 0; > + int64_t rc; > + Suggest adding a dev_dbg to to indicate a flush request to a drc. That way if the loop below gets stuck the issue can be debugged with kernel logs. > + do { > + rc = plpar_hcall(H_SCM_FLUSH, ret_buf, p->drc_index, token); > + token = ret_buf[0]; > + > + /* Check if we are stalled for some time */ > + if (H_IS_LONG_BUSY(rc)) { > + msleep(get_longbusy_msecs(rc)); > + rc = H_BUSY; > + } else if (rc == H_BUSY) { > + cond_resched(); > + } > + > + } while (rc == H_BUSY); > + > + if (rc) { > + dev_err(>pdev->dev, "flush error: %lld", rc); > + rc = -EIO; > + } else { > + dev_dbg(>pdev->dev, "flush drc 0x%x complete", p->drc_index); > + } > + > + return rc; > +} > + > static LIST_HEAD(papr_nd_regions); > static DEFINE_MUTEX(papr_ndr_lock); > > @@ -943,6 +976,11 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p) > ndr_desc.num_mappings = 1; > ndr_desc.nd_set = >nd_set; > > + if (p->hcall_flush_required) { > + set_bit(ND_REGION_ASYNC, _desc.flags); > + ndr_desc.flush = papr_scm_pmem_flush; > + } > + > if (p->is_volatile) > p->region =
[PATCH] powerpc/mm: Add cond_resched() while removing hpte mappings
While removing large number of mappings from hash page tables for large memory systems as soft-lockup is reported because of the time spent inside htap_remove_mapping() like one below: watchdog: BUG: soft lockup - CPU#8 stuck for 23s! NIP plpar_hcall+0x38/0x58 LR pSeries_lpar_hpte_invalidate+0x68/0xb0 Call Trace: 0x1fff000 (unreliable) pSeries_lpar_hpte_removebolted+0x9c/0x230 hash__remove_section_mapping+0xec/0x1c0 remove_section_mapping+0x28/0x3c arch_remove_memory+0xfc/0x150 devm_memremap_pages_release+0x180/0x2f0 devm_action_release+0x30/0x50 release_nodes+0x28c/0x300 device_release_driver_internal+0x16c/0x280 unbind_store+0x124/0x170 drv_attr_store+0x44/0x60 sysfs_kf_write+0x64/0x90 kernfs_fop_write+0x1b0/0x290 __vfs_write+0x3c/0x70 vfs_write+0xd4/0x270 ksys_write+0xdc/0x130 system_call+0x5c/0x70 Fix this by adding a cond_resched() to the loop in htap_remove_mapping() that issues hcall to remove hpte mapping. This should prevent the soft-lockup from being reported. Suggested-by: Aneesh Kumar K.V Signed-off-by: Vaibhav Jain --- arch/powerpc/mm/book3s64/hash_utils.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/powerpc/mm/book3s64/hash_utils.c b/arch/powerpc/mm/book3s64/hash_utils.c index 581b20a2feaf..ea3945c70b18 100644 --- a/arch/powerpc/mm/book3s64/hash_utils.c +++ b/arch/powerpc/mm/book3s64/hash_utils.c @@ -359,6 +359,8 @@ int htab_remove_mapping(unsigned long vstart, unsigned long vend, } if (rc < 0) return rc; + + cond_resched(); } return ret; -- 2.29.2
[RFC][PATCH 2/2] powerpc/papr_scm: Implement support for reporting generic nvdimm stats
Add support for reporting papr-scm supported generic nvdimm stats by implementing support for handling ND_CMD_GET_STAT in 'papr_scm_ndctl(). The mapping between libnvdimm generic nvdimm-stats and papr-scm specific performance-stats is embedded inside 'dimm_stats_map[]'. This array is queried by newly introduced 'papr_scm_get_stat()' that verifies if the requested nvdimm-stat is supported and if yes does an hcall via 'drc_pmem_query_stat()' to request the performance-stat and return it back to libnvdimm. Signed-off-by: Vaibhav Jain --- arch/powerpc/platforms/pseries/papr_scm.c | 66 ++- 1 file changed, 65 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c index 835163f54244..51eeab3376fd 100644 --- a/arch/powerpc/platforms/pseries/papr_scm.c +++ b/arch/powerpc/platforms/pseries/papr_scm.c @@ -25,7 +25,8 @@ ((1ul << ND_CMD_GET_CONFIG_SIZE) | \ (1ul << ND_CMD_GET_CONFIG_DATA) | \ (1ul << ND_CMD_SET_CONFIG_DATA) | \ -(1ul << ND_CMD_CALL)) +(1ul << ND_CMD_CALL) |\ +(1ul << ND_CMD_GET_STAT)) /* DIMM health bitmap bitmap indicators */ /* SCM device is unable to persist memory contents */ @@ -120,6 +121,16 @@ struct papr_scm_priv { static LIST_HEAD(papr_nd_regions); static DEFINE_MUTEX(papr_ndr_lock); +/* Map generic nvdimm stats to papr-scm stats */ +static const char * const dimm_stat_map[] = { + [ND_DIMM_STAT_INVALID] = NULL, + [ND_DIMM_STAT_MEDIA_READS] = "MedRCnt ", + [ND_DIMM_STAT_MEDIA_WRITES] = "MedWCnt ", + [ND_DIMM_STAT_READ_REQUESTS] = "HostLCnt", + [ND_DIMM_STAT_WRITE_REQUESTS] = "HostSCnt", + [ND_DIMM_STAT_MAX] = NULL, +}; + static int drc_pmem_bind(struct papr_scm_priv *p) { unsigned long ret[PLPAR_HCALL_BUFSIZE]; @@ -728,6 +739,54 @@ static int papr_scm_service_pdsm(struct papr_scm_priv *p, return pdsm_pkg->cmd_status; } +/* + * For a given pdsm request call an appropriate service function. + * Returns errors if any while handling the pdsm command package. + */ +static int papr_scm_get_stat(struct papr_scm_priv *p, +struct nd_cmd_get_dimm_stat *dimm_stat) + +{ + int rc; + ssize_t size; + struct papr_scm_perf_stat *stat; + struct papr_scm_perf_stats *stats; + + /* Check if the requested stat-id is supported */ + if (dimm_stat->stat_id >= ARRAY_SIZE(dimm_stat_map) || + !dimm_stat_map[dimm_stat->stat_id]) { + dev_dbg(>pdev->dev, "Invalid stat-id %lld\n", dimm_stat->stat_id); + return -ENOSPC; + } + + /* Allocate request buffer enough to hold single performance stat */ + size = sizeof(struct papr_scm_perf_stats) + + sizeof(struct papr_scm_perf_stat); + + stats = kzalloc(size, GFP_KERNEL); + if (!stats) + return -ENOMEM; + + stat = >scm_statistic[0]; + memcpy(>stat_id, dimm_stat_map[dimm_stat->stat_id], + sizeof(stat->stat_id)); + stat->stat_val = 0; + + /* Fetch the statistic from PHYP and copy it to provided payload */ + rc = drc_pmem_query_stats(p, stats, 1); + if (rc < 0) { + dev_dbg(>pdev->dev, "Err(%d) fetching stat '%.8s'\n", + rc, stat->stat_id); + kfree(stats); + return rc; + } + + dimm_stat->int_val = be64_to_cpu(stat->stat_val); + + kfree(stats); + return 0; +} + static int papr_scm_ndctl(struct nvdimm_bus_descriptor *nd_desc, struct nvdimm *nvdimm, unsigned int cmd, void *buf, unsigned int buf_len, int *cmd_rc) @@ -772,6 +831,11 @@ static int papr_scm_ndctl(struct nvdimm_bus_descriptor *nd_desc, *cmd_rc = papr_scm_service_pdsm(p, call_pkg); break; + case ND_CMD_GET_STAT: + *cmd_rc = papr_scm_get_stat(p, + (struct nd_cmd_get_dimm_stat *)buf); + break; + default: dev_dbg(>pdev->dev, "Unknown command = %d\n", cmd); return -EINVAL; -- 2.28.0
[RFC][PATCH 1/2] libnvdimm: Introduce ND_CMD_GET_STAT to retrieve nvdimm statistics
Implement support for exposing generic nvdimm statistics via newly introduced dimm-command ND_CMD_GET_STAT that can be handled by nvdimm command handler function and provide values for these statistics back to libnvdimm. Following generic nvdimm statistics are defined as an enumeration in 'uapi/ndctl.h': * "media_reads" : Number of media reads that have occurred since reboot. * "media_writes" : Number of media writes that have occurred since reboot. * "read_requests" : Number of read requests that have occurred since reboot. * "write_requests" : Number of write requests that have occurred since reboot. * "total_media_reads" : Total number of media reads that have occurred. * "total_media_writes" : Total number of media writes that have occurred. * "total_read_requests" : Total number of read requests that have occurred. * "total_write_requests" : Total number of write requests that have occurred. Apart from ND_CMD_GET_STAT ioctl these nvdimm statistics are also exposed via sysfs '/stats' directory for easy user-space access like below: /sys/class/nd/ndctl0/device/nmem0/stats # tail -n +1 * ==> media_reads <== 252197707602 ==> media_writes <== 20684685172 ==> read_requests <== 658810924962 ==> write_requests <== 404464081574 In case a specific nvdimm-statistic is not supported than nvdimm command handler function can simply return an error (e.g -ENOENT) for request to read that nvdimm-statistic. The value for a specific nvdimm-stat is exchanged via newly introduced 'struct nd_cmd_get_dimm_stat' that hold a single statistics and a union of possible values types. Presently only '__s64' type of generic attributes are supported. These attributes are defined in 'ndvimm/dimm_devs.c' via a helper macro 'NVDIMM_STAT_ATTR'. Signed-off-by: Vaibhav Jain --- drivers/nvdimm/bus.c | 6 ++ drivers/nvdimm/dimm_devs.c | 109 + drivers/nvdimm/nd.h| 5 ++ include/uapi/linux/ndctl.h | 27 + 4 files changed, 147 insertions(+) diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c index 2304c6183822..d53564477437 100644 --- a/drivers/nvdimm/bus.c +++ b/drivers/nvdimm/bus.c @@ -794,6 +794,12 @@ static const struct nd_cmd_desc __nd_cmd_dimm_descs[] = { .out_num = 1, .out_sizes = { UINT_MAX, }, }, + [ND_CMD_GET_STAT] = { + .in_num = 1, + .in_sizes = { sizeof(struct nd_cmd_get_dimm_stat), }, + .out_num = 1, + .out_sizes = { sizeof(struct nd_cmd_get_dimm_stat), }, + }, }; const struct nd_cmd_desc *nd_cmd_dimm_desc(int cmd) diff --git a/drivers/nvdimm/dimm_devs.c b/drivers/nvdimm/dimm_devs.c index b59032e0859b..68aaa294def7 100644 --- a/drivers/nvdimm/dimm_devs.c +++ b/drivers/nvdimm/dimm_devs.c @@ -555,6 +555,114 @@ static umode_t nvdimm_firmware_visible(struct kobject *kobj, struct attribute *a return a->mode; } +/* Request a dimm stat from the bus driver */ +static int __request_dimm_stat(struct nvdimm_bus *nvdimm_bus, + struct nvdimm *dimm, u64 stat_id, + s64 *stat_val) +{ + struct nvdimm_bus_descriptor *nd_desc = nvdimm_bus->nd_desc; + struct nd_cmd_get_dimm_stat stat = { .stat_id = stat_id }; + int rc, cmd_rc; + + if (!test_bit(ND_CMD_GET_STAT, >cmd_mask)) { + pr_debug("CMD_GET_STAT not set for bus driver 0x%lx\n", +nd_desc->cmd_mask); + return -ENOENT; + } + + /* Is stat requested is known & bus driver supports fetching stats */ + if (stat_id <= ND_DIMM_STAT_INVALID || stat_id > ND_DIMM_STAT_MAX) { + WARN(1, "Unknown stat-id(%llu) requested", stat_id); + return -ENOENT; + } + + /* Ask bus driver for its stat value */ + rc = nd_desc->ndctl(nd_desc, dimm, ND_CMD_GET_STAT, + , sizeof(stat), _rc); + if (rc || cmd_rc) { + pr_debug("Unable to request stat %lld. Error (%d,%d)\n", +stat_id, rc, cmd_rc); + return rc ? rc : cmd_rc; + } + + /* Indicate error in case returned struct doesn't have the stat_id set */ + if (stat.stat_id != stat_id) { + pr_debug("Invalid statid %llu returned\n", stat.stat_id); + return -ENOENT; + } + + *stat_val = stat.int_val; + return 0; +} + +static ssize_t nvdimm_stat_attr_show(struct device *dev, +struct device_attribute *attr, +char *buf) +{ + struct nvdimm_stat_attr *nattr = container_of(attr, typeof(*nattr), attr); + struct nvdimm_bus *nvdimm_bus = walk_to_nvdimm_bus(dev); + struct nvdimm *nvdimm = to_nvd
Re: [PATCH 30/33] docs: ABI: cleanup several ABI documents
Mauro Carvalho Chehab writes: > There are some ABI documents that, while they don't generate > any warnings, they have issues when parsed by get_abi.pl script > on its output result. > > Address them, in order to provide a clean output. > > Signed-off-by: Mauro Carvalho Chehab > diff --git a/Documentation/ABI/testing/sysfs-bus-papr-pmem > b/Documentation/ABI/testing/sysfs-bus-papr-pmem > index c1a67275c43f..8316c33862a0 100644 > --- a/Documentation/ABI/testing/sysfs-bus-papr-pmem > +++ b/Documentation/ABI/testing/sysfs-bus-papr-pmem > @@ -11,19 +11,26 @@ Description: > at 'Documentation/powerpc/papr_hcalls.rst' . Below are > the flags reported in this sysfs file: > > - * "not_armed" : Indicates that NVDIMM contents will not > + * "not_armed" > + Indicates that NVDIMM contents will not > survive a power cycle. > - * "flush_fail" : Indicates that NVDIMM contents > + * "flush_fail" > + Indicates that NVDIMM contents > couldn't be flushed during last > shut-down event. > - * "restore_fail": Indicates that NVDIMM contents > + * "restore_fail" > + Indicates that NVDIMM contents > couldn't be restored during NVDIMM > initialization. > - * "encrypted" : NVDIMM contents are encrypted. > - * "smart_notify": There is health event for the NVDIMM. > - * "scrubbed": Indicating that contents of the > + * "encrypted" > + NVDIMM contents are encrypted. > + * "smart_notify" > + There is health event for the NVDIMM. > + * "scrubbed" > + Indicating that contents of the > NVDIMM have been scrubbed. > - * "locked" : Indicating that NVDIMM contents cant > + * "locked" > + Indicating that NVDIMM contents cant > be modified until next power cycle. > > What:/sys/bus/nd/devices/nmemX/papr/perf_stats > @@ -51,4 +58,4 @@ Description: > * "MedWDur " : Media Write Duration > * "CchRHCnt" : Cache Read Hit Count > * "CchWHCnt" : Cache Write Hit Count > - * "FastWCnt" : Fast Write Count > \ No newline at end of file > + * "FastWCnt" : Fast Write Count Thanks, I am fine with proposed changes to sysfs-bus-papr-pmem. Acked-by: Vaibhav Jain # for sysfs-bus-papr-pmem
Re: [PATCH] powerpc/papr_scm: Add PAPR command family to pass-through command-set
Hi Dan, Ira and Vishal, Can you please take a look at this patch. Without it the functionality to report nvdimm health via ndctl breaks on 5.9 Thanks, ~ Vaibhav Vaibhav Jain writes: > Add NVDIMM_FAMILY_PAPR to the list of valid 'dimm_family_mask' > acceptable by papr_scm. This is needed as since commit > 92fe2aa859f5 ("libnvdimm: Validate command family indices") libnvdimm > performs a validation of 'nd_cmd_pkg.nd_family' received as part of > ND_CMD_CALL processing to ensure only known command families can use > the general ND_CMD_CALL pass-through functionality. > > Without this change the ND_CMD_CALL pass-through targeting > NVDIMM_FAMILY_PAPR error out with -EINVAL. > > Fixes: 92fe2aa859f5 ("libnvdimm: Validate command family indices") > Signed-off-by: Vaibhav Jain > --- > arch/powerpc/platforms/pseries/papr_scm.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/arch/powerpc/platforms/pseries/papr_scm.c > b/arch/powerpc/platforms/pseries/papr_scm.c > index 5493bc847bd08..27268370dee00 100644 > --- a/arch/powerpc/platforms/pseries/papr_scm.c > +++ b/arch/powerpc/platforms/pseries/papr_scm.c > @@ -898,6 +898,9 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p) > p->bus_desc.of_node = p->pdev->dev.of_node; > p->bus_desc.provider_name = kstrdup(p->pdev->name, GFP_KERNEL); > > + /* Set the dimm command family mask to accept PDSMs */ > + set_bit(NVDIMM_FAMILY_PAPR, >bus_desc.dimm_family_mask); > + > if (!p->bus_desc.provider_name) > return -ENOMEM; > > -- > 2.26.2 >
Re: [PATCH -next] powerpc/papr_scm: Fix warnings about undeclared variable
Thanks for the patch. This looks good to me. Wang Wensheng writes: > Build the kernel with 'make C=2': > arch/powerpc/platforms/pseries/papr_scm.c:825:1: warning: symbol > 'dev_attr_perf_stats' was not declared. Should it be static? > Signed-off-by: Wang Wensheng Reviewed-by: Vaibhav Jain -- Cheers ~ Vaibhav
Re: [PATCH v2] powerpc/papr_scm: Fix warning triggered by perf_stats_show()
Michael Ellerman writes: > Vaibhav Jain writes: >> A warning is reported by the kernel in case perf_stats_show() returns >> an error code. The warning is of the form below: >> >> papr_scm ibm,persistent-memory:ibm,pmemory@4411: >>Failed to query performance stats, Err:-10 >> dev_attr_show: perf_stats_show+0x0/0x1c0 [papr_scm] returned bad count >> fill_read_buffer: dev_attr_show+0x0/0xb0 returned bad count >> >> On investigation it looks like that the compiler is silently truncating the >> return value of drc_pmem_query_stats() from 'long' to 'int', since the >> variable used to store the return code 'rc' is an 'int'. This >> truncated value is then returned back as a 'ssize_t' back from >> perf_stats_show() to 'dev_attr_show()' which thinks of it as a large >> unsigned number and triggers this warning.. >> >> To fix this we update the type of variable 'rc' from 'int' to >> 'ssize_t' that prevents the compiler from truncating the return value >> of drc_pmem_query_stats() and returning correct signed value back from >> perf_stats_show(). >> >> Fixes: 2d02bf835e573 ('powerpc/papr_scm: Fetch nvdimm performance >>stats from PHYP') > > Please don't word wrap the Fixes tag it breaks b4. > > I've fixed it up this time. Thanks Mpe > > cheers -- Cheers ~ Vaibhav
[PATCH] powerpc/papr_scm: Support dynamic enable/disable of performance statistics
Collection of performance statistics of an NVDIMM can be dynamically enabled/disabled from the Hypervisor Management Console even when the guest lpar is running. The current implementation however will check if the performance statistics collection is supported during NVDIMM probe and if yes will assume that to be the case afterwards. Hence we update papr_scm to remove this assumption from the code by eliminating the 'stat_buffer_len' member from 'struct papr_scm_priv' that was used to cache the max buffer size needed to fetch NVDIMM performance stats from PHYP. With that struct member gone, various functions that depended on it are updated. Specifically perf_stats_show() is updated to query the PHYP first for the size of buffer needed to hold all performance statistics instead of relying on 'stat_buffer_len' Signed-off-by: Vaibhav Jain --- arch/powerpc/platforms/pseries/papr_scm.c | 53 +++ 1 file changed, 25 insertions(+), 28 deletions(-) diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c index 27268370dee00..6697e1c3b9ebe 100644 --- a/arch/powerpc/platforms/pseries/papr_scm.c +++ b/arch/powerpc/platforms/pseries/papr_scm.c @@ -112,9 +112,6 @@ struct papr_scm_priv { /* Health information for the dimm */ u64 health_bitmap; - - /* length of the stat buffer as expected by phyp */ - size_t stat_buffer_len; }; static LIST_HEAD(papr_nd_regions); @@ -230,14 +227,15 @@ static int drc_pmem_query_n_bind(struct papr_scm_priv *p) * - If buff_stats == NULL the return value is the size in byes of the buffer * needed to hold all supported performance-statistics. * - If buff_stats != NULL and num_stats == 0 then we copy all known - * performance-statistics to 'buff_stat' and expect to be large enough to - * hold them. + * performance-statistics to 'buff_stat' and expect it to be large enough to + * hold them. The 'buff_size' args contains the size of the 'buff_stats' * - if buff_stats != NULL and num_stats > 0 then copy the requested * performance-statistics to buff_stats. */ static ssize_t drc_pmem_query_stats(struct papr_scm_priv *p, struct papr_scm_perf_stats *buff_stats, - unsigned int num_stats) + unsigned int num_stats, + size_t buff_size) { unsigned long ret[PLPAR_HCALL_BUFSIZE]; size_t size; @@ -261,12 +259,18 @@ static ssize_t drc_pmem_query_stats(struct papr_scm_priv *p, size = sizeof(struct papr_scm_perf_stats) + num_stats * sizeof(struct papr_scm_perf_stat); else - size = p->stat_buffer_len; + size = buff_size; } else { /* In case of no out buffer ignore the size */ size = 0; } + /* verify that the buffer size needed is sufficient */ + if (size > buff_size) { + __WARN(); + return -EINVAL; + } + /* Do the HCALL asking PHYP for info */ rc = plpar_hcall(H_SCM_PERFORMANCE_STATS, ret, p->drc_index, buff_stats ? virt_to_phys(buff_stats) : 0, @@ -277,6 +281,10 @@ static ssize_t drc_pmem_query_stats(struct papr_scm_priv *p, dev_err(>pdev->dev, "Unknown performance stats, Err:0x%016lX\n", ret[0]); return -ENOENT; + } else if (rc == H_AUTHORITY) { + dev_dbg(>pdev->dev, + "Performance stats in-accessible\n"); + return -EPERM; } else if (rc != H_SUCCESS) { dev_err(>pdev->dev, "Failed to query performance stats, Err:%lld\n", rc); @@ -526,10 +534,6 @@ static int papr_pdsm_fuel_gauge(struct papr_scm_priv *p, struct papr_scm_perf_stat *stat; struct papr_scm_perf_stats *stats; - /* Silently fail if fetching performance metrics isn't supported */ - if (!p->stat_buffer_len) - return 0; - /* Allocate request buffer enough to hold single performance stat */ size = sizeof(struct papr_scm_perf_stats) + sizeof(struct papr_scm_perf_stat); @@ -543,9 +547,11 @@ static int papr_pdsm_fuel_gauge(struct papr_scm_priv *p, stat->stat_val = 0; /* Fetch the fuel gauge and populate it in payload */ - rc = drc_pmem_query_stats(p, stats, 1); + rc = drc_pmem_query_stats(p, stats, 1, size); if (rc < 0) { dev_dbg(>pdev->dev, "Err(%d) fetching fuel gauge\n", rc); + /* Silently fail if unable to fetch performance metric */ + rc = 0; goto free_stats; } @@ -786,23 +792,25 @@ static ssize_t perf_stats_show(s
[PATCH] powerpc/papr_scm: Add PAPR command family to pass-through command-set
Add NVDIMM_FAMILY_PAPR to the list of valid 'dimm_family_mask' acceptable by papr_scm. This is needed as since commit 92fe2aa859f5 ("libnvdimm: Validate command family indices") libnvdimm performs a validation of 'nd_cmd_pkg.nd_family' received as part of ND_CMD_CALL processing to ensure only known command families can use the general ND_CMD_CALL pass-through functionality. Without this change the ND_CMD_CALL pass-through targeting NVDIMM_FAMILY_PAPR error out with -EINVAL. Fixes: 92fe2aa859f5 ("libnvdimm: Validate command family indices") Signed-off-by: Vaibhav Jain --- arch/powerpc/platforms/pseries/papr_scm.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c index 5493bc847bd08..27268370dee00 100644 --- a/arch/powerpc/platforms/pseries/papr_scm.c +++ b/arch/powerpc/platforms/pseries/papr_scm.c @@ -898,6 +898,9 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p) p->bus_desc.of_node = p->pdev->dev.of_node; p->bus_desc.provider_name = kstrdup(p->pdev->name, GFP_KERNEL); + /* Set the dimm command family mask to accept PDSMs */ + set_bit(NVDIMM_FAMILY_PAPR, >bus_desc.dimm_family_mask); + if (!p->bus_desc.provider_name) return -ENOMEM; -- 2.26.2
Re: [PATCH] powerpc/papr_scm: Fix warning triggered by perf_stats_show()
Ira Weiny writes: > On Sat, Sep 12, 2020 at 01:36:46AM +0530, Vaibhav Jain wrote: >> Thanks for reviewing this patch Ira, >> >> >> Ira Weiny writes: >> >> > On Thu, Sep 10, 2020 at 02:52:12PM +0530, Vaibhav Jain wrote: >> >> A warning is reported by the kernel in case perf_stats_show() returns >> >> an error code. The warning is of the form below: >> >> >> >> papr_scm ibm,persistent-memory:ibm,pmemory@4411: >> >> Failed to query performance stats, Err:-10 >> >> dev_attr_show: perf_stats_show+0x0/0x1c0 [papr_scm] returned bad count >> >> fill_read_buffer: dev_attr_show+0x0/0xb0 returned bad count >> >> >> >> On investigation it looks like that the compiler is silently truncating >> >> the >> >> return value of drc_pmem_query_stats() from 'long' to 'int', since the >> >> variable used to store the return code 'rc' is an 'int'. This >> >> truncated value is then returned back as a 'ssize_t' back from >> >> perf_stats_show() to 'dev_attr_show()' which thinks of it as a large >> >> unsigned number and triggers this warning.. >> >> >> >> To fix this we update the type of variable 'rc' from 'int' to >> >> 'ssize_t' that prevents the compiler from truncating the return value >> >> of drc_pmem_query_stats() and returning correct signed value back from >> >> perf_stats_show(). >> >> >> >> Fixes: 2d02bf835e573 ('powerpc/papr_scm: Fetch nvdimm performance >> >>stats from PHYP') >> >> Signed-off-by: Vaibhav Jain >> >> --- >> >> arch/powerpc/platforms/pseries/papr_scm.c | 3 ++- >> >> 1 file changed, 2 insertions(+), 1 deletion(-) >> >> >> >> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c >> >> b/arch/powerpc/platforms/pseries/papr_scm.c >> >> index a88a707a608aa..9f00b61676ab9 100644 >> >> --- a/arch/powerpc/platforms/pseries/papr_scm.c >> >> +++ b/arch/powerpc/platforms/pseries/papr_scm.c >> >> @@ -785,7 +785,8 @@ static int papr_scm_ndctl(struct >> >> nvdimm_bus_descriptor *nd_desc, >> >> static ssize_t perf_stats_show(struct device *dev, >> >> struct device_attribute *attr, char *buf) >> >> { >> >> - int index, rc; >> >> + int index; >> >> + ssize_t rc; >> > >> > I'm not sure this is really fixing everything here. >> >> The issue is with the statement in perf_stats_show(): >> >> 'return rc ? rc : seq_buf_used();' >> >> The function seq_buf_used() returns an 'unsigned int' and with 'rc' >> typed as 'int', forces a promotion of the expression to 'unsigned int' >> which causes a loss of signedness of 'rc' and compiler silently >> assigns this unsigned value to the function return typed as 'signed >> long'. >> >> Making 'rc', a 'signed long' forces a promotion of the expresion to >> 'signed long' which preserves the signedness of 'rc' and will also be >> compatible with the function return type. > > Ok, ok, I read this all wrong. > > FWIW I would also cast seq_buf_used() to ssize_t to show you know what you are > doing there. > >> >> > >> > drc_pmem_query_stats() can return negative errno's. Why are those not >> > checked >> > somewhere in perf_stats_show()? >> > >> For the specific invocation 'drc_pmem_query_stats(p, stats, 0)' we only >> expect return value 'rc <=0' with '0' indicating a successful fetch of >> nvdimm performance stats from hypervisor. Hence there are no explicit >> checks for negative error codes in the functions as all return values >> !=0 indicate an error. >> >> >> > It seems like all this fix is handling is a > 0 return value: 'ret[0]' from >> > line 289 in papr_scm.c... Or something? >> No, in case the arg 'num_stats' is '0' and 'buff_stats != NULL' the >> variable 'size' is assigned a non-zero value hence that specific branch >> you mentioned is never taken. Instead in case of success >> drc_pmem_query_stats() return '0' and in case of an error a negative >> error code is returned. >> >> > >> > Worse yet drc_pmem_query_stats() is returning ssize_t which is a signed >> > value. >> > Therefore, it should not be returning -errno. I'm surprised the static >> > checkers did not catch that. >> Didnt quite get the assertion here. The function is marked to return >> 'ss
[PATCH v2] powerpc/papr_scm: Fix warning triggered by perf_stats_show()
A warning is reported by the kernel in case perf_stats_show() returns an error code. The warning is of the form below: papr_scm ibm,persistent-memory:ibm,pmemory@4411: Failed to query performance stats, Err:-10 dev_attr_show: perf_stats_show+0x0/0x1c0 [papr_scm] returned bad count fill_read_buffer: dev_attr_show+0x0/0xb0 returned bad count On investigation it looks like that the compiler is silently truncating the return value of drc_pmem_query_stats() from 'long' to 'int', since the variable used to store the return code 'rc' is an 'int'. This truncated value is then returned back as a 'ssize_t' back from perf_stats_show() to 'dev_attr_show()' which thinks of it as a large unsigned number and triggers this warning.. To fix this we update the type of variable 'rc' from 'int' to 'ssize_t' that prevents the compiler from truncating the return value of drc_pmem_query_stats() and returning correct signed value back from perf_stats_show(). Fixes: 2d02bf835e573 ('powerpc/papr_scm: Fetch nvdimm performance stats from PHYP') Signed-off-by: Vaibhav Jain --- Changelog: v2: Added an explicit cast to the expression calling 'seq_buf_used()' and triggering this issue. [ Ira ] --- arch/powerpc/platforms/pseries/papr_scm.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c index a88a707a608aa..5493bc847bd08 100644 --- a/arch/powerpc/platforms/pseries/papr_scm.c +++ b/arch/powerpc/platforms/pseries/papr_scm.c @@ -785,7 +785,8 @@ static int papr_scm_ndctl(struct nvdimm_bus_descriptor *nd_desc, static ssize_t perf_stats_show(struct device *dev, struct device_attribute *attr, char *buf) { - int index, rc; + int index; + ssize_t rc; struct seq_buf s; struct papr_scm_perf_stat *stat; struct papr_scm_perf_stats *stats; @@ -820,7 +821,7 @@ static ssize_t perf_stats_show(struct device *dev, free_stats: kfree(stats); - return rc ? rc : seq_buf_used(); + return rc ? rc : (ssize_t)seq_buf_used(); } DEVICE_ATTR_ADMIN_RO(perf_stats); -- 2.26.2
Re: [PATCH] powerpc/papr_scm: Fix warning triggered by perf_stats_show()
Thanks for reviewing this patch Ira, Ira Weiny writes: > On Thu, Sep 10, 2020 at 02:52:12PM +0530, Vaibhav Jain wrote: >> A warning is reported by the kernel in case perf_stats_show() returns >> an error code. The warning is of the form below: >> >> papr_scm ibm,persistent-memory:ibm,pmemory@4411: >>Failed to query performance stats, Err:-10 >> dev_attr_show: perf_stats_show+0x0/0x1c0 [papr_scm] returned bad count >> fill_read_buffer: dev_attr_show+0x0/0xb0 returned bad count >> >> On investigation it looks like that the compiler is silently truncating the >> return value of drc_pmem_query_stats() from 'long' to 'int', since the >> variable used to store the return code 'rc' is an 'int'. This >> truncated value is then returned back as a 'ssize_t' back from >> perf_stats_show() to 'dev_attr_show()' which thinks of it as a large >> unsigned number and triggers this warning.. >> >> To fix this we update the type of variable 'rc' from 'int' to >> 'ssize_t' that prevents the compiler from truncating the return value >> of drc_pmem_query_stats() and returning correct signed value back from >> perf_stats_show(). >> >> Fixes: 2d02bf835e573 ('powerpc/papr_scm: Fetch nvdimm performance >>stats from PHYP') >> Signed-off-by: Vaibhav Jain >> --- >> arch/powerpc/platforms/pseries/papr_scm.c | 3 ++- >> 1 file changed, 2 insertions(+), 1 deletion(-) >> >> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c >> b/arch/powerpc/platforms/pseries/papr_scm.c >> index a88a707a608aa..9f00b61676ab9 100644 >> --- a/arch/powerpc/platforms/pseries/papr_scm.c >> +++ b/arch/powerpc/platforms/pseries/papr_scm.c >> @@ -785,7 +785,8 @@ static int papr_scm_ndctl(struct nvdimm_bus_descriptor >> *nd_desc, >> static ssize_t perf_stats_show(struct device *dev, >> struct device_attribute *attr, char *buf) >> { >> -int index, rc; >> +int index; >> +ssize_t rc; > > I'm not sure this is really fixing everything here. The issue is with the statement in perf_stats_show(): 'return rc ? rc : seq_buf_used();' The function seq_buf_used() returns an 'unsigned int' and with 'rc' typed as 'int', forces a promotion of the expression to 'unsigned int' which causes a loss of signedness of 'rc' and compiler silently assigns this unsigned value to the function return typed as 'signed long'. Making 'rc', a 'signed long' forces a promotion of the expresion to 'signed long' which preserves the signedness of 'rc' and will also be compatible with the function return type. > > drc_pmem_query_stats() can return negative errno's. Why are those not checked > somewhere in perf_stats_show()? > For the specific invocation 'drc_pmem_query_stats(p, stats, 0)' we only expect return value 'rc <=0' with '0' indicating a successful fetch of nvdimm performance stats from hypervisor. Hence there are no explicit checks for negative error codes in the functions as all return values !=0 indicate an error. > It seems like all this fix is handling is a > 0 return value: 'ret[0]' from > line 289 in papr_scm.c... Or something? No, in case the arg 'num_stats' is '0' and 'buff_stats != NULL' the variable 'size' is assigned a non-zero value hence that specific branch you mentioned is never taken. Instead in case of success drc_pmem_query_stats() return '0' and in case of an error a negative error code is returned. > > Worse yet drc_pmem_query_stats() is returning ssize_t which is a signed value. > Therefore, it should not be returning -errno. I'm surprised the static > checkers did not catch that. Didnt quite get the assertion here. The function is marked to return 'ssize_t' because we can return both +ve for success and -ve values to indicate errors. > > I believe I caught similar errors with a patch series before which did not pay > attention to variable types. > > Please audit this code for these types of errors and ensure you are really > doing the correct thing when using the sysfs interface. I'm pretty sure bad > things will eventually happen (if they are not already) if you return some > really big number to the sysfs core from *_show(). I think this problem is different compared to what you had previously pointed to. The values returned from drc_pmem_query_stats() can be stored in an 'int' variable too, however it was the silent promotion of a signed type to unsigned type was what caused this specific issue. -- Cheers ~ Vaibhav
[PATCH] powerpc/papr_scm: Fix warning triggered by perf_stats_show()
A warning is reported by the kernel in case perf_stats_show() returns an error code. The warning is of the form below: papr_scm ibm,persistent-memory:ibm,pmemory@4411: Failed to query performance stats, Err:-10 dev_attr_show: perf_stats_show+0x0/0x1c0 [papr_scm] returned bad count fill_read_buffer: dev_attr_show+0x0/0xb0 returned bad count On investigation it looks like that the compiler is silently truncating the return value of drc_pmem_query_stats() from 'long' to 'int', since the variable used to store the return code 'rc' is an 'int'. This truncated value is then returned back as a 'ssize_t' back from perf_stats_show() to 'dev_attr_show()' which thinks of it as a large unsigned number and triggers this warning.. To fix this we update the type of variable 'rc' from 'int' to 'ssize_t' that prevents the compiler from truncating the return value of drc_pmem_query_stats() and returning correct signed value back from perf_stats_show(). Fixes: 2d02bf835e573 ('powerpc/papr_scm: Fetch nvdimm performance stats from PHYP') Signed-off-by: Vaibhav Jain --- arch/powerpc/platforms/pseries/papr_scm.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c index a88a707a608aa..9f00b61676ab9 100644 --- a/arch/powerpc/platforms/pseries/papr_scm.c +++ b/arch/powerpc/platforms/pseries/papr_scm.c @@ -785,7 +785,8 @@ static int papr_scm_ndctl(struct nvdimm_bus_descriptor *nd_desc, static ssize_t perf_stats_show(struct device *dev, struct device_attribute *attr, char *buf) { - int index, rc; + int index; + ssize_t rc; struct seq_buf s; struct papr_scm_perf_stat *stat; struct papr_scm_perf_stats *stats; -- 2.26.2
[PATCH v2] powerpc/papr_scm: Limit the readability of 'perf_stats' sysfs attribute
The newly introduced 'perf_stats' attribute uses the default access mode of 0444 letting non-root users access performance stats of an nvdimm and potentially force the kernel into issuing large number of expensive HCALLs. Since the information exposed by this attribute cannot be cached hence its better to ward of access to this attribute from users who don't need to access these performance statistics. Hence this patch updates access mode of 'perf_stats' attribute to be only readable by root users. Fixes: 2d02bf835e573 ('powerpc/papr_scm: Fetch nvdimm performance stats from PHYP') Reported-by: Aneesh Kumar K.V Signed-off-by: Vaibhav Jain --- Change-log: v2: * Instead of checking for perfmon_capable() inside show_perf_stats() set the attribute as DEVICE_ATTR_ADMIN_RO [ Aneesh ] * Update patch description --- arch/powerpc/platforms/pseries/papr_scm.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c index f439f0dfea7d1..a88a707a608aa 100644 --- a/arch/powerpc/platforms/pseries/papr_scm.c +++ b/arch/powerpc/platforms/pseries/papr_scm.c @@ -822,7 +822,7 @@ static ssize_t perf_stats_show(struct device *dev, kfree(stats); return rc ? rc : seq_buf_used(); } -DEVICE_ATTR_RO(perf_stats); +DEVICE_ATTR_ADMIN_RO(perf_stats); static ssize_t flags_show(struct device *dev, struct device_attribute *attr, char *buf) -- 2.26.2
Re: [PATCH] powerpc/papr_scm: Limit the readability of 'perf_stats' sysfs attribute
Thanks Aneesh and Mpe for reviewing this patch. Michael Ellerman writes: > "Aneesh Kumar K.V" writes: [snip] >>> >>> + /* Allow access only to perfmon capable users */ >>> + if (!perfmon_capable()) >>> + return -EACCES; >>> + >> >> An access check is usually done in open(). This is the read callback IIUC. > > Yes. Otherwise an unprivileged user can open the file, and then trick a > suid program into reading from it. Agree, but since the 'open()' for this sysfs attribute is handled by kern-fs, AFAIK dont see any direct way to enforce this policy. Only other way it seems to me is to convert the 'perf_stats' DEVICE_ATTR_RO to DEVICE_ATTR_ADMIN_RO. > > cheers -- Cheers ~ Vaibhav
[PATCH] libnvdimm: Add a NULL entry to 'nvdimm_firmware_attributes'
We recently discovered a kernel oops with 'papr_scm' module while booting ppc64 phyp guest with following back-trace: BUG: Kernel NULL pointer dereference on write at 0x0188 Faulting instruction address: 0xc05d7084 Oops: Kernel access of bad area, sig: 11 [#1] Call Trace: internal_create_group+0x128/0x4c0 (unreliable) internal_create_groups.part.4+0x70/0x130 device_add+0x458/0x9c0 nd_async_device_register+0x28/0xa0 [libnvdimm] async_run_entry_fn+0x78/0x1f0 process_one_work+0x2c0/0x5b0 worker_thread+0x88/0x650 kthread+0x1a8/0x1b0 ret_from_kernel_thread+0x5c/0x6c A bisect lead to the 'commit 48001ea50d17f ("PM, libnvdimm: Add runtime firmware activation support")' and on investigation discovered that the newly introduced 'struct attribute *nvdimm_firmware_attributes[]' is missing a terminating NULL entry in the array. This causes a loop in sysfs's 'create_files()' to read garbage beyond bounds of 'nvdimm_firmware_attributes' and trigger the oops. Fixes: 48001ea50d17f ("PM, libnvdimm: Add runtime firmware activation support") Reported-by: Sandipan Das Signed-off-by: Vaibhav Jain --- drivers/nvdimm/dimm_devs.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/nvdimm/dimm_devs.c b/drivers/nvdimm/dimm_devs.c index 61374def51555..b59032e0859b7 100644 --- a/drivers/nvdimm/dimm_devs.c +++ b/drivers/nvdimm/dimm_devs.c @@ -529,6 +529,7 @@ static DEVICE_ATTR_ADMIN_RW(activate); static struct attribute *nvdimm_firmware_attributes[] = { _attr_activate.attr, _attr_result.attr, + NULL, }; static umode_t nvdimm_firmware_visible(struct kobject *kobj, struct attribute *a, int n) -- 2.26.2
[PATCH] powerpc/papr_scm: Limit the readability of 'perf_stats' sysfs attribute
The newly introduced 'perf_stats' attribute uses the default access mode of 0444 letting non-root users access performance stats of an nvdimm and potentially force the kernel into issuing large number of expensive HCALLs. Since the information exposed by this attribute cannot be cached hence its better to ward of access to this attribute from users who don't need to access these performance statistics. Hence this patch adds check in perf_stats_show() to only let users that are 'perfmon_capable()' to read the nvdimm performance statistics. Fixes: 2d02bf835e573 ('powerpc/papr_scm: Fetch nvdimm performance stats from PHYP') Reported-by: Aneesh Kumar K.V Signed-off-by: Vaibhav Jain --- arch/powerpc/platforms/pseries/papr_scm.c | 4 1 file changed, 4 insertions(+) diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c index f439f0dfea7d1..36c51bf8af9a8 100644 --- a/arch/powerpc/platforms/pseries/papr_scm.c +++ b/arch/powerpc/platforms/pseries/papr_scm.c @@ -792,6 +792,10 @@ static ssize_t perf_stats_show(struct device *dev, struct nvdimm *dimm = to_nvdimm(dev); struct papr_scm_priv *p = nvdimm_provider_data(dimm); + /* Allow access only to perfmon capable users */ + if (!perfmon_capable()) + return -EACCES; + if (!p->stat_buffer_len) return -ENOENT; -- 2.26.2
Re: [PATCH] powerpc/papr_scm: Make access mode of 'perf_stats' attribute file to '0400'
Hi Mpe, Thanks for reviewing this patch. My responses below: Michael Ellerman writes: > Vaibhav Jain writes: >> The newly introduced 'perf_stats' attribute uses the default access >> mode of 0444 letting non-root users access performance stats of an >> nvdimm and potentially force the kernel into issuing large number of >> expensive HCALLs. Since the information exposed by this attribute >> cannot be cached hence its better to ward of access to this attribute >> from non-root users. >> >> Hence this patch updates the access-mode of 'perf_stats' sysfs >> attribute file to 0400 to make it only readable to root-users. > > Or should we ratelimit it? Ideal consumers of this data will be users with CAP_PERFMON or CAP_SYS_ADMIN. Also they need up-to-date values for these performance stats as these values can be time sensitive. So rate limiting may not be a complete solution since a user running 'perf' might be throttled by another user who is simply reading the sysfs file contents. So instead of setting attribute mode to 0400, will add a check for 'perfmon_capable()' in perf_stats_show() denying read access to users without CAP_PERFMON or CAP_SYS_ADMIN. > Fixes: ?? Right. I will add this in v2. > >> Reported-by: Aneesh Kumar K.V >> Signed-off-by: Vaibhav Jain > > cheers > -- Cheers ~ Vaibhav
[PATCH] powerpc/papr_scm: Make access mode of 'perf_stats' attribute file to '0400'
The newly introduced 'perf_stats' attribute uses the default access mode of 0444 letting non-root users access performance stats of an nvdimm and potentially force the kernel into issuing large number of expensive HCALLs. Since the information exposed by this attribute cannot be cached hence its better to ward of access to this attribute from non-root users. Hence this patch updates the access-mode of 'perf_stats' sysfs attribute file to 0400 to make it only readable to root-users. Reported-by: Aneesh Kumar K.V Signed-off-by: Vaibhav Jain --- arch/powerpc/platforms/pseries/papr_scm.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c index f439f0dfea7d1..31864d167a2ce 100644 --- a/arch/powerpc/platforms/pseries/papr_scm.c +++ b/arch/powerpc/platforms/pseries/papr_scm.c @@ -822,7 +822,7 @@ static ssize_t perf_stats_show(struct device *dev, kfree(stats); return rc ? rc : seq_buf_used(); } -DEVICE_ATTR_RO(perf_stats); +DEVICE_ATTR(perf_stats, 0400, perf_stats_show, NULL); static ssize_t flags_show(struct device *dev, struct device_attribute *attr, char *buf) -- 2.26.2
[PATCH v4 2/2] powerpc/papr_scm: Add support for fetching nvdimm 'fuel-gauge' metric
We add support for reporting 'fuel-gauge' NVDIMM metric via PAPR_PDSM_HEALTH pdsm payload. 'fuel-gauge' metric indicates the usage life remaining of a papr-scm compatible NVDIMM. PHYP exposes this metric via the H_SCM_PERFORMANCE_STATS. The metric value is returned from the pdsm by extending the return payload 'struct nd_papr_pdsm_health' without breaking the ABI. A new field 'dimm_fuel_gauge' to hold the metric value is introduced at the end of the payload struct and its presence is indicated by by extension flag PDSM_DIMM_HEALTH_RUN_GAUGE_VALID. The patch introduces a new function papr_pdsm_fuel_gauge() that is called from papr_pdsm_health(). If fetching NVDIMM performance stats is supported then 'papr_pdsm_fuel_gauge()' allocated an output buffer large enough to hold the performance stat and passes it to drc_pmem_query_stats() that issues the HCALL to PHYP. The return value of the stat is then populated in the 'struct nd_papr_pdsm_health.dimm_fuel_gauge' field with extension flag 'PDSM_DIMM_HEALTH_RUN_GAUGE_VALID' set in 'struct nd_papr_pdsm_health.extension_flags' Signed-off-by: Vaibhav Jain --- Changelog: v4: * Moved a hunk from this patch to previous patch in the series. [ Aneesh ] v3: * Updated papr_pdsm_fuel_guage() to use the updated drc_pmem_query_stats() function. Resend: None v2: * Restructure code in papr_pdsm_fuel_gauge() to handle error case first [ Ira ] * Ignore the return value of papr_pdsm_fuel_gauge() in papr_psdm_health() [ Ira ] --- arch/powerpc/include/uapi/asm/papr_pdsm.h | 9 + arch/powerpc/platforms/pseries/papr_scm.c | 49 +++ 2 files changed, 58 insertions(+) diff --git a/arch/powerpc/include/uapi/asm/papr_pdsm.h b/arch/powerpc/include/uapi/asm/papr_pdsm.h index 9ccecc1d6840..50ef95e2f5b1 100644 --- a/arch/powerpc/include/uapi/asm/papr_pdsm.h +++ b/arch/powerpc/include/uapi/asm/papr_pdsm.h @@ -72,6 +72,11 @@ #define PAPR_PDSM_DIMM_CRITICAL 2 #define PAPR_PDSM_DIMM_FATAL 3 +/* struct nd_papr_pdsm_health.extension_flags field flags */ + +/* Indicate that the 'dimm_fuel_gauge' field is valid */ +#define PDSM_DIMM_HEALTH_RUN_GAUGE_VALID 1 + /* * Struct exchanged between kernel & ndctl in for PAPR_PDSM_HEALTH * Various flags indicate the health status of the dimm. @@ -84,6 +89,7 @@ * dimm_locked : Contents of the dimm cant be modified until CEC reboot * dimm_encrypted : Contents of dimm are encrypted. * dimm_health : Dimm health indicator. One of PAPR_PDSM_DIMM_ + * dimm_fuel_gauge : Life remaining of DIMM as a percentage from 0-100 */ struct nd_papr_pdsm_health { union { @@ -96,6 +102,9 @@ struct nd_papr_pdsm_health { __u8 dimm_locked; __u8 dimm_encrypted; __u16 dimm_health; + + /* Extension flag PDSM_DIMM_HEALTH_RUN_GAUGE_VALID */ + __u16 dimm_fuel_gauge; }; __u8 buf[ND_PDSM_PAYLOAD_MAX_SIZE]; }; diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c index f37f3f70007d..f439f0dfea7d 100644 --- a/arch/powerpc/platforms/pseries/papr_scm.c +++ b/arch/powerpc/platforms/pseries/papr_scm.c @@ -518,6 +518,51 @@ static int is_cmd_valid(struct nvdimm *nvdimm, unsigned int cmd, void *buf, return 0; } +static int papr_pdsm_fuel_gauge(struct papr_scm_priv *p, + union nd_pdsm_payload *payload) +{ + int rc, size; + u64 statval; + struct papr_scm_perf_stat *stat; + struct papr_scm_perf_stats *stats; + + /* Silently fail if fetching performance metrics isn't supported */ + if (!p->stat_buffer_len) + return 0; + + /* Allocate request buffer enough to hold single performance stat */ + size = sizeof(struct papr_scm_perf_stats) + + sizeof(struct papr_scm_perf_stat); + + stats = kzalloc(size, GFP_KERNEL); + if (!stats) + return -ENOMEM; + + stat = >scm_statistic[0]; + memcpy(>stat_id, "MemLife ", sizeof(stat->stat_id)); + stat->stat_val = 0; + + /* Fetch the fuel gauge and populate it in payload */ + rc = drc_pmem_query_stats(p, stats, 1); + if (rc < 0) { + dev_dbg(>pdev->dev, "Err(%d) fetching fuel gauge\n", rc); + goto free_stats; + } + + statval = be64_to_cpu(stat->stat_val); + dev_dbg(>pdev->dev, + "Fetched fuel-gauge %llu", statval); + payload->health.extension_flags |= + PDSM_DIMM_HEALTH_RUN_GAUGE_VALID; + payload->health.dimm_fuel_gauge = statval; + + rc = sizeof(struct nd_papr_pdsm_health); + +free_stats: + kfree(stats); + return rc; +} + /* Fetch the DIMM health info and populate it in provided package. */ static int papr_pdsm_health(s
[PATCH v4 1/2] powerpc/papr_scm: Fetch nvdimm performance stats from PHYP
Update papr_scm.c to query dimm performance statistics from PHYP via H_SCM_PERFORMANCE_STATS hcall and export them to user-space as PAPR specific NVDIMM attribute 'perf_stats' in sysfs. The patch also provide a sysfs ABI documentation for the stats being reported and their meanings. During NVDIMM probe time in papr_scm_nvdimm_init() a special variant of H_SCM_PERFORMANCE_STATS hcall is issued to check if collection of performance statistics is supported or not. If successful then a PHYP returns a maximum possible buffer length needed to read all performance stats. This returned value is stored in a per-nvdimm attribute 'stat_buffer_len'. The layout of request buffer for reading NVDIMM performance stats from PHYP is defined in 'struct papr_scm_perf_stats' and 'struct papr_scm_perf_stat'. These structs are used in newly introduced drc_pmem_query_stats() that issues the H_SCM_PERFORMANCE_STATS hcall. The sysfs access function perf_stats_show() uses value 'stat_buffer_len' to allocate a buffer large enough to hold all possible NVDIMM performance stats and passes it to drc_pmem_query_stats() to populate. Finally statistics reported in the buffer are formatted into the sysfs access function output buffer. Signed-off-by: Vaibhav Jain --- Changelog: v4: * Fixed a build issue with this patch by moving a hunk from second patch in series to this patch. [ Aneesh ] v3: * Updated drc_pmem_query_stats() to not require 'buff_size' and 'out' args to the function. Instead 'buff_size' is calculated from 'num_stats' and instead of populating 'R4' in arg 'out' the value is returned from the function in case 'R4' represents 'max-buffer-size'. Resend: None v2: * Updated 'struct papr_scm_perf_stats' and 'struct papr_scm_perf_stat' to use big-endian types. [ Aneesh ] * s/len_stat_buffer/stat_buffer_len/ [ Aneesh ] * s/statistics_id/stat_id/ , s/statistics_val/stat_val/ [ Aneesh ] * Conversion from Big endian to cpu endian happens later rather than just after its fetched from PHYP. * Changed a log statement to unambiguously report dimm performance stats are not available for the given nvdimm [ Ira ] * Restructed some code to handle error case first [ Ira ] --- Documentation/ABI/testing/sysfs-bus-papr-pmem | 27 arch/powerpc/platforms/pseries/papr_scm.c | 150 ++ 2 files changed, 177 insertions(+) diff --git a/Documentation/ABI/testing/sysfs-bus-papr-pmem b/Documentation/ABI/testing/sysfs-bus-papr-pmem index 5b10d036a8d4..c1a67275c43f 100644 --- a/Documentation/ABI/testing/sysfs-bus-papr-pmem +++ b/Documentation/ABI/testing/sysfs-bus-papr-pmem @@ -25,3 +25,30 @@ Description: NVDIMM have been scrubbed. * "locked" : Indicating that NVDIMM contents cant be modified until next power cycle. + +What: /sys/bus/nd/devices/nmemX/papr/perf_stats +Date: May, 2020 +KernelVersion: v5.9 +Contact: linuxppc-dev , linux-nvd...@lists.01.org, +Description: + (RO) Report various performance stats related to papr-scm NVDIMM + device. Each stat is reported on a new line with each line + composed of a stat-identifier followed by it value. Below are + currently known dimm performance stats which are reported: + + * "CtlResCt" : Controller Reset Count + * "CtlResTm" : Controller Reset Elapsed Time + * "PonSecs " : Power-on Seconds + * "MemLife " : Life Remaining + * "CritRscU" : Critical Resource Utilization + * "HostLCnt" : Host Load Count + * "HostSCnt" : Host Store Count + * "HostSDur" : Host Store Duration + * "HostLDur" : Host Load Duration + * "MedRCnt " : Media Read Count + * "MedWCnt " : Media Write Count + * "MedRDur " : Media Read Duration + * "MedWDur " : Media Write Duration + * "CchRHCnt" : Cache Read Hit Count + * "CchWHCnt" : Cache Write Hit Count + * "FastWCnt" : Fast Write Count \ No newline at end of file diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c index 3d1235a76ba9..f37f3f70007d 100644 --- a/arch/powerpc/platforms/pseries/papr_scm.c +++ b/arch/powerpc/platforms/pseries/papr_scm.c @@ -64,6 +64,26 @@ PAPR_PMEM_HEALTH_FATAL |\ PAPR_PMEM_HEALTH_UNHEALTHY) +#define PAPR_SCM_PERF_STATS_EYECATCHER __stringify(SCMSTATS) +#define PAPR_SCM_PERF_STATS_VERSION 0x1 + +/* Struct holding a single performance metric */ +struct papr_scm_perf_stat { + u8 stat_id[8]; + __be64 stat_
[PATCH v4 0/2] powerpc/papr_scm: add support for reporting NVDIMM 'life_used_percentage' metric
Changes since v3[1]: * Fixed a rebase issue pointed out by Aneesh in first patch in the series. [1] https://lore.kernel.org/linux-nvdimm/20200730121303.134230-1-vaib...@linux.ibm.com --- This small patchset implements kernel side support for reporting 'life_used_percentage' metric in NDCTL with dimm health output for papr-scm NVDIMMs. With corresponding NDCTL side changes output for should be like: $ sudo ndctl list -DH [ { "dev":"nmem0", "health":{ "health_state":"ok", "life_used_percentage":0, "shutdown_state":"clean" } } ] PHYP supports H_SCM_PERFORMANCE_STATS hcall through which an LPAR can fetch various performance stats including 'fuel_gauge' percentage for an NVDIMM. 'fuel_gauge' metric indicates the usable life remaining of an NVDIMM expressed as percentage and 'life_used_percentage' can be calculated as 'life_used_percentage = 100 - fuel_gauge'. Structure of the patchset = First patch implements necessary scaffolding needed to issue the H_SCM_PERFORMANCE_STATS hcall and fetch performance stats catalogue. The patch also implements support for 'perf_stats' sysfs attribute to report the full catalogue of supported performance stats by PHYP. Second and final patch implements support for sending this value to libndctl by extending the PAPR_PDSM_HEALTH pdsm payload to add a new field named 'dimm_fuel_gauge' to it. Vaibhav Jain (2): powerpc/papr_scm: Fetch nvdimm performance stats from PHYP powerpc/papr_scm: Add support for fetching nvdimm 'fuel-gauge' metric Documentation/ABI/testing/sysfs-bus-papr-pmem | 27 +++ arch/powerpc/include/uapi/asm/papr_pdsm.h | 9 + arch/powerpc/platforms/pseries/papr_scm.c | 199 ++ 3 files changed, 235 insertions(+) -- 2.26.2
[PATCH v3 2/2] powerpc/papr_scm: Add support for fetching nvdimm 'fuel-gauge' metric
We add support for reporting 'fuel-gauge' NVDIMM metric via PAPR_PDSM_HEALTH pdsm payload. 'fuel-gauge' metric indicates the usage life remaining of a papr-scm compatible NVDIMM. PHYP exposes this metric via the H_SCM_PERFORMANCE_STATS. The metric value is returned from the pdsm by extending the return payload 'struct nd_papr_pdsm_health' without breaking the ABI. A new field 'dimm_fuel_gauge' to hold the metric value is introduced at the end of the payload struct and its presence is indicated by by extension flag PDSM_DIMM_HEALTH_RUN_GAUGE_VALID. The patch introduces a new function papr_pdsm_fuel_gauge() that is called from papr_pdsm_health(). If fetching NVDIMM performance stats is supported then 'papr_pdsm_fuel_gauge()' allocated an output buffer large enough to hold the performance stat and passes it to drc_pmem_query_stats() that issues the HCALL to PHYP. The return value of the stat is then populated in the 'struct nd_papr_pdsm_health.dimm_fuel_gauge' field with extension flag 'PDSM_DIMM_HEALTH_RUN_GAUGE_VALID' set in 'struct nd_papr_pdsm_health.extension_flags' Signed-off-by: Vaibhav Jain --- Changelog: v3: * Updated papr_pdsm_fuel_guage() to use the updated drc_pmem_query_stats() function. Resend: None v2: * Restructure code in papr_pdsm_fuel_gauge() to handle error case first [ Ira ] * Ignore the return value of papr_pdsm_fuel_gauge() in papr_psdm_health() [ Ira ] --- arch/powerpc/include/uapi/asm/papr_pdsm.h | 9 arch/powerpc/platforms/pseries/papr_scm.c | 51 ++- 2 files changed, 59 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/uapi/asm/papr_pdsm.h b/arch/powerpc/include/uapi/asm/papr_pdsm.h index 9ccecc1d6840..50ef95e2f5b1 100644 --- a/arch/powerpc/include/uapi/asm/papr_pdsm.h +++ b/arch/powerpc/include/uapi/asm/papr_pdsm.h @@ -72,6 +72,11 @@ #define PAPR_PDSM_DIMM_CRITICAL 2 #define PAPR_PDSM_DIMM_FATAL 3 +/* struct nd_papr_pdsm_health.extension_flags field flags */ + +/* Indicate that the 'dimm_fuel_gauge' field is valid */ +#define PDSM_DIMM_HEALTH_RUN_GAUGE_VALID 1 + /* * Struct exchanged between kernel & ndctl in for PAPR_PDSM_HEALTH * Various flags indicate the health status of the dimm. @@ -84,6 +89,7 @@ * dimm_locked : Contents of the dimm cant be modified until CEC reboot * dimm_encrypted : Contents of dimm are encrypted. * dimm_health : Dimm health indicator. One of PAPR_PDSM_DIMM_ + * dimm_fuel_gauge : Life remaining of DIMM as a percentage from 0-100 */ struct nd_papr_pdsm_health { union { @@ -96,6 +102,9 @@ struct nd_papr_pdsm_health { __u8 dimm_locked; __u8 dimm_encrypted; __u16 dimm_health; + + /* Extension flag PDSM_DIMM_HEALTH_RUN_GAUGE_VALID */ + __u16 dimm_fuel_gauge; }; __u8 buf[ND_PDSM_PAYLOAD_MAX_SIZE]; }; diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c index 29cab86141d8..837a21083268 100644 --- a/arch/powerpc/platforms/pseries/papr_scm.c +++ b/arch/powerpc/platforms/pseries/papr_scm.c @@ -518,6 +518,51 @@ static int is_cmd_valid(struct nvdimm *nvdimm, unsigned int cmd, void *buf, return 0; } +static int papr_pdsm_fuel_gauge(struct papr_scm_priv *p, + union nd_pdsm_payload *payload) +{ + int rc, size; + u64 statval; + struct papr_scm_perf_stat *stat; + struct papr_scm_perf_stats *stats; + + /* Silently fail if fetching performance metrics isn't supported */ + if (!p->stat_buffer_len) + return 0; + + /* Allocate request buffer enough to hold single performance stat */ + size = sizeof(struct papr_scm_perf_stats) + + sizeof(struct papr_scm_perf_stat); + + stats = kzalloc(size, GFP_KERNEL); + if (!stats) + return -ENOMEM; + + stat = >scm_statistic[0]; + memcpy(>stat_id, "MemLife ", sizeof(stat->stat_id)); + stat->stat_val = 0; + + /* Fetch the fuel gauge and populate it in payload */ + rc = drc_pmem_query_stats(p, stats, 1); + if (rc < 0) { + dev_dbg(>pdev->dev, "Err(%d) fetching fuel gauge\n", rc); + goto free_stats; + } + + statval = be64_to_cpu(stat->stat_val); + dev_dbg(>pdev->dev, + "Fetched fuel-gauge %llu", statval); + payload->health.extension_flags |= + PDSM_DIMM_HEALTH_RUN_GAUGE_VALID; + payload->health.dimm_fuel_gauge = statval; + + rc = sizeof(struct nd_papr_pdsm_health); + +free_stats: + kfree(stats); + return rc; +} + /* Fetch the DIMM health info and populate it in provided package. */ static int papr_pdsm_health(struct papr_scm_priv *p, union nd_pdsm_
[PATCH v3 0/2] powerpc/papr_scm: add support for reporting NVDIMM 'life_used_percentage' metric
Changes since v2[1]: * Updated drc_pmem_query_stats() to reduce the number of input args to the function based suggestions from Aneesh. [1] https://lore.kernel.org/linux-nvdimm/20200726122030.31529-1-vaib...@linux.ibm.com --- This small patchset implements kernel side support for reporting 'life_used_percentage' metric in NDCTL with dimm health output for papr-scm NVDIMMs. With corresponding NDCTL side changes output for should be like: $ sudo ndctl list -DH [ { "dev":"nmem0", "health":{ "health_state":"ok", "life_used_percentage":0, "shutdown_state":"clean" } } ] PHYP supports H_SCM_PERFORMANCE_STATS hcall through which an LPAR can fetch various performance stats including 'fuel_gauge' percentage for an NVDIMM. 'fuel_gauge' metric indicates the usable life remaining of an NVDIMM expressed as percentage and 'life_used_percentage' can be calculated as 'life_used_percentage = 100 - fuel_gauge'. Structure of the patchset = First patch implements necessary scaffolding needed to issue the H_SCM_PERFORMANCE_STATS hcall and fetch performance stats catalogue. The patch also implements support for 'perf_stats' sysfs attribute to report the full catalogue of supported performance stats by PHYP. Second and final patch implements support for sending this value to libndctl by extending the PAPR_PDSM_HEALTH pdsm payload to add a new field named 'dimm_fuel_gauge' to it. Vaibhav Jain (2): powerpc/papr_scm: Fetch nvdimm performance stats from PHYP powerpc/papr_scm: Add support for fetching nvdimm 'fuel-gauge' metric Documentation/ABI/testing/sysfs-bus-papr-pmem | 27 +++ arch/powerpc/include/uapi/asm/papr_pdsm.h | 9 + arch/powerpc/platforms/pseries/papr_scm.c | 199 ++ 3 files changed, 235 insertions(+) -- 2.26.2
[PATCH v3 1/2] powerpc/papr_scm: Fetch nvdimm performance stats from PHYP
Update papr_scm.c to query dimm performance statistics from PHYP via H_SCM_PERFORMANCE_STATS hcall and export them to user-space as PAPR specific NVDIMM attribute 'perf_stats' in sysfs. The patch also provide a sysfs ABI documentation for the stats being reported and their meanings. During NVDIMM probe time in papr_scm_nvdimm_init() a special variant of H_SCM_PERFORMANCE_STATS hcall is issued to check if collection of performance statistics is supported or not. If successful then a PHYP returns a maximum possible buffer length needed to read all performance stats. This returned value is stored in a per-nvdimm attribute 'stat_buffer_len'. The layout of request buffer for reading NVDIMM performance stats from PHYP is defined in 'struct papr_scm_perf_stats' and 'struct papr_scm_perf_stat'. These structs are used in newly introduced drc_pmem_query_stats() that issues the H_SCM_PERFORMANCE_STATS hcall. The sysfs access function perf_stats_show() uses value 'stat_buffer_len' to allocate a buffer large enough to hold all possible NVDIMM performance stats and passes it to drc_pmem_query_stats() to populate. Finally statistics reported in the buffer are formatted into the sysfs access function output buffer. Signed-off-by: Vaibhav Jain --- Changelog: v3: * Updated drc_pmem_query_stats() to not require 'buff_size' and 'out' args to the function. Instead 'buff_size' is calculated from 'num_stats' and instead of populating 'R4' in arg 'out' the value is returned from the function in case 'R4' represents 'max-buffer-size'. [ Aneesh ] Resend: None v2: * Updated 'struct papr_scm_perf_stats' and 'struct papr_scm_perf_stat' to use big-endian types. [ Aneesh ] * s/len_stat_buffer/stat_buffer_len/ [ Aneesh ] * s/statistics_id/stat_id/ , s/statistics_val/stat_val/ [ Aneesh ] * Conversion from Big endian to cpu endian happens later rather than just after its fetched from PHYP. * Changed a log statement to unambiguously report dimm performance stats are not available for the given nvdimm [ Ira ] * Restructed some code to handle error case first [ Ira ] --- Documentation/ABI/testing/sysfs-bus-papr-pmem | 27 arch/powerpc/platforms/pseries/papr_scm.c | 150 ++ 2 files changed, 177 insertions(+) diff --git a/Documentation/ABI/testing/sysfs-bus-papr-pmem b/Documentation/ABI/testing/sysfs-bus-papr-pmem index 5b10d036a8d4..c1a67275c43f 100644 --- a/Documentation/ABI/testing/sysfs-bus-papr-pmem +++ b/Documentation/ABI/testing/sysfs-bus-papr-pmem @@ -25,3 +25,30 @@ Description: NVDIMM have been scrubbed. * "locked" : Indicating that NVDIMM contents cant be modified until next power cycle. + +What: /sys/bus/nd/devices/nmemX/papr/perf_stats +Date: May, 2020 +KernelVersion: v5.9 +Contact: linuxppc-dev , linux-nvd...@lists.01.org, +Description: + (RO) Report various performance stats related to papr-scm NVDIMM + device. Each stat is reported on a new line with each line + composed of a stat-identifier followed by it value. Below are + currently known dimm performance stats which are reported: + + * "CtlResCt" : Controller Reset Count + * "CtlResTm" : Controller Reset Elapsed Time + * "PonSecs " : Power-on Seconds + * "MemLife " : Life Remaining + * "CritRscU" : Critical Resource Utilization + * "HostLCnt" : Host Load Count + * "HostSCnt" : Host Store Count + * "HostSDur" : Host Store Duration + * "HostLDur" : Host Load Duration + * "MedRCnt " : Media Read Count + * "MedWCnt " : Media Write Count + * "MedRDur " : Media Read Duration + * "MedWDur " : Media Write Duration + * "CchRHCnt" : Cache Read Hit Count + * "CchWHCnt" : Cache Write Hit Count + * "FastWCnt" : Fast Write Count \ No newline at end of file diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c index 8fd441d32487..29cab86141d8 100644 --- a/arch/powerpc/platforms/pseries/papr_scm.c +++ b/arch/powerpc/platforms/pseries/papr_scm.c @@ -64,6 +64,26 @@ PAPR_PMEM_HEALTH_FATAL |\ PAPR_PMEM_HEALTH_UNHEALTHY) +#define PAPR_SCM_PERF_STATS_EYECATCHER __stringify(SCMSTATS) +#define PAPR_SCM_PERF_STATS_VERSION 0x1 + +/* Struct holding a single performance metric */ +struct papr_scm_perf_stat { + u8 stat_id[8]; + __be64 stat_val; +} __packed; + +/* Struct exchanged between kernel and PHYP for fetching drc perf stats */ +struct papr_scm_perf_stats { +
[RESEND PATCH v2 2/2] powerpc/papr_scm: Add support for fetching nvdimm 'fuel-gauge' metric
We add support for reporting 'fuel-gauge' NVDIMM metric via PAPR_PDSM_HEALTH pdsm payload. 'fuel-gauge' metric indicates the usage life remaining of a papr-scm compatible NVDIMM. PHYP exposes this metric via the H_SCM_PERFORMANCE_STATS. The metric value is returned from the pdsm by extending the return payload 'struct nd_papr_pdsm_health' without breaking the ABI. A new field 'dimm_fuel_gauge' to hold the metric value is introduced at the end of the payload struct and its presence is indicated by by extension flag PDSM_DIMM_HEALTH_RUN_GAUGE_VALID. The patch introduces a new function papr_pdsm_fuel_gauge() that is called from papr_pdsm_health(). If fetching NVDIMM performance stats is supported then 'papr_pdsm_fuel_gauge()' allocated an output buffer large enough to hold the performance stat and passes it to drc_pmem_query_stats() that issues the HCALL to PHYP. The return value of the stat is then populated in the 'struct nd_papr_pdsm_health.dimm_fuel_gauge' field with extension flag 'PDSM_DIMM_HEALTH_RUN_GAUGE_VALID' set in 'struct nd_papr_pdsm_health.extension_flags' Signed-off-by: Vaibhav Jain --- Changelog: Resend: None v2: * Restructure code in papr_pdsm_fuel_gauge() to handle error case first [ Ira ] * Ignore the return value of papr_pdsm_fuel_gauge() in papr_psdm_health() [ Ira ] --- arch/powerpc/include/uapi/asm/papr_pdsm.h | 9 + arch/powerpc/platforms/pseries/papr_scm.c | 49 +++ 2 files changed, 58 insertions(+) diff --git a/arch/powerpc/include/uapi/asm/papr_pdsm.h b/arch/powerpc/include/uapi/asm/papr_pdsm.h index 9ccecc1d6840..50ef95e2f5b1 100644 --- a/arch/powerpc/include/uapi/asm/papr_pdsm.h +++ b/arch/powerpc/include/uapi/asm/papr_pdsm.h @@ -72,6 +72,11 @@ #define PAPR_PDSM_DIMM_CRITICAL 2 #define PAPR_PDSM_DIMM_FATAL 3 +/* struct nd_papr_pdsm_health.extension_flags field flags */ + +/* Indicate that the 'dimm_fuel_gauge' field is valid */ +#define PDSM_DIMM_HEALTH_RUN_GAUGE_VALID 1 + /* * Struct exchanged between kernel & ndctl in for PAPR_PDSM_HEALTH * Various flags indicate the health status of the dimm. @@ -84,6 +89,7 @@ * dimm_locked : Contents of the dimm cant be modified until CEC reboot * dimm_encrypted : Contents of dimm are encrypted. * dimm_health : Dimm health indicator. One of PAPR_PDSM_DIMM_ + * dimm_fuel_gauge : Life remaining of DIMM as a percentage from 0-100 */ struct nd_papr_pdsm_health { union { @@ -96,6 +102,9 @@ struct nd_papr_pdsm_health { __u8 dimm_locked; __u8 dimm_encrypted; __u16 dimm_health; + + /* Extension flag PDSM_DIMM_HEALTH_RUN_GAUGE_VALID */ + __u16 dimm_fuel_gauge; }; __u8 buf[ND_PDSM_PAYLOAD_MAX_SIZE]; }; diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c index 3172a018..80261d241831 100644 --- a/arch/powerpc/platforms/pseries/papr_scm.c +++ b/arch/powerpc/platforms/pseries/papr_scm.c @@ -504,6 +504,51 @@ static int is_cmd_valid(struct nvdimm *nvdimm, unsigned int cmd, void *buf, return 0; } +static int papr_pdsm_fuel_gauge(struct papr_scm_priv *p, + union nd_pdsm_payload *payload) +{ + int rc, size; + u64 statval; + struct papr_scm_perf_stat *stat; + struct papr_scm_perf_stats *stats; + + /* Silently fail if fetching performance metrics isn't supported */ + if (!p->stat_buffer_len) + return 0; + + /* Allocate request buffer enough to hold single performance stat */ + size = sizeof(struct papr_scm_perf_stats) + + sizeof(struct papr_scm_perf_stat); + + stats = kzalloc(size, GFP_KERNEL); + if (!stats) + return -ENOMEM; + + stat = >scm_statistic[0]; + memcpy(>stat_id, "MemLife ", sizeof(stat->stat_id)); + stat->stat_val = 0; + + /* Fetch the fuel gauge and populate it in payload */ + rc = drc_pmem_query_stats(p, stats, size, 1, NULL); + if (rc) { + dev_dbg(>pdev->dev, "Err(%d) fetching fuel gauge\n", rc); + goto free_stats; + } + + statval = be64_to_cpu(stat->stat_val); + dev_dbg(>pdev->dev, + "Fetched fuel-gauge %llu", statval); + payload->health.extension_flags |= + PDSM_DIMM_HEALTH_RUN_GAUGE_VALID; + payload->health.dimm_fuel_gauge = statval; + + rc = sizeof(struct nd_papr_pdsm_health); + +free_stats: + kfree(stats); + return rc; +} + /* Fetch the DIMM health info and populate it in provided package. */ static int papr_pdsm_health(struct papr_scm_priv *p, union nd_pdsm_payload *payload) @@ -544,6 +589,10 @@ static int papr_pdsm_health(struct papr_scm_priv *p, /* st