[PATCH v2] KVM: PPC: Book3S HV nestedv2: Cancel pending DEC exception

2024-04-14 Thread Vaibhav Jain
This reverts commit 180c6b072bf3 ("KVM: PPC: Book3S HV nestedv2: Do not
cancel pending decrementer exception") [1] which prevented canceling a
pending HDEC exception for nestedv2 KVM guests. It was done to avoid
overhead of a H_GUEST_GET_STATE hcall to read the 'DEC expiry TB' register
which was higher compared to handling extra decrementer exceptions.

However recent benchmarks indicate that overhead of not handling 'DECR'
expiry for Nested KVM Guest(L2) is higher and results in much larger exits
to Pseries Host(L1) as indicated by the Unixbench-arithoh bench[2]

Metric| Current upstream| Revert [1]  | Difference %

arithoh-count (10)| 3244831634  | 3403089673  | +04.88%
kvm_hv:kvm_guest_exit | 513558  | 152441  | -70.32%
probe:kvmppc_gsb_recv | 28060   | 28110   | +00.18%

N=1

As indicated by the data above that reverting [1] results in substantial
reduction in number of L2->L1 exits with only slight increase in number of
H_GUEST_GET_STATE hcalls to read the value of 'DEC expiry TB'. This results
in an overall ~4% improvement of arithoh[2] throughput.

[1] commit 180c6b072bf3 ("KVM: PPC: Book3S HV nestedv2: Do not cancel pending 
decrementer exception")
[2] https://github.com/kdlucas/byte-unixbench/

Fixes: 180c6b072bf3 ("KVM: PPC: Book3S HV nestedv2: Do not cancel pending 
decrementer exception")
Signed-off-by: Vaibhav Jain 

---
Changelog:
Since v1: 
https://lore.kernel.org/all/20240313072625.76804-1-vaib...@linux.ibm.com
* Updated/Corrected patch title and description
* Included data on test benchmark results for Unixbench-arithoh bench.
---
 arch/powerpc/kvm/book3s_hv.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 8e86eb577eb8..692a7c6f5fd9 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -4857,7 +4857,7 @@ int kvmhv_run_single_vcpu(struct kvm_vcpu *vcpu, u64 
time_limit,
 * entering a nested guest in which case the decrementer is now owned
 * by L2 and the L1 decrementer is provided in hdec_expires
 */
-   if (!kvmhv_is_nestedv2() && kvmppc_core_pending_dec(vcpu) &&
+   if (kvmppc_core_pending_dec(vcpu) &&
((tb < kvmppc_dec_expires_host_tb(vcpu)) ||
 (trap == BOOK3S_INTERRUPT_SYSCALL &&
  kvmppc_get_gpr(vcpu, 3) == H_ENTER_NESTED)))
-- 
2.44.0



[PATCH] KVM: PPC: Book3S HV nestedv2: Cancel pending HDEC exception

2024-03-13 Thread Vaibhav Jain
This reverts commit 180c6b072bf360b686e53d893d8dcf7dbbaec6bb ("KVM: PPC:
Book3S HV nestedv2: Do not cancel pending decrementer exception") which
prevented cancelling a pending HDEC exception for nestedv2 KVM guests. It
was done to avoid overhead of a H_GUEST_GET_STATE hcall to read the 'HDEC
expiry TB' register which was higher compared to handling extra decrementer
exceptions.

This overhead of reading 'HDEC expiry TB' register has been mitigated
recently by the L0 hypervisor(PowerVM) by putting the value of this
register in L2 guest-state output buffer on trap to L1. From there the
value of this register is cached, made available in kvmhv_run_single_vcpu()
to compare it against host(L1) timebase and cancel the pending hypervisor
decrementer exception if needed.

Fixes: 180c6b072bf3 ("KVM: PPC: Book3S HV nestedv2: Do not cancel pending 
decrementer exception")
Signed-off-by: Vaibhav Jain 
---
 arch/powerpc/kvm/book3s_hv.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 0b921704da45..e47b954ce266 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -4856,7 +4856,7 @@ int kvmhv_run_single_vcpu(struct kvm_vcpu *vcpu, u64 
time_limit,
 * entering a nested guest in which case the decrementer is now owned
 * by L2 and the L1 decrementer is provided in hdec_expires
 */
-   if (!kvmhv_is_nestedv2() && kvmppc_core_pending_dec(vcpu) &&
+   if (kvmppc_core_pending_dec(vcpu) &&
((tb < kvmppc_dec_expires_host_tb(vcpu)) ||
 (trap == BOOK3S_INTERRUPT_SYSCALL &&
  kvmppc_get_gpr(vcpu, 3) == H_ENTER_NESTED)))
-- 
2.44.0



Re: [PATCH 2/2] powerpc/pseries: Set CPU_FTR_DBELL according to ibm,pi-features

2024-02-07 Thread Vaibhav Jain
Nicholas Piggin  writes:

> PAPR will define a new ibm,pi-features bit which says that doorbells
> should not be used even on architectures where they exist. This could be
> because they are emulated and slower than using the interrupt controller
> directly for IPIs.
>
> Wire this bit into the pi-features parser to clear CPU_FTR_DBELL, and
> ensure CPU_FTR_DBELL is not in CPU_FTRS_ALWAYS.
>
> Signed-off-by: Nicholas Piggin 
> ---

Tested this patch on a PP64-LE lpar with the patch[1] and seeing the
relevant pi-feature bit CPU_FTR_DBELL in `cur_cpu_spec->cpu_features`
getting cleared.


[1]
https://lore.kernel.org/all/20240207035220.339726-1-npig...@gmail.com

Hence,
Tested-by: Vaibhav Jain 

-- 
Cheers
~ Vaibhav


Re: [PATCH 1/2] powerpc/pseries: Add a clear modifier to ibm,pa/pi-features parser

2024-02-07 Thread Vaibhav Jain
Nicholas Piggin  writes:

> When a new ibm,pa/pi-features bit is introduced that is intended to
> apply to existing systems and features, it may have an "inverted"
> meaning (i.e., bit clear => feature available; bit set => unavailable).
> Depending on the nature of the feature, this may give the best
> backward compatibility result where old firmware will continue to
> have that bit clear and therefore the feature available.
>
> The 'invert' modifier presumably was introduced for this type of
> feature bit. However it invert will set the feature if the bit is
> clear, which prevents it being used in the situation where an old
> CPU lacks a feature that a new CPU has, then a new firmware comes
> out to disable that feature on the new CPU if the bit is set.
> Adding an 'invert' entry for that feature would incorrectly enable
> it for the old CPU.
>
> So add a 'clear' modifier that clears the feature if the bit is set,
> but it does not set the feature if the bit is clear. The feature
> is expected to be set in the cpu table.
>
> This replaces the 'invert' modifier, which is unused since commit
> 7d4703455168 ("powerpc/feature: Remove CPU_FTR_NODSISRALIGN").
>
> Signed-off-by: Nicholas Piggin 
> ---


Tested this patch on a PP64-LE lpar with the patch[1] and seeing the
relevant pi-feature bit CPU_FTR_DBELL getting cleared.

[1]
https://lore.kernel.org/all/20240207035220.339726-2-npig...@gmail.com

Hence,

Tested-by: Vaibhav Jain 

-- 
Cheers
~ Vaibhav


Re: [PATCH v2] KVM: PPC: Book3S HV: Fix L2 guest reboot failure due to empty 'arch_compat'

2024-02-05 Thread Vaibhav Jain
Hi Amit,

Thanks for the patch. Minor comment on the patch below:

Amit Machhiwal  writes:



>  
> +static inline unsigned long map_pcr_to_cap(unsigned long pcr)
> +{
> + unsigned long cap = 0;
> +
> + switch (pcr) {
> + case PCR_ARCH_300:
> + cap = H_GUEST_CAP_POWER9;
> + break;
> + case PCR_ARCH_31:
> + cap = H_GUEST_CAP_POWER10;
Though CONFIG_CC_IMPLICIT_FALLTHROUGH and '-Wimplicit-fallthrough'
doesnt explicitly flag this usage, please consider using the
'fallthrough;' keyword here.

However you probably dont want this switch-case to fallthrough so please
use a 'break' instead.

> + default:
> + break;
> + }
> +
> + return cap;
> +}
> +
>


With the suggested change above

Reviewed-by: Vaibhav Jain 

-- 
Cheers
~ Vaibhav


Re: [PATCH] KVM: PPC: Book3S HV nestedv2: Fix an error handling path in gs_msg_ops_kvmhv_nestedv2_config_fill_info()

2024-01-28 Thread Vaibhav Jain




Christophe JAILLET  writes:


>   if (kvmppc_gsm_includes(gsm, KVMPPC_GSID_RUN_OUTPUT)) {
> - kvmppc_gse_put_buff_info(gsb, KVMPPC_GSID_RUN_OUTPUT,
> -  cfg->vcpu_run_output_cfg);
> + rc = kvmppc_gse_put_buff_info(gsb, KVMPPC_GSID_RUN_OUTPUT,
> +   cfg->vcpu_run_output_cfg);
>   if (rc < 0)
>   return rc;
>   }
> -- 
Thanks for catching this and the change looks good to me.

Reviewed-by: Vaibhav Jain 

-- 
Cheers
~ Vaibhav


[PATCH] powerpc/hvcall: Reorder Nestedv2 hcall opcodes

2023-12-19 Thread Vaibhav Jain
This trivial patch reorders the newly introduced hcall opcodes for Nestedv2
to follow the increasing-opcode-number convention followed in
'hvcall.h'. The patch also updates the value for MAX_HCALL_OPCODE which is
at various places in arch code for range checking.

Fixes: 19d31c5f1157 ("KVM: PPC: Add support for nestedv2 guests")
Suggested-by: Michael Ellerman 
Signed-off-by: Vaibhav Jain
---
 arch/powerpc/include/asm/hvcall.h | 21 +++--
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/hvcall.h 
b/arch/powerpc/include/asm/hvcall.h
index ddb99e982917..605ed2b58aff 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -349,7 +349,17 @@
 #define H_GET_ENERGY_SCALE_INFO0x450
 #define H_PKS_SIGNED_UPDATE0x454
 #define H_WATCHDOG 0x45C
-#define MAX_HCALL_OPCODE   H_WATCHDOG
+#define H_WATCHDOG 0x45C
+#define H_GUEST_GET_CAPABILITIES 0x460
+#define H_GUEST_SET_CAPABILITIES 0x464
+#define H_GUEST_CREATE 0x470
+#define H_GUEST_CREATE_VCPU0x474
+#define H_GUEST_GET_STATE  0x478
+#define H_GUEST_SET_STATE  0x47C
+#define H_GUEST_RUN_VCPU   0x480
+#define H_GUEST_COPY_MEMORY0x484
+#define H_GUEST_DELETE 0x488
+#define MAX_HCALL_OPCODE   H_GUEST_DELETE
 
 /* Scope args for H_SCM_UNBIND_ALL */
 #define H_UNBIND_SCOPE_ALL (0x1)
@@ -393,15 +403,6 @@
 #define H_ENTER_NESTED 0xF804
 #define H_TLB_INVALIDATE   0xF808
 #define H_COPY_TOFROM_GUEST0xF80C
-#define H_GUEST_GET_CAPABILITIES 0x460
-#define H_GUEST_SET_CAPABILITIES 0x464
-#define H_GUEST_CREATE 0x470
-#define H_GUEST_CREATE_VCPU0x474
-#define H_GUEST_GET_STATE  0x478
-#define H_GUEST_SET_STATE  0x47C
-#define H_GUEST_RUN_VCPU   0x480
-#define H_GUEST_COPY_MEMORY0x484
-#define H_GUEST_DELETE 0x488
 
 /* Flags for H_SVM_PAGE_IN */
 #define H_PAGE_IN_SHARED0x1
-- 
2.43.0



Re: [PATCH 09/12] KVM: PPC: Book3S HV nestedv2: Do not call H_COPY_TOFROM_GUEST

2023-12-17 Thread Vaibhav Jain
Hi Aneesh,

"Aneesh Kumar K.V"  writes:


>> Yes, Agreed and thats a nice suggestion. However ATM the hypervisor 
>> supporting Nestedv2 doesnt have support for this hcall. In future
>> once we have support for this hcall for nestedv2 from the hypervisor
>> we can replace this branch with a firmware_has_feature() test.
>> 
>
> What I am suggesting is we convert that conditional to firmware_has_feature 
> so that
> later when hypervisor supports this hcall all older kernel can make
> use of the copy_tofrom_guest without any code change.

AFAIK for firmware_has_feature to work we either need:
- A way to call this hcall with some invalid args. However lpid/pid for
guest arent allocated during boot.

- A way for hypervisor to advertise support for this hcall before the L1
kernel boots.

ATM L0 dosent support for any of these two ways. I can do a follow up
patch later when we have a clarity on how we want to advertise support
for this hcall. For now current kernel supporting nestedv2 wont be
using this hcall assuming its not supported. Future kernels can use one
of the two ways above to set the firmware_has_feature flag to take
advantage of this hcall.


-- 
Cheers
~ Vaibhav


Re: [PATCH 01/12] KVM: PPC: Book3S HV nestedv2: Invalidate RPT before deleting a guest

2023-12-17 Thread Vaibhav Jain
"Aneesh Kumar K.V"  writes:

> Vaibhav Jain  writes:
>
>> Hi Aneesh,
>>
>> Thanks for looking into this patch. My responses inline below:
>>
>> "Aneesh Kumar K.V (IBM)"  writes:
>>
>>> Vaibhav Jain  writes:
>>>
>>>> From: Jordan Niethe 
>>>>
>>>> An L0 must invalidate the L2's RPT during H_GUEST_DELETE if this has not
>>>> already been done. This is a slow operation that means H_GUEST_DELETE
>>>> must return H_BUSY multiple times before completing. Invalidating the
>>>> tables before deleting the guest so there is less work for the L0 to do.
>>>>
>>>> Signed-off-by: Jordan Niethe 
>>>> ---
>>>>  arch/powerpc/include/asm/kvm_book3s.h | 1 +
>>>>  arch/powerpc/kvm/book3s_hv.c  | 6 --
>>>>  arch/powerpc/kvm/book3s_hv_nested.c   | 2 +-
>>>>  3 files changed, 6 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
>>>> b/arch/powerpc/include/asm/kvm_book3s.h
>>>> index 4f527d09c92b..a37736ed3728 100644
>>>> --- a/arch/powerpc/include/asm/kvm_book3s.h
>>>> +++ b/arch/powerpc/include/asm/kvm_book3s.h
>>>> @@ -302,6 +302,7 @@ void kvmhv_nested_exit(void);
>>>>  void kvmhv_vm_nested_init(struct kvm *kvm);
>>>>  long kvmhv_set_partition_table(struct kvm_vcpu *vcpu);
>>>>  long kvmhv_copy_tofrom_guest_nested(struct kvm_vcpu *vcpu);
>>>> +void kvmhv_flush_lpid(u64 lpid);
>>>>  void kvmhv_set_ptbl_entry(u64 lpid, u64 dw0, u64 dw1);
>>>>  void kvmhv_release_all_nested(struct kvm *kvm);
>>>>  long kvmhv_enter_nested_guest(struct kvm_vcpu *vcpu);
>>>> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
>>>> index 1ed6ec140701..5543e8490cd9 100644
>>>> --- a/arch/powerpc/kvm/book3s_hv.c
>>>> +++ b/arch/powerpc/kvm/book3s_hv.c
>>>> @@ -5691,10 +5691,12 @@ static void kvmppc_core_destroy_vm_hv(struct kvm 
>>>> *kvm)
>>>>kvmhv_set_ptbl_entry(kvm->arch.lpid, 0, 0);
>>>>}
>>>>  
>>>> -  if (kvmhv_is_nestedv2())
>>>> +  if (kvmhv_is_nestedv2()) {
>>>> +  kvmhv_flush_lpid(kvm->arch.lpid);
>>>>plpar_guest_delete(0, kvm->arch.lpid);
>>>>
>>>
>>> I am not sure I follow the optimization here. I would expect the
>>> hypervisor to kill all the translation caches as part of guest_delete.
>>> What is the benefit of doing a lpid flush outside the guest delete?
>>>
>> Thats right. However without this optimization the H_GUEST_DELETE hcall
>> in plpar_guest_delete() returns H_BUSY multiple times resulting in
>> multiple hcalls to the hypervisor until it finishes. Flushing the guest
>> translation cache upfront reduces the number of HCALLs L1 guests has to
>> make to delete a L2 guest via H_GUEST_DELETE.
>>
>
> can we add that as a comment above that kvmhv_flush_lpid()?
Sure, I will put up a comment with that detail in v2 of the patch
series.

>
> -aneesh

-- 
Cheers
~ Vaibhav


Re: [PATCH 09/12] KVM: PPC: Book3S HV nestedv2: Do not call H_COPY_TOFROM_GUEST

2023-12-10 Thread Vaibhav Jain
Hi Aneesh,

Thanks for looking into this patch. My responses inline:

"Aneesh Kumar K.V (IBM)"  writes:


> May be we should use
> firmware_has_feature(FW_FEATURE_H_COPY_TOFROM_GUEST))?
>
> the nestedv2 can end up using the above hcall if it is supported by the
> hypervisor right? In its absence we will have to translate the guest ea
> using xlate and then use kvm_guest_read to read location using the guest
> real address right? That xlate will also involves multiple kvm_guest_read.
>
>
Yes, Agreed and thats a nice suggestion. However ATM the hypervisor
supporting Nestedv2 doesnt have support for this hcall. In future once
we have support for this hcall for nestedv2 from the hypervisor we can
replace this branch with a firmware_has_feature() test.

>> Signed-off-by: Jordan Niethe 
>> ---
>>  arch/powerpc/kvm/book3s_64_mmu_radix.c | 3 +++
>>  1 file changed, 3 insertions(+)
>>
>> diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
>> b/arch/powerpc/kvm/book3s_64_mmu_radix.c
>> index 916af6c153a5..4a1abb9f7c05 100644
>> --- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
>> +++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
>> @@ -40,6 +40,9 @@ unsigned long __kvmhv_copy_tofrom_guest_radix(int lpid, 
>> int pid,
>>  unsigned long quadrant, ret = n;
>>  bool is_load = !!to;
>>  
>> +if (kvmhv_is_nestedv2())
>> +return H_UNSUPPORTED;
>> +
>>  /* Can't access quadrants 1 or 2 in non-HV mode, call the HV to do it */
>>  if (kvmhv_on_pseries())
>>  return plpar_hcall_norets(H_COPY_TOFROM_GUEST, lpid, pid, eaddr,
>> -- 
>> 2.42.0

-- 
Cheers
~ Vaibhav


Re: [PATCH 01/12] KVM: PPC: Book3S HV nestedv2: Invalidate RPT before deleting a guest

2023-12-08 Thread Vaibhav Jain


Hi Aneesh,

Thanks for looking into this patch. My responses inline below:

"Aneesh Kumar K.V (IBM)"  writes:

> Vaibhav Jain  writes:
>
>> From: Jordan Niethe 
>>
>> An L0 must invalidate the L2's RPT during H_GUEST_DELETE if this has not
>> already been done. This is a slow operation that means H_GUEST_DELETE
>> must return H_BUSY multiple times before completing. Invalidating the
>> tables before deleting the guest so there is less work for the L0 to do.
>>
>> Signed-off-by: Jordan Niethe 
>> ---
>>  arch/powerpc/include/asm/kvm_book3s.h | 1 +
>>  arch/powerpc/kvm/book3s_hv.c  | 6 --
>>  arch/powerpc/kvm/book3s_hv_nested.c   | 2 +-
>>  3 files changed, 6 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
>> b/arch/powerpc/include/asm/kvm_book3s.h
>> index 4f527d09c92b..a37736ed3728 100644
>> --- a/arch/powerpc/include/asm/kvm_book3s.h
>> +++ b/arch/powerpc/include/asm/kvm_book3s.h
>> @@ -302,6 +302,7 @@ void kvmhv_nested_exit(void);
>>  void kvmhv_vm_nested_init(struct kvm *kvm);
>>  long kvmhv_set_partition_table(struct kvm_vcpu *vcpu);
>>  long kvmhv_copy_tofrom_guest_nested(struct kvm_vcpu *vcpu);
>> +void kvmhv_flush_lpid(u64 lpid);
>>  void kvmhv_set_ptbl_entry(u64 lpid, u64 dw0, u64 dw1);
>>  void kvmhv_release_all_nested(struct kvm *kvm);
>>  long kvmhv_enter_nested_guest(struct kvm_vcpu *vcpu);
>> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
>> index 1ed6ec140701..5543e8490cd9 100644
>> --- a/arch/powerpc/kvm/book3s_hv.c
>> +++ b/arch/powerpc/kvm/book3s_hv.c
>> @@ -5691,10 +5691,12 @@ static void kvmppc_core_destroy_vm_hv(struct kvm 
>> *kvm)
>>  kvmhv_set_ptbl_entry(kvm->arch.lpid, 0, 0);
>>  }
>>  
>> -if (kvmhv_is_nestedv2())
>> +if (kvmhv_is_nestedv2()) {
>> +kvmhv_flush_lpid(kvm->arch.lpid);
>>  plpar_guest_delete(0, kvm->arch.lpid);
>>
>
> I am not sure I follow the optimization here. I would expect the
> hypervisor to kill all the translation caches as part of guest_delete.
> What is the benefit of doing a lpid flush outside the guest delete?
>
Thats right. However without this optimization the H_GUEST_DELETE hcall
in plpar_guest_delete() returns H_BUSY multiple times resulting in
multiple hcalls to the hypervisor until it finishes. Flushing the guest
translation cache upfront reduces the number of HCALLs L1 guests has to
make to delete a L2 guest via H_GUEST_DELETE.

>> -else
>> +} else {
>>  kvmppc_free_lpid(kvm->arch.lpid);
>> +}
>>  
>>  kvmppc_free_pimap(kvm);
>>  }
>> diff --git a/arch/powerpc/kvm/book3s_hv_nested.c 
>> b/arch/powerpc/kvm/book3s_hv_nested.c
>> index 3b658b8696bc..5c375ec1a3c6 100644
>> --- a/arch/powerpc/kvm/book3s_hv_nested.c
>> +++ b/arch/powerpc/kvm/book3s_hv_nested.c
>> @@ -503,7 +503,7 @@ void kvmhv_nested_exit(void)
>>  }
>>  }
>>  
>> -static void kvmhv_flush_lpid(u64 lpid)
>> +void kvmhv_flush_lpid(u64 lpid)
>>  {
>>  long rc;
>>  
>> -- 
>> 2.42.0

-- 
Cheers
~ Vaibhav


[PATCH 11/12] KVM: PPC: Reduce reliance on analyse_instr() in mmio emulation

2023-12-01 Thread Vaibhav Jain
From: Jordan Niethe 

Commit 709236039964 ("KVM: PPC: Reimplement non-SIMD LOAD/STORE
instruction mmio emulation with analyse_instr() input") and
commit 2b33cb585f94 ("KVM: PPC: Reimplement LOAD_FP/STORE_FP instruction
mmio emulation with analyse_instr() input") made
kvmppc_emulate_loadstore() use the results from analyse_instr() for
instruction emulation. In particular the effective address from
analyse_instr() is used for UPDATE type instructions and fact that
op.val is all ready endian corrected is used in the STORE case.

However, these changes now have some negative implications for the
nestedv2 case.  For analyse_instr() to determine the correct effective
address, the GPRs must be loaded from the L0. This is not needed as
vcpu->arch.vaddr_accessed is already set. Change back to using
vcpu->arch.vaddr_accessed.

In the STORE case, use kvmppc_get_gpr() value instead of the op.val.
kvmppc_get_gpr() will reload from the L0 if needed in the nestedv2 case.
This means if a byte reversal is needed must now be passed to
kvmppc_handle_store() like in the kvmppc_handle_load() case.

This means the call to kvmhv_nestedv2_reload_ptregs() can be avoided as
there is no concern about op.val being stale. Drop the call to
kvmhv_nestedv2_mark_dirty_ptregs() as without the call to
kvmhv_nestedv2_reload_ptregs(), stale state could be marked as valid.

This is fine as the required marking things dirty is already handled for
the UPDATE case by the call to kvmppc_set_gpr(). For LOADs, it is
handled in kvmppc_complete_mmio_load(). This is called either directly
in __kvmppc_handle_load() if the load can be handled in KVM, or on the
next kvm_arch_vcpu_ioctl_run() if an exit was required.

Signed-off-by: Jordan Niethe 
---
 arch/powerpc/kvm/emulate_loadstore.c | 21 ++---
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/kvm/emulate_loadstore.c 
b/arch/powerpc/kvm/emulate_loadstore.c
index 077fd88a0b68..ec60c7979718 100644
--- a/arch/powerpc/kvm/emulate_loadstore.c
+++ b/arch/powerpc/kvm/emulate_loadstore.c
@@ -93,7 +93,6 @@ int kvmppc_emulate_loadstore(struct kvm_vcpu *vcpu)
 
emulated = EMULATE_FAIL;
vcpu->arch.regs.msr = kvmppc_get_msr(vcpu);
-   kvmhv_nestedv2_reload_ptregs(vcpu, >arch.regs);
if (analyse_instr(, >arch.regs, inst) == 0) {
int type = op.type & INSTR_TYPE_MASK;
int size = GETSIZE(op.type);
@@ -112,7 +111,7 @@ int kvmppc_emulate_loadstore(struct kvm_vcpu *vcpu)
op.reg, size, !instr_byte_swap);
 
if ((op.type & UPDATE) && (emulated != EMULATE_FAIL))
-   kvmppc_set_gpr(vcpu, op.update_reg, op.ea);
+   kvmppc_set_gpr(vcpu, op.update_reg, 
vcpu->arch.vaddr_accessed);
 
break;
}
@@ -132,7 +131,7 @@ int kvmppc_emulate_loadstore(struct kvm_vcpu *vcpu)
 KVM_MMIO_REG_FPR|op.reg, size, 1);
 
if ((op.type & UPDATE) && (emulated != EMULATE_FAIL))
-   kvmppc_set_gpr(vcpu, op.update_reg, op.ea);
+   kvmppc_set_gpr(vcpu, op.update_reg, 
vcpu->arch.vaddr_accessed);
 
break;
 #endif
@@ -224,16 +223,17 @@ int kvmppc_emulate_loadstore(struct kvm_vcpu *vcpu)
break;
}
 #endif
-   case STORE:
-   /* if need byte reverse, op.val has been reversed by
-* analyse_instr().
-*/
-   emulated = kvmppc_handle_store(vcpu, op.val, size, 1);
+   case STORE: {
+   int instr_byte_swap = op.type & BYTEREV;
+
+   emulated = kvmppc_handle_store(vcpu, 
kvmppc_get_gpr(vcpu, op.reg),
+  size, !instr_byte_swap);
 
if ((op.type & UPDATE) && (emulated != EMULATE_FAIL))
-   kvmppc_set_gpr(vcpu, op.update_reg, op.ea);
+   kvmppc_set_gpr(vcpu, op.update_reg, 
vcpu->arch.vaddr_accessed);
 
break;
+   }
 #ifdef CONFIG_PPC_FPU
case STORE_FP:
if (kvmppc_check_fp_disabled(vcpu))
@@ -254,7 +254,7 @@ int kvmppc_emulate_loadstore(struct kvm_vcpu *vcpu)
kvmppc_get_fpr(vcpu, op.reg), size, 1);
 
if ((op.type & UPDATE) && (emulated != EMULATE_FAIL))
-   kvmppc_set_gpr(vcpu, op.update_reg, op.ea);
+   kvmppc_set_gpr(vcpu, op.update_reg, 
vcpu->arch.vaddr_accessed);
 
break;
 #endif
@@ -358,7 +358,6 @@ int kvmppc_emulate_loadstore(struct kvm_vcpu *vcpu)
}
 
trace_kvm_ppc_instr(ppc_inst_val(inst), 

[PATCH 09/12] KVM: PPC: Book3S HV nestedv2: Do not call H_COPY_TOFROM_GUEST

2023-12-01 Thread Vaibhav Jain
From: Jordan Niethe 

H_COPY_TOFROM_GUEST is part of the nestedv1 API and so should not be
called by a nestedv2 host. Do not attempt to call it.

Signed-off-by: Jordan Niethe 
---
 arch/powerpc/kvm/book3s_64_mmu_radix.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 916af6c153a5..4a1abb9f7c05 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -40,6 +40,9 @@ unsigned long __kvmhv_copy_tofrom_guest_radix(int lpid, int 
pid,
unsigned long quadrant, ret = n;
bool is_load = !!to;
 
+   if (kvmhv_is_nestedv2())
+   return H_UNSUPPORTED;
+
/* Can't access quadrants 1 or 2 in non-HV mode, call the HV to do it */
if (kvmhv_on_pseries())
return plpar_hcall_norets(H_COPY_TOFROM_GUEST, lpid, pid, eaddr,
-- 
2.42.0



[PATCH 12/12] KVM: PPC: Book3S HV nestedv2: Do not cancel pending decrementer exception

2023-12-01 Thread Vaibhav Jain
From: Jordan Niethe 

In the nestedv2 case, if there is a pending decrementer exception, the
L1 must get the L2's timebase from the L0 to see if the exception should
be cancelled. This adds the overhead of a H_GUEST_GET_STATE call to the
likely case in which the decrementer should not be cancelled.

Avoid this logic for the nestedv2 case.

Signed-off-by: Jordan Niethe 
---
 arch/powerpc/kvm/book3s_hv.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 2ee3f2478570..e48126a59ba7 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -4834,7 +4834,7 @@ int kvmhv_run_single_vcpu(struct kvm_vcpu *vcpu, u64 
time_limit,
 * entering a nested guest in which case the decrementer is now owned
 * by L2 and the L1 decrementer is provided in hdec_expires
 */
-   if (kvmppc_core_pending_dec(vcpu) &&
+   if (!kvmhv_is_nestedv2() && kvmppc_core_pending_dec(vcpu) &&
((tb < kvmppc_dec_expires_host_tb(vcpu)) ||
 (trap == BOOK3S_INTERRUPT_SYSCALL &&
  kvmppc_get_gpr(vcpu, 3) == H_ENTER_NESTED)))
-- 
2.42.0



[PATCH 03/12] KVM: PPC: Book3S HV nestedv2: Do not check msr on hcalls

2023-12-01 Thread Vaibhav Jain
From: Jordan Niethe 

The check for a hcall coming from userspace is done for KVM-PR. This is
not supported for nestedv2 and the L0 will directly inject the necessary
exception to the L2 if userspace performs a hcall. Avoid checking the
MSR and thus avoid a H_GUEST_GET_STATE hcall in the L1.

Signed-off-by: Jordan Niethe 
---
 arch/powerpc/kvm/book3s_hv.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 5543e8490cd9..069c336b6f3c 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -1688,7 +1688,7 @@ static int kvmppc_handle_exit_hv(struct kvm_vcpu *vcpu,
{
int i;
 
-   if (unlikely(__kvmppc_get_msr_hv(vcpu) & MSR_PR)) {
+   if (!kvmhv_is_nestedv2() && unlikely(__kvmppc_get_msr_hv(vcpu) 
& MSR_PR)) {
/*
 * Guest userspace executed sc 1. This can only be
 * reached by the P9 path because the old path
@@ -4949,7 +4949,7 @@ static int kvmppc_vcpu_run_hv(struct kvm_vcpu *vcpu)
if (run->exit_reason == KVM_EXIT_PAPR_HCALL) {
accumulate_time(vcpu, >arch.hcall);
 
-   if (WARN_ON_ONCE(__kvmppc_get_msr_hv(vcpu) & MSR_PR)) {
+   if (!kvmhv_is_nestedv2() && 
WARN_ON_ONCE(__kvmppc_get_msr_hv(vcpu) & MSR_PR)) {
/*
 * These should have been caught reflected
 * into the guest by now. Final sanity check:
-- 
2.42.0



[PATCH 10/12] KVM: PPC: Book3S HV nestedv2: Register the VPA with the L0

2023-12-01 Thread Vaibhav Jain
From: Jordan Niethe 

In the nestedv2 case, the L1 may register the L2's VPA with the L0. This
allows the L0 to manage the L2's dispatch count, as well as enable
possible performance optimisations by seeing if certain resources are
not being used by the L2 (such as the PMCs).

Use the H_GUEST_SET_STATE call to inform the L0 of the L2's VPA
address. This can not be done in the H_GUEST_VCPU_RUN input buffer.

Signed-off-by: Jordan Niethe 
---
 arch/powerpc/include/asm/kvm_book3s_64.h |  1 +
 arch/powerpc/kvm/book3s_hv.c | 38 ++--
 arch/powerpc/kvm/book3s_hv_nestedv2.c| 29 ++
 3 files changed, 59 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h 
b/arch/powerpc/include/asm/kvm_book3s_64.h
index 2477021bff54..d8729ec81ca0 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -682,6 +682,7 @@ void kvmhv_nestedv2_vcpu_free(struct kvm_vcpu *vcpu, struct 
kvmhv_nestedv2_io *i
 int kvmhv_nestedv2_flush_vcpu(struct kvm_vcpu *vcpu, u64 time_limit);
 int kvmhv_nestedv2_set_ptbl_entry(unsigned long lpid, u64 dw0, u64 dw1);
 int kvmhv_nestedv2_parse_output(struct kvm_vcpu *vcpu);
+int kvmhv_nestedv2_set_vpa(struct kvm_vcpu *vcpu, unsigned long vpa);
 
 #endif /* CONFIG_KVM_BOOK3S_HV_POSSIBLE */
 
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 47fe470375df..2ee3f2478570 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -650,7 +650,8 @@ static unsigned long do_h_register_vpa(struct kvm_vcpu 
*vcpu,
return err;
 }
 
-static void kvmppc_update_vpa(struct kvm_vcpu *vcpu, struct kvmppc_vpa *vpap)
+static void kvmppc_update_vpa(struct kvm_vcpu *vcpu, struct kvmppc_vpa *vpap,
+  struct kvmppc_vpa *old_vpap)
 {
struct kvm *kvm = vcpu->kvm;
void *va;
@@ -690,9 +691,8 @@ static void kvmppc_update_vpa(struct kvm_vcpu *vcpu, struct 
kvmppc_vpa *vpap)
kvmppc_unpin_guest_page(kvm, va, gpa, false);
va = NULL;
}
-   if (vpap->pinned_addr)
-   kvmppc_unpin_guest_page(kvm, vpap->pinned_addr, vpap->gpa,
-   vpap->dirty);
+   *old_vpap = *vpap;
+
vpap->gpa = gpa;
vpap->pinned_addr = va;
vpap->dirty = false;
@@ -702,6 +702,9 @@ static void kvmppc_update_vpa(struct kvm_vcpu *vcpu, struct 
kvmppc_vpa *vpap)
 
 static void kvmppc_update_vpas(struct kvm_vcpu *vcpu)
 {
+   struct kvm *kvm = vcpu->kvm;
+   struct kvmppc_vpa old_vpa = { 0 };
+
if (!(vcpu->arch.vpa.update_pending ||
  vcpu->arch.slb_shadow.update_pending ||
  vcpu->arch.dtl.update_pending))
@@ -709,17 +712,34 @@ static void kvmppc_update_vpas(struct kvm_vcpu *vcpu)
 
spin_lock(>arch.vpa_update_lock);
if (vcpu->arch.vpa.update_pending) {
-   kvmppc_update_vpa(vcpu, >arch.vpa);
-   if (vcpu->arch.vpa.pinned_addr)
+   kvmppc_update_vpa(vcpu, >arch.vpa, _vpa);
+   if (old_vpa.pinned_addr) {
+   if (kvmhv_is_nestedv2())
+   kvmhv_nestedv2_set_vpa(vcpu, ~0ull);
+   kvmppc_unpin_guest_page(kvm, old_vpa.pinned_addr, 
old_vpa.gpa,
+   old_vpa.dirty);
+   }
+   if (vcpu->arch.vpa.pinned_addr) {
init_vpa(vcpu, vcpu->arch.vpa.pinned_addr);
+   if (kvmhv_is_nestedv2())
+   kvmhv_nestedv2_set_vpa(vcpu, 
__pa(vcpu->arch.vpa.pinned_addr));
+   }
}
if (vcpu->arch.dtl.update_pending) {
-   kvmppc_update_vpa(vcpu, >arch.dtl);
+   kvmppc_update_vpa(vcpu, >arch.dtl, _vpa);
+   if (old_vpa.pinned_addr)
+   kvmppc_unpin_guest_page(kvm, old_vpa.pinned_addr, 
old_vpa.gpa,
+   old_vpa.dirty);
vcpu->arch.dtl_ptr = vcpu->arch.dtl.pinned_addr;
vcpu->arch.dtl_index = 0;
}
-   if (vcpu->arch.slb_shadow.update_pending)
-   kvmppc_update_vpa(vcpu, >arch.slb_shadow);
+   if (vcpu->arch.slb_shadow.update_pending) {
+   kvmppc_update_vpa(vcpu, >arch.slb_shadow, _vpa);
+   if (old_vpa.pinned_addr)
+   kvmppc_unpin_guest_page(kvm, old_vpa.pinned_addr, 
old_vpa.gpa,
+   old_vpa.dirty);
+   }
+
spin_unlock(>arch.vpa_update_lock);
 }
 
diff --git a/arch/powerpc/kvm/book3s_hv_nestedv2.c 
b/arch/powerpc/kvm/book3s_hv_nestedv2.c
index fd3c4f2d9480..5378eb40b162 100644
--- a/arch/powerpc/kvm/book3s_hv_nestedv2.c
+++ b/arch/powerpc/kvm/book3s_hv_nestedv2.c
@@ -855,6 +855,35 @@ int kvmhv_nestedv2_set_ptbl_entry(unsigned long lpid, u64 
dw0, u64 dw1)
 }
 

[PATCH 08/12] KVM: PPC: Book3S HV nestedv2: Avoid msr check in kvmppc_handle_exit_hv()

2023-12-01 Thread Vaibhav Jain
From: Jordan Niethe 

The msr check in kvmppc_handle_exit_hv() is not needed for nestedv2 hosts,
skip the check to avoid a H_GUEST_GET_STATE hcall.

Signed-off-by: Jordan Niethe 
---
 arch/powerpc/kvm/book3s_hv.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 4dc6a928073f..47fe470375df 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -1597,7 +1597,7 @@ static int kvmppc_handle_exit_hv(struct kvm_vcpu *vcpu,
 * That can happen due to a bug, or due to a machine check
 * occurring at just the wrong time.
 */
-   if (__kvmppc_get_msr_hv(vcpu) & MSR_HV) {
+   if (!kvmhv_is_nestedv2() && (__kvmppc_get_msr_hv(vcpu) & MSR_HV)) {
printk(KERN_EMERG "KVM trap in HV mode!\n");
printk(KERN_EMERG "trap=0x%x | pc=0x%lx | msr=0x%llx\n",
vcpu->arch.trap, kvmppc_get_pc(vcpu),
-- 
2.42.0



[PATCH 04/12] KVM: PPC: Book3S HV nestedv2: Get the PID only if needed to copy tofrom a guest

2023-12-01 Thread Vaibhav Jain
From: Jordan Niethe 

kvmhv_copy_tofrom_guest_radix() gets the PID at the start of the
function. If pid is not used, then this is a wasteful H_GUEST_GET_STATE
hcall for nestedv2 hosts. Move the assignment to where pid will be used.

Suggested-by: Nicholas Piggin 
Signed-off-by: Jordan Niethe 
---
 arch/powerpc/kvm/book3s_64_mmu_radix.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 175a8eb2681f..916af6c153a5 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -97,7 +97,7 @@ static long kvmhv_copy_tofrom_guest_radix(struct kvm_vcpu 
*vcpu, gva_t eaddr,
  void *to, void *from, unsigned long n)
 {
int lpid = vcpu->kvm->arch.lpid;
-   int pid = kvmppc_get_pid(vcpu);
+   int pid;
 
/* This would cause a data segment intr so don't allow the access */
if (eaddr & (0x3FFUL << 52))
@@ -110,6 +110,8 @@ static long kvmhv_copy_tofrom_guest_radix(struct kvm_vcpu 
*vcpu, gva_t eaddr,
/* If accessing quadrant 3 then pid is expected to be 0 */
if (((eaddr >> 62) & 0x3) == 0x3)
pid = 0;
+   else
+   pid = kvmppc_get_pid(vcpu);
 
eaddr &= ~(0xFFFUL << 52);
 
-- 
2.42.0



[PATCH 07/12] KVM: PPC: Book3S HV nestedv2: Do not inject certain interrupts

2023-12-01 Thread Vaibhav Jain
From: Jordan Niethe 

There is no need to inject an external interrupt in
kvmppc_book3s_irqprio_deliver() as the test for BOOK3S_IRQPRIO_EXTERNAL
in kvmhv_run_single_vcpu() before guest entry will raise LPCR_MER if
needed. There is also no need to inject the decrementer interrupt as
this will be raised within the L2 if needed. Avoiding these injections
reduces H_GUEST_GET_STATE hcalls by the L1.

Suggested-by: Nicholas Piggin 
Signed-off-by: Jordan Niethe 
---
 arch/powerpc/kvm/book3s.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index 6cd20ab9e94e..8acec144120e 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -302,11 +302,11 @@ static int kvmppc_book3s_irqprio_deliver(struct kvm_vcpu 
*vcpu,
 
switch (priority) {
case BOOK3S_IRQPRIO_DECREMENTER:
-   deliver = (kvmppc_get_msr(vcpu) & MSR_EE) && !crit;
+   deliver = !kvmhv_is_nestedv2() && (kvmppc_get_msr(vcpu) & 
MSR_EE) && !crit;
vec = BOOK3S_INTERRUPT_DECREMENTER;
break;
case BOOK3S_IRQPRIO_EXTERNAL:
-   deliver = (kvmppc_get_msr(vcpu) & MSR_EE) && !crit;
+   deliver = !kvmhv_is_nestedv2() && (kvmppc_get_msr(vcpu) & 
MSR_EE) && !crit;
vec = BOOK3S_INTERRUPT_EXTERNAL;
break;
case BOOK3S_IRQPRIO_SYSTEM_RESET:
-- 
2.42.0



[PATCH 00/12] KVM: PPC: Nested APIv2 : Performance improvements

2023-12-01 Thread Vaibhav Jain
From: vaibhav 


This patch series introduces series of performance improvements to recently
added support for Nested APIv2 PPC64 Guests via [1]. Details for Nested
APIv2 for PPC64 Guests is available in Documentation/powerpc/kvm-nested.rst.

This patch series introduces various optimizations for a Nested APIv2
guests namely:

* Reduce the number times L1 hypervisor requests for L2 state from L0.
* Register the L2 VPA with L1
* Optimizing interrupt delivery of some interrupt types.
* Optimize emulation of mmio loads/stores for L2 in L1.

The hcalls needed for testing these patches have been implemented in the
spapr qemu model and is available at [2].

There are scripts available to assist in setting up an environment for
testing nested guests at [3].

These patches are consequence of insights from on going performance
engineering effort for improving performance of Nested APIv2
Guests. Special thanks goes to:
* Gautam Menghani
* Jordan Niethe
* Nicholas Piggin
* Vaidyanathan Srinivasan

Refs:
[1] https://lore.kernel.org/all/20230905034658.82835-1-jniet...@gmail.com
[2] https://github.com/planetharsh/qemu/tree/upstream-0714-kop
[3] https://github.com/iamjpn/kvm-powervm-test

Jordan Niethe (11):
  KVM: PPC: Book3S HV nestedv2: Invalidate RPT before deleting a guest
  KVM: PPC: Book3S HV nestedv2: Avoid reloading the tb offset
  KVM: PPC: Book3S HV nestedv2: Do not check msr on hcalls
  KVM: PPC: Book3S HV nestedv2: Get the PID only if needed to copy
tofrom a guest
  KVM: PPC: Book3S HV nestedv2: Ensure LPCR_MER bit is passed to the L0
  KVM: PPC: Book3S HV nestedv2: Do not inject certain interrupts
  KVM: PPC: Book3S HV nestedv2: Avoid msr check in
kvmppc_handle_exit_hv()
  KVM: PPC: Book3S HV nestedv2: Do not call H_COPY_TOFROM_GUEST
  KVM: PPC: Book3S HV nestedv2: Register the VPA with the L0
  KVM: PPC: Reduce reliance on analyse_instr() in mmio emulation
  KVM: PPC: Book3S HV nestedv2: Do not cancel pending decrementer
exception

Nicholas Piggin (1):
  KVM: PPC: Book3S HV: Handle pending exceptions on guest entry with
MSR_EE

 arch/powerpc/include/asm/kvm_book3s.h| 10 +++-
 arch/powerpc/include/asm/kvm_book3s_64.h |  1 +
 arch/powerpc/kvm/book3s.c|  4 +-
 arch/powerpc/kvm/book3s_64_mmu_radix.c   |  7 ++-
 arch/powerpc/kvm/book3s_hv.c | 72 +---
 arch/powerpc/kvm/book3s_hv_nested.c  |  2 +-
 arch/powerpc/kvm/book3s_hv_nestedv2.c| 29 ++
 arch/powerpc/kvm/emulate_loadstore.c | 21 ---
 8 files changed, 107 insertions(+), 39 deletions(-)

-- 
2.42.0



[PATCH 06/12] KVM: PPC: Book3S HV: Handle pending exceptions on guest entry with MSR_EE

2023-12-01 Thread Vaibhav Jain
From: Nicholas Piggin 

Commit 026728dc5d41 ("KVM: PPC: Book3S HV P9: Inject pending xive
interrupts at guest entry") changed guest entry so that if external
interrupts are enabled, BOOK3S_IRQPRIO_EXTERNAL is not tested for. Test
for this regardless of MSR_EE.

For an L1 host, do not inject an interrupt, but always
use LPCR_MER. If the L0 desires it can inject an interrupt.

Fixes: 026728dc5d41 ("KVM: PPC: Book3S HV P9: Inject pending xive interrupts at 
guest entry")
Signed-off-by: Nicholas Piggin 
[jpn: use kvmpcc_get_msr(), write commit message]
Signed-off-by: Jordan Niethe 
---
 arch/powerpc/kvm/book3s_hv.c | 18 --
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 6d1f0bca27aa..4dc6a928073f 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -4738,13 +4738,19 @@ int kvmhv_run_single_vcpu(struct kvm_vcpu *vcpu, u64 
time_limit,
 
if (!nested) {
kvmppc_core_prepare_to_enter(vcpu);
-   if (__kvmppc_get_msr_hv(vcpu) & MSR_EE) {
-   if (xive_interrupt_pending(vcpu))
+   if (test_bit(BOOK3S_IRQPRIO_EXTERNAL,
+>arch.pending_exceptions) ||
+   xive_interrupt_pending(vcpu)) {
+   /*
+* For nested HV, don't synthesize but always pass MER,
+* the L0 will be able to optimise that more
+* effectively than manipulating registers directly.
+*/
+   if (!kvmhv_on_pseries() && (__kvmppc_get_msr_hv(vcpu) & 
MSR_EE))
kvmppc_inject_interrupt_hv(vcpu,
-   BOOK3S_INTERRUPT_EXTERNAL, 0);
-   } else if (test_bit(BOOK3S_IRQPRIO_EXTERNAL,
->arch.pending_exceptions)) {
-   lpcr |= LPCR_MER;
+  
BOOK3S_INTERRUPT_EXTERNAL, 0);
+   else
+   lpcr |= LPCR_MER;
}
} else if (vcpu->arch.pending_exceptions ||
   vcpu->arch.doorbell_request ||
-- 
2.42.0



[PATCH 05/12] KVM: PPC: Book3S HV nestedv2: Ensure LPCR_MER bit is passed to the L0

2023-12-01 Thread Vaibhav Jain
From: Jordan Niethe 

LPCR_MER is conditionally set during entry to a guest if there is a
pending external interrupt. In the nestedv2 case, this change is not
being communicated to the L0, which means it is not being set in the L2.
Ensure the updated LPCR value is passed to the L0.

Signed-off-by: Jordan Niethe 
---
 arch/powerpc/kvm/book3s_hv.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 069c336b6f3c..6d1f0bca27aa 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -4084,6 +4084,8 @@ static int kvmhv_vcpu_entry_nestedv2(struct kvm_vcpu 
*vcpu, u64 time_limit,
if (rc < 0)
return -EINVAL;
 
+   kvmppc_gse_put_u64(io->vcpu_run_input, KVMPPC_GSID_LPCR, lpcr);
+
accumulate_time(vcpu, >arch.in_guest);
rc = plpar_guest_run_vcpu(0, vcpu->kvm->arch.lpid, vcpu->vcpu_id,
  , );
-- 
2.42.0



[PATCH 01/12] KVM: PPC: Book3S HV nestedv2: Invalidate RPT before deleting a guest

2023-12-01 Thread Vaibhav Jain
From: Jordan Niethe 

An L0 must invalidate the L2's RPT during H_GUEST_DELETE if this has not
already been done. This is a slow operation that means H_GUEST_DELETE
must return H_BUSY multiple times before completing. Invalidating the
tables before deleting the guest so there is less work for the L0 to do.

Signed-off-by: Jordan Niethe 
---
 arch/powerpc/include/asm/kvm_book3s.h | 1 +
 arch/powerpc/kvm/book3s_hv.c  | 6 --
 arch/powerpc/kvm/book3s_hv_nested.c   | 2 +-
 3 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index 4f527d09c92b..a37736ed3728 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -302,6 +302,7 @@ void kvmhv_nested_exit(void);
 void kvmhv_vm_nested_init(struct kvm *kvm);
 long kvmhv_set_partition_table(struct kvm_vcpu *vcpu);
 long kvmhv_copy_tofrom_guest_nested(struct kvm_vcpu *vcpu);
+void kvmhv_flush_lpid(u64 lpid);
 void kvmhv_set_ptbl_entry(u64 lpid, u64 dw0, u64 dw1);
 void kvmhv_release_all_nested(struct kvm *kvm);
 long kvmhv_enter_nested_guest(struct kvm_vcpu *vcpu);
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 1ed6ec140701..5543e8490cd9 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -5691,10 +5691,12 @@ static void kvmppc_core_destroy_vm_hv(struct kvm *kvm)
kvmhv_set_ptbl_entry(kvm->arch.lpid, 0, 0);
}
 
-   if (kvmhv_is_nestedv2())
+   if (kvmhv_is_nestedv2()) {
+   kvmhv_flush_lpid(kvm->arch.lpid);
plpar_guest_delete(0, kvm->arch.lpid);
-   else
+   } else {
kvmppc_free_lpid(kvm->arch.lpid);
+   }
 
kvmppc_free_pimap(kvm);
 }
diff --git a/arch/powerpc/kvm/book3s_hv_nested.c 
b/arch/powerpc/kvm/book3s_hv_nested.c
index 3b658b8696bc..5c375ec1a3c6 100644
--- a/arch/powerpc/kvm/book3s_hv_nested.c
+++ b/arch/powerpc/kvm/book3s_hv_nested.c
@@ -503,7 +503,7 @@ void kvmhv_nested_exit(void)
}
 }
 
-static void kvmhv_flush_lpid(u64 lpid)
+void kvmhv_flush_lpid(u64 lpid)
 {
long rc;
 
-- 
2.42.0



[PATCH 02/12] KVM: PPC: Book3S HV nestedv2: Avoid reloading the tb offset

2023-12-01 Thread Vaibhav Jain
From: Jordan Niethe 

The kvmppc_get_tb_offset() getter reloads KVMPPC_GSID_TB_OFFSET from the
L0 for nestedv2 host. This is unnecessary as the value does not change.
KVMPPC_GSID_TB_OFFSET also need not be reloaded in
kvmppc_{s,g}et_dec_expires().

Signed-off-by: Jordan Niethe 
---
 arch/powerpc/include/asm/kvm_book3s.h | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index a37736ed3728..3e1e2a698c9e 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -594,13 +594,17 @@ static inline u##size kvmppc_get_##reg(struct kvm_vcpu 
*vcpu) \
 
 
 KVMPPC_BOOK3S_VCORE_ACCESSOR(vtb, 64, KVMPPC_GSID_VTB)
-KVMPPC_BOOK3S_VCORE_ACCESSOR(tb_offset, 64, KVMPPC_GSID_TB_OFFSET)
 KVMPPC_BOOK3S_VCORE_ACCESSOR_GET(arch_compat, 32, KVMPPC_GSID_LOGICAL_PVR)
 KVMPPC_BOOK3S_VCORE_ACCESSOR_GET(lpcr, 64, KVMPPC_GSID_LPCR)
+KVMPPC_BOOK3S_VCORE_ACCESSOR_SET(tb_offset, 64, KVMPPC_GSID_TB_OFFSET)
+
+static inline u64 kvmppc_get_tb_offset(struct kvm_vcpu *vcpu)
+{
+   return vcpu->arch.vcore->tb_offset;
+}
 
 static inline u64 kvmppc_get_dec_expires(struct kvm_vcpu *vcpu)
 {
-   WARN_ON(kvmhv_nestedv2_cached_reload(vcpu, KVMPPC_GSID_TB_OFFSET) < 0);
WARN_ON(kvmhv_nestedv2_cached_reload(vcpu, KVMPPC_GSID_DEC_EXPIRY_TB) < 
0);
return vcpu->arch.dec_expires;
 }
@@ -608,7 +612,6 @@ static inline u64 kvmppc_get_dec_expires(struct kvm_vcpu 
*vcpu)
 static inline void kvmppc_set_dec_expires(struct kvm_vcpu *vcpu, u64 val)
 {
vcpu->arch.dec_expires = val;
-   WARN_ON(kvmhv_nestedv2_cached_reload(vcpu, KVMPPC_GSID_TB_OFFSET) < 0);
kvmhv_nestedv2_mark_dirty(vcpu, KVMPPC_GSID_DEC_EXPIRY_TB);
 }
 
-- 
2.42.0



Re: [PATCH] powerpc: Enable generic cpu idle-loop

2023-08-20 Thread Vaibhav Jain
Thanks for looking at this patch Mpe.

Michael Ellerman  writes:

> Vaibhav Jain  writes:
>> This minor patch enables config option GENERIC_IDLE_POLL_SETUP for arch
>> powerpc. This should add support for kernel param 'nohlt'.
>
> Which is named after an 8086 instruction :), but oh well.
>
Thanks. This is an interesting trivia :)

>> Powerpc kernel also supports another kernel boot-time param called
>> 'powersave' which can also be used to disable all cpu idle-states and
>> forces CPU to an idle-loop similar to what cpu_idle_poll() does. This
>> patch however makes powerpc kernel-parameters better aligned to the
>> generic boot-time parameters.
>
> It would be nice if we could make our powersave=off parameter just
> enable this generic logic.
>
> Have you looked at if that's possible? At a glance it looks like it
> should be, when cpu_idle_force_poll is true do_idle() never calls
> cpuidle_idle_call(), so the cpuidle drivers are never invoked.

Thanks for suggesting this and it looks like a simple fix to arch/idle.c
code. I have posted a v2 with the needed change on the mailing list at
https://lore.kernel.org/all/20230821045928.1350893-1-vaib...@linux.ibm.com

-- 
Cheers
~ Vaibhav


[PATCH v2] powerpc: Enable generic cpu idle-loop

2023-08-20 Thread Vaibhav Jain
This minor patch enables config option GENERIC_IDLE_POLL_SETUP for arch
powerpc. This should add support for kernel param 'nohlt'.

Powerpc kernel also supports another kernel boot-time param called
'powersave' which can also be used to disable all cpu idle-states and
forces CPU to an idle-loop similar to what cpu_idle_poll() does. Hence this
patch  also updates the handling of 'powersave=off' kernel param to enable
generic cpu idle-loop if its enabled.

Signed-off-by: Vaibhav Jain
---
Changelog:

Since v1:
https://lore.kernel.org/all/20230818050739.827851-1-vaib...@linux.ibm.com
* Updated powersave_off() to enable generic cpu idle-loop if
'powersave=off' kernel arg is given. [Mpe]
* Update patch description
---
 Documentation/admin-guide/kernel-parameters.txt | 2 +-
 arch/powerpc/Kconfig| 1 +
 arch/powerpc/kernel/idle.c  | 3 +++
 3 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 722b6eca2e93..6b657ebafcfb 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3738,7 +3738,7 @@
 
nohibernate [HIBERNATION] Disable hibernation and resume.
 
-   nohlt   [ARM,ARM64,MICROBLAZE,MIPS,SH] Forces the kernel to
+   nohlt   [ARM,ARM64,MICROBLAZE,MIPS,PPC,SH] Forces the kernel to
busy wait in do_idle() and not use the arch_cpu_idle()
implementation; requires CONFIG_GENERIC_IDLE_POLL_SETUP
to be effective. This is useful on platforms where the
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 0b1172cbeccb..574661403800 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -193,6 +193,7 @@ config PPC
select GENERIC_CPU_VULNERABILITIES  if PPC_BARRIER_NOSPEC
select GENERIC_EARLY_IOREMAP
select GENERIC_GETTIMEOFDAY
+   select GENERIC_IDLE_POLL_SETUP
select GENERIC_IRQ_SHOW
select GENERIC_IRQ_SHOW_LEVEL
select GENERIC_PCI_IOMAPif PCI
diff --git a/arch/powerpc/kernel/idle.c b/arch/powerpc/kernel/idle.c
index b1c0418b25c8..7367a0698459 100644
--- a/arch/powerpc/kernel/idle.c
+++ b/arch/powerpc/kernel/idle.c
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -35,6 +36,8 @@ EXPORT_SYMBOL(cpuidle_disable);
 
 static int __init powersave_off(char *arg)
 {
+   /* Use generic idle loop if thats available */
+   cpu_idle_poll_ctrl(true);
ppc_md.power_save = NULL;
cpuidle_disable = IDLE_POWERSAVE_OFF;
return 1;
-- 
2.41.0



[PATCH] powerpc: Enable generic cpu idle-loop

2023-08-17 Thread Vaibhav Jain
This minor patch enables config option GENERIC_IDLE_POLL_SETUP for arch
powerpc. This should add support for kernel param 'nohlt'.

Powerpc kernel also supports another kernel boot-time param called
'powersave' which can also be used to disable all cpu idle-states and
forces CPU to an idle-loop similar to what cpu_idle_poll() does. This
patch however makes powerpc kernel-parameters better aligned to the
generic boot-time parameters.

Signed-off-by: Vaibhav Jain
---
 Documentation/admin-guide/kernel-parameters.txt | 2 +-
 arch/powerpc/Kconfig| 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 722b6eca2e93..6b657ebafcfb 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3738,7 +3738,7 @@
 
nohibernate [HIBERNATION] Disable hibernation and resume.
 
-   nohlt   [ARM,ARM64,MICROBLAZE,MIPS,SH] Forces the kernel to
+   nohlt   [ARM,ARM64,MICROBLAZE,MIPS,PPC,SH] Forces the kernel to
busy wait in do_idle() and not use the arch_cpu_idle()
implementation; requires CONFIG_GENERIC_IDLE_POLL_SETUP
to be effective. This is useful on platforms where the
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 0b1172cbeccb..574661403800 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -193,6 +193,7 @@ config PPC
select GENERIC_CPU_VULNERABILITIES  if PPC_BARRIER_NOSPEC
select GENERIC_EARLY_IOREMAP
select GENERIC_GETTIMEOFDAY
+   select GENERIC_IDLE_POLL_SETUP
select GENERIC_IRQ_SHOW
select GENERIC_IRQ_SHOW_LEVEL
select GENERIC_PCI_IOMAPif PCI
-- 
2.41.0



Re: [PATCH v3] powerpc/papr_scm: Fix nvdimm event mappings

2022-08-04 Thread Vaibhav Jain


Kajol Jain  writes:

> Commit 4c08d4bbc089 ("powerpc/papr_scm: Add perf interface support")
> added performance monitoring support for papr-scm nvdimm devices via
> perf interface. Commit also added an array in papr_scm_priv
> structure called "nvdimm_events_map", which got filled based on the
> result of H_SCM_PERFORMANCE_STATS hcall.
>
> Currently there is an assumption that the order of events in the
> stats buffer, returned by the hypervisor is same. And order also
> happens to matches with the events specified in nvdimm driver code.
> But this assumption is not documented in Power Architecture
> Platform Requirements (PAPR) document. Although the order
> of events happens to be same on current generation od system, but
> it might not be true in future generation systems. Fix the issue, by
> adding a static mapping for nvdimm events to corresponding stat-id,
> and removing the dynamic map from papr_scm_priv structure. Also
> remove the function papr_scm_pmu_check_events from papr_scm.c file,
> as we no longer need to copy stat-ids dynamically.
>
> Fixes: 4c08d4bbc089 ("powerpc/papr_scm: Add perf interface support")
> Reported-by: Aneesh Kumar K.V 
> Signed-off-by: Kajol Jain 
> ---
>  arch/powerpc/platforms/pseries/papr_scm.c | 88 +++
>  1 file changed, 27 insertions(+), 61 deletions(-)
>
> ---
> Changelog:
> v2 -> v3
> - Remove function papr_scm_pmu_check_events() and replace the
>   event checks in papr_scm_pmu_register() function with p->stat_buffer_len
>   as suggested by Vaibhav Jain
>   Link to the patch v2: 
> https://lore.kernel.org/all/20220711034605.212683-1-kj...@linux.ibm.com/

V3 patch looks good to me.

Reviewed-by: Vaibhav Jain 
-- 
Cheers
~ Vaibhav


Re: [PATCH v2] powerpc/papr_scm: Fix nvdimm event mappings

2022-07-11 Thread Vaibhav Jain
Hi Kajol,

Thanks for the patch. Minor review comment below:

Kajol Jain  writes:

> Commit 4c08d4bbc089 ("powerpc/papr_scm: Add perf interface support")
> added performance monitoring support for papr-scm nvdimm devices via
> perf interface. Commit also added an array in papr_scm_priv
> structure called "nvdimm_events_map", which got filled based on the
> result of H_SCM_PERFORMANCE_STATS hcall. 
>
> Currently there is an assumption that the order of events in the
> stats buffer, returned by the hypervisor is same. And that order also
> matches with the events specified in nvdimm driver code. 
> But this assumption is not documented anywhere in Power Architecture
> Platform Requirements (PAPR) document. Although the order
> of events happens to be same on current systems, but it might
> not be true in future generation systems. Fix the issue, by
> adding a static mapping for nvdimm events to corresponding stat-id,
> and removing the dynamic map from papr_scm_priv structure.
>
> Fixes: 4c08d4bbc089 ("powerpc/papr_scm: Add perf interface support")
> Reported-by: Aneesh Kumar K.V 
> Signed-off-by: Kajol Jain 


> @@ -460,10 +480,9 @@ static void papr_scm_pmu_del(struct perf_event *event, 
> int flags)
>  
>  static int papr_scm_pmu_check_events(struct papr_scm_priv *p, struct 
> nvdimm_pmu *nd_pmu)
>  {
> - struct papr_scm_perf_stat *stat;
>   struct papr_scm_perf_stats *stats;
>   u32 available_events;
> - int index, rc = 0;
> + int rc;
>  
>   if (!p->stat_buffer_len)
>   return -ENOENT;
> @@ -476,34 +495,12 @@ static int papr_scm_pmu_check_events(struct 
> papr_scm_priv *p, struct nvdimm_pmu
>   /* Allocate the buffer for phyp where stats are written */
>   stats = kzalloc(p->stat_buffer_len, GFP_KERNEL);
>   if (!stats) {
> - rc = -ENOMEM;
> - return rc;
> + return -ENOMEM;
>   }
>  
>   /* Called to get list of events supported */
>   rc = drc_pmem_query_stats(p, stats, 0);
> - if (rc)
> - goto out;
> -
> - /*
> -  * Allocate memory and populate nvdimm_event_map.
> -  * Allocate an extra element for NULL entry
> -  */
> - p->nvdimm_events_map = kcalloc(available_events + 1,
> -sizeof(stat->stat_id),
> -GFP_KERNEL);
> - if (!p->nvdimm_events_map) {
> - rc = -ENOMEM;
> - goto out;
> - }
>  
> - /* Copy all stat_ids to event map */
> - for (index = 0, stat = stats->scm_statistic;
> -  index < available_events; index++, ++stat) {
> - memcpy(>nvdimm_events_map[index * sizeof(stat->stat_id)],
> ->stat_id, sizeof(stat->stat_id));
> - }
> -out:
>   kfree(stats);
>   return rc;
>  }

Earlier implementation of papr_scm_pmu_check_events() would copy the
contents of returned stat-ids to struct papr_scm_priv->nvdimm_events_map,
hence it was needed.

With static events map you dont really need to call
drc_pmem_query_stats() as that would have been already being done once
in papr_scm_probe() before papr_scm_pmu_register() is called:


papr_scm_probe()
{
...
/* Try retrieving the stat buffer and see if its supported */
stat_size = drc_pmem_query_stats(p, NULL, 0);
...
papr_scm_pmu_register(p);
...
}

I would suggest replacing single callsite of papr_scm_pmu_check_events()
with the check

 if (!p->stat_buffer_len)
goto pmu_check_events_err;



-- 
Cheers
~ Vaibhav



Re: [PATCH v2] of: check previous kernel's ima-kexec-buffer against memory bounds

2022-05-30 Thread Vaibhav Jain
Thanks for looking into this patch Rob,

I have addressed your review comment in v3 of the patch posted at
https://lore.kernel.org/all/20220531041446.3334259-1-vaib...@linux.ibm.com/



-- 
Cheers
~ Vaibhav


[PATCH v3] of: check previous kernel's ima-kexec-buffer against memory bounds

2022-05-30 Thread Vaibhav Jain
Presently ima_get_kexec_buffer() doesn't check if the previous kernel's
ima-kexec-buffer lies outside the addressable memory range. This can result
in a kernel panic if the new kernel is booted with 'mem=X' arg and the
ima-kexec-buffer was allocated beyond that range by the previous kernel.
The panic is usually of the form below:

$ sudo kexec --initrd initrd vmlinux --append='mem=16G'


 BUG: Unable to handle kernel data access on read at 0xc000c01fff7f
 Faulting instruction address: 0xc0837974
 Oops: Kernel access of bad area, sig: 11 [#1]

 NIP [c0837974] ima_restore_measurement_list+0x94/0x6c0
 LR [c083b55c] ima_load_kexec_buffer+0xac/0x160
 Call Trace:
 [c371fa80] [c083b55c] ima_load_kexec_buffer+0xac/0x160
 [c371fb00] [c20512c4] ima_init+0x80/0x108
 [c371fb70] [c20514dc] init_ima+0x4c/0x120
 [c371fbf0] [c0012240] do_one_initcall+0x60/0x2c0
 [c371fcc0] [c2004ad0] kernel_init_freeable+0x344/0x3ec
 [c371fda0] [c00128a4] kernel_init+0x34/0x1b0
 [c371fe10] [c000ce64] ret_from_kernel_thread+0x5c/0x64
 Instruction dump:
 f92100b8 f92100c0 90e10090 910100a0 4182050c 282a0017 3bc0 40810330
 7c0802a6 fb610198 7c9b2378 f80101d0  2c090001 40820614 e9240010
 ---[ end trace  ]---

Fix this issue by checking returned PFN range of previous kernel's
ima-kexec-buffer with page_is_ram() to ensure correct memory bounds.

Fixes: 467d27824920 ("powerpc: ima: get the kexec buffer passed by the previous 
kernel")
Cc: Frank Rowand 
Cc: Prakhar Srivastava 
Cc: Lakshmi Ramasubramanian 
Cc: Thiago Jung Bauermann 
Cc: Rob Herring 
Cc: Ritesh Harjani 
Cc: Robin Murphy 
Signed-off-by: Vaibhav Jain 
---
Changelog
==
v3:
* change the type for {start,end}_pfn to unsigned long [ Ritesh ]
* Switched to page_is_ram() from pfn_vaild() [ Rob ]

v2:
* Instead of using memblock to determine the valid bounds use pfn_valid() to do
so since memblock may not be available late after the kernel init. [ Mpe ]
* Changed the patch prefix from 'powerpc' to 'of' [ Mpe ]
* Updated the 'Fixes' tag to point to correct commit that introduced this
function. [ Rob ]
* Fixed some whitespace/tab issues in the patch description [ Rob ]
* Added another check for checking ig 'tmp_size' for ima-kexec-buffer is > 0
---
 drivers/of/kexec.c | 17 +
 1 file changed, 17 insertions(+)

diff --git a/drivers/of/kexec.c b/drivers/of/kexec.c
index 8d374cc552be..91b04b04eec4 100644
--- a/drivers/of/kexec.c
+++ b/drivers/of/kexec.c
@@ -126,6 +126,7 @@ int ima_get_kexec_buffer(void **addr, size_t *size)
 {
int ret, len;
unsigned long tmp_addr;
+   unsigned long start_pfn, end_pfn;
size_t tmp_size;
const void *prop;
 
@@ -140,6 +141,22 @@ int ima_get_kexec_buffer(void **addr, size_t *size)
if (ret)
return ret;
 
+   /* Do some sanity on the returned size for the ima-kexec buffer */
+   if (!tmp_size)
+   return -ENOENT;
+
+   /*
+* Calculate the PFNs for the buffer and ensure
+* they are with in addressable memory.
+*/
+   start_pfn = PHYS_PFN(tmp_addr);
+   end_pfn = PHYS_PFN(tmp_addr + tmp_size - 1);
+   if (!page_is_ram(start_pfn) || !page_is_ram(end_pfn)) {
+   pr_warn("IMA buffer at 0x%lx, size = 0x%zx beyond memory\n",
+   tmp_addr, tmp_size);
+   return -EINVAL;
+   }
+
*addr = __va(tmp_addr);
*size = tmp_size;
 
-- 
2.35.1



Re: [PATCH v2] of: check previous kernel's ima-kexec-buffer against memory bounds

2022-05-25 Thread Vaibhav Jain
Hi Ritesh,
thanks for looking into this patch,

Ritesh Harjani  writes:

> Just a minor nit which I noticed.
>
> On 22/05/24 11:20AM, Vaibhav Jain wrote:
>> Presently ima_get_kexec_buffer() doesn't check if the previous kernel's
>> ima-kexec-buffer lies outside the addressable memory range. This can result
>> in a kernel panic if the new kernel is booted with 'mem=X' arg and the
>> ima-kexec-buffer was allocated beyond that range by the previous kernel.
>> The panic is usually of the form below:
>>
>> $ sudo kexec --initrd initrd vmlinux --append='mem=16G'
>>
>> 
>>  BUG: Unable to handle kernel data access on read at 0xc000c01fff7f
>>  Faulting instruction address: 0xc0837974
>>  Oops: Kernel access of bad area, sig: 11 [#1]
>> 
>>  NIP [c0837974] ima_restore_measurement_list+0x94/0x6c0
>>  LR [c083b55c] ima_load_kexec_buffer+0xac/0x160
>>  Call Trace:
>>  [c371fa80] [c083b55c] ima_load_kexec_buffer+0xac/0x160
>>  [c371fb00] [c20512c4] ima_init+0x80/0x108
>>  [c371fb70] [c20514dc] init_ima+0x4c/0x120
>>  [c371fbf0] [c0012240] do_one_initcall+0x60/0x2c0
>>  [c371fcc0] [c2004ad0] kernel_init_freeable+0x344/0x3ec
>>  [c371fda0] [c00128a4] kernel_init+0x34/0x1b0
>>  [c371fe10] [c000ce64] ret_from_kernel_thread+0x5c/0x64
>>  Instruction dump:
>>  f92100b8 f92100c0 90e10090 910100a0 4182050c 282a0017 3bc0 40810330
>>  7c0802a6 fb610198 7c9b2378 f80101d0  2c090001 40820614 e9240010
>>  ---[ end trace  ]---
>>
>> Fix this issue by checking returned PFN range of previous kernel's
>> ima-kexec-buffer with pfn_valid to ensure correct memory bounds.
>>
>> Fixes: 467d27824920 ("powerpc: ima: get the kexec buffer passed by the 
>> previous kernel")
>> Cc: Frank Rowand 
>> Cc: Prakhar Srivastava 
>> Cc: Lakshmi Ramasubramanian 
>> Cc: Thiago Jung Bauermann 
>> Cc: Rob Herring 
>> Signed-off-by: Vaibhav Jain 
>>
>> ---
>> Changelog
>> ==
>>
>> v2:
>> * Instead of using memblock to determine the valid bounds use pfn_valid() to 
>> do
>> so since memblock may not be available late after the kernel init. [ Mpe ]
>> * Changed the patch prefix from 'powerpc' to 'of' [ Mpe ]
>> * Updated the 'Fixes' tag to point to correct commit that introduced this
>> function. [ Rob ]
>> * Fixed some whitespace/tab issues in the patch description [ Rob ]
>> * Added another check for checking ig 'tmp_size' for ima-kexec-buffer is > 0
>> ---
>>  drivers/of/kexec.c | 17 +
>>  1 file changed, 17 insertions(+)
>>
>> diff --git a/drivers/of/kexec.c b/drivers/of/kexec.c
>> index 8d374cc552be..879e984fe901 100644
>> --- a/drivers/of/kexec.c
>> +++ b/drivers/of/kexec.c
>> @@ -126,6 +126,7 @@ int ima_get_kexec_buffer(void **addr, size_t *size)
>>  {
>>  int ret, len;
>>  unsigned long tmp_addr;
>> +unsigned int start_pfn, end_pfn;
>
> ^^^ Shouldn't this be unsigned long?
Thanks for catching this. Yes that should be 'unsigned long'. Will
resend the patch with this fixed.

>
> -ritesh
>
>>  size_t tmp_size;
>>  const void *prop;
>>
>> @@ -140,6 +141,22 @@ int ima_get_kexec_buffer(void **addr, size_t *size)
>>  if (ret)
>>  return ret;
>>
>> +/* Do some sanity on the returned size for the ima-kexec buffer */
>> +if (!tmp_size)
>> +return -ENOENT;
>> +
>> +/*
>> + * Calculate the PFNs for the buffer and ensure
>> + * they are with in addressable memory.
>> + */
>> +start_pfn = PHYS_PFN(tmp_addr);
>> +end_pfn = PHYS_PFN(tmp_addr + tmp_size - 1);
>> +if (!pfn_valid(start_pfn) || !pfn_valid(end_pfn)) {
>> +pr_warn("IMA buffer at 0x%lx, size = 0x%zx beyond memory\n",
>> +tmp_addr, tmp_size);
>> +return -EINVAL;
>> +}
>> +
>>  *addr = __va(tmp_addr);
>>  *size = tmp_size;
>>
>> --
>> 2.35.1
>>

-- 
Cheers
~ Vaibhav


Re: [powerpc] linux-next 20220520 boot failure (drc_pmem_query_stats)

2022-05-24 Thread Vaibhav Jain
Thanks for reporting this Sachin,

I have posted a fix for this at
https://lore.kernel.org/nvdimm/20220524112353.1718454-1-vaib...@linux.ibm.com


Sachin Sant  writes:

> While booting linux-next (5.18.0-rc7-next-20220520) on a Power10 LPAR
> configure with pmem following oops is seen. The LPAR fails to boot to
> login prompt.
>
> [   10.948211] papr_scm ibm,persistent-memory:ibm,pmemory@44104001: 
> Permission denied while accessing performance stats
> [   10.948536] Kernel attempted to write user page (1c) - exploit attempt? 
> (uid: 0)
> [   10.948539] BUG: Kernel NULL pointer dereference on write at 0x001c
> [   10.948540] Faulting instruction address: 0xc00801b90844
> [   10.948542] Oops: Kernel access of bad area, sig: 11 [#1]
> [   10.948563] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
> [   10.948568] Modules linked in: papr_scm(E+) libnvdimm(E) vmx_crypto(E) 
> ext4(E) mbcache(E) jbd2(E) sd_mod(E) t10_pi(E) crc64_rocksoft(E) crc64(E) 
> sg(E) ibmvscsi(E) scsi_transport_srp(E) ibmveth(E) fuse(E)
> [   10.948587] CPU: 25 PID: 796 Comm: systemd-udevd Tainted: GE   
>   5.18.0-rc7-next-20220520 #2
> [   10.948592] NIP:  c00801b90844 LR: c00801b92794 CTR: 
> c00801b907f8
> [   10.948595] REGS: c0003082b110 TRAP: 0300   Tainted: GE
>   (5.18.0-rc7-next-20220520)
> [   10.948600] MSR:  8280b033   CR: 
> 44222822  XER: 0001
> [   10.948613] CFAR: c007c744 DAR: 001c DSISR: 4200 
> IRQMASK: 0 
> [   10.948613] GPR00: c00801b92794 c0003082b3b0 c00801bc8000 
> c941bc00 
> [   10.948613] GPR04: 0010  c00016001800 
> c0003082b420 
> [   10.948613] GPR08: 001c 0100 53544154 
> c00801b92c98 
> [   10.948613] GPR12: c00801b907f8 c00abfd02b00 c0003082bd00 
> 0001372bd8b0 
> [   10.948613] GPR16: ff20 c008008911b8 c0080089 
> 11d0 
> [   10.948613] GPR20: 0001 c0003082bbc0 c00801bc0a88 
>  
> [   10.948613] GPR24:  c2950e30  
> 0010 
> [   10.948613] GPR28: c941bc00 0010 0020 
> c941bc00 
> [   10.948660] NIP [c00801b90844] drc_pmem_query_stats+0x5c/0x270 
> [papr_scm]
> [   10.948667] LR [c00801b92794] papr_scm_probe+0x2ac/0x6ec [papr_scm]
> [   10.948673] Call Trace:
> [   10.948675] [c0003082b3b0] [c941bca0] 0xc941bca0 
> (unreliable)
> [   10.948680] [c0003082b460] [c00801b92794] 
> papr_scm_probe+0x2ac/0x6ec [papr_scm]
> [   10.948687] [c0003082b550] [c09809b8] platform_probe+0x98/0x150
> [   10.948694] [c0003082b5d0] [c097bf2c] really_probe+0xfc/0x510
> [   10.948699] [c0003082b650] [c097c4bc] 
> __driver_probe_device+0x17c/0x230
> [   10.948704] [c0003082b6d0] [c097c5c8] 
> driver_probe_device+0x58/0x120
> [   10.948709] [c0003082b710] [c097ce0c] 
> __driver_attach+0xfc/0x230
> [   10.948714] [c0003082b790] [c0978458] 
> bus_for_each_dev+0xa8/0x130
> [   10.948718] [c0003082b7f0] [c097b2c4] driver_attach+0x34/0x50
> [   10.948722] [c0003082b810] [c097a508] 
> bus_add_driver+0x1e8/0x350
> [   10.948729] [c0003082b8a0] [c097def8] 
> driver_register+0x98/0x1a0
> [   10.948736] [c0003082b910] [c09804a8] 
> __platform_driver_register+0x38/0x50
> [   10.948741] [c0003082b930] [c00801b92c10] papr_scm_init+0x3c/0x78 
> [papr_scm]
> [   10.948747] [c0003082b960] [c0011ff0] 
> do_one_initcall+0x60/0x2d0
> [   10.948753] [c0003082ba30] [c023627c] do_init_module+0x6c/0x2d0
> [   10.948760] [c0003082bab0] [c0239650] load_module+0x1e90/0x2290
> [   10.948765] [c0003082bc90] [c0239d9c] 
> __do_sys_finit_module+0xdc/0x180
> [   10.948771] [c0003082bdb0] [c00335fc] 
> system_call_exception+0x17c/0x350
> [   10.948777] [c0003082be10] [c000c53c] 
> system_call_common+0xec/0x270
> [   10.948782] --- interrupt: c00 at 0x7fffa3f2f1d4
> [   10.948785] NIP:  7fffa3f2f1d4 LR: 7fffa456ea9c CTR: 
> 
> [   10.948789] REGS: c0003082be80 TRAP: 0c00   Tainted: GE
>   (5.18.0-rc7-next-20220520)
> [   10.948793] MSR:  8280f033   
> CR: 2804  XER: 
> [   10.948805] IRQMASK: 0 
> [   10.948805] GPR00: 0161 7fffd70550b0 7fffa4007300 
> 0011 
> [   10.948805] GPR04: 7fffa457ad30  0011 
>  
> [   10.948805] GPR08:    
>  
> [   10.948805] GPR12:  7fffa4656580 0002 
> 0001372bd8b0 
> [   10.948805] GPR16: 000137300108 0001372c5c68  
>  
> [   10.948805] GPR20: 

[PATCH] powerpc/papr_scm: don't requests stats with '0' sized stats buffer

2022-05-24 Thread Vaibhav Jain
Sachin reported [1] that on a POWER-10 lpar he is seeing a kernel panic being
reported with vPMEM when papr_scm probe is being called. The panic is of the
form below and is observed only with following option disabled(profile) for the
said LPAR 'Enable Performance Information Collection' in the HMC:

 Kernel attempted to write user page (1c) - exploit attempt? (uid: 0)
 BUG: Kernel NULL pointer dereference on write at 0x001c
 Faulting instruction address: 0xc00801b90844
 Oops: Kernel access of bad area, sig: 11 [#1]

 NIP [c00801b90844] drc_pmem_query_stats+0x5c/0x270 [papr_scm]
 LR [c00801b92794] papr_scm_probe+0x2ac/0x6ec [papr_scm]
 Call Trace:
   0xc941bca0 (unreliable)
   papr_scm_probe+0x2ac/0x6ec [papr_scm]
   platform_probe+0x98/0x150
   really_probe+0xfc/0x510
   __driver_probe_device+0x17c/0x230

 ---[ end trace  ]---
 Kernel panic - not syncing: Fatal exception

On investigation looks like this panic was caused due to a 'stat_buffer' of
size==0 being provided to drc_pmem_query_stats() to fetch all performance
stats-ids of an NVDIMM. However drc_pmem_query_stats() shouldn't have been 
called
since the vPMEM NVDIMM doesn't support and performance stat-id's. This was 
caused
due to missing check for 'p->stat_buffer_len' at the beginning of
papr_scm_pmu_check_events() which indicates that the NVDIMM doesn't support
performance-stats.

Fix this by introducing the check for 'p->stat_buffer_len' at the beginning of
papr_scm_pmu_check_events().

[1] 
https://lore.kernel.org/all/6b3a522a-6a5f-4cc9-b268-0c63aa6e0...@linux.ibm.com

Fixes: 0e0946e22f3665d2732 ("powerpc/papr_scm: Fix leaking nvdimm_events_map 
elements")
Reported-by: Sachin Sant 
Signed-off-by: Vaibhav Jain 
---
 arch/powerpc/platforms/pseries/papr_scm.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index 181b855b3050..82cae08976bc 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -465,6 +465,9 @@ static int papr_scm_pmu_check_events(struct papr_scm_priv 
*p, struct nvdimm_pmu
u32 available_events;
int index, rc = 0;
 
+   if (!p->stat_buffer_len)
+   return -ENOENT;
+
available_events = (p->stat_buffer_len  - sizeof(struct 
papr_scm_perf_stats))
/ sizeof(struct papr_scm_perf_stat);
if (available_events == 0)
-- 
2.35.1




Re: [PATCH] powerpc: check previous kernel's ima-kexec-buffer against memory bounds

2022-05-23 Thread Vaibhav Jain


Michael Ellerman  writes:


>
> Rob's point is that commit fee3ff99bc67 only moved existing code, the
> bug already existed.
>
> The function was introduced in:
>
>   467d27824920 ("powerpc: ima: get the kexec buffer passed by the previous 
> kernel")
>
> So that's where the Fixes tag should point.

Thanks Mpe for explaining and apologies Rob as I initially didnt get the
point you made. I wrongly assumed that since the patch can only be
applied on top of commit fee3ff99bc67 hence fixes should point to
fee3ff99bc67.

I have sent out a v2 with the updated Fixes tag and also changes as
suggested by Mpe and Rob at
https://lore.kernel.org/all/20220524055042.1527968-1-vaib...@linux.ibm.com/

-- 
Cheers
~ Vaibhav


[PATCH v2] of: check previous kernel's ima-kexec-buffer against memory bounds

2022-05-23 Thread Vaibhav Jain
Presently ima_get_kexec_buffer() doesn't check if the previous kernel's
ima-kexec-buffer lies outside the addressable memory range. This can result
in a kernel panic if the new kernel is booted with 'mem=X' arg and the
ima-kexec-buffer was allocated beyond that range by the previous kernel.
The panic is usually of the form below:

$ sudo kexec --initrd initrd vmlinux --append='mem=16G'


 BUG: Unable to handle kernel data access on read at 0xc000c01fff7f
 Faulting instruction address: 0xc0837974
 Oops: Kernel access of bad area, sig: 11 [#1]

 NIP [c0837974] ima_restore_measurement_list+0x94/0x6c0
 LR [c083b55c] ima_load_kexec_buffer+0xac/0x160
 Call Trace:
 [c371fa80] [c083b55c] ima_load_kexec_buffer+0xac/0x160
 [c371fb00] [c20512c4] ima_init+0x80/0x108
 [c371fb70] [c20514dc] init_ima+0x4c/0x120
 [c371fbf0] [c0012240] do_one_initcall+0x60/0x2c0
 [c371fcc0] [c2004ad0] kernel_init_freeable+0x344/0x3ec
 [c371fda0] [c00128a4] kernel_init+0x34/0x1b0
 [c371fe10] [c000ce64] ret_from_kernel_thread+0x5c/0x64
 Instruction dump:
 f92100b8 f92100c0 90e10090 910100a0 4182050c 282a0017 3bc0 40810330
 7c0802a6 fb610198 7c9b2378 f80101d0  2c090001 40820614 e9240010
 ---[ end trace  ]---

Fix this issue by checking returned PFN range of previous kernel's
ima-kexec-buffer with pfn_valid to ensure correct memory bounds.

Fixes: 467d27824920 ("powerpc: ima: get the kexec buffer passed by the previous 
kernel")
Cc: Frank Rowand 
Cc: Prakhar Srivastava 
Cc: Lakshmi Ramasubramanian 
Cc: Thiago Jung Bauermann 
Cc: Rob Herring 
Signed-off-by: Vaibhav Jain 

---
Changelog
==

v2:
* Instead of using memblock to determine the valid bounds use pfn_valid() to do
so since memblock may not be available late after the kernel init. [ Mpe ]
* Changed the patch prefix from 'powerpc' to 'of' [ Mpe ]
* Updated the 'Fixes' tag to point to correct commit that introduced this
function. [ Rob ]
* Fixed some whitespace/tab issues in the patch description [ Rob ]
* Added another check for checking ig 'tmp_size' for ima-kexec-buffer is > 0
---
 drivers/of/kexec.c | 17 +
 1 file changed, 17 insertions(+)

diff --git a/drivers/of/kexec.c b/drivers/of/kexec.c
index 8d374cc552be..879e984fe901 100644
--- a/drivers/of/kexec.c
+++ b/drivers/of/kexec.c
@@ -126,6 +126,7 @@ int ima_get_kexec_buffer(void **addr, size_t *size)
 {
int ret, len;
unsigned long tmp_addr;
+   unsigned int start_pfn, end_pfn;
size_t tmp_size;
const void *prop;
 
@@ -140,6 +141,22 @@ int ima_get_kexec_buffer(void **addr, size_t *size)
if (ret)
return ret;
 
+   /* Do some sanity on the returned size for the ima-kexec buffer */
+   if (!tmp_size)
+   return -ENOENT;
+
+   /*
+* Calculate the PFNs for the buffer and ensure
+* they are with in addressable memory.
+*/
+   start_pfn = PHYS_PFN(tmp_addr);
+   end_pfn = PHYS_PFN(tmp_addr + tmp_size - 1);
+   if (!pfn_valid(start_pfn) || !pfn_valid(end_pfn)) {
+   pr_warn("IMA buffer at 0x%lx, size = 0x%zx beyond memory\n",
+   tmp_addr, tmp_size);
+   return -EINVAL;
+   }
+
*addr = __va(tmp_addr);
*size = tmp_size;
 
-- 
2.35.1



Re: [PATCH] powerpc: check previous kernel's ima-kexec-buffer against memory bounds

2022-05-22 Thread Vaibhav Jain
Rob Herring  writes:

> On Thu, May 19, 2022 at 01:35:47AM +0530, Vaibhav Jain wrote:
>> Presently ima_get_kexec_buffer() doesn't check if the previous kernel's
>> ima-kexec-buffer lies outside the addressable memory range. This can result
>> in a kernel panic if the new kernel is booted with 'mem=X' arg and the
>> ima-kexec-buffer was allocated beyond that range by the previous kernel.
>> The panic is usually of the form below:
>> 
>> $ sudo kexec --initrd initrd vmlinux --append='mem=16G'
>> 
>> 
>>  BUG: Unable to handle kernel data access on read at 0xc000c01fff7f
>>  Faulting instruction address: 0xc0837974
>>  Oops: Kernel access of bad area, sig: 11 [#1]
>> 
>>  NIP [c0837974] ima_restore_measurement_list+0x94/0x6c0
>>  LR [c083b55c] ima_load_kexec_buffer+0xac/0x160
>>  Call Trace:
>>  [c371fa80] [c083b55c] ima_load_kexec_buffer+0xac/0x160
>>  [c371fb00] [c20512c4] ima_init+0x80/0x108
>>  [c371fb70] [c20514dc] init_ima+0x4c/0x120
>>  [c371fbf0] [c0012240] do_one_initcall+0x60/0x2c0
>>  [c371fcc0] [c2004ad0] kernel_init_freeable+0x344/0x3ec
>>  [c371fda0] [c00128a4] kernel_init+0x34/0x1b0
>>  [c371fe10] [c000ce64] ret_from_kernel_thread+0x5c/0x64
>>  Instruction dump:
>>  f92100b8 f92100c0 90e10090 910100a0 4182050c 282a0017 3bc0 40810330
>>  7c0802a6 fb610198 7c9b2378 f80101d0  2c090001 40820614 e9240010
>>  ---[ end trace  ]---
>> 
>> Fix this issue by checking returned address/size of previous kernel's
>> ima-kexec-buffer against memblock's memory bounds.
>> 
>> Fixes: fee3ff99bc67("powerpc: Move arch independent ima kexec functions to
>> drivers/of/kexec.c")
>
> Your mailer (and/or you) corrupted this. There should be a space between 
> the commit hash and subject, and it should not be wrapped.
I probably messed it up. Will resend a fixed patch.

>
> It should also not have a blank line here.
Sure. Will get this fixed
>
> But more importantly, how did this commit introduce the problem? It just 
> moved the code and didn't have any such check.
Yes, the code didnt have the necessary check to see if the address for
previous kernel IMA buffer is beyond the currently addressable memory. I
have described the problem in patch description.



-- 
Cheers
~ Vaibhav


[PATCH] powerpc: check previous kernel's ima-kexec-buffer against memory bounds

2022-05-18 Thread Vaibhav Jain
Presently ima_get_kexec_buffer() doesn't check if the previous kernel's
ima-kexec-buffer lies outside the addressable memory range. This can result
in a kernel panic if the new kernel is booted with 'mem=X' arg and the
ima-kexec-buffer was allocated beyond that range by the previous kernel.
The panic is usually of the form below:

$ sudo kexec --initrd initrd vmlinux --append='mem=16G'


 BUG: Unable to handle kernel data access on read at 0xc000c01fff7f
 Faulting instruction address: 0xc0837974
 Oops: Kernel access of bad area, sig: 11 [#1]

 NIP [c0837974] ima_restore_measurement_list+0x94/0x6c0
 LR [c083b55c] ima_load_kexec_buffer+0xac/0x160
 Call Trace:
 [c371fa80] [c083b55c] ima_load_kexec_buffer+0xac/0x160
 [c371fb00] [c20512c4] ima_init+0x80/0x108
 [c371fb70] [c20514dc] init_ima+0x4c/0x120
 [c371fbf0] [c0012240] do_one_initcall+0x60/0x2c0
 [c371fcc0] [c2004ad0] kernel_init_freeable+0x344/0x3ec
 [c371fda0] [c00128a4] kernel_init+0x34/0x1b0
 [c371fe10] [c000ce64] ret_from_kernel_thread+0x5c/0x64
 Instruction dump:
 f92100b8 f92100c0 90e10090 910100a0 4182050c 282a0017 3bc0 40810330
 7c0802a6 fb610198 7c9b2378 f80101d0  2c090001 40820614 e9240010
 ---[ end trace  ]---

Fix this issue by checking returned address/size of previous kernel's
ima-kexec-buffer against memblock's memory bounds.

Fixes: fee3ff99bc67("powerpc: Move arch independent ima kexec functions to
drivers/of/kexec.c")

Cc: Frank Rowand 
Cc: Prakhar Srivastava 
Cc: Lakshmi Ramasubramanian 
Cc: Thiago Jung Bauermann 
Cc: Rob Herring 
Signed-off-by: Vaibhav Jain 
---
 drivers/of/kexec.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/drivers/of/kexec.c b/drivers/of/kexec.c
index b9bd1cff1793..c73007eda52d 100644
--- a/drivers/of/kexec.c
+++ b/drivers/of/kexec.c
@@ -140,6 +140,13 @@ int ima_get_kexec_buffer(void **addr, size_t *size)
if (ret)
return ret;
 
+   /* if the ima-kexec-buffer goes beyond the addressable memory */
+   if (!memblock_is_region_memory(tmp_addr, tmp_size)) {
+   pr_warn("IMA buffer at 0x%lx, size = 0x%zx beyond memory\n",
+   tmp_addr, tmp_size);
+   return -EINVAL;
+   }
+
*addr = __va(tmp_addr);
*size = tmp_size;
 
-- 
2.35.1



[PATCH v2] powerpc/papr_scm: Fix leaking nvdimm_events_map elements

2022-05-11 Thread Vaibhav Jain
Right now 'char *' elements allocated for individual 'stat_id' in
'papr_scm_priv.nvdimm_events_map[]' during papr_scm_pmu_check_events(), get
leaked in papr_scm_remove() and papr_scm_pmu_register(),
papr_scm_pmu_check_events() error paths.

Also individual 'stat_id' arent NULL terminated 'char *' instead they are fixed
8-byte sized identifiers. However papr_scm_pmu_register() assumes it to be a
NULL terminated 'char *' and at other places it assumes it to be a
'papr_scm_perf_stat.stat_id' sized string which is 8-byes in size.

Fix this by allocating the memory for papr_scm_priv.nvdimm_events_map to also
include space for 'stat_id' entries. This is possible since number of available
events/stat_ids are known upfront. This saves some memory and one extra level of
indirection from 'nvdimm_events_map' to 'stat_id'. Also rest of the code
can continue to call 'kfree(papr_scm_priv.nvdimm_events_map)' without needing to
iterate over the array and free up individual elements.

Fixes: 4c08d4bbc089 ("powerpc/papr_scm: Add perf interface support")
Signed-off-by: Vaibhav Jain 
---
Change-log:

v2:
* Removed the typedef 'stat_id_t' which will be introduced in a later patch
  [Aneesh]
* Updated the patch description a bit
---
 arch/powerpc/platforms/pseries/papr_scm.c | 54 ++-
 1 file changed, 24 insertions(+), 30 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index 39962c905542..181b855b3050 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -125,8 +125,8 @@ struct papr_scm_priv {
/* The bits which needs to be overridden */
u64 health_bitmap_inject_mask;
 
-/* array to have event_code and stat_id mappings */
-   char **nvdimm_events_map;
+   /* array to have event_code and stat_id mappings */
+   u8 *nvdimm_events_map;
 };
 
 static int papr_scm_pmem_flush(struct nd_region *nd_region,
@@ -370,7 +370,7 @@ static int papr_scm_pmu_get_value(struct perf_event *event, 
struct device *dev,
 
stat = >scm_statistic[0];
memcpy(>stat_id,
-  p->nvdimm_events_map[event->attr.config],
+  >nvdimm_events_map[event->attr.config * 
sizeof(stat->stat_id)],
sizeof(stat->stat_id));
stat->stat_val = 0;
 
@@ -462,14 +462,13 @@ static int papr_scm_pmu_check_events(struct papr_scm_priv 
*p, struct nvdimm_pmu
 {
struct papr_scm_perf_stat *stat;
struct papr_scm_perf_stats *stats;
-   int index, rc, count;
u32 available_events;
-
-   if (!p->stat_buffer_len)
-   return -ENOENT;
+   int index, rc = 0;
 
available_events = (p->stat_buffer_len  - sizeof(struct 
papr_scm_perf_stats))
/ sizeof(struct papr_scm_perf_stat);
+   if (available_events == 0)
+   return -EOPNOTSUPP;
 
/* Allocate the buffer for phyp where stats are written */
stats = kzalloc(p->stat_buffer_len, GFP_KERNEL);
@@ -478,35 +477,30 @@ static int papr_scm_pmu_check_events(struct papr_scm_priv 
*p, struct nvdimm_pmu
return rc;
}
 
-   /* Allocate memory to nvdimm_event_map */
-   p->nvdimm_events_map = kcalloc(available_events, sizeof(char *), 
GFP_KERNEL);
-   if (!p->nvdimm_events_map) {
-   rc = -ENOMEM;
-   goto out_stats;
-   }
-
/* Called to get list of events supported */
rc = drc_pmem_query_stats(p, stats, 0);
if (rc)
-   goto out_nvdimm_events_map;
-
-   for (index = 0, stat = stats->scm_statistic, count = 0;
-index < available_events; index++, ++stat) {
-   p->nvdimm_events_map[count] = kmemdup_nul(stat->stat_id, 8, 
GFP_KERNEL);
-   if (!p->nvdimm_events_map[count]) {
-   rc = -ENOMEM;
-   goto out_nvdimm_events_map;
-   }
+   goto out;
 
-   count++;
+   /*
+* Allocate memory and populate nvdimm_event_map.
+* Allocate an extra element for NULL entry
+*/
+   p->nvdimm_events_map = kcalloc(available_events + 1,
+  sizeof(stat->stat_id),
+  GFP_KERNEL);
+   if (!p->nvdimm_events_map) {
+   rc = -ENOMEM;
+   goto out;
}
-   p->nvdimm_events_map[count] = NULL;
-   kfree(stats);
-   return 0;
 
-out_nvdimm_events_map:
-   kfree(p->nvdimm_events_map);
-out_stats:
+   /* Copy all stat_ids to event map */
+   for (index = 0, stat = stats->scm_statistic;
+index < available_events; index++, ++stat) {
+   memcpy(>nvdimm_events_map[index * sizeof(stat->stat_id)],
+  >stat_id, sizeof(stat->stat_id));
+   }
+out

Re: [PATCH] powerpc/papr_scm: Fix leaking nvdimm_events_map elements

2022-05-11 Thread Vaibhav Jain


Thanks for reviewing this patch Aneesh,

"Aneesh Kumar K.V"  writes:

> Vaibhav Jain  writes:
>
>> Right now 'char *' elements allocated individual 'stat_id' in
>> 'papr_scm_priv.nvdimm_events_map' during papr_scm_pmu_check_events() leak in
>> papr_scm_remove() and papr_scm_pmu_register(), papr_scm_pmu_check_events() 
>> error
>> paths.
>>
>> Also individual 'stat_id' arent NULL terminated 'char *' instead they are 
>> fixed
>> 8-byte sized identifiers. However papr_scm_pmu_register() assumes it to be a
>> NULL terminated 'char *' and at other places it assumes it to be a
>> 'papr_scm_perf_stat.stat_id' sized string which is 8-byes in size.
>>
>> Fix this by allocating the memory for papr_scm_priv.nvdimm_events_map to also
>> include space for 'stat_id' entries. This is possible since number of 
>> available
>> events/stat_ids are known upfront. This saves some memory and one extra 
>> level of
>> indirection from 'nvdimm_events_map' to 'stat_id'. Also rest of the code
>> can continue to call 'kfree(papr_scm_priv.nvdimm_events_map)' without 
>> needing to
>> iterate over the array and free up individual elements.
>>
>> Also introduce a new typedef called 'state_id_t' thats a 'u8[8]' and can be 
>> used
>> across papr_scm to deal with stat_ids.
>>
>> Fixes: 4c08d4bbc089 ("powerpc/papr_scm: Add perf interface support")
>> Signed-off-by: Vaibhav Jain 
>> ---
>>  arch/powerpc/platforms/pseries/papr_scm.c | 48 +++
>>  1 file changed, 22 insertions(+), 26 deletions(-)
>>
>> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
>> b/arch/powerpc/platforms/pseries/papr_scm.c
>> index 39962c905542..f33a865ad5fb 100644
>> --- a/arch/powerpc/platforms/pseries/papr_scm.c
>> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
>> @@ -70,8 +70,10 @@
>>  #define PAPR_SCM_PERF_STATS_VERSION 0x1
>>  
>>  /* Struct holding a single performance metric */
>> +typedef u8 stat_id_t[8];
>> +
>>  struct papr_scm_perf_stat {
>> -u8 stat_id[8];
>> +stat_id_t stat_id;
>>  __be64 stat_val;
>>  } __packed;
>
> Can we do this as two patch? One that fix the leak and other that adds
> new type?
>
Yeah, sure that would be a simple change. I will send a v2 across that
only removes the leak and then a subsequent patch that changes the usage
of stat_id within papr_scm to the new typedef

>>  
>> @@ -126,7 +128,7 @@ struct papr_scm_priv {
>>  u64 health_bitmap_inject_mask;
>>  
>>   /* array to have event_code and stat_id mappings */
>> -char **nvdimm_events_map;
>> +stat_id_t *nvdimm_events_map;
>>  };
>>  
>>  static int papr_scm_pmem_flush(struct nd_region *nd_region,
>> @@ -462,7 +464,7 @@ static int papr_scm_pmu_check_events(struct 
>> papr_scm_priv *p, struct nvdimm_pmu
>>  {
>>  struct papr_scm_perf_stat *stat;
>>  struct papr_scm_perf_stats *stats;
>> -int index, rc, count;
>> +int index, rc = 0;
>>  u32 available_events;
>>  
>>  if (!p->stat_buffer_len)
>> @@ -478,35 +480,29 @@ static int papr_scm_pmu_check_events(struct 
>> papr_scm_priv *p, struct nvdimm_pmu
>>  return rc;
>>  }
>>  
>> -/* Allocate memory to nvdimm_event_map */
>> -p->nvdimm_events_map = kcalloc(available_events, sizeof(char *), 
>> GFP_KERNEL);
>> -if (!p->nvdimm_events_map) {
>> -rc = -ENOMEM;
>> -goto out_stats;
>> -}
>> -
>>  /* Called to get list of events supported */
>>  rc = drc_pmem_query_stats(p, stats, 0);
>>  if (rc)
>> -goto out_nvdimm_events_map;
>> -
>> -for (index = 0, stat = stats->scm_statistic, count = 0;
>> - index < available_events; index++, ++stat) {
>> -p->nvdimm_events_map[count] = kmemdup_nul(stat->stat_id, 8, 
>> GFP_KERNEL);
>> -if (!p->nvdimm_events_map[count]) {
>> -rc = -ENOMEM;
>> -goto out_nvdimm_events_map;
>> -}
>> +goto out;
>>  
>> -count++;
>> +/*
>> + * Allocate memory and populate nvdimm_event_map.
>> + * Allocate an extra element for NULL entry
>> + */
>> +p->nvdimm_events_map = kcalloc(available_events + 1,
>> +   sizeof(stat_id_t), GFP_KERNEL);
>> +if (!p->nvdimm_events_map) {
>> +rc = -ENOMEM;
>> +goto out;
>>  }
>> -p->nvdimm_events_map[count] = NULL;
>> -kfree(stats);
>> -return 0;
>>  
>> -out_nvdimm_events_map:
>> -kfree(p->nvdimm_events_map);
>> -out_stats:
>> +/* Copy all stat_ids to event map */
>> +for (index = 0, stat = stats->scm_statistic;
>> + index < available_events; index++, ++stat) {
>> +memcpy(>nvdimm_events_map[index], >stat_id,
>> +   sizeof(stat_id_t));
>> +}
>> +out:
>>  kfree(stats);
>>  return rc;
>>  }
>>
>> base-commit: 348c71344111d7a48892e3e52264ff11956fc196
>> -- 
>> 2.35.1
>

-- 
Cheers
~ Vaibhav



[PATCH] powerpc/papr_scm: Fix leaking nvdimm_events_map elements

2022-05-09 Thread Vaibhav Jain
Right now 'char *' elements allocated individual 'stat_id' in
'papr_scm_priv.nvdimm_events_map' during papr_scm_pmu_check_events() leak in
papr_scm_remove() and papr_scm_pmu_register(), papr_scm_pmu_check_events() error
paths.

Also individual 'stat_id' arent NULL terminated 'char *' instead they are fixed
8-byte sized identifiers. However papr_scm_pmu_register() assumes it to be a
NULL terminated 'char *' and at other places it assumes it to be a
'papr_scm_perf_stat.stat_id' sized string which is 8-byes in size.

Fix this by allocating the memory for papr_scm_priv.nvdimm_events_map to also
include space for 'stat_id' entries. This is possible since number of available
events/stat_ids are known upfront. This saves some memory and one extra level of
indirection from 'nvdimm_events_map' to 'stat_id'. Also rest of the code
can continue to call 'kfree(papr_scm_priv.nvdimm_events_map)' without needing to
iterate over the array and free up individual elements.

Also introduce a new typedef called 'state_id_t' thats a 'u8[8]' and can be used
across papr_scm to deal with stat_ids.

Fixes: 4c08d4bbc089 ("powerpc/papr_scm: Add perf interface support")
Signed-off-by: Vaibhav Jain 
---
 arch/powerpc/platforms/pseries/papr_scm.c | 48 +++
 1 file changed, 22 insertions(+), 26 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index 39962c905542..f33a865ad5fb 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -70,8 +70,10 @@
 #define PAPR_SCM_PERF_STATS_VERSION 0x1
 
 /* Struct holding a single performance metric */
+typedef u8 stat_id_t[8];
+
 struct papr_scm_perf_stat {
-   u8 stat_id[8];
+   stat_id_t stat_id;
__be64 stat_val;
 } __packed;
 
@@ -126,7 +128,7 @@ struct papr_scm_priv {
u64 health_bitmap_inject_mask;
 
 /* array to have event_code and stat_id mappings */
-   char **nvdimm_events_map;
+   stat_id_t *nvdimm_events_map;
 };
 
 static int papr_scm_pmem_flush(struct nd_region *nd_region,
@@ -462,7 +464,7 @@ static int papr_scm_pmu_check_events(struct papr_scm_priv 
*p, struct nvdimm_pmu
 {
struct papr_scm_perf_stat *stat;
struct papr_scm_perf_stats *stats;
-   int index, rc, count;
+   int index, rc = 0;
u32 available_events;
 
if (!p->stat_buffer_len)
@@ -478,35 +480,29 @@ static int papr_scm_pmu_check_events(struct papr_scm_priv 
*p, struct nvdimm_pmu
return rc;
}
 
-   /* Allocate memory to nvdimm_event_map */
-   p->nvdimm_events_map = kcalloc(available_events, sizeof(char *), 
GFP_KERNEL);
-   if (!p->nvdimm_events_map) {
-   rc = -ENOMEM;
-   goto out_stats;
-   }
-
/* Called to get list of events supported */
rc = drc_pmem_query_stats(p, stats, 0);
if (rc)
-   goto out_nvdimm_events_map;
-
-   for (index = 0, stat = stats->scm_statistic, count = 0;
-index < available_events; index++, ++stat) {
-   p->nvdimm_events_map[count] = kmemdup_nul(stat->stat_id, 8, 
GFP_KERNEL);
-   if (!p->nvdimm_events_map[count]) {
-   rc = -ENOMEM;
-   goto out_nvdimm_events_map;
-   }
+   goto out;
 
-   count++;
+   /*
+* Allocate memory and populate nvdimm_event_map.
+* Allocate an extra element for NULL entry
+*/
+   p->nvdimm_events_map = kcalloc(available_events + 1,
+  sizeof(stat_id_t), GFP_KERNEL);
+   if (!p->nvdimm_events_map) {
+   rc = -ENOMEM;
+   goto out;
}
-   p->nvdimm_events_map[count] = NULL;
-   kfree(stats);
-   return 0;
 
-out_nvdimm_events_map:
-   kfree(p->nvdimm_events_map);
-out_stats:
+   /* Copy all stat_ids to event map */
+   for (index = 0, stat = stats->scm_statistic;
+index < available_events; index++, ++stat) {
+   memcpy(>nvdimm_events_map[index], >stat_id,
+  sizeof(stat_id_t));
+   }
+out:
kfree(stats);
return rc;
 }

base-commit: 348c71344111d7a48892e3e52264ff11956fc196
-- 
2.35.1




Re: [PATCH] powerpc/papr_scm: Fix buffer overflow issue with CONFIG_FORTIFY_SOURCE

2022-05-06 Thread Vaibhav Jain
Hi Kajol,

Thanks for the patch. Minor review comment below:

Kajol Jain  writes:

> With CONFIG_FORTIFY_SOURCE enabled, string functions will also perform
> dynamic checks for string size which can panic the kernel,
> like incase of overflow detection.
>
> In papr_scm, papr_scm_pmu_check_events function uses stat->stat_id
> with string operations, to populate the nvdimm_events_map array.
> Since stat_id variable is not NULL terminated, the kernel panics
> with CONFIG_FORTIFY_SOURCE enabled at boot time.
>
> Below are the logs of kernel panic:
>
> [0.090221][T1] detected buffer overflow in __fortify_strlen
> [0.090241][T1] [ cut here ]
> [0.090246][T1] kernel BUG at lib/string_helpers.c:980!
> [0.090253][T1] Oops: Exception in kernel mode, sig: 5 [#1]
> 
> [0.090375][T1] NIP [c077dad0] fortify_panic+0x28/0x38
> [0.090382][T1] LR [c077dacc] fortify_panic+0x24/0x38
> [0.090387][T1] Call Trace:
> [0.090390][T1] [c022d77836e0] [c077dacc] 
> fortify_panic+0x24/0x38 (unreliable)
> [9.297707] [T1] [c0080deb2660] 
> papr_scm_pmu_check_events.constprop.0+0x118/0x220 [papr_scm]
> [9.297721] [T1] [c0080deb2cb0] papr_scm_probe+0x288/0x62c 
> [papr_scm]
> [9.297732] [T1] [c09b46a8] platform_probe+0x98/0x150
>
> Fix this issue by using kmemdup_nul function to copy the content of
> stat->stat_id directly to the nvdimm_events_map array.
>
> Fixes: 4c08d4bbc089 ("powerpc/papr_scm: Add perf interface support")
> Signed-off-by: Kajol Jain 
> ---
>  arch/powerpc/platforms/pseries/papr_scm.c | 7 ++-
>  1 file changed, 2 insertions(+), 5 deletions(-)
>
> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
> b/arch/powerpc/platforms/pseries/papr_scm.c
> index f58728d5f10d..39962c905542 100644
> --- a/arch/powerpc/platforms/pseries/papr_scm.c
> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
> @@ -462,7 +462,6 @@ static int papr_scm_pmu_check_events(struct papr_scm_priv 
> *p, struct nvdimm_pmu
>  {
>   struct papr_scm_perf_stat *stat;
>   struct papr_scm_perf_stats *stats;
> - char *statid;
>   int index, rc, count;
>   u32 available_events;
>  
> @@ -493,14 +492,12 @@ static int papr_scm_pmu_check_events(struct 
> papr_scm_priv *p, struct nvdimm_pmu
>  
>   for (index = 0, stat = stats->scm_statistic, count = 0;
>index < available_events; index++, ++stat) {
> - statid = kzalloc(strlen(stat->stat_id) + 1, GFP_KERNEL);
> - if (!statid) {
> + p->nvdimm_events_map[count] = kmemdup_nul(stat->stat_id,
>   8, GFP_KERNEL);
s/8/sizeof(stat->stat_id)/

> + if (!p->nvdimm_events_map[count]) {
>   rc = -ENOMEM;
>   goto out_nvdimm_events_map;
>   }
>  
> - strcpy(statid, stat->stat_id);
> - p->nvdimm_events_map[count] = statid;
>   count++;
>   }
>   p->nvdimm_events_map[count] = NULL;
> -- 
> 2.31.1
>

-- 
Cheers
~ Vaibhav



Re: [PATCH 2/2] powerpc/papr_scm: Fix build failure when CONFIG_PERF_EVENTS is not set

2022-03-23 Thread Vaibhav Jain
Dan Williams  writes:

> On Tue, Mar 22, 2022 at 7:30 AM kajoljain  wrote:
>>
>>
>>
>> On 3/22/22 03:09, Dan Williams wrote:
>> > On Fri, Mar 18, 2022 at 4:42 AM Kajol Jain  wrote:
>> >>
>> >> The following build failure occures when CONFIG_PERF_EVENTS is not set
>> >> as generic pmu functions are not visible in that scenario.
>> >>
>> >> arch/powerpc/platforms/pseries/papr_scm.c:372:35: error: ‘struct 
>> >> perf_event’ has no member named ‘attr’
>> >>  p->nvdimm_events_map[event->attr.config],
>> >>^~
>> >> In file included from ./include/linux/list.h:5,
>> >>  from ./include/linux/kobject.h:19,
>> >>  from ./include/linux/of.h:17,
>> >>  from arch/powerpc/platforms/pseries/papr_scm.c:5:
>> >> arch/powerpc/platforms/pseries/papr_scm.c: In function 
>> >> ‘papr_scm_pmu_event_init’:
>> >> arch/powerpc/platforms/pseries/papr_scm.c:389:49: error: ‘struct 
>> >> perf_event’ has no member named ‘pmu’
>> >>   struct nvdimm_pmu *nd_pmu = to_nvdimm_pmu(event->pmu);
>> >>  ^~
>> >> ./include/linux/container_of.h:18:26: note: in definition of macro 
>> >> ‘container_of’
>> >>   void *__mptr = (void *)(ptr); \
>> >>   ^~~
>> >> arch/powerpc/platforms/pseries/papr_scm.c:389:30: note: in expansion of 
>> >> macro ‘to_nvdimm_pmu’
>> >>   struct nvdimm_pmu *nd_pmu = to_nvdimm_pmu(event->pmu);
>> >>   ^
>> >> In file included from ./include/linux/bits.h:22,
>> >>  from ./include/linux/bitops.h:6,
>> >>  from ./include/linux/of.h:15,
>> >>  from arch/powerpc/platforms/pseries/papr_scm.c:5:
>> >>
>> >> Fix the build issue by adding check for CONFIG_PERF_EVENTS config option
>> >> and disabling the papr_scm perf interface support incase this config
>> >> is not set
>> >>
>> >> Fixes: 4c08d4bbc089 ("powerpc/papr_scm: Add perf interface support") 
>> >> (Commit id
>> >> based on linux-next tree)
>> >> Signed-off-by: Kajol Jain 
>> >> ---
>> >>  arch/powerpc/platforms/pseries/papr_scm.c | 15 +++
>> >
>> > This is a bit messier than I would have liked mainly because it dumps
>> > a bunch of ifdefery into a C file contrary to coding style, "Wherever
>> > possible, don't use preprocessor conditionals (#if, #ifdef) in .c
>> > files". I would expect this all to move to an organization like:
>>
>> Hi Dan,
>>   Thanks for reviewing the patches. Inorder to avoid the multiple
>> ifdefs checks, we can also add stub function for papr_scm_pmu_register.
>> With that change we will just have one ifdef check for
>> CONFIG_PERF_EVENTS config in both papr_scm.c and nd.h file. Hence we can
>> avoid adding new files specific for papr_scm perf interface.
>>
>> Below is the code snippet for that change, let me know if looks fine to
>> you. I tested it
>> with set/unset PAPR_SCM config value and set/unset PERF_EVENTS config
>> value combinations.
>>
>> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c
>> b/arch/powerpc/platforms/pseries/papr_scm.c
>> index 4dd513d7c029..38fabb44d3c3 100644
>> --- a/arch/powerpc/platforms/pseries/papr_scm.c
>> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
>> @@ -69,8 +69,6 @@
>>  #define PAPR_SCM_PERF_STATS_EYECATCHER __stringify(SCMSTATS)
>>  #define PAPR_SCM_PERF_STATS_VERSION 0x1
>>
>> -#define to_nvdimm_pmu(_pmu)container_of(_pmu, struct nvdimm_pmu, pmu)
>> -
>>  /* Struct holding a single performance metric */
>>  struct papr_scm_perf_stat {
>> u8 stat_id[8];
>> @@ -346,6 +344,9 @@ static ssize_t drc_pmem_query_stats(struct
>> papr_scm_priv *p,
>> return 0;
>>  }
>>
>> +#ifdef CONFIG_PERF_EVENTS
>> +#define to_nvdimm_pmu(_pmu)container_of(_pmu, struct nvdimm_pmu, pmu)
>> +
>>  static int papr_scm_pmu_get_value(struct perf_event *event, struct
>> device *dev, u64 *count)
>>  {
>> struct papr_scm_perf_stat *stat;
>> @@ -558,6 +559,10 @@ static void papr_scm_pmu_register(struct
>> papr_scm_priv *p)
>> dev_info(>pdev->dev, "nvdimm pmu didn't register rc=%d\n", rc);
>>  }
>>
>> +#else
>> +static inline void papr_scm_pmu_register(struct papr_scm_priv *p) { }
>
> Since this isn't in a header file, it does not need to be marked
> "inline" the compiler will figure it out.
>
>> +#endif
>
> It might be time to create:
>
> arch/powerpc/platforms/pseries/papr_scm.h
>
> ...there is quite a bit of header material accrued in papr_scm.c and
> once the ifdefs start landing in it then it becomes a nominal coding
> style issue. That said, this is certainly more palatable than the
> previous version. So if you don't want to create papr_scm.h yet for
> this, at least make a note in the changelog that the first portion of
> arch/powerpc/platforms/pseries/papr_scm.c is effectively papr_scm.h
> content and may move there in the future, or something like that.

Great suggestion Dan and incidently we are already working on 

Re: linux-next: manual merge of the nvdimm tree with the powerpc tree

2022-03-15 Thread Vaibhav Jain
Stephen Rothwell  writes:

> Hi all,
>
> Today's linux-next merge of the nvdimm tree got a conflict in:
>
>   arch/powerpc/platforms/pseries/papr_scm.c
>
> between commit:
>
>   bbbca72352bb ("powerpc/papr_scm: Implement initial support for injecting 
> smart errors")
>
> from the powerpc tree and commit:
>
>   4c08d4bbc089 ("powerpc/papr_scm: Add perf interface support")
>
> from the nvdimm tree.
>
> I fixed it up (see below) and can carry the fix as necessary. This
> is now fixed as far as linux-next is concerned, but any non trivial
> conflicts should be mentioned to your upstream maintainer when your tree
> is submitted for merging.  You may also want to consider cooperating
> with the maintainer of the conflicting tree to minimise any particularly
> complex conflicts.

Thanks for this correction Stephen and the change looks ok to me. I
verified the functionality introduced by kernel commit bbbca72352bb
("powerpc/papr_scm: Implement initial support for injecting smart
errors") on the 'next-20220315' and found it to be working fine.



-- 
Cheers
~ Vaibhav


[PATCH v4] powerpc/papr_scm: Implement initial support for injecting smart errors

2022-01-24 Thread Vaibhav Jain
Presently PAPR doesn't support injecting smart errors on an
NVDIMM. This makes testing the NVDIMM health reporting functionality
difficult as simulating NVDIMM health related events need a hacked up
qemu version.

To solve this problem this patch proposes simulating certain set of
NVDIMM health related events in papr_scm. Specifically 'fatal' health
state and 'dirty' shutdown state. These error can be injected via the
user-space 'ndctl-inject-smart(1)' command. With the proposed patch and
corresponding ndctl patches following command flow is expected:

$ sudo ndctl list -DH -d nmem0
...
  "health_state":"ok",
  "shutdown_state":"clean",
...
 # inject unsafe shutdown and fatal health error
$ sudo ndctl inject-smart nmem0 -Uf
...
  "health_state":"fatal",
  "shutdown_state":"dirty",
...
 # uninject all errors
$ sudo ndctl inject-smart nmem0 -N
...
  "health_state":"ok",
  "shutdown_state":"clean",
...

The patch adds a new member 'health_bitmap_inject_mask' inside struct
papr_scm_priv which is then bitwise ANDed to the health bitmap fetched from the
hypervisor. The value for 'health_bitmap_inject_mask' is accessible from sysfs
at nmemX/papr/health_bitmap_inject.

A new PDSM named 'SMART_INJECT' is proposed that accepts newly
introduced 'struct nd_papr_pdsm_smart_inject' as payload thats
exchanged between libndctl and papr_scm to indicate the requested
smart-error states.

When the processing the PDSM 'SMART_INJECT', papr_pdsm_smart_inject()
constructs a pair or 'inject_mask' and 'clear_mask' bitmaps from the payload
and bit-blt it to the 'health_bitmap_inject_mask'. This ensures the after being
fetched from the hypervisor, the health_bitmap reflects requested smart-error
states.

Signed-off-by: Vaibhav Jain 
Signed-off-by: Shivaprasad G Bhat 
---
Changelog:

Since v3:
* Renamed the sysfs entry from 'health_bitmap_override' to
'health_bitmap_inject'.
* Simplified the variable names and removed the 'health_bitmap_{mask,override}'
members. Instead replaced them with a single 'health_bitmap_inject_mask'
member. [Aneesh]
* Updated the sysfs documentations and commit description.
* Used READ/WRITE_ONCE macros at places where 'health_bitmap_inject_mask' may be
accessed concurrently.

Since v2:
* Rebased the patch to ppc-next
* Added documentation for newly introduced sysfs attribute 
'health_bitmap_override'

Since v1:
* Updated the patch description.
* Removed dependency of a header movement patch.
* Removed '__packed' attribute for 'struct nd_papr_pdsm_smart_inject' [Aneesh]
---
 Documentation/ABI/testing/sysfs-bus-papr-pmem | 12 +++
 arch/powerpc/include/uapi/asm/papr_pdsm.h | 18 
 arch/powerpc/platforms/pseries/papr_scm.c | 90 ++-
 3 files changed, 117 insertions(+), 3 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-bus-papr-pmem 
b/Documentation/ABI/testing/sysfs-bus-papr-pmem
index 95254cec92bf..4ac0673901e7 100644
--- a/Documentation/ABI/testing/sysfs-bus-papr-pmem
+++ b/Documentation/ABI/testing/sysfs-bus-papr-pmem
@@ -61,3 +61,15 @@ Description:
* "CchRHCnt" : Cache Read Hit Count
* "CchWHCnt" : Cache Write Hit Count
* "FastWCnt" : Fast Write Count
+
+What:  /sys/bus/nd/devices/nmemX/papr/health_bitmap_inject
+Date:  Jan, 2022
+KernelVersion: v5.17
+Contact:   linuxppc-dev , 
nvd...@lists.linux.dev,
+Description:
+   (RO) Reports the health bitmap inject bitmap that is applied to
+   bitmap received from PowerVM via the H_SCM_HEALTH. This is used
+   to forcibly set specific bits returned from Hcall. These is then
+   used to simulate various health or shutdown states for an nvdimm
+   and are set by user-space tools like ndctl by issuing a PAPR 
DSM.
+
diff --git a/arch/powerpc/include/uapi/asm/papr_pdsm.h 
b/arch/powerpc/include/uapi/asm/papr_pdsm.h
index 82488b1e7276..17439925045c 100644
--- a/arch/powerpc/include/uapi/asm/papr_pdsm.h
+++ b/arch/powerpc/include/uapi/asm/papr_pdsm.h
@@ -116,6 +116,22 @@ struct nd_papr_pdsm_health {
};
 };
 
+/* Flags for injecting specific smart errors */
+#define PDSM_SMART_INJECT_HEALTH_FATAL (1 << 0)
+#define PDSM_SMART_INJECT_BAD_SHUTDOWN (1 << 1)
+
+struct nd_papr_pdsm_smart_inject {
+   union {
+   struct {
+   /* One or more of PDSM_SMART_INJECT_ */
+   __u32 flags;
+   __u8 fatal_enable;
+   __u8 unsafe_shutdown_enable;
+   };
+   __u8 buf[ND_PDSM_PAYLOAD_MAX_SIZE];
+   };
+};
+
 /*
  * Methods to be embedded in ND_CMD_CALL request. These are sent to the kernel
  * via 'nd_cmd_pkg.nd_command' member of the ioctl struct
@@ -123,12 +139,14 @@ struct nd_papr_pdsm_health 

Re: [PATCH v3] powerpc/papr_scm: Implement initial support for injecting smart errors

2022-01-19 Thread Vaibhav Jain


Hi Ira, Thanks for reviewing this patch.

Ira Weiny  writes:

> On Thu, Jan 13, 2022 at 05:32:52PM +0530, Vaibhav Jain wrote:
> [snip]
>
>>  
>> +/* Inject a smart error Add the dirty-shutdown-counter value to the pdsm */
>> +static int papr_pdsm_smart_inject(struct papr_scm_priv *p,
>> +  union nd_pdsm_payload *payload)
>> +{
>> +int rc;
>> +u32 supported_flags = 0;
>> +u64 mask = 0, override = 0;
>> +
>> +/* Check for individual smart error flags and update mask and override 
>> */
>> +if (payload->smart_inject.flags & PDSM_SMART_INJECT_HEALTH_FATAL) {
>> +supported_flags |= PDSM_SMART_INJECT_HEALTH_FATAL;
>> +mask |= PAPR_PMEM_HEALTH_FATAL;
>> +override |= payload->smart_inject.fatal_enable ?
>> +PAPR_PMEM_HEALTH_FATAL : 0;
>> +}
>> +
>> +if (payload->smart_inject.flags & PDSM_SMART_INJECT_BAD_SHUTDOWN) {
>> +supported_flags |= PDSM_SMART_INJECT_BAD_SHUTDOWN;
>> +mask |= PAPR_PMEM_SHUTDOWN_DIRTY;
>> +override |= payload->smart_inject.unsafe_shutdown_enable ?
>> +PAPR_PMEM_SHUTDOWN_DIRTY : 0;
>> +}
>> +
>
> I'm struggling to see why there is a need for both a flag and an 8 bit 
> 'enable'
> value?
>
This is to enable the inject/uninject error usecase with ndctl which
lets user select individual error conditions like bad_shutdown or
fatal-health state.

The nd_papr_pdsm_smart_inject.flag field indicates which error
conditions needs to be tweaked and individual __u8 fields like
'fatal_enable' are boolean values to indicate the inject/uninject state
of that error condition.

For e.g to uninject fatal-health and inject unsafe-shutdown following
nd_papr_pdsm_smart_inject payload can be sent:

{
.flags = PDSM_SMART_INJECT_HEALTH_FATAL |
   PDSM_SMART_INJECT_BAD_SHUTDOWN,
.fatal_enable = 0,
.unsafe_shutdown_enable = 1,
}


To just to inject fatal-health following nd_papr_pdsm_smart_inject
payload can be sent:

{
.flags = PDSM_SMART_INJECT_HEALTH_FATAL,
.fatal_enable = 1,
.unsafe_shutdown_enable = ,
}


> Ira
>

-- 
Cheers
~ Vaibhav



Re: [PATCH v3] powerpc/papr_scm: Implement initial support for injecting smart errors

2022-01-17 Thread Vaibhav Jain
Thanks for reviewing this patch Aneesh. My responses to your comments
below:

"Aneesh Kumar K.V"  writes:

> Vaibhav Jain  writes:
>
>> Presently PAPR doesn't support injecting smart errors on an
>> NVDIMM. This makes testing the NVDIMM health reporting functionality
>> difficult as simulating NVDIMM health related events need a hacked up
>> qemu version.
>>
>> To solve this problem this patch proposes simulating certain set of
>> NVDIMM health related events in papr_scm. Specifically 'fatal' health
>> state and 'dirty' shutdown state. These error can be injected via the
>> user-space 'ndctl-inject-smart(1)' command. With the proposed patch and
>> corresponding ndctl patches following command flow is expected:
>>
>> $ sudo ndctl list -DH -d nmem0
>> ...
>>   "health_state":"ok",
>>   "shutdown_state":"clean",
>> ...
>>  # inject unsafe shutdown and fatal health error
>> $ sudo ndctl inject-smart nmem0 -Uf
>> ...
>>   "health_state":"fatal",
>>   "shutdown_state":"dirty",
>> ...
>>  # uninject all errors
>> $ sudo ndctl inject-smart nmem0 -N
>> ...
>>   "health_state":"ok",
>>   "shutdown_state":"clean",
>> ...
>>
>> The patch adds two members 'health_bitmap_mask' and
>> 'health_bitmap_override' inside struct papr_scm_priv which are then
>> bit blt'ed to the health bitmaps fetched from the hypervisor. In case
>> we are not able to fetch health information from the hypervisor we
>> service the health bitmap from these two members. These members are
>> accessible from sysfs at nmemX/papr/health_bitmap_override
>>
>> A new PDSM named 'SMART_INJECT' is proposed that accepts newly
>> introduced 'struct nd_papr_pdsm_smart_inject' as payload thats
>> exchanged between libndctl and papr_scm to indicate the requested
>> smart-error states.
>>
>> When the processing the PDSM 'SMART_INJECT', papr_pdsm_smart_inject()
>> constructs a pair or 'mask' and 'override' bitmaps from the payload
>> and bit-blt it to the 'health_bitmap_{mask, override}' members. This
>> ensures the after being fetched from the hypervisor, the health_bitmap
>> reflects requested smart-error states.
>>
>> Signed-off-by: Vaibhav Jain 
>> Signed-off-by: Shivaprasad G Bhat 
>> ---
>> Changelog:
>>
>> Since v2:
>> * Rebased the patch to ppc-next
>> * Added documentation for newly introduced sysfs attribute 
>> 'health_bitmap_override'
>>
>> Since v1:
>> * Updated the patch description.
>> * Removed dependency of a header movement patch.
>> * Removed '__packed' attribute for 'struct nd_papr_pdsm_smart_inject' 
>> [Aneesh]
>> ---
>>  Documentation/ABI/testing/sysfs-bus-papr-pmem | 13 +++
>>  arch/powerpc/include/uapi/asm/papr_pdsm.h | 18 
>>  arch/powerpc/platforms/pseries/papr_scm.c | 94 ++-
>>  3 files changed, 122 insertions(+), 3 deletions(-)
>>
>> diff --git a/Documentation/ABI/testing/sysfs-bus-papr-pmem 
>> b/Documentation/ABI/testing/sysfs-bus-papr-pmem
>> index 95254cec92bf..8a0b2a7f7671 100644
>> --- a/Documentation/ABI/testing/sysfs-bus-papr-pmem
>> +++ b/Documentation/ABI/testing/sysfs-bus-papr-pmem
>> @@ -61,3 +61,16 @@ Description:
>>  * "CchRHCnt" : Cache Read Hit Count
>>  * "CchWHCnt" : Cache Write Hit Count
>>  * "FastWCnt" : Fast Write Count
>> +
>> +What:   /sys/bus/nd/devices/nmemX/papr/health_bitmap_override
>> +Date:   Jan, 2022
>> +KernelVersion:  v5.17
>> +Contact:linuxppc-dev , 
>> nvd...@lists.linux.dev,
>> +Description:
>> +(RO) Reports the health bitmap override/mask bitmaps that are
>> +applied to top bitmap received from PowerVM via the H_SCM_HEALTH
>> +Hcall. Together these can be used to forcibly set/reset specific
>> +bits returned from Hcall. These bitmaps are presently used to
>> +simulate various health or shutdown states for an nvdimm and are
>> +set by user-space tools like ndctl by issuing a PAPR DSM.
>> +
>> diff --git a/arch/powerpc/include/uapi/asm/papr_pdsm.h 
>> b/arch/powerpc/include/uapi/asm/papr_pdsm.h
>> index 82488b1e7276..17439925045c 100644
>> --- a/arch/powerpc/include/uapi/asm/papr_pdsm.h
>> +++ b/arch/powerpc/include/uapi/asm/papr_pdsm.h
>> @@ -116,6 +116,22 @@ struct nd_

[PATCH v3] powerpc/papr_scm: Implement initial support for injecting smart errors

2022-01-13 Thread Vaibhav Jain
Presently PAPR doesn't support injecting smart errors on an
NVDIMM. This makes testing the NVDIMM health reporting functionality
difficult as simulating NVDIMM health related events need a hacked up
qemu version.

To solve this problem this patch proposes simulating certain set of
NVDIMM health related events in papr_scm. Specifically 'fatal' health
state and 'dirty' shutdown state. These error can be injected via the
user-space 'ndctl-inject-smart(1)' command. With the proposed patch and
corresponding ndctl patches following command flow is expected:

$ sudo ndctl list -DH -d nmem0
...
  "health_state":"ok",
  "shutdown_state":"clean",
...
 # inject unsafe shutdown and fatal health error
$ sudo ndctl inject-smart nmem0 -Uf
...
  "health_state":"fatal",
  "shutdown_state":"dirty",
...
 # uninject all errors
$ sudo ndctl inject-smart nmem0 -N
...
  "health_state":"ok",
  "shutdown_state":"clean",
...

The patch adds two members 'health_bitmap_mask' and
'health_bitmap_override' inside struct papr_scm_priv which are then
bit blt'ed to the health bitmaps fetched from the hypervisor. In case
we are not able to fetch health information from the hypervisor we
service the health bitmap from these two members. These members are
accessible from sysfs at nmemX/papr/health_bitmap_override

A new PDSM named 'SMART_INJECT' is proposed that accepts newly
introduced 'struct nd_papr_pdsm_smart_inject' as payload thats
exchanged between libndctl and papr_scm to indicate the requested
smart-error states.

When the processing the PDSM 'SMART_INJECT', papr_pdsm_smart_inject()
constructs a pair or 'mask' and 'override' bitmaps from the payload
and bit-blt it to the 'health_bitmap_{mask, override}' members. This
ensures the after being fetched from the hypervisor, the health_bitmap
reflects requested smart-error states.

Signed-off-by: Vaibhav Jain 
Signed-off-by: Shivaprasad G Bhat 
---
Changelog:

Since v2:
* Rebased the patch to ppc-next
* Added documentation for newly introduced sysfs attribute 
'health_bitmap_override'

Since v1:
* Updated the patch description.
* Removed dependency of a header movement patch.
* Removed '__packed' attribute for 'struct nd_papr_pdsm_smart_inject' [Aneesh]
---
 Documentation/ABI/testing/sysfs-bus-papr-pmem | 13 +++
 arch/powerpc/include/uapi/asm/papr_pdsm.h | 18 
 arch/powerpc/platforms/pseries/papr_scm.c | 94 ++-
 3 files changed, 122 insertions(+), 3 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-bus-papr-pmem 
b/Documentation/ABI/testing/sysfs-bus-papr-pmem
index 95254cec92bf..8a0b2a7f7671 100644
--- a/Documentation/ABI/testing/sysfs-bus-papr-pmem
+++ b/Documentation/ABI/testing/sysfs-bus-papr-pmem
@@ -61,3 +61,16 @@ Description:
* "CchRHCnt" : Cache Read Hit Count
* "CchWHCnt" : Cache Write Hit Count
* "FastWCnt" : Fast Write Count
+
+What:  /sys/bus/nd/devices/nmemX/papr/health_bitmap_override
+Date:  Jan, 2022
+KernelVersion: v5.17
+Contact:   linuxppc-dev , 
nvd...@lists.linux.dev,
+Description:
+   (RO) Reports the health bitmap override/mask bitmaps that are
+   applied to top bitmap received from PowerVM via the H_SCM_HEALTH
+   Hcall. Together these can be used to forcibly set/reset specific
+   bits returned from Hcall. These bitmaps are presently used to
+   simulate various health or shutdown states for an nvdimm and are
+   set by user-space tools like ndctl by issuing a PAPR DSM.
+
diff --git a/arch/powerpc/include/uapi/asm/papr_pdsm.h 
b/arch/powerpc/include/uapi/asm/papr_pdsm.h
index 82488b1e7276..17439925045c 100644
--- a/arch/powerpc/include/uapi/asm/papr_pdsm.h
+++ b/arch/powerpc/include/uapi/asm/papr_pdsm.h
@@ -116,6 +116,22 @@ struct nd_papr_pdsm_health {
};
 };
 
+/* Flags for injecting specific smart errors */
+#define PDSM_SMART_INJECT_HEALTH_FATAL (1 << 0)
+#define PDSM_SMART_INJECT_BAD_SHUTDOWN (1 << 1)
+
+struct nd_papr_pdsm_smart_inject {
+   union {
+   struct {
+   /* One or more of PDSM_SMART_INJECT_ */
+   __u32 flags;
+   __u8 fatal_enable;
+   __u8 unsafe_shutdown_enable;
+   };
+   __u8 buf[ND_PDSM_PAYLOAD_MAX_SIZE];
+   };
+};
+
 /*
  * Methods to be embedded in ND_CMD_CALL request. These are sent to the kernel
  * via 'nd_cmd_pkg.nd_command' member of the ioctl struct
@@ -123,12 +139,14 @@ struct nd_papr_pdsm_health {
 enum papr_pdsm {
PAPR_PDSM_MIN = 0x0,
PAPR_PDSM_HEALTH,
+   PAPR_PDSM_SMART_INJECT,
PAPR_PDSM_MAX,
 };
 
 /* Maximal union that can hold all possible payload types */
 union nd_pdsm_payload {
stru

Re: [RFC PATCH v2] powerpc/papr_scm: Move duplicate definitions to common header files

2021-09-08 Thread Vaibhav Jain
Hi Mpe,

Thanks for looking into this patch.

Michael Ellerman  writes:

> Shivaprasad G Bhat  writes:
>> papr_scm and ndtest share common PDSM payload structs like
>> nd_papr_pdsm_health. Presently these structs are duplicated across 
>> papr_pdsm.h
>> and ndtest.h header files. Since 'ndtest' is essentially arch independent 
>> and can
>> run on platforms other than PPC64, a way needs to be deviced to avoid 
>> redundancy
>> and duplication of PDSM structs in future.
>>
>> So the patch proposes moving the PDSM header from arch/powerpc/include/uapi/ 
>> to
>> the generic include/uapi/linux directory. Also, there are some #defines 
>> common
>> between papr_scm and ndtest which are not exported to the user space. So, 
>> move
>> them to a header file which can be shared across ndtest and papr_scm via 
>> newly
>> introduced include/linux/papr_scm.h.
>>
>> Signed-off-by: Shivaprasad G Bhat 
>> Signed-off-by: Vaibhav Jain 
>> Suggested-by: "Aneesh Kumar K.V" 
>> ---
>> Changelog:
>>
>> Since v1:
>> Link: 
>> https://patchwork.kernel.org/project/linux-nvdimm/patch/162505488483.72147.12741153746322191381.stgit@56e104a48989/
>> * Removed dependency on this patch for the other patches
>>
>>  MAINTAINERS   |2 
>>  arch/powerpc/include/uapi/asm/papr_pdsm.h |  165 
>> -
>>  arch/powerpc/platforms/pseries/papr_scm.c |   43 
>>  include/linux/papr_scm.h  |   48 
>>  include/uapi/linux/papr_pdsm.h|  165 
>> +
>
> This doesn't make sense to me.
>
> Anything with papr (or PAPR) in the name is fundamentally powerpc
> specific, it doesn't belong in a generic header, or in a generic
> location.
>
> What's the actual problem you're trying to solve?
>
The ndtest module (tools/testing/nvdimm/test/ndtest.c) is implemented in
an arch independed way to enable testing of PAPR PDSMs on non ppc64
platforms like x86_64. It uses the same PDSM structs as used by papr_scm
to communicate with libndctl userspace. 

Since papr_scm is ppc64 arch specific we were so far duplicating the
PDSM structures between ndtest and papr_scm. The patch tries to solve
this duplication by moving the shared structs to arch independent common
include dirs.

Secondly, PDSMs describes how userspace can use NVDIMM_FAMILY_PAPR to
interact with NVDIMMs. So potentially a new NVDIMM beyond powerpc arch
can add its support and would need access to the same structs used by
papr_scm and ndtest. In that context it would make sense to move PDSM
headers to generic include dirs.

> If it's just including papr_scm bits into ndtest.c then that should be
> as simple as:
>
> #ifdef __powerpc__
> #include 
> #endif
>
> Shouldn't it?
>
No, as ndtest implements support for NVDIMM_FAMILY_PAPR and would need
access to PDSM related structs which presently are only available for
powerpc.

> cheers
>

-- 
Cheers
~ Vaibhav


[PATCH] powerpc/papr_scm: Implement initial support for injecting smart errors

2021-07-12 Thread Vaibhav Jain
Presently PAPR doesn't support injecting smart errors on an
NVDIMM. This makes testing the NVDIMM health reporting functionality
difficult as simulating NVDIMM health related events need a hacked up
qemu version.

To solve this problem this patch proposes simulating certain set of
NVDIMM health related events in papr_scm. Specifically 'fatal' health
state and 'dirty' shutdown state. These error can be injected via the
user-space 'ndctl-inject-smart(1)' command. With the proposed patch and
corresponding ndctl patches following command flow is expected:

$ sudo ndctl list -DH -d nmem0
...
  "health_state":"ok",
  "shutdown_state":"clean",
...
 # inject unsafe shutdown and fatal health error
$ sudo ndctl inject-smart nmem0 -Uf
...
  "health_state":"fatal",
  "shutdown_state":"dirty",
...
 # uninject all errors
$ sudo ndctl inject-smart nmem0 -N
...
  "health_state":"ok",
  "shutdown_state":"clean",
...

The patch adds two members 'health_bitmap_mask' and
'health_bitmap_override' inside struct papr_scm_priv which are then
bit blt'ed[1] to the health bitmaps fetched from the hypervisor. In case
we are not able to fetch health information from the hypervisor we
service the health bitmap from these two members. These members are
accessible from sysfs at nmemX/papr/health_bitmap_override

A new PDSM named 'SMART_INJECT' is proposed that accepts newly
introduced 'struct nd_papr_pdsm_smart_inject' as payload thats
exchanged between libndctl and papr_scm to indicate the requested
smart-error states.

When the processing the PDSM 'SMART_INJECT', papr_pdsm_smart_inject()
constructs a pair or 'mask' and 'override' bitmaps from the payload
and bit-blt it to the 'health_bitmap_{mask, override}' members. This
ensures the after being fetched from the hypervisor, the health_bitmap
reflects requested smart-error states.

The patch is based on [2] "powerpc/papr_scm: Move duplicate
definitions to common header files".

[1] : https://en.wikipedia.org/wiki/Bit_blit
[2] :
https://lore.kernel.org/nvdimm/162505488483.72147.12741153746322191381.stgit@56e104a48989

Signed-off-by: Vaibhav Jain 
Signed-off-by: Shivaprasad G Bhat 
---
 arch/powerpc/platforms/pseries/papr_scm.c | 94 ++-
 include/uapi/linux/papr_pdsm.h| 18 +
 2 files changed, 109 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index 0c56db5a1427..b7437c61a270 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -29,6 +29,10 @@
 (1ul << ND_CMD_SET_CONFIG_DATA) | \
 (1ul << ND_CMD_CALL))
 
+/* Use bitblt method to override specific bits in the '_bitmap_' */
+#define BITBLT_BITMAP(_bitmap_, _mask_, _override_)\
+   (((_bitmap_) & ~(_mask_)) | ((_mask_) & (_override_)))
+
 /* Struct holding a single performance metric */
 struct papr_scm_perf_stat {
u8 stat_id[8];
@@ -81,6 +85,12 @@ struct papr_scm_priv {
 
/* length of the stat buffer as expected by phyp */
size_t stat_buffer_len;
+
+   /* The bits which needs to be overridden */
+   u64 health_bitmap_mask;
+
+   /* The overridden values for the bits having the masks set */
+   u64 health_bitmap_override;
 };
 
 static int papr_scm_pmem_flush(struct nd_region *nd_region,
@@ -308,19 +318,28 @@ static ssize_t drc_pmem_query_stats(struct papr_scm_priv 
*p,
 static int __drc_pmem_query_health(struct papr_scm_priv *p)
 {
unsigned long ret[PLPAR_HCALL_BUFSIZE];
+   u64 bitmap = 0;
long rc;
 
/* issue the hcall */
rc = plpar_hcall(H_SCM_HEALTH, ret, p->drc_index);
-   if (rc != H_SUCCESS) {
+   if (rc == H_SUCCESS)
+   bitmap = ret[0] & ret[1];
+   else if (rc == H_FUNCTION)
+   dev_info_once(>pdev->dev,
+ "Hcall H_SCM_HEALTH not implemented, assuming 
empty health bitmap");
+   else {
+
dev_err(>pdev->dev,
"Failed to query health information, Err:%ld\n", rc);
return -ENXIO;
}
 
p->lasthealth_jiffies = jiffies;
-   p->health_bitmap = ret[0] & ret[1];
-
+   /* Allow overriding specific health bits via bit blt. */
+   bitmap = BITBLT_BITMAP(bitmap, p->health_bitmap_mask,
+  p->health_bitmap_override);
+   WRITE_ONCE(p->health_bitmap, bitmap);
dev_dbg(>pdev->dev,
"Queried dimm health info. Bitmap:0x%016lx Mask:0x%016lx\n",
ret[0], ret[1]);
@@ -630,6 +649,54 @@ static int papr_pdsm_health(struct papr_scm_priv *p,
return rc;
 }
 
+/* Inject a smart error Add the dirty-shutdown-counter value

[RESEND-PATCH v2] powerpc/papr_scm: Add support for reporting dirty-shutdown-count

2021-06-24 Thread Vaibhav Jain
Persistent memory devices like NVDIMMs can loose cached writes in case
something prevents flush on power-fail. Such situations are termed as
dirty shutdown and are exposed to applications as
last-shutdown-state (LSS) flag and a dirty-shutdown-counter(DSC) as
described at [1]. The latter being useful in conditions where multiple
applications want to detect a dirty shutdown event without racing with
one another.

PAPR-NVDIMMs have so far only exposed LSS style flags to indicate a
dirty-shutdown-state. This patch further adds support for DSC via the
"ibm,persistence-failed-count" device tree property of an NVDIMM. This
property is a monotonic increasing 64-bit counter thats an indication
of number of times an NVDIMM has encountered a dirty-shutdown event
causing persistence loss.

Since this value is not expected to change after system-boot hence
papr_scm reads & caches its value during NVDIMM probe and exposes it
as a PAPR sysfs attributed named 'dirty_shutdown' to match the name of
similarly named NFIT sysfs attribute. Also this value is available to
libnvdimm via PAPR_PDSM_HEALTH payload. 'struct nd_papr_pdsm_health'
has been extended to add a new member called 'dimm_dsc' presence of
which is indicated by the newly introduced PDSM_DIMM_DSC_VALID flag.

References:
[1] https://pmem.io/documents/Dirty_Shutdown_Handling-V1.0.pdf

Signed-off-by: Vaibhav Jain 
Reviewed-by: Aneesh Kumar K.V 
---
Changelog:

Resend:
* Added ack by Aneesh.
* Minor fix to changelog of v2 patch

v2:
* Rebased the patch on latest ppc-next tree
* s/psdm/pdsm/g   [ Santosh ]
---
 arch/powerpc/include/uapi/asm/papr_pdsm.h |  6 +
 arch/powerpc/platforms/pseries/papr_scm.c | 30 +++
 2 files changed, 36 insertions(+)

diff --git a/arch/powerpc/include/uapi/asm/papr_pdsm.h 
b/arch/powerpc/include/uapi/asm/papr_pdsm.h
index 50ef95e2f5b1..82488b1e7276 100644
--- a/arch/powerpc/include/uapi/asm/papr_pdsm.h
+++ b/arch/powerpc/include/uapi/asm/papr_pdsm.h
@@ -77,6 +77,9 @@
 /* Indicate that the 'dimm_fuel_gauge' field is valid */
 #define PDSM_DIMM_HEALTH_RUN_GAUGE_VALID 1
 
+/* Indicate that the 'dimm_dsc' field is valid */
+#define PDSM_DIMM_DSC_VALID 2
+
 /*
  * Struct exchanged between kernel & ndctl in for PAPR_PDSM_HEALTH
  * Various flags indicate the health status of the dimm.
@@ -105,6 +108,9 @@ struct nd_papr_pdsm_health {
 
/* Extension flag PDSM_DIMM_HEALTH_RUN_GAUGE_VALID */
__u16 dimm_fuel_gauge;
+
+   /* Extension flag PDSM_DIMM_DSC_VALID */
+   __u64 dimm_dsc;
};
__u8 buf[ND_PDSM_PAYLOAD_MAX_SIZE];
};
diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index 11e7b90a3360..eb8be0eb6119 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -114,6 +114,9 @@ struct papr_scm_priv {
/* Health information for the dimm */
u64 health_bitmap;
 
+   /* Holds the last known dirty shutdown counter value */
+   u64 dirty_shutdown_counter;
+
/* length of the stat buffer as expected by phyp */
size_t stat_buffer_len;
 };
@@ -603,6 +606,16 @@ static int papr_pdsm_fuel_gauge(struct papr_scm_priv *p,
return rc;
 }
 
+/* Add the dirty-shutdown-counter value to the pdsm */
+static int papr_pdsm_dsc(struct papr_scm_priv *p,
+union nd_pdsm_payload *payload)
+{
+   payload->health.extension_flags |= PDSM_DIMM_DSC_VALID;
+   payload->health.dimm_dsc = p->dirty_shutdown_counter;
+
+   return sizeof(struct nd_papr_pdsm_health);
+}
+
 /* Fetch the DIMM health info and populate it in provided package. */
 static int papr_pdsm_health(struct papr_scm_priv *p,
union nd_pdsm_payload *payload)
@@ -646,6 +659,8 @@ static int papr_pdsm_health(struct papr_scm_priv *p,
 
/* Populate the fuel gauge meter in the payload */
papr_pdsm_fuel_gauge(p, payload);
+   /* Populate the dirty-shutdown-counter field */
+   papr_pdsm_dsc(p, payload);
 
rc = sizeof(struct nd_papr_pdsm_health);
 
@@ -907,6 +922,16 @@ static ssize_t flags_show(struct device *dev,
 }
 DEVICE_ATTR_RO(flags);
 
+static ssize_t dirty_shutdown_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+   struct nvdimm *dimm = to_nvdimm(dev);
+   struct papr_scm_priv *p = nvdimm_provider_data(dimm);
+
+   return sysfs_emit(buf, "%llu\n", p->dirty_shutdown_counter);
+}
+DEVICE_ATTR_RO(dirty_shutdown);
+
 static umode_t papr_nd_attribute_visible(struct kobject *kobj,
 struct attribute *attr, int n)
 {
@@ -925,6 +950,7 @@ static umode_t papr_nd_attribute_visible(struct kobject 
*kobj,
 static struct attribute *papr_nd_attributes[] = {
_attr_flags.attr,
_attr_perf_stats

Re: [PATCH] powerpc/papr_scm: Add support for reporting dirty-shutdown-count

2021-05-27 Thread Vaibhav Jain
Thanks for catching this Santosh. Have fixed this in v2 version of this
patch

-- 
Cheers
~ Vaibhav


Santosh Sivaraj  writes:

> Hi Vaibhav,
>
> Vaibhav Jain  writes:
>
>> Persistent memory devices like NVDIMMs can loose cached writes in case
>> something prevents flush on power-fail. Such situations are termed as
>> dirty shutdown and are exposed to applications as
>> last-shutdown-state (LSS) flag and a dirty-shutdown-counter(DSC) as
>> described at [1]. The latter being useful in conditions where multiple
>> applications want to detect a dirty shutdown event without racing with
>> one another.
>>
>> PAPR-NVDIMMs have so far only exposed LSS style flags to indicate a
>> dirty-shutdown-state. This patch further adds support for DSC via the
>> "ibm,persistence-failed-count" device tree property of an NVDIMM. This
>> property is a monotonic increasing 64-bit counter thats an indication
>> of number of times an NVDIMM has encountered a dirty-shutdown event
>> causing persistence loss.
>>
>> Since this value is not expected to change after system-boot hence
>> papr_scm reads & caches its value during NVDIMM probe and exposes it
>> as a PAPR sysfs attributed named 'dirty_shutdown' to match the name of
>> similarly named NFIT sysfs attribute. Also this value is available to
>> libnvdimm via PAPR_PDSM_HEALTH payload. 'struct nd_papr_pdsm_health'
>> has been extended to add a new member called 'dimm_dsc' presence of
>> which is indicated by the newly introduced PDSM_DIMM_DSC_VALID flag.
>>
>> References:
>> [1] https://pmem.io/documents/Dirty_Shutdown_Handling-V1.0.pdf
>>
>> Signed-off-by: Vaibhav Jain 
>> ---
>>  arch/powerpc/include/uapi/asm/papr_pdsm.h |  6 +
>>  arch/powerpc/platforms/pseries/papr_scm.c | 30 +++
>>  2 files changed, 36 insertions(+)
>>
>> diff --git a/arch/powerpc/include/uapi/asm/papr_pdsm.h 
>> b/arch/powerpc/include/uapi/asm/papr_pdsm.h
>> index 50ef95e2f5b1..82488b1e7276 100644
>> --- a/arch/powerpc/include/uapi/asm/papr_pdsm.h
>> +++ b/arch/powerpc/include/uapi/asm/papr_pdsm.h
>> @@ -77,6 +77,9 @@
>>  /* Indicate that the 'dimm_fuel_gauge' field is valid */
>>  #define PDSM_DIMM_HEALTH_RUN_GAUGE_VALID 1
>>  
>> +/* Indicate that the 'dimm_dsc' field is valid */
>> +#define PDSM_DIMM_DSC_VALID 2
>> +
>>  /*
>>   * Struct exchanged between kernel & ndctl in for PAPR_PDSM_HEALTH
>>   * Various flags indicate the health status of the dimm.
>> @@ -105,6 +108,9 @@ struct nd_papr_pdsm_health {
>>  
>>  /* Extension flag PDSM_DIMM_HEALTH_RUN_GAUGE_VALID */
>>  __u16 dimm_fuel_gauge;
>> +
>> +/* Extension flag PDSM_DIMM_DSC_VALID */
>> +__u64 dimm_dsc;
>>  };
>>  __u8 buf[ND_PDSM_PAYLOAD_MAX_SIZE];
>>  };
>> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
>> b/arch/powerpc/platforms/pseries/papr_scm.c
>> index 11e7b90a3360..68f0d3d5e899 100644
>> --- a/arch/powerpc/platforms/pseries/papr_scm.c
>> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
>> @@ -114,6 +114,9 @@ struct papr_scm_priv {
>>  /* Health information for the dimm */
>>  u64 health_bitmap;
>>  
>> +/* Holds the last known dirty shutdown counter value */
>> +u64 dirty_shutdown_counter;
>> +
>>  /* length of the stat buffer as expected by phyp */
>>  size_t stat_buffer_len;
>>  };
>> @@ -603,6 +606,16 @@ static int papr_pdsm_fuel_gauge(struct papr_scm_priv *p,
>>  return rc;
>>  }
>>  
>> +/* Add the dirty-shutdown-counter value to the pdsm */
>> +static int papr_psdm_dsc(struct papr_scm_priv *p,
> should be pdsm
>> + union nd_pdsm_payload *payload)
>> +{
>> +payload->health.extension_flags |= PDSM_DIMM_DSC_VALID;
>> +payload->health.dimm_dsc = p->dirty_shutdown_counter;
>> +
>> +return sizeof(struct nd_papr_pdsm_health);
>> +}
>> +
>>  /* Fetch the DIMM health info and populate it in provided package. */
>>  static int papr_pdsm_health(struct papr_scm_priv *p,
>>  union nd_pdsm_payload *payload)
>> @@ -646,6 +659,8 @@ static int papr_pdsm_health(struct papr_scm_priv *p,
>>  
>>  /* Populate the fuel gauge meter in the payload */
>>  papr_pdsm_fuel_gauge(p, payload);
>> +/* Populate the dirty-shutdown-counter field */
>> +papr_psdm_dsc(p, payload);
>  

[PATCH v2] powerpc/papr_scm: Add support for reporting dirty-shutdown-count

2021-05-27 Thread Vaibhav Jain
Persistent memory devices like NVDIMMs can loose cached writes in case
something prevents flush on power-fail. Such situations are termed as
dirty shutdown and are exposed to applications as
last-shutdown-state (LSS) flag and a dirty-shutdown-counter(DSC) as
described at [1]. The latter being useful in conditions where multiple
applications want to detect a dirty shutdown event without racing with
one another.

PAPR-NVDIMMs have so far only exposed LSS style flags to indicate a
dirty-shutdown-state. This patch further adds support for DSC via the
"ibm,persistence-failed-count" device tree property of an NVDIMM. This
property is a monotonic increasing 64-bit counter thats an indication
of number of times an NVDIMM has encountered a dirty-shutdown event
causing persistence loss.

Since this value is not expected to change after system-boot hence
papr_scm reads & caches its value during NVDIMM probe and exposes it
as a PAPR sysfs attributed named 'dirty_shutdown' to match the name of
similarly named NFIT sysfs attribute. Also this value is available to
libnvdimm via PAPR_PDSM_HEALTH payload. 'struct nd_papr_pdsm_health'
has been extended to add a new member called 'dimm_dsc' presence of
which is indicated by the newly introduced PDSM_DIMM_DSC_VALID flag.

References:
[1] https://pmem.io/documents/Dirty_Shutdown_Handling-V1.0.pdf

Signed-off-by: Vaibhav Jain 
---
Changelog:

v2:
* Rebased the patch on latest ppc-next tree
* s/psdm/psdm/g   [ Santosh ]
---
 arch/powerpc/include/uapi/asm/papr_pdsm.h |  6 +
 arch/powerpc/platforms/pseries/papr_scm.c | 30 +++
 2 files changed, 36 insertions(+)

diff --git a/arch/powerpc/include/uapi/asm/papr_pdsm.h 
b/arch/powerpc/include/uapi/asm/papr_pdsm.h
index 50ef95e2f5b1..82488b1e7276 100644
--- a/arch/powerpc/include/uapi/asm/papr_pdsm.h
+++ b/arch/powerpc/include/uapi/asm/papr_pdsm.h
@@ -77,6 +77,9 @@
 /* Indicate that the 'dimm_fuel_gauge' field is valid */
 #define PDSM_DIMM_HEALTH_RUN_GAUGE_VALID 1
 
+/* Indicate that the 'dimm_dsc' field is valid */
+#define PDSM_DIMM_DSC_VALID 2
+
 /*
  * Struct exchanged between kernel & ndctl in for PAPR_PDSM_HEALTH
  * Various flags indicate the health status of the dimm.
@@ -105,6 +108,9 @@ struct nd_papr_pdsm_health {
 
/* Extension flag PDSM_DIMM_HEALTH_RUN_GAUGE_VALID */
__u16 dimm_fuel_gauge;
+
+   /* Extension flag PDSM_DIMM_DSC_VALID */
+   __u64 dimm_dsc;
};
__u8 buf[ND_PDSM_PAYLOAD_MAX_SIZE];
};
diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index 11e7b90a3360..eb8be0eb6119 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -114,6 +114,9 @@ struct papr_scm_priv {
/* Health information for the dimm */
u64 health_bitmap;
 
+   /* Holds the last known dirty shutdown counter value */
+   u64 dirty_shutdown_counter;
+
/* length of the stat buffer as expected by phyp */
size_t stat_buffer_len;
 };
@@ -603,6 +606,16 @@ static int papr_pdsm_fuel_gauge(struct papr_scm_priv *p,
return rc;
 }
 
+/* Add the dirty-shutdown-counter value to the pdsm */
+static int papr_pdsm_dsc(struct papr_scm_priv *p,
+union nd_pdsm_payload *payload)
+{
+   payload->health.extension_flags |= PDSM_DIMM_DSC_VALID;
+   payload->health.dimm_dsc = p->dirty_shutdown_counter;
+
+   return sizeof(struct nd_papr_pdsm_health);
+}
+
 /* Fetch the DIMM health info and populate it in provided package. */
 static int papr_pdsm_health(struct papr_scm_priv *p,
union nd_pdsm_payload *payload)
@@ -646,6 +659,8 @@ static int papr_pdsm_health(struct papr_scm_priv *p,
 
/* Populate the fuel gauge meter in the payload */
papr_pdsm_fuel_gauge(p, payload);
+   /* Populate the dirty-shutdown-counter field */
+   papr_pdsm_dsc(p, payload);
 
rc = sizeof(struct nd_papr_pdsm_health);
 
@@ -907,6 +922,16 @@ static ssize_t flags_show(struct device *dev,
 }
 DEVICE_ATTR_RO(flags);
 
+static ssize_t dirty_shutdown_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+   struct nvdimm *dimm = to_nvdimm(dev);
+   struct papr_scm_priv *p = nvdimm_provider_data(dimm);
+
+   return sysfs_emit(buf, "%llu\n", p->dirty_shutdown_counter);
+}
+DEVICE_ATTR_RO(dirty_shutdown);
+
 static umode_t papr_nd_attribute_visible(struct kobject *kobj,
 struct attribute *attr, int n)
 {
@@ -925,6 +950,7 @@ static umode_t papr_nd_attribute_visible(struct kobject 
*kobj,
 static struct attribute *papr_nd_attributes[] = {
_attr_flags.attr,
_attr_perf_stats.attr,
+   _attr_dirty_shutdown.attr,
NULL,
 };
 
@@ -1149,6 +1175,10 @@ static int pap

[PATCH] powerpc/papr_scm: Add support for reporting dirty-shutdown-count

2021-05-21 Thread Vaibhav Jain
Persistent memory devices like NVDIMMs can loose cached writes in case
something prevents flush on power-fail. Such situations are termed as
dirty shutdown and are exposed to applications as
last-shutdown-state (LSS) flag and a dirty-shutdown-counter(DSC) as
described at [1]. The latter being useful in conditions where multiple
applications want to detect a dirty shutdown event without racing with
one another.

PAPR-NVDIMMs have so far only exposed LSS style flags to indicate a
dirty-shutdown-state. This patch further adds support for DSC via the
"ibm,persistence-failed-count" device tree property of an NVDIMM. This
property is a monotonic increasing 64-bit counter thats an indication
of number of times an NVDIMM has encountered a dirty-shutdown event
causing persistence loss.

Since this value is not expected to change after system-boot hence
papr_scm reads & caches its value during NVDIMM probe and exposes it
as a PAPR sysfs attributed named 'dirty_shutdown' to match the name of
similarly named NFIT sysfs attribute. Also this value is available to
libnvdimm via PAPR_PDSM_HEALTH payload. 'struct nd_papr_pdsm_health'
has been extended to add a new member called 'dimm_dsc' presence of
which is indicated by the newly introduced PDSM_DIMM_DSC_VALID flag.

References:
[1] https://pmem.io/documents/Dirty_Shutdown_Handling-V1.0.pdf

Signed-off-by: Vaibhav Jain 
---
 arch/powerpc/include/uapi/asm/papr_pdsm.h |  6 +
 arch/powerpc/platforms/pseries/papr_scm.c | 30 +++
 2 files changed, 36 insertions(+)

diff --git a/arch/powerpc/include/uapi/asm/papr_pdsm.h 
b/arch/powerpc/include/uapi/asm/papr_pdsm.h
index 50ef95e2f5b1..82488b1e7276 100644
--- a/arch/powerpc/include/uapi/asm/papr_pdsm.h
+++ b/arch/powerpc/include/uapi/asm/papr_pdsm.h
@@ -77,6 +77,9 @@
 /* Indicate that the 'dimm_fuel_gauge' field is valid */
 #define PDSM_DIMM_HEALTH_RUN_GAUGE_VALID 1
 
+/* Indicate that the 'dimm_dsc' field is valid */
+#define PDSM_DIMM_DSC_VALID 2
+
 /*
  * Struct exchanged between kernel & ndctl in for PAPR_PDSM_HEALTH
  * Various flags indicate the health status of the dimm.
@@ -105,6 +108,9 @@ struct nd_papr_pdsm_health {
 
/* Extension flag PDSM_DIMM_HEALTH_RUN_GAUGE_VALID */
__u16 dimm_fuel_gauge;
+
+   /* Extension flag PDSM_DIMM_DSC_VALID */
+   __u64 dimm_dsc;
};
__u8 buf[ND_PDSM_PAYLOAD_MAX_SIZE];
};
diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index 11e7b90a3360..68f0d3d5e899 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -114,6 +114,9 @@ struct papr_scm_priv {
/* Health information for the dimm */
u64 health_bitmap;
 
+   /* Holds the last known dirty shutdown counter value */
+   u64 dirty_shutdown_counter;
+
/* length of the stat buffer as expected by phyp */
size_t stat_buffer_len;
 };
@@ -603,6 +606,16 @@ static int papr_pdsm_fuel_gauge(struct papr_scm_priv *p,
return rc;
 }
 
+/* Add the dirty-shutdown-counter value to the pdsm */
+static int papr_psdm_dsc(struct papr_scm_priv *p,
+union nd_pdsm_payload *payload)
+{
+   payload->health.extension_flags |= PDSM_DIMM_DSC_VALID;
+   payload->health.dimm_dsc = p->dirty_shutdown_counter;
+
+   return sizeof(struct nd_papr_pdsm_health);
+}
+
 /* Fetch the DIMM health info and populate it in provided package. */
 static int papr_pdsm_health(struct papr_scm_priv *p,
union nd_pdsm_payload *payload)
@@ -646,6 +659,8 @@ static int papr_pdsm_health(struct papr_scm_priv *p,
 
/* Populate the fuel gauge meter in the payload */
papr_pdsm_fuel_gauge(p, payload);
+   /* Populate the dirty-shutdown-counter field */
+   papr_psdm_dsc(p, payload);
 
rc = sizeof(struct nd_papr_pdsm_health);
 
@@ -907,6 +922,16 @@ static ssize_t flags_show(struct device *dev,
 }
 DEVICE_ATTR_RO(flags);
 
+static ssize_t dirty_shutdown_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+   struct nvdimm *dimm = to_nvdimm(dev);
+   struct papr_scm_priv *p = nvdimm_provider_data(dimm);
+
+   return sysfs_emit(buf, "%llu\n", p->dirty_shutdown_counter);
+}
+DEVICE_ATTR_RO(dirty_shutdown);
+
 static umode_t papr_nd_attribute_visible(struct kobject *kobj,
 struct attribute *attr, int n)
 {
@@ -925,6 +950,7 @@ static umode_t papr_nd_attribute_visible(struct kobject 
*kobj,
 static struct attribute *papr_nd_attributes[] = {
_attr_flags.attr,
_attr_perf_stats.attr,
+   _attr_dirty_shutdown.attr,
NULL,
 };
 
@@ -1149,6 +1175,10 @@ static int papr_scm_probe(struct platform_device *pdev)
p->is_volatile = !of_property_read_bool

Re: [PATCH v2] powerpc/papr_scm: Make 'perf_stats' invisible if perf-stats unavailable

2021-05-13 Thread Vaibhav Jain
Thanks for looking into this patch Dan,

Dan Williams  writes:

> On Fri, May 7, 2021 at 4:40 AM Vaibhav Jain  wrote:
>>
>> In case performance stats for an nvdimm are not available, reading the
>> 'perf_stats' sysfs file returns an -ENOENT error. A better approach is
>> to make the 'perf_stats' file entirely invisible to indicate that
>> performance stats for an nvdimm are unavailable.
>>
>> So this patch updates 'papr_nd_attribute_group' to add a 'is_visible'
>> callback implemented as newly introduced 'papr_nd_attribute_visible()'
>> that returns an appropriate mode in case performance stats aren't
>> supported in a given nvdimm.
>>
>> Also the initialization of 'papr_scm_priv.stat_buffer_len' is moved
>> from papr_scm_nvdimm_init() to papr_scm_probe() so that it value is
>> available when 'papr_nd_attribute_visible()' is called during nvdimm
>> initialization.
>>
>> Fixes: 2d02bf835e57('powerpc/papr_scm: Fetch nvdimm performance stats from 
>> PHYP')
>
> Since this has been the exposed ABI since v5.9 perhaps a note /
> analysis is needed here that the disappearance of this file will not
> break any existing scripts/tools.
>
Yes, I have added a note in patch description and also updated
Documentation/ABI/testing/sysfs-bus-papr-pmem with a note on when this
sysfs attr would be available. Thanks for this suggestion.

>> Signed-off-by: Vaibhav Jain 
>> ---
>> Changelog:
>>
>> v2:
>> * Removed a redundant dev_info() from pap_scm_nvdimm_init() [ Aneesh ]
>> * Marked papr_nd_attribute_visible() as static which also fixed the
>>   build warning reported by kernel build robot
>> ---
>>  arch/powerpc/platforms/pseries/papr_scm.c | 35 ---
>>  1 file changed, 24 insertions(+), 11 deletions(-)
>>
>> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
>> b/arch/powerpc/platforms/pseries/papr_scm.c
>> index e2b69cc3beaf..11e7b90a3360 100644
>> --- a/arch/powerpc/platforms/pseries/papr_scm.c
>> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
>> @@ -907,6 +907,20 @@ static ssize_t flags_show(struct device *dev,
>>  }
>>  DEVICE_ATTR_RO(flags);
>>
>> +static umode_t papr_nd_attribute_visible(struct kobject *kobj,
>> +struct attribute *attr, int n)
>> +{
>> +   struct device *dev = container_of(kobj, typeof(*dev), kobj);
>
> This can use the kobj_to_dev() helper.
>
Fixed in v3
>> +   struct nvdimm *nvdimm = to_nvdimm(dev);
>> +   struct papr_scm_priv *p = nvdimm_provider_data(nvdimm);
>> +
>> +   /* For if perf-stats not available remove perf_stats sysfs */
>> +   if (attr == _attr_perf_stats.attr && p->stat_buffer_len == 0)
>> +   return 0;
>> +
>> +   return attr->mode;
>> +}
>> +
>>  /* papr_scm specific dimm attributes */
>>  static struct attribute *papr_nd_attributes[] = {
>> _attr_flags.attr,
>> @@ -916,6 +930,7 @@ static struct attribute *papr_nd_attributes[] = {
>>
>>  static struct attribute_group papr_nd_attribute_group = {
>> .name = "papr",
>> +   .is_visible = papr_nd_attribute_visible,
>> .attrs = papr_nd_attributes,
>>  };
>>
>> @@ -931,7 +946,6 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
>> struct nd_region_desc ndr_desc;
>> unsigned long dimm_flags;
>> int target_nid, online_nid;
>> -   ssize_t stat_size;
>>
>> p->bus_desc.ndctl = papr_scm_ndctl;
>> p->bus_desc.module = THIS_MODULE;
>> @@ -1016,16 +1030,6 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv 
>> *p)
>> list_add_tail(>region_list, _nd_regions);
>> mutex_unlock(_ndr_lock);
>>
>> -   /* Try retriving the stat buffer and see if its supported */
>> -   stat_size = drc_pmem_query_stats(p, NULL, 0);
>> -   if (stat_size > 0) {
>> -   p->stat_buffer_len = stat_size;
>> -   dev_dbg(>pdev->dev, "Max perf-stat size %lu-bytes\n",
>> -   p->stat_buffer_len);
>> -   } else {
>> -   dev_info(>pdev->dev, "Dimm performance stats 
>> unavailable\n");
>> -   }
>> -
>> return 0;
>>
>>  err:   nvdimm_bus_unregister(p->bus);
>> @@ -1102,6 +1106,7 @@ static int papr_scm_probe(struct platform_device *pdev)
>> u64 blocks, block_size;
>> struct papr_scm_priv *p;
>> const char *uuid_str;
>

[PATCH v3] powerpc/papr_scm: Make 'perf_stats' invisible if perf-stats unavailable

2021-05-13 Thread Vaibhav Jain
In case performance stats for an nvdimm are not available, reading the
'perf_stats' sysfs file returns an -ENOENT error. A better approach is
to make the 'perf_stats' file entirely invisible to indicate that
performance stats for an nvdimm are unavailable.

So this patch updates 'papr_nd_attribute_group' to add a 'is_visible'
callback implemented as newly introduced 'papr_nd_attribute_visible()'
that returns an appropriate mode in case performance stats aren't
supported in a given nvdimm.

Also the initialization of 'papr_scm_priv.stat_buffer_len' is moved
from papr_scm_nvdimm_init() to papr_scm_probe() so that it value is
available when 'papr_nd_attribute_visible()' is called during nvdimm
initialization.

Even though 'perf_stats' attribute is available since v5.9, there are
no known user-space tools/scripts that are dependent on presence of its
sysfs file. Hence I dont expect any user-space breakage with this
patch.

Fixes: 2d02bf835e57('powerpc/papr_scm: Fetch nvdimm performance stats from 
PHYP')
Reviewed-by: Dan Williams 
Signed-off-by: Vaibhav Jain 
---
Changelog:

v3:
* Minor spell corrections [ Dan Williams ]
* Switched to kobj_to_dev() helper in papr_nd_attribute_visible() [
  Dan Williams ]
* Updated ABI/../sysfs-bus-papr-pmem to indicate that the attribute is
  only available for devices that support performance stats.
* Updated patch description.

v2:
* Removed a redundant dev_info() from pap_scm_nvdimm_init() [ Aneesh ]
* Marked papr_nd_attribute_visible() as static which also fixed the
  build warning reported by kernel build robot
---
 Documentation/ABI/testing/sysfs-bus-papr-pmem |  8 +++--
 arch/powerpc/platforms/pseries/papr_scm.c | 35 +--
 2 files changed, 29 insertions(+), 14 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-bus-papr-pmem 
b/Documentation/ABI/testing/sysfs-bus-papr-pmem
index 8316c33862a0..0aa02bf2bde5 100644
--- a/Documentation/ABI/testing/sysfs-bus-papr-pmem
+++ b/Documentation/ABI/testing/sysfs-bus-papr-pmem
@@ -39,9 +39,11 @@ KernelVersion:   v5.9
 Contact:   linuxppc-dev , 
linux-nvd...@lists.01.org,
 Description:
(RO) Report various performance stats related to papr-scm NVDIMM
-   device.  Each stat is reported on a new line with each line
-   composed of a stat-identifier followed by it value. Below are
-   currently known dimm performance stats which are reported:
+   device. This attribute is only available for NVDIMM devices
+   that support reporting NVDIMM performance stats. Each stat is
+   reported on a new line with each line composed of a
+   stat-identifier followed by it value. Below are currently known
+   dimm performance stats which are reported:
 
* "CtlResCt" : Controller Reset Count
* "CtlResTm" : Controller Reset Elapsed Time
diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index e2b69cc3beaf..e8577e7e49ab 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -907,6 +907,20 @@ static ssize_t flags_show(struct device *dev,
 }
 DEVICE_ATTR_RO(flags);
 
+static umode_t papr_nd_attribute_visible(struct kobject *kobj,
+struct attribute *attr, int n)
+{
+   struct device *dev = kobj_to_dev(kobj);
+   struct nvdimm *nvdimm = to_nvdimm(dev);
+   struct papr_scm_priv *p = nvdimm_provider_data(nvdimm);
+
+   /* For if perf-stats not available remove perf_stats sysfs */
+   if (attr == _attr_perf_stats.attr && p->stat_buffer_len == 0)
+   return 0;
+
+   return attr->mode;
+}
+
 /* papr_scm specific dimm attributes */
 static struct attribute *papr_nd_attributes[] = {
_attr_flags.attr,
@@ -916,6 +930,7 @@ static struct attribute *papr_nd_attributes[] = {
 
 static struct attribute_group papr_nd_attribute_group = {
.name = "papr",
+   .is_visible = papr_nd_attribute_visible,
.attrs = papr_nd_attributes,
 };
 
@@ -931,7 +946,6 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
struct nd_region_desc ndr_desc;
unsigned long dimm_flags;
int target_nid, online_nid;
-   ssize_t stat_size;
 
p->bus_desc.ndctl = papr_scm_ndctl;
p->bus_desc.module = THIS_MODULE;
@@ -1016,16 +1030,6 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
list_add_tail(>region_list, _nd_regions);
mutex_unlock(_ndr_lock);
 
-   /* Try retriving the stat buffer and see if its supported */
-   stat_size = drc_pmem_query_stats(p, NULL, 0);
-   if (stat_size > 0) {
-   p->stat_buffer_len = stat_size;
-   dev_dbg(>pdev->dev, "Max perf-stat size %lu-bytes\n",
-   p->stat_buffer_len);
-   } els

[PATCH v3] powerpc/papr_scm: Reduce error severity if nvdimm stats inaccessible

2021-05-07 Thread Vaibhav Jain
Currently drc_pmem_qeury_stats() generates a dev_err in case
"Enable Performance Information Collection" feature is disabled from
HMC or performance stats are not available for an nvdimm. The error is
of the form below:

papr_scm ibm,persistent-memory:ibm,pmemory@44104001: Failed to query
 performance stats, Err:-10

This error message confuses users as it implies a possible problem
with the nvdimm even though its due to a disabled/unavailable
feature. We fix this by explicitly handling the H_AUTHORITY and
H_UNSUPPORTED errors from the H_SCM_PERFORMANCE_STATS hcall.

In case of H_AUTHORITY error an info message is logged instead of an
error, saying that "Permission denied while accessing performance
stats" and an EPERM error is returned back.

In case of H_UNSUPPORTED error we return a EOPNOTSUPP error back from
drc_pmem_query_stats() indicating that performance stats-query
operation is not supported on this nvdimm.

Fixes: 2d02bf835e57('powerpc/papr_scm: Fetch nvdimm performance stats from 
PHYP')
Signed-off-by: Vaibhav Jain 
---
Changelog

v3:
* Return EOPNOTSUPP error in case of H_UNSUPPORTED [ Ira ]
* Return EPERM in case of H_AUTHORITY [ Ira ]
* Updated patch description

v2:
* Updated the message logged in case of H_AUTHORITY error [ Ira ]
* Switched from dev_warn to dev_info in case of H_AUTHORITY error.
* Instead of -EPERM return -EACCESS for H_AUTHORITY error.
* Added explicit handling of H_UNSUPPORTED error.
---
 arch/powerpc/platforms/pseries/papr_scm.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index ef26fe40efb0..e2b69cc3beaf 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -310,6 +310,13 @@ static ssize_t drc_pmem_query_stats(struct papr_scm_priv 
*p,
dev_err(>pdev->dev,
"Unknown performance stats, Err:0x%016lX\n", ret[0]);
return -ENOENT;
+   } else if (rc == H_AUTHORITY) {
+   dev_info(>pdev->dev,
+"Permission denied while accessing performance stats");
+   return -EPERM;
+   } else if (rc == H_UNSUPPORTED) {
+   dev_dbg(>pdev->dev, "Performance stats unsupported\n");
+   return -EOPNOTSUPP;
} else if (rc != H_SUCCESS) {
dev_err(>pdev->dev,
"Failed to query performance stats, Err:%lld\n", rc);
-- 
2.31.1



Re: [PATCH v2] powerpc/papr_scm: Reduce error severity if nvdimm stats inaccessible

2021-05-07 Thread Vaibhav Jain
Hi Ira,

Thanks for looking into this patch

Ira Weiny  writes:

> On Thu, May 06, 2021 at 12:46:06AM +0530, Vaibhav Jain wrote:
>> Currently drc_pmem_qeury_stats() generates a dev_err in case
>> "Enable Performance Information Collection" feature is disabled from
>> HMC or performance stats are not available for an nvdimm. The error is
>> of the form below:
>> 
>> papr_scm ibm,persistent-memory:ibm,pmemory@44104001: Failed to query
>>   performance stats, Err:-10
>> 
>> This error message confuses users as it implies a possible problem
>> with the nvdimm even though its due to a disabled/unavailable
>> feature. We fix this by explicitly handling the H_AUTHORITY and
>> H_UNSUPPORTED errors from the H_SCM_PERFORMANCE_STATS hcall.
>> 
>> In case of H_AUTHORITY error an info message is logged instead of an
>> error, saying that "Permission denied while accessing performance
>> stats". Also '-EACCES' error is return instead of -EPERM.
>
> I thought you clarified before that this was a permission issue.  So why 
> change
> the error to EACCES?
>
EACCESS("Permission Denied") felt like a more accurate error code for
this case than EPERM("Operation not permitted"). So switched the usage
of EPERM error code to handle the case if this hcall is not supported
for an nvdimm.

>> 
>> In case of H_UNSUPPORTED error we return a -EPERM error back from
>> drc_pmem_query_stats() indicating that performance stats-query
>> operation is not supported on this nvdimm.
>
> EPERM seems wrong here too...  ENOTSUP?
Yes, will change it to EOPNOTSUPP in v3.

>
> Ira
> ___
> Linux-nvdimm mailing list -- linux-nvd...@lists.01.org
> To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

-- 
Cheers
~ Vaibhav


Re: [PATCH] powerpc/papr_scm: Make 'perf_stats' invisible if perf-stats unavailable

2021-05-07 Thread Vaibhav Jain
"Aneesh Kumar K.V"  writes:

> Vaibhav Jain  writes:
>
>> In case performance stats for an nvdimm are not available, reading the
>> 'perf_stats' sysfs file returns an -ENOENT error. A better approach is
>> to make the 'perf_stats' file entirely invisible to indicate that
>> performance stats for an nvdimm are unavailable.
>>
>> So this patch updates 'papr_nd_attribute_group' to add a 'is_visible'
>> callback implemented as newly introduced 'papr_nd_attribute_visible()'
>> that returns an appropriate mode in case performance stats aren't
>> supported in a given nvdimm.
>>
>> Also the initialization of 'papr_scm_priv.stat_buffer_len' is moved
>> from papr_scm_nvdimm_init() to papr_scm_probe() so that it value is
>> available when 'papr_nd_attribute_visible()' is called during nvdimm
>> initialization.
>>
>> Fixes: 2d02bf835e57('powerpc/papr_scm: Fetch nvdimm performance stats from 
>> PHYP')
>> Signed-off-by: Vaibhav Jain 
>> ---
>>  arch/powerpc/platforms/pseries/papr_scm.c | 37 ---
>>  1 file changed, 26 insertions(+), 11 deletions(-)
>>
>> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
>> b/arch/powerpc/platforms/pseries/papr_scm.c
>> index 12f1513f0fca..90f0af8fefe8 100644
>> --- a/arch/powerpc/platforms/pseries/papr_scm.c
>> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
>> @@ -907,6 +907,20 @@ static ssize_t flags_show(struct device *dev,
>>  }
>>  DEVICE_ATTR_RO(flags);
>>  
>> +umode_t papr_nd_attribute_visible(struct kobject *kobj, struct attribute 
>> *attr,
>> +  int n)
>> +{
>> +struct device *dev = container_of(kobj, typeof(*dev), kobj);
>> +struct nvdimm *nvdimm = to_nvdimm(dev);
>> +struct papr_scm_priv *p = nvdimm_provider_data(nvdimm);
>> +
>> +/* For if perf-stats not available remove perf_stats sysfs */
>> +if (attr == _attr_perf_stats.attr && p->stat_buffer_len == 0)
>> +return 0;
>> +
>> +return attr->mode;
>> +}
>> +
>>  /* papr_scm specific dimm attributes */
>>  static struct attribute *papr_nd_attributes[] = {
>>  _attr_flags.attr,
>> @@ -916,6 +930,7 @@ static struct attribute *papr_nd_attributes[] = {
>>  
>>  static struct attribute_group papr_nd_attribute_group = {
>>  .name = "papr",
>> +.is_visible = papr_nd_attribute_visible,
>>  .attrs = papr_nd_attributes,
>>  };
>>  
>> @@ -931,7 +946,6 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
>>  struct nd_region_desc ndr_desc;
>>  unsigned long dimm_flags;
>>  int target_nid, online_nid;
>> -ssize_t stat_size;
>>  
>>  p->bus_desc.ndctl = papr_scm_ndctl;
>>  p->bus_desc.module = THIS_MODULE;
>> @@ -1016,16 +1030,6 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv 
>> *p)
>>  list_add_tail(>region_list, _nd_regions);
>>  mutex_unlock(_ndr_lock);
>>  
>> -/* Try retriving the stat buffer and see if its supported */
>> -stat_size = drc_pmem_query_stats(p, NULL, 0);
>> -if (stat_size > 0) {
>> -p->stat_buffer_len = stat_size;
>> -dev_dbg(>pdev->dev, "Max perf-stat size %lu-bytes\n",
>> -p->stat_buffer_len);
>> -} else {
>> -dev_info(>pdev->dev, "Dimm performance stats unavailable\n");
>> -}
>> -
>>  return 0;
>>  
>>  err:nvdimm_bus_unregister(p->bus);
>> @@ -1102,6 +1106,7 @@ static int papr_scm_probe(struct platform_device *pdev)
>>  u64 blocks, block_size;
>>  struct papr_scm_priv *p;
>>  const char *uuid_str;
>> +ssize_t stat_size;
>>  u64 uuid[2];
>>  int rc;
>>  
>> @@ -1179,6 +1184,16 @@ static int papr_scm_probe(struct platform_device 
>> *pdev)
>>  p->res.name  = pdev->name;
>>  p->res.flags = IORESOURCE_MEM;
>>  
>> +/* Try retriving the stat buffer and see if its supported */
>> +stat_size = drc_pmem_query_stats(p, NULL, 0);
>> +if (stat_size > 0) {
>> +p->stat_buffer_len = stat_size;
>> +dev_dbg(>pdev->dev, "Max perf-stat size %lu-bytes\n",
>> +p->stat_buffer_len);
>> +} else {
>> +dev_info(>pdev->dev, "Dimm performance stats unavailable\n");
>> +}
>
> With this patch 
> https://lore.kernel.org/linuxppc-dev/20210505191606.51666-1-vaib...@linux.ibm.com
> We are adding details of whyy performance stat query hcall failed. Do we
> need to print again here?  Are we being more verbose here?
>
Yes agree this looks more verbose with the other patch you mentioned. I
have sent out a v2 of this patch with this dev_info removed.


> -aneesh
> ___
> Linux-nvdimm mailing list -- linux-nvd...@lists.01.org
> To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

-- 
Cheers
~ Vaibhav


[PATCH v2] powerpc/papr_scm: Make 'perf_stats' invisible if perf-stats unavailable

2021-05-07 Thread Vaibhav Jain
In case performance stats for an nvdimm are not available, reading the
'perf_stats' sysfs file returns an -ENOENT error. A better approach is
to make the 'perf_stats' file entirely invisible to indicate that
performance stats for an nvdimm are unavailable.

So this patch updates 'papr_nd_attribute_group' to add a 'is_visible'
callback implemented as newly introduced 'papr_nd_attribute_visible()'
that returns an appropriate mode in case performance stats aren't
supported in a given nvdimm.

Also the initialization of 'papr_scm_priv.stat_buffer_len' is moved
from papr_scm_nvdimm_init() to papr_scm_probe() so that it value is
available when 'papr_nd_attribute_visible()' is called during nvdimm
initialization.

Fixes: 2d02bf835e57('powerpc/papr_scm: Fetch nvdimm performance stats from 
PHYP')
Signed-off-by: Vaibhav Jain 
---
Changelog:

v2:
* Removed a redundant dev_info() from pap_scm_nvdimm_init() [ Aneesh ]
* Marked papr_nd_attribute_visible() as static which also fixed the
  build warning reported by kernel build robot
---
 arch/powerpc/platforms/pseries/papr_scm.c | 35 ---
 1 file changed, 24 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index e2b69cc3beaf..11e7b90a3360 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -907,6 +907,20 @@ static ssize_t flags_show(struct device *dev,
 }
 DEVICE_ATTR_RO(flags);
 
+static umode_t papr_nd_attribute_visible(struct kobject *kobj,
+struct attribute *attr, int n)
+{
+   struct device *dev = container_of(kobj, typeof(*dev), kobj);
+   struct nvdimm *nvdimm = to_nvdimm(dev);
+   struct papr_scm_priv *p = nvdimm_provider_data(nvdimm);
+
+   /* For if perf-stats not available remove perf_stats sysfs */
+   if (attr == _attr_perf_stats.attr && p->stat_buffer_len == 0)
+   return 0;
+
+   return attr->mode;
+}
+
 /* papr_scm specific dimm attributes */
 static struct attribute *papr_nd_attributes[] = {
_attr_flags.attr,
@@ -916,6 +930,7 @@ static struct attribute *papr_nd_attributes[] = {
 
 static struct attribute_group papr_nd_attribute_group = {
.name = "papr",
+   .is_visible = papr_nd_attribute_visible,
.attrs = papr_nd_attributes,
 };
 
@@ -931,7 +946,6 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
struct nd_region_desc ndr_desc;
unsigned long dimm_flags;
int target_nid, online_nid;
-   ssize_t stat_size;
 
p->bus_desc.ndctl = papr_scm_ndctl;
p->bus_desc.module = THIS_MODULE;
@@ -1016,16 +1030,6 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
list_add_tail(>region_list, _nd_regions);
mutex_unlock(_ndr_lock);
 
-   /* Try retriving the stat buffer and see if its supported */
-   stat_size = drc_pmem_query_stats(p, NULL, 0);
-   if (stat_size > 0) {
-   p->stat_buffer_len = stat_size;
-   dev_dbg(>pdev->dev, "Max perf-stat size %lu-bytes\n",
-   p->stat_buffer_len);
-   } else {
-   dev_info(>pdev->dev, "Dimm performance stats unavailable\n");
-   }
-
return 0;
 
 err:   nvdimm_bus_unregister(p->bus);
@@ -1102,6 +1106,7 @@ static int papr_scm_probe(struct platform_device *pdev)
u64 blocks, block_size;
struct papr_scm_priv *p;
const char *uuid_str;
+   ssize_t stat_size;
u64 uuid[2];
int rc;
 
@@ -1179,6 +1184,14 @@ static int papr_scm_probe(struct platform_device *pdev)
p->res.name  = pdev->name;
p->res.flags = IORESOURCE_MEM;
 
+   /* Try retriving the stat buffer and see if its supported */
+   stat_size = drc_pmem_query_stats(p, NULL, 0);
+   if (stat_size > 0) {
+   p->stat_buffer_len = stat_size;
+   dev_dbg(>pdev->dev, "Max perf-stat size %lu-bytes\n",
+   p->stat_buffer_len);
+   }
+
rc = papr_scm_nvdimm_init(p);
if (rc)
goto err2;
-- 
2.31.1



[PATCH] powerpc/papr_scm: Make 'perf_stats' invisible if perf-stats unavailable

2021-05-05 Thread Vaibhav Jain
In case performance stats for an nvdimm are not available, reading the
'perf_stats' sysfs file returns an -ENOENT error. A better approach is
to make the 'perf_stats' file entirely invisible to indicate that
performance stats for an nvdimm are unavailable.

So this patch updates 'papr_nd_attribute_group' to add a 'is_visible'
callback implemented as newly introduced 'papr_nd_attribute_visible()'
that returns an appropriate mode in case performance stats aren't
supported in a given nvdimm.

Also the initialization of 'papr_scm_priv.stat_buffer_len' is moved
from papr_scm_nvdimm_init() to papr_scm_probe() so that it value is
available when 'papr_nd_attribute_visible()' is called during nvdimm
initialization.

Fixes: 2d02bf835e57('powerpc/papr_scm: Fetch nvdimm performance stats from 
PHYP')
Signed-off-by: Vaibhav Jain 
---
 arch/powerpc/platforms/pseries/papr_scm.c | 37 ---
 1 file changed, 26 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index 12f1513f0fca..90f0af8fefe8 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -907,6 +907,20 @@ static ssize_t flags_show(struct device *dev,
 }
 DEVICE_ATTR_RO(flags);
 
+umode_t papr_nd_attribute_visible(struct kobject *kobj, struct attribute *attr,
+ int n)
+{
+   struct device *dev = container_of(kobj, typeof(*dev), kobj);
+   struct nvdimm *nvdimm = to_nvdimm(dev);
+   struct papr_scm_priv *p = nvdimm_provider_data(nvdimm);
+
+   /* For if perf-stats not available remove perf_stats sysfs */
+   if (attr == _attr_perf_stats.attr && p->stat_buffer_len == 0)
+   return 0;
+
+   return attr->mode;
+}
+
 /* papr_scm specific dimm attributes */
 static struct attribute *papr_nd_attributes[] = {
_attr_flags.attr,
@@ -916,6 +930,7 @@ static struct attribute *papr_nd_attributes[] = {
 
 static struct attribute_group papr_nd_attribute_group = {
.name = "papr",
+   .is_visible = papr_nd_attribute_visible,
.attrs = papr_nd_attributes,
 };
 
@@ -931,7 +946,6 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
struct nd_region_desc ndr_desc;
unsigned long dimm_flags;
int target_nid, online_nid;
-   ssize_t stat_size;
 
p->bus_desc.ndctl = papr_scm_ndctl;
p->bus_desc.module = THIS_MODULE;
@@ -1016,16 +1030,6 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
list_add_tail(>region_list, _nd_regions);
mutex_unlock(_ndr_lock);
 
-   /* Try retriving the stat buffer and see if its supported */
-   stat_size = drc_pmem_query_stats(p, NULL, 0);
-   if (stat_size > 0) {
-   p->stat_buffer_len = stat_size;
-   dev_dbg(>pdev->dev, "Max perf-stat size %lu-bytes\n",
-   p->stat_buffer_len);
-   } else {
-   dev_info(>pdev->dev, "Dimm performance stats unavailable\n");
-   }
-
return 0;
 
 err:   nvdimm_bus_unregister(p->bus);
@@ -1102,6 +1106,7 @@ static int papr_scm_probe(struct platform_device *pdev)
u64 blocks, block_size;
struct papr_scm_priv *p;
const char *uuid_str;
+   ssize_t stat_size;
u64 uuid[2];
int rc;
 
@@ -1179,6 +1184,16 @@ static int papr_scm_probe(struct platform_device *pdev)
p->res.name  = pdev->name;
p->res.flags = IORESOURCE_MEM;
 
+   /* Try retriving the stat buffer and see if its supported */
+   stat_size = drc_pmem_query_stats(p, NULL, 0);
+   if (stat_size > 0) {
+   p->stat_buffer_len = stat_size;
+   dev_dbg(>pdev->dev, "Max perf-stat size %lu-bytes\n",
+   p->stat_buffer_len);
+   } else {
+   dev_info(>pdev->dev, "Dimm performance stats unavailable\n");
+   }
+
rc = papr_scm_nvdimm_init(p);
if (rc)
goto err2;
-- 
2.31.1



[PATCH v2] powerpc/papr_scm: Reduce error severity if nvdimm stats inaccessible

2021-05-05 Thread Vaibhav Jain
Currently drc_pmem_qeury_stats() generates a dev_err in case
"Enable Performance Information Collection" feature is disabled from
HMC or performance stats are not available for an nvdimm. The error is
of the form below:

papr_scm ibm,persistent-memory:ibm,pmemory@44104001: Failed to query
 performance stats, Err:-10

This error message confuses users as it implies a possible problem
with the nvdimm even though its due to a disabled/unavailable
feature. We fix this by explicitly handling the H_AUTHORITY and
H_UNSUPPORTED errors from the H_SCM_PERFORMANCE_STATS hcall.

In case of H_AUTHORITY error an info message is logged instead of an
error, saying that "Permission denied while accessing performance
stats". Also '-EACCES' error is return instead of -EPERM.

In case of H_UNSUPPORTED error we return a -EPERM error back from
drc_pmem_query_stats() indicating that performance stats-query
operation is not supported on this nvdimm.

Fixes: 2d02bf835e57('powerpc/papr_scm: Fetch nvdimm performance stats from 
PHYP')
Signed-off-by: Vaibhav Jain 
---
Changelog

v2:
* Updated the message logged in case of H_AUTHORITY error [ Ira ]
* Switched from dev_warn to dev_info in case of H_AUTHORITY error.
* Instead of -EPERM return -EACCESS for H_AUTHORITY error.
* Added explicit handling of H_UNSUPPORTED error.
---
 arch/powerpc/platforms/pseries/papr_scm.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index ef26fe40efb0..12f1513f0fca 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -310,6 +310,13 @@ static ssize_t drc_pmem_query_stats(struct papr_scm_priv 
*p,
dev_err(>pdev->dev,
"Unknown performance stats, Err:0x%016lX\n", ret[0]);
return -ENOENT;
+   } else if (rc == H_AUTHORITY) {
+   dev_info(>pdev->dev,
+"Permission denied while accessing performance stats");
+   return -EACCES;
+   } else if (rc == H_UNSUPPORTED) {
+   dev_dbg(>pdev->dev, "Performance stats unsupported\n");
+   return -EPERM;
} else if (rc != H_SUCCESS) {
dev_err(>pdev->dev,
"Failed to query performance stats, Err:%lld\n", rc);
-- 
2.31.1



Re: [PATCH v1 1/1] powerpc/papr_scm: Properly handle UUID types and API

2021-04-16 Thread Vaibhav Jain
Andy Shevchenko  writes:

> On Thu, Apr 15, 2021 at 8:10 PM Vaibhav Jain  wrote:
>>
>>
>> Thanks for the patch Andy,
>>
>> Unfortunately ran into a compilation issue due to missing "#include
>> " that provides definition for
>> get_unaligned_le64(). Gcc reported following error:
>>
>> error: implicit declaration of function ‘get_unaligned_le64’
>
> Right, I have not tested it (as mentioned in the comments to the patch)
>
>> After including the necessary header file, kernel compiled fine and I
>> was able to test & verify the patch.
>
> Thank you very much for the testing.
>
> I'm not sure what the coverage of your test is.

Your patch updates the way the interleaved set-cookies are populated in
papr_scm which are then used to populate label entry for a namespace. I
verified that the reported region setcookie hasnt changed for an nvdimm
region before and after applying your patch for both BE and LE variants:

# 64-bit Little endian kernel before applying the patch
$ sudo cat /sys/devices/ndbus0/region0/set_cookie
0x8b6b26cbc930e2b5

# 64-bit Little endian kernel after applying your patch
$ sudo cat /sys/devices/ndbus0/region0/set_cookie
0x8b6b26cbc930e2b5

# 64-bit Big endian kernel before applying your patch
$ sudo cat /sys/devices/ndbus0/region0/set_cookie
0x8b6b26cbc930e2b5

# 64-bit Big endian kernel after applying your patch
$ sudo cat /sys/devices/ndbus0/region0/set_cookie
0x8b6b26cbc930e2b5

> That's why I have an
> additional question below. Is the byte ordering kept the same in BE
> (32- and 64-bit) cases? Because I'm worrying that I might have missed
> something.
Libnvdimm store these cookies in label area as little endian values and
based on the results above I think we are good.
>
>
> -- 
> With Best Regards,
> Andy Shevchenko

-- 
Cheers
~ Vaibhav


Re: [PATCH v1 1/1] powerpc/papr_scm: Properly handle UUID types and API

2021-04-15 Thread Vaibhav Jain


Thanks for the patch Andy,

Unfortunately ran into a compilation issue due to missing "#include
" that provides definition for
get_unaligned_le64(). Gcc reported following error:
 
error: implicit declaration of function ‘get_unaligned_le64’

After including the necessary header file, kernel compiled fine and I
was able to test & verify the patch.

-- 
Cheers
~ Vaibhav

Andy Shevchenko  writes:

> Parse to and export from UUID own type, before dereferencing.
> This also fixes wrong comment (Little Endian UUID is something else)
> and should fix Sparse warnings about assigning strict types to POD.
>
> Fixes: 43001c52b603 ("powerpc/papr_scm: Use ibm,unit-guid as the iset cookie")
> Fixes: 259a948c4ba1 ("powerpc/pseries/scm: Use a specific endian format for 
> storing uuid from the device tree")
> Cc: Oliver O'Halloran 
> Cc: Aneesh Kumar K.V 
> Signed-off-by: Andy Shevchenko 
> ---
> Not tested
>  arch/powerpc/platforms/pseries/papr_scm.c | 13 -
>  1 file changed, 8 insertions(+), 5 deletions(-)
>
> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
> b/arch/powerpc/platforms/pseries/papr_scm.c
> index ae6f5d80d5ce..4366e1902890 100644
> --- a/arch/powerpc/platforms/pseries/papr_scm.c
> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
> @@ -1085,8 +1085,9 @@ static int papr_scm_probe(struct platform_device *pdev)
>   u32 drc_index, metadata_size;
>   u64 blocks, block_size;
>   struct papr_scm_priv *p;
> + u8 uuid_raw[UUID_SIZE];
>   const char *uuid_str;
> - u64 uuid[2];
> + uuid_t uuid;
>   int rc;
>  
>   /* check we have all the required DT properties */
> @@ -1129,16 +1130,18 @@ static int papr_scm_probe(struct platform_device 
> *pdev)
>   p->hcall_flush_required = of_property_read_bool(dn, 
> "ibm,hcall-flush-required");
>  
>   /* We just need to ensure that set cookies are unique across */
> - uuid_parse(uuid_str, (uuid_t *) uuid);
> + uuid_parse(uuid_str, );
> +
>   /*
>* cookie1 and cookie2 are not really little endian
> -  * we store a little endian representation of the
> +  * we store a raw buffer representation of the
>* uuid str so that we can compare this with the label
>* area cookie irrespective of the endian config with which
>* the kernel is built.
>*/
> - p->nd_set.cookie1 = cpu_to_le64(uuid[0]);
> - p->nd_set.cookie2 = cpu_to_le64(uuid[1]);
> + export_uuid(uuid_raw, );
> + p->nd_set.cookie1 = get_unaligned_le64(_raw[0]);
> + p->nd_set.cookie2 = get_unaligned_le64(_raw[8]);
>  
>   /* might be zero */
>   p->metadata_size = metadata_size;
> -- 
> 2.30.2
>



Re: [PATCH] powerpc/papr_scm: Reduce error severity if nvdimm stats inaccessible

2021-04-15 Thread Vaibhav Jain
Ira Weiny  writes:

> On Wed, Apr 14, 2021 at 09:51:40PM +0530, Vaibhav Jain wrote:
>> Thanks for looking into this patch Ira,
>> 
>> Ira Weiny  writes:
>> 
>> > On Wed, Apr 14, 2021 at 06:10:26PM +0530, Vaibhav Jain wrote:
>> >> Currently drc_pmem_qeury_stats() generates a dev_err in case
>> >> "Enable Performance Information Collection" feature is disabled from
>> >> HMC. The error is of the form below:
>> >> 
>> >> papr_scm ibm,persistent-memory:ibm,pmemory@44104001: Failed to query
>> >>performance stats, Err:-10
>> >> 
>> >> This error message confuses users as it implies a possible problem
>> >> with the nvdimm even though its due to a disabled feature.
>> >> 
>> >> So we fix this by explicitly handling the H_AUTHORITY error from the
>> >> H_SCM_PERFORMANCE_STATS hcall and generating a warning instead of an
>> >> error, saying that "Performance stats in-accessible".
>> >> 
>> >> Fixes: 2d02bf835e57('powerpc/papr_scm: Fetch nvdimm performance stats 
>> >> from PHYP')
>> >> Signed-off-by: Vaibhav Jain 
>> >> ---
>> >>  arch/powerpc/platforms/pseries/papr_scm.c | 3 +++
>> >>  1 file changed, 3 insertions(+)
>> >> 
>> >> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
>> >> b/arch/powerpc/platforms/pseries/papr_scm.c
>> >> index 835163f54244..9216424f8be3 100644
>> >> --- a/arch/powerpc/platforms/pseries/papr_scm.c
>> >> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
>> >> @@ -277,6 +277,9 @@ static ssize_t drc_pmem_query_stats(struct 
>> >> papr_scm_priv *p,
>> >>   dev_err(>pdev->dev,
>> >>   "Unknown performance stats, Err:0x%016lX\n", ret[0]);
>> >>   return -ENOENT;
>> >> + } else if (rc == H_AUTHORITY) {
>> >> + dev_warn(>pdev->dev, "Performance stats in-accessible");
>> >> + return -EPERM;
>> >
>> > Is this because of a disabled feature or because of permissions?
>> 
>> Its because of a disabled feature that revokes permission for a guest to
>> retrieve performance statistics.
>> 
>> The feature is called "Enable Performance Information Collection" and
>> once disabled the hcall H_SCM_PERFORMANCE_STATS returns an error
>> H_AUTHORITY indicating that the guest doesn't have permission to retrieve
>> performance statistics.
>
> In that case would it be appropriate to have the error message indicate a
> permission issue?
>
> Something like 'permission denied'?

Yes, Something like "Permission denied while accessing performance
stats" might be more clear and intuitive.

Will update the warn message in v2.

>
> Ira
>

-- 
Cheers
~ Vaibhav


Re: [PATCH] powerpc/papr_scm: Reduce error severity if nvdimm stats inaccessible

2021-04-15 Thread Vaibhav Jain
Thanks for looking into this Dan,

Dan Williams  writes:

> On Wed, Apr 14, 2021 at 5:40 AM Vaibhav Jain  wrote:
>>
>> Currently drc_pmem_qeury_stats() generates a dev_err in case
>> "Enable Performance Information Collection" feature is disabled from
>> HMC. The error is of the form below:
>>
>> papr_scm ibm,persistent-memory:ibm,pmemory@44104001: Failed to query
>>  performance stats, Err:-10
>>
>> This error message confuses users as it implies a possible problem
>> with the nvdimm even though its due to a disabled feature.
>>
>> So we fix this by explicitly handling the H_AUTHORITY error from the
>> H_SCM_PERFORMANCE_STATS hcall and generating a warning instead of an
>> error, saying that "Performance stats in-accessible".
>>
>> Fixes: 2d02bf835e57('powerpc/papr_scm: Fetch nvdimm performance stats from 
>> PHYP')
>> Signed-off-by: Vaibhav Jain 
>> ---
>>  arch/powerpc/platforms/pseries/papr_scm.c | 3 +++
>>  1 file changed, 3 insertions(+)
>>
>> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
>> b/arch/powerpc/platforms/pseries/papr_scm.c
>> index 835163f54244..9216424f8be3 100644
>> --- a/arch/powerpc/platforms/pseries/papr_scm.c
>> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
>> @@ -277,6 +277,9 @@ static ssize_t drc_pmem_query_stats(struct papr_scm_priv 
>> *p,
>> dev_err(>pdev->dev,
>> "Unknown performance stats, Err:0x%016lX\n", ret[0]);
>> return -ENOENT;
>> +   } else if (rc == H_AUTHORITY) {
>> +   dev_warn(>pdev->dev, "Performance stats in-accessible");
>> +   return -EPERM;
>
> So userspace can spam the kernel log? Why is kernel log message needed
> at all? EPERM told the caller what happened.
Currently this error message is only reported during probe of the
nvdimm. So userspace cannot directly spam kernel log.

The callsite for this function in papr_scm_nvdimm_init() doesnt handle
specific error codes. Instead in case of an error it only reports that
"Dimm performance stats are unavailable". The log message just
preceeding that mentions the real cause of failure. Thats why just
returning -EPERM wont be usefui.

Alternatively I can update papr_scm_nvdimm_init() to report the error
code returned from drc_pmem_query_stats().

-- 
Cheers
~ Vaibhav


Re: [PATCH] powerpc/papr_scm: Reduce error severity if nvdimm stats inaccessible

2021-04-14 Thread Vaibhav Jain
Thanks for looking into this patch Ira,

Ira Weiny  writes:

> On Wed, Apr 14, 2021 at 06:10:26PM +0530, Vaibhav Jain wrote:
>> Currently drc_pmem_qeury_stats() generates a dev_err in case
>> "Enable Performance Information Collection" feature is disabled from
>> HMC. The error is of the form below:
>> 
>> papr_scm ibm,persistent-memory:ibm,pmemory@44104001: Failed to query
>>   performance stats, Err:-10
>> 
>> This error message confuses users as it implies a possible problem
>> with the nvdimm even though its due to a disabled feature.
>> 
>> So we fix this by explicitly handling the H_AUTHORITY error from the
>> H_SCM_PERFORMANCE_STATS hcall and generating a warning instead of an
>> error, saying that "Performance stats in-accessible".
>> 
>> Fixes: 2d02bf835e57('powerpc/papr_scm: Fetch nvdimm performance stats from 
>> PHYP')
>> Signed-off-by: Vaibhav Jain 
>> ---
>>  arch/powerpc/platforms/pseries/papr_scm.c | 3 +++
>>  1 file changed, 3 insertions(+)
>> 
>> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
>> b/arch/powerpc/platforms/pseries/papr_scm.c
>> index 835163f54244..9216424f8be3 100644
>> --- a/arch/powerpc/platforms/pseries/papr_scm.c
>> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
>> @@ -277,6 +277,9 @@ static ssize_t drc_pmem_query_stats(struct papr_scm_priv 
>> *p,
>>  dev_err(>pdev->dev,
>>  "Unknown performance stats, Err:0x%016lX\n", ret[0]);
>>  return -ENOENT;
>> +} else if (rc == H_AUTHORITY) {
>> +dev_warn(>pdev->dev, "Performance stats in-accessible");
>> +return -EPERM;
>
> Is this because of a disabled feature or because of permissions?

Its because of a disabled feature that revokes permission for a guest to
retrieve performance statistics.

The feature is called "Enable Performance Information Collection" and
once disabled the hcall H_SCM_PERFORMANCE_STATS returns an error
H_AUTHORITY indicating that the guest doesn't have permission to retrieve
performance statistics.

>
> Ira
>
>>  } else if (rc != H_SUCCESS) {
>>  dev_err(>pdev->dev,
>>  "Failed to query performance stats, Err:%lld\n", rc);
>> -- 
>> 2.30.2
>> 

-- 
Cheers
~ Vaibhav


[PATCH] powerpc/papr_scm: Reduce error severity if nvdimm stats inaccessible

2021-04-14 Thread Vaibhav Jain
Currently drc_pmem_qeury_stats() generates a dev_err in case
"Enable Performance Information Collection" feature is disabled from
HMC. The error is of the form below:

papr_scm ibm,persistent-memory:ibm,pmemory@44104001: Failed to query
 performance stats, Err:-10

This error message confuses users as it implies a possible problem
with the nvdimm even though its due to a disabled feature.

So we fix this by explicitly handling the H_AUTHORITY error from the
H_SCM_PERFORMANCE_STATS hcall and generating a warning instead of an
error, saying that "Performance stats in-accessible".

Fixes: 2d02bf835e57('powerpc/papr_scm: Fetch nvdimm performance stats from 
PHYP')
Signed-off-by: Vaibhav Jain 
---
 arch/powerpc/platforms/pseries/papr_scm.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index 835163f54244..9216424f8be3 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -277,6 +277,9 @@ static ssize_t drc_pmem_query_stats(struct papr_scm_priv *p,
dev_err(>pdev->dev,
"Unknown performance stats, Err:0x%016lX\n", ret[0]);
return -ENOENT;
+   } else if (rc == H_AUTHORITY) {
+   dev_warn(>pdev->dev, "Performance stats in-accessible");
+   return -EPERM;
} else if (rc != H_SUCCESS) {
dev_err(>pdev->dev,
"Failed to query performance stats, Err:%lld\n", rc);
-- 
2.30.2



Re: [PATCH v3] powerpc/papr_scm: Implement support for H_SCM_FLUSH hcall

2021-04-14 Thread Vaibhav Jain
"Aneesh Kumar K.V"  writes:

> Vaibhav Jain  writes:
>
>> Hi Shiva,
>>
>> Apologies for a late review but something just caught my eye while
>> working on a different patch.
>>
>> Shivaprasad G Bhat  writes:
>>
>>> Add support for ND_REGION_ASYNC capability if the device tree
>>> indicates 'ibm,hcall-flush-required' property in the NVDIMM node.
>>> Flush is done by issuing H_SCM_FLUSH hcall to the hypervisor.
>>>
>>> If the flush request failed, the hypervisor is expected to
>>> to reflect the problem in the subsequent nvdimm H_SCM_HEALTH call.
>>>
>>> This patch prevents mmap of namespaces with MAP_SYNC flag if the
>>> nvdimm requires an explicit flush[1].
>>>
>>> References:
>>> [1] 
>>> https://github.com/avocado-framework-tests/avocado-misc-tests/blob/master/memory/ndctl.py.data/map_sync.c
>>>
>>> Signed-off-by: Shivaprasad G Bhat 
>>> ---
>>> v2 - https://www.spinics.net/lists/kvm-ppc/msg18799.html
>>> Changes from v2:
>>>- Fixed the commit message.
>>>- Add dev_dbg before the H_SCM_FLUSH hcall
>>>
>>> v1 - https://www.spinics.net/lists/kvm-ppc/msg18272.html
>>> Changes from v1:
>>>- Hcall semantics finalized, all changes are to accomodate them.
>>>
>>>  Documentation/powerpc/papr_hcalls.rst |   14 ++
>>>  arch/powerpc/include/asm/hvcall.h |3 +-
>>>  arch/powerpc/platforms/pseries/papr_scm.c |   40 
>>> +
>>>  3 files changed, 56 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/Documentation/powerpc/papr_hcalls.rst 
>>> b/Documentation/powerpc/papr_hcalls.rst
>>> index 48fcf1255a33..648f278eea8f 100644
>>> --- a/Documentation/powerpc/papr_hcalls.rst
>>> +++ b/Documentation/powerpc/papr_hcalls.rst
>>> @@ -275,6 +275,20 @@ Health Bitmap Flags:
>>>  Given a DRC Index collect the performance statistics for NVDIMM and copy 
>>> them
>>>  to the resultBuffer.
>>>  
>>> +**H_SCM_FLUSH**
>>> +
>>> +| Input: *drcIndex, continue-token*
>>> +| Out: *continue-token*
>>> +| Return Value: *H_SUCCESS, H_Parameter, H_P2, H_BUSY*
>>> +
>>> +Given a DRC Index Flush the data to backend NVDIMM device.
>>> +
>>> +The hcall returns H_BUSY when the flush takes longer time and the hcall 
>>> needs
>>> +to be issued multiple times in order to be completely serviced. The
>>> +*continue-token* from the output to be passed in the argument list of
>>> +subsequent hcalls to the hypervisor until the hcall is completely serviced
>>> +at which point H_SUCCESS or other error is returned by the hypervisor.
>>> +
>> Does the hcall semantic also include measures to trigger a barrier/fence
>> (like pm_wmb()) so that all the stores before the hcall are gauranteed
>> to have hit the pmem controller ?
>
> It is upto the hypervisor to implement the right sequence to ensure the
> guarantee. The hcall doesn't expect the store to reach the platform
> buffers.

Since the asyc_flush function is also called for performing
deep_flush from libnvdimm and as the hcall doesnt gaurantee stores to
reach the platform buffers, hence papr_scm_pmem_flush() should issue
pm_wmb() before the hcall.

This would ensure papr_scm_pmem_flush() semantics are consistent that to
generic_nvdimm_flush().

Also, adding the statement "The hcall doesn't expect the store to reach
the platform buffers" to the hcall documentation would be good to have.

>
>
>>
>> If not then the papr_scm_pmem_flush() introduced below should issue a
>> fence like pm_wmb() before issuing the flush hcall.
>>
>> If yes then this behaviour should also be documented with the hcall
>> semantics above.
>>
>
> IIUC fdatasync on the backend file is enough to ensure the
> guaraantee. Such a request will have REQ_PREFLUSH and REQ_FUA which will
> do the necessary barrier on the backing device holding the backend file.
>
Right, but thats assuming nvdimm is backed by a regular file in
hypervisor which may not always be the case.


> -aneesh

-- 
Cheers
~ Vaibhav


Re: [PATCH v3] powerpc/papr_scm: Implement support for H_SCM_FLUSH hcall

2021-04-13 Thread Vaibhav Jain
Hi Shiva,

Apologies for a late review but something just caught my eye while
working on a different patch.

Shivaprasad G Bhat  writes:

> Add support for ND_REGION_ASYNC capability if the device tree
> indicates 'ibm,hcall-flush-required' property in the NVDIMM node.
> Flush is done by issuing H_SCM_FLUSH hcall to the hypervisor.
>
> If the flush request failed, the hypervisor is expected to
> to reflect the problem in the subsequent nvdimm H_SCM_HEALTH call.
>
> This patch prevents mmap of namespaces with MAP_SYNC flag if the
> nvdimm requires an explicit flush[1].
>
> References:
> [1] 
> https://github.com/avocado-framework-tests/avocado-misc-tests/blob/master/memory/ndctl.py.data/map_sync.c
>
> Signed-off-by: Shivaprasad G Bhat 
> ---
> v2 - https://www.spinics.net/lists/kvm-ppc/msg18799.html
> Changes from v2:
>- Fixed the commit message.
>- Add dev_dbg before the H_SCM_FLUSH hcall
>
> v1 - https://www.spinics.net/lists/kvm-ppc/msg18272.html
> Changes from v1:
>- Hcall semantics finalized, all changes are to accomodate them.
>
>  Documentation/powerpc/papr_hcalls.rst |   14 ++
>  arch/powerpc/include/asm/hvcall.h |3 +-
>  arch/powerpc/platforms/pseries/papr_scm.c |   40 
> +
>  3 files changed, 56 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/powerpc/papr_hcalls.rst 
> b/Documentation/powerpc/papr_hcalls.rst
> index 48fcf1255a33..648f278eea8f 100644
> --- a/Documentation/powerpc/papr_hcalls.rst
> +++ b/Documentation/powerpc/papr_hcalls.rst
> @@ -275,6 +275,20 @@ Health Bitmap Flags:
>  Given a DRC Index collect the performance statistics for NVDIMM and copy them
>  to the resultBuffer.
>  
> +**H_SCM_FLUSH**
> +
> +| Input: *drcIndex, continue-token*
> +| Out: *continue-token*
> +| Return Value: *H_SUCCESS, H_Parameter, H_P2, H_BUSY*
> +
> +Given a DRC Index Flush the data to backend NVDIMM device.
> +
> +The hcall returns H_BUSY when the flush takes longer time and the hcall needs
> +to be issued multiple times in order to be completely serviced. The
> +*continue-token* from the output to be passed in the argument list of
> +subsequent hcalls to the hypervisor until the hcall is completely serviced
> +at which point H_SUCCESS or other error is returned by the hypervisor.
> +
Does the hcall semantic also include measures to trigger a barrier/fence
(like pm_wmb()) so that all the stores before the hcall are gauranteed
to have hit the pmem controller ?

If not then the papr_scm_pmem_flush() introduced below should issue a
fence like pm_wmb() before issuing the flush hcall.

If yes then this behaviour should also be documented with the hcall
semantics above.

>  References
>  ==
>  .. [1] "Power Architecture Platform Reference"
> diff --git a/arch/powerpc/include/asm/hvcall.h 
> b/arch/powerpc/include/asm/hvcall.h
> index ed6086d57b22..9f7729a97ebd 100644
> --- a/arch/powerpc/include/asm/hvcall.h
> +++ b/arch/powerpc/include/asm/hvcall.h
> @@ -315,7 +315,8 @@
>  #define H_SCM_HEALTH0x400
>  #define H_SCM_PERFORMANCE_STATS 0x418
>  #define H_RPT_INVALIDATE 0x448
> -#define MAX_HCALL_OPCODE H_RPT_INVALIDATE
> +#define H_SCM_FLUSH  0x44C
> +#define MAX_HCALL_OPCODE H_SCM_FLUSH
>  
>  /* Scope args for H_SCM_UNBIND_ALL */
>  #define H_UNBIND_SCOPE_ALL (0x1)
> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
> b/arch/powerpc/platforms/pseries/papr_scm.c
> index 835163f54244..b7a47fcc5aa5 100644
> --- a/arch/powerpc/platforms/pseries/papr_scm.c
> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
> @@ -93,6 +93,7 @@ struct papr_scm_priv {
>   uint64_t block_size;
>   int metadata_size;
>   bool is_volatile;
> + bool hcall_flush_required;
>  
>   uint64_t bound_addr;
>  
> @@ -117,6 +118,39 @@ struct papr_scm_priv {
>   size_t stat_buffer_len;
>  };
>  
> +static int papr_scm_pmem_flush(struct nd_region *nd_region,
> +struct bio *bio __maybe_unused)
> +{
> + struct papr_scm_priv *p = nd_region_provider_data(nd_region);
> + unsigned long ret_buf[PLPAR_HCALL_BUFSIZE];
> + uint64_t token = 0;
> + int64_t rc;
> +
> + dev_dbg(>pdev->dev, "flush drc 0x%x", p->drc_index);
> +
> + do {
> + rc = plpar_hcall(H_SCM_FLUSH, ret_buf, p->drc_index, token);
> + token = ret_buf[0];
> +
> + /* Check if we are stalled for some time */
> + if (H_IS_LONG_BUSY(rc)) {
> + msleep(get_longbusy_msecs(rc));
> + rc = H_BUSY;
> + } else if (rc == H_BUSY) {
> + cond_resched();
> + }
> + } while (rc == H_BUSY);
> +
> + if (rc) {
> + dev_err(>pdev->dev, "flush error: %lld", rc);
> + rc = -EIO;
> + } else {
> + dev_dbg(>pdev->dev, "flush drc 0x%x complete", p->drc_index);
> + }
> +
> + return rc;
> +}
> +
>  static LIST_HEAD(papr_nd_regions);
>  

Re: [PATCH] powerpc/mm: Add cond_resched() while removing hpte mappings

2021-04-05 Thread Vaibhav Jain
Hi Mpe,

Thanks for looking into this patch.

Michael Ellerman  writes:

> Vaibhav Jain  writes:
>> While removing large number of mappings from hash page tables for
>> large memory systems as soft-lockup is reported because of the time
>> spent inside htap_remove_mapping() like one below:
>>
>>  watchdog: BUG: soft lockup - CPU#8 stuck for 23s!
>>  
>>  NIP plpar_hcall+0x38/0x58
>>  LR  pSeries_lpar_hpte_invalidate+0x68/0xb0
>>  Call Trace:
>>   0x1fff000 (unreliable)
>>   pSeries_lpar_hpte_removebolted+0x9c/0x230
>>   hash__remove_section_mapping+0xec/0x1c0
>>   remove_section_mapping+0x28/0x3c
>>   arch_remove_memory+0xfc/0x150
>>   devm_memremap_pages_release+0x180/0x2f0
>>   devm_action_release+0x30/0x50
>>   release_nodes+0x28c/0x300
>>   device_release_driver_internal+0x16c/0x280
>>   unbind_store+0x124/0x170
>>   drv_attr_store+0x44/0x60
>>   sysfs_kf_write+0x64/0x90
>>   kernfs_fop_write+0x1b0/0x290
>>   __vfs_write+0x3c/0x70
>>   vfs_write+0xd4/0x270
>>   ksys_write+0xdc/0x130
>>   system_call+0x5c/0x70
>>
>> Fix this by adding a cond_resched() to the loop in
>> htap_remove_mapping() that issues hcall to remove hpte mapping. This
>> should prevent the soft-lockup from being reported.
>
> Can/should we also/instead be using H_BLOCK_REMOVE?
>
> cheers

Current mmp_ops implementation seems to use H_BULK_REMOVE for hugepages
so for removing mappings for regular pages I am looking into adding a
new mmu_op that can take a range to be unmapped and 
I did try implmenting a new mmu_op for this which can reduce the number
of hash_pte lookups needed to invalidate this range. But that would need
some more work so as a stop gap I have sent out a v2 with Christophe's
suggestion to add a cond_resched() every HZ interval.

-- 
Cheers
~ Vaibhav


[PATCH v2] powerpc/mm: Add cond_resched() while removing hpte mappings

2021-04-04 Thread Vaibhav Jain
While removing large number of mappings from hash page tables for
large memory systems as soft-lockup is reported because of the time
spent inside htap_remove_mapping() like one below:

 watchdog: BUG: soft lockup - CPU#8 stuck for 23s!
 
 NIP plpar_hcall+0x38/0x58
 LR  pSeries_lpar_hpte_invalidate+0x68/0xb0
 Call Trace:
  0x1fff000 (unreliable)
  pSeries_lpar_hpte_removebolted+0x9c/0x230
  hash__remove_section_mapping+0xec/0x1c0
  remove_section_mapping+0x28/0x3c
  arch_remove_memory+0xfc/0x150
  devm_memremap_pages_release+0x180/0x2f0
  devm_action_release+0x30/0x50
  release_nodes+0x28c/0x300
  device_release_driver_internal+0x16c/0x280
  unbind_store+0x124/0x170
  drv_attr_store+0x44/0x60
  sysfs_kf_write+0x64/0x90
  kernfs_fop_write+0x1b0/0x290
  __vfs_write+0x3c/0x70
  vfs_write+0xd4/0x270
  ksys_write+0xdc/0x130
  system_call+0x5c/0x70

Fix this by adding a cond_resched() to the loop in
htap_remove_mapping() that issues hcall to remove hpte mapping. The
call to cond_resched() is issued every HZ jiffies which should prevent
the soft-lockup from being reported.

Suggested-by: Aneesh Kumar K.V 
Signed-off-by: Vaibhav Jain 

---
Changelog:

v2: Issue cond_resched() every HZ jiffies instead of each iteration of
the loop. [ Christophe Leroy ]
---
 arch/powerpc/mm/book3s64/hash_utils.c | 13 -
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/book3s64/hash_utils.c 
b/arch/powerpc/mm/book3s64/hash_utils.c
index 581b20a2feaf..286e7e8cb919 100644
--- a/arch/powerpc/mm/book3s64/hash_utils.c
+++ b/arch/powerpc/mm/book3s64/hash_utils.c
@@ -338,7 +338,7 @@ int htab_bolt_mapping(unsigned long vstart, unsigned long 
vend,
 int htab_remove_mapping(unsigned long vstart, unsigned long vend,
  int psize, int ssize)
 {
-   unsigned long vaddr;
+   unsigned long vaddr, time_limit;
unsigned int step, shift;
int rc;
int ret = 0;
@@ -351,8 +351,19 @@ int htab_remove_mapping(unsigned long vstart, unsigned 
long vend,
 
/* Unmap the full range specificied */
vaddr = ALIGN_DOWN(vstart, step);
+   time_limit = jiffies + HZ;
+
for (;vaddr < vend; vaddr += step) {
rc = mmu_hash_ops.hpte_removebolted(vaddr, psize, ssize);
+
+   /*
+* For large number of mappings introduce a cond_resched()
+* to prevent softlockup warnings.
+*/
+   if (time_after(jiffies, time_limit)) {
+   cond_resched();
+   time_limit = jiffies + HZ;
+   }
if (rc == -ENOENT) {
ret = -ENOENT;
continue;
-- 
2.30.2



[PATCH] powerpc/papr_scm: Mark nvdimm as unarmed if needed during probe

2021-03-29 Thread Vaibhav Jain
In case an nvdimm is found to be unarmed during probe then set its
NDD_UNARMED flag before nvdimm_create(). This would enforce a
read-only access to the ndimm region. Presently even if an nvdimm is
unarmed its not marked as read-only on ppc64 guests.

The patch updates papr_scm_nvdimm_init() to force query of nvdimm
health via __drc_pmem_query_health() and if nvdimm is found to be
unarmed then set the nvdimm flag ND_UNARMED for nvdimm_create().

Signed-off-by: Vaibhav Jain 
---
 arch/powerpc/platforms/pseries/papr_scm.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index 835163f54244..7e8168e19427 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -914,6 +914,15 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
dimm_flags = 0;
set_bit(NDD_LABELING, _flags);
 
+   /*
+* Check if the nvdimm is unarmed. No locking needed as we are still
+* initializing. Ignore error encountered if any.
+*/
+   __drc_pmem_query_health(p);
+
+   if (p->health_bitmap & PAPR_PMEM_UNARMED_MASK)
+   set_bit(NDD_UNARMED, _flags);
+
p->nvdimm = nvdimm_create(p->bus, p, papr_nd_attr_groups,
  dimm_flags, PAPR_SCM_DIMM_CMD_MASK, 0, NULL);
if (!p->nvdimm) {
-- 
2.30.2



Re: [PATCH v2] powerpc/papr_scm: Implement support for H_SCM_FLUSH hcall

2021-03-24 Thread Vaibhav Jain


Hi Shiva,

Thanks for the patch. Few minor review comments:

Shivaprasad G Bhat  writes:

> Add support for ND_REGION_ASYNC capability if the device tree
> indicates 'ibm,hcall-flush-required' property in the NVDIMM node.
> Flush is done by issuing H_SCM_FLUSH hcall to the hypervisor.
>
> If the flush request failed, the hypervisor is expected to
> to reflect the problem in the subsequent dimm health request call.
s/dimm/nvdimm
s/health request call/H_SCM_HEALTH hcall/

>
> This patch prevents mmap of namespaces with MAP_SYNC flag if the
> nvdimm requires explicit flush[1].
s/explicit/an explicit/

>
> References:
> [1] 
> https://github.com/avocado-framework-tests/avocado-misc-tests/blob/master/memory/ndctl.py.data/map_sync.c
>
> Signed-off-by: Shivaprasad G Bhat 
> ---
> v1 - https://www.spinics.net/lists/kvm-ppc/msg18272.html
> Changes from v1:
>- Hcall semantics finalized, all changes are to accomodate them.
>
>  Documentation/powerpc/papr_hcalls.rst |   14 ++
>  arch/powerpc/include/asm/hvcall.h |3 +-
>  arch/powerpc/platforms/pseries/papr_scm.c |   39 
> +
>  3 files changed, 55 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/powerpc/papr_hcalls.rst 
> b/Documentation/powerpc/papr_hcalls.rst
> index 48fcf1255a33..648f278eea8f 100644
> --- a/Documentation/powerpc/papr_hcalls.rst
> +++ b/Documentation/powerpc/papr_hcalls.rst
> @@ -275,6 +275,20 @@ Health Bitmap Flags:
>  Given a DRC Index collect the performance statistics for NVDIMM and copy them
>  to the resultBuffer.
>  
> +**H_SCM_FLUSH**
> +
> +| Input: *drcIndex, continue-token*
> +| Out: *continue-token*
> +| Return Value: *H_SUCCESS, H_Parameter, H_P2, H_BUSY*
> +
> +Given a DRC Index Flush the data to backend NVDIMM device.
> +
> +The hcall returns H_BUSY when the flush takes longer time and the hcall needs
> +to be issued multiple times in order to be completely serviced. The
> +*continue-token* from the output to be passed in the argument list of
> +subsequent hcalls to the hypervisor until the hcall is completely serviced
> +at which point H_SUCCESS or other error is returned by the hypervisor.
> +
>  References
>  ==
>  .. [1] "Power Architecture Platform Reference"
> diff --git a/arch/powerpc/include/asm/hvcall.h 
> b/arch/powerpc/include/asm/hvcall.h
> index ed6086d57b22..9f7729a97ebd 100644
> --- a/arch/powerpc/include/asm/hvcall.h
> +++ b/arch/powerpc/include/asm/hvcall.h
> @@ -315,7 +315,8 @@
>  #define H_SCM_HEALTH0x400
>  #define H_SCM_PERFORMANCE_STATS 0x418
>  #define H_RPT_INVALIDATE 0x448
> -#define MAX_HCALL_OPCODE H_RPT_INVALIDATE
> +#define H_SCM_FLUSH  0x44C
> +#define MAX_HCALL_OPCODE H_SCM_FLUSH
>  
>  /* Scope args for H_SCM_UNBIND_ALL */
>  #define H_UNBIND_SCOPE_ALL (0x1)
> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
> b/arch/powerpc/platforms/pseries/papr_scm.c
> index 835163f54244..f0407e135410 100644
> --- a/arch/powerpc/platforms/pseries/papr_scm.c
> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
> @@ -93,6 +93,7 @@ struct papr_scm_priv {
>   uint64_t block_size;
>   int metadata_size;
>   bool is_volatile;
> + bool hcall_flush_required;
>  
>   uint64_t bound_addr;
>  
> @@ -117,6 +118,38 @@ struct papr_scm_priv {
>   size_t stat_buffer_len;
>  };
>  
> +static int papr_scm_pmem_flush(struct nd_region *nd_region,
> +struct bio *bio __maybe_unused)
> +{
> + struct papr_scm_priv *p = nd_region_provider_data(nd_region);
> + unsigned long ret_buf[PLPAR_HCALL_BUFSIZE];
> + uint64_t token = 0;
> + int64_t rc;
> +
Suggest adding a dev_dbg to to indicate a flush request to a drc. That
way if the loop below gets stuck the issue can be debugged with kernel
logs.
> + do {
> + rc = plpar_hcall(H_SCM_FLUSH, ret_buf, p->drc_index, token);
> + token = ret_buf[0];
> +
> + /* Check if we are stalled for some time */
> + if (H_IS_LONG_BUSY(rc)) {
> + msleep(get_longbusy_msecs(rc));
> + rc = H_BUSY;
> + } else if (rc == H_BUSY) {
> + cond_resched();
> + }
> +
> + } while (rc == H_BUSY);
> +
> + if (rc) {
> + dev_err(>pdev->dev, "flush error: %lld", rc);
> + rc = -EIO;
> + } else {
> + dev_dbg(>pdev->dev, "flush drc 0x%x complete", p->drc_index);
> + }
> +
> + return rc;
> +}
> +
>  static LIST_HEAD(papr_nd_regions);
>  static DEFINE_MUTEX(papr_ndr_lock);
>  
> @@ -943,6 +976,11 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
>   ndr_desc.num_mappings = 1;
>   ndr_desc.nd_set = >nd_set;
>  
> + if (p->hcall_flush_required) {
> + set_bit(ND_REGION_ASYNC, _desc.flags);
> + ndr_desc.flush = papr_scm_pmem_flush;
> + }
> +
>   if (p->is_volatile)
>   p->region = 

[PATCH] powerpc/mm: Add cond_resched() while removing hpte mappings

2021-03-10 Thread Vaibhav Jain
While removing large number of mappings from hash page tables for
large memory systems as soft-lockup is reported because of the time
spent inside htap_remove_mapping() like one below:

 watchdog: BUG: soft lockup - CPU#8 stuck for 23s!
 
 NIP plpar_hcall+0x38/0x58
 LR  pSeries_lpar_hpte_invalidate+0x68/0xb0
 Call Trace:
  0x1fff000 (unreliable)
  pSeries_lpar_hpte_removebolted+0x9c/0x230
  hash__remove_section_mapping+0xec/0x1c0
  remove_section_mapping+0x28/0x3c
  arch_remove_memory+0xfc/0x150
  devm_memremap_pages_release+0x180/0x2f0
  devm_action_release+0x30/0x50
  release_nodes+0x28c/0x300
  device_release_driver_internal+0x16c/0x280
  unbind_store+0x124/0x170
  drv_attr_store+0x44/0x60
  sysfs_kf_write+0x64/0x90
  kernfs_fop_write+0x1b0/0x290
  __vfs_write+0x3c/0x70
  vfs_write+0xd4/0x270
  ksys_write+0xdc/0x130
  system_call+0x5c/0x70

Fix this by adding a cond_resched() to the loop in
htap_remove_mapping() that issues hcall to remove hpte mapping. This
should prevent the soft-lockup from being reported.

Suggested-by: Aneesh Kumar K.V 
Signed-off-by: Vaibhav Jain 
---
 arch/powerpc/mm/book3s64/hash_utils.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/mm/book3s64/hash_utils.c 
b/arch/powerpc/mm/book3s64/hash_utils.c
index 581b20a2feaf..ea3945c70b18 100644
--- a/arch/powerpc/mm/book3s64/hash_utils.c
+++ b/arch/powerpc/mm/book3s64/hash_utils.c
@@ -359,6 +359,8 @@ int htab_remove_mapping(unsigned long vstart, unsigned long 
vend,
}
if (rc < 0)
return rc;
+
+   cond_resched();
}
 
return ret;
-- 
2.29.2



[RFC][PATCH 2/2] powerpc/papr_scm: Implement support for reporting generic nvdimm stats

2020-11-08 Thread Vaibhav Jain
Add support for reporting papr-scm supported generic nvdimm stats by
implementing support for handling ND_CMD_GET_STAT in
'papr_scm_ndctl().

The mapping between libnvdimm generic nvdimm-stats and papr-scm
specific performance-stats is embedded inside 'dimm_stats_map[]'. This
array is queried by newly introduced 'papr_scm_get_stat()' that
verifies if the requested nvdimm-stat is supported and if yes does an
hcall via 'drc_pmem_query_stat()' to request the performance-stat and
return it back to libnvdimm.

Signed-off-by: Vaibhav Jain 
---
 arch/powerpc/platforms/pseries/papr_scm.c | 66 ++-
 1 file changed, 65 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index 835163f54244..51eeab3376fd 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -25,7 +25,8 @@
((1ul << ND_CMD_GET_CONFIG_SIZE) | \
 (1ul << ND_CMD_GET_CONFIG_DATA) | \
 (1ul << ND_CMD_SET_CONFIG_DATA) | \
-(1ul << ND_CMD_CALL))
+(1ul << ND_CMD_CALL) |\
+(1ul << ND_CMD_GET_STAT))
 
 /* DIMM health bitmap bitmap indicators */
 /* SCM device is unable to persist memory contents */
@@ -120,6 +121,16 @@ struct papr_scm_priv {
 static LIST_HEAD(papr_nd_regions);
 static DEFINE_MUTEX(papr_ndr_lock);
 
+/* Map generic nvdimm stats to papr-scm stats */
+static const char * const dimm_stat_map[] = {
+   [ND_DIMM_STAT_INVALID] = NULL,
+   [ND_DIMM_STAT_MEDIA_READS] = "MedRCnt ",
+   [ND_DIMM_STAT_MEDIA_WRITES] = "MedWCnt ",
+   [ND_DIMM_STAT_READ_REQUESTS] = "HostLCnt",
+   [ND_DIMM_STAT_WRITE_REQUESTS] = "HostSCnt",
+   [ND_DIMM_STAT_MAX] = NULL,
+};
+
 static int drc_pmem_bind(struct papr_scm_priv *p)
 {
unsigned long ret[PLPAR_HCALL_BUFSIZE];
@@ -728,6 +739,54 @@ static int papr_scm_service_pdsm(struct papr_scm_priv *p,
return pdsm_pkg->cmd_status;
 }
 
+/*
+ * For a given pdsm request call an appropriate service function.
+ * Returns errors if any while handling the pdsm command package.
+ */
+static int papr_scm_get_stat(struct papr_scm_priv *p,
+struct nd_cmd_get_dimm_stat *dimm_stat)
+
+{
+   int rc;
+   ssize_t size;
+   struct papr_scm_perf_stat *stat;
+   struct papr_scm_perf_stats *stats;
+
+   /* Check if the requested stat-id is supported */
+   if (dimm_stat->stat_id >= ARRAY_SIZE(dimm_stat_map) ||
+   !dimm_stat_map[dimm_stat->stat_id]) {
+   dev_dbg(>pdev->dev, "Invalid stat-id %lld\n", 
dimm_stat->stat_id);
+   return -ENOSPC;
+   }
+
+   /* Allocate request buffer enough to hold single performance stat */
+   size = sizeof(struct papr_scm_perf_stats) +
+   sizeof(struct papr_scm_perf_stat);
+
+   stats = kzalloc(size, GFP_KERNEL);
+   if (!stats)
+   return -ENOMEM;
+
+   stat = >scm_statistic[0];
+   memcpy(>stat_id, dimm_stat_map[dimm_stat->stat_id],
+  sizeof(stat->stat_id));
+   stat->stat_val = 0;
+
+   /* Fetch the statistic from PHYP and copy it to provided payload */
+   rc = drc_pmem_query_stats(p, stats, 1);
+   if (rc < 0) {
+   dev_dbg(>pdev->dev, "Err(%d) fetching stat '%.8s'\n",
+   rc, stat->stat_id);
+   kfree(stats);
+   return rc;
+   }
+
+   dimm_stat->int_val = be64_to_cpu(stat->stat_val);
+
+   kfree(stats);
+   return 0;
+}
+
 static int papr_scm_ndctl(struct nvdimm_bus_descriptor *nd_desc,
  struct nvdimm *nvdimm, unsigned int cmd, void *buf,
  unsigned int buf_len, int *cmd_rc)
@@ -772,6 +831,11 @@ static int papr_scm_ndctl(struct nvdimm_bus_descriptor 
*nd_desc,
*cmd_rc = papr_scm_service_pdsm(p, call_pkg);
break;
 
+   case ND_CMD_GET_STAT:
+   *cmd_rc = papr_scm_get_stat(p,
+   (struct nd_cmd_get_dimm_stat *)buf);
+   break;
+
default:
dev_dbg(>pdev->dev, "Unknown command = %d\n", cmd);
return -EINVAL;
-- 
2.28.0



[RFC][PATCH 1/2] libnvdimm: Introduce ND_CMD_GET_STAT to retrieve nvdimm statistics

2020-11-08 Thread Vaibhav Jain
Implement support for exposing generic nvdimm statistics via newly
introduced dimm-command ND_CMD_GET_STAT that can be handled by nvdimm
command handler function and provide values for these statistics back
to libnvdimm. Following generic nvdimm statistics are defined as an
enumeration in 'uapi/ndctl.h':

* "media_reads" : Number of media reads that have occurred since reboot.
* "media_writes" : Number of media writes that have occurred since reboot.
* "read_requests" : Number of read requests that have occurred since reboot.
* "write_requests" : Number of write requests that have occurred since reboot.
* "total_media_reads" : Total number of media reads that have occurred.
* "total_media_writes" : Total number of media writes that have occurred.
* "total_read_requests" : Total number of read requests that have occurred.
* "total_write_requests" : Total number of write requests that have occurred.

Apart from ND_CMD_GET_STAT ioctl these nvdimm statistics are also
exposed via sysfs '/stats' directory for easy user-space
access like below:

/sys/class/nd/ndctl0/device/nmem0/stats # tail -n +1 *
==> media_reads <==
252197707602
==> media_writes <==
20684685172
==> read_requests <==
658810924962
==> write_requests <==
404464081574

In case a specific nvdimm-statistic is not supported than nvdimm
command handler function can simply return an error (e.g -ENOENT) for
request to read that nvdimm-statistic.

The value for a specific nvdimm-stat is exchanged via newly introduced
'struct nd_cmd_get_dimm_stat' that hold a single statistics and a
union of possible values types. Presently only '__s64' type of generic
attributes are supported. These attributes are defined in
'ndvimm/dimm_devs.c' via a helper macro 'NVDIMM_STAT_ATTR'.

Signed-off-by: Vaibhav Jain 
---
 drivers/nvdimm/bus.c   |   6 ++
 drivers/nvdimm/dimm_devs.c | 109 +
 drivers/nvdimm/nd.h|   5 ++
 include/uapi/linux/ndctl.h |  27 +
 4 files changed, 147 insertions(+)

diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c
index 2304c6183822..d53564477437 100644
--- a/drivers/nvdimm/bus.c
+++ b/drivers/nvdimm/bus.c
@@ -794,6 +794,12 @@ static const struct nd_cmd_desc __nd_cmd_dimm_descs[] = {
.out_num = 1,
.out_sizes = { UINT_MAX, },
},
+   [ND_CMD_GET_STAT] = {
+   .in_num = 1,
+   .in_sizes = { sizeof(struct nd_cmd_get_dimm_stat), },
+   .out_num = 1,
+   .out_sizes = { sizeof(struct nd_cmd_get_dimm_stat), },
+   },
 };
 
 const struct nd_cmd_desc *nd_cmd_dimm_desc(int cmd)
diff --git a/drivers/nvdimm/dimm_devs.c b/drivers/nvdimm/dimm_devs.c
index b59032e0859b..68aaa294def7 100644
--- a/drivers/nvdimm/dimm_devs.c
+++ b/drivers/nvdimm/dimm_devs.c
@@ -555,6 +555,114 @@ static umode_t nvdimm_firmware_visible(struct kobject 
*kobj, struct attribute *a
return a->mode;
 }
 
+/* Request a dimm stat from the bus driver */
+static int __request_dimm_stat(struct nvdimm_bus *nvdimm_bus,
+  struct nvdimm *dimm, u64 stat_id,
+  s64 *stat_val)
+{
+   struct nvdimm_bus_descriptor *nd_desc = nvdimm_bus->nd_desc;
+   struct nd_cmd_get_dimm_stat stat = { .stat_id = stat_id };
+   int rc, cmd_rc;
+
+   if (!test_bit(ND_CMD_GET_STAT, >cmd_mask)) {
+   pr_debug("CMD_GET_STAT not set for bus driver 0x%lx\n",
+nd_desc->cmd_mask);
+   return -ENOENT;
+   }
+
+   /* Is stat requested is known & bus driver supports fetching stats */
+   if (stat_id <= ND_DIMM_STAT_INVALID || stat_id > ND_DIMM_STAT_MAX) {
+   WARN(1, "Unknown stat-id(%llu) requested", stat_id);
+   return -ENOENT;
+   }
+
+   /* Ask bus driver for its stat value */
+   rc = nd_desc->ndctl(nd_desc, dimm, ND_CMD_GET_STAT,
+   , sizeof(stat), _rc);
+   if (rc || cmd_rc) {
+   pr_debug("Unable to request stat %lld. Error (%d,%d)\n",
+stat_id, rc, cmd_rc);
+   return rc ? rc : cmd_rc;
+   }
+
+   /* Indicate error in case returned struct doesn't have the stat_id set 
*/
+   if (stat.stat_id != stat_id) {
+   pr_debug("Invalid statid %llu returned\n", stat.stat_id);
+   return -ENOENT;
+   }
+
+   *stat_val = stat.int_val;
+   return 0;
+}
+
+static ssize_t nvdimm_stat_attr_show(struct device *dev,
+struct device_attribute *attr,
+char *buf)
+{
+   struct nvdimm_stat_attr *nattr = container_of(attr, typeof(*nattr), 
attr);
+   struct nvdimm_bus *nvdimm_bus = walk_to_nvdimm_bus(dev);
+   struct nvdimm *nvdimm = to_nvd

Re: [PATCH 30/33] docs: ABI: cleanup several ABI documents

2020-10-30 Thread Vaibhav Jain
Mauro Carvalho Chehab  writes:

> There are some ABI documents that, while they don't generate
> any warnings, they have issues when parsed by get_abi.pl script
> on its output result.
>
> Address them, in order to provide a clean output.
>
> Signed-off-by: Mauro Carvalho Chehab 


> diff --git a/Documentation/ABI/testing/sysfs-bus-papr-pmem 
> b/Documentation/ABI/testing/sysfs-bus-papr-pmem
> index c1a67275c43f..8316c33862a0 100644
> --- a/Documentation/ABI/testing/sysfs-bus-papr-pmem
> +++ b/Documentation/ABI/testing/sysfs-bus-papr-pmem
> @@ -11,19 +11,26 @@ Description:
>   at 'Documentation/powerpc/papr_hcalls.rst' . Below are
>   the flags reported in this sysfs file:
>  
> - * "not_armed"   : Indicates that NVDIMM contents will not
> + * "not_armed"
> +   Indicates that NVDIMM contents will not
> survive a power cycle.
> - * "flush_fail"  : Indicates that NVDIMM contents
> + * "flush_fail"
> +   Indicates that NVDIMM contents
> couldn't be flushed during last
> shut-down event.
> - * "restore_fail": Indicates that NVDIMM contents
> + * "restore_fail"
> +   Indicates that NVDIMM contents
> couldn't be restored during NVDIMM
> initialization.
> - * "encrypted"   : NVDIMM contents are encrypted.
> - * "smart_notify": There is health event for the NVDIMM.
> - * "scrubbed": Indicating that contents of the
> + * "encrypted"
> +   NVDIMM contents are encrypted.
> + * "smart_notify"
> +   There is health event for the NVDIMM.
> + * "scrubbed"
> +   Indicating that contents of the
> NVDIMM have been scrubbed.
> - * "locked"  : Indicating that NVDIMM contents cant
> + * "locked"
> +   Indicating that NVDIMM contents cant
> be modified until next power cycle.
>  
>  What:/sys/bus/nd/devices/nmemX/papr/perf_stats
> @@ -51,4 +58,4 @@ Description:
>   * "MedWDur " : Media Write Duration
>   * "CchRHCnt" : Cache Read Hit Count
>   * "CchWHCnt" : Cache Write Hit Count
> - * "FastWCnt" : Fast Write Count
> \ No newline at end of file
> + * "FastWCnt" : Fast Write Count


Thanks,

I am fine with proposed changes to sysfs-bus-papr-pmem.

Acked-by: Vaibhav Jain  # for sysfs-bus-papr-pmem



Re: [PATCH] powerpc/papr_scm: Add PAPR command family to pass-through command-set

2020-09-28 Thread Vaibhav Jain
Hi Dan, Ira and Vishal,

Can you please take a look at this patch. Without it the functionality
to report nvdimm health via ndctl breaks on 5.9

Thanks,
~ Vaibhav

Vaibhav Jain  writes:

> Add NVDIMM_FAMILY_PAPR to the list of valid 'dimm_family_mask'
> acceptable by papr_scm. This is needed as since commit
> 92fe2aa859f5 ("libnvdimm: Validate command family indices") libnvdimm
> performs a validation of 'nd_cmd_pkg.nd_family' received as part of
> ND_CMD_CALL processing to ensure only known command families can use
> the general ND_CMD_CALL pass-through functionality.
>
> Without this change the ND_CMD_CALL pass-through targeting
> NVDIMM_FAMILY_PAPR error out with -EINVAL.
>
> Fixes: 92fe2aa859f5 ("libnvdimm: Validate command family indices")
> Signed-off-by: Vaibhav Jain 
> ---
>  arch/powerpc/platforms/pseries/papr_scm.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
> b/arch/powerpc/platforms/pseries/papr_scm.c
> index 5493bc847bd08..27268370dee00 100644
> --- a/arch/powerpc/platforms/pseries/papr_scm.c
> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
> @@ -898,6 +898,9 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
>   p->bus_desc.of_node = p->pdev->dev.of_node;
>   p->bus_desc.provider_name = kstrdup(p->pdev->name, GFP_KERNEL);
>  
> + /* Set the dimm command family mask to accept PDSMs */
> + set_bit(NVDIMM_FAMILY_PAPR, >bus_desc.dimm_family_mask);
> +
>   if (!p->bus_desc.provider_name)
>   return -ENOMEM;
>  
> -- 
> 2.26.2
>


Re: [PATCH -next] powerpc/papr_scm: Fix warnings about undeclared variable

2020-09-24 Thread Vaibhav Jain


Thanks for the patch. This looks good to me.

Wang Wensheng  writes:

> Build the kernel with 'make C=2':
> arch/powerpc/platforms/pseries/papr_scm.c:825:1: warning: symbol
> 'dev_attr_perf_stats' was not declared. Should it be static?

> Signed-off-by: Wang Wensheng 
Reviewed-by: Vaibhav Jain 

-- 
Cheers
~ Vaibhav


Re: [PATCH v2] powerpc/papr_scm: Fix warning triggered by perf_stats_show()

2020-09-15 Thread Vaibhav Jain
Michael Ellerman  writes:

> Vaibhav Jain  writes:
>> A warning is reported by the kernel in case perf_stats_show() returns
>> an error code. The warning is of the form below:
>>
>>  papr_scm ibm,persistent-memory:ibm,pmemory@4411:
>>Failed to query performance stats, Err:-10
>>  dev_attr_show: perf_stats_show+0x0/0x1c0 [papr_scm] returned bad count
>>  fill_read_buffer: dev_attr_show+0x0/0xb0 returned bad count
>>
>> On investigation it looks like that the compiler is silently truncating the
>> return value of drc_pmem_query_stats() from 'long' to 'int', since the
>> variable used to store the return code 'rc' is an 'int'. This
>> truncated value is then returned back as a 'ssize_t' back from
>> perf_stats_show() to 'dev_attr_show()' which thinks of it as a large
>> unsigned number and triggers this warning..
>>
>> To fix this we update the type of variable 'rc' from 'int' to
>> 'ssize_t' that prevents the compiler from truncating the return value
>> of drc_pmem_query_stats() and returning correct signed value back from
>> perf_stats_show().
>>
>> Fixes: 2d02bf835e573 ('powerpc/papr_scm: Fetch nvdimm performance
>>stats from PHYP')
>
> Please don't word wrap the Fixes tag it breaks b4.
>
> I've fixed it up this time.

Thanks Mpe

>
> cheers

-- 
Cheers
~ Vaibhav


[PATCH] powerpc/papr_scm: Support dynamic enable/disable of performance statistics

2020-09-13 Thread Vaibhav Jain
Collection of performance statistics of an NVDIMM can be dynamically
enabled/disabled from the Hypervisor Management Console even when the
guest lpar is running. The current implementation however will check if
the performance statistics collection is supported during NVDIMM probe
and if yes will assume that to be the case afterwards.

Hence we update papr_scm to remove this assumption from the code by
eliminating the 'stat_buffer_len' member from 'struct papr_scm_priv'
that was used to cache the max buffer size needed to fetch NVDIMM
performance stats from PHYP. With that struct member gone, various
functions that depended on it are updated. Specifically
perf_stats_show() is updated to query the PHYP first for the size of
buffer needed to hold all performance statistics instead of relying on
'stat_buffer_len'

Signed-off-by: Vaibhav Jain 
---
 arch/powerpc/platforms/pseries/papr_scm.c | 53 +++
 1 file changed, 25 insertions(+), 28 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index 27268370dee00..6697e1c3b9ebe 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -112,9 +112,6 @@ struct papr_scm_priv {
 
/* Health information for the dimm */
u64 health_bitmap;
-
-   /* length of the stat buffer as expected by phyp */
-   size_t stat_buffer_len;
 };
 
 static LIST_HEAD(papr_nd_regions);
@@ -230,14 +227,15 @@ static int drc_pmem_query_n_bind(struct papr_scm_priv *p)
  * - If buff_stats == NULL the return value is the size in byes of the buffer
  * needed to hold all supported performance-statistics.
  * - If buff_stats != NULL and num_stats == 0 then we copy all known
- * performance-statistics to 'buff_stat' and expect to be large enough to
- * hold them.
+ * performance-statistics to 'buff_stat' and expect it to be large enough to
+ * hold them. The 'buff_size' args contains the size of the 'buff_stats'
  * - if buff_stats != NULL and num_stats > 0 then copy the requested
  * performance-statistics to buff_stats.
  */
 static ssize_t drc_pmem_query_stats(struct papr_scm_priv *p,
struct papr_scm_perf_stats *buff_stats,
-   unsigned int num_stats)
+   unsigned int num_stats,
+   size_t buff_size)
 {
unsigned long ret[PLPAR_HCALL_BUFSIZE];
size_t size;
@@ -261,12 +259,18 @@ static ssize_t drc_pmem_query_stats(struct papr_scm_priv 
*p,
size = sizeof(struct papr_scm_perf_stats) +
num_stats * sizeof(struct papr_scm_perf_stat);
else
-   size = p->stat_buffer_len;
+   size = buff_size;
} else {
/* In case of no out buffer ignore the size */
size = 0;
}
 
+   /* verify that the buffer size needed is sufficient */
+   if (size > buff_size) {
+   __WARN();
+   return -EINVAL;
+   }
+
/* Do the HCALL asking PHYP for info */
rc = plpar_hcall(H_SCM_PERFORMANCE_STATS, ret, p->drc_index,
 buff_stats ? virt_to_phys(buff_stats) : 0,
@@ -277,6 +281,10 @@ static ssize_t drc_pmem_query_stats(struct papr_scm_priv 
*p,
dev_err(>pdev->dev,
"Unknown performance stats, Err:0x%016lX\n", ret[0]);
return -ENOENT;
+   } else if (rc == H_AUTHORITY) {
+   dev_dbg(>pdev->dev,
+   "Performance stats in-accessible\n");
+   return -EPERM;
} else if (rc != H_SUCCESS) {
dev_err(>pdev->dev,
"Failed to query performance stats, Err:%lld\n", rc);
@@ -526,10 +534,6 @@ static int papr_pdsm_fuel_gauge(struct papr_scm_priv *p,
struct papr_scm_perf_stat *stat;
struct papr_scm_perf_stats *stats;
 
-   /* Silently fail if fetching performance metrics isn't  supported */
-   if (!p->stat_buffer_len)
-   return 0;
-
/* Allocate request buffer enough to hold single performance stat */
size = sizeof(struct papr_scm_perf_stats) +
sizeof(struct papr_scm_perf_stat);
@@ -543,9 +547,11 @@ static int papr_pdsm_fuel_gauge(struct papr_scm_priv *p,
stat->stat_val = 0;
 
/* Fetch the fuel gauge and populate it in payload */
-   rc = drc_pmem_query_stats(p, stats, 1);
+   rc = drc_pmem_query_stats(p, stats, 1, size);
if (rc < 0) {
dev_dbg(>pdev->dev, "Err(%d) fetching fuel gauge\n", rc);
+   /* Silently fail if unable to fetch performance metric */
+   rc = 0;
goto free_stats;
}
 
@@ -786,23 +792,25 @@ static ssize_t perf_stats_show(s

[PATCH] powerpc/papr_scm: Add PAPR command family to pass-through command-set

2020-09-13 Thread Vaibhav Jain
Add NVDIMM_FAMILY_PAPR to the list of valid 'dimm_family_mask'
acceptable by papr_scm. This is needed as since commit
92fe2aa859f5 ("libnvdimm: Validate command family indices") libnvdimm
performs a validation of 'nd_cmd_pkg.nd_family' received as part of
ND_CMD_CALL processing to ensure only known command families can use
the general ND_CMD_CALL pass-through functionality.

Without this change the ND_CMD_CALL pass-through targeting
NVDIMM_FAMILY_PAPR error out with -EINVAL.

Fixes: 92fe2aa859f5 ("libnvdimm: Validate command family indices")
Signed-off-by: Vaibhav Jain 
---
 arch/powerpc/platforms/pseries/papr_scm.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index 5493bc847bd08..27268370dee00 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -898,6 +898,9 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
p->bus_desc.of_node = p->pdev->dev.of_node;
p->bus_desc.provider_name = kstrdup(p->pdev->name, GFP_KERNEL);
 
+   /* Set the dimm command family mask to accept PDSMs */
+   set_bit(NVDIMM_FAMILY_PAPR, >bus_desc.dimm_family_mask);
+
if (!p->bus_desc.provider_name)
return -ENOMEM;
 
-- 
2.26.2



Re: [PATCH] powerpc/papr_scm: Fix warning triggered by perf_stats_show()

2020-09-12 Thread Vaibhav Jain
Ira Weiny  writes:

> On Sat, Sep 12, 2020 at 01:36:46AM +0530, Vaibhav Jain wrote:
>> Thanks for reviewing this patch Ira,
>> 
>> 
>> Ira Weiny  writes:
>> 
>> > On Thu, Sep 10, 2020 at 02:52:12PM +0530, Vaibhav Jain wrote:
>> >> A warning is reported by the kernel in case perf_stats_show() returns
>> >> an error code. The warning is of the form below:
>> >> 
>> >>  papr_scm ibm,persistent-memory:ibm,pmemory@4411:
>> >> Failed to query performance stats, Err:-10
>> >>  dev_attr_show: perf_stats_show+0x0/0x1c0 [papr_scm] returned bad count
>> >>  fill_read_buffer: dev_attr_show+0x0/0xb0 returned bad count
>> >> 
>> >> On investigation it looks like that the compiler is silently truncating 
>> >> the
>> >> return value of drc_pmem_query_stats() from 'long' to 'int', since the
>> >> variable used to store the return code 'rc' is an 'int'. This
>> >> truncated value is then returned back as a 'ssize_t' back from
>> >> perf_stats_show() to 'dev_attr_show()' which thinks of it as a large
>> >> unsigned number and triggers this warning..
>> >> 
>> >> To fix this we update the type of variable 'rc' from 'int' to
>> >> 'ssize_t' that prevents the compiler from truncating the return value
>> >> of drc_pmem_query_stats() and returning correct signed value back from
>> >> perf_stats_show().
>> >> 
>> >> Fixes: 2d02bf835e573 ('powerpc/papr_scm: Fetch nvdimm performance
>> >>stats from PHYP')
>> >> Signed-off-by: Vaibhav Jain 
>> >> ---
>> >>  arch/powerpc/platforms/pseries/papr_scm.c | 3 ++-
>> >>  1 file changed, 2 insertions(+), 1 deletion(-)
>> >> 
>> >> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
>> >> b/arch/powerpc/platforms/pseries/papr_scm.c
>> >> index a88a707a608aa..9f00b61676ab9 100644
>> >> --- a/arch/powerpc/platforms/pseries/papr_scm.c
>> >> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
>> >> @@ -785,7 +785,8 @@ static int papr_scm_ndctl(struct 
>> >> nvdimm_bus_descriptor *nd_desc,
>> >>  static ssize_t perf_stats_show(struct device *dev,
>> >>  struct device_attribute *attr, char *buf)
>> >>  {
>> >> - int index, rc;
>> >> + int index;
>> >> + ssize_t rc;
>> >
>> > I'm not sure this is really fixing everything here.
>> 
>> The issue is with the statement in perf_stats_show():
>> 
>> 'return rc ? rc : seq_buf_used();'
>> 
>> The function seq_buf_used() returns an 'unsigned int' and with 'rc'
>> typed as 'int', forces a promotion of the expression to 'unsigned int'
>> which causes a loss of signedness of 'rc' and compiler silently
>> assigns this unsigned value to the function return typed as 'signed
>> long'.
>> 
>> Making 'rc', a 'signed long' forces a promotion of the expresion to
>> 'signed long' which preserves the signedness of 'rc' and will also be
>> compatible with the function return type.
>
> Ok, ok, I read this all wrong.
>
> FWIW I would also cast seq_buf_used() to ssize_t to show you know what you are
> doing there.
>
>> 
>> >
>> > drc_pmem_query_stats() can return negative errno's.  Why are those not 
>> > checked
>> > somewhere in perf_stats_show()?
>> >
>> For the specific invocation 'drc_pmem_query_stats(p, stats, 0)' we only
>> expect return value 'rc <=0' with '0' indicating a successful fetch of
>> nvdimm performance stats from hypervisor. Hence there are no explicit
>> checks for negative error codes in the functions as all return values
>> !=0 indicate an error.
>> 
>> 
>> > It seems like all this fix is handling is a > 0 return value: 'ret[0]' from
>> > line 289 in papr_scm.c...  Or something?
>> No, in case the arg 'num_stats' is '0' and 'buff_stats != NULL' the
>> variable 'size' is assigned a non-zero value hence that specific branch
>> you mentioned  is never taken. Instead in case of success
>> drc_pmem_query_stats() return '0' and in case of an error a negative
>> error code is returned.
>> 
>> >
>> > Worse yet drc_pmem_query_stats() is returning ssize_t which is a signed 
>> > value.
>> > Therefore, it should not be returning -errno.  I'm surprised the static
>> > checkers did not catch that.
>> Didnt quite get the assertion here. The function is marked to return
>> 'ss

[PATCH v2] powerpc/papr_scm: Fix warning triggered by perf_stats_show()

2020-09-12 Thread Vaibhav Jain
A warning is reported by the kernel in case perf_stats_show() returns
an error code. The warning is of the form below:

 papr_scm ibm,persistent-memory:ibm,pmemory@4411:
  Failed to query performance stats, Err:-10
 dev_attr_show: perf_stats_show+0x0/0x1c0 [papr_scm] returned bad count
 fill_read_buffer: dev_attr_show+0x0/0xb0 returned bad count

On investigation it looks like that the compiler is silently truncating the
return value of drc_pmem_query_stats() from 'long' to 'int', since the
variable used to store the return code 'rc' is an 'int'. This
truncated value is then returned back as a 'ssize_t' back from
perf_stats_show() to 'dev_attr_show()' which thinks of it as a large
unsigned number and triggers this warning..

To fix this we update the type of variable 'rc' from 'int' to
'ssize_t' that prevents the compiler from truncating the return value
of drc_pmem_query_stats() and returning correct signed value back from
perf_stats_show().

Fixes: 2d02bf835e573 ('powerpc/papr_scm: Fetch nvdimm performance
   stats from PHYP')
Signed-off-by: Vaibhav Jain 
---
Changelog:

v2: Added an explicit cast to the expression calling 'seq_buf_used()'
and triggering this issue. [ Ira ]
---
 arch/powerpc/platforms/pseries/papr_scm.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index a88a707a608aa..5493bc847bd08 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -785,7 +785,8 @@ static int papr_scm_ndctl(struct nvdimm_bus_descriptor 
*nd_desc,
 static ssize_t perf_stats_show(struct device *dev,
   struct device_attribute *attr, char *buf)
 {
-   int index, rc;
+   int index;
+   ssize_t rc;
struct seq_buf s;
struct papr_scm_perf_stat *stat;
struct papr_scm_perf_stats *stats;
@@ -820,7 +821,7 @@ static ssize_t perf_stats_show(struct device *dev,
 
 free_stats:
kfree(stats);
-   return rc ? rc : seq_buf_used();
+   return rc ? rc : (ssize_t)seq_buf_used();
 }
 DEVICE_ATTR_ADMIN_RO(perf_stats);
 
-- 
2.26.2



Re: [PATCH] powerpc/papr_scm: Fix warning triggered by perf_stats_show()

2020-09-11 Thread Vaibhav Jain
Thanks for reviewing this patch Ira,


Ira Weiny  writes:

> On Thu, Sep 10, 2020 at 02:52:12PM +0530, Vaibhav Jain wrote:
>> A warning is reported by the kernel in case perf_stats_show() returns
>> an error code. The warning is of the form below:
>> 
>>  papr_scm ibm,persistent-memory:ibm,pmemory@4411:
>>Failed to query performance stats, Err:-10
>>  dev_attr_show: perf_stats_show+0x0/0x1c0 [papr_scm] returned bad count
>>  fill_read_buffer: dev_attr_show+0x0/0xb0 returned bad count
>> 
>> On investigation it looks like that the compiler is silently truncating the
>> return value of drc_pmem_query_stats() from 'long' to 'int', since the
>> variable used to store the return code 'rc' is an 'int'. This
>> truncated value is then returned back as a 'ssize_t' back from
>> perf_stats_show() to 'dev_attr_show()' which thinks of it as a large
>> unsigned number and triggers this warning..
>> 
>> To fix this we update the type of variable 'rc' from 'int' to
>> 'ssize_t' that prevents the compiler from truncating the return value
>> of drc_pmem_query_stats() and returning correct signed value back from
>> perf_stats_show().
>> 
>> Fixes: 2d02bf835e573 ('powerpc/papr_scm: Fetch nvdimm performance
>>stats from PHYP')
>> Signed-off-by: Vaibhav Jain 
>> ---
>>  arch/powerpc/platforms/pseries/papr_scm.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>> 
>> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
>> b/arch/powerpc/platforms/pseries/papr_scm.c
>> index a88a707a608aa..9f00b61676ab9 100644
>> --- a/arch/powerpc/platforms/pseries/papr_scm.c
>> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
>> @@ -785,7 +785,8 @@ static int papr_scm_ndctl(struct nvdimm_bus_descriptor 
>> *nd_desc,
>>  static ssize_t perf_stats_show(struct device *dev,
>> struct device_attribute *attr, char *buf)
>>  {
>> -int index, rc;
>> +int index;
>> +ssize_t rc;
>
> I'm not sure this is really fixing everything here.

The issue is with the statement in perf_stats_show():

'return rc ? rc : seq_buf_used();'

The function seq_buf_used() returns an 'unsigned int' and with 'rc'
typed as 'int', forces a promotion of the expression to 'unsigned int'
which causes a loss of signedness of 'rc' and compiler silently
assigns this unsigned value to the function return typed as 'signed
long'.

Making 'rc', a 'signed long' forces a promotion of the expresion to
'signed long' which preserves the signedness of 'rc' and will also be
compatible with the function return type.

>
> drc_pmem_query_stats() can return negative errno's.  Why are those not checked
> somewhere in perf_stats_show()?
>
For the specific invocation 'drc_pmem_query_stats(p, stats, 0)' we only
expect return value 'rc <=0' with '0' indicating a successful fetch of
nvdimm performance stats from hypervisor. Hence there are no explicit
checks for negative error codes in the functions as all return values
!=0 indicate an error.


> It seems like all this fix is handling is a > 0 return value: 'ret[0]' from
> line 289 in papr_scm.c...  Or something?
No, in case the arg 'num_stats' is '0' and 'buff_stats != NULL' the
variable 'size' is assigned a non-zero value hence that specific branch
you mentioned  is never taken. Instead in case of success
drc_pmem_query_stats() return '0' and in case of an error a negative
error code is returned.

>
> Worse yet drc_pmem_query_stats() is returning ssize_t which is a signed value.
> Therefore, it should not be returning -errno.  I'm surprised the static
> checkers did not catch that.
Didnt quite get the assertion here. The function is marked to return
'ssize_t' because we can return both +ve for success and -ve values to
indicate errors.

>
> I believe I caught similar errors with a patch series before which did not pay
> attention to variable types.
>
> Please audit this code for these types of errors and ensure you are really
> doing the correct thing when using the sysfs interface.  I'm pretty sure bad
> things will eventually happen (if they are not already) if you return some
> really big number to the sysfs core from *_show().
I think this problem is different compared to what you had previously pointed
to. The values returned from drc_pmem_query_stats() can be stored in an
'int' variable too, however it was the silent promotion of a signed type
to unsigned type was what caused this specific issue.

-- 
Cheers
~ Vaibhav


[PATCH] powerpc/papr_scm: Fix warning triggered by perf_stats_show()

2020-09-10 Thread Vaibhav Jain
A warning is reported by the kernel in case perf_stats_show() returns
an error code. The warning is of the form below:

 papr_scm ibm,persistent-memory:ibm,pmemory@4411:
  Failed to query performance stats, Err:-10
 dev_attr_show: perf_stats_show+0x0/0x1c0 [papr_scm] returned bad count
 fill_read_buffer: dev_attr_show+0x0/0xb0 returned bad count

On investigation it looks like that the compiler is silently truncating the
return value of drc_pmem_query_stats() from 'long' to 'int', since the
variable used to store the return code 'rc' is an 'int'. This
truncated value is then returned back as a 'ssize_t' back from
perf_stats_show() to 'dev_attr_show()' which thinks of it as a large
unsigned number and triggers this warning..

To fix this we update the type of variable 'rc' from 'int' to
'ssize_t' that prevents the compiler from truncating the return value
of drc_pmem_query_stats() and returning correct signed value back from
perf_stats_show().

Fixes: 2d02bf835e573 ('powerpc/papr_scm: Fetch nvdimm performance
   stats from PHYP')
Signed-off-by: Vaibhav Jain 
---
 arch/powerpc/platforms/pseries/papr_scm.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index a88a707a608aa..9f00b61676ab9 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -785,7 +785,8 @@ static int papr_scm_ndctl(struct nvdimm_bus_descriptor 
*nd_desc,
 static ssize_t perf_stats_show(struct device *dev,
   struct device_attribute *attr, char *buf)
 {
-   int index, rc;
+   int index;
+   ssize_t rc;
struct seq_buf s;
struct papr_scm_perf_stat *stat;
struct papr_scm_perf_stats *stats;
-- 
2.26.2



[PATCH v2] powerpc/papr_scm: Limit the readability of 'perf_stats' sysfs attribute

2020-09-07 Thread Vaibhav Jain
The newly introduced 'perf_stats' attribute uses the default access
mode of 0444 letting non-root users access performance stats of an
nvdimm and potentially force the kernel into issuing large number of
expensive HCALLs. Since the information exposed by this attribute
cannot be cached hence its better to ward of access to this attribute
from users who don't need to access these performance statistics.

Hence this patch updates access mode of 'perf_stats' attribute to
be only readable by root users.

Fixes: 2d02bf835e573 ('powerpc/papr_scm: Fetch nvdimm performance stats from 
PHYP')
Reported-by: Aneesh Kumar K.V 
Signed-off-by: Vaibhav Jain 
---
Change-log:

v2:
* Instead of checking for perfmon_capable() inside show_perf_stats()
  set the attribute as DEVICE_ATTR_ADMIN_RO [ Aneesh ]
* Update patch description
---
 arch/powerpc/platforms/pseries/papr_scm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index f439f0dfea7d1..a88a707a608aa 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -822,7 +822,7 @@ static ssize_t perf_stats_show(struct device *dev,
kfree(stats);
return rc ? rc : seq_buf_used();
 }
-DEVICE_ATTR_RO(perf_stats);
+DEVICE_ATTR_ADMIN_RO(perf_stats);
 
 static ssize_t flags_show(struct device *dev,
  struct device_attribute *attr, char *buf)
-- 
2.26.2



Re: [PATCH] powerpc/papr_scm: Limit the readability of 'perf_stats' sysfs attribute

2020-08-19 Thread Vaibhav Jain
Thanks Aneesh and Mpe for reviewing this patch.

Michael Ellerman  writes:

> "Aneesh Kumar K.V"  writes:
[snip]
>>>   
>>> +   /* Allow access only to perfmon capable users */
>>> +   if (!perfmon_capable())
>>> +   return -EACCES;
>>> +
>>
>> An access check is usually done in open(). This is the read callback IIUC.
>
> Yes. Otherwise an unprivileged user can open the file, and then trick a
> suid program into reading from it.

Agree, but since the 'open()' for this sysfs attribute is handled
by kern-fs, AFAIK dont see any direct way to enforce this policy.

Only other way it seems to me is to convert the 'perf_stats' DEVICE_ATTR_RO
to DEVICE_ATTR_ADMIN_RO.

>
> cheers

-- 
Cheers
~ Vaibhav


[PATCH] libnvdimm: Add a NULL entry to 'nvdimm_firmware_attributes'

2020-08-14 Thread Vaibhav Jain
We recently discovered a kernel oops with 'papr_scm' module while
booting ppc64 phyp guest with following back-trace:

BUG: Kernel NULL pointer dereference on write at 0x0188
Faulting instruction address: 0xc05d7084
Oops: Kernel access of bad area, sig: 11 [#1]

Call Trace:
 internal_create_group+0x128/0x4c0 (unreliable)
 internal_create_groups.part.4+0x70/0x130
 device_add+0x458/0x9c0
 nd_async_device_register+0x28/0xa0 [libnvdimm]
 async_run_entry_fn+0x78/0x1f0
 process_one_work+0x2c0/0x5b0
 worker_thread+0x88/0x650
 kthread+0x1a8/0x1b0
 ret_from_kernel_thread+0x5c/0x6c

A bisect lead to the 'commit 48001ea50d17f ("PM, libnvdimm: Add runtime
firmware activation support")' and on investigation discovered that
the newly introduced 'struct attribute *nvdimm_firmware_attributes[]'
is missing a terminating NULL entry in the array. This causes a loop
in sysfs's 'create_files()' to read garbage beyond bounds of
'nvdimm_firmware_attributes' and trigger the oops.

Fixes: 48001ea50d17f ("PM, libnvdimm: Add runtime firmware activation support")
Reported-by: Sandipan Das 
Signed-off-by: Vaibhav Jain 
---
 drivers/nvdimm/dimm_devs.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/nvdimm/dimm_devs.c b/drivers/nvdimm/dimm_devs.c
index 61374def51555..b59032e0859b7 100644
--- a/drivers/nvdimm/dimm_devs.c
+++ b/drivers/nvdimm/dimm_devs.c
@@ -529,6 +529,7 @@ static DEVICE_ATTR_ADMIN_RW(activate);
 static struct attribute *nvdimm_firmware_attributes[] = {
_attr_activate.attr,
_attr_result.attr,
+   NULL,
 };
 
 static umode_t nvdimm_firmware_visible(struct kobject *kobj, struct attribute 
*a, int n)
-- 
2.26.2



[PATCH] powerpc/papr_scm: Limit the readability of 'perf_stats' sysfs attribute

2020-08-12 Thread Vaibhav Jain
The newly introduced 'perf_stats' attribute uses the default access
mode of 0444 letting non-root users access performance stats of an
nvdimm and potentially force the kernel into issuing large number of
expensive HCALLs. Since the information exposed by this attribute
cannot be cached hence its better to ward of access to this attribute
from users who don't need to access these performance statistics.

Hence this patch adds check in perf_stats_show() to only let users
that are 'perfmon_capable()' to read the nvdimm performance
statistics.

Fixes: 2d02bf835e573 ('powerpc/papr_scm: Fetch nvdimm performance stats from 
PHYP')
Reported-by: Aneesh Kumar K.V 
Signed-off-by: Vaibhav Jain 
---
 arch/powerpc/platforms/pseries/papr_scm.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index f439f0dfea7d1..36c51bf8af9a8 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -792,6 +792,10 @@ static ssize_t perf_stats_show(struct device *dev,
struct nvdimm *dimm = to_nvdimm(dev);
struct papr_scm_priv *p = nvdimm_provider_data(dimm);
 
+   /* Allow access only to perfmon capable users */
+   if (!perfmon_capable())
+   return -EACCES;
+
if (!p->stat_buffer_len)
return -ENOENT;
 
-- 
2.26.2



Re: [PATCH] powerpc/papr_scm: Make access mode of 'perf_stats' attribute file to '0400'

2020-08-12 Thread Vaibhav Jain
Hi Mpe,

Thanks for reviewing this patch. My responses below:

Michael Ellerman  writes:

> Vaibhav Jain  writes:
>> The newly introduced 'perf_stats' attribute uses the default access
>> mode of 0444 letting non-root users access performance stats of an
>> nvdimm and potentially force the kernel into issuing large number of
>> expensive HCALLs. Since the information exposed by this attribute
>> cannot be cached hence its better to ward of access to this attribute
>> from non-root users.
>>
>> Hence this patch updates the access-mode of 'perf_stats' sysfs
>> attribute file to 0400 to make it only readable to root-users.
>
> Or should we ratelimit it?
Ideal consumers of this data will be users with CAP_PERFMON or
CAP_SYS_ADMIN. Also they need up-to-date values for these performance stats
as these values can be time sensitive.

So rate limiting may not be a complete solution since a user running
'perf' might be throttled by another user who is simply reading the
sysfs file contents.

So instead of setting attribute mode to 0400, will add a check for
'perfmon_capable()' in perf_stats_show() denying read access to users
without CAP_PERFMON or CAP_SYS_ADMIN.


> Fixes: ??
Right. I will add this in v2.

>
>> Reported-by: Aneesh Kumar K.V 
>> Signed-off-by: Vaibhav Jain 
>
> cheers
>

-- 
Cheers
~ Vaibhav


[PATCH] powerpc/papr_scm: Make access mode of 'perf_stats' attribute file to '0400'

2020-08-07 Thread Vaibhav Jain
The newly introduced 'perf_stats' attribute uses the default access
mode of 0444 letting non-root users access performance stats of an
nvdimm and potentially force the kernel into issuing large number of
expensive HCALLs. Since the information exposed by this attribute
cannot be cached hence its better to ward of access to this attribute
from non-root users.

Hence this patch updates the access-mode of 'perf_stats' sysfs
attribute file to 0400 to make it only readable to root-users.

Reported-by: Aneesh Kumar K.V 
Signed-off-by: Vaibhav Jain 
---
 arch/powerpc/platforms/pseries/papr_scm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index f439f0dfea7d1..31864d167a2ce 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -822,7 +822,7 @@ static ssize_t perf_stats_show(struct device *dev,
kfree(stats);
return rc ? rc : seq_buf_used();
 }
-DEVICE_ATTR_RO(perf_stats);
+DEVICE_ATTR(perf_stats, 0400, perf_stats_show, NULL);
 
 static ssize_t flags_show(struct device *dev,
  struct device_attribute *attr, char *buf)
-- 
2.26.2



[PATCH v4 2/2] powerpc/papr_scm: Add support for fetching nvdimm 'fuel-gauge' metric

2020-07-31 Thread Vaibhav Jain
We add support for reporting 'fuel-gauge' NVDIMM metric via
PAPR_PDSM_HEALTH pdsm payload. 'fuel-gauge' metric indicates the usage
life remaining of a papr-scm compatible NVDIMM. PHYP exposes this
metric via the H_SCM_PERFORMANCE_STATS.

The metric value is returned from the pdsm by extending the return
payload 'struct nd_papr_pdsm_health' without breaking the ABI. A new
field 'dimm_fuel_gauge' to hold the metric value is introduced at the
end of the payload struct and its presence is indicated by by
extension flag PDSM_DIMM_HEALTH_RUN_GAUGE_VALID.

The patch introduces a new function papr_pdsm_fuel_gauge() that is
called from papr_pdsm_health(). If fetching NVDIMM performance stats
is supported then 'papr_pdsm_fuel_gauge()' allocated an output buffer
large enough to hold the performance stat and passes it to
drc_pmem_query_stats() that issues the HCALL to PHYP. The return value
of the stat is then populated in the 'struct
nd_papr_pdsm_health.dimm_fuel_gauge' field with extension flag
'PDSM_DIMM_HEALTH_RUN_GAUGE_VALID' set in 'struct
nd_papr_pdsm_health.extension_flags'

Signed-off-by: Vaibhav Jain 
---
Changelog:

v4:
* Moved a hunk from this patch to previous patch in the series.
  [ Aneesh ]

v3:
* Updated papr_pdsm_fuel_guage() to use the updated
  drc_pmem_query_stats() function.

Resend:
None

v2:
* Restructure code in papr_pdsm_fuel_gauge() to handle error case
first [ Ira ]
* Ignore the return value of papr_pdsm_fuel_gauge() in
papr_psdm_health() [ Ira ]
---
 arch/powerpc/include/uapi/asm/papr_pdsm.h |  9 +
 arch/powerpc/platforms/pseries/papr_scm.c | 49 +++
 2 files changed, 58 insertions(+)

diff --git a/arch/powerpc/include/uapi/asm/papr_pdsm.h 
b/arch/powerpc/include/uapi/asm/papr_pdsm.h
index 9ccecc1d6840..50ef95e2f5b1 100644
--- a/arch/powerpc/include/uapi/asm/papr_pdsm.h
+++ b/arch/powerpc/include/uapi/asm/papr_pdsm.h
@@ -72,6 +72,11 @@
 #define PAPR_PDSM_DIMM_CRITICAL  2
 #define PAPR_PDSM_DIMM_FATAL 3
 
+/* struct nd_papr_pdsm_health.extension_flags field flags */
+
+/* Indicate that the 'dimm_fuel_gauge' field is valid */
+#define PDSM_DIMM_HEALTH_RUN_GAUGE_VALID 1
+
 /*
  * Struct exchanged between kernel & ndctl in for PAPR_PDSM_HEALTH
  * Various flags indicate the health status of the dimm.
@@ -84,6 +89,7 @@
  * dimm_locked : Contents of the dimm cant be modified until CEC reboot
  * dimm_encrypted  : Contents of dimm are encrypted.
  * dimm_health : Dimm health indicator. One of PAPR_PDSM_DIMM_
+ * dimm_fuel_gauge : Life remaining of DIMM as a percentage from 0-100
  */
 struct nd_papr_pdsm_health {
union {
@@ -96,6 +102,9 @@ struct nd_papr_pdsm_health {
__u8 dimm_locked;
__u8 dimm_encrypted;
__u16 dimm_health;
+
+   /* Extension flag PDSM_DIMM_HEALTH_RUN_GAUGE_VALID */
+   __u16 dimm_fuel_gauge;
};
__u8 buf[ND_PDSM_PAYLOAD_MAX_SIZE];
};
diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index f37f3f70007d..f439f0dfea7d 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -518,6 +518,51 @@ static int is_cmd_valid(struct nvdimm *nvdimm, unsigned 
int cmd, void *buf,
return 0;
 }
 
+static int papr_pdsm_fuel_gauge(struct papr_scm_priv *p,
+   union nd_pdsm_payload *payload)
+{
+   int rc, size;
+   u64 statval;
+   struct papr_scm_perf_stat *stat;
+   struct papr_scm_perf_stats *stats;
+
+   /* Silently fail if fetching performance metrics isn't  supported */
+   if (!p->stat_buffer_len)
+   return 0;
+
+   /* Allocate request buffer enough to hold single performance stat */
+   size = sizeof(struct papr_scm_perf_stats) +
+   sizeof(struct papr_scm_perf_stat);
+
+   stats = kzalloc(size, GFP_KERNEL);
+   if (!stats)
+   return -ENOMEM;
+
+   stat = >scm_statistic[0];
+   memcpy(>stat_id, "MemLife ", sizeof(stat->stat_id));
+   stat->stat_val = 0;
+
+   /* Fetch the fuel gauge and populate it in payload */
+   rc = drc_pmem_query_stats(p, stats, 1);
+   if (rc < 0) {
+   dev_dbg(>pdev->dev, "Err(%d) fetching fuel gauge\n", rc);
+   goto free_stats;
+   }
+
+   statval = be64_to_cpu(stat->stat_val);
+   dev_dbg(>pdev->dev,
+   "Fetched fuel-gauge %llu", statval);
+   payload->health.extension_flags |=
+   PDSM_DIMM_HEALTH_RUN_GAUGE_VALID;
+   payload->health.dimm_fuel_gauge = statval;
+
+   rc = sizeof(struct nd_papr_pdsm_health);
+
+free_stats:
+   kfree(stats);
+   return rc;
+}
+
 /* Fetch the DIMM health info and populate it in provided package. */
 static int papr_pdsm_health(s

[PATCH v4 1/2] powerpc/papr_scm: Fetch nvdimm performance stats from PHYP

2020-07-31 Thread Vaibhav Jain
Update papr_scm.c to query dimm performance statistics from PHYP via
H_SCM_PERFORMANCE_STATS hcall and export them to user-space as PAPR
specific NVDIMM attribute 'perf_stats' in sysfs. The patch also
provide a sysfs ABI documentation for the stats being reported and
their meanings.

During NVDIMM probe time in papr_scm_nvdimm_init() a special variant
of H_SCM_PERFORMANCE_STATS hcall is issued to check if collection of
performance statistics is supported or not. If successful then a PHYP
returns a maximum possible buffer length needed to read all
performance stats. This returned value is stored in a per-nvdimm
attribute 'stat_buffer_len'.

The layout of request buffer for reading NVDIMM performance stats from
PHYP is defined in 'struct papr_scm_perf_stats' and 'struct
papr_scm_perf_stat'. These structs are used in newly introduced
drc_pmem_query_stats() that issues the H_SCM_PERFORMANCE_STATS hcall.

The sysfs access function perf_stats_show() uses value
'stat_buffer_len' to allocate a buffer large enough to hold all
possible NVDIMM performance stats and passes it to
drc_pmem_query_stats() to populate. Finally statistics reported in the
buffer are formatted into the sysfs access function output buffer.

Signed-off-by: Vaibhav Jain 
---
Changelog:

v4:
* Fixed a build issue with this patch by moving a hunk from second
  patch in series to this patch. [ Aneesh ]

v3:
* Updated drc_pmem_query_stats() to not require 'buff_size' and 'out'
  args to the function. Instead 'buff_size' is calculated from
  'num_stats' and instead of populating 'R4' in arg 'out' the value is
  returned from the function in case 'R4' represents
  'max-buffer-size'.

Resend:
None

v2:
* Updated 'struct papr_scm_perf_stats' and 'struct papr_scm_perf_stat'
to use big-endian types. [ Aneesh ]
* s/len_stat_buffer/stat_buffer_len/ [ Aneesh ]
* s/statistics_id/stat_id/ , s/statistics_val/stat_val/ [ Aneesh ]
* Conversion from Big endian to cpu endian happens later rather than
just after its fetched from PHYP.
* Changed a log statement to unambiguously report dimm performance
stats are not available for the given nvdimm [ Ira ]
* Restructed some code to handle error case first [ Ira ]
---
 Documentation/ABI/testing/sysfs-bus-papr-pmem |  27 
 arch/powerpc/platforms/pseries/papr_scm.c | 150 ++
 2 files changed, 177 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-bus-papr-pmem 
b/Documentation/ABI/testing/sysfs-bus-papr-pmem
index 5b10d036a8d4..c1a67275c43f 100644
--- a/Documentation/ABI/testing/sysfs-bus-papr-pmem
+++ b/Documentation/ABI/testing/sysfs-bus-papr-pmem
@@ -25,3 +25,30 @@ Description:
  NVDIMM have been scrubbed.
* "locked"  : Indicating that NVDIMM contents cant
  be modified until next power cycle.
+
+What:  /sys/bus/nd/devices/nmemX/papr/perf_stats
+Date:  May, 2020
+KernelVersion: v5.9
+Contact:   linuxppc-dev , 
linux-nvd...@lists.01.org,
+Description:
+   (RO) Report various performance stats related to papr-scm NVDIMM
+   device.  Each stat is reported on a new line with each line
+   composed of a stat-identifier followed by it value. Below are
+   currently known dimm performance stats which are reported:
+
+   * "CtlResCt" : Controller Reset Count
+   * "CtlResTm" : Controller Reset Elapsed Time
+   * "PonSecs " : Power-on Seconds
+   * "MemLife " : Life Remaining
+   * "CritRscU" : Critical Resource Utilization
+   * "HostLCnt" : Host Load Count
+   * "HostSCnt" : Host Store Count
+   * "HostSDur" : Host Store Duration
+   * "HostLDur" : Host Load Duration
+   * "MedRCnt " : Media Read Count
+   * "MedWCnt " : Media Write Count
+   * "MedRDur " : Media Read Duration
+   * "MedWDur " : Media Write Duration
+   * "CchRHCnt" : Cache Read Hit Count
+   * "CchWHCnt" : Cache Write Hit Count
+   * "FastWCnt" : Fast Write Count
\ No newline at end of file
diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index 3d1235a76ba9..f37f3f70007d 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -64,6 +64,26 @@
PAPR_PMEM_HEALTH_FATAL |\
PAPR_PMEM_HEALTH_UNHEALTHY)
 
+#define PAPR_SCM_PERF_STATS_EYECATCHER __stringify(SCMSTATS)
+#define PAPR_SCM_PERF_STATS_VERSION 0x1
+
+/* Struct holding a single performance metric */
+struct papr_scm_perf_stat {
+   u8 stat_id[8];
+   __be64 stat_

[PATCH v4 0/2] powerpc/papr_scm: add support for reporting NVDIMM 'life_used_percentage' metric

2020-07-31 Thread Vaibhav Jain
Changes since v3[1]:

* Fixed a rebase issue pointed out by Aneesh in first patch in the series.

[1] 
https://lore.kernel.org/linux-nvdimm/20200730121303.134230-1-vaib...@linux.ibm.com
---

This small patchset implements kernel side support for reporting
'life_used_percentage' metric in NDCTL with dimm health output for
papr-scm NVDIMMs. With corresponding NDCTL side changes output for
should be like:

$ sudo ndctl list -DH
[
  {
"dev":"nmem0",
"health":{
  "health_state":"ok",
  "life_used_percentage":0,
  "shutdown_state":"clean"
}
  }
]

PHYP supports H_SCM_PERFORMANCE_STATS hcall through which an LPAR can
fetch various performance stats including 'fuel_gauge' percentage for
an NVDIMM. 'fuel_gauge' metric indicates the usable life remaining of
an NVDIMM expressed as percentage and  'life_used_percentage' can be
calculated as 'life_used_percentage = 100 - fuel_gauge'.

Structure of the patchset
=
First patch implements necessary scaffolding needed to issue the
H_SCM_PERFORMANCE_STATS hcall and fetch performance stats
catalogue. The patch also implements support for 'perf_stats' sysfs
attribute to report the full catalogue of supported performance stats
by PHYP.

Second and final patch implements support for sending this value to
libndctl by extending the PAPR_PDSM_HEALTH pdsm payload to add a new
field named 'dimm_fuel_gauge' to it.

Vaibhav Jain (2):
  powerpc/papr_scm: Fetch nvdimm performance stats from PHYP
  powerpc/papr_scm: Add support for fetching nvdimm 'fuel-gauge' metric

 Documentation/ABI/testing/sysfs-bus-papr-pmem |  27 +++
 arch/powerpc/include/uapi/asm/papr_pdsm.h |   9 +
 arch/powerpc/platforms/pseries/papr_scm.c | 199 ++
 3 files changed, 235 insertions(+)

-- 
2.26.2



[PATCH v3 2/2] powerpc/papr_scm: Add support for fetching nvdimm 'fuel-gauge' metric

2020-07-30 Thread Vaibhav Jain
We add support for reporting 'fuel-gauge' NVDIMM metric via
PAPR_PDSM_HEALTH pdsm payload. 'fuel-gauge' metric indicates the usage
life remaining of a papr-scm compatible NVDIMM. PHYP exposes this
metric via the H_SCM_PERFORMANCE_STATS.

The metric value is returned from the pdsm by extending the return
payload 'struct nd_papr_pdsm_health' without breaking the ABI. A new
field 'dimm_fuel_gauge' to hold the metric value is introduced at the
end of the payload struct and its presence is indicated by by
extension flag PDSM_DIMM_HEALTH_RUN_GAUGE_VALID.

The patch introduces a new function papr_pdsm_fuel_gauge() that is
called from papr_pdsm_health(). If fetching NVDIMM performance stats
is supported then 'papr_pdsm_fuel_gauge()' allocated an output buffer
large enough to hold the performance stat and passes it to
drc_pmem_query_stats() that issues the HCALL to PHYP. The return value
of the stat is then populated in the 'struct
nd_papr_pdsm_health.dimm_fuel_gauge' field with extension flag
'PDSM_DIMM_HEALTH_RUN_GAUGE_VALID' set in 'struct
nd_papr_pdsm_health.extension_flags'

Signed-off-by: Vaibhav Jain 
---
Changelog:

v3:
* Updated papr_pdsm_fuel_guage() to use the updated
  drc_pmem_query_stats() function.

Resend:
None

v2:
* Restructure code in papr_pdsm_fuel_gauge() to handle error case
first [ Ira ]
* Ignore the return value of papr_pdsm_fuel_gauge() in
papr_psdm_health() [ Ira ]
---
 arch/powerpc/include/uapi/asm/papr_pdsm.h |  9 
 arch/powerpc/platforms/pseries/papr_scm.c | 51 ++-
 2 files changed, 59 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/uapi/asm/papr_pdsm.h 
b/arch/powerpc/include/uapi/asm/papr_pdsm.h
index 9ccecc1d6840..50ef95e2f5b1 100644
--- a/arch/powerpc/include/uapi/asm/papr_pdsm.h
+++ b/arch/powerpc/include/uapi/asm/papr_pdsm.h
@@ -72,6 +72,11 @@
 #define PAPR_PDSM_DIMM_CRITICAL  2
 #define PAPR_PDSM_DIMM_FATAL 3
 
+/* struct nd_papr_pdsm_health.extension_flags field flags */
+
+/* Indicate that the 'dimm_fuel_gauge' field is valid */
+#define PDSM_DIMM_HEALTH_RUN_GAUGE_VALID 1
+
 /*
  * Struct exchanged between kernel & ndctl in for PAPR_PDSM_HEALTH
  * Various flags indicate the health status of the dimm.
@@ -84,6 +89,7 @@
  * dimm_locked : Contents of the dimm cant be modified until CEC reboot
  * dimm_encrypted  : Contents of dimm are encrypted.
  * dimm_health : Dimm health indicator. One of PAPR_PDSM_DIMM_
+ * dimm_fuel_gauge : Life remaining of DIMM as a percentage from 0-100
  */
 struct nd_papr_pdsm_health {
union {
@@ -96,6 +102,9 @@ struct nd_papr_pdsm_health {
__u8 dimm_locked;
__u8 dimm_encrypted;
__u16 dimm_health;
+
+   /* Extension flag PDSM_DIMM_HEALTH_RUN_GAUGE_VALID */
+   __u16 dimm_fuel_gauge;
};
__u8 buf[ND_PDSM_PAYLOAD_MAX_SIZE];
};
diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index 29cab86141d8..837a21083268 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -518,6 +518,51 @@ static int is_cmd_valid(struct nvdimm *nvdimm, unsigned 
int cmd, void *buf,
return 0;
 }
 
+static int papr_pdsm_fuel_gauge(struct papr_scm_priv *p,
+   union nd_pdsm_payload *payload)
+{
+   int rc, size;
+   u64 statval;
+   struct papr_scm_perf_stat *stat;
+   struct papr_scm_perf_stats *stats;
+
+   /* Silently fail if fetching performance metrics isn't  supported */
+   if (!p->stat_buffer_len)
+   return 0;
+
+   /* Allocate request buffer enough to hold single performance stat */
+   size = sizeof(struct papr_scm_perf_stats) +
+   sizeof(struct papr_scm_perf_stat);
+
+   stats = kzalloc(size, GFP_KERNEL);
+   if (!stats)
+   return -ENOMEM;
+
+   stat = >scm_statistic[0];
+   memcpy(>stat_id, "MemLife ", sizeof(stat->stat_id));
+   stat->stat_val = 0;
+
+   /* Fetch the fuel gauge and populate it in payload */
+   rc = drc_pmem_query_stats(p, stats, 1);
+   if (rc < 0) {
+   dev_dbg(>pdev->dev, "Err(%d) fetching fuel gauge\n", rc);
+   goto free_stats;
+   }
+
+   statval = be64_to_cpu(stat->stat_val);
+   dev_dbg(>pdev->dev,
+   "Fetched fuel-gauge %llu", statval);
+   payload->health.extension_flags |=
+   PDSM_DIMM_HEALTH_RUN_GAUGE_VALID;
+   payload->health.dimm_fuel_gauge = statval;
+
+   rc = sizeof(struct nd_papr_pdsm_health);
+
+free_stats:
+   kfree(stats);
+   return rc;
+}
+
 /* Fetch the DIMM health info and populate it in provided package. */
 static int papr_pdsm_health(struct papr_scm_priv *p,
union nd_pdsm_

[PATCH v3 0/2] powerpc/papr_scm: add support for reporting NVDIMM 'life_used_percentage' metric

2020-07-30 Thread Vaibhav Jain
Changes since v2[1]:

* Updated drc_pmem_query_stats() to reduce the number of input args
  to the function based suggestions from Aneesh.

[1] 
https://lore.kernel.org/linux-nvdimm/20200726122030.31529-1-vaib...@linux.ibm.com
---

This small patchset implements kernel side support for reporting
'life_used_percentage' metric in NDCTL with dimm health output for
papr-scm NVDIMMs. With corresponding NDCTL side changes output for
should be like:

$ sudo ndctl list -DH
[
  {
"dev":"nmem0",
"health":{
  "health_state":"ok",
  "life_used_percentage":0,
  "shutdown_state":"clean"
}
  }
]

PHYP supports H_SCM_PERFORMANCE_STATS hcall through which an LPAR can
fetch various performance stats including 'fuel_gauge' percentage for
an NVDIMM. 'fuel_gauge' metric indicates the usable life remaining of
an NVDIMM expressed as percentage and  'life_used_percentage' can be
calculated as 'life_used_percentage = 100 - fuel_gauge'.

Structure of the patchset
=
First patch implements necessary scaffolding needed to issue the
H_SCM_PERFORMANCE_STATS hcall and fetch performance stats
catalogue. The patch also implements support for 'perf_stats' sysfs
attribute to report the full catalogue of supported performance stats
by PHYP.

Second and final patch implements support for sending this value to
libndctl by extending the PAPR_PDSM_HEALTH pdsm payload to add a new
field named 'dimm_fuel_gauge' to it.

Vaibhav Jain (2):
  powerpc/papr_scm: Fetch nvdimm performance stats from PHYP
  powerpc/papr_scm: Add support for fetching nvdimm 'fuel-gauge' metric

 Documentation/ABI/testing/sysfs-bus-papr-pmem |  27 +++
 arch/powerpc/include/uapi/asm/papr_pdsm.h |   9 +
 arch/powerpc/platforms/pseries/papr_scm.c | 199 ++
 3 files changed, 235 insertions(+)

-- 
2.26.2



[PATCH v3 1/2] powerpc/papr_scm: Fetch nvdimm performance stats from PHYP

2020-07-30 Thread Vaibhav Jain
Update papr_scm.c to query dimm performance statistics from PHYP via
H_SCM_PERFORMANCE_STATS hcall and export them to user-space as PAPR
specific NVDIMM attribute 'perf_stats' in sysfs. The patch also
provide a sysfs ABI documentation for the stats being reported and
their meanings.

During NVDIMM probe time in papr_scm_nvdimm_init() a special variant
of H_SCM_PERFORMANCE_STATS hcall is issued to check if collection of
performance statistics is supported or not. If successful then a PHYP
returns a maximum possible buffer length needed to read all
performance stats. This returned value is stored in a per-nvdimm
attribute 'stat_buffer_len'.

The layout of request buffer for reading NVDIMM performance stats from
PHYP is defined in 'struct papr_scm_perf_stats' and 'struct
papr_scm_perf_stat'. These structs are used in newly introduced
drc_pmem_query_stats() that issues the H_SCM_PERFORMANCE_STATS hcall.

The sysfs access function perf_stats_show() uses value
'stat_buffer_len' to allocate a buffer large enough to hold all
possible NVDIMM performance stats and passes it to
drc_pmem_query_stats() to populate. Finally statistics reported in the
buffer are formatted into the sysfs access function output buffer.

Signed-off-by: Vaibhav Jain 
---
Changelog:

v3:
* Updated drc_pmem_query_stats() to not require 'buff_size' and 'out'
  args to the function. Instead 'buff_size' is calculated from
  'num_stats' and instead of populating 'R4' in arg 'out' the value is
  returned from the function in case 'R4' represents
  'max-buffer-size'. [ Aneesh ]

Resend:
None

v2:
* Updated 'struct papr_scm_perf_stats' and 'struct papr_scm_perf_stat'
to use big-endian types. [ Aneesh ]
* s/len_stat_buffer/stat_buffer_len/ [ Aneesh ]
* s/statistics_id/stat_id/ , s/statistics_val/stat_val/ [ Aneesh ]
* Conversion from Big endian to cpu endian happens later rather than
just after its fetched from PHYP.
* Changed a log statement to unambiguously report dimm performance
stats are not available for the given nvdimm [ Ira ]
* Restructed some code to handle error case first [ Ira ]
---
 Documentation/ABI/testing/sysfs-bus-papr-pmem |  27 
 arch/powerpc/platforms/pseries/papr_scm.c | 150 ++
 2 files changed, 177 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-bus-papr-pmem 
b/Documentation/ABI/testing/sysfs-bus-papr-pmem
index 5b10d036a8d4..c1a67275c43f 100644
--- a/Documentation/ABI/testing/sysfs-bus-papr-pmem
+++ b/Documentation/ABI/testing/sysfs-bus-papr-pmem
@@ -25,3 +25,30 @@ Description:
  NVDIMM have been scrubbed.
* "locked"  : Indicating that NVDIMM contents cant
  be modified until next power cycle.
+
+What:  /sys/bus/nd/devices/nmemX/papr/perf_stats
+Date:  May, 2020
+KernelVersion: v5.9
+Contact:   linuxppc-dev , 
linux-nvd...@lists.01.org,
+Description:
+   (RO) Report various performance stats related to papr-scm NVDIMM
+   device.  Each stat is reported on a new line with each line
+   composed of a stat-identifier followed by it value. Below are
+   currently known dimm performance stats which are reported:
+
+   * "CtlResCt" : Controller Reset Count
+   * "CtlResTm" : Controller Reset Elapsed Time
+   * "PonSecs " : Power-on Seconds
+   * "MemLife " : Life Remaining
+   * "CritRscU" : Critical Resource Utilization
+   * "HostLCnt" : Host Load Count
+   * "HostSCnt" : Host Store Count
+   * "HostSDur" : Host Store Duration
+   * "HostLDur" : Host Load Duration
+   * "MedRCnt " : Media Read Count
+   * "MedWCnt " : Media Write Count
+   * "MedRDur " : Media Read Duration
+   * "MedWDur " : Media Write Duration
+   * "CchRHCnt" : Cache Read Hit Count
+   * "CchWHCnt" : Cache Write Hit Count
+   * "FastWCnt" : Fast Write Count
\ No newline at end of file
diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index 8fd441d32487..29cab86141d8 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -64,6 +64,26 @@
PAPR_PMEM_HEALTH_FATAL |\
PAPR_PMEM_HEALTH_UNHEALTHY)
 
+#define PAPR_SCM_PERF_STATS_EYECATCHER __stringify(SCMSTATS)
+#define PAPR_SCM_PERF_STATS_VERSION 0x1
+
+/* Struct holding a single performance metric */
+struct papr_scm_perf_stat {
+   u8 stat_id[8];
+   __be64 stat_val;
+} __packed;
+
+/* Struct exchanged between kernel and PHYP for fetching drc perf stats */
+struct papr_scm_perf_stats {
+   

[RESEND PATCH v2 2/2] powerpc/papr_scm: Add support for fetching nvdimm 'fuel-gauge' metric

2020-07-26 Thread Vaibhav Jain
We add support for reporting 'fuel-gauge' NVDIMM metric via
PAPR_PDSM_HEALTH pdsm payload. 'fuel-gauge' metric indicates the usage
life remaining of a papr-scm compatible NVDIMM. PHYP exposes this
metric via the H_SCM_PERFORMANCE_STATS.

The metric value is returned from the pdsm by extending the return
payload 'struct nd_papr_pdsm_health' without breaking the ABI. A new
field 'dimm_fuel_gauge' to hold the metric value is introduced at the
end of the payload struct and its presence is indicated by by
extension flag PDSM_DIMM_HEALTH_RUN_GAUGE_VALID.

The patch introduces a new function papr_pdsm_fuel_gauge() that is
called from papr_pdsm_health(). If fetching NVDIMM performance stats
is supported then 'papr_pdsm_fuel_gauge()' allocated an output buffer
large enough to hold the performance stat and passes it to
drc_pmem_query_stats() that issues the HCALL to PHYP. The return value
of the stat is then populated in the 'struct
nd_papr_pdsm_health.dimm_fuel_gauge' field with extension flag
'PDSM_DIMM_HEALTH_RUN_GAUGE_VALID' set in 'struct
nd_papr_pdsm_health.extension_flags'

Signed-off-by: Vaibhav Jain 
---
Changelog:

Resend:
None

v2:
* Restructure code in papr_pdsm_fuel_gauge() to handle error case
first [ Ira ]
* Ignore the return value of papr_pdsm_fuel_gauge() in
papr_psdm_health() [ Ira ]
---
 arch/powerpc/include/uapi/asm/papr_pdsm.h |  9 +
 arch/powerpc/platforms/pseries/papr_scm.c | 49 +++
 2 files changed, 58 insertions(+)

diff --git a/arch/powerpc/include/uapi/asm/papr_pdsm.h 
b/arch/powerpc/include/uapi/asm/papr_pdsm.h
index 9ccecc1d6840..50ef95e2f5b1 100644
--- a/arch/powerpc/include/uapi/asm/papr_pdsm.h
+++ b/arch/powerpc/include/uapi/asm/papr_pdsm.h
@@ -72,6 +72,11 @@
 #define PAPR_PDSM_DIMM_CRITICAL  2
 #define PAPR_PDSM_DIMM_FATAL 3
 
+/* struct nd_papr_pdsm_health.extension_flags field flags */
+
+/* Indicate that the 'dimm_fuel_gauge' field is valid */
+#define PDSM_DIMM_HEALTH_RUN_GAUGE_VALID 1
+
 /*
  * Struct exchanged between kernel & ndctl in for PAPR_PDSM_HEALTH
  * Various flags indicate the health status of the dimm.
@@ -84,6 +89,7 @@
  * dimm_locked : Contents of the dimm cant be modified until CEC reboot
  * dimm_encrypted  : Contents of dimm are encrypted.
  * dimm_health : Dimm health indicator. One of PAPR_PDSM_DIMM_
+ * dimm_fuel_gauge : Life remaining of DIMM as a percentage from 0-100
  */
 struct nd_papr_pdsm_health {
union {
@@ -96,6 +102,9 @@ struct nd_papr_pdsm_health {
__u8 dimm_locked;
__u8 dimm_encrypted;
__u16 dimm_health;
+
+   /* Extension flag PDSM_DIMM_HEALTH_RUN_GAUGE_VALID */
+   __u16 dimm_fuel_gauge;
};
__u8 buf[ND_PDSM_PAYLOAD_MAX_SIZE];
};
diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index 3172a018..80261d241831 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -504,6 +504,51 @@ static int is_cmd_valid(struct nvdimm *nvdimm, unsigned 
int cmd, void *buf,
return 0;
 }
 
+static int papr_pdsm_fuel_gauge(struct papr_scm_priv *p,
+   union nd_pdsm_payload *payload)
+{
+   int rc, size;
+   u64 statval;
+   struct papr_scm_perf_stat *stat;
+   struct papr_scm_perf_stats *stats;
+
+   /* Silently fail if fetching performance metrics isn't  supported */
+   if (!p->stat_buffer_len)
+   return 0;
+
+   /* Allocate request buffer enough to hold single performance stat */
+   size = sizeof(struct papr_scm_perf_stats) +
+   sizeof(struct papr_scm_perf_stat);
+
+   stats = kzalloc(size, GFP_KERNEL);
+   if (!stats)
+   return -ENOMEM;
+
+   stat = >scm_statistic[0];
+   memcpy(>stat_id, "MemLife ", sizeof(stat->stat_id));
+   stat->stat_val = 0;
+
+   /* Fetch the fuel gauge and populate it in payload */
+   rc = drc_pmem_query_stats(p, stats, size, 1, NULL);
+   if (rc) {
+   dev_dbg(>pdev->dev, "Err(%d) fetching fuel gauge\n", rc);
+   goto free_stats;
+   }
+
+   statval = be64_to_cpu(stat->stat_val);
+   dev_dbg(>pdev->dev,
+   "Fetched fuel-gauge %llu", statval);
+   payload->health.extension_flags |=
+ PDSM_DIMM_HEALTH_RUN_GAUGE_VALID;
+   payload->health.dimm_fuel_gauge = statval;
+
+   rc = sizeof(struct nd_papr_pdsm_health);
+
+free_stats:
+   kfree(stats);
+   return rc;
+}
+
 /* Fetch the DIMM health info and populate it in provided package. */
 static int papr_pdsm_health(struct papr_scm_priv *p,
union nd_pdsm_payload *payload)
@@ -544,6 +589,10 @@ static int papr_pdsm_health(struct papr_scm_priv *p,
 
/* st

  1   2   3   4   5   >