Re: [PATCH] powerpc/mce: Fix access error in mce handler

2021-09-07 Thread Ganesh

On 9/8/21 11:10 AM, Michael Ellerman wrote:


Ganesh  writes:

On 9/6/21 6:03 PM, Michael Ellerman wrote:


Ganesh Goudar  writes

Oops: Kernel access of bad area, sig: 11 [#1]
LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
CPU: 5 PID: 1883 Comm: insmod Tainted: GOE 5.14.0-mce+ #137
NIP:  c0735d60 LR: c0318640 CTR: 
REGS: c0001ebff9a0 TRAP: 0300   Tainted: G   OE  (5.14.0-mce+)
MSR:  80001003   CR: 28008228  XER: 0001
CFAR: c031863c DAR: c0027fa8fe08 DSISR: 4000 IRQMASK: 0
GPR00: c03186d0 c0001ebffc40 c1b0df00 c16337e8
GPR04: c16337e8 c0027fa8fe08 0023 c16337f0
GPR08: 0023 c12ffe08  c00801460240
GPR12:  c0001ec9a900 c0002ac4bd00 
GPR16: 05a0 c008006b c008006b05a0 c0ff3068
GPR20: c0002ac4bbc0 0001 c0002ac4bbc0 c00801490298
GPR24: c00801490108 c1636198 c00801470090 c00801470058
GPR28: 0510 c0080100 c0080819 0019
NIP [c0735d60] llist_add_batch+0x0/0x40
LR [c0318640] __irq_work_queue_local+0x70/0xc0
Call Trace:
[c0001ebffc40] [c0001ebffc0c] 0xc0001ebffc0c (unreliable)
[c0001ebffc60] [c03186d0] irq_work_queue+0x40/0x70
[c0001ebffc80] [c004425c] machine_check_queue_event+0xbc/0xd0
[c0001ebffcf0] [c000838c] machine_check_early_common+0x16c/0x1f4

Fixes: 74c3354bc1d89 ("powerpc/pseries/mce: restore msr before returning from 
handler")

Please explain in more detail why that commit caused this breakage.

After enabling translation in mce_handle_error() we used to leave it enabled to 
avoid
crashing here, but now with this commit we are restoring the MSR to disable 
translation.

Are you sure we left the MMU enabled to avoid crashing there, or we just
left it enabled by accident?


No, I think we left it enabled intentionally, I mentioned about leaving it 
enabled
in my comment and commit message of a95a0a1654 "powerpc/pseries: Fix MCE handling on 
pseries".



But yeah, previously the MMU was enabled when we got here whereas now
it's not, because of that change.


Missed to mention it in commit log, I will add it.

Thanks.


diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
index 47a683cd00d2..9d1e39d42e3e 100644
--- a/arch/powerpc/kernel/mce.c
+++ b/arch/powerpc/kernel/mce.c
@@ -249,6 +249,7 @@ void machine_check_queue_event(void)
   {
int index;
struct machine_check_event evt;
+   unsigned long msr;
   
   	if (!get_mce_event(&evt, MCE_EVENT_RELEASE))

return;
@@ -262,8 +263,19 @@ void machine_check_queue_event(void)
memcpy(&local_paca->mce_info->mce_event_queue[index],
   &evt, sizeof(evt));
   
-	/* Queue irq work to process this event later. */

-   irq_work_queue(&mce_event_process_work);
+   /* Queue irq work to process this event later. Before
+* queuing the work enable translation for non radix LPAR,
+* as irq_work_queue may try to access memory outside RMO
+* region.
+*/
+   if (!radix_enabled() && firmware_has_feature(FW_FEATURE_LPAR)) {
+   msr = mfmsr();
+   mtmsr(msr | MSR_IR | MSR_DR);
+   irq_work_queue(&mce_event_process_work);
+   mtmsr(msr);
+   } else {
+   irq_work_queue(&mce_event_process_work);
+   }
   }

We already went to virtual mode and queued (different) irq work in
arch/powerpc/platforms/pseries/ras.c:mce_handle_error()

We also called save_mce_event() which also might have queued irq work,
via machine_check_ue_event().

So it really feels like something about the design is wrong if we have
to go to virtual mode again and queue more irq work here.

I guess we can probably merge this as a backportable fix, doing anything
else would be a bigger change.

I agree.


Looking at ras.c there's the comment:

 * Enable translation as we will be accessing per-cpu variables
 * in save_mce_event() which may fall outside RMO region, also

But AFAICS it's only irq_work_queue() that touches anything percpu?

Yeah, we left the comment unchanged after doing some modifications around it,
It needs to be updated, ill send a separate patch for it.

Thanks.

I see some other comments that look out of date, ie. the one above
machine_check_process_queued_event() mentions syscall exit, which is no
longer true.


ill take care of it.



There's also comments in pseries/ras.c about fwnmi_release_errinfo()
crashing in real mode, but we call it in real mode now so that must be
fixed?


Yes, it is fixed now.




So maybe we should just not be using irq_work_queue(). It's a pretty
thin wrapper around set_dec(1), perhaps we just need to hand-roll some
real-mode friendly way of doing that.

You mean, have separate 

Re: [PATCH kernel v2] KVM: PPC: Merge powerpc's debugfs entry content into generic entry

2021-09-07 Thread Alexey Kardashevskiy

[hopefulle fixed my thunderbird now]

Huh, not sure anymore after reading d56f5136b0102 "KVM: let 
kvm_destroy_vm_debugfs clean up vCPU debugfs directories" which remove

debugfs_dentry from vcpu. Paolo?


On 05/09/2021 12:27, Alexey Kardashevskiy wrote:

Please ignore this one, v3 is coming.

After I posted this, I suddenly realized that the vcpu debugfs entry
remain until the VM exists and this does not handle vcpu
hotunplug+hotplug (the ppc book3e did handle this). Thanks,


On 04/09/2021 23:35, Alexey Kardashevskiy wrote:

At the moment the generic KVM code creates an "%pid-%fd" entry per a KVM
instance; and the PPC HV KVM creates its own at "vm%pid". The Book3E KVM
creates its own entry for timings.

The problems with the PPC entries are:
1. they do not allow multiple VMs in the same process (which is extremely
rare case mostly used by syzkaller fuzzer);
2. prone to race bugs like the generic KVM code had fixed in
commit 85cd39af14f4 ("KVM: Do not leak memory for duplicate debugfs
directories").

This defines kvm_arch_create_kvm_debugfs() similar to one for vcpus.

This defines 2 hooks in kvmppc_ops for allowing specific KVM
implementations to add necessary entries. This defines handlers
for HV KVM and defines the Book3E debugfs vcpu helper as a handler.

This makes use of already existing kvm_arch_create_vcpu_debugfs
on PPC.

This removes no more used debugfs_dir pointers from PPC kvm_arch structs.

Suggested-by: Fabiano Rosas 
Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v2:
* handled powerpc-booke
* s/kvm/vm/ in arch hooks
---
   arch/powerpc/include/asm/kvm_host.h|  7 +++---
   arch/powerpc/include/asm/kvm_ppc.h |  2 ++
   arch/powerpc/kvm/timing.h  |  7 +++---
   include/linux/kvm_host.h   |  3 +++
   arch/powerpc/kvm/book3s_64_mmu_hv.c|  2 +-
   arch/powerpc/kvm/book3s_64_mmu_radix.c |  2 +-
   arch/powerpc/kvm/book3s_hv.c   | 30 +-
   arch/powerpc/kvm/e500.c|  1 +
   arch/powerpc/kvm/e500mc.c  |  1 +
   arch/powerpc/kvm/powerpc.c | 15 ++---
   arch/powerpc/kvm/timing.c  | 20 -
   virt/kvm/kvm_main.c|  3 +++
   12 files changed, 44 insertions(+), 49 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 2bcac6da0a4b..f29b66cc2163 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -296,7 +296,6 @@ struct kvm_arch {
bool dawr1_enabled;
pgd_t *pgtable;
u64 process_table;
-   struct dentry *debugfs_dir;
struct kvm_resize_hpt *resize_hpt; /* protected by kvm->lock */
   #endif /* CONFIG_KVM_BOOK3S_HV_POSSIBLE */
   #ifdef CONFIG_KVM_BOOK3S_PR_POSSIBLE
@@ -672,7 +671,6 @@ struct kvm_vcpu_arch {
u64 timing_min_duration[__NUMBER_OF_KVM_EXIT_TYPES];
u64 timing_max_duration[__NUMBER_OF_KVM_EXIT_TYPES];
u64 timing_last_exit;
-   struct dentry *debugfs_exit_timing;
   #endif
   
   #ifdef CONFIG_PPC_BOOK3S

@@ -828,8 +826,6 @@ struct kvm_vcpu_arch {
struct kvmhv_tb_accumulator rm_exit;/* real-mode exit code */
struct kvmhv_tb_accumulator guest_time; /* guest execution */
struct kvmhv_tb_accumulator cede_time;  /* time napping inside guest */
-
-   struct dentry *debugfs_dir;
   #endif /* CONFIG_KVM_BOOK3S_HV_EXIT_TIMING */
   };
   
@@ -868,4 +864,7 @@ static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) {}

   static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {}
   static inline void kvm_arch_vcpu_block_finish(struct kvm_vcpu *vcpu) {}
   
+#define __KVM_HAVE_ARCH_VCPU_DEBUGFS

+#define __KVM_HAVE_ARCH_KVM_DEBUGFS
+
   #endif /* __POWERPC_KVM_HOST_H__ */
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 6355a6980ccf..fd841e844b90 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -316,6 +316,8 @@ struct kvmppc_ops {
int (*svm_off)(struct kvm *kvm);
int (*enable_dawr1)(struct kvm *kvm);
bool (*hash_v3_possible)(void);
+   void (*create_vm_debugfs)(struct kvm *kvm);
+   void (*create_vcpu_debugfs)(struct kvm_vcpu *vcpu, struct dentry 
*debugfs_dentry);
   };
   
   extern struct kvmppc_ops *kvmppc_hv_ops;

diff --git a/arch/powerpc/kvm/timing.h b/arch/powerpc/kvm/timing.h
index feef7885ba82..36f7c201c6f1 100644
--- a/arch/powerpc/kvm/timing.h
+++ b/arch/powerpc/kvm/timing.h
@@ -14,8 +14,8 @@
   #ifdef CONFIG_KVM_EXIT_TIMING
   void kvmppc_init_timing_stats(struct kvm_vcpu *vcpu);
   void kvmppc_update_timing_stats(struct kvm_vcpu *vcpu);
-void kvmppc_create_vcpu_debugfs(struct kvm_vcpu *vcpu, unsigned int id);
-void kvmppc_remove_vcpu_debugfs(struct kvm_vcpu *vcpu);
+void kvmppc_create_vcpu_debugfs(struct kvm_vcpu *vcpu,
+   struct dentry *debugfs_dentry);
   
   static inline void kv

Re: [PATCH] powerpc/mce: Fix access error in mce handler

2021-09-07 Thread Michael Ellerman
Ganesh  writes:
> On 9/6/21 6:03 PM, Michael Ellerman wrote:
>
>> Ganesh Goudar  writes:
>>> We queue an irq work for deferred processing of mce event
>>> in realmode mce handler, where translation is disabled.
>>> Queuing of the work may result in accessing memory outside
>>> RMO region, such access needs the translation to be enabled
>>> for an LPAR running with hash mmu else the kernel crashes.
>>>
>>> So enable the translation before queuing the work.
>>>
>>> Without this change following trace is seen on injecting machine
>>> check error in an LPAR running with hash mmu.
>> What type of error are you injecting?
>
> SLB multihit in kernel mode.
>
>>
>>> Oops: Kernel access of bad area, sig: 11 [#1]
>>> LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
>>> CPU: 5 PID: 1883 Comm: insmod Tainted: GOE 5.14.0-mce+ #137
>>> NIP:  c0735d60 LR: c0318640 CTR: 
>>> REGS: c0001ebff9a0 TRAP: 0300   Tainted: G   OE  (5.14.0-mce+)
>>> MSR:  80001003   CR: 28008228  XER: 0001
>>> CFAR: c031863c DAR: c0027fa8fe08 DSISR: 4000 IRQMASK: 0
>>> GPR00: c03186d0 c0001ebffc40 c1b0df00 c16337e8
>>> GPR04: c16337e8 c0027fa8fe08 0023 c16337f0
>>> GPR08: 0023 c12ffe08  c00801460240
>>> GPR12:  c0001ec9a900 c0002ac4bd00 
>>> GPR16: 05a0 c008006b c008006b05a0 c0ff3068
>>> GPR20: c0002ac4bbc0 0001 c0002ac4bbc0 c00801490298
>>> GPR24: c00801490108 c1636198 c00801470090 c00801470058
>>> GPR28: 0510 c0080100 c0080819 0019
>>> NIP [c0735d60] llist_add_batch+0x0/0x40
>>> LR [c0318640] __irq_work_queue_local+0x70/0xc0
>>> Call Trace:
>>> [c0001ebffc40] [c0001ebffc0c] 0xc0001ebffc0c (unreliable)
>>> [c0001ebffc60] [c03186d0] irq_work_queue+0x40/0x70
>>> [c0001ebffc80] [c004425c] machine_check_queue_event+0xbc/0xd0
>>> [c0001ebffcf0] [c000838c] machine_check_early_common+0x16c/0x1f4
>>>
>>> Fixes: 74c3354bc1d89 ("powerpc/pseries/mce: restore msr before returning 
>>> from handler")
>> Please explain in more detail why that commit caused this breakage.
>
> After enabling translation in mce_handle_error() we used to leave it enabled 
> to avoid
> crashing here, but now with this commit we are restoring the MSR to disable 
> translation.

Are you sure we left the MMU enabled to avoid crashing there, or we just
left it enabled by accident?

But yeah, previously the MMU was enabled when we got here whereas now
it's not, because of that change.

> Missed to mention it in commit log, I will add it.

Thanks.

>>> diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
>>> index 47a683cd00d2..9d1e39d42e3e 100644
>>> --- a/arch/powerpc/kernel/mce.c
>>> +++ b/arch/powerpc/kernel/mce.c
>>> @@ -249,6 +249,7 @@ void machine_check_queue_event(void)
>>>   {
>>> int index;
>>> struct machine_check_event evt;
>>> +   unsigned long msr;
>>>   
>>> if (!get_mce_event(&evt, MCE_EVENT_RELEASE))
>>> return;
>>> @@ -262,8 +263,19 @@ void machine_check_queue_event(void)
>>> memcpy(&local_paca->mce_info->mce_event_queue[index],
>>>&evt, sizeof(evt));
>>>   
>>> -   /* Queue irq work to process this event later. */
>>> -   irq_work_queue(&mce_event_process_work);
>>> +   /* Queue irq work to process this event later. Before
>>> +* queuing the work enable translation for non radix LPAR,
>>> +* as irq_work_queue may try to access memory outside RMO
>>> +* region.
>>> +*/
>>> +   if (!radix_enabled() && firmware_has_feature(FW_FEATURE_LPAR)) {
>>> +   msr = mfmsr();
>>> +   mtmsr(msr | MSR_IR | MSR_DR);
>>> +   irq_work_queue(&mce_event_process_work);
>>> +   mtmsr(msr);
>>> +   } else {
>>> +   irq_work_queue(&mce_event_process_work);
>>> +   }
>>>   }
>> We already went to virtual mode and queued (different) irq work in
>> arch/powerpc/platforms/pseries/ras.c:mce_handle_error()
>>
>> We also called save_mce_event() which also might have queued irq work,
>> via machine_check_ue_event().
>>
>> So it really feels like something about the design is wrong if we have
>> to go to virtual mode again and queue more irq work here.
>>
>> I guess we can probably merge this as a backportable fix, doing anything
>> else would be a bigger change.
>
> I agree.
>
>>
>> Looking at ras.c there's the comment:
>>
>>   * Enable translation as we will be accessing per-cpu variables
>>   * in save_mce_event() which may fall outside RMO region, also
>>
>> But AFAICS it's only irq_work_queue() that touches anything percpu?
>
> Yeah, we left the comment unchanged after doing some modifications around it,
> It needs to be updated, ill send a separate patch for it.

Thanks.

I see 

Re: [PATCH 1/2] powerpc/perf: Expose instruction and data address registers as part of extended regs

2021-09-07 Thread Michael Ellerman
Athira Rajeev  writes:
> Patch adds support to include Sampled Instruction Address Register
> (SIAR) and Sampled Data Address Register (SDAR) SPRs as part of extended
> registers. Update the definition of PERF_REG_PMU_MASK_300/31 and
> PERF_REG_EXTENDED_MAX to include these SPR's.
>
> Signed-off-by: Athira Rajeev 
> ---
>  arch/powerpc/include/uapi/asm/perf_regs.h | 12 +++-
>  arch/powerpc/perf/perf_regs.c |  4 
>  2 files changed, 11 insertions(+), 5 deletions(-)
>
...
> diff --git a/arch/powerpc/perf/perf_regs.c b/arch/powerpc/perf/perf_regs.c
> index b931eed..51d31b6 100644
> --- a/arch/powerpc/perf/perf_regs.c
> +++ b/arch/powerpc/perf/perf_regs.c
> @@ -90,7 +90,11 @@ static u64 get_ext_regs_value(int idx)
>   return mfspr(SPRN_SIER2);
>   case PERF_REG_POWERPC_SIER3:
>   return mfspr(SPRN_SIER3);
> + case PERF_REG_POWERPC_SDAR:
> + return mfspr(SPRN_SDAR);
>  #endif
> + case PERF_REG_POWERPC_SIAR:
> + return mfspr(SPRN_SIAR);
>   default: return 0;
>   }

This file is built for all powerpc configs that have PERF_EVENTS. Which
includes CPUs that don't have SDAR or SIAR.

Don't we need checks in perf_reg_value() like we do for SIER?

I guess we already got this wrong when we added the Power10 registers,
SIER2/3 etc.

cheers


Re: [RFC PATCH v2] powerpc/papr_scm: Move duplicate definitions to common header files

2021-09-07 Thread Michael Ellerman
Shivaprasad G Bhat  writes:
> papr_scm and ndtest share common PDSM payload structs like
> nd_papr_pdsm_health. Presently these structs are duplicated across papr_pdsm.h
> and ndtest.h header files. Since 'ndtest' is essentially arch independent and 
> can
> run on platforms other than PPC64, a way needs to be deviced to avoid 
> redundancy
> and duplication of PDSM structs in future.
>
> So the patch proposes moving the PDSM header from arch/powerpc/include/uapi/ 
> to
> the generic include/uapi/linux directory. Also, there are some #defines common
> between papr_scm and ndtest which are not exported to the user space. So, move
> them to a header file which can be shared across ndtest and papr_scm via newly
> introduced include/linux/papr_scm.h.
>
> Signed-off-by: Shivaprasad G Bhat 
> Signed-off-by: Vaibhav Jain 
> Suggested-by: "Aneesh Kumar K.V" 
> ---
> Changelog:
>
> Since v1:
> Link: 
> https://patchwork.kernel.org/project/linux-nvdimm/patch/162505488483.72147.12741153746322191381.stgit@56e104a48989/
> * Removed dependency on this patch for the other patches
>
>  MAINTAINERS   |2 
>  arch/powerpc/include/uapi/asm/papr_pdsm.h |  165 
> -
>  arch/powerpc/platforms/pseries/papr_scm.c |   43 
>  include/linux/papr_scm.h  |   48 
>  include/uapi/linux/papr_pdsm.h|  165 
> +

This doesn't make sense to me.

Anything with papr (or PAPR) in the name is fundamentally powerpc
specific, it doesn't belong in a generic header, or in a generic
location.

What's the actual problem you're trying to solve?

If it's just including papr_scm bits into ndtest.c then that should be
as simple as:

#ifdef __powerpc__
#include 
#endif

Shouldn't it?

cheers


Re: [PATCH v3] ftrace: Cleanup ftrace_dyn_arch_init()

2021-09-07 Thread Weizhao Ouyang
Thanks for reply.

On 2021/9/7 23:55, LEROY Christophe wrote:
>
>> -Message d'origine-
>> De : Linuxppc-dev > bounces+christophe.leroy=csgroup...@lists.ozlabs.org> De la part de Weizhao
>> Ouyang
>>
>> Most of ARCHs use empty ftrace_dyn_arch_init(), introduce a weak common
>> ftrace_dyn_arch_init() to cleanup them.
>>
>> Signed-off-by: Weizhao Ouyang 
>> Acked-by: Heiko Carstens  (s390)
>> Acked-by: Helge Deller  (parisc)
>>
>> ---
>> Changes in v3:
>> -- fix unrecognized opcode on PowerPC
>>
>> Changes in v2:
>> -- correct CONFIG_DYNAMIC_FTRACE on PowerPC
>> -- add Acked-by tag
>>
>> ---
>>  arch/arm/kernel/ftrace.c  | 5 -
>>  arch/arm64/kernel/ftrace.c| 5 -
>>  arch/csky/kernel/ftrace.c | 5 -
>>  arch/ia64/kernel/ftrace.c | 6 --
>>  arch/microblaze/kernel/ftrace.c   | 5 -
>>  arch/mips/include/asm/ftrace.h| 2 ++
>>  arch/nds32/kernel/ftrace.c| 5 -
>>  arch/parisc/kernel/ftrace.c   | 5 -
>>  arch/powerpc/include/asm/ftrace.h | 4 
>>  arch/riscv/kernel/ftrace.c| 5 -
>>  arch/s390/kernel/ftrace.c | 5 -
>>  arch/sh/kernel/ftrace.c   | 5 -
>>  arch/sparc/kernel/ftrace.c| 5 -
>>  arch/x86/kernel/ftrace.c  | 5 -
>>  include/linux/ftrace.h| 1 -
>>  kernel/trace/ftrace.c | 5 +
>>  16 files changed, 11 insertions(+), 62 deletions(-)
>>
>> diff --git a/arch/mips/include/asm/ftrace.h b/arch/mips/include/asm/ftrace.h
>> index b463f2aa5a61..ed013e767390 100644
>> --- a/arch/mips/include/asm/ftrace.h
>> +++ b/arch/mips/include/asm/ftrace.h
>> @@ -76,6 +76,8 @@ do {\
>>
>>
>>  #ifdef CONFIG_DYNAMIC_FTRACE
>> +int __init ftrace_dyn_arch_init(void);
>> +
> Why ?
>
>
>>  static inline unsigned long ftrace_call_adjust(unsigned long addr)
>>  {
>>   return addr;
>> diff --git a/arch/powerpc/include/asm/ftrace.h
>> b/arch/powerpc/include/asm/ftrace.h
>> index debe8c4f7062..b05c43f13a4d 100644
>> --- a/arch/powerpc/include/asm/ftrace.h
>> +++ b/arch/powerpc/include/asm/ftrace.h
>> @@ -126,6 +126,10 @@ static inline void this_cpu_enable_ftrace(void) { }
>>  static inline void this_cpu_set_ftrace_enabled(u8 ftrace_enabled) { }
>>  static inline u8 this_cpu_get_ftrace_enabled(void) { return 1; }
>>  #endif /* CONFIG_PPC64 */
>> +
>> +#ifdef CONFIG_DYNAMIC_FTRACE
>> +int __init ftrace_dyn_arch_init(void);
>> +#endif /* CONFIG_DYNAMIC_FTRACE */
> Why ?
>
>>  #endif /* !__ASSEMBLY__ */
>>
>>  #endif /* _ASM_POWERPC_FTRACE */
>> diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
>> index 832e65f06754..f1eca123d89d 100644
>> --- a/include/linux/ftrace.h
>> +++ b/include/linux/ftrace.h
>> @@ -573,7 +573,6 @@ ftrace_set_early_filter(struct ftrace_ops *ops, char
>> *buf, int enable);
>>
>>  /* defined in arch */
>>  extern int ftrace_ip_converted(unsigned long ip);
>> -extern int ftrace_dyn_arch_init(void);
> Why removing that ?
>
> Have you tried to build kernel/trace/ftrace.o with C=2 ? It will likely tell 
> you that the function is not declared and that it should be static

Yes I missed this check. Under the situation, the function should be static.

> We could eventually consider that in the past, this generic declaration was 
> unrelevant because the definitions where in the arch specific sections.
> Now that you are implementing a generic weak version of this function, it 
> would make sense to have a generic declaration as well.
>
> I really don't see the point in duplicating the declaration of the function 
> in the arch specific headers.

I use declaration in arch specific headers in tend to clarify the arch has 
implement ftrace_dyn_arch_init().
Anyway, it maybe pointless, a generic declaration is enough. Will update it 
later.

>>  extern void ftrace_replace_code(int enable);
>>  extern int ftrace_update_ftrace_func(ftrace_func_t func);
>>  extern void ftrace_caller(void);
> Christophe
>
> CS Group - Document Interne


Re: [PATCH 0/5] s390/pci: automatic error recovery

2021-09-07 Thread Oliver O'Halloran
On Tue, Sep 7, 2021 at 10:21 PM Niklas Schnelle  wrote:
>
> On Tue, 2021-09-07 at 10:45 +0200, Niklas Schnelle wrote:
> > On Tue, 2021-09-07 at 12:04 +1000, Oliver O'Halloran wrote:
> > > On Mon, Sep 6, 2021 at 7:49 PM Niklas Schnelle  
> > > wrote:
> > > > Patch 3 I already sent separately resulting in the discussion below but 
> > > > without
> > > > a final conclusion.
> > > >
> > > > https://lore.kernel.org/lkml/20210720150145.640727-1-schne...@linux.ibm.com/
> > > >
> > > > I believe even though there were some doubts about the use of
> > > > pci_dev_is_added() by arch code the existing uses as well as the use in 
> > > > the
> > > > final patch of this series warrant this export.
> > >
> > > The use of pci_dev_is_added() in arch/powerpc was because in the past
> > > pci_bus_add_device() could be called before pci_device_add(). That was
> > > fixed a while ago so It should be safe to remove those calls now.
> >
> > Hmm, ok that confirms Bjorns suspicion and explains how it came to be.
> > I can certainly sent a patch for that. This would then leave only the
> > existing use in s390 which I added because of a dead lock prevention
> > and explained here:
> > https://lore.kernel.org/lkml/87d15d5eead35c9eaa667958d057cf4a81a8bf13.ca...@linux.ibm.com/
> >
> > Plus the need to use it in the recovery code of this series. I think in
> > the EEH code the need for a similar check is alleviated by the checks
> > in the beginning of
> > arch/powerpc/kernel/eeh_driver.c:eeh_handle_normal_event() especially
> > eeh_slot_presence_check() which checks presence via the hotplug slot.
> > I guess we could use our own state tracking in a similar way but felt
> > like pci_dev_is_added() is the more logical choice.

The slot check is mainly there to prevent attempts to "recover"
devices that have been surprise removed (i.e NVMe hot-unplug). The
actual recovery process operates off the eeh_pe tree which is frozen
in place when an error is detected. If a pci_dev is added or removed
it's not really a problem since those are only ever looked at when
notifying drivers which is done with the rescan_remove lock held. That
said, I wouldn't really encourage anyone to follow the EEH model since
it's pretty byzantine.

> Looking into this again, I think we actually can't easily track this
> state ourselves outside struct pci_dev. The reason for this is that
> when e.g. arch/s390/pci/pci_sysfs.c:recover_store() removes the struct
> pci_dev and scans it again the new struct pci_dev re-uses the same
> struct zpci_dev because from a platform point of view the PCI device
> was never removed but only disabled and re-enabled. Thus we can only
> distinguish the stale struct pci_dev by looking at things stored in
> struct pci_dev itself.

IMO the real problem is removing and re-adding the pci_dev. I think
it's something that's done largely because the PCI core doesn't really
provide any better mechanism for getting a device back into a
known-good state so it's abused to implement error recovery. This is
something that's always annoyed me since it conflates recovery with
hotplug. After a hot-(un)plug we might have a different device or no
device. In the recovery case we expect to start and end with the same
device. Why not apply the same logic to the pci_dev?

Something I was tinkering with before I left IBM was re-working the
way EEH handles recovering devices that don't have a driver with error
handling callbacks to something like:

1. unbind the driver
2. pci_save_state()
3. do the reset
4. pci_restore_state()
5. re-bind the driver

That would allow keeping the pci_dev around and let me delete a pile
of confusing code which handles binding the eeh_dev to the new
pci_dev. The obvious problem with that approach is the assumption the
device is functional enough to allow saving the config space, but I
don't think that's a deal breaker. We could stash a copy of the device
state before we allow drivers to attach and use that to restore the
device after the reset. The end result would be the same known-good
state that we'd get after a re-scan.

> That said, I think for the recovery case we might be able to drop the
> pci_dev_is_added() and rely on pdev->driver != NULL which we check
> anyway and that should catch any PCI device that was already removed.

Would that work if there was an error on a device without a driver
bound? If you're just trying to stop races between recovery and device
removal then pci_dev_is_added() is probably the right tool for the
job. Trying to substitute it with a proxy seems like a bad idea.


Re: [RESEND PATCH v4 4/4] powerpc/papr_scm: Document papr_scm sysfs event format entries

2021-09-07 Thread Dan Williams
On Thu, Sep 2, 2021 at 10:11 PM Kajol Jain  wrote:
>
> Details is added for the event, cpumask and format attributes
> in the ABI documentation.
>
> Acked-by: Peter Zijlstra (Intel) 
> Reviewed-by: Madhavan Srinivasan 
> Tested-by: Nageswara R Sastry 
> Signed-off-by: Kajol Jain 
> ---
>  Documentation/ABI/testing/sysfs-bus-papr-pmem | 31 +++
>  1 file changed, 31 insertions(+)
>
> diff --git a/Documentation/ABI/testing/sysfs-bus-papr-pmem 
> b/Documentation/ABI/testing/sysfs-bus-papr-pmem
> index 95254cec92bf..4d86252448f8 100644
> --- a/Documentation/ABI/testing/sysfs-bus-papr-pmem
> +++ b/Documentation/ABI/testing/sysfs-bus-papr-pmem
> @@ -61,3 +61,34 @@ Description:
> * "CchRHCnt" : Cache Read Hit Count
> * "CchWHCnt" : Cache Write Hit Count
> * "FastWCnt" : Fast Write Count
> +
> +What:  /sys/devices/nmemX/format
> +Date:  June 2021
> +Contact:   linuxppc-dev , 
> nvd...@lists.linux.dev,
> +Description:   (RO) Attribute group to describe the magic bits
> +that go into perf_event_attr.config for a particular pmu.
> +(See ABI/testing/sysfs-bus-event_source-devices-format).
> +
> +Each attribute under this group defines a bit range of the
> +perf_event_attr.config. Supported attribute is listed
> +below::
> +
> +   event  = "config:0-4"  - event ID
> +
> +   For example::
> +   noopstat = "event=0x1"
> +
> +What:  /sys/devices/nmemX/events

That's not a valid sysfs path. Did you mean /sys/bus/nd/devices/nmemX?

> +Date:  June 2021
> +Contact:   linuxppc-dev , 
> nvd...@lists.linux.dev,
> +Description:(RO) Attribute group to describe performance monitoring
> +events specific to papr-scm. Each attribute in this group 
> describes
> +a single performance monitoring event supported by this 
> nvdimm pmu.
> +The name of the file is the name of the event.
> +(See ABI/testing/sysfs-bus-event_source-devices-events).

Given these events are in the generic namespace the ABI documentation
should be generic as well. So I think move these entries to
Documentation/ABI/testing/sysfs-bus-nvdimm directly.

You can still mention papr-scm, but I would expect something like:

What:   /sys/bus/nd/devices/nmemX/events
Date:   September 2021
KernelVersion:  5.16
Contact:Kajol Jain 
Description:
(RO) Attribute group to describe performance monitoring events
for the nvdimm memory device. Each attribute in this group
describes a single performance monitoring event supported by
this nvdimm pmu.  The name of the file is the name of the event.
(See ABI/testing/sysfs-bus-event_source-devices-events). A
listing of the events supported by a given nvdimm provider type
can be found in Documentation/driver-api/nvdimm/$provider, for
example: Documentation/driver-api/nvdimm/papr-scm.



> +
> +What:  /sys/devices/nmemX/cpumask
> +Date:  June 2021
> +Contact:   linuxppc-dev , 
> nvd...@lists.linux.dev,
> +Description:   (RO) This sysfs file exposes the cpumask which is designated 
> to make
> +HCALLs to retrieve nvdimm pmu event counter data.

Seems this one would be provider generic, so no need to refer to PPC
specific concepts like HCALLs.


Re: [PATCH] pci/hotplug/pnv-php: Remove probable double put

2021-09-07 Thread Oliver O'Halloran
On Wed, Sep 8, 2021 at 8:02 AM Tyrel Datwyler  wrote:
>
> On 9/7/21 1:59 AM, Xu Wang wrote:
> > Device node iterators put the previous value of the index variable,
> > so an explicit put causes a double put.
> >
> > Signed-off-by: Xu Wang 
> > ---
> >  drivers/pci/hotplug/pnv_php.c | 1 -
> >  1 file changed, 1 deletion(-)
> >
> > diff --git a/drivers/pci/hotplug/pnv_php.c b/drivers/pci/hotplug/pnv_php.c
> > index 04565162a449..ed4d1a2c3f22 100644
> > --- a/drivers/pci/hotplug/pnv_php.c
> > +++ b/drivers/pci/hotplug/pnv_php.c
> > @@ -158,7 +158,6 @@ static void pnv_php_detach_device_nodes(struct 
> > device_node *parent)
> >   for_each_child_of_node(parent, dn) {
> >   pnv_php_detach_device_nodes(dn);
> >
> > - of_node_put(dn);
> >   of_detach_node(dn);
>
> Are you sure this is a double put? This looks to me like its meant to drive 
> tear
> down of the device by putting a long term reference and not the short term get
> that is part of the iterator.

Yeah, the put is there is to drop the initial ref so the node can be
released. It might be worth adding a comment.


Re: [PATCH] pci/hotplug/pnv-php: Remove probable double put

2021-09-07 Thread Tyrel Datwyler
On 9/7/21 1:59 AM, Xu Wang wrote:
> Device node iterators put the previous value of the index variable,
> so an explicit put causes a double put.
> 
> Signed-off-by: Xu Wang 
> ---
>  drivers/pci/hotplug/pnv_php.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/drivers/pci/hotplug/pnv_php.c b/drivers/pci/hotplug/pnv_php.c
> index 04565162a449..ed4d1a2c3f22 100644
> --- a/drivers/pci/hotplug/pnv_php.c
> +++ b/drivers/pci/hotplug/pnv_php.c
> @@ -158,7 +158,6 @@ static void pnv_php_detach_device_nodes(struct 
> device_node *parent)
>   for_each_child_of_node(parent, dn) {
>   pnv_php_detach_device_nodes(dn);
> 
> - of_node_put(dn);
>   of_detach_node(dn);

Are you sure this is a double put? This looks to me like its meant to drive tear
down of the device by putting a long term reference and not the short term get
that is part of the iterator.

-Tyrel

>   }
>  }
> 



Re: [RESEND PATCH v4 1/4] drivers/nvdimm: Add nvdimm pmu structure

2021-09-07 Thread Dan Williams
Hi Kajol,

Apologies for the delay in responding to this series, some comments below:

On Thu, Sep 2, 2021 at 10:10 PM Kajol Jain  wrote:
>
> A structure is added, called nvdimm_pmu, for performance
> stats reporting support of nvdimm devices. It can be used to add
> nvdimm pmu data such as supported events and pmu event functions
> like event_init/add/read/del with cpu hotplug support.
>
> Acked-by: Peter Zijlstra (Intel) 
> Reviewed-by: Madhavan Srinivasan 
> Tested-by: Nageswara R Sastry 
> Signed-off-by: Kajol Jain 
> ---
>  include/linux/nd.h | 43 +++
>  1 file changed, 43 insertions(+)
>
> diff --git a/include/linux/nd.h b/include/linux/nd.h
> index ee9ad76afbba..712499cf7335 100644
> --- a/include/linux/nd.h
> +++ b/include/linux/nd.h
> @@ -8,6 +8,8 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
>
>  enum nvdimm_event {
> NVDIMM_REVALIDATE_POISON,
> @@ -23,6 +25,47 @@ enum nvdimm_claim_class {
> NVDIMM_CCLASS_UNKNOWN,
>  };
>
> +/* Event attribute array index */
> +#define NVDIMM_PMU_FORMAT_ATTR 0
> +#define NVDIMM_PMU_EVENT_ATTR  1
> +#define NVDIMM_PMU_CPUMASK_ATTR2
> +#define NVDIMM_PMU_NULL_ATTR   3
> +
> +/**
> + * struct nvdimm_pmu - data structure for nvdimm perf driver
> + *
> + * @name: name of the nvdimm pmu device.
> + * @pmu: pmu data structure for nvdimm performance stats.
> + * @dev: nvdimm device pointer.
> + * @functions(event_init/add/del/read): platform specific pmu functions.

This is not valid kernel-doc:

include/linux/nd.h:67: warning: Function parameter or member
'event_init' not described in 'nvdimm_pmu'
include/linux/nd.h:67: warning: Function parameter or member 'add' not
described in 'nvdimm_pmu'
include/linux/nd.h:67: warning: Function parameter or member 'del' not
described in 'nvdimm_pmu'
include/linux/nd.h:67: warning: Function parameter or member 'read'
not described in 'nvdimm_pmu'

...but I think rather than fixing those up 'struct nvdimm_pmu' should be pruned.

It's not clear to me that it is worth the effort to describe these
details to the nvdimm core which is just going to turn around and call
the pmu core. I'd just as soon have the driver call the pmu core
directly, optionally passing in attributes and callbacks that come
from the nvdimm core and/or the nvdimm provider.

Otherwise it's also not clear which of these structure members are
used at runtime vs purely used as temporary storage to pass parameters
to the pmu core.

> + * @attr_groups: data structure for events, formats and cpumask
> + * @cpu: designated cpu for counter access.
> + * @node: node for cpu hotplug notifier link.
> + * @cpuhp_state: state for cpu hotplug notification.
> + * @arch_cpumask: cpumask to get designated cpu for counter access.
> + */
> +struct nvdimm_pmu {
> +   const char *name;
> +   struct pmu pmu;
> +   struct device *dev;
> +   int (*event_init)(struct perf_event *event);
> +   int  (*add)(struct perf_event *event, int flags);
> +   void (*del)(struct perf_event *event, int flags);
> +   void (*read)(struct perf_event *event);
> +   /*
> +* Attribute groups for the nvdimm pmu. Index 0 used for
> +* format attribute, index 1 used for event attribute,
> +* index 2 used for cpusmask attribute and index 3 kept as NULL.
> +*/
> +   const struct attribute_group *attr_groups[4];

Following from above, I'd rather this was organized as static
attributes with an is_visible() helper for the groups for any dynamic
aspects. That mirrors the behavior of nvdimm_create() and allows for
device drivers to compose the attribute groups from a core set and /
or a provider specific set.

> +   int cpu;
> +   struct hlist_node node;
> +   enum cpuhp_state cpuhp_state;
> +
> +   /* cpumask provided by arch/platform specific code */
> +   struct cpumask arch_cpumask;
> +};
> +
>  struct nd_device_driver {
> struct device_driver drv;
> unsigned long type;
> --
> 2.26.2
>


Re: [PATCH 1/5] KVM: rseq: Update rseq when processing NOTIFY_RESUME on xfer to KVM guest

2021-09-07 Thread Sean Christopherson
On Mon, Sep 06, 2021, Paolo Bonzini wrote:
> On 20/08/21 20:51, Mathieu Desnoyers wrote:
> > > Ah, or is it the case that rseq_cs is non-NULL if and only if userspace 
> > > is in an
> > > rseq critical section, and because syscalls in critical sections are 
> > > illegal, by
> > > definition clearing rseq_cs is a nop unless userspace is misbehaving.
> > Not quite, as I described above. But we want it to stay set so the 
> > CONFIG_DEBUG_RSEQ
> > code executed when returning from ioctl to userspace will be able to 
> > validate that
> > it is not nested within a rseq critical section.
> > 
> > > If that's true, what about explicitly checking that at NOTIFY_RESUME?  Or 
> > > is it
> > > not worth the extra code to detect an error that will likely be caught 
> > > anyways?
> > The error will indeed already be caught on return from ioctl to userspace, 
> > so I
> > don't see any added value in duplicating this check.
> 
> Sean, can you send a v2 (even for this patch only would be okay)?

Made it all the way to v3 while you were out :-)

https://lkml.kernel.org/r/20210901203030.1292304-1-sea...@google.com


RE: [PATCH v3] ftrace: Cleanup ftrace_dyn_arch_init()

2021-09-07 Thread LEROY Christophe



> -Message d'origine-
> De : Linuxppc-dev  bounces+christophe.leroy=csgroup...@lists.ozlabs.org> De la part de Weizhao
> Ouyang
>
> Most of ARCHs use empty ftrace_dyn_arch_init(), introduce a weak common
> ftrace_dyn_arch_init() to cleanup them.
>
> Signed-off-by: Weizhao Ouyang 
> Acked-by: Heiko Carstens  (s390)
> Acked-by: Helge Deller  (parisc)
>
> ---
> Changes in v3:
> -- fix unrecognized opcode on PowerPC
>
> Changes in v2:
> -- correct CONFIG_DYNAMIC_FTRACE on PowerPC
> -- add Acked-by tag
>
> ---
>  arch/arm/kernel/ftrace.c  | 5 -
>  arch/arm64/kernel/ftrace.c| 5 -
>  arch/csky/kernel/ftrace.c | 5 -
>  arch/ia64/kernel/ftrace.c | 6 --
>  arch/microblaze/kernel/ftrace.c   | 5 -
>  arch/mips/include/asm/ftrace.h| 2 ++
>  arch/nds32/kernel/ftrace.c| 5 -
>  arch/parisc/kernel/ftrace.c   | 5 -
>  arch/powerpc/include/asm/ftrace.h | 4 
>  arch/riscv/kernel/ftrace.c| 5 -
>  arch/s390/kernel/ftrace.c | 5 -
>  arch/sh/kernel/ftrace.c   | 5 -
>  arch/sparc/kernel/ftrace.c| 5 -
>  arch/x86/kernel/ftrace.c  | 5 -
>  include/linux/ftrace.h| 1 -
>  kernel/trace/ftrace.c | 5 +
>  16 files changed, 11 insertions(+), 62 deletions(-)
>
> diff --git a/arch/mips/include/asm/ftrace.h b/arch/mips/include/asm/ftrace.h
> index b463f2aa5a61..ed013e767390 100644
> --- a/arch/mips/include/asm/ftrace.h
> +++ b/arch/mips/include/asm/ftrace.h
> @@ -76,6 +76,8 @@ do {\
>
>
>  #ifdef CONFIG_DYNAMIC_FTRACE
> +int __init ftrace_dyn_arch_init(void);
> +

Why ?


>  static inline unsigned long ftrace_call_adjust(unsigned long addr)
>  {
>   return addr;
> diff --git a/arch/powerpc/include/asm/ftrace.h
> b/arch/powerpc/include/asm/ftrace.h
> index debe8c4f7062..b05c43f13a4d 100644
> --- a/arch/powerpc/include/asm/ftrace.h
> +++ b/arch/powerpc/include/asm/ftrace.h
> @@ -126,6 +126,10 @@ static inline void this_cpu_enable_ftrace(void) { }
>  static inline void this_cpu_set_ftrace_enabled(u8 ftrace_enabled) { }
>  static inline u8 this_cpu_get_ftrace_enabled(void) { return 1; }
>  #endif /* CONFIG_PPC64 */
> +
> +#ifdef CONFIG_DYNAMIC_FTRACE
> +int __init ftrace_dyn_arch_init(void);
> +#endif /* CONFIG_DYNAMIC_FTRACE */

Why ?

>  #endif /* !__ASSEMBLY__ */
>
>  #endif /* _ASM_POWERPC_FTRACE */
> diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
> index 832e65f06754..f1eca123d89d 100644
> --- a/include/linux/ftrace.h
> +++ b/include/linux/ftrace.h
> @@ -573,7 +573,6 @@ ftrace_set_early_filter(struct ftrace_ops *ops, char
> *buf, int enable);
>
>  /* defined in arch */
>  extern int ftrace_ip_converted(unsigned long ip);
> -extern int ftrace_dyn_arch_init(void);

Why removing that ?

Have you tried to build kernel/trace/ftrace.o with C=2 ? It will likely tell 
you that the function is not declared and that it should be static

We could eventually consider that in the past, this generic declaration was 
unrelevant because the definitions where in the arch specific sections.
Now that you are implementing a generic weak version of this function, it would 
make sense to have a generic declaration as well.

I really don't see the point in duplicating the declaration of the function in 
the arch specific headers.

>  extern void ftrace_replace_code(int enable);
>  extern int ftrace_update_ftrace_func(ftrace_func_t func);
>  extern void ftrace_caller(void);

Christophe

CS Group - Document Interne


[PATCH 3/9] xen/x86: make "earlyprintk=xen" work better for PVH Dom0

2021-09-07 Thread Jan Beulich
The xen_hvm_early_write() path better wouldn't be taken in this case;
while port 0xE9 can be used, the hypercall path is quite a bit more
efficient. Put that first, as it may also work for DomU-s (see also
xen_raw_console_write()).

While there also bail from the function when the first
domU_write_console() failed - later ones aren't going to succeed.

Signed-off-by: Jan Beulich 

--- a/drivers/tty/hvc/hvc_xen.c
+++ b/drivers/tty/hvc/hvc_xen.c
@@ -632,17 +632,16 @@ static void xenboot_write_console(struct
unsigned int linelen, off = 0;
const char *pos;
 
+   if (dom0_write_console(0, string, len) >= 0)
+   return;
+
if (!xen_pv_domain()) {
xen_hvm_early_write(0, string, len);
return;
}
 
-   dom0_write_console(0, string, len);
-
-   if (xen_initial_domain())
+   if (domU_write_console(0, "(early) ", 8) < 0)
return;
-
-   domU_write_console(0, "(early) ", 8);
while (off < len && NULL != (pos = strchr(string+off, '\n'))) {
linelen = pos-string+off;
if (off + linelen > len)



[PATCH 5/9] xen/x86: make "earlyprintk=xen" work for HVM/PVH DomU

2021-09-07 Thread Jan Beulich
xenboot_write_console() is dealing with these quite fine so I don't see
why xenboot_console_setup() would return -ENOENT in this case.

Adjust documentation accordingly.

Signed-off-by: Jan Beulich 

--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1266,7 +1266,7 @@
The VGA and EFI output is eventually overwritten by
the real console.
 
-   The xen output can only be used by Xen PV guests.
+   The xen option can only be used in Xen domains.
 
The sclp output can only be used on s390.
 
--- a/drivers/tty/hvc/hvc_xen.c
+++ b/drivers/tty/hvc/hvc_xen.c
@@ -618,10 +618,8 @@ static int __init xenboot_console_setup(
 {
static struct xencons_info xenboot;
 
-   if (xen_initial_domain())
+   if (xen_initial_domain() || !xen_pv_domain())
return 0;
-   if (!xen_pv_domain())
-   return -ENODEV;
 
return xencons_info_pv_init(&xenboot, 0);
 }



Re: [PATCH] powerpc/mce: Fix access error in mce handler

2021-09-07 Thread Ganesh


On 9/6/21 6:03 PM, Michael Ellerman wrote:


Ganesh Goudar  writes:

We queue an irq work for deferred processing of mce event
in realmode mce handler, where translation is disabled.
Queuing of the work may result in accessing memory outside
RMO region, such access needs the translation to be enabled
for an LPAR running with hash mmu else the kernel crashes.

So enable the translation before queuing the work.

Without this change following trace is seen on injecting machine
check error in an LPAR running with hash mmu.

What type of error are you injecting?


SLB multihit in kernel mode.




Oops: Kernel access of bad area, sig: 11 [#1]
LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
CPU: 5 PID: 1883 Comm: insmod Tainted: GOE 5.14.0-mce+ #137
NIP:  c0735d60 LR: c0318640 CTR: 
REGS: c0001ebff9a0 TRAP: 0300   Tainted: G   OE  (5.14.0-mce+)
MSR:  80001003   CR: 28008228  XER: 0001
CFAR: c031863c DAR: c0027fa8fe08 DSISR: 4000 IRQMASK: 0
GPR00: c03186d0 c0001ebffc40 c1b0df00 c16337e8
GPR04: c16337e8 c0027fa8fe08 0023 c16337f0
GPR08: 0023 c12ffe08  c00801460240
GPR12:  c0001ec9a900 c0002ac4bd00 
GPR16: 05a0 c008006b c008006b05a0 c0ff3068
GPR20: c0002ac4bbc0 0001 c0002ac4bbc0 c00801490298
GPR24: c00801490108 c1636198 c00801470090 c00801470058
GPR28: 0510 c0080100 c0080819 0019
NIP [c0735d60] llist_add_batch+0x0/0x40
LR [c0318640] __irq_work_queue_local+0x70/0xc0
Call Trace:
[c0001ebffc40] [c0001ebffc0c] 0xc0001ebffc0c (unreliable)
[c0001ebffc60] [c03186d0] irq_work_queue+0x40/0x70
[c0001ebffc80] [c004425c] machine_check_queue_event+0xbc/0xd0
[c0001ebffcf0] [c000838c] machine_check_early_common+0x16c/0x1f4

Fixes: 74c3354bc1d89 ("powerpc/pseries/mce: restore msr before returning from 
handler")

Please explain in more detail why that commit caused this breakage.


After enabling translation in mce_handle_error() we used to leave it enabled to 
avoid
crashing here, but now with this commit we are restoring the MSR to disable 
translation.
Missed to mention it in commit log, I will add it.




diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
index 47a683cd00d2..9d1e39d42e3e 100644
--- a/arch/powerpc/kernel/mce.c
+++ b/arch/powerpc/kernel/mce.c
@@ -249,6 +249,7 @@ void machine_check_queue_event(void)
  {
int index;
struct machine_check_event evt;
+   unsigned long msr;
  
  	if (!get_mce_event(&evt, MCE_EVENT_RELEASE))

return;
@@ -262,8 +263,19 @@ void machine_check_queue_event(void)
memcpy(&local_paca->mce_info->mce_event_queue[index],
   &evt, sizeof(evt));
  
-	/* Queue irq work to process this event later. */

-   irq_work_queue(&mce_event_process_work);
+   /* Queue irq work to process this event later. Before
+* queuing the work enable translation for non radix LPAR,
+* as irq_work_queue may try to access memory outside RMO
+* region.
+*/
+   if (!radix_enabled() && firmware_has_feature(FW_FEATURE_LPAR)) {
+   msr = mfmsr();
+   mtmsr(msr | MSR_IR | MSR_DR);
+   irq_work_queue(&mce_event_process_work);
+   mtmsr(msr);
+   } else {
+   irq_work_queue(&mce_event_process_work);
+   }
  }

We already went to virtual mode and queued (different) irq work in
arch/powerpc/platforms/pseries/ras.c:mce_handle_error()

We also called save_mce_event() which also might have queued irq work,
via machine_check_ue_event().

So it really feels like something about the design is wrong if we have
to go to virtual mode again and queue more irq work here.

I guess we can probably merge this as a backportable fix, doing anything
else would be a bigger change.


I agree.



Looking at ras.c there's the comment:

 * Enable translation as we will be accessing per-cpu variables
 * in save_mce_event() which may fall outside RMO region, also

But AFAICS it's only irq_work_queue() that touches anything percpu?


Yeah, we left the comment unchanged after doing some modifications around it,
It needs to be updated, ill send a separate patch for it.



So maybe we should just not be using irq_work_queue(). It's a pretty
thin wrapper around set_dec(1), perhaps we just need to hand-roll some
real-mode friendly way of doing that.


You mean, have separate queue and run the work from timer handler?



cheers


Re: [PATCH 0/5] s390/pci: automatic error recovery

2021-09-07 Thread Niklas Schnelle
On Tue, 2021-09-07 at 10:45 +0200, Niklas Schnelle wrote:
> On Tue, 2021-09-07 at 12:04 +1000, Oliver O'Halloran wrote:
> > On Mon, Sep 6, 2021 at 7:49 PM Niklas Schnelle  
> > wrote:
> > > Patch 3 I already sent separately resulting in the discussion below but 
> > > without
> > > a final conclusion.
> > > 
> > > https://lore.kernel.org/lkml/20210720150145.640727-1-schne...@linux.ibm.com/
> > > 
> > > I believe even though there were some doubts about the use of
> > > pci_dev_is_added() by arch code the existing uses as well as the use in 
> > > the
> > > final patch of this series warrant this export.
> > 
> > The use of pci_dev_is_added() in arch/powerpc was because in the past
> > pci_bus_add_device() could be called before pci_device_add(). That was
> > fixed a while ago so It should be safe to remove those calls now.
> 
> Hmm, ok that confirms Bjorns suspicion and explains how it came to be.
> I can certainly sent a patch for that. This would then leave only the
> existing use in s390 which I added because of a dead lock prevention
> and explained here:
> https://lore.kernel.org/lkml/87d15d5eead35c9eaa667958d057cf4a81a8bf13.ca...@linux.ibm.com/
> 
> Plus the need to use it in the recovery code of this series. I think in
> the EEH code the need for a similar check is alleviated by the checks
> in the beginning of
> arch/powerpc/kernel/eeh_driver.c:eeh_handle_normal_event() especially
> eeh_slot_presence_check() which checks presence via the hotplug slot.
> I guess we could use our own state tracking in a similar way but felt
> like pci_dev_is_added() is the more logical choice.

Looking into this again, I think we actually can't easily track this
state ourselves outside struct pci_dev. The reason for this is that
when e.g. arch/s390/pci/pci_sysfs.c:recover_store() removes the struct
pci_dev and scans it again the new struct pci_dev re-uses the same
struct zpci_dev because from a platform point of view the PCI device
was never removed but only disabled and re-enabled. Thus we can only
distinguish the stale struct pci_dev by looking at things stored in
struct pci_dev itself.

That said, I think for the recovery case we might be able to drop the
pci_dev_is_added() and rely on pdev->driver != NULL which we check
anyway and that should catch any PCI device that was already removed.



Re: [PATCH] pci/hotplug/pnv-php: Remove probable double put

2021-09-07 Thread Bjorn Helgaas
Make your subject line follow the previous convention.

Figure out if this is a "probable" or a real double put.  If it's a
real double put, we should fix it.  If it's only "probable," that
means we don't understand the problem yet.

On Tue, Sep 07, 2021 at 08:59:46AM +, Xu Wang wrote:
> Device node iterators put the previous value of the index variable,
> so an explicit put causes a double put.
> 
> Signed-off-by: Xu Wang 
> ---
>  drivers/pci/hotplug/pnv_php.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/drivers/pci/hotplug/pnv_php.c b/drivers/pci/hotplug/pnv_php.c
> index 04565162a449..ed4d1a2c3f22 100644
> --- a/drivers/pci/hotplug/pnv_php.c
> +++ b/drivers/pci/hotplug/pnv_php.c
> @@ -158,7 +158,6 @@ static void pnv_php_detach_device_nodes(struct 
> device_node *parent)
>   for_each_child_of_node(parent, dn) {
>   pnv_php_detach_device_nodes(dn);
>  
> - of_node_put(dn);
>   of_detach_node(dn);
>   }
>  }
> -- 
> 2.17.1
> 


[PATCH v3] ftrace: Cleanup ftrace_dyn_arch_init()

2021-09-07 Thread Weizhao Ouyang
Most of ARCHs use empty ftrace_dyn_arch_init(), introduce a weak common
ftrace_dyn_arch_init() to cleanup them.

Signed-off-by: Weizhao Ouyang 
Acked-by: Heiko Carstens  (s390)
Acked-by: Helge Deller  (parisc)

---
Changes in v3:
-- fix unrecognized opcode on PowerPC

Changes in v2:
-- correct CONFIG_DYNAMIC_FTRACE on PowerPC
-- add Acked-by tag

---
 arch/arm/kernel/ftrace.c  | 5 -
 arch/arm64/kernel/ftrace.c| 5 -
 arch/csky/kernel/ftrace.c | 5 -
 arch/ia64/kernel/ftrace.c | 6 --
 arch/microblaze/kernel/ftrace.c   | 5 -
 arch/mips/include/asm/ftrace.h| 2 ++
 arch/nds32/kernel/ftrace.c| 5 -
 arch/parisc/kernel/ftrace.c   | 5 -
 arch/powerpc/include/asm/ftrace.h | 4 
 arch/riscv/kernel/ftrace.c| 5 -
 arch/s390/kernel/ftrace.c | 5 -
 arch/sh/kernel/ftrace.c   | 5 -
 arch/sparc/kernel/ftrace.c| 5 -
 arch/x86/kernel/ftrace.c  | 5 -
 include/linux/ftrace.h| 1 -
 kernel/trace/ftrace.c | 5 +
 16 files changed, 11 insertions(+), 62 deletions(-)

diff --git a/arch/arm/kernel/ftrace.c b/arch/arm/kernel/ftrace.c
index 3c83b5d29697..a006585e1c09 100644
--- a/arch/arm/kernel/ftrace.c
+++ b/arch/arm/kernel/ftrace.c
@@ -193,11 +193,6 @@ int ftrace_make_nop(struct module *mod,
 
return ret;
 }
-
-int __init ftrace_dyn_arch_init(void)
-{
-   return 0;
-}
 #endif /* CONFIG_DYNAMIC_FTRACE */
 
 #ifdef CONFIG_FUNCTION_GRAPH_TRACER
diff --git a/arch/arm64/kernel/ftrace.c b/arch/arm64/kernel/ftrace.c
index 7f467bd9db7a..fc62dfe73f93 100644
--- a/arch/arm64/kernel/ftrace.c
+++ b/arch/arm64/kernel/ftrace.c
@@ -236,11 +236,6 @@ void arch_ftrace_update_code(int command)
command |= FTRACE_MAY_SLEEP;
ftrace_modify_all_code(command);
 }
-
-int __init ftrace_dyn_arch_init(void)
-{
-   return 0;
-}
 #endif /* CONFIG_DYNAMIC_FTRACE */
 
 #ifdef CONFIG_FUNCTION_GRAPH_TRACER
diff --git a/arch/csky/kernel/ftrace.c b/arch/csky/kernel/ftrace.c
index b4a7ec1517ff..50bfcf129078 100644
--- a/arch/csky/kernel/ftrace.c
+++ b/arch/csky/kernel/ftrace.c
@@ -133,11 +133,6 @@ int ftrace_update_ftrace_func(ftrace_func_t func)
(unsigned long)func, true, true);
return ret;
 }
-
-int __init ftrace_dyn_arch_init(void)
-{
-   return 0;
-}
 #endif /* CONFIG_DYNAMIC_FTRACE */
 
 #ifdef CONFIG_DYNAMIC_FTRACE_WITH_REGS
diff --git a/arch/ia64/kernel/ftrace.c b/arch/ia64/kernel/ftrace.c
index b2ab2d58fb30..d6360fd404ab 100644
--- a/arch/ia64/kernel/ftrace.c
+++ b/arch/ia64/kernel/ftrace.c
@@ -194,9 +194,3 @@ int ftrace_update_ftrace_func(ftrace_func_t func)
flush_icache_range(addr, addr + 16);
return 0;
 }
-
-/* run from kstop_machine */
-int __init ftrace_dyn_arch_init(void)
-{
-   return 0;
-}
diff --git a/arch/microblaze/kernel/ftrace.c b/arch/microblaze/kernel/ftrace.c
index 224eea40e1ee..188749d62709 100644
--- a/arch/microblaze/kernel/ftrace.c
+++ b/arch/microblaze/kernel/ftrace.c
@@ -163,11 +163,6 @@ int ftrace_make_call(struct dyn_ftrace *rec, unsigned long 
addr)
return ret;
 }
 
-int __init ftrace_dyn_arch_init(void)
-{
-   return 0;
-}
-
 int ftrace_update_ftrace_func(ftrace_func_t func)
 {
unsigned long ip = (unsigned long)(&ftrace_call);
diff --git a/arch/mips/include/asm/ftrace.h b/arch/mips/include/asm/ftrace.h
index b463f2aa5a61..ed013e767390 100644
--- a/arch/mips/include/asm/ftrace.h
+++ b/arch/mips/include/asm/ftrace.h
@@ -76,6 +76,8 @@ do {  \
 
 
 #ifdef CONFIG_DYNAMIC_FTRACE
+int __init ftrace_dyn_arch_init(void);
+
 static inline unsigned long ftrace_call_adjust(unsigned long addr)
 {
return addr;
diff --git a/arch/nds32/kernel/ftrace.c b/arch/nds32/kernel/ftrace.c
index 0e23e3a8df6b..f0ef4842d191 100644
--- a/arch/nds32/kernel/ftrace.c
+++ b/arch/nds32/kernel/ftrace.c
@@ -84,11 +84,6 @@ void _ftrace_caller(unsigned long parent_ip)
/* restore all state needed by the compiler epilogue */
 }
 
-int __init ftrace_dyn_arch_init(void)
-{
-   return 0;
-}
-
 static unsigned long gen_sethi_insn(unsigned long addr)
 {
unsigned long opcode = 0x4600;
diff --git a/arch/parisc/kernel/ftrace.c b/arch/parisc/kernel/ftrace.c
index 0a1e75af5382..01581f715737 100644
--- a/arch/parisc/kernel/ftrace.c
+++ b/arch/parisc/kernel/ftrace.c
@@ -94,11 +94,6 @@ int ftrace_disable_ftrace_graph_caller(void)
 #endif
 
 #ifdef CONFIG_DYNAMIC_FTRACE
-
-int __init ftrace_dyn_arch_init(void)
-{
-   return 0;
-}
 int ftrace_update_ftrace_func(ftrace_func_t func)
 {
return 0;
diff --git a/arch/powerpc/include/asm/ftrace.h 
b/arch/powerpc/include/asm/ftrace.h
index debe8c4f7062..b05c43f13a4d 100644
--- a/arch/powerpc/include/asm/ftrace.h
+++ b/arch/powerpc/include/asm/ftrace.h
@@ -126,6 +126,10 @@ static inline void this_cpu_enable_ftrace(void) { }
 static inline void this_cpu_set_ftrace_enabled(u8 ftrace_enabled) { 

[PATCH] pci/hotplug/pnv-php: Remove probable double put

2021-09-07 Thread Xu Wang
Device node iterators put the previous value of the index variable,
so an explicit put causes a double put.

Signed-off-by: Xu Wang 
---
 drivers/pci/hotplug/pnv_php.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/pci/hotplug/pnv_php.c b/drivers/pci/hotplug/pnv_php.c
index 04565162a449..ed4d1a2c3f22 100644
--- a/drivers/pci/hotplug/pnv_php.c
+++ b/drivers/pci/hotplug/pnv_php.c
@@ -158,7 +158,6 @@ static void pnv_php_detach_device_nodes(struct device_node 
*parent)
for_each_child_of_node(parent, dn) {
pnv_php_detach_device_nodes(dn);
 
-   of_node_put(dn);
of_detach_node(dn);
}
 }
-- 
2.17.1



Re: [PATCH 0/5] s390/pci: automatic error recovery

2021-09-07 Thread Niklas Schnelle
On Tue, 2021-09-07 at 12:04 +1000, Oliver O'Halloran wrote:
> On Mon, Sep 6, 2021 at 7:49 PM Niklas Schnelle  wrote:
> > Patch 3 I already sent separately resulting in the discussion below but 
> > without
> > a final conclusion.
> > 
> > https://lore.kernel.org/lkml/20210720150145.640727-1-schne...@linux.ibm.com/
> > 
> > I believe even though there were some doubts about the use of
> > pci_dev_is_added() by arch code the existing uses as well as the use in the
> > final patch of this series warrant this export.
> 
> The use of pci_dev_is_added() in arch/powerpc was because in the past
> pci_bus_add_device() could be called before pci_device_add(). That was
> fixed a while ago so It should be safe to remove those calls now.

Hmm, ok that confirms Bjorns suspicion and explains how it came to be.
I can certainly sent a patch for that. This would then leave only the
existing use in s390 which I added because of a dead lock prevention
and explained here:
https://lore.kernel.org/lkml/87d15d5eead35c9eaa667958d057cf4a81a8bf13.ca...@linux.ibm.com/

Plus the need to use it in the recovery code of this series. I think in
the EEH code the need for a similar check is alleviated by the checks
in the beginning of
arch/powerpc/kernel/eeh_driver.c:eeh_handle_normal_event() especially
eeh_slot_presence_check() which checks presence via the hotplug slot.
I guess we could use our own state tracking in a similar way but felt
like pci_dev_is_added() is the more logical choice.

> 
> > Patch 4 "PCI: Export pci_dev_lock()" is basically an extension to commit
> > e3a9b1212b9d ("PCI: Export pci_dev_trylock() and pci_dev_unlock()") which
> > already exported pci_dev_trylock(). In the final patch we make use of
> > pci_dev_lock() to wait for any other exclusive uses of the pdev to be 
> > finished
> > before starting recovery.
> 
> Hmm, I noticed the EEH
> (arch/powerpc/kernel/eeh_driver.c:eeh_pe_report_edev())  and the
> generic PCIe error recovery code (see
> drivers/pci/pcie/err.c:report_error_detected()) only call
> device_lock() before entering the driver's error handling callbacks. I
> wonder if they should be using pci_dev_lock() instead. The only real
> difference is that pci_dev_lock() will also block user space from
> accessing the device's config space while error recovery is in
> progress which seems sensible enough.

I agree that sounds reasonable.



Re: [PATCH 3/5] PCI: Move pci_dev_is/assign_added() to pci.h

2021-09-07 Thread Niklas Schnelle
On Tue, 2021-09-07 at 10:51 +0300, Andy Shevchenko wrote:
> On Tue, Sep 7, 2021 at 3:26 AM kernel test robot  wrote:
> > Hi Niklas,
> > 
> > I love your patch! Yet something to improve:
> > 
> > [auto build test ERROR on s390/features]
> > [also build test ERROR on next-20210906]
> > [cannot apply to pci/next powerpc/next v5.14]
> > [If your patch is applied to the wrong git tree, kindly drop us a note.
> > And when submitting patch, we suggest to use '--base' as documented in
> > https://git-scm.com/docs/git-format-patch]
> > 
> > url:
> > https://github.com/0day-ci/linux/commits/Niklas-Schnelle/s390-pci-automatic-error-recovery/20210906-175309
> > base:   https://git.kernel.org/pub/scm/linux/kernel/git/s390/linux.git 
> > features
> > config: i386-allyesconfig (attached as .config)
> > compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
> > reproduce (this is a W=1 build):
> > # 
> > https://github.com/0day-ci/linux/commit/404ed8c00a612e7ae31c50557c80c6726c464863
> > git remote add linux-review https://github.com/0day-ci/linux
> > git fetch --no-tags linux-review 
> > Niklas-Schnelle/s390-pci-automatic-error-recovery/20210906-175309
> > git checkout 404ed8c00a612e7ae31c50557c80c6726c464863
> > # save the attached .config to linux build tree
> > make W=1 ARCH=i386
> > 
> > If you fix the issue, kindly add following tag as appropriate
> > Reported-by: kernel test robot 
> > 
> > All errors (new ones prefixed by >>):
> 
> Obviously drivers/pci/pci.h is not only for the above.
> 
> When play with headers always do two test builds: allyesconfig and 
> allmodconfig.

You're right and additionally have to built on some other architectures
as well because allyesconfig and allmodconfig both run through fine on
s390. 

I'll look into it but at first glance it looks like I was over reaching
removing the include from drivers/pci/hotplug/acpiphp_glue.c in
addition it's not even the same kind of awkward relative include from
drivers into arch code. Sorry about that.

> 



Re: [PATCH 3/5] PCI: Move pci_dev_is/assign_added() to pci.h

2021-09-07 Thread Andy Shevchenko
On Tue, Sep 7, 2021 at 3:26 AM kernel test robot  wrote:
>
> Hi Niklas,
>
> I love your patch! Yet something to improve:
>
> [auto build test ERROR on s390/features]
> [also build test ERROR on next-20210906]
> [cannot apply to pci/next powerpc/next v5.14]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch]
>
> url:
> https://github.com/0day-ci/linux/commits/Niklas-Schnelle/s390-pci-automatic-error-recovery/20210906-175309
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/s390/linux.git 
> features
> config: i386-allyesconfig (attached as .config)
> compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
> reproduce (this is a W=1 build):
> # 
> https://github.com/0day-ci/linux/commit/404ed8c00a612e7ae31c50557c80c6726c464863
> git remote add linux-review https://github.com/0day-ci/linux
> git fetch --no-tags linux-review 
> Niklas-Schnelle/s390-pci-automatic-error-recovery/20210906-175309
> git checkout 404ed8c00a612e7ae31c50557c80c6726c464863
> # save the attached .config to linux build tree
> make W=1 ARCH=i386
>
> If you fix the issue, kindly add following tag as appropriate
> Reported-by: kernel test robot 
>
> All errors (new ones prefixed by >>):

Obviously drivers/pci/pci.h is not only for the above.

When play with headers always do two test builds: allyesconfig and allmodconfig.

-- 
With Best Regards,
Andy Shevchenko


Re: [PATCH 0/5] s390/pci: automatic error recovery

2021-09-07 Thread Niklas Schnelle
On Mon, 2021-09-06 at 21:05 -0500, Linas Vepstas wrote:
> On Mon, Sep 6, 2021 at 4:49 AM Niklas Schnelle 
> wrote:
> 
> >  I believe we might be the first
> > implementation of PCI device recovery in a virtualized setting requiring
> > us to
> > coordinate the device reset with the hypervisor platform by issuing a
> > disable
> > and re-enable to the platform as well as starting the recovery following
> > a platform event.
> > 
> 
> I recall none of the details, but SRIOV is a standardized system for
> sharing a PCI device across multiple virtual machines. It has detailed info
> on what the hypervisor must do, and what the local OS instance must do to
> accomplish this.  

Yes and in fact on s390 we make heavy use of SR-IOV.

> It's part of the PCI standard, and its more than a decade
> old now, maybe two. Being a part of the PCI standard, it was interoperable
> with error recovery, to the best of my recollection. 

Maybe I worded things with a bit too much sensationalism and it might
even be that POWER supports error recovery also with virtualization,
though I'm not sure how far that goes.

I believe you are right in that SR-IOV supports the error recovery,
after all this patch set also has to work together with SRIOV enabled
devices. At least on s390 though until this patch set the error
recovery performed by the hypervisor stopped in the hypervisor.

The missing part added by this patch set is coordinating with device
drivers in Linux to determine where use of a recovered device can pick
up after the PCIe level error recovery is done.

As for virtualization this coordination of course needs to cross the
hypervisor/guest boundary and at least for KVM+QEMU I know for a fact
that reporting a PCI error to the guest is currently just a stub that
actually completely stops the guest, so you definitely don't get smooth
error recovery there yet.

> At the time it was
> introduced, it got pushed very aggressively.  The x86 hypervisor vendors
> were aiming at the heart of zseries, and were militant about it.

And yet we're still here, use SR-IOV ourselves and even support Linux +
KVM as a hypervisor you can use just the same on a mainframe, an x86,
POWER, or ARM system.

> 
> -- Linas
>