from:"Greg Kurz"

Re: [PATCH kernel] KVM: PPC: Book3s: Fix warning about xics_rm_h_xirr_x

2022-06-22 Thread Greg Kurz

On Wed, 22 Jun 2022 15:52:35 +1000
Alexey Kardashevskiy  wrote:

> This fixes "no previous prototype":
> 
> arch/powerpc/kvm/book3s_hv_rm_xics.c:482:15:
> warning: no previous prototype for 'xics_rm_h_xirr_x' [-Wmissing-prototypes]
> 
> Reported by the kernel test robot.
> 
> Fixes: b22af9041927 ("KVM: PPC: Book3s: Remove real mode interrupt controller 
> hcalls handlers")
> Signed-off-by: Alexey Kardashevskiy 
> ---

FWIW

Reviewed-by: Greg Kurz 

>  arch/powerpc/kvm/book3s_xics.h | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/arch/powerpc/kvm/book3s_xics.h b/arch/powerpc/kvm/book3s_xics.h
> index 8e4c79e2fcd8..08fb0843faf5 100644
> --- a/arch/powerpc/kvm/book3s_xics.h
> +++ b/arch/powerpc/kvm/book3s_xics.h
> @@ -143,6 +143,7 @@ static inline struct kvmppc_ics 
> *kvmppc_xics_find_ics(struct kvmppc_xics *xics,
>  }
>  
>  extern unsigned long xics_rm_h_xirr(struct kvm_vcpu *vcpu);
> +extern unsigned long xics_rm_h_xirr_x(struct kvm_vcpu *vcpu);
>  extern int xics_rm_h_ipi(struct kvm_vcpu *vcpu, unsigned long server,
>unsigned long mfrr);
>  extern int xics_rm_h_cppr(struct kvm_vcpu *vcpu, unsigned long cppr);

Re: [PATCH] powerpc/xive: Change IRQ domain to a tree domain

2021-11-16 Thread Greg Kurz

On Tue, 16 Nov 2021 15:49:13 +0100
Cédric Le Goater  wrote:

> On 11/16/21 15:23, Greg Kurz wrote:
> > On Tue, 16 Nov 2021 14:40:22 +0100
> > Cédric Le Goater  wrote:
> > 
> >> Commit 4f86a06e2d6e ("irqdomain: Make normal and nomap irqdomains
> >> exclusive") introduced an IRQ_DOMAIN_FLAG_NO_MAP flag to isolate the
> >> 'nomap' domains still in use under the powerpc arch. With this new
> >> flag, the revmap_tree of the IRQ domain is not used anymore. This
> >> change broke the support of shared LSIs [1] in the XIVE driver because
> >> it was relying on a lookup in the revmap_tree to query previously
> >> mapped interrupts. Linux now creates two distinct IRQ mappings on the
> >> same HW IRQ which can lead to unexpected behavior in the drivers.
> >>
> >> The XIVE IRQ domain is not a direct mapping domain and its HW IRQ
> >> interrupt number space is rather large : 1M/socket on POWER9 and
> >> POWER10, change the XIVE driver to use a 'tree' domain type instead.
> >>
> >> [1] For instance, a linux KVM guest with virtio-rng and virtio-balloon
> >>  devices.
> >>
> >> Cc: Marc Zyngier 
> >> Cc: sta...@vger.kernel.org # v5.14+
> >> Fixes: 4f86a06e2d6e ("irqdomain: Make normal and nomap irqdomains 
> >> exclusive")
> >> Signed-off-by: Cédric Le Goater 
> >> ---
> >>
> > 
> > Tested-by: Greg Kurz 
> > 
> > with a KVM guest + virtio-rng + virtio-balloon on a POWER9 host.
> 
> Did you test on a 5.14 backport or mainline ?
> 

I've tested on a 5.14 backport only.

> I am asking because a large change adding support for MSI domains
> to XIVE was merged in 5.15.
> 
> Thanks,
> 
> C.
> 
> 
> > 
> >>   Marc,
> >>
> >>   The Fixes tag is there because the patch in question revealed that
> >>   something was broken in XIVE. genirq is not in cause. However, I
> >>   don't know for PS3 and Cell. May be less critical for now.
> >>   
> >>   arch/powerpc/sysdev/xive/common.c | 3 +--
> >>   arch/powerpc/sysdev/xive/Kconfig  | 1 -
> >>   2 files changed, 1 insertion(+), 3 deletions(-)
> >>
> >> diff --git a/arch/powerpc/sysdev/xive/common.c 
> >> b/arch/powerpc/sysdev/xive/common.c
> >> index fed6fd16c8f4..9d0f0fe25598 100644
> >> --- a/arch/powerpc/sysdev/xive/common.c
> >> +++ b/arch/powerpc/sysdev/xive/common.c
> >> @@ -1536,8 +1536,7 @@ static const struct irq_domain_ops 
> >> xive_irq_domain_ops = {
> >>   
> >>   static void __init xive_init_host(struct device_node *np)
> >>   {
> >> -  xive_irq_domain = irq_domain_add_nomap(np, XIVE_MAX_IRQ,
> >> - _irq_domain_ops, NULL);
> >> +  xive_irq_domain = irq_domain_add_tree(np, _irq_domain_ops, NULL);
> >>if (WARN_ON(xive_irq_domain == NULL))
> >>return;
> >>irq_set_default_host(xive_irq_domain);
> >> diff --git a/arch/powerpc/sysdev/xive/Kconfig 
> >> b/arch/powerpc/sysdev/xive/Kconfig
> >> index 97796c6b63f0..785c292d104b 100644
> >> --- a/arch/powerpc/sysdev/xive/Kconfig
> >> +++ b/arch/powerpc/sysdev/xive/Kconfig
> >> @@ -3,7 +3,6 @@ config PPC_XIVE
> >>bool
> >>select PPC_SMP_MUXED_IPI
> >>select HARDIRQS_SW_RESEND
> >> -  select IRQ_DOMAIN_NOMAP
> >>   
> >>   config PPC_XIVE_NATIVE
> >>bool
> > 
>

Re: [PATCH] powerpc/xive: Change IRQ domain to a tree domain

2021-11-16 Thread Greg Kurz

On Tue, 16 Nov 2021 14:40:22 +0100
Cédric Le Goater  wrote:

> Commit 4f86a06e2d6e ("irqdomain: Make normal and nomap irqdomains
> exclusive") introduced an IRQ_DOMAIN_FLAG_NO_MAP flag to isolate the
> 'nomap' domains still in use under the powerpc arch. With this new
> flag, the revmap_tree of the IRQ domain is not used anymore. This
> change broke the support of shared LSIs [1] in the XIVE driver because
> it was relying on a lookup in the revmap_tree to query previously
> mapped interrupts. Linux now creates two distinct IRQ mappings on the
> same HW IRQ which can lead to unexpected behavior in the drivers.
> 
> The XIVE IRQ domain is not a direct mapping domain and its HW IRQ
> interrupt number space is rather large : 1M/socket on POWER9 and
> POWER10, change the XIVE driver to use a 'tree' domain type instead.
> 
> [1] For instance, a linux KVM guest with virtio-rng and virtio-balloon
> devices.
> 
> Cc: Marc Zyngier 
> Cc: sta...@vger.kernel.org # v5.14+
> Fixes: 4f86a06e2d6e ("irqdomain: Make normal and nomap irqdomains exclusive")
> Signed-off-by: Cédric Le Goater 
> ---
> 

Tested-by: Greg Kurz 

with a KVM guest + virtio-rng + virtio-balloon on a POWER9 host.

>  Marc,
> 
>  The Fixes tag is there because the patch in question revealed that
>  something was broken in XIVE. genirq is not in cause. However, I
>  don't know for PS3 and Cell. May be less critical for now. 
>  
>  arch/powerpc/sysdev/xive/common.c | 3 +--
>  arch/powerpc/sysdev/xive/Kconfig  | 1 -
>  2 files changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/sysdev/xive/common.c 
> b/arch/powerpc/sysdev/xive/common.c
> index fed6fd16c8f4..9d0f0fe25598 100644
> --- a/arch/powerpc/sysdev/xive/common.c
> +++ b/arch/powerpc/sysdev/xive/common.c
> @@ -1536,8 +1536,7 @@ static const struct irq_domain_ops xive_irq_domain_ops 
> = {
>  
>  static void __init xive_init_host(struct device_node *np)
>  {
> - xive_irq_domain = irq_domain_add_nomap(np, XIVE_MAX_IRQ,
> -_irq_domain_ops, NULL);
> + xive_irq_domain = irq_domain_add_tree(np, _irq_domain_ops, NULL);
>   if (WARN_ON(xive_irq_domain == NULL))
>   return;
>   irq_set_default_host(xive_irq_domain);
> diff --git a/arch/powerpc/sysdev/xive/Kconfig 
> b/arch/powerpc/sysdev/xive/Kconfig
> index 97796c6b63f0..785c292d104b 100644
> --- a/arch/powerpc/sysdev/xive/Kconfig
> +++ b/arch/powerpc/sysdev/xive/Kconfig
> @@ -3,7 +3,6 @@ config PPC_XIVE
>   bool
>   select PPC_SMP_MUXED_IPI
>   select HARDIRQS_SW_RESEND
> - select IRQ_DOMAIN_NOMAP
>  
>  config PPC_XIVE_NATIVE
>   bool

Re: [PATCH v2] KVM: PPC: Defer vtime accounting 'til after IRQ handling

2021-10-07 Thread Greg Kurz

On Thu,  7 Oct 2021 16:28:56 +0200
Laurent Vivier  wrote:

> Commit 112665286d08 moved guest_exit() in the interrupt protected
> area to avoid wrong context warning (or worse), but the tick counter
> cannot be updated and the guest time is accounted to the system time.
> 
> To fix the problem port to POWER the x86 fix
> 160457140187 ("Defer vtime accounting 'til after IRQ handling"):
> 
> "Defer the call to account guest time until after servicing any IRQ(s)
>  that happened in the guest or immediately after VM-Exit.  Tick-based
>  accounting of vCPU time relies on PF_VCPU being set when the tick IRQ
>  handler runs, and IRQs are blocked throughout the main sequence of
>  vcpu_enter_guest(), including the call into vendor code to actually
>  enter and exit the guest."
> 
> Fixes: 112665286d08 ("KVM: PPC: Book3S HV: Context tracking exit guest 
> context before enabling irqs")
> Cc: npig...@gmail.com
> Cc:  # 5.12
> Signed-off-by: Laurent Vivier 
> ---
> 
> Notes:
> v2: remove reference to commit 61bd0f66ff92
> cc stable 5.12
> add the same comment in the code as for x86
> 

Works for me. As you stated in your answer, someone can polish the
code later on.

Reviewed-by: Greg Kurz 

>  arch/powerpc/kvm/book3s_hv.c | 24 
>  1 file changed, 20 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index 2acb1c96cfaf..a694d1a8f6ce 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -3695,6 +3695,8 @@ static noinline void kvmppc_run_core(struct 
> kvmppc_vcore *vc)
>  
>   srcu_read_unlock(>kvm->srcu, srcu_idx);
>  
> + context_tracking_guest_exit();
> +
>   set_irq_happened(trap);
>  
>   spin_lock(>lock);
> @@ -3726,9 +3728,15 @@ static noinline void kvmppc_run_core(struct 
> kvmppc_vcore *vc)
>  
>   kvmppc_set_host_core(pcpu);
>  
> - guest_exit_irqoff();
> -
>   local_irq_enable();
> + /*
> +  * Wait until after servicing IRQs to account guest time so that any
> +  * ticks that occurred while running the guest are properly accounted
> +  * to the guest.  Waiting until IRQs are enabled degrades the accuracy
> +  * of accounting via context tracking, but the loss of accuracy is
> +  * acceptable for all known use cases.
> +  */
> + vtime_account_guest_exit();
>  
>   /* Let secondaries go back to the offline loop */
>   for (i = 0; i < controlled_threads; ++i) {
> @@ -4506,13 +4514,21 @@ int kvmhv_run_single_vcpu(struct kvm_vcpu *vcpu, u64 
> time_limit,
>  
>   srcu_read_unlock(>srcu, srcu_idx);
>  
> + context_tracking_guest_exit();
> +
>   set_irq_happened(trap);
>  
>   kvmppc_set_host_core(pcpu);
>  
> - guest_exit_irqoff();
> -
>   local_irq_enable();
> + /*
> +  * Wait until after servicing IRQs to account guest time so that any
> +  * ticks that occurred while running the guest are properly accounted
> +  * to the guest.  Waiting until IRQs are enabled degrades the accuracy
> +  * of accounting via context tracking, but the loss of accuracy is
> +  * acceptable for all known use cases.
> +  */
> + vtime_account_guest_exit();
>  
>   cpumask_clear_cpu(pcpu, >arch.cpu_in_guest);
>

Re: [PATCH] KVM: PPC: Defer vtime accounting 'til after IRQ handling

2021-10-06 Thread Greg Kurz

On Wed,  6 Oct 2021 09:37:45 +0200
Laurent Vivier  wrote:

> Commit 61bd0f66ff92 has moved guest_enter() out of the interrupt
> protected area to be able to have updated tick counters, but
> commit 112665286d08 moved back to this area to avoid wrong
> context warning (or worse).
> 
> None of them are correct, to fix the problem port to POWER
> the x86 fix 160457140187 ("KVM: x86: Defer vtime accounting 'til
> after IRQ handling"):
> 
> "Defer the call to account guest time until after servicing any IRQ(s)
>  that happened in the guest or immediately after VM-Exit.  Tick-based
>  accounting of vCPU time relies on PF_VCPU being set when the tick IRQ
>  handler runs, and IRQs are blocked throughout the main sequence of
>  vcpu_enter_guest(), including the call into vendor code to actually
>  enter and exit the guest."
> 
> Link: https://bugzilla.redhat.com/show_bug.cgi?id=2009312
> Fixes: 61bd0f66ff92 ("KVM: PPC: Book3S HV: Fix guest time accounting with 
> VIRT_CPU_ACCOUNTING_GEN")

This patch was merged in linux 4.16 and thus is on the 4.16.y
stable branch and it was backported to stable 4.14.y. It would
make sense to Cc: stable # v4.14 also, but...

> Fixes: 112665286d08 ("KVM: PPC: Book3S HV: Context tracking exit guest 
> context before enabling irqs")

... this one, which was merged in linux 5.12, was never backported
anywhere because it wasn't considered as a fix, as commented here:

https://lore.kernel.org/linuxppc-dev/1610793296.fjhomer31g.astr...@bobo.none/

AFAICT commit 61bd0f66ff92 was never mentioned anywhere in a bug. The
first Fixes: tag thus looks a bit misleading. I'd personally drop it
and Cc: stable # v5.12.

> Cc: npig...@gmail.com
> 
> Signed-off-by: Laurent Vivier 
> ---
>  arch/powerpc/kvm/book3s_hv.c | 10 ++
>  1 file changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index 2acb1c96cfaf..43e1ce853785 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -3695,6 +3695,8 @@ static noinline void kvmppc_run_core(struct 
> kvmppc_vcore *vc)
>  
>   srcu_read_unlock(>kvm->srcu, srcu_idx);
>  
> + context_tracking_guest_exit();
> +
>   set_irq_happened(trap);
>  
>   spin_lock(>lock);
> @@ -3726,9 +3728,8 @@ static noinline void kvmppc_run_core(struct 
> kvmppc_vcore *vc)
>  
>   kvmppc_set_host_core(pcpu);
>  
> - guest_exit_irqoff();
> -


Change looks ok but it might be a bit confusing for the
occasional reader that guest_enter_irqoff() isn't matched
by a guest_exit_irqoff().

>   local_irq_enable();
> + vtime_account_guest_exit();
>  

Maybe add a comment like in x86 ?

>   /* Let secondaries go back to the offline loop */
>   for (i = 0; i < controlled_threads; ++i) {
> @@ -4506,13 +4507,14 @@ int kvmhv_run_single_vcpu(struct kvm_vcpu *vcpu, u64 
> time_limit,
>  
>   srcu_read_unlock(>srcu, srcu_idx);
>  
> + context_tracking_guest_exit();
> +
>   set_irq_happened(trap);
>  
>   kvmppc_set_host_core(pcpu);
>  
> - guest_exit_irqoff();
> -
>   local_irq_enable();
> + vtime_account_guest_exit();
>  
>   cpumask_clear_cpu(pcpu, >arch.cpu_in_guest);
>  

Same remarks. FWIW a followup was immediately added to x86 to
encapsulate the enter/exit logic in common helpers :

ommit bc908e091b3264672889162733020048901021fb
Author: Sean Christopherson 
Date:   Tue May 4 17:27:35 2021 -0700

KVM: x86: Consolidate guest enter/exit logic to common helpers

This makes the code nicer. Maybe do something similar for POWER ?

Cheers,

--
Greg

Re: [PATCH] memblock: make for_each_mem_range() traverse MEMBLOCK_HOTPLUG regions

2021-07-12 Thread Greg Kurz

On Mon, 12 Jul 2021 10:11:32 +0300
Mike Rapoport  wrote:

> From: Mike Rapoport 
> 
> Commit b10d6bca8720 ("arch, drivers: replace for_each_membock() with
> for_each_mem_range()") didn't take into account that when there is
> movable_node parameter in the kernel command line, for_each_mem_range()
> would skip ranges marked with MEMBLOCK_HOTPLUG.
> 
> The page table setup code in POWER uses for_each_mem_range() to create the
> linear mapping of the physical memory and since the regions marked as
> MEMORY_HOTPLUG are skipped, they never make it to the linear map.
> 
> A later access to the memory in those ranges will fail:
> 
> [2.271743] BUG: Unable to handle kernel data access on write at 
> 0xc004
> [2.271984] Faulting instruction address: 0xc008a3c0
> [2.272568] Oops: Kernel access of bad area, sig: 11 [#1]
> [2.272683] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
> [2.273063] Modules linked in:
> [2.273435] CPU: 0 PID: 53 Comm: kworker/u2:0 Not tainted 5.13.0 #7
> [2.273832] NIP:  c008a3c0 LR: c03c1ed8 CTR: 
> 0040
> [2.273918] REGS: c8a57770 TRAP: 0300   Not tainted  (5.13.0)
> [2.274036] MSR:  82009033   CR: 
> 8402  XER: 2004
> [2.274454] CFAR: c03c1ed4 DAR: c004 DSISR: 4200 
> IRQMASK: 0
> [2.274454] GPR00: c03c1ed8 c8a57a10 c19da700 
> c004
> [2.274454] GPR04: 0280 0180 0400 
> 0200
> [2.274454] GPR08: 0100 0080 0040 
> 0300
> [2.274454] GPR12: 0380 c1bc c01660c8 
> c6337e00
> [2.274454] GPR16:    
> 
> [2.274454] GPR20: 4000 2000 c1a81990 
> c8c3
> [2.274454] GPR24: c8c2 c1a81998 000f 
> c1a819a0
> [2.274454] GPR28: c1a81908 c00c0100 c8c4 
> c8a64680
> [2.275520] NIP [c008a3c0] clear_user_page+0x50/0x80
> [2.276333] LR [c03c1ed8] __handle_mm_fault+0xc88/0x1910
> [2.276688] Call Trace:
> [2.276839] [c8a57a10] [c03c1e94] 
> __handle_mm_fault+0xc44/0x1910 (unreliable)
> [2.277142] [c8a57af0] [c03c2c90] 
> handle_mm_fault+0x130/0x2a0
> [2.277331] [c8a57b40] [c03b5f08] 
> __get_user_pages+0x248/0x610
> [2.277541] [c8a57c40] [c03b848c] 
> __get_user_pages_remote+0x12c/0x3e0
> [2.277768] [c8a57cd0] [c0473f24] get_arg_page+0x54/0xf0
> [2.277959] [c8a57d10] [c0474a7c] 
> copy_string_kernel+0x11c/0x210
> [2.278159] [c8a57d80] [c047663c] kernel_execve+0x16c/0x220
> [2.278361] [c8a57dd0] [c0166270] 
> call_usermodehelper_exec_async+0x1b0/0x2f0
> [2.278543] [c8a57e10] [c000d5ec] 
> ret_from_kernel_thread+0x5c/0x70
> [2.278870] Instruction dump:
> [2.279214] 79280fa4 79271764 79261f24 794ae8e2 7ca94214 7d683a14 7c893a14 
> 7d893050
> [2.279416] 7d4903a6 6000 6000 6000 <7c001fec> 7c091fec 
> 7c081fec 7c051fec
> [2.280193] ---[ end trace 490b8c67e6075e09 ]---
> 
> Making for_each_mem_range() include MEMBLOCK_HOTPLUG regions in the
> traversal fixes this issue.
> 
> Link: https://bugzilla.redhat.com/show_bug.cgi?id=1976100
> Fixes: b10d6bca8720 ("arch, drivers: replace for_each_membock() with 
> for_each_mem_range()")
> Signed-off-by: Mike Rapoport 
> ---

This fixes the issue I was observing with both radix and hash.

Thanks !

Tested-by: Greg Kurz 

Cc'ing linuxppc-dev so that POWER folks know about the fix
and stable.

Cc: sta...@vger.kernel.org # v5.10

>  include/linux/memblock.h | 4 ++--
>  mm/memblock.c| 3 ++-
>  2 files changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> index cbf46f56d105..4a53c3ca86bd 100644
> --- a/include/linux/memblock.h
> +++ b/include/linux/memblock.h
> @@ -209,7 +209,7 @@ static inline void __next_physmem_range(u64 *idx, struct 
> memblock_type *type,
>   */
>  #define for_each_mem_range(i, p_start, p_end) \
>   __for_each_mem_range(i, , NULL, NUMA_NO_NODE,   \
> -  MEMBLOCK_NONE, p_start, p_end, NULL)
> +  MEMBLOCK_HOTPLUG, p_start, p_end, NULL)
>  
>  /**
>   * for_each_mem_range_rev - reverse iterate through memblock areas from
> @@ -220,7 +220,7 @@ static inline void __next_physmem_ra

Re: [PATCH] powerpc: Fix initrd corruption with relative jump labels

2021-06-15 Thread Greg Kurz

On Mon, 14 Jun 2021 17:57:40 +0200
Greg Kurz  wrote:

> On Mon, 14 Jun 2021 23:14:40 +1000
> Michael Ellerman  wrote:
> 
> > Commit b0b3b2c78ec0 ("powerpc: Switch to relative jump labels") switched
> > us to using relative jump labels. That involves changing the code,
> > target and key members in struct jump_entry to be relative to the
> > address of the jump_entry, rather than absolute addresses.
> > 
> > We have two static inlines that create a struct jump_entry,
> > arch_static_branch() and arch_static_branch_jump(), as well as an asm
> > macro ARCH_STATIC_BRANCH, which is used by the pseries-only hypervisor
> > tracing code.
> > 
> > Unfortunately we missed updating the key to be a relative reference in
> > ARCH_STATIC_BRANCH.
> > 
> > That causes a pseries kernel to have a handful of jump_entry structs
> > with bad key values. Instead of being a relative reference they instead
> > hold the full address of the key.
> > 
> > However the code doesn't expect that, it still adds the key value to the
> > address of the jump_entry (see jump_entry_key()) expecting to get a
> > pointer to a key somewhere in kernel data.
> > 
> > The table of jump_entry structs sits in rodata, which comes after the
> > kernel text. In a typical build this will be somewhere around 15MB. The
> > address of the key will be somewhere in data, typically around 20MB.
> > Adding the two values together gets us a pointer somewhere around 45MB.
> > 
> > We then call static_key_set_entries() with that bad pointer and modify
> > some members of the struct static_key we think we are pointing at.
> > 
> > A pseries kernel is typically ~30MB in size, so writing to ~45MB won't
> > corrupt the kernel itself. However if we're booting with an initrd,
> > depending on the size and exact location of the initrd, we can corrupt
> > the initrd. Depending on how exactly we corrupt the initrd it can either
> > cause the system to not boot, or just corrupt one of the files in the
> > initrd.
> > 
> > The fix is simply to make the key value relative to the jump_entry
> > struct in the ARCH_STATIC_BRANCH macro.
> > 
> > Fixes: b0b3b2c78ec0 ("powerpc: Switch to relative jump labels")
> > Reported-by: Anastasia Kovaleva 
> > Reported-by: Roman Bolshakov 
> > Reported-by: Greg Kurz 
> > Reported-by: Daniel Axtens 
> > Signed-off-by: Michael Ellerman 
> > ---
> 
> Great thanks for debugging this issue ! I'll try it out tomorrow morning.
> 

This fixes the issue. Great thanks again :)

Tested-by: Greg Kurz 

> Cheers,
> 
> --
> Greg
> 
> >  arch/powerpc/include/asm/jump_label.h | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/arch/powerpc/include/asm/jump_label.h 
> > b/arch/powerpc/include/asm/jump_label.h
> > index 2d5c6bec2b4f..93ce3ec25387 100644
> > --- a/arch/powerpc/include/asm/jump_label.h
> > +++ b/arch/powerpc/include/asm/jump_label.h
> > @@ -50,7 +50,7 @@ static __always_inline bool 
> > arch_static_branch_jump(struct static_key *key, bool
> >  1098:  nop;\
> > .pushsection __jump_table, "aw";\
> > .long 1098b - ., LABEL - .; \
> > -   FTR_ENTRY_LONG KEY; \
> > +   FTR_ENTRY_LONG KEY - .; \
> > .popsection
> >  #endif
> >  
>

Re: [PATCH] powerpc: Fix initrd corruption with relative jump labels

2021-06-14 Thread Greg Kurz

On Mon, 14 Jun 2021 23:14:40 +1000
Michael Ellerman  wrote:

> Commit b0b3b2c78ec0 ("powerpc: Switch to relative jump labels") switched
> us to using relative jump labels. That involves changing the code,
> target and key members in struct jump_entry to be relative to the
> address of the jump_entry, rather than absolute addresses.
> 
> We have two static inlines that create a struct jump_entry,
> arch_static_branch() and arch_static_branch_jump(), as well as an asm
> macro ARCH_STATIC_BRANCH, which is used by the pseries-only hypervisor
> tracing code.
> 
> Unfortunately we missed updating the key to be a relative reference in
> ARCH_STATIC_BRANCH.
> 
> That causes a pseries kernel to have a handful of jump_entry structs
> with bad key values. Instead of being a relative reference they instead
> hold the full address of the key.
> 
> However the code doesn't expect that, it still adds the key value to the
> address of the jump_entry (see jump_entry_key()) expecting to get a
> pointer to a key somewhere in kernel data.
> 
> The table of jump_entry structs sits in rodata, which comes after the
> kernel text. In a typical build this will be somewhere around 15MB. The
> address of the key will be somewhere in data, typically around 20MB.
> Adding the two values together gets us a pointer somewhere around 45MB.
> 
> We then call static_key_set_entries() with that bad pointer and modify
> some members of the struct static_key we think we are pointing at.
> 
> A pseries kernel is typically ~30MB in size, so writing to ~45MB won't
> corrupt the kernel itself. However if we're booting with an initrd,
> depending on the size and exact location of the initrd, we can corrupt
> the initrd. Depending on how exactly we corrupt the initrd it can either
> cause the system to not boot, or just corrupt one of the files in the
> initrd.
> 
> The fix is simply to make the key value relative to the jump_entry
> struct in the ARCH_STATIC_BRANCH macro.
> 
> Fixes: b0b3b2c78ec0 ("powerpc: Switch to relative jump labels")
> Reported-by: Anastasia Kovaleva 
> Reported-by: Roman Bolshakov 
> Reported-by: Greg Kurz 
> Reported-by: Daniel Axtens 
> Signed-off-by: Michael Ellerman 
> ---

Great thanks for debugging this issue ! I'll try it out tomorrow morning.

Cheers,

--
Greg

>  arch/powerpc/include/asm/jump_label.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/include/asm/jump_label.h 
> b/arch/powerpc/include/asm/jump_label.h
> index 2d5c6bec2b4f..93ce3ec25387 100644
> --- a/arch/powerpc/include/asm/jump_label.h
> +++ b/arch/powerpc/include/asm/jump_label.h
> @@ -50,7 +50,7 @@ static __always_inline bool arch_static_branch_jump(struct 
> static_key *key, bool
>  1098:nop;\
>   .pushsection __jump_table, "aw";\
>   .long 1098b - ., LABEL - .; \
> - FTR_ENTRY_LONG KEY; \
> + FTR_ENTRY_LONG KEY - .; \
>   .popsection
>  #endif
>

Re: [PATCH] Revert "powerpc: Switch to relative jump labels"

2021-06-07 Thread Greg Kurz

On Tue, 01 Jun 2021 17:36:15 +1000
Michael Ellerman  wrote:

> Roman Bolshakov  writes:
> > On Sat, May 29, 2021 at 09:39:49AM +1000, Michael Ellerman wrote:
> >> Roman Bolshakov  writes:
> >> > This reverts commit b0b3b2c78ec075cec4721986a95abbbac8c3da4f.
> >> >
> >> > Otherwise, direct kernel boot with initramfs no longer works in QEMU.
> >> > It's broken in some bizarre way because a valid initramfs is not
> >> > recognized anymore:
> >> >
> >> >   Found initrd at 0xc1f7:0xc3d61d64
> >> >   rootfs image is not initramfs (XZ-compressed data is corrupt); looks 
> >> > like an initrd
> >> >
> >> > The issue is observed on v5.13-rc3 if the kernel is built with
> >> > defconfig, GCC 7.5.0 and GNU ld 2.32.0.
> >> 
> >> Are you able to try a different compiler?
> >
> > Hi Michael,
> >
> > I've just tried GCC 9.3.1 and the result is the same.
> >
> > The offending patch has assembly inlines, they typically go through
> > binutils/GAS and it might also be a case when older binutils doesn't
> > implement something properly (i've seen this on x86 and arm).
> 
> Jump labels use asm goto, which is a compiler feature, but you're right
> that the binutils version could also be important.
> 
> What ld versions have you tried?
> 
> And are those the toolchains from kernel.org or somewhere else?
> 
> >> I test booting qemu constantly, but I don't use GCC 7.5.
> >>
> >> And what qemu version are you using?
> >> 
> >
> > QEMU 3.1.1, but I've also tried 6.0.50 (QEMU master, 62c0ac5041e913) and
> > it fails the same way.
> 
> OK.
> 
> >> I assume your initramfs is compressed with XZ? How large is it
> >> compressed?
> >> 
> >
> > Yes, XZ. initramfs size is 30 MB (around 100 MB cpio size).
> >
> > It's interesting that the issue doesn't happen if I pass initramfs from
> > host (11MB), then the initramfs can be recognized. It might be related
> > to initramfs size then and bigger initramfs that used to work no longer
> > work with v5.13-rc3.
> 
> Are you using qemu's -initrd option to pass the initramfs, or are you
> building the initramfs into the kernel?
> 

Hi Michael,

I'm hitting the same issue while trying to boot a RHEL9 guest with
the distro's default kernel/initramfs and grub.

Interestingly this doesn't happen with older QEMU, e.g. 4.2.0 that
is shipped with RHEL8. I've bissected to this commit from the
QEMU 5.0 era :


commit 8897ea5a9fc0aafa5ed7eee1e0c49893b91a2d87
Author: David Gibson 
Date:   Thu Nov 28 16:37:04 2019 +1100

spapr: Don't attempt to clamp RMA to VRMA constraint


This mostly changes how memory is presented in the FDT.

Before 8897ea5a9fc, for a VM with 1 gig of RAM, we had several nodes,
first one being the VRMA (limited to 256 megs).

memory@2000 {
ibm,associativity = <0x04 0x00 0x00 0x00 0x00>;
reg = <0x00 0x2000 0x00 0x2000>;
device_type = "memory";
};

memory@1000 {
ibm,associativity = <0x04 0x00 0x00 0x00 0x00>;
reg = <0x00 0x1000 0x00 0x1000>;
device_type = "memory";
};

memory@0 {
ibm,associativity = <0x04 0x00 0x00 0x00 0x00>;
reg = <0x00 0x00 0x00 0x1000>;
device_type = "memory";
};


Now we have a single node for all RAM:

memory@0 {
ibm,associativity = <0x04 0x00 0x00 0x00 0x00>;
reg = <0x00 0x00 0x00 0x4000>;
device_type = "memory";
};

If I set an arbitrary constraint again on the VRMA, I get the
multiple memory nodes back and, depending on the value, the
boot succeeds. In my 1 gig RHEL9 guest case, I need to set
a VRMA size <= 0x3200.

Not sure how this can relate to the initramfs though. I just see
that grub doens't map it at the same place:

0x0310 when boot fails

0x0f00 when boot succeeds

In case this rings a bell...

> > So, I've created a small initramfs using only static busybox (2.7M
> > uncompressed, 960K compressed with xz). No error is produced and it
> > boots fine.
> >
> > If I add a dummy file (11M off /dev/urandom) to the small busybox
> > initramfs, it boots and the init is started but I'm seeing the error:
> >
> >   rootfs image is not initramfs (XZ-compressed data is corrupt); looks like 
> > an initrd
> >
> > sha1sum of the file inside initramfs doesn't match sha1sum on the host.
> >
> >   guest # sha1sum dummy
> >   407c347e671ddd00f69df12b3368048bad0ebf0c  dummy
> >   # QEMU: Terminated
> >   host $ sha1sum dummy
> >   ed8494b3eecab804960ceba2c497270eed0b0cd1  dummy
> >
> > sha1sum is the same in the guest and on the host for 10M dummy file:
> >
> >   guest # sha1sum dummy
> >   43855f7a772a28cce91da9eb8f86f53bc807631f  dummy
> >   # QEMU: Terminated
> >   host $ sha1sum dummy
> >   43855f7a772a28cce91da9eb8f86f53bc807631f  dummy
> >
> > That might explain why bigger initramfs (or initramfs with bigger files)
> > doesn't boot -

Re: [PATCH] Revert "powerpc: Switch to relative jump labels"

2021-05-28 Thread Greg Kurz

On Fri, 28 May 2021 04:29:43 +0300
Roman Bolshakov  wrote:

> This reverts commit b0b3b2c78ec075cec4721986a95abbbac8c3da4f.
> 
> Otherwise, direct kernel boot with initramfs no longer works in QEMU.
> It's broken in some bizarre way because a valid initramfs is not
> recognized anymore:
> 
>   Found initrd at 0xc1f7:0xc3d61d64
>   rootfs image is not initramfs (XZ-compressed data is corrupt); looks like 
> an initrd
> 
> The issue is observed on v5.13-rc3 if the kernel is built with
> defconfig, GCC 7.5.0 and GNU ld 2.32.0.
> 
> Cc: Christophe Leroy 
> Reported-by: Anastasia Kovaleva 
> Signed-off-by: Roman Bolshakov 
> ---

I'm observing the very same issue and reverting the offending commit
fixes it indeed. Until someone has investigated the root cause, this
looks like a reasonable bug fix to me.

Reviewed-by: Greg Kurz 

and

Tested-by: Greg Kurz 

>  arch/powerpc/Kconfig  |  1 -
>  arch/powerpc/include/asm/jump_label.h | 21 +++--
>  arch/powerpc/kernel/jump_label.c  |  4 ++--
>  3 files changed, 17 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index 088dd2afcfe4..59e0d55ee01d 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -189,7 +189,6 @@ config PPC
>   select HAVE_ARCH_HUGE_VMALLOC   if HAVE_ARCH_HUGE_VMAP
>   select HAVE_ARCH_HUGE_VMAP  if PPC_BOOK3S_64 && 
> PPC_RADIX_MMU
>   select HAVE_ARCH_JUMP_LABEL
> - select HAVE_ARCH_JUMP_LABEL_RELATIVE
>   select HAVE_ARCH_KASAN  if PPC32 && PPC_PAGE_SHIFT <= 14
>   select HAVE_ARCH_KASAN_VMALLOC  if PPC32 && PPC_PAGE_SHIFT <= 14
>   select HAVE_ARCH_KFENCE if PPC32
> diff --git a/arch/powerpc/include/asm/jump_label.h 
> b/arch/powerpc/include/asm/jump_label.h
> index 2d5c6bec2b4f..09297ec9fa52 100644
> --- a/arch/powerpc/include/asm/jump_label.h
> +++ b/arch/powerpc/include/asm/jump_label.h
> @@ -20,8 +20,7 @@ static __always_inline bool arch_static_branch(struct 
> static_key *key, bool bran
>   asm_volatile_goto("1:\n\t"
>"nop # arch_static_branch\n\t"
>".pushsection __jump_table,  \"aw\"\n\t"
> -  ".long 1b - ., %l[l_yes] - .\n\t"
> -  JUMP_ENTRY_TYPE "%c0 - .\n\t"
> +  JUMP_ENTRY_TYPE "1b, %l[l_yes], %c0\n\t"
>".popsection \n\t"
>: :  "i" (&((char *)key)[branch]) : : l_yes);
>  
> @@ -35,8 +34,7 @@ static __always_inline bool arch_static_branch_jump(struct 
> static_key *key, bool
>   asm_volatile_goto("1:\n\t"
>"b %l[l_yes] # arch_static_branch_jump\n\t"
>".pushsection __jump_table,  \"aw\"\n\t"
> -  ".long 1b - ., %l[l_yes] - .\n\t"
> -  JUMP_ENTRY_TYPE "%c0 - .\n\t"
> +  JUMP_ENTRY_TYPE "1b, %l[l_yes], %c0\n\t"
>".popsection \n\t"
>: :  "i" (&((char *)key)[branch]) : : l_yes);
>  
> @@ -45,12 +43,23 @@ static __always_inline bool 
> arch_static_branch_jump(struct static_key *key, bool
>   return true;
>  }
>  
> +#ifdef CONFIG_PPC64
> +typedef u64 jump_label_t;
> +#else
> +typedef u32 jump_label_t;
> +#endif
> +
> +struct jump_entry {
> + jump_label_t code;
> + jump_label_t target;
> + jump_label_t key;
> +};
> +
>  #else
>  #define ARCH_STATIC_BRANCH(LABEL, KEY)   \
>  1098:nop;\
>   .pushsection __jump_table, "aw";\
> - .long 1098b - ., LABEL - .; \
> - FTR_ENTRY_LONG KEY; \
> + FTR_ENTRY_LONG 1098b, LABEL, KEY;   \
>   .popsection
>  #endif
>  
> diff --git a/arch/powerpc/kernel/jump_label.c 
> b/arch/powerpc/kernel/jump_label.c
> index ce87dc5ea23c..144858027fa3 100644
> --- a/arch/powerpc/kernel/jump_label.c
> +++ b/arch/powerpc/kernel/jump_label.c
> @@ -11,10 +11,10 @@
>  void arch_jump_label_transform(struct jump_entry *entry,
>  enum jump_label_type type)
>  {
> - struct ppc_inst *addr = (struct ppc_inst *)jump_entry_code(entry);
> + struct ppc_inst *addr = (struct ppc_inst *)(unsigned long)entry->code;
>  
>   if (type == JUMP_LABEL_JMP)
> - patch_branch(addr, jump_entry_target(entry), 0);
> + patch_branch(addr, entry->target, 0);
>   else
>   patch_instruction(addr, ppc_inst(PPC_INST_NOP));
>  }

Re: [PATCH] vfio/pci: Revert nvlink removal uAPI breakage

2021-05-04 Thread Greg Kurz

On Tue, 04 May 2021 09:52:02 -0600
Alex Williamson  wrote:

> Revert the uAPI changes from the below commit with notice that these
> regions and capabilities are no longer provided.
> 
> Fixes: b392a1989170 ("vfio/pci: remove vfio_pci_nvlink2")
> Reported-by: Greg Kurz 
> Signed-off-by: Alex Williamson 
> ---
> 
> Greg (Kurz), please double check this resolves the issue.  Thanks!
> 

It does. Feel free to add:

Reviewed-by: Greg Kurz 

and

Tested-by: Greg Kurz 

Thanks for the quick fix.

Cheers,

--
Greg

>  include/uapi/linux/vfio.h |   46 
> +
>  1 file changed, 42 insertions(+), 4 deletions(-)
> 
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 34b1f53a3901..ef33ea002b0b 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -333,10 +333,21 @@ struct vfio_region_info_cap_type {
>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG(3)
>  
>  /* 10de vendor PCI sub-types */
> -/* subtype 1 was VFIO_REGION_SUBTYPE_NVIDIA_NVLINK2_RAM, don't use */
> +/*
> + * NVIDIA GPU NVlink2 RAM is coherent RAM mapped onto the host address space.
> + *
> + * Deprecated, region no longer provided
> + */
> +#define VFIO_REGION_SUBTYPE_NVIDIA_NVLINK2_RAM   (1)
>  
>  /* 1014 vendor PCI sub-types */
> -/* subtype 1 was VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD, don't use */
> +/*
> + * IBM NPU NVlink2 ATSD (Address Translation Shootdown) register of NPU
> + * to do TLB invalidation on a GPU.
> + *
> + * Deprecated, region no longer provided
> + */
> +#define VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD (1)
>  
>  /* sub-types for VFIO_REGION_TYPE_GFX */
>  #define VFIO_REGION_SUBTYPE_GFX_EDID(1)
> @@ -630,9 +641,36 @@ struct vfio_device_migration_info {
>   */
>  #define VFIO_REGION_INFO_CAP_MSIX_MAPPABLE   3
>  
> -/* subtype 4 was VFIO_REGION_INFO_CAP_NVLINK2_SSATGT, don't use */
> +/*
> + * Capability with compressed real address (aka SSA - small system address)
> + * where GPU RAM is mapped on a system bus. Used by a GPU for DMA routing
> + * and by the userspace to associate a NVLink bridge with a GPU.
> + *
> + * Deprecated, capability no longer provided
> + */
> +#define VFIO_REGION_INFO_CAP_NVLINK2_SSATGT  4
> +
> +struct vfio_region_info_cap_nvlink2_ssatgt {
> + struct vfio_info_cap_header header;
> + __u64 tgt;
> +};
>  
> -/* subtype 5 was VFIO_REGION_INFO_CAP_NVLINK2_LNKSPD, don't use */
> +/*
> + * Capability with an NVLink link speed. The value is read by
> + * the NVlink2 bridge driver from the bridge's "ibm,nvlink-speed"
> + * property in the device tree. The value is fixed in the hardware
> + * and failing to provide the correct value results in the link
> + * not working with no indication from the driver why.
> + *
> + * Deprecated, capability no longer provided
> + */
> +#define VFIO_REGION_INFO_CAP_NVLINK2_LNKSPD  5
> +
> +struct vfio_region_info_cap_nvlink2_lnkspd {
> + struct vfio_info_cap_header header;
> + __u32 link_speed;
> + __u32 __pad;
> +};
>  
>  /**
>   * VFIO_DEVICE_GET_IRQ_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 9,
> 
>

Re: remove the nvlink2 pci_vfio subdriver v2

2021-05-04 Thread Greg Kurz

On Tue, 4 May 2021 15:30:15 +0200
Greg Kroah-Hartman  wrote:

> On Tue, May 04, 2021 at 03:20:34PM +0200, Greg Kurz wrote:
> > On Tue, 4 May 2021 14:59:07 +0200
> > Greg Kroah-Hartman  wrote:
> > 
> > > On Tue, May 04, 2021 at 02:22:36PM +0200, Greg Kurz wrote:
> > > > On Fri, 26 Mar 2021 07:13:09 +0100
> > > > Christoph Hellwig  wrote:
> > > > 
> > > > > Hi all,
> > > > > 
> > > > > the nvlink2 vfio subdriver is a weird beast.  It supports a hardware
> > > > > feature without any open source component - what would normally be
> > > > > the normal open source userspace that we require for kernel drivers,
> > > > > although in this particular case user space could of course be a
> > > > > kernel driver in a VM.  It also happens to be a complete mess that
> > > > > does not properly bind to PCI IDs, is hacked into the vfio_pci driver
> > > > > and also pulles in over 1000 lines of code always build into powerpc
> > > > > kernels that have Power NV support enabled.  Because of all these
> > > > > issues and the lack of breaking userspace when it is removed I think
> > > > > the best idea is to simply kill.
> > > > > 
> > > > > Changes since v1:
> > > > >  - document the removed subtypes as reserved
> > > > >  - add the ACK from Greg
> > > > > 
> > > > > Diffstat:
> > > > >  arch/powerpc/platforms/powernv/npu-dma.c |  705 
> > > > > ---
> > > > >  b/arch/powerpc/include/asm/opal.h|3 
> > > > >  b/arch/powerpc/include/asm/pci-bridge.h  |1 
> > > > >  b/arch/powerpc/include/asm/pci.h |7 
> > > > >  b/arch/powerpc/platforms/powernv/Makefile|2 
> > > > >  b/arch/powerpc/platforms/powernv/opal-call.c |2 
> > > > >  b/arch/powerpc/platforms/powernv/pci-ioda.c  |  185 ---
> > > > >  b/arch/powerpc/platforms/powernv/pci.c   |   11 
> > > > >  b/arch/powerpc/platforms/powernv/pci.h   |   17 
> > > > >  b/arch/powerpc/platforms/pseries/pci.c   |   23 
> > > > >  b/drivers/vfio/pci/Kconfig   |6 
> > > > >  b/drivers/vfio/pci/Makefile  |1 
> > > > >  b/drivers/vfio/pci/vfio_pci.c|   18 
> > > > >  b/drivers/vfio/pci/vfio_pci_private.h|   14 
> > > > >  b/include/uapi/linux/vfio.h  |   38 -
> > > > 
> > > > 
> > > > Hi Christoph,
> > > > 
> > > > FYI, these uapi changes break build of QEMU.
> > > 
> > > What uapi changes?
> > > 
> > 
> > All macros and structure definitions that are being removed
> > from include/uapi/linux/vfio.h by patch 1.
> > 
> > > What exactly breaks?
> > > 
> > 
> > These macros and types are used by the current QEMU code base.
> > Next time the QEMU source tree updates its copy of the kernel
> > headers, the compilation of affected code will fail.
> 
> So does QEMU use this api that is being removed, or does it just have
> some odd build artifacts of the uapi things?
> 

These are region subtypes definition and associated capabilities.
QEMU basically gets information on VFIO regions from the kernel
driver and for those regions with a nvlink2 subtype, it tries
to extract some more nvlink2 related info.

> What exactly is the error messages here?
> 

[55/143] Compiling C object libqemu-ppc64-softmmu.fa.p/hw_vfio_pci-quirks.c.o
FAILED: libqemu-ppc64-softmmu.fa.p/hw_vfio_pci-quirks.c.o 
cc -Ilibqemu-ppc64-softmmu.fa.p -I. -I../.. -Itarget/ppc -I../../target/ppc 
-I../../capstone/include/capstone -Iqapi -Itrace -Iui -Iui/shader 
-I/usr/include/pixman-1 -I/usr/include/glib-2.0 -I/usr/lib64/glib-2.0/include 
-fdiagnostics-color=auto -pipe -Wall -Winvalid-pch -Werror -std=gnu99 -O2 -g 
-isystem /home/greg/Work/qemu/qemu-virtiofs/linux-headers -isystem 
linux-headers -iquote . -iquote /home/greg/Work/qemu/qemu-virtiofs -iquote 
/home/greg/Work/qemu/qemu-virtiofs/include -iquote 
/home/greg/Work/qemu/qemu-virtiofs/disas/libvixl -iquote 
/home/greg/Work/qemu/qemu-virtiofs/tcg/ppc -iquote 
/home/greg/Work/qemu/qemu-virtiofs/accel/tcg -pthread -U_FORTIFY_SOURCE 
-D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE 
-Wstrict-prototypes -Wredundant-decls -Wundef -Wwrite-strings 
-Wmissing-prototypes -fno-strict-aliasing -fno-common -fwrapv 
-Wold-style-declaration -Wold-style-defin

Re: remove the nvlink2 pci_vfio subdriver v2

2021-05-04 Thread Greg Kurz

On Tue, 4 May 2021 14:59:07 +0200
Greg Kroah-Hartman  wrote:

> On Tue, May 04, 2021 at 02:22:36PM +0200, Greg Kurz wrote:
> > On Fri, 26 Mar 2021 07:13:09 +0100
> > Christoph Hellwig  wrote:
> > 
> > > Hi all,
> > > 
> > > the nvlink2 vfio subdriver is a weird beast.  It supports a hardware
> > > feature without any open source component - what would normally be
> > > the normal open source userspace that we require for kernel drivers,
> > > although in this particular case user space could of course be a
> > > kernel driver in a VM.  It also happens to be a complete mess that
> > > does not properly bind to PCI IDs, is hacked into the vfio_pci driver
> > > and also pulles in over 1000 lines of code always build into powerpc
> > > kernels that have Power NV support enabled.  Because of all these
> > > issues and the lack of breaking userspace when it is removed I think
> > > the best idea is to simply kill.
> > > 
> > > Changes since v1:
> > >  - document the removed subtypes as reserved
> > >  - add the ACK from Greg
> > > 
> > > Diffstat:
> > >  arch/powerpc/platforms/powernv/npu-dma.c |  705 
> > > ---
> > >  b/arch/powerpc/include/asm/opal.h|3 
> > >  b/arch/powerpc/include/asm/pci-bridge.h  |1 
> > >  b/arch/powerpc/include/asm/pci.h |7 
> > >  b/arch/powerpc/platforms/powernv/Makefile|2 
> > >  b/arch/powerpc/platforms/powernv/opal-call.c |2 
> > >  b/arch/powerpc/platforms/powernv/pci-ioda.c  |  185 ---
> > >  b/arch/powerpc/platforms/powernv/pci.c   |   11 
> > >  b/arch/powerpc/platforms/powernv/pci.h   |   17 
> > >  b/arch/powerpc/platforms/pseries/pci.c   |   23 
> > >  b/drivers/vfio/pci/Kconfig   |6 
> > >  b/drivers/vfio/pci/Makefile  |1 
> > >  b/drivers/vfio/pci/vfio_pci.c|   18 
> > >  b/drivers/vfio/pci/vfio_pci_private.h|   14 
> > >  b/include/uapi/linux/vfio.h  |   38 -
> > 
> > 
> > Hi Christoph,
> > 
> > FYI, these uapi changes break build of QEMU.
> 
> What uapi changes?
> 

All macros and structure definitions that are being removed
from include/uapi/linux/vfio.h by patch 1.

> What exactly breaks?
> 

These macros and types are used by the current QEMU code base.
Next time the QEMU source tree updates its copy of the kernel
headers, the compilation of affected code will fail.

> Why does QEMU require kernel driver stuff?
> 

Not sure to understand the question... is there a problem
with QEMU using an already published uapi ?

> thanks,
> 
> greg k-h

Re: remove the nvlink2 pci_vfio subdriver v2

2021-05-04 Thread Greg Kurz

On Fri, 26 Mar 2021 07:13:09 +0100
Christoph Hellwig  wrote:

> Hi all,
> 
> the nvlink2 vfio subdriver is a weird beast.  It supports a hardware
> feature without any open source component - what would normally be
> the normal open source userspace that we require for kernel drivers,
> although in this particular case user space could of course be a
> kernel driver in a VM.  It also happens to be a complete mess that
> does not properly bind to PCI IDs, is hacked into the vfio_pci driver
> and also pulles in over 1000 lines of code always build into powerpc
> kernels that have Power NV support enabled.  Because of all these
> issues and the lack of breaking userspace when it is removed I think
> the best idea is to simply kill.
> 
> Changes since v1:
>  - document the removed subtypes as reserved
>  - add the ACK from Greg
> 
> Diffstat:
>  arch/powerpc/platforms/powernv/npu-dma.c |  705 
> ---
>  b/arch/powerpc/include/asm/opal.h|3 
>  b/arch/powerpc/include/asm/pci-bridge.h  |1 
>  b/arch/powerpc/include/asm/pci.h |7 
>  b/arch/powerpc/platforms/powernv/Makefile|2 
>  b/arch/powerpc/platforms/powernv/opal-call.c |2 
>  b/arch/powerpc/platforms/powernv/pci-ioda.c  |  185 ---
>  b/arch/powerpc/platforms/powernv/pci.c   |   11 
>  b/arch/powerpc/platforms/powernv/pci.h   |   17 
>  b/arch/powerpc/platforms/pseries/pci.c   |   23 
>  b/drivers/vfio/pci/Kconfig   |6 
>  b/drivers/vfio/pci/Makefile  |1 
>  b/drivers/vfio/pci/vfio_pci.c|   18 
>  b/drivers/vfio/pci/vfio_pci_private.h|   14 
>  b/include/uapi/linux/vfio.h  |   38 -


Hi Christoph,

FYI, these uapi changes break build of QEMU.

I guess QEMU people should take some action before this percolates
to the QEMU source tree.

Cc'ing relevant QEMU lists to bring the discussion there.

Cheers,

--
Greg

>  drivers/vfio/pci/vfio_pci_nvlink2.c  |  490 --
>  16 files changed, 12 insertions(+), 1511 deletions(-)

Re: [PATCH v3 0/9] powerpc/xive: Map one IPI interrupt per node

2021-04-01 Thread Greg Kurz

On Thu, 1 Apr 2021 11:18:10 +0200
Cédric Le Goater  wrote:

> Hello,
> 
> On 4/1/21 10:04 AM, Greg Kurz wrote:
> > On Wed, 31 Mar 2021 16:45:05 +0200
> > Cédric Le Goater  wrote:
> > 
> >>
> >> Hello,
> >>
> >> ipistorm [*] can be used to benchmark the raw interrupt rate of an
> >> interrupt controller by measuring the number of IPIs a system can
> >> sustain. When applied to the XIVE interrupt controller of POWER9 and
> >> POWER10 systems, a significant drop of the interrupt rate can be
> >> observed when crossing the second node boundary.
> >>
> >> This is due to the fact that a single IPI interrupt is used for all
> >> CPUs of the system. The structure is shared and the cache line updates
> >> impact greatly the traffic between nodes and the overall IPI
> >> performance.
> >>
> >> As a workaround, the impact can be reduced by deactivating the IRQ
> >> lockup detector ("noirqdebug") which does a lot of accounting in the
> >> Linux IRQ descriptor structure and is responsible for most of the
> >> performance penalty.
> >>
> >> As a fix, this proposal allocates an IPI interrupt per node, to be
> >> shared by all CPUs of that node. It solves the scaling issue, the IRQ
> >> lockup detector still has an impact but the XIVE interrupt rate scales
> >> linearly. It also improves the "noirqdebug" case as showed in the
> >> tables below. 
> >>
> > 
> > As explained by David and others, NUMA nodes happen to match sockets
> > with current POWER CPUs but these are really different concepts. NUMA
> > is about CPU memory accesses latency, 
> 
> This is exactly our problem. we have cache issues because hw threads 
> on different chips are trying to access the same structure in memory.
> It happens on virtual platforms and baremetal platforms. This is not
> restricted to pseries.
> 

Ok, I get it... the XIVE HW accesses structures in RAM, just like HW threads
do, so the closer, the better. This definitely looks NUMA related indeed. So
yes, the idea of having the XIVE HW to only access local in-RAM data when
handling IPIs between vCPUs in the same NUMA node makes sense.

What is less clear is the exact role of ibm,chip-id actually. This is
currently used on PowerNV only to pick up a default target on the same
"chip" as the source if possible. What is the detailed motivation behind
this ?

> > while in the case of XIVE you
> > really need to identify a XIVE chip localized in a given socket.
> > 
> > PAPR doesn't know about sockets, only cores. In other words, a PAPR
> > compliant guest sees all vCPUs like they all sit in a single socket.
> 
> There are also NUMA nodes on PAPR.
> 

Yes but nothing prevents a NUMA node to span over multiple sockets
or having several NUMA nodes within the same socket, even if this
isn't the case in practice with current POWER hardware.

> > Same for the XIVE. Trying to introduce a concept of socket, either
> > by hijacking OPAL's ibm,chip-id or NUMA node ids, is a kind of
> > spec violation in this context. If the user cares for locality of
> > the vCPUs and XIVE on the same socket, then it should bind vCPU
> > threads to host CPUs from the same socket in the first place.
> 
> Yes. that's a must have of course. You need to reflect the real HW
> topology in the guest or LPAR if you are after performance, or 
> restrict the virtual machine to be on a single socket/chip/node.  
> 
> And this is not only a XIVE problem. XICS has the same problem with
> a shared single IPI interrupt descriptor but XICS doesn't scale well 
> by design, so it doesn't show.
> 
> 
> > Isn't this enough to solve the performance issues this series
> > want to fix, without the need for virtual socket ids ?
> what are virtual socket ids ? A new concept ? 
> 

For now, we have virtual CPUs identified by a virtual CPU id.
It thus seems natural to speak of a virtual socket id, but
anyway, the wording isn't really important here and you
don't answer the question ;-)

> Thanks,
> 
> C.
> 
> > 
> >>  * P9 DD2.2 - 2s * 64 threads
> >>
> >>"noirqdebug"
> >> Mint/sMint/s   
> >>  chips  cpus  IPI/sys   IPI/chip   IPI/chipIPI/sys 
> >>  --
> >>  1  0-15 4.984023   4.875405   4.996536   5.048892
> >> 0-3110.879164  10.544040  10.757632  11.037859
> >> 0-4715.345301  14.688764  14.

Re: [PATCH v3 0/9] powerpc/xive: Map one IPI interrupt per node

2021-04-01 Thread Greg Kurz

On Wed, 31 Mar 2021 16:45:05 +0200
Cédric Le Goater  wrote:

> 
> Hello,
> 
> ipistorm [*] can be used to benchmark the raw interrupt rate of an
> interrupt controller by measuring the number of IPIs a system can
> sustain. When applied to the XIVE interrupt controller of POWER9 and
> POWER10 systems, a significant drop of the interrupt rate can be
> observed when crossing the second node boundary.
> 
> This is due to the fact that a single IPI interrupt is used for all
> CPUs of the system. The structure is shared and the cache line updates
> impact greatly the traffic between nodes and the overall IPI
> performance.
> 
> As a workaround, the impact can be reduced by deactivating the IRQ
> lockup detector ("noirqdebug") which does a lot of accounting in the
> Linux IRQ descriptor structure and is responsible for most of the
> performance penalty.
> 
> As a fix, this proposal allocates an IPI interrupt per node, to be
> shared by all CPUs of that node. It solves the scaling issue, the IRQ
> lockup detector still has an impact but the XIVE interrupt rate scales
> linearly. It also improves the "noirqdebug" case as showed in the
> tables below. 
> 

As explained by David and others, NUMA nodes happen to match sockets
with current POWER CPUs but these are really different concepts. NUMA
is about CPU memory accesses latency, while in the case of XIVE you
really need to identify a XIVE chip localized in a given socket.

PAPR doesn't know about sockets, only cores. In other words, a PAPR
compliant guest sees all vCPUs like they all sit in a single socket.
Same for the XIVE. Trying to introduce a concept of socket, either
by hijacking OPAL's ibm,chip-id or NUMA node ids, is a kind of
spec violation in this context. If the user cares for locality of
the vCPUs and XIVE on the same socket, then it should bind vCPU
threads to host CPUs from the same socket in the first place.
Isn't this enough to solve the performance issues this series
want to fix, without the need for virtual socket ids ?

>  * P9 DD2.2 - 2s * 64 threads
> 
>"noirqdebug"
> Mint/sMint/s   
>  chips  cpus  IPI/sys   IPI/chip   IPI/chipIPI/sys 
>  --
>  1  0-15 4.984023   4.875405   4.996536   5.048892
> 0-3110.879164  10.544040  10.757632  11.037859
> 0-4715.345301  14.688764  14.926520  15.310053
> 0-6317.064907  17.066812  17.613416  17.874511
>  2  0-7911.768764  21.650749  22.689120  22.566508
> 0-9510.616812  26.878789  28.434703  28.320324
> 0-111   10.151693  31.397803  31.771773  32.388122
> 0-1279.948502  33.139336  34.875716  35.224548
> 
> 
>  * P10 DD1 - 4s (not homogeneous) 352 threads
> 
>"noirqdebug"
> Mint/sMint/s   
>  chips  cpus  IPI/sys   IPI/chip   IPI/chipIPI/sys 
>  --
>  1  0-15 2.409402   2.364108   2.383303   2.395091
> 0-31 6.028325   6.046075   6.08   6.073750
> 0-47 8.655178   8.644531   8.712830   8.724702
> 0-6311.629652  11.735953  12.088203  12.055979
> 0-7914.392321  14.729959  14.986701  14.973073
> 0-9512.604158  13.004034  17.528748  17.568095
>  2  0-1119.767753  13.719831  19.968606  20.024218
> 0-1276.744566  16.418854  22.898066  22.995110
> 0-1436.005699  19.174421  25.425622  25.417541
> 0-1595.649719  21.938836  27.952662  28.059603
> 0-1755.441410  24.109484  31.133915  31.127996
>  3  0-1915.318341  24.405322  33.999221  33.775354
> 0-2075.191382  26.449769  36.050161  35.867307
> 0-2235.102790  29.356943  39.544135  39.508169
> 0-2395.035295  31.933051  42.135075  42.071975
> 0-2554.969209  34.477367  44.655395  44.757074
>  4  0-2714.907652  35.887016  47.080545  47.318537
> 0-2874.839581  38.076137  50.464307  50.636219
> 0-3034.786031  40.881319  53.478684  53.310759
> 0-3194.743750  43.448424  56.388102  55.973969
> 0-3354.709936  45.623532  59.400930  58.926857
> 0-3514.681413  45.646151  62.035804  61.830057
> 
> [*] https://github.com/antonblanchard/ipistorm
> 
> Thanks,
> 
> C.
> 
> Changes in v3:
> 
>   - improved commit log for the misuse of "ibm,chip-id"
>   - better error handling of xive_request_ipi()
>   - use of a fwnode_handle to name the new domain 
>   - increased IPI name length
>   - use of early_cpu_to_node() for hotplugged CPUs
>   - filter CPU-less nodes
> 
> Changes in v2:
> 
>   - extra simplification

Re: [PATCH] powerpc/numa: Fix topology_physical_package_id() on pSeries

2021-03-15 Thread Greg Kurz

On Fri, 12 Mar 2021 15:31:54 +0100
Cédric Le Goater  wrote:

> Initial commit 15863ff3b8da ("powerpc: Make chip-id information
> available to userspace") introduce a cpu_to_chip_id() routine for the
> PowerNV platform using the "ibm,chip-id" property to query the chip id
> of a CPU. But PAPR does not specify such a property and the node id
> query is broken.
> 
> Use cpu_to_node() instead which guarantees to have a correct value on
> all platforms, PowerNV an pSeries.
> 
> Cc: Nathan Lynch 
> Cc: Srikar Dronamraju 
> Cc: Vasant Hegde 
> Signed-off-by: Cédric Le Goater 
> ---

Makes sense.

FWIW

Reviewed-by: Greg Kurz 

>  arch/powerpc/include/asm/topology.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/include/asm/topology.h 
> b/arch/powerpc/include/asm/topology.h
> index 3beeb030cd78..887c42a4e43d 100644
> --- a/arch/powerpc/include/asm/topology.h
> +++ b/arch/powerpc/include/asm/topology.h
> @@ -123,7 +123,7 @@ static inline int cpu_to_coregroup_id(int cpu)
>  #ifdef CONFIG_PPC64
>  #include 
>  
> -#define topology_physical_package_id(cpu)(cpu_to_chip_id(cpu))
> +#define topology_physical_package_id(cpu)(cpu_to_node(cpu))
>  
>  #define topology_sibling_cpumask(cpu)(per_cpu(cpu_sibling_map, cpu))
>  #define topology_core_cpumask(cpu)   (cpu_cpu_mask(cpu))

Re: [PATCH v2 1/8] powerpc/xive: Use cpu_to_node() instead of ibm,chip-id property

2021-03-12 Thread Greg Kurz

On Fri, 12 Mar 2021 09:18:39 -0300
Daniel Henrique Barboza  wrote:

> 
> 
> On 3/12/21 6:53 AM, Cédric Le Goater wrote:
> > On 3/12/21 2:55 AM, David Gibson wrote:
> >> On Tue, 9 Mar 2021 18:26:35 +0100
> >> Cédric Le Goater  wrote:
> >>
> >>> On 3/9/21 6:08 PM, Daniel Henrique Barboza wrote:
> >>>>
> >>>>
> >>>> On 3/9/21 12:33 PM, Cédric Le Goater wrote:
> >>>>> On 3/8/21 6:13 PM, Greg Kurz wrote:
> >>>>>> On Wed, 3 Mar 2021 18:48:50 +0100
> >>>>>> Cédric Le Goater  wrote:
> >>>>>>   
> >>>>>>> The 'chip_id' field of the XIVE CPU structure is used to choose a
> >>>>>>> target for a source located on the same chip when possible. This field
> >>>>>>> is assigned on the PowerNV platform using the "ibm,chip-id" property
> >>>>>>> on pSeries under KVM when NUMA nodes are defined but it is undefined
> >>>>>>
> >>>>>> This sentence seems to have a syntax problem... like it is missing an
> >>>>>> 'and' before 'on pSeries'.
> >>>>>
> >>>>> ah yes, or simply a comma.
> >>>>>   
> >>>>>>> under PowerVM. The XIVE source structure has a similar field
> >>>>>>> 'src_chip' which is only assigned on the PowerNV platform.
> >>>>>>>
> >>>>>>> cpu_to_node() returns a compatible value on all platforms, 0 being the
> >>>>>>> default node. It will also give us the opportunity to set the affinity
> >>>>>>> of a source on pSeries when we can localize them.
> >>>>>>>   
> >>>>>>
> >>>>>> IIUC this relies on the fact that the NUMA node id is == to chip id
> >>>>>> on PowerNV, i.e. xc->chip_id which is passed to OPAL remain stable
> >>>>>> with this change.
> >>>>>
> >>>>> Linux sets the NUMA node in numa_setup_cpu(). On pseries, the hcall
> >>>>> H_HOME_NODE_ASSOCIATIVITY returns the node id if I am correct (Daniel
> >>>>> in Cc:)
> >>>   [...]
> >>>>>
> >>>>> On PowerNV, Linux uses "ibm,associativity" property of the CPU to find
> >>>>> the node id. This value is built from the chip id in OPAL, so the
> >>>>> value returned by cpu_to_node(cpu) and the value of the "ibm,chip-id"
> >>>>> property are unlikely to be different.
> >>>>>
> >>>>> cpu_to_node(cpu) is used in many places to allocate the structures
> >>>>> locally to the owning node. XIVE is not an exception (see below in the
> >>>>> same patch), it is better to be consistent and get the same information
> >>>>> (node id) using the same routine.
> >>>>>
> >>>>>
> >>>>> In Linux, "ibm,chip-id" is only used in low level PowerNV drivers :
> >>>>> LPC, XSCOM, RNG, VAS, NX. XIVE should be in that list also but skiboot
> >>>>> unifies the controllers of the system to only expose one the OS. This
> >>>>> is problematic and should be changed but it's another topic.
> >>>>>
> >>>>>   
> >>>>>> On the other hand, you have the pSeries case under PowerVM that
> >>>>>> doesn't xc->chip_id, which isn't passed to any hcall AFAICT.
> >>>>>
> >>>>> yes "ibm,chip-id" is an OPAL concept unfortunately and it has no meaning
> >>>>> under PAPR. xc->chip_id on pseries (PowerVM) will contains an invalid
> >>>>> chip id.
> >>>>>
> >>>>> QEMU/KVM exposes "ibm,chip-id" but it's not used. (its value is not
> >>>>> always correct btw)
> >>>>
> >>>>
> >>>> If you have a way to reliably reproduce this, let me know and I'll fix it
> >>>> up in QEMU.
> >>>
> >>> with :
> >>>
> >>> -smp 4,cores=1,maxcpus=8 -object 
> >>> memory-backend-ram,id=ram-node0,size=2G -numa 
> >>> node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object 
> >>> memory-backend-ram,id=ram-node1,size=2G -numa 
> >>> node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1
> >>>
> >>> # dmes

[PATCH] powerpc/xmon: Check cpu id in commands "c#", "dp#" and "dx#"

2021-03-09 Thread Greg Kurz

All these commands end up peeking into the PACA using the user originated
cpu id as an index. Check the cpu id is valid in order to prevent xmon to
crash. Instead of printing an error, this follows the same behavior as the
"lp s #" command : ignore the buggy cpu id parameter and fall back to the
#-less version of the command.

Signed-off-by: Greg Kurz 
---
 arch/powerpc/xmon/xmon.c |6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
index 80fbf8968f77..d3d6e044228e 100644
--- a/arch/powerpc/xmon/xmon.c
+++ b/arch/powerpc/xmon/xmon.c
@@ -1248,7 +1248,7 @@ static int cpu_cmd(void)
unsigned long cpu, first_cpu, last_cpu;
int timeout;
 
-   if (!scanhex()) {
+   if (!scanhex() || cpu >= num_possible_cpus()) {
/* print cpus waiting or in xmon */
printf("cpus stopped:");
last_cpu = first_cpu = NR_CPUS;
@@ -2678,7 +2678,7 @@ static void dump_pacas(void)
 
termch = c; /* Put c back, it wasn't 'a' */
 
-   if (scanhex())
+   if (scanhex() && num < num_possible_cpus())
dump_one_paca(num);
else
dump_one_paca(xmon_owner);
@@ -2751,7 +2751,7 @@ static void dump_xives(void)
 
termch = c; /* Put c back, it wasn't 'a' */
 
-   if (scanhex())
+   if (scanhex() && num < num_possible_cpus())
dump_one_xive(num);
else
dump_one_xive(xmon_owner);

Re: [PATCH v2 8/8] powerpc/xive: Map one IPI interrupt per node

2021-03-09 Thread Greg Kurz

+++ b/arch/powerpc/sysdev/xive/common.c
> @@ -65,8 +65,16 @@ static struct irq_domain *xive_irq_domain;
>  #ifdef CONFIG_SMP
>  static struct irq_domain *xive_ipi_irq_domain;
>  
> -/* The IPIs all use the same logical irq number */
> -static u32 xive_ipi_irq;
> +/* The IPIs use the same logical irq number when on the same chip */
> +static struct xive_ipi_desc {
> + unsigned int irq;
> + char name[8]; /* enough bytes to fit IPI-XXX */

So this assumes that the node number that node is <= 999 ? This
is certainly the case for now since CONFIG_NODES_SHIFT is 8
on ppc64 but starting with 10, you'd have truncated names.
What about deriving the size of name[] from CONFIG_NODES_SHIFT ?

Apart from that, LGTM. Probably not worth to respin just for
this.

I also could give a try in a KVM guest.

Topology passed to QEMU:

  -smp 8,maxcpus=8,cores=2,threads=2,sockets=2 \
  -numa node,nodeid=0,cpus=0-4 \
  -numa node,nodeid=1,cpus=4-7

Topology observed in guest with lstopo :

  Package L#0
NUMANode L#0 (P#0 30GB)
L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
  PU L#0 (P#0)
  PU L#1 (P#1)
L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
  PU L#2 (P#2)
  PU L#3 (P#3)
  Package L#1
NUMANode L#1 (P#1 32GB)
L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
  PU L#4 (P#4)
  PU L#5 (P#5)
L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
  PU L#6 (P#6)
  PU L#7 (P#7)

Interrupts in guest:

$ cat /proc/interrupts 
   CPU0   CPU1   CPU2   CPU3   CPU4   CPU5   
CPU6   CPU7   
 16:   1023871   1042749  0  0  
0  0  XIVE-IPI   0 Edge  IPI-0
 17:  0  0  0  0   2123   1019   
1263   1288  XIVE-IPI   1 Edge  IPI-1

IPIs are mapped to the appropriate nodes, and the numbers indicate
that everything is working as expected.

Reviewed-and-tested-by: Greg Kurz 

> +} *xive_ipis;
> +
> +static unsigned int xive_ipi_cpu_to_irq(unsigned int cpu)
> +{
> + return xive_ipis[cpu_to_node(cpu)].irq;
> +}
>  #endif
>  
>  /* Xive state for each CPU */
> @@ -1106,25 +1114,36 @@ static const struct irq_domain_ops 
> xive_ipi_irq_domain_ops = {
>  
>  static void __init xive_request_ipi(void)
>  {
> - unsigned int virq;
> + unsigned int node;
>  
> - xive_ipi_irq_domain = irq_domain_add_linear(NULL, 1,
> + xive_ipi_irq_domain = irq_domain_add_linear(NULL, nr_node_ids,
>   _ipi_irq_domain_ops, 
> NULL);
>   if (WARN_ON(xive_ipi_irq_domain == NULL))
>   return;
>  
> - /* Initialize it */
> - virq = irq_create_mapping(xive_ipi_irq_domain, XIVE_IPI_HW_IRQ);
> - xive_ipi_irq = virq;
> + xive_ipis = kcalloc(nr_node_ids, sizeof(*xive_ipis), GFP_KERNEL | 
> __GFP_NOFAIL);
> + for_each_node(node) {
> + struct xive_ipi_desc *xid = _ipis[node];
> + irq_hw_number_t node_ipi_hwirq = node;
> +
> + /*
> +  * Map one IPI interrupt per node for all cpus of that node.
> +  * Since the HW interrupt number doesn't have any meaning,
> +  * simply use the node number.
> +  */
> + xid->irq = irq_create_mapping(xive_ipi_irq_domain, 
> node_ipi_hwirq);
> + snprintf(xid->name, sizeof(xid->name), "IPI-%d", node);
>  
> - WARN_ON(request_irq(virq, xive_muxed_ipi_action,
> - IRQF_PERCPU | IRQF_NO_THREAD, "IPI", NULL));
> + WARN_ON(request_irq(xid->irq, xive_muxed_ipi_action,
> + IRQF_PERCPU | IRQF_NO_THREAD, xid->name, 
> NULL));
> + }
>  }
>  
>  static int xive_setup_cpu_ipi(unsigned int cpu)
>  {
>   struct xive_cpu *xc;
>   int rc;
> + unsigned int xive_ipi_irq = xive_ipi_cpu_to_irq(cpu);
>  
>   pr_debug("Setting up IPI for CPU %d\n", cpu);
>  
> @@ -1165,6 +1184,8 @@ static int xive_setup_cpu_ipi(unsigned int cpu)
>  
>  static void xive_cleanup_cpu_ipi(unsigned int cpu, struct xive_cpu *xc)
>  {
> + unsigned int xive_ipi_irq = xive_ipi_cpu_to_irq(cpu);
> +
>   /* Disable the IPI and free the IRQ data */
>  
>   /* Already cleaned up ? */

Re: [PATCH v2 7/8] powerpc/xive: Fix xmon command "dxi"

2021-03-09 Thread Greg Kurz

On Wed, 3 Mar 2021 18:48:56 +0100
Cédric Le Goater  wrote:

> When under xmon, the "dxi" command dumps the state of the XIVE
> interrupts. If an interrupt number is specified, only the state of
> the associated XIVE interrupt is dumped. This form of the command
> lacks an irq_data parameter which is nevertheless used by
> xmon_xive_get_irq_config(), leading to an xmon crash.
> 
> Fix that by doing a lookup in the system IRQ mapping to query the IRQ
> descriptor data. Invalid interrupt numbers, or not belonging to the
> XIVE IRQ domain, OPAL event interrupt number for instance, should be
> caught by the previous query done at the firmware level.
> 
> Reported-by: kernel test robot 
> Reported-by: Dan Carpenter 
> Fixes: 97ef27507793 ("powerpc/xive: Fix xmon support on the PowerNV platform")
> Signed-off-by: Cédric Le Goater 
> ---

I've tested this in a KVM guest and it seems to do the job.

6:mon> dxi 1201
IRQ 0x1201 : target=0xfc00 prio=ff lirq=0x0 flags= LH PQ=-Q

Bad HW irq numbers are filtered by the hypervisor:

6:mon> dxi bad
[  696.390577] xive: H_INT_GET_SOURCE_CONFIG lisn=2989 failed -55
IRQ 0x0bad : no config rc=-6

Note that this also allows to show IPIs:

6:mon> dxi 0
IRQ 0x : target=0x0 prio=06 lirq=0x10 

This is a bit inconsistent with output of the 0-argument form of "dxi",
which filters them out for a reason that isn't obvious to me. No big
deal though, this should be addressed in another patch anyway.

Reviewed-and-tested-by: Greg Kurz 

>  arch/powerpc/sysdev/xive/common.c | 14 ++
>  1 file changed, 10 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/powerpc/sysdev/xive/common.c 
> b/arch/powerpc/sysdev/xive/common.c
> index f6b7b15bbb3a..8eefd152b947 100644
> --- a/arch/powerpc/sysdev/xive/common.c
> +++ b/arch/powerpc/sysdev/xive/common.c
> @@ -255,17 +255,20 @@ notrace void xmon_xive_do_dump(int cpu)
>   xmon_printf("\n");
>  }
>  
> +static struct irq_data *xive_get_irq_data(u32 hw_irq)
> +{
> + unsigned int irq = irq_find_mapping(xive_irq_domain, hw_irq);
> +
> + return irq ? irq_get_irq_data(irq) : NULL;
> +}
> +
>  int xmon_xive_get_irq_config(u32 hw_irq, struct irq_data *d)
>  {
> - struct irq_chip *chip = irq_data_get_irq_chip(d);
>   int rc;
>   u32 target;
>   u8 prio;
>   u32 lirq;
>  
> - if (!is_xive_irq(chip))
> - return -EINVAL;
> -
>   rc = xive_ops->get_irq_config(hw_irq, , , );
>   if (rc) {
>   xmon_printf("IRQ 0x%08x : no config rc=%d\n", hw_irq, rc);
> @@ -275,6 +278,9 @@ int xmon_xive_get_irq_config(u32 hw_irq, struct irq_data 
> *d)
>   xmon_printf("IRQ 0x%08x : target=0x%x prio=%02x lirq=0x%x ",
>   hw_irq, target, prio, lirq);
>  
> + if (!d)
> + d = xive_get_irq_data(hw_irq);
> +
>   if (d) {
>   struct xive_irq_data *xd = irq_data_get_irq_handler_data(d);
>   u64 val = xive_esb_read(xd, XIVE_ESB_GET);

Re: [PATCH v2 4/8] powerpc/xive: Simplify xive_core_debug_show()

2021-03-09 Thread Greg Kurz

On Tue, 9 Mar 2021 10:13:39 +0100
Greg Kurz  wrote:

> On Mon, 8 Mar 2021 19:11:11 +0100
> Cédric Le Goater  wrote:
> 
> > On 3/8/21 7:07 PM, Greg Kurz wrote:
> > > On Wed, 3 Mar 2021 18:48:53 +0100
> > > Cédric Le Goater  wrote:
> > > 
> > >> Now that the IPI interrupt has its own domain, the checks on the HW
> > >> interrupt number XIVE_IPI_HW_IRQ and on the chip can be replaced by a
> > >> check on the domain.
> > >>
> > >> Signed-off-by: Cédric Le Goater 
> > >> ---
> > > 
> > > Shouldn't this have the following tags ?
> > > 
> > > Reported-by: kernel test robot 
> > > Reported-by: Dan Carpenter 
> > > Fixes: 930914b7d528 ("powerpc/xive: Add a debugfs file to dump internal 
> > > XIVE state")
> > 
> > The next patch has because it removes the useless check on irq_data.
> >  
> 
> Ok I get it. This report isn't about an actual crash. Just a false
> positive because of the not needed check in the caller.
> 

Hrm... I meant because of the check in xive_debug_show_irq(). On the
contrary, the check removed by this patch in xive_core_debug_show()
was rather an explicit hint that xive_debug_show_irq() couldn't be
called with d being NULL.

> > C.
> > 
> > > 
> > > Anyway,
> > > 
> > > Reviewed-by: Greg Kurz 
> > > 
> > >>  arch/powerpc/sysdev/xive/common.c | 18 --
> > >>  1 file changed, 4 insertions(+), 14 deletions(-)
> > >>
> > >> diff --git a/arch/powerpc/sysdev/xive/common.c 
> > >> b/arch/powerpc/sysdev/xive/common.c
> > >> index 678680531d26..7581cb12bb53 100644
> > >> --- a/arch/powerpc/sysdev/xive/common.c
> > >> +++ b/arch/powerpc/sysdev/xive/common.c
> > >> @@ -1579,17 +1579,14 @@ static void xive_debug_show_cpu(struct seq_file 
> > >> *m, int cpu)
> > >>  seq_puts(m, "\n");
> > >>  }
> > >>  
> > >> -static void xive_debug_show_irq(struct seq_file *m, u32 hw_irq, struct 
> > >> irq_data *d)
> > >> +static void xive_debug_show_irq(struct seq_file *m, struct irq_data *d)
> > >>  {
> > >> -struct irq_chip *chip = irq_data_get_irq_chip(d);
> > >> +unsigned int hw_irq = (unsigned int)irqd_to_hwirq(d);
> > >>  int rc;
> > >>  u32 target;
> > >>  u8 prio;
> > >>  u32 lirq;
> > >>  
> > >> -if (!is_xive_irq(chip))
> > >> -return;
> > >> -
> > >>  rc = xive_ops->get_irq_config(hw_irq, , , );
> > >>  if (rc) {
> > >>  seq_printf(m, "IRQ 0x%08x : no config rc=%d\n", hw_irq, 
> > >> rc);
> > >> @@ -1627,16 +1624,9 @@ static int xive_core_debug_show(struct seq_file 
> > >> *m, void *private)
> > >>  
> > >>  for_each_irq_desc(i, desc) {
> > >>  struct irq_data *d = irq_desc_get_irq_data(desc);
> > >> -unsigned int hw_irq;
> > >> -
> > >> -if (!d)
> > >> -continue;
> > >> -
> > >> -hw_irq = (unsigned int)irqd_to_hwirq(d);
> > >>  
> > >> -/* IPIs are special (HW number 0) */
> > >> -if (hw_irq != XIVE_IPI_HW_IRQ)
> > >> -xive_debug_show_irq(m, hw_irq, d);
> > >> +if (d->domain == xive_irq_domain)
> > >> +xive_debug_show_irq(m, d);
> > >>  }
> > >>  return 0;
> > >>  }
> > > 
> > 
>

Re: [PATCH v2 6/8] powerpc/xive: Simplify the dump of XIVE interrupts under xmon

2021-03-09 Thread Greg Kurz

On Wed, 3 Mar 2021 18:48:55 +0100
Cédric Le Goater  wrote:

> Move the xmon routine under XIVE subsystem and rework the loop on the
> interrupts taking into account the xive_irq_domain to filter out IPIs.
> 
> Signed-off-by: Cédric Le Goater 
> ---

Nice again ! :)

Reviewed-by: Greg Kurz 

>  arch/powerpc/include/asm/xive.h   |  1 +
>  arch/powerpc/sysdev/xive/common.c | 14 ++
>  arch/powerpc/xmon/xmon.c  | 28 ++--
>  3 files changed, 17 insertions(+), 26 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/xive.h b/arch/powerpc/include/asm/xive.h
> index 9a312b975ca8..aa094a8655b0 100644
> --- a/arch/powerpc/include/asm/xive.h
> +++ b/arch/powerpc/include/asm/xive.h
> @@ -102,6 +102,7 @@ void xive_flush_interrupt(void);
>  /* xmon hook */
>  void xmon_xive_do_dump(int cpu);
>  int xmon_xive_get_irq_config(u32 hw_irq, struct irq_data *d);
> +void xmon_xive_get_irq_all(void);
>  
>  /* APIs used by KVM */
>  u32 xive_native_default_eq_shift(void);
> diff --git a/arch/powerpc/sysdev/xive/common.c 
> b/arch/powerpc/sysdev/xive/common.c
> index 60ebd6f4b31d..f6b7b15bbb3a 100644
> --- a/arch/powerpc/sysdev/xive/common.c
> +++ b/arch/powerpc/sysdev/xive/common.c
> @@ -291,6 +291,20 @@ int xmon_xive_get_irq_config(u32 hw_irq, struct irq_data 
> *d)
>   return 0;
>  }
>  
> +void xmon_xive_get_irq_all(void)
> +{
> + unsigned int i;
> + struct irq_desc *desc;
> +
> + for_each_irq_desc(i, desc) {
> + struct irq_data *d = irq_desc_get_irq_data(desc);
> + unsigned int hwirq = (unsigned int)irqd_to_hwirq(d);
> +
> + if (d->domain == xive_irq_domain)
> + xmon_xive_get_irq_config(hwirq, d);
> + }
> +}
> +
>  #endif /* CONFIG_XMON */
>  
>  static unsigned int xive_get_irq(void)
> diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
> index 3fe37495f63d..80fbf8968f77 100644
> --- a/arch/powerpc/xmon/xmon.c
> +++ b/arch/powerpc/xmon/xmon.c
> @@ -2727,30 +2727,6 @@ static void dump_all_xives(void)
>   dump_one_xive(cpu);
>  }
>  
> -static void dump_one_xive_irq(u32 num, struct irq_data *d)
> -{
> - xmon_xive_get_irq_config(num, d);
> -}
> -
> -static void dump_all_xive_irq(void)
> -{
> - unsigned int i;
> - struct irq_desc *desc;
> -
> - for_each_irq_desc(i, desc) {
> - struct irq_data *d = irq_desc_get_irq_data(desc);
> - unsigned int hwirq;
> -
> - if (!d)
> - continue;
> -
> - hwirq = (unsigned int)irqd_to_hwirq(d);
> - /* IPIs are special (HW number 0) */
> - if (hwirq)
> - dump_one_xive_irq(hwirq, d);
> - }
> -}
> -
>  static void dump_xives(void)
>  {
>   unsigned long num;
> @@ -2767,9 +2743,9 @@ static void dump_xives(void)
>   return;
>   } else if (c == 'i') {
>   if (scanhex())
> - dump_one_xive_irq(num, NULL);
> + xmon_xive_get_irq_config(num, NULL);
>   else
> - dump_all_xive_irq();
> + xmon_xive_get_irq_all();
>   return;
>   }
>

Re: [PATCH v2 5/8] powerpc/xive: Drop check on irq_data in xive_core_debug_show()

2021-03-09 Thread Greg Kurz

On Wed, 3 Mar 2021 18:48:54 +0100
Cédric Le Goater  wrote:

> When looping on IRQ descriptor, irq_data is always valid.
> 
> Reported-by: kernel test robot 
> Reported-by: Dan Carpenter 
> Fixes: 930914b7d528 ("powerpc/xive: Add a debugfs file to dump internal XIVE 
> state")
> Signed-off-by: Cédric Le Goater 
> ---

Reviewed-by: Greg Kurz 

>  arch/powerpc/sysdev/xive/common.c | 21 ++---
>  1 file changed, 10 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/powerpc/sysdev/xive/common.c 
> b/arch/powerpc/sysdev/xive/common.c
> index 7581cb12bb53..60ebd6f4b31d 100644
> --- a/arch/powerpc/sysdev/xive/common.c
> +++ b/arch/powerpc/sysdev/xive/common.c
> @@ -1586,6 +1586,8 @@ static void xive_debug_show_irq(struct seq_file *m, 
> struct irq_data *d)
>   u32 target;
>   u8 prio;
>   u32 lirq;
> + struct xive_irq_data *xd;
> + u64 val;
>  
>   rc = xive_ops->get_irq_config(hw_irq, , , );
>   if (rc) {
> @@ -1596,17 +1598,14 @@ static void xive_debug_show_irq(struct seq_file *m, 
> struct irq_data *d)
>   seq_printf(m, "IRQ 0x%08x : target=0x%x prio=%02x lirq=0x%x ",
>  hw_irq, target, prio, lirq);
>  
> - if (d) {
> - struct xive_irq_data *xd = irq_data_get_irq_handler_data(d);
> - u64 val = xive_esb_read(xd, XIVE_ESB_GET);
> -
> - seq_printf(m, "flags=%c%c%c PQ=%c%c",
> -xd->flags & XIVE_IRQ_FLAG_STORE_EOI ? 'S' : ' ',
> -xd->flags & XIVE_IRQ_FLAG_LSI ? 'L' : ' ',
> -xd->flags & XIVE_IRQ_FLAG_H_INT_ESB ? 'H' : ' ',
> -val & XIVE_ESB_VAL_P ? 'P' : '-',
> -val & XIVE_ESB_VAL_Q ? 'Q' : '-');
> - }
> + xd = irq_data_get_irq_handler_data(d);
> + val = xive_esb_read(xd, XIVE_ESB_GET);
> + seq_printf(m, "flags=%c%c%c PQ=%c%c",
> +xd->flags & XIVE_IRQ_FLAG_STORE_EOI ? 'S' : ' ',
> +xd->flags & XIVE_IRQ_FLAG_LSI ? 'L' : ' ',
> +xd->flags & XIVE_IRQ_FLAG_H_INT_ESB ? 'H' : ' ',
> +val & XIVE_ESB_VAL_P ? 'P' : '-',
> +val & XIVE_ESB_VAL_Q ? 'Q' : '-');
>   seq_puts(m, "\n");
>  }
>

Re: [PATCH v2 4/8] powerpc/xive: Simplify xive_core_debug_show()

2021-03-09 Thread Greg Kurz

On Mon, 8 Mar 2021 19:11:11 +0100
Cédric Le Goater  wrote:

> On 3/8/21 7:07 PM, Greg Kurz wrote:
> > On Wed, 3 Mar 2021 18:48:53 +0100
> > Cédric Le Goater  wrote:
> > 
> >> Now that the IPI interrupt has its own domain, the checks on the HW
> >> interrupt number XIVE_IPI_HW_IRQ and on the chip can be replaced by a
> >> check on the domain.
> >>
> >> Signed-off-by: Cédric Le Goater 
> >> ---
> > 
> > Shouldn't this have the following tags ?
> > 
> > Reported-by: kernel test robot 
> > Reported-by: Dan Carpenter 
> > Fixes: 930914b7d528 ("powerpc/xive: Add a debugfs file to dump internal 
> > XIVE state")
> 
> The next patch has because it removes the useless check on irq_data.
>  

Ok I get it. This report isn't about an actual crash. Just a false
positive because of the not needed check in the caller.

> C.
> 
> > 
> > Anyway,
> > 
> > Reviewed-by: Greg Kurz 
> > 
> >>  arch/powerpc/sysdev/xive/common.c | 18 --
> >>  1 file changed, 4 insertions(+), 14 deletions(-)
> >>
> >> diff --git a/arch/powerpc/sysdev/xive/common.c 
> >> b/arch/powerpc/sysdev/xive/common.c
> >> index 678680531d26..7581cb12bb53 100644
> >> --- a/arch/powerpc/sysdev/xive/common.c
> >> +++ b/arch/powerpc/sysdev/xive/common.c
> >> @@ -1579,17 +1579,14 @@ static void xive_debug_show_cpu(struct seq_file 
> >> *m, int cpu)
> >>seq_puts(m, "\n");
> >>  }
> >>  
> >> -static void xive_debug_show_irq(struct seq_file *m, u32 hw_irq, struct 
> >> irq_data *d)
> >> +static void xive_debug_show_irq(struct seq_file *m, struct irq_data *d)
> >>  {
> >> -  struct irq_chip *chip = irq_data_get_irq_chip(d);
> >> +  unsigned int hw_irq = (unsigned int)irqd_to_hwirq(d);
> >>int rc;
> >>u32 target;
> >>u8 prio;
> >>u32 lirq;
> >>  
> >> -  if (!is_xive_irq(chip))
> >> -  return;
> >> -
> >>rc = xive_ops->get_irq_config(hw_irq, , , );
> >>if (rc) {
> >>seq_printf(m, "IRQ 0x%08x : no config rc=%d\n", hw_irq, rc);
> >> @@ -1627,16 +1624,9 @@ static int xive_core_debug_show(struct seq_file *m, 
> >> void *private)
> >>  
> >>for_each_irq_desc(i, desc) {
> >>struct irq_data *d = irq_desc_get_irq_data(desc);
> >> -  unsigned int hw_irq;
> >> -
> >> -  if (!d)
> >> -  continue;
> >> -
> >> -  hw_irq = (unsigned int)irqd_to_hwirq(d);
> >>  
> >> -  /* IPIs are special (HW number 0) */
> >> -  if (hw_irq != XIVE_IPI_HW_IRQ)
> >> -  xive_debug_show_irq(m, hw_irq, d);
> >> +  if (d->domain == xive_irq_domain)
> >> +  xive_debug_show_irq(m, d);
> >>}
> >>return 0;
> >>  }
> > 
>

Re: [PATCH v2 2/8] powerpc/xive: Introduce an IPI interrupt domain

2021-03-08 Thread Greg Kurz

On Wed, 3 Mar 2021 18:48:51 +0100
Cédric Le Goater  wrote:

> The IPI interrupt is a special case of the XIVE IRQ domain. When
> mapping and unmapping the interrupts in the Linux interrupt number
> space, the HW interrupt number 0 (XIVE_IPI_HW_IRQ) is checked to
> distinguish the IPI interrupt from other interrupts of the system.
> 
> Simplify the XIVE interrupt domain by introducing a specific domain
> for the IPI.
> 
> Signed-off-by: Cédric Le Goater 
> ---

Nice !

Reviewed-by: Greg Kurz 

>  arch/powerpc/sysdev/xive/common.c | 51 +--
>  1 file changed, 22 insertions(+), 29 deletions(-)
> 
> diff --git a/arch/powerpc/sysdev/xive/common.c 
> b/arch/powerpc/sysdev/xive/common.c
> index b8e456da28aa..e7783760d278 100644
> --- a/arch/powerpc/sysdev/xive/common.c
> +++ b/arch/powerpc/sysdev/xive/common.c
> @@ -63,6 +63,8 @@ static const struct xive_ops *xive_ops;
>  static struct irq_domain *xive_irq_domain;
>  
>  #ifdef CONFIG_SMP
> +static struct irq_domain *xive_ipi_irq_domain;
> +
>  /* The IPIs all use the same logical irq number */
>  static u32 xive_ipi_irq;
>  #endif
> @@ -1067,20 +1069,32 @@ static struct irq_chip xive_ipi_chip = {
>   .irq_unmask = xive_ipi_do_nothing,
>  };
>  
> +/*
> + * IPIs are marked per-cpu. We use separate HW interrupts under the
> + * hood but associated with the same "linux" interrupt
> + */
> +static int xive_ipi_irq_domain_map(struct irq_domain *h, unsigned int virq,
> +irq_hw_number_t hw)
> +{
> + irq_set_chip_and_handler(virq, _ipi_chip, handle_percpu_irq);
> + return 0;
> +}
> +
> +static const struct irq_domain_ops xive_ipi_irq_domain_ops = {
> + .map = xive_ipi_irq_domain_map,
> +};
> +
>  static void __init xive_request_ipi(void)
>  {
>   unsigned int virq;
>  
> - /*
> -  * Initialization failed, move on, we might manage to
> -  * reach the point where we display our errors before
> -  * the system falls appart
> -  */
> - if (!xive_irq_domain)
> + xive_ipi_irq_domain = irq_domain_add_linear(NULL, 1,
> + _ipi_irq_domain_ops, 
> NULL);
> + if (WARN_ON(xive_ipi_irq_domain == NULL))
>   return;
>  
>   /* Initialize it */
> - virq = irq_create_mapping(xive_irq_domain, XIVE_IPI_HW_IRQ);
> + virq = irq_create_mapping(xive_ipi_irq_domain, XIVE_IPI_HW_IRQ);
>   xive_ipi_irq = virq;
>  
>   WARN_ON(request_irq(virq, xive_muxed_ipi_action,
> @@ -1178,19 +1192,6 @@ static int xive_irq_domain_map(struct irq_domain *h, 
> unsigned int virq,
>*/
>   irq_clear_status_flags(virq, IRQ_LEVEL);
>  
> -#ifdef CONFIG_SMP
> - /* IPIs are special and come up with HW number 0 */
> - if (hw == XIVE_IPI_HW_IRQ) {
> - /*
> -  * IPIs are marked per-cpu. We use separate HW interrupts under
> -  * the hood but associated with the same "linux" interrupt
> -  */
> - irq_set_chip_and_handler(virq, _ipi_chip,
> -  handle_percpu_irq);
> - return 0;
> - }
> -#endif
> -
>   rc = xive_irq_alloc_data(virq, hw);
>   if (rc)
>   return rc;
> @@ -1202,15 +1203,7 @@ static int xive_irq_domain_map(struct irq_domain *h, 
> unsigned int virq,
>  
>  static void xive_irq_domain_unmap(struct irq_domain *d, unsigned int virq)
>  {
> - struct irq_data *data = irq_get_irq_data(virq);
> - unsigned int hw_irq;
> -
> - /* XXX Assign BAD number */
> - if (!data)
> - return;
> - hw_irq = (unsigned int)irqd_to_hwirq(data);
> - if (hw_irq != XIVE_IPI_HW_IRQ)
> - xive_irq_free_data(virq);
> + xive_irq_free_data(virq);
>  }
>  
>  static int xive_irq_domain_xlate(struct irq_domain *h, struct device_node 
> *ct,

Re: [PATCH v2 4/8] powerpc/xive: Simplify xive_core_debug_show()

2021-03-08 Thread Greg Kurz

On Wed, 3 Mar 2021 18:48:53 +0100
Cédric Le Goater  wrote:

> Now that the IPI interrupt has its own domain, the checks on the HW
> interrupt number XIVE_IPI_HW_IRQ and on the chip can be replaced by a
> check on the domain.
> 
> Signed-off-by: Cédric Le Goater 
> ---

Shouldn't this have the following tags ?

Reported-by: kernel test robot 
Reported-by: Dan Carpenter 
Fixes: 930914b7d528 ("powerpc/xive: Add a debugfs file to dump internal XIVE 
state")


Anyway,

Reviewed-by: Greg Kurz 

>  arch/powerpc/sysdev/xive/common.c | 18 --
>  1 file changed, 4 insertions(+), 14 deletions(-)
> 
> diff --git a/arch/powerpc/sysdev/xive/common.c 
> b/arch/powerpc/sysdev/xive/common.c
> index 678680531d26..7581cb12bb53 100644
> --- a/arch/powerpc/sysdev/xive/common.c
> +++ b/arch/powerpc/sysdev/xive/common.c
> @@ -1579,17 +1579,14 @@ static void xive_debug_show_cpu(struct seq_file *m, 
> int cpu)
>   seq_puts(m, "\n");
>  }
>  
> -static void xive_debug_show_irq(struct seq_file *m, u32 hw_irq, struct 
> irq_data *d)
> +static void xive_debug_show_irq(struct seq_file *m, struct irq_data *d)
>  {
> - struct irq_chip *chip = irq_data_get_irq_chip(d);
> + unsigned int hw_irq = (unsigned int)irqd_to_hwirq(d);
>   int rc;
>   u32 target;
>   u8 prio;
>   u32 lirq;
>  
> - if (!is_xive_irq(chip))
> - return;
> -
>   rc = xive_ops->get_irq_config(hw_irq, , , );
>   if (rc) {
>   seq_printf(m, "IRQ 0x%08x : no config rc=%d\n", hw_irq, rc);
> @@ -1627,16 +1624,9 @@ static int xive_core_debug_show(struct seq_file *m, 
> void *private)
>  
>   for_each_irq_desc(i, desc) {
>   struct irq_data *d = irq_desc_get_irq_data(desc);
> - unsigned int hw_irq;
> -
> - if (!d)
> - continue;
> -
> - hw_irq = (unsigned int)irqd_to_hwirq(d);
>  
> - /* IPIs are special (HW number 0) */
> - if (hw_irq != XIVE_IPI_HW_IRQ)
> - xive_debug_show_irq(m, hw_irq, d);
> + if (d->domain == xive_irq_domain)
> + xive_debug_show_irq(m, d);
>   }
>   return 0;
>  }

Re: [PATCH v2 3/8] powerpc/xive: Remove useless check on XIVE_IPI_HW_IRQ

2021-03-08 Thread Greg Kurz

On Wed, 3 Mar 2021 18:48:52 +0100
Cédric Le Goater  wrote:

> The IPI interrupt has its own domain now. Testing the HW interrupt
> number is not needed anymore.
> 
> Signed-off-by: Cédric Le Goater 
> ---

Reviewed-by: Greg Kurz 

>  arch/powerpc/sysdev/xive/common.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/sysdev/xive/common.c 
> b/arch/powerpc/sysdev/xive/common.c
> index e7783760d278..678680531d26 100644
> --- a/arch/powerpc/sysdev/xive/common.c
> +++ b/arch/powerpc/sysdev/xive/common.c
> @@ -1396,13 +1396,12 @@ static void xive_flush_cpu_queue(unsigned int cpu, 
> struct xive_cpu *xc)
>   struct irq_desc *desc = irq_to_desc(irq);
>   struct irq_data *d = irq_desc_get_irq_data(desc);
>   struct xive_irq_data *xd;
> - unsigned int hw_irq = (unsigned int)irqd_to_hwirq(d);
>  
>   /*
>* Ignore anything that isn't a XIVE irq and ignore
>* IPIs, so can just be dropped.
>*/
> - if (d->domain != xive_irq_domain || hw_irq == XIVE_IPI_HW_IRQ)
> + if (d->domain != xive_irq_domain)
>   continue;
>  
>   /*

Re: [PATCH v2 1/8] powerpc/xive: Use cpu_to_node() instead of ibm,chip-id property

2021-03-08 Thread Greg Kurz

On Wed, 3 Mar 2021 18:48:50 +0100
Cédric Le Goater  wrote:

> The 'chip_id' field of the XIVE CPU structure is used to choose a
> target for a source located on the same chip when possible. This field
> is assigned on the PowerNV platform using the "ibm,chip-id" property
> on pSeries under KVM when NUMA nodes are defined but it is undefined

This sentence seems to have a syntax problem... like it is missing an
'and' before 'on pSeries'.

> under PowerVM. The XIVE source structure has a similar field
> 'src_chip' which is only assigned on the PowerNV platform.
> 
> cpu_to_node() returns a compatible value on all platforms, 0 being the
> default node. It will also give us the opportunity to set the affinity
> of a source on pSeries when we can localize them.
> 

IIUC this relies on the fact that the NUMA node id is == to chip id
on PowerNV, i.e. xc->chip_id which is passed to OPAL remain stable
with this change.

On the other hand, you have the pSeries case under PowerVM that
doesn't xc->chip_id, which isn't passed to any hcall AFAICT. It
looks like the chip id is only used for localization purpose in
this case, right ?

In this case, what about doing this change for pSeries only,
somewhere in spapr.c ?

> Signed-off-by: Cédric Le Goater 
> ---
>  arch/powerpc/sysdev/xive/common.c | 7 +--
>  1 file changed, 1 insertion(+), 6 deletions(-)
> 
> diff --git a/arch/powerpc/sysdev/xive/common.c 
> b/arch/powerpc/sysdev/xive/common.c
> index 595310e056f4..b8e456da28aa 100644
> --- a/arch/powerpc/sysdev/xive/common.c
> +++ b/arch/powerpc/sysdev/xive/common.c
> @@ -1335,16 +1335,11 @@ static int xive_prepare_cpu(unsigned int cpu)
>  
>   xc = per_cpu(xive_cpu, cpu);
>   if (!xc) {
> - struct device_node *np;
> -
>   xc = kzalloc_node(sizeof(struct xive_cpu),
> GFP_KERNEL, cpu_to_node(cpu));
>   if (!xc)
>   return -ENOMEM;
> - np = of_get_cpu_node(cpu, NULL);
> - if (np)
> - xc->chip_id = of_get_ibm_chip_id(np);
> - of_node_put(np);
> + xc->chip_id = cpu_to_node(cpu);
>   xc->hw_ipi = XIVE_BAD_IRQ;
>  
>   per_cpu(xive_cpu, cpu) = xc;

[PATCH v2] powerpc/pseries: Don't enforce MSI affinity with kdump

2021-02-15 Thread Greg Kurz

Depending on the number of online CPUs in the original kernel, it is
likely for CPU #0 to be offline in a kdump kernel. The associated IRQs
in the affinity mappings provided by irq_create_affinity_masks() are
thus not started by irq_startup(), as per-design with managed IRQs.

This can be a problem with multi-queue block devices driven by blk-mq :
such a non-started IRQ is very likely paired with the single queue
enforced by blk-mq during kdump (see blk_mq_alloc_tag_set()). This
causes the device to remain silent and likely hangs the guest at
some point.

This is a regression caused by commit 9ea69a55b3b9 ("powerpc/pseries:
Pass MSI affinity to irq_create_mapping()"). Note that this only happens
with the XIVE interrupt controller because XICS has a workaround to bypass
affinity, which is activated during kdump with the "noirqdistrib" kernel
parameter.

The issue comes from a combination of factors:
- discrepancy between the number of queues detected by the multi-queue
  block driver, that was used to create the MSI vectors, and the single
  queue mode enforced later on by blk-mq because of kdump (i.e. keeping
  all queues fixes the issue)
- CPU#0 offline (i.e. kdump always succeed with CPU#0)

Given that I couldn't reproduce on x86, which seems to always have CPU#0
online even during kdump, I'm not sure where this should be fixed. Hence
going for another approach : fine-grained affinity is for performance
and we don't really care about that during kdump. Simply revert to the
previous working behavior of ignoring affinity masks in this case only.

Fixes: 9ea69a55b3b9 ("powerpc/pseries: Pass MSI affinity to 
irq_create_mapping()")
Cc: lviv...@redhat.com
Cc: sta...@vger.kernel.org
Reviewed-by: Laurent Vivier 
Reviewed-by: Cédric Le Goater 
Signed-off-by: Greg Kurz 
---

v2: - added missing #include 

 arch/powerpc/platforms/pseries/msi.c | 25 +++--
 1 file changed, 23 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/msi.c 
b/arch/powerpc/platforms/pseries/msi.c
index b3ac2455faad..637300330507 100644
--- a/arch/powerpc/platforms/pseries/msi.c
+++ b/arch/powerpc/platforms/pseries/msi.c
@@ -4,6 +4,7 @@
  * Copyright 2006-2007 Michael Ellerman, IBM Corp.
  */
 
+#include 
 #include 
 #include 
 #include 
@@ -458,8 +459,28 @@ static int rtas_setup_msi_irqs(struct pci_dev *pdev, int 
nvec_in, int type)
return hwirq;
}
 
-   virq = irq_create_mapping_affinity(NULL, hwirq,
-  entry->affinity);
+   /*
+* Depending on the number of online CPUs in the original
+* kernel, it is likely for CPU #0 to be offline in a kdump
+* kernel. The associated IRQs in the affinity mappings
+* provided by irq_create_affinity_masks() are thus not
+* started by irq_startup(), as per-design for managed IRQs.
+* This can be a problem with multi-queue block devices driven
+* by blk-mq : such a non-started IRQ is very likely paired
+* with the single queue enforced by blk-mq during kdump (see
+* blk_mq_alloc_tag_set()). This causes the device to remain
+* silent and likely hangs the guest at some point.
+*
+* We don't really care for fine-grained affinity when doing
+* kdump actually : simply ignore the pre-computed affinity
+* masks in this case and let the default mask with all CPUs
+* be used when creating the IRQ mappings.
+*/
+   if (is_kdump_kernel())
+   virq = irq_create_mapping(NULL, hwirq);
+   else
+   virq = irq_create_mapping_affinity(NULL, hwirq,
+  entry->affinity);
 
if (!virq) {
pr_debug("rtas_msi: Failed mapping hwirq %d\n", hwirq);
-- 
2.26.2

[PATCH] powerpc/pseries: Don't enforce MSI affinity with kdump

2021-02-12 Thread Greg Kurz

Depending on the number of online CPUs in the original kernel, it is
likely for CPU #0 to be offline in a kdump kernel. The associated IRQs
in the affinity mappings provided by irq_create_affinity_masks() are
thus not started by irq_startup(), as per-design with managed IRQs.

This can be a problem with multi-queue block devices driven by blk-mq :
such a non-started IRQ is very likely paired with the single queue
enforced by blk-mq during kdump (see blk_mq_alloc_tag_set()). This
causes the device to remain silent and likely hangs the guest at
some point.

This is a regression caused by commit 9ea69a55b3b9 ("powerpc/pseries:
Pass MSI affinity to irq_create_mapping()"). Note that this only happens
with the XIVE interrupt controller because XICS has a workaround to bypass
affinity, which is activated during kdump with the "noirqdistrib" kernel
parameter.

The issue comes from a combination of factors:
- discrepancy between the number of queues detected by the multi-queue
  block driver, that was used to create the MSI vectors, and the single
  queue mode enforced later on by blk-mq because of kdump (i.e. keeping
  all queues fixes the issue)
- CPU#0 offline (i.e. kdump always succeed with CPU#0)

Given that I couldn't reproduce on x86, which seems to always have CPU#0
online even during kdump, I'm not sure where this should be fixed. Hence
going for another approach : fine-grained affinity is for performance
and we don't really care about that during kdump. Simply revert to the
previous working behavior of ignoring affinity masks in this case only.

Fixes: 9ea69a55b3b9 ("powerpc/pseries: Pass MSI affinity to 
irq_create_mapping()")
Cc: lviv...@redhat.com
Cc: sta...@vger.kernel.org
Signed-off-by: Greg Kurz 
---
 arch/powerpc/platforms/pseries/msi.c | 24 ++--
 1 file changed, 22 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/msi.c 
b/arch/powerpc/platforms/pseries/msi.c
index b3ac2455faad..29d04b83288d 100644
--- a/arch/powerpc/platforms/pseries/msi.c
+++ b/arch/powerpc/platforms/pseries/msi.c
@@ -458,8 +458,28 @@ static int rtas_setup_msi_irqs(struct pci_dev *pdev, int 
nvec_in, int type)
return hwirq;
}
 
-   virq = irq_create_mapping_affinity(NULL, hwirq,
-  entry->affinity);
+   /*
+* Depending on the number of online CPUs in the original
+* kernel, it is likely for CPU #0 to be offline in a kdump
+* kernel. The associated IRQs in the affinity mappings
+* provided by irq_create_affinity_masks() are thus not
+* started by irq_startup(), as per-design for managed IRQs.
+* This can be a problem with multi-queue block devices driven
+* by blk-mq : such a non-started IRQ is very likely paired
+* with the single queue enforced by blk-mq during kdump (see
+* blk_mq_alloc_tag_set()). This causes the device to remain
+* silent and likely hangs the guest at some point.
+*
+* We don't really care for fine-grained affinity when doing
+* kdump actually : simply ignore the pre-computed affinity
+* masks in this case and let the default mask with all CPUs
+* be used when creating the IRQ mappings.
+*/
+   if (is_kdump_kernel())
+   virq = irq_create_mapping(NULL, hwirq);
+   else
+   virq = irq_create_mapping_affinity(NULL, hwirq,
+  entry->affinity);
 
if (!virq) {
pr_debug("rtas_msi: Failed mapping hwirq %d\n", hwirq);
-- 
2.26.2

Re: [RFC Qemu PATCH v2 1/2] spapr: drc: Add support for async hcalls at the drc level

2020-12-21 Thread Greg Kurz

On Mon, 21 Dec 2020 13:08:53 +0100
Greg Kurz  wrote:

> Hi Shiva,
> 
> On Mon, 30 Nov 2020 09:16:39 -0600
> Shivaprasad G Bhat  wrote:
> 
> > The patch adds support for async hcalls at the DRC level for the
> > spapr devices. To be used by spapr-scm devices in the patch/es to follow.
> > 
> > Signed-off-by: Shivaprasad G Bhat 
> > ---
> 
> The overall idea looks good but I think you should consider using
> a thread pool to implement it. See below.
> 

Some more comments.

> >  hw/ppc/spapr_drc.c |  149 
> > 
> >  include/hw/ppc/spapr_drc.h |   25 +++
> >  2 files changed, 174 insertions(+)
> > 
> > diff --git a/hw/ppc/spapr_drc.c b/hw/ppc/spapr_drc.c
> > index 77718cde1f..4ecd04f686 100644
> > --- a/hw/ppc/spapr_drc.c
> > +++ b/hw/ppc/spapr_drc.c
> > @@ -15,6 +15,7 @@
> >  #include "qapi/qmp/qnull.h"
> >  #include "cpu.h"
> >  #include "qemu/cutils.h"
> > +#include "qemu/guest-random.h"
> >  #include "hw/ppc/spapr_drc.h"
> >  #include "qom/object.h"
> >  #include "migration/vmstate.h"
> > @@ -421,6 +422,148 @@ void spapr_drc_detach(SpaprDrc *drc)
> >  spapr_drc_release(drc);
> >  }
> >  
> > +
> > +/*
> > + * @drc : device DRC targetting which the async hcalls to be made.
> > + *
> > + * All subsequent requests to run/query the status should use the
> > + * unique token returned here.
> > + */
> > +uint64_t spapr_drc_get_new_async_hcall_token(SpaprDrc *drc)
> > +{
> > +Error *err = NULL;
> > +uint64_t token;
> > +SpaprDrcDeviceAsyncHCallState *tmp, *next, *state;
> > +
> > +state = g_malloc0(sizeof(*state));
> > +state->pending = true;
> > +
> > +qemu_mutex_lock(>async_hcall_states_lock);
> > +retry:
> > +if (qemu_guest_getrandom(, sizeof(token), ) < 0) {
> > +error_report_err(err);
> > +g_free(state);
> > +qemu_mutex_unlock(>async_hcall_states_lock);
> > +return 0;
> > +}
> > +
> > +if (!token) /* Token should be non-zero */
> > +goto retry;
> > +
> > +if (!QLIST_EMPTY(>async_hcall_states)) {
> > +QLIST_FOREACH_SAFE(tmp, >async_hcall_states, node, next) {
> > +if (tmp->continue_token == token) {
> > +/* If the token already in use, get a new one */
> > +goto retry;
> > +}
> > +}
> > +}
> > +
> > +state->continue_token = token;
> > +QLIST_INSERT_HEAD(>async_hcall_states, state, node);
> > +
> > +qemu_mutex_unlock(>async_hcall_states_lock);
> > +
> > +return state->continue_token;
> > +}
> > +
> > +static void *spapr_drc_async_hcall_runner(void *opaque)
> > +{
> > +int response = -1;

It feels wrong since the return value of func() is supposed to be
opaque to this function. And anyway it isn't needed since response
is always set a few lines below.

> > +SpaprDrcDeviceAsyncHCallState *state = opaque;
> > +
> > +/*
> > + * state is freed only after this thread finishes(after 
> > pthread_join()),
> > + * don't worry about it becoming NULL.
> > + */
> > +
> > +response = state->func(state->data);
> > +
> > +state->hcall_ret = response;
> > +state->pending = 0;

s/0/false/

> > +
> > +return NULL;
> > +}
> > +
> > +/*
> > + * @drc  : device DRC targetting which the async hcalls to be made.
> > + * token : The continue token to be used for tracking as recived from
> > + * spapr_drc_get_new_async_hcall_token
> > + * @func() : the worker function which needs to be executed asynchronously
> > + * @data : data to be passed to the asynchronous function. Worker is 
> > supposed
> > + * to free/cleanup the data that is passed here
> 
> It'd be cleaner to pass a completion callback and have free/cleanup handled 
> there.
> 
> > + */
> > +void spapr_drc_run_async_hcall(SpaprDrc *drc, uint64_t token,
> > +   SpaprDrcAsyncHcallWorkerFunc *func, void 
> > *data)
> > +{
> > +SpaprDrcDeviceAsyncHCallState *state;
> > +
> > +qemu_mutex_lock(>async_hcall_states_lock);
> > +QLIST_FOREACH(state, >async_hcall_states, node) {
> > +if (state->continue_token == token) {
> > +

Re: [RFC Qemu PATCH v2 0/2] spapr: nvdimm: Asynchronus flush hcall support

2020-12-21 Thread Greg Kurz

On Mon, 30 Nov 2020 09:16:14 -0600
Shivaprasad G Bhat  wrote:

> The nvdimm devices are expected to ensure write persistent during power
> failure kind of scenarios.
> 
> The libpmem has architecture specific instructions like dcbf on power
> to flush the cache data to backend nvdimm device during normal writes.
> 
> Qemu - virtual nvdimm devices are memory mapped. The dcbf in the guest
> doesn't traslate to actual flush to the backend file on the host in case
> of file backed vnvdimms. This is addressed by virtio-pmem in case of x86_64
> by making asynchronous flushes.
> 
> On PAPR, issue is addressed by adding a new hcall to
> request for an explicit asynchronous flush requests from the guest ndctl
> driver when the backend nvdimm cannot ensure write persistence with dcbf
> alone. So, the approach here is to convey when the asynchronous flush is
> required in a device tree property. The guest makes the hcall when the
> property is found, instead of relying on dcbf.
> 
> The first patch adds the necessary asynchronous hcall support infrastructure
> code at the DRC level. Second patch implements the hcall using the
> infrastructure.
> 
> Hcall semantics are in review and not final.
> 
> A new device property sync-dax is added to the nvdimm device. When the 
> sync-dax is off(default), the asynchronous hcalls will be called.
> 
> With respect to save from new qemu to restore on old qemu, having the
> sync-dax by default off(when not specified) causes IO errors in guests as
> the async-hcall would not be supported on old qemu. The new hcall
> implementation being supported only on the new  pseries machine version,
> the current machine version checks may be sufficient to prevent
> such migration. Please suggest what should be done.
> 

First, all requests that are still not completed from the guest POV,
ie. the hcall hasn't returned H_SUCCESS yet, are state that we should
migrate in theory. In this case, I guess we rather want to drain all
pending requests on the source in some pre-save handler.

Then, as explained in another mail, you should enforce stable behavior
for existing machine types with some hw_compat magic.

> The below demonstration shows the map_sync behavior with sync-dax on & off.
> (https://github.com/avocado-framework-tests/avocado-misc-tests/blob/master/memory/ndctl.py.data/map_sync.c)
> 
> The pmem0 is from nvdimm with With sync-dax=on, and pmem1 is from nvdimm with 
> syn-dax=off, mounted as
> /dev/pmem0 on /mnt1 type xfs 
> (rw,relatime,attr2,dax=always,inode64,logbufs=8,logbsize=32k,noquota)
> /dev/pmem1 on /mnt2 type xfs 
> (rw,relatime,attr2,dax=always,inode64,logbufs=8,logbsize=32k,noquota)
> 
> [root@atest-guest ~]# ./mapsync /mnt1/newfile> When sync-dax=off
> [root@atest-guest ~]# ./mapsync /mnt2/newfile> when sync-dax=on
> Failed to mmap  with Operation not supported
> 
> ---
> v1 - https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg06330.html
> Changes from v1
>   - Fixed a missed-out unlock
>   - using QLIST_FOREACH instead of QLIST_FOREACH_SAFE while generating 
> token
> 
> Shivaprasad G Bhat (2):
>   spapr: drc: Add support for async hcalls at the drc level
>   spapr: nvdimm: Implement async flush hcalls
> 
> 
>  hw/mem/nvdimm.c|1
>  hw/ppc/spapr_drc.c |  146 
> 
>  hw/ppc/spapr_nvdimm.c  |   79 
>  include/hw/mem/nvdimm.h|   10 +++
>  include/hw/ppc/spapr.h |3 +
>  include/hw/ppc/spapr_drc.h |   25 
>  6 files changed, 263 insertions(+), 1 deletion(-)
> 
> --
> Signature
> 
>

Re: [RFC Qemu PATCH v2 2/2] spapr: nvdimm: Implement async flush hcalls

2020-12-21 Thread Greg Kurz

On Mon, 30 Nov 2020 09:17:24 -0600
Shivaprasad G Bhat  wrote:

> When the persistent memory beacked by a file, a cpu cache flush instruction
> is not sufficient to ensure the stores are correctly flushed to the media.
> 
> The patch implements the async hcalls for flush operation on demand from the
> guest kernel.
> 
> The device option sync-dax is by default off and enables explicit asynchronous
> flush requests from guest. It can be disabled by setting syn-dax=on.
> 
> Signed-off-by: Shivaprasad G Bhat 
> ---
>  hw/mem/nvdimm.c |1 +
>  hw/ppc/spapr_nvdimm.c   |   79 
> +++
>  include/hw/mem/nvdimm.h |   10 ++
>  include/hw/ppc/spapr.h  |3 +-
>  4 files changed, 92 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/mem/nvdimm.c b/hw/mem/nvdimm.c
> index 03c2201b56..37a4db0135 100644
> --- a/hw/mem/nvdimm.c
> +++ b/hw/mem/nvdimm.c
> @@ -220,6 +220,7 @@ static void nvdimm_write_label_data(NVDIMMDevice *nvdimm, 
> const void *buf,
>  
>  static Property nvdimm_properties[] = {
>  DEFINE_PROP_BOOL(NVDIMM_UNARMED_PROP, NVDIMMDevice, unarmed, false),
> +DEFINE_PROP_BOOL(NVDIMM_SYNC_DAX_PROP, NVDIMMDevice, sync_dax, false),
>  DEFINE_PROP_END_OF_LIST(),
>  };
>  
> diff --git a/hw/ppc/spapr_nvdimm.c b/hw/ppc/spapr_nvdimm.c
> index a833a63b5e..557e36aa98 100644
> --- a/hw/ppc/spapr_nvdimm.c
> +++ b/hw/ppc/spapr_nvdimm.c
> @@ -22,6 +22,7 @@
>   * THE SOFTWARE.
>   */
>  #include "qemu/osdep.h"
> +#include "qemu/cutils.h"
>  #include "qapi/error.h"
>  #include "hw/ppc/spapr_drc.h"
>  #include "hw/ppc/spapr_nvdimm.h"
> @@ -155,6 +156,11 @@ static int spapr_dt_nvdimm(SpaprMachineState *spapr, 
> void *fdt,
>   "operating-system")));
>  _FDT(fdt_setprop(fdt, child_offset, "ibm,cache-flush-required", NULL, 
> 0));
>  
> +if (!nvdimm->sync_dax) {

So this is done unconditionally for all machine types. This means that a
guest started on a newer QEMU cannot be migrated to an older QEMU. This
is annoying because people legitimately expect an existing machine type
to be migratable to any QEMU version that supports it.

This means that something like the following should be added in hw_compat_5_2[]
to fix the property for pre-6.0 machine types:

{ "nvdimm", "sync-dax", "on" },

> +_FDT(fdt_setprop(fdt, child_offset, "ibm,async-flush-required",
> + NULL, 0));
> +}
> +
>  return child_offset;
>  }
>  
> @@ -370,6 +376,78 @@ static target_ulong h_scm_bind_mem(PowerPCCPU *cpu, 
> SpaprMachineState *spapr,
>  return H_SUCCESS;
>  }
>  
> +typedef struct SCMAsyncFlushData {
> +int fd;
> +uint64_t token;
> +} SCMAsyncFlushData;
> +
> +static int flush_worker_cb(void *opaque)
> +{
> +int ret = H_SUCCESS;
> +SCMAsyncFlushData *req_data = opaque;
> +
> +/* flush raw backing image */
> +if (qemu_fdatasync(req_data->fd) < 0) {
> +error_report("papr_scm: Could not sync nvdimm to backend file: %s",
> + strerror(errno));
> +ret = H_HARDWARE;
> +}
> +
> +g_free(req_data);
> +
> +return ret;
> +}
> +
> +static target_ulong h_scm_async_flush(PowerPCCPU *cpu, SpaprMachineState 
> *spapr,
> +  target_ulong opcode, target_ulong 
> *args)
> +{
> +int ret;
> +uint32_t drc_index = args[0];
> +uint64_t continue_token = args[1];
> +SpaprDrc *drc = spapr_drc_by_index(drc_index);
> +PCDIMMDevice *dimm;
> +HostMemoryBackend *backend = NULL;
> +SCMAsyncFlushData *req_data = NULL;
> +
> +if (!drc || !drc->dev ||
> +spapr_drc_type(drc) != SPAPR_DR_CONNECTOR_TYPE_PMEM) {
> +return H_PARAMETER;
> +}
> +
> +if (continue_token != 0) {
> +ret = spapr_drc_get_async_hcall_status(drc, continue_token);
> +if (ret == H_BUSY) {
> +args[0] = continue_token;
> +return H_LONG_BUSY_ORDER_1_SEC;
> +}
> +
> +return ret;
> +}
> +
> +dimm = PC_DIMM(drc->dev);
> +backend = MEMORY_BACKEND(dimm->hostmem);
> +
> +req_data = g_malloc0(sizeof(SCMAsyncFlushData));
> +req_data->fd = memory_region_get_fd(>mr);
> +
> +continue_token = spapr_drc_get_new_async_hcall_token(drc);
> +if (!continue_token) {
> +g_free(req_data);
> +return H_P2;
> +}
> +req_data->token = continue_token;
> +
> +spapr_drc_run_async_hcall(drc, continue_token, _worker_cb, 
> req_data);
> +
> +ret = spapr_drc_get_async_hcall_status(drc, continue_token);
> +if (ret == H_BUSY) {
> +args[0] = req_data->token;
> +return ret;
> +}
> +
> +return ret;
> +}
> +
>  static target_ulong h_scm_unbind_mem(PowerPCCPU *cpu, SpaprMachineState 
> *spapr,
>   target_ulong opcode, target_ulong *args)
>  {
> @@ -486,6 +564,7 @@ static void spapr_scm_register_types(void)
>  spapr_register_hypercall(H_SCM_BIND_MEM, h_scm_bind_mem);
>

Re: [RFC Qemu PATCH v2 1/2] spapr: drc: Add support for async hcalls at the drc level

2020-12-21 Thread Greg Kurz

Hi Shiva,

On Mon, 30 Nov 2020 09:16:39 -0600
Shivaprasad G Bhat  wrote:

> The patch adds support for async hcalls at the DRC level for the
> spapr devices. To be used by spapr-scm devices in the patch/es to follow.
> 
> Signed-off-by: Shivaprasad G Bhat 
> ---

The overall idea looks good but I think you should consider using
a thread pool to implement it. See below.

>  hw/ppc/spapr_drc.c |  149 
> 
>  include/hw/ppc/spapr_drc.h |   25 +++
>  2 files changed, 174 insertions(+)
> 
> diff --git a/hw/ppc/spapr_drc.c b/hw/ppc/spapr_drc.c
> index 77718cde1f..4ecd04f686 100644
> --- a/hw/ppc/spapr_drc.c
> +++ b/hw/ppc/spapr_drc.c
> @@ -15,6 +15,7 @@
>  #include "qapi/qmp/qnull.h"
>  #include "cpu.h"
>  #include "qemu/cutils.h"
> +#include "qemu/guest-random.h"
>  #include "hw/ppc/spapr_drc.h"
>  #include "qom/object.h"
>  #include "migration/vmstate.h"
> @@ -421,6 +422,148 @@ void spapr_drc_detach(SpaprDrc *drc)
>  spapr_drc_release(drc);
>  }
>  
> +
> +/*
> + * @drc : device DRC targetting which the async hcalls to be made.
> + *
> + * All subsequent requests to run/query the status should use the
> + * unique token returned here.
> + */
> +uint64_t spapr_drc_get_new_async_hcall_token(SpaprDrc *drc)
> +{
> +Error *err = NULL;
> +uint64_t token;
> +SpaprDrcDeviceAsyncHCallState *tmp, *next, *state;
> +
> +state = g_malloc0(sizeof(*state));
> +state->pending = true;
> +
> +qemu_mutex_lock(>async_hcall_states_lock);
> +retry:
> +if (qemu_guest_getrandom(, sizeof(token), ) < 0) {
> +error_report_err(err);
> +g_free(state);
> +qemu_mutex_unlock(>async_hcall_states_lock);
> +return 0;
> +}
> +
> +if (!token) /* Token should be non-zero */
> +goto retry;
> +
> +if (!QLIST_EMPTY(>async_hcall_states)) {
> +QLIST_FOREACH_SAFE(tmp, >async_hcall_states, node, next) {
> +if (tmp->continue_token == token) {
> +/* If the token already in use, get a new one */
> +goto retry;
> +}
> +}
> +}
> +
> +state->continue_token = token;
> +QLIST_INSERT_HEAD(>async_hcall_states, state, node);
> +
> +qemu_mutex_unlock(>async_hcall_states_lock);
> +
> +return state->continue_token;
> +}
> +
> +static void *spapr_drc_async_hcall_runner(void *opaque)
> +{
> +int response = -1;
> +SpaprDrcDeviceAsyncHCallState *state = opaque;
> +
> +/*
> + * state is freed only after this thread finishes(after pthread_join()),
> + * don't worry about it becoming NULL.
> + */
> +
> +response = state->func(state->data);
> +
> +state->hcall_ret = response;
> +state->pending = 0;
> +
> +return NULL;
> +}
> +
> +/*
> + * @drc  : device DRC targetting which the async hcalls to be made.
> + * token : The continue token to be used for tracking as recived from
> + * spapr_drc_get_new_async_hcall_token
> + * @func() : the worker function which needs to be executed asynchronously
> + * @data : data to be passed to the asynchronous function. Worker is supposed
> + * to free/cleanup the data that is passed here

It'd be cleaner to pass a completion callback and have free/cleanup handled 
there.

> + */
> +void spapr_drc_run_async_hcall(SpaprDrc *drc, uint64_t token,
> +   SpaprDrcAsyncHcallWorkerFunc *func, void 
> *data)
> +{
> +SpaprDrcDeviceAsyncHCallState *state;
> +
> +qemu_mutex_lock(>async_hcall_states_lock);
> +QLIST_FOREACH(state, >async_hcall_states, node) {
> +if (state->continue_token == token) {
> +state->func = func;
> +state->data = data;
> +qemu_thread_create(>thread, "sPAPR Async HCALL",
> +   spapr_drc_async_hcall_runner, state,
> +   QEMU_THREAD_JOINABLE);

qemu_thread_create() exits on failure, it shouldn't be called on
a guest triggerable path, eg. a buggy guest could call it up to
the point that pthread_create() returns EAGAIN.

Please use a thread pool (see thread_pool_submit_aio()). This takes care
of all the thread housekeeping for you in a safe way, and it provides a
completion callback API. The implementation could then be just about
having two lists: one for pending requests (fed here) and one for
completed requests (fed by the completion callback).

> +break;
> +}
> +}
> +qemu_mutex_unlock(>async_hcall_states_lock);
> +}
> +
> +/*
> + * spapr_drc_finish_async_hcalls
> + *  Waits for all pending async requests to complete
> + *  thier execution and free the states
> + */
> +static void spapr_drc_finish_async_hcalls(SpaprDrc *drc)
> +{
> +SpaprDrcDeviceAsyncHCallState *state, *next;
> +
> +if (QLIST_EMPTY(>async_hcall_states)) {
> +return;
> +}
> +
> +qemu_mutex_lock(>async_hcall_states_lock);
> +QLIST_FOREACH_SAFE(state, >async_hcall_states, node, next) {

Re: [PATCH 12/13] powerpc/xive: Simplify xive_do_source_eoi()

2020-12-09 Thread Greg Kurz

On Tue, 8 Dec 2020 16:11:23 +0100
Cédric Le Goater  wrote:

> Previous patches removed the need of the first argument which was a
> hack for Firwmware EOI. Remove it and flatten the routine which has
> became simpler.
> 
> Signed-off-by: Cédric Le Goater 
> ---

Much nicer indeed.

Reviewed-by: Greg Kurz 

>  arch/powerpc/sysdev/xive/common.c | 72 ++-
>  1 file changed, 33 insertions(+), 39 deletions(-)
> 
> diff --git a/arch/powerpc/sysdev/xive/common.c 
> b/arch/powerpc/sysdev/xive/common.c
> index fe6229dd3241..fb438203d5ee 100644
> --- a/arch/powerpc/sysdev/xive/common.c
> +++ b/arch/powerpc/sysdev/xive/common.c
> @@ -348,39 +348,40 @@ static void xive_do_queue_eoi(struct xive_cpu *xc)
>   * EOI an interrupt at the source. There are several methods
>   * to do this depending on the HW version and source type
>   */
> -static void xive_do_source_eoi(u32 hw_irq, struct xive_irq_data *xd)
> +static void xive_do_source_eoi(struct xive_irq_data *xd)
>  {
> + u8 eoi_val;
> +
>   xd->stale_p = false;
> +
>   /* If the XIVE supports the new "store EOI facility, use it */
> - if (xd->flags & XIVE_IRQ_FLAG_STORE_EOI)
> + if (xd->flags & XIVE_IRQ_FLAG_STORE_EOI) {
>   xive_esb_write(xd, XIVE_ESB_STORE_EOI, 0);
> - else {
> - u8 eoi_val;
> + return;
> + }
>  
> - /*
> -  * Otherwise for EOI, we use the special MMIO that does
> -  * a clear of both P and Q and returns the old Q,
> -  * except for LSIs where we use the "EOI cycle" special
> -  * load.
> -  *
> -  * This allows us to then do a re-trigger if Q was set
> -  * rather than synthesizing an interrupt in software
> -  *
> -  * For LSIs the HW EOI cycle is used rather than PQ bits,
> -  * as they are automatically re-triggred in HW when still
> -  * pending.
> -  */
> - if (xd->flags & XIVE_IRQ_FLAG_LSI)
> - xive_esb_read(xd, XIVE_ESB_LOAD_EOI);
> - else {
> - eoi_val = xive_esb_read(xd, XIVE_ESB_SET_PQ_00);
> - DBG_VERBOSE("eoi_val=%x\n", eoi_val);
> -
> - /* Re-trigger if needed */
> - if ((eoi_val & XIVE_ESB_VAL_Q) && xd->trig_mmio)
> - out_be64(xd->trig_mmio, 0);
> - }
> + /*
> +  * For LSIs, we use the "EOI cycle" special load rather than
> +  * PQ bits, as they are automatically re-triggered in HW when
> +  * still pending.
> +  */
> + if (xd->flags & XIVE_IRQ_FLAG_LSI) {
> + xive_esb_read(xd, XIVE_ESB_LOAD_EOI);
> + return;
>   }
> +
> + /*
> +  * Otherwise, we use the special MMIO that does a clear of
> +  * both P and Q and returns the old Q. This allows us to then
> +  * do a re-trigger if Q was set rather than synthesizing an
> +  * interrupt in software
> +  */
> + eoi_val = xive_esb_read(xd, XIVE_ESB_SET_PQ_00);
> + DBG_VERBOSE("eoi_val=%x\n", eoi_val);
> +
> + /* Re-trigger if needed */
> + if ((eoi_val & XIVE_ESB_VAL_Q) && xd->trig_mmio)
> + out_be64(xd->trig_mmio, 0);
>  }
>  
>  /* irq_chip eoi callback, called with irq descriptor lock held */
> @@ -398,7 +399,7 @@ static void xive_irq_eoi(struct irq_data *d)
>*/
>   if (!irqd_irq_disabled(d) && !irqd_is_forwarded_to_vcpu(d) &&
>   !(xd->flags & XIVE_IRQ_FLAG_NO_EOI))
> - xive_do_source_eoi(irqd_to_hwirq(d), xd);
> + xive_do_source_eoi(xd);
>   else
>   xd->stale_p = true;
>  
> @@ -788,14 +789,7 @@ static int xive_irq_retrigger(struct irq_data *d)
>* 11, then perform an EOI.
>*/
>   xive_esb_read(xd, XIVE_ESB_SET_PQ_11);
> -
> - /*
> -  * Note: We pass "0" to the hw_irq argument in order to
> -  * avoid calling into the backend EOI code which we don't
> -  * want to do in the case of a re-trigger. Backends typically
> -  * only do EOI for LSIs anyway.
> -  */
> - xive_do_source_eoi(0, xd);
> + xive_do_source_eoi(xd);
>  
>   return 1;
>  }
> @@ -910,7 +904,7 @@ static int xive_irq_set_vcpu_affinity(struct irq_data *d, 
> void *state)
>* while masked, the generic code will re-mask it anyway.
>*/
>   if (!xd->saved_p)
> -

Re: [PATCH 07/13] powerpc/xive: Add a debug_show handler to the XIVE irq_domain

2020-12-09 Thread Greg Kurz

On Tue, 8 Dec 2020 16:11:18 +0100
Cédric Le Goater  wrote:

> Full state of the Linux interrupt descriptors can be dumped under
> debugfs when compiled with CONFIG_GENERIC_IRQ_DEBUGFS. Add support for
> the XIVE interrupt controller.
> 
> Signed-off-by: Cédric Le Goater 
> ---
>  arch/powerpc/sysdev/xive/common.c | 58 +++
>  1 file changed, 58 insertions(+)
> 
> diff --git a/arch/powerpc/sysdev/xive/common.c 
> b/arch/powerpc/sysdev/xive/common.c
> index 721617f0f854..411cba12d73b 100644
> --- a/arch/powerpc/sysdev/xive/common.c
> +++ b/arch/powerpc/sysdev/xive/common.c
> @@ -1303,11 +1303,69 @@ static int xive_irq_domain_match(struct irq_domain 
> *h, struct device_node *node,
>   return xive_ops->match(node);
>  }
>  
> +#ifdef CONFIG_GENERIC_IRQ_DEBUGFS
> +static const char * const esb_names[] = { "RESET", "OFF", "PENDING", 
> "QUEUED" };
> +
> +static const struct {
> + u64  mask;
> + char *name;
> +} xive_irq_flags[] = {
> + { XIVE_IRQ_FLAG_STORE_EOI, "STORE_EOI" },
> + { XIVE_IRQ_FLAG_LSI,   "LSI"   },

> + { XIVE_IRQ_FLAG_SHIFT_BUG, "SHIFT_BUG" },
> + { XIVE_IRQ_FLAG_MASK_FW,   "MASK_FW"   },
> + { XIVE_IRQ_FLAG_EOI_FW,"EOI_FW"},

If seems that you don't even need these ^^ if you move this patch after
patch 11 actually.

> + { XIVE_IRQ_FLAG_H_INT_ESB, "H_INT_ESB" },
> + { XIVE_IRQ_FLAG_NO_EOI,"NO_EOI"},
> +};
> +
> +static void xive_irq_domain_debug_show(struct seq_file *m, struct irq_domain 
> *d,
> +struct irq_data *irqd, int ind)
> +{
> + struct xive_irq_data *xd;
> + u64 val;
> + int i;
> +
> + /* No IRQ domain level information. To be done */
> + if (!irqd)
> + return;
> +
> + if (!is_xive_irq(irq_data_get_irq_chip(irqd)))

Wouldn't it be a bug to get anything else but the XIVE irqchip here ?

WARN_ON_ONCE() ?

> + return;
> +
> + seq_printf(m, "%*sXIVE:\n", ind, "");
> + ind++;
> +
> + xd = irq_data_get_irq_handler_data(irqd);
> + if (!xd) {
> + seq_printf(m, "%*snot assigned\n", ind, "");
> + return;
> + }
> +
> + val = xive_esb_read(xd, XIVE_ESB_GET);
> + seq_printf(m, "%*sESB:  %s\n", ind, "", esb_names[val & 0x3]);
> + seq_printf(m, "%*sPstate:   %s %s\n", ind, "", xd->stale_p ? "stale" : 
> "",
> +xd->saved_p ? "saved" : "");
> + seq_printf(m, "%*sTarget:   %d\n", ind, "", xd->target);
> + seq_printf(m, "%*sChip: %d\n", ind, "", xd->src_chip);
> + seq_printf(m, "%*sTrigger:  0x%016llx\n", ind, "", xd->trig_page);
> + seq_printf(m, "%*sEOI:  0x%016llx\n", ind, "", xd->eoi_page);
> + seq_printf(m, "%*sFlags:0x%llx\n", ind, "", xd->flags);
> + for (i = 0; i < ARRAY_SIZE(xive_irq_flags); i++) {
> + if (xd->flags & xive_irq_flags[i].mask)
> + seq_printf(m, "%*s%s\n", ind + 12, "", 
> xive_irq_flags[i].name);
> + }
> +}
> +#endif
> +
>  static const struct irq_domain_ops xive_irq_domain_ops = {
>   .match = xive_irq_domain_match,
>   .map = xive_irq_domain_map,
>   .unmap = xive_irq_domain_unmap,
>   .xlate = xive_irq_domain_xlate,
> +#ifdef CONFIG_GENERIC_IRQ_DEBUGFS
> + .debug_show = xive_irq_domain_debug_show,
> +#endif
>  };
>  
>  static void __init xive_init_host(struct device_node *np)

Re: [PATCH 13/13] powerpc/xive: Improve error reporting of OPAL calls

2020-12-09 Thread Greg Kurz

On Tue, 8 Dec 2020 16:11:24 +0100
Cédric Le Goater  wrote:

> Introduce a vp_err() macro to standardize error reporting.
> 
> Signed-off-by: Cédric Le Goater 
> ---

Reviewed-by: Greg Kurz 

>  arch/powerpc/sysdev/xive/native.c | 28 
>  1 file changed, 16 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/powerpc/sysdev/xive/native.c 
> b/arch/powerpc/sysdev/xive/native.c
> index 4902d05ebbd1..42297a131a6e 100644
> --- a/arch/powerpc/sysdev/xive/native.c
> +++ b/arch/powerpc/sysdev/xive/native.c
> @@ -122,6 +122,8 @@ static int xive_native_get_irq_config(u32 hw_irq, u32 
> *target, u8 *prio,
>   return rc == 0 ? 0 : -ENXIO;
>  }
>  
> +#define vp_err(vp, fmt, ...) pr_err("VP[0x%x]: " fmt, vp, ##__VA_ARGS__)
> +
>  /* This can be called multiple time to change a queue configuration */
>  int xive_native_configure_queue(u32 vp_id, struct xive_q *q, u8 prio,
>   __be32 *qpage, u32 order, bool can_escalate)
> @@ -149,7 +151,7 @@ int xive_native_configure_queue(u32 vp_id, struct xive_q 
> *q, u8 prio,
> _irq_be,
> NULL);
>   if (rc) {
> - pr_err("Error %lld getting queue info prio %d\n", rc, prio);
> + vp_err(vp_id, "Failed to get queue %d info : %lld\n", prio, rc);
>   rc = -EIO;
>   goto fail;
>   }
> @@ -172,7 +174,7 @@ int xive_native_configure_queue(u32 vp_id, struct xive_q 
> *q, u8 prio,
>   msleep(OPAL_BUSY_DELAY_MS);
>   }
>   if (rc) {
> - pr_err("Error %lld setting queue for prio %d\n", rc, prio);
> + vp_err(vp_id, "Failed to set queue %d info: %lld\n", prio, rc);
>   rc = -EIO;
>   } else {
>   /*
> @@ -199,7 +201,7 @@ static void __xive_native_disable_queue(u32 vp_id, struct 
> xive_q *q, u8 prio)
>   msleep(OPAL_BUSY_DELAY_MS);
>   }
>   if (rc)
> - pr_err("Error %lld disabling queue for prio %d\n", rc, prio);
> + vp_err(vp_id, "Failed to disable queue %d : %lld\n", prio, rc);
>  }
>  
>  void xive_native_disable_queue(u32 vp_id, struct xive_q *q, u8 prio)
> @@ -698,6 +700,8 @@ int xive_native_enable_vp(u32 vp_id, bool 
> single_escalation)
>   break;
>   msleep(OPAL_BUSY_DELAY_MS);
>   }
> + if (rc)
> + vp_err(vp_id, "Failed to enable VP : %lld\n", rc);
>   return rc ? -EIO : 0;
>  }
>  EXPORT_SYMBOL_GPL(xive_native_enable_vp);
> @@ -712,6 +716,8 @@ int xive_native_disable_vp(u32 vp_id)
>   break;
>   msleep(OPAL_BUSY_DELAY_MS);
>   }
> + if (rc)
> + vp_err(vp_id, "Failed to disable VP : %lld\n", rc);
>   return rc ? -EIO : 0;
>  }
>  EXPORT_SYMBOL_GPL(xive_native_disable_vp);
> @@ -723,8 +729,10 @@ int xive_native_get_vp_info(u32 vp_id, u32 *out_cam_id, 
> u32 *out_chip_id)
>   s64 rc;
>  
>   rc = opal_xive_get_vp_info(vp_id, NULL, _cam_be, NULL, 
> _chip_id_be);
> - if (rc)
> + if (rc) {
> + vp_err(vp_id, "Failed to get VP info : %lld\n", rc);
>   return -EIO;
> + }
>   *out_cam_id = be64_to_cpu(vp_cam_be) & 0xu;
>   *out_chip_id = be32_to_cpu(vp_chip_id_be);
>  
> @@ -755,8 +763,7 @@ int xive_native_get_queue_info(u32 vp_id, u32 prio,
>   rc = opal_xive_get_queue_info(vp_id, prio, , ,
> _page, _irq, );
>   if (rc) {
> - pr_err("OPAL failed to get queue info for VCPU %d/%d : %lld\n",
> -vp_id, prio, rc);
> + vp_err(vp_id, "failed to get queue %d info : %lld\n", prio, rc);
>   return -EIO;
>   }
>  
> @@ -784,8 +791,7 @@ int xive_native_get_queue_state(u32 vp_id, u32 prio, u32 
> *qtoggle, u32 *qindex)
>   rc = opal_xive_get_queue_state(vp_id, prio, _qtoggle,
>  _qindex);
>   if (rc) {
> - pr_err("OPAL failed to get queue state for VCPU %d/%d : %lld\n",
> -vp_id, prio, rc);
> + vp_err(vp_id, "failed to get queue %d state : %lld\n", prio, 
> rc);
>   return -EIO;
>   }
>  
> @@ -804,8 +810,7 @@ int xive_native_set_queue_state(u32 vp_id, u32 prio, u32 
> qtoggle, u32 qindex)
>  
>   rc = opal_xive_set_queue_state(vp_id, prio, qtoggle, qindex);
>   if (rc) {
> - pr_err("OPAL failed to set queue state for VCPU %d/%d : %lld\n

Re: [PATCH 11/13] powerpc/xive: Remove P9 DD1 flag XIVE_IRQ_FLAG_EOI_FW

2020-12-09 Thread Greg Kurz

On Tue, 8 Dec 2020 16:11:22 +0100
Cédric Le Goater  wrote:

> This flag was used to support the P9 DD1 and we have stopped
> supporting this CPU when DD2 came out. See skiboot commit:
> 
>   https://github.com/open-power/skiboot/commit/0b0d15e3c170
> 
> Also, remove eoi handler which is now unused.
> 
> Signed-off-by: Cédric Le Goater 
> ---

Reviewed-by: Greg Kurz 

Same suggestion as with previous patch.

>  arch/powerpc/include/asm/opal-api.h  |  2 +-
>  arch/powerpc/include/asm/xive.h  |  2 +-
>  arch/powerpc/sysdev/xive/xive-internal.h |  1 -
>  arch/powerpc/kvm/book3s_xive_template.c  |  2 --
>  arch/powerpc/sysdev/xive/common.c| 13 +
>  arch/powerpc/sysdev/xive/native.c| 12 
>  arch/powerpc/sysdev/xive/spapr.c |  6 --
>  7 files changed, 3 insertions(+), 35 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/opal-api.h 
> b/arch/powerpc/include/asm/opal-api.h
> index 0455b679c050..0b63ba7d5917 100644
> --- a/arch/powerpc/include/asm/opal-api.h
> +++ b/arch/powerpc/include/asm/opal-api.h
> @@ -1093,7 +1093,7 @@ enum {
>   OPAL_XIVE_IRQ_LSI   = 0x0004,
>   OPAL_XIVE_IRQ_SHIFT_BUG = 0x0008, /* P9 DD1.0 workaround */
>   OPAL_XIVE_IRQ_MASK_VIA_FW   = 0x0010, /* P9 DD1.0 workaround */
> - OPAL_XIVE_IRQ_EOI_VIA_FW= 0x0020,
> + OPAL_XIVE_IRQ_EOI_VIA_FW= 0x0020, /* P9 DD1.0 workaround */
>  };
>  
>  /* Flags for OPAL_XIVE_GET/SET_QUEUE_INFO */
> diff --git a/arch/powerpc/include/asm/xive.h b/arch/powerpc/include/asm/xive.h
> index d62368d0ba91..f6150d7a757a 100644
> --- a/arch/powerpc/include/asm/xive.h
> +++ b/arch/powerpc/include/asm/xive.h
> @@ -62,7 +62,7 @@ struct xive_irq_data {
>  #define XIVE_IRQ_FLAG_LSI0x02
>  #define XIVE_IRQ_FLAG_SHIFT_BUG  0x04 /* P9 DD1.0 workaround */
>  #define XIVE_IRQ_FLAG_MASK_FW0x08 /* P9 DD1.0 workaround */
> -#define XIVE_IRQ_FLAG_EOI_FW 0x10
> +#define XIVE_IRQ_FLAG_EOI_FW 0x10 /* P9 DD1.0 workaround */
>  #define XIVE_IRQ_FLAG_H_INT_ESB  0x20
>  
>  /* Special flag set by KVM for excalation interrupts */
> diff --git a/arch/powerpc/sysdev/xive/xive-internal.h 
> b/arch/powerpc/sysdev/xive/xive-internal.h
> index 066d6fe3dc1d..3b7dd2cba9db 100644
> --- a/arch/powerpc/sysdev/xive/xive-internal.h
> +++ b/arch/powerpc/sysdev/xive/xive-internal.h
> @@ -52,7 +52,6 @@ struct xive_ops {
>   void(*shutdown)(void);
>  
>   void(*update_pending)(struct xive_cpu *xc);
> - void(*eoi)(u32 hw_irq);
>   void(*sync_source)(u32 hw_irq);
>   u64 (*esb_rw)(u32 hw_irq, u32 offset, u64 data, bool write);
>  #ifdef CONFIG_SMP
> diff --git a/arch/powerpc/kvm/book3s_xive_template.c 
> b/arch/powerpc/kvm/book3s_xive_template.c
> index ece36e024a8f..b0015e05d99a 100644
> --- a/arch/powerpc/kvm/book3s_xive_template.c
> +++ b/arch/powerpc/kvm/book3s_xive_template.c
> @@ -74,8 +74,6 @@ static void GLUE(X_PFX,source_eoi)(u32 hw_irq, struct 
> xive_irq_data *xd)
>   /* If the XIVE supports the new "store EOI facility, use it */
>   if (xd->flags & XIVE_IRQ_FLAG_STORE_EOI)
>   __x_writeq(0, __x_eoi_page(xd) + XIVE_ESB_STORE_EOI);
> - else if (hw_irq && xd->flags & XIVE_IRQ_FLAG_EOI_FW)
> - opal_int_eoi(hw_irq);
>   else if (xd->flags & XIVE_IRQ_FLAG_LSI) {
>   /*
>* For LSIs the HW EOI cycle is used rather than PQ bits,
> diff --git a/arch/powerpc/sysdev/xive/common.c 
> b/arch/powerpc/sysdev/xive/common.c
> index a71412fefb65..fe6229dd3241 100644
> --- a/arch/powerpc/sysdev/xive/common.c
> +++ b/arch/powerpc/sysdev/xive/common.c
> @@ -354,18 +354,7 @@ static void xive_do_source_eoi(u32 hw_irq, struct 
> xive_irq_data *xd)
>   /* If the XIVE supports the new "store EOI facility, use it */
>   if (xd->flags & XIVE_IRQ_FLAG_STORE_EOI)
>   xive_esb_write(xd, XIVE_ESB_STORE_EOI, 0);
> - else if (hw_irq && xd->flags & XIVE_IRQ_FLAG_EOI_FW) {
> - /*
> -  * The FW told us to call it. This happens for some
> -  * interrupt sources that need additional HW whacking
> -  * beyond the ESB manipulation. For example LPC interrupts
> -  * on P9 DD1.0 needed a latch to be clared in the LPC bridge
> -  * itself. The Firmware will take care of it.
> -  */
> - if (WARN_ON_ONCE(!xive_ops->eoi))
> - return;
> - xive_ops->eoi(hw_irq);
> - } else {
> + else {
>   u8 eoi_val;
>  
>   /*
> diff --git a/arch/po

Re: [PATCH 10/13] powerpc/xive: Remove P9 DD1 flag XIVE_IRQ_FLAG_MASK_FW

2020-12-09 Thread Greg Kurz

On Tue, 8 Dec 2020 16:11:21 +0100
Cédric Le Goater  wrote:

> This flag was used to support the PHB4 LSIs on P9 DD1 and we have
> stopped supporting this CPU when DD2 came out. See skiboot commit:
> 
>   https://github.com/open-power/skiboot/commit/0b0d15e3c170
> 
> Signed-off-by: Cédric Le Goater 
> ---

Reviewed-by: Greg Kurz 

In case a v2 is required, same suggestion to comment out the removed
items entirely, plus fix an indent nit

>  arch/powerpc/include/asm/opal-api.h |  2 +-
>  arch/powerpc/include/asm/xive.h |  2 +-
>  arch/powerpc/kvm/book3s_xive.c  | 54 +
>  arch/powerpc/sysdev/xive/common.c   | 39 +
>  arch/powerpc/sysdev/xive/native.c   |  2 --
>  5 files changed, 11 insertions(+), 88 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/opal-api.h 
> b/arch/powerpc/include/asm/opal-api.h
> index 48ee604ca39a..0455b679c050 100644
> --- a/arch/powerpc/include/asm/opal-api.h
> +++ b/arch/powerpc/include/asm/opal-api.h
> @@ -1092,7 +1092,7 @@ enum {
>   OPAL_XIVE_IRQ_STORE_EOI = 0x0002,
>   OPAL_XIVE_IRQ_LSI   = 0x0004,
>   OPAL_XIVE_IRQ_SHIFT_BUG = 0x0008, /* P9 DD1.0 workaround */
> - OPAL_XIVE_IRQ_MASK_VIA_FW   = 0x0010,
> + OPAL_XIVE_IRQ_MASK_VIA_FW   = 0x0010, /* P9 DD1.0 workaround */
>   OPAL_XIVE_IRQ_EOI_VIA_FW= 0x0020,
>  };
>  
> diff --git a/arch/powerpc/include/asm/xive.h b/arch/powerpc/include/asm/xive.h
> index ff805885a028..d62368d0ba91 100644
> --- a/arch/powerpc/include/asm/xive.h
> +++ b/arch/powerpc/include/asm/xive.h
> @@ -61,7 +61,7 @@ struct xive_irq_data {
>  #define XIVE_IRQ_FLAG_STORE_EOI  0x01
>  #define XIVE_IRQ_FLAG_LSI0x02
>  #define XIVE_IRQ_FLAG_SHIFT_BUG  0x04 /* P9 DD1.0 workaround */
> -#define XIVE_IRQ_FLAG_MASK_FW0x08
> +#define XIVE_IRQ_FLAG_MASK_FW0x08 /* P9 DD1.0 workaround */
>  #define XIVE_IRQ_FLAG_EOI_FW 0x10
>  #define XIVE_IRQ_FLAG_H_INT_ESB  0x20
>  
> diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
> index fae1c2e8da29..59a986ae640b 100644
> --- a/arch/powerpc/kvm/book3s_xive.c
> +++ b/arch/powerpc/kvm/book3s_xive.c
> @@ -419,37 +419,16 @@ static u8 xive_lock_and_mask(struct kvmppc_xive *xive,
>   /* Get the right irq */
>   kvmppc_xive_select_irq(state, _num, );
>  
> - /*
> -  * If the interrupt is marked as needing masking via
> -  * firmware, we do it here. Firmware masking however
> -  * is "lossy", it won't return the old p and q bits
> -  * and won't set the interrupt to a state where it will
> -  * record queued ones. If this is an issue we should do
> -  * lazy masking instead.
> -  *
> -  * For now, we work around this in unmask by forcing
> -  * an interrupt whenever we unmask a non-LSI via FW
> -  * (if ever).
> -  */
> - if (xd->flags & OPAL_XIVE_IRQ_MASK_VIA_FW) {
> - xive_native_configure_irq(hw_num,
> - kvmppc_xive_vp(xive, state->act_server),
> - MASKED, state->number);
> - /* set old_p so we can track if an H_EOI was done */
> - state->old_p = true;
> - state->old_q = false;
> - } else {
> - /* Set PQ to 10, return old P and old Q and remember them */
> - val = xive_vm_esb_load(xd, XIVE_ESB_SET_PQ_10);
> - state->old_p = !!(val & 2);
> - state->old_q = !!(val & 1);
> + /* Set PQ to 10, return old P and old Q and remember them */
> + val = xive_vm_esb_load(xd, XIVE_ESB_SET_PQ_10);
> + state->old_p = !!(val & 2);
> + state->old_q = !!(val & 1);
>  
> - /*
> -  * Synchronize hardware to sensure the queues are updated
> -  * when masking
> + /*
> +  * Synchronize hardware to sensure the queues are updated
> +  * when masking
>*/

... here ^^

> - xive_native_sync_source(hw_num);
> - }
> + xive_native_sync_source(hw_num);
>  
>   return old_prio;
>  }
> @@ -483,23 +462,6 @@ static void xive_finish_unmask(struct kvmppc_xive *xive,
>   /* Get the right irq */
>   kvmppc_xive_select_irq(state, _num, );
>  
> - /*
> -  * See comment in xive_lock_and_mask() concerning masking
> -  * via firmware.
> -  */
> - if (xd->flags & OPAL_XIVE_IRQ_MASK_VIA_FW) {
> - xive_native_configure_irq(hw_num,
> - kvmppc_xive_vp(xive, state->act_server),
> -

Re: [PATCH 09/13] powerpc/xive: Remove P9 DD1 flag XIVE_IRQ_FLAG_SHIFT_BUG

2020-12-08 Thread Greg Kurz

On Tue, 8 Dec 2020 16:11:20 +0100
Cédric Le Goater  wrote:

> This flag was used to support the PHB4 LSIs on P9 DD1 and we have
> stopped supporting this CPU when DD2 came out. See skiboot commit:
> 
>   https://github.com/open-power/skiboot/commit/0b0d15e3c170
> 
> Signed-off-by: Cédric Le Goater 
> ---

Reviewed-by: Greg Kurz 

Just a minor suggestion in case you need to post a v2. See below.

>  arch/powerpc/include/asm/opal-api.h | 2 +-
>  arch/powerpc/include/asm/xive.h | 2 +-
>  arch/powerpc/kvm/book3s_xive_native.c   | 3 ---
>  arch/powerpc/kvm/book3s_xive_template.c | 3 ---
>  arch/powerpc/sysdev/xive/common.c   | 8 
>  arch/powerpc/sysdev/xive/native.c   | 2 --
>  6 files changed, 2 insertions(+), 18 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/opal-api.h 
> b/arch/powerpc/include/asm/opal-api.h
> index 1dffa3cb16ba..48ee604ca39a 100644
> --- a/arch/powerpc/include/asm/opal-api.h
> +++ b/arch/powerpc/include/asm/opal-api.h
> @@ -1091,7 +1091,7 @@ enum {
>   OPAL_XIVE_IRQ_TRIGGER_PAGE  = 0x0001,
>   OPAL_XIVE_IRQ_STORE_EOI = 0x0002,
>   OPAL_XIVE_IRQ_LSI   = 0x0004,
> - OPAL_XIVE_IRQ_SHIFT_BUG = 0x0008,
> + OPAL_XIVE_IRQ_SHIFT_BUG = 0x0008, /* P9 DD1.0 workaround */

Maybe you can even comment the entire line so that any future
tentative to use that flag breaks build ?

>   OPAL_XIVE_IRQ_MASK_VIA_FW   = 0x0010,
>   OPAL_XIVE_IRQ_EOI_VIA_FW= 0x0020,
>  };
> diff --git a/arch/powerpc/include/asm/xive.h b/arch/powerpc/include/asm/xive.h
> index d332dd9a18de..ff805885a028 100644
> --- a/arch/powerpc/include/asm/xive.h
> +++ b/arch/powerpc/include/asm/xive.h
> @@ -60,7 +60,7 @@ struct xive_irq_data {
>  };
>  #define XIVE_IRQ_FLAG_STORE_EOI  0x01
>  #define XIVE_IRQ_FLAG_LSI0x02
> -#define XIVE_IRQ_FLAG_SHIFT_BUG  0x04
> +#define XIVE_IRQ_FLAG_SHIFT_BUG  0x04 /* P9 DD1.0 workaround */

Same here, with an extra cleanup to stop using it when initializing 
xive_irq_flags[] in common.c.

>  #define XIVE_IRQ_FLAG_MASK_FW0x08
>  #define XIVE_IRQ_FLAG_EOI_FW 0x10
>  #define XIVE_IRQ_FLAG_H_INT_ESB  0x20
> diff --git a/arch/powerpc/kvm/book3s_xive_native.c 
> b/arch/powerpc/kvm/book3s_xive_native.c
> index 9b395381179d..170d1d04e1d1 100644
> --- a/arch/powerpc/kvm/book3s_xive_native.c
> +++ b/arch/powerpc/kvm/book3s_xive_native.c
> @@ -37,9 +37,6 @@ static u8 xive_vm_esb_load(struct xive_irq_data *xd, u32 
> offset)
>* ordering.
>*/
>  
> - if (xd->flags & XIVE_IRQ_FLAG_SHIFT_BUG)
> - offset |= offset << 4;
> -
>   val = in_be64(xd->eoi_mmio + offset);
>   return (u8)val;
>  }
> diff --git a/arch/powerpc/kvm/book3s_xive_template.c 
> b/arch/powerpc/kvm/book3s_xive_template.c
> index 4ad3c0279458..ece36e024a8f 100644
> --- a/arch/powerpc/kvm/book3s_xive_template.c
> +++ b/arch/powerpc/kvm/book3s_xive_template.c
> @@ -61,9 +61,6 @@ static u8 GLUE(X_PFX,esb_load)(struct xive_irq_data *xd, 
> u32 offset)
>   if (offset == XIVE_ESB_SET_PQ_10 && xd->flags & XIVE_IRQ_FLAG_STORE_EOI)
>   offset |= XIVE_ESB_LD_ST_MO;
>  
> - if (xd->flags & XIVE_IRQ_FLAG_SHIFT_BUG)
> - offset |= offset << 4;
> -
>   val =__x_readq(__x_eoi_page(xd) + offset);
>  #ifdef __LITTLE_ENDIAN__
>   val >>= 64-8;
> diff --git a/arch/powerpc/sysdev/xive/common.c 
> b/arch/powerpc/sysdev/xive/common.c
> index 411cba12d73b..a9259470bf9f 100644
> --- a/arch/powerpc/sysdev/xive/common.c
> +++ b/arch/powerpc/sysdev/xive/common.c
> @@ -200,10 +200,6 @@ static notrace u8 xive_esb_read(struct xive_irq_data 
> *xd, u32 offset)
>   if (offset == XIVE_ESB_SET_PQ_10 && xd->flags & XIVE_IRQ_FLAG_STORE_EOI)
>   offset |= XIVE_ESB_LD_ST_MO;
>  
> - /* Handle HW errata */
> - if (xd->flags & XIVE_IRQ_FLAG_SHIFT_BUG)
> - offset |= offset << 4;
> -
>   if ((xd->flags & XIVE_IRQ_FLAG_H_INT_ESB) && xive_ops->esb_rw)
>   val = xive_ops->esb_rw(xd->hw_irq, offset, 0, 0);
>   else
> @@ -214,10 +210,6 @@ static notrace u8 xive_esb_read(struct xive_irq_data 
> *xd, u32 offset)
>  
>  static void xive_esb_write(struct xive_irq_data *xd, u32 offset, u64 data)
>  {
> - /* Handle HW errata */
> - if (xd->flags & XIVE_IRQ_FLAG_SHIFT_BUG)
> - offset |= offset << 4;
> -
>   if ((xd->flags & XIVE_IRQ_FLAG_H_INT_ESB) && xive_ops->esb_rw)
>   xive_ops->esb_rw(xd->hw_irq, offset, data, 1);
>

Re: [PATCH 08/13] powerpc: Increase NR_IRQS range to support more KVM guests

2020-12-08 Thread Greg Kurz

On Tue, 8 Dec 2020 16:11:19 +0100
Cédric Le Goater  wrote:

> PowerNV systems can handle up to 4K guests and 1M interrupt numbers
> per chip. Increase the range of allowed interrupts to support a larger
> number of guests.
> 
> Signed-off-by: Cédric Le Goater 
> ---

Reviewed-by: Greg Kurz 

>  arch/powerpc/Kconfig | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index 5181872f9452..c250fbd430d1 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -66,7 +66,7 @@ config NEED_PER_CPU_PAGE_FIRST_CHUNK
>  
>  config NR_IRQS
>   int "Number of virtual interrupt numbers"
> - range 32 32768
> + range 32 1048576
>   default "512"
>   help
> This defines the number of virtual interrupt numbers the kernel

Re: [PATCH 03/13] powerpc/xive: Introduce XIVE_IPI_HW_IRQ

2020-12-08 Thread Greg Kurz

On Tue, 8 Dec 2020 16:11:14 +0100
Cédric Le Goater  wrote:

> The XIVE driver deals with CPU IPIs in a peculiar way. Each CPU has
> its own XIVE IPI interrupt allocated at the HW level, for PowerNV, or
> at the hypervisor level for pSeries. In practice, these interrupts are
> not always used. pSeries/PowerVM prefers local doorbells for local
> threads since they are faster. On PowerNV, global doorbells are also
> preferred for the same reason.
> 
> The mapping in the Linux is reduced to a single interrupt using HW
> interrupt number 0 and a custom irq_chip to handle EOI. This can cause
> performance issues in some benchmark (ipistorm) on multichip systems.
> 
> Clarify the use of the 0 value, it will help in improving multichip
> support.
> 
> Signed-off-by: Cédric Le Goater 
> ---

Reviewed-by: Greg Kurz 

>  arch/powerpc/sysdev/xive/xive-internal.h |  2 ++
>  arch/powerpc/sysdev/xive/common.c| 10 +-
>  2 files changed, 7 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/powerpc/sysdev/xive/xive-internal.h 
> b/arch/powerpc/sysdev/xive/xive-internal.h
> index b7b901da2168..d701af7fb48c 100644
> --- a/arch/powerpc/sysdev/xive/xive-internal.h
> +++ b/arch/powerpc/sysdev/xive/xive-internal.h
> @@ -5,6 +5,8 @@
>  #ifndef __XIVE_INTERNAL_H
>  #define __XIVE_INTERNAL_H
>  
> +#define XIVE_IPI_HW_IRQ  0 /* interrupt source # for IPIs */
> +
>  /*
>   * A "disabled" interrupt should never fire, to catch problems
>   * we set its logical number to this
> diff --git a/arch/powerpc/sysdev/xive/common.c 
> b/arch/powerpc/sysdev/xive/common.c
> index 65af34ac1fa2..ee375daf8114 100644
> --- a/arch/powerpc/sysdev/xive/common.c
> +++ b/arch/powerpc/sysdev/xive/common.c
> @@ -1142,7 +1142,7 @@ static void __init xive_request_ipi(void)
>   return;
>  
>   /* Initialize it */
> - virq = irq_create_mapping(xive_irq_domain, 0);
> + virq = irq_create_mapping(xive_irq_domain, XIVE_IPI_HW_IRQ);
>   xive_ipi_irq = virq;
>  
>   WARN_ON(request_irq(virq, xive_muxed_ipi_action,
> @@ -1242,7 +1242,7 @@ static int xive_irq_domain_map(struct irq_domain *h, 
> unsigned int virq,
>  
>  #ifdef CONFIG_SMP
>   /* IPIs are special and come up with HW number 0 */
> - if (hw == 0) {
> + if (hw == XIVE_IPI_HW_IRQ) {
>   /*
>* IPIs are marked per-cpu. We use separate HW interrupts under
>* the hood but associated with the same "linux" interrupt
> @@ -1271,7 +1271,7 @@ static void xive_irq_domain_unmap(struct irq_domain *d, 
> unsigned int virq)
>   if (!data)
>   return;
>   hw_irq = (unsigned int)irqd_to_hwirq(data);
> - if (hw_irq)
> + if (hw_irq != XIVE_IPI_HW_IRQ)
>   xive_irq_free_data(virq);
>  }
>  
> @@ -1421,7 +1421,7 @@ static void xive_flush_cpu_queue(unsigned int cpu, 
> struct xive_cpu *xc)
>* Ignore anything that isn't a XIVE irq and ignore
>* IPIs, so can just be dropped.
>*/
> - if (d->domain != xive_irq_domain || hw_irq == 0)
> + if (d->domain != xive_irq_domain || hw_irq == XIVE_IPI_HW_IRQ)
>   continue;
>  
>   /*
> @@ -1655,7 +1655,7 @@ static int xive_core_debug_show(struct seq_file *m, 
> void *private)
>   hw_irq = (unsigned int)irqd_to_hwirq(d);
>  
>   /* IPIs are special (HW number 0) */
> - if (hw_irq)
> + if (hw_irq != XIVE_IPI_HW_IRQ)
>   xive_debug_show_irq(m, hw_irq, d);
>   }
>   return 0;

Re: [PATCH 02/13] powerpc/xive: Rename XIVE_IRQ_NO_EOI to show its a flag

2020-12-08 Thread Greg Kurz

On Tue, 8 Dec 2020 16:11:13 +0100
Cédric Le Goater  wrote:

> This is a simple cleanup to identify easily all flags of the XIVE
> interrupt structure. The interrupts flagged with XIVE_IRQ_FLAG_NO_EOI
> are the escalations used to wake up vCPUs in KVM. They are handled
> very differently from the rest.
> 
> Signed-off-by: Cédric Le Goater 
> ---

Reviewed-by: Greg Kurz 

>  arch/powerpc/include/asm/xive.h   | 2 +-
>  arch/powerpc/kvm/book3s_xive.c| 4 ++--
>  arch/powerpc/sysdev/xive/common.c | 2 +-
>  3 files changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/xive.h b/arch/powerpc/include/asm/xive.h
> index 309b4d65b74f..d332dd9a18de 100644
> --- a/arch/powerpc/include/asm/xive.h
> +++ b/arch/powerpc/include/asm/xive.h
> @@ -66,7 +66,7 @@ struct xive_irq_data {
>  #define XIVE_IRQ_FLAG_H_INT_ESB  0x20
>  
>  /* Special flag set by KVM for excalation interrupts */
> -#define XIVE_IRQ_NO_EOI  0x80
> +#define XIVE_IRQ_FLAG_NO_EOI 0x80
>  
>  #define XIVE_INVALID_CHIP_ID -1
>  
> diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
> index 18a6b75a3bfd..fae1c2e8da29 100644
> --- a/arch/powerpc/kvm/book3s_xive.c
> +++ b/arch/powerpc/kvm/book3s_xive.c
> @@ -219,7 +219,7 @@ int kvmppc_xive_attach_escalation(struct kvm_vcpu *vcpu, 
> u8 prio,
>   /* In single escalation mode, we grab the ESB MMIO of the
>* interrupt and mask it. Also populate the VCPU v/raddr
>* of the ESB page for use by asm entry/exit code. Finally
> -  * set the XIVE_IRQ_NO_EOI flag which will prevent the
> +  * set the XIVE_IRQ_FLAG_NO_EOI flag which will prevent the
>* core code from performing an EOI on the escalation
>* interrupt, thus leaving it effectively masked after
>* it fires once.
> @@ -231,7 +231,7 @@ int kvmppc_xive_attach_escalation(struct kvm_vcpu *vcpu, 
> u8 prio,
>   xive_vm_esb_load(xd, XIVE_ESB_SET_PQ_01);
>   vcpu->arch.xive_esc_raddr = xd->eoi_page;
>   vcpu->arch.xive_esc_vaddr = (__force u64)xd->eoi_mmio;
> - xd->flags |= XIVE_IRQ_NO_EOI;
> + xd->flags |= XIVE_IRQ_FLAG_NO_EOI;
>   }
>  
>   return 0;
> diff --git a/arch/powerpc/sysdev/xive/common.c 
> b/arch/powerpc/sysdev/xive/common.c
> index a80440af491a..65af34ac1fa2 100644
> --- a/arch/powerpc/sysdev/xive/common.c
> +++ b/arch/powerpc/sysdev/xive/common.c
> @@ -416,7 +416,7 @@ static void xive_irq_eoi(struct irq_data *d)
>* been passed-through to a KVM guest
>*/
>   if (!irqd_irq_disabled(d) && !irqd_is_forwarded_to_vcpu(d) &&
> - !(xd->flags & XIVE_IRQ_NO_EOI))
> + !(xd->flags & XIVE_IRQ_FLAG_NO_EOI))
>   xive_do_source_eoi(irqd_to_hwirq(d), xd);
>   else
>   xd->stale_p = true;

[PATCH] KVM: PPC: Book3S HV: XIVE: Fix vCPU id sanity check

2020-11-30 Thread Greg Kurz

Commit 062cfab7069f ("KVM: PPC: Book3S HV: XIVE: Make VP block size
configurable") updated kvmppc_xive_vcpu_id_valid() in a way that
allows userspace to trigger an assertion in skiboot and crash the host:

[  696.186248988,3] XIVE[ IC 08  ] eq_blk != vp_blk (0 vs. 1) for target 
0x438c/0
[  696.186314757,0] Assert fail: hw/xive.c:2370:0
[  696.186342458,0] Aborting!
xive-kvCPU 0043 Backtrace:
 S: 31e2b8f0 R: 30013840   .backtrace+0x48
 S: 31e2b990 R: 3001b2d0   ._abort+0x4c
 S: 31e2ba10 R: 3001b34c   .assert_fail+0x34
 S: 31e2ba90 R: 30058984   .xive_eq_for_target.part.20+0xb0
 S: 31e2bb40 R: 30059fdc   .xive_setup_silent_gather+0x2c
 S: 31e2bc20 R: 3005a334   .opal_xive_set_vp_info+0x124
 S: 31e2bd20 R: 300051a4   opal_entry+0x134
 --- OPAL call token: 0x8a caller R1: 0xc01f28563850 ---

XIVE maintains the interrupt context state of non-dispatched vCPUs in
an internal VP structure. We allocate a bunch of those on startup to
accommodate all possible vCPUs. Each VP has an id, that we derive from
the vCPU id for efficiency:

static inline u32 kvmppc_xive_vp(struct kvmppc_xive *xive, u32 server)
{
return xive->vp_base + kvmppc_pack_vcpu_id(xive->kvm, server);
}

The KVM XIVE device used to allocate KVM_MAX_VCPUS VPs. This was
limitting the number of concurrent VMs because the VP space is
limited on the HW. Since most of the time, VMs run with a lot less
vCPUs, commit 062cfab7069f ("KVM: PPC: Book3S HV: XIVE: Make VP
block size configurable") gave the possibility for userspace to
tune the size of the VP block through the KVM_DEV_XIVE_NR_SERVERS
attribute.

The check in kvmppc_pack_vcpu_id() was changed from

cpu < KVM_MAX_VCPUS * xive->kvm->arch.emul_smt_mode

to

cpu < xive->nr_servers * xive->kvm->arch.emul_smt_mode

The previous check was based on the fact that the VP block had
KVM_MAX_VCPUS entries and that kvmppc_pack_vcpu_id() guarantees
that packed vCPU ids are below KVM_MAX_VCPUS. We've changed the
size of the VP block, but kvmppc_pack_vcpu_id() has nothing to
do with it and it certainly doesn't ensure that the packed vCPU
ids are below xive->nr_servers. kvmppc_xive_vcpu_id_valid() might
thus return true when the VM was configured with a non-standard
VSMT mode, even if the packed vCPU id is higher than what we
expect. We end up using an unallocated VP id, which confuses
OPAL. The assert in OPAL is probably abusive and should be
converted to a regular error that the kernel can handle, but
we shouldn't really use broken VP ids in the first place.

Fix kvmppc_xive_vcpu_id_valid() so that it checks the packed
vCPU id is below xive->nr_servers, which is explicitly what we
want.

Fixes: 062cfab7069f ("KVM: PPC: Book3S HV: XIVE: Make VP block size 
configurable")
Cc: sta...@vger.kernel.org # v5.5+
Signed-off-by: Greg Kurz 
---
 arch/powerpc/kvm/book3s_xive.c |7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
index 85215e79db42..a0ebc29f30b2 100644
--- a/arch/powerpc/kvm/book3s_xive.c
+++ b/arch/powerpc/kvm/book3s_xive.c
@@ -1214,12 +1214,9 @@ void kvmppc_xive_cleanup_vcpu(struct kvm_vcpu *vcpu)
 static bool kvmppc_xive_vcpu_id_valid(struct kvmppc_xive *xive, u32 cpu)
 {
/* We have a block of xive->nr_servers VPs. We just need to check
-* raw vCPU ids are below the expected limit for this guest's
-* core stride ; kvmppc_pack_vcpu_id() will pack them down to an
-* index that can be safely used to compute a VP id that belongs
-* to the VP block.
+* packed vCPU ids are below that.
 */
-   return cpu < xive->nr_servers * xive->kvm->arch.emul_smt_mode;
+   return kvmppc_pack_vcpu_id(xive->kvm, cpu) < xive->nr_servers;
 }
 
 int kvmppc_xive_compute_vp_id(struct kvmppc_xive *xive, u32 cpu, u32 *vp)

Re: [PATCH v3 2/2] powerpc/pseries: pass MSI affinity to irq_create_mapping()

2020-11-25 Thread Greg Kurz

On Wed, 25 Nov 2020 16:42:30 +
Marc Zyngier  wrote:

> On 2020-11-25 16:24, Laurent Vivier wrote:
> > On 25/11/2020 17:05, Denis Kirjanov wrote:
> >> On 11/25/20, Laurent Vivier  wrote:
> >>> With virtio multiqueue, normally each queue IRQ is mapped to a CPU.
> >>> 
> >>> But since commit 0d9f0a52c8b9f ("virtio_scsi: use virtio IRQ 
> >>> affinity")
> >>> this is broken on pseries.
> >> 
> >> Please add "Fixes" tag.
> > 
> > In fact, the code in commit 0d9f0a52c8b9f is correct.
> > 
> > The problem is with MSI/X irq affinity and pseries. So this patch
> > fixes more than virtio_scsi. I put this information because this
> > commit allows to clearly show the problem. Perhaps I should remove
> > this line in fact?
> 
> This patch does not fix virtio_scsi at all, which as you noticed, is
> correct. It really fixes the PPC MSI setup, which is starting to show
> its age. So getting rid of the reference seems like the right thing to 
> do.
> 
> I'm also not keen on the BugId thing. It should really be a lore link.
> I also cannot find any such tag in the kernel, nor is it a documented
> practice. The last reference to a Bugzilla entry seems to have happened
> with 786b5219081ff16 (five years ago).
> 

My bad, I suggested BugId to Laurent but the intent was actually BugLink,
which seems to be commonly used in the kernel.

Cheers,

--
Greg

> Thanks,
> 
>  M.

Re: [PATCH v2 1/2] genirq: add an irq_create_mapping_affinity() function

2020-11-25 Thread Greg Kurz

On Wed, 25 Nov 2020 12:16:56 +0100
Laurent Vivier  wrote:

> This function adds an affinity parameter to irq_create_mapping().
> This parameter is needed to pass it to irq_domain_alloc_descs().
> 
> irq_create_mapping() is a wrapper around irq_create_mapping_affinity()
> to pass NULL for the affinity parameter.
> 
> No functional change.
> 
> Signed-off-by: Laurent Vivier 
> ---

Reviewed-by: Greg Kurz 

>  include/linux/irqdomain.h | 12 ++--
>  kernel/irq/irqdomain.c| 13 -
>  2 files changed, 18 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/irqdomain.h b/include/linux/irqdomain.h
> index 71535e87109f..ea5a337e0f8b 100644
> --- a/include/linux/irqdomain.h
> +++ b/include/linux/irqdomain.h
> @@ -384,11 +384,19 @@ extern void irq_domain_associate_many(struct irq_domain 
> *domain,
>  extern void irq_domain_disassociate(struct irq_domain *domain,
>   unsigned int irq);
>  
> -extern unsigned int irq_create_mapping(struct irq_domain *host,
> -irq_hw_number_t hwirq);
> +extern unsigned int irq_create_mapping_affinity(struct irq_domain *host,
> +   irq_hw_number_t hwirq,
> +   const struct irq_affinity_desc *affinity);
>  extern unsigned int irq_create_fwspec_mapping(struct irq_fwspec *fwspec);
>  extern void irq_dispose_mapping(unsigned int virq);
>  
> +static inline unsigned int irq_create_mapping(struct irq_domain *host,
> +   irq_hw_number_t hwirq)
> +{
> + return irq_create_mapping_affinity(host, hwirq, NULL);
> +}
> +
> +
>  /**
>   * irq_linear_revmap() - Find a linux irq from a hw irq number.
>   * @domain: domain owning this hardware interrupt
> diff --git a/kernel/irq/irqdomain.c b/kernel/irq/irqdomain.c
> index cf8b374b892d..e4ca69608f3b 100644
> --- a/kernel/irq/irqdomain.c
> +++ b/kernel/irq/irqdomain.c
> @@ -624,17 +624,19 @@ unsigned int irq_create_direct_mapping(struct 
> irq_domain *domain)
>  EXPORT_SYMBOL_GPL(irq_create_direct_mapping);
>  
>  /**
> - * irq_create_mapping() - Map a hardware interrupt into linux irq space
> + * irq_create_mapping_affinity() - Map a hardware interrupt into linux irq 
> space
>   * @domain: domain owning this hardware interrupt or NULL for default domain
>   * @hwirq: hardware irq number in that domain space
> + * @affinity: irq affinity
>   *
>   * Only one mapping per hardware interrupt is permitted. Returns a linux
>   * irq number.
>   * If the sense/trigger is to be specified, set_irq_type() should be called
>   * on the number returned from that call.
>   */
> -unsigned int irq_create_mapping(struct irq_domain *domain,
> - irq_hw_number_t hwirq)
> +unsigned int irq_create_mapping_affinity(struct irq_domain *domain,
> +irq_hw_number_t hwirq,
> +const struct irq_affinity_desc *affinity)
>  {
>   struct device_node *of_node;
>   int virq;
> @@ -660,7 +662,8 @@ unsigned int irq_create_mapping(struct irq_domain *domain,
>   }
>  
>   /* Allocate a virtual interrupt number */
> - virq = irq_domain_alloc_descs(-1, 1, hwirq, of_node_to_nid(of_node), 
> NULL);
> + virq = irq_domain_alloc_descs(-1, 1, hwirq, of_node_to_nid(of_node),
> +   affinity);
>   if (virq <= 0) {
>   pr_debug("-> virq allocation failed\n");
>   return 0;
> @@ -676,7 +679,7 @@ unsigned int irq_create_mapping(struct irq_domain *domain,
>  
>   return virq;
>  }
> -EXPORT_SYMBOL_GPL(irq_create_mapping);
> +EXPORT_SYMBOL_GPL(irq_create_mapping_affinity);
>  
>  /**
>   * irq_create_strict_mappings() - Map a range of hw irqs to fixed linux irqs

Re: [PATCH v2 2/2] powerpc/pseries: pass MSI affinity to irq_create_mapping()

2020-11-25 Thread Greg Kurz

On Wed, 25 Nov 2020 12:16:57 +0100
Laurent Vivier  wrote:

> With virtio multiqueue, normally each queue IRQ is mapped to a CPU.
> 
> But since commit 0d9f0a52c8b9f ("virtio_scsi: use virtio IRQ affinity")
> this is broken on pseries.
> 
> The affinity is correctly computed in msi_desc but this is not applied
> to the system IRQs.
> 
> It appears the affinity is correctly passed to rtas_setup_msi_irqs() but
> lost at this point and never passed to irq_domain_alloc_descs()
> (see commit 06ee6d571f0e ("genirq: Add affinity hint to irq allocation"))
> because irq_create_mapping() doesn't take an affinity parameter.
> 
> As the previous patch has added the affinity parameter to
> irq_create_mapping() we can forward the affinity from rtas_setup_msi_irqs()
> to irq_domain_alloc_descs().
> 
> With this change, the virtqueues are correctly dispatched between the CPUs
> on pseries.
> 

Since it is public, maybe add:

BugId: https://bugzilla.redhat.com/show_bug.cgi?id=1702939

?

> Signed-off-by: Laurent Vivier 
> ---

Anyway,

Reviewed-by: Greg Kurz 

>  arch/powerpc/platforms/pseries/msi.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/platforms/pseries/msi.c 
> b/arch/powerpc/platforms/pseries/msi.c
> index 133f6adcb39c..b3ac2455faad 100644
> --- a/arch/powerpc/platforms/pseries/msi.c
> +++ b/arch/powerpc/platforms/pseries/msi.c
> @@ -458,7 +458,8 @@ static int rtas_setup_msi_irqs(struct pci_dev *pdev, int 
> nvec_in, int type)
>   return hwirq;
>   }
>  
> - virq = irq_create_mapping(NULL, hwirq);
> + virq = irq_create_mapping_affinity(NULL, hwirq,
> +entry->affinity);
>  
>   if (!virq) {
>   pr_debug("rtas_msi: Failed mapping hwirq %d\n", hwirq);

Re: [PATCH] KVM: PPC: Book3S: Assign boolean values to a bool variable

2020-11-06 Thread Greg Kurz

On Sat,  7 Nov 2020 14:26:22 +0800
xiakaixu1...@gmail.com wrote:

> From: Kaixu Xia 
> 
> Fix the following coccinelle warnings:
> 
> ./arch/powerpc/kvm/book3s_xics.c:476:3-15: WARNING: Assignment of 0/1 to bool 
> variable
> ./arch/powerpc/kvm/book3s_xics.c:504:3-15: WARNING: Assignment of 0/1 to bool 
> variable
> 
> Reported-by: Tosk Robot 
> Signed-off-by: Kaixu Xia 
> ---

Reviewed-by: Greg Kurz 

>  arch/powerpc/kvm/book3s_xics.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_xics.c b/arch/powerpc/kvm/book3s_xics.c
> index 5fee5a11550d..303e3cb096db 100644
> --- a/arch/powerpc/kvm/book3s_xics.c
> +++ b/arch/powerpc/kvm/book3s_xics.c
> @@ -473,7 +473,7 @@ static void icp_deliver_irq(struct kvmppc_xics *xics, 
> struct kvmppc_icp *icp,
>   arch_spin_unlock(>lock);
>   local_irq_restore(flags);
>   new_irq = reject;
> - check_resend = 0;
> + check_resend = false;
>   goto again;
>   }
>   } else {
> @@ -501,7 +501,7 @@ static void icp_deliver_irq(struct kvmppc_xics *xics, 
> struct kvmppc_icp *icp,
>   state->resend = 0;
>   arch_spin_unlock(>lock);
>   local_irq_restore(flags);
> - check_resend = 0;
> + check_resend = false;
>   goto again;
>   }
>   }

Re: [PATCH] KVM: PPC: Book3S HV: XIVE: Fix possible oops when accessing ESB page

2020-11-05 Thread Greg Kurz

On Thu, 5 Nov 2020 14:47:13 +0100
Cédric Le Goater  wrote:

> When accessing the ESB page of a source interrupt, the fault handler
> will retrieve the page address from the XIVE interrupt 'xive_irq_data'
> structure. If the associated KVM XIVE interrupt is not valid, that is
> not allocated at the HW level for some reason, the fault handler will
> dereference a NULL pointer leading to the oops below :
> 
> WARNING: CPU: 40 PID: 59101 at arch/powerpc/kvm/book3s_xive_native.c:259 
> xive_native_esb_fault+0xe4/0x240 [kvm]
> CPU: 40 PID: 59101 Comm: qemu-system-ppc Kdump: loaded Tainted: G
> W- -  - 4.18.0-240.el8.ppc64le #1
> NIP:  c0080e949fac LR: c044b164 CTR: c0080e949ec8
> REGS: c01f69617840 TRAP: 0700   Tainted: GW- 
> -  -  (4.18.0-240.el8.ppc64le)
> MSR:  90029033   CR: 44044282  XER: 
> 
> CFAR: c044b160 IRQMASK: 0
> GPR00: c044b164 c01f69617ac0 c0080e96e000 c01f69617c10
> GPR04: 05faa2b21e80  0005 
> GPR08:  0001  0001
> GPR12: c0080e949ec8 c01d3400  
> GPR16:    
> GPR20:   c01f5c065160 c1c76f90
> GPR24: c01f06f2 c01f5c065100 0008 c01f0eb98c78
> GPR28: c01dcab4 c01dcab403d8 c01f69617c10 0011
> NIP [c0080e949fac] xive_native_esb_fault+0xe4/0x240 [kvm]
> LR [c044b164] __do_fault+0x64/0x220
> Call Trace:
> [c01f69617ac0] [000137a5dc20] 0x137a5dc20 (unreliable)
> [c01f69617b50] [c044b164] __do_fault+0x64/0x220
> [c01f69617b90] [c0453838] do_fault+0x218/0x930
> [c01f69617bf0] [c0456f50] __handle_mm_fault+0x350/0xdf0
> [c01f69617cd0] [c0457b1c] handle_mm_fault+0x12c/0x310
> [c01f69617d10] [c007ef44] __do_page_fault+0x264/0xbb0
> [c01f69617df0] [c007f8c8] do_page_fault+0x38/0xd0
> [c01f69617e30] [c000a714] handle_page_fault+0x18/0x38
> Instruction dump:
> 40c2fff0 7c2004ac 2fa9 409e0118 73e90001 41820080 e8bd0008 7c2004ac
> 7ca90074 3940 915c 7929d182 <0b09> 2fa5 419e0080 e89e0018
> ---[ end trace 66c6ff034c53f64f ]---
> xive-kvm: xive_native_esb_fault: accessing invalid ESB page for source 8 !
> 
> Fix that by checking the validity of the KVM XIVE interrupt structure.
> 
> Reported-by: Greg Kurz 
> Signed-off-by: Cédric Le Goater 
> ---

Looks sane to me. QEMU still crashes on SIGBUS but no more oops at least.

Tested-by: Greg Kurz 

>  arch/powerpc/kvm/book3s_xive_native.c | 7 +++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/arch/powerpc/kvm/book3s_xive_native.c 
> b/arch/powerpc/kvm/book3s_xive_native.c
> index d0c2db0e07fa..a59a94f02733 100644
> --- a/arch/powerpc/kvm/book3s_xive_native.c
> +++ b/arch/powerpc/kvm/book3s_xive_native.c
> @@ -251,6 +251,13 @@ static vm_fault_t xive_native_esb_fault(struct vm_fault 
> *vmf)
>   }
>  
>   state = >irq_state[src];
> +
> + /* Some sanity checking */
> + if (!state->valid) {
> + pr_devel("%s: source %lx invalid !\n", __func__, irq);
> + return VM_FAULT_SIGBUS;
> + }
> +
>   kvmppc_xive_select_irq(state, _num, );
>  
>   arch_spin_lock(>lock);

Re: [PATCH] powerpc/pci: Fix PHB removal/rescan on PowerNV

2020-10-14 Thread Greg Kurz

On Thu, 8 Oct 2020 06:37:02 +0200
Cédric Le Goater  wrote:

> On 10/8/20 4:23 AM, Oliver O'Halloran wrote:
> > On Fri, Sep 25, 2020 at 7:23 PM Cédric Le Goater  wrote:
> >>
> >> To fix an issue with PHB hotplug on pSeries machine (HPT/XIVE), commit
> >> 3a3181e16fbd introduced a PPC specific pcibios_remove_bus() routine to
> >> clear all interrupt mappings when the bus is removed. This routine
> >> frees an array allocated in pcibios_scan_phb().
> >>
> >> This broke PHB hotplug on PowerNV because, when a PHB is removed and
> >> re-scanned through sysfs, the PCI layer un-assigns and re-assigns
> >> resources to the PHB but does not destroy and recreate the PCI
> >> controller structure. Since pcibios_remove_bus() does not clear the
> >> 'irq_map' array pointer, a second removal of the PHB will try to free
> >> the array a second time and corrupt memory.
> > 
> > "PHB hotplug" and "hot-plugging devices under a PHB" are different
> > things. What you're saying here doesn't make a whole lot of sense to
> > me unless you're conflating the two. The distinction is important
> > since on pseries we can use DLPAR to add and remove actual PHBs (i.e.
> > the pci_controller) at runtime, but there's no corresponding mechanism
> > on PowerNV.
> 
> And it's even different on QEMU. 
> 

If the real HW doesn't have the notion of adding/removing a PHB at
runtime, then QEMU should stick to that, ie. setting dc->hotpluggable
to false for PNV PHB device types.

> >> Free the 'irq_map' array in pcibios_free_controller() to fix
> >> corruption and clear interrupt mapping after it has been
> >> disposed. This to avoid filling up the array with successive
> >> remove/rescan of a bus.
> > 
> > Even with this patch I think we're still broken. With this patch
> > applied the init path is something like:
> > 
> > per-phb init:
> > allocate phb->irq_map array
> > per-bus init:
> > nothing
> > per-device init:
> > pcibios_bus_add_device()
> >pci_read_irq_line()
> > pci_irq_map_register(pci_dev, virq)
> >*record the device's interrupt in phb->irq_map*
> > 
> > And the teardown path:
> > 
> > per-device teardown:
> > nothing
> > per-bus teardown:
> > pcibios_remove_bus()
> > pci_irq_map_dispose()
> > *walk phb->irq_map and dispose of each mapped interrupt*
> > per-phb teardown:
> > free(phb->irq_map)
> > 
> > There's a lot of asymmetry here, which is a problem in itself, but the
> > real issue is that when removing *any* pci_bus under a PHB we dispose
> > of the LSI\ for *every* device under that PHB. Not good.
> > 
> > Ideally we should be fixing this by having the per-device teardown
> > handle disposing the mapping. Unfortunately, there's no pcibios hook
> > that's called when removing a pci_dev. However, we can register a bus
> > notifier which will be called when the pci_dev is removed from its bus
> > and we already do that for the per-device EEH teardown and to handle
> > IOMMU TCE invalidation when the device is removed.
> 
> I lack the knowledge here and I think some else should take over,
> as I am not doing a good job. 
> 
> Michael, can you drop the initial patch again :/ It is better not
> to merge anything.
> 
> Thanks,
> 
> C. 
> 
>

[PATCH] KVM: PPC: Don't return -ENOTSUPP to userspace in ioctls

2020-09-11 Thread Greg Kurz

ENOTSUPP is a linux only thingy, the value of which is unknown to
userspace, not to be confused with ENOTSUP which linux maps to
EOPNOTSUPP, as permitted by POSIX [1]:

[EOPNOTSUPP]
Operation not supported on socket. The type of socket (address family
or protocol) does not support the requested operation. A conforming
implementation may assign the same values for [EOPNOTSUPP] and [ENOTSUP].

Return -EOPNOTSUPP instead of -ENOTSUPP for the following ioctls:
- KVM_GET_FPU for Book3s and BookE
- KVM_SET_FPU for Book3s and BookE
- KVM_GET_DIRTY_LOG for BookE

This doesn't affect QEMU which doesn't call the KVM_GET_FPU and
KVM_SET_FPU ioctls on POWER anyway since they are not supported,
and _buggily_ ignores anything but -EPERM for KVM_GET_DIRTY_LOG.

[1] https://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html

Signed-off-by: Greg Kurz 
---
 arch/powerpc/kvm/book3s.c |4 ++--
 arch/powerpc/kvm/booke.c  |6 +++---
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index 1fce9777af1c..44bf567b6589 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -558,12 +558,12 @@ int kvm_arch_vcpu_ioctl_set_regs(struct kvm_vcpu *vcpu, 
struct kvm_regs *regs)
 
 int kvm_arch_vcpu_ioctl_get_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu)
 {
-   return -ENOTSUPP;
+   return -EOPNOTSUPP;
 }
 
 int kvm_arch_vcpu_ioctl_set_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu)
 {
-   return -ENOTSUPP;
+   return -EOPNOTSUPP;
 }
 
 int kvmppc_get_one_reg(struct kvm_vcpu *vcpu, u64 id,
diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index 3e1c9f08e302..b1abcb816439 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -1747,12 +1747,12 @@ int kvmppc_set_one_reg(struct kvm_vcpu *vcpu, u64 id,
 
 int kvm_arch_vcpu_ioctl_get_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu)
 {
-   return -ENOTSUPP;
+   return -EOPNOTSUPP;
 }
 
 int kvm_arch_vcpu_ioctl_set_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu)
 {
-   return -ENOTSUPP;
+   return -EOPNOTSUPP;
 }
 
 int kvm_arch_vcpu_ioctl_translate(struct kvm_vcpu *vcpu,
@@ -1773,7 +1773,7 @@ void kvm_arch_sync_dirty_log(struct kvm *kvm, struct 
kvm_memory_slot *memslot)
 
 int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log)
 {
-   return -ENOTSUPP;
+   return -EOPNOTSUPP;
 }
 
 void kvmppc_core_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)

Re: [PATCH] KVM: PPC: Book3S HV: Do not allocate HPT for a nested guest

2020-09-11 Thread Greg Kurz

On Fri, 11 Sep 2020 10:08:10 +0200
Michal Suchánek  wrote:

> On Fri, Sep 11, 2020 at 10:01:33AM +0200, Greg Kurz wrote:
> > On Fri, 11 Sep 2020 09:45:36 +0200
> > Greg Kurz  wrote:
> > 
> > > On Fri, 11 Sep 2020 01:16:07 -0300
> > > Fabiano Rosas  wrote:
> > > 
> > > > The current nested KVM code does not support HPT guests. This is
> > > > informed/enforced in some ways:
> > > > 
> > > > - Hosts < P9 will not be able to enable the nested HV feature;
> > > > 
> > > > - The nested hypervisor MMU capabilities will not contain
> > > >   KVM_CAP_PPC_MMU_HASH_V3;
> > > > 
> > > > - QEMU reflects the MMU capabilities in the
> > > >   'ibm,arch-vec-5-platform-support' device-tree property;
> > > > 
> > > > - The nested guest, at 'prom_parse_mmu_model' ignores the
> > > >   'disable_radix' kernel command line option if HPT is not supported;
> > > > 
> > > > - The KVM_PPC_CONFIGURE_V3_MMU ioctl will fail if trying to use HPT.
> > > > 
> > > > There is, however, still a way to start a HPT guest by using
> > > > max-compat-cpu=power8 at the QEMU machine options. This leads to the
> > > > guest being set to use hash after QEMU calls the KVM_PPC_ALLOCATE_HTAB
> > > > ioctl.
> > > > 
> > > > With the guest set to hash, the nested hypervisor goes through the
> > > > entry path that has no knowledge of nesting (kvmppc_run_vcpu) and
> > > > crashes when it tries to execute an hypervisor-privileged (mtspr
> > > > HDEC) instruction at __kvmppc_vcore_entry:
> > > > 
> > > > root@L1:~ $ qemu-system-ppc64 -machine pseries,max-cpu-compat=power8 ...
> > > > 
> > > > 
> > > > [  538.543303] CPU: 83 PID: 25185 Comm: CPU 0/KVM Not tainted 5.9.0-rc4 
> > > > #1
> > > > [  538.543355] NIP:  c0080753f388 LR: c0080753f368 CTR: 
> > > > c01e5ec0
> > > > [  538.543417] REGS: c013e91e33b0 TRAP: 0700   Not tainted  
> > > > (5.9.0-rc4)
> > > > [  538.543470] MSR:  82843033   
> > > > CR: 22422882  XER: 2004
> > > > [  538.543546] CFAR: c0080753f4b0 IRQMASK: 3
> > > >GPR00: c008075397a0 c013e91e3640 
> > > > c0080755e600 8000
> > > >GPR04:  c013eab19800 
> > > > c01394de 0043a054db72
> > > >GPR08: 003b1652  
> > > >  c008075502e0
> > > >GPR12: c01e5ec0 c007ffa74200 
> > > > c013eab19800 0008
> > > >GPR16:  c0139676c6c0 
> > > > c1d23948 c013e91e38b8
> > > >GPR20: 0053  
> > > > 0001 
> > > >GPR24: 0001 0001 
> > > >  0001
> > > >GPR28: 0001 0053 
> > > > c013eab19800 0001
> > > > [  538.544067] NIP [c0080753f388] __kvmppc_vcore_entry+0x90/0x104 
> > > > [kvm_hv]
> > > > [  538.544121] LR [c0080753f368] __kvmppc_vcore_entry+0x70/0x104 
> > > > [kvm_hv]
> > > > [  538.544173] Call Trace:
> > > > [  538.544196] [c013e91e3640] [c013e91e3680] 0xc013e91e3680 
> > > > (unreliable)
> > > > [  538.544260] [c013e91e3820] [c008075397a0] 
> > > > kvmppc_run_core+0xbc8/0x19d0 [kvm_hv]
> > > > [  538.544325] [c013e91e39e0] [c0080753d99c] 
> > > > kvmppc_vcpu_run_hv+0x404/0xc00 [kvm_hv]
> > > > [  538.544394] [c013e91e3ad0] [c008072da4fc] 
> > > > kvmppc_vcpu_run+0x34/0x48 [kvm]
> > > > [  538.544472] [c013e91e3af0] [c008072d61b8] 
> > > > kvm_arch_vcpu_ioctl_run+0x310/0x420 [kvm]
> > > > [  538.544539] [c013e91e3b80] [c008072c7450] 
> > > > kvm_vcpu_ioctl+0x298/0x778 [kvm]
> > > > [  538.544605] [c013e91e3ce0] [c04b8c2c] 
> > > > sys_ioctl+0x1dc/0xc90
> > > > [  538.544662] [c013e91e3dc0] [c002f9a4] 
> > > > system_call_exception+0xe4/0x1c0
> > > > [  538.544726] [c013e91e3e20] [c000d140] 
> > > > system_call_common+0xf0/0

Re: [PATCH] KVM: PPC: Book3S HV: Do not allocate HPT for a nested guest

2020-09-11 Thread Greg Kurz

On Fri, 11 Sep 2020 09:45:36 +0200
Greg Kurz  wrote:

> On Fri, 11 Sep 2020 01:16:07 -0300
> Fabiano Rosas  wrote:
> 
> > The current nested KVM code does not support HPT guests. This is
> > informed/enforced in some ways:
> > 
> > - Hosts < P9 will not be able to enable the nested HV feature;
> > 
> > - The nested hypervisor MMU capabilities will not contain
> >   KVM_CAP_PPC_MMU_HASH_V3;
> > 
> > - QEMU reflects the MMU capabilities in the
> >   'ibm,arch-vec-5-platform-support' device-tree property;
> > 
> > - The nested guest, at 'prom_parse_mmu_model' ignores the
> >   'disable_radix' kernel command line option if HPT is not supported;
> > 
> > - The KVM_PPC_CONFIGURE_V3_MMU ioctl will fail if trying to use HPT.
> > 
> > There is, however, still a way to start a HPT guest by using
> > max-compat-cpu=power8 at the QEMU machine options. This leads to the
> > guest being set to use hash after QEMU calls the KVM_PPC_ALLOCATE_HTAB
> > ioctl.
> > 
> > With the guest set to hash, the nested hypervisor goes through the
> > entry path that has no knowledge of nesting (kvmppc_run_vcpu) and
> > crashes when it tries to execute an hypervisor-privileged (mtspr
> > HDEC) instruction at __kvmppc_vcore_entry:
> > 
> > root@L1:~ $ qemu-system-ppc64 -machine pseries,max-cpu-compat=power8 ...
> > 
> > 
> > [  538.543303] CPU: 83 PID: 25185 Comm: CPU 0/KVM Not tainted 5.9.0-rc4 #1
> > [  538.543355] NIP:  c0080753f388 LR: c0080753f368 CTR: 
> > c01e5ec0
> > [  538.543417] REGS: c013e91e33b0 TRAP: 0700   Not tainted  (5.9.0-rc4)
> > [  538.543470] MSR:  82843033   CR: 
> > 22422882  XER: 2004
> > [  538.543546] CFAR: c0080753f4b0 IRQMASK: 3
> >GPR00: c008075397a0 c013e91e3640 c0080755e600 
> > 8000
> >GPR04:  c013eab19800 c01394de 
> > 0043a054db72
> >GPR08: 003b1652   
> > c008075502e0
> >GPR12: c01e5ec0 c007ffa74200 c013eab19800 
> > 0008
> >GPR16:  c0139676c6c0 c1d23948 
> > c013e91e38b8
> >GPR20: 0053  0001 
> > 
> >GPR24: 0001 0001  
> > 0001
> >GPR28: 0001 0053 c013eab19800 
> > 0001
> > [  538.544067] NIP [c0080753f388] __kvmppc_vcore_entry+0x90/0x104 
> > [kvm_hv]
> > [  538.544121] LR [c0080753f368] __kvmppc_vcore_entry+0x70/0x104 
> > [kvm_hv]
> > [  538.544173] Call Trace:
> > [  538.544196] [c013e91e3640] [c013e91e3680] 0xc013e91e3680 
> > (unreliable)
> > [  538.544260] [c013e91e3820] [c008075397a0] 
> > kvmppc_run_core+0xbc8/0x19d0 [kvm_hv]
> > [  538.544325] [c013e91e39e0] [c0080753d99c] 
> > kvmppc_vcpu_run_hv+0x404/0xc00 [kvm_hv]
> > [  538.544394] [c013e91e3ad0] [c008072da4fc] 
> > kvmppc_vcpu_run+0x34/0x48 [kvm]
> > [  538.544472] [c013e91e3af0] [c008072d61b8] 
> > kvm_arch_vcpu_ioctl_run+0x310/0x420 [kvm]
> > [  538.544539] [c013e91e3b80] [c008072c7450] 
> > kvm_vcpu_ioctl+0x298/0x778 [kvm]
> > [  538.544605] [c013e91e3ce0] [c04b8c2c] sys_ioctl+0x1dc/0xc90
> > [  538.544662] [c013e91e3dc0] [c002f9a4] 
> > system_call_exception+0xe4/0x1c0
> > [  538.544726] [c013e91e3e20] [c000d140] 
> > system_call_common+0xf0/0x27c
> > [  538.544787] Instruction dump:
> > [  538.544821] f86d1098 6000 6000 4899 e8ad0fe8 e8c500a0 
> > e9264140 75290002
> > [  538.544886] 7d1602a6 7cec42a6 40820008 7d0807b4 <7d164ba6> 7d083a14 
> > f90d10a0 480104fd
> > [  538.544953] ---[ end trace 74423e2b948c2e0c ]---
> > 
> > This patch makes the KVM_PPC_ALLOCATE_HTAB ioctl fail when running in
> > the nested hypervisor, causing QEMU to abort.
> > 
> > Reported-by: Satheesh Rajendran 
> > Signed-off-by: Fabiano Rosas 
> > ---
> 
> LGTM
> 
> Reviewed-by: Greg Kurz 
> 
> >  arch/powerpc/kvm/book3s_hv.c | 6 ++
> >  1 file changed, 6 insertions(+)
> > 
> > diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> > index 4ba06a2a306c..764b6239ef72 100644
> > --- a/arch/powerpc/kvm/book3s_hv.c
> > +++ b/arch/powerpc/kvm/book3s_hv.c
> > @@ -5250,6 +5250,12

Re: [PATCH] KVM: PPC: Book3S HV: Do not allocate HPT for a nested guest

2020-09-11 Thread Greg Kurz

On Fri, 11 Sep 2020 01:16:07 -0300
Fabiano Rosas  wrote:

> The current nested KVM code does not support HPT guests. This is
> informed/enforced in some ways:
> 
> - Hosts < P9 will not be able to enable the nested HV feature;
> 
> - The nested hypervisor MMU capabilities will not contain
>   KVM_CAP_PPC_MMU_HASH_V3;
> 
> - QEMU reflects the MMU capabilities in the
>   'ibm,arch-vec-5-platform-support' device-tree property;
> 
> - The nested guest, at 'prom_parse_mmu_model' ignores the
>   'disable_radix' kernel command line option if HPT is not supported;
> 
> - The KVM_PPC_CONFIGURE_V3_MMU ioctl will fail if trying to use HPT.
> 
> There is, however, still a way to start a HPT guest by using
> max-compat-cpu=power8 at the QEMU machine options. This leads to the
> guest being set to use hash after QEMU calls the KVM_PPC_ALLOCATE_HTAB
> ioctl.
> 
> With the guest set to hash, the nested hypervisor goes through the
> entry path that has no knowledge of nesting (kvmppc_run_vcpu) and
> crashes when it tries to execute an hypervisor-privileged (mtspr
> HDEC) instruction at __kvmppc_vcore_entry:
> 
> root@L1:~ $ qemu-system-ppc64 -machine pseries,max-cpu-compat=power8 ...
> 
> 
> [  538.543303] CPU: 83 PID: 25185 Comm: CPU 0/KVM Not tainted 5.9.0-rc4 #1
> [  538.543355] NIP:  c0080753f388 LR: c0080753f368 CTR: 
> c01e5ec0
> [  538.543417] REGS: c013e91e33b0 TRAP: 0700   Not tainted  (5.9.0-rc4)
> [  538.543470] MSR:  82843033   CR: 
> 22422882  XER: 2004
> [  538.543546] CFAR: c0080753f4b0 IRQMASK: 3
>GPR00: c008075397a0 c013e91e3640 c0080755e600 
> 8000
>GPR04:  c013eab19800 c01394de 
> 0043a054db72
>GPR08: 003b1652   
> c008075502e0
>GPR12: c01e5ec0 c007ffa74200 c013eab19800 
> 0008
>GPR16:  c0139676c6c0 c1d23948 
> c013e91e38b8
>GPR20: 0053  0001 
> 
>GPR24: 0001 0001  
> 0001
>GPR28: 0001 0053 c013eab19800 
> 0001
> [  538.544067] NIP [c0080753f388] __kvmppc_vcore_entry+0x90/0x104 [kvm_hv]
> [  538.544121] LR [c0080753f368] __kvmppc_vcore_entry+0x70/0x104 [kvm_hv]
> [  538.544173] Call Trace:
> [  538.544196] [c013e91e3640] [c013e91e3680] 0xc013e91e3680 
> (unreliable)
> [  538.544260] [c013e91e3820] [c008075397a0] 
> kvmppc_run_core+0xbc8/0x19d0 [kvm_hv]
> [  538.544325] [c013e91e39e0] [c0080753d99c] 
> kvmppc_vcpu_run_hv+0x404/0xc00 [kvm_hv]
> [  538.544394] [c013e91e3ad0] [c008072da4fc] 
> kvmppc_vcpu_run+0x34/0x48 [kvm]
> [  538.544472] [c013e91e3af0] [c008072d61b8] 
> kvm_arch_vcpu_ioctl_run+0x310/0x420 [kvm]
> [  538.544539] [c013e91e3b80] [c008072c7450] 
> kvm_vcpu_ioctl+0x298/0x778 [kvm]
> [  538.544605] [c013e91e3ce0] [c04b8c2c] sys_ioctl+0x1dc/0xc90
> [  538.544662] [c013e91e3dc0] [c002f9a4] 
> system_call_exception+0xe4/0x1c0
> [  538.544726] [c013e91e3e20] [c000d140] 
> system_call_common+0xf0/0x27c
> [  538.544787] Instruction dump:
> [  538.544821] f86d1098 6000 6000 4899 e8ad0fe8 e8c500a0 e9264140 
> 75290002
> [  538.544886] 7d1602a6 7cec42a6 40820008 7d0807b4 <7d164ba6> 7d083a14 
> f90d10a0 480104fd
> [  538.544953] ---[ end trace 74423e2b948c2e0c ]---
> 
> This patch makes the KVM_PPC_ALLOCATE_HTAB ioctl fail when running in
> the nested hypervisor, causing QEMU to abort.
> 
> Reported-by: Satheesh Rajendran 
> Signed-off-by: Fabiano Rosas 
> ---

LGTM

Reviewed-by: Greg Kurz 

>  arch/powerpc/kvm/book3s_hv.c | 6 ++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index 4ba06a2a306c..764b6239ef72 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -5250,6 +5250,12 @@ static long kvm_arch_vm_ioctl_hv(struct file *filp,
>   case KVM_PPC_ALLOCATE_HTAB: {
>   u32 htab_order;
>  
> + /* If we're a nested hypervisor, we currently only support 
> radix */
> + if (kvmhv_on_pseries()) {
> + r = -EOPNOTSUPP;
> + break;
> + }
> +
>   r = -EFAULT;
>   if (get_user(htab_order, (u32 __user *)argp))
>   break;

Re: [PATCH v2] powerpc/pseries/hotplug-cpu: wait indefinitely for vCPU death

2020-08-12 Thread Greg Kurz

ck_tftp 
> tun bridge stp llc nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib 
> nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nf_tables_set 
> nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6_tables 
> nft_compat ip_set nf_tables nfnetlink sunrpc xts vmx_crypto ip_tables xfs 
> libcrc32c sd_mod sg virtio_net net_failover virtio_scsi failover dm_mirror 
> dm_region_hash dm_log dm_mod
>   [ 1538.362613] CPU: 0 PID: 548 Comm: kworker/u768:3 Kdump: loaded Not 
> tainted 4.18.0-224.el8.bz1856588.ppc64le #1
>   [ 1538.362687] Workqueue: pseries hotplug workque pseries_hp_work_fn
>   [ 1538.362725] Call Trace:
>   [ 1538.362743] [c009d4adf590] [c0e0e0fc] dump_stack+0xb0/0xf4 
> (unreliable)
>   [ 1538.362789] [c009d4adf5d0] [c0475dfc] bad_page+0x12c/0x1b0
>   [ 1538.362827] [c009d4adf660] [c04784bc] 
> free_pcppages_bulk+0x5bc/0x940
>   [ 1538.362871] [c009d4adf760] [c0478c38] 
> page_alloc_cpu_dead+0x118/0x120
>   [ 1538.362918] [c009d4adf7b0] [c015b898] 
> cpuhp_invoke_callback.constprop.5+0xb8/0x760
>   [ 1538.362969] [c009d4adf820] [c015eee8] _cpu_down+0x188/0x340
>   [ 1538.363007] [c009d4adf890] [c015d75c] cpu_down+0x5c/0xa0
>   [ 1538.363045] [c009d4adf8c0] [c092c544] 
> cpu_subsys_offline+0x24/0x40
>   [ 1538.363091] [c009d4adf8e0] [c09212f0] 
> device_offline+0xf0/0x130
>   [ 1538.363129] [c009d4adf920] [c010aee4] 
> dlpar_offline_cpu+0x1c4/0x2a0
>   [ 1538.363174] [c009d4adf9e0] [c010b2f8] 
> dlpar_cpu_remove+0xb8/0x190
>   [ 1538.363219] [c009d4adfa60] [c010b4fc] 
> dlpar_cpu_remove_by_index+0x12c/0x150
>   [ 1538.363264] [c009d4adfaf0] [c010ca24] dlpar_cpu+0x94/0x800
>   [ 1538.363302] [c009d4adfc00] [c0102cc8] 
> pseries_hp_work_fn+0x128/0x1e0
>   [ 1538.363347] [c009d4adfc70] [c018aa84] 
> process_one_work+0x304/0x5d0
>   [ 1538.363394] [c009d4adfd10] [c018b5cc] 
> worker_thread+0xcc/0x7a0
>   [ 1538.363433] [c009d4adfdc0] [c019567c] kthread+0x1ac/0x1c0
>   [ 1538.363469] [c009d4adfe30] [c000b7dc] 
> ret_from_kernel_thread+0x5c/0x80
> 
> The latter trace is due to the following sequence:
> 
>   page_alloc_cpu_dead
> drain_pages
>   drain_pages_zone
> free_pcppages_bulk
> 
> where drain_pages() in this case is called under the assumption that
> the unplugged cpu is no longer executing. To ensure that is the case,
> and early call is made to __cpu_die()->pseries_cpu_die(), which runs
> a loop that waits for the cpu to reach a halted state by polling its
> status via query-cpu-stopped-state RTAS calls. It only polls for
> 25 iterations before giving up, however, and in the trace above this
> results in the following being printed only .1 seconds after the
> hotplug worker thread begins processing the unplug request:
> 
>   [ 1538.253044] pseries-hotplug-cpu: Attempting to remove CPU , drc 
> index: 113a
>   [ 1538.360259] Querying DEAD? cpu 314 (314) shows 2
> 
> At that point the worker thread assumes the unplugged CPU is in some
> unknown/dead state and procedes with the cleanup, causing the race with
> the XIVE cleanup code executed by the unplugged CPU.
> 
> Fix this by waiting indefinitely, but also making an effort to avoid
> spurious lockup messages by allowing for rescheduling after polling
> the CPU status and printing a warning if we wait for longer than 120s.
> 
> Fixes: eac1e731b59ee ("powerpc/xive: guest exploitation of the XIVE interrupt 
> controller")
> Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1856588
> Suggested-by: Michael Ellerman 
> Cc: Thiago Jung Bauermann 
> Cc: Michael Ellerman 
> Cc: Cedric Le Goater 
> Cc: Greg Kurz 
> Cc: Nathan Lynch 
> Signed-off-by: Michael Roth 
> ---
> changes from v1:
>  - renamed from "powerpc/pseries/hotplug-cpu: increase wait time for vCPU 
> death"
>  - wait indefinitely, but issue cond_resched() when RTAS reports that CPU
>hasn't stopped (Michael)
>  - print a warning after 120s of waiting (Michael)
>  - use pr_warn() instead of default printk() level
> ---
>  arch/powerpc/platforms/pseries/hotplug-cpu.c | 18 --
>  1 file changed, 12 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c 
> b/arch/powerpc/platforms/pseries/hotplug-cpu.c
> index c6e0d8abf75e..7a974ed6b240 100644
> --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
> +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
> @@ -107,22 +107,28 @@ static int pseries_cpu_disable(void)
>   */
>  static void pseries_cpu_die(unsigned int

Re: [PATCH] powerpc/pseries/hotplug-cpu: increase wait time for vCPU death

2020-08-04 Thread Greg Kurz

On Tue, 04 Aug 2020 23:35:10 +1000
Michael Ellerman  wrote:

> Hi Mike,
> 
> There is a bit of history to this code, but not in a good way :)
> 
> Michael Roth  writes:
> > For a power9 KVM guest with XIVE enabled, running a test loop
> > where we hotplug 384 vcpus and then unplug them, the following traces
> > can be seen (generally within a few loops) either from the unplugged
> > vcpu:
> >
> >   [ 1767.353447] cpu 65 (hwid 65) Ready to die...
> >   [ 1767.952096] Querying DEAD? cpu 66 (66) shows 2
> >   [ 1767.952311] list_del corruption. next->prev should be 
> > c00a02470208, but was c00a02470048
> ...
> >
> > At that point the worker thread assumes the unplugged CPU is in some
> > unknown/dead state and procedes with the cleanup, causing the race with
> > the XIVE cleanup code executed by the unplugged CPU.
> >
> > Fix this by inserting an msleep() after each RTAS call to avoid
> 
> We previously had an msleep(), but it was removed:
> 
>   b906cfa397fd ("powerpc/pseries: Fix cpu hotplug")
> 

Ah, I hadn't seen that one...

> > pseries_cpu_die() returning prematurely, and double the number of
> > attempts so we wait at least a total of 5 seconds. While this isn't an
> > ideal solution, it is similar to how we dealt with a similar issue for
> > cede_offline mode in the past (940ce422a3).
> 
> Thiago tried to fix this previously but there was a bit of discussion
> that didn't quite resolve:
> 
>   
> https://lore.kernel.org/linuxppc-dev/20190423223914.3882-1-bauer...@linux.ibm.com/
> 

Yeah it appears that the motivation at the time was to make the "Querying DEAD?"
messages to disappear and to avoid potentially concurrent calls to 
rtas-stop-self
which is prohibited by PAPR... not fixing actual crashes.

> 
> Spinning forever seems like a bad idea, but as has been demonstrated at
> least twice now, continuing when we don't know the state of the other
> CPU can lead to straight up crashes.
> 
> So I think I'm persuaded that it's preferable to have the kernel stuck
> spinning rather than oopsing.
> 

+1

> I'm 50/50 on whether we should have a cond_resched() in the loop. My
> first instinct is no, if we're stuck here for 20s a stack trace would be
> good. But then we will probably hit that on some big and/or heavily
> loaded machine.
> 
> So possibly we should call cond_resched() but have some custom logic in
> the loop to print a warning if we are stuck for more than some
> sufficiently long amount of time.
> 

How long should that be ?

> 
> > Fixes: eac1e731b59ee ("powerpc/xive: guest exploitation of the XIVE 
> > interrupt controller")
> > Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1856588
> 
> This is not public.
> 

I'll have a look at changing that.

> I tend to trim Bugzilla links from the change log, because I'm not
> convinced they will last forever, but it is good to have them in the
> mail archive.
> 
> cheers
> 

Cheers,

--
Greg

> > Cc: Michael Ellerman 
> > Cc: Cedric Le Goater 
> > Cc: Greg Kurz 
> > Cc: Nathan Lynch 
> > Signed-off-by: Michael Roth 
> > ---
> >  arch/powerpc/platforms/pseries/hotplug-cpu.c | 5 ++---
> >  1 file changed, 2 insertions(+), 3 deletions(-)
> >
> > diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c 
> > b/arch/powerpc/platforms/pseries/hotplug-cpu.c
> > index c6e0d8abf75e..3cb172758052 100644
> > --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
> > +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
> > @@ -111,13 +111,12 @@ static void pseries_cpu_die(unsigned int cpu)
> > int cpu_status = 1;
> > unsigned int pcpu = get_hard_smp_processor_id(cpu);
> >  
> > -   for (tries = 0; tries < 25; tries++) {
> > +   for (tries = 0; tries < 50; tries++) {
> > cpu_status = smp_query_cpu_stopped(pcpu);
> > if (cpu_status == QCSS_STOPPED ||
> > cpu_status == QCSS_HARDWARE_ERROR)
> > break;
> > -   cpu_relax();
> > -
> > +   msleep(100);
> > }
> >  
> > if (cpu_status != 0) {
> > -- 
> > 2.17.1

Re: [PATCH -next] powerpc/xive: Remove unused inline function xive_kexec_teardown_cpu()

2020-07-15 Thread Greg Kurz

On Wed, 15 Jul 2020 10:50:40 +0800
YueHaibing  wrote:

> commit e27e0a94651e ("powerpc/xive: Remove xive_kexec_teardown_cpu()")
> left behind this, remove it.
> 
> Signed-off-by: YueHaibing 
> ---

Reviewed-by: Greg Kurz 

>  arch/powerpc/include/asm/xive.h | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/arch/powerpc/include/asm/xive.h b/arch/powerpc/include/asm/xive.h
> index d08ea11b271c..309b4d65b74f 100644
> --- a/arch/powerpc/include/asm/xive.h
> +++ b/arch/powerpc/include/asm/xive.h
> @@ -155,7 +155,6 @@ static inline void xive_smp_probe(void) { }
>  static inline int  xive_smp_prepare_cpu(unsigned int cpu) { return -EINVAL; }
>  static inline void xive_smp_setup_cpu(void) { }
>  static inline void xive_smp_disable_cpu(void) { }
> -static inline void xive_kexec_teardown_cpu(int secondary) { }
>  static inline void xive_shutdown(void) { }
>  static inline void xive_flush_interrupt(void) { }
>

Re: [PATCH] KVM: PPC: Book3S HV: relax check on H_SVM_INIT_ABORT

2020-05-20 Thread Greg Kurz

On Wed, 20 May 2020 18:51:10 +0200
Laurent Dufour  wrote:

> The commit 8c47b6ff29e3 ("KVM: PPC: Book3S HV: Check caller of H_SVM_*
> Hcalls") added checks of secure bit of SRR1 to filter out the Hcall
> reserved to the Ultravisor.
> 
> However, the Hcall H_SVM_INIT_ABORT is made by the Ultravisor passing the
> context of the VM calling UV_ESM. This allows the Hypervisor to return to
> the guest without going through the Ultravisor. Thus the Secure bit of SRR1
> is not set in that particular case.
> 
> In the case a regular VM is calling H_SVM_INIT_ABORT, this hcall will be
> filtered out in kvmppc_h_svm_init_abort() because kvm->arch.secure_guest is
> not set in that case.
> 

Why not checking vcpu->kvm->arch.secure_guest then ?

> Fixes: 8c47b6ff29e3 ("KVM: PPC: Book3S HV: Check caller of H_SVM_* Hcalls")
> Signed-off-by: Laurent Dufour 
> ---
>  arch/powerpc/kvm/book3s_hv.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index 93493f0cbfe8..eb1f96cb7b72 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -1099,9 +1099,7 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
>   ret = kvmppc_h_svm_init_done(vcpu->kvm);
>   break;
>   case H_SVM_INIT_ABORT:
> - ret = H_UNSUPPORTED;
> - if (kvmppc_get_srr1(vcpu) & MSR_S)
> - ret = kvmppc_h_svm_init_abort(vcpu->kvm);

or at least put a comment to explain why H_SVM_INIT_ABORT
doesn't have the same sanity check as the other SVM hcalls.

> + ret = kvmppc_h_svm_init_abort(vcpu->kvm);
>   break;
>  
>   default:

Re: [PATCH v2] KVM: PPC: Book3S HV: relax check on H_SVM_INIT_ABORT

2020-05-20 Thread Greg Kurz

On Wed, 20 May 2020 19:43:08 +0200
Laurent Dufour  wrote:

> The commit 8c47b6ff29e3 ("KVM: PPC: Book3S HV: Check caller of H_SVM_*
> Hcalls") added checks of secure bit of SRR1 to filter out the Hcall
> reserved to the Ultravisor.
> 
> However, the Hcall H_SVM_INIT_ABORT is made by the Ultravisor passing the
> context of the VM calling UV_ESM. This allows the Hypervisor to return to
> the guest without going through the Ultravisor. Thus the Secure bit of SRR1
> is not set in that particular case.
> 
> In the case a regular VM is calling H_SVM_INIT_ABORT, this hcall will be
> filtered out in kvmppc_h_svm_init_abort() because kvm->arch.secure_guest is
> not set in that case.
> 
> Fixes: 8c47b6ff29e3 ("KVM: PPC: Book3S HV: Check caller of H_SVM_* Hcalls")
> Signed-off-by: Laurent Dufour 
> ---

Reviewed-by: Greg Kurz 

>  arch/powerpc/kvm/book3s_hv.c | 9 ++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index 93493f0cbfe8..6ad1a3b14300 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -1099,9 +1099,12 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
>   ret = kvmppc_h_svm_init_done(vcpu->kvm);
>   break;
>   case H_SVM_INIT_ABORT:
> - ret = H_UNSUPPORTED;
> - if (kvmppc_get_srr1(vcpu) & MSR_S)
> - ret = kvmppc_h_svm_init_abort(vcpu->kvm);
> + /*
> +  * Even if that call is made by the Ultravisor, the SSR1 value
> +  * is the guest context one, with the secure bit clear as it has
> +  * not yet been secured. So we can't check it here.
> +  */
> + ret = kvmppc_h_svm_init_abort(vcpu->kvm);
>   break;
>  
>   default:

Re: [PATCH 4/4] ocxl: Remove custom service to allocate interrupts

2020-04-03 Thread Greg Kurz

On Thu,  2 Apr 2020 17:43:52 +0200
Frederic Barrat  wrote:

> We now allocate interrupts through xive directly.
> 
> Signed-off-by: Frederic Barrat 
> ---
>  arch/powerpc/include/asm/pnv-ocxl.h   |  3 ---
>  arch/powerpc/platforms/powernv/ocxl.c | 30 ---

Nice diffstat :)

Reviewed-by: Greg Kurz 

>  2 files changed, 33 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/pnv-ocxl.h 
> b/arch/powerpc/include/asm/pnv-ocxl.h
> index 7de82647e761..e90650328c9c 100644
> --- a/arch/powerpc/include/asm/pnv-ocxl.h
> +++ b/arch/powerpc/include/asm/pnv-ocxl.h
> @@ -30,7 +30,4 @@ extern int pnv_ocxl_spa_setup(struct pci_dev *dev, void 
> *spa_mem, int PE_mask,
>  extern void pnv_ocxl_spa_release(void *platform_data);
>  extern int pnv_ocxl_spa_remove_pe_from_cache(void *platform_data, int 
> pe_handle);
>  
> -extern int pnv_ocxl_alloc_xive_irq(u32 *irq, u64 *trigger_addr);
> -extern void pnv_ocxl_free_xive_irq(u32 irq);
> -
>  #endif /* _ASM_PNV_OCXL_H */
> diff --git a/arch/powerpc/platforms/powernv/ocxl.c 
> b/arch/powerpc/platforms/powernv/ocxl.c
> index 8c65aacda9c8..ecdad219d704 100644
> --- a/arch/powerpc/platforms/powernv/ocxl.c
> +++ b/arch/powerpc/platforms/powernv/ocxl.c
> @@ -2,7 +2,6 @@
>  // Copyright 2017 IBM Corp.
>  #include 
>  #include 
> -#include 
>  #include 
>  #include "pci.h"
>  
> @@ -484,32 +483,3 @@ int pnv_ocxl_spa_remove_pe_from_cache(void 
> *platform_data, int pe_handle)
>   return rc;
>  }
>  EXPORT_SYMBOL_GPL(pnv_ocxl_spa_remove_pe_from_cache);
> -
> -int pnv_ocxl_alloc_xive_irq(u32 *irq, u64 *trigger_addr)
> -{
> - __be64 flags, trigger_page;
> - s64 rc;
> - u32 hwirq;
> -
> - hwirq = xive_native_alloc_irq();
> - if (!hwirq)
> - return -ENOENT;
> -
> - rc = opal_xive_get_irq_info(hwirq, , NULL, _page, NULL,
> - NULL);
> - if (rc || !trigger_page) {
> - xive_native_free_irq(hwirq);
> - return -ENOENT;
> - }
> - *irq = hwirq;
> - *trigger_addr = be64_to_cpu(trigger_page);
> - return 0;
> -
> -}
> -EXPORT_SYMBOL_GPL(pnv_ocxl_alloc_xive_irq);
> -
> -void pnv_ocxl_free_xive_irq(u32 irq)
> -{
> - xive_native_free_irq(irq);
> -}
> -EXPORT_SYMBOL_GPL(pnv_ocxl_free_xive_irq);

Re: [PATCH 3/4] ocxl: Don't return trigger page when allocating an interrupt

2020-04-03 Thread Greg Kurz

On Thu,  2 Apr 2020 17:43:51 +0200
Frederic Barrat  wrote:

> Existing users of ocxl_link_irq_alloc() have been converted to obtain
> the trigger page of an interrupt through xive directly, we therefore
> have no need to return the trigger page when allocating an interrupt.
> 
> It also allows ocxl to use the xive native interface to allocate
> interrupts, instead of its custom service.
> 
> Signed-off-by: Frederic Barrat 
> ---

Reviewed-by: Greg Kurz 

>  drivers/misc/ocxl/Kconfig   |  2 +-
>  drivers/misc/ocxl/afu_irq.c |  4 +---
>  drivers/misc/ocxl/link.c| 15 +++
>  drivers/scsi/cxlflash/ocxl_hw.c |  3 +--
>  include/misc/ocxl.h | 10 ++
>  5 files changed, 12 insertions(+), 22 deletions(-)
> 
> diff --git a/drivers/misc/ocxl/Kconfig b/drivers/misc/ocxl/Kconfig
> index 2d2266c1439e..e65773f5cf59 100644
> --- a/drivers/misc/ocxl/Kconfig
> +++ b/drivers/misc/ocxl/Kconfig
> @@ -9,7 +9,7 @@ config OCXL_BASE
>  
>  config OCXL
>   tristate "OpenCAPI coherent accelerator support"
> - depends on PPC_POWERNV && PCI && EEH
> + depends on PPC_POWERNV && PCI && EEH && PPC_XIVE_NATIVE
>   select OCXL_BASE
>   select HOTPLUG_PCI_POWERNV
>   default m
> diff --git a/drivers/misc/ocxl/afu_irq.c b/drivers/misc/ocxl/afu_irq.c
> index b30ec0ef7be7..ecdcfae025b7 100644
> --- a/drivers/misc/ocxl/afu_irq.c
> +++ b/drivers/misc/ocxl/afu_irq.c
> @@ -11,7 +11,6 @@ struct afu_irq {
>   int hw_irq;
>   unsigned int virq;
>   char *name;
> - u64 trigger_page;
>   irqreturn_t (*handler)(void *private);
>   void (*free_private)(void *private);
>   void *private;
> @@ -125,8 +124,7 @@ int ocxl_afu_irq_alloc(struct ocxl_context *ctx, int 
> *irq_id)
>   goto err_unlock;
>   }
>  
> - rc = ocxl_link_irq_alloc(ctx->afu->fn->link, >hw_irq,
> - >trigger_page);
> + rc = ocxl_link_irq_alloc(ctx->afu->fn->link, >hw_irq);
>   if (rc)
>   goto err_idr;
>  
> diff --git a/drivers/misc/ocxl/link.c b/drivers/misc/ocxl/link.c
> index 58d111afd9f6..fd73d3bc0eb6 100644
> --- a/drivers/misc/ocxl/link.c
> +++ b/drivers/misc/ocxl/link.c
> @@ -6,6 +6,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include "ocxl_internal.h"
>  #include "trace.h"
> @@ -682,23 +683,21 @@ int ocxl_link_remove_pe(void *link_handle, int pasid)
>  }
>  EXPORT_SYMBOL_GPL(ocxl_link_remove_pe);
>  
> -int ocxl_link_irq_alloc(void *link_handle, int *hw_irq, u64 *trigger_addr)
> +int ocxl_link_irq_alloc(void *link_handle, int *hw_irq)
>  {
>   struct ocxl_link *link = (struct ocxl_link *) link_handle;
> - int rc, irq;
> - u64 addr;
> + int irq;
>  
>   if (atomic_dec_if_positive(>irq_available) < 0)
>   return -ENOSPC;
>  
> - rc = pnv_ocxl_alloc_xive_irq(, );
> - if (rc) {
> + irq = xive_native_alloc_irq();
> + if (!irq) {
>   atomic_inc(>irq_available);
> - return rc;
> + return -ENXIO;
>   }
>  
>   *hw_irq = irq;
> - *trigger_addr = addr;
>   return 0;
>  }
>  EXPORT_SYMBOL_GPL(ocxl_link_irq_alloc);
> @@ -707,7 +706,7 @@ void ocxl_link_free_irq(void *link_handle, int hw_irq)
>  {
>   struct ocxl_link *link = (struct ocxl_link *) link_handle;
>  
> - pnv_ocxl_free_xive_irq(hw_irq);
> + xive_native_free_irq(hw_irq);
>   atomic_inc(>irq_available);
>  }
>  EXPORT_SYMBOL_GPL(ocxl_link_free_irq);
> diff --git a/drivers/scsi/cxlflash/ocxl_hw.c b/drivers/scsi/cxlflash/ocxl_hw.c
> index 59452850f71c..03bff0cae658 100644
> --- a/drivers/scsi/cxlflash/ocxl_hw.c
> +++ b/drivers/scsi/cxlflash/ocxl_hw.c
> @@ -613,7 +613,6 @@ static int alloc_afu_irqs(struct ocxlflash_context *ctx, 
> int num)
>   struct ocxl_hw_afu *afu = ctx->hw_afu;
>   struct device *dev = afu->dev;
>   struct ocxlflash_irqs *irqs;
> - u64 addr;
>   int rc = 0;
>   int hwirq;
>   int i;
> @@ -638,7 +637,7 @@ static int alloc_afu_irqs(struct ocxlflash_context *ctx, 
> int num)
>   }
>  
>   for (i = 0; i < num; i++) {
> - rc = ocxl_link_irq_alloc(afu->link_token, , );
> + rc = ocxl_link_irq_alloc(afu->link_token, );
>   if (unlikely(rc)) {
>   dev_err(dev, "%s: ocxl_link_irq_alloc failed rc=%d\n",
>   __func__, rc);
> diff --git a/include/misc/ocxl.h b/include/misc/ocxl.h
> index 06dd5839e

Re: [PATCH 2/4] ocxl: Access interrupt trigger page from xive directly

2020-04-03 Thread Greg Kurz

On Thu,  2 Apr 2020 17:43:50 +0200
Frederic Barrat  wrote:

> We can access the trigger page through standard APIs so let's use it
> and avoid saving it when allocating the interrupt. It will also allow
> to simplify allocation in a later patch.
> 
> Signed-off-by: Frederic Barrat 
> ---

Reviewed-by: Greg Kurz 

>  drivers/misc/ocxl/afu_irq.c | 8 ++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/misc/ocxl/afu_irq.c b/drivers/misc/ocxl/afu_irq.c
> index 70f8f1c3929d..b30ec0ef7be7 100644
> --- a/drivers/misc/ocxl/afu_irq.c
> +++ b/drivers/misc/ocxl/afu_irq.c
> @@ -2,6 +2,7 @@
>  // Copyright 2017 IBM Corp.
>  #include 
>  #include 
> +#include 
>  #include "ocxl_internal.h"
>  #include "trace.h"
>  
> @@ -196,13 +197,16 @@ void ocxl_afu_irq_free_all(struct ocxl_context *ctx)
>  
>  u64 ocxl_afu_irq_get_addr(struct ocxl_context *ctx, int irq_id)
>  {
> + struct xive_irq_data *xd;
>   struct afu_irq *irq;
>   u64 addr = 0;
>  
>   mutex_lock(>irq_lock);
>   irq = idr_find(>irq_idr, irq_id);
> - if (irq)
> - addr = irq->trigger_page;
> + if (irq) {
> + xd = irq_get_handler_data(irq->virq);
> + addr = xd ? xd->trig_page : 0;
> + }
>   mutex_unlock(>irq_lock);
>   return addr;
>  }

Re: [PATCH v4 00/25] Add support for OpenCAPI Persistent Memory devices

2020-04-02 Thread Greg Kurz

On Thu, 02 Apr 2020 21:06:01 +1100
Michael Ellerman  wrote:

> "Oliver O'Halloran"  writes:
> > On Thu, Apr 2, 2020 at 2:42 PM Michael Ellerman  wrote:
> >> "Alastair D'Silva"  writes:
> >> >> -Original Message-
> >> >> From: Dan Williams 
> >> >>
> >> >> On Sun, Mar 29, 2020 at 10:23 PM Alastair D'Silva 
> >> >> wrote:
> >> >> >
> >> >> > *snip*
> >> >> Are OPAL calls similar to ACPI DSMs? I.e. methods for the OS to invoke
> >> >> platform firmware services? What's Skiboot?
> >> >
> >> > Yes, OPAL is the interface to firmware for POWER. Skiboot is the 
> >> > open-source (and only) implementation of OPAL.
> >>
> >>   https://github.com/open-power/skiboot
> >>
> >> In particular the tokens for calls are defined here:
> >>
> >>   https://github.com/open-power/skiboot/blob/master/include/opal-api.h#L220
> >>
> >> And you can grep for the token to find the implementation:
> >>
> >>   
> >> https://github.com/open-power/skiboot/blob/master/hw/npu2-opencapi.c#L2328
> >
> > I'm not sure I'd encourage anyone to read npu2-opencapi.c. I find it
> > hard enough to follow even with access to the workbooks.
> 
> Compared to certain firmwares that run on certain other platforms it's
> actually pretty readable code ;)
> 

Forth rocks ! ;-)

> > There's an OPAL call API reference here:
> > http://open-power.github.io/skiboot/doc/opal-api/index.html
> 
> Even better.
> 
> cheers

Re: [PATCH v2] powerpc/XIVE: SVM: share the event-queue page with the Hypervisor.

2020-03-26 Thread Greg Kurz

On Thu, 26 Mar 2020 01:38:47 -0700
Ram Pai  wrote:

> XIVE interrupt controller use an Event Queue (EQ) to enqueue event
> notifications when an exception occurs. The EQ is a single memory page
> provided by the O/S defining a circular buffer, one per server and
> priority couple.
> 
> On baremetal, the EQ page is configured with an OPAL call. On pseries,
> an extra hop is necessary and the guest OS uses the hcall
> H_INT_SET_QUEUE_CONFIG to configure the XIVE interrupt controller.
> 
> The XIVE controller being Hypervisor privileged, it will not be allowed
> to enqueue event notifications for a Secure VM unless the EQ pages are
> shared by the Secure VM.
> 
> Hypervisor/Ultravisor still requires support for the TIMA and ESB page
> fault handlers. Until this is complete, QEMU can use the emulated XIVE
> device for Secure VMs, option "kernel_irqchip=off" on the QEMU pseries
> machine.
> 
> Cc: kvm-...@vger.kernel.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: Michael Ellerman 
> Cc: Thiago Jung Bauermann 
> Cc: Michael Anderson 
> Cc: Sukadev Bhattiprolu 
> Cc: Alexey Kardashevskiy 
> Cc: Paul Mackerras 
> Cc: Greg Kurz 
> Cc: Cedric Le Goater 
> Cc: David Gibson 
> Signed-off-by: Ram Pai 
> 
> v2: better description of the patch from Cedric.
> ---

Reviewed-by: Greg Kurz 

>  arch/powerpc/sysdev/xive/spapr.c | 7 +++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/arch/powerpc/sysdev/xive/spapr.c 
> b/arch/powerpc/sysdev/xive/spapr.c
> index 55dc61c..608b52f 100644
> --- a/arch/powerpc/sysdev/xive/spapr.c
> +++ b/arch/powerpc/sysdev/xive/spapr.c
> @@ -26,6 +26,8 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
>  
>  #include "xive-internal.h"
>  
> @@ -501,6 +503,9 @@ static int xive_spapr_configure_queue(u32 target, struct 
> xive_q *q, u8 prio,
>   rc = -EIO;
>   } else {
>   q->qpage = qpage;
> + if (is_secure_guest())
> + uv_share_page(PHYS_PFN(qpage_phys),
> + 1 << xive_alloc_order(order));
>   }
>  fail:
>   return rc;
> @@ -534,6 +539,8 @@ static void xive_spapr_cleanup_queue(unsigned int cpu, 
> struct xive_cpu *xc,
>  hw_cpu, prio);
>  
>   alloc_order = xive_alloc_order(xive_queue_shift);
> + if (is_secure_guest())
> + uv_unshare_page(PHYS_PFN(__pa(q->qpage)), 1 << alloc_order);
>   free_pages((unsigned long)q->qpage, alloc_order);
>   q->qpage = NULL;
>  }

Re: [PATCH] powerpc/prom_init: Include the termination message in ibm,os-term RTAS call

2020-03-25 Thread Greg Kurz

On Wed, 25 Mar 2020 21:06:22 +1100
Michael Ellerman  wrote:

> Fabiano Rosas  writes:
> 
> > QEMU can now print the ibm,os-term message[1], so let's include it in
> > the RTAS call. E.g.:
> >
> >   qemu-system-ppc64: OS terminated: Switch to secure mode failed.
> >
> > 1- https://git.qemu.org/?p=qemu.git;a=commitdiff;h=a4c3791ae0
> >
> > Signed-off-by: Fabiano Rosas 
> > ---
> >  arch/powerpc/kernel/prom_init.c | 3 +++
> >  1 file changed, 3 insertions(+)
> 
> I have this queued:
>   https://patchwork.ozlabs.org/patch/1253390/
> 
> Which I think does the same thing?
> 

Alexey's patch also sets os_term_args.nret as indicated in PAPR.
Even if QEMU's handler for "ibm,os-term" doesn't seem to have
a use for nret, I think it's better to stick to the spec.

Cheers,

--
Greg

> cheers
> 
> > diff --git a/arch/powerpc/kernel/prom_init.c 
> > b/arch/powerpc/kernel/prom_init.c
> > index 577345382b23..d543fb6d29c5 100644
> > --- a/arch/powerpc/kernel/prom_init.c
> > +++ b/arch/powerpc/kernel/prom_init.c
> > @@ -1773,6 +1773,9 @@ static void __init prom_rtas_os_term(char *str)
> > if (token == 0)
> > prom_panic("Could not get token for ibm,os-term\n");
> > os_term_args.token = cpu_to_be32(token);
> > +   os_term_args.nargs = cpu_to_be32(1);
> > +   os_term_args.args[0] = cpu_to_be32(__pa(str));
> > +
> > prom_rtas_hcall((uint64_t)_term_args);
> >  }
> >  #endif /* CONFIG_PPC_SVM */
> > -- 
> > 2.23.0

Re: [PATCH 1/2] KVM: PPC: Book3S HV: check caller of H_SVM_* Hcalls

2020-03-24 Thread Greg Kurz

On Tue, 24 Mar 2020 10:43:23 +1100
Paul Mackerras  wrote:

> On Fri, Mar 20, 2020 at 01:22:48PM +0100, Greg Kurz wrote:
> > On Fri, 20 Mar 2020 11:26:42 +0100
> > Laurent Dufour  wrote:
> > 
> > > The Hcall named H_SVM_* are reserved to the Ultravisor. However, nothing
> > > prevent a malicious VM or SVM to call them. This could lead to weird 
> > > result
> > > and should be filtered out.
> > > 
> > > Checking the Secure bit of the calling MSR ensure that the call is coming
> > > from either the Ultravisor or a SVM. But any system call made from a SVM
> > > are going through the Ultravisor, and the Ultravisor should filter out
> > > these malicious call. This way, only the Ultravisor is able to make such a
> > > Hcall.
> > 
> > "Ultravisor should filter" ? And what if it doesn't (eg. because of a bug) ?
> > 
> > Shouldn't we also check the HV bit of the calling MSR as well to
> > disambiguate SVM and UV ?
> 
> The trouble with doing that (checking the HV bit) is that KVM does not
> expect to see the HV bit set on an interrupt that occurred while we
> were in the guest, and if it is set, it indicates a serious problem,
> i.e. that an interrupt occurred while we were in the code that
> transitions from host context to guest context, or from guest context
> to host context.  In those cases we don't know how much of the
> transition has been completed and therefore whether we have guest
> values or host values in the CPU registers (GPRs, FPRs/VSRs, SPRs).
> If we do see HV set then KVM reports a severe error to userspace which
> should cause userspace to terminate the guest.
> 
> Therefore the UV should *always* have the HV bit clear in HSRR1/SRR1
> when transitioning to KVM.
> 

Indeed... thanks for the clarification. So I guess we'll just assume
that the UV doesn't reflect these SVM specific hcalls if they happened
to be issued by the guest then.

Cheers,

--
Greg

> Paul.

Re: [PATCH 1/2] KVM: PPC: Book3S HV: check caller of H_SVM_* Hcalls

2020-03-20 Thread Greg Kurz

On Fri, 20 Mar 2020 11:26:42 +0100
Laurent Dufour  wrote:

> The Hcall named H_SVM_* are reserved to the Ultravisor. However, nothing
> prevent a malicious VM or SVM to call them. This could lead to weird result
> and should be filtered out.
> 
> Checking the Secure bit of the calling MSR ensure that the call is coming
> from either the Ultravisor or a SVM. But any system call made from a SVM
> are going through the Ultravisor, and the Ultravisor should filter out
> these malicious call. This way, only the Ultravisor is able to make such a
> Hcall.

"Ultravisor should filter" ? And what if it doesn't (eg. because of a bug) ?

Shouldn't we also check the HV bit of the calling MSR as well to
disambiguate SVM and UV ?

> 
> Cc: Bharata B Rao 
> Cc: Paul Mackerras 
> Cc: Benjamin Herrenschmidt 
> Cc: Michael Ellerman 
> Signed-off-by: Laurent Dufour 
> ---
>  arch/powerpc/kvm/book3s_hv.c | 32 +---
>  1 file changed, 21 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index 33be4d93248a..43773182a737 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -1074,25 +1074,35 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
>kvmppc_get_gpr(vcpu, 6));
>   break;
>   case H_SVM_PAGE_IN:
> - ret = kvmppc_h_svm_page_in(vcpu->kvm,
> -kvmppc_get_gpr(vcpu, 4),
> -kvmppc_get_gpr(vcpu, 5),
> -kvmppc_get_gpr(vcpu, 6));
> + ret = H_UNSUPPORTED;
> + if (kvmppc_get_srr1(vcpu) & MSR_S)
> + ret = kvmppc_h_svm_page_in(vcpu->kvm,
> +kvmppc_get_gpr(vcpu, 4),
> +kvmppc_get_gpr(vcpu, 5),
> +kvmppc_get_gpr(vcpu, 6));

If calling kvmppc_h_svm_page_in() produces a "weird result" when
the MSR_S bit isn't set, then I think it should do the checking
itself, ie. pass vcpu.

This would also prevent adding that many lines in kvmppc_pseries_do_hcall()
which is a big enough function already. The checking could be done in a
helper in book3s_hv_uvmem.c and used by all UV specific hcalls.

>   break;
>   case H_SVM_PAGE_OUT:
> - ret = kvmppc_h_svm_page_out(vcpu->kvm,
> - kvmppc_get_gpr(vcpu, 4),
> - kvmppc_get_gpr(vcpu, 5),
> - kvmppc_get_gpr(vcpu, 6));
> + ret = H_UNSUPPORTED;
> + if (kvmppc_get_srr1(vcpu) & MSR_S)
> + ret = kvmppc_h_svm_page_out(vcpu->kvm,
> + kvmppc_get_gpr(vcpu, 4),
> + kvmppc_get_gpr(vcpu, 5),
> + kvmppc_get_gpr(vcpu, 6));
>   break;
>   case H_SVM_INIT_START:
> - ret = kvmppc_h_svm_init_start(vcpu->kvm);
> + ret = H_UNSUPPORTED;
> + if (kvmppc_get_srr1(vcpu) & MSR_S)
> + ret = kvmppc_h_svm_init_start(vcpu->kvm);
>   break;
>   case H_SVM_INIT_DONE:
> - ret = kvmppc_h_svm_init_done(vcpu->kvm);
> + ret = H_UNSUPPORTED;
> + if (kvmppc_get_srr1(vcpu) & MSR_S)
> + ret = kvmppc_h_svm_init_done(vcpu->kvm);
>   break;
>   case H_SVM_INIT_ABORT:
> - ret = kvmppc_h_svm_init_abort(vcpu->kvm);
> + ret = H_UNSUPPORTED;
> + if (kvmppc_get_srr1(vcpu) & MSR_S)
> + ret = kvmppc_h_svm_init_abort(vcpu->kvm);
>   break;
>  
>   default:

Re: [PATCH] KVM: PPC: Book3S HV: Skip kvmppc_uvmem_free if Ultravisor is not supported

2020-03-20 Thread Greg Kurz

On Thu, 19 Mar 2020 19:55:10 -0300
Fabiano Rosas  wrote:

> kvmppc_uvmem_init checks for Ultravisor support and returns early if
> it is not present. Calling kvmppc_uvmem_free at module exit will cause
> an Oops:
> 
> $ modprobe -r kvm-hv
> 
>   Oops: Kernel access of bad area, sig: 11 [#1]
>   
>   NIP:  c0789e90 LR: c0789e8c CTR: c0401030
>   REGS: c03fa7bab9a0 TRAP: 0300   Not tainted  
> (5.6.0-rc6-00033-g6c90b86a745a-dirty)
>   MSR:  90009033   CR: 24002282  XER: 
> 
>   CFAR: c0dae880 DAR: 0008 DSISR: 4000 IRQMASK: 1
>   GPR00: c0789e8c c03fa7babc30 c16fe500 
>   GPR04:  0006  c03faf205c00
>   GPR08:  0001 802d c0080ddde140
>   GPR12: c0401030 c03d9080 0001 
>   GPR16:   00013aad0074 00013aaac978
>   GPR20: 00013aad0070  7fffd1b37158 
>   GPR24: 00014fef0d58  00014fef0cf0 0001
>   GPR28:   c18b2a60 
>   NIP [c0789e90] percpu_ref_kill_and_confirm+0x40/0x170
>   LR [c0789e8c] percpu_ref_kill_and_confirm+0x3c/0x170
>   Call Trace:
>   [c03fa7babc30] [c03faf2064d4] 0xc03faf2064d4 (unreliable)
>   [c03fa7babcb0] [c0400e8c] dev_pagemap_kill+0x6c/0x80
>   [c03fa7babcd0] [c0401064] memunmap_pages+0x34/0x2f0
>   [c03fa7babd50] [c0080548] kvmppc_uvmem_free+0x30/0x80 [kvm_hv]
>   [c03fa7babd80] [c0080ddcef18] kvmppc_book3s_exit_hv+0x20/0x78 
> [kvm_hv]
>   [c03fa7babda0] [c02084d0] sys_delete_module+0x1d0/0x2c0
>   [c03fa7babe20] [c000b9d0] system_call+0x5c/0x68
>   Instruction dump:
>   3fc2001b fb81ffe0 fba1ffe8 fbe1fff8 7c7f1b78 7c9c2378 3bde4560 7fc3f378
>   f8010010 f821ff81 486249a1 6000  7c7d1b78 712a0002 40820084
>   ---[ end trace 5774ef4dc2c98279 ]---
> 
> So this patch checks if kvmppc_uvmem_init actually allocated anything
> before running kvmppc_uvmem_free.
> 
> Fixes: ca9f4942670c ("KVM: PPC: Book3S HV: Support for running secure guests")
> Reported-by: Greg Kurz 
> Signed-off-by: Fabiano Rosas 
> ---

Thanks for the quick fix :)

Tested-by: Greg Kurz 

>  arch/powerpc/kvm/book3s_hv_uvmem.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
> b/arch/powerpc/kvm/book3s_hv_uvmem.c
> index 79b1202b1c62..9d26614b2a77 100644
> --- a/arch/powerpc/kvm/book3s_hv_uvmem.c
> +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
> @@ -806,6 +806,9 @@ int kvmppc_uvmem_init(void)
>  
>  void kvmppc_uvmem_free(void)
>  {
> + if (!kvmppc_uvmem_bitmap)
> + return;
> +
>   memunmap_pages(_uvmem_pgmap);
>   release_mem_region(kvmppc_uvmem_pgmap.res.start,
>  resource_size(_uvmem_pgmap.res));

[PATCH 3/3] KVM: PPC: Kill kvmppc_ops::mmu_destroy() and kvmppc_mmu_destroy()

2020-03-18 Thread Greg Kurz

These are only used by HV KVM and BookE, and in both cases they are
nops.

Signed-off-by: Greg Kurz 
---
 arch/powerpc/include/asm/kvm_ppc.h |2 --
 arch/powerpc/kvm/book3s.c  |5 -
 arch/powerpc/kvm/book3s_hv.c   |6 --
 arch/powerpc/kvm/book3s_pr.c   |1 -
 arch/powerpc/kvm/booke.c   |5 -
 arch/powerpc/kvm/booke.h   |2 --
 arch/powerpc/kvm/e500.c|1 -
 arch/powerpc/kvm/e500_mmu.c|4 
 arch/powerpc/kvm/e500mc.c  |1 -
 9 files changed, 27 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 399a657c1bf3..be627367e3bd 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -107,7 +107,6 @@ extern void kvmppc_mmu_map(struct kvm_vcpu *vcpu, u64 
gvaddr, gpa_t gpaddr,
unsigned int gtlb_idx);
 extern void kvmppc_mmu_priv_switch(struct kvm_vcpu *vcpu, int usermode);
 extern void kvmppc_mmu_switch_pid(struct kvm_vcpu *vcpu, u32 pid);
-extern void kvmppc_mmu_destroy(struct kvm_vcpu *vcpu);
 extern int kvmppc_mmu_dtlb_index(struct kvm_vcpu *vcpu, gva_t eaddr);
 extern int kvmppc_mmu_itlb_index(struct kvm_vcpu *vcpu, gva_t eaddr);
 extern gpa_t kvmppc_mmu_xlate(struct kvm_vcpu *vcpu, unsigned int gtlb_index,
@@ -290,7 +289,6 @@ struct kvmppc_ops {
int (*age_hva)(struct kvm *kvm, unsigned long start, unsigned long end);
int (*test_age_hva)(struct kvm *kvm, unsigned long hva);
void (*set_spte_hva)(struct kvm *kvm, unsigned long hva, pte_t pte);
-   void (*mmu_destroy)(struct kvm_vcpu *vcpu);
void (*free_memslot)(struct kvm_memory_slot *free,
 struct kvm_memory_slot *dont);
int (*create_memslot)(struct kvm_memory_slot *slot,
diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index d07a8e12fa15..19ccb019eb3b 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -858,11 +858,6 @@ int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, 
pte_t pte)
return 0;
 }
 
-void kvmppc_mmu_destroy(struct kvm_vcpu *vcpu)
-{
-   vcpu->kvm->arch.kvm_ops->mmu_destroy(vcpu);
-}
-
 int kvmppc_core_init_vm(struct kvm *kvm)
 {
 
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 2cefd071b848..48d0bce16164 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -4558,11 +4558,6 @@ void kvmppc_update_lpcr(struct kvm *kvm, unsigned long 
lpcr, unsigned long mask)
}
 }
 
-static void kvmppc_mmu_destroy_hv(struct kvm_vcpu *vcpu)
-{
-   return;
-}
-
 void kvmppc_setup_partition_table(struct kvm *kvm)
 {
unsigned long dw0, dw1;
@@ -5526,7 +5521,6 @@ static struct kvmppc_ops kvm_ops_hv = {
.age_hva  = kvm_age_hva_hv,
.test_age_hva = kvm_test_age_hva_hv,
.set_spte_hva = kvm_set_spte_hva_hv,
-   .mmu_destroy  = kvmppc_mmu_destroy_hv,
.free_memslot = kvmppc_core_free_memslot_hv,
.create_memslot = kvmppc_core_create_memslot_hv,
.init_vm =  kvmppc_core_init_vm_hv,
diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c
index 28e63a68a3dc..5cc88203f435 100644
--- a/arch/powerpc/kvm/book3s_pr.c
+++ b/arch/powerpc/kvm/book3s_pr.c
@@ -2098,7 +2098,6 @@ static struct kvmppc_ops kvm_ops_pr = {
.age_hva  = kvm_age_hva_pr,
.test_age_hva = kvm_test_age_hva_pr,
.set_spte_hva = kvm_set_spte_hva_pr,
-   .mmu_destroy  = kvmppc_mmu_destroy_pr,
.free_memslot = kvmppc_core_free_memslot_pr,
.create_memslot = kvmppc_core_create_memslot_pr,
.init_vm = kvmppc_core_init_vm_pr,
diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index 7b27604adadf..8a516d947405 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -2074,11 +2074,6 @@ void kvmppc_booke_vcpu_put(struct kvm_vcpu *vcpu)
kvmppc_clear_dbsr();
 }
 
-void kvmppc_mmu_destroy(struct kvm_vcpu *vcpu)
-{
-   vcpu->kvm->arch.kvm_ops->mmu_destroy(vcpu);
-}
-
 int kvmppc_core_init_vm(struct kvm *kvm)
 {
return kvm->arch.kvm_ops->init_vm(kvm);
diff --git a/arch/powerpc/kvm/booke.h b/arch/powerpc/kvm/booke.h
index 9d3169fbce55..65b4d337d337 100644
--- a/arch/powerpc/kvm/booke.h
+++ b/arch/powerpc/kvm/booke.h
@@ -94,7 +94,6 @@ enum int_class {
 
 void kvmppc_set_pending_interrupt(struct kvm_vcpu *vcpu, enum int_class type);
 
-extern void kvmppc_mmu_destroy_e500(struct kvm_vcpu *vcpu);
 extern int kvmppc_core_emulate_op_e500(struct kvm_run *run,
   struct kvm_vcpu *vcpu,
   unsigned int inst, int *advance);
@@ -102,7 +101,6 @@ extern int kvmppc_core_emulate_mtspr_e500(struct kvm_vcpu 
*vcpu, int sprn,
  ulong spr_val);
 extern int kvmppc_core_emulate_mfspr_e500(struct

[PATCH 2/3] KVM: PPC: Move kvmppc_mmu_init() PR KVM

2020-03-18 Thread Greg Kurz

This is only relevant to PR KVM. Make it obvious by moving the
function declaration to the Book3s header and rename it with
a _pr suffix.

Signed-off-by: Greg Kurz 
---
 arch/powerpc/include/asm/kvm_ppc.h|1 -
 arch/powerpc/kvm/book3s.h |1 +
 arch/powerpc/kvm/book3s_32_mmu_host.c |2 +-
 arch/powerpc/kvm/book3s_64_mmu_host.c |2 +-
 arch/powerpc/kvm/book3s_pr.c  |2 +-
 5 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index bc2494e5710a..399a657c1bf3 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -108,7 +108,6 @@ extern void kvmppc_mmu_map(struct kvm_vcpu *vcpu, u64 
gvaddr, gpa_t gpaddr,
 extern void kvmppc_mmu_priv_switch(struct kvm_vcpu *vcpu, int usermode);
 extern void kvmppc_mmu_switch_pid(struct kvm_vcpu *vcpu, u32 pid);
 extern void kvmppc_mmu_destroy(struct kvm_vcpu *vcpu);
-extern int kvmppc_mmu_init(struct kvm_vcpu *vcpu);
 extern int kvmppc_mmu_dtlb_index(struct kvm_vcpu *vcpu, gva_t eaddr);
 extern int kvmppc_mmu_itlb_index(struct kvm_vcpu *vcpu, gva_t eaddr);
 extern gpa_t kvmppc_mmu_xlate(struct kvm_vcpu *vcpu, unsigned int gtlb_index,
diff --git a/arch/powerpc/kvm/book3s.h b/arch/powerpc/kvm/book3s.h
index 3a4613985949..eae259ee49af 100644
--- a/arch/powerpc/kvm/book3s.h
+++ b/arch/powerpc/kvm/book3s.h
@@ -16,6 +16,7 @@ extern int kvm_age_hva_hv(struct kvm *kvm, unsigned long 
start,
 extern int kvm_test_age_hva_hv(struct kvm *kvm, unsigned long hva);
 extern void kvm_set_spte_hva_hv(struct kvm *kvm, unsigned long hva, pte_t pte);
 
+extern int kvmppc_mmu_init_pr(struct kvm_vcpu *vcpu);
 extern void kvmppc_mmu_destroy_pr(struct kvm_vcpu *vcpu);
 extern int kvmppc_core_emulate_op_pr(struct kvm_run *run, struct kvm_vcpu 
*vcpu,
 unsigned int inst, int *advance);
diff --git a/arch/powerpc/kvm/book3s_32_mmu_host.c 
b/arch/powerpc/kvm/book3s_32_mmu_host.c
index d4cb3bcf41b6..e8e7b2c530d1 100644
--- a/arch/powerpc/kvm/book3s_32_mmu_host.c
+++ b/arch/powerpc/kvm/book3s_32_mmu_host.c
@@ -356,7 +356,7 @@ void kvmppc_mmu_destroy_pr(struct kvm_vcpu *vcpu)
 /* From mm/mmu_context_hash32.c */
 #define CTX_TO_VSID(c, id) c) * (897 * 16)) + (id * 0x111)) & 0xff)
 
-int kvmppc_mmu_init(struct kvm_vcpu *vcpu)
+int kvmppc_mmu_init_pr(struct kvm_vcpu *vcpu)
 {
struct kvmppc_vcpu_book3s *vcpu3s = to_book3s(vcpu);
int err;
diff --git a/arch/powerpc/kvm/book3s_64_mmu_host.c 
b/arch/powerpc/kvm/book3s_64_mmu_host.c
index 044dd49eeb9d..e452158a18d7 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_host.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_host.c
@@ -384,7 +384,7 @@ void kvmppc_mmu_destroy_pr(struct kvm_vcpu *vcpu)
__destroy_context(to_book3s(vcpu)->context_id[0]);
 }
 
-int kvmppc_mmu_init(struct kvm_vcpu *vcpu)
+int kvmppc_mmu_init_pr(struct kvm_vcpu *vcpu)
 {
struct kvmppc_vcpu_book3s *vcpu3s = to_book3s(vcpu);
int err;
diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c
index db3a87319642..28e63a68a3dc 100644
--- a/arch/powerpc/kvm/book3s_pr.c
+++ b/arch/powerpc/kvm/book3s_pr.c
@@ -1795,7 +1795,7 @@ static int kvmppc_core_vcpu_create_pr(struct kvm_vcpu 
*vcpu)
 
vcpu->arch.shadow_msr = MSR_USER64 & ~MSR_LE;
 
-   err = kvmppc_mmu_init(vcpu);
+   err = kvmppc_mmu_init_pr(vcpu);
if (err < 0)
goto free_shared_page;

[PATCH 1/3] KVM: PPC: Fix kernel crash with PR KVM

2020-03-18 Thread Greg Kurz

With PR KVM, shutting down a VM causes the host kernel to crash:

[  314.219284] BUG: Unable to handle kernel data access on read at 
0xc0080176c638
[  314.219299] Faulting instruction address: 0xc00800d4ddb0
cpu 0x0: Vector: 300 (Data Access) at [c0036da077a0]
pc: c00800d4ddb0: kvmppc_mmu_pte_flush_all+0x68/0xd0 [kvm_pr]
lr: c00800d4dd94: kvmppc_mmu_pte_flush_all+0x4c/0xd0 [kvm_pr]
sp: c0036da07a30
   msr: 90010280b033
   dar: c0080176c638
 dsisr: 4000
  current = 0xc0036d4c
  paca= 0xc1a0   irqmask: 0x03   irq_happened: 0x01
pid   = 1992, comm = qemu-system-ppc
Linux version 5.6.0-master-gku+ (greg@palmb) (gcc version 7.5.0 (Ubuntu 
7.5.0-3ubuntu1~18.04)) #17 SMP Wed Mar 18 13:49:29 CET 2020
enter ? for help
[c0036da07ab0] c00800d4fbe0 kvmppc_mmu_destroy_pr+0x28/0x60 [kvm_pr]
[c0036da07ae0] c008009eab8c kvmppc_mmu_destroy+0x34/0x50 [kvm]
[c0036da07b00] c008009e50c0 kvm_arch_vcpu_destroy+0x108/0x140 [kvm]
[c0036da07b30] c008009d1b50 kvm_vcpu_destroy+0x28/0x80 [kvm]
[c0036da07b60] c008009e4434 kvm_arch_destroy_vm+0xbc/0x190 [kvm]
[c0036da07ba0] c008009d9c2c kvm_put_kvm+0x1d4/0x3f0 [kvm]
[c0036da07c00] c008009da760 kvm_vm_release+0x38/0x60 [kvm]
[c0036da07c30] c0420be0 __fput+0xe0/0x310
[c0036da07c90] c01747a0 task_work_run+0x150/0x1c0
[c0036da07cf0] c014896c do_exit+0x44c/0xd00
[c0036da07dc0] c01492f4 do_group_exit+0x64/0xd0
[c0036da07e00] c0149384 sys_exit_group+0x24/0x30
[c0036da07e20] c000b9d0 system_call+0x5c/0x68

This is caused by a use-after-free in kvmppc_mmu_pte_flush_all()
which dereferences vcpu->arch.book3s which was previously freed by
kvmppc_core_vcpu_free_pr(). This happens because kvmppc_mmu_destroy()
is called after kvmppc_core_vcpu_free() since commit ff030fdf5573
("KVM: PPC: Move kvm_vcpu_init() invocation to common code").

The kvmppc_mmu_destroy() helper calls one of the following depending
on the KVM backend:

- kvmppc_mmu_destroy_hv() which does nothing (Book3s HV)

- kvmppc_mmu_destroy_pr() which undoes the effects of
  kvmppc_mmu_init() (Book3s PR 32-bit)

- kvmppc_mmu_destroy_pr() which undoes the effects of
  kvmppc_mmu_init() (Book3s PR 64-bit)

- kvmppc_mmu_destroy_e500() which does nothing (BookE e500/e500mc)

It turns out that this is only relevant to PR KVM actually. And both
32 and 64 backends need vcpu->arch.book3s to be valid when calling
kvmppc_mmu_destroy_pr(). So instead of calling kvmppc_mmu_destroy()
from kvm_arch_vcpu_destroy(), call kvmppc_mmu_destroy_pr() at the
beginning of kvmppc_core_vcpu_free_pr(). This is consistent with
kvmppc_mmu_init() being the last call in kvmppc_core_vcpu_create_pr().

For the same reason, if kvmppc_core_vcpu_create_pr() returns an
error then this means that kvmppc_mmu_init() was either not called
or failed, in which case kvmppc_mmu_destroy() should not be called.
Drop the line in the error path of kvm_arch_vcpu_create().

Fixes: ff030fdf5573 ("KVM: PPC: Move kvm_vcpu_init() invocation to common code")
Signed-off-by: Greg Kurz 
---
 arch/powerpc/kvm/book3s_pr.c |1 +
 arch/powerpc/kvm/powerpc.c   |2 --
 2 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c
index 729a0f12a752..db3a87319642 100644
--- a/arch/powerpc/kvm/book3s_pr.c
+++ b/arch/powerpc/kvm/book3s_pr.c
@@ -1817,6 +1817,7 @@ static void kvmppc_core_vcpu_free_pr(struct kvm_vcpu 
*vcpu)
 {
struct kvmppc_vcpu_book3s *vcpu_book3s = to_book3s(vcpu);
 
+   kvmppc_mmu_destroy_pr(vcpu);
free_page((unsigned long)vcpu->arch.shared & PAGE_MASK);
 #ifdef CONFIG_KVM_BOOK3S_32_HANDLER
kfree(vcpu->arch.shadow_vcpu);
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 1af96fb5dc6f..302e9dccdd6d 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -759,7 +759,6 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
return 0;
 
 out_vcpu_uninit:
-   kvmppc_mmu_destroy(vcpu);
kvmppc_subarch_vcpu_uninit(vcpu);
return err;
 }
@@ -792,7 +791,6 @@ void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
 
kvmppc_core_vcpu_free(vcpu);
 
-   kvmppc_mmu_destroy(vcpu);
kvmppc_subarch_vcpu_uninit(vcpu);
 }

[PATCH 0/3] KVM: PPC: Fix host kernel crash with PR KVM

2020-03-18 Thread Greg Kurz

Recent cleanup from Sean Christopherson introduced a use-after-free
condition that crashes the kernel when shutting down the VM with
PR KVM. It went unnoticed so far because PR isn't tested/used much
these days (mostly used for nested on POWER8, not supported on POWER9
where HV should be used for nested), and other KVM implementations for
ppc are unaffected.

This all boils down to the fact that the path that frees the per-vCPU
MMU data goes through a complex set of indirections. This obfuscates
the code to the point that we didn't realize that the MMU data was
now being freed too early. And worse, most of the indirection isn't
needed because only PR KVM has some MMU data to free when the vCPU is
destroyed.

Fix the issue (patch 1) and simplify the code (patch 2 and 3).

--
Greg

---

Greg Kurz (3):
  KVM: PPC: Fix kernel crash with PR KVM
  KVM: PPC: Move kvmppc_mmu_init() PR KVM
  KVM: PPC: Kill kvmppc_ops::mmu_destroy() and kvmppc_mmu_destroy()


 arch/powerpc/include/asm/kvm_ppc.h|3 ---
 arch/powerpc/kvm/book3s.c |5 -
 arch/powerpc/kvm/book3s.h |1 +
 arch/powerpc/kvm/book3s_32_mmu_host.c |2 +-
 arch/powerpc/kvm/book3s_64_mmu_host.c |2 +-
 arch/powerpc/kvm/book3s_hv.c  |6 --
 arch/powerpc/kvm/book3s_pr.c  |4 ++--
 arch/powerpc/kvm/booke.c  |5 -
 arch/powerpc/kvm/booke.h  |2 --
 arch/powerpc/kvm/e500.c   |1 -
 arch/powerpc/kvm/e500_mmu.c   |4 
 arch/powerpc/kvm/e500mc.c |1 -
 arch/powerpc/kvm/powerpc.c|2 --
 13 files changed, 5 insertions(+), 33 deletions(-)

Re: [PATCH 2/4] powerpc/xive: Fix xmon support on the PowerNV platform

2020-03-10 Thread Greg Kurz

On Fri,  6 Mar 2020 16:01:41 +0100
Cédric Le Goater  wrote:

> The PowerNV platform has multiple IRQ chips and the xmon command
> dumping the state of the XIVE interrupt should only operate on the
> XIVE IRQ chip.
> 
> Fixes: 5896163f7f91 ("powerpc/xmon: Improve output of XIVE interrupts")
> Cc: sta...@vger.kernel.org # v5.4+
> Signed-off-by: Cédric Le Goater 
> ---

Reviewed-by: Greg Kurz 

>  arch/powerpc/sysdev/xive/common.c | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/arch/powerpc/sysdev/xive/common.c 
> b/arch/powerpc/sysdev/xive/common.c
> index 550baba98ec9..8155adc2225a 100644
> --- a/arch/powerpc/sysdev/xive/common.c
> +++ b/arch/powerpc/sysdev/xive/common.c
> @@ -261,11 +261,15 @@ notrace void xmon_xive_do_dump(int cpu)
>  
>  int xmon_xive_get_irq_config(u32 hw_irq, struct irq_data *d)
>  {
> + struct irq_chip *chip = irq_data_get_irq_chip(d);
>   int rc;
>   u32 target;
>   u8 prio;
>   u32 lirq;
>  
> + if (!is_xive_irq(chip))
> + return -EINVAL;
> +
>   rc = xive_ops->get_irq_config(hw_irq, , , );
>   if (rc) {
>   xmon_printf("IRQ 0x%08x : no config rc=%d\n", hw_irq, rc);

Re: [PATCH 3/4] powerpc/xmon: Add source flags to output of XIVE interrupts

2020-03-10 Thread Greg Kurz

On Fri,  6 Mar 2020 16:01:42 +0100
Cédric Le Goater  wrote:

> Some firmwares or hypervisors can advertise different source
> characteristics. Track their value under XMON. What we are mostly
> interested in is the StoreEOI flag.
> 
> Signed-off-by: Cédric Le Goater 
> ---

Reviewed-by: Greg Kurz 

>  arch/powerpc/sysdev/xive/common.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/sysdev/xive/common.c 
> b/arch/powerpc/sysdev/xive/common.c
> index 8155adc2225a..c865ae554605 100644
> --- a/arch/powerpc/sysdev/xive/common.c
> +++ b/arch/powerpc/sysdev/xive/common.c
> @@ -283,7 +283,10 @@ int xmon_xive_get_irq_config(u32 hw_irq, struct irq_data 
> *d)
>   struct xive_irq_data *xd = irq_data_get_irq_handler_data(d);
>   u64 val = xive_esb_read(xd, XIVE_ESB_GET);
>  
> - xmon_printf("PQ=%c%c",
> + xmon_printf("flags=%c%c%c PQ=%c%c",
> + xd->flags & XIVE_IRQ_FLAG_STORE_EOI ? 'S' : ' ',
> + xd->flags & XIVE_IRQ_FLAG_LSI ? 'L' : ' ',
> + xd->flags & XIVE_IRQ_FLAG_H_INT_ESB ? 'H' : ' ',
>   val & XIVE_ESB_VAL_P ? 'P' : '-',
>   val & XIVE_ESB_VAL_Q ? 'Q' : '-');
>   }

Re: [PATCH 1/4] powerpc/xive: Use XIVE_BAD_IRQ instead of zero to catch non configured IPIs

2020-03-10 Thread Greg Kurz

On Fri,  6 Mar 2020 16:01:40 +0100
Cédric Le Goater  wrote:

> When a CPU is brought up, an IPI number is allocated and recorded
> under the XIVE CPU structure. Invalid IPI numbers are tracked with
> interrupt number 0x0.
> 
> On the PowerNV platform, the interrupt number space starts at 0x10 and
> this works fine. However, on the sPAPR platform, it is possible to
> allocate the interrupt number 0x0 and this raises an issue when CPU 0
> is unplugged. The XIVE spapr driver tracks allocated interrupt numbers
> in a bitmask and it is not correctly updated when interrupt number 0x0
> is freed. It stays allocated and it is then impossible to reallocate.
> 
> Fix by using the XIVE_BAD_IRQ value instead of zero on both platforms.
> 
> Reported-by: David Gibson 
> Fixes: eac1e731b59e ("powerpc/xive: guest exploitation of the XIVE interrupt 
> controller")
> Cc: sta...@vger.kernel.org # v4.14+
> Signed-off-by: Cédric Le Goater 
> ---

This looks mostly good. I'm juste wondering about potential overlooks:

$ git grep 'if.*hw_i' arch/powerpc/ | egrep -v 'xics|XIVE_BAD_IRQ'
arch/powerpc/kvm/book3s_xive.h: if (out_hw_irq)
arch/powerpc/kvm/book3s_xive.h: if (out_hw_irq)
arch/powerpc/kvm/book3s_xive_template.c:else if (hw_irq && xd->flags & 
XIVE_IRQ_FLAG_EOI_FW)
arch/powerpc/sysdev/xive/common.c:  else if (hw_irq && xd->flags & 
XIVE_IRQ_FLAG_EOI_FW) {

This hw_irq check in xive_do_source_eoi() for example is related to:

/*
 * Note: We pass "0" to the hw_irq argument in order to
 * avoid calling into the backend EOI code which we don't
 * want to do in the case of a re-trigger. Backends typically
 * only do EOI for LSIs anyway.
 */
xive_do_source_eoi(0, xd);

but it can get hw_irq from:

xive_do_source_eoi(xc->hw_ipi, >ipi_data);

It seems that these should use XIVE_BAD_IRQ as well or I'm missing
something ?

arch/powerpc/sysdev/xive/common.c:  if (hw_irq)
arch/powerpc/sysdev/xive/common.c:  if (d->domain != 
xive_irq_domain || hw_irq == 0)



>  arch/powerpc/sysdev/xive/xive-internal.h |  7 +++
>  arch/powerpc/sysdev/xive/common.c| 12 +++-
>  arch/powerpc/sysdev/xive/native.c|  4 ++--
>  arch/powerpc/sysdev/xive/spapr.c |  4 ++--
>  4 files changed, 14 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/powerpc/sysdev/xive/xive-internal.h 
> b/arch/powerpc/sysdev/xive/xive-internal.h
> index 59cd366e7933..382980f4de2d 100644
> --- a/arch/powerpc/sysdev/xive/xive-internal.h
> +++ b/arch/powerpc/sysdev/xive/xive-internal.h
> @@ -5,6 +5,13 @@
>  #ifndef __XIVE_INTERNAL_H
>  #define __XIVE_INTERNAL_H
>  
> +/*
> + * A "disabled" interrupt should never fire, to catch problems
> + * we set its logical number to this
> + */
> +#define XIVE_BAD_IRQ 0x7fff
> +#define XIVE_MAX_IRQ (XIVE_BAD_IRQ - 1)
> +
>  /* Each CPU carry one of these with various per-CPU state */
>  struct xive_cpu {
>  #ifdef CONFIG_SMP
> diff --git a/arch/powerpc/sysdev/xive/common.c 
> b/arch/powerpc/sysdev/xive/common.c
> index fa49193206b6..550baba98ec9 100644
> --- a/arch/powerpc/sysdev/xive/common.c
> +++ b/arch/powerpc/sysdev/xive/common.c
> @@ -68,13 +68,6 @@ static u32 xive_ipi_irq;
>  /* Xive state for each CPU */
>  static DEFINE_PER_CPU(struct xive_cpu *, xive_cpu);
>  
> -/*
> - * A "disabled" interrupt should never fire, to catch problems
> - * we set its logical number to this
> - */
> -#define XIVE_BAD_IRQ 0x7fff
> -#define XIVE_MAX_IRQ (XIVE_BAD_IRQ - 1)
> -
>  /* An invalid CPU target */
>  #define XIVE_INVALID_TARGET  (-1)
>  
> @@ -1153,7 +1146,7 @@ static int xive_setup_cpu_ipi(unsigned int cpu)
>   xc = per_cpu(xive_cpu, cpu);
>  
>   /* Check if we are already setup */
> - if (xc->hw_ipi != 0)
> + if (xc->hw_ipi != XIVE_BAD_IRQ)
>   return 0;
>  
>   /* Grab an IPI from the backend, this will populate xc->hw_ipi */
> @@ -1190,7 +1183,7 @@ static void xive_cleanup_cpu_ipi(unsigned int cpu, 
> struct xive_cpu *xc)
>   /* Disable the IPI and free the IRQ data */
>  
>   /* Already cleaned up ? */
> - if (xc->hw_ipi == 0)
> + if (xc->hw_ipi == XIVE_BAD_IRQ)
>   return;
>  
>   /* Mask the IPI */
> @@ -1346,6 +1339,7 @@ static int xive_prepare_cpu(unsigned int cpu)
>   if (np)
>   xc->chip_id = of_get_ibm_chip_id(np);
>   of_node_put(np);
> + xc->hw_ipi = XIVE_BAD_IRQ;
>  
>   per_cpu(xive_cpu, cpu) = xc;
>   }
> diff --git a/arch/powerpc/sysdev/xive/native.c 
> b/arch/powerpc/sysdev/xive/native.c
> index 0ff6b739052c..50e1a8e02497 100644
> --- a/arch/powerpc/sysdev/xive/native.c
> +++ b/arch/powerpc/sysdev/xive/native.c
> @@ -312,7 +312,7 @@ static void xive_native_put_ipi(unsigned int cpu, struct 
> xive_cpu *xc)
>   s64 rc;
>  
>   /* Free the IPI */
> - if (!xc->hw_ipi)
> + if (xc->hw_ipi ==

Re: [RFC PATCH v1] powerpc/prom_init: disable XIVE in Secure VM.

2020-03-04 Thread Greg Kurz

On Tue, 3 Mar 2020 10:56:45 -0800
Ram Pai  wrote:

> On Tue, Mar 03, 2020 at 06:45:20PM +0100, Greg Kurz wrote:
> > On Tue, 3 Mar 2020 09:02:05 -0800
> > Ram Pai  wrote:
> > 
> > > On Tue, Mar 03, 2020 at 07:50:08AM +0100, Cédric Le Goater wrote:
> > > > On 3/3/20 12:32 AM, David Gibson wrote:
> > > > > On Fri, Feb 28, 2020 at 11:54:04PM -0800, Ram Pai wrote:
> > > > >> XIVE is not correctly enabled for Secure VM in the KVM Hypervisor 
> > > > >> yet.
> > > > >>
> > > > >> Hence Secure VM, must always default to XICS interrupt controller.
> > > > >>
> > > > >> If XIVE is requested through kernel command line option "xive=on",
> > > > >> override and turn it off.
> > > > >>
> > > > >> If XIVE is the only supported platform interrupt controller; 
> > > > >> specified
> > > > >> through qemu option "ic-mode=xive", simply abort. Otherwise default 
> > > > >> to
> > > > >> XICS.
> > > > > 
> > > > > Uh... the discussion thread here seems to have gotten oddly off
> > > > > track.  
> > > > 
> > > > There seem to be multiple issues. It is difficult to have a clear 
> > > > status.
> > > > 
> > > > > So, to try to clean up some misunderstandings on both sides:
> > > > > 
> > > > >   1) The guest is the main thing that knows that it will be in secure
> > > > >  mode, so it's reasonable for it to conditionally use XIVE based
> > > > >  on that
> > > > 
> > > > FW support is required AFAIUI.
> > > > >   2) The mechanism by which we do it here isn't quite right.  Here the
> > > > >  guest is checking itself that the host only allows XIVE, but we
> > > > >  can't do XIVE and is panic()ing.  Instead, in the SVM case we
> > > > >  should force support->xive to false, and send that in the CAS
> > > > >  request to the host.  We expect the host to just terminate
> > > > >  us because of the mismatch, but this will interact better with
> > > > >  host side options setting policy for panic states and the like.
> > > > >  Essentially an SVM kernel should behave like an old kernel with
> > > > >  no XIVE support at all, at least w.r.t. the CAS irq mode flags.
> > > > 
> > > > Yes. XIVE shouldn't be requested by the guest.
> > > 
> > >   Ok.
> > > 
> > > > This is the last option 
> > > > I proposed but I thought there was some negotiation with the hypervisor
> > > > which is not the case. 
> > > > 
> > > > >   3) Although there are means by which the hypervisor can kind of know
> > > > >  a guest is in secure mode, there's not really an "svm=on" option
> > > > >  on the host side.  For the most part secure mode is based on
> > > > >  discussion directly between the guest and the ultravisor with
> > > > >  almost no hypervisor intervention.
> > > > 
> > > > Is there a negotiation with the ultravisor ? 
> > > 
> > >   The VM has no negotiation with the ultravisor w.r.t CAS.
> > > 
> > > > 
> > > > >   4) I'm guessing the problem with XIVE in SVM mode is that XIVE needs
> > > > >  to write to event queues in guest memory, which would have to be
> > > > >  explicitly shared for secure mode.  That's true whether it's KVM
> > > > >  or qemu accessing the guest memory, so kernel_irqchip=on/off is
> > > > >  entirely irrelevant.
> > > > 
> > > > This problem should be already fixed.
> > > > The XIVE event queues are shared 
> > >   
> > > Yes i have a patch for the guest kernel that shares the event 
> > > queue page with the hypervisor. This is done using the
> > > UV_SHARE_PAGE ultracall. This patch is not sent out to any any mailing
> > > lists yet.
> > 
> > Why ?
> 
> At this point I am not sure if this is the only change, I need to the
> guest kernel.

Maybe but we're already sure that this change is needed. I don't really see
the point in holding this any longer.

>  I also need changes to KVM and to the ultravisor. Its bit
> premature to send the patch without having figured out everything
> to get xive working on a Secure VM.
> 

I'm a bit confused... why did you send this workaround patch in
the first place then ? I mean, this raises a concern and we're
just trying to move forward.

> > 
> > > However the patch by itself does not solve the xive problem
> > > for secure VM.
> > > 
> > 
> > This patch would allow at least to answer Cedric's question about
> > kernel_irqchip=off, since this looks like the only thing needed
> > to make it work.
> 
> hmm.. I am not sure. Are you saying
> (a) patch the guest kernel to share the event queue page
> (b) run the qemu with "kernel_irqchip=off"
> (c) and the guest kernel with "svm=on"
> 
> and it should all work?
> 

Yes.

> RP
>

Re: [EXTERNAL] Re: [RFC PATCH v1] powerpc/prom_init: disable XIVE in Secure VM.

2020-03-04 Thread Greg Kurz

On Tue, 3 Mar 2020 20:18:18 +0100
Cédric Le Goater  wrote:

> >>  BTW: I figured, I dont need this intermin patch to disable xive for
> >> secure VM.  Just doing "svm=on xive=off" on the kernel command line is
> >> sufficient for now. *
> >>
> > 
> > No it is not. If the hypervisor doesn't propose XIVE (ie. ic-mode=xive
> > on the QEMU command line), the kernel simply ignores "xive=off".
> 

Ah... sorry for the typo... "doesn't propose XICS" of course :)

> If I am correct, with the option ic-mode=xive, the hypervisor will 
> propose only 'xive' in OV5 and not both 'xive' and 'xics'. But the
> result is the same because xive can not be turned off and "xive=off" 
> is ignored.
> 
> Anyway, it's not the most common case of usage of the QEMU command
> like. I think it's OK to use "xive=off" on the kernel command line 
> for now.
> 

Sure, I just wanted to make things clear. Like you said it's a chicken
switch introduced for distro testing. I think it should not be used
to do anything else. If "svm=1" needs to enforce supported.xive == false
as a temporary workaround, it should do it explicitly.

> C.

Re: [RFC PATCH v1] powerpc/prom_init: disable XIVE in Secure VM.

2020-03-03 Thread Greg Kurz

On Tue, 3 Mar 2020 09:02:05 -0800
Ram Pai  wrote:

> On Tue, Mar 03, 2020 at 07:50:08AM +0100, Cédric Le Goater wrote:
> > On 3/3/20 12:32 AM, David Gibson wrote:
> > > On Fri, Feb 28, 2020 at 11:54:04PM -0800, Ram Pai wrote:
> > >> XIVE is not correctly enabled for Secure VM in the KVM Hypervisor yet.
> > >>
> > >> Hence Secure VM, must always default to XICS interrupt controller.
> > >>
> > >> If XIVE is requested through kernel command line option "xive=on",
> > >> override and turn it off.
> > >>
> > >> If XIVE is the only supported platform interrupt controller; specified
> > >> through qemu option "ic-mode=xive", simply abort. Otherwise default to
> > >> XICS.
> > > 
> > > Uh... the discussion thread here seems to have gotten oddly off
> > > track.  
> > 
> > There seem to be multiple issues. It is difficult to have a clear status.
> > 
> > > So, to try to clean up some misunderstandings on both sides:
> > > 
> > >   1) The guest is the main thing that knows that it will be in secure
> > >  mode, so it's reasonable for it to conditionally use XIVE based
> > >  on that
> > 
> > FW support is required AFAIUI.
> > >   2) The mechanism by which we do it here isn't quite right.  Here the
> > >  guest is checking itself that the host only allows XIVE, but we
> > >  can't do XIVE and is panic()ing.  Instead, in the SVM case we
> > >  should force support->xive to false, and send that in the CAS
> > >  request to the host.  We expect the host to just terminate
> > >  us because of the mismatch, but this will interact better with
> > >  host side options setting policy for panic states and the like.
> > >  Essentially an SVM kernel should behave like an old kernel with
> > >  no XIVE support at all, at least w.r.t. the CAS irq mode flags.
> > 
> > Yes. XIVE shouldn't be requested by the guest.
> 
>   Ok.
> 
> > This is the last option 
> > I proposed but I thought there was some negotiation with the hypervisor
> > which is not the case. 
> > 
> > >   3) Although there are means by which the hypervisor can kind of know
> > >  a guest is in secure mode, there's not really an "svm=on" option
> > >  on the host side.  For the most part secure mode is based on
> > >  discussion directly between the guest and the ultravisor with
> > >  almost no hypervisor intervention.
> > 
> > Is there a negotiation with the ultravisor ? 
> 
>   The VM has no negotiation with the ultravisor w.r.t CAS.
> 
> > 
> > >   4) I'm guessing the problem with XIVE in SVM mode is that XIVE needs
> > >  to write to event queues in guest memory, which would have to be
> > >  explicitly shared for secure mode.  That's true whether it's KVM
> > >  or qemu accessing the guest memory, so kernel_irqchip=on/off is
> > >  entirely irrelevant.
> > 
> > This problem should be already fixed.
> > The XIVE event queues are shared 
>   
> Yes i have a patch for the guest kernel that shares the event 
> queue page with the hypervisor. This is done using the
> UV_SHARE_PAGE ultracall. This patch is not sent out to any any mailing
> lists yet.

Why ?

> However the patch by itself does not solve the xive problem
> for secure VM.
> 

This patch would allow at least to answer Cedric's question about
kernel_irqchip=off, since this looks like the only thing needed
to make it work.

> > and the remaining problem with XIVE is the KVM page fault handler 
> > populating the TIMA and ESB pages. Ultravisor doesn't seem to support
> > this feature and this breaks interrupt management in the guest. 
> 
> Yes. This is the bigger issue that needs to be fixed. When the secure guest
> accesses the page associated with the xive memslot, a page fault is
> generated, which the ultravisor reflects to the hypervisor. Hypervisor
> seems to be mapping Hardware-page to that GPA. Unforatunately it is not
> informing the ultravisor of that map.  I am trying to understand the
> root cause. But since I am not sure what more issues I might run into
> after chasing down that issue, I figured its better to disable xive
> support in SVM in the interim.
> 
>  BTW: I figured, I dont need this intermin patch to disable xive for
> secure VM.  Just doing "svm=on xive=off" on the kernel command line is
> sufficient for now. *
> 

No it is not. If the hypervisor doesn't propose XIVE (ie. ic-mode=xive
on the QEMU command line), the kernel simply ignores "xive=off".

> 
> > 
> > But, kernel_irqchip=off should work out of the box. It seems it doesn't. 
> > Something to investigate.
> 
> Dont know why. 
> 
> Does this option, disable the chip from interrupting the
> guest directly; instead mediates the interrupt through the hypervisor?
> 
> > 
> > > 
> > >   5) All the above said, having to use XICS is pretty crappy.  You
> > >  should really get working on XIVE support for secure VMs.
> > 
> > Yes. 
> 
> and yes too.
> 
> 
> Summary:  I am dropping this patch for now.
> 
> > 
> > Thanks,
> > 
> > C.
>

Re: [RFC PATCH v1] powerpc/prom_init: disable XIVE in Secure VM.

2020-03-02 Thread Greg Kurz

On Fri, 28 Feb 2020 23:54:04 -0800
Ram Pai  wrote:

> XIVE is not correctly enabled for Secure VM in the KVM Hypervisor yet.
> 

What exactly is "not correctly enabled" ?

> Hence Secure VM, must always default to XICS interrupt controller.
> 

So this is a temporary workaround until whatever isn't working with
XIVE and the Secure VM gets fixed. Maybe worth mentioning this in
some comment.

> If XIVE is requested through kernel command line option "xive=on",
> override and turn it off.
> 

There's no such thing as requesting XIVE with "xive=on". XIVE is
on by default if the platform and CPU support it BUT it can be
disabled with "xive=off" in which case the guest wont request
XIVE except if it's the only available mode.

> If XIVE is the only supported platform interrupt controller; specified
> through qemu option "ic-mode=xive", simply abort. Otherwise default to
> XICS.
> 

If XIVE is the only option and the guest requests XICS anyway, QEMU is
supposed to print an error message and terminate:

if (!spapr->irq->xics) {
error_report(
"Guest requested unavailable interrupt mode (XICS), either don't set the 
ic-mode machine property or try ic-mode=xics or ic-mode=dual");
exit(EXIT_FAILURE);
}

I think it would be better to end up there rather than aborting.

> Cc: kvm-...@vger.kernel.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: Michael Ellerman 
> Cc: Thiago Jung Bauermann 
> Cc: Michael Anderson 
> Cc: Sukadev Bhattiprolu 
> Cc: Alexey Kardashevskiy 
> Cc: Paul Mackerras 
> Cc: Greg Kurz 
> Cc: Cedric Le Goater 
> Cc: David Gibson 
> Signed-off-by: Ram Pai 
> ---
>  arch/powerpc/kernel/prom_init.c | 43 
> -
>  1 file changed, 30 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
> index 5773453..dd96c82 100644
> --- a/arch/powerpc/kernel/prom_init.c
> +++ b/arch/powerpc/kernel/prom_init.c
> @@ -805,6 +805,18 @@ static void __init early_cmdline_parse(void)
>  #endif
>   }
>  
> +#ifdef CONFIG_PPC_SVM
> + opt = prom_strstr(prom_cmd_line, "svm=");
> + if (opt) {
> + bool val;
> +
> + opt += sizeof("svm=") - 1;
> + if (!prom_strtobool(opt, ))
> + prom_svm_enable = val;
> + prom_printf("svm =%d\n", prom_svm_enable);
> + }
> +#endif /* CONFIG_PPC_SVM */
> +
>  #ifdef CONFIG_PPC_PSERIES
>   prom_radix_disable = !IS_ENABLED(CONFIG_PPC_RADIX_MMU_DEFAULT);
>   opt = prom_strstr(prom_cmd_line, "disable_radix");
> @@ -823,23 +835,22 @@ static void __init early_cmdline_parse(void)
>   if (prom_radix_disable)
>   prom_debug("Radix disabled from cmdline\n");
>  
> - opt = prom_strstr(prom_cmd_line, "xive=off");
> - if (opt) {

A comment to explain why we currently need to limit ourselves to using
XICS would be appreciated.

> +#ifdef CONFIG_PPC_SVM
> + if (prom_svm_enable) {
>   prom_xive_disable = true;
> - prom_debug("XIVE disabled from cmdline\n");
> + prom_debug("XIVE disabled in Secure VM\n");
>   }
> -#endif /* CONFIG_PPC_PSERIES */
> -
> -#ifdef CONFIG_PPC_SVM
> - opt = prom_strstr(prom_cmd_line, "svm=");
> - if (opt) {
> - bool val;
> +#endif /* CONFIG_PPC_SVM */
>  
> - opt += sizeof("svm=") - 1;
> - if (!prom_strtobool(opt, ))
> - prom_svm_enable = val;
> + if (!prom_xive_disable) {
> + opt = prom_strstr(prom_cmd_line, "xive=off");
> + if (opt) {
> + prom_xive_disable = true;
> + prom_debug("XIVE disabled from cmdline\n");
> + }
>   }
> -#endif /* CONFIG_PPC_SVM */
> +
> +#endif /* CONFIG_PPC_PSERIES */
>  }
>  
>  #ifdef CONFIG_PPC_PSERIES
> @@ -1251,6 +1262,12 @@ static void __init prom_parse_xive_model(u8 val,
>   break;
>   case OV5_FEAT(OV5_XIVE_EXPLOIT): /* Only Exploitation mode */
>   prom_debug("XIVE - exploitation mode supported\n");
> +
> +#ifdef CONFIG_PPC_SVM
> + if (prom_svm_enable)
> + prom_panic("WARNING: xive unsupported in Secure VM\n");

Change the prom_panic() line into a break. The guest will ask XICS and QEMU
will terminate nicely. Maybe still print out a warning since QEMU won't mention
the Secure VM aspect of things.

> +#endif /* CONFIG_PPC_SVM */
> +
>   if (prom_xive_disable) {
>   /*
>* If we __have__ to do XIVE, we're better off ignoring

Re: [PATCH v3 04/27] ocxl: Remove unnecessary externs

2020-02-26 Thread Greg Kurz

On Wed, 26 Feb 2020 22:15:23 +0800
'Baoquan He'  wrote:

> On 02/26/20 at 10:01am, Greg Kurz wrote:
> > On Wed, 26 Feb 2020 19:26:34 +1100
> > "Alastair D'Silva"  wrote:
> > 
> > > > -Original Message-
> > > > From: Baoquan He 
> > > > Sent: Wednesday, 26 February 2020 7:15 PM
> > > > To: Alastair D'Silva 
> > > > Cc: alast...@d-silva.org; Aneesh Kumar K . V
> > > > ; Oliver O'Halloran ;
> > > > Benjamin Herrenschmidt ; Paul Mackerras
> > > > ; Michael Ellerman ; Frederic
> > > > Barrat ; Andrew Donnellan ;
> > > > Arnd Bergmann ; Greg Kroah-Hartman
> > > > ; Dan Williams ;
> > > > Vishal Verma ; Dave Jiang
> > > > ; Ira Weiny ; Andrew Morton
> > > > ; Mauro Carvalho Chehab
> > > > ; David S. Miller ;
> > > > Rob Herring ; Anton Blanchard ;
> > > > Krzysztof Kozlowski ; Mahesh Salgaonkar
> > > > ; Madhavan Srinivasan
> > > > ; Cédric Le Goater ; Anju T
> > > > Sudhakar ; Hari Bathini
> > > > ; Thomas Gleixner ; Greg
> > > > Kurz ; Nicholas Piggin ; Masahiro
> > > > Yamada ; Alexey Kardashevskiy
> > > > ; linux-ker...@vger.kernel.org; linuxppc-
> > > > d...@lists.ozlabs.org; linux-nvd...@lists.01.org; linux...@kvack.org
> > > > Subject: Re: [PATCH v3 04/27] ocxl: Remove unnecessary externs
> > > > 
> > > > On 02/21/20 at 02:26pm, Alastair D'Silva wrote:
> > > > > From: Alastair D'Silva 
> > > > >
> > > > > Function declarations don't need externs, remove the existing ones so
> > > > > they are consistent with newer code
> > > > >
> > > > > Signed-off-by: Alastair D'Silva 
> > > > > ---
> > > > >  arch/powerpc/include/asm/pnv-ocxl.h | 32 
> > > > > ++---
> > > > >  include/misc/ocxl.h |  6 +++---
> > > > >  2 files changed, 18 insertions(+), 20 deletions(-)
> > > > >
> > > > > diff --git a/arch/powerpc/include/asm/pnv-ocxl.h
> > > > > b/arch/powerpc/include/asm/pnv-ocxl.h
> > > > > index 0b2a6707e555..b23c99bc0c84 100644
> > > > > --- a/arch/powerpc/include/asm/pnv-ocxl.h
> > > > > +++ b/arch/powerpc/include/asm/pnv-ocxl.h
> > > > > @@ -9,29 +9,27 @@
> > > > >  #define PNV_OCXL_TL_BITS_PER_RATE   4
> > > > >  #define PNV_OCXL_TL_RATE_BUF_SIZE
> > > > ((PNV_OCXL_TL_MAX_TEMPLATE+1) * PNV_OCXL_TL_BITS_PER_RATE / 8)
> > > > >
> > > > > -extern int pnv_ocxl_get_actag(struct pci_dev *dev, u16 *base, u16
> > > > *enabled,
> > > > > - u16 *supported);
> > > > 
> > > > It works w or w/o extern when declare functions. Searching 'extern'
> > > > under include can find so many functions with 'extern' adding. Do we 
> > > > have
> > > a
> > > > explicit standard if we should add or remove 'exter' in function
> > > declaration?
> > > > 
> > > > I have no objection to this patch, just want to make clear so that I can
> > > handle
> > > > it w/o confusion.
> > > > 
> > > > Thanks
> > > > Baoquan
> > > > 
> > > 
> > > For the OpenCAPI driver, we have settled on not having 'extern' on
> > > functions.
> > > 
> > > I don't think I've seen a standard that supports or refutes this, but it
> > > does not value add.
> > > 
> > 
> > FWIW this is a warning condition for checkpatch:
> > 
> > $ ./scripts/checkpatch.pl --strict -f include/misc/ocxl.h
> 
> Good to know, thanks.
> 
> I didn't know checkpatch.pl can run on header file directly. Tried to
> check patch with '--strict -f', the below info doesn't appear. But it

Hmm... -f is to check a source file, not a patch... What did you try
exactly ?

> does give out below information when run on header file.
> 
> > 
> > [...]
> > 
> > CHECK: extern prototypes should be avoided in .h files
> > #176: FILE: include/misc/ocxl.h:176:
> > +extern int ocxl_afu_irq_alloc(struct ocxl_context *ctx, int *irq_id);
> > 
> > [...]
> > 
>

Re: [PATCH v3 04/27] ocxl: Remove unnecessary externs

2020-02-26 Thread Greg Kurz

On Wed, 26 Feb 2020 19:26:34 +1100
"Alastair D'Silva"  wrote:

> > -Original Message-
> > From: Baoquan He 
> > Sent: Wednesday, 26 February 2020 7:15 PM
> > To: Alastair D'Silva 
> > Cc: alast...@d-silva.org; Aneesh Kumar K . V
> > ; Oliver O'Halloran ;
> > Benjamin Herrenschmidt ; Paul Mackerras
> > ; Michael Ellerman ; Frederic
> > Barrat ; Andrew Donnellan ;
> > Arnd Bergmann ; Greg Kroah-Hartman
> > ; Dan Williams ;
> > Vishal Verma ; Dave Jiang
> > ; Ira Weiny ; Andrew Morton
> > ; Mauro Carvalho Chehab
> > ; David S. Miller ;
> > Rob Herring ; Anton Blanchard ;
> > Krzysztof Kozlowski ; Mahesh Salgaonkar
> > ; Madhavan Srinivasan
> > ; Cédric Le Goater ; Anju T
> > Sudhakar ; Hari Bathini
> > ; Thomas Gleixner ; Greg
> > Kurz ; Nicholas Piggin ; Masahiro
> > Yamada ; Alexey Kardashevskiy
> > ; linux-ker...@vger.kernel.org; linuxppc-
> > d...@lists.ozlabs.org; linux-nvd...@lists.01.org; linux...@kvack.org
> > Subject: Re: [PATCH v3 04/27] ocxl: Remove unnecessary externs
> > 
> > On 02/21/20 at 02:26pm, Alastair D'Silva wrote:
> > > From: Alastair D'Silva 
> > >
> > > Function declarations don't need externs, remove the existing ones so
> > > they are consistent with newer code
> > >
> > > Signed-off-by: Alastair D'Silva 
> > > ---
> > >  arch/powerpc/include/asm/pnv-ocxl.h | 32 ++---
> > >  include/misc/ocxl.h |  6 +++---
> > >  2 files changed, 18 insertions(+), 20 deletions(-)
> > >
> > > diff --git a/arch/powerpc/include/asm/pnv-ocxl.h
> > > b/arch/powerpc/include/asm/pnv-ocxl.h
> > > index 0b2a6707e555..b23c99bc0c84 100644
> > > --- a/arch/powerpc/include/asm/pnv-ocxl.h
> > > +++ b/arch/powerpc/include/asm/pnv-ocxl.h
> > > @@ -9,29 +9,27 @@
> > >  #define PNV_OCXL_TL_BITS_PER_RATE   4
> > >  #define PNV_OCXL_TL_RATE_BUF_SIZE
> > ((PNV_OCXL_TL_MAX_TEMPLATE+1) * PNV_OCXL_TL_BITS_PER_RATE / 8)
> > >
> > > -extern int pnv_ocxl_get_actag(struct pci_dev *dev, u16 *base, u16
> > *enabled,
> > > - u16 *supported);
> > 
> > It works w or w/o extern when declare functions. Searching 'extern'
> > under include can find so many functions with 'extern' adding. Do we have
> a
> > explicit standard if we should add or remove 'exter' in function
> declaration?
> > 
> > I have no objection to this patch, just want to make clear so that I can
> handle
> > it w/o confusion.
> > 
> > Thanks
> > Baoquan
> > 
> 
> For the OpenCAPI driver, we have settled on not having 'extern' on
> functions.
> 
> I don't think I've seen a standard that supports or refutes this, but it
> does not value add.
> 

FWIW this is a warning condition for checkpatch:

$ ./scripts/checkpatch.pl --strict -f include/misc/ocxl.h

[...]

CHECK: extern prototypes should be avoided in .h files
#176: FILE: include/misc/ocxl.h:176:
+extern int ocxl_afu_irq_alloc(struct ocxl_context *ctx, int *irq_id);

[...]

Re: QEMU/KVM snapshot restore bug

2020-02-12 Thread Greg Kurz

On Tue, 11 Feb 2020 04:57:52 +0100
dftxbs3e  wrote:

> Hello,
> 
> I took a snapshot of a ppc64 (big endian) VM from a ppc64 (little endian) 
> host using `virsh snapshot-create-as --domain  --name `
> 

A big endian guest doing XIVE ?!? I'm pretty sure we didn't do much testing, if
any, on such a setup... What distro is used in the VM ?

> Then I restarted my system and tried restoring the snapshot:
> 
> # virsh snapshot-revert --domain  --snapshotname 
> error: internal error: process exited while connecting to monitor: 
> 2020-02-11T03:18:08.110582Z qemu-system-ppc64: KVM_SET_DEVICE_ATTR failed: 
> Group 3 attr 0x1309: Device or resource busy
> 2020-02-11T03:18:08.110605Z qemu-system-ppc64: error while loading state for 
> instance 0x0 of device 'spapr'
> 2020-02-11T03:18:08.112843Z qemu-system-ppc64: Error -1 while loading VM state
> 

This indicates that QEMU failed to configure the source targeting
for the HW interrupt 0x1309, which is an MSI interrupt used by
a PCI device plugged in the default PHB. Especially, -EBUSY means

-EBUSY:  No CPU available to serve interrupt

> And dmesg shows each time the restore command is executed:
> 
> [  180.176606] WARNING: CPU: 16 PID: 5528 at 
> arch/powerpc/kvm/book3s_xive.c:345 xive_try_pick_queue+0x40/0xb8 [kvm]

This warning means that we have vCPU without a configured event queue.

Since kvmppc_xive_select_target() is trying all vCPUs before bailing out
with -EBUSY, you might be seeing several WARNINGs (1 per vCPU) in dmesg,
correct ?

Anyway, this looks wrong since QEMU is supposed to have already configured
the event queues at this point... Not sure what's happening here...

> [  180.176608] Modules linked in: vhost_net vhost tap kvm_hv kvm xt_CHECKSUM 
> xt_MASQUERADE nf_nat_tftp nf_conntrack_tftp tun bridge 8021q garp mrp stp llc 
> rfkill nf_conntrack_netbios_ns nf_conntrack_broadcast xt_CT ip6t_REJECT 
> nf_reject_ipv6 ip6t_rpfilter ipt_REJECT nf_reject_ipv4 xt_conntrack 
> ebtable_nat ebtable_broute ip6table_nat ip6table_mangle ip6table_raw 
> ip6table_security iptable_nat nf_nat iptable_mangle iptable_raw 
> iptable_security nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nfnetlink 
> ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter sunrpc 
> raid1 at24 regmap_i2c snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg 
> joydev snd_hda_codec snd_hda_core ofpart snd_hwdep crct10dif_vpmsum snd_seq 
> ipmi_powernv powernv_flash ipmi_devintf snd_seq_device mtd ipmi_msghandler 
> rtc_opal snd_pcm opal_prd i2c_opal snd_timer snd soundcore lz4 lz4_compress 
> zram ip_tables xfs libcrc32c dm_crypt amdgpu ast drm_vram_helper mfd_core 
> i2c_algo_bit gpu_sched drm_kms_helper mpt3sas
> [  180.176652]  syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm 
> vmx_crypto tg3 crc32c_vpmsum nvme raid_class scsi_transport_sas nvme_core 
> drm_panel_orientation_quirks i2c_core fuse
> [  180.176663] CPU: 16 PID: 5528 Comm: qemu-system-ppc Not tainted 
> 5.4.17-200.fc31.ppc64le #1
> [  180.176665] NIP:  c0080a883c80 LR: c0080a886db8 CTR: 
> c0080a88a9e0
> [  180.176667] REGS: c00767a17890 TRAP: 0700   Not tainted  
> (5.4.17-200.fc31.ppc64le)
> [  180.176668] MSR:  90029033   CR: 48224248 
>  XER: 2004
> [  180.176673] CFAR: c0080a886db4 IRQMASK: 0 
>GPR00: c0080a886db8 c00767a17b20 c0080a8aed00 
> c0002005468a4480 
>GPR04:    
> 0001 
>GPR08: c0002007142b2400 c0002007142b2400  
> c0080a8910f0 
>GPR12: c0080a88a488 c007fffed000  
>  
>GPR16: 000149524180 739bda78 739bda30 
> 025c 
>GPR20:  0003 c0002006f13a 
>  
>GPR24: 1359  c002f8c96c38 
> c002f8c8 
>GPR28:  c0002006f13a c0002006f13a4038 
> c00767a17be4 
> [  180.176688] NIP [c0080a883c80] xive_try_pick_queue+0x40/0xb8 [kvm]
> [  180.176693] LR [c0080a886db8] kvmppc_xive_select_target+0x100/0x210 
> [kvm]
> [  180.176694] Call Trace:
> [  180.176696] [c00767a17b20] [c00767a17b70] 0xc00767a17b70 
> (unreliable)
> [  180.176701] [c00767a17b70] [c0080a88b420] 
> kvmppc_xive_native_set_attr+0xf98/0x1760 [kvm]
> [  180.176705] [c00767a17cc0] [c0080a86392c] 
> kvm_device_ioctl+0xf4/0x180 [kvm]
> [  180.176710] [c00767a17d10] [c05380b0] do_vfs_ioctl+0xaa0/0xd90
> [  180.176712] [c00767a17dd0] [c0538464] sys_ioctl+0xc4/0x110
> [  180.176716] [c00767a17e20] [c000b9d0] system_call+0x5c/0x68
> [  180.176717] Instruction dump:
> [  180.176719] 794ad182 0b0a 2c29 41820080 89490010 2c0a 41820074 
> 78883664 
> [  180.176723] 7d094214 e9480070 7d470074 78e7d182 <0b07> 2c2a 
> 41820054

Re: [PATCH 18/18] powerpc/fault: Use analyse_instr() to check for store with updates to sp

2020-02-07 Thread Greg Kurz

On Thu, 19 Dec 2019 01:11:33 +1100
Daniel Axtens  wrote:

> Jordan Niethe  writes:
> 
> > A user-mode access to an address a long way below the stack pointer is
> > only valid if the instruction is one that would update the stack pointer
> > to the address accessed. This is checked by directly looking at the
> > instructions op-code. As a result is does not take into account prefixed
> > instructions. Instead of looking at the instruction our self, use
> > analyse_instr() determine if this a store instruction that will update
> > the stack pointer.
> >
> > Something to note is that there currently are not any store with update
> > prefixed instructions. Actually there is no plan for prefixed
> > update-form loads and stores. So this patch is probably not needed but
> > it might be preferable to use analyse_instr() rather than open coding
> > the test anyway.
> 
> Yes please. I was looking through this code recently and was
> horrified. This improves things a lot and I think is justification
> enough as-is.
> 

Except it doesn't work... I'm now experiencing a systematic crash of
systemd at boot in my fedora31 guest:

[3.322912] systemd[1]: segfault (11) at 73eaf550 nip 7ce4d42f8d78 lr 
9d82c098fc0 code 1 in libsystemd-shared-243.so[7ce4d415+2e]
[3.323112] systemd[1]: code: 0480 6042 3c4c001e 3842edb0 7c0802a6 
3d81fff0 fb81ffe0 fba1ffe8 
[3.323244] systemd[1]: code: fbc1fff0 fbe1fff8 f8010010 7c200b78  
7c216000 4082fff8 f801ff71 

f801f001 is

0x1a8d78 : stdur0,-4096(r1)

which analyse_instr() is supposed to decode as a STORE that
updates r1 so we should be good... Unfortunately analyse_instr()
forbids partial register sets, since it might return op->val
based on some register content depending on the instruction:

/* Following cases refer to regs->gpr[], so we need all regs */
if (!FULL_REGS(regs))
return -1;

analyse_instr() was introduced with instruction emulation in mind, which
goes far beyond the need we have in store_updates_sp(). Especially the
fault path doesn't care for the register content at all...

Not sure how to cope with that correctly (refactor analyse_instr() ? ) but
until someone comes up with a solution, please don't merge this patch.

Cheers,

--
Greg

> Regards,
> Daniel
> >
> > Signed-off-by: Jordan Niethe 
> > ---
> >  arch/powerpc/mm/fault.c | 39 +++
> >  1 file changed, 11 insertions(+), 28 deletions(-)
> >
> > diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
> > index b5047f9b5dec..cb78b3ca1800 100644
> > --- a/arch/powerpc/mm/fault.c
> > +++ b/arch/powerpc/mm/fault.c
> > @@ -41,37 +41,17 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  
> >  /*
> >   * Check whether the instruction inst is a store using
> >   * an update addressing form which will update r1.
> >   */
> > -static bool store_updates_sp(unsigned int inst)
> > +static bool store_updates_sp(struct instruction_op *op)
> >  {
> > -   /* check for 1 in the rA field */
> > -   if (((inst >> 16) & 0x1f) != 1)
> > -   return false;
> > -   /* check major opcode */
> > -   switch (inst >> 26) {
> > -   case OP_STWU:
> > -   case OP_STBU:
> > -   case OP_STHU:
> > -   case OP_STFSU:
> > -   case OP_STFDU:
> > -   return true;
> > -   case OP_STD:/* std or stdu */
> > -   return (inst & 3) == 1;
> > -   case OP_31:
> > -   /* check minor opcode */
> > -   switch ((inst >> 1) & 0x3ff) {
> > -   case OP_31_XOP_STDUX:
> > -   case OP_31_XOP_STWUX:
> > -   case OP_31_XOP_STBUX:
> > -   case OP_31_XOP_STHUX:
> > -   case OP_31_XOP_STFSUX:
> > -   case OP_31_XOP_STFDUX:
> > +   if (GETTYPE(op->type) == STORE) {
> > +   if ((op->type & UPDATE) && (op->update_reg == 1))
> > return true;
> > -   }
> > }
> > return false;
> >  }
> > @@ -278,14 +258,17 @@ static bool bad_stack_expansion(struct pt_regs *regs, 
> > unsigned long address,
> >  
> > if ((flags & FAULT_FLAG_WRITE) && (flags & FAULT_FLAG_USER) &&
> > access_ok(nip, sizeof(*nip))) {
> > -   unsigned int inst;
> > +   unsigned int inst, sufx;
> > +   struct instruction_op op;
> > int res;
> >  
> > pagefault_disable();
> > -   res = __get_user_inatomic(inst, nip);
> > +   res = __get_user_instr_inatomic(inst, sufx, nip);
> > pagefault_enable();
> > -   if (!res)
> > -   return !store_updates_sp(inst);
> > +   if (!res) {
> > +   analyse_instr(, uregs, inst, sufx);
> > +   return !store_updates_sp();
> > +   }
> > *must_retry = true;
> > }
> > return true;
> > -- 
> > 2.20.1

Re: [PATCH v2 05/27] powerpc: Map & release OpenCAPI LPC memory

2020-01-21 Thread Greg Kurz

On Tue, 21 Jan 2020 17:46:12 +1100
Andrew Donnellan  wrote:

> On 3/12/19 2:46 pm, Alastair D'Silva wrote:
> > From: Alastair D'Silva 
> > 
> > This patch adds platform support to map & release LPC memory.
> 
> Might want to explain what LPC is.
> 
> Otherwise:
> 
> Reviewed-by: Andrew Donnellan 
> 
> > 
> > Signed-off-by: Alastair D'Silva 
> > ---
> >   arch/powerpc/include/asm/pnv-ocxl.h   |  2 ++
> >   arch/powerpc/platforms/powernv/ocxl.c | 42 +++
> >   2 files changed, 44 insertions(+)
> > 
> > diff --git a/arch/powerpc/include/asm/pnv-ocxl.h 
> > b/arch/powerpc/include/asm/pnv-ocxl.h
> > index 7de82647e761..f8f8ffb48aa8 100644
> > --- a/arch/powerpc/include/asm/pnv-ocxl.h
> > +++ b/arch/powerpc/include/asm/pnv-ocxl.h
> > @@ -32,5 +32,7 @@ extern int pnv_ocxl_spa_remove_pe_from_cache(void 
> > *platform_data, int pe_handle)
> >   
> >   extern int pnv_ocxl_alloc_xive_irq(u32 *irq, u64 *trigger_addr);
> >   extern void pnv_ocxl_free_xive_irq(u32 irq);
> > +extern u64 pnv_ocxl_platform_lpc_setup(struct pci_dev *pdev, u64 size);
> > +extern void pnv_ocxl_platform_lpc_release(struct pci_dev *pdev);
> 
> nit: I don't think these need to be extern?
> 
> 

And even if they were, as verified by checkpatch:

"extern prototypes should be avoided in .h files"

Re: [PATCH] powerpc/xive: discard ESB load value when interrupt is invalid

2020-01-14 Thread Greg Kurz

On Tue, 14 Jan 2020 08:44:54 +0100
Cédric Le Goater  wrote:

> On 1/14/20 2:14 AM, Michael Ellerman wrote:
> > Cédric Le Goater  writes:
> >> On 1/13/20 2:01 PM, Cédric Le Goater wrote:
> >>> From: Frederic Barrat 
> >>>
> >>> A load on an ESB page returning all 1's means that the underlying
> >>> device has invalidated the access to the PQ state of the interrupt
> >>> through mmio. It may happen, for example when querying a PHB interrupt
> >>> while the PHB is in an error state.
> >>>
> >>> In that case, we should consider the interrupt to be invalid when
> >>> checking its state in the irq_get_irqchip_state() handler.
> >>
> >>
> >> and we need also these tags :
> >>
> >> Fixes: da15c03b047d ("powerpc/xive: Implement get_irqchip_state method for 
> >> XIVE to fix shutdown race")
> >> Cc: sta...@vger.kernel.org # v5.3+
> > 
> > I added those, although it's v5.4+, as the offending commit was first
> > included in v5.4-rc1.
> 
> Ah yes. I mistook the merge tag of the branch used for the PR (v5.3-rc2)
> 

You might want to use 'git tag --contains':

[greg@bahia kernel-linus]$ git tag --contains da15c03b047d
for-linus
kvm-5.4-2
next-20191118
next-20191126
tags/kvm-5.4-1
tags/kvm-5.4-2
v5.4
v5.4-rc1
v5.4-rc2
v5.4-rc3
v5.4-rc4
v5.4-rc5
v5.4-rc6
v5.4-rc7
v5.4-rc8
v5.5-rc1

> Thanks,
> 
> C. 
>

Re: [PATCH v3] ocxl: Fix potential memory leak on context creation

2019-12-09 Thread Greg Kurz

On Mon,  9 Dec 2019 11:55:13 +0100
Frederic Barrat  wrote:

> If we couldn't fully init a context, we were leaking memory.
> 
> Fixes: b9721d275cc2 ("ocxl: Allow external drivers to use OpenCAPI contexts")
> Signed-off-by: Frederic Barrat 
> ---

Reviewed-by: Greg Kurz 

> Changlog:
> v3:
>   code cleanup (Greg)
> v2:
>   reset context pointer in case of allocation failure (Andrew)
> 
> 
>  drivers/misc/ocxl/context.c | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/misc/ocxl/context.c b/drivers/misc/ocxl/context.c
> index 994563a078eb..de8a66b9d76b 100644
> --- a/drivers/misc/ocxl/context.c
> +++ b/drivers/misc/ocxl/context.c
> @@ -10,18 +10,17 @@ int ocxl_context_alloc(struct ocxl_context **context, 
> struct ocxl_afu *afu,
>   int pasid;
>   struct ocxl_context *ctx;
>  
> - *context = kzalloc(sizeof(struct ocxl_context), GFP_KERNEL);
> - if (!*context)
> + ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
> + if (!ctx)
>   return -ENOMEM;
>  
> - ctx = *context;
> -
>   ctx->afu = afu;
>   mutex_lock(>contexts_lock);
>   pasid = idr_alloc(>contexts_idr, ctx, afu->pasid_base,
>   afu->pasid_base + afu->pasid_max, GFP_KERNEL);
>   if (pasid < 0) {
>   mutex_unlock(>contexts_lock);
> + kfree(ctx);
>   return pasid;
>   }
>   afu->pasid_count++;
> @@ -43,6 +42,7 @@ int ocxl_context_alloc(struct ocxl_context **context, 
> struct ocxl_afu *afu,
>* duration of the life of the context
>*/
>   ocxl_afu_get(afu);
> + *context = ctx;
>   return 0;
>  }
>  EXPORT_SYMBOL_GPL(ocxl_context_alloc);

Re: [PATCH] powerpc/xive: skip ioremap() of ESB pages for LSI interrupts

2019-12-04 Thread Greg Kurz

On Thu,  5 Dec 2019 00:30:56 +1100 (AEDT)
Michael Ellerman  wrote:

> On Tue, 2019-12-03 at 16:36:42 UTC, =?UTF-8?q?C=C3=A9dric=20Le=20Goater?= 
> wrote:
> > The PCI INTx interrupts and other LSI interrupts are handled differently
> > under a sPAPR platform. When the interrupt source characteristics are
> > queried, the hypervisor returns an H_INT_ESB flag to inform the OS
> > that it should be using the H_INT_ESB hcall for interrupt management
> > and not loads and stores on the interrupt ESB pages.
> > 
> > A default -1 value is returned for the addresses of the ESB pages. The
> > driver ignores this condition today and performs a bogus IO mapping.
> > Recent changes and the DEBUG_VM configuration option make the bug
> > visible with :
> > 
> > [0.015518] kernel BUG at 
> > arch/powerpc/include/asm/book3s/64/pgtable.h:612!
> > [0.015578] Oops: Exception in kernel mode, sig: 5 [#1]
> > [0.015627] LE PAGE_SIZE=64K MMU=Radix MMU=Hash SMP NR_CPUS=1024 NUMA 
> > pSeries
> > [0.015697] Modules linked in:
> > [0.015739] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 
> > 5.4.0-0.rc6.git0.1.fc32.ppc64le #1
> > [0.015812] NIP:  c0f63294 LR: c0f62e44 CTR: 
> > 
> > [0.015889] REGS: c000fa45f0d0 TRAP: 0700   Not tainted  
> > (5.4.0-0.rc6.git0.1.fc32.ppc64le)
> > [0.015971] MSR:  82029033   CR: 
> > 44000424  XER: 
> > [0.016050] CFAR: c0f63128 IRQMASK: 0
> > [0.016050] GPR00: c0f62e44 c000fa45f360 c1be5400 
> > 
> > [0.016050] GPR04: c19c7d38 c000fa340030 fa330009 
> > c1c15e18
> > [0.016050] GPR08: 0040 ffe0  
> > 8418dd352dbd190f
> > [0.016050] GPR12:  c1e0 c00a8006 
> > c00a8006
> > [0.016050] GPR16:  81ae c1c24d98 
> > 
> > [0.016050] GPR20: c00a8007 c1cafca0 c00a8007 
> > 
> > [0.016050] GPR24: c00a8008 c00a8008 c1cafca8 
> > c00a8008
> > [0.016050] GPR28: c000fa32e010 c00a8006  
> > c000fa33
> > [0.016711] NIP [c0f63294] ioremap_page_range+0x4c4/0x6e0
> > [0.016778] LR [c0f62e44] ioremap_page_range+0x74/0x6e0
> > [0.016846] Call Trace:
> > [0.016876] [c000fa45f360] [c0f62e44] 
> > ioremap_page_range+0x74/0x6e0 (unreliable)
> > [0.016969] [c000fa45f460] [c00934bc] do_ioremap+0x8c/0x120
> > [0.017037] [c000fa45f4b0] [c00938e8] 
> > __ioremap_caller+0x128/0x140
> > [0.017116] [c000fa45f500] [c00931a0] ioremap+0x30/0x50
> > [0.017184] [c000fa45f520] [c00d1380] 
> > xive_spapr_populate_irq_data+0x170/0x260
> > [0.017263] [c000fa45f5c0] [c00cc90c] 
> > xive_irq_domain_map+0x8c/0x170
> > [0.017344] [c000fa45f600] [c0219124] 
> > irq_domain_associate+0xb4/0x2d0
> > [0.017424] [c000fa45f690] [c0219fe0] 
> > irq_create_mapping+0x1e0/0x3b0
> > [0.017506] [c000fa45f730] [c021ad6c] 
> > irq_create_fwspec_mapping+0x27c/0x3e0
> > [0.017586] [c000fa45f7c0] [c021af68] 
> > irq_create_of_mapping+0x98/0xb0
> > [0.017666] [c000fa45f830] [c08d4e48] 
> > of_irq_parse_and_map_pci+0x168/0x230
> > [0.017746] [c000fa45f910] [c0075428] 
> > pcibios_setup_device+0x88/0x250
> > [0.017826] [c000fa45f9a0] [c0077b84] 
> > pcibios_setup_bus_devices+0x54/0x100
> > [0.017906] [c000fa45fa10] [c00793f0] 
> > __of_scan_bus+0x160/0x310
> > [0.017973] [c000fa45faf0] [c0075fc0] 
> > pcibios_scan_phb+0x330/0x390
> > [0.018054] [c000fa45fba0] [c139217c] pcibios_init+0x8c/0x128
> > [0.018121] [c000fa45fc20] [c00107b0] 
> > do_one_initcall+0x60/0x2c0
> > [0.018201] [c000fa45fcf0] [c1384624] 
> > kernel_init_freeable+0x290/0x378
> > [0.018280] [c000fa45fdb0] [c0010d24] kernel_init+0x2c/0x148
> > [0.018348] [c000fa45fe20] [c000bdbc] 
> > ret_from_kernel_thread+0x5c/0x80
> > [0.018427] Instruction dump:
> > [0.018468] 41820014 3920fe7f 7d494838 7d290074 7929d182 f8e10038 
> > 69290001 0b09
> > [0.018552] 7a098420 0b09 7bc95960 7929a802 <0b09> 7fc68b78 
> > e8610048 7dc47378
> > 
> > Cc: sta...@vger.kernel.org # v4.14+
> > Fixes: bed81ee181dd ("powerpc/xive: introduce H_INT_ESB hcall")
> > Signed-off-by: Cédric Le Goater 
> 
> Applied to powerpc fixes, thanks.
> 
> https://git.kernel.org/powerpc/c/b67a95f2abff0c34e5667c15ab8900de73d8d087
> 

My R-b tag is missing... I guess I didn't review it quick enough :)

> cheers

Re: [PATCH] powerpc/xive: skip ioremap() of ESB pages for LSI interrupts

2019-12-04 Thread Greg Kurz

r.c
> @@ -392,20 +392,28 @@ static int xive_spapr_populate_irq_data(u32 hw_irq, 
> struct xive_irq_data *data)
>   data->esb_shift = esb_shift;
>   data->trig_page = trig_page;
>  
> + data->hw_irq = hw_irq;
> +

This is a side effect in the case where the XIVE_IRQ_FLAG_H_INT_ESB flag
isn't set and ioremap() fails. But I guess a sane caller shouldn't look
at data->hw_irq if this function fails in the first place, so:

Reviewed-by: Greg Kurz 

>   /*
>* No chip-id for the sPAPR backend. This has an impact how we
>* pick a target. See xive_pick_irq_target().
>*/
>   data->src_chip = XIVE_INVALID_CHIP_ID;
>  
> + /*
> +  * When the H_INT_ESB flag is set, the H_INT_ESB hcall should
> +  * be used for interrupt management. Skip the remapping of the
> +  * ESB pages which are not available.
> +  */
> + if (data->flags & XIVE_IRQ_FLAG_H_INT_ESB)
> + return 0;
> +
>   data->eoi_mmio = ioremap(data->eoi_page, 1u << data->esb_shift);
>   if (!data->eoi_mmio) {
>   pr_err("Failed to map EOI page for irq 0x%x\n", hw_irq);
>   return -ENOMEM;
>   }
>  
> - data->hw_irq = hw_irq;
> -
>   /* Full function page supports trigger */
>   if (flags & XIVE_SRC_TRIGGER) {
>   data->trig_mmio = data->eoi_mmio;

Re: [Very RFC 40/46] powernv/npu: Don't drop refcount when looking up GPU pci_devs

2019-11-27 Thread Greg Kurz

On Wed, 27 Nov 2019 10:47:45 +0100
Frederic Barrat  wrote:

> 
> 
> Le 27/11/2019 à 10:33, Greg Kurz a écrit :
> > On Wed, 27 Nov 2019 10:10:13 +0100
> > Frederic Barrat  wrote:
> > 
> >>
> >>
> >> Le 27/11/2019 à 09:24, Greg Kurz a écrit :
> >>> On Wed, 27 Nov 2019 18:09:40 +1100
> >>> Alexey Kardashevskiy  wrote:
> >>>
> >>>>
> >>>>
> >>>> On 20/11/2019 12:28, Oliver O'Halloran wrote:
> >>>>> The comment here implies that we don't need to take a ref to the pci_dev
> >>>>> because the ioda_pe will always have one. This implies that the current
> >>>>> expection is that the pci_dev for an NPU device will *never* be torn
> >>>>> down since the ioda_pe having a ref to the device will prevent the
> >>>>> release function from being called.
> >>>>>
> >>>>> In other words, the desired behaviour here appears to be leaking a ref.
> >>>>>
> >>>>> Nice!
> >>>>
> >>>>
> >>>> There is a history: https://patchwork.ozlabs.org/patch/1088078/
> >>>>
> >>>> We did not fix anything in particular then, we do not seem to be fixing
> >>>> anything now (in other words - we cannot test it in a normal natural
> >>>> way). I'd drop this one.
> >>>>
> >>>
> >>> Yeah, I didn't fix anything at the time. Just reverted to the ref
> >>> count behavior we had before:
> >>>
> >>> https://patchwork.ozlabs.org/patch/829172/
> >>>
> >>> Frederic recently posted his take on the same topic from the OpenCAPI
> >>> point of view:
> >>>
> >>> http://patchwork.ozlabs.org/patch/1198947/
> >>>
> >>> He seems to indicate the NPU devices as the real culprit because
> >>> nobody ever cared for them to be removable. Fixing that seems be
> >>> a chore nobody really wants to address obviously... :-\
> >>
> >>
> >> I had taken a stab at not leaking a ref for the nvlink devices and do
> >> the proper thing regarding ref counting (i.e. fixing all the callers of
> >> get_pci_dev() to drop the reference when they were done). With that, I
> >> could see that the ref count of the nvlink devices could drop to 0
> >> (calling remove for the device in /sys) and that the devices could go away.
> >>
> >> But then, I realized it's not necessarily desirable at this point. There
> >> are several comments in the code saying the npu devices (for nvlink)
> >> don't go away, there's no device release callback defined when it seems
> >> there should be, at least to handle releasing PEs All in all, it
> >> seems that some work would be needed. And if it hasn't been required by
> >> now...
> >>
> > 
> > If everyone is ok with leaking a reference in the NPU case, I guess
> > this isn't a problem. But if we move forward with Oliver's patch, a
> > pci_dev_put() would be needed for OpenCAPI, correct ?
> 
> 
> No, these code paths are nvlink-only.
> 

Oh yes indeed. Then this patch and yours fit well together :)

>Fred
> 
> 
> 
> >> Fred
> >>
> >>
> >>>>
> >>>>
> >>>>>
> >>>>> Signed-off-by: Oliver O'Halloran 
> >>>>> ---
> >>>>>arch/powerpc/platforms/powernv/npu-dma.c | 11 +++
> >>>>>1 file changed, 3 insertions(+), 8 deletions(-)
> >>>>>
> >>>>> diff --git a/arch/powerpc/platforms/powernv/npu-dma.c 
> >>>>> b/arch/powerpc/platforms/powernv/npu-dma.c
> >>>>> index 72d3749da02c..2eb6e6d45a98 100644
> >>>>> --- a/arch/powerpc/platforms/powernv/npu-dma.c
> >>>>> +++ b/arch/powerpc/platforms/powernv/npu-dma.c
> >>>>> @@ -28,15 +28,10 @@ static struct pci_dev *get_pci_dev(struct 
> >>>>> device_node *dn)
> >>>>> break;
> >>>>>
> >>>>> /*
> >>>>> -* pci_get_domain_bus_and_slot() increased the reference count 
> >>>>> of
> >>>>> -* the PCI device, but callers don't need that actually as the 
> >>>>> PE
> >>>>> -* already holds a reference to the device. Since callers aren't
> >>>>> -* aware of the reference count change, call pci_dev_put() now 
> >>>>> to
> >>>>> -* avoid leaks.
> >>>>> +* NB: for_each_pci_dev() elevates the pci_dev refcount.
> >>>>> +* Caller is responsible for dropping the ref when it's
> >>>>> +* finished with it.
> >>>>>  */
> >>>>> -   if (pdev)
> >>>>> -   pci_dev_put(pdev);
> >>>>> -
> >>>>> return pdev;
> >>>>>}
> >>>>>
> >>>>>
> >>>>
> >>>
> >>
> > 
>

Re: [Very RFC 40/46] powernv/npu: Don't drop refcount when looking up GPU pci_devs

2019-11-27 Thread Greg Kurz

On Wed, 27 Nov 2019 10:10:13 +0100
Frederic Barrat  wrote:

> 
> 
> Le 27/11/2019 à 09:24, Greg Kurz a écrit :
> > On Wed, 27 Nov 2019 18:09:40 +1100
> > Alexey Kardashevskiy  wrote:
> > 
> >>
> >>
> >> On 20/11/2019 12:28, Oliver O'Halloran wrote:
> >>> The comment here implies that we don't need to take a ref to the pci_dev
> >>> because the ioda_pe will always have one. This implies that the current
> >>> expection is that the pci_dev for an NPU device will *never* be torn
> >>> down since the ioda_pe having a ref to the device will prevent the
> >>> release function from being called.
> >>>
> >>> In other words, the desired behaviour here appears to be leaking a ref.
> >>>
> >>> Nice!
> >>
> >>
> >> There is a history: https://patchwork.ozlabs.org/patch/1088078/
> >>
> >> We did not fix anything in particular then, we do not seem to be fixing
> >> anything now (in other words - we cannot test it in a normal natural
> >> way). I'd drop this one.
> >>
> > 
> > Yeah, I didn't fix anything at the time. Just reverted to the ref
> > count behavior we had before:
> > 
> > https://patchwork.ozlabs.org/patch/829172/
> > 
> > Frederic recently posted his take on the same topic from the OpenCAPI
> > point of view:
> > 
> > http://patchwork.ozlabs.org/patch/1198947/
> > 
> > He seems to indicate the NPU devices as the real culprit because
> > nobody ever cared for them to be removable. Fixing that seems be
> > a chore nobody really wants to address obviously... :-\
> 
> 
> I had taken a stab at not leaking a ref for the nvlink devices and do 
> the proper thing regarding ref counting (i.e. fixing all the callers of 
> get_pci_dev() to drop the reference when they were done). With that, I 
> could see that the ref count of the nvlink devices could drop to 0 
> (calling remove for the device in /sys) and that the devices could go away.
> 
> But then, I realized it's not necessarily desirable at this point. There 
> are several comments in the code saying the npu devices (for nvlink) 
> don't go away, there's no device release callback defined when it seems 
> there should be, at least to handle releasing PEs All in all, it 
> seems that some work would be needed. And if it hasn't been required by 
> now...
> 

If everyone is ok with leaking a reference in the NPU case, I guess
this isn't a problem. But if we move forward with Oliver's patch, a
pci_dev_put() would be needed for OpenCAPI, correct ?

>Fred
> 
> 
> >>
> >>
> >>>
> >>> Signed-off-by: Oliver O'Halloran 
> >>> ---
> >>>   arch/powerpc/platforms/powernv/npu-dma.c | 11 +++
> >>>   1 file changed, 3 insertions(+), 8 deletions(-)
> >>>
> >>> diff --git a/arch/powerpc/platforms/powernv/npu-dma.c 
> >>> b/arch/powerpc/platforms/powernv/npu-dma.c
> >>> index 72d3749da02c..2eb6e6d45a98 100644
> >>> --- a/arch/powerpc/platforms/powernv/npu-dma.c
> >>> +++ b/arch/powerpc/platforms/powernv/npu-dma.c
> >>> @@ -28,15 +28,10 @@ static struct pci_dev *get_pci_dev(struct device_node 
> >>> *dn)
> >>>   break;
> >>>   
> >>>   /*
> >>> -  * pci_get_domain_bus_and_slot() increased the reference count of
> >>> -  * the PCI device, but callers don't need that actually as the PE
> >>> -  * already holds a reference to the device. Since callers aren't
> >>> -  * aware of the reference count change, call pci_dev_put() now to
> >>> -  * avoid leaks.
> >>> +  * NB: for_each_pci_dev() elevates the pci_dev refcount.
> >>> +  * Caller is responsible for dropping the ref when it's
> >>> +  * finished with it.
> >>>*/
> >>> - if (pdev)
> >>> - pci_dev_put(pdev);
> >>> -
> >>>   return pdev;
> >>>   }
> >>>   
> >>>
> >>
> > 
>

Re: [Very RFC 40/46] powernv/npu: Don't drop refcount when looking up GPU pci_devs

2019-11-27 Thread Greg Kurz

On Wed, 27 Nov 2019 20:40:00 +1100
"Oliver O'Halloran"  wrote:

> On Wed, Nov 27, 2019 at 8:34 PM Greg Kurz  wrote:
> >
> >
> > If everyone is ok with leaking a reference in the NPU case, I guess
> > this isn't a problem. But if we move forward with Oliver's patch, a
> > pci_dev_put() would be needed for OpenCAPI, correct ?
> 
> Yes, but I think that's fair enough. By convention it's the callers
> responsibility to drop the ref when it calls a function that returns a
> refcounted object. Doing anything else creates a race condition since
> the object's count could drop to zero before the caller starts using
> it.
> 

Sure, you're right, especially with Frederic's patch that drops
the pci_dev_get(dev) in pnv_ioda_setup_dev_PE().

> Oliver

Re: [Very RFC 40/46] powernv/npu: Don't drop refcount when looking up GPU pci_devs

2019-11-27 Thread Greg Kurz

On Wed, 27 Nov 2019 18:09:40 +1100
Alexey Kardashevskiy  wrote:

> 
> 
> On 20/11/2019 12:28, Oliver O'Halloran wrote:
> > The comment here implies that we don't need to take a ref to the pci_dev
> > because the ioda_pe will always have one. This implies that the current
> > expection is that the pci_dev for an NPU device will *never* be torn
> > down since the ioda_pe having a ref to the device will prevent the
> > release function from being called.
> > 
> > In other words, the desired behaviour here appears to be leaking a ref.
> > 
> > Nice!
> 
> 
> There is a history: https://patchwork.ozlabs.org/patch/1088078/
> 
> We did not fix anything in particular then, we do not seem to be fixing
> anything now (in other words - we cannot test it in a normal natural
> way). I'd drop this one.
> 

Yeah, I didn't fix anything at the time. Just reverted to the ref
count behavior we had before:

https://patchwork.ozlabs.org/patch/829172/

Frederic recently posted his take on the same topic from the OpenCAPI
point of view:

http://patchwork.ozlabs.org/patch/1198947/

He seems to indicate the NPU devices as the real culprit because
nobody ever cared for them to be removable. Fixing that seems be
a chore nobody really wants to address obviously... :-\

> 
> 
> > 
> > Signed-off-by: Oliver O'Halloran 
> > ---
> >  arch/powerpc/platforms/powernv/npu-dma.c | 11 +++
> >  1 file changed, 3 insertions(+), 8 deletions(-)
> > 
> > diff --git a/arch/powerpc/platforms/powernv/npu-dma.c 
> > b/arch/powerpc/platforms/powernv/npu-dma.c
> > index 72d3749da02c..2eb6e6d45a98 100644
> > --- a/arch/powerpc/platforms/powernv/npu-dma.c
> > +++ b/arch/powerpc/platforms/powernv/npu-dma.c
> > @@ -28,15 +28,10 @@ static struct pci_dev *get_pci_dev(struct device_node 
> > *dn)
> > break;
> >  
> > /*
> > -* pci_get_domain_bus_and_slot() increased the reference count of
> > -* the PCI device, but callers don't need that actually as the PE
> > -* already holds a reference to the device. Since callers aren't
> > -* aware of the reference count change, call pci_dev_put() now to
> > -* avoid leaks.
> > +* NB: for_each_pci_dev() elevates the pci_dev refcount.
> > +* Caller is responsible for dropping the ref when it's
> > +* finished with it.
> >  */
> > -   if (pdev)
> > -   pci_dev_put(pdev);
> > -
> > return pdev;
> >  }
> >  
> > 
>

[PATCH] powerpc/xive: Drop extern qualifiers from header function prototypes

2019-11-15 Thread Greg Kurz

As reported by ./scripts/checkpatch.pl --strict:

CHECK: extern prototypes should be avoided in .h files

Signed-off-by: Greg Kurz 
---
 arch/powerpc/include/asm/xive.h |   92 ---
 1 file changed, 46 insertions(+), 46 deletions(-)

diff --git a/arch/powerpc/include/asm/xive.h b/arch/powerpc/include/asm/xive.h
index 24cdf97376c4..93f982dbb3d4 100644
--- a/arch/powerpc/include/asm/xive.h
+++ b/arch/powerpc/include/asm/xive.h
@@ -87,56 +87,56 @@ extern bool __xive_enabled;
 
 static inline bool xive_enabled(void) { return __xive_enabled; }
 
-extern bool xive_spapr_init(void);
-extern bool xive_native_init(void);
-extern void xive_smp_probe(void);
-extern int  xive_smp_prepare_cpu(unsigned int cpu);
-extern void xive_smp_setup_cpu(void);
-extern void xive_smp_disable_cpu(void);
-extern void xive_teardown_cpu(void);
-extern void xive_shutdown(void);
-extern void xive_flush_interrupt(void);
+bool xive_spapr_init(void);
+bool xive_native_init(void);
+void xive_smp_probe(void);
+int  xive_smp_prepare_cpu(unsigned int cpu);
+void xive_smp_setup_cpu(void);
+void xive_smp_disable_cpu(void);
+void xive_teardown_cpu(void);
+void xive_shutdown(void);
+void xive_flush_interrupt(void);
 
 /* xmon hook */
-extern void xmon_xive_do_dump(int cpu);
-extern int xmon_xive_get_irq_config(u32 hw_irq, struct irq_data *d);
+void xmon_xive_do_dump(int cpu);
+int xmon_xive_get_irq_config(u32 hw_irq, struct irq_data *d);
 
 /* APIs used by KVM */
-extern u32 xive_native_default_eq_shift(void);
-extern u32 xive_native_alloc_vp_block(u32 max_vcpus);
-extern void xive_native_free_vp_block(u32 vp_base);
-extern int xive_native_populate_irq_data(u32 hw_irq,
-struct xive_irq_data *data);
-extern void xive_cleanup_irq_data(struct xive_irq_data *xd);
-extern u32 xive_native_alloc_irq(void);
-extern void xive_native_free_irq(u32 irq);
-extern int xive_native_configure_irq(u32 hw_irq, u32 target, u8 prio, u32 
sw_irq);
-
-extern int xive_native_configure_queue(u32 vp_id, struct xive_q *q, u8 prio,
-  __be32 *qpage, u32 order, bool 
can_escalate);
-extern void xive_native_disable_queue(u32 vp_id, struct xive_q *q, u8 prio);
-
-extern void xive_native_sync_source(u32 hw_irq);
-extern void xive_native_sync_queue(u32 hw_irq);
-extern bool is_xive_irq(struct irq_chip *chip);
-extern int xive_native_enable_vp(u32 vp_id, bool single_escalation);
-extern int xive_native_disable_vp(u32 vp_id);
-extern int xive_native_get_vp_info(u32 vp_id, u32 *out_cam_id, u32 
*out_chip_id);
-extern bool xive_native_has_single_escalation(void);
-
-extern int xive_native_get_queue_info(u32 vp_id, uint32_t prio,
- u64 *out_qpage,
- u64 *out_qsize,
- u64 *out_qeoi_page,
- u32 *out_escalate_irq,
- u64 *out_qflags);
-
-extern int xive_native_get_queue_state(u32 vp_id, uint32_t prio, u32 *qtoggle,
-  u32 *qindex);
-extern int xive_native_set_queue_state(u32 vp_id, uint32_t prio, u32 qtoggle,
-  u32 qindex);
-extern int xive_native_get_vp_state(u32 vp_id, u64 *out_state);
-extern bool xive_native_has_queue_state_support(void);
+u32 xive_native_default_eq_shift(void);
+u32 xive_native_alloc_vp_block(u32 max_vcpus);
+void xive_native_free_vp_block(u32 vp_base);
+int xive_native_populate_irq_data(u32 hw_irq,
+ struct xive_irq_data *data);
+void xive_cleanup_irq_data(struct xive_irq_data *xd);
+u32 xive_native_alloc_irq(void);
+void xive_native_free_irq(u32 irq);
+int xive_native_configure_irq(u32 hw_irq, u32 target, u8 prio, u32 sw_irq);
+
+int xive_native_configure_queue(u32 vp_id, struct xive_q *q, u8 prio,
+   __be32 *qpage, u32 order, bool can_escalate);
+void xive_native_disable_queue(u32 vp_id, struct xive_q *q, u8 prio);
+
+void xive_native_sync_source(u32 hw_irq);
+void xive_native_sync_queue(u32 hw_irq);
+bool is_xive_irq(struct irq_chip *chip);
+int xive_native_enable_vp(u32 vp_id, bool single_escalation);
+int xive_native_disable_vp(u32 vp_id);
+int xive_native_get_vp_info(u32 vp_id, u32 *out_cam_id, u32 *out_chip_id);
+bool xive_native_has_single_escalation(void);
+
+int xive_native_get_queue_info(u32 vp_id, uint32_t prio,
+  u64 *out_qpage,
+  u64 *out_qsize,
+  u64 *out_qeoi_page,
+  u32 *out_escalate_irq,
+  u64 *out_qflags);
+
+int xive_native_get_queue_state(u32 vp_id, uint32_t prio, u32 *qtoggle,
+   u32 *qindex);
+int xive_native_set_queue_state(u32 vp_id, uint32_t prio, u32 qtoggle,
+   u32 qindex);
+int xive_native_get_vp_state(u32 vp_id, u64

[PATCH v2 2/2] KVM: PPC: Book3S HV: XIVE: Fix potential page leak on error path

2019-11-13 Thread Greg Kurz

We need to check the host page size is big enough to accomodate the
EQ. Let's do this before taking a reference on the EQ page to avoid
a potential leak if the check fails.

Cc: sta...@vger.kernel.org # v5.2
Fixes: 13ce3297c576 ("KVM: PPC: Book3S HV: XIVE: Add controls for the EQ 
configuration")
Signed-off-by: Greg Kurz 
---
 arch/powerpc/kvm/book3s_xive_native.c |   13 +++--
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_xive_native.c 
b/arch/powerpc/kvm/book3s_xive_native.c
index 0e1fc5a16729..d83adb1e1490 100644
--- a/arch/powerpc/kvm/book3s_xive_native.c
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -630,12 +630,6 @@ static int kvmppc_xive_native_set_queue_config(struct 
kvmppc_xive *xive,
 
srcu_idx = srcu_read_lock(>srcu);
gfn = gpa_to_gfn(kvm_eq.qaddr);
-   page = gfn_to_page(kvm, gfn);
-   if (is_error_page(page)) {
-   srcu_read_unlock(>srcu, srcu_idx);
-   pr_err("Couldn't get queue page %llx!\n", kvm_eq.qaddr);
-   return -EINVAL;
-   }
 
page_size = kvm_host_page_size(kvm, gfn);
if (1ull << kvm_eq.qshift > page_size) {
@@ -644,6 +638,13 @@ static int kvmppc_xive_native_set_queue_config(struct 
kvmppc_xive *xive,
return -EINVAL;
}
 
+   page = gfn_to_page(kvm, gfn);
+   if (is_error_page(page)) {
+   srcu_read_unlock(>srcu, srcu_idx);
+   pr_err("Couldn't get queue page %llx!\n", kvm_eq.qaddr);
+   return -EINVAL;
+   }
+
qaddr = page_to_virt(page) + (kvm_eq.qaddr & ~PAGE_MASK);
srcu_read_unlock(>srcu, srcu_idx);

[PATCH v2 1/2] KVM: PPC: Book3S HV: XIVE: Free previous EQ page when setting up a new one

2019-11-13 Thread Greg Kurz

The EQ page is allocated by the guest and then passed to the hypervisor
with the H_INT_SET_QUEUE_CONFIG hcall. A reference is taken on the page
before handing it over to the HW. This reference is dropped either when
the guest issues the H_INT_RESET hcall or when the KVM device is released.
But, the guest can legitimately call H_INT_SET_QUEUE_CONFIG several times,
either to reset the EQ (vCPU hot unplug) or to set a new EQ (guest reboot).
In both cases the existing EQ page reference is leaked because we simply
overwrite it in the XIVE queue structure without calling put_page().

This is especially visible when the guest memory is backed with huge pages:
start a VM up to the guest userspace, either reboot it or unplug a vCPU,
quit QEMU. The leak is observed by comparing the value of HugePages_Free in
/proc/meminfo before and after the VM is run.

Ideally we'd want the XIVE code to handle the EQ page de-allocation at the
platform level. This isn't the case right now because the various XIVE
drivers have different allocation needs. It could maybe worth introducing
hooks for this purpose instead of exposing XIVE internals to the drivers,
but this is certainly a huge work to be done later.

In the meantime, for easier backport, fix both vCPU unplug and guest reboot
leaks by introducing a wrapper around xive_native_configure_queue() that
does the necessary cleanup.

Reported-by: Satheesh Rajendran 
Cc: sta...@vger.kernel.org # v5.2
Fixes: 13ce3297c576 ("KVM: PPC: Book3S HV: XIVE: Add controls for the EQ 
configuration")
Signed-off-by: Cédric Le Goater 
Signed-off-by: Greg Kurz 
---
v2: use wrapper as suggested by Cedric
---
 arch/powerpc/kvm/book3s_xive_native.c |   31 ++-
 1 file changed, 22 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_xive_native.c 
b/arch/powerpc/kvm/book3s_xive_native.c
index 34bd123fa024..0e1fc5a16729 100644
--- a/arch/powerpc/kvm/book3s_xive_native.c
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -50,6 +50,24 @@ static void kvmppc_xive_native_cleanup_queue(struct kvm_vcpu 
*vcpu, int prio)
}
 }
 
+static int kvmppc_xive_native_configure_queue(u32 vp_id, struct xive_q *q,
+ u8 prio, __be32 *qpage,
+ u32 order, bool can_escalate)
+{
+   int rc;
+   __be32 *qpage_prev = q->qpage;
+
+   rc = xive_native_configure_queue(vp_id, q, prio, qpage, order,
+can_escalate);
+   if (rc)
+   return rc;
+
+   if (qpage_prev)
+   put_page(virt_to_page(qpage_prev));
+
+   return rc;
+}
+
 void kvmppc_xive_native_cleanup_vcpu(struct kvm_vcpu *vcpu)
 {
struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
@@ -575,19 +593,14 @@ static int kvmppc_xive_native_set_queue_config(struct 
kvmppc_xive *xive,
q->guest_qaddr  = 0;
q->guest_qshift = 0;
 
-   rc = xive_native_configure_queue(xc->vp_id, q, priority,
-NULL, 0, true);
+   rc = kvmppc_xive_native_configure_queue(xc->vp_id, q, priority,
+   NULL, 0, true);
if (rc) {
pr_err("Failed to reset queue %d for VCPU %d: %d\n",
   priority, xc->server_num, rc);
return rc;
}
 
-   if (q->qpage) {
-   put_page(virt_to_page(q->qpage));
-   q->qpage = NULL;
-   }
-
return 0;
}
 
@@ -646,8 +659,8 @@ static int kvmppc_xive_native_set_queue_config(struct 
kvmppc_xive *xive,
  * OPAL level because the use of END ESBs is not supported by
  * Linux.
  */
-   rc = xive_native_configure_queue(xc->vp_id, q, priority,
-(__be32 *) qaddr, kvm_eq.qshift, true);
+   rc = kvmppc_xive_native_configure_queue(xc->vp_id, q, priority,
+   (__be32 *) qaddr, kvm_eq.qshift, true);
if (rc) {
pr_err("Failed to configure queue %d for VCPU %d: %d\n",
   priority, xc->server_num, rc);

Re: [PATCH] KVM: PPC: Book3S HV: XIVE: Free previous EQ page when setting up a new one

2019-11-12 Thread Greg Kurz

On Mon, 11 Nov 2019 12:26:25 +0100
Cédric Le Goater  wrote:

> On 11/11/2019 10:49, Greg Kurz wrote:
> > The EQ page is allocated by the guest and then passed to the hypervisor
> > with the H_INT_SET_QUEUE_CONFIG hcall. A reference is taken on the page
> > before handing it over to the HW. This reference is dropped either when
> > the guest issues the H_INT_RESET hcall or when the KVM device is released.
> > But, the guest can legitimately call H_INT_SET_QUEUE_CONFIG several times
> > to reset the EQ (vCPU hot unplug) or set a new EQ (guest reboot). In both
> > cases the EQ page reference is leaked. This is especially visible when
> > the guest memory is backed with huge pages: start a VM up to the guest
> > userspace, either reboot it or unplug a vCPU, quit QEMU. The leak is
> > observed by comparing the value of HugePages_Free in /proc/meminfo before
> > and after the VM is run.
> > 
> > Note that the EQ reset path seems to be calling put_page() but this is
> > done after xive_native_configure_queue() which clears the qpage field
> > in the XIVE queue structure, ie. the put_page() block is a nop and the
> > previous page pointer was just overwritten anyway. In the other case of
> > configuring a new EQ page, nothing seems to be done to release the old
> > one.
> 
> Yes. Nice catch. I think we should try to fix the problem differently. 
> 
> The routine xive_native_configure_queue() is only suited for XIVE 
> drivers doing their own EQ page allocation: Linux PowerNV and the 
> KVM XICS-over-XIVE device. The KVM XIVE device acts as a proxy for 
> the guest OS doing the allocation and it has different needs.
> 

Well xive_native_configure_queue() is at least partially suited for all three
drivers since they use it to configure the EQ. But it doesn't address the page
allocation/de-allocation which is indeed different.

> Having a specific xive_native_configure_queue() for the KVM XIVE 
> device seems overkill. May be, we could introduce a helper routine 
> in KVM XIVE device calling xive_native_configure_queue() and handling 
> the page reference how it should be ? That is to drop the previous
> page reference in case of a change on q->qpage.
> 

Yes, that seems better. I'll post a v2 with the helper you've mailed
me.

> 
> Also, we should try to preserve the previous setting until the whole 
> configuration is in place. That seems possible up to the call to 
> xive_native_configure_queue(). If kvmppc_xive_attach_escalation()
> fails I think it is too late, as the HW has been configured by 
> xive_native_configure_queue(), and we should just cleanup everything. 
> 
> Thanks,
> 
> C. 
> 
> 
> > Fix both cases by always calling put_page() on the existing EQ page in
> > kvmppc_xive_native_set_queue_config(). This is a seemless change for the
> > EQ reset case. However this causes xive_native_configure_queue() to be
> > called twice for the new EQ page case: one time to reset the EQ and another
> > time to configure the new page. This is needed because we cannot release
> > the EQ page before calling xive_native_configure_queue() since it may still
> > be used by the HW. We cannot modify xive_native_configure_queue() to drop
> > the reference either because this function is also used by the XICS-on-XIVE
> > device which requires free_pages() instead of put_page(). This isn't a big
> > deal anyway since H_INT_SET_QUEUE_CONFIG isn't a hot path.
> > 
> > Reported-by: Satheesh Rajendran 
> > Cc: sta...@vger.kernel.org # v5.2
> > Fixes: 13ce3297c576 ("KVM: PPC: Book3S HV: XIVE: Add controls for the EQ 
> > configuration")
> > Signed-off-by: Greg Kurz 
> > ---
> >  arch/powerpc/kvm/book3s_xive_native.c |   21 -
> >  1 file changed, 12 insertions(+), 9 deletions(-)
> > 
> > diff --git a/arch/powerpc/kvm/book3s_xive_native.c 
> > b/arch/powerpc/kvm/book3s_xive_native.c
> > index 34bd123fa024..8ab908d23dc2 100644
> > --- a/arch/powerpc/kvm/book3s_xive_native.c
> > +++ b/arch/powerpc/kvm/book3s_xive_native.c
> > @@ -570,10 +570,12 @@ static int kvmppc_xive_native_set_queue_config(struct 
> > kvmppc_xive *xive,
> >  __func__, server, priority, kvm_eq.flags,
> >  kvm_eq.qshift, kvm_eq.qaddr, kvm_eq.qtoggle, kvm_eq.qindex);
> >  
> > -   /* reset queue and disable queueing */
> > -   if (!kvm_eq.qshift) {
> > -   q->guest_qaddr  = 0;
> > -   q->guest_qshift = 0;
> > +   /*
> > +* Reset queue and disable queueing. It will be re-enabled
> > +* later on if the guest is configuring a new EQ page.
> > +*/
> > +

[PATCH] KVM: PPC: Book3S HV: XIVE: Free previous EQ page when setting up a new one

2019-11-11 Thread Greg Kurz

The EQ page is allocated by the guest and then passed to the hypervisor
with the H_INT_SET_QUEUE_CONFIG hcall. A reference is taken on the page
before handing it over to the HW. This reference is dropped either when
the guest issues the H_INT_RESET hcall or when the KVM device is released.
But, the guest can legitimately call H_INT_SET_QUEUE_CONFIG several times
to reset the EQ (vCPU hot unplug) or set a new EQ (guest reboot). In both
cases the EQ page reference is leaked. This is especially visible when
the guest memory is backed with huge pages: start a VM up to the guest
userspace, either reboot it or unplug a vCPU, quit QEMU. The leak is
observed by comparing the value of HugePages_Free in /proc/meminfo before
and after the VM is run.

Note that the EQ reset path seems to be calling put_page() but this is
done after xive_native_configure_queue() which clears the qpage field
in the XIVE queue structure, ie. the put_page() block is a nop and the
previous page pointer was just overwritten anyway. In the other case of
configuring a new EQ page, nothing seems to be done to release the old
one.

Fix both cases by always calling put_page() on the existing EQ page in
kvmppc_xive_native_set_queue_config(). This is a seemless change for the
EQ reset case. However this causes xive_native_configure_queue() to be
called twice for the new EQ page case: one time to reset the EQ and another
time to configure the new page. This is needed because we cannot release
the EQ page before calling xive_native_configure_queue() since it may still
be used by the HW. We cannot modify xive_native_configure_queue() to drop
the reference either because this function is also used by the XICS-on-XIVE
device which requires free_pages() instead of put_page(). This isn't a big
deal anyway since H_INT_SET_QUEUE_CONFIG isn't a hot path.

Reported-by: Satheesh Rajendran 
Cc: sta...@vger.kernel.org # v5.2
Fixes: 13ce3297c576 ("KVM: PPC: Book3S HV: XIVE: Add controls for the EQ 
configuration")
Signed-off-by: Greg Kurz 
---
 arch/powerpc/kvm/book3s_xive_native.c |   21 -
 1 file changed, 12 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_xive_native.c 
b/arch/powerpc/kvm/book3s_xive_native.c
index 34bd123fa024..8ab908d23dc2 100644
--- a/arch/powerpc/kvm/book3s_xive_native.c
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -570,10 +570,12 @@ static int kvmppc_xive_native_set_queue_config(struct 
kvmppc_xive *xive,
 __func__, server, priority, kvm_eq.flags,
 kvm_eq.qshift, kvm_eq.qaddr, kvm_eq.qtoggle, kvm_eq.qindex);
 
-   /* reset queue and disable queueing */
-   if (!kvm_eq.qshift) {
-   q->guest_qaddr  = 0;
-   q->guest_qshift = 0;
+   /*
+* Reset queue and disable queueing. It will be re-enabled
+* later on if the guest is configuring a new EQ page.
+*/
+   if (q->guest_qshift) {
+   page = virt_to_page(q->qpage);
 
rc = xive_native_configure_queue(xc->vp_id, q, priority,
 NULL, 0, true);
@@ -583,12 +585,13 @@ static int kvmppc_xive_native_set_queue_config(struct 
kvmppc_xive *xive,
return rc;
}
 
-   if (q->qpage) {
-   put_page(virt_to_page(q->qpage));
-   q->qpage = NULL;
-   }
+   put_page(page);
 
-   return 0;
+   if (!kvm_eq.qshift) {
+   q->guest_qaddr  = 0;
+   q->guest_qshift = 0;
+   return 0;
+   }
}
 
/*

Re: [PATCH 3/3] powerpc/pseries: Fixup config space size of OpenCAPI devices

2019-11-09 Thread Greg Kurz

On Thu, 7 Nov 2019 09:46:25 +0100
christophe lombard  wrote:

> On 05/11/2019 06:01, Andrew Donnellan wrote:
> > On 22/10/19 6:52 pm, christophe lombard wrote:
> >> Fix up the pci config size of the OpenCAPI PCIe devices in the pseries
> >> environment.
> >> Most of OpenCAPI PCIe devices have 4096 bytes of configuration space.
> > 
> > It's not "most of", it's "all" - the OpenCAPI Discovery and 
> > Configuration Spec requires the use of extended capabilities that fall 
> > in the 0x100-0xFFF range.
> > 
> >>
> >> Signed-off-by: Christophe Lombard 
> >> ---
> >>   arch/powerpc/platforms/pseries/pci.c | 9 +
> >>   1 file changed, 9 insertions(+)
> >>
> >> diff --git a/arch/powerpc/platforms/pseries/pci.c 
> >> b/arch/powerpc/platforms/pseries/pci.c
> >> index 1eae1d09980c..3397784767b0 100644
> >> --- a/arch/powerpc/platforms/pseries/pci.c
> >> +++ b/arch/powerpc/platforms/pseries/pci.c
> >> @@ -291,6 +291,15 @@ static void fixup_winbond_82c105(struct pci_dev* 
> >> dev)
> >>   DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_WINBOND, 
> >> PCI_DEVICE_ID_WINBOND_82C105,
> >>    fixup_winbond_82c105);
> >> +static void fixup_opencapi_cfg_size(struct pci_dev *pdev)
> >> +{
> >> +    if (!machine_is(pseries))
> >> +    return;
> >> +
> >> +    pdev->cfg_size = PCI_CFG_SPACE_EXP_SIZE;
> >> +}
> >> +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_IBM, 0x062b, 
> >> fixup_opencapi_cfg_size);
> > 
> > An OpenCAPI device can have any PCI ID, is there a particular reason 
> > we're limiting this to 1014:062b? On PowerNV, we check the PHB type to 
> > determine whether the device is OpenCAPI or not, what's the equivalent 
> > for pseries?
> > 
> 
> Thanks for the review. For pseries, there is no specific OpenCapi PHB 
> type which constraints this kind of request.
> We are working to found an other solution.
> 

Well... we have an old PAPR+ addendum draft that mentions an "open-capi"
PHB type. The specification was never finalized and AFAIK PowerVM doesn't
support the OpenCAPI interface, so we didn't stick to the addendum during
our in-house prototyping. But now that we want to upstream things, I think
we should probably come up with a dedicated PHB type.

> >> +
> >>   int pseries_root_bridge_prepare(struct pci_host_bridge *bridge)
> >>   {
> >>   struct device_node *dn, *pdn;
> >>
> > 
>

Re: [PATCH] powerpc/xive: Prevent page fault issues in the machine crash handler

2019-10-31 Thread Greg Kurz

On Thu, 31 Oct 2019 07:31:00 +0100
Cédric Le Goater  wrote:

> When the machine crash handler is invoked, all interrupts are masked
> but interrupts which have not been started yet do not have an ESB page
> mapped in the Linux address space. This crashes the 'crash kexec'
> sequence on sPAPR guests.
> 
> To fix, force the mapping of the ESB page when an interrupt is being
> mapped in the Linux IRQ number space. This is done by setting the
> initial state of the interrupt to OFF which is not necessarily the
> case on PowerNV.
> 
> Signed-off-by: Cédric Le Goater 
> ---

Reviewed-by: Greg Kurz 

>  arch/powerpc/sysdev/xive/common.c | 9 +
>  1 file changed, 9 insertions(+)
> 
> diff --git a/arch/powerpc/sysdev/xive/common.c 
> b/arch/powerpc/sysdev/xive/common.c
> index df832b09e3e9..f5fadbd2533a 100644
> --- a/arch/powerpc/sysdev/xive/common.c
> +++ b/arch/powerpc/sysdev/xive/common.c
> @@ -1035,6 +1035,15 @@ static int xive_irq_alloc_data(unsigned int virq, 
> irq_hw_number_t hw)
>   xd->target = XIVE_INVALID_TARGET;
>   irq_set_handler_data(virq, xd);
>  
> + /*
> +  * Turn OFF by default the interrupt being mapped. A side
> +  * effect of this check is the mapping the ESB page of the
> +  * interrupt in the Linux address space. This prevents page
> +  * fault issues in the crash handler which masks all
> +  * interrupts.
> +  */
> + xive_esb_read(xd, XIVE_ESB_SET_PQ_01);
> +
>   return 0;
>  }
>

1 2 3 >

1 - 100 of 253 matches

Mail list logo