Re: [PATCH v3 1/2] KVM: PPC: Book3S HV: Sanitise vcpu registers in nested path

2021-04-30 Thread Nicholas Piggin
Excerpts from Fabiano Rosas's message of April 16, 2021 9:09 am:
> As one of the arguments of the H_ENTER_NESTED hypercall, the nested
> hypervisor (L1) prepares a structure containing the values of various
> hypervisor-privileged registers with which it wants the nested guest
> (L2) to run. Since the nested HV runs in supervisor mode it needs the
> host to write to these registers.
> 
> To stop a nested HV manipulating this mechanism and using a nested
> guest as a proxy to access a facility that has been made unavailable
> to it, we have a routine that sanitises the values of the HV registers
> before copying them into the nested guest's vcpu struct.
> 
> However, when coming out of the guest the values are copied as they
> were back into L1 memory, which means that any sanitisation we did
> during guest entry will be exposed to L1 after H_ENTER_NESTED returns.
> 
> This patch alters this sanitisation to have effect on the vcpu->arch
> registers directly before entering and after exiting the guest,
> leaving the structure that is copied back into L1 unchanged (except
> when we really want L1 to access the value, e.g the Cause bits of
> HFSCR).
> 
> Signed-off-by: Fabiano Rosas 
> ---
>  arch/powerpc/kvm/book3s_hv_nested.c | 55 ++---
>  1 file changed, 34 insertions(+), 21 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_hv_nested.c 
> b/arch/powerpc/kvm/book3s_hv_nested.c
> index 0cd0e7aad588..270552dd42c5 100644
> --- a/arch/powerpc/kvm/book3s_hv_nested.c
> +++ b/arch/powerpc/kvm/book3s_hv_nested.c
> @@ -102,8 +102,17 @@ static void save_hv_return_state(struct kvm_vcpu *vcpu, 
> int trap,
>  {
>   struct kvmppc_vcore *vc = vcpu->arch.vcore;
>  
> + /*
> +  * When loading the hypervisor-privileged registers to run L2,
> +  * we might have used bits from L1 state to restrict what the
> +  * L2 state is allowed to be. Since L1 is not allowed to read
> +  * the HV registers, do not include these modifications in the
> +  * return state.
> +  */
> + hr->hfscr = ((~HFSCR_INTR_CAUSE & hr->hfscr) |
> +  (HFSCR_INTR_CAUSE & vcpu->arch.hfscr));
> +
>   hr->dpdes = vc->dpdes;
> - hr->hfscr = vcpu->arch.hfscr;
>   hr->purr = vcpu->arch.purr;
>   hr->spurr = vcpu->arch.spurr;
>   hr->ic = vcpu->arch.ic;

The below parts of the patch I have no problem with, I think it's good to 
be able to restore the hv_guest_state for return, e.g., for cases where 
the L0 might emulate some HV behaviour transparently it will be useful,
at least.

Thanks,
Nick

> @@ -132,24 +141,7 @@ static void save_hv_return_state(struct kvm_vcpu *vcpu, 
> int trap,
>   }
>  }
>  
> -static void sanitise_hv_regs(struct kvm_vcpu *vcpu, struct hv_guest_state 
> *hr)
> -{
> - /*
> -  * Don't let L1 enable features for L2 which we've disabled for L1,
> -  * but preserve the interrupt cause field.
> -  */
> - hr->hfscr &= (HFSCR_INTR_CAUSE | vcpu->arch.hfscr);
> -
> - /* Don't let data address watchpoint match in hypervisor state */
> - hr->dawrx0 &= ~DAWRX_HYP;
> - hr->dawrx1 &= ~DAWRX_HYP;
> -
> - /* Don't let completed instruction address breakpt match in HV state */
> - if ((hr->ciabr & CIABR_PRIV) == CIABR_PRIV_HYPER)
> - hr->ciabr &= ~CIABR_PRIV;
> -}
> -
> -static void restore_hv_regs(struct kvm_vcpu *vcpu, struct hv_guest_state *hr)
> +static void restore_hv_regs(struct kvm_vcpu *vcpu, const struct 
> hv_guest_state *hr)
>  {
>   struct kvmppc_vcore *vc = vcpu->arch.vcore;
>  
> @@ -261,6 +253,27 @@ static int kvmhv_write_guest_state_and_regs(struct 
> kvm_vcpu *vcpu,
>sizeof(struct pt_regs));
>  }
>  
> +static void load_l2_hv_regs(struct kvm_vcpu *vcpu,
> + const struct hv_guest_state *l2_hv,
> + const struct hv_guest_state *l1_hv)
> +{
> + restore_hv_regs(vcpu, l2_hv);
> +
> + /*
> +  * Don't let L1 enable features for L2 which we've disabled for L1,
> +  * but preserve the interrupt cause field.
> +  */
> + vcpu->arch.hfscr = l2_hv->hfscr & (HFSCR_INTR_CAUSE | l1_hv->hfscr);
> +
> + /* Don't let data address watchpoint match in hypervisor state */
> + vcpu->arch.dawrx0 = l2_hv->dawrx0 & ~DAWRX_HYP;
> + vcpu->arch.dawrx1 = l2_hv->dawrx1 & ~DAWRX_HYP;
> +
> + /* Don't let completed instruction address breakpt match in HV state */
> + if ((l2_hv->ciabr & CIABR_PRIV) == CIABR_PRIV_HYPER)
> + vcpu->arch.ciabr = l2_hv->ciabr & ~CIABR_PRIV;
> +}
> +
>  long kvmhv_enter_nested_guest(struct kvm_vcpu *vcpu)
>  {
>   long int err, r;
> @@ -324,8 +337,8 @@ long kvmhv_enter_nested_guest(struct kvm_vcpu *vcpu)
>   mask = LPCR_DPFD | LPCR_ILE | LPCR_TC | LPCR_AIL | LPCR_LD |
>   LPCR_LPES | LPCR_MER;
>   lpcr = (vc->lpcr & ~mask) | (l2_hv.lpcr & mask);
> - sanitise_hv_regs(vcpu, _hv);
> - restore_hv_regs(vcpu, _hv);
> +
> +

Re: [PATCH v3 2/2] KVM: PPC: Book3S HV: Stop forwarding all HFSCR cause bits to L1

2021-04-30 Thread Nicholas Piggin
Oh sorry, I didn't skim this one before replying to the first.

Excerpts from Fabiano Rosas's message of April 16, 2021 9:09 am:
> Since commit 73937deb4b2d ("KVM: PPC: Book3S HV: Sanitise hv_regs on
> nested guest entry") we have been disabling for the nested guest the
> hypervisor facility bits that its nested hypervisor don't have access
> to.
> 
> If the nested guest tries to use one of those facilities, the hardware
> will cause a Hypervisor Facility Unavailable interrupt. The HFSCR
> register is modified by the hardware to contain information about the
> cause of the interrupt.
> 
> We have been returning the cause bits to the nested hypervisor but
> since commit 549e29b458c5 ("KVM: PPC: Book3S HV: Sanitise vcpu
> registers in nested path") we are reducing the amount of information
> exposed to L1, so it seems like a good idea to restrict some of the
> cause bits as well.
> 
> With this patch the L1 guest will be allowed to handle only the
> interrupts caused by facilities it has disabled for L2. The interrupts
> caused by facilities that L0 denied will cause a Program Interrupt in
> L1.

I'm not sure if this is a good solution. This would be randomly killing 
guest processes or kernels with no way for them to understand what's going
on or deal with it.

The problem is really a nested hypervisor mismatch / configuration 
error, so it should be handled between the L0 and L1. Returning failure
from H_ENTER_NESTED, for example (which is probe-able, but not really 
any less probe-able than this approach).

Thanks,
Nick

> 
> Signed-off-by: Fabiano Rosas 
> ---
>  arch/powerpc/kvm/book3s_hv_nested.c | 17 +
>  1 file changed, 17 insertions(+)
> 
> diff --git a/arch/powerpc/kvm/book3s_hv_nested.c 
> b/arch/powerpc/kvm/book3s_hv_nested.c
> index 270552dd42c5..912a2bcdf7b0 100644
> --- a/arch/powerpc/kvm/book3s_hv_nested.c
> +++ b/arch/powerpc/kvm/book3s_hv_nested.c
> @@ -138,6 +138,23 @@ static void save_hv_return_state(struct kvm_vcpu *vcpu, 
> int trap,
>   case BOOK3S_INTERRUPT_H_EMUL_ASSIST:
>   hr->heir = vcpu->arch.emul_inst;
>   break;
> + case BOOK3S_INTERRUPT_H_FAC_UNAVAIL:
> + {
> + u8 cause = vcpu->arch.hfscr >> 56;
> +
> + WARN_ON_ONCE(cause >= BITS_PER_LONG);
> +
> + if (hr->hfscr & (1UL << cause)) {
> + hr->hfscr &= ~HFSCR_INTR_CAUSE;
> + /*
> +  * We have not restored L1 state yet, so queue
> +  * this interrupt instead of delivering it
> +  * immediately.
> +  */
> + kvmppc_book3s_queue_irqprio(vcpu, 
> BOOK3S_INTERRUPT_PROGRAM);
> + }
> + break;
> + }
>   }
>  }
>  
> -- 
> 2.29.2
> 
> 


Re: [PATCH v3 1/2] KVM: PPC: Book3S HV: Sanitise vcpu registers in nested path

2021-04-30 Thread Nicholas Piggin
Excerpts from Fabiano Rosas's message of April 16, 2021 9:09 am:
> As one of the arguments of the H_ENTER_NESTED hypercall, the nested
> hypervisor (L1) prepares a structure containing the values of various
> hypervisor-privileged registers with which it wants the nested guest
> (L2) to run. Since the nested HV runs in supervisor mode it needs the
> host to write to these registers.
> 
> To stop a nested HV manipulating this mechanism and using a nested
> guest as a proxy to access a facility that has been made unavailable
> to it, we have a routine that sanitises the values of the HV registers
> before copying them into the nested guest's vcpu struct.
> 
> However, when coming out of the guest the values are copied as they
> were back into L1 memory, which means that any sanitisation we did
> during guest entry will be exposed to L1 after H_ENTER_NESTED returns.
> 
> This patch alters this sanitisation to have effect on the vcpu->arch
> registers directly before entering and after exiting the guest,
> leaving the structure that is copied back into L1 unchanged (except
> when we really want L1 to access the value, e.g the Cause bits of
> HFSCR).
> 
> Signed-off-by: Fabiano Rosas 
> ---
>  arch/powerpc/kvm/book3s_hv_nested.c | 55 ++---
>  1 file changed, 34 insertions(+), 21 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_hv_nested.c 
> b/arch/powerpc/kvm/book3s_hv_nested.c
> index 0cd0e7aad588..270552dd42c5 100644
> --- a/arch/powerpc/kvm/book3s_hv_nested.c
> +++ b/arch/powerpc/kvm/book3s_hv_nested.c
> @@ -102,8 +102,17 @@ static void save_hv_return_state(struct kvm_vcpu *vcpu, 
> int trap,
>  {
>   struct kvmppc_vcore *vc = vcpu->arch.vcore;
>  
> + /*
> +  * When loading the hypervisor-privileged registers to run L2,
> +  * we might have used bits from L1 state to restrict what the
> +  * L2 state is allowed to be. Since L1 is not allowed to read
> +  * the HV registers, do not include these modifications in the
> +  * return state.
> +  */
> + hr->hfscr = ((~HFSCR_INTR_CAUSE & hr->hfscr) |
> +  (HFSCR_INTR_CAUSE & vcpu->arch.hfscr));
> +
>   hr->dpdes = vc->dpdes;
> - hr->hfscr = vcpu->arch.hfscr;
>   hr->purr = vcpu->arch.purr;
>   hr->spurr = vcpu->arch.spurr;
>   hr->ic = vcpu->arch.ic;

Do we still have the problem here that hfac interrupts due to bits cleared
by the hfscr sanitisation would have the cause bits returned to the L1,
so in theory it could probe hfscr directly that way? I don't see a good
solution to this except either have the L0 intercept these faults and do
"something" transparent, or return error from H_ENTER_NESTED (which would
also allow trivial probing of the facilities).

Returning an hfac interrupt to a hypervisor that thought it enabled the 
bit would be strange. But so does appearing to modify the register 
underneath it and then returning a fault.

I think the sanest thing would actually be to return failure from the 
hcall.

> @@ -132,24 +141,7 @@ static void save_hv_return_state(struct kvm_vcpu *vcpu, 
> int trap,
>   }
>  }
>  
> -static void sanitise_hv_regs(struct kvm_vcpu *vcpu, struct hv_guest_state 
> *hr)
> -{
> - /*
> -  * Don't let L1 enable features for L2 which we've disabled for L1,
> -  * but preserve the interrupt cause field.
> -  */
> - hr->hfscr &= (HFSCR_INTR_CAUSE | vcpu->arch.hfscr);
> -
> - /* Don't let data address watchpoint match in hypervisor state */
> - hr->dawrx0 &= ~DAWRX_HYP;
> - hr->dawrx1 &= ~DAWRX_HYP;
> -
> - /* Don't let completed instruction address breakpt match in HV state */
> - if ((hr->ciabr & CIABR_PRIV) == CIABR_PRIV_HYPER)
> - hr->ciabr &= ~CIABR_PRIV;
> -}
> -
> -static void restore_hv_regs(struct kvm_vcpu *vcpu, struct hv_guest_state *hr)
> +static void restore_hv_regs(struct kvm_vcpu *vcpu, const struct 
> hv_guest_state *hr)
>  {
>   struct kvmppc_vcore *vc = vcpu->arch.vcore;
>  
> @@ -261,6 +253,27 @@ static int kvmhv_write_guest_state_and_regs(struct 
> kvm_vcpu *vcpu,
>sizeof(struct pt_regs));
>  }
>  
> +static void load_l2_hv_regs(struct kvm_vcpu *vcpu,
> + const struct hv_guest_state *l2_hv,
> + const struct hv_guest_state *l1_hv)
> +{
> + restore_hv_regs(vcpu, l2_hv);
> +
> + /*
> +  * Don't let L1 enable features for L2 which we've disabled for L1,
> +  * but preserve the interrupt cause field.
> +  */
> + vcpu->arch.hfscr = l2_hv->hfscr & (HFSCR_INTR_CAUSE | l1_hv->hfscr);
> +
> + /* Don't let data address watchpoint match in hypervisor state */
> + vcpu->arch.dawrx0 = l2_hv->dawrx0 & ~DAWRX_HYP;
> + vcpu->arch.dawrx1 = l2_hv->dawrx1 & ~DAWRX_HYP;
> +
> + /* Don't let completed instruction address breakpt match in HV state */
> + if ((l2_hv->ciabr & CIABR_PRIV) == CIABR_PRIV_HYPER)
> + vcpu->arch.ciabr = l2_hv->ciabr 

Re: [PATCH v7] powerpc/irq: Inline call_do_irq() and call_do_softirq()

2021-04-30 Thread Nick Desaulniers
On Tue, Apr 27, 2021 at 1:42 PM Nick Desaulniers
 wrote:
>
> On Mon, Apr 26, 2021 at 11:39 PM Christophe Leroy
>  wrote:
> >
> > As you can see, CLANG doesn't save/restore 'lr' allthought 'lr' is 
> > explicitely listed in the
> > registers clobbered by the inline assembly:
>
> Ah, thanks for debugging this. Will follow up in
> https://bugs.llvm.org/show_bug.cgi?id=50147.

Looks like there's a fix posted for LLVM in: https://reviews.llvm.org/D101657

Though trying to test it in QEMU, I'm hitting some assertion failure
booting a kernel (even without that patch to LLVM):
qemu-system-ppc: ../../hw/pci/pci.c:253: pci_bus_change_irq_level:
Assertion `irq_num >= 0' failed.
That's with
QEMU emulator version 5.2.0 (Debian 1:5.2+dfsg-9)

I didn't see anything in https://bugs.launchpad.net/qemu/ about it,
but figured I'd share in case that assertion failure looked familiar
to anyone.
-- 
Thanks,
~Nick Desaulniers


Re: [PATCH 4/4] powerpc/pseries: warn if recursing into the hcall tracing code

2021-04-30 Thread Nicholas Piggin
Excerpts from Naveen N. Rao's message of April 27, 2021 11:59 pm:
> Nicholas Piggin wrote:
>> ---
>>  arch/powerpc/platforms/pseries/lpar.c | 11 +++
>>  1 file changed, 7 insertions(+), 4 deletions(-)
>> 
>> diff --git a/arch/powerpc/platforms/pseries/lpar.c 
>> b/arch/powerpc/platforms/pseries/lpar.c
>> index 835e7f661a05..a961a7ebeab3 100644
>> --- a/arch/powerpc/platforms/pseries/lpar.c
>> +++ b/arch/powerpc/platforms/pseries/lpar.c
>> @@ -1828,8 +1828,11 @@ void hcall_tracepoint_unregfunc(void)
>> 
>>  /*
>>   * Since the tracing code might execute hcalls we need to guard against
>> - * recursion. H_CONFER from spin locks must be treated separately though
>> - * and use _notrace plpar_hcall variants, see yield_to_preempted().
>> + * recursion, but this always seems risky -- __trace_hcall_entry might be
>> + * ftraced, for example. So warn in this case.
> 
> __trace_hcall_[entry|exit] aren't traced anymore since they now have the 
> 'notrace' annotation.

Yes that's true I went back and added the other patch, so I should fix 
this comment.

>> + *
>> + * H_CONFER from spin locks must be treated separately though and use 
>> _notrace
>> + * plpar_hcall variants, see yield_to_preempted().
>>   */
>>  static DEFINE_PER_CPU(unsigned int, hcall_trace_depth);
>> 
>> @@ -1843,7 +1846,7 @@ notrace void __trace_hcall_entry(unsigned long opcode, 
>> unsigned long *args)
>> 
>>  depth = this_cpu_ptr(_trace_depth);
>> 
>> -if (*depth)
>> +if (WARN_ON_ONCE(*depth))
>>  goto out;
> 
> I don't think this will be helpful. The hcall trace depth tracking is 
> for the tracepoint and I suspect that this warning will be triggered 
> quite easily. Since we have recursion protection, I don't think we 
> should warn here.

What would trigger recursion?

Thanks,
Nick


Re: [PATCH 1/4] powerpc/pseries: Fix hcall tracing recursion in pv queued spinlocks

2021-04-30 Thread Nicholas Piggin
Excerpts from Naveen N. Rao's message of April 27, 2021 11:43 pm:
> Nicholas Piggin wrote:
>> The paravit queued spinlock slow path adds itself to the queue then
>> calls pv_wait to wait for the lock to become free. This is implemented
>> by calling H_CONFER to donate cycles.
>> 
>> When hcall tracing is enabled, this H_CONFER call can lead to a spin
>> lock being taken in the tracing code, which will result in the lock to
>> be taken again, which will also go to the slow path because it queues
>> behind itself and so won't ever make progress.
>> 
>> An example trace of a deadlock:
>> 
>>   __pv_queued_spin_lock_slowpath
>>   trace_clock_global
>>   ring_buffer_lock_reserve
>>   trace_event_buffer_lock_reserve
>>   trace_event_buffer_reserve
>>   trace_event_raw_event_hcall_exit
>>   __trace_hcall_exit
>>   plpar_hcall_norets_trace
>>   __pv_queued_spin_lock_slowpath
>>   trace_clock_global
>>   ring_buffer_lock_reserve
>>   trace_event_buffer_lock_reserve
>>   trace_event_buffer_reserve
>>   trace_event_raw_event_rcu_dyntick
>>   rcu_irq_exit
>>   irq_exit
>>   __do_irq
>>   call_do_irq
>>   do_IRQ
>>   hardware_interrupt_common_virt
>> 
>> Fix this by introducing plpar_hcall_norets_notrace(), and using that to
>> make SPLPAR virtual processor dispatching hcalls by the paravirt
>> spinlock code.
>> 
>> Fixes: 20c0e8269e9d ("powerpc/pseries: Implement paravirt qspinlocks for 
>> SPLPAR")
>> Signed-off-by: Nicholas Piggin 
>> ---
>>  arch/powerpc/include/asm/hvcall.h   |  3 +++
>>  arch/powerpc/include/asm/paravirt.h | 22 +++---
>>  arch/powerpc/platforms/pseries/hvCall.S | 10 ++
>>  arch/powerpc/platforms/pseries/lpar.c   |  4 ++--
>>  4 files changed, 34 insertions(+), 5 deletions(-)
> 
> Thanks for the fix! Some very minor nits below, but none the less:
> Reviewed-by: Naveen N. Rao 
> 
>> 
>> diff --git a/arch/powerpc/include/asm/hvcall.h 
>> b/arch/powerpc/include/asm/hvcall.h
>> index ed6086d57b22..0c92b01a3c3c 100644
>> --- a/arch/powerpc/include/asm/hvcall.h
>> +++ b/arch/powerpc/include/asm/hvcall.h
>> @@ -446,6 +446,9 @@
>>   */
>>  long plpar_hcall_norets(unsigned long opcode, ...);
>> 
>> +/* Variant which does not do hcall tracing */
>> +long plpar_hcall_norets_notrace(unsigned long opcode, ...);
>> +
>>  /**
>>   * plpar_hcall: - Make a pseries hypervisor call
>>   * @opcode: The hypervisor call to make.
>> diff --git a/arch/powerpc/include/asm/paravirt.h 
>> b/arch/powerpc/include/asm/paravirt.h
>> index 5d1726bb28e7..3c13c2ec70a9 100644
>> --- a/arch/powerpc/include/asm/paravirt.h
>> +++ b/arch/powerpc/include/asm/paravirt.h
>> @@ -30,17 +30,33 @@ static inline u32 yield_count_of(int cpu)
>> 
>>  static inline void yield_to_preempted(int cpu, u32 yield_count)
>>  {
> 
> It looks like yield_to_preempted() is only used by simple spin locks 
> today. I wonder if it makes more sense to put the below comment in 
> yield_to_any() which is used by the qspinlock code.

Yeah, I just put it above the functions entirely because it refers to 
all of them.

> 
>> -plpar_hcall_norets(H_CONFER, get_hard_smp_processor_id(cpu), 
>> yield_count);
>> +/*
>> + * Spinlock code yields and prods, so don't trace the hcalls because
>> + * tracing code takes spinlocks which could recurse.
>> + *
>> + * These calls are made while the lock is not held, the lock slowpath
>> + * yields if it can not acquire the lock, and unlock slow path might
>> + * prod if a waiter has yielded). So this did not seem to be a problem
>> + * for simple spin locks because technically it didn't recuse on the
>  ^^
>  recurse
> 
>> + * lock. However the queued spin lock contended path is more strictly
>> + * ordered: the H_CONFER hcall is made after the task has queued itself
>> + * on the lock, so then recursing on the lock will queue up behind that
>> + * (or worse: queued spinlocks uses tricks that assume a context never
>> + * waits on more than one spinlock, so that may cause random
>> + * corruption).
>> + */
>> +plpar_hcall_norets_notrace(H_CONFER,
>> +   get_hard_smp_processor_id(cpu), yield_count);
> 
> This can all be on a single line.

Should it though? Linux in general allegedly changed to 100 column 
lines for checkpatch, but it seems to still be frowned upon to go
beyond 80 deliberately. What about arch/powerpc?

Thanks,
Nick


Re: [PATCH v3] pseries/drmem: update LMBs after LPM

2021-04-30 Thread Tyrel Datwyler
On 4/30/21 9:13 AM, Laurent Dufour wrote:
> Le 29/04/2021 à 21:12, Tyrel Datwyler a écrit :
>> On 4/29/21 3:27 AM, Aneesh Kumar K.V wrote:
>>> Laurent Dufour  writes:
>>>
 After a LPM, the device tree node ibm,dynamic-reconfiguration-memory may be
 updated by the hypervisor in the case the NUMA topology of the LPAR's
 memory is updated.

 This is caught by the kernel, but the memory's node is updated because
 there is no way to move a memory block between nodes.

 If later a memory block is added or removed, drmem_update_dt() is called
 and it is overwriting the DT node to match the added or removed LMB. But
 the LMB's associativity node has not been updated after the DT node update
 and thus the node is overwritten by the Linux's topology instead of the
 hypervisor one.

 Introduce a hook called when the ibm,dynamic-reconfiguration-memory node is
 updated to force an update of the LMB's associativity.

 Cc: Tyrel Datwyler 
 Signed-off-by: Laurent Dufour 
 ---

 V3:
   - Check rd->dn->name instead of rd->dn->full_name
 V2:
   - Take Tyrel's idea to rely on OF_RECONFIG_UPDATE_PROPERTY instead of
   introducing a new hook mechanism.
 ---
   arch/powerpc/include/asm/drmem.h  |  1 +
   arch/powerpc/mm/drmem.c   | 35 +++
   .../platforms/pseries/hotplug-memory.c    |  4 +++
   3 files changed, 40 insertions(+)

 diff --git a/arch/powerpc/include/asm/drmem.h
 b/arch/powerpc/include/asm/drmem.h
 index bf2402fed3e0..4265d5e95c2c 100644
 --- a/arch/powerpc/include/asm/drmem.h
 +++ b/arch/powerpc/include/asm/drmem.h
 @@ -111,6 +111,7 @@ int drmem_update_dt(void);
   int __init
   walk_drmem_lmbs_early(unsigned long node, void *data,
     int (*func)(struct drmem_lmb *, const __be32 **, void *));
 +void drmem_update_lmbs(struct property *prop);
   #endif
     static inline void invalidate_lmb_associativity_index(struct drmem_lmb
 *lmb)
 diff --git a/arch/powerpc/mm/drmem.c b/arch/powerpc/mm/drmem.c
 index 9af3832c9d8d..f0a6633132af 100644
 --- a/arch/powerpc/mm/drmem.c
 +++ b/arch/powerpc/mm/drmem.c
 @@ -307,6 +307,41 @@ int __init walk_drmem_lmbs_early(unsigned long node,
 void *data,
   return ret;
   }
   +/*
 + * Update the LMB associativity index.
 + */
 +static int update_lmb(struct drmem_lmb *updated_lmb,
 +  __maybe_unused const __be32 **usm,
 +  __maybe_unused void *data)
 +{
 +    struct drmem_lmb *lmb;
 +
 +    /*
 + * Brut force there may be better way to fetch the LMB
 + */
 +    for_each_drmem_lmb(lmb) {
 +    if (lmb->drc_index != updated_lmb->drc_index)
 +    continue;
 +
 +    lmb->aa_index = updated_lmb->aa_index;
 +    break;
 +    }
 +    return 0;
 +}
 +
 +/*
 + * Update the LMB associativity index.
 + *
 + * This needs to be called when the hypervisor is updating the
 + * dynamic-reconfiguration-memory node property.
 + */
 +void drmem_update_lmbs(struct property *prop)
 +{
 +    if (!strcmp(prop->name, "ibm,dynamic-memory"))
 +    __walk_drmem_v1_lmbs(prop->value, NULL, NULL, update_lmb);
 +    else if (!strcmp(prop->name, "ibm,dynamic-memory-v2"))
 +    __walk_drmem_v2_lmbs(prop->value, NULL, NULL, update_lmb);
 +}
   #endif
     static int init_drmem_lmb_size(struct device_node *dn)
 diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c
 b/arch/powerpc/platforms/pseries/hotplug-memory.c
 index 8377f1f7c78e..672ffbee2e78 100644
 --- a/arch/powerpc/platforms/pseries/hotplug-memory.c
 +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
 @@ -949,6 +949,10 @@ static int pseries_memory_notifier(struct
 notifier_block *nb,
   case OF_RECONFIG_DETACH_NODE:
   err = pseries_remove_mem_node(rd->dn);
   break;
 +    case OF_RECONFIG_UPDATE_PROPERTY:
 +    if (!strcmp(rd->dn->name,
 +    "ibm,dynamic-reconfiguration-memory"))
 +    drmem_update_lmbs(rd->prop);
   }
   return notifier_from_errno(err);
>>>
>>> How will this interact with DLPAR memory? When we dlpar memory,
>>> ibm,configure-connector is used to fetch the new associativity details
>>> and set drmem_lmb->aa_index correctly there. Once that is done kernel
>>> then call drmem_update_dt() which will result in the above notifier
>>> callback?
>>>
>>> IIUC, the call back then will update drmem_lmb->aa_index again?
>>
>> After digging through some of this code I'm a bit concerned about all the 
>> kernel
>> device tree manipulation around memory DLPAR both with the assoc-lookup-array
>> prop update and post dynamic-memory prop updating. We build a 

[PATCH v3 2/3] audit: add support for the openat2 syscall

2021-04-30 Thread Richard Guy Briggs
The openat2(2) syscall was added in kernel v5.6 with commit fddb5d430ad9
("open: introduce openat2(2) syscall")

Add the openat2(2) syscall to the audit syscall classifier.

See the github issue
https://github.com/linux-audit/audit-kernel/issues/67

Signed-off-by: Richard Guy Briggs 
---
 arch/alpha/kernel/audit.c  | 2 ++
 arch/ia64/kernel/audit.c   | 2 ++
 arch/parisc/kernel/audit.c | 2 ++
 arch/parisc/kernel/compat_audit.c  | 2 ++
 arch/powerpc/kernel/audit.c| 2 ++
 arch/powerpc/kernel/compat_audit.c | 2 ++
 arch/s390/kernel/audit.c   | 2 ++
 arch/s390/kernel/compat_audit.c| 2 ++
 arch/sparc/kernel/audit.c  | 2 ++
 arch/sparc/kernel/compat_audit.c   | 2 ++
 arch/x86/ia32/audit.c  | 2 ++
 arch/x86/kernel/audit_64.c | 2 ++
 include/linux/auditscm.h   | 1 +
 kernel/auditsc.c   | 3 +++
 lib/audit.c| 4 
 lib/compat_audit.c | 4 
 16 files changed, 36 insertions(+)

diff --git a/arch/alpha/kernel/audit.c b/arch/alpha/kernel/audit.c
index 81cbd804e375..3ab04709784a 100644
--- a/arch/alpha/kernel/audit.c
+++ b/arch/alpha/kernel/audit.c
@@ -42,6 +42,8 @@ int audit_classify_syscall(int abi, unsigned syscall)
return AUDITSC_OPENAT;
case __NR_execve:
return AUDITSC_EXECVE;
+   case __NR_openat2:
+   return AUDITSC_OPENAT2;
default:
return AUDITSC_NATIVE;
}
diff --git a/arch/ia64/kernel/audit.c b/arch/ia64/kernel/audit.c
index dba6a74c9ab3..ec61f20ca61f 100644
--- a/arch/ia64/kernel/audit.c
+++ b/arch/ia64/kernel/audit.c
@@ -43,6 +43,8 @@ int audit_classify_syscall(int abi, unsigned syscall)
return AUDITSC_OPENAT;
case __NR_execve:
return AUDITSC_EXECVE;
+   case __NR_openat2:
+   return AUDITSC_OPENAT2;
default:
return AUDITSC_NATIVE;
}
diff --git a/arch/parisc/kernel/audit.c b/arch/parisc/kernel/audit.c
index 14244e83db75..f420b5552140 100644
--- a/arch/parisc/kernel/audit.c
+++ b/arch/parisc/kernel/audit.c
@@ -52,6 +52,8 @@ int audit_classify_syscall(int abi, unsigned syscall)
return AUDITSC_OPENAT;
case __NR_execve:
return AUDITSC_EXECVE;
+   case __NR_openat2:
+   return AUDITSC_OPENAT2;
default:
return AUDITSC_NATIVE;
}
diff --git a/arch/parisc/kernel/compat_audit.c 
b/arch/parisc/kernel/compat_audit.c
index 0c181bb39f34..02cfd9d1ebeb 100644
--- a/arch/parisc/kernel/compat_audit.c
+++ b/arch/parisc/kernel/compat_audit.c
@@ -36,6 +36,8 @@ int parisc32_classify_syscall(unsigned syscall)
return AUDITSC_OPENAT;
case __NR_execve:
return AUDITSC_EXECVE;
+   case __NR_openat2:
+   return AUDITSC_OPENAT2;
default:
return AUDITSC_COMPAT;
}
diff --git a/arch/powerpc/kernel/audit.c b/arch/powerpc/kernel/audit.c
index 6eb18ef77dff..1bcfca5fdf67 100644
--- a/arch/powerpc/kernel/audit.c
+++ b/arch/powerpc/kernel/audit.c
@@ -54,6 +54,8 @@ int audit_classify_syscall(int abi, unsigned syscall)
return AUDITSC_SOCKETCALL;
case __NR_execve:
return AUDITSC_EXECVE;
+   case __NR_openat2:
+   return AUDITSC_OPENAT2;
default:
return AUDITSC_NATIVE;
}
diff --git a/arch/powerpc/kernel/compat_audit.c 
b/arch/powerpc/kernel/compat_audit.c
index f250777f6365..1fa0c902be8a 100644
--- a/arch/powerpc/kernel/compat_audit.c
+++ b/arch/powerpc/kernel/compat_audit.c
@@ -39,6 +39,8 @@ int ppc32_classify_syscall(unsigned syscall)
return AUDITSC_SOCKETCALL;
case __NR_execve:
return AUDITSC_EXECVE;
+   case __NR_openat2:
+   return AUDITSC_OPENAT2;
default:
return AUDITSC_COMPAT;
}
diff --git a/arch/s390/kernel/audit.c b/arch/s390/kernel/audit.c
index 7e331e1831d4..02051a596b87 100644
--- a/arch/s390/kernel/audit.c
+++ b/arch/s390/kernel/audit.c
@@ -54,6 +54,8 @@ int audit_classify_syscall(int abi, unsigned syscall)
return AUDITSC_SOCKETCALL;
case __NR_execve:
return AUDITSC_EXECVE;
+   case __NR_openat2:
+   return AUDITSC_OPENAT2;
default:
return AUDITSC_NATIVE;
}
diff --git a/arch/s390/kernel/compat_audit.c b/arch/s390/kernel/compat_audit.c
index b2a2ed5d605a..320b5e7d96f0 100644
--- a/arch/s390/kernel/compat_audit.c
+++ b/arch/s390/kernel/compat_audit.c
@@ -40,6 +40,8 @@ int s390_classify_syscall(unsigned syscall)
return AUDITSC_SOCKETCALL;
case __NR_execve:
return AUDITSC_EXECVE;
+   case __NR_openat2:
+   return AUDITSC_OPENAT2;
default:
return AUDITSC_COMPAT;
}
diff --git a/arch/sparc/kernel/audit.c 

[PATCH v3 1/3] audit: replace magic audit syscall class numbers with macros

2021-04-30 Thread Richard Guy Briggs
Replace audit syscall class magic numbers with macros.

This required putting the macros into new header file
include/linux/auditscm.h since the syscall macros were included for both 64
bit and 32 bit in any compat code, causing redefinition warnings.

Signed-off-by: Richard Guy Briggs 
---
 MAINTAINERS|  1 +
 arch/alpha/kernel/audit.c  |  8 
 arch/ia64/kernel/audit.c   |  8 
 arch/parisc/kernel/audit.c |  8 
 arch/parisc/kernel/compat_audit.c  |  9 +
 arch/powerpc/kernel/audit.c| 10 +-
 arch/powerpc/kernel/compat_audit.c | 11 ++-
 arch/s390/kernel/audit.c   | 10 +-
 arch/s390/kernel/compat_audit.c| 11 ++-
 arch/sparc/kernel/audit.c  | 10 +-
 arch/sparc/kernel/compat_audit.c   | 11 ++-
 arch/x86/ia32/audit.c  | 11 ++-
 arch/x86/kernel/audit_64.c |  8 
 include/linux/audit.h  |  1 +
 include/linux/auditscm.h   | 23 +++
 kernel/auditsc.c   | 12 ++--
 lib/audit.c| 10 +-
 lib/compat_audit.c | 11 ++-
 18 files changed, 102 insertions(+), 71 deletions(-)
 create mode 100644 include/linux/auditscm.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 1249655459d3..2db1dc94888f 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2981,6 +2981,7 @@ W:https://github.com/linux-audit
 T: git git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit.git
 F: include/asm-generic/audit_*.h
 F: include/linux/audit.h
+F: include/linux/auditscm.h
 F: include/uapi/linux/audit.h
 F: kernel/audit*
 F: lib/*audit.c
diff --git a/arch/alpha/kernel/audit.c b/arch/alpha/kernel/audit.c
index 96a9d18ff4c4..81cbd804e375 100644
--- a/arch/alpha/kernel/audit.c
+++ b/arch/alpha/kernel/audit.c
@@ -37,13 +37,13 @@ int audit_classify_syscall(int abi, unsigned syscall)
 {
switch(syscall) {
case __NR_open:
-   return 2;
+   return AUDITSC_OPEN;
case __NR_openat:
-   return 3;
+   return AUDITSC_OPENAT;
case __NR_execve:
-   return 5;
+   return AUDITSC_EXECVE;
default:
-   return 0;
+   return AUDITSC_NATIVE;
}
 }
 
diff --git a/arch/ia64/kernel/audit.c b/arch/ia64/kernel/audit.c
index 5192ca899fe6..dba6a74c9ab3 100644
--- a/arch/ia64/kernel/audit.c
+++ b/arch/ia64/kernel/audit.c
@@ -38,13 +38,13 @@ int audit_classify_syscall(int abi, unsigned syscall)
 {
switch(syscall) {
case __NR_open:
-   return 2;
+   return AUDITSC_OPEN;
case __NR_openat:
-   return 3;
+   return AUDITSC_OPENAT;
case __NR_execve:
-   return 5;
+   return AUDITSC_EXECVE;
default:
-   return 0;
+   return AUDITSC_NATIVE;
}
 }
 
diff --git a/arch/parisc/kernel/audit.c b/arch/parisc/kernel/audit.c
index 9eb47b2225d2..14244e83db75 100644
--- a/arch/parisc/kernel/audit.c
+++ b/arch/parisc/kernel/audit.c
@@ -47,13 +47,13 @@ int audit_classify_syscall(int abi, unsigned syscall)
 #endif
switch (syscall) {
case __NR_open:
-   return 2;
+   return AUDITSC_OPEN;
case __NR_openat:
-   return 3;
+   return AUDITSC_OPENAT;
case __NR_execve:
-   return 5;
+   return AUDITSC_EXECVE;
default:
-   return 0;
+   return AUDITSC_NATIVE;
}
 }
 
diff --git a/arch/parisc/kernel/compat_audit.c 
b/arch/parisc/kernel/compat_audit.c
index 20c39c9d86a9..0c181bb39f34 100644
--- a/arch/parisc/kernel/compat_audit.c
+++ b/arch/parisc/kernel/compat_audit.c
@@ -1,4 +1,5 @@
 // SPDX-License-Identifier: GPL-2.0
+#include 
 #include 
 
 unsigned int parisc32_dir_class[] = {
@@ -30,12 +31,12 @@ int parisc32_classify_syscall(unsigned syscall)
 {
switch (syscall) {
case __NR_open:
-   return 2;
+   return AUDITSC_OPEN;
case __NR_openat:
-   return 3;
+   return AUDITSC_OPENAT;
case __NR_execve:
-   return 5;
+   return AUDITSC_EXECVE;
default:
-   return 1;
+   return AUDITSC_COMPAT;
}
 }
diff --git a/arch/powerpc/kernel/audit.c b/arch/powerpc/kernel/audit.c
index a27f3d09..6eb18ef77dff 100644
--- a/arch/powerpc/kernel/audit.c
+++ b/arch/powerpc/kernel/audit.c
@@ -47,15 +47,15 @@ int audit_classify_syscall(int abi, unsigned syscall)
 #endif
switch(syscall) {
case __NR_open:
-   return 2;
+   return AUDITSC_OPEN;
case __NR_openat:
-   return 3;
+   return AUDITSC_OPENAT;
case __NR_socketcall:
-   return 4;
+   

[PATCH v3 0/3] audit: add support for openat2

2021-04-30 Thread Richard Guy Briggs
The openat2(2) syscall was added in v5.6.  Add support for openat2 to the
audit syscall classifier and for recording openat2 parameters that cannot
be captured in the syscall parameters of the SYSCALL record.

Supporting userspace code can be found in
https://github.com/rgbriggs/audit-userspace/tree/ghau-openat2

Supporting test case can be found in
https://github.com/linux-audit/audit-testsuite/pull/103

Changelog:
v3:
- re-add commit descriptions that somehow got dropped
- add new file to MAINTAINERS

v2:
- add include/linux/auditscm.h for audit syscall class macros due to syscall 
redefinition warnings:
arch/x86/ia32/audit.c:3:
./include/linux/audit.h:12,
./include/linux/sched.h:22,
./include/linux/seccomp.h:21,
./arch/x86/include/asm/seccomp.h:5,
./arch/x86/include/asm/unistd.h:20,
./arch/x86/include/generated/uapi/asm/unistd_64.h:4: warning: 
"__NR_read" redefined #define __NR_read 0
...
./arch/x86/include/generated/uapi/asm/unistd_64.h:338: warning: 
"__NR_rseq" redefined #define __NR_rseq 334
previous:
arch/x86/ia32/audit.c:2:
./arch/x86/include/generated/uapi/asm/unistd_32.h:7: note: this is the 
location of the previous definition #define __NR_read 3 
 
...
./arch/x86/include/generated/uapi/asm/unistd_32.h:386: note: this is 
the location of the previous definition #define __NR_rseq 386

Richard Guy Briggs (3):
  audit: replace magic audit syscall class numbers with macros
  audit: add support for the openat2 syscall
  audit: add OPENAT2 record to list how

 MAINTAINERS|  1 +
 arch/alpha/kernel/audit.c  | 10 ++
 arch/ia64/kernel/audit.c   | 10 ++
 arch/parisc/kernel/audit.c | 10 ++
 arch/parisc/kernel/compat_audit.c  | 11 +++
 arch/powerpc/kernel/audit.c| 12 +++-
 arch/powerpc/kernel/compat_audit.c | 13 -
 arch/s390/kernel/audit.c   | 12 +++-
 arch/s390/kernel/compat_audit.c| 13 -
 arch/sparc/kernel/audit.c  | 12 +++-
 arch/sparc/kernel/compat_audit.c   | 13 -
 arch/x86/ia32/audit.c  | 13 -
 arch/x86/kernel/audit_64.c | 10 ++
 fs/open.c  |  2 ++
 include/linux/audit.h  | 11 +++
 include/linux/auditscm.h   | 24 +++
 include/uapi/linux/audit.h |  1 +
 kernel/audit.h |  2 ++
 kernel/auditsc.c   | 31 --
 lib/audit.c| 14 +-
 lib/compat_audit.c | 15 ++-
 21 files changed, 169 insertions(+), 71 deletions(-)
 create mode 100644 include/linux/auditscm.h

-- 
2.27.0



Re: [GIT PULL] Please pull powerpc/linux.git powerpc-5.13-1 tag

2021-04-30 Thread pr-tracker-bot
The pull request you sent on Fri, 30 Apr 2021 14:02:32 +1000:

> https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git 
> tags/powerpc-5.13-1

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/c70a4be130de333ea079c59da41cc959712bb01c

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html


Re: [PATCH v2 0/3] audit: add support for openat2

2021-04-30 Thread Richard Guy Briggs
On 2021-04-30 13:29, Richard Guy Briggs wrote:
> The openat2(2) syscall was added in v5.6.  Add support for openat2 to the
> audit syscall classifier and for recording openat2 parameters that cannot
> be captured in the syscall parameters of the SYSCALL record.

Well, that was a bit premature...  Commit descriptions in each of the
patches might be a good idea...  Somehow they got dropped from V1.  I
guess they seemed obvious to me.  :-)Changelog might be a nice
addition too...  Sorry for the noise.

> Supporting userspace code can be found in
> https://github.com/rgbriggs/audit-userspace/tree/ghau-openat2
> 
> Supporting test case can be found in
> https://github.com/linux-audit/audit-testsuite/pull/103
> 
> Richard Guy Briggs (3):
>   audit: replace magic audit syscall class numbers with macros
>   audit: add support for the openat2 syscall
>   audit: add OPENAT2 record to list how
> 
>  arch/alpha/kernel/audit.c  | 10 ++
>  arch/ia64/kernel/audit.c   | 10 ++
>  arch/parisc/kernel/audit.c | 10 ++
>  arch/parisc/kernel/compat_audit.c  | 11 +++
>  arch/powerpc/kernel/audit.c| 12 +++-
>  arch/powerpc/kernel/compat_audit.c | 13 -
>  arch/s390/kernel/audit.c   | 12 +++-
>  arch/s390/kernel/compat_audit.c| 13 -
>  arch/sparc/kernel/audit.c  | 12 +++-
>  arch/sparc/kernel/compat_audit.c   | 13 -
>  arch/x86/ia32/audit.c  | 13 -
>  arch/x86/kernel/audit_64.c | 10 ++
>  fs/open.c  |  2 ++
>  include/linux/audit.h  | 11 +++
>  include/linux/auditscm.h   | 24 +++
>  include/uapi/linux/audit.h |  1 +
>  kernel/audit.h |  2 ++
>  kernel/auditsc.c   | 31 --
>  lib/audit.c| 14 +-
>  lib/compat_audit.c | 15 ++-
>  20 files changed, 168 insertions(+), 71 deletions(-)
>  create mode 100644 include/linux/auditscm.h
> 
> -- 
> 2.27.0
> 

- RGB

--
Richard Guy Briggs 
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635



[PATCH v2 2/3] audit: add support for the openat2 syscall

2021-04-30 Thread Richard Guy Briggs
Signed-off-by: Richard Guy Briggs 
---
 arch/alpha/kernel/audit.c  | 2 ++
 arch/ia64/kernel/audit.c   | 2 ++
 arch/parisc/kernel/audit.c | 2 ++
 arch/parisc/kernel/compat_audit.c  | 2 ++
 arch/powerpc/kernel/audit.c| 2 ++
 arch/powerpc/kernel/compat_audit.c | 2 ++
 arch/s390/kernel/audit.c   | 2 ++
 arch/s390/kernel/compat_audit.c| 2 ++
 arch/sparc/kernel/audit.c  | 2 ++
 arch/sparc/kernel/compat_audit.c   | 2 ++
 arch/x86/ia32/audit.c  | 2 ++
 arch/x86/kernel/audit_64.c | 2 ++
 include/linux/auditscm.h   | 1 +
 kernel/auditsc.c   | 3 +++
 lib/audit.c| 4 
 lib/compat_audit.c | 4 
 16 files changed, 36 insertions(+)

diff --git a/arch/alpha/kernel/audit.c b/arch/alpha/kernel/audit.c
index 81cbd804e375..3ab04709784a 100644
--- a/arch/alpha/kernel/audit.c
+++ b/arch/alpha/kernel/audit.c
@@ -42,6 +42,8 @@ int audit_classify_syscall(int abi, unsigned syscall)
return AUDITSC_OPENAT;
case __NR_execve:
return AUDITSC_EXECVE;
+   case __NR_openat2:
+   return AUDITSC_OPENAT2;
default:
return AUDITSC_NATIVE;
}
diff --git a/arch/ia64/kernel/audit.c b/arch/ia64/kernel/audit.c
index dba6a74c9ab3..ec61f20ca61f 100644
--- a/arch/ia64/kernel/audit.c
+++ b/arch/ia64/kernel/audit.c
@@ -43,6 +43,8 @@ int audit_classify_syscall(int abi, unsigned syscall)
return AUDITSC_OPENAT;
case __NR_execve:
return AUDITSC_EXECVE;
+   case __NR_openat2:
+   return AUDITSC_OPENAT2;
default:
return AUDITSC_NATIVE;
}
diff --git a/arch/parisc/kernel/audit.c b/arch/parisc/kernel/audit.c
index 14244e83db75..f420b5552140 100644
--- a/arch/parisc/kernel/audit.c
+++ b/arch/parisc/kernel/audit.c
@@ -52,6 +52,8 @@ int audit_classify_syscall(int abi, unsigned syscall)
return AUDITSC_OPENAT;
case __NR_execve:
return AUDITSC_EXECVE;
+   case __NR_openat2:
+   return AUDITSC_OPENAT2;
default:
return AUDITSC_NATIVE;
}
diff --git a/arch/parisc/kernel/compat_audit.c 
b/arch/parisc/kernel/compat_audit.c
index 0c181bb39f34..02cfd9d1ebeb 100644
--- a/arch/parisc/kernel/compat_audit.c
+++ b/arch/parisc/kernel/compat_audit.c
@@ -36,6 +36,8 @@ int parisc32_classify_syscall(unsigned syscall)
return AUDITSC_OPENAT;
case __NR_execve:
return AUDITSC_EXECVE;
+   case __NR_openat2:
+   return AUDITSC_OPENAT2;
default:
return AUDITSC_COMPAT;
}
diff --git a/arch/powerpc/kernel/audit.c b/arch/powerpc/kernel/audit.c
index 6eb18ef77dff..1bcfca5fdf67 100644
--- a/arch/powerpc/kernel/audit.c
+++ b/arch/powerpc/kernel/audit.c
@@ -54,6 +54,8 @@ int audit_classify_syscall(int abi, unsigned syscall)
return AUDITSC_SOCKETCALL;
case __NR_execve:
return AUDITSC_EXECVE;
+   case __NR_openat2:
+   return AUDITSC_OPENAT2;
default:
return AUDITSC_NATIVE;
}
diff --git a/arch/powerpc/kernel/compat_audit.c 
b/arch/powerpc/kernel/compat_audit.c
index f250777f6365..1fa0c902be8a 100644
--- a/arch/powerpc/kernel/compat_audit.c
+++ b/arch/powerpc/kernel/compat_audit.c
@@ -39,6 +39,8 @@ int ppc32_classify_syscall(unsigned syscall)
return AUDITSC_SOCKETCALL;
case __NR_execve:
return AUDITSC_EXECVE;
+   case __NR_openat2:
+   return AUDITSC_OPENAT2;
default:
return AUDITSC_COMPAT;
}
diff --git a/arch/s390/kernel/audit.c b/arch/s390/kernel/audit.c
index 7e331e1831d4..02051a596b87 100644
--- a/arch/s390/kernel/audit.c
+++ b/arch/s390/kernel/audit.c
@@ -54,6 +54,8 @@ int audit_classify_syscall(int abi, unsigned syscall)
return AUDITSC_SOCKETCALL;
case __NR_execve:
return AUDITSC_EXECVE;
+   case __NR_openat2:
+   return AUDITSC_OPENAT2;
default:
return AUDITSC_NATIVE;
}
diff --git a/arch/s390/kernel/compat_audit.c b/arch/s390/kernel/compat_audit.c
index b2a2ed5d605a..320b5e7d96f0 100644
--- a/arch/s390/kernel/compat_audit.c
+++ b/arch/s390/kernel/compat_audit.c
@@ -40,6 +40,8 @@ int s390_classify_syscall(unsigned syscall)
return AUDITSC_SOCKETCALL;
case __NR_execve:
return AUDITSC_EXECVE;
+   case __NR_openat2:
+   return AUDITSC_OPENAT2;
default:
return AUDITSC_COMPAT;
}
diff --git a/arch/sparc/kernel/audit.c b/arch/sparc/kernel/audit.c
index 50fab35bdaba..b092274eca79 100644
--- a/arch/sparc/kernel/audit.c
+++ b/arch/sparc/kernel/audit.c
@@ -55,6 +55,8 @@ int audit_classify_syscall(int abi, unsigned int syscall)
return AUDITSC_SOCKETCALL;

[PATCH v2 1/3] audit: replace magic audit syscall class numbers with macros

2021-04-30 Thread Richard Guy Briggs
Replace the magic numbers used to indicate audit syscall classes with macros.

Signed-off-by: Richard Guy Briggs 
---
 arch/alpha/kernel/audit.c  |  8 
 arch/ia64/kernel/audit.c   |  8 
 arch/parisc/kernel/audit.c |  8 
 arch/parisc/kernel/compat_audit.c  |  9 +
 arch/powerpc/kernel/audit.c| 10 +-
 arch/powerpc/kernel/compat_audit.c | 11 ++-
 arch/s390/kernel/audit.c   | 10 +-
 arch/s390/kernel/compat_audit.c| 11 ++-
 arch/sparc/kernel/audit.c  | 10 +-
 arch/sparc/kernel/compat_audit.c   | 11 ++-
 arch/x86/ia32/audit.c  | 11 ++-
 arch/x86/kernel/audit_64.c |  8 
 include/linux/audit.h  |  1 +
 include/linux/auditscm.h   | 23 +++
 kernel/auditsc.c   | 12 ++--
 lib/audit.c| 10 +-
 lib/compat_audit.c | 11 ++-
 17 files changed, 101 insertions(+), 71 deletions(-)
 create mode 100644 include/linux/auditscm.h

diff --git a/arch/alpha/kernel/audit.c b/arch/alpha/kernel/audit.c
index 96a9d18ff4c4..81cbd804e375 100644
--- a/arch/alpha/kernel/audit.c
+++ b/arch/alpha/kernel/audit.c
@@ -37,13 +37,13 @@ int audit_classify_syscall(int abi, unsigned syscall)
 {
switch(syscall) {
case __NR_open:
-   return 2;
+   return AUDITSC_OPEN;
case __NR_openat:
-   return 3;
+   return AUDITSC_OPENAT;
case __NR_execve:
-   return 5;
+   return AUDITSC_EXECVE;
default:
-   return 0;
+   return AUDITSC_NATIVE;
}
 }
 
diff --git a/arch/ia64/kernel/audit.c b/arch/ia64/kernel/audit.c
index 5192ca899fe6..dba6a74c9ab3 100644
--- a/arch/ia64/kernel/audit.c
+++ b/arch/ia64/kernel/audit.c
@@ -38,13 +38,13 @@ int audit_classify_syscall(int abi, unsigned syscall)
 {
switch(syscall) {
case __NR_open:
-   return 2;
+   return AUDITSC_OPEN;
case __NR_openat:
-   return 3;
+   return AUDITSC_OPENAT;
case __NR_execve:
-   return 5;
+   return AUDITSC_EXECVE;
default:
-   return 0;
+   return AUDITSC_NATIVE;
}
 }
 
diff --git a/arch/parisc/kernel/audit.c b/arch/parisc/kernel/audit.c
index 9eb47b2225d2..14244e83db75 100644
--- a/arch/parisc/kernel/audit.c
+++ b/arch/parisc/kernel/audit.c
@@ -47,13 +47,13 @@ int audit_classify_syscall(int abi, unsigned syscall)
 #endif
switch (syscall) {
case __NR_open:
-   return 2;
+   return AUDITSC_OPEN;
case __NR_openat:
-   return 3;
+   return AUDITSC_OPENAT;
case __NR_execve:
-   return 5;
+   return AUDITSC_EXECVE;
default:
-   return 0;
+   return AUDITSC_NATIVE;
}
 }
 
diff --git a/arch/parisc/kernel/compat_audit.c 
b/arch/parisc/kernel/compat_audit.c
index 20c39c9d86a9..0c181bb39f34 100644
--- a/arch/parisc/kernel/compat_audit.c
+++ b/arch/parisc/kernel/compat_audit.c
@@ -1,4 +1,5 @@
 // SPDX-License-Identifier: GPL-2.0
+#include 
 #include 
 
 unsigned int parisc32_dir_class[] = {
@@ -30,12 +31,12 @@ int parisc32_classify_syscall(unsigned syscall)
 {
switch (syscall) {
case __NR_open:
-   return 2;
+   return AUDITSC_OPEN;
case __NR_openat:
-   return 3;
+   return AUDITSC_OPENAT;
case __NR_execve:
-   return 5;
+   return AUDITSC_EXECVE;
default:
-   return 1;
+   return AUDITSC_COMPAT;
}
 }
diff --git a/arch/powerpc/kernel/audit.c b/arch/powerpc/kernel/audit.c
index a27f3d09..6eb18ef77dff 100644
--- a/arch/powerpc/kernel/audit.c
+++ b/arch/powerpc/kernel/audit.c
@@ -47,15 +47,15 @@ int audit_classify_syscall(int abi, unsigned syscall)
 #endif
switch(syscall) {
case __NR_open:
-   return 2;
+   return AUDITSC_OPEN;
case __NR_openat:
-   return 3;
+   return AUDITSC_OPENAT;
case __NR_socketcall:
-   return 4;
+   return AUDITSC_SOCKETCALL;
case __NR_execve:
-   return 5;
+   return AUDITSC_EXECVE;
default:
-   return 0;
+   return AUDITSC_NATIVE;
}
 }
 
diff --git a/arch/powerpc/kernel/compat_audit.c 
b/arch/powerpc/kernel/compat_audit.c
index 55c6ccda0a85..f250777f6365 100644
--- a/arch/powerpc/kernel/compat_audit.c
+++ b/arch/powerpc/kernel/compat_audit.c
@@ -1,5 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0
 #undef __powerpc64__
+#include 
 #include 
 
 unsigned ppc32_dir_class[] = {
@@ -31,14 +32,14 @@ int ppc32_classify_syscall(unsigned syscall)
 {
switch(syscall) {
  

[PATCH v2 0/3] audit: add support for openat2

2021-04-30 Thread Richard Guy Briggs
The openat2(2) syscall was added in v5.6.  Add support for openat2 to the
audit syscall classifier and for recording openat2 parameters that cannot
be captured in the syscall parameters of the SYSCALL record.

Supporting userspace code can be found in
https://github.com/rgbriggs/audit-userspace/tree/ghau-openat2

Supporting test case can be found in
https://github.com/linux-audit/audit-testsuite/pull/103

Richard Guy Briggs (3):
  audit: replace magic audit syscall class numbers with macros
  audit: add support for the openat2 syscall
  audit: add OPENAT2 record to list how

 arch/alpha/kernel/audit.c  | 10 ++
 arch/ia64/kernel/audit.c   | 10 ++
 arch/parisc/kernel/audit.c | 10 ++
 arch/parisc/kernel/compat_audit.c  | 11 +++
 arch/powerpc/kernel/audit.c| 12 +++-
 arch/powerpc/kernel/compat_audit.c | 13 -
 arch/s390/kernel/audit.c   | 12 +++-
 arch/s390/kernel/compat_audit.c| 13 -
 arch/sparc/kernel/audit.c  | 12 +++-
 arch/sparc/kernel/compat_audit.c   | 13 -
 arch/x86/ia32/audit.c  | 13 -
 arch/x86/kernel/audit_64.c | 10 ++
 fs/open.c  |  2 ++
 include/linux/audit.h  | 11 +++
 include/linux/auditscm.h   | 24 +++
 include/uapi/linux/audit.h |  1 +
 kernel/audit.h |  2 ++
 kernel/auditsc.c   | 31 --
 lib/audit.c| 14 +-
 lib/compat_audit.c | 15 ++-
 20 files changed, 168 insertions(+), 71 deletions(-)
 create mode 100644 include/linux/auditscm.h

-- 
2.27.0



[PATCH v4 11/11] powerpc/pseries/iommu: Rename "direct window" to "dma window"

2021-04-30 Thread Leonardo Bras
A previous change introduced the usage of DDW as a bigger indirect DMA
mapping when the DDW available size does not map the whole partition.

As most of the code that manipulates direct mappings was reused for
indirect mappings, it's necessary to rename all names and debug/info
messages to reflect that it can be used for both kinds of mapping.

This should cause no behavioural change, just adjust naming.

Signed-off-by: Leonardo Bras 
---
 arch/powerpc/platforms/pseries/iommu.c | 93 +-
 1 file changed, 48 insertions(+), 45 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 572879af0211..ce7b841fb10f 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -355,7 +355,7 @@ struct dynamic_dma_window_prop {
__be32  window_shift;   /* ilog2(tce_window_size) */
 };
 
-struct direct_window {
+struct dma_win {
struct device_node *device;
const struct dynamic_dma_window_prop *prop;
struct list_head list;
@@ -375,11 +375,11 @@ struct ddw_create_response {
u32 addr_lo;
 };
 
-static LIST_HEAD(direct_window_list);
+static LIST_HEAD(dma_win_list);
 /* prevents races between memory on/offline and window creation */
-static DEFINE_SPINLOCK(direct_window_list_lock);
+static DEFINE_SPINLOCK(dma_win_list_lock);
 /* protects initializing window twice for same device */
-static DEFINE_MUTEX(direct_window_init_mutex);
+static DEFINE_MUTEX(dma_win_init_mutex);
 #define DIRECT64_PROPNAME "linux,direct64-ddr-window-info"
 #define DMA64_PROPNAME "linux,dma64-ddr-window-info"
 
@@ -712,7 +712,10 @@ static void pci_dma_bus_setup_pSeriesLP(struct pci_bus 
*bus)
pr_debug("pci_dma_bus_setup_pSeriesLP: setting up bus %pOF\n",
 dn);
 
-   /* Find nearest ibm,dma-window, walking up the device tree */
+   /*
+* Find nearest ibm,dma-window (default DMA window), walking up the
+* device tree
+*/
for (pdn = dn; pdn != NULL; pdn = pdn->parent) {
dma_window = of_get_property(pdn, "ibm,dma-window", NULL);
if (dma_window != NULL)
@@ -816,11 +819,11 @@ static void remove_dma_window(struct device_node *np, u32 
*ddw_avail,
 
ret = rtas_call(ddw_avail[DDW_REMOVE_PE_DMA_WIN], 1, 1, NULL, liobn);
if (ret)
-   pr_warn("%pOF: failed to remove direct window: rtas returned "
+   pr_warn("%pOF: failed to remove DMA window: rtas returned "
"%d to ibm,remove-pe-dma-window(%x) %llx\n",
np, ret, ddw_avail[DDW_REMOVE_PE_DMA_WIN], liobn);
else
-   pr_debug("%pOF: successfully removed direct window: rtas 
returned "
+   pr_debug("%pOF: successfully removed DMA window: rtas returned "
"%d to ibm,remove-pe-dma-window(%x) %llx\n",
np, ret, ddw_avail[DDW_REMOVE_PE_DMA_WIN], liobn);
 }
@@ -848,37 +851,37 @@ static int remove_ddw(struct device_node *np, bool 
remove_prop, const char *win_
 
ret = of_remove_property(np, win);
if (ret)
-   pr_warn("%pOF: failed to remove direct window property: %d\n",
+   pr_warn("%pOF: failed to remove DMA window property: %d\n",
np, ret);
return 0;
 }
 
 static bool find_existing_ddw(struct device_node *pdn, u64 *dma_addr, int 
*window_shift)
 {
-   struct direct_window *window;
-   const struct dynamic_dma_window_prop *direct64;
+   struct dma_win *window;
+   const struct dynamic_dma_window_prop *dma64;
bool found = false;
 
-   spin_lock(_window_list_lock);
+   spin_lock(_win_list_lock);
/* check if we already created a window and dupe that config if so */
-   list_for_each_entry(window, _window_list, list) {
+   list_for_each_entry(window, _win_list, list) {
if (window->device == pdn) {
-   direct64 = window->prop;
-   *dma_addr = be64_to_cpu(direct64->dma_base);
-   *window_shift = be32_to_cpu(direct64->window_shift);
+   dma64 = window->prop;
+   *dma_addr = be64_to_cpu(dma64->dma_base);
+   *window_shift = be32_to_cpu(dma64->window_shift);
found = true;
break;
}
}
-   spin_unlock(_window_list_lock);
+   spin_unlock(_win_list_lock);
 
return found;
 }
 
-static struct direct_window *ddw_list_new_entry(struct device_node *pdn,
-   const struct 
dynamic_dma_window_prop *dma64)
+static struct dma_win *ddw_list_new_entry(struct device_node *pdn,
+ const struct dynamic_dma_window_prop 
*dma64)
 {
-   struct direct_window *window;
+   struct dma_win *window;
 
window = 

[PATCH v4 10/11] powerpc/pseries/iommu: Make use of DDW for indirect mapping

2021-04-30 Thread Leonardo Bras
So far it's assumed possible to map the guest RAM 1:1 to the bus, which
works with a small number of devices. SRIOV changes it as the user can
configure hundreds VFs and since phyp preallocates TCEs and does not
allow IOMMU pages bigger than 64K, it has to limit the number of TCEs
per a PE to limit waste of physical pages.

As of today, if the assumed direct mapping is not possible, DDW creation
is skipped and the default DMA window "ibm,dma-window" is used instead.

By using DDW, indirect mapping  can get more TCEs than available for the
default DMA window, and also get access to using much larger pagesizes
(16MB as implemented in qemu vs 4k from default DMA window), causing a
significant increase on the maximum amount of memory that can be IOMMU
mapped at the same time.

Indirect mapping will only be used if direct mapping is not a
possibility.

For indirect mapping, it's necessary to re-create the iommu_table with
the new DMA window parameters, so iommu_alloc() can use it.

Removing the default DMA window for using DDW with indirect mapping
is only allowed if there is no current IOMMU memory allocated in
the iommu_table. enable_ddw() is aborted otherwise.

Even though there won't be both direct and indirect mappings at the
same time, we can't reuse the DIRECT64_PROPNAME property name, or else
an older kexec()ed kernel can assume direct mapping, and skip
iommu_alloc(), causing undesirable behavior.
So a new property name DMA64_PROPNAME "linux,dma64-ddr-window-info"
was created to represent a DDW that does not allow direct mapping.

Signed-off-by: Leonardo Bras 
---
 arch/powerpc/platforms/pseries/iommu.c | 87 +-
 1 file changed, 72 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index de54ddd9decd..572879af0211 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -53,6 +53,7 @@ enum {
DDW_EXT_QUERY_OUT_SIZE = 2
 };
 
+static phys_addr_t ddw_memory_hotplug_max(void);
 #ifdef CONFIG_IOMMU_API
 static int tce_exchange_pseries(struct iommu_table *tbl, long index, unsigned 
long *tce,
enum dma_data_direction *direction, bool 
realmode);
@@ -380,6 +381,7 @@ static DEFINE_SPINLOCK(direct_window_list_lock);
 /* protects initializing window twice for same device */
 static DEFINE_MUTEX(direct_window_init_mutex);
 #define DIRECT64_PROPNAME "linux,direct64-ddr-window-info"
+#define DMA64_PROPNAME "linux,dma64-ddr-window-info"
 
 static int tce_clearrange_multi_pSeriesLP(unsigned long start_pfn,
unsigned long num_pfn, const void *arg)
@@ -918,6 +920,7 @@ static int find_existing_ddw_windows(void)
return 0;
 
find_existing_ddw_windows_named(DIRECT64_PROPNAME);
+   find_existing_ddw_windows_named(DMA64_PROPNAME);
 
return 0;
 }
@@ -1207,10 +1210,13 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
struct device_node *dn;
u32 ddw_avail[DDW_APPLICABLE_SIZE];
struct direct_window *window;
+   const char *win_name;
struct property *win64 = NULL;
struct failed_ddw_pdn *fpdn;
-   bool default_win_removed = false;
+   bool default_win_removed = false, direct_mapping = false;
bool pmem_present;
+   struct pci_dn *pci = PCI_DN(pdn);
+   struct iommu_table *tbl = pci->table_group->tables[0];
 
dn = of_find_node_by_type(NULL, "ibm,pmemory");
pmem_present = dn != NULL;
@@ -1218,8 +1224,12 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
 
mutex_lock(_window_init_mutex);
 
-   if (find_existing_ddw(pdn, >dev.archdata.dma_offset, ))
-   goto out_unlock;
+   if (find_existing_ddw(pdn, >dev.archdata.dma_offset, )) {
+   direct_mapping = (len >= max_ram_len);
+
+   mutex_unlock(_window_init_mutex);
+   return direct_mapping;
+   }
 
/*
 * If we already went through this for a previous function of
@@ -1298,7 +1308,6 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
goto out_failed;
}
/* verify the window * number of ptes will map the partition */
-   /* check largest block * page size > max memory hotplug addr */
/*
 * The "ibm,pmemory" can appear anywhere in the address space.
 * Assuming it is still backed by page structs, try MAX_PHYSMEM_BITS
@@ -1320,6 +1329,17 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
1ULL << len,
query.largest_available_block,
1ULL << page_shift);
+
+   len = order_base_2(query.largest_available_block << page_shift);
+   win_name = DMA64_PROPNAME;
+   } else {
+   direct_mapping = true;
+   win_name = 

[PATCH v4 09/11] powerpc/pseries/iommu: Find existing DDW with given property name

2021-04-30 Thread Leonardo Bras
At the moment pseries stores information about created directly mapped
DDW window in DIRECT64_PROPNAME.

With the objective of implementing indirect DMA mapping with DDW, it's
necessary to have another propriety name to make sure kexec'ing into older
kernels does not break, as it would if we reuse DIRECT64_PROPNAME.

In order to have this, find_existing_ddw_windows() needs to be able to
look for different property names.

Extract find_existing_ddw_windows() into find_existing_ddw_windows_named()
and calls it with current property name.

Signed-off-by: Leonardo Bras 
---
 arch/powerpc/platforms/pseries/iommu.c | 25 +++--
 1 file changed, 15 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index f8922fcf34b6..de54ddd9decd 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -888,24 +888,21 @@ static struct direct_window *ddw_list_new_entry(struct 
device_node *pdn,
return window;
 }
 
-static int find_existing_ddw_windows(void)
+static void find_existing_ddw_windows_named(const char *name)
 {
int len;
struct device_node *pdn;
struct direct_window *window;
-   const struct dynamic_dma_window_prop *direct64;
-
-   if (!firmware_has_feature(FW_FEATURE_LPAR))
-   return 0;
+   const struct dynamic_dma_window_prop *dma64;
 
-   for_each_node_with_property(pdn, DIRECT64_PROPNAME) {
-   direct64 = of_get_property(pdn, DIRECT64_PROPNAME, );
-   if (!direct64 || len < sizeof(*direct64)) {
-   remove_ddw(pdn, true, DIRECT64_PROPNAME);
+   for_each_node_with_property(pdn, name) {
+   dma64 = of_get_property(pdn, name, );
+   if (!dma64 || len < sizeof(*dma64)) {
+   remove_ddw(pdn, true, name);
continue;
}
 
-   window = ddw_list_new_entry(pdn, direct64);
+   window = ddw_list_new_entry(pdn, dma64);
if (!window)
break;
 
@@ -913,6 +910,14 @@ static int find_existing_ddw_windows(void)
list_add(>list, _window_list);
spin_unlock(_window_list_lock);
}
+}
+
+static int find_existing_ddw_windows(void)
+{
+   if (!firmware_has_feature(FW_FEATURE_LPAR))
+   return 0;
+
+   find_existing_ddw_windows_named(DIRECT64_PROPNAME);
 
return 0;
 }
-- 
2.30.2



[PATCH v4 08/11] powerpc/pseries/iommu: Update remove_dma_window() to accept property name

2021-04-30 Thread Leonardo Bras
Update remove_dma_window() so it can be used to remove DDW with a given
property name.

This enables the creation of new property names for DDW, so we can
have different usage for it, like indirect mapping.

Signed-off-by: Leonardo Bras 
---
 arch/powerpc/platforms/pseries/iommu.c | 21 +++--
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 89cb6e9e9f31..f8922fcf34b6 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -823,31 +823,32 @@ static void remove_dma_window(struct device_node *np, u32 
*ddw_avail,
np, ret, ddw_avail[DDW_REMOVE_PE_DMA_WIN], liobn);
 }
 
-static void remove_ddw(struct device_node *np, bool remove_prop)
+static int remove_ddw(struct device_node *np, bool remove_prop, const char 
*win_name)
 {
struct property *win;
u32 ddw_avail[DDW_APPLICABLE_SIZE];
int ret = 0;
 
+   win = of_find_property(np, win_name, NULL);
+   if (!win)
+   return -EINVAL;
+
ret = of_property_read_u32_array(np, "ibm,ddw-applicable",
 _avail[0], DDW_APPLICABLE_SIZE);
if (ret)
-   return;
-
-   win = of_find_property(np, DIRECT64_PROPNAME, NULL);
-   if (!win)
-   return;
+   return 0;
 
if (win->length >= sizeof(struct dynamic_dma_window_prop))
remove_dma_window(np, ddw_avail, win);
 
if (!remove_prop)
-   return;
+   return 0;
 
ret = of_remove_property(np, win);
if (ret)
pr_warn("%pOF: failed to remove direct window property: %d\n",
np, ret);
+   return 0;
 }
 
 static bool find_existing_ddw(struct device_node *pdn, u64 *dma_addr, int 
*window_shift)
@@ -900,7 +901,7 @@ static int find_existing_ddw_windows(void)
for_each_node_with_property(pdn, DIRECT64_PROPNAME) {
direct64 = of_get_property(pdn, DIRECT64_PROPNAME, );
if (!direct64 || len < sizeof(*direct64)) {
-   remove_ddw(pdn, true);
+   remove_ddw(pdn, true, DIRECT64_PROPNAME);
continue;
}
 
@@ -1372,7 +1373,7 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
win64 = NULL;
 
 out_remove_win:
-   remove_ddw(pdn, true);
+   remove_ddw(pdn, true, DIRECT64_PROPNAME);
 
 out_failed:
if (default_win_removed)
@@ -1536,7 +1537,7 @@ static int iommu_reconfig_notifier(struct notifier_block 
*nb, unsigned long acti
 * we have to remove the property when releasing
 * the device node.
 */
-   remove_ddw(np, false);
+   remove_ddw(np, false, DIRECT64_PROPNAME);
if (pci && pci->table_group)
iommu_pseries_free_group(pci->table_group,
np->full_name);
-- 
2.30.2



[PATCH v4 07/11] powerpc/pseries/iommu: Reorganize iommu_table_setparms*() with new helper

2021-04-30 Thread Leonardo Bras
Add a new helper _iommu_table_setparms(), and use it in
iommu_table_setparms() and iommu_table_setparms_lpar() to avoid duplicated
code.

Also, setting tbl->it_ops was happening outsite iommu_table_setparms*(),
so move it to the new helper. Since we need the iommu_table_ops to be
declared before used, move iommu_table_lpar_multi_ops and
iommu_table_pseries_ops to before their respective iommu_table_setparms*().

Signed-off-by: Leonardo Bras 
---
 arch/powerpc/platforms/pseries/iommu.c | 100 -
 1 file changed, 50 insertions(+), 50 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 5a70ecd579b8..89cb6e9e9f31 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -53,6 +53,11 @@ enum {
DDW_EXT_QUERY_OUT_SIZE = 2
 };
 
+#ifdef CONFIG_IOMMU_API
+static int tce_exchange_pseries(struct iommu_table *tbl, long index, unsigned 
long *tce,
+   enum dma_data_direction *direction, bool 
realmode);
+#endif
+
 static struct iommu_table *iommu_pseries_alloc_table(int node)
 {
struct iommu_table *tbl;
@@ -501,6 +506,28 @@ static int tce_setrange_multi_pSeriesLP_walk(unsigned long 
start_pfn,
return tce_setrange_multi_pSeriesLP(start_pfn, num_pfn, arg);
 }
 
+static inline void _iommu_table_setparms(struct iommu_table *tbl, unsigned 
long busno,
+unsigned long liobn, unsigned long 
win_addr,
+unsigned long window_size, unsigned 
long page_shift,
+unsigned long base, struct 
iommu_table_ops *table_ops)
+{
+   tbl->it_busno = busno;
+   tbl->it_index = liobn;
+   tbl->it_offset = win_addr >> page_shift;
+   tbl->it_size = window_size >> page_shift;
+   tbl->it_page_shift = page_shift;
+   tbl->it_base = base;
+   tbl->it_blocksize = 16;
+   tbl->it_type = TCE_PCI;
+   tbl->it_ops = table_ops;
+}
+
+struct iommu_table_ops iommu_table_pseries_ops = {
+   .set = tce_build_pSeries,
+   .clear = tce_free_pSeries,
+   .get = tce_get_pseries
+};
+
 static void iommu_table_setparms(struct pci_controller *phb,
 struct device_node *dn,
 struct iommu_table *tbl)
@@ -509,8 +536,13 @@ static void iommu_table_setparms(struct pci_controller 
*phb,
const unsigned long *basep;
const u32 *sizep;
 
-   node = phb->dn;
+   /* Test if we are going over 2GB of DMA space */
+   if (phb->dma_window_base_cur + phb->dma_window_size > SZ_2G) {
+   udbg_printf("PCI_DMA: Unexpected number of IOAs under this 
PHB.\n");
+   panic("PCI_DMA: Unexpected number of IOAs under this PHB.\n");
+   }
 
+   node = phb->dn;
basep = of_get_property(node, "linux,tce-base", NULL);
sizep = of_get_property(node, "linux,tce-size", NULL);
if (basep == NULL || sizep == NULL) {
@@ -519,33 +551,25 @@ static void iommu_table_setparms(struct pci_controller 
*phb,
return;
}
 
-   tbl->it_base = (unsigned long)__va(*basep);
+   _iommu_table_setparms(tbl, phb->bus->number, 0, 
phb->dma_window_base_cur,
+ phb->dma_window_size, IOMMU_PAGE_SHIFT_4K,
+ (unsigned long)__va(*basep), 
_table_pseries_ops);
 
if (!is_kdump_kernel())
memset((void *)tbl->it_base, 0, *sizep);
 
-   tbl->it_busno = phb->bus->number;
-   tbl->it_page_shift = IOMMU_PAGE_SHIFT_4K;
-
-   /* Units of tce entries */
-   tbl->it_offset = phb->dma_window_base_cur >> tbl->it_page_shift;
-
-   /* Test if we are going over 2GB of DMA space */
-   if (phb->dma_window_base_cur + phb->dma_window_size > 0x8000ul) {
-   udbg_printf("PCI_DMA: Unexpected number of IOAs under this 
PHB.\n");
-   panic("PCI_DMA: Unexpected number of IOAs under this PHB.\n");
-   }
-
phb->dma_window_base_cur += phb->dma_window_size;
-
-   /* Set the tce table size - measured in entries */
-   tbl->it_size = phb->dma_window_size >> tbl->it_page_shift;
-
-   tbl->it_index = 0;
-   tbl->it_blocksize = 16;
-   tbl->it_type = TCE_PCI;
 }
 
+struct iommu_table_ops iommu_table_lpar_multi_ops = {
+   .set = tce_buildmulti_pSeriesLP,
+#ifdef CONFIG_IOMMU_API
+   .xchg_no_kill = tce_exchange_pseries,
+#endif
+   .clear = tce_freemulti_pSeriesLP,
+   .get = tce_get_pSeriesLP
+};
+
 /*
  * iommu_table_setparms_lpar
  *
@@ -557,28 +581,17 @@ static void iommu_table_setparms_lpar(struct 
pci_controller *phb,
  struct iommu_table_group *table_group,
  const __be32 *dma_window)
 {
-   unsigned long offset, size;
+   unsigned long offset, size, liobn;
 
-   of_parse_dma_window(dn, 

[PATCH v4 06/11] powerpc/pseries/iommu: Add ddw_property_create() and refactor enable_ddw()

2021-04-30 Thread Leonardo Bras
Code used to create a ddw property that was previously scattered in
enable_ddw() is now gathered in ddw_property_create(), which deals with
allocation and filling the property, letting it ready for
of_property_add(), which now occurs in sequence.

This created an opportunity to reorganize the second part of enable_ddw():

Without this patch enable_ddw() does, in order:
kzalloc() property & members, create_ddw(), fill ddwprop inside property,
ddw_list_new_entry(), do tce_setrange_multi_pSeriesLP_walk in all memory,
of_add_property(), and list_add().

With this patch enable_ddw() does, in order:
create_ddw(), ddw_property_create(), of_add_property(),
ddw_list_new_entry(), do tce_setrange_multi_pSeriesLP_walk in all memory,
and list_add().

This change requires of_remove_property() in case anything fails after
of_add_property(), but we get to do tce_setrange_multi_pSeriesLP_walk
in all memory, which looks the most expensive operation, only if
everything else succeeds.

Signed-off-by: Leonardo Bras 
---
 arch/powerpc/platforms/pseries/iommu.c | 93 --
 1 file changed, 57 insertions(+), 36 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 955cf095416c..5a70ecd579b8 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -1122,6 +1122,35 @@ static void reset_dma_window(struct pci_dev *dev, struct 
device_node *par_dn)
 ret);
 }
 
+static struct property *ddw_property_create(const char *propname, u32 liobn, 
u64 dma_addr,
+   u32 page_shift, u32 window_shift)
+{
+   struct dynamic_dma_window_prop *ddwprop;
+   struct property *win64;
+
+   win64 = kzalloc(sizeof(*win64), GFP_KERNEL);
+   if (!win64)
+   return NULL;
+
+   win64->name = kstrdup(propname, GFP_KERNEL);
+   ddwprop = kzalloc(sizeof(*ddwprop), GFP_KERNEL);
+   win64->value = ddwprop;
+   win64->length = sizeof(*ddwprop);
+   if (!win64->name || !win64->value) {
+   kfree(win64->name);
+   kfree(win64->value);
+   kfree(win64);
+   return NULL;
+   }
+
+   ddwprop->liobn = cpu_to_be32(liobn);
+   ddwprop->dma_base = cpu_to_be64(dma_addr);
+   ddwprop->tce_shift = cpu_to_be32(page_shift);
+   ddwprop->window_shift = cpu_to_be32(window_shift);
+
+   return win64;
+}
+
 /* Return largest page shift based on "IO Page Sizes" output of 
ibm,query-pe-dma-window. */
 static int iommu_get_page_shift(u32 query_page_size)
 {
@@ -1167,11 +1196,11 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
struct ddw_query_response query;
struct ddw_create_response create;
int page_shift;
+   u64 win_addr;
struct device_node *dn;
u32 ddw_avail[DDW_APPLICABLE_SIZE];
struct direct_window *window;
struct property *win64 = NULL;
-   struct dynamic_dma_window_prop *ddwprop;
struct failed_ddw_pdn *fpdn;
bool default_win_removed = false;
bool pmem_present;
@@ -1286,65 +1315,54 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
1ULL << page_shift);
goto out_failed;
}
-   win64 = kzalloc(sizeof(struct property), GFP_KERNEL);
-   if (!win64) {
-   dev_info(>dev,
-   "couldn't allocate property for 64bit dma window\n");
-   goto out_failed;
-   }
-   win64->name = kstrdup(DIRECT64_PROPNAME, GFP_KERNEL);
-   win64->value = ddwprop = kmalloc(sizeof(*ddwprop), GFP_KERNEL);
-   win64->length = sizeof(*ddwprop);
-   if (!win64->name || !win64->value) {
-   dev_info(>dev,
-   "couldn't allocate property name and value\n");
-   goto out_free_prop;
-   }
 
ret = create_ddw(dev, ddw_avail, , page_shift, len);
if (ret != 0)
-   goto out_free_prop;
-
-   ddwprop->liobn = cpu_to_be32(create.liobn);
-   ddwprop->dma_base = cpu_to_be64(((u64)create.addr_hi << 32) |
-   create.addr_lo);
-   ddwprop->tce_shift = cpu_to_be32(page_shift);
-   ddwprop->window_shift = cpu_to_be32(len);
+   goto out_failed;
 
dev_dbg(>dev, "created tce table LIOBN 0x%x for %pOF\n",
  create.liobn, dn);
 
-   window = ddw_list_new_entry(pdn, ddwprop);
+   win_addr = ((u64)create.addr_hi << 32) | create.addr_lo;
+   win64 = ddw_property_create(DIRECT64_PROPNAME, create.liobn, win_addr,
+   page_shift, len);
+   if (!win64) {
+   dev_info(>dev,
+"couldn't allocate property, property name, or 
value\n");
+   goto out_remove_win;
+   }
+
+   ret = of_add_property(pdn, win64);
+   if (ret) {
+   

[PATCH v4 05/11] powerpc/pseries/iommu: Allow DDW windows starting at 0x00

2021-04-30 Thread Leonardo Bras
enable_ddw() currently returns the address of the DMA window, which is
considered invalid if has the value 0x00.

Also, it only considers valid an address returned from find_existing_ddw
if it's not 0x00.

Changing this behavior makes sense, given the users of enable_ddw() only
need to know if direct mapping is possible. It can also allow a DMA window
starting at 0x00 to be used.

This will be helpful for using a DDW with indirect mapping, as the window
address will be different than 0x00, but it will not map the whole
partition.

Signed-off-by: Leonardo Bras 
Reviewed-by: Alexey Kardashevskiy 
---
 arch/powerpc/platforms/pseries/iommu.c | 35 --
 1 file changed, 16 insertions(+), 19 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 6f14894d2d04..955cf095416c 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -849,25 +849,26 @@ static void remove_ddw(struct device_node *np, bool 
remove_prop)
np, ret);
 }
 
-static u64 find_existing_ddw(struct device_node *pdn, int *window_shift)
+static bool find_existing_ddw(struct device_node *pdn, u64 *dma_addr, int 
*window_shift)
 {
struct direct_window *window;
const struct dynamic_dma_window_prop *direct64;
-   u64 dma_addr = 0;
+   bool found = false;
 
spin_lock(_window_list_lock);
/* check if we already created a window and dupe that config if so */
list_for_each_entry(window, _window_list, list) {
if (window->device == pdn) {
direct64 = window->prop;
-   dma_addr = be64_to_cpu(direct64->dma_base);
+   *dma_addr = be64_to_cpu(direct64->dma_base);
*window_shift = be32_to_cpu(direct64->window_shift);
+   found = true;
break;
}
}
spin_unlock(_window_list_lock);
 
-   return dma_addr;
+   return found;
 }
 
 static struct direct_window *ddw_list_new_entry(struct device_node *pdn,
@@ -1157,20 +1158,19 @@ static int iommu_get_page_shift(u32 query_page_size)
  * pdn: the parent pe node with the ibm,dma_window property
  * Future: also check if we can remap the base window for our base page size
  *
- * returns the dma offset for use by the direct mapped DMA code.
+ * returns true if can map all pages (direct mapping), false otherwise..
  */
-static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
+static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
 {
int len = 0, ret;
int max_ram_len = order_base_2(ddw_memory_hotplug_max());
struct ddw_query_response query;
struct ddw_create_response create;
int page_shift;
-   u64 dma_addr;
struct device_node *dn;
u32 ddw_avail[DDW_APPLICABLE_SIZE];
struct direct_window *window;
-   struct property *win64;
+   struct property *win64 = NULL;
struct dynamic_dma_window_prop *ddwprop;
struct failed_ddw_pdn *fpdn;
bool default_win_removed = false;
@@ -1182,8 +1182,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
 
mutex_lock(_window_init_mutex);
 
-   dma_addr = find_existing_ddw(pdn, );
-   if (dma_addr != 0)
+   if (find_existing_ddw(pdn, >dev.archdata.dma_offset, ))
goto out_unlock;
 
/*
@@ -1338,7 +1337,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
list_add(>list, _window_list);
spin_unlock(_window_list_lock);
 
-   dma_addr = be64_to_cpu(ddwprop->dma_base);
+   dev->dev.archdata.dma_offset = be64_to_cpu(ddwprop->dma_base);
goto out_unlock;
 
 out_free_window:
@@ -1351,6 +1350,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
kfree(win64->name);
kfree(win64->value);
kfree(win64);
+   win64 = NULL;
 
 out_failed:
if (default_win_removed)
@@ -1370,10 +1370,10 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
 * as RAM, then we failed to create a window to cover persistent
 * memory and need to set the DMA limit.
 */
-   if (pmem_present && dma_addr && (len == max_ram_len))
-   dev->dev.bus_dma_limit = dma_addr + (1ULL << len);
+   if (pmem_present && win64 && (len == max_ram_len))
+   dev->dev.bus_dma_limit = dev->dev.archdata.dma_offset + (1ULL 
<< len);
 
-   return dma_addr;
+   return win64;
 }
 
 static void pci_dma_dev_setup_pSeriesLP(struct pci_dev *dev)
@@ -1452,11 +1452,8 @@ static bool iommu_bypass_supported_pSeriesLP(struct 
pci_dev *pdev, u64 dma_mask)
break;
}
 
-   if (pdn && PCI_DN(pdn)) {
-   pdev->dev.archdata.dma_offset = enable_ddw(pdev, pdn);
-   if 

[PATCH v4 04/11] powerpc/pseries/iommu: Add ddw_list_new_entry() helper

2021-04-30 Thread Leonardo Bras
There are two functions creating direct_window_list entries in a
similar way, so create a ddw_list_new_entry() to avoid duplicity and
simplify those functions.

Signed-off-by: Leonardo Bras 
Reviewed-by: Alexey Kardashevskiy 
---
 arch/powerpc/platforms/pseries/iommu.c | 32 +-
 1 file changed, 21 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index d02359ca1f9f..6f14894d2d04 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -870,6 +870,21 @@ static u64 find_existing_ddw(struct device_node *pdn, int 
*window_shift)
return dma_addr;
 }
 
+static struct direct_window *ddw_list_new_entry(struct device_node *pdn,
+   const struct 
dynamic_dma_window_prop *dma64)
+{
+   struct direct_window *window;
+
+   window = kzalloc(sizeof(*window), GFP_KERNEL);
+   if (!window)
+   return NULL;
+
+   window->device = pdn;
+   window->prop = dma64;
+
+   return window;
+}
+
 static int find_existing_ddw_windows(void)
 {
int len;
@@ -882,18 +897,15 @@ static int find_existing_ddw_windows(void)
 
for_each_node_with_property(pdn, DIRECT64_PROPNAME) {
direct64 = of_get_property(pdn, DIRECT64_PROPNAME, );
-   if (!direct64)
-   continue;
-
-   window = kzalloc(sizeof(*window), GFP_KERNEL);
-   if (!window || len < sizeof(struct dynamic_dma_window_prop)) {
-   kfree(window);
+   if (!direct64 || len < sizeof(*direct64)) {
remove_ddw(pdn, true);
continue;
}
 
-   window->device = pdn;
-   window->prop = direct64;
+   window = ddw_list_new_entry(pdn, direct64);
+   if (!window)
+   break;
+
spin_lock(_window_list_lock);
list_add(>list, _window_list);
spin_unlock(_window_list_lock);
@@ -1303,7 +1315,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
dev_dbg(>dev, "created tce table LIOBN 0x%x for %pOF\n",
  create.liobn, dn);
 
-   window = kzalloc(sizeof(*window), GFP_KERNEL);
+   window = ddw_list_new_entry(pdn, ddwprop);
if (!window)
goto out_clear_window;
 
@@ -1322,8 +1334,6 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
goto out_free_window;
}
 
-   window->device = pdn;
-   window->prop = ddwprop;
spin_lock(_window_list_lock);
list_add(>list, _window_list);
spin_unlock(_window_list_lock);
-- 
2.30.2



[PATCH v4 03/11] powerpc/pseries/iommu: Add iommu_pseries_alloc_table() helper

2021-04-30 Thread Leonardo Bras
Creates a helper to allow allocating a new iommu_table without the need
to reallocate the iommu_group.

This will be helpful for replacing the iommu_table for the new DMA window,
after we remove the old one with iommu_tce_table_put().

Signed-off-by: Leonardo Bras 
Reviewed-by: Alexey Kardashevskiy 
---
 arch/powerpc/platforms/pseries/iommu.c | 25 ++---
 1 file changed, 14 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 796ab356341c..d02359ca1f9f 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -53,28 +53,31 @@ enum {
DDW_EXT_QUERY_OUT_SIZE = 2
 };
 
-static struct iommu_table_group *iommu_pseries_alloc_group(int node)
+static struct iommu_table *iommu_pseries_alloc_table(int node)
 {
-   struct iommu_table_group *table_group;
struct iommu_table *tbl;
 
-   table_group = kzalloc_node(sizeof(struct iommu_table_group), GFP_KERNEL,
-  node);
-   if (!table_group)
-   return NULL;
-
tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL, node);
if (!tbl)
-   goto free_group;
+   return NULL;
 
INIT_LIST_HEAD_RCU(>it_group_list);
kref_init(>it_kref);
+   return tbl;
+}
 
-   table_group->tables[0] = tbl;
+static struct iommu_table_group *iommu_pseries_alloc_group(int node)
+{
+   struct iommu_table_group *table_group;
+
+   table_group = kzalloc_node(sizeof(*table_group), GFP_KERNEL, node);
+   if (!table_group)
+   return NULL;
 
-   return table_group;
+   table_group->tables[0] = iommu_pseries_alloc_table(node);
+   if (table_group->tables[0])
+   return table_group;
 
-free_group:
kfree(table_group);
return NULL;
 }
-- 
2.30.2



[PATCH v4 02/11] powerpc/kernel/iommu: Add new iommu_table_in_use() helper

2021-04-30 Thread Leonardo Bras
Having a function to check if the iommu table has any allocation helps
deciding if a tbl can be reset for using a new DMA window.

It should be enough to replace all instances of !bitmap_empty(tbl...).

iommu_table_in_use() skips reserved memory, so we don't need to worry about
releasing it before testing. This causes iommu_table_release_pages() to
become unnecessary, given it is only used to remove reserved memory for
testing.

Also, only allow storing reserved memory values in tbl if they are valid
in the table, so there is no need to check it in the new helper.

Signed-off-by: Leonardo Bras 
---
 arch/powerpc/include/asm/iommu.h |  1 +
 arch/powerpc/kernel/iommu.c  | 65 
 2 files changed, 34 insertions(+), 32 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index deef7c94d7b6..bf3b84128525 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -154,6 +154,7 @@ extern int iommu_tce_table_put(struct iommu_table *tbl);
  */
 extern struct iommu_table *iommu_init_table(struct iommu_table *tbl,
int nid, unsigned long res_start, unsigned long res_end);
+bool iommu_table_in_use(struct iommu_table *tbl);
 
 #define IOMMU_TABLE_GROUP_MAX_TABLES   2
 
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index ad82dda81640..5e168bd91401 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -691,32 +691,24 @@ static void iommu_table_reserve_pages(struct iommu_table 
*tbl,
if (tbl->it_offset == 0)
set_bit(0, tbl->it_map);
 
-   tbl->it_reserved_start = res_start;
-   tbl->it_reserved_end = res_end;
-
-   /* Check if res_start..res_end isn't empty and overlaps the table */
-   if (res_start && res_end &&
-   (tbl->it_offset + tbl->it_size < res_start ||
-res_end < tbl->it_offset))
-   return;
+   if (res_start < tbl->it_offset)
+   res_start = tbl->it_offset;
 
-   for (i = tbl->it_reserved_start; i < tbl->it_reserved_end; ++i)
-   set_bit(i - tbl->it_offset, tbl->it_map);
-}
+   if (res_end > (tbl->it_offset + tbl->it_size))
+   res_end = tbl->it_offset + tbl->it_size;
 
-static void iommu_table_release_pages(struct iommu_table *tbl)
-{
-   int i;
+   /* Check if res_start..res_end is a valid range in the table */
+   if (res_start >= res_end) {
+   tbl->it_reserved_start = tbl->it_offset;
+   tbl->it_reserved_end = tbl->it_offset;
+   return;
+   }
 
-   /*
-* In case we have reserved the first bit, we should not emit
-* the warning below.
-*/
-   if (tbl->it_offset == 0)
-   clear_bit(0, tbl->it_map);
+   tbl->it_reserved_start = res_start;
+   tbl->it_reserved_end = res_end;
 
for (i = tbl->it_reserved_start; i < tbl->it_reserved_end; ++i)
-   clear_bit(i - tbl->it_offset, tbl->it_map);
+   set_bit(i - tbl->it_offset, tbl->it_map);
 }
 
 /*
@@ -781,6 +773,22 @@ struct iommu_table *iommu_init_table(struct iommu_table 
*tbl, int nid,
return tbl;
 }
 
+bool iommu_table_in_use(struct iommu_table *tbl)
+{
+   unsigned long start = 0, end;
+
+   /* ignore reserved bit0 */
+   if (tbl->it_offset == 0)
+   start = 1;
+   end = tbl->it_reserved_start - tbl->it_offset;
+   if (find_next_bit(tbl->it_map, end, start) != end)
+   return true;
+
+   start = tbl->it_reserved_end - tbl->it_offset;
+   end = tbl->it_size;
+   return find_next_bit(tbl->it_map, end, start) != end;
+}
+
 static void iommu_table_free(struct kref *kref)
 {
unsigned long bitmap_sz;
@@ -799,10 +807,8 @@ static void iommu_table_free(struct kref *kref)
 
iommu_debugfs_del(tbl);
 
-   iommu_table_release_pages(tbl);
-
/* verify that table contains no entries */
-   if (!bitmap_empty(tbl->it_map, tbl->it_size))
+   if (iommu_table_in_use(tbl))
pr_warn("%s: Unexpected TCEs\n", __func__);
 
/* calculate bitmap size in bytes */
@@ -1108,18 +1114,13 @@ int iommu_take_ownership(struct iommu_table *tbl)
for (i = 0; i < tbl->nr_pools; i++)
spin_lock(>pools[i].lock);
 
-   iommu_table_release_pages(tbl);
-
-   if (!bitmap_empty(tbl->it_map, tbl->it_size)) {
+   if (iommu_table_in_use(tbl)) {
pr_err("iommu_tce: it_map is not empty");
ret = -EBUSY;
-   /* Undo iommu_table_release_pages, i.e. restore bit#0, etc */
-   iommu_table_reserve_pages(tbl, tbl->it_reserved_start,
-   tbl->it_reserved_end);
-   } else {
-   memset(tbl->it_map, 0xff, sz);
}
 
+   memset(tbl->it_map, 0xff, sz);
+
for (i = 0; i < tbl->nr_pools; i++)

[PATCH v4 01/11] powerpc/pseries/iommu: Replace hard-coded page shift

2021-04-30 Thread Leonardo Bras
Some functions assume IOMMU page size can only be 4K (pageshift == 12).
Update them to accept any page size passed, so we can use 64K pages.

In the process, some defines like TCE_SHIFT were made obsolete, and then
removed.

IODA3 Revision 3.0_prd1 (OpenPowerFoundation), Figures 3.4 and 3.5 show
a RPN of 52-bit, and considers a 12-bit pageshift, so there should be
no need of using TCE_RPN_MASK, which masks out any bit after 40 in rpn.
It's usage removed from tce_build_pSeries(), tce_build_pSeriesLP(), and
tce_buildmulti_pSeriesLP().

Most places had a tbl struct, so using tbl->it_page_shift was simple.
tce_free_pSeriesLP() was a special case, since callers not always have a
tbl struct, so adding a tceshift parameter seems the right thing to do.

Signed-off-by: Leonardo Bras 
Reviewed-by: Alexey Kardashevskiy 
---
 arch/powerpc/include/asm/tce.h |  8 --
 arch/powerpc/platforms/pseries/iommu.c | 39 +++---
 2 files changed, 23 insertions(+), 24 deletions(-)

diff --git a/arch/powerpc/include/asm/tce.h b/arch/powerpc/include/asm/tce.h
index db5fc2f2262d..0c34d2756d92 100644
--- a/arch/powerpc/include/asm/tce.h
+++ b/arch/powerpc/include/asm/tce.h
@@ -19,15 +19,7 @@
 #define TCE_VB 0
 #define TCE_PCI1
 
-/* TCE page size is 4096 bytes (1 << 12) */
-
-#define TCE_SHIFT  12
-#define TCE_PAGE_SIZE  (1 << TCE_SHIFT)
-
 #define TCE_ENTRY_SIZE 8   /* each TCE is 64 bits */
-
-#define TCE_RPN_MASK   0xfful  /* 40-bit RPN (4K pages) */
-#define TCE_RPN_SHIFT  12
 #define TCE_VALID  0x800   /* TCE valid */
 #define TCE_ALLIO  0x400   /* TCE valid for all lpars */
 #define TCE_PCI_WRITE  0x2 /* write from PCI allowed */
diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 67c9953a6503..796ab356341c 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -107,6 +107,8 @@ static int tce_build_pSeries(struct iommu_table *tbl, long 
index,
u64 proto_tce;
__be64 *tcep;
u64 rpn;
+   const unsigned long tceshift = tbl->it_page_shift;
+   const unsigned long pagesize = IOMMU_PAGE_SIZE(tbl);
 
proto_tce = TCE_PCI_READ; // Read allowed
 
@@ -117,10 +119,10 @@ static int tce_build_pSeries(struct iommu_table *tbl, 
long index,
 
while (npages--) {
/* can't move this out since we might cross MEMBLOCK boundary */
-   rpn = __pa(uaddr) >> TCE_SHIFT;
-   *tcep = cpu_to_be64(proto_tce | (rpn & TCE_RPN_MASK) << 
TCE_RPN_SHIFT);
+   rpn = __pa(uaddr) >> tceshift;
+   *tcep = cpu_to_be64(proto_tce | rpn << tceshift);
 
-   uaddr += TCE_PAGE_SIZE;
+   uaddr += pagesize;
tcep++;
}
return 0;
@@ -146,7 +148,7 @@ static unsigned long tce_get_pseries(struct iommu_table 
*tbl, long index)
return be64_to_cpu(*tcep);
 }
 
-static void tce_free_pSeriesLP(unsigned long liobn, long, long);
+static void tce_free_pSeriesLP(unsigned long liobn, long, long, long);
 static void tce_freemulti_pSeriesLP(struct iommu_table*, long, long);
 
 static int tce_build_pSeriesLP(unsigned long liobn, long tcenum, long tceshift,
@@ -166,12 +168,12 @@ static int tce_build_pSeriesLP(unsigned long liobn, long 
tcenum, long tceshift,
proto_tce |= TCE_PCI_WRITE;
 
while (npages--) {
-   tce = proto_tce | (rpn & TCE_RPN_MASK) << tceshift;
+   tce = proto_tce | rpn << tceshift;
rc = plpar_tce_put((u64)liobn, (u64)tcenum << tceshift, tce);
 
if (unlikely(rc == H_NOT_ENOUGH_RESOURCES)) {
ret = (int)rc;
-   tce_free_pSeriesLP(liobn, tcenum_start,
+   tce_free_pSeriesLP(liobn, tcenum_start, tceshift,
   (npages_start - (npages + 1)));
break;
}
@@ -205,10 +207,11 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table 
*tbl, long tcenum,
long tcenum_start = tcenum, npages_start = npages;
int ret = 0;
unsigned long flags;
+   const unsigned long tceshift = tbl->it_page_shift;
 
if ((npages == 1) || !firmware_has_feature(FW_FEATURE_PUT_TCE_IND)) {
return tce_build_pSeriesLP(tbl->it_index, tcenum,
-  tbl->it_page_shift, npages, uaddr,
+  tceshift, npages, uaddr,
   direction, attrs);
}
 
@@ -225,13 +228,13 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table 
*tbl, long tcenum,
if (!tcep) {
local_irq_restore(flags);
return tce_build_pSeriesLP(tbl->it_index, tcenum,
-  

[PATCH v4 00/11] DDW + Indirect Mapping

2021-04-30 Thread Leonardo Bras
So far it's assumed possible to map the guest RAM 1:1 to the bus, which
works with a small number of devices. SRIOV changes it as the user can
configure hundreds VFs and since phyp preallocates TCEs and does not
allow IOMMU pages bigger than 64K, it has to limit the number of TCEs
per a PE to limit waste of physical pages.

As of today, if the assumed direct mapping is not possible, DDW creation
is skipped and the default DMA window "ibm,dma-window" is used instead.

Using the DDW instead of the default DMA window may allow to expand the
amount of memory that can be DMA-mapped, given the number of pages (TCEs)
may stay the same (or increase) and the default DMA window offers only
4k-pages while DDW may offer larger pages (4k, 64k, 16M ...).

Patch #1 replaces hard-coded 4K page size with a variable containing the
correct page size for the window.

Patch #2 introduces iommu_table_in_use(), and replace manual bit-field
checking where it's used. It will be used for aborting enable_ddw() if
there is any current iommu allocation and we are trying single window
indirect mapping.

Patch #3 introduces iommu_pseries_alloc_table() that will be helpful
when indirect mapping needs to replace the iommu_table.

Patch #4 adds helpers for adding DDWs in the list.

Patch #5 refactors enable_ddw() so it returns if direct mapping is
possible, instead of DMA offset. It helps for next patches on
indirect DMA mapping and also allows DMA windows starting at 0x00.

Patch #6 bring new helper to simplify enable_ddw(), allowing
some reorganization for introducing indirect mapping DDW.

Patch #7 adds new helper _iommu_table_setparms() and use it in other
*setparams*() to fill iommu_table. It will also be used for creating a
new iommu_table for indirect mapping.

Patch #8 updates remove_dma_window() to accept different property names,
so we can introduce a new property for indirect mapping.

Patch #9 extracts find_existing_ddw_windows() into
find_existing_ddw_windows_named(), and calls it by it's property name.
This will be useful when the property for indirect mapping is created,
so we can search the device-tree for both properties.

Patch #10:
Instead of destroying the created DDW if it doesn't map the whole
partition, make use of it instead of the default DMA window as it improves
performance. Also, update the iommu_table and re-generate the pools.
It introduces a new property name for DDW with indirect DMA mapping.

Patch #11:
Does some renaming of 'direct window' to 'dma window', given the DDW
created can now be also used in indirect mapping if direct mapping is not
available.

All patches were tested into an LPAR with an virtio-net interface that
allows default DMA window and DDW to coexist.

Changes since v3:
- Fixed inverted free order at ddw_property_create()
- Updated goto tag naming

Changes since v2:
- Some patches got removed from the series and sent by themselves,
- New tbl created for DDW + indirect mapping reserves MMIO32 space,
- Improved reserved area algorithm,
- Improved commit messages,
- Removed define for default DMA window prop name,
- Avoided some unnecessary renaming,
- Removed some unnecessary empty lines,
- Changed some code moving to forward declarations.
v2
Link: 
http://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=201210=%2A=both
 

Leonardo Bras (11):
  powerpc/pseries/iommu: Replace hard-coded page shift
  powerpc/kernel/iommu: Add new iommu_table_in_use() helper
  powerpc/pseries/iommu: Add iommu_pseries_alloc_table() helper
  powerpc/pseries/iommu: Add ddw_list_new_entry() helper
  powerpc/pseries/iommu: Allow DDW windows starting at 0x00
  powerpc/pseries/iommu: Add ddw_property_create() and refactor
enable_ddw()
  powerpc/pseries/iommu: Reorganize iommu_table_setparms*() with new
helper
  powerpc/pseries/iommu: Update remove_dma_window() to accept property
name
  powerpc/pseries/iommu: Find existing DDW with given property name
  powerpc/pseries/iommu: Make use of DDW for indirect mapping
  powerpc/pseries/iommu: Rename "direct window" to "dma window"

 arch/powerpc/include/asm/iommu.h   |   1 +
 arch/powerpc/include/asm/tce.h |   8 -
 arch/powerpc/kernel/iommu.c|  65 ++--
 arch/powerpc/platforms/pseries/iommu.c | 504 +++--
 4 files changed, 338 insertions(+), 240 deletions(-)

-- 
2.30.2



Re: Radeon NI: GIT kernel with the nislands_smc commit doesn't boot on a Freescale P5040 board and P.A.Semi Nemo board

2021-04-30 Thread Gustavo A. R. Silva



On 4/30/21 10:26, Deucher, Alexander wrote:
> [AMD Public Use]
> 
> + Gustavo, amd-gfx
> 
>> -Original Message-
>> From: Christian Zigotzky 
>> Sent: Friday, April 30, 2021 8:00 AM
>> To: gustavo...@kernel.org; Deucher, Alexander 
>> 
>> Cc: R.T.Dickinson ; Darren Stevens > zone.net>; mad skateman ; linuxppc-dev 
>> ; Olof Johansson ; 
>> Maling list - DRI developers ; Michel 
>> Dänzer ; Christian Zigotzky 
>> Subject: Radeon NI: GIT kernel with the nislands_smc commit doesn't 
>> boot on a Freescale P5040 board and P.A.Semi Nemo board
>>
>> Hello,
>>
>> The Nemo board (A-EON AmigaOne X1000) [1] and the FSL P5040 Cyrus+ 
>> board (A-EON AmigaOne X5000) [2] with installed AMD Radeon HD6970 NI 
>> graphics cards (Cayman XT) [3] don't boot with the latest git kernel 
>> anymore after the commit "drm/radeon/nislands_smc.h: Replace 
>> one-element array with flexible-array member in struct NISLANDS_SMC_SWSTATE 
>> branch" [4].
>> This git kernel boots in a virtual e5500 QEMU machine with a VirtIO-GPU [5].
>>
>> I bisected today [6].
>>
>> Result: drm/radeon/nislands_smc.h: Replace one-element array with 
>> flexible-array member in struct NISLANDS_SMC_SWSTATE branch
>> (434fb1e7444a2efc3a4ebd950c7f771ebfcffa31) [4] is the first bad commit.
>>
>> I was able to revert this commit [7] and after a new compiling, the 
>> kernel boots without any problems on my AmigaOnes.
>>
>> After that I created a patch for reverting this commit for new git test 
>> kernels.
>> [3]
>>
>> The kernel compiles and boots with this patch on my AmigaOnes. Please 
>> find attached the kernel config files.
>>
>> Please check the first bad commit.

I'll have a look.

Thanks for the report!
--
Gustavo

>>
>> Thanks,
>> Christian
>>
>> [1]
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fen.
>> wikipedia.org%2Fwiki%2FAmigaOne_X1000data=04%7C01%7Calexand
>> er.deucher%40amd.com%7C0622549383fb4320346b08d90bcf7be1%7C3dd89
>> 61fe4884e608e11a82d994e183d%7C0%7C0%7C637553808670161651%7CUnkn
>> own%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik
>> 1haWwiLCJXVCI6Mn0%3D%7C1000sdata=PNSrApUdMrku20hH7dEKlJJ
>> TBi7Qp5JOkqpA4MvKqdE%3Dreserved=0
>> [2]
>> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwiki.
>> a miga.org%2Findex.php%3Ftitle%3DX5000data=04%7C01%7Calexander
>> .deucher%40amd.com%7C0622549383fb4320346b08d90bcf7be1%7C3dd8961f
>> e4884e608e11a82d994e183d%7C0%7C0%7C637553808670161651%7CUnknow
>> n%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1ha
>> WwiLCJXVCI6Mn0%3D%7C1000sdata=B8Uvhs25%2FP3RfnL1AgICN3Y4
>> CEXeCE1yIoi3vvwvGto%3Dreserved=0
>> [3]
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fforu
>> m.hyperion-
>> entertainment.com%2Fviewtopic.php%3Ff%3D35%26t%3D4377data=
>> 04%7C01%7Calexander.deucher%40amd.com%7C0622549383fb4320346b08d
>> 90bcf7be1%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C63755380
>> 8670161651%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIj
>> oiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=TokXplD
>> Tvg3%2BZMPLCgR1fs%2BN2X9MIfLXLW67MAM2Qsk%3Dreserved=0
>> [4]
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit.
>> k ernel.org%2Fpub%2Fscm%2Flinux%2Fkernel%2Fgit%2Ftorvalds%2Flinux.git%
>> 2Fcommit%2F%3Fid%3D434fb1e7444a2efc3a4ebd950c7f771ebfcffa31d
>> ata=04%7C01%7Calexander.deucher%40amd.com%7C0622549383fb4320346
>> b08d90bcf7be1%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C6375
>> 53808670161651%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiL
>> CJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=JC
>> M4xvPEnWdckcTPbQ2Ujv%2FAiMMsFMzzl4Pr%2FRPlcMQ%3Dreserve
>> d=0
>> [5] qemu-system-ppc64 -M ppce500 -cpu e5500 -m 1024 -kernel uImage - 
>> drive format=raw,file=MintPPC32-X5000.img,index=0,if=virtio -netdev
>> user,id=mynet0 -device virtio-net-pci,netdev=mynet0 -append "rw 
>> root=/dev/vda" -device virtio-vga -usb -device usb-ehci,id=ehci 
>> -device usb- tablet -device virtio-keyboard-pci -smp 4 -vnc :1 [6] 
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fforu
>> m.hyperion-
>> entertainment.com%2Fviewtopic.php%3Fp%3D53074%23p53074data
>> =04%7C01%7Calexander.deucher%40amd.com%7C0622549383fb4320346b08
>> d90bcf7be1%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C6375538
>> 08670161651%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQ
>> IjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=RXfSlY
>> A3bDEFas0%2Fk2vMWsl2l0nuhS2ecjZgSBLc%2Bs4%3Dreserved=0
>> [7] git revert 434fb1e7444a2efc3a4ebd950c7f771ebfcffa3


Re: [PATCH v3] pseries/drmem: update LMBs after LPM

2021-04-30 Thread Laurent Dufour

Le 29/04/2021 à 21:12, Tyrel Datwyler a écrit :

On 4/29/21 3:27 AM, Aneesh Kumar K.V wrote:

Laurent Dufour  writes:


After a LPM, the device tree node ibm,dynamic-reconfiguration-memory may be
updated by the hypervisor in the case the NUMA topology of the LPAR's
memory is updated.

This is caught by the kernel, but the memory's node is updated because
there is no way to move a memory block between nodes.

If later a memory block is added or removed, drmem_update_dt() is called
and it is overwriting the DT node to match the added or removed LMB. But
the LMB's associativity node has not been updated after the DT node update
and thus the node is overwritten by the Linux's topology instead of the
hypervisor one.

Introduce a hook called when the ibm,dynamic-reconfiguration-memory node is
updated to force an update of the LMB's associativity.

Cc: Tyrel Datwyler 
Signed-off-by: Laurent Dufour 
---

V3:
  - Check rd->dn->name instead of rd->dn->full_name
V2:
  - Take Tyrel's idea to rely on OF_RECONFIG_UPDATE_PROPERTY instead of
  introducing a new hook mechanism.
---
  arch/powerpc/include/asm/drmem.h  |  1 +
  arch/powerpc/mm/drmem.c   | 35 +++
  .../platforms/pseries/hotplug-memory.c|  4 +++
  3 files changed, 40 insertions(+)

diff --git a/arch/powerpc/include/asm/drmem.h b/arch/powerpc/include/asm/drmem.h
index bf2402fed3e0..4265d5e95c2c 100644
--- a/arch/powerpc/include/asm/drmem.h
+++ b/arch/powerpc/include/asm/drmem.h
@@ -111,6 +111,7 @@ int drmem_update_dt(void);
  int __init
  walk_drmem_lmbs_early(unsigned long node, void *data,
  int (*func)(struct drmem_lmb *, const __be32 **, void *));
+void drmem_update_lmbs(struct property *prop);
  #endif
  
  static inline void invalidate_lmb_associativity_index(struct drmem_lmb *lmb)

diff --git a/arch/powerpc/mm/drmem.c b/arch/powerpc/mm/drmem.c
index 9af3832c9d8d..f0a6633132af 100644
--- a/arch/powerpc/mm/drmem.c
+++ b/arch/powerpc/mm/drmem.c
@@ -307,6 +307,41 @@ int __init walk_drmem_lmbs_early(unsigned long node, void 
*data,
return ret;
  }
  
+/*

+ * Update the LMB associativity index.
+ */
+static int update_lmb(struct drmem_lmb *updated_lmb,
+ __maybe_unused const __be32 **usm,
+ __maybe_unused void *data)
+{
+   struct drmem_lmb *lmb;
+
+   /*
+* Brut force there may be better way to fetch the LMB
+*/
+   for_each_drmem_lmb(lmb) {
+   if (lmb->drc_index != updated_lmb->drc_index)
+   continue;
+
+   lmb->aa_index = updated_lmb->aa_index;
+   break;
+   }
+   return 0;
+}
+
+/*
+ * Update the LMB associativity index.
+ *
+ * This needs to be called when the hypervisor is updating the
+ * dynamic-reconfiguration-memory node property.
+ */
+void drmem_update_lmbs(struct property *prop)
+{
+   if (!strcmp(prop->name, "ibm,dynamic-memory"))
+   __walk_drmem_v1_lmbs(prop->value, NULL, NULL, update_lmb);
+   else if (!strcmp(prop->name, "ibm,dynamic-memory-v2"))
+   __walk_drmem_v2_lmbs(prop->value, NULL, NULL, update_lmb);
+}
  #endif
  
  static int init_drmem_lmb_size(struct device_node *dn)

diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index 8377f1f7c78e..672ffbee2e78 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -949,6 +949,10 @@ static int pseries_memory_notifier(struct notifier_block 
*nb,
case OF_RECONFIG_DETACH_NODE:
err = pseries_remove_mem_node(rd->dn);
break;
+   case OF_RECONFIG_UPDATE_PROPERTY:
+   if (!strcmp(rd->dn->name,
+   "ibm,dynamic-reconfiguration-memory"))
+   drmem_update_lmbs(rd->prop);
}
return notifier_from_errno(err);


How will this interact with DLPAR memory? When we dlpar memory,
ibm,configure-connector is used to fetch the new associativity details
and set drmem_lmb->aa_index correctly there. Once that is done kernel
then call drmem_update_dt() which will result in the above notifier
callback?

IIUC, the call back then will update drmem_lmb->aa_index again?


After digging through some of this code I'm a bit concerned about all the kernel
device tree manipulation around memory DLPAR both with the assoc-lookup-array
prop update and post dynamic-memory prop updating. We build a drmem_info array
of the LMBs from the device-tree at boot. I don't really understand why we are
manipulating the device tree property every time we add/remove an LMB. Not sure
the reasoning was to write back in particular the aa_index and flags for each
LMB into the device tree when we already have them in the drmem_info array. On
the other hand the assoc-lookup-array I suppose would need to have an in kernel
representation to avoid updating the device 

Re: [PATCH v3 06/11] powerpc/pseries/iommu: Add ddw_property_create() and refactor enable_ddw()

2021-04-30 Thread Leonardo Bras
On Fri, 2021-04-23 at 19:04 +1000, Alexey Kardashevskiy wrote:
> 
> > +   win64->name = kstrdup(propname, GFP_KERNEL);
> > +   ddwprop = kzalloc(sizeof(*ddwprop), GFP_KERNEL);
> > +   win64->value = ddwprop;
> > +   win64->length = sizeof(*ddwprop);
> > +   if (!win64->name || !win64->value) {
> > +   kfree(win64);
> > +   kfree(win64->name);
> > +   kfree(win64->value);
> 
> 
> Wrong order.
> 

Right! Sorry about that. 
Changed for next version!

> > 
> > 
> > +out_del_win:
> 
> 
> (I would not bother but since I am commenting on the patch)
> 
> nit: the new name is not that much better than the old 
> "out_clear_window:" ("out_remove_win" would be a bit better) and it does 
> make reviewing a little bit harder. Thanks,

Replaced by out_remove_win
Thanks!






Re: [PATCH v3 01/11] powerpc/pseries/iommu: Replace hard-coded page shift

2021-04-30 Thread Leonardo Bras
Thanks Alexey!

On Fri, 2021-04-23 at 17:27 +1000, Alexey Kardashevskiy wrote:
> 
> On 22/04/2021 17:07, Leonardo Bras wrote:
> > Some functions assume IOMMU page size can only be 4K (pageshift == 12).
> > Update them to accept any page size passed, so we can use 64K pages.
> > 
> > In the process, some defines like TCE_SHIFT were made obsolete, and then
> > removed.
> > 
> > IODA3 Revision 3.0_prd1 (OpenPowerFoundation), Figures 3.4 and 3.5 show
> > a RPN of 52-bit, and considers a 12-bit pageshift, so there should be
> > no need of using TCE_RPN_MASK, which masks out any bit after 40 in rpn.
> > It's usage removed from tce_build_pSeries(), tce_build_pSeriesLP(), and
> > tce_buildmulti_pSeriesLP().
> 
> 
> After rereading the patch, I wonder why we had this TCE_RPN_MASK at all 
> but what is certain is that this has nothing to do with IODA3 as these 
> TCEs are guest phys addresses in pseries and IODA3 is bare metal. Except...
> 
> 
> > Most places had a tbl struct, so using tbl->it_page_shift was simple.
> > tce_free_pSeriesLP() was a special case, since callers not always have a
> > tbl struct, so adding a tceshift parameter seems the right thing to do.
> > 
> > Signed-off-by: Leonardo Bras 
> > Reviewed-by: Alexey Kardashevskiy 
> > ---
> >   arch/powerpc/include/asm/tce.h |  8 --
> >   arch/powerpc/platforms/pseries/iommu.c | 39 +++---
> >   2 files changed, 23 insertions(+), 24 deletions(-)
> > 
> > diff --git a/arch/powerpc/include/asm/tce.h b/arch/powerpc/include/asm/tce.h
> > index db5fc2f2262d..0c34d2756d92 100644
> > --- a/arch/powerpc/include/asm/tce.h
> > +++ b/arch/powerpc/include/asm/tce.h
> > @@ -19,15 +19,7 @@
> >   #define TCE_VB0
> >   #define TCE_PCI   1
> >   
> > 
> > -/* TCE page size is 4096 bytes (1 << 12) */
> > -
> > -#define TCE_SHIFT  12
> > -#define TCE_PAGE_SIZE  (1 << TCE_SHIFT)
> > -
> >   #define TCE_ENTRY_SIZE8   /* each TCE is 64 bits 
> > */
> > -
> > -#define TCE_RPN_MASK   0xfful  /* 40-bit RPN (4K 
> > pages) */
> > -#define TCE_RPN_SHIFT  12
> >   #define TCE_VALID 0x800   /* TCE valid */
> >   #define TCE_ALLIO 0x400   /* TCE valid for all lpars */
> >   #define TCE_PCI_WRITE 0x2 /* write from PCI 
> > allowed */
> > diff --git a/arch/powerpc/platforms/pseries/iommu.c 
> > b/arch/powerpc/platforms/pseries/iommu.c
> > index 67c9953a6503..796ab356341c 100644
> > --- a/arch/powerpc/platforms/pseries/iommu.c
> > +++ b/arch/powerpc/platforms/pseries/iommu.c
> > @@ -107,6 +107,8 @@ static int tce_build_pSeries(struct iommu_table *tbl, 
> > long index,
> >     u64 proto_tce;
> >     __be64 *tcep;
> >     u64 rpn;
> > +   const unsigned long tceshift = tbl->it_page_shift;
> > +   const unsigned long pagesize = IOMMU_PAGE_SIZE(tbl);
> 
> (nit: only used once)
> 
> >   
> > 
> >     proto_tce = TCE_PCI_READ; // Read allowed
> >   
> > 
> > @@ -117,10 +119,10 @@ static int tce_build_pSeries(struct iommu_table *tbl, 
> > long index,
> 
> 
> ... this pseries which is not pseriesLP, i.e. no LPAR == bare metal 
> pseries such as ancient power5 or cellbe (I guess) and for those 
> TCE_RPN_MASK may actually make sense, keep it.
> 
> The rest of the patch looks good. Thanks,
> 
> 
> >   
> > 
> >     while (npages--) {
> >     /* can't move this out since we might cross MEMBLOCK boundary */
> > -   rpn = __pa(uaddr) >> TCE_SHIFT;
> > -   *tcep = cpu_to_be64(proto_tce | (rpn & TCE_RPN_MASK) << 
> > TCE_RPN_SHIFT);
> > +   rpn = __pa(uaddr) >> tceshift;
> > +   *tcep = cpu_to_be64(proto_tce | rpn << tceshift);
> >   
> > 
> > -   uaddr += TCE_PAGE_SIZE;
> > +   uaddr += pagesize;
> >     tcep++;
> >     }
> >     return 0;
> > @@ -146,7 +148,7 @@ static unsigned long tce_get_pseries(struct iommu_table 
> > *tbl, long index)
> >     return be64_to_cpu(*tcep);
> >   }
> >   
> > 
> > -static void tce_free_pSeriesLP(unsigned long liobn, long, long);
> > +static void tce_free_pSeriesLP(unsigned long liobn, long, long, long);
> >   static void tce_freemulti_pSeriesLP(struct iommu_table*, long, long);
> >   
> > 
> >   static int tce_build_pSeriesLP(unsigned long liobn, long tcenum, long 
> > tceshift,
> > @@ -166,12 +168,12 @@ static int tce_build_pSeriesLP(unsigned long liobn, 
> > long tcenum, long tceshift,
> >     proto_tce |= TCE_PCI_WRITE;
> >   
> > 
> >     while (npages--) {
> > -   tce = proto_tce | (rpn & TCE_RPN_MASK) << tceshift;
> > +   tce = proto_tce | rpn << tceshift;
> >     rc = plpar_tce_put((u64)liobn, (u64)tcenum << tceshift, tce);
> >   
> > 
> >     if (unlikely(rc == H_NOT_ENOUGH_RESOURCES)) {
> >     ret = (int)rc;
> > -   tce_free_pSeriesLP(liobn, tcenum_start,
> > +   tce_free_pSeriesLP(liobn, tcenum_start, tceshift,
> >    

Re: [PATCH v2 0/3] powerpc/mm/hash: Time improvements for memory hot(un)plug

2021-04-30 Thread Leonardo Bras
CC: David Gibson

http://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=241574=%2A=both



[PATCH v2 3/3] powerpc/mm/hash: Avoid multiple HPT resize-downs on memory hotunplug

2021-04-30 Thread Leonardo Bras
During memory hotunplug, after each LMB is removed, the HPT may be
resized-down if it would map a max of 4 times the current amount of memory.
(2 shifts, due to introduced histeresis)

It usually is not an issue, but it can take a lot of time if HPT
resizing-down fails. This happens  because resize-down failures
usually repeat at each LMB removal, until there are no more bolted entries
conflict, which can take a while to happen.

This can be solved by doing a single HPT resize at the end of memory
hotunplug, after all requested entries are removed.

To make this happen, it's necessary to temporarily disable all HPT
resize-downs before hotunplug, re-enable them after hotunplug ends,
and then resize-down HPT to the current memory size.

As an example, hotunplugging 256GB from a 385GB guest took 621s without
this patch, and 100s after applied.

Signed-off-by: Leonardo Bras 
---
 arch/powerpc/include/asm/book3s/64/hash.h |  2 +
 arch/powerpc/mm/book3s64/hash_utils.c | 45 +--
 .../platforms/pseries/hotplug-memory.c| 26 +++
 3 files changed, 70 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/hash.h 
b/arch/powerpc/include/asm/book3s/64/hash.h
index fad4af8b8543..6cd66e7e98c9 100644
--- a/arch/powerpc/include/asm/book3s/64/hash.h
+++ b/arch/powerpc/include/asm/book3s/64/hash.h
@@ -256,6 +256,8 @@ int hash__create_section_mapping(unsigned long start, 
unsigned long end,
 int hash__remove_section_mapping(unsigned long start, unsigned long end);
 
 void hash_batch_expand_prepare(unsigned long newsize);
+void hash_batch_shrink_begin(void);
+void hash_batch_shrink_end(void);
 
 #endif /* !__ASSEMBLY__ */
 #endif /* __KERNEL__ */
diff --git a/arch/powerpc/mm/book3s64/hash_utils.c 
b/arch/powerpc/mm/book3s64/hash_utils.c
index 3fa395b3fe57..73ecd0f61acd 100644
--- a/arch/powerpc/mm/book3s64/hash_utils.c
+++ b/arch/powerpc/mm/book3s64/hash_utils.c
@@ -795,6 +795,9 @@ static unsigned long __init htab_get_table_size(void)
 }
 
 #ifdef CONFIG_MEMORY_HOTPLUG
+
+static DEFINE_MUTEX(hpt_resize_down_lock);
+
 static int resize_hpt_for_hotplug(unsigned long new_mem_size, bool shrinking)
 {
unsigned target_hpt_shift;
@@ -805,7 +808,7 @@ static int resize_hpt_for_hotplug(unsigned long 
new_mem_size, bool shrinking)
target_hpt_shift = htab_shift_for_mem_size(new_mem_size);
 
if (shrinking) {
-
+   int ret;
/*
 * To avoid lots of HPT resizes if memory size is fluctuating
 * across a boundary, we deliberately have some hysterisis
@@ -818,10 +821,20 @@ static int resize_hpt_for_hotplug(unsigned long 
new_mem_size, bool shrinking)
if (target_hpt_shift >= ppc64_pft_size - 1)
return 0;
 
-   } else if (target_hpt_shift <= ppc64_pft_size) {
-   return 0;
+   /* When batch removing entries, only resizes HPT at the end. */
+
+   if (!mutex_trylock(_resize_down_lock))
+   return 0;
+
+   ret = mmu_hash_ops.resize_hpt(target_hpt_shift);
+
+   mutex_unlock(_resize_down_lock);
+   return ret;
}
 
+   if (target_hpt_shift <= ppc64_pft_size)
+   return 0;
+
return mmu_hash_ops.resize_hpt(target_hpt_shift);
 }
 
@@ -879,6 +892,32 @@ void hash_batch_expand_prepare(unsigned long newsize)
break;
}
 }
+
+void hash_batch_shrink_begin(void)
+{
+   /* Disable HPT resize-down during hot-unplug */
+   mutex_lock(_resize_down_lock);
+}
+
+void hash_batch_shrink_end(void)
+{
+   const u64 starting_size = ppc64_pft_size;
+   unsigned long newsize;
+
+   newsize = memblock_phys_mem_size();
+   /* Resize to smallest SHIFT possible */
+   while (resize_hpt_for_hotplug(newsize, true) == -ENOSPC) {
+   newsize *= 2;
+   pr_warn("Hash collision while resizing HPT\n");
+
+   /* Do not try to resize to the starting size, or bigger value */
+   if (htab_shift_for_mem_size(newsize) >= starting_size)
+   break;
+   }
+
+   /* Re-enables HPT resize-down after hot-unplug */
+   mutex_unlock(_resize_down_lock);
+}
 #endif /* CONFIG_MEMORY_HOTPLUG */
 
 static void __init hash_init_partition_table(phys_addr_t hash_table,
diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index 48b2cfe4ce69..44bc50d72353 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -426,6 +426,9 @@ static int dlpar_memory_remove_by_count(u32 lmbs_to_remove)
return -EINVAL;
}
 
+   if (!radix_enabled())
+   hash_batch_shrink_begin();
+
for_each_drmem_lmb(lmb) {
rc = dlpar_remove_lmb(lmb);
if (rc)
@@ -471,6 +474,9 @@ static int 

[PATCH v2 2/3] powerpc/mm/hash: Avoid multiple HPT resize-ups on memory hotplug

2021-04-30 Thread Leonardo Bras
Every time a memory hotplug happens, and the memory limit crosses a 2^n
value, it may be necessary to perform HPT resizing-up, which can take
some time (over 100ms in my tests).

It usually is not an issue, but it can take some time if a lot of memory
is added to a guest with little starting memory:
Adding 256G to a 2GB guest, for example will require 8 HPT resizes.

Perform an HPT resize before memory hotplug, updating HPT to its
final size (considering a successful hotplug), taking the number of
HPT resizes to at most one per memory hotplug action.

Signed-off-by: Leonardo Bras 
---
 arch/powerpc/include/asm/book3s/64/hash.h |  2 ++
 arch/powerpc/mm/book3s64/hash_utils.c | 20 +++
 .../platforms/pseries/hotplug-memory.c|  9 +
 3 files changed, 31 insertions(+)

diff --git a/arch/powerpc/include/asm/book3s/64/hash.h 
b/arch/powerpc/include/asm/book3s/64/hash.h
index d959b0195ad9..fad4af8b8543 100644
--- a/arch/powerpc/include/asm/book3s/64/hash.h
+++ b/arch/powerpc/include/asm/book3s/64/hash.h
@@ -255,6 +255,8 @@ int hash__create_section_mapping(unsigned long start, 
unsigned long end,
 int nid, pgprot_t prot);
 int hash__remove_section_mapping(unsigned long start, unsigned long end);
 
+void hash_batch_expand_prepare(unsigned long newsize);
+
 #endif /* !__ASSEMBLY__ */
 #endif /* __KERNEL__ */
 #endif /* _ASM_POWERPC_BOOK3S_64_HASH_H */
diff --git a/arch/powerpc/mm/book3s64/hash_utils.c 
b/arch/powerpc/mm/book3s64/hash_utils.c
index 608e4ed397a9..3fa395b3fe57 100644
--- a/arch/powerpc/mm/book3s64/hash_utils.c
+++ b/arch/powerpc/mm/book3s64/hash_utils.c
@@ -859,6 +859,26 @@ int hash__remove_section_mapping(unsigned long start, 
unsigned long end)
 
return rc;
 }
+
+void hash_batch_expand_prepare(unsigned long newsize)
+{
+   const u64 starting_size = ppc64_pft_size;
+
+   /*
+* Resizing-up HPT should never fail, but there are some cases system 
starts with higher
+* SHIFT than required, and we go through the funny case of resizing 
HPT down while
+* adding memory
+*/
+
+   while (resize_hpt_for_hotplug(newsize, false) == -ENOSPC) {
+   newsize *= 2;
+   pr_warn("Hash collision while resizing HPT\n");
+
+   /* Do not try to resize to the starting size, or bigger value */
+   if (htab_shift_for_mem_size(newsize) >= starting_size)
+   break;
+   }
+}
 #endif /* CONFIG_MEMORY_HOTPLUG */
 
 static void __init hash_init_partition_table(phys_addr_t hash_table,
diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index 8377f1f7c78e..48b2cfe4ce69 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -671,6 +672,10 @@ static int dlpar_memory_add_by_count(u32 lmbs_to_add)
if (lmbs_available < lmbs_to_add)
return -EINVAL;
 
+   if (!radix_enabled())
+   hash_batch_expand_prepare(memblock_phys_mem_size() +
+lmbs_to_add * 
drmem_lmb_size());
+
for_each_drmem_lmb(lmb) {
if (lmb->flags & DRCONF_MEM_ASSIGNED)
continue;
@@ -788,6 +793,10 @@ static int dlpar_memory_add_by_ic(u32 lmbs_to_add, u32 
drc_index)
if (lmbs_available < lmbs_to_add)
return -EINVAL;
 
+   if (!radix_enabled())
+   hash_batch_expand_prepare(memblock_phys_mem_size() +
+ lmbs_to_add * drmem_lmb_size());
+
for_each_drmem_lmb_in_range(lmb, start_lmb, end_lmb) {
if (lmb->flags & DRCONF_MEM_ASSIGNED)
continue;
-- 
2.30.2



[PATCH v2 1/3] powerpc/mm/hash: Avoid resizing-down HPT on first memory hotplug

2021-04-30 Thread Leonardo Bras
Because hypervisors may need to create HPTs without knowing the guest
page size, the smallest used page-size (4k) may be chosen, resulting in
a HPT that is possibly bigger than needed.

On a guest with bigger page-sizes, the amount of entries for HTP may be
too high, causing the guest to ask for a HPT resize-down on the first
hotplug.

This becomes a problem when HPT resize-down fails, and causes the
HPT resize to be performed on every LMB added, until HPT size is
compatible to guest memory size, causing a major slowdown.

So, avoiding HPT resizing-down on hot-add significantly improves memory
hotplug times.

As an example, hotplugging 256GB on a 129GB guest took 710s without this
patch, and 21s after applied.

Signed-off-by: Leonardo Bras 
---
 arch/powerpc/mm/book3s64/hash_utils.c | 36 ---
 1 file changed, 21 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/hash_utils.c 
b/arch/powerpc/mm/book3s64/hash_utils.c
index 581b20a2feaf..608e4ed397a9 100644
--- a/arch/powerpc/mm/book3s64/hash_utils.c
+++ b/arch/powerpc/mm/book3s64/hash_utils.c
@@ -795,7 +795,7 @@ static unsigned long __init htab_get_table_size(void)
 }
 
 #ifdef CONFIG_MEMORY_HOTPLUG
-static int resize_hpt_for_hotplug(unsigned long new_mem_size)
+static int resize_hpt_for_hotplug(unsigned long new_mem_size, bool shrinking)
 {
unsigned target_hpt_shift;
 
@@ -804,19 +804,25 @@ static int resize_hpt_for_hotplug(unsigned long 
new_mem_size)
 
target_hpt_shift = htab_shift_for_mem_size(new_mem_size);
 
-   /*
-* To avoid lots of HPT resizes if memory size is fluctuating
-* across a boundary, we deliberately have some hysterisis
-* here: we immediately increase the HPT size if the target
-* shift exceeds the current shift, but we won't attempt to
-* reduce unless the target shift is at least 2 below the
-* current shift
-*/
-   if (target_hpt_shift > ppc64_pft_size ||
-   target_hpt_shift < ppc64_pft_size - 1)
-   return mmu_hash_ops.resize_hpt(target_hpt_shift);
+   if (shrinking) {
 
-   return 0;
+   /*
+* To avoid lots of HPT resizes if memory size is fluctuating
+* across a boundary, we deliberately have some hysterisis
+* here: we immediately increase the HPT size if the target
+* shift exceeds the current shift, but we won't attempt to
+* reduce unless the target shift is at least 2 below the
+* current shift
+*/
+
+   if (target_hpt_shift >= ppc64_pft_size - 1)
+   return 0;
+
+   } else if (target_hpt_shift <= ppc64_pft_size) {
+   return 0;
+   }
+
+   return mmu_hash_ops.resize_hpt(target_hpt_shift);
 }
 
 int hash__create_section_mapping(unsigned long start, unsigned long end,
@@ -829,7 +835,7 @@ int hash__create_section_mapping(unsigned long start, 
unsigned long end,
return -1;
}
 
-   resize_hpt_for_hotplug(memblock_phys_mem_size());
+   resize_hpt_for_hotplug(memblock_phys_mem_size(), false);
 
rc = htab_bolt_mapping(start, end, __pa(start),
   pgprot_val(prot), mmu_linear_psize,
@@ -848,7 +854,7 @@ int hash__remove_section_mapping(unsigned long start, 
unsigned long end)
int rc = htab_remove_mapping(start, end, mmu_linear_psize,
 mmu_kernel_ssize);
 
-   if (resize_hpt_for_hotplug(memblock_phys_mem_size()) == -ENOSPC)
+   if (resize_hpt_for_hotplug(memblock_phys_mem_size(), true) == -ENOSPC)
pr_warn("Hash collision while resizing HPT\n");
 
return rc;
-- 
2.30.2



[PATCH v2 0/3] powerpc/mm/hash: Time improvements for memory hot(un)plug

2021-04-30 Thread Leonardo Bras
This patchset intends to reduce time needed for processing memory
hotplug/hotunplug in hash guests.

The first one, makes sure guests with pagesize over 4k don't need to
go through HPT resize-downs after memory hotplug.

The second and third patches make hotplug / hotunplug perform a single
HPT resize per operation, instead of one for each shift change, or one
for each LMB in case of resize-down error.

Why haven't the same mechanism used for both memory hotplug and hotunplug?
They both have different requirements:

Memory hotplug causes (usually) HPT resize-ups, which are fine happening
at the start of hotplug, but resize-ups should not ever be disabled, as
other mechanisms may try to increase memory, hitting issues with a HPT
that is too small.

Memory hotunplug causes HPT resize-downs, which can be disabled (HPT will
just remain larger for a while), but need to happen at the end of an
hotunplug operation. If we want to batch it, we need to disable
resize-downs and perform it only at the end.

Tests done with this patchset in the same machine / guest config:
Starting memory: 129GB, DIMM: 256GB
Before patchset: hotplug = 710s, hotunplug = 621s.
After patchset: hotplug  = 21s, hotunplug = 100s.

Any feedback will be appreciated!

Changes since v1:
- Atomic used to disable resize was replaced by a mutex
- Removed wrappers, testing for !radix directly in hot(un)plug routine
- Added bounds to HPT resize loop
- Removed batching from dlpar_memory_*_by_index, as it adds a single LMB 

Best regards,
Leonardo Bras (3):
  powerpc/mm/hash: Avoid resizing-down HPT on first memory hotplug
  powerpc/mm/hash: Avoid multiple HPT resize-ups on memory hotplug
  powerpc/mm/hash: Avoid multiple HPT resize-downs on memory hotunplug

 arch/powerpc/include/asm/book3s/64/hash.h |  4 +
 arch/powerpc/mm/book3s64/hash_utils.c | 95 ---
 .../platforms/pseries/hotplug-memory.c| 35 +++
 3 files changed, 119 insertions(+), 15 deletions(-)

-- 
2.30.2



[PATCH 3/3] pseries/hotplug-memory.c: adding dlpar_memory_remove_lmbs_common()

2021-04-30 Thread Daniel Henrique Barboza
One difference between dlpar_memory_remove_by_count() and
dlpar_memory_remove_by_ic() is that the latter, added in commit
753843471cbb, removes the LMBs in a contiguous block. This was done
because QEMU works with DIMMs, which is nothing more than a set of LMBs,
that must be added or removed together. Failing to remove one LMB must
fail the removal of the entire set of LMBs. Another difference is that
dlpar_memory_remove_by_ic() is going to set the LMB DRC to unisolate in
case of a removal error, which is a no-op for the hypervisors that
doesn't care about this error handling knob, and could be called by
remove_by_count() without issues.

Aside from that, the logic is the same for both functions, and yet we
keep them separated and having to duplicate LMB removal logic in both.

This patch introduces a helper called dlpar_memory_remove_lmbs_common()
to be used by both functions. The helper handles the case of block
removal of remove_by_ic() by failing earlier in the validation and
removal steps if the helper was called by remove_by_ic() (i.e. it was
called with a drc_index) while preserving the more relaxed behavior of
remove_by_count() if drc_index is 0.

Signed-off-by: Daniel Henrique Barboza 
---
 .../platforms/pseries/hotplug-memory.c| 163 +++---
 1 file changed, 67 insertions(+), 96 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index 4e6d162c3f1a..a031993725ca 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -399,25 +399,43 @@ static int dlpar_remove_lmb(struct drmem_lmb *lmb)
return 0;
 }
 
-static int dlpar_memory_remove_by_count(u32 lmbs_to_remove)
+static int dlpar_memory_remove_lmbs_common(u32 lmbs_to_remove, u32 drc_index)
 {
-   struct drmem_lmb *lmb;
-   int lmbs_removed = 0;
+   struct drmem_lmb *lmb, *start_lmb, *end_lmb;
int lmbs_available = 0;
+   int lmbs_removed = 0;
int rc;
-
-   pr_info("Attempting to hot-remove %d LMB(s)\n", lmbs_to_remove);
+   /*
+* dlpar_memory_remove_by_ic() wants to remove all
+* 'lmbs_to_remove' LMBs, starting from drc_index, in a
+* contiguous block.
+*/
+   bool block_removal;
 
if (lmbs_to_remove == 0)
return -EINVAL;
 
+   block_removal = drc_index != 0;
+
+   if (block_removal) {
+   rc = get_lmb_range(drc_index, lmbs_to_remove, _lmb, 
_lmb);
+   if (rc)
+   return -EINVAL;
+   } else {
+   start_lmb = _info->lmbs[0];
+   end_lmb = _info->lmbs[drmem_info->n_lmbs];
+   }
+
/* Validate that there are enough LMBs to satisfy the request */
-   for_each_drmem_lmb(lmb) {
-   if (lmb_is_removable(lmb))
+   for_each_drmem_lmb_in_range(lmb, start_lmb, end_lmb) {
+   if (lmb_is_removable(lmb)) {
lmbs_available++;
 
-   if (lmbs_available == lmbs_to_remove)
+   if (lmbs_available == lmbs_to_remove)
+   break;
+   } else if (block_removal) {
break;
+   }
}
 
if (lmbs_available < lmbs_to_remove) {
@@ -426,28 +444,40 @@ static int dlpar_memory_remove_by_count(u32 
lmbs_to_remove)
return -EINVAL;
}
 
-   for_each_drmem_lmb(lmb) {
+   for_each_drmem_lmb_in_range(lmb, start_lmb, end_lmb) {
rc = dlpar_remove_lmb(lmb);
-   if (rc)
-   continue;
 
-   /* Mark this lmb so we can add it later if all of the
-* requested LMBs cannot be removed.
-*/
-   drmem_mark_lmb_reserved(lmb);
+   if (!rc) {
+   /* Mark this lmb so we can add it later if all of the
+* requested LMBs cannot be removed.
+*/
+   drmem_mark_lmb_reserved(lmb);
 
-   lmbs_removed++;
-   if (lmbs_removed == lmbs_to_remove)
+   lmbs_removed++;
+   if (lmbs_removed == lmbs_to_remove)
+   break;
+   } else if (block_removal) {
break;
+   }
}
 
if (lmbs_removed != lmbs_to_remove) {
-   pr_err("Memory hot-remove failed, adding LMB's back\n");
+   if (block_removal)
+   pr_err("Memory indexed-count-remove failed, adding any 
removed LMBs\n");
+   else
+   pr_err("Memory hot-remove failed, adding LMB's back\n");
 
-   for_each_drmem_lmb(lmb) {
+   for_each_drmem_lmb_in_range(lmb, start_lmb, end_lmb) {
if (!drmem_lmb_reserved(lmb))
continue;
 
+

[PATCH 1/3] powerpc/pseries: Set UNISOLATE on dlpar_memory_remove_by_ic() error

2021-04-30 Thread Daniel Henrique Barboza
As previously done in dlpar_cpu_remove() for CPUs, this patch changes
dlpar_memory_remove_by_ic() to unisolate the LMB DRC when the LMB is
failed to be removed. The hypervisor, seeing a LMB DRC that was supposed
to be removed being unisolated instead, can do error recovery on its
side.

This change is done in dlpar_memory_remove_by_ic() only because, as of
today, only QEMU is using this code path for error recovery (via the
PSERIES_HP_ELOG_ID_DRC_IC event). phyp treats it as a no-op.

Signed-off-by: Daniel Henrique Barboza 
---
 arch/powerpc/platforms/pseries/hotplug-memory.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index 8377f1f7c78e..bb98574a84a2 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -551,6 +551,13 @@ static int dlpar_memory_remove_by_ic(u32 lmbs_to_remove, 
u32 drc_index)
if (!drmem_lmb_reserved(lmb))
continue;
 
+   /*
+* Setting the isolation state of an 
UNISOLATED/CONFIGURED
+* device to UNISOLATE is a no-op, but the hypervisor 
can
+* use it as a hint that the LMB removal failed.
+*/
+   dlpar_unisolate_drc(lmb->drc_index);
+
rc = dlpar_add_lmb(lmb);
if (rc)
pr_err("Failed to add LMB, drc index %x\n",
-- 
2.30.2



[PATCH 2/3] hotplug-memory.c: enhance dlpar_memory_remove* LMB checks

2021-04-30 Thread Daniel Henrique Barboza
dlpar_memory_remove_by_ic() validates the amount of LMBs to be removed
by checking !DRCONF_MEM_RESERVED, and in the following loop before
dlpar_remove_lmb() a check for DRCONF_MEM_ASSIGNED is made before
removing it. This means that a LMB that is both !DRCONF_MEM_RESERVED and
!DRCONF_MEM_ASSIGNED will be counted as valid, but then not being
removed.  The function will end up not removing all 'lmbs_to_remove'
LMBs while also not reporting any errors.

Comparing it to dlpar_memory_remove_by_count(), the validation is done
via lmb_is_removable(), which checks for DRCONF_MEM_ASSIGNED and fadump
constraints. No additional check is made afterwards, and
DRCONF_MEM_RESERVED is never checked before dlpar_remove_lmb(). The
function doesn't have the same 'check A for validation, then B for
removal' issue as remove_by_ic(), but it's not checking if the LMB is
reserved.

There is no reason for these functions to validate the same operation in
two different manners. This patch addresses that by changing
lmb_is_removable() to also check for DRCONF_MEM_RESERVED to tell if a
lmb is removable, making dlpar_memory_remove_by_count() take the
reservation state into account when counting the LMBs.
lmb_is_removable() is then used in the validation step of
dlpar_memory_remove_by_ic(), which is already checking for both states
but in different stages, to avoid counting a LMB that is not assigned as
eligible for removal. We can then skip the check before
dlpar_remove_lmb() since we're validating all LMBs beforehand.

Signed-off-by: Daniel Henrique Barboza 
---
 arch/powerpc/platforms/pseries/hotplug-memory.c | 8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index bb98574a84a2..4e6d162c3f1a 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -348,7 +348,8 @@ static int pseries_remove_mem_node(struct device_node *np)
 
 static bool lmb_is_removable(struct drmem_lmb *lmb)
 {
-   if (!(lmb->flags & DRCONF_MEM_ASSIGNED))
+   if ((lmb->flags & DRCONF_MEM_RESERVED) ||
+   !(lmb->flags & DRCONF_MEM_ASSIGNED))
return false;
 
 #ifdef CONFIG_FA_DUMP
@@ -523,7 +524,7 @@ static int dlpar_memory_remove_by_ic(u32 lmbs_to_remove, 
u32 drc_index)
 
/* Validate that there are enough LMBs to satisfy the request */
for_each_drmem_lmb_in_range(lmb, start_lmb, end_lmb) {
-   if (lmb->flags & DRCONF_MEM_RESERVED)
+   if (!lmb_is_removable(lmb))
break;
 
lmbs_available++;
@@ -533,9 +534,6 @@ static int dlpar_memory_remove_by_ic(u32 lmbs_to_remove, 
u32 drc_index)
return -EINVAL;
 
for_each_drmem_lmb_in_range(lmb, start_lmb, end_lmb) {
-   if (!(lmb->flags & DRCONF_MEM_ASSIGNED))
-   continue;
-
rc = dlpar_remove_lmb(lmb);
if (rc)
break;
-- 
2.30.2



[PATCH 0/3] Unisolate LMBs DRC on removal error + cleanups

2021-04-30 Thread Daniel Henrique Barboza
Hi,

This is a follow-up of the work done in dlpar_cpu_remove() to
report CPU removal error by unisolating the DRC. This time I'm
doing it for LMBs. Patch 01 handles this.

Patches 2 and 3 are cleanups I consider worth posting.


Daniel Henrique Barboza (3):
  powerpc/pseries: Set UNISOLATE on dlpar_memory_remove_by_ic() error
  hotplug-memory.c: enhance dlpar_memory_remove* LMB checks
  pseries/hotplug-memory.c: adding dlpar_memory_remove_lmbs_common()

 .../platforms/pseries/hotplug-memory.c| 162 --
 1 file changed, 69 insertions(+), 93 deletions(-)

-- 
2.30.2



Re: [PATCH 2/3] powerpc: prom_init: switch to early string functions

2021-04-30 Thread Christophe Leroy




Le 30/04/2021 à 06:22, Daniel Walker a écrit :

This converts the prom_init string users to the early string function
which don't suffer from KASAN or any other debugging enabled.

Cc: xe-linux-exter...@cisco.com
Signed-off-by: Daniel Walker 
---
  arch/powerpc/kernel/prom_init.c| 185 ++---
  arch/powerpc/kernel/prom_init_check.sh |   9 +-
  2 files changed, 51 insertions(+), 143 deletions(-)

diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
index ccf77b985c8f..4d4343da1280 100644
--- a/arch/powerpc/kernel/prom_init.c
+++ b/arch/powerpc/kernel/prom_init.c
@@ -225,105 +225,6 @@ static bool  __prombss rtas_has_query_cpu_stopped;
  #define PHANDLE_VALID(p)  ((p) != 0 && (p) != PROM_ERROR)
  #define IHANDLE_VALID(i)  ((i) != 0 && (i) != PROM_ERROR)
  
-/* Copied from lib/string.c and lib/kstrtox.c */


Please leave the second part of the comment as you have not removed 
prom_strtobool()



Re: [PATCH 1/3] lib: early_string: allow early usage of some string functions

2021-04-30 Thread Christophe Leroy




Le 30/04/2021 à 10:47, Christophe Leroy a écrit :



Le 30/04/2021 à 06:22, Daniel Walker a écrit :

This systems allows some string functions to be moved into
lib/early_string.c and they will be prepended with "early_" and compiled
without debugging like KASAN.

This is already done on x86 for,
"AMD Secure Memory Encryption (SME) support"

and on powerpc prom_init.c , and EFI's libstub.

The AMD memory feature disabled KASAN for all string functions, and
prom_init.c and efi libstub implement their own versions of the
functions.

This implementation allows sharing of the string functions without
removing the debugging features for the whole system.


This looks good. I prefer that rather than the way you proposed to do it two 
years ago.

Only one problem, see below.


+size_t strlcat(char *dest, const char *src, size_t count)
+{
+    size_t dsize = strlen(dest);
+    size_t len = strlen(src);
+    size_t res = dsize + len;
+
+    /* This would be a bug */
+    BUG_ON(dsize >= count);


powerpc is not ready to handle BUG_ON() in when in prom_init.

Can you do:

#ifndef __EARLY_STRING_ENABLED
 BUG_ON(dsize >= count);
#endif


In fact, should be like in prom_init today:

#ifdef __EARLY_STRING_ENABLED
if (dsize >= count)
return count;
#else
BUG_ON(dsize >= count);
#endif





+
+    dest += dsize;
+    count -= dsize;
+    if (len >= count)
+    len = count-1;
+    memcpy(dest, src, len);
+    dest[len] = 0;
+    return res;
+}
+EXPORT_SYMBOL(strlcat);
+#endif
+


Re: [PATCH 1/3] lib: early_string: allow early usage of some string functions

2021-04-30 Thread Christophe Leroy




Le 30/04/2021 à 06:22, Daniel Walker a écrit :

This systems allows some string functions to be moved into
lib/early_string.c and they will be prepended with "early_" and compiled
without debugging like KASAN.

This is already done on x86 for,
"AMD Secure Memory Encryption (SME) support"

and on powerpc prom_init.c , and EFI's libstub.

The AMD memory feature disabled KASAN for all string functions, and
prom_init.c and efi libstub implement their own versions of the
functions.

This implementation allows sharing of the string functions without
removing the debugging features for the whole system.


This looks good. I prefer that rather than the way you proposed to do it two 
years ago.

Only one problem, see below.


+size_t strlcat(char *dest, const char *src, size_t count)
+{
+   size_t dsize = strlen(dest);
+   size_t len = strlen(src);
+   size_t res = dsize + len;
+
+   /* This would be a bug */
+   BUG_ON(dsize >= count);


powerpc is not ready to handle BUG_ON() in when in prom_init.

Can you do:

#ifndef __EARLY_STRING_ENABLED
BUG_ON(dsize >= count);
#endif



+
+   dest += dsize;
+   count -= dsize;
+   if (len >= count)
+   len = count-1;
+   memcpy(dest, src, len);
+   dest[len] = 0;
+   return res;
+}
+EXPORT_SYMBOL(strlcat);
+#endif
+


Re: [PATCH] powerpc/powernv/memtrace: Fix dcache flushing

2021-04-30 Thread Christophe Leroy




Le 30/04/2021 à 09:55, Sandipan Das a écrit :

Trace memory is cleared and the corresponding dcache lines
are flushed after allocation. However, this should not be
done using the PFN. This adds the missing __va() conversion.

Fixes: 2ac02e5ecec0 ("powerpc/mm: Remove dcache flush from memory remove.")
Signed-off-by: Sandipan Das 
---
  arch/powerpc/platforms/powernv/memtrace.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/memtrace.c 
b/arch/powerpc/platforms/powernv/memtrace.c
index 71c1262589fe..a31f13814f2e 100644
--- a/arch/powerpc/platforms/powernv/memtrace.c
+++ b/arch/powerpc/platforms/powernv/memtrace.c
@@ -104,8 +104,8 @@ static void memtrace_clear_range(unsigned long start_pfn,
 * Before we go ahead and use this range as cache inhibited range
 * flush the cache.
 */
-   flush_dcache_range_chunked(PFN_PHYS(start_pfn),
-  PFN_PHYS(start_pfn + nr_pages),
+   flush_dcache_range_chunked((unsigned long)__va(PFN_PHYS(start_pfn)),
+  (unsigned long)__va(PFN_PHYS(start_pfn + 
nr_pages)),


Can you use pfn_to_virt() instead ?


   FLUSH_CHUNK_SIZE);
  }
  



Re: [PATCH] ppc64/numa: consider the max numa node for migratable LPAR

2021-04-30 Thread Laurent Dufour

Le 29/04/2021 à 21:29, Tyrel Datwyler a écrit :

On 4/29/21 11:19 AM, Laurent Dufour wrote:

When a LPAR is migratable, we should consider the maximum possible NUMA
node instead the number of NUMA node from the actual system.

The DT property 'ibm,current-associativity-domains' is defining the maximum
number of nodes the LPAR can see when running on that box. But if the LPAR
is being migrated on another box, it may seen up to the nodes defined by
'ibm,max-associativity-domains'. So if a LPAR is migratable, that value
should be used.

Unfortunately, there is no easy way to know if a LPAR is migratable or
not. The hypervisor is exporting the property 'ibm,migratable-partition' in
the case it set to migrate partition, but that would not mean that the
current partition is migratable.


Wording is a little hard to follow for me here. From PAPR the
'ibm,migratable-partition' property presence indicates that the platform
supports the potential migration of the partition. I guess maybe the point is
that all migratable partitions define 'ibm,migratable-partition', but all
partitions that define 'ibm,migratable-partition' are not necessarily 
migratable.


That's what I meant.

Laurent


-Tyrel



Without that patch, when a LPAR is started on a 2 nodes box and then
migrated to a 3 nodes box, the hypervisor may spread the LPAR's CPUs on the
3rd node. In that case if a CPU from that 3rd node is added to the LPAR, it
will be wrongly assigned to the node because the kernel has been set to use
up to 2 nodes (the configuration of the departure node). With that patch
applies, the CPU is correctly added to the 3rd node.

Cc: Srikar Dronamraju 
Signed-off-by: Laurent Dufour 
---
  arch/powerpc/mm/numa.c | 14 +++---
  1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index f2bf98bdcea2..673fa6e47850 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -893,7 +893,7 @@ static void __init setup_node_data(int nid, u64 start_pfn, 
u64 end_pfn)
  static void __init find_possible_nodes(void)
  {
struct device_node *rtas;
-   const __be32 *domains;
+   const __be32 *domains = NULL;
int prop_length, max_nodes;
u32 i;

@@ -909,9 +909,14 @@ static void __init find_possible_nodes(void)
 * it doesn't exist, then fallback on ibm,max-associativity-domains.
 * Current denotes what the platform can support compared to max
 * which denotes what the Hypervisor can support.
+*
+* If the LPAR is migratable, new nodes might be activated after a LPM,
+* so we should consider the max number in that case.
 */
-   domains = of_get_property(rtas, "ibm,current-associativity-domains",
-   _length);
+   if (!of_get_property(of_root, "ibm,migratable-partition", NULL))
+   domains = of_get_property(rtas,
+ "ibm,current-associativity-domains",
+ _length);
if (!domains) {
domains = of_get_property(rtas, "ibm,max-associativity-domains",
_length);
@@ -920,6 +925,9 @@ static void __init find_possible_nodes(void)
}

max_nodes = of_read_number([min_common_depth], 1);
+   printk(KERN_INFO "Partition configured for %d NUMA nodes.\n",
+  max_nodes);
+
for (i = 0; i < max_nodes; i++) {
if (!node_possible(i))
node_set(i, node_possible_map);







[PATCH 19/31] powerpc/xics: Add debug logging to the set_irq_affinity handlers

2021-04-30 Thread Cédric Le Goater
It really helps to know how the HW is configured when tweaking the IRQ
subsystem.

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/sysdev/xics/ics-opal.c | 2 +-
 arch/powerpc/sysdev/xics/ics-rtas.c | 3 +++
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/sysdev/xics/ics-opal.c 
b/arch/powerpc/sysdev/xics/ics-opal.c
index 8c7ddcc718b6..bf26cae1b982 100644
--- a/arch/powerpc/sysdev/xics/ics-opal.c
+++ b/arch/powerpc/sysdev/xics/ics-opal.c
@@ -133,7 +133,7 @@ static int ics_opal_set_affinity(struct irq_data *d,
}
server = ics_opal_mangle_server(wanted_server);
 
-   pr_devel("ics-hal: set-affinity irq %d [hw 0x%x] server: 0x%x/0x%x\n",
+   pr_debug("ics-hal: set-affinity irq %d [hw 0x%x] server: 0x%x/0x%x\n",
 d->irq, hw_irq, wanted_server, server);
 
rc = opal_set_xive(hw_irq, server, priority);
diff --git a/arch/powerpc/sysdev/xics/ics-rtas.c 
b/arch/powerpc/sysdev/xics/ics-rtas.c
index 6d19d711ed35..b50c6341682e 100644
--- a/arch/powerpc/sysdev/xics/ics-rtas.c
+++ b/arch/powerpc/sysdev/xics/ics-rtas.c
@@ -133,6 +133,9 @@ static int ics_rtas_set_affinity(struct irq_data *d,
return -1;
}
 
+   pr_debug("%s: irq %d [hw 0x%x] server: 0x%x\n", __func__, d->irq,
+hw_irq, irq_server);
+
status = rtas_call_reentrant(ibm_set_xive, 3, 1, NULL,
 hw_irq, irq_server, xics_status[1]);
 
-- 
2.26.3



[PATCH 28/31] powerpc/powernv/pci: Set the IRQ chip data for P8/CXL devices

2021-04-30 Thread Cédric Le Goater
Before MSI domains, the default IRQ chip of PHB3 MSIs was patched by
pnv_set_msi_irq_chip() with the custom EOI handler pnv_ioda2_msi_eoi()
and the owning PHB was deduced from the 'ioda.irq_chip' field. This
path has been deprecated by the MSI domains but it is still in use by
the P8 CAPI 'cxl' driver.

Rewriting this driver to support MSI would be a waste of time.
Nevertheless, we can still remove the IRQ chip patch and set the IRQ
chip data instead. This is cleaner.

Cc: Frederic Barrat 
Cc: Christophe Lombard 
Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index c1598ab730c3..d496d5b1b45a 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2115,19 +2115,23 @@ int64_t pnv_opal_pci_msi_eoi(struct irq_chip *chip, 
unsigned int hw_irq)
return opal_pci_msi_eoi(phb->opal_id, hw_irq);
 }
 
+/*
+ * The IRQ data is mapped in the XICS domain, with OPAL HW IRQ numbers
+ */
 static void pnv_ioda2_msi_eoi(struct irq_data *d)
 {
int64_t rc;
unsigned int hw_irq = (unsigned int)irqd_to_hwirq(d);
-   struct irq_chip *chip = irq_data_get_irq_chip(d);
+   struct pci_controller *hose = irq_data_get_irq_chip_data(d);
+   struct pnv_phb *phb = hose->private_data;
 
-   rc = pnv_opal_pci_msi_eoi(chip, hw_irq);
+   rc = opal_pci_msi_eoi(phb->opal_id, hw_irq);
WARN_ON_ONCE(rc);
 
icp_native_eoi(d);
 }
 
-
+/* P8/CXL only */
 void pnv_set_msi_irq_chip(struct pnv_phb *phb, unsigned int virq)
 {
struct irq_data *idata;
@@ -2149,6 +2153,7 @@ void pnv_set_msi_irq_chip(struct pnv_phb *phb, unsigned 
int virq)
phb->ioda.irq_chip.irq_eoi = pnv_ioda2_msi_eoi;
}
irq_set_chip(virq, >ioda.irq_chip);
+   irq_set_chip_data(virq, phb->hose);
 }
 
 static struct irq_chip pnv_pci_msi_irq_chip;
-- 
2.26.3



[PATCH 13/31] KVM: PPC: Book3S HV: Use the new IRQ chip to detect passthrough interrupts

2021-04-30 Thread Cédric Le Goater
Passthrough PCI MSI interrupts are detected in KVM with a check on a
specific EOI handler (P8) or on XIVE (P9). We can now check the
PCI-MSI IRQ chip which is cleaner.

Cc: Paul Mackerras 
Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/kvm/book3s_hv.c  | 2 +-
 arch/powerpc/platforms/powernv/pci-ioda.c | 4 +++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index deb450e4289e..86a0f8b0e6da 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -5153,7 +5153,7 @@ static int kvmppc_set_passthru_irq(struct kvm *kvm, int 
host_irq, int guest_gsi)
 * what our real-mode EOI code does, or a XIVE interrupt
 */
chip = irq_data_get_irq_chip(>irq_data);
-   if (!chip || !(is_pnv_opal_msi(chip) || is_xive_irq(chip))) {
+   if (!chip || !is_pnv_opal_msi(chip)) {
pr_warn("kvmppc_set_passthru_irq_hv: Could not assign IRQ map 
for (%d,%d)\n",
host_irq, guest_gsi);
mutex_unlock(>lock);
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 3886ca6e2ed3..7b75af17dc59 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2151,13 +2151,15 @@ void pnv_set_msi_irq_chip(struct pnv_phb *phb, unsigned 
int virq)
irq_set_chip(virq, >ioda.irq_chip);
 }
 
+static struct irq_chip pnv_pci_msi_irq_chip;
+
 /*
  * Returns true iff chip is something that we could call
  * pnv_opal_pci_msi_eoi for.
  */
 bool is_pnv_opal_msi(struct irq_chip *chip)
 {
-   return chip->irq_eoi == pnv_ioda2_msi_eoi;
+   return chip->irq_eoi == pnv_ioda2_msi_eoi || chip == 
_pci_msi_irq_chip;
 }
 EXPORT_SYMBOL_GPL(is_pnv_opal_msi);
 
-- 
2.26.3



[PATCH 07/31] powerpc/xive: Fix xive_irq_set_affinity for MSI

2021-04-30 Thread Cédric Le Goater
The MSI affinity is automanaged and it can be set before starting the
associated IRQ.

( Should we simply remove the irqd_is_started() test ? )

Cc: Thomas Gleixner 
Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/sysdev/xive/common.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/sysdev/xive/common.c 
b/arch/powerpc/sysdev/xive/common.c
index 96737938e8e3..3485baf9ec8c 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -710,7 +710,7 @@ static int xive_irq_set_affinity(struct irq_data *d,
return -EINVAL;
 
/* Don't do anything if the interrupt isn't started */
-   if (!irqd_is_started(d))
+   if (!irqd_is_started(d) && !irqd_affinity_is_managed(d))
return IRQ_SET_MASK_OK;
 
/*
-- 
2.26.3



[PATCH 29/31] powerpc/powernv/pci: Rework pnv_opal_pci_msi_eoi()

2021-04-30 Thread Cédric Le Goater
pnv_opal_pci_msi_eoi() is called from KVM to EOI passthrough interrupts
when in real mode. Adding MSI domain broke the hack using the
'ioda.irq_chip' field to deduce the owning PHB. Fix that by using the
IRQ chip data in the MSI domain.

The 'ioda.irq_chip' field is now unused and could be removed from the
pnv_phb struct.

Cc: Paul Mackerras 
Cc: Alexey Kardashevskiy 
Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/include/asm/pnv-pci.h|  2 +-
 arch/powerpc/kvm/book3s_hv_rm_xics.c  |  8 
 arch/powerpc/platforms/powernv/pci-ioda.c | 17 +
 3 files changed, 18 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/include/asm/pnv-pci.h 
b/arch/powerpc/include/asm/pnv-pci.h
index d0ee0ede5767..b3f480799352 100644
--- a/arch/powerpc/include/asm/pnv-pci.h
+++ b/arch/powerpc/include/asm/pnv-pci.h
@@ -33,7 +33,7 @@ int pnv_cxl_alloc_hwirqs(struct pci_dev *dev, int num);
 void pnv_cxl_release_hwirqs(struct pci_dev *dev, int hwirq, int num);
 int pnv_cxl_get_irq_count(struct pci_dev *dev);
 struct device_node *pnv_pci_get_phb_node(struct pci_dev *dev);
-int64_t pnv_opal_pci_msi_eoi(struct irq_chip *chip, unsigned int hw_irq);
+int64_t pnv_opal_pci_msi_eoi(struct irq_data *d);
 bool is_pnv_opal_msi(struct irq_chip *chip);
 
 #ifdef CONFIG_CXL_BASE
diff --git a/arch/powerpc/kvm/book3s_hv_rm_xics.c 
b/arch/powerpc/kvm/book3s_hv_rm_xics.c
index c2c9c733f359..1772d53526e2 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_xics.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_xics.c
@@ -713,6 +713,7 @@ static int ics_rm_eoi(struct kvm_vcpu *vcpu, u32 irq)
icp->rm_eoied_irq = irq;
}
 
+   /* Handle passthrough interrupts */
if (state->host_irq) {
++vcpu->stat.pthru_all;
if (state->intr_cpu != -1) {
@@ -766,7 +767,7 @@ int xics_rm_h_eoi(struct kvm_vcpu *vcpu, unsigned long xirr)
 
 static unsigned long eoi_rc;
 
-static void icp_eoi(struct irq_chip *c, u32 hwirq, __be32 xirr, bool *again)
+static void icp_eoi(struct irq_data *d, u32 hwirq, __be32 xirr, bool *again)
 {
void __iomem *xics_phys;
int64_t rc;
@@ -779,7 +780,7 @@ static void icp_eoi(struct irq_chip *c, u32 hwirq, __be32 
xirr, bool *again)
return;
}
 
-   rc = pnv_opal_pci_msi_eoi(c, hwirq);
+   rc = pnv_opal_pci_msi_eoi(d);
 
if (rc)
eoi_rc = rc;
@@ -887,8 +888,7 @@ long kvmppc_deliver_irq_passthru(struct kvm_vcpu *vcpu,
icp_rm_deliver_irq(xics, icp, irq, false);
 
/* EOI the interrupt */
-   icp_eoi(irq_desc_get_chip(irq_map->desc), irq_map->r_hwirq, xirr,
-   again);
+   icp_eoi(irq_desc_get_irq_data(irq_map->desc), irq_map->r_hwirq, xirr, 
again);
 
if (check_too_hard(xics, icp) == H_TOO_HARD)
return 2;
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index d496d5b1b45a..8406b94cbfca 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2107,12 +2107,21 @@ void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
pe->dma_setup_done = true;
 }
 
-int64_t pnv_opal_pci_msi_eoi(struct irq_chip *chip, unsigned int hw_irq)
+/*
+ * Called from KVM in real mode to EOI passthru interrupts. The ICP
+ * EOI is handled directly in KVM in kvmppc_deliver_irq_passthru().
+ *
+ * The IRQ data is mapped in the PCI-MSI domain and the EOI OPAL call
+ * needs an HW IRQ number mapped in the XICS IRQ domain. The HW IRQ
+ * numbers of the in-the-middle MSI domain are vector numbers and it's
+ * good enough for OPAL. Use that.
+ */
+int64_t pnv_opal_pci_msi_eoi(struct irq_data *d)
 {
-   struct pnv_phb *phb = container_of(chip, struct pnv_phb,
-  ioda.irq_chip);
+   struct pci_controller *hose = 
irq_data_get_irq_chip_data(d->parent_data);
+   struct pnv_phb *phb = hose->private_data;
 
-   return opal_pci_msi_eoi(phb->opal_id, hw_irq);
+   return opal_pci_msi_eoi(phb->opal_id, d->parent_data->hwirq);
 }
 
 /*
-- 
2.26.3



[PATCH 22/31] powerpc/pci: Drop XIVE restriction on MSI domains

2021-04-30 Thread Cédric Le Goater
The PowerNV and pSeries platforms now have support for both the XICS
and XIVE IRQ domains.

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 4 +---
 arch/powerpc/platforms/pseries/msi.c  | 4 
 2 files changed, 1 insertion(+), 7 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 7035be271c34..13b56de92d85 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2476,9 +2476,7 @@ static void pnv_pci_init_ioda_msis(struct pnv_phb *phb)
pr_info("  Allocated bitmap for %d MSIs (base IRQ 0x%x)\n",
count, phb->msi_base);
 
-   /* Only supported by the XIVE driver */
-   if (xive_enabled())
-   pnv_msi_allocate_domains(phb->hose, count);
+   pnv_msi_allocate_domains(phb->hose, count);
 }
 
 static void pnv_ioda_setup_pe_res(struct pnv_ioda_pe *pe,
diff --git a/arch/powerpc/platforms/pseries/msi.c 
b/arch/powerpc/platforms/pseries/msi.c
index d1470941cadf..1886cb5ca4df 100644
--- a/arch/powerpc/platforms/pseries/msi.c
+++ b/arch/powerpc/platforms/pseries/msi.c
@@ -720,10 +720,6 @@ int pseries_msi_allocate_domains(struct pci_controller 
*phb)
 {
int count;
 
-   /* Only supported by the XIVE driver */
-   if (!xive_enabled())
-   return -ENODEV;
-
if (!__find_pe_total_msi(phb->dn, )) {
pr_err("PCI: failed to find MSIs for bridge %pOF (domain %d)\n",
   phb->dn, phb->global_number);
-- 
2.26.3



[PATCH 20/31] powerpc/xics: Add support for IRQ domain hierarchy

2021-04-30 Thread Cédric Le Goater
XICS doesn't have any state associated with the IRQ. The support is
straightforward and simpler than for XIVE.

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/sysdev/xics/xics-common.c | 37 ++
 1 file changed, 37 insertions(+)

diff --git a/arch/powerpc/sysdev/xics/xics-common.c 
b/arch/powerpc/sysdev/xics/xics-common.c
index 981587c7..05d21005dc79 100644
--- a/arch/powerpc/sysdev/xics/xics-common.c
+++ b/arch/powerpc/sysdev/xics/xics-common.c
@@ -406,7 +406,44 @@ int xics_retrigger(struct irq_data *data)
return 0;
 }
 
+static int xics_host_domain_translate(struct irq_domain *d, struct irq_fwspec 
*fwspec,
+ unsigned long *hwirq, unsigned int *type)
+{
+   return xics_host_xlate(d, to_of_node(fwspec->fwnode), fwspec->param,
+  fwspec->param_count, hwirq, type);
+}
+
+static int xics_host_domain_alloc(struct irq_domain *domain, unsigned int virq,
+ unsigned int nr_irqs, void *arg)
+{
+   struct irq_fwspec *fwspec = arg;
+   irq_hw_number_t hwirq;
+   unsigned int type = IRQ_TYPE_NONE;
+   int i, rc;
+
+   rc = xics_host_domain_translate(domain, fwspec, , );
+   if (rc)
+   return rc;
+
+   pr_debug("%s %d/%lx #%d\n", __func__, virq, hwirq, nr_irqs);
+
+   for (i = 0; i < nr_irqs; i++)
+   irq_domain_set_info(domain, virq + i, hwirq + i, xics_ics->chip,
+   xics_ics, handle_fasteoi_irq, NULL, NULL);
+
+   return 0;
+}
+
+static void xics_host_domain_free(struct irq_domain *domain,
+ unsigned int virq, unsigned int nr_irqs)
+{
+   pr_debug("%s %d #%d\n", __func__, virq, nr_irqs);
+}
+
 static const struct irq_domain_ops xics_host_ops = {
+   .alloc  = xics_host_domain_alloc,
+   .free   = xics_host_domain_free,
+   .translate = xics_host_domain_translate,
.match = xics_host_match,
.map = xics_host_map,
.xlate = xics_host_xlate,
-- 
2.26.3



[PATCH 23/31] powerpc/xics: Drop unmask of MSIs at startup

2021-04-30 Thread Cédric Le Goater
That was a workaround in the XICS domain because of the lack of MSI
domain. This is now handled.

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/sysdev/xics/ics-opal.c | 11 ---
 arch/powerpc/sysdev/xics/ics-rtas.c |  9 -
 2 files changed, 20 deletions(-)

diff --git a/arch/powerpc/sysdev/xics/ics-opal.c 
b/arch/powerpc/sysdev/xics/ics-opal.c
index bf26cae1b982..c4d95d8beb6f 100644
--- a/arch/powerpc/sysdev/xics/ics-opal.c
+++ b/arch/powerpc/sysdev/xics/ics-opal.c
@@ -62,17 +62,6 @@ static void ics_opal_unmask_irq(struct irq_data *d)
 
 static unsigned int ics_opal_startup(struct irq_data *d)
 {
-#ifdef CONFIG_PCI_MSI
-   /*
-* The generic MSI code returns with the interrupt disabled on the
-* card, using the MSI mask bits. Firmware doesn't appear to unmask
-* at that level, so we do it here by hand.
-*/
-   if (irq_data_get_msi_desc(d))
-   pci_msi_unmask_irq(d);
-#endif
-
-   /* unmask it */
ics_opal_unmask_irq(d);
return 0;
 }
diff --git a/arch/powerpc/sysdev/xics/ics-rtas.c 
b/arch/powerpc/sysdev/xics/ics-rtas.c
index b50c6341682e..b9da317b7a2d 100644
--- a/arch/powerpc/sysdev/xics/ics-rtas.c
+++ b/arch/powerpc/sysdev/xics/ics-rtas.c
@@ -57,15 +57,6 @@ static void ics_rtas_unmask_irq(struct irq_data *d)
 
 static unsigned int ics_rtas_startup(struct irq_data *d)
 {
-#ifdef CONFIG_PCI_MSI
-   /*
-* The generic MSI code returns with the interrupt disabled on the
-* card, using the MSI mask bits. Firmware doesn't appear to unmask
-* at that level, so we do it here by hand.
-*/
-   if (irq_data_get_msi_desc(d))
-   pci_msi_unmask_irq(d);
-#endif
/* unmask it */
ics_rtas_unmask_irq(d);
return 0;
-- 
2.26.3



[PATCH 25/31] powerpc/powernv/pci: Drop unused MSI code

2021-04-30 Thread Cédric Le Goater
MSIs should be fully managed by the PCI and IRQ subsystems now.

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/platforms/powernv/pci.h  |  6 --
 arch/powerpc/platforms/powernv/pci-ioda.c | 29 --
 arch/powerpc/platforms/powernv/pci.c  | 67 ---
 3 files changed, 102 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci.h 
b/arch/powerpc/platforms/powernv/pci.h
index 36d22920f5a3..a075012788df 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -127,11 +127,7 @@ struct pnv_phb {
 #endif
 
unsigned intmsi_base;
-   unsigned intmsi32_support;
struct msi_bitmap   msi_bmp;
-   int (*msi_setup)(struct pnv_phb *phb, struct pci_dev *dev,
-unsigned int hwirq, unsigned int virq,
-unsigned int is_64, struct msi_msg *msg);
int (*init_m64)(struct pnv_phb *phb);
int (*get_pe_state)(struct pnv_phb *phb, int pe_no);
void (*freeze_pe)(struct pnv_phb *phb, int pe_no);
@@ -295,8 +291,6 @@ extern void pnv_npu2_map_lpar(struct pnv_ioda_pe *gpe, 
unsigned long msr);
 extern void pnv_pci_reset_secondary_bus(struct pci_dev *dev);
 extern int pnv_eeh_phb_reset(struct pci_controller *hose, int option);
 
-extern int pnv_setup_msi_irqs(struct pci_dev *pdev, int nvec, int type);
-extern void pnv_teardown_msi_irqs(struct pci_dev *pdev);
 extern struct pnv_ioda_pe *pnv_pci_bdfn_to_pe(struct pnv_phb *phb, u16 bdfn);
 extern struct pnv_ioda_pe *pnv_ioda_get_pe(struct pci_dev *dev);
 extern void pnv_set_msi_irq_chip(struct pnv_phb *phb, unsigned int virq);
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 13b56de92d85..c5acd85a9144 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2224,29 +2224,6 @@ static int __pnv_pci_ioda_msi_setup(struct pnv_phb *phb, 
struct pci_dev *dev,
return 0;
 }
 
-static int pnv_pci_ioda_msi_setup(struct pnv_phb *phb, struct pci_dev *dev,
- unsigned int hwirq, unsigned int virq,
- unsigned int is_64, struct msi_msg *msg)
-{
-   struct pnv_ioda_pe *pe = pnv_ioda_get_pe(dev);
-   unsigned int xive_num = hwirq - phb->msi_base;
-   int rc;
-
-   rc = __pnv_pci_ioda_msi_setup(phb, dev, xive_num, is_64, msg);
-   if (rc)
-   return rc;
-
-   /* P8 only */
-   pnv_set_msi_irq_chip(phb, virq);
-
-   pr_devel("%s: %s-bit MSI on hwirq %x (xive #%d),"
-" address=%x_%08x data=%x PE# %x\n",
-pci_name(dev), is_64 ? "64" : "32", hwirq, xive_num,
-msg->address_hi, msg->address_lo, msg->data, pe->pe_number);
-
-   return 0;
-}
-
 /*
  * The msi_free() op is called before irq_domain_free_irqs_top() when
  * the handler data is still available. Use that to clear the XIVE
@@ -2471,8 +2448,6 @@ static void pnv_pci_init_ioda_msis(struct pnv_phb *phb)
return;
}
 
-   phb->msi_setup = pnv_pci_ioda_msi_setup;
-   phb->msi32_support = 1;
pr_info("  Allocated bitmap for %d MSIs (base IRQ 0x%x)\n",
count, phb->msi_base);
 
@@ -3090,8 +3065,6 @@ static const struct pci_controller_ops 
pnv_pci_ioda_controller_ops = {
.dma_dev_setup  = pnv_pci_ioda_dma_dev_setup,
.dma_bus_setup  = pnv_pci_ioda_dma_bus_setup,
.iommu_bypass_supported = pnv_pci_ioda_iommu_bypass_supported,
-   .setup_msi_irqs = pnv_setup_msi_irqs,
-   .teardown_msi_irqs  = pnv_teardown_msi_irqs,
.enable_device_hook = pnv_pci_enable_device_hook,
.release_device = pnv_pci_release_device,
.window_alignment   = pnv_pci_window_alignment,
@@ -3101,8 +3074,6 @@ static const struct pci_controller_ops 
pnv_pci_ioda_controller_ops = {
 };
 
 static const struct pci_controller_ops pnv_npu_ioda_controller_ops = {
-   .setup_msi_irqs = pnv_setup_msi_irqs,
-   .teardown_msi_irqs  = pnv_teardown_msi_irqs,
.enable_device_hook = pnv_pci_enable_device_hook,
.window_alignment   = pnv_pci_window_alignment,
.reset_secondary_bus= pnv_pci_reset_secondary_bus,
diff --git a/arch/powerpc/platforms/powernv/pci.c 
b/arch/powerpc/platforms/powernv/pci.c
index 9b9bca169275..397b3d7eb150 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -160,73 +160,6 @@ int pnv_pci_set_power_state(uint64_t id, uint8_t state, 
struct opal_msg *msg)
 }
 EXPORT_SYMBOL_GPL(pnv_pci_set_power_state);
 
-int pnv_setup_msi_irqs(struct pci_dev *pdev, int nvec, int type)
-{
-   struct pnv_phb *phb = pci_bus_to_pnvhb(pdev->bus);
-   struct msi_desc *entry;
-   struct msi_msg msg;
-   int hwirq;
-   unsigned int virq;
-   int rc;
-
-   if (WARN_ON(!phb) || 

[PATCH 31/31] genirq: Improve "hwirq" output in /proc and /sys/

2021-04-30 Thread Cédric Le Goater
The HW IRQ numbers generated by the PCI MSI layer can be quite large
on a pSeries machine when running under the IBM Hypervisor and they
appear as negative. Use '%u' to show them correctly.

Cc: Thomas Gleixner 
Signed-off-by: Cédric Le Goater 
---
 kernel/irq/irqdesc.c | 2 +-
 kernel/irq/proc.c| 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
index cc1a09406c6e..85054eb2ae51 100644
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -188,7 +188,7 @@ static ssize_t hwirq_show(struct kobject *kobj,
 
raw_spin_lock_irq(>lock);
if (desc->irq_data.domain)
-   ret = sprintf(buf, "%d\n", (int)desc->irq_data.hwirq);
+   ret = sprintf(buf, "%u\n", (int)desc->irq_data.hwirq);
raw_spin_unlock_irq(>lock);
 
return ret;
diff --git a/kernel/irq/proc.c b/kernel/irq/proc.c
index 98138788cb04..e2392f05da04 100644
--- a/kernel/irq/proc.c
+++ b/kernel/irq/proc.c
@@ -513,7 +513,7 @@ int show_interrupts(struct seq_file *p, void *v)
seq_printf(p, " %8s", "None");
}
if (desc->irq_data.domain)
-   seq_printf(p, " %*d", prec, (int) desc->irq_data.hwirq);
+   seq_printf(p, " %*u", prec, (int)desc->irq_data.hwirq);
else
seq_printf(p, " %*s", prec, "");
 #ifdef CONFIG_GENERIC_IRQ_SHOW_LEVEL
-- 
2.26.3



[PATCH 08/31] powerpc/pseries/pci: Add a domain_free_irqs handler

2021-04-30 Thread Cédric Le Goater
The RTAS firmware can not disable one MSI at a time. It's all or
nothing. We need a custom free IRQ handler for that.

Cc: Thomas Gleixner 
Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/platforms/pseries/msi.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/msi.c 
b/arch/powerpc/platforms/pseries/msi.c
index a9bd1e991df5..a41c448520d4 100644
--- a/arch/powerpc/platforms/pseries/msi.c
+++ b/arch/powerpc/platforms/pseries/msi.c
@@ -529,8 +529,24 @@ static int pseries_msi_ops_prepare(struct irq_domain 
*domain, struct device *dev
return rtas_prepare_msi_irqs(pdev, nvec, type, arg);
 }
 
+/*
+ * RTAS can not disable one MSI at a time. It's all or nothing. Do it
+ * at the end after all IRQs have been freed.
+ */
+static void pseries_msi_domain_free_irqs(struct irq_domain *domain,
+struct device *dev)
+{
+   if (WARN_ON_ONCE(!dev_is_pci(dev)))
+   return;
+
+   __msi_domain_free_irqs(domain, dev);
+
+   rtas_disable_msi(to_pci_dev(dev));
+}
+
 static struct msi_domain_ops pseries_pci_msi_domain_ops = {
.msi_prepare= pseries_msi_ops_prepare,
+   .domain_free_irqs = pseries_msi_domain_free_irqs,
 };
 
 static void pseries_msi_shutdown(struct irq_data *d)
-- 
2.26.3



[PATCH 09/31] powerpc/pseries/pci: Add a msi_free() handler to clear XIVE data

2021-04-30 Thread Cédric Le Goater
The MSI domain clears the IRQ with msi_domain_free(), which calls
irq_domain_free_irqs_top(), which clears the handler data. This is a
problem for the XIVE controller since we need to unmap MMIO pages and
free a specific XIVE structure.

The 'msi_free()' handler is called before irq_domain_free_irqs_top()
when the handler data is still available. Use that to clear the XIVE
controller data.

Cc: Thomas Gleixner 
Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/include/asm/xive.h  |  1 +
 arch/powerpc/platforms/pseries/msi.c | 16 +++-
 arch/powerpc/sysdev/xive/common.c|  5 -
 3 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/xive.h b/arch/powerpc/include/asm/xive.h
index aa094a8655b0..20ae50ab083c 100644
--- a/arch/powerpc/include/asm/xive.h
+++ b/arch/powerpc/include/asm/xive.h
@@ -111,6 +111,7 @@ void xive_native_free_vp_block(u32 vp_base);
 int xive_native_populate_irq_data(u32 hw_irq,
  struct xive_irq_data *data);
 void xive_cleanup_irq_data(struct xive_irq_data *xd);
+void xive_irq_free_data(unsigned int virq);
 void xive_native_free_irq(u32 irq);
 int xive_native_configure_irq(u32 hw_irq, u32 target, u8 prio, u32 sw_irq);
 
diff --git a/arch/powerpc/platforms/pseries/msi.c 
b/arch/powerpc/platforms/pseries/msi.c
index a41c448520d4..da9d63a088bb 100644
--- a/arch/powerpc/platforms/pseries/msi.c
+++ b/arch/powerpc/platforms/pseries/msi.c
@@ -529,6 +529,19 @@ static int pseries_msi_ops_prepare(struct irq_domain 
*domain, struct device *dev
return rtas_prepare_msi_irqs(pdev, nvec, type, arg);
 }
 
+/*
+ * ->msi_free() is called before irq_domain_free_irqs_top() when the
+ * handler data is still available. Use that to clear the XIVE
+ * controller data.
+ */
+static void pseries_msi_ops_msi_free(struct irq_domain *domain,
+struct msi_domain_info *info,
+unsigned int irq)
+{
+   if (xive_enabled())
+   xive_irq_free_data(irq);
+}
+
 /*
  * RTAS can not disable one MSI at a time. It's all or nothing. Do it
  * at the end after all IRQs have been freed.
@@ -546,6 +559,7 @@ static void pseries_msi_domain_free_irqs(struct irq_domain 
*domain,
 
 static struct msi_domain_ops pseries_pci_msi_domain_ops = {
.msi_prepare= pseries_msi_ops_prepare,
+   .msi_free   = pseries_msi_ops_msi_free,
.domain_free_irqs = pseries_msi_domain_free_irqs,
 };
 
@@ -660,7 +674,7 @@ static void pseries_irq_domain_free(struct irq_domain 
*domain, unsigned int virq
 
pr_debug("%s bridge %pOF %d #%d\n", __func__, phb->dn, virq, nr_irqs);
 
-   irq_domain_free_irqs_parent(domain, virq, nr_irqs);
+   /* XIVE domain data is cleared through ->msi_free() */
 }
 
 static const struct irq_domain_ops pseries_irq_domain_ops = {
diff --git a/arch/powerpc/sysdev/xive/common.c 
b/arch/powerpc/sysdev/xive/common.c
index 3485baf9ec8c..191cd80ec534 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -980,6 +980,8 @@ EXPORT_SYMBOL_GPL(is_xive_irq);
 
 void xive_cleanup_irq_data(struct xive_irq_data *xd)
 {
+   pr_debug("%s for HW %x\n", __func__, xd->hw_irq);
+
if (xd->eoi_mmio) {
unmap_kernel_range((unsigned long)xd->eoi_mmio,
   1u << xd->esb_shift);
@@ -1025,7 +1027,7 @@ static int xive_irq_alloc_data(unsigned int virq, 
irq_hw_number_t hw)
return 0;
 }
 
-static void xive_irq_free_data(unsigned int virq)
+void xive_irq_free_data(unsigned int virq)
 {
struct xive_irq_data *xd = irq_get_handler_data(virq);
 
@@ -1035,6 +1037,7 @@ static void xive_irq_free_data(unsigned int virq)
xive_cleanup_irq_data(xd);
kfree(xd);
 }
+EXPORT_SYMBOL_GPL(xive_irq_free_data);
 
 #ifdef CONFIG_SMP
 
-- 
2.26.3



[PATCH 15/31] KVM: PPC: Book3S HV: XIVE: Fix mapping of passthrough interrupts

2021-04-30 Thread Cédric Le Goater
PCI MSI interrupt numbers are now mapped in a PCI-MSI domain but the
underlying calls handling the passthrough of the interrupt in the
guest need a number in the XIVE IRQ domain.

Use the IRQ data mapped in the XIVE IRQ domain and not the one in the
PCI-MSI domain.

Exporting irq_get_default_host() might not be the best solution.

Cc: Thomas Gleixner 
Cc: Paul Mackerras 
Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/kvm/book3s_xive.c | 3 ++-
 kernel/irq/irqdomain.c | 1 +
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
index 3a7da42bed57..81b9f4fc3978 100644
--- a/arch/powerpc/kvm/book3s_xive.c
+++ b/arch/powerpc/kvm/book3s_xive.c
@@ -861,7 +861,8 @@ int kvmppc_xive_set_mapped(struct kvm *kvm, unsigned long 
guest_irq,
struct kvmppc_xive *xive = kvm->arch.xive;
struct kvmppc_xive_src_block *sb;
struct kvmppc_xive_irq_state *state;
-   struct irq_data *host_data = irq_get_irq_data(host_irq);
+   struct irq_data *host_data =
+   irq_domain_get_irq_data(irq_get_default_host(), host_irq);
unsigned int hw_irq = (unsigned int)irqd_to_hwirq(host_data);
u16 idx;
u8 prio;
diff --git a/kernel/irq/irqdomain.c b/kernel/irq/irqdomain.c
index d10ab1d689d5..8a073d1ce611 100644
--- a/kernel/irq/irqdomain.c
+++ b/kernel/irq/irqdomain.c
@@ -481,6 +481,7 @@ struct irq_domain *irq_get_default_host(void)
 {
return irq_default_domain;
 }
+EXPORT_SYMBOL_GPL(irq_get_default_host);
 
 static void irq_domain_clear_mapping(struct irq_domain *domain,
 irq_hw_number_t hwirq)
-- 
2.26.3



[PATCH 05/31] powerpc/pseries/pci: Add MSI domains

2021-04-30 Thread Cédric Le Goater
Two IRQ domains are added on top of default machine IRQ domain.

First, the top level "PCI-MSI" domain deals with the MSI specificities.
In this domain, the HW IRQ numbers are generated by the PCI MSI layer,
they compose a unique ID for an MSI source with the PCI device
identifier and the MSI vector number.

These numbers can be quite large on a pSeries machine running under
the IBM Hypervisor and /sys/kernel/irq/ and /proc/interrupts will
require small fixes to show them correctly.

Then, the in-the-middle "MSI" domain acts as a proxy between the PCI
MSI subsystem and the machine IRQ subsystem. It usually handles the
MSI allocator but on pSeries machines, this is done by the RTAS
FW. RTAS returns IRQ numbers in the IRQ number space of the machine.
This is why this in-the-middle "Pseries-MSI" domain has the same HW
IRQ numbers as its parent domain.

Only the XIVE (P9/P10) parent domain is supported for now. We still
need to add support for IRQ domain hierarchy under XICS.

Cc: Thomas Gleixner 
Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/include/asm/pci-bridge.h|   5 +
 arch/powerpc/platforms/pseries/pseries.h |   1 +
 arch/powerpc/kernel/pci-common.c |   6 +
 arch/powerpc/platforms/pseries/msi.c | 185 +++
 arch/powerpc/platforms/pseries/setup.c   |   2 +
 5 files changed, 199 insertions(+)

diff --git a/arch/powerpc/include/asm/pci-bridge.h 
b/arch/powerpc/include/asm/pci-bridge.h
index d2a2a14e56f9..fb35d340a739 100644
--- a/arch/powerpc/include/asm/pci-bridge.h
+++ b/arch/powerpc/include/asm/pci-bridge.h
@@ -127,6 +127,11 @@ struct pci_controller {
 
void *private_data;
struct npu *npu;
+
+   /* IRQ domain hierarchy */
+   struct irq_domain   *dev_domain;
+   struct irq_domain   *msi_domain;
+   struct fwnode_handle*fwnode;
 };
 
 /* These are used for config access before all the PCI probing
diff --git a/arch/powerpc/platforms/pseries/pseries.h 
b/arch/powerpc/platforms/pseries/pseries.h
index 4fe48c04c6c2..91cf2afcf423 100644
--- a/arch/powerpc/platforms/pseries/pseries.h
+++ b/arch/powerpc/platforms/pseries/pseries.h
@@ -87,6 +87,7 @@ struct pci_host_bridge;
 int pseries_root_bridge_prepare(struct pci_host_bridge *bridge);
 
 extern struct pci_controller_ops pseries_pci_controller_ops;
+int pseries_msi_allocate_domains(struct pci_controller *phb);
 
 unsigned long pseries_memory_block_size(void);
 
diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
index 001e90cd8948..c3573430919d 100644
--- a/arch/powerpc/kernel/pci-common.c
+++ b/arch/powerpc/kernel/pci-common.c
@@ -29,6 +29,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -1060,11 +1061,16 @@ void pcibios_bus_add_device(struct pci_dev *dev)
 
 int pcibios_add_device(struct pci_dev *dev)
 {
+   struct irq_domain *d;
+
 #ifdef CONFIG_PCI_IOV
if (ppc_md.pcibios_fixup_sriov)
ppc_md.pcibios_fixup_sriov(dev);
 #endif /* CONFIG_PCI_IOV */
 
+   d = dev_get_msi_domain(>bus->dev);
+   if (d)
+   dev_set_msi_domain(>dev, d);
return 0;
 }
 
diff --git a/arch/powerpc/platforms/pseries/msi.c 
b/arch/powerpc/platforms/pseries/msi.c
index 4bf14f27e1aa..a9bd1e991df5 100644
--- a/arch/powerpc/platforms/pseries/msi.c
+++ b/arch/powerpc/platforms/pseries/msi.c
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "pseries.h"
 
@@ -518,6 +519,190 @@ static int rtas_setup_msi_irqs(struct pci_dev *pdev, int 
nvec_in, int type)
return 0;
 }
 
+static int pseries_msi_ops_prepare(struct irq_domain *domain, struct device 
*dev,
+  int nvec, msi_alloc_info_t *arg)
+{
+   struct pci_dev *pdev = to_pci_dev(dev);
+   struct msi_desc *desc = first_pci_msi_entry(pdev);
+   int type = desc->msi_attrib.is_msix ? PCI_CAP_ID_MSIX : PCI_CAP_ID_MSI;
+
+   return rtas_prepare_msi_irqs(pdev, nvec, type, arg);
+}
+
+static struct msi_domain_ops pseries_pci_msi_domain_ops = {
+   .msi_prepare= pseries_msi_ops_prepare,
+};
+
+static void pseries_msi_shutdown(struct irq_data *d)
+{
+   d = d->parent_data;
+   if (d->chip->irq_shutdown)
+   d->chip->irq_shutdown(d);
+}
+
+static void pseries_msi_mask(struct irq_data *d)
+{
+   pci_msi_mask_irq(d);
+   irq_chip_mask_parent(d);
+}
+
+static void pseries_msi_unmask(struct irq_data *d)
+{
+   pci_msi_unmask_irq(d);
+   irq_chip_unmask_parent(d);
+}
+
+static struct irq_chip pseries_pci_msi_irq_chip = {
+   .name   = "Pseries-PCI-MSI",
+   .irq_shutdown   = pseries_msi_shutdown,
+   .irq_mask   = pseries_msi_mask,
+   .irq_unmask = pseries_msi_unmask,
+   .irq_eoi= irq_chip_eoi_parent,
+};
+
+static struct msi_domain_info pseries_msi_domain_info = {
+   .flags = (MSI_FLAG_USE_DEF_DOM_OPS | MSI_FLAG_USE_DEF_CHIP_OPS |
+ MSI_FLAG_MULTI_PCI_MSI  | MSI_FLAG_PCI_MSIX),
+ 

[PATCH 27/31] powerpc/xics: Fix IRQ migration

2021-04-30 Thread Cédric Le Goater
desc->irq_data points to the top level IRQ data descriptor which is
not necessarily in the XICS IRQ domain. MSIs are in another domain for
instance. Fix that by looking for a mapping on the low level XICS IRQ
domain.

TODO: Why not use irq_migrate_all_off_this_cpu() instead ?

Cc: Thomas Gleixner 
Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/sysdev/xics/xics-common.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/sysdev/xics/xics-common.c 
b/arch/powerpc/sysdev/xics/xics-common.c
index 05d21005dc79..2a3ad7f5c331 100644
--- a/arch/powerpc/sysdev/xics/xics-common.c
+++ b/arch/powerpc/sysdev/xics/xics-common.c
@@ -183,6 +183,8 @@ void xics_migrate_irqs_away(void)
unsigned int irq, virq;
struct irq_desc *desc;
 
+   pr_debug("%s: CPU %u\n", __func__, cpu);
+
/* If we used to be the default server, move to the new "boot_cpuid" */
if (hw_cpu == xics_default_server)
xics_update_irq_servers();
@@ -197,6 +199,7 @@ void xics_migrate_irqs_away(void)
struct irq_chip *chip;
long server;
unsigned long flags;
+   struct irq_data *irqd;
 
/* We can't set affinity on ISA interrupts */
if (virq < NUM_ISA_INTERRUPTS)
@@ -204,9 +207,11 @@ void xics_migrate_irqs_away(void)
/* We only need to migrate enabled IRQS */
if (!desc->action)
continue;
-   if (desc->irq_data.domain != xics_host)
+   /* We need a mapping in the XICS IRQ domain */
+   irqd = irq_domain_get_irq_data(xics_host, virq);
+   if (!irqd)
continue;
-   irq = desc->irq_data.hwirq;
+   irq = irqd_to_hwirq(irqd);
/* We need to get IPIs still. */
if (irq == XICS_IPI || irq == XICS_IRQ_SPURIOUS)
continue;
-- 
2.26.3



[PATCH 21/31] powerpc/powernv/pci: Customize the MSI EOI handler to support PHB3

2021-04-30 Thread Cédric Le Goater
PHB3s need an extra OPAL call to EOI the interrupt. The call takes an
OPAL HW IRQ number but it is translated into a vector number in OPAL.
Here, we directly use the vector number of the in-the-middle "MSI"
domain instead of grabbing the OPAL HW IRQ number in the XICS parent
domain.

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 23 ++-
 1 file changed, 22 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 7b75af17dc59..7035be271c34 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2313,12 +2313,33 @@ static void pnv_msi_compose_msg(struct irq_data *d, 
struct msi_msg *msg)
entry->msi_attrib.is_64 ? "64" : "32", d->hwirq, rc);
 }
 
+/*
+ * The IRQ data is mapped in the MSI domain in which HW IRQ numbers
+ * correspond to vector numbers.
+ */
+static void pnv_msi_eoi(struct irq_data *d)
+{
+   struct pci_controller *hose = irq_data_get_irq_chip_data(d);
+   struct pnv_phb *phb = hose->private_data;
+
+   if (phb->model == PNV_PHB_MODEL_PHB3) {
+   /*
+* The EOI OPAL call takes an OPAL HW IRQ number but
+* since it is translated into a vector number in
+* OPAL, use that directly.
+*/
+   WARN_ON_ONCE(opal_pci_msi_eoi(phb->opal_id, d->hwirq));
+   }
+
+   irq_chip_eoi_parent(d);
+}
+
 static struct irq_chip pnv_msi_irq_chip = {
.name   = "PNV-MSI",
.irq_shutdown   = pnv_msi_shutdown,
.irq_mask   = irq_chip_mask_parent,
.irq_unmask = irq_chip_unmask_parent,
-   .irq_eoi= irq_chip_eoi_parent,
+   .irq_eoi= pnv_msi_eoi,
.irq_set_affinity   = irq_chip_set_affinity_parent,
.irq_compose_msi_msg= pnv_msi_compose_msg,
 };
-- 
2.26.3



[PATCH 30/31] KVM: PPC: Book3S HV: XICS: Fix mapping of passthrough interrupts

2021-04-30 Thread Cédric Le Goater
PCI MSIs now live in an MSI domain but the underlying calls, which
will EOI the interrupt in real mode, need an HW IRQ number mapped in
the XICS IRQ domain. Grab it there.

Cc: Paul Mackerras 
Cc: Alexey Kardashevskiy 
Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/kvm/book3s_hv.c | 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 9f4eb74a11cc..6058bcc5b61e 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -5126,6 +5126,7 @@ static int kvmppc_set_passthru_irq(struct kvm *kvm, int 
host_irq, int guest_gsi)
struct kvmppc_passthru_irqmap *pimap;
struct irq_chip *chip;
int i, rc = 0;
+   struct irq_data *host_data;
 
if (!kvm_irq_bypass)
return 1;
@@ -5190,7 +5191,14 @@ static int kvmppc_set_passthru_irq(struct kvm *kvm, int 
host_irq, int guest_gsi)
 * the KVM real mode handler.
 */
smp_wmb();
-   irq_map->r_hwirq = desc->irq_data.hwirq;
+
+   /*
+* The 'host_irq' number is mapped in the PCI-MSI domain but
+* the underlying calls, which will EOI the interrupt in real
+* mode, need an HW IRQ number mapped in the XICS IRQ domain.
+*/
+   host_data = irq_domain_get_irq_data(irq_get_default_host(), host_irq);
+   irq_map->r_hwirq = (unsigned int)irqd_to_hwirq(host_data);
 
if (i == pimap->n_mapped)
pimap->n_mapped++;
@@ -5198,7 +5206,7 @@ static int kvmppc_set_passthru_irq(struct kvm *kvm, int 
host_irq, int guest_gsi)
if (xics_on_xive())
rc = kvmppc_xive_set_mapped(kvm, guest_gsi, host_irq);
else
-   kvmppc_xics_set_mapped(kvm, guest_gsi, desc->irq_data.hwirq);
+   kvmppc_xics_set_mapped(kvm, guest_gsi, irq_map->r_hwirq);
if (rc)
irq_map->r_hwirq = 0;
 
-- 
2.26.3



[PATCH 17/31] powerpc/xics: Rename the map handler in a check handler

2021-04-30 Thread Cédric Le Goater
This moves the IRQ initialization done under the OPAL and RTAS backends
in the common part of XICS. The 'map' handler becomes a simple 'check'
on the HW IRQ at the FW level.

As we don't need an ICS anymore in xics_migrate_irqs_away(), the XICS
domain does not set a chip data for the IRQ.

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/include/asm/xics.h|  3 ++-
 arch/powerpc/sysdev/xics/ics-opal.c| 27 +
 arch/powerpc/sysdev/xics/ics-rtas.c| 28 +-
 arch/powerpc/sysdev/xics/xics-common.c | 15 --
 4 files changed, 31 insertions(+), 42 deletions(-)

diff --git a/arch/powerpc/include/asm/xics.h b/arch/powerpc/include/asm/xics.h
index 8e903b3f9c24..01b51a926f56 100644
--- a/arch/powerpc/include/asm/xics.h
+++ b/arch/powerpc/include/asm/xics.h
@@ -85,10 +85,11 @@ static inline int ics_opal_init(void) { return -ENODEV; }
 /* ICS instance, hooked up to chip_data of an irq */
 struct ics {
struct list_head link;
-   int (*map)(struct ics *ics, unsigned int virq);
+   int (*check)(struct ics *ics, unsigned int hwirq);
void (*mask_unknown)(struct ics *ics, unsigned long vec);
long (*get_server)(struct ics *ics, unsigned long vec);
int (*host_match)(struct ics *ics, struct device_node *node);
+   struct irq_chip *chip;
char data[];
 };
 
diff --git a/arch/powerpc/sysdev/xics/ics-opal.c 
b/arch/powerpc/sysdev/xics/ics-opal.c
index 823f6c9664cd..8c7ddcc718b6 100644
--- a/arch/powerpc/sysdev/xics/ics-opal.c
+++ b/arch/powerpc/sysdev/xics/ics-opal.c
@@ -157,26 +157,13 @@ static struct irq_chip ics_opal_irq_chip = {
.irq_retrigger = xics_retrigger,
 };
 
-static int ics_opal_map(struct ics *ics, unsigned int virq);
-static void ics_opal_mask_unknown(struct ics *ics, unsigned long vec);
-static long ics_opal_get_server(struct ics *ics, unsigned long vec);
-
 static int ics_opal_host_match(struct ics *ics, struct device_node *node)
 {
return 1;
 }
 
-/* Only one global & state struct ics */
-static struct ics ics_hal = {
-   .map= ics_opal_map,
-   .mask_unknown   = ics_opal_mask_unknown,
-   .get_server = ics_opal_get_server,
-   .host_match = ics_opal_host_match,
-};
-
-static int ics_opal_map(struct ics *ics, unsigned int virq)
+static int ics_opal_check(struct ics *ics, unsigned int hw_irq)
 {
-   unsigned int hw_irq = (unsigned int)virq_to_hw(virq);
int64_t rc;
__be16 server;
int8_t priority;
@@ -189,9 +176,6 @@ static int ics_opal_map(struct ics *ics, unsigned int virq)
if (rc != OPAL_SUCCESS)
return -ENXIO;
 
-   irq_set_chip_and_handler(virq, _opal_irq_chip, handle_fasteoi_irq);
-   irq_set_chip_data(virq, _hal);
-
return 0;
 }
 
@@ -222,6 +206,15 @@ static long ics_opal_get_server(struct ics *ics, unsigned 
long vec)
return ics_opal_unmangle_server(be16_to_cpu(server));
 }
 
+/* Only one global & state struct ics */
+static struct ics ics_hal = {
+   .check  = ics_opal_check,
+   .mask_unknown   = ics_opal_mask_unknown,
+   .get_server = ics_opal_get_server,
+   .host_match = ics_opal_host_match,
+   .chip   = _opal_irq_chip,
+};
+
 int __init ics_opal_init(void)
 {
if (!firmware_has_feature(FW_FEATURE_OPAL))
diff --git a/arch/powerpc/sysdev/xics/ics-rtas.c 
b/arch/powerpc/sysdev/xics/ics-rtas.c
index 4cf18000f07c..6d19d711ed35 100644
--- a/arch/powerpc/sysdev/xics/ics-rtas.c
+++ b/arch/powerpc/sysdev/xics/ics-rtas.c
@@ -24,19 +24,6 @@ static int ibm_set_xive;
 static int ibm_int_on;
 static int ibm_int_off;
 
-static int ics_rtas_map(struct ics *ics, unsigned int virq);
-static void ics_rtas_mask_unknown(struct ics *ics, unsigned long vec);
-static long ics_rtas_get_server(struct ics *ics, unsigned long vec);
-static int ics_rtas_host_match(struct ics *ics, struct device_node *node);
-
-/* Only one global & state struct ics */
-static struct ics ics_rtas = {
-   .map= ics_rtas_map,
-   .mask_unknown   = ics_rtas_mask_unknown,
-   .get_server = ics_rtas_get_server,
-   .host_match = ics_rtas_host_match,
-};
-
 static void ics_rtas_unmask_irq(struct irq_data *d)
 {
unsigned int hw_irq = (unsigned int)irqd_to_hwirq(d);
@@ -169,9 +156,8 @@ static struct irq_chip ics_rtas_irq_chip = {
.irq_retrigger = xics_retrigger,
 };
 
-static int ics_rtas_map(struct ics *ics, unsigned int virq)
+static int ics_rtas_check(struct ics *ics, unsigned int hw_irq)
 {
-   unsigned int hw_irq = (unsigned int)virq_to_hw(virq);
int status[2];
int rc;
 
@@ -183,9 +169,6 @@ static int ics_rtas_map(struct ics *ics, unsigned int virq)
if (rc)
return -ENXIO;
 
-   irq_set_chip_and_handler(virq, _rtas_irq_chip, handle_fasteoi_irq);
-   irq_set_chip_data(virq, _rtas);
-
return 0;
 }
 
@@ -213,6 +196,15 @@ static int 

[PATCH 26/31] powerpc/powernv/pci: Adapt is_pnv_opal_msi() to detect passthrough interrupt

2021-04-30 Thread Cédric Le Goater
The pnv_ioda2_msi_eoi chip handler is not used anymore for MSIs.
Simply use the check on the PSI-MSI chip.

Cc: Alexey Kardashevskiy 
Cc: Paul Mackerras 
Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index c5acd85a9144..c1598ab730c3 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2159,7 +2159,7 @@ static struct irq_chip pnv_pci_msi_irq_chip;
  */
 bool is_pnv_opal_msi(struct irq_chip *chip)
 {
-   return chip->irq_eoi == pnv_ioda2_msi_eoi || chip == 
_pci_msi_irq_chip;
+   return chip == _pci_msi_irq_chip;
 }
 EXPORT_SYMBOL_GPL(is_pnv_opal_msi);
 
-- 
2.26.3



[PATCH 24/31] powerpc/pseries/pci: Drop unused MSI code

2021-04-30 Thread Cédric Le Goater
MSIs should be fully managed by the PCI and IRQ subsystems now.

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/platforms/pseries/msi.c | 87 
 1 file changed, 87 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/msi.c 
b/arch/powerpc/platforms/pseries/msi.c
index 1886cb5ca4df..7ddce65edb88 100644
--- a/arch/powerpc/platforms/pseries/msi.c
+++ b/arch/powerpc/platforms/pseries/msi.c
@@ -111,21 +111,6 @@ static int rtas_query_irq_number(struct pci_dn *pdn, int 
offset)
return rtas_ret[0];
 }
 
-static void rtas_teardown_msi_irqs(struct pci_dev *pdev)
-{
-   struct msi_desc *entry;
-
-   for_each_pci_msi_entry(entry, pdev) {
-   if (!entry->irq)
-   continue;
-
-   irq_set_msi_desc(entry->irq, NULL);
-   irq_dispose_mapping(entry->irq);
-   }
-
-   rtas_disable_msi(pdev);
-}
-
 static int check_req(struct pci_dev *pdev, int nvec, char *prop_name)
 {
struct device_node *dn;
@@ -459,66 +444,6 @@ static int rtas_prepare_msi_irqs(struct pci_dev *pdev, int 
nvec_in, int type,
return 0;
 }
 
-static int rtas_setup_msi_irqs(struct pci_dev *pdev, int nvec_in, int type)
-{
-   struct pci_dn *pdn;
-   int hwirq, virq, i;
-   int rc;
-   struct msi_desc *entry;
-   struct msi_msg msg;
-
-   rc = rtas_prepare_msi_irqs(pdev, nvec_in, type, NULL);
-   if (rc)
-   return rc;
-
-   pdn = pci_get_pdn(pdev);
-   i = 0;
-   for_each_pci_msi_entry(entry, pdev) {
-   hwirq = rtas_query_irq_number(pdn, i++);
-   if (hwirq < 0) {
-   pr_debug("rtas_msi: error (%d) getting hwirq\n", rc);
-   return hwirq;
-   }
-
-   /*
-* Depending on the number of online CPUs in the original
-* kernel, it is likely for CPU #0 to be offline in a kdump
-* kernel. The associated IRQs in the affinity mappings
-* provided by irq_create_affinity_masks() are thus not
-* started by irq_startup(), as per-design for managed IRQs.
-* This can be a problem with multi-queue block devices driven
-* by blk-mq : such a non-started IRQ is very likely paired
-* with the single queue enforced by blk-mq during kdump (see
-* blk_mq_alloc_tag_set()). This causes the device to remain
-* silent and likely hangs the guest at some point.
-*
-* We don't really care for fine-grained affinity when doing
-* kdump actually : simply ignore the pre-computed affinity
-* masks in this case and let the default mask with all CPUs
-* be used when creating the IRQ mappings.
-*/
-   if (is_kdump_kernel())
-   virq = irq_create_mapping(NULL, hwirq);
-   else
-   virq = irq_create_mapping_affinity(NULL, hwirq,
-  entry->affinity);
-
-   if (!virq) {
-   pr_debug("rtas_msi: Failed mapping hwirq %d\n", hwirq);
-   return -ENOSPC;
-   }
-
-   dev_dbg(>dev, "rtas_msi: allocated virq %d\n", virq);
-   irq_set_msi_desc(virq, entry);
-
-   /* Read config space back so we can restore after reset */
-   __pci_read_msi_msg(entry, );
-   entry->msg = msg;
-   }
-
-   return 0;
-}
-
 static int pseries_msi_ops_prepare(struct irq_domain *domain, struct device 
*dev,
   int nvec, msi_alloc_info_t *arg)
 {
@@ -759,8 +684,6 @@ static void rtas_msi_pci_irq_fixup(struct pci_dev *pdev)
 
 static int rtas_msi_init(void)
 {
-   struct pci_controller *phb;
-
query_token  = rtas_token("ibm,query-interrupt-source-number");
change_token = rtas_token("ibm,change-msi");
 
@@ -772,16 +695,6 @@ static int rtas_msi_init(void)
 
pr_debug("rtas_msi: Registering RTAS MSI callbacks.\n");
 
-   WARN_ON(pseries_pci_controller_ops.setup_msi_irqs);
-   pseries_pci_controller_ops.setup_msi_irqs = rtas_setup_msi_irqs;
-   pseries_pci_controller_ops.teardown_msi_irqs = rtas_teardown_msi_irqs;
-
-   list_for_each_entry(phb, _list, list_node) {
-   WARN_ON(phb->controller_ops.setup_msi_irqs);
-   phb->controller_ops.setup_msi_irqs = rtas_setup_msi_irqs;
-   phb->controller_ops.teardown_msi_irqs = rtas_teardown_msi_irqs;
-   }
-
WARN_ON(ppc_md.pci_irq_fixup);
ppc_md.pci_irq_fixup = rtas_msi_pci_irq_fixup;
 
-- 
2.26.3



[PATCH 10/31] powerpc/pseries/pci: Add support of MSI domains to PHB hotplug

2021-04-30 Thread Cédric Le Goater
Simply allocate or release the MSI domains when a PHB is inserted in
or removed from the machine.

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/platforms/pseries/pseries.h   |  1 +
 arch/powerpc/platforms/pseries/msi.c   | 10 ++
 arch/powerpc/platforms/pseries/pci_dlpar.c |  4 
 3 files changed, 15 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/pseries.h 
b/arch/powerpc/platforms/pseries/pseries.h
index 91cf2afcf423..57bf4c2091e1 100644
--- a/arch/powerpc/platforms/pseries/pseries.h
+++ b/arch/powerpc/platforms/pseries/pseries.h
@@ -88,6 +88,7 @@ int pseries_root_bridge_prepare(struct pci_host_bridge 
*bridge);
 
 extern struct pci_controller_ops pseries_pci_controller_ops;
 int pseries_msi_allocate_domains(struct pci_controller *phb);
+void pseries_msi_free_domains(struct pci_controller *phb);
 
 unsigned long pseries_memory_block_size(void);
 
diff --git a/arch/powerpc/platforms/pseries/msi.c 
b/arch/powerpc/platforms/pseries/msi.c
index da9d63a088bb..d1470941cadf 100644
--- a/arch/powerpc/platforms/pseries/msi.c
+++ b/arch/powerpc/platforms/pseries/msi.c
@@ -733,6 +733,16 @@ int pseries_msi_allocate_domains(struct pci_controller 
*phb)
return __pseries_msi_allocate_domains(phb, count);
 }
 
+void pseries_msi_free_domains(struct pci_controller *phb)
+{
+   if (phb->msi_domain)
+   irq_domain_remove(phb->msi_domain);
+   if (phb->dev_domain)
+   irq_domain_remove(phb->dev_domain);
+   if (phb->fwnode)
+   irq_domain_free_fwnode(phb->fwnode);
+}
+
 static void rtas_msi_pci_irq_fixup(struct pci_dev *pdev)
 {
/* No LSI -> leave MSIs (if any) configured */
diff --git a/arch/powerpc/platforms/pseries/pci_dlpar.c 
b/arch/powerpc/platforms/pseries/pci_dlpar.c
index f9ae17e8a0f4..cf8a2e7a0f2c 100644
--- a/arch/powerpc/platforms/pseries/pci_dlpar.c
+++ b/arch/powerpc/platforms/pseries/pci_dlpar.c
@@ -33,6 +33,8 @@ struct pci_controller *init_phb_dynamic(struct device_node 
*dn)
 
pci_devs_phb_init_dynamic(phb);
 
+   pseries_msi_allocate_domains(phb);
+
/* Create EEH devices for the PHB */
eeh_phb_pe_create(phb);
 
@@ -73,6 +75,8 @@ int remove_phb_dynamic(struct pci_controller *phb)
}
}
 
+   pseries_msi_free_domains(phb);
+
/* Remove the PCI bus and unregister the bridge device from sysfs */
phb->bus = NULL;
pci_remove_bus(b);
-- 
2.26.3



[PATCH 18/31] powerpc/xics: Give a name to the default XICS IRQ domain

2021-04-30 Thread Cédric Le Goater
and clean up the error path.

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/sysdev/xics/xics-common.c | 21 +
 1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/sysdev/xics/xics-common.c 
b/arch/powerpc/sysdev/xics/xics-common.c
index 2fa45cd12a82..981587c7 100644
--- a/arch/powerpc/sysdev/xics/xics-common.c
+++ b/arch/powerpc/sysdev/xics/xics-common.c
@@ -412,11 +412,22 @@ static const struct irq_domain_ops xics_host_ops = {
.xlate = xics_host_xlate,
 };
 
-static void __init xics_init_host(void)
+static int __init xics_allocate_domain(void)
 {
-   xics_host = irq_domain_add_tree(NULL, _host_ops, NULL);
-   BUG_ON(xics_host == NULL);
+   struct fwnode_handle *fn;
+
+   fn = irq_domain_alloc_named_fwnode("XICS");
+   if (!fn)
+   return -ENOMEM;
+
+   xics_host = irq_domain_create_tree(fn, _host_ops, NULL);
+   if (!xics_host) {
+   irq_domain_free_fwnode(fn);
+   return -ENOMEM;
+   }
+
irq_set_default_host(xics_host);
+   return 0;
 }
 
 void __init xics_register_ics(struct ics *ics)
@@ -478,6 +489,8 @@ void __init xics_init(void)
/* Initialize common bits */
xics_get_server_size();
xics_update_irq_servers();
-   xics_init_host();
+   rc = xics_allocate_domain();
+   if (rc < 0)
+   pr_err("XICS: Failed to create IRQ domain");
xics_setup_cpu();
 }
-- 
2.26.3



[PATCH 02/31] powerpc/pseries/pci: Introduce rtas_prepare_msi_irqs()

2021-04-30 Thread Cédric Le Goater
This splits the routine setting the MSIs in two parts: allocation of
MSIs for the PCI device at the FW level (RTAS) and the actual mapping
and activation of the IRQs.

rtas_prepare_msi_irqs() will serve as a handler for the MSI domain.

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/platforms/pseries/msi.c | 23 +++
 1 file changed, 19 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/msi.c 
b/arch/powerpc/platforms/pseries/msi.c
index d2d090e04745..4bf14f27e1aa 100644
--- a/arch/powerpc/platforms/pseries/msi.c
+++ b/arch/powerpc/platforms/pseries/msi.c
@@ -373,12 +373,11 @@ static void rtas_hack_32bit_msi_gen2(struct pci_dev *pdev)
pci_write_config_dword(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_HI, 0);
 }
 
-static int rtas_setup_msi_irqs(struct pci_dev *pdev, int nvec_in, int type)
+static int rtas_prepare_msi_irqs(struct pci_dev *pdev, int nvec_in, int type,
+msi_alloc_info_t *arg)
 {
struct pci_dn *pdn;
-   int hwirq, virq, i, quota, rc;
-   struct msi_desc *entry;
-   struct msi_msg msg;
+   int quota, rc;
int nvec = nvec_in;
int use_32bit_msi_hack = 0;
 
@@ -456,6 +455,22 @@ static int rtas_setup_msi_irqs(struct pci_dev *pdev, int 
nvec_in, int type)
return rc;
}
 
+   return 0;
+}
+
+static int rtas_setup_msi_irqs(struct pci_dev *pdev, int nvec_in, int type)
+{
+   struct pci_dn *pdn;
+   int hwirq, virq, i;
+   int rc;
+   struct msi_desc *entry;
+   struct msi_msg msg;
+
+   rc = rtas_prepare_msi_irqs(pdev, nvec_in, type, NULL);
+   if (rc)
+   return rc;
+
+   pdn = pci_get_pdn(pdev);
i = 0;
for_each_pci_msi_entry(entry, pdev) {
hwirq = rtas_query_irq_number(pdn, i++);
-- 
2.26.3



[PATCH 01/31] powerpc/pseries/pci: Introduce __find_pe_total_msi()

2021-04-30 Thread Cédric Le Goater
It will help to size the PCI MSI domain.

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/platforms/pseries/msi.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/msi.c 
b/arch/powerpc/platforms/pseries/msi.c
index 637300330507..d2d090e04745 100644
--- a/arch/powerpc/platforms/pseries/msi.c
+++ b/arch/powerpc/platforms/pseries/msi.c
@@ -164,12 +164,12 @@ static int check_req_msix(struct pci_dev *pdev, int nvec)
 
 /* Quota calculation */
 
-static struct device_node *find_pe_total_msi(struct pci_dev *dev, int *total)
+static struct device_node *__find_pe_total_msi(struct device_node *node, int 
*total)
 {
struct device_node *dn;
const __be32 *p;
 
-   dn = of_node_get(pci_device_to_OF_node(dev));
+   dn = of_node_get(node);
while (dn) {
p = of_get_property(dn, "ibm,pe-total-#msi", NULL);
if (p) {
@@ -185,6 +185,11 @@ static struct device_node *find_pe_total_msi(struct 
pci_dev *dev, int *total)
return NULL;
 }
 
+static struct device_node *find_pe_total_msi(struct pci_dev *dev, int *total)
+{
+   return __find_pe_total_msi(pci_device_to_OF_node(dev), total);
+}
+
 static struct device_node *find_pe_dn(struct pci_dev *dev, int *total)
 {
struct device_node *dn;
-- 
2.26.3



[PATCH 16/31] powerpc/xics: Remove ICS list

2021-04-30 Thread Cédric Le Goater
We always had only one ICS per machine. Simplify the XICS driver by
removing the ICS list.

The ICS stored in the chip data of the XICS domain becomes useless and
we don't need it anymore to migrate away IRQs from a CPU. This will be
removed in a subsequent patch.

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/sysdev/xics/xics-common.c | 45 +++---
 1 file changed, 19 insertions(+), 26 deletions(-)

diff --git a/arch/powerpc/sysdev/xics/xics-common.c 
b/arch/powerpc/sysdev/xics/xics-common.c
index 7e4305c01bac..509b9432c368 100644
--- a/arch/powerpc/sysdev/xics/xics-common.c
+++ b/arch/powerpc/sysdev/xics/xics-common.c
@@ -38,7 +38,7 @@ DEFINE_PER_CPU(struct xics_cppr, xics_cppr);
 
 struct irq_domain *xics_host;
 
-static LIST_HEAD(ics_list);
+static struct ics *xics_ics;
 
 void xics_update_irq_servers(void)
 {
@@ -111,12 +111,11 @@ void xics_setup_cpu(void)
 
 void xics_mask_unknown_vec(unsigned int vec)
 {
-   struct ics *ics;
-
pr_err("Interrupt 0x%x (real) is invalid, disabling it.\n", vec);
 
-   list_for_each_entry(ics, _list, link)
-   ics->mask_unknown(ics, vec);
+   if (WARN_ON(!xics_ics))
+   return;
+   xics_ics->mask_unknown(xics_ics, vec);
 }
 
 
@@ -198,7 +197,6 @@ void xics_migrate_irqs_away(void)
struct irq_chip *chip;
long server;
unsigned long flags;
-   struct ics *ics;
 
/* We can't set affinity on ISA interrupts */
if (virq < NUM_ISA_INTERRUPTS)
@@ -219,13 +217,10 @@ void xics_migrate_irqs_away(void)
raw_spin_lock_irqsave(>lock, flags);
 
/* Locate interrupt server */
-   server = -1;
-   ics = irq_desc_get_chip_data(desc);
-   if (ics)
-   server = ics->get_server(ics, irq);
+   server = xics_ics->get_server(xics_ics, irq);
if (server < 0) {
-   printk(KERN_ERR "%s: Can't find server for irq %d\n",
-  __func__, irq);
+   pr_err("%s: Can't find server for irq %d/%x\n",
+  __func__, virq, irq);
goto unlock;
}
 
@@ -307,13 +302,9 @@ int xics_get_irq_server(unsigned int virq, const struct 
cpumask *cpumask,
 static int xics_host_match(struct irq_domain *h, struct device_node *node,
   enum irq_domain_bus_token bus_token)
 {
-   struct ics *ics;
-
-   list_for_each_entry(ics, _list, link)
-   if (ics->host_match(ics, node))
-   return 1;
-
-   return 0;
+   if (WARN_ON(!xics_ics))
+   return 0;
+   return xics_ics->host_match(xics_ics, node) ? 1 : 0;
 }
 
 /* Dummies */
@@ -330,8 +321,6 @@ static struct irq_chip xics_ipi_chip = {
 static int xics_host_map(struct irq_domain *h, unsigned int virq,
 irq_hw_number_t hw)
 {
-   struct ics *ics;
-
pr_devel("xics: map virq %d, hwirq 0x%lx\n", virq, hw);
 
/*
@@ -348,12 +337,14 @@ static int xics_host_map(struct irq_domain *h, unsigned 
int virq,
return 0;
}
 
+   if (WARN_ON(!xics_ics))
+   return -EINVAL;
+
/* Let the ICS setup the chip data */
-   list_for_each_entry(ics, _list, link)
-   if (ics->map(ics, virq) == 0)
-   return 0;
+   if (xics_ics->map(xics_ics, virq))
+   return -EINVAL;
 
-   return -EINVAL;
+   return 0;
 }
 
 static int xics_host_xlate(struct irq_domain *h, struct device_node *ct,
@@ -427,7 +418,9 @@ static void __init xics_init_host(void)
 
 void __init xics_register_ics(struct ics *ics)
 {
-   list_add(>link, _list);
+   if (WARN_ONCE(xics_ics, "XICS: Source Controller is already defined !"))
+   return;
+   xics_ics = ics;
 }
 
 static void __init xics_get_server_size(void)
-- 
2.26.3



[PATCH 14/31] KVM: PPC: Book3S HV: XIVE: Change interface of passthrough interrupt routines

2021-04-30 Thread Cédric Le Goater
The routine kvmppc_set_passthru_irq() calls kvmppc_xive_set_mapped()
and kvmppc_xive_clr_mapped() with an IRQ descriptor. Use directly the
host IRQ number to remove a useless conversion.

Add some debug.

Cc: Paul Mackerras 
Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/include/asm/kvm_ppc.h |  4 ++--
 arch/powerpc/kvm/book3s_hv.c   |  4 ++--
 arch/powerpc/kvm/book3s_xive.c | 17 -
 3 files changed, 12 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 8aacd76bb702..d6c52a0ec687 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -663,9 +663,9 @@ extern int kvmppc_xive_connect_vcpu(struct kvm_device *dev,
struct kvm_vcpu *vcpu, u32 cpu);
 extern void kvmppc_xive_cleanup_vcpu(struct kvm_vcpu *vcpu);
 extern int kvmppc_xive_set_mapped(struct kvm *kvm, unsigned long guest_irq,
- struct irq_desc *host_desc);
+ unsigned long host_irq);
 extern int kvmppc_xive_clr_mapped(struct kvm *kvm, unsigned long guest_irq,
- struct irq_desc *host_desc);
+ unsigned long host_irq);
 extern u64 kvmppc_xive_get_icp(struct kvm_vcpu *vcpu);
 extern int kvmppc_xive_set_icp(struct kvm_vcpu *vcpu, u64 icpval);
 
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 86a0f8b0e6da..9f4eb74a11cc 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -5196,7 +5196,7 @@ static int kvmppc_set_passthru_irq(struct kvm *kvm, int 
host_irq, int guest_gsi)
pimap->n_mapped++;
 
if (xics_on_xive())
-   rc = kvmppc_xive_set_mapped(kvm, guest_gsi, desc);
+   rc = kvmppc_xive_set_mapped(kvm, guest_gsi, host_irq);
else
kvmppc_xics_set_mapped(kvm, guest_gsi, desc->irq_data.hwirq);
if (rc)
@@ -5237,7 +5237,7 @@ static int kvmppc_clr_passthru_irq(struct kvm *kvm, int 
host_irq, int guest_gsi)
}
 
if (xics_on_xive())
-   rc = kvmppc_xive_clr_mapped(kvm, guest_gsi, 
pimap->mapped[i].desc);
+   rc = kvmppc_xive_clr_mapped(kvm, guest_gsi, host_irq);
else
kvmppc_xics_clr_mapped(kvm, guest_gsi, 
pimap->mapped[i].r_hwirq);
 
diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
index e7219b6f5f9a..3a7da42bed57 100644
--- a/arch/powerpc/kvm/book3s_xive.c
+++ b/arch/powerpc/kvm/book3s_xive.c
@@ -856,13 +856,12 @@ int kvmppc_xive_set_icp(struct kvm_vcpu *vcpu, u64 icpval)
 }
 
 int kvmppc_xive_set_mapped(struct kvm *kvm, unsigned long guest_irq,
-  struct irq_desc *host_desc)
+  unsigned long host_irq)
 {
struct kvmppc_xive *xive = kvm->arch.xive;
struct kvmppc_xive_src_block *sb;
struct kvmppc_xive_irq_state *state;
-   struct irq_data *host_data = irq_desc_get_irq_data(host_desc);
-   unsigned int host_irq = irq_desc_get_irq(host_desc);
+   struct irq_data *host_data = irq_get_irq_data(host_irq);
unsigned int hw_irq = (unsigned int)irqd_to_hwirq(host_data);
u16 idx;
u8 prio;
@@ -871,7 +870,8 @@ int kvmppc_xive_set_mapped(struct kvm *kvm, unsigned long 
guest_irq,
if (!xive)
return -ENODEV;
 
-   pr_devel("set_mapped girq 0x%lx host HW irq 0x%x...\n",guest_irq, 
hw_irq);
+   pr_debug("%s: GIRQ 0x%lx host IRQ %ld XIVE HW IRQ 0x%x\n",
+__func__, guest_irq, host_irq, hw_irq);
 
sb = kvmppc_xive_find_source(xive, guest_irq, );
if (!sb)
@@ -893,7 +893,7 @@ int kvmppc_xive_set_mapped(struct kvm *kvm, unsigned long 
guest_irq,
 */
rc = irq_set_vcpu_affinity(host_irq, state);
if (rc) {
-   pr_err("Failed to set VCPU affinity for irq %d\n", host_irq);
+   pr_err("Failed to set VCPU affinity for host IRQ %ld\n", 
host_irq);
return rc;
}
 
@@ -953,12 +953,11 @@ int kvmppc_xive_set_mapped(struct kvm *kvm, unsigned long 
guest_irq,
 EXPORT_SYMBOL_GPL(kvmppc_xive_set_mapped);
 
 int kvmppc_xive_clr_mapped(struct kvm *kvm, unsigned long guest_irq,
-  struct irq_desc *host_desc)
+  unsigned long host_irq)
 {
struct kvmppc_xive *xive = kvm->arch.xive;
struct kvmppc_xive_src_block *sb;
struct kvmppc_xive_irq_state *state;
-   unsigned int host_irq = irq_desc_get_irq(host_desc);
u16 idx;
u8 prio;
int rc;
@@ -966,7 +965,7 @@ int kvmppc_xive_clr_mapped(struct kvm *kvm, unsigned long 
guest_irq,
if (!xive)
return -ENODEV;
 
-   pr_devel("clr_mapped girq 0x%lx...\n", guest_irq);
+   pr_debug("%s: GIRQ 0x%lx host IRQ %ld\n", __func__, guest_irq, 
host_irq);
 
sb = kvmppc_xive_find_source(xive, 

[PATCH 12/31] powerpc/powernv/pci: Add MSI domains

2021-04-30 Thread Cédric Le Goater
This is very similar to the MSI domains of the pSeries platform. The
MSI allocator is directly handled under the Linux PHB in the
in-the-middle "MSI" domain.

Only the XIVE (P9/P10) parent domain is supported for now. We still
need to add support for IRQ domain hierarchy under XICS.

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 188 ++
 1 file changed, 188 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index b2a8da6114b5..3886ca6e2ed3 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -36,6 +36,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -2244,6 +2245,189 @@ static int pnv_pci_ioda_msi_setup(struct pnv_phb *phb, 
struct pci_dev *dev,
return 0;
 }
 
+/*
+ * The msi_free() op is called before irq_domain_free_irqs_top() when
+ * the handler data is still available. Use that to clear the XIVE
+ * controller.
+ */
+static void pnv_msi_ops_msi_free(struct irq_domain *domain,
+struct msi_domain_info *info,
+unsigned int irq)
+{
+   if (xive_enabled())
+   xive_irq_free_data(irq);
+}
+
+static struct msi_domain_ops pnv_pci_msi_domain_ops = {
+   .msi_free   = pnv_msi_ops_msi_free,
+};
+
+static void pnv_msi_shutdown(struct irq_data *d)
+{
+   d = d->parent_data;
+   if (d->chip->irq_shutdown)
+   d->chip->irq_shutdown(d);
+}
+
+static void pnv_msi_mask(struct irq_data *d)
+{
+   pci_msi_mask_irq(d);
+   irq_chip_mask_parent(d);
+}
+
+static void pnv_msi_unmask(struct irq_data *d)
+{
+   pci_msi_unmask_irq(d);
+   irq_chip_unmask_parent(d);
+}
+
+static struct irq_chip pnv_pci_msi_irq_chip = {
+   .name   = "PNV-PCI-MSI",
+   .irq_shutdown   = pnv_msi_shutdown,
+   .irq_mask   = pnv_msi_mask,
+   .irq_unmask = pnv_msi_unmask,
+   .irq_eoi= irq_chip_eoi_parent,
+};
+
+static struct msi_domain_info pnv_msi_domain_info = {
+   .flags = (MSI_FLAG_USE_DEF_DOM_OPS | MSI_FLAG_USE_DEF_CHIP_OPS |
+ MSI_FLAG_MULTI_PCI_MSI  | MSI_FLAG_PCI_MSIX),
+   .ops   = _pci_msi_domain_ops,
+   .chip  = _pci_msi_irq_chip,
+};
+
+static void pnv_msi_compose_msg(struct irq_data *d, struct msi_msg *msg)
+{
+   struct msi_desc *entry = irq_data_get_msi_desc(d);
+   struct pci_dev *pdev = msi_desc_to_pci_dev(entry);
+   struct pci_controller *hose = irq_data_get_irq_chip_data(d);
+   struct pnv_phb *phb = hose->private_data;
+   int rc;
+
+   rc = __pnv_pci_ioda_msi_setup(phb, pdev, d->hwirq,
+ entry->msi_attrib.is_64, msg);
+   if (rc)
+   dev_err(>dev, "Failed to setup %s-bit MSI #%ld : %d\n",
+   entry->msi_attrib.is_64 ? "64" : "32", d->hwirq, rc);
+}
+
+static struct irq_chip pnv_msi_irq_chip = {
+   .name   = "PNV-MSI",
+   .irq_shutdown   = pnv_msi_shutdown,
+   .irq_mask   = irq_chip_mask_parent,
+   .irq_unmask = irq_chip_unmask_parent,
+   .irq_eoi= irq_chip_eoi_parent,
+   .irq_set_affinity   = irq_chip_set_affinity_parent,
+   .irq_compose_msi_msg= pnv_msi_compose_msg,
+};
+
+static int pnv_irq_parent_domain_alloc(struct irq_domain *domain,
+  unsigned int virq, int hwirq)
+{
+   struct irq_fwspec parent_fwspec;
+   int ret;
+
+   parent_fwspec.fwnode = domain->parent->fwnode;
+   parent_fwspec.param_count = 2;
+   parent_fwspec.param[0] = hwirq;
+   parent_fwspec.param[1] = IRQ_TYPE_EDGE_RISING;
+
+   ret = irq_domain_alloc_irqs_parent(domain, virq, 1, _fwspec);
+   if (ret)
+   return ret;
+
+   return 0;
+}
+
+static int pnv_irq_domain_alloc(struct irq_domain *domain, unsigned int virq,
+   unsigned int nr_irqs, void *arg)
+{
+   struct pci_controller *hose = domain->host_data;
+   struct pnv_phb *phb = hose->private_data;
+   msi_alloc_info_t *info = arg;
+   struct pci_dev *pdev = msi_desc_to_pci_dev(info->desc);
+   int hwirq;
+   int i, ret;
+
+   hwirq = msi_bitmap_alloc_hwirqs(>msi_bmp, nr_irqs);
+   if (hwirq < 0) {
+   dev_warn(>dev, "failed to find a free MSI\n");
+   return -ENOSPC;
+   }
+
+   dev_dbg(>dev, "%s bridge %pOF %d/%x #%d\n", __func__,
+   hose->dn, virq, hwirq, nr_irqs);
+
+   for (i = 0; i < nr_irqs; i++) {
+   ret = pnv_irq_parent_domain_alloc(domain, virq + i,
+ phb->msi_base + hwirq + i);
+   if (ret)
+   goto out;
+
+   irq_domain_set_hwirq_and_chip(domain, virq + i, hwirq + i,
+ 

[PATCH 11/31] powerpc/powernv/pci: Introduce __pnv_pci_ioda_msi_setup()

2021-04-30 Thread Cédric Le Goater
It will be used as a 'compose_msg' handler of the MSI domain
introduced later.

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 28 +++
 1 file changed, 23 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index f0f901683a2f..b2a8da6114b5 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2160,15 +2160,17 @@ bool is_pnv_opal_msi(struct irq_chip *chip)
 }
 EXPORT_SYMBOL_GPL(is_pnv_opal_msi);
 
-static int pnv_pci_ioda_msi_setup(struct pnv_phb *phb, struct pci_dev *dev,
- unsigned int hwirq, unsigned int virq,
- unsigned int is_64, struct msi_msg *msg)
+static int __pnv_pci_ioda_msi_setup(struct pnv_phb *phb, struct pci_dev *dev,
+   unsigned int xive_num,
+   unsigned int is_64, struct msi_msg *msg)
 {
struct pnv_ioda_pe *pe = pnv_ioda_get_pe(dev);
-   unsigned int xive_num = hwirq - phb->msi_base;
__be32 data;
int rc;
 
+   dev_dbg(>dev, "%s: setup %s-bit MSI for vector #%d\n", __func__,
+   is_64 ? "64" : "32", xive_num);
+
/* No PE assigned ? bail out ... no MSI for you ! */
if (pe == NULL)
return -ENXIO;
@@ -2216,12 +2218,28 @@ static int pnv_pci_ioda_msi_setup(struct pnv_phb *phb, 
struct pci_dev *dev,
}
msg->data = be32_to_cpu(data);
 
+   return 0;
+}
+
+static int pnv_pci_ioda_msi_setup(struct pnv_phb *phb, struct pci_dev *dev,
+ unsigned int hwirq, unsigned int virq,
+ unsigned int is_64, struct msi_msg *msg)
+{
+   struct pnv_ioda_pe *pe = pnv_ioda_get_pe(dev);
+   unsigned int xive_num = hwirq - phb->msi_base;
+   int rc;
+
+   rc = __pnv_pci_ioda_msi_setup(phb, dev, xive_num, is_64, msg);
+   if (rc)
+   return rc;
+
+   /* P8 only */
pnv_set_msi_irq_chip(phb, virq);
 
pr_devel("%s: %s-bit MSI on hwirq %x (xive #%d),"
 " address=%x_%08x data=%x PE# %x\n",
 pci_name(dev), is_64 ? "64" : "32", hwirq, xive_num,
-msg->address_hi, msg->address_lo, data, pe->pe_number);
+msg->address_hi, msg->address_lo, msg->data, pe->pe_number);
 
return 0;
 }
-- 
2.26.3



[PATCH 00/31] powerpc: Modernize the PCI/MSI support

2021-04-30 Thread Cédric Le Goater
Hello,

This series adds support for MSI IRQ domains on top of the XICS (P8)
and XIVE (P9/P10) IRQ domains for the PowerNV (baremetal) and pSeries
(VM) platforms. It should improve greatly IRQ affinity of PCI MSIs
under these PowerPC platforms. Data locality can still be improved
with a machine IRQ domain per chip but this requires FW changes.

The patchset has a large impact but it is well contained under the MSI
support. Initial tests were done on the P8, P9 and P10 PowerNV and
pSeries platforms, under the KVM and PowerVM hypervisor. PCI passthrough
was tested on P8/KVM, P9/KVM and P9/pVM.

P8 passthrough adds an optimization to EOI MSIs when under real mode
but I didn't see any performance improvements with a passthrough 10G
Ethernet adapter. If someone has faster adapters, I would be interested
by the results.

The P8/CAPI driver is also impacted. Tests were done on a Firestone
system with a memory AFU.

Thanks,

C.

Cédric Le Goater (31):
  powerpc/pseries/pci: Introduce __find_pe_total_msi()
  powerpc/pseries/pci: Introduce rtas_prepare_msi_irqs()
  powerpc/xive: Add support for IRQ domain hierarchy
  powerpc/xive: Ease debugging of xive_irq_set_affinity()
  powerpc/pseries/pci: Add MSI domains
  powerpc/xive: Drop unmask of MSIs at startup
  powerpc/xive: Fix xive_irq_set_affinity for MSI
  powerpc/pseries/pci: Add a domain_free_irqs handler
  powerpc/pseries/pci: Add a msi_free() handler to clear XIVE data
  powerpc/pseries/pci: Add support of MSI domains to PHB hotplug
  powerpc/powernv/pci: Introduce __pnv_pci_ioda_msi_setup()
  powerpc/powernv/pci: Add MSI domains
  KVM: PPC: Book3S HV: Use the new IRQ chip to detect passthrough
interrupts
  KVM: PPC: Book3S HV: XIVE: Change interface of passthrough interrupt
routines
  KVM: PPC: Book3S HV: XIVE: Fix mapping of passthrough interrupts
  powerpc/xics: Remove ICS list
  powerpc/xics: Rename the map handler in a check handler
  powerpc/xics: Give a name to the default XICS IRQ domain
  powerpc/xics: Add debug logging to the set_irq_affinity handlers
  powerpc/xics: Add support for IRQ domain hierarchy
  powerpc/powernv/pci: Customize the MSI EOI handler to support PHB3
  powerpc/pci: Drop XIVE restriction on MSI domains
  powerpc/xics: Drop unmask of MSIs at startup
  powerpc/pseries/pci: Drop unused MSI code
  powerpc/powernv/pci: Drop unused MSI code
  powerpc/powernv/pci: Adapt is_pnv_opal_msi() to detect passthrough
interrupt
  powerpc/xics: Fix IRQ migration
  powerpc/powernv/pci: Set the IRQ chip data for P8/CXL devices
  powerpc/powernv/pci: Rework pnv_opal_pci_msi_eoi()
  KVM: PPC: Book3S HV: XICS: Fix mapping of passthrough interrupts
  genirq: Improve "hwirq" output in /proc and /sys/

 arch/powerpc/include/asm/kvm_ppc.h |   4 +-
 arch/powerpc/include/asm/pci-bridge.h  |   5 +
 arch/powerpc/include/asm/pnv-pci.h |   2 +-
 arch/powerpc/include/asm/xics.h|   3 +-
 arch/powerpc/include/asm/xive.h|   1 +
 arch/powerpc/platforms/powernv/pci.h   |   6 -
 arch/powerpc/platforms/pseries/pseries.h   |   2 +
 arch/powerpc/kernel/pci-common.c   |   6 +
 arch/powerpc/kvm/book3s_hv.c   |  18 +-
 arch/powerpc/kvm/book3s_hv_rm_xics.c   |   8 +-
 arch/powerpc/kvm/book3s_xive.c |  18 +-
 arch/powerpc/platforms/powernv/pci-ioda.c  | 258 --
 arch/powerpc/platforms/powernv/pci.c   |  67 -
 arch/powerpc/platforms/pseries/msi.c   | 296 -
 arch/powerpc/platforms/pseries/pci_dlpar.c |   4 +
 arch/powerpc/platforms/pseries/setup.c |   2 +
 arch/powerpc/sysdev/xics/ics-opal.c|  40 +--
 arch/powerpc/sysdev/xics/ics-rtas.c|  40 +--
 arch/powerpc/sysdev/xics/xics-common.c | 125 ++---
 arch/powerpc/sysdev/xive/common.c  |  81 +-
 kernel/irq/irqdesc.c   |   2 +-
 kernel/irq/irqdomain.c |   1 +
 kernel/irq/proc.c  |   2 +-
 23 files changed, 693 insertions(+), 298 deletions(-)

-- 
2.26.3



[PATCH 04/31] powerpc/xive: Ease debugging of xive_irq_set_affinity()

2021-04-30 Thread Cédric Le Goater
pr_debug() is easier to activate and it helps to know how the HW is
configured when tweaking the IRQ subsystem.

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/sysdev/xive/common.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/sysdev/xive/common.c 
b/arch/powerpc/sysdev/xive/common.c
index 6ad26243bc33..9cb7ae728b46 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -713,7 +713,7 @@ static int xive_irq_set_affinity(struct irq_data *d,
u32 target, old_target;
int rc = 0;
 
-   pr_devel("xive_irq_set_affinity: irq %d\n", d->irq);
+   pr_debug("%s: irq %d/%x\n", __func__, d->irq, hw_irq);
 
/* Is this valid ? */
if (cpumask_any_and(cpumask, cpu_online_mask) >= nr_cpu_ids)
@@ -758,7 +758,7 @@ static int xive_irq_set_affinity(struct irq_data *d,
return rc;
}
 
-   pr_devel("  target: 0x%x\n", target);
+   pr_debug("  target: 0x%x\n", target);
xd->target = target;
 
/* Give up previous target */
-- 
2.26.3



[PATCH 03/31] powerpc/xive: Add support for IRQ domain hierarchy

2021-04-30 Thread Cédric Le Goater
This adds handlers to allocate/free IRQs in a domain hierarchy. We
could try to use xive_irq_domain_map() in xive_irq_domain_alloc() but
we rely on xive_irq_alloc_data() to set the IRQ handler data and
duplicating the code is simpler.

xive_irq_free_data() needs to be called when IRQ are freed to clear
the MMIO mappings and free the XIVE handler data, xive_irq_data
structure. This is going to be a problem with MSI domains which we
will address later.

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/sysdev/xive/common.c | 60 +++
 1 file changed, 60 insertions(+)

diff --git a/arch/powerpc/sysdev/xive/common.c 
b/arch/powerpc/sysdev/xive/common.c
index 5acd76403ee7..6ad26243bc33 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -1372,7 +1372,67 @@ static void xive_irq_domain_debug_show(struct seq_file 
*m, struct irq_domain *d,
 }
 #endif
 
+static int xive_irq_domain_translate(struct irq_domain *d,
+struct irq_fwspec *fwspec,
+unsigned long *hwirq,
+unsigned int *type)
+{
+   return xive_irq_domain_xlate(d, to_of_node(fwspec->fwnode),
+fwspec->param, fwspec->param_count,
+hwirq, type);
+}
+
+static int xive_irq_domain_alloc(struct irq_domain *domain, unsigned int virq,
+unsigned int nr_irqs, void *arg)
+{
+   struct irq_fwspec *fwspec = arg;
+   irq_hw_number_t hwirq;
+   unsigned int type = IRQ_TYPE_NONE;
+   int i, rc;
+
+   rc = xive_irq_domain_translate(domain, fwspec, , );
+   if (rc)
+   return rc;
+
+   pr_debug("%s %d/%lx #%d\n", __func__, virq, hwirq, nr_irqs);
+
+   for (i = 0; i < nr_irqs; i++) {
+   /* TODO: call xive_irq_domain_map() */
+
+   /*
+* Mark interrupts as edge sensitive by default so that resend
+* actually works. Will fix that up below if needed.
+*/
+   irq_clear_status_flags(virq, IRQ_LEVEL);
+
+   /* allocates and sets handler data */
+   rc = xive_irq_alloc_data(virq + i, hwirq + i);
+   if (rc)
+   return rc;
+
+   irq_domain_set_hwirq_and_chip(domain, virq + i, hwirq + i,
+ _irq_chip, 
domain->host_data);
+   irq_set_handler(virq + i, handle_fasteoi_irq);
+   }
+
+   return 0;
+}
+
+static void xive_irq_domain_free(struct irq_domain *domain,
+unsigned int virq, unsigned int nr_irqs)
+{
+   int i;
+
+   pr_debug("%s %d #%d\n", __func__, virq, nr_irqs);
+
+   for (i = 0; i < nr_irqs; i++)
+   xive_irq_free_data(virq + i);
+}
+
 static const struct irq_domain_ops xive_irq_domain_ops = {
+   .alloc  = xive_irq_domain_alloc,
+   .free   = xive_irq_domain_free,
+   .translate = xive_irq_domain_translate,
.match = xive_irq_domain_match,
.map = xive_irq_domain_map,
.unmap = xive_irq_domain_unmap,
-- 
2.26.3



[PATCH 06/31] powerpc/xive: Drop unmask of MSIs at startup

2021-04-30 Thread Cédric Le Goater
That was a workaround in the XIVE domain because of the lack of MSI
domain. This is now handled.

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/sysdev/xive/common.c | 10 --
 1 file changed, 10 deletions(-)

diff --git a/arch/powerpc/sysdev/xive/common.c 
b/arch/powerpc/sysdev/xive/common.c
index 9cb7ae728b46..96737938e8e3 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -616,16 +616,6 @@ static unsigned int xive_irq_startup(struct irq_data *d)
pr_devel("xive_irq_startup: irq %d [0x%x] data @%p\n",
 d->irq, hw_irq, d);
 
-#ifdef CONFIG_PCI_MSI
-   /*
-* The generic MSI code returns with the interrupt disabled on the
-* card, using the MSI mask bits. Firmware doesn't appear to unmask
-* at that level, so we do it here by hand.
-*/
-   if (irq_data_get_msi_desc(d))
-   pci_msi_unmask_irq(d);
-#endif
-
/* Pick a target */
target = xive_pick_irq_target(d, irq_data_get_affinity_mask(d));
if (target == XIVE_INVALID_TARGET) {
-- 
2.26.3



Re: [PATCH] powerpc/powernv/memtrace: Fix dcache flushing

2021-04-30 Thread Aneesh Kumar K.V
Sandipan Das  writes:

> Trace memory is cleared and the corresponding dcache lines
> are flushed after allocation. However, this should not be
> done using the PFN. This adds the missing __va() conversion.

Reviewed-by: Aneesh Kumar K.V 

>
> Fixes: 2ac02e5ecec0 ("powerpc/mm: Remove dcache flush from memory remove.")
> Signed-off-by: Sandipan Das 
> ---
>  arch/powerpc/platforms/powernv/memtrace.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/arch/powerpc/platforms/powernv/memtrace.c 
> b/arch/powerpc/platforms/powernv/memtrace.c
> index 71c1262589fe..a31f13814f2e 100644
> --- a/arch/powerpc/platforms/powernv/memtrace.c
> +++ b/arch/powerpc/platforms/powernv/memtrace.c
> @@ -104,8 +104,8 @@ static void memtrace_clear_range(unsigned long start_pfn,
>* Before we go ahead and use this range as cache inhibited range
>* flush the cache.
>*/
> - flush_dcache_range_chunked(PFN_PHYS(start_pfn),
> -PFN_PHYS(start_pfn + nr_pages),
> + flush_dcache_range_chunked((unsigned long)__va(PFN_PHYS(start_pfn)),
> +(unsigned long)__va(PFN_PHYS(start_pfn + 
> nr_pages)),
>  FLUSH_CHUNK_SIZE);
>  }
>  
> -- 
> 2.25.1


[PATCH] powerpc/powernv/memtrace: Fix dcache flushing

2021-04-30 Thread Sandipan Das
Trace memory is cleared and the corresponding dcache lines
are flushed after allocation. However, this should not be
done using the PFN. This adds the missing __va() conversion.

Fixes: 2ac02e5ecec0 ("powerpc/mm: Remove dcache flush from memory remove.")
Signed-off-by: Sandipan Das 
---
 arch/powerpc/platforms/powernv/memtrace.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/memtrace.c 
b/arch/powerpc/platforms/powernv/memtrace.c
index 71c1262589fe..a31f13814f2e 100644
--- a/arch/powerpc/platforms/powernv/memtrace.c
+++ b/arch/powerpc/platforms/powernv/memtrace.c
@@ -104,8 +104,8 @@ static void memtrace_clear_range(unsigned long start_pfn,
 * Before we go ahead and use this range as cache inhibited range
 * flush the cache.
 */
-   flush_dcache_range_chunked(PFN_PHYS(start_pfn),
-  PFN_PHYS(start_pfn + nr_pages),
+   flush_dcache_range_chunked((unsigned long)__va(PFN_PHYS(start_pfn)),
+  (unsigned long)__va(PFN_PHYS(start_pfn + 
nr_pages)),
   FLUSH_CHUNK_SIZE);
 }
 
-- 
2.25.1



Re: [PATCH v2] kbuild: replace LANG=C with LC_ALL=C

2021-04-30 Thread Greg KH
On Fri, Apr 30, 2021 at 10:56:27AM +0900, Masahiro Yamada wrote:
> LANG gives a weak default to each LC_* in case it is not explicitly
> defined. LC_ALL, if set, overrides all other LC_* variables.
> 
>   LANG  <  LC_CTYPE, LC_COLLATE, LC_MONETARY, LC_NUMERIC, ...  <  LC_ALL
> 
> This is why documentation such as [1] suggests to set LC_ALL in build
> scripts to get the deterministic result.
> 
> LANG=C is not strong enough to override LC_* that may be set by end
> users.
> 
> [1]: https://reproducible-builds.org/docs/locales/
> 
> Signed-off-by: Masahiro Yamada 
> Acked-by: Michael Ellerman  (powerpc)
> Reviewed-by: Matthias Maennich 
> Acked-by: Matthieu Baerts  (mptcp)

Reviewed-by: Greg Kroah-Hartman