RE: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-12 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Monday, December 13, 2021 7:37 AM
> 
> On Sun, Dec 12, 2021 at 09:55:32PM +0100, Thomas Gleixner wrote:
> > Kevin,
> >
> > On Sun, Dec 12 2021 at 01:56, Kevin Tian wrote:
> > >> From: Thomas Gleixner 
> > >> All I can find is drivers/iommu/virtio-iommu.c but I can't find anything
> > >> vIR related there.
> > >
> > > Well, virtio-iommu is a para-virtualized vIOMMU implementations.
> > >
> > > In reality there are also fully emulated vIOMMU implementations (e.g.
> > > Qemu fully emulates Intel/AMD/ARM IOMMUs). In those configurations
> > > the IR logic in existing iommu drivers just apply:
> > >
> > >   drivers/iommu/intel/irq_remapping.c
> > >   drivers/iommu/amd/iommu.c
> >
> > thanks for the explanation. So that's a full IOMMU emulation. I was more
> > expecting a paravirtualized lightweight one.
> 
> Kevin can you explain what on earth vIR is for and how does it work??
> 
> Obviously we don't expose the IR machinery to userspace, so at best
> this is somehow changing what the MSI trap does?
> 

Initially it was introduced for supporting more than 255 vCPUs. Due to 
full emulation this capability can certainly support other vIR usages
as observed on bare metal.

vIR doesn't rely on the physical IR presence.

First if the guest doesn't have vfio device then the physical capability
doesn't matter.

Even with vfio device, IR by definition is just about remapping instead
of injection (talk about this part later). The interrupts are always routed
to the host handler first (vfio_msihandler() in this case), which then 
triggers irqfd to call virtual interrupt injection handler (irqfd_wakeup())
in kvm.

This suggests a clear role split between vfio and kvm:

  - vfio is responsible for irq allocation/startup as it is the device driver;
  - kvm takes care of virtual interrupt injection, being a VMM;

The two are connected via irqfd.

Following this split vIR information is completely hidden in userspace.
Qemu calculates the routing information between vGSI and vCPU
(with or without vIR, and for whatever trapped interrupt storages) 
and then registers it to kvm.

When kvm receives a notification via irqfd, it follows irqfd->vGSI->vCPU
and injects a virtual interrupt into the target vCPU.

Then comes an interesting scenario about IOMMU posted interrupt (PI).
This capability allows the IR engine directly converting a physical
interrupt into virtual and then inject it into the guest. Kinda offloading 
the virtual routing information into the hardware.

This is currently achieved via IRQ bypass manager, which helps connect 
vfio (IRQ producer) to kvm (IRQ consumer) around a specific Linux irq 
number. Once the connection is established, kvm calls 
irq_set_vcpu_affinity() to update IRTE with virtual routing information 
for that irq number.

With that design Qemu doesn't know whether IR or PI is enabled 
physically. It always talks to vfio for having IRQ resource allocated and
to kvm for registering virtual routing information.

Then adding the new hypercall machinery into this picture:

  1) The hypercall needs to carry all necessary virtual routing 
 information due to no-trap;

  2) Before querying IRTE data/pair, Qemu needs to complete necessary
 operations as of today to have IRTE ready:

a) Register irqfd and related GSI routing info to kvm
b) Allocates/startup IRQs via vfio;

 When PI is enabled, IRTE is ready only after both are completed.

  3) Qemu gets IRTE data/pair from kernel and return to the guest.

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-12 Thread Jason Gunthorpe via iommu
On Sun, Dec 12, 2021 at 01:12:05AM +0100, Thomas Gleixner wrote:

>  PCI/MSI and PCI/MSI-X are just implementations of IMS
> 
> Not more, not less. The fact that they have very strict rules about the
> storage space and the fact that they are mutually exclusive does not
> change that at all.

And the mess we have is that virtualiation broke this
design. Virtualization made MSI/MSI-X special!

I am wondering if we just need to bite the bullet and force the
introduction of a new ACPI flag for the APIC that says one of:
- message addr/data pairs work correctly (baremetal)
- creating message addr/data pairs need to use a hypercall protocol
- property not defined so assume only MSI/MSI-X/etc work.

Intel was originally trying to do this with the 'IMS enabled' PCI
Capability block, but a per PCI device capability is in the wrong
layer.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-12 Thread Jason Gunthorpe via iommu
On Sun, Dec 12, 2021 at 09:55:32PM +0100, Thomas Gleixner wrote:
> Kevin,
> 
> On Sun, Dec 12 2021 at 01:56, Kevin Tian wrote:
> >> From: Thomas Gleixner 
> >> All I can find is drivers/iommu/virtio-iommu.c but I can't find anything
> >> vIR related there.
> >
> > Well, virtio-iommu is a para-virtualized vIOMMU implementations.
> >
> > In reality there are also fully emulated vIOMMU implementations (e.g.
> > Qemu fully emulates Intel/AMD/ARM IOMMUs). In those configurations
> > the IR logic in existing iommu drivers just apply:
> >
> > drivers/iommu/intel/irq_remapping.c
> > drivers/iommu/amd/iommu.c
> 
> thanks for the explanation. So that's a full IOMMU emulation. I was more
> expecting a paravirtualized lightweight one.

Kevin can you explain what on earth vIR is for and how does it work??

Obviously we don't expose the IR machinery to userspace, so at best
this is somehow changing what the MSI trap does?

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-12 Thread Jason Gunthorpe via iommu
On Sun, Dec 12, 2021 at 08:44:46AM +0200, Mika Penttilä wrote:

> > /*
> >   * The MSIX mappable capability informs that MSIX data of a BAR can be 
> > mmapped
> >   * which allows direct access to non-MSIX registers which happened to be 
> > within
> >   * the same system page.
> >   *
> >   * Even though the userspace gets direct access to the MSIX data, the 
> > existing
> >   * VFIO_DEVICE_SET_IRQS interface must still be used for MSIX 
> > configuration.
> >   */
> > #define VFIO_REGION_INFO_CAP_MSIX_MAPPABLE  3
> > 
> > IIRC this was introduced for PPC when a device has MSI-X in the same BAR as
> > other MMIO registers. Trapping MSI-X leads to performance downgrade on
> > accesses to adjacent registers. MSI-X can be mapped by userspace because
> > PPC already uses a hypercall mechanism for interrupt. Though unclear about
> > the detail it sounds a similar usage as proposed here.
> > 
> > Thanks
> > Kevin
>
> I see  VFIO_REGION_INFO_CAP_MSIX_MAPPABLE is always set so if msix table is
> in its own bar, qemu never traps/emulates the access. 

It is some backwards compat, the kernel always sets it to indicate a
new kernel, that doesn't mean qemu doesn't trap.

As the comment says, ""VFIO_DEVICE_SET_IRQS interface must still be
used for MSIX configuration"" so there is no way qemu can meet that
without either trapping the MSI page or using a special hypercall
(ppc)

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

RE: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-12 Thread Thomas Gleixner
Kevin,

On Sun, Dec 12 2021 at 01:56, Kevin Tian wrote:
>> From: Thomas Gleixner 
>> All I can find is drivers/iommu/virtio-iommu.c but I can't find anything
>> vIR related there.
>
> Well, virtio-iommu is a para-virtualized vIOMMU implementations.
>
> In reality there are also fully emulated vIOMMU implementations (e.g.
> Qemu fully emulates Intel/AMD/ARM IOMMUs). In those configurations
> the IR logic in existing iommu drivers just apply:
>
>   drivers/iommu/intel/irq_remapping.c
>   drivers/iommu/amd/iommu.c

thanks for the explanation. So that's a full IOMMU emulation. I was more
expecting a paravirtualized lightweight one.

Thanks,

tglx
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-12 Thread Thomas Gleixner
Kevin,

On Sun, Dec 12 2021 at 02:14, Kevin Tian wrote:
>> From: Thomas Gleixner 
> I just continue the thought practice along that direction to see what
> the host flow will be like (step by step). Looking at the current 
> implementation is just one necessary step in my thought practice to 
> help refine the picture. When I found something which may be 
> worth being aligned then I shared to avoid follow a wrong direction 
> too far.
>
> If both of your think it simply adds noise to this discussion, I can
> surely hold back and focus on 'concept' only.

All good. We _want_ your participartion for sure. Comparing and
contrasting it to the existing flow is fine.

Thanks,

tglx
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-12 Thread Mika Penttilä



On 10.12.2021 9.36, Tian, Kevin wrote:

From: Jason Gunthorpe 
Sent: Friday, December 10, 2021 4:59 AM

On Thu, Dec 09, 2021 at 09:32:42PM +0100, Thomas Gleixner wrote:

On Thu, Dec 09 2021 at 12:21, Jason Gunthorpe wrote:

On Thu, Dec 09, 2021 at 09:37:06AM +0100, Thomas Gleixner wrote:
If we keep the MSI emulation in the hypervisor then MSI != IMS.  The
MSI code needs to write a addr/data pair compatible with the emulation
and the IMS code needs to write an addr/data pair from the
hypercall. Seems like this scenario is best avoided!

 From this perspective I haven't connected how virtual interrupt
remapping helps in the guest? Is this a way to provide the hypercall
I'm imagining above?

That was my thought to avoid having different mechanisms.

The address/data pair is computed in two places:

   1) Activation of an interrupt
   2) Affinity setting on an interrupt

Both configure the IRTE when interrupt remapping is in place.

In both cases a vector is allocated in the vector domain and based on
the resulting target APIC / vector number pair the IRTE is
(re)configured.

So putting the hypercall into the vIRTE update is the obvious
place. Both activation and affinity setting can fail and propagate an
error code down to the originating caller.

Hmm?

Okay, I think I get it. Would be nice to have someone from intel
familiar with the vIOMMU protocols and qemu code remark what the
hypervisor side can look like.

There is a bit more work here, we'd have to change VFIO to somehow
entirely disconnect the kernel IRQ logic from the MSI table and
directly pass control of it to the guest after the hypervisor IOMMU IR
secures it. ie directly mmap the msi-x table into the guest


It's supported already:

/*
  * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
  * which allows direct access to non-MSIX registers which happened to be within
  * the same system page.
  *
  * Even though the userspace gets direct access to the MSIX data, the existing
  * VFIO_DEVICE_SET_IRQS interface must still be used for MSIX configuration.
  */
#define VFIO_REGION_INFO_CAP_MSIX_MAPPABLE  3

IIRC this was introduced for PPC when a device has MSI-X in the same BAR as
other MMIO registers. Trapping MSI-X leads to performance downgrade on
accesses to adjacent registers. MSI-X can be mapped by userspace because
PPC already uses a hypercall mechanism for interrupt. Though unclear about
the detail it sounds a similar usage as proposed here.

Thanks
Kevin


I see  VFIO_REGION_INFO_CAP_MSIX_MAPPABLE is always set so if msix table 
is in its own bar, qemu never traps/emulates the access. On the other 
hand, qemu is said to depend on emulating masking. So how is this 
supposed to work, in case the table is not in the config bar?


Thanks,
Mika


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

RE: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-11 Thread Tian, Kevin
Hi, Thomas,

> From: Thomas Gleixner 
> Sent: Sunday, December 12, 2021 8:12 AM
> 
> Kevin,
> 
> On Sat, Dec 11 2021 at 07:52, Kevin Tian wrote:
> >> From: Jason Gunthorpe 
> >> > Then Qemu needs to find out the GSI number for the vIRTE handle.
> >> > Again Qemu doesn't have such information since it doesn't know
> >> > which MSI[-X] entry points to this handle due to no trap.
> >>
> >> No this is already going wrong. qemu *cannot* know the MSI information
> >> because there is no MSI information for IMS.
> >
> > I haven't thought of IMS at this step. The IR approach applies to
> > all types of interrupt storages, thus I'm more interested in how it
> > affect the storages which are already virtualized today (MSI[-X]
> > in my thought practice).
> 
> They are not any different. As I explained several times now IMS is
> nothing new at all. It existed since the invention of Message Signaled
> interrupts. Why?
> 
> The principle behind Message Signaled Interrupts is:
> 
> Device writes DATA to ADDRESS which raises an interrupt in a CPU
> 
> Message Signaled Interrupts obviously need some place to store the
> ADDRESS/DATA pair so that the device can use it for raising an
> interrupt, i.e. an
> 
>Interrupt Message Store, short IMS.
> 
> PCI/MSI was the first implementation of this and the storage was defined
> to be at a specified and therefore uniform and device independent place.
> 
> PCI/MSI-X followed the same approch. While it solved quite some of the
> shortcomings of PCI/MSI it still has a specificed and uniform and device
> independent place to store the message (ADDRESS/DATA pair)
> 
> Now the PCI wizards figured out that PCI/MSI[-X] is not longer up to the
> task for various reasons and came up with the revolutionary new concept
> of IMS, aka Interrupt Message Store. where the device defines where the
> message is stored.
> 
> IOW, this is coming back full circle to the original problem of where to
> store the message, i.e. the ADDRESS/DATA pair so that the device can
> raise an interrupt in a CPU, which requires - drum roll - an
> 
>Interrupt Message Store, short IMS.
> 
> So you simply have to look at it from a pure MSI (not PCI/MSI) point
> of view:
> 
>MSI at the conceptual level requires storage for the ADDRESS/DATA
>pair at some place so that the device or the compute unit embedded in
>the device can write DATA to ADDRESS.
> 
> That's it. Not more, not less.
> 
> When you look at it from this perspective, then you'll realize that
> 
>  PCI/MSI and PCI/MSI-X are just implementations of IMS
> 
> Not more, not less. The fact that they have very strict rules about the
> storage space and the fact that they are mutually exclusive does not
> change that at all.
> 
> That's where a lot of the confusion comes from. If you go back to all
> the IDXD/IMS discussions which happened over time then you'll figure out
> that _all_ of us where coming from the same wrong assumption:
> 
> IMS is new and it's just another exclusive variant of PCI/MSI and
> PCi/MSI-X.
> 
> It took _all_ of us quite some time to realize that we need to look at
> it from the other way around.
> 
> There was surely some other conceptual confusion vs. subdevices, queues
> and whatever involved which contributed to that. Water under the bridge.
> 
> Coming back to your initial question:
> 
> > I haven't thought of IMS at this step. The IR approach applies to
> > all types of interrupt storages, thus I'm more interested in how it
> > affect the storages which are already virtualized today (MSI[-X]
> > in my thought practice).
> 
> Stop focussing on implementation details. Focus on the general concept
> instead. See above.
> 

I have no any problem on understanding above relational. This has
been acked multiple times in my previous replies.

I just continue the thought practice along that direction to see what
the host flow will be like (step by step). Looking at the current 
implementation is just one necessary step in my thought practice to 
help refine the picture. When I found something which may be 
worth being aligned then I shared to avoid follow a wrong direction 
too far.

If both of your think it simply adds noise to this discussion, I can
surely hold back and focus on 'concept' only.

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-11 Thread Tian, Kevin
> From: Thomas Gleixner 
> Sent: Saturday, December 11, 2021 9:05 PM
> 
> Kevin,
> 
> On Sat, Dec 11 2021 at 07:44, Kevin Tian wrote:
> >> From: Thomas Gleixner 
> >> On Fri, Dec 10 2021 at 08:39, Jason Gunthorpe wrote:
> >> > It is clever, we don't have an vIOMMU that supplies vIR today, so by
> >> > definition all guests are excluded and only bare metal works.
> >>
> >> Dammit. Now you spilled the beans. :)
> >
> > Unfortunately we do have that today. Qemu supports IR for
> > both AMD and Intel vIOMMU.
> 
> can you point me to the code?
> 
> All I can find is drivers/iommu/virtio-iommu.c but I can't find anything
> vIR related there.
> 

Well, virtio-iommu is a para-virtualized vIOMMU implementations.

In reality there are also fully emulated vIOMMU implementations (e.g.
Qemu fully emulates Intel/AMD/ARM IOMMUs). In those configurations
the IR logic in existing iommu drivers just apply:

drivers/iommu/intel/irq_remapping.c
drivers/iommu/amd/iommu.c
...

As I replied in another mail, the 1st vIR implementation was introduced
to Qemu back to 2016:

commit 1121e0afdcfa0cd40e36bd3acff56a3fac4f70fd
Author: Peter Xu 
Date:   Thu Jul 14 13:56:13 2016 +0800

x86-iommu: introduce "intremap" property

Adding one property for intel-iommu devices to specify whether we should
support interrupt remapping. By default, IR is disabled. To enable it,
we should use (take Intel IOMMU as example):

  -device intel_iommu,intremap=on

This property can be shared by Intel and future AMD IOMMUs.

Signed-off-by: Peter Xu 
Reviewed-by: Michael S. Tsirkin 
Signed-off-by: Michael S. Tsirkin 

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-11 Thread Thomas Gleixner
Kevin,

On Sat, Dec 11 2021 at 07:52, Kevin Tian wrote:
>> From: Jason Gunthorpe 
>> > Then Qemu needs to find out the GSI number for the vIRTE handle.
>> > Again Qemu doesn't have such information since it doesn't know
>> > which MSI[-X] entry points to this handle due to no trap.
>> 
>> No this is already going wrong. qemu *cannot* know the MSI information
>> because there is no MSI information for IMS.
>
> I haven't thought of IMS at this step. The IR approach applies to
> all types of interrupt storages, thus I'm more interested in how it
> affect the storages which are already virtualized today (MSI[-X] 
> in my thought practice).

They are not any different. As I explained several times now IMS is
nothing new at all. It existed since the invention of Message Signaled
interrupts. Why?

The principle behind Message Signaled Interrupts is:

Device writes DATA to ADDRESS which raises an interrupt in a CPU

Message Signaled Interrupts obviously need some place to store the
ADDRESS/DATA pair so that the device can use it for raising an
interrupt, i.e. an

   Interrupt Message Store, short IMS.

PCI/MSI was the first implementation of this and the storage was defined
to be at a specified and therefore uniform and device independent place.

PCI/MSI-X followed the same approch. While it solved quite some of the
shortcomings of PCI/MSI it still has a specificed and uniform and device
independent place to store the message (ADDRESS/DATA pair)

Now the PCI wizards figured out that PCI/MSI[-X] is not longer up to the
task for various reasons and came up with the revolutionary new concept
of IMS, aka Interrupt Message Store. where the device defines where the
message is stored.

IOW, this is coming back full circle to the original problem of where to
store the message, i.e. the ADDRESS/DATA pair so that the device can
raise an interrupt in a CPU, which requires - drum roll - an

   Interrupt Message Store, short IMS.

So you simply have to look at it from a pure MSI (not PCI/MSI) point
of view:

   MSI at the conceptual level requires storage for the ADDRESS/DATA
   pair at some place so that the device or the compute unit embedded in
   the device can write DATA to ADDRESS.

That's it. Not more, not less.

When you look at it from this perspective, then you'll realize that

 PCI/MSI and PCI/MSI-X are just implementations of IMS

Not more, not less. The fact that they have very strict rules about the
storage space and the fact that they are mutually exclusive does not
change that at all.

That's where a lot of the confusion comes from. If you go back to all
the IDXD/IMS discussions which happened over time then you'll figure out
that _all_ of us where coming from the same wrong assumption:

IMS is new and it's just another exclusive variant of PCI/MSI and
PCi/MSI-X.

It took _all_ of us quite some time to realize that we need to look at
it from the other way around.

There was surely some other conceptual confusion vs. subdevices, queues
and whatever involved which contributed to that. Water under the bridge.

Coming back to your initial question:

> I haven't thought of IMS at this step. The IR approach applies to
> all types of interrupt storages, thus I'm more interested in how it
> affect the storages which are already virtualized today (MSI[-X] 
> in my thought practice).

Stop focussing on implementation details. Focus on the general concept
instead. See above.

Thanks,

tglx

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-11 Thread Thomas Gleixner
Kevin,

On Sat, Dec 11 2021 at 07:44, Kevin Tian wrote:
>> From: Thomas Gleixner 
>> On Fri, Dec 10 2021 at 08:39, Jason Gunthorpe wrote:
>> > It is clever, we don't have an vIOMMU that supplies vIR today, so by
>> > definition all guests are excluded and only bare metal works.
>> 
>> Dammit. Now you spilled the beans. :)
>
> Unfortunately we do have that today. Qemu supports IR for
> both AMD and Intel vIOMMU.

can you point me to the code?

All I can find is drivers/iommu/virtio-iommu.c but I can't find anything
vIR related there.

Thanks,

tglx
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-11 Thread Tian, Kevin
> From: Thomas Gleixner 
> Sent: Friday, December 10, 2021 8:13 PM
> 
> >>   5) It's not possible for the kernel to reliably detect whether it is
> >>  running on bare metal or not. Yes we talked about heuristics, but
> >>  that's something I really want to avoid.
> >
> > How would the hypercall mechanism avoid such heuristics?
> 
> The availability of IR remapping where the irqdomain which is provided
> by the remapping unit signals that it supports this new scheme:
> 
> |--IO/APIC
> |--MSI
>  vector -- IR --|--MSI-X
> |--IMS
> 
> while the current scheme is:
> 
> |--IO/APIC
>  vector -- IR --|--PCI/MSI[-X]
> 
> or
> 
> |--IO/APIC
>  vector |--PCI/MSI[-X]
> 
> So in the new scheme the IR domain will advertise new features which are
> not available on older kernels. The availability of these new features
> is the indicator for the interrupt subsystem and subsequently for PCI
> whether IMS is supported or not.
> 
> Bootup either finds an IR unit or not. In the bare metal case that's the
> usual hardware/firmware detection. In the guest case it's the
> availability of vIR including the required hypercall protocol.

Given we have vIR already, there are three scenarios:

1) Bare metal: IR (no hypercall, for sure)
2) VM: vIR (no hypercall, today)
3) VM: vIR (hypercall, tomorrow)

IMS should be allowed only for 1) and 3).

But how to differentiate 2) from 1) if no guest heuristics?

btw I checked Qemu history to find vIR was introduced in 2016:

commit 1121e0afdcfa0cd40e36bd3acff56a3fac4f70fd
Author: Peter Xu 
Date:   Thu Jul 14 13:56:13 2016 +0800

x86-iommu: introduce "intremap" property

Adding one property for intel-iommu devices to specify whether we should
support interrupt remapping. By default, IR is disabled. To enable it,
we should use (take Intel IOMMU as example):

  -device intel_iommu,intremap=on

This property can be shared by Intel and future AMD IOMMUs.

Signed-off-by: Peter Xu 
Reviewed-by: Michael S. Tsirkin 
Signed-off-by: Michael S. Tsirkin 

> 
> > Then Qemu needs to find out the GSI number for the vIRTE handle.
> > Again Qemu doesn't have such information since it doesn't know
> > which MSI[-X] entry points to this handle due to no trap.
> >
> > This implies that we may also need carry device ID, #msi entry, etc.
> > in the hypercall, so Qemu can associate the virtual routing info
> > to the right [irqfd, gsi].
> >
> > In your model the hypercall is raised by IR domain. Do you see
> > any problem of finding those information within IR domain?
> 
> IR has the following information available:
> 
>Interrupt type
> - MSI:   Device, index and number of vectors
> - MSI-X: Device, index
> - IMS:   Device, index
> 
>   Target APIC/vector pair
> 
> IMS: The index depends on the storage type:
> 
>  For storage in device memory, e.g. IDXD, it's the array index.
> 
>  For storage in system memory, the index is a software artifact.
> 
> Does that answer your question?
> 

Yes.

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-10 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Friday, December 10, 2021 8:40 PM
> 
> 
> > Then Qemu needs to find out the GSI number for the vIRTE handle.
> > Again Qemu doesn't have such information since it doesn't know
> > which MSI[-X] entry points to this handle due to no trap.
> 
> No this is already going wrong. qemu *cannot* know the MSI information
> because there is no MSI information for IMS.

I haven't thought of IMS at this step. The IR approach applies to
all types of interrupt storages, thus I'm more interested in how it
affect the storages which are already virtualized today (MSI[-X] 
in my thought practice).

> 
> All qemu should get is the origin device information and data about
> how the guest wants the interrupt setup.

yes, this is what I concluded in previous mail. The hypercall should
provide those information to the host and then get addr/data pair
back from the host.

> 
> Forget about guests and all of this complexity, design how to make
> VFIO work with IMS in pure userspace like DPDK.
> 
> We must have a VFIO ioctl to acquire a addr/data pair and link it to
> an event fd.

No question on it. Still need more thought down this road.

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-10 Thread Tian, Kevin
> From: Thomas Gleixner 
> Sent: Saturday, December 11, 2021 3:00 AM
> 
> Jason,
> 
> On Fri, Dec 10 2021 at 08:39, Jason Gunthorpe wrote:
> 
> > On Fri, Dec 10, 2021 at 07:29:01AM +, Tian, Kevin wrote:
> >> >   5) It's not possible for the kernel to reliably detect whether it is
> >> >  running on bare metal or not. Yes we talked about heuristics, but
> >> >  that's something I really want to avoid.
> >>
> >> How would the hypercall mechanism avoid such heuristics?
> >
> > It is clever, we don't have an vIOMMU that supplies vIR today, so by
> > definition all guests are excluded and only bare metal works.
> 
> Dammit. Now you spilled the beans. :)
> 

Unfortunately we do have that today. Qemu supports IR for
both AMD and Intel vIOMMU.

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-10 Thread Thomas Gleixner
Jason,

On Fri, Dec 10 2021 at 08:39, Jason Gunthorpe wrote:

> On Fri, Dec 10, 2021 at 07:29:01AM +, Tian, Kevin wrote:
>> >   5) It's not possible for the kernel to reliably detect whether it is
>> >  running on bare metal or not. Yes we talked about heuristics, but
>> >  that's something I really want to avoid.
>> 
>> How would the hypercall mechanism avoid such heuristics?
>
> It is clever, we don't have an vIOMMU that supplies vIR today, so by
> definition all guests are excluded and only bare metal works.

Dammit. Now you spilled the beans. :)

>> > The charm is that his works for everything from INTx to IMS because all
>> > of them go through the same procedure, except that INTx (IO/APIC) does
>> > not support the reservation mode dance.
>
> Do we even have vIOAPIC?

It does not matter much. INTx via IOAPIC is different anyway because
INTx is shared so it's unclear from which device it originates.

> It seems reasonable - do you have any idea how this all would work on
> ARM too? IMS on baremetal ARM is surely interesting. I assume they
> have a similar issue with trapping the MSI

On baremetal it should just work once ARM is converted over. No idea
about guests. Marc should know.

>> Then Qemu needs to find out the GSI number for the vIRTE handle. 
>> Again Qemu doesn't have such information since it doesn't know 
>> which MSI[-X] entry points to this handle due to no trap.
>
> No this is already going wrong. qemu *cannot* know the MSI information
> because there is no MSI information for IMS.
>
> All qemu should get is the origin device information and data about
> how the guest wants the interrupt setup.
>
> Forget about guests and all of this complexity, design how to make
> VFIO work with IMS in pure userspace like DPDK.
>
> We must have a VFIO ioctl to acquire a addr/data pair and link it to
> an event fd.
>
> I'm not sure exactly how this should be done, it is 90% of what IMS
> is, except the VFIO irq_chip cannot touch any real HW and certainly
> cannot do mask/unmask..
>
> Maybe that is OK now that it requires IR?

IR unfortunately does not allow masking, but we surely can come up some
emergency button to press when e.g. an interrupt storm is detected.

Thanks,

tglx
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-10 Thread Jason Gunthorpe via iommu
On Fri, Dec 10, 2021 at 07:29:01AM +, Tian, Kevin wrote:
> >   5) It's not possible for the kernel to reliably detect whether it is
> >  running on bare metal or not. Yes we talked about heuristics, but
> >  that's something I really want to avoid.
> 
> How would the hypercall mechanism avoid such heuristics?

It is clever, we don't have an vIOMMU that supplies vIR today, so by
definition all guests are excluded and only bare metal works.

> > The charm is that his works for everything from INTx to IMS because all
> > of them go through the same procedure, except that INTx (IO/APIC) does
> > not support the reservation mode dance.

Do we even have vIOAPIC?

> > Thoughts?

It seems reasonable - do you have any idea how this all would work on
ARM too? IMS on baremetal ARM is surely interesting. I assume they
have a similar issue with trapping the MSI

> Then Qemu needs to find out the GSI number for the vIRTE handle. 
> Again Qemu doesn't have such information since it doesn't know 
> which MSI[-X] entry points to this handle due to no trap.

No this is already going wrong. qemu *cannot* know the MSI information
because there is no MSI information for IMS.

All qemu should get is the origin device information and data about
how the guest wants the interrupt setup.

Forget about guests and all of this complexity, design how to make
VFIO work with IMS in pure userspace like DPDK.

We must have a VFIO ioctl to acquire a addr/data pair and link it to
an event fd.

I'm not sure exactly how this should be done, it is 90% of what IMS
is, except the VFIO irq_chip cannot touch any real HW and certainly
cannot do mask/unmask..

Maybe that is OK now that it requires IR?

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-10 Thread Jason Gunthorpe via iommu
On Fri, Dec 10, 2021 at 07:36:12AM +, Tian, Kevin wrote:

> /*
>  * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
>  * which allows direct access to non-MSIX registers which happened to be 
> within
>  * the same system page.
>  *
>  * Even though the userspace gets direct access to the MSIX data, the existing
>  * VFIO_DEVICE_SET_IRQS interface must still be used for MSIX configuration.
 ^

It is this I think we don't want, there should be no hypervisor
involvment on installing the addr/data pair

Guessing by what I saw in PPC there is a hypercall to install a MSI
and the guest never touches the page even though it is
mapped. Presumably there is some IR like thing making this secured

PPC has the same basic problem, I think they have a hypercall to
install a MSI, not a hypercall to provision a generic addr/data pair.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-10 Thread Thomas Gleixner
Kevin,

On Fri, Dec 10 2021 at 07:29, Kevin Tian wrote:
>> From: Thomas Gleixner 
>>   4) For the guest side we agreed that we need an hypercall because the
>>  host can't trap the write to the MSI[-X] entry anymore.
>
> To be accurate I'd like to not call it "can't trap". The host still traps the 
> MSI/MSI-X entry if the hypercall is not used. This is for current guest 
> OS which doesn't have this hypercall mechanism. For future new guest
> OS which will support this machinery then a handshake process from
> such guest will disable the trap for MSI-X and map it for direct guest
> access in the fly.

Right. What I'm suggesting is a clear cut between the current approach,
which obviously needs to be preserved, and a consistent new approach
which handles MSI, MSI-X and IMS in the exactly same way.

> MSI has to be always trapped although the guest has acquired the right 
> data/addr pair via the hypercall, since it's a PCI config capability.
>
>> 
>>  Aside of the fact that this creates a special case for IMS which is
>>  undesirable in my opinion, it's not really obvious where the
>>  hypercall should be placed to work for all scenarios so that it can
>>  also solve the existing issue of silent failures.
>> 
>>   5) It's not possible for the kernel to reliably detect whether it is
>>  running on bare metal or not. Yes we talked about heuristics, but
>>  that's something I really want to avoid.
>
> How would the hypercall mechanism avoid such heuristics?

The availability of IR remapping where the irqdomain which is provided
by the remapping unit signals that it supports this new scheme:

|--IO/APIC
|--MSI
 vector -- IR --|--MSI-X
|--IMS

while the current scheme is:

|--IO/APIC
 vector -- IR --|--PCI/MSI[-X]

or 

|--IO/APIC
 vector |--PCI/MSI[-X]

So in the new scheme the IR domain will advertise new features which are
not available on older kernels. The availability of these new features
is the indicator for the interrupt subsystem and subsequently for PCI
whether IMS is supported or not.

Bootup either finds an IR unit or not. In the bare metal case that's the
usual hardware/firmware detection. In the guest case it's the
availability of vIR including the required hypercall protocol.

So for the guest case the initialization would look like this:

   qemu starts guest
  Method of interrupt management defaults to current scheme
  restricted to MSI/MSI-X

  guest initializes
  older guest do not check for the hypercall so everything
  continues as of today

  new guest initializes vIR, detects hypercall and requests
  from the hypervisor to switch over to the new scheme.

  The hypervisor grants or rejects the request, i.e. either both
  switch to the new scheme or stay with the old one.

The new scheme means, that all variants, MSI, MSI-X, IMS are handled in
the same way. Staying on the old scheme means that IMS is not available
to the guest.

Having that clear separation is in my opinion way better than trying to
make all of that a big maze of conditionals.

I'm going to make that clear cut in the PCI/MSI management layer as
well. Trying to do that hybrid we do today for irqdomain and non
irqdomain based backends is just not feasible. The decision which way to
go will be very close to the driver exposed API, i.e.:

   pci_alloc_ims_vector()
if (new_scheme())
return new_scheme_alloc_ims();
else
return -ENOTSUPP;

and new_scheme_alloc_ims() will have:

   new_scheme_alloc_ims()
if (!system_supports_ims())
return -ENOTSUPP;


system_supports_ims() makes its decision based on the feature flags
exposed by the underlying base MSI irqdomain, i.e. either vector or IR
on x86.

Vector domain will not have that feature flag set, but IR will have on
bare metal and eventually in the guest when the vIR setup and hypercall
detection and negotiation succeeds.

> Then Qemu needs to find out the GSI number for the vIRTE handle. 
> Again Qemu doesn't have such information since it doesn't know 
> which MSI[-X] entry points to this handle due to no trap.
>
> This implies that we may also need carry device ID, #msi entry, etc. 
> in the hypercall, so Qemu can associate the virtual routing info
> to the right [irqfd, gsi].
>
> In your model the hypercall is raised by IR domain. Do you see
> any problem of finding those information within IR domain?

IR has the following information available:

   Interrupt type
- MSI:   Device, index and number of vectors
- MSI-X: Device, index
- IMS:   Device, index

  Target APIC/vector pair

IMS: The index depends on the storage type:

 For storage in device memory, e.g. IDXD, it's the array index.

 For storage in system memory, the index is a software artifact.

Does that answer 

RE: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-09 Thread Tian, Kevin
> From: Thomas Gleixner 
> Sent: Thursday, December 9, 2021 11:58 PM
> 
> On Thu, Dec 09 2021 at 12:17, Kevin Tian wrote:
> >> From: Thomas Gleixner 
> >> I think you are looking at that from the internal implementation details
> >> of IDXD. But you can just model it in an IDXD implementation agnostic
> >> way:
> >>
> >>  ENQCMD(PASID, IMS-ENTRY,.)
> >
> > Not exactly IMS-ENTRY. MSI-ENTRY also works.
> 
> Sure.
> 
> >>
> >> implies an on demand allocation of a virtual queue, which is deallocated
> >> when the command completes. The PASID and IMS-ENTRY act as the
> 'queue'
> >> identifier.
> >>
> >> The implementation detail of IDXD that it executes these computations on
> >> an internal shared workqueue does not change that.
> >>
> >> Such a workqueue can only execute one enqueued command at a time,
> >> which
> >> means that during the execution of a particular command that IDXD
> >> internal workqueue represents the 'virtual queue' which is identified by
> >> the unique PASID/IMS-ENTRY pair.
> >
> > While it's one way of looking at this model do we want to actually
> > create some objects to represent this 'virtual queue' concept? that
> > implies each ENQCMD must be moderated to create such short-lifespan
> > objects and I'm not sure the benefit of doing so.
> 
> You don't have to create anything. The PASID/ENTRY pair represents that
> 'virtual queue', right?

Yes

> 
> > If not then from driver p.o.v it's still one queue resource and driver
> > needs to manage its association with multiple interrupt entries and
> > PASIDs when it's connected to multiple clients.
> 
> That's correct, but there is nothing problematic with it. It's like
> allocating multiple interrupts for any other hardware device or
> subdevice, right?

No question on this. Just want to point out such usage example 
since Jason mentioned it. 

> 
> What's probably more interresting is how the PASID/interrupt/RID
> relations are managed.
> 

yes, that's something we need further think of.

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

RE: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-09 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Friday, December 10, 2021 4:59 AM
> 
> On Thu, Dec 09, 2021 at 09:32:42PM +0100, Thomas Gleixner wrote:
> > On Thu, Dec 09 2021 at 12:21, Jason Gunthorpe wrote:
> > > On Thu, Dec 09, 2021 at 09:37:06AM +0100, Thomas Gleixner wrote:
> > > If we keep the MSI emulation in the hypervisor then MSI != IMS.  The
> > > MSI code needs to write a addr/data pair compatible with the emulation
> > > and the IMS code needs to write an addr/data pair from the
> > > hypercall. Seems like this scenario is best avoided!
> > >
> > > From this perspective I haven't connected how virtual interrupt
> > > remapping helps in the guest? Is this a way to provide the hypercall
> > > I'm imagining above?
> >
> > That was my thought to avoid having different mechanisms.
> >
> > The address/data pair is computed in two places:
> >
> >   1) Activation of an interrupt
> >   2) Affinity setting on an interrupt
> >
> > Both configure the IRTE when interrupt remapping is in place.
> >
> > In both cases a vector is allocated in the vector domain and based on
> > the resulting target APIC / vector number pair the IRTE is
> > (re)configured.
> >
> > So putting the hypercall into the vIRTE update is the obvious
> > place. Both activation and affinity setting can fail and propagate an
> > error code down to the originating caller.
> >
> > Hmm?
> 
> Okay, I think I get it. Would be nice to have someone from intel
> familiar with the vIOMMU protocols and qemu code remark what the
> hypervisor side can look like.
> 
> There is a bit more work here, we'd have to change VFIO to somehow
> entirely disconnect the kernel IRQ logic from the MSI table and
> directly pass control of it to the guest after the hypervisor IOMMU IR
> secures it. ie directly mmap the msi-x table into the guest
> 

It's supported already:

/*
 * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
 * which allows direct access to non-MSIX registers which happened to be within
 * the same system page.
 *
 * Even though the userspace gets direct access to the MSIX data, the existing
 * VFIO_DEVICE_SET_IRQS interface must still be used for MSIX configuration.
 */
#define VFIO_REGION_INFO_CAP_MSIX_MAPPABLE  3

IIRC this was introduced for PPC when a device has MSI-X in the same BAR as
other MMIO registers. Trapping MSI-X leads to performance downgrade on
accesses to adjacent registers. MSI-X can be mapped by userspace because
PPC already uses a hypercall mechanism for interrupt. Though unclear about
the detail it sounds a similar usage as proposed here.

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-09 Thread Tian, Kevin
> From: Thomas Gleixner 
> Sent: Friday, December 10, 2021 8:26 AM
> 
> On Thu, Dec 09 2021 at 23:09, Thomas Gleixner wrote:
> > On Thu, Dec 09 2021 at 16:58, Jason Gunthorpe wrote:
> >> Okay, I think I get it. Would be nice to have someone from intel
> >> familiar with the vIOMMU protocols and qemu code remark what the
> >> hypervisor side can look like.
> >>
> >> There is a bit more work here, we'd have to change VFIO to somehow
> >> entirely disconnect the kernel IRQ logic from the MSI table and
> >> directly pass control of it to the guest after the hypervisor IOMMU IR
> >> secures it. ie directly mmap the msi-x table into the guest
> >
> > That makes everything consistent and a clear cut on all levels, right?
> 
> Let me give a bit more rationale here, why I think this is the right
> thing to do. There are several problems with IMS both on the host and on
> the guest side:
> 
>   1) Contrary to MSI/MSI-X the address/data pair is not completely
>  managed by the core. It's handed off to driver writers in the
>  hope they get it right.
> 
>   2) Without interrupt remapping there is a fundamental issue on x86
>  for the affinity setting case, as there is no guarantee that
>  the magic protocol which we came up with (see msi_set_affinity()
>  in the x86 code) is correctly implemented at the driver level or
>  that the update is truly atomic so that the problem does not
>  arise. My interrest in chasing these things is exactly zero.
> 
>  With interrupt remapping the affinity change happens at the IRTE
>  level and not at the device level. It's a one time setup for the
>  device.

This is a good rationale for making IMS depend on IR on native.

> 
>  Just for the record:
> 
>  The ATH11 thing does not have that problem by pure luck because
>  multi-vector MSI is not supported on X86 unless interrupt
>  remapping is enabled.
> 
>  The switchtec NTB thing will fall apart w/o remapping AFAICT.
> 
>   3) With remapping the message for the device is constructed at
>  allocation time. It does not change after that because the affinity
>  change happens at the remapping level, which eliminates #2 above.
> 
>  That has another advantage for IMS because it does not require any
>  synchronization with the queue or whatever is involved. The next
>  interrupt after the change at the remapping level ends up on the
>  new target.

Yes

> 
>   4) For the guest side we agreed that we need an hypercall because the
>  host can't trap the write to the MSI[-X] entry anymore.

To be accurate I'd like to not call it "can't trap". The host still traps the 
MSI/MSI-X entry if the hypercall is not used. This is for current guest 
OS which doesn't have this hypercall mechanism. For future new guest
OS which will support this machinery then a handshake process from
such guest will disable the trap for MSI-X and map it for direct guest
access in the fly.

MSI has to be always trapped although the guest has acquired the right 
data/addr pair via the hypercall, since it's a PCI config capability.

> 
>  Aside of the fact that this creates a special case for IMS which is
>  undesirable in my opinion, it's not really obvious where the
>  hypercall should be placed to work for all scenarios so that it can
>  also solve the existing issue of silent failures.
> 
>   5) It's not possible for the kernel to reliably detect whether it is
>  running on bare metal or not. Yes we talked about heuristics, but
>  that's something I really want to avoid.

How would the hypercall mechanism avoid such heuristics?

> 
> When looking at the above I came to the conclusion that the consistent
> way is to make IMS depend on IR both on the host and the guest as this
> solves all of the above in one go.
> 
> How would that work? With IR the irqdomain hierarchy looks like this:
> 
>|--IO/APIC
>|--MSI
> vector -- IR --|--MIX-X
>|--IMS
> 
> There are several context where this matters:
> 
>   1) Allocation of an interrupt, e.g. pci_alloc_irq_vectors().
> 
>   2) Activation of an interrupt which happens during allocation and/or
>  at request_irq() time
> 
>   3) Interrupt affinity setting
> 
> #1 Allocation
> 
>That allocates an IRTE, which can fail
> 
> #2 Activation
> 
>That's the step where actually a CPU vector is allocated, where the
>IRTE is updated and where the device message is composed to target
>the IRTE.
> 
>On X86 activation is happening twice:
> 
>1) During allocation it allocates a special CPU vector which is
>   handed out to all allocated interrupts. That's called reservation
>   mode. This was introduced to prevent vector exhaustion for two
>   cases:
> 
>- Devices allocating tons of MSI-X vectors without using
>  them. That obviously needs to be fixed at the device driver
>  level, but due to the fact that 

Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-09 Thread Thomas Gleixner
On Thu, Dec 09 2021 at 23:09, Thomas Gleixner wrote:
> On Thu, Dec 09 2021 at 16:58, Jason Gunthorpe wrote:
>> Okay, I think I get it. Would be nice to have someone from intel
>> familiar with the vIOMMU protocols and qemu code remark what the
>> hypervisor side can look like.
>>
>> There is a bit more work here, we'd have to change VFIO to somehow
>> entirely disconnect the kernel IRQ logic from the MSI table and
>> directly pass control of it to the guest after the hypervisor IOMMU IR
>> secures it. ie directly mmap the msi-x table into the guest
>
> That makes everything consistent and a clear cut on all levels, right?

Let me give a bit more rationale here, why I think this is the right
thing to do. There are several problems with IMS both on the host and on
the guest side:

  1) Contrary to MSI/MSI-X the address/data pair is not completely
 managed by the core. It's handed off to driver writers in the
 hope they get it right.

  2) Without interrupt remapping there is a fundamental issue on x86
 for the affinity setting case, as there is no guarantee that
 the magic protocol which we came up with (see msi_set_affinity()
 in the x86 code) is correctly implemented at the driver level or
 that the update is truly atomic so that the problem does not
 arise. My interrest in chasing these things is exactly zero.

 With interrupt remapping the affinity change happens at the IRTE
 level and not at the device level. It's a one time setup for the
 device.

 Just for the record:

 The ATH11 thing does not have that problem by pure luck because
 multi-vector MSI is not supported on X86 unless interrupt
 remapping is enabled. 

 The switchtec NTB thing will fall apart w/o remapping AFAICT.

  3) With remapping the message for the device is constructed at
 allocation time. It does not change after that because the affinity
 change happens at the remapping level, which eliminates #2 above.

 That has another advantage for IMS because it does not require any
 synchronization with the queue or whatever is involved. The next
 interrupt after the change at the remapping level ends up on the
 new target.

  4) For the guest side we agreed that we need an hypercall because the
 host can't trap the write to the MSI[-X] entry anymore.

 Aside of the fact that this creates a special case for IMS which is
 undesirable in my opinion, it's not really obvious where the
 hypercall should be placed to work for all scenarios so that it can
 also solve the existing issue of silent failures.

  5) It's not possible for the kernel to reliably detect whether it is
 running on bare metal or not. Yes we talked about heuristics, but
 that's something I really want to avoid.

When looking at the above I came to the conclusion that the consistent
way is to make IMS depend on IR both on the host and the guest as this
solves all of the above in one go.

How would that work? With IR the irqdomain hierarchy looks like this:

   |--IO/APIC
   |--MSI
vector -- IR --|--MIX-X
   |--IMS

There are several context where this matters:

  1) Allocation of an interrupt, e.g. pci_alloc_irq_vectors().

  2) Activation of an interrupt which happens during allocation and/or
 at request_irq() time

  3) Interrupt affinity setting

#1 Allocation

   That allocates an IRTE, which can fail

#2 Activation

   That's the step where actually a CPU vector is allocated, where the
   IRTE is updated and where the device message is composed to target
   the IRTE.

   On X86 activation is happening twice:

   1) During allocation it allocates a special CPU vector which is
  handed out to all allocated interrupts. That's called reservation
  mode. This was introduced to prevent vector exhaustion for two
  cases:
  
   - Devices allocating tons of MSI-X vectors without using
 them. That obviously needs to be fixed at the device driver
 level, but due to the fact that post probe() allocation is not
 supported, that's not always possible

   - CPU hotunplug

 All vectors targeting the outgoing CPU need to be migrated to a
 new target CPU, which can result in exhaustion of the vector
 space.

 Reservation mode avoids that because it just uses a unique
 vector for all interrupts which are allocated but not
 requested.

2) On request_irq()

   As the vector assigned during allocation is just a place holder
   to make the MSI hardware happy it needs to be replaced by a
   real vector.

   Both can fail and the error is propagated through the call chain

#3 Changing the interrupt affinity

   This obviously needs to allocate a new target CPU vector and update
   the IRTE.

   Allocating a new target CPU vector can fail.

When looking at it from the host side, then the host needs to do the
same things:


Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-09 Thread Thomas Gleixner
On Thu, Dec 09 2021 at 16:58, Jason Gunthorpe wrote:
> On Thu, Dec 09, 2021 at 09:32:42PM +0100, Thomas Gleixner wrote:
>> That was my thought to avoid having different mechanisms.
>> 
>> The address/data pair is computed in two places:
>> 
>>   1) Activation of an interrupt
>>   2) Affinity setting on an interrupt
>> 
>> Both configure the IRTE when interrupt remapping is in place.
>> 
>> In both cases a vector is allocated in the vector domain and based on
>> the resulting target APIC / vector number pair the IRTE is
>> (re)configured.
>> 
>> So putting the hypercall into the vIRTE update is the obvious
>> place. Both activation and affinity setting can fail and propagate an
>> error code down to the originating caller.
>> 
>> Hmm?
>
> Okay, I think I get it. Would be nice to have someone from intel
> familiar with the vIOMMU protocols and qemu code remark what the
> hypervisor side can look like.
>
> There is a bit more work here, we'd have to change VFIO to somehow
> entirely disconnect the kernel IRQ logic from the MSI table and
> directly pass control of it to the guest after the hypervisor IOMMU IR
> secures it. ie directly mmap the msi-x table into the guest

That makes everything consistent and a clear cut on all levels, right?

Thanks,

tglx

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-09 Thread Jason Gunthorpe via iommu
On Thu, Dec 09, 2021 at 09:32:42PM +0100, Thomas Gleixner wrote:
> On Thu, Dec 09 2021 at 12:21, Jason Gunthorpe wrote:
> > On Thu, Dec 09, 2021 at 09:37:06AM +0100, Thomas Gleixner wrote:
> > If we keep the MSI emulation in the hypervisor then MSI != IMS.  The
> > MSI code needs to write a addr/data pair compatible with the emulation
> > and the IMS code needs to write an addr/data pair from the
> > hypercall. Seems like this scenario is best avoided!
> >
> > From this perspective I haven't connected how virtual interrupt
> > remapping helps in the guest? Is this a way to provide the hypercall
> > I'm imagining above?
> 
> That was my thought to avoid having different mechanisms.
> 
> The address/data pair is computed in two places:
> 
>   1) Activation of an interrupt
>   2) Affinity setting on an interrupt
> 
> Both configure the IRTE when interrupt remapping is in place.
> 
> In both cases a vector is allocated in the vector domain and based on
> the resulting target APIC / vector number pair the IRTE is
> (re)configured.
> 
> So putting the hypercall into the vIRTE update is the obvious
> place. Both activation and affinity setting can fail and propagate an
> error code down to the originating caller.
> 
> Hmm?

Okay, I think I get it. Would be nice to have someone from intel
familiar with the vIOMMU protocols and qemu code remark what the
hypervisor side can look like.

There is a bit more work here, we'd have to change VFIO to somehow
entirely disconnect the kernel IRQ logic from the MSI table and
directly pass control of it to the guest after the hypervisor IOMMU IR
secures it. ie directly mmap the msi-x table into the guest

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-09 Thread Thomas Gleixner
On Thu, Dec 09 2021 at 12:21, Jason Gunthorpe wrote:
> On Thu, Dec 09, 2021 at 09:37:06AM +0100, Thomas Gleixner wrote:
> If we keep the MSI emulation in the hypervisor then MSI != IMS.  The
> MSI code needs to write a addr/data pair compatible with the emulation
> and the IMS code needs to write an addr/data pair from the
> hypercall. Seems like this scenario is best avoided!
>
> From this perspective I haven't connected how virtual interrupt
> remapping helps in the guest? Is this a way to provide the hypercall
> I'm imagining above?

That was my thought to avoid having different mechanisms.

The address/data pair is computed in two places:

  1) Activation of an interrupt
  2) Affinity setting on an interrupt

Both configure the IRTE when interrupt remapping is in place.

In both cases a vector is allocated in the vector domain and based on
the resulting target APIC / vector number pair the IRTE is
(re)configured.

So putting the hypercall into the vIRTE update is the obvious
place. Both activation and affinity setting can fail and propagate an
error code down to the originating caller.

Hmm?

Thanks,

tglx


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-09 Thread Jason Gunthorpe via iommu
On Thu, Dec 09, 2021 at 09:37:06AM +0100, Thomas Gleixner wrote:
> On Thu, Dec 09 2021 at 05:23, Kevin Tian wrote:
> >> From: Thomas Gleixner 
> >> I don't see anything wrong with that. A subdevice is it's own entity and
> >> VFIO can chose the most conveniant representation of it to the guest
> >> obviously.
> >> 
> >> How that is backed on the host does not really matter. You can expose
> >> MSI-X to the guest with a INTx backing as well.
> >> 
> >
> > Agree with this point. How the interrupts are represented to the guest
> > is orthogonal to how the backend resource is allocated. Physically MSI-X 
> > and IMS can be enabled simultaneously on an IDXD device. Once 
> > dynamic allocation is allowed for both, either one can be allocated for
> > a subdevice (with only difference on supported #subdevices). 
> >
> > When an interrupt resource is exposed to the guest with the same type 
> > (e.g. MSI-on-MSI or IMS-on-IMS), it can be also passed through to the 
> > guest as long as a hypercall machinery is in place to get addr/data pair 
> > from the host (as you suggested earlier).
> 
> As I pointed out in the conclusion of this thread, IMS is only going to
> be supported with interrupt remapping in place on both host and guest.
> 
> As these devices are requiring a vIOMMU on the guest anyway (PASID, User
> IO page tables), the required hypercalls are part of the vIOMMU/IR
> implementation. If you look at it from the irqdomain hierarchy view:

It is true for IDXD, but mlx5 will work without a PASID or vIOMMU in a
guest today, and there is no reason to imagine some future IMS would
have any different device requirements from MSI-X.

Though, vIOMMU operating in bypass mode seems like it is fine if it
helps the solution.

> If you look at the above hierarchy view then MSI/MSI-X/IMS are all
> treated in exactly the same way. It all becomes the common case.

Unfortunately in a guest they are not all the same - it is like the
PPC code I saw messing with MSI today - MSI setup is a hypercall,
either explicitly or implicitly by trapping device registers.

So MSI is special compared to everything else because MSI has that
hypervisor intrusion.

My ideal outcome would be to have the guest negotiate some new
capability with the hypervisor where the guest promises to use new
hypecalls to get addr/data pairs and the hypervisor removes all the
emulation around the PCI MSI. Then MSI == IMS again and we can have
everything treated in exactly the same way. Hypervisor doesn't care
how the guest tells the origin device what the addr/data pair is.

This moves the hypervisor trap to setup the interrupt remapping from
the MSI emulation to the new hypercall.

If we keep the MSI emulation in the hypervisor then MSI != IMS.  The
MSI code needs to write a addr/data pair compatible with the emulation
and the IMS code needs to write an addr/data pair from the
hypercall. Seems like this scenario is best avoided!

>From this perspective I haven't connected how virtual interrupt
remapping helps in the guest? Is this a way to provide the hypercall
I'm imagining above?

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-09 Thread Thomas Gleixner
On Thu, Dec 09 2021 at 12:17, Kevin Tian wrote:
>> From: Thomas Gleixner 
>> I think you are looking at that from the internal implementation details
>> of IDXD. But you can just model it in an IDXD implementation agnostic
>> way:
>> 
>>  ENQCMD(PASID, IMS-ENTRY,.)
>
> Not exactly IMS-ENTRY. MSI-ENTRY also works.

Sure.

>> 
>> implies an on demand allocation of a virtual queue, which is deallocated
>> when the command completes. The PASID and IMS-ENTRY act as the 'queue'
>> identifier.
>> 
>> The implementation detail of IDXD that it executes these computations on
>> an internal shared workqueue does not change that.
>> 
>> Such a workqueue can only execute one enqueued command at a time,
>> which
>> means that during the execution of a particular command that IDXD
>> internal workqueue represents the 'virtual queue' which is identified by
>> the unique PASID/IMS-ENTRY pair.
>
> While it's one way of looking at this model do we want to actually
> create some objects to represent this 'virtual queue' concept? that
> implies each ENQCMD must be moderated to create such short-lifespan
> objects and I'm not sure the benefit of doing so.

You don't have to create anything. The PASID/ENTRY pair represents that
'virtual queue', right?

> If not then from driver p.o.v it's still one queue resource and driver 
> needs to manage its association with multiple interrupt entries and 
> PASIDs when it's connected to multiple clients.

That's correct, but there is nothing problematic with it. It's like
allocating multiple interrupts for any other hardware device or
subdevice, right?

What's probably more interresting is how the PASID/interrupt/RID
relations are managed.

Thanks,

tglx

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-09 Thread Tian, Kevin
> From: Thomas Gleixner 
> Sent: Thursday, December 9, 2021 4:37 PM
> 
> On Thu, Dec 09 2021 at 05:23, Kevin Tian wrote:
> >> From: Thomas Gleixner 
> >> I don't see anything wrong with that. A subdevice is it's own entity and
> >> VFIO can chose the most conveniant representation of it to the guest
> >> obviously.
> >>
> >> How that is backed on the host does not really matter. You can expose
> >> MSI-X to the guest with a INTx backing as well.
> >>
> >
> > Agree with this point. How the interrupts are represented to the guest
> > is orthogonal to how the backend resource is allocated. Physically MSI-X
> > and IMS can be enabled simultaneously on an IDXD device. Once
> > dynamic allocation is allowed for both, either one can be allocated for
> > a subdevice (with only difference on supported #subdevices).
> >
> > When an interrupt resource is exposed to the guest with the same type
> > (e.g. MSI-on-MSI or IMS-on-IMS), it can be also passed through to the
> > guest as long as a hypercall machinery is in place to get addr/data pair
> > from the host (as you suggested earlier).
> 
> As I pointed out in the conclusion of this thread, IMS is only going to
> be supported with interrupt remapping in place on both host and guest.

I still need to read the last few mails but thanks for pointing it out now.

> 
> As these devices are requiring a vIOMMU on the guest anyway (PASID, User
> IO page tables), the required hypercalls are part of the vIOMMU/IR
> implementation. If you look at it from the irqdomain hierarchy view:
> 
>  |- PCI-MSI
>   VECTOR -- [v]IOMMU/IR -|- PCI-MSI-X
>  |- PCI-IMS
> 
> So host and guest use just the same representation which makes a ton of
> sense.
> 
> There are two places where this matters:
> 
>   1) The activate() callback of the IR domain
> 
>   2) The irq_set_affinity() callback of the irqchip associated with the
>  IR domain
> 
> Both callbacks are allowed to fail and the error code is handed back to
> the originating call site.
> 
> If you look at the above hierarchy view then MSI/MSI-X/IMS are all
> treated in exactly the same way. It all becomes the common case.
> 
> No?
> 

Yes, I think above makes sense. 

For a new guest OS which supports this enlightened hierarchy the same
machinery works for all type of interrupt storages and we have a
failure path from host to guest in case of host-side resource shortage.
And no trap is required on guest access to the interrupt storage.

A legacy guest OS which doesn't support the enlightened hierarchy
can only use MSI/MSI-X which is still trapped. But with vector 
reallocation support from your work the situation already improves 
a lot than current awkward way in VFIO (free all previous vectors 
and then re-allocate).

Overall I think this is a good modeling.

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-09 Thread Tian, Kevin
> From: Thomas Gleixner 
> Sent: Thursday, December 9, 2021 5:03 PM
> 
> Kevin,
> 
> On Thu, Dec 09 2021 at 06:26, Kevin Tian wrote:
> >> From: Jason Gunthorpe 
> >> I don't know of any use case for more than one interrupt on a queue,
> >> and if it did come up I'd probably approach it by making the queue
> >> handle above also specify the 'queue relative HW index'
> >
> > We have such use case with IDXD.
> >
> > Basically the IDXD queue allows software to put an interrupt handle
> > (the index of MSI-X or IMS entry) in the submitted descriptor. Upon
> > completion of the descriptor the hardware finds the specified entry
> > and then generate interrupt to notify software.
> >
> > Conceptually descriptors submitted to a same queue can use different
> > handles, implying one queue can be associated to multiple interrupts.
> 
> I think you are looking at that from the internal implementation details
> of IDXD. But you can just model it in an IDXD implementation agnostic
> way:
> 
>  ENQCMD(PASID, IMS-ENTRY,.)

Not exactly IMS-ENTRY. MSI-ENTRY also works.

> 
> implies an on demand allocation of a virtual queue, which is deallocated
> when the command completes. The PASID and IMS-ENTRY act as the 'queue'
> identifier.
> 
> The implementation detail of IDXD that it executes these computations on
> an internal shared workqueue does not change that.
> 
> Such a workqueue can only execute one enqueued command at a time,
> which
> means that during the execution of a particular command that IDXD
> internal workqueue represents the 'virtual queue' which is identified by
> the unique PASID/IMS-ENTRY pair.
> 
> No?
> 

While it's one way of looking at this model do we want to actually
create some objects to represent this 'virtual queue' concept? that
implies each ENQCMD must be moderated to create such short-lifespan
objects and I'm not sure the benefit of doing so.

If not then from driver p.o.v it's still one queue resource and driver 
needs to manage its association with multiple interrupt entries and 
PASIDs when it's connected to multiple clients.

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-09 Thread Thomas Gleixner
Kevin,

On Thu, Dec 09 2021 at 06:26, Kevin Tian wrote:
>> From: Jason Gunthorpe 
>> I don't know of any use case for more than one interrupt on a queue,
>> and if it did come up I'd probably approach it by making the queue
>> handle above also specify the 'queue relative HW index'
>
> We have such use case with IDXD.
>
> Basically the IDXD queue allows software to put an interrupt handle
> (the index of MSI-X or IMS entry) in the submitted descriptor. Upon
> completion of the descriptor the hardware finds the specified entry 
> and then generate interrupt to notify software.
>
> Conceptually descriptors submitted to a same queue can use different
> handles, implying one queue can be associated to multiple interrupts.

I think you are looking at that from the internal implementation details
of IDXD. But you can just model it in an IDXD implementation agnostic
way:

 ENQCMD(PASID, IMS-ENTRY,.)

implies an on demand allocation of a virtual queue, which is deallocated
when the command completes. The PASID and IMS-ENTRY act as the 'queue'
identifier.

The implementation detail of IDXD that it executes these computations on
an internal shared workqueue does not change that.

Such a workqueue can only execute one enqueued command at a time, which
means that during the execution of a particular command that IDXD
internal workqueue represents the 'virtual queue' which is identified by
the unique PASID/IMS-ENTRY pair.

No?

Thanks,

tglx
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-09 Thread Thomas Gleixner
On Thu, Dec 09 2021 at 05:23, Kevin Tian wrote:
>> From: Thomas Gleixner 
>> I don't see anything wrong with that. A subdevice is it's own entity and
>> VFIO can chose the most conveniant representation of it to the guest
>> obviously.
>> 
>> How that is backed on the host does not really matter. You can expose
>> MSI-X to the guest with a INTx backing as well.
>> 
>
> Agree with this point. How the interrupts are represented to the guest
> is orthogonal to how the backend resource is allocated. Physically MSI-X 
> and IMS can be enabled simultaneously on an IDXD device. Once 
> dynamic allocation is allowed for both, either one can be allocated for
> a subdevice (with only difference on supported #subdevices). 
>
> When an interrupt resource is exposed to the guest with the same type 
> (e.g. MSI-on-MSI or IMS-on-IMS), it can be also passed through to the 
> guest as long as a hypercall machinery is in place to get addr/data pair 
> from the host (as you suggested earlier).

As I pointed out in the conclusion of this thread, IMS is only going to
be supported with interrupt remapping in place on both host and guest.

As these devices are requiring a vIOMMU on the guest anyway (PASID, User
IO page tables), the required hypercalls are part of the vIOMMU/IR
implementation. If you look at it from the irqdomain hierarchy view:

 |- PCI-MSI
  VECTOR -- [v]IOMMU/IR -|- PCI-MSI-X
 |- PCI-IMS

So host and guest use just the same representation which makes a ton of
sense.

There are two places where this matters:

  1) The activate() callback of the IR domain

  2) The irq_set_affinity() callback of the irqchip associated with the
 IR domain

Both callbacks are allowed to fail and the error code is handed back to
the originating call site.

If you look at the above hierarchy view then MSI/MSI-X/IMS are all
treated in exactly the same way. It all becomes the common case.

No?

Thanks,

tglx


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-08 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Saturday, December 4, 2021 12:41 AM
> 
> > Or has each queue and controlblock and whatever access to a shared large
> > array where the messages are stored and the indices are handed out to
> > the queues and controlblocks?
> 
> > If each of them have their own small array, then queue relative indexing
> > makes a ton of sense, no?
> 
> Okay, I see.
> 
> I don't know of any use case for more than one interrupt on a queue,
> and if it did come up I'd probably approach it by making the queue
> handle above also specify the 'queue relative HW index'
> 

We have such use case with IDXD.

Basically the IDXD queue allows software to put an interrupt handle
(the index of MSI-X or IMS entry) in the submitted descriptor. Upon
completion of the descriptor the hardware finds the specified entry 
and then generate interrupt to notify software.

Conceptually descriptors submitted to a same queue can use different
handles, implying one queue can be associated to multiple interrupts.

One example is the shared work queue usage which allows multiple 
clients directly and simultaneously submitting descriptors to the 
same queue, by using ENQCMD(pasid, descriptor) instruction. In this 
case each client can be allocated with an interrupt entry (including the
information about the client's pasid for permission check when the
HW generates completion interrupt) and then use this entry for 
all descriptors submitted by that client.

Haven't completed reading of this thread, but would like to point 
out this usage so it is not ignored in the final rework. It basically
means one queue might be associated to multiple interrupt entries
and multiple pasids. 

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-08 Thread Jason Wang
On Thu, Dec 9, 2021 at 1:41 PM Tian, Kevin  wrote:
>
> > From: Jason Gunthorpe 
> > Sent: Thursday, December 2, 2021 9:55 PM
> >
> > Further, there is no reason why IMS should be reserved exclusively for
> > VFIO!
>
> This is correct. Just as what you agreed with Thomas, the only difference
> between IMS and MSI is on where the messages are stored. Physically
> it is unreasonable for the HW to check whether an interrupt is used for
> a specific software usage (e.g. virtualization) since it doesn't have such
> knowledge. At most an entry is associated to PASID, but again the PASID
> can be used anywhere.

Note that vDPA had the plan to use IMS for the parent that can have a
huge amount of instances.

Thanks

>
> The wording in current IDXD spec is not clear on this part. We'll talk to
> the spec owner to have it fixed.
>
> Thanks
> Kevin
>

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-08 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Thursday, December 2, 2021 9:55 PM
> 
> Further, there is no reason why IMS should be reserved exclusively for
> VFIO! 

This is correct. Just as what you agreed with Thomas, the only difference
between IMS and MSI is on where the messages are stored. Physically 
it is unreasonable for the HW to check whether an interrupt is used for
a specific software usage (e.g. virtualization) since it doesn't have such
knowledge. At most an entry is associated to PASID, but again the PASID
can be used anywhere.

The wording in current IDXD spec is not clear on this part. We'll talk to 
the spec owner to have it fixed.

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-08 Thread Tian, Kevin
> From: Thomas Gleixner 
> Sent: Thursday, December 2, 2021 5:45 AM
> 
> On Wed, Dec 01 2021 at 14:21, Dave Jiang wrote:
> > On 12/1/2021 1:25 PM, Thomas Gleixner wrote:
> >>> The hardware implementation does not have enough MSIX vectors for
> >>> guests. There are only 9 MSIX vectors total (8 for queues) and 2048 IMS
> >>> vectors. So if we are to do MSI-X for all of them, then we need to do
> >>> the IMS backed MSIX scheme rather than passthrough IMS to guests.
> >> Confused. Are you talking about passing a full IDXD device to the guest
> >> or about passing a carved out subdevice, aka. queue?
> >
> > I'm talking about carving out a subdevice. I had the impression of you
> > wanting IMS passed through for all variations. But it sounds like for a
> > sub-device, you are ok with the implementation of MSIX backed by IMS?
> 
> I don't see anything wrong with that. A subdevice is it's own entity and
> VFIO can chose the most conveniant representation of it to the guest
> obviously.
> 
> How that is backed on the host does not really matter. You can expose
> MSI-X to the guest with a INTx backing as well.
> 

Agree with this point. How the interrupts are represented to the guest
is orthogonal to how the backend resource is allocated. Physically MSI-X 
and IMS can be enabled simultaneously on an IDXD device. Once 
dynamic allocation is allowed for both, either one can be allocated for
a subdevice (with only difference on supported #subdevices). 

When an interrupt resource is exposed to the guest with the same type 
(e.g. MSI-on-MSI or IMS-on-IMS), it can be also passed through to the 
guest as long as a hypercall machinery is in place to get addr/data pair 
from the host (as you suggested earlier).

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-06 Thread Thomas Gleixner
On Mon, Dec 06 2021 at 17:06, Jason Gunthorpe wrote:
> On Mon, Dec 06, 2021 at 09:28:47PM +0100, Thomas Gleixner wrote:
>> I wish I could mask underneath for some stuff on x86. Though that would
>> not help with the worst problem vs. affinity settings. See the horrible
>> dance in:
>
> My thinking here is that this stuff in ARM is one of the different
> cases (ie not using MSI_FLAG_USE_DEF_CHIP_OPS), and I guess we can
> just handle it cleanly by having the core call both the irq_chip->mask
> and the msi_storage_ops->mask and we don't need ARM to be different,
> x86 just won't provide a mask at destination op.
>
>> x86/kernel/apic/msi.c::msi_set_affinity()
>
> Okay, so it is complicated, but it is just calling
>irq_data_get_irq_chip(irqd)->irq_write_msi_msg(irqd, msg);
>
> So, from a msi_storage_ops perspective, things are still clean.

Yes.

>> You forgot IO/APIC which is a MSI endpoint too, just more convoluted but
>> it's not using MSI domains so it's not in the way. I'm not going to
>> touch that with a ten foot pole. :)
>
> I left off IOAPIC because I view it as conceptually different. I used
> the phrasse "device that originated the interrupt" deliberately,
> IOAPIC is just a middle box that converts from a physical interrupt
> line to a message world, it belongs with the physical interrupt
> infrastructure.

I mentioned it because there is mbigen on arm64 which is the same thing,
translates hundreds of wire inputs into MSI. It's even worse than
IO/APIC. There is a horrible hack to make it "work" which Marc and I are
looking at whether we can kill it on the way.

> Possibly the IOAPIC considerations is what motivated some of this to
> look the way it does today, because it really was trying to hide MSI
> under normal PCI INTX physical pins with full compatability. We kind
> of kept doing that as MSI grew into its own thing.

Not really. It was more to avoid having a complete separate
infrastructure for irqdomain based MSI[X]. Lazyness and lack of time
added the rest of non-motivation :)

> I'm curious to see if you end up with irq_domains and irq_chips along
> with what I labeled as the msi_storage above, or if those turn out to
> be unnecesary for the driver to provide MSI programming.

I cant avoid irq chips because from the interrupt handling side of view
that's unavoidable unless we create a duplicate zoo there. What I have
in mind is to convert the msi ops provided by the device driver into a
real chip as that just falls in place without further changes.

The irqdomain will be real as well just to make things consistent and to
share as much code as possible.

> Also, if msi_storage_ops can be robust enough you'd be comfortable
> with it in a driver .c file and just a regex match in the MAINTAINERS
> file :)

That might work. Let's see when we are there.

>>- Have a transition mechanism to convert one part at a time to keep
>>  the patch sizes reviewable and the whole mess bisectable.
>
> This seems difficult all on its own..

I've done that before. It just needs some thought.

Thanks,

tglx
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-06 Thread Jason Gunthorpe via iommu
On Mon, Dec 06, 2021 at 09:28:47PM +0100, Thomas Gleixner wrote:

> That's already the plan in some form, but there's a long way towards
> that. See below.

Okay, then I think we are thinking the same sorts of things, it is
good to see
 
> Also there will be a question of how many different callbacks we're
> going to create just to avoid one conditional. At some point this might
> become silly.

Yes, but I think the the overal point is that it we could have ops
that do exactly what they need, and we can choose when we get there
which makes the most sense. ie the 64 vs 32 is probably silly.

> > Surprised! These are actually IMS. The HPET and DMAR devices both
> > have device-specific message storage! So these could use
> > msi_storage_ops. And WTF is IOMMU DMAR driver code doing in
> > apic/msi.c ???
> 
> Historical reasons coming from the pre irqdomain aera. Also DMAR needs
> direct access to the x86 low level composer which we didn't want to
> expose. Plus DMAR is shared with ia64 to make it more interesting.
>
> Yes, they can be converted. But that's the least of my worries. Those
> are straight forward and not really relevant for the design.
> 
> > arch/powerpc/platforms/pseries/msi.c:   .irq_write_msi_msg  = 
> > pseries_msi_write_msg,
> >
> > AFAICT this is really like virtualization? The hypervisor is
> > controlling the real MSI table and what the OS sees is faked out
> > somewhat.
> >
> > This is more of quirk in the PCI MSI implementation (do not touch the
> > storage) and a block on non-PCI uses of MSI similar to what x86 needs?
> 
> There is an underlying hypervisor of some sorts and that stuff needs to
> deal with it. I leave that to the powerpc wizards to sort out.
> 
> > drivers/irqchip/irq-gic-v2m.c:  .irq_write_msi_msg  = 
> > pci_msi_domain_write_msg,
> > drivers/irqchip/irq-gic-v3-its-pci-msi.c:   .irq_write_msi_msg  = 
> > pci_msi_domain_write_msg,
> > drivers/irqchip/irq-gic-v3-mbi.c:   .irq_write_msi_msg  = 
> > pci_msi_domain_write_msg,
> >
> > ARM seems to be replacing the 'mask at source' with 'mask at
> > destination' - I wonder why?
> 
> Because the majority of PCI/MSI endpoint implementations do not provide
> masking...
> 
> We're telling hardware people for 15+ years that this is a horrible
> idea, but it's as effective as talking to a wall. Sure the spec grants
> them to make my life miserable...
> 
> > Should this really be hierarchical where we mask *both* the MSI
> > originating device (storage_ops->mask) and at the CPU IRQ controller?
> > (gicv2m_mask_msi_irq ?) if it can?
> 
> I wish I could mask underneath for some stuff on x86. Though that would
> not help with the worst problem vs. affinity settings. See the horrible
> dance in:

My thinking here is that this stuff in ARM is one of the different
cases (ie not using MSI_FLAG_USE_DEF_CHIP_OPS), and I guess we can
just handle it cleanly by having the core call both the irq_chip->mask
and the msi_storage_ops->mask and we don't need ARM to be different,
x86 just won't provide a mask at destination op.

> x86/kernel/apic/msi.c::msi_set_affinity()

Okay, so it is complicated, but it is just calling
   irq_data_get_irq_chip(irqd)->irq_write_msi_msg(irqd, msg);

So, from a msi_storage_ops perspective, things are still clean.

> > PCI, HPET, DMAR move to msi_storage_ops instead of using irq_chip
> 
> With different parent domains. DMAR hangs always directly off the vector
> domain. HPET has its own IOMMU zone.
> 
> You forgot IO/APIC which is a MSI endpoint too, just more convoluted but
> it's not using MSI domains so it's not in the way. I'm not going to
> touch that with a ten foot pole. :)

I left off IOAPIC because I view it as conceptually different. I used
the phrasse "device that originated the interrupt" deliberately,
IOAPIC is just a middle box that converts from a physical interrupt
line to a message world, it belongs with the physical interrupt
infrastructure.

Possibly the IOAPIC considerations is what motivated some of this to
look the way it does today, because it really was trying to hide MSI
under normal PCI INTX physical pins with full compatability. We kind
of kept doing that as MSI grew into its own thing.

> > Seems like a nice uniform solution?
> 
> That's where I'm heading.

Okay, it looks like a lot TBH, but I think the direction is clean and
general.

I'm curious to see if you end up with irq_domains and irq_chips along
with what I labeled as the msi_storage above, or if those turn out to
be unnecesary for the driver to provide MSI programming.

Also, if msi_storage_ops can be robust enough you'd be comfortable
with it in a driver .c file and just a regex match in the MAINTAINERS
file :)

>- Have a transition mechanism to convert one part at a time to keep
>  the patch sizes reviewable and the whole mess bisectable.

This seems difficult all on its own..

> I have a pretty good picture in my head and notes already, which needs
> to be dumped into code. But let me post 

Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-06 Thread Thomas Gleixner
Jason,

On Mon, Dec 06 2021 at 13:00, Jason Gunthorpe wrote:
> On Mon, Dec 06, 2021 at 04:47:58PM +0100, Thomas Gleixner wrote:
>> It will need some more than that, e.g. mask/unmask and as we discussed
>> quite some time ago something like the irq_buslock/unlock pair, so you
>> can handle updates to the state from thread context via a command queue
>> (IIRC).
>
> pci_msi_create_irq_domain() hooks into the platforms irq_chip as an
> alternative to hierarchical irq domains (?)

It's based on hierarchical irqdomains. It's the outermost domain and irq
chip. On X86:

  [VECTOR chip=apic]->[PCI chip=PCI]
or
  [VECTOR chip=apic]->[IOMMU chip=IR]->[PCI chip=PCI-IR]

The chips are different because of quirks. See below :)

>   chip->irq_write_msi_msg = pci_msi_domain_write_msg;
> In almost all cases 'ops' will come along with a 'state', so lets
> create one:
>
> struct msi_storage {  // Look, I avoided the word table!

I already made a note, that I need to smuggle a table in somewhere :)

> And others for the different cases. Look no ifs!
>
> OK?

That's already the plan in some form, but there's a long way towards
that. See below.

Also there will be a question of how many different callbacks we're
going to create just to avoid one conditional. At some point this might
become silly.

> Now, we have some duplication between the struct msi_storage_ops and
> the struct irq_chip. Let's see what that is about:
>
> arch/x86/kernel/apic/msi.c: .irq_write_msi_msg  = dmar_msi_write_msg,
> arch/x86/kernel/hpet.c: .irq_write_msi_msg = hpet_msi_write_msg,
>
> Surprised! These are actually IMS. The HPET and DMAR devices both have
> device-specific message storage! So these could use
> msi_storage_ops. And WTF is IOMMU DMAR driver code doing in
> apic/msi.c ???

Historical reasons coming from the pre irqdomain aera. Also DMAR needs
direct access to the x86 low level composer which we didn't want to
expose. Plus DMAR is shared with ia64 to make it more interesting.

Yes, they can be converted. But that's the least of my worries. Those
are straight forward and not really relevant for the design.

> arch/powerpc/platforms/pseries/msi.c:   .irq_write_msi_msg  = 
> pseries_msi_write_msg,
>
> AFAICT this is really like virtualization? The hypervisor is
> controlling the real MSI table and what the OS sees is faked out
> somewhat.
>
> This is more of quirk in the PCI MSI implementation (do not touch the
> storage) and a block on non-PCI uses of MSI similar to what x86 needs?

There is an underlying hypervisor of some sorts and that stuff needs to
deal with it. I leave that to the powerpc wizards to sort out.

> drivers/irqchip/irq-gic-v2m.c:  .irq_write_msi_msg  = 
> pci_msi_domain_write_msg,
> drivers/irqchip/irq-gic-v3-its-pci-msi.c:   .irq_write_msi_msg  = 
> pci_msi_domain_write_msg,
> drivers/irqchip/irq-gic-v3-mbi.c:   .irq_write_msi_msg  = 
> pci_msi_domain_write_msg,
>
> ARM seems to be replacing the 'mask at source' with 'mask at
> destination' - I wonder why?

Because the majority of PCI/MSI endpoint implementations do not provide
masking...

We're telling hardware people for 15+ years that this is a horrible
idea, but it's as effective as talking to a wall. Sure the spec grants
them to make my life miserable...

> Should this really be hierarchical where we mask *both* the MSI
> originating device (storage_ops->mask) and at the CPU IRQ controller?
> (gicv2m_mask_msi_irq ?) if it can?

I wish I could mask underneath for some stuff on x86. Though that would
not help with the worst problem vs. affinity settings. See the horrible
dance in:

x86/kernel/apic/msi.c::msi_set_affinity()

So this will end up with a shim as the base domain for !IOMMU systems:

   |--[HPET]
  [VECTOR chip=apic]-|--[x86-msi chip=x86-msi]-|--[PCI/MSI]
 |--[DMAR] |--[PCI/MSI-X]

That nonsense can't move into the apic domain set_affinity() callback as
this is not required when interrupt remapping is enabled.

With IOMMU this looks then:

|--[HPET]
  [VECTOR chip=apic]-|--[IOMMU chip=IR]-|--[PCI/MSI]
 |--[DMAR]  |--[PCI/MSI-X]
|--[PCI/IMS]

> drivers/base/platform-msi.c:chip->irq_write_msi_msg = 
> platform_msi_write_msg;
>
> Oh! this is doing what I kind of just suggested, just non-generically
> and hacked into platform bus drivers the same as PCI does:

Correct. It's a hack and it's on the list of things which need to
vanish. I was already discussing that with Marc on the side for quite a
while.

> PCI, HPET, DMAR move to msi_storage_ops instead of using irq_chip

With different parent domains. DMAR hangs always directly off the vector
domain. HPET has its own IOMMU zone.

You forgot IO/APIC which is a MSI endpoint too, just more convoluted but
it's not using MSI domains so it's not in the way. I'm not going to

Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-06 Thread Jason Gunthorpe via iommu
On Mon, Dec 06, 2021 at 04:47:58PM +0100, Thomas Gleixner wrote:

> >>- The irqchip callbacks which can be implemented by these top
> >>  level domains are going to be restricted.
> >
> > OK - I think it is great that the driver will see a special ops struct
> > that is 'ops for device's MSI addr/data pair storage'. It makes it
> > really clear what it is
> 
> It will need some more than that, e.g. mask/unmask and as we discussed
> quite some time ago something like the irq_buslock/unlock pair, so you
> can handle updates to the state from thread context via a command queue
> (IIRC).

Yes, I was thinking about all of that in here.

Let me ask a slightly different question

pci_msi_create_irq_domain() hooks into the platforms irq_chip as an
alternative to hierarchical irq domains (?)

eg:

chip->irq_write_msi_msg = pci_msi_domain_write_msg;

And we see:

void pci_msi_domain_write_msg(struct irq_data *irq_data, struct msi_msg *msg)
{
struct msi_desc *desc = irq_data_get_msi_desc(irq_data);

Now, if we have your idea to have some:

struct msi_storage_ops {
   write_msg
   mask;
   unmask;
   lock;
   unlock;
};

Which is how a driver plugs in its storage operations.

In almost all cases 'ops' will come along with a 'state', so lets
create one:

struct msi_storage {  // Look, I avoided the word table!
   const struct msi_storage_ops *ops
};

ie:

struct msi_storage_ops {
   void (*write_msg)(struct msi_storage *msi, struct msi_desc *desc, struct 
msi_msg *msg);


Now, what if we made a 'generic_msi_create_irq_domain()' that hooks
the irq_chip with something like this:

void generic_msi_domain_write_msg(struct irq_data *irq_data, struct msi_msg 
*msg)
{
struct msi_desc *desc = irq_data_get_msi_desc(irq_data);
struct msi_storage *msi = desc->msi;

msi->ops->write_msg(msi, desc, msg);
}

And then have what pci_msi_domain_write_msg() did now accomplished by
having it set desc->storage to a pci_msi_storage or pci_msix_storage
with those msi_storage_ops pointing at pci_msi_domain_write_msg/etc

Then we can transform:

void __pci_write_msi_msg(struct msi_desc *entry, struct msi_msg *msg)
{
struct pci_dev *dev = msi_desc_to_pci_dev(entry);

into

struct pci_msi_storage {
struct msi_storage storage;
struct pci_dev *dev;
unsigned int msi_cap_off;
};

void pci_write_msi64_msg(struct msi_storage *storage,  struct msi_desc *desc, 
struct msi_msg *msg)
{
struct pci_msi_storage *msi = container_of(storage, struct 
pci_msi_storage, storage);
unsigned int pos = storage->msi_cap_off;
struct pci_dev *dev = msi->dev;

pci_read_config_word(dev, pos + PCI_MSI_FLAGS, );
msgctl &= ~PCI_MSI_FLAGS_QSIZE;
msgctl |= desc->msi_attrib.multiple << 4;
pci_write_config_word(dev, pos + PCI_MSI_FLAGS, msgctl);

pci_write_config_dword(dev, pos + PCI_MSI_ADDRESS_LO, msg->address_lo);
pci_write_config_dword(dev, pos + PCI_MSI_ADDRESS_HI, msg->address_hi);
pci_write_config_word(dev, pos + PCI_MSI_DATA_64, msg->data);
/* Ensure that the writes are visible in the device */
pci_read_config_word(dev, pos + PCI_MSI_FLAGS, );
desc->msg = *msg; // in core code instead ?
}

And others for the different cases. Look no ifs!

OK?

Now, we have some duplication between the struct msi_storage_ops and
the struct irq_chip. Let's see what that is about:

arch/x86/kernel/apic/msi.c: .irq_write_msi_msg  = dmar_msi_write_msg,
arch/x86/kernel/hpet.c: .irq_write_msi_msg = hpet_msi_write_msg,

Surprised! These are actually IMS. The HPET and DMAR devices both have
device-specific message storage! So these could use
msi_storage_ops. And WTF is IOMMU DMAR driver code doing in
apic/msi.c ???

arch/powerpc/platforms/pseries/msi.c:   .irq_write_msi_msg  = 
pseries_msi_write_msg,

AFAICT this is really like virtualization? The hypervisor is
controlling the real MSI table and what the OS sees is faked out
somewhat.

This is more of quirk in the PCI MSI implementation (do not touch the
storage) and a block on non-PCI uses of MSI similar to what x86 needs?

drivers/irqchip/irq-gic-v2m.c:  .irq_write_msi_msg  = 
pci_msi_domain_write_msg,
drivers/irqchip/irq-gic-v3-its-pci-msi.c:   .irq_write_msi_msg  = 
pci_msi_domain_write_msg,
drivers/irqchip/irq-gic-v3-mbi.c:   .irq_write_msi_msg  = 
pci_msi_domain_write_msg,

ARM seems to be replacing the 'mask at source' with 'mask at
destination' - I wonder why?

Should this really be hierarchical where we mask *both* the MSI
originating device (storage_ops->mask) and at the CPU IRQ controller?
(gicv2m_mask_msi_irq ?) if it can?

drivers/base/platform-msi.c:chip->irq_write_msi_msg = 
platform_msi_write_msg;

Oh! this is doing what I kind of just suggested, just non-generically
and hacked into platform bus drivers the same as PCI does:

static void platform_msi_write_msg(struct irq_data *data, struct 

Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-06 Thread Thomas Gleixner
Jason,

On Mon, Dec 06 2021 at 10:43, Jason Gunthorpe wrote:
> On Sun, Dec 05, 2021 at 03:16:40PM +0100, Thomas Gleixner wrote:
>> > That's not really a good idea because dev->irqdomain is a generic
>> > mechanism and not restricted to the PCI use case. Special casing it for
>> > PCI is just wrong. Special casing it for all use cases just to please
>> > PCI is equally wrong. There is a world outside of PCI and x86. 
>> 
>> That argument is actually only partially correct.
>
> I'm not sure I understood your reply? I think we are both agreeing
> that dev->irqdomain wants to be a generic mechanism?

Yes. I managed to confuse myself there by being too paranoid about how
to distinguish things on platforms which need to support both ways, i.e.
x86 when XEN is enabled.

> I'd say that today we've special cased it to handle PCI. IMHO that is
> exactly what pci_msi_create_irq_domain() is doing - it replaces the
> chip ops with ops that can *ONLY* do PCI MSI and so dev->irqdomain
> becomes PCI only and non-generic.

Right. See above. That's why I went back to my notes, did some more
research ...

>>   2) Guest support is strictly opt-in
>> 
>>  The underlying architecture/subarchitecture specific irqdomain has
>>  to detect at setup time (eventually early boot), whether the
>>  underlying hypervisor supports it.
>> 
>>  The only reasonable way to support that is the availability of
>>  interrupt remapping via vIOMMU, as we discussed before.
>
> This is talking about IMS specifically because of the legacy issue
> where the MSI addr/data pair inside a guest is often completely fake?

This is about IMS, right. PCI/MSI[x] is handled today because the writes
to the MSI/MSI-X message store can be trapped.

>>  That does not work in all cases due to architecture and host
>>  controller constraints, so we might end up with:
>> 
>>VECTOR -> IOMMU -> SHIM -> PCI/[MSI/MSI-X/IMS] domains
>
> OK - I dont' know enough about the architecture/controller details to
> imagine what SHIM is, but if it allows keeping the PCI code as purely
> PCI code, then great

It's today part of the arch/subarch specific PCI/MSI domain to deal with
quirks above the IOMMU level. As we can't proliferate that into the new
endpoint domain, that needs to be done as a shim layer in between which
has no real other functionality than applying the quirks. Yes, it's all
pretty. Welcome to my wonderful world.

>>- The irqchip callbacks which can be implemented by these top
>>  level domains are going to be restricted.
>
> OK - I think it is great that the driver will see a special ops struct
> that is 'ops for device's MSI addr/data pair storage'. It makes it
> really clear what it is

It will need some more than that, e.g. mask/unmask and as we discussed
quite some time ago something like the irq_buslock/unlock pair, so you
can handle updates to the state from thread context via a command queue
(IIRC).

>>- For the irqchip callbacks which are allowed/required the rules
>>  vs. following down the hierarchy need to be defined and
>>  enforced.
>
> The driver should be the ultimate origin of the interrupt so it is
> always end-point in the hierarchy, opposite the CPU?
>
> I would hope the driver doesn't have an exposure to hierarchy?

No.
  
> So we have a new concept: 'device MSI storage ops'
>
> Put them along with the xarray holding the msi_descs and you've got my
> msi_table :)

Hehe.
  
>>  Sorry Jason, no tables for you. :)
>
> How does the driver select with 'device MSI storage ops' it is
> requesting a MSI for ?

Via some cookie, reference whatever as discussed in the other
mail. We'll bikeshed the naming once I get there :)

>>   1) I'm going to post part 1-3 of the series once more with the fallout
>>  and review comments addressed.
>
> OK, I didn't see anything in there that was making anything harder in
> this direction

It's helping to keep the existing stuff including the !irqdomain parts
sufficiently self contained so I can actually change the inner workings
of msi domains without going back to any of these places (hopefully).
  
>>   5) Implement an IMS user.
>> 
>>  The obvious candidate which should be halfways accessible is the
>>  ath11 PCI driver which falls into that category.
>
> Aiiee:

Yes.

> drivers/net/wireless/ath/ath11k/pci.c:  ab_pci->msi_ep_base_data = 
> msi_desc->msg.data;

That's only one part of it. Look how the address is retrieved.

Thanks,

tglx
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-06 Thread Thomas Gleixner
Jason,

On Mon, Dec 06 2021 at 10:19, Jason Gunthorpe wrote:
> On Sat, Dec 04, 2021 at 03:20:36PM +0100, Thomas Gleixner wrote:
>> even try to make the irqchip/domain code mangled into the other device
>> code. It should create the irqdomain with the associated chip and that
>> creation process returns a cookie which is passed to the actual device
>> specific code. Allocation then can use that cookie and not the irqdomain
>> pointer itself.
>
> Sounds like your cookie == my msi_table? Maybe we are all agreeing at
> this point then?

I think so. It's going to be something the driver can use as a
reference. Same for the actual interrupt allocated through this domain
reference.

>> So thanks for being patient in educating me here.
>
> I'm happy you got something out of all these words!

Definitely so. That's why we are having these discussions, right?

The shiny irq subsystem is not so shiny when the drivers cannot use
it. OTOH, it's often enough the case that driver folks want to use it
the wrong way just because.

> Yes, it is amazing how many patches this is at already.

Don't worry. You'll get a few more patch bombs in your inbox until IMS
is supported, unless you want to be "unsubscribed".

Thanks,

tglx
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-06 Thread Jason Gunthorpe via iommu
On Sun, Dec 05, 2021 at 03:16:40PM +0100, Thomas Gleixner wrote:
> On Sat, Dec 04 2021 at 15:20, Thomas Gleixner wrote:
> > On Fri, Dec 03 2021 at 12:41, Jason Gunthorpe wrote:
> > So I need to break that up in a way which caters for both cases, but
> > does neither create a special case for PCI nor for the rest of the
> > universe, i.e. the 1:1 case has to be a subset of the 1:2 case which
> > means all of it is common case. With that solved the storage question
> > becomes a nobrainer.
> >
> > When I looked at it again yesterday while writing mail, I went back to
> > my notes and found the loose ends where I left off. Let me go back and
> > start over from there.
> 
> I found out why I stopped looking at it. I came from a similar point of
> view what you were suggesting:
> 
> >> If IMS == MSI, then couldn't we conceptually have the dev->irqdomain
> >> completely unable to handle IMS/MSI/MSI-X at all, and instead, when
> >> the driver asks for PCI MSI access we create a new hierarchical
> >> irqdomain and link it to a MSI chip_ops or a MSI-X chip_ops - just as
> >> you outlined for IMS?  (again, not saying to do this, but let's ask if
> >> that makes more sense than the current configuration)
> 
> Which I shot down with:
> 
> > That's not really a good idea because dev->irqdomain is a generic
> > mechanism and not restricted to the PCI use case. Special casing it for
> > PCI is just wrong. Special casing it for all use cases just to please
> > PCI is equally wrong. There is a world outside of PCI and x86. 
> 
> That argument is actually only partially correct.

I'm not sure I understood your reply? I think we are both agreeing
that dev->irqdomain wants to be a generic mechanism?

I'd say that today we've special cased it to handle PCI. IMHO that is
exactly what pci_msi_create_irq_domain() is doing - it replaces the
chip ops with ops that can *ONLY* do PCI MSI and so dev->irqdomain
becomes PCI only and non-generic.

> After studying my notes and some more code (sigh), it looks feasible
> under certain assumptions, constraints and consequences.
> 
> Assumptions:
> 
>   1) The irqdomain pointer of PCI devices which is set up during device
>  discovery is not used by anything else than infrastructure which
>  knows how to handle it.
> 
>  Of course there is no guarantee, but I'm not that horrified about
>  it anymore after chasing the exposure with yet more coccinelle
>  scripts.

OK


> Constraints:
> 
>   1) This is strictly opt-in and depends on hierarchical irqdomains.

OK

>  That means that devices which depend on IMS won't work on anything
>  which is not up to date.

OK - I think any pressure to get platforms to update their code is
only positive.
 
>   2) Guest support is strictly opt-in
> 
>  The underlying architecture/subarchitecture specific irqdomain has
>  to detect at setup time (eventually early boot), whether the
>  underlying hypervisor supports it.
> 
>  The only reasonable way to support that is the availability of
>  interrupt remapping via vIOMMU, as we discussed before.

This is talking about IMS specifically because of the legacy issue
where the MSI addr/data pair inside a guest is often completely fake?

>   4) The resulting irqdomain hierarchy would ideally look like this:
> 
>  VECTOR -> [IOMMU, ROUTING, ...] -> PCI/[MSI/MSI-X/IMS] domains

OK, this matches where I have come from as well
 
>  That does not work in all cases due to architecture and host
>  controller constraints, so we might end up with:
> 
>VECTOR -> IOMMU -> SHIM -> PCI/[MSI/MSI-X/IMS] domains

OK - I dont' know enough about the architecture/controller details to
imagine what SHIM is, but if it allows keeping the PCI code as purely
PCI code, then great

>   5) The design rules for the device specific IMS irqdomains have to be
>  documented and enforced to the extent possible.
> 
>  Rules which I have in my notes as of today:
> 
>- The device specific IMS irq chip / irqdomain has to be strictly
>  separated from the rest of the driver code and can only
>  interact via the irq chip data which is either per interrupt or
>  per device.

It seems OK with the observaion that IDXD and mlx5 will both need to
set 'per-interrupt' data before provisioning the IRQ

>  I have some ideas how to enforce these things to go into
>  drivers/irqchip/ so they are exposed to scrutiny and not
>  burried in some "my device is special" driver code and applied
>  by subsystem maintainers before anyone can even look at it. 

Means more modules, but OK
 
>- The irqchip callbacks which can be implemented by these top
>  level domains are going to be restricted.

OK - I think it is great that the driver will see a special ops struct
that is 'ops for device's MSI addr/data pair storage'. It makes it
really clear what it is

>- For the irqchip callbacks which are 

Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-06 Thread Jason Gunthorpe via iommu
On Sat, Dec 04, 2021 at 03:20:36PM +0100, Thomas Gleixner wrote:
> Jason,
> 
> On Fri, Dec 03 2021 at 12:41, Jason Gunthorpe wrote:
> > On Fri, Dec 03, 2021 at 04:07:58PM +0100, Thomas Gleixner wrote:
> > Lets do a thought experiment, lets say we forget about the current PCI
> > MSI API.

I've read both your emails, and I'll make a few small remarks on this
one, mainly to try to clearly communicate what I was thinking

Just want to call out the above paragraph, not suggesting we do it,
but the thought exercise of what should we do if not weighed down by
endless engineering cruft is usually informative.

> > What if it worked more like this:
> >
> > probe()
> >  // Access the real PCI SIG MSI/MSI-X table
> >  mystruct->msi_table = pci_allocate_msi_table(pci_dev);
> >
> >  // Special interrupt 0
> >  mstruct->desc0 = msi_table_alloc_vector(mystruct->msi_table, hw_label=0);
> >  request_irq(mystruct->desc0, ..)
> 
> A device driver should not even know about msi_desc. Period.

Here, this wasn't really about the msi_desc, it was just about some
pointer-based handle. Call it what you will, and force it to be opaque
to the drivers.

eg this 'msi table' layer can just have a 

struct msi_handle;

In headers

and in a C it can be:

struct msi_handle {
   struct msi_desc desc;
};

And achieve what you've asked for

> >  - msi_table is a general API for accessing MSIs. Each bus type
> >would provide some kind of logical creation function for their
> >bus standardized MSI table type. eg MSI/MSI-X/etc
> 
> You can't give up on that table thing, right? We just established that
> for devices like yours IMS is not necessary a table and does not even
> need the index. The fact that MSI-X uses a table does not make
> everything a nail^Wtable. :)

It is just a name for the idea of a handle to a device's MSI
mechanism.

Call it 'msi_hw' or something.

My concept is that each device, integral to the device, has some ways
to operate the device's MSI storage (MSI/MSI-X/IMS/Platform), so lets
give that a name and give it a struct and an API. It is all still core
code.

Think of it as a clean separation between 'determining the addr/data
pair' and 'writing the addr/data pair to HW'.

Today you are thinking about this object as an irqdomain, irqchip, and
msi xarray - I think?

> >It is a logical handle for the physical resource the holds the MSI
> >addr/data paris.
> >
> >  - Use a pointer instead of idx as the API for referring to a MSI
> >irq. Most drivers probably only use msi_desc->irq?
> 
> No. pci_irq_vector() translates the hw index to linux irq number.  The
> hardware index is known by the driver when it is doing a batched
> allocation, right?

Yes

> I'm fine with using the info struct I described above as the reference
> cookie.

IMHO an opaque pointer to refer to the MSI is cleaner
 
> >  - We do not have device->msi, it is held in the driver instead.
> 
> No. First of all drivers have no business with that, really.

I'm reading your second email and still not entirely clear what your
thinking is for devices->msi?

> >dev->irqdomain is always the bus/platform originated irqdomain of
> >the physical device.
> 
> This is a guarantee for subtle bugs and I'm not even remotely interested
> going there. See also below.

Not sure I follow this? My suggestion is that it is either as-is today
or NULL and we don't try to use eg mdev->irqdomain for anything.

> >Thus we break the 1:1 of the device and irqdomain. A device can
> >spawn many msi_tables, but only one would be on the dev->irqdomain
> 
> Why are you trying brute force to push things into device drivers?
> That's really the wrong direction. We want common infrastructure to be
> managed by generic code. And all of this is about common infrastructure.

The driver needs some kind of handle to tell the core code what MSI
'namespace' it is talking about in every API - eg MSI or IMS.

Pointers are the usual way to make such a handle.

Going along the IMS == MSI principle that says there should be a
driver facing API with a pointer handle for the MSI and a pointer
handle for the IMS.

Yes, this means the existing API is some compat wrapper around a
pointer API and would have to store the pointer handle in the device,
but the idea would be to make that pci->dev ONLY for the compat API,
not used in the IRQ infrastructure.

> > Is it sane? What really needs device->msi anyhow?
> 
> device->msi is a container for interrupt management and that's done by
> the interrupt code and not by random device drivers. Again, struct
> msi_desc is a software construct which the interrupt core code and the
> irqdomains and irq chip implementation use for various purposes. Nothing
> outside of this realm has to even know about the fact that this data
> structure exists in the first place. See below.

I imagined that msi_desc remains as part of the interrupt core code
and the 'msi_table' object is another interrupt core code object for
dealing with device's MSI


Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-05 Thread Thomas Gleixner
On Sat, Dec 04 2021 at 15:20, Thomas Gleixner wrote:
> On Fri, Dec 03 2021 at 12:41, Jason Gunthorpe wrote:
> So I need to break that up in a way which caters for both cases, but
> does neither create a special case for PCI nor for the rest of the
> universe, i.e. the 1:1 case has to be a subset of the 1:2 case which
> means all of it is common case. With that solved the storage question
> becomes a nobrainer.
>
> When I looked at it again yesterday while writing mail, I went back to
> my notes and found the loose ends where I left off. Let me go back and
> start over from there.

I found out why I stopped looking at it. I came from a similar point of
view what you were suggesting:

>> If IMS == MSI, then couldn't we conceptually have the dev->irqdomain
>> completely unable to handle IMS/MSI/MSI-X at all, and instead, when
>> the driver asks for PCI MSI access we create a new hierarchical
>> irqdomain and link it to a MSI chip_ops or a MSI-X chip_ops - just as
>> you outlined for IMS?  (again, not saying to do this, but let's ask if
>> that makes more sense than the current configuration)

Which I shot down with:

> That's not really a good idea because dev->irqdomain is a generic
> mechanism and not restricted to the PCI use case. Special casing it for
> PCI is just wrong. Special casing it for all use cases just to please
> PCI is equally wrong. There is a world outside of PCI and x86. 

That argument is actually only partially correct.

After studying my notes and some more code (sigh), it looks feasible
under certain assumptions, constraints and consequences.

Assumptions:

  1) The irqdomain pointer of PCI devices which is set up during device
 discovery is not used by anything else than infrastructure which
 knows how to handle it.

 Of course there is no guarantee, but I'm not that horrified about
 it anymore after chasing the exposure with yet more coccinelle
 scripts.

Constraints:

  1) This is strictly opt-in and depends on hierarchical irqdomains.

 If an architecture/subarchitecture wants to support it then it
 needs to rework their PCI/MSI backend to hierarchical irqdomains or
 make their PCI/MSI irqdomain ready for the task.

 From my inspection of 30+ PCI/MSI irqdomains, most of them should
 be trivial to convert. The hard ones are powerpc, XEN and VMD,
 where XEN is definitely the most convoluted one.

 That means that devices which depend on IMS won't work on anything
 which is not up to date.

  2) Guest support is strictly opt-in

 The underlying architecture/subarchitecture specific irqdomain has
 to detect at setup time (eventually early boot), whether the
 underlying hypervisor supports it.

 The only reasonable way to support that is the availability of
 interrupt remapping via vIOMMU, as we discussed before.

  3) IOMMU/Interrupt remapping dependency

 While IMS just works without interrupt remapping on bare metal the
 fact that there is no reliable way to figure out whether the kernel
 runs on bare metal or not, makes it pretty much mandatory, at least
 on x86.

 That's not a hardcoded constraint. It's a decision made during the
 setup of the underlying architecture/subarchitecture specific
 irqdomain.

  4) The resulting irqdomain hierarchy would ideally look like this:

 VECTOR -> [IOMMU, ROUTING, ...] -> PCI/[MSI/MSI-X/IMS] domains

 That does not work in all cases due to architecture and host
 controller constraints, so we might end up with:

   VECTOR -> IOMMU -> SHIM -> PCI/[MSI/MSI-X/IMS] domains

 The nice thing about the irqdomain hierarchy concept is that this
 does not create any runtime special cases as the base hierarchy is
 established at boot or device detection time. It's just another
 layer of indirection.

  5) The design rules for the device specific IMS irqdomains have to be
 documented and enforced to the extent possible.

 Rules which I have in my notes as of today:

   - The device specific IMS irq chip / irqdomain has to be strictly
 separated from the rest of the driver code and can only
 interact via the irq chip data which is either per interrupt or
 per device.

 I have some ideas how to enforce these things to go into
 drivers/irqchip/ so they are exposed to scrutiny and not
 burried in some "my device is special" driver code and applied
 by subsystem maintainers before anyone can even look at it. 

   - The irqchip callbacks which can be implemented by these top
 level domains are going to be restricted.

   - For the irqchip callbacks which are allowed/required the rules
 vs. following down the hierarchy need to be defined and
 enforced.

   - To achieve that the registration interface will not be based on
 struct irq_chip. This will be a new representation and the core
 will convert that into a 

Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-04 Thread Thomas Gleixner
Jason,

On Fri, Dec 03 2021 at 12:41, Jason Gunthorpe wrote:
> On Fri, Dec 03, 2021 at 04:07:58PM +0100, Thomas Gleixner wrote:
> Lets do a thought experiment, lets say we forget about the current PCI
> MSI API.
>
> What if it worked more like this:
>
> probe()
>  // Access the real PCI SIG MSI/MSI-X table
>  mystruct->msi_table = pci_allocate_msi_table(pci_dev);
>
>  // Special interrupt 0
>  mstruct->desc0 = msi_table_alloc_vector(mystruct->msi_table, hw_label=0);
>  request_irq(mystruct->desc0, ..)

A device driver should not even know about msi_desc. Period.

The code which gets the msi_desc pointer handed in is the irqdomain code
and eventually the irqchip callbacks which need to deal with it.

I just cleaned up quite some places again which fiddled with msi_desc
for the very wrong reasons. We've had subtle bugs due to that in the
past.

The time I wasted in the past to fix drivers which fiddled with msi_desc
and even with irq descriptors just because they can, makes me think hard
about encapsulations because I have absolutely no interest to do full
tree cleanups every time I need to change something in the core
code. Neither do I have interest to deal with the resulting bugs which
end up on my desk because the usual consequence is that the abuse makes
the core go belly up.

If I need to change a dozen irq chip implementations, that's a limited
scope, but chasing the abuse all over the place is insane. I'm dead
tired of it.

The driver can get some info struct which has the index and the linux
irq number in it and that is all it needs, seriously.

If C would provide access protection scopes in objects then the msi
descriptor pointer could become a cookie. But we only can document that
the driver writer has to treat it like a cookie which is equally
effective to not documenting it at all. That's unfortunately the sad
reality. So the goal is to make it harder to screw up and not to make it
easier. The only way to do that is strict encapsulation.

>  - msi_table is a general API for accessing MSIs. Each bus type
>would provide some kind of logical creation function for their
>bus standardized MSI table type. eg MSI/MSI-X/etc

You can't give up on that table thing, right? We just established that
for devices like yours IMS is not necessary a table and does not even
need the index. The fact that MSI-X uses a table does not make
everything a nail^Wtable. :)

>It is a logical handle for the physical resource the holds the MSI
>addr/data paris.
>
>  - Use a pointer instead of idx as the API for referring to a MSI
>irq. Most drivers probably only use msi_desc->irq?

No. pci_irq_vector() translates the hw index to linux irq number.  The
hardware index is known by the driver when it is doing a batched
allocation, right?

There is also that magic interface which hands in an array of struct
msi_entry.  msi_entry contains the hardware index and the linux irq
number. The hardware index is filled by the driver and the linux irq
number by the core code when the allocation was successful.

Yes, it's awful from an interface point of view, but that has nothing to
do with the underlying infrastructure and the problem at hand.

I'm fine with using the info struct I described above as the reference
cookie.

>  - We do not have device->msi, it is held in the driver instead.

No. First of all drivers have no business with that, really.

It's infrastructure which the driver utilizes and that infrastructure
manages the data it requires and uses it within the infrastructure, in
this case irqdomains and the associated irq chips.

Aside of that. Even if it would be a good idea, then someone has to
rewrite the PCI/MSI backends of powerpc, mips-octeon, s390, xen and some
more first. Along with the MSI and the PCI/MSI core.

Then create either nasty wrappers around the existing interfaces which
create those magic tables automatically and store them where? In struct
device->msi perhaps? :)

Alternatively cleanup ~360 usage sites throughout drivers and add the
same silly boilerplate code all over the place, which is wrong to begin
with. We want less boilerplate not more.

I'm sure you're volunteering or can provide the engineering capacity to
do that. Depending on the capacity this might be done in 1-10 years from
now.

>dev->irqdomain is always the bus/platform originated irqdomain of
>the physical device.

This is a guarantee for subtle bugs and I'm not even remotely interested
going there. See also below.

>Thus we break the 1:1 of the device and irqdomain. A device can
>spawn many msi_tables, but only one would be on the dev->irqdomain

Why are you trying brute force to push things into device drivers?
That's really the wrong direction. We want common infrastructure to be
managed by generic code. And all of this is about common infrastructure.

The only device specific thing which is required due to IMS is the
irqchip plus it's data. Whether that irq chip implementation is part of
the device 

Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-03 Thread Jason Gunthorpe via iommu
On Fri, Dec 03, 2021 at 04:07:58PM +0100, Thomas Gleixner wrote:
> Jason,
> 
> On Thu, Dec 02 2021 at 20:37, Jason Gunthorpe wrote:
> > On Thu, Dec 02, 2021 at 11:31:11PM +0100, Thomas Gleixner wrote:
> >> >> Of course we can store them in pci_dev.dev.msi.data.store. Either with a
> >> >> dedicated xarray or by partitioning the xarray space. Both have their
> >> >> pro and cons.
> >> >
> >> > This decision seems to drive the question of how many 'struct devices'
> >> > do we need, and where do we get them..
> >> 
> >> Not really. There is nothing what enforces to make the MSI irqdomain
> >> storage strictly hang off struct device. There has to be a connection to
> >> a struct device in some way obviously to make IOMMU happy.
> >
> > Yes, I thought this too, OK we agree
> 
> One thing which just occured to me and I missed so far is the
> association of the irqdomain.

Oh? I though this was part of what you were thinking when talkinga
about using the mdev struct device.

> but I really need to look at all of the MSI infrastructure code whether
> it makes assumptions about this:
> 
>msi_desc->dev->irqdomain
> 
> being the correct one. Which would obviously just be correct when we'd
> have a per queue struct device :)

Right

I'd half expect the msi_desc to gain a pointer to the 'lfunc'

Then msi_desc->dev becomes msi_desc->origin_physical_device and is
only used by IOMMU code to figure out MSI remapping/etc. Maybe that
allows solve your aliasing problem (Is this the RID aliasing we see
with iommu_groups too?)

> > Lets imagine a simple probe function for a simple timer device. It has
> > no cdev, vfio, queues, etc. However, the device only supports IMS. No
> > MSI, MSI-X, nothing else.
> >
> > What does the probe function look like to get to request_irq()?
> 
> The timer device would be represented by what? A struct device?

Lets say it is a pci_device and probe is a pci_driver probe function

> If so then something needs to set device->irqdomain and then you can
> just do:
> 
>  msi_irq_domain_alloc_irqs(dev->irqdomain, dev, ...);

Okay, but that doesn't work if it is a PCI device and dev->irqdomain
is already the PCI MSI and INTx irqdomain?

> To do that, I need to understand the bigger picture and the intended
> usage because otherwise we end up with an utter mess.

Sure
 
> >> The idea is to have a common representation for these type of things
> >> which allows:
> >> 
> >>  1) Have common code for exposing queues to VFIO, cdev, sysfs...
> >> 
> >> You still need myqueue specific code, but the common stuff which is
> >> in struct logical_function can be generic and device independent.
> >
> > I will quote you: "Seriously, you cannot make something uniform which
> > is by definition non-uniform." :)
> >
> > We will find there is no common stuff here, we did this exercise
> > already when we designed struct auxiliary_device, and came up
> > empty.
> 
> Really?

Yes.. It was a long drawn out affair, and auxiliary_device is just a
simple container

> >>  2) Having the MSI storage per logical function (queue) allows to have
> >> a queue relative 0 based MSI index space.
> >
> > Can you explain a bit what you see 0 meaning in this idx numbering?
> >
> > I also don't understand what 'queue relative' means?
> >
> >> The actual index in the physical table (think IMS) would be held in
> >> the msi descriptor itself.
> >
> > This means that a device like IDXD would store the phsical IMS table
> > entry # in the msi descriptor? What is idx then?
> >
> > For a device like IDXD with a simple linear table, how does the driver
> > request a specific entry in the IMS table? eg maybe IMS table entry #0
> > is special like we often see in MSI?
> 
> If there is a hardwired entry then this hardwired entry belongs to the
> controlblock (status, error or whatever), but certainly not to a
> dynamically allocatable queue as that would defeat the whole purpose.
> 
> That hardwired entry could surely exist in a IDXD type setup where the
> message storage is in an defined array on the device.

Agree - so how does it get accessed?

> But with the use case you described where you store the message at some
> place in per queue system memory, the MSI index is not relevant at all,
> because there is no table. So it's completely meaningless for the device
> and just useful for finding the related MSI descriptor again.

Yes

What I don't see is how exactly we make this situation work. The big
picture sequence would be:

 - Device allocates a queue and gets a queue handle
 - Device finds a free msi_desc and associates that msi_desc with the
   queue handle
 - Device specific irq_chip uses the msi_desc to get the queue handle
   and programs the addr/data

In this regard the msi_desc is just an artifact of the API, the real
work is in the msi_desc.

> Or has each queue and controlblock and whatever access to a shared large
> array where the messages are stored and the indices are handed out to
> the queues and 

Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-03 Thread Thomas Gleixner
Jason,

On Thu, Dec 02 2021 at 20:37, Jason Gunthorpe wrote:
> On Thu, Dec 02, 2021 at 11:31:11PM +0100, Thomas Gleixner wrote:
>> >> Of course we can store them in pci_dev.dev.msi.data.store. Either with a
>> >> dedicated xarray or by partitioning the xarray space. Both have their
>> >> pro and cons.
>> >
>> > This decision seems to drive the question of how many 'struct devices'
>> > do we need, and where do we get them..
>> 
>> Not really. There is nothing what enforces to make the MSI irqdomain
>> storage strictly hang off struct device. There has to be a connection to
>> a struct device in some way obviously to make IOMMU happy.
>
> Yes, I thought this too, OK we agree

One thing which just occured to me and I missed so far is the
association of the irqdomain.

Right now we know for each device which irqdomain it belongs to. That's
part of device discovery (PCI, device tree or whatever) which set's up

 device->irqdomain

That obviously cannot work for that device specific IMS domain. Why?

Because the PCIe device provides MSI[x] which obviously requires that
device->irqdomain is pointing to the PCI/MSI irqdomain. Otherwise the
PCI/MSI allocation mechanism wont work.

Sure each driver can have

 mystruct->ims_irqdomain

and then do the allocation via

msi_irq_domain_alloc(mystruct->ims_irqdomain)

but I really need to look at all of the MSI infrastructure code whether
it makes assumptions about this:

   msi_desc->dev->irqdomain

being the correct one. Which would obviously just be correct when we'd
have a per queue struct device :)

>> Just badly named because the table itself is where the resulting message
>> is stored, which is composed with the help of the relevant MSI
>> descriptor. See above.
>
> I picked the name because it looks like it will primarily contain an
> xarray and the API you suggested to query it is idx based. To me that
> is a table. A table of msi_desc storage to be specific.
>
> It seems we have some agreement here as your lfunc also included the
> same xarray and uses an idx as part of the API, right?

It's a per lfunc xarray, yes.

>> We really should not try to make up an artifical table representation
>> for something which does not necessarily have a table at all, i.e. the
>> devices you talk about which store the message in queue specific system
>> memory. Pretending that this is a table is just silly.
>
> Now I'm a bit confused, what is the idx in the lfunc? 
>
> I think this is language again, I would call idx an artificial table
> representation?

Ok. I prefer the term indexed storage, but yeah.

>> I really meant a container like this:
>> 
>> struct logical_function {
>> /* Pointer to the physical device */
>> struct device*phys_device;
>> /* MSI descriptor storage */
>>  struct msi_data msi;
>
> Up to here is basically what I thought the "message table" would
> contain. Possibly a pointer to the iommu_domain too?
>
>> /* The queue number */
>> unsigned int queue_nr;
>> /* Add more information which is common to these things */
>
> Not sure why do we need this?
>
> Lets imagine a simple probe function for a simple timer device. It has
> no cdev, vfio, queues, etc. However, the device only supports IMS. No
> MSI, MSI-X, nothing else.
>
> What does the probe function look like to get to request_irq()?

The timer device would be represented by what? A struct device?

If so then something needs to set device->irqdomain and then you can
just do:

 msi_irq_domain_alloc_irqs(dev->irqdomain, dev, ...);

> Why does this simple driver create something called a 'logical
> function' to access its only interrupt?

It does not have to when it's struct device based.

> I think you have the right idea here, just forget about VFIO, the IDXD
> use case, all of it. Provide a way to use IMS cleanly and concurrently
> with MSI.
>
> Do that and everything else should have sane solutions too.

To do that, I need to understand the bigger picture and the intended
usage because otherwise we end up with an utter mess.

>> The idea is to have a common representation for these type of things
>> which allows:
>> 
>>  1) Have common code for exposing queues to VFIO, cdev, sysfs...
>> 
>> You still need myqueue specific code, but the common stuff which is
>> in struct logical_function can be generic and device independent.
>
> I will quote you: "Seriously, you cannot make something uniform which
> is by definition non-uniform." :)
>
> We will find there is no common stuff here, we did this exercise
> already when we designed struct auxiliary_device, and came up
> empty.

Really?

>>  2) Having the MSI storage per logical function (queue) allows to have
>> a queue relative 0 based MSI index space.
>
> Can you explain a bit what you see 0 meaning in this idx numbering?
>
> I also don't understand what 'queue relative' means?
>
>> The actual index in the physical table (think IMS) 

Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-02 Thread Jason Gunthorpe via iommu
On Thu, Dec 02, 2021 at 11:31:11PM +0100, Thomas Gleixner wrote:

> The software representation aka struct msi_desc is a different
> story. That's what we are debating.

Okay, I did mean msi_desc storage, so we are talking about the same thigns

> >> Of course we can store them in pci_dev.dev.msi.data.store. Either with a
> >> dedicated xarray or by partitioning the xarray space. Both have their
> >> pro and cons.
> >
> > This decision seems to drive the question of how many 'struct devices'
> > do we need, and where do we get them..
> 
> Not really. There is nothing what enforces to make the MSI irqdomain
> storage strictly hang off struct device. There has to be a connection to
> a struct device in some way obviously to make IOMMU happy.

Yes, I thought this too, OK we agree

> Just badly named because the table itself is where the resulting message
> is stored, which is composed with the help of the relevant MSI
> descriptor. See above.

I picked the name because it looks like it will primarily contain an
xarray and the API you suggested to query it is idx based. To me that
is a table. A table of msi_desc storage to be specific.

It seems we have some agreement here as your lfunc also included the
same xarray and uses an idx as part of the API, right?

> We really should not try to make up an artifical table representation
> for something which does not necessarily have a table at all, i.e. the
> devices you talk about which store the message in queue specific system
> memory. Pretending that this is a table is just silly.

Now I'm a bit confused, what is the idx in the lfunc? 

I think this is language again, I would call idx an artificial table
representation?

> Also I disagree that this has to be tied to a PCI specific interface,
> except for creating a PCI specific wrapper for it to not make a driver
> developer have to write '>dev', which is the very least of our
> problems.

Agreed, just trying to be illustrative at a higher level
 
> Aside of that 'my_irq_chip' does not cut it at all because of the way
> how the resulting messages are stored. IDXD has IOMEM storage and a
> storage space limitation while your device uses system memory storage
> and has other limitations, i.e. system memory and the number of queues
> the device can provide.

Okay, so the device must setup the domain because it must provide
information during setup. Makes sense

> Not really. I was really reasoning about an abstract representation for
> a functional queue, which is more than just a queue allocated from the
> PF or VF device.
> 
> I really meant a container like this:
> 
> struct logical_function {
> /* Pointer to the physical device */
> struct device *phys_device;
> /* MSI descriptor storage */
>   struct msi_data msi;

Up to here is basically what I thought the "message table" would
contain. Possibly a pointer to the iommu_domain too?

> /* The queue number */
> unsigned int  queue_nr;
> /* Add more information which is common to these things */

Not sure why do we need this?

Lets imagine a simple probe function for a simple timer device. It has
no cdev, vfio, queues, etc. However, the device only supports IMS. No
MSI, MSI-X, nothing else.

What does the probe function look like to get to request_irq()?

Why does this simple driver create something called a 'logical
function' to access its only interrupt?

I think you have the right idea here, just forget about VFIO, the IDXD
use case, all of it. Provide a way to use IMS cleanly and concurrently
with MSI.

Do that and everything else should have sane solutions too.

> The idea is to have a common representation for these type of things
> which allows:
> 
>  1) Have common code for exposing queues to VFIO, cdev, sysfs...
> 
> You still need myqueue specific code, but the common stuff which is
> in struct logical_function can be generic and device independent.

I will quote you: "Seriously, you cannot make something uniform which
is by definition non-uniform." :)

We will find there is no common stuff here, we did this exercise
already when we designed struct auxiliary_device, and came up
empty. 

>  2) Having the MSI storage per logical function (queue) allows to have
> a queue relative 0 based MSI index space.

Can you explain a bit what you see 0 meaning in this idx numbering?

I also don't understand what 'queue relative' means?

> The actual index in the physical table (think IMS) would be held in
> the msi descriptor itself.

This means that a device like IDXD would store the phsical IMS table
entry # in the msi descriptor? What is idx then?

For a device like IDXD with a simple linear table, how does the driver
request a specific entry in the IMS table? eg maybe IMS table entry #0
is special like we often see in MSI?

> msi_get_virq(>lfunc.msi, idx = 0)
>
> v.s.
>
> idx = myqueue->msidx[0];
> msi_get_virq(pcidev->dev, idx);
 
>

Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-02 Thread Thomas Gleixner
Jason,

On Thu, Dec 02 2021 at 16:00, Jason Gunthorpe wrote:
> On Thu, Dec 02, 2021 at 08:25:48PM +0100, Thomas Gleixner wrote:
>> We seem to have a serious problem of terminology and the understanding
>> of topology which is why we continue to talk past each other forever.
>
> I think I understand and agree with everything you said below.

Good!

> The point we diverge is where to put the vector storage:

Kinda. The vector, i.e. message storage is either:

  - MSI entry in the PCI config space
  - MSI-X table in the PCI config space
  - Device specific IMS storage

The software representation aka struct msi_desc is a different
story. That's what we are debating.

>> Of course we can store them in pci_dev.dev.msi.data.store. Either with a
>> dedicated xarray or by partitioning the xarray space. Both have their
>> pro and cons.
>
> This decision seems to drive the question of how many 'struct devices'
> do we need, and where do we get them..

Not really. There is nothing what enforces to make the MSI irqdomain
storage strictly hang off struct device. There has to be a connection to
a struct device in some way obviously to make IOMMU happy.

>> Such a logical function would be the entity to hand out for VFIO or
>> cdev.
>
> What is a logical function, concretely?

That's a name I came up with for a abstract representation of such a
queue container. I came up with that as a obvious consequence of my
previous reasoning about PF -> VF -> XF.

> Does it have struct device?

It does not necessarily have to .

> Can I instead suggest a name like 'message interrupt table' ?

Well yes, but that's not what I meant. See below.

> Ie a device has two linearly indexed message interrupt tables - the
> PCI SIG defined MSI/MSI-X one created by the PCI core and the IMS one
> created by the driver.
>
> Both start at 0 index and they have different irq_domains.
>
> Instead of asking the driver to create a domain we ask the driver to
> create a new 'message interrupt table'. The driver provides the
> irq_chip to program the messages and the pci_device. The core code
> manages the irq domain setup.
>
> Using what you say below:
>
>> If this is not split out, then every driver and wrapper has to come up
>> with it's own representation of this instead of being able to do:
>> 
>>  request_irq(msi_get_virq(lfunc, idx=0), handler0, ...);
>>  request_irq(msi_get_virq(lfunc, idx=1), handler1, ...);
>
> We could say:
>   msi_get_virq(device.pci_msi_table, index=0)
>
> Is the 0th PCI SIG MSI vector
>
> Something like:
>
>  ims_table = pci_create_msi_table(pci_dev, my_irq_chip,..)
>  msi_get_virq(ims_table, index=0)

Which is pretty much a wrapper around two different irqdomains for the
device and either partitioned index space in the xarray or two xarrays.

Just badly named because the table itself is where the resulting message
is stored, which is composed with the help of the relevant MSI
descriptor. See above.

We really should not try to make up an artifical table representation
for something which does not necessarily have a table at all, i.e. the
devices you talk about which store the message in queue specific system
memory. Pretending that this is a table is just silly.

Also I disagree that this has to be tied to a PCI specific interface,
except for creating a PCI specific wrapper for it to not make a driver
developer have to write '>dev', which is the very least of our
problems.

IMS as a technical concept is absolutely not PCI specific at all and not
invented by PCI/SIG. It's a marketing brandname for something which
existed way before they thought about it: Message signaled interrupts.

Aside of that 'my_irq_chip' does not cut it at all because of the way
how the resulting messages are stored. IDXD has IOMEM storage and a
storage space limitation while your device uses system memory storage
and has other limitations, i.e. system memory and the number of queues
the device can provide.

An irqchip is a just set of functions to talk to hardware either
directly or via some indirect transport (I2C, SPI, MLX queue management
magic...). These functions require irqdomain and/or device specific
information to function.

Trying to create a universal pci_create_foo() wrapper around this is
going to be like the 13th Herkulean task.

Seriously, you cannot make something uniform which is by definition
non-uniform.

Let's not even try to pretend that it is possible.

> Is the 0th IMS vector
>
> Is it close to what you are thinking with lfunc?

Not really. I was really reasoning about an abstract representation for
a functional queue, which is more than just a queue allocated from the
PF or VF device.

I really meant a container like this:

struct logical_function {
/* Pointer to the physical device */
struct device   *phys_device;
/* MSI descriptor storage */
struct msi_data msi;
/* The queue number */
unsigned intqueue_nr;
/* Add more information 

Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-02 Thread Jason Gunthorpe via iommu
On Thu, Dec 02, 2021 at 08:25:48PM +0100, Thomas Gleixner wrote:
> Jason,
> 
> On Thu, Dec 02 2021 at 09:55, Jason Gunthorpe wrote:
> > On Thu, Dec 02, 2021 at 01:01:42AM +0100, Thomas Gleixner wrote:
> >> On Wed, Dec 01 2021 at 21:21, Thomas Gleixner wrote:
> >> > On Wed, Dec 01 2021 at 14:14, Jason Gunthorpe wrote:
> >> > Which in turn is consistent all over the place and does not require any
> >> > special case for anything. Neither for interrupts nor for anything else.
> >> 
> >> that said, feel free to tell me that I'm getting it all wrong.
> >> 
> >> The reason I'm harping on this is that we are creating ABIs on several
> >> ends and we all know that getting that wrong is a major pain.
> >
> > I don't really like coupling the method to fetch IRQs with needing
> > special struct devices. Struct devices have a sysfs presence and it is
> > not always appropriate to create sysfs stuff just to allocate some
> > IRQs.
> >
> > A queue is simply not a device, it doesn't make any sense. A queue is
> > more like a socket().
> 
> Let's put the term 'device' for a bit please. 
> 
> > That said, we often have enough struct devices floating about to make
> > this work. Between netdev/ib_device/aux device/mdev we can use them to
> > do this.
> >
> > I think it is conceptual nonsense to attach an IMS IRQ domain to a
> > netdev or a cdev, but it will solve this problem.
> 
> The IMS irqdomain is _NOT_ part of the netdev or cdev or whatever. I
> explained that several times now.
> 
> We seem to have a serious problem of terminology and the understanding
> of topology which is why we continue to talk past each other forever.

I think I understand and agree with everything you said below.

The point we diverge is where to put the vector storage:
 
> Of course we can store them in pci_dev.dev.msi.data.store. Either with a
> dedicated xarray or by partitioning the xarray space. Both have their
> pro and cons.

This decision seems to drive the question of how many 'struct devices'
do we need, and where do we get them..

> But what I really struggle with is how this is further represented when
> the queues are allocated for VFIO, cdev, whatever usage.

Yes, this seems to be the primary question

> The fact that the irqdomain is instantiated by the device driver does
> not make it any different. As explained above it's just an
> implementation detail which makes it possible to handle the device
> specific message storage requirement in a sane way. The actual interrupt
> resources come still from the underlying irqdomains as for PCI/MSI.

Yes! This is not under debate

> Now I was looking for a generic representation of such a container and
> my initial reaction was to bind it to a struct device, which also makes
> it trivial to store these MSI descriptors in that struct device.

Right, I've been trying to argue around just this choice.
 
> I can understand your resistance against that to some extent, but I
> think we really have to come up with a proper abstract representation
> for these.

Ok

> Such a logical function would be the entity to hand out for VFIO or
> cdev.

What is a logical function, concretely? 

Does it have struct device?

Can I instead suggest a name like 'message interrupt table' ?

Ie a device has two linearly indexed message interrupt tables - the
PCI SIG defined MSI/MSI-X one created by the PCI core and the IMS one
created by the driver.

Both start at 0 index and they have different irq_domains.

Instead of asking the driver to create a domain we ask the driver to
create a new 'message interrupt table'. The driver provides the
irq_chip to program the messages and the pci_device. The core code
manages the irq domain setup.

Using what you say below:

> If this is not split out, then every driver and wrapper has to come up
> with it's own representation of this instead of being able to do:
> 
>  request_irq(msi_get_virq(lfunc, idx=0), handler0, ...);
>  request_irq(msi_get_virq(lfunc, idx=1), handler1, ...);

We could say:
  msi_get_virq(device.pci_msi_table, index=0)

Is the 0th PCI SIG MSI vector

Something like:

 ims_table = pci_create_msi_table(pci_dev, my_irq_chip,..)
 msi_get_virq(ims_table, index=0)

Is the 0th IMS vector

Is it close to what you are thinking with lfunc?

Regards,
Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-02 Thread Thomas Gleixner
Jason,

On Thu, Dec 02 2021 at 09:55, Jason Gunthorpe wrote:
> On Thu, Dec 02, 2021 at 01:01:42AM +0100, Thomas Gleixner wrote:
>> On Wed, Dec 01 2021 at 21:21, Thomas Gleixner wrote:
>> > On Wed, Dec 01 2021 at 14:14, Jason Gunthorpe wrote:
>> > Which in turn is consistent all over the place and does not require any
>> > special case for anything. Neither for interrupts nor for anything else.
>> 
>> that said, feel free to tell me that I'm getting it all wrong.
>> 
>> The reason I'm harping on this is that we are creating ABIs on several
>> ends and we all know that getting that wrong is a major pain.
>
> I don't really like coupling the method to fetch IRQs with needing
> special struct devices. Struct devices have a sysfs presence and it is
> not always appropriate to create sysfs stuff just to allocate some
> IRQs.
>
> A queue is simply not a device, it doesn't make any sense. A queue is
> more like a socket().

Let's put the term 'device' for a bit please. 

> That said, we often have enough struct devices floating about to make
> this work. Between netdev/ib_device/aux device/mdev we can use them to
> do this.
>
> I think it is conceptual nonsense to attach an IMS IRQ domain to a
> netdev or a cdev, but it will solve this problem.

The IMS irqdomain is _NOT_ part of the netdev or cdev or whatever. I
explained that several times now.

We seem to have a serious problem of terminology and the understanding
of topology which is why we continue to talk past each other forever.

Let's take a big step back and clarify these things first.

An irq domain is an entity which has two parts:

   1) The resource management part
   
   2) An irqchip which knows how to access the actual hardware
  implementation for the various operations related to irq chips
  (mask, unmask, ack, eoi, )

irq domains are used for uniform resource spaces which have a uniform
hardware interrupt chip. Modern computer systems have several resource
levels which form a hierarchy of several irqdomain levels.

The root of that hierarchy is at the CPU level:

 |---|
 | CPU | BUS |
 | |-|
 | | IRQ |
 |

The CPU exposes the bus and a mechanism to raise interrupts in the CPU
from external components. Whether those interrupts are raised by wire or
by writing a message to a given address is an implementation detail.

So the root IRQ domain is the one which handles the CPU level interrupt
controller and manages the associated resources: The per CPU vector
space on x86, the global vector space on ARM64.

If there's an IOMMU in the system them the IOMMU is the next level
interrupt controller:

1) It manages the translation table resources

2) Provides an interrupt chip which knows how to access the
   table entries

Of course there can be several IOMMUs depending on the system topology,
but from an irq domain view they are at the same level.

On top of that we have on x86 the next level irq domains:

   IO/APIC, HPET and PCI/MSI

which again do resource management and provide an irqchip which
handles the underlying hardware implementation.

So the full hierarchy so far looks like this on X86

   VECTOR --|
|-- IOMMU --|
|-- IO/APIC
|-- HPET
|-- PCI/MSI

So allocating an interrupt on the outer level which is exposed to
devices requires allocation through the hierarchy down to the root
domain (VECTOR). The reason why we have this hierarchy is that this
makes it trivial to support systems with and without IOMMU because on a
system w/o IOMMU the outer domains have the vector domain as their
parent. ARM(64) has even deeper hierarchies, but operates on the same
principle.

Let's look at the PCI/MSI domain more detailed:

  It's either one irqdomain per system or one irqdomain per IOMMU domain

  The PCI/MCI irqdomain does not really do resource allocations, it's
  all delegated down the hierarchy.

  It provides also an interrupt chip which knows to access the MSI[X]
  entries and can handle masking etc.

Now you might ask how the MSI descriptors come into play. The MSI
descriptors are allocated by the PCI/MSI interface functions in order to
enable the PCI/MSI irqdomain to actually allocate resources.

They could be allocated as part of the irqdomain allocation mechanism
itself. That's what I do now for simple irq domains which only have a
linear MSI descriptor space where the only information in the descriptor
is the index and not all the PCI/MSI tidbits.

It's kinda blury, but that's moslty due to the fact that we still have
to support the legacy PCI/MSI implementations on architectures which do
not use hierarchical irq domains.

The MSI descriptors are stored in device->msi.data.store. That makes
sense because they are allocated on behalf of that device. But there is
no requirement to store them there. We could just store a cookie in the
device and have some central storage. It's an 

Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-02 Thread Jason Gunthorpe via iommu
On Thu, Dec 02, 2021 at 03:23:38PM +0100, Greg Kroah-Hartman wrote:
> On Thu, Dec 02, 2021 at 09:55:02AM -0400, Jason Gunthorpe wrote:
> > Further, there is no reason why IMS should be reserved exclusively for
> > VFIO! Why shouldn't the cdev be able to use IMS vectors too? It is
> > just a feature of the PCI device like MSI. If the queue has a PASID it
> > can use IDXD's IMS.
> 
> No, sorry, but a cdev is not for anything resembling any real resource
> at all.

My point is that when the user asks the driver to allocate a queue
through a cdev ioctl it should be able to get the queue attached to an
IMS, today it can only get a queue attached to a MSI.

> It is ONLY for the /dev/NODE interface that controls the character
> device api to userspace.  The struct device involved in it is ONLY for
> that, nothing else.  Any attempt to add things to it will be gleefully
> rejected.

I agree with you!
 
> > If we really need a 2nd struct device to turn on IMS then, I'd suggest
> > picking the cdev, as it keeps IMS and its allocator inside the IDXD
> > PCIe driver and not in the VFIO world.
> 
> No!  Again, a cdev is to control the lifespan/lifecycle of the /dev/NODE
> only.  Anything other than that is not ok to do at all.

Said the same thing in a prior email - which is why I think the only
logical choice here is to make IMS work on the pci_device

FWIW I feel the same way about the VFIO mdev - its *ONLY* purpose is
to control the lifecycle and we are close to stripping away all the
other abuses using it for other things.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-02 Thread Greg Kroah-Hartman
On Thu, Dec 02, 2021 at 09:55:02AM -0400, Jason Gunthorpe wrote:
> Further, there is no reason why IMS should be reserved exclusively for
> VFIO! Why shouldn't the cdev be able to use IMS vectors too? It is
> just a feature of the PCI device like MSI. If the queue has a PASID it
> can use IDXD's IMS.

No, sorry, but a cdev is not for anything resembling any real resource
at all.

It is ONLY for the /dev/NODE interface that controls the character
device api to userspace.  The struct device involved in it is ONLY for
that, nothing else.  Any attempt to add things to it will be gleefully
rejected.

The cdev api today (in the kernel) exposes too much mess and there's at
least 4 or 5 different ways to use it.  It's on my long-term TODO list
to fix this up to not even allow abuses like you are considering here,
so please don't do that.

> If we really need a 2nd struct device to turn on IMS then, I'd suggest
> picking the cdev, as it keeps IMS and its allocator inside the IDXD
> PCIe driver and not in the VFIO world.

No!  Again, a cdev is to control the lifespan/lifecycle of the /dev/NODE
only.  Anything other than that is not ok to do at all.

thanks,

greg k-h
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-02 Thread Jason Gunthorpe via iommu
On Thu, Dec 02, 2021 at 01:01:42AM +0100, Thomas Gleixner wrote:
> Jason,
> 
> On Wed, Dec 01 2021 at 21:21, Thomas Gleixner wrote:
> > On Wed, Dec 01 2021 at 14:14, Jason Gunthorpe wrote:
> > Which in turn is consistent all over the place and does not require any
> > special case for anything. Neither for interrupts nor for anything else.
> 
> that said, feel free to tell me that I'm getting it all wrong.
> 
> The reason I'm harping on this is that we are creating ABIs on several
> ends and we all know that getting that wrong is a major pain.

I don't really like coupling the method to fetch IRQs with needing
special struct devices. Struct devices have a sysfs presence and it is
not always appropriate to create sysfs stuff just to allocate some
IRQs.

A queue is simply not a device, it doesn't make any sense. A queue is
more like a socket().

That said, we often have enough struct devices floating about to make
this work. Between netdev/ib_device/aux device/mdev we can use them to
do this.

I think it is conceptual nonsense to attach an IMS IRQ domain to a
netdev or a cdev, but it will solve this problem.

However, I would really prefer that there be no uAPI here. 

I looked at the msi_irqs/ stuff and could not find a user. Debian code
search found nothing, Greg redid the entire uAPI in 2013
(1c51b50c2995), so I think it is just dead. (maybe delete it?)

So lets not create any sysfs for IMS with the msi_irqs/ dir.  We can
revise the in-kenel mechanism someday if it turns out to be a problem.

As to your question:

> So again, why would we want to make software managed subdevices look
> exactly the opposite way like hardware/firmware managed subdevices?

That isn't my thinking at all.

Something like mlx5 has a hw/fw managed VF and there is an RPC call
from driver to device to 'create a queue'. The device has no hard
division inside along a VF, the device simply checks resource limits
and security properties and returns one of the >>10k queues. Again
think more like socket() than a hard partitioning.

It is the same as I suggest for IDXD & VFIO where the PCIe IDXD layer
takes the place of hw/fw and has a 'create a queue' API call for the
VFIO layer to use. Instead of using a VF as the security identity, it
uses a PASID.

This is a logical partitioning and it matches the partioning we'd have
if it was a real device.

> So if a queue is represented as a subdevice, then VFIO can just build
> a wrapper around that subdevice.

I think that oversimplifies the picture.

IDXD is a multi queue device that uses PASID as a security context. It
has a cdev /dev/idxd interface where userspace can use an IOCTL and
get a queue to use. The queue is bound to a PASID that is linked to an
IO Page table that mirrors the process page table. Userspace operates
the queue and does whatever with it.

VFIO is just another interface that should logically be considered a
peer of the cdev. Using VFIO userspace can get a queue, bind it to a
PASID and operate it. The primary difference between the cdev and the
VFIO mdev is user programming API - VFIO uses IOCTLs that carry
emulated MMIO read/write operations.

I consider *neither* to be a subdevice. They are just a user API,
however convoluted, to create a queue, associate it with a PASID
security context and allow userspace to operate the queue. It is much
closer to socket() than a PCI VF subdevice.

Internally the driver should be built so that the PCI driver is doing
all the device operation and the two uAPI layers are only concerend
with translating their repsective uAPIs to the internal device API.

Further, there is no reason why IMS should be reserved exclusively for
VFIO! Why shouldn't the cdev be able to use IMS vectors too? It is
just a feature of the PCI device like MSI. If the queue has a PASID it
can use IDXD's IMS.

If we really need a 2nd struct device to turn on IMS then, I'd suggest
picking the cdev, as it keeps IMS and its allocator inside the IDXD
PCIe driver and not in the VFIO world.

Regards,
Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-01 Thread Thomas Gleixner
Jason,

On Wed, Dec 01 2021 at 21:21, Thomas Gleixner wrote:
> On Wed, Dec 01 2021 at 14:14, Jason Gunthorpe wrote:
> Which in turn is consistent all over the place and does not require any
> special case for anything. Neither for interrupts nor for anything else.

that said, feel free to tell me that I'm getting it all wrong.

The reason I'm harping on this is that we are creating ABIs on several
ends and we all know that getting that wrong is a major pain.

Thanks,

tglx
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-01 Thread Thomas Gleixner
dave,

On Wed, Dec 01 2021 at 15:53, Dave Jiang wrote:
> On 12/1/2021 3:03 PM, Thomas Gleixner wrote:
>> This still depends on how this overall discussion about representation
>> of all of this stuff is resolved.
>>
 What needs a subdevice to expose?
>> Can you answer that too please?
>
> Sorry. So initial version of the IDXD sub-device is represented with a 
> single queue. It needs a command irq (emulated) and a completion irq 
> that is backed by a device vector (currently IMS).

thanks for clarification! Let me think about it some more.

   tglx
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-01 Thread Dave Jiang



On 12/1/2021 3:03 PM, Thomas Gleixner wrote:

On Wed, Dec 01 2021 at 14:49, Dave Jiang wrote:

On 12/1/2021 2:44 PM, Thomas Gleixner wrote:

How that is backed on the host does not really matter. You can expose
MSI-X to the guest with a INTx backing as well.

I'm still failing to see the connection between the 9 MSIX vectors and
the 2048 IMS vectors which I assume that this is the limitation of the
physical device, right?

I think I was confused with what you were asking and was thinking you
are saying why can't we just have MSIX on guest backed by the MSIX on
the physical device and thought there would not be enough vectors to
service the many guests. I think I understand what your position is now
with the clarification above.

This still depends on how this overall discussion about representation
of all of this stuff is resolved.


What needs a subdevice to expose?

Can you answer that too please?


Sorry. So initial version of the IDXD sub-device is represented with a 
single queue. It needs a command irq (emulated) and a completion irq 
that is backed by a device vector (currently IMS).





Thanks,

 tglx

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-01 Thread Thomas Gleixner
On Wed, Dec 01 2021 at 14:49, Dave Jiang wrote:
> On 12/1/2021 2:44 PM, Thomas Gleixner wrote:
>> How that is backed on the host does not really matter. You can expose
>> MSI-X to the guest with a INTx backing as well.
>>
>> I'm still failing to see the connection between the 9 MSIX vectors and
>> the 2048 IMS vectors which I assume that this is the limitation of the
>> physical device, right?
>
> I think I was confused with what you were asking and was thinking you 
> are saying why can't we just have MSIX on guest backed by the MSIX on 
> the physical device and thought there would not be enough vectors to 
> service the many guests. I think I understand what your position is now 
> with the clarification above.

This still depends on how this overall discussion about representation
of all of this stuff is resolved.

>> What needs a subdevice to expose?

Can you answer that too please?

Thanks,

tglx
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-01 Thread Dave Jiang



On 12/1/2021 2:44 PM, Thomas Gleixner wrote:

On Wed, Dec 01 2021 at 14:21, Dave Jiang wrote:

On 12/1/2021 1:25 PM, Thomas Gleixner wrote:

The hardware implementation does not have enough MSIX vectors for
guests. There are only 9 MSIX vectors total (8 for queues) and 2048 IMS
vectors. So if we are to do MSI-X for all of them, then we need to do
the IMS backed MSIX scheme rather than passthrough IMS to guests.

Confused. Are you talking about passing a full IDXD device to the guest
or about passing a carved out subdevice, aka. queue?

I'm talking about carving out a subdevice. I had the impression of you
wanting IMS passed through for all variations. But it sounds like for a
sub-device, you are ok with the implementation of MSIX backed by IMS?

I don't see anything wrong with that. A subdevice is it's own entity and
VFIO can chose the most conveniant representation of it to the guest
obviously.

How that is backed on the host does not really matter. You can expose
MSI-X to the guest with a INTx backing as well.

I'm still failing to see the connection between the 9 MSIX vectors and
the 2048 IMS vectors which I assume that this is the limitation of the
physical device, right?


I think I was confused with what you were asking and was thinking you 
are saying why can't we just have MSIX on guest backed by the MSIX on 
the physical device and thought there would not be enough vectors to 
service the many guests. I think I understand what your position is now 
with the clarification above.





What needs a subdevice to expose?

Thanks,

 tglx




___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-01 Thread Thomas Gleixner
On Wed, Dec 01 2021 at 14:21, Dave Jiang wrote:
> On 12/1/2021 1:25 PM, Thomas Gleixner wrote:
>>> The hardware implementation does not have enough MSIX vectors for
>>> guests. There are only 9 MSIX vectors total (8 for queues) and 2048 IMS
>>> vectors. So if we are to do MSI-X for all of them, then we need to do
>>> the IMS backed MSIX scheme rather than passthrough IMS to guests.
>> Confused. Are you talking about passing a full IDXD device to the guest
>> or about passing a carved out subdevice, aka. queue?
>
> I'm talking about carving out a subdevice. I had the impression of you 
> wanting IMS passed through for all variations. But it sounds like for a 
> sub-device, you are ok with the implementation of MSIX backed by IMS?

I don't see anything wrong with that. A subdevice is it's own entity and
VFIO can chose the most conveniant representation of it to the guest
obviously.

How that is backed on the host does not really matter. You can expose
MSI-X to the guest with a INTx backing as well.

I'm still failing to see the connection between the 9 MSIX vectors and
the 2048 IMS vectors which I assume that this is the limitation of the
physical device, right?

What needs a subdevice to expose?

Thanks,

tglx



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-01 Thread Dave Jiang



On 12/1/2021 1:25 PM, Thomas Gleixner wrote:

On Wed, Dec 01 2021 at 11:47, Dave Jiang wrote:

On 12/1/2021 11:41 AM, Thomas Gleixner wrote:

Hi Thomas. This is actually the IDXD usage for a mediated device passed
to a guest kernel when we plumb the pass through of IMS to the guest
rather than doing previous implementation of having a MSIX vector on
guest backed by IMS.

Which makes a lot of sense.


The control block for the mediated device is emulated and therefore an
emulated MSIX vector will be surfaced as vector 0. However the queues
will backed by IMS vectors. So we end up needing MSIX and IMS coexist
running on the guest kernel for the same device.

Why? What's wrong with using straight MSI-X for all of them?

The hardware implementation does not have enough MSIX vectors for
guests. There are only 9 MSIX vectors total (8 for queues) and 2048 IMS
vectors. So if we are to do MSI-X for all of them, then we need to do
the IMS backed MSIX scheme rather than passthrough IMS to guests.

Confused. Are you talking about passing a full IDXD device to the guest
or about passing a carved out subdevice, aka. queue?


I'm talking about carving out a subdevice. I had the impression of you 
wanting IMS passed through for all variations. But it sounds like for a 
sub-device, you are ok with the implementation of MSIX backed by IMS?





Thanks,

 tglx


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-01 Thread Thomas Gleixner
On Wed, Dec 01 2021 at 11:47, Dave Jiang wrote:
> On 12/1/2021 11:41 AM, Thomas Gleixner wrote:
>>> Hi Thomas. This is actually the IDXD usage for a mediated device passed
>>> to a guest kernel when we plumb the pass through of IMS to the guest
>>> rather than doing previous implementation of having a MSIX vector on
>>> guest backed by IMS.
>> Which makes a lot of sense.
>>
>>> The control block for the mediated device is emulated and therefore an
>>> emulated MSIX vector will be surfaced as vector 0. However the queues
>>> will backed by IMS vectors. So we end up needing MSIX and IMS coexist
>>> running on the guest kernel for the same device.
>> Why? What's wrong with using straight MSI-X for all of them?
>
> The hardware implementation does not have enough MSIX vectors for 
> guests. There are only 9 MSIX vectors total (8 for queues) and 2048 IMS 
> vectors. So if we are to do MSI-X for all of them, then we need to do 
> the IMS backed MSIX scheme rather than passthrough IMS to guests.

Confused. Are you talking about passing a full IDXD device to the guest
or about passing a carved out subdevice, aka. queue?

Thanks,

tglx
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-01 Thread Thomas Gleixner
Jason,

On Wed, Dec 01 2021 at 14:14, Jason Gunthorpe wrote:
> On Wed, Dec 01, 2021 at 06:35:35PM +0100, Thomas Gleixner wrote:
>> On Wed, Dec 01 2021 at 09:00, Jason Gunthorpe wrote:
>> But NTB is operating through an abstraction layer and is not a direct
>> PCIe device driver.
>
> I'm not sure exactly how NTB seems to be split between switchtec and
> the ntb code, but since the ntbd code seems to be doing MMIO touches,
> it feels like part of a PCIe driver?

It's a maze of magic PCIe driver with callbacks left and right into the
underlying NTB hardware driver which itself does some stuff on its own
and also calls back into the underlying PCIe driver.

Decomposing that thing feels like being trapped in the Labyrinth of
Knossos. But let's ignore that for a moment.

>> The VFIO driver does not own the irq chip ever. The irq chip is of
>> course part of the underlying infrastructure. I never asked for that.
>
> That isn't quite what I ment.. I ment the PCIe driver cannot create
> the domain or make use of the irq_chip until the VFIO layer comes
> along and provides the struct device. To me this is backwards
> layering, the interrupts come from the PCIe layer and should exist
> independently from VFIO.

See below.

>>  When it allocates a slice for whatever usage then it also
>>  allocates the IMS interrupts (though the VFIO people want to
>>  have only one and do the allocations later on demand).
>> 
>>  That allocation cannot be part of the PCI/MSIx interrupt
>>  domain as we already agreed on.
>
> Yes, it is just an open question of where the new irq_domain need to
> reside

The irqdomain is created by and part of the physical device. But that
does not mean that the interrupts which are allocated from that irq
domain are stored in the physical device representation.

Going by that logic, the PCI/MSI domain would store all MSI[X]
interrupts which are allocated on some root bridge or wherever.

They are obviously stored per PCI device, but the irqdomain is owned
e.g. by the underlying IOMMU zone.

The irqdomain is managing and handing out resources. Like any other
resource manager does.

See?

>> 1) Storage
>> 
>>A) Having "subdevices" solves the storage problem nicely and
>>   makes everything just fall in place. Even for a purely
>>   physical multiqueue device one can argue that each queue is a
>>   "subdevice" of the physical device. The fact that we lump them
>>   all together today is not an argument against that.
>
> I don't like the idea that queue is a device, that is trying to force
> a struct device centric world onto a queue which doesn't really want
> it..

Here we are at the point where we agree to disagree.

Of course a queue is a resource, but what prevents us to represent a
queue as a carved out subdevice of the physical device?

Look at it from the VF point of view. If VFs are disabled then all
resources belong to the physical device. If VFs are enabled then the
hardware/firmware splits the resources into separate subdevices which
have their own device representation, interrupt storage etc.

As VFs have a scalability limitation due to the underlying PCIe
restrictions the step we are talking about now is to split the queues up
in software which means nothing else than creating a software
representation of finer grained and more scalable subdevices.

So why would we want to pretend that these are not devices?

They are from a conceptual and topology view a subdevice of the physical
device. Just because they are named queues does not make it any
different.

Let's look at VFIO again. If VFIO passes a VF through then it builds a
wrapper around the VF device.

So if a queue is represented as a subdevice, then VFIO can just build
a wrapper around that subdevice.

But with your model, VFIO has to create a device, request a queue,
wrestle the interrupts in place, etc. Which is exactly the opposite of
how VFs are handled.

So again, why would we want to make software managed subdevices look
exactly the opposite way like hardware/firmware managed subdevices?

Let me also look at the cdev which is exposed by the phsyical device.

The cdev is nothing else than a software vehicle to create a /dev/
node. (ignore the switchtec case which (ab)uses the cdev to connect to
NTB). The cdev allows user space to allocate/release a resource.

Of course it can just allocate a queue and stick the necessary
interrupts into the physical device MSI descriptor storage.

But there is no reason why the queue allocation cannot allocate a
subdevice representing the queue.

That makes the VFIO and the cdev case just using the same underlying
representation which can expose it's properties via the underlying
device.

Which in turn is consistent all over the place and does not require any
special case for anything. Neither for interrupts nor for anything else.

Thanks,

tglx



___
iommu mailing list

Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-01 Thread Dave Jiang



On 12/1/2021 11:41 AM, Thomas Gleixner wrote:

Dave,

please trim your replies.

On Wed, Dec 01 2021 at 09:28, Dave Jiang wrote:


On 12/1/2021 3:16 AM, Thomas Gleixner wrote:

Jason,

CC+ IOMMU folks

On Tue, Nov 30 2021 at 20:17, Jason Gunthorpe wrote:

On Tue, Nov 30, 2021 at 10:23:16PM +0100, Thomas Gleixner wrote:

Though I fear there is also a use case for MSI-X and IMS tied to the
same device. That network card you are talking about might end up using
MSI-X for a control block and then IMS for the actual network queues
when it is used as physical function device as a whole, but that's
conceptually a different case.

Hi Thomas. This is actually the IDXD usage for a mediated device passed
to a guest kernel when we plumb the pass through of IMS to the guest
rather than doing previous implementation of having a MSIX vector on
guest backed by IMS.

Which makes a lot of sense.


The control block for the mediated device is emulated and therefore an
emulated MSIX vector will be surfaced as vector 0. However the queues
will backed by IMS vectors. So we end up needing MSIX and IMS coexist
running on the guest kernel for the same device.

Why? What's wrong with using straight MSI-X for all of them?


The hardware implementation does not have enough MSIX vectors for 
guests. There are only 9 MSIX vectors total (8 for queues) and 2048 IMS 
vectors. So if we are to do MSI-X for all of them, then we need to do 
the IMS backed MSIX scheme rather than passthrough IMS to guests.





Thanks,

 tglx


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-01 Thread Logan Gunthorpe



On 2021-12-01 11:14 a.m., 'Jason Gunthorpe' via linux-ntb wrote:
> On Wed, Dec 01, 2021 at 06:35:35PM +0100, Thomas Gleixner wrote:
>> On Wed, Dec 01 2021 at 09:00, Jason Gunthorpe wrote:
>>> On Wed, Dec 01, 2021 at 11:16:47AM +0100, Thomas Gleixner wrote:
 Looking at the device slices as subdevices with their own struct device
 makes a lot of sense from the conceptual level.
>>>
>>> Except IMS is not just for subdevices, it should be usable for any
>>> driver in any case as a general interrupt mechiansm, as you alluded to
>>> below about ethernet queues. ntb seems to be the current example of
>>> this need..
>>
>> But NTB is operating through an abstraction layer and is not a direct
>> PCIe device driver.
> 
> I'm not sure exactly how NTB seems to be split between switchtec and
> the ntb code, but since the ntbd code seems to be doing MMIO touches,
> it feels like part of a PCIe driver?

Eh, sort of. NTB has lots of layers. At the top you'll find ntb_netdev
which is an network interface. Below that is ntb_transport() which is a
generic queue pair. Below that is the hardware driver itself (ie
switchtec) through the abstraction layer. The switchtec driver is split
in two: the main driver which just allows for information and
administration of the switch itself and switchtec_ntb which is the
module that provides the NTB abstractions to twiddle its registers.

ntb_transport() doesn't directly do MMIO touches (as it doesn't know
what the underlying hardware registers are). Except for the memory
windows which are usually setup to be a specific BAR (or parts of a
BAR). ntb_transport will do MMIO writes to those specific BAR address
which correspond to writing into buffers on the peer.

Logan
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-01 Thread Thomas Gleixner
Dave,

please trim your replies.

On Wed, Dec 01 2021 at 09:28, Dave Jiang wrote:

> On 12/1/2021 3:16 AM, Thomas Gleixner wrote:
>> Jason,
>>
>> CC+ IOMMU folks
>>
>> On Tue, Nov 30 2021 at 20:17, Jason Gunthorpe wrote:
>>> On Tue, Nov 30, 2021 at 10:23:16PM +0100, Thomas Gleixner wrote:
>>
>> Though I fear there is also a use case for MSI-X and IMS tied to the
>> same device. That network card you are talking about might end up using
>> MSI-X for a control block and then IMS for the actual network queues
>> when it is used as physical function device as a whole, but that's
>> conceptually a different case.
>
> Hi Thomas. This is actually the IDXD usage for a mediated device passed 
> to a guest kernel when we plumb the pass through of IMS to the guest 
> rather than doing previous implementation of having a MSIX vector on 
> guest backed by IMS.

Which makes a lot of sense.

> The control block for the mediated device is emulated and therefore an
> emulated MSIX vector will be surfaced as vector 0. However the queues
> will backed by IMS vectors. So we end up needing MSIX and IMS coexist
> running on the guest kernel for the same device.

Why? What's wrong with using straight MSI-X for all of them?

Thanks,

tglx
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-01 Thread Jason Gunthorpe via iommu
On Wed, Dec 01, 2021 at 06:35:35PM +0100, Thomas Gleixner wrote:
> On Wed, Dec 01 2021 at 09:00, Jason Gunthorpe wrote:
> > On Wed, Dec 01, 2021 at 11:16:47AM +0100, Thomas Gleixner wrote:
> >> Looking at the device slices as subdevices with their own struct device
> >> makes a lot of sense from the conceptual level.
> >
> > Except IMS is not just for subdevices, it should be usable for any
> > driver in any case as a general interrupt mechiansm, as you alluded to
> > below about ethernet queues. ntb seems to be the current example of
> > this need..
> 
> But NTB is operating through an abstraction layer and is not a direct
> PCIe device driver.

I'm not sure exactly how NTB seems to be split between switchtec and
the ntb code, but since the ntbd code seems to be doing MMIO touches,
it feels like part of a PCIe driver?

> > IDXD is not so much making device "slices", but virtualizing and
> > sharing a PCI device. The IDXD hardware is multi-queue like the NIC I
> > described and the VFIO driver is simply allocating queues from a PCI
> > device for specific usages and assigning them interrupts.
> 
> Right.
> 
> But what is the representation for that resulting device? Some VFIO
> specific homebrewn muck or something which is based on struct device?

Why is there a device? A queue is a queue, not a device.

If the task is to make some struct device (eg mdev, or cdev, or
whatever) then queues may be allocated to do this, but the queue is
logically a resource handed out by the PCIe driver and there should
not be a requirement to have an external struct device just to create
a queue.

> Right now with VF passthrough, I can see the interrupts which are
> associated to the device.
> 
> How is that going to be with something which is just made up? Does that
> expose it's own msi properties then somewhere hidden in the VFIO layer?

For sysfs, I think all interrupts should be on the PCI directory.

> > There is already a char dev interface that equally allocates queues
> > from the same IDXD device, why shouldn't it be able to access IMS
> > interrupt pools too?
> 
> Why wouldn't it be able to do so?

The only 'struct device' there is a cdev and I really don't think
cdevs should have interrupts. It is a bit hacky as a in-kernel thing
and downright wrong as a sysfs ABI.

> The VFIO driver does not own the irq chip ever. The irq chip is of
> course part of the underlying infrastructure. I never asked for that.

That isn't quite what I ment.. I ment the PCIe driver cannot create
the domain or make use of the irq_chip until the VFIO layer comes
along and provides the struct device. To me this is backwards
layering, the interrupts come from the PCIe layer and should exist
independently from VFIO.

>  When it allocates a slice for whatever usage then it also
>  allocates the IMS interrupts (though the VFIO people want to
>  have only one and do the allocations later on demand).
> 
>  That allocation cannot be part of the PCI/MSIx interrupt
>  domain as we already agreed on.

Yes, it is just an open question of where the new irq_domain need to
reside

> 1) Storage
> 
>A) Having "subdevices" solves the storage problem nicely and
>   makes everything just fall in place. Even for a purely
>   physical multiqueue device one can argue that each queue is a
>   "subdevice" of the physical device. The fact that we lump them
>   all together today is not an argument against that.

I don't like the idea that queue is a device, that is trying to force
a struct device centric world onto a queue which doesn't really want
it..
 
>B) Requires extra storage in the PCIe device and extra storage
>   per subdevice, queue to keep track of the interrupts which
>   are associated to it.

Yes

> 2) Exposure of VFIO interrupts via sysfs
> 
>A) Just works

I would say this is flawed, in sysfs I expect all the interrupts for
the PCIe device to be in the PCIe sysfs, not strewn over subsystem
owned sub-directories.

For instance, today in mlx5, when a subdevice allocates a queue for a
slice (which is modeled as an aux device) the queue's assigned MSI-X
interrupt shows up on the PCIe sysfs, not the aux.

It should be uniform, if I assign a queue a legacy INT, MSI or an IMS
it should show in sysfs in the same way. Leaking this kernel
implementation detail as sysfs ABI does not seem good.

> 3) On demand expansion of the vectors for VFIO
> 
>A) Just works because the device has an irqdomain assigned.
> 
>B) Requires extra indirections to do that

Yes.
 
> 4) PASID
> 
>While an Intel IDXD specific issue, it want's to be solved
>without any nasty hacks.
> 
>A) Having a "subdevice" allows to associate the PASID with the
>   underlying struct device which makes IOMMU integration trivial
> 
>B) Needs some other custom hackery to get that solved

Yes

> > Any possibility that the 'IMS' xarray could 

Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-01 Thread Thomas Gleixner
On Wed, Dec 01 2021 at 09:00, Jason Gunthorpe wrote:
> On Wed, Dec 01, 2021 at 11:16:47AM +0100, Thomas Gleixner wrote:
>> Looking at the device slices as subdevices with their own struct device
>> makes a lot of sense from the conceptual level.
>
> Except IMS is not just for subdevices, it should be usable for any
> driver in any case as a general interrupt mechiansm, as you alluded to
> below about ethernet queues. ntb seems to be the current example of
> this need..

But NTB is operating through an abstraction layer and is not a direct
PCIe device driver.

> IDXD is not so much making device "slices", but virtualizing and
> sharing a PCI device. The IDXD hardware is multi-queue like the NIC I
> described and the VFIO driver is simply allocating queues from a PCI
> device for specific usages and assigning them interrupts.

Right.

But what is the representation for that resulting device? Some VFIO
specific homebrewn muck or something which is based on struct device?

Right now with VF passthrough, I can see the interrupts which are
associated to the device.

How is that going to be with something which is just made up? Does that
expose it's own msi properties then somewhere hidden in the VFIO layer?

See below.

> There is already a char dev interface that equally allocates queues
> from the same IDXD device, why shouldn't it be able to access IMS
> interrupt pools too?

Why wouldn't it be able to do so?

> IMHO a well designed IDXD driver should put all the PCI MMIO, queue
> mangement, interrupts, etc in the PCI driver layer, and the VFIO
> driver layer should only deal with the MMIO trapping and VFIO
> interfacing.
>
> From this view it is conceptually wrong to have the VFIO driver
> directly talking to MMIO registers in the PCI device or owning the
> irq_chip.

The VFIO driver does not own the irq chip ever. The irq chip is of
course part of the underlying infrastructure. I never asked for that.

PCIe driver
 Owns the PCI/MSI[x] interrupts for the control block

 Has a control mechanism which handles queues or whatever the
 device is partitioned into, that's what I called slice.

 The irqdomain is part of that control mechanism and the irqchip
 implementation belongs to that as well. It has to because the
 message store is device specific.

 Whether the doamin and chip implementation is in a separate
 drivers/irqchip/foo.c file for sharing or directly built into the
 PCIe driver itself does not matter.

 When it allocates a slice for whatever usage then it also
 allocates the IMS interrupts (though the VFIO people want to
 have only one and do the allocations later on demand).

 That allocation cannot be part of the PCI/MSIx interrupt
 domain as we already agreed on.

We have several problems to solve. Let me look at it from both point of
views:

1) Storage

   A) Having "subdevices" solves the storage problem nicely and
  makes everything just fall in place. Even for a purely
  physical multiqueue device one can argue that each queue is a
  "subdevice" of the physical device. The fact that we lump them
  all together today is not an argument against that.

   B) Requires extra storage in the PCIe device and extra storage
  per subdevice, queue to keep track of the interrupts which
  are associated to it.

2) Exposure of VFIO interrupts via sysfs

   A) Just works

   B) Requires extra mechanisms to expose it

3) On demand expansion of the vectors for VFIO

   A) Just works because the device has an irqdomain assigned.

   B) Requires extra indirections to do that

4) IOMMU msi_desc::dev

   A) I think that's reasonably simple to address by having the
  relationship to the underlying PCIe device stored in struct
  device, which is not really adding any complexity to the IOMMU
  code.

  Quite the contrary it could be used to make the DMA aliasing a
  problem of device setup time and not a lookup per interrupt
  allocation in the IOMMU code.

   B) No change required.

4) PASID

   While an Intel IDXD specific issue, it want's to be solved
   without any nasty hacks.

   A) Having a "subdevice" allows to associate the PASID with the
  underlying struct device which makes IOMMU integration trivial

   B) Needs some other custom hackery to get that solved

So both variants come with their ups and downs.

IMO A) is the right thing to do when looking at all the involved moving
pieces.

> It would be very odd for the PCI driver to allocate interrupts from
> some random external struct device when it is creating queues on the
> PCI device.

No. That's not what I want.

>> and then have a store index for each allocation domain. With the
>> proposed encapsulation of the xarray handling that's definitely
>> feasible. Whether that buys much is a different question. Let me think
>> about 

Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-01 Thread Dave Jiang



On 12/1/2021 3:16 AM, Thomas Gleixner wrote:

Jason,

CC+ IOMMU folks

On Tue, Nov 30 2021 at 20:17, Jason Gunthorpe wrote:

On Tue, Nov 30, 2021 at 10:23:16PM +0100, Thomas Gleixner wrote:

The real problem is where to store the MSI descriptors because the PCI
device has its own real PCI/MSI-X interrupts which means it still shares
the storage space.

Er.. I never realized that just looking at the patches :|

That is relevant to all real "IMS" users. IDXD escaped this because
it, IMHO, wrongly used the mdev with the IRQ layer. The mdev is purely
a messy artifact of VFIO, it should not be required to make the IRQ
layers work.
I don't think it makes sense that the msi_desc would point to a mdev,
the iommu layer consumes the msi_desc_to_dev(), it really should point
to the physical device that originates the message with a proper
iommu ops/data/etc.

Looking at the device slices as subdevices with their own struct device
makes a lot of sense from the conceptual level. That makes is pretty
much obvious to manage the MSIs of those devices at this level like we
do for any other device.

Whether mdev is the right encapsulation for these subdevices is an
orthogonal problem.

I surely agree that msi_desc::dev is an interesting question, but we
already have this disconnect of msi_desc::dev and DMA today due to DMA
aliasing. I haven't looked at that in detail yet, but of course the
alias handling is substantially different accross the various IOMMU
implementations.

Though I fear there is also a use case for MSI-X and IMS tied to the
same device. That network card you are talking about might end up using
MSI-X for a control block and then IMS for the actual network queues
when it is used as physical function device as a whole, but that's
conceptually a different case.


Hi Thomas. This is actually the IDXD usage for a mediated device passed 
to a guest kernel when we plumb the pass through of IMS to the guest 
rather than doing previous implementation of having a MSIX vector on 
guest backed by IMS. The control block for the mediated device is 
emulated and therefore an emulated MSIX vector will be surfaced as 
vector 0. However the queues will backed by IMS vectors. So we end up 
needing MSIX and IMS coexist running on the guest kernel for the same 
device.


DJ




I'm currently tending to partition the index space in the xarray:

  0x - 0x  PCI/MSI-X
  0x0001 - 0x0001  NTB

It is OK, with some xarray work it can be range allocating & reserving
so that the msi_domain_alloc_irqs() flows can carve out chunks of the
number space..

Another view is the msi_domain_alloc_irqs() flows should have their
own xarrays..

Yes, I was thinking about that as well. The trivial way would be:

 struct xarray store[MSI_MAX_STORES];

and then have a store index for each allocation domain. With the
proposed encapsulation of the xarray handling that's definitely
feasible. Whether that buys much is a different question. Let me think
about it some more.


which is feasible now with the range modifications and way simpler to do
with xarray than with the linked list.

Indeed!

I'm glad you like the approach.

Thanks,

 tglx



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-01 Thread Jason Gunthorpe via iommu
On Wed, Dec 01, 2021 at 11:16:47AM +0100, Thomas Gleixner wrote:

> Looking at the device slices as subdevices with their own struct device
> makes a lot of sense from the conceptual level.

Except IMS is not just for subdevices, it should be usable for any
driver in any case as a general interrupt mechiansm, as you alluded to
below about ethernet queues. ntb seems to be the current example of
this need..

If it works properly on the physical PCI dev there is no reason to try
to also make it work on the mdev and add complixity in iommu land.

> Though I fear there is also a use case for MSI-X and IMS tied to the
> same device. That network card you are talking about might end up using
> MSI-X for a control block and then IMS for the actual network queues
> when it is used as physical function device as a whole, but that's
> conceptually a different case.

Is it? Why?

IDXD is not so much making device "slices", but virtualizing and
sharing a PCI device. The IDXD hardware is multi-queue like the NIC I
described and the VFIO driver is simply allocating queues from a PCI
device for specific usages and assigning them interrupts.

There is already a char dev interface that equally allocates queues
from the same IDXD device, why shouldn't it be able to access IMS
interrupt pools too?

IMHO a well designed IDXD driver should put all the PCI MMIO, queue
mangement, interrupts, etc in the PCI driver layer, and the VFIO
driver layer should only deal with the MMIO trapping and VFIO
interfacing.

>From this view it is conceptually wrong to have the VFIO driver
directly talking to MMIO registers in the PCI device or owning the
irq_chip. It would be very odd for the PCI driver to allocate
interrupts from some random external struct device when it is creating
queues on the PCI device.

> > Another view is the msi_domain_alloc_irqs() flows should have their
> > own xarrays..
> 
> Yes, I was thinking about that as well. The trivial way would be:
> 
> struct xarray store[MSI_MAX_STORES];
> 
> and then have a store index for each allocation domain. With the
> proposed encapsulation of the xarray handling that's definitely
> feasible. Whether that buys much is a different question. Let me think
> about it some more.

Any possibility that the 'IMS' xarray could be outside the struct
device?

Thanks,
Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-01 Thread Thomas Gleixner
Jason,

CC+ IOMMU folks

On Tue, Nov 30 2021 at 20:17, Jason Gunthorpe wrote:
> On Tue, Nov 30, 2021 at 10:23:16PM +0100, Thomas Gleixner wrote:
>> The real problem is where to store the MSI descriptors because the PCI
>> device has its own real PCI/MSI-X interrupts which means it still shares
>> the storage space.
>
> Er.. I never realized that just looking at the patches :|
>
> That is relevant to all real "IMS" users. IDXD escaped this because
> it, IMHO, wrongly used the mdev with the IRQ layer. The mdev is purely
> a messy artifact of VFIO, it should not be required to make the IRQ
> layers work.

> I don't think it makes sense that the msi_desc would point to a mdev,
> the iommu layer consumes the msi_desc_to_dev(), it really should point
> to the physical device that originates the message with a proper
> iommu ops/data/etc.

Looking at the device slices as subdevices with their own struct device
makes a lot of sense from the conceptual level. That makes is pretty
much obvious to manage the MSIs of those devices at this level like we
do for any other device.

Whether mdev is the right encapsulation for these subdevices is an
orthogonal problem.

I surely agree that msi_desc::dev is an interesting question, but we
already have this disconnect of msi_desc::dev and DMA today due to DMA
aliasing. I haven't looked at that in detail yet, but of course the
alias handling is substantially different accross the various IOMMU
implementations.

Though I fear there is also a use case for MSI-X and IMS tied to the
same device. That network card you are talking about might end up using
MSI-X for a control block and then IMS for the actual network queues
when it is used as physical function device as a whole, but that's
conceptually a different case.

>> I'm currently tending to partition the index space in the xarray:
>> 
>>  0x - 0x  PCI/MSI-X
>>  0x0001 - 0x0001  NTB
>
> It is OK, with some xarray work it can be range allocating & reserving
> so that the msi_domain_alloc_irqs() flows can carve out chunks of the
> number space..
>
> Another view is the msi_domain_alloc_irqs() flows should have their
> own xarrays..

Yes, I was thinking about that as well. The trivial way would be:

struct xarray store[MSI_MAX_STORES];

and then have a store index for each allocation domain. With the
proposed encapsulation of the xarray handling that's definitely
feasible. Whether that buys much is a different question. Let me think
about it some more.

>> which is feasible now with the range modifications and way simpler to do
>> with xarray than with the linked list.
>
> Indeed!

I'm glad you like the approach.

Thanks,

tglx


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu