RE: [RFC 1/3] virtio-iommu: firmware description of the virtual topology

2017-04-21 Thread Tian, Kevin
> From: Jean-Philippe Brucker [mailto:jean-philippe.bruc...@arm.com]
> Sent: Wednesday, April 19, 2017 2:41 AM
> 
> On 18/04/17 10:51, Tian, Kevin wrote:
> >> From: Jean-Philippe Brucker
> >> Sent: Saturday, April 8, 2017 3:18 AM
> >>
> >> Unlike other virtio devices, the virtio-iommu doesn't work independently,
> >> it is linked to other virtual or assigned devices. So before jumping into
> >> device operations, we need to define a way for the guest to discover the
> >> virtual IOMMU and the devices it translates.
> >>
> >> The host must describe the relation between IOMMU and devices to the
> >> guest
> >> using either device-tree or ACPI. The virtual IOMMU identifies each
> >
> > Do you plan to support both device tree and ACPI?
> 
> Yes, with ACPI the topology would be described using IORT nodes. I didn't
> include an example in my driver because DT is sufficient for a prototype
> and is readily available (both in Linux and kvmtool), whereas IORT would
> be quite easy to reuse in Linux, but isn't present in kvmtool at the
> moment. However, both interfaces have to be supported for the virtio-
> iommu
> to be portable.

'portable' means whether guest enables ACPI?

> 
> >> virtual device with a 32-bit ID, that we will call "Device ID" in this
> >> document. Device IDs are not necessarily unique system-wide, but they
> may
> >> not overlap within a single virtual IOMMU. Device ID of passed-through
> >> devices do not need to match IDs seen by the physical IOMMU.
> >>
> >> The virtual IOMMU uses virtio-mmio transport exclusively, not virtio-pci,
> >> because with PCI the IOMMU interface would itself be an endpoint, and
> >> existing firmware interfaces don't allow to describe IOMMU<->master
> >> relations between PCI endpoints.
> >
> > I'm not familiar with virtio-mmio mechanism. Curious how devices in
> > virtio-mmio are enumerated today? Could we use that mechanism to
> > identify vIOMMUs and then invent a purely para-virtualized method to
> > enumerate devices behind each vIOMMU?
> 
> Using DT, virtio-mmio devices are described with "virtio-mmio" compatible
> node, and with ACPI they use _HID LNRO0005. Since the host already
> describes available devices to a guest using a firmware interface, I think
> we should reuse the tools provided by that interface for describing
> relations between DMA masters and IOMMU.

OK, I didn't realize virtio-mmio is defined to rely on DT for enumeration.

> 
> > Asking this is because each vendor has its own enumeration methods.
> > ARM has device tree and ACPI IORT. AMR has ACPI IVRS and device
> > tree (same format as ARM?). Intel has APCI DMAR and sub-tables. Your
> > current proposal looks following ARM definitions which I'm not sure
> > extensible enough to cover features defined only in other vendors'
> > structures.
> 
> ACPI IORT can be extended to incorporate para-virtualized IOMMUs,
> regardless of the underlying architecture. It isn't defined solely for the
> ARM SMMU, but serves a more general purpose of describing a map of
> device
> identifiers communicated from one components to another. Both DMAR and
> IVRS have such description (respectively DRHD and IVHD), but they are
> designed for a specific IOMMU, whereas IORT could host other kinds.

I'll take a look at IORT definition. DRHD includes information more
than device mapping.

> 
> It seems that all we really need is an interface that says "there is a
> virtio-iommu at address X, here are the devices it translates and their
> corresponding IDs", and both DT and ACPI IORT are able to fulfill this role.
> 
> > Since the purpose of this series is to go para-virtualize, why not also
> > para-virtualize and simplify the enumeration method? For example,
> > we may define a query interface through vIOMMU registers to allow
> > guest query whether a device belonging to that vIOMMU. Then we
> > can even remove use of any enumeration structure completely...
> > Just a quick example which I may not think through all the pros and
> > cons. :-)
> 
> I don't think adding a brand new topology description mechanism is worth
> the effort, we're better off reusing what already exists and is
> implemented by operating systems. Adding a query interface inside the
> vIOMMU may work (though might be very painful to integrate with fwspec in
> Linux), but would be redundant since the host has to provide a firmware
> description of the system anyway.
> 
> >> The following diagram describes a situation where two virtual IOMMUs
&g

RE: [RFC 2/3] virtio-iommu: device probing and operations

2017-04-21 Thread Tian, Kevin
> From: Jean-Philippe Brucker [mailto:jean-philippe.bruc...@arm.com]
> Sent: Wednesday, April 19, 2017 2:46 AM
> 
> On 18/04/17 11:26, Tian, Kevin wrote:
> >> From: Jean-Philippe Brucker
> >> Sent: Saturday, April 8, 2017 3:18 AM
> >>
> > [...]
> >>   II. Feature bits
> >>   
> >>
> >> VIRTIO_IOMMU_F_INPUT_RANGE (0)
> >>  Available range of virtual addresses is described in input_range
> >
> > Usually only the maximum supported address bits are important.
> > Curious do you see such situation where low end of the address
> > space is not usable (since you have both start/end defined later)?
> 
> A start address would allow to provide something resembling a GART to the
> guest: an IOMMU with one address space (ioasid_bits=0) and a small IOVA
> aperture. I'm not sure how useful that would be in practice.

Intel VT-d has no such limitation, which I can tell. :-)

> 
> On a related note, the virtio-iommu itself doesn't provide a
> per-address-space aperture as it stands. For example, attaching a device
> to an address space might restrict the available IOVA range for the whole
> AS if that device cannot write to high memory (above 32-bit). If the guest
> attempts to map an IOVA outside this window into the device's address
> space, it should expect the MAP request to fail. And when attaching, if
> the address space already has mappings outside this window, then ATTACH
> should fail.
> 
> This too seems to be something that ought to be communicated by firmware,
> but bits are missing (I can't find anything equivalent to DT's dma-ranges
> for PCI root bridges in ACPI tables, for example). In addition VFIO
> doesn't communicate any DMA mask for devices, and doesn't check them
> itself. I guess that the host could find out the DMA mask of devices one
> way or another, but it is tricky to enforce, so I didn't make this a hard
> requirement. Although I should probably add a few words about it.

If there is no such communication on bare metal, then same for pvIOMMU.

> 
> > [...]
> >>   1. Attach device
> >>   
> >>
> >> struct virtio_iommu_req_attach {
> >>le32address_space;
> >>le32device;
> >>le32flags/reserved;
> >> };
> >>
> >> Attach a device to an address space. 'address_space' is an identifier
> >> unique to the guest. If the address space doesn't exist in the IOMMU
> >
> > Based on your description this address space ID is per operation right?
> > MAP/UNMAP and page-table sharing should have different ID spaces...
> 
> I think it's simpler if we keep a single IOASID space per virtio-iommu
> device, because the maximum number of address spaces (described by
> ioasid_bits) might be a restriction of the pIOMMU. For page-table sharing
> you still need to define which devices will share a page directory using
> ATTACH requests, though that interface is not set in stone.

got you. yes VM is supposed to consume less IOASIDs than physically
available. It doesn’t hurt to have one IOASID space for both IOVA
map/unmap usages (one IOASID per device) and SVM usages (multiple
IOASIDs per device). The former is digested by software and the latter
will be bound to hardware.

Thanks
Kevin

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

RE: [RFC 3/3] virtio-iommu: future work

2017-04-21 Thread Tian, Kevin
> From: Jean-Philippe Brucker
> Sent: Saturday, April 8, 2017 3:18 AM
> 
> Here I propose a few ideas for extensions and optimizations. This is all
> very exploratory, feel free to correct mistakes and suggest more things.

[...]
> 
>   II. Page table sharing
>   ==
> 
>   1. Sharing IOMMU page tables
>   
> 
> VIRTIO_IOMMU_F_PT_SHARING
> 
> This is independent of the nested mode described in I.2, but relies on a
> similar feature in the physical IOMMU: having two stages of page tables,
> one for the host and one for the guest.
> 
> When this is supported, the guest can manage its own s1 page directory, to
> avoid sending MAP/UNMAP requests. Feature
> VIRTIO_IOMMU_F_PT_SHARING allows
> a driver to give a page directory pointer (pgd) to the host and send
> invalidations when removing or changing a mapping. In this mode, three
> requests are used: probe, attach and invalidate. An address space cannot
> be using the MAP/UNMAP interface and PT_SHARING at the same time.
> 
> Device and driver first need to negotiate which page table format they
> will be using. This depends on the physical IOMMU, so the request contains
> a negotiation part to probe the device capabilities.
> 
> (1) Driver attaches devices to address spaces as usual, but a flag
> VIRTIO_IOMMU_ATTACH_F_PRIVATE (working title) tells the device not to
> create page tables for use with the MAP/UNMAP API. The driver intends
> to manage the address space itself.
> 
> (2) Driver sends a PROBE_TABLE request. It sets len > 0 with the size of
> pg_format array.
> 
>   VIRTIO_IOMMU_T_PROBE_TABLE
> 
>   struct virtio_iommu_req_probe_table {
>   le32address_space;
>   le32flags;
>   le32len;
> 
>   le32nr_contexts;
>   struct {
>   le32model;
>   u8  format[64];
>   } pg_format[len];
>   };
> 
> Introducing a probe request is more flexible than advertising those
> features in virtio config, because capabilities are dynamic, and depend on
> which devices are attached to an address space. Within a single address
> space, devices may support different numbers of contexts (PASIDs), and
> some may not support recoverable faults.
> 
> (3) Device responds success with all page table formats implemented by the
> physical IOMMU in pg_format. 'model' 0 is invalid, so driver can
> initialize the array to 0 and deduce from there which entries have
> been filled by the device.
> 
> Using a probe method seems preferable over trying to attach every possible
> format until one sticks. For instance, with an ARM guest running on an x86
> host, PROBE_TABLE would return the Intel IOMMU page table format, and
> the
> guest could use that page table code to handle its mappings, hidden behind
> the IOMMU API. This requires that the page-table code is reasonably
> abstracted from the architecture, as is done with drivers/iommu/io-pgtable
> (an x86 guest could use any format implement by io-pgtable for example.)

So essentially you need modify all existing IOMMU drivers to support page 
table sharing in pvIOMMU. After abstraction is done the core pvIOMMU files 
can be kept vendor agnostic. But if we talk about the whole pvIOMMU 
module, it actually includes vendor specific logic thus unlike typical 
para-virtualized virtio drivers being completely vendor agnostic. Is this 
understanding accurate?

It also means in the host-side pIOMMU driver needs to propagate all
supported formats through VFIO to Qemu vIOMMU, meaning
such format definitions need be consistently agreed across all those 
components.

[...]

> 
>   2. Sharing MMU page tables
>   --
> 
> The guest can share process page-tables with the physical IOMMU. To do
> that, it sends PROBE_TABLE with (F_INDIRECT | F_NATIVE | F_FAULT). The
> page table format is implicit, so the pg_format array can be empty (unless
> the guest wants to query some specific property, e.g. number of levels
> supported by the pIOMMU?). If the host answers with success, guest can
> send its MMU page table details with ATTACH_TABLE and (F_NATIVE |
> F_INDIRECT | F_FAULT) flags.
> 
> F_FAULT means that the host communicates page requests from device to
> the
> guest, and the guest can handle them by mapping virtual address in the
> fault to pages. It is only available with VIRTIO_IOMMU_F_FAULT_QUEUE (see
> below.)
> 
> F_NATIVE means that the pIOMMU pgtable format is the same as guest
> MMU
> pgtable format.
> 
> F_INDIRECT means that 'table' pointer is a context table, instead of a
> page directory. Each slot in the context table points to a page directory:
> 
>64  2 1 0
>   table > +-+
>   |   pgd   |0|1|<--- context 0
>   |   ---   |0|0|<--- context 1
>   |   

RE: [RFC 0/3] virtio-iommu: a paravirtualized IOMMU

2017-04-13 Thread Tian, Kevin
> From: Jean-Philippe Brucker
> Sent: Saturday, April 8, 2017 3:18 AM
> 
> This is the initial proposal for a paravirtualized IOMMU device using
> virtio transport. It contains a description of the device, a Linux driver,
> and a toy implementation in kvmtool. With this prototype, you can
> translate DMA to guest memory from emulated (virtio), or passed-through
> (VFIO) devices.
> 
> In its simplest form, implemented here, the device handles map/unmap
> requests from the guest. Future extensions proposed in "RFC 3/3" should
> allow to bind page tables to devices.
> 
> There are a number of advantages in a paravirtualized IOMMU over a full
> emulation. It is portable and could be reused on different architectures.
> It is easier to implement than a full emulation, with less state tracking.
> It might be more efficient in some cases, with less context switches to
> the host and the possibility of in-kernel emulation.
> 
> When designing it and writing the kvmtool device, I considered two main
> scenarios, illustrated below.
> 
> Scenario 1: a hardware device passed through twice via VFIO
> 
>MEMpIOMMUPCI device
> HARDWARE
> | (2b)\
>   --|-+-+--\-
> | : KVM :   \
> | : :\
>pIOMMU drv : ___virtio-iommu drv   \KERNEL
> | :|:  |   \
>   VFIO:|:VFIO   \
> | :|:  | \
> | :|:  | /
>   --|-+|+--|/
> |  |:  |   /
> | (1c)(1b) |: (1a) |  / (2a)
> |  |:  | /
> |  |:  |/   USERSPACE
> |___virtio-iommu dev___|:net drv___/
> :
>   --+
>  HOST   : GUEST
> 

Usually people draw such layers in reverse order, e.g. hw in the
bottom then kernel in the middle then user in the top. :-)

> (1) a. Guest userspace is running a net driver (e.g. DPDK). It allocates a
>buffer with mmap, obtaining virtual address VA. It then send a
>VFIO_IOMMU_MAP_DMA request to map VA to an IOVA (possibly
> VA=IOVA).
> b. The maping request is relayed to the host through virtio
>(VIRTIO_IOMMU_T_MAP).
> c. The mapping request is relayed to the physical IOMMU through VFIO.
> 
> (2) a. The guest userspace driver can now instruct the device to directly
>access the buffer at IOVA
> b. IOVA accesses from the device are translated into physical
>addresses by the IOMMU.
> 
> Scenario 2: a virtual net device behind a virtual IOMMU.
> 
>   MEM__pIOMMU___PCI device HARDWARE
>  | |
>   ---|-|--+-+---
>  | |  : KVM :
>  | |  : :
> pIOMMU drv |  : :
>  \ |  :  _virtio-net drv  KERNEL
>   \_net drv   : |   :  / (1a)
>|  : |   : /
>   tap : |virtio-iommu drv
>|  : |   |   : (1b)
>   -|--+-|---|---+---
>||   |   :
>|_virtio-net_|   |   :
>  / (2)  |   :
> /   |   :  USERSPACE
>   virtio-iommu dev__|   :
> :
>   --+---
>  HOST   : GUEST
> 
> (1) a. Guest virtio-net driver maps the virtio ring and a buffer
> b. The mapping requests are relayed to the host through virtio.
> (2) The virtio-net device now needs to access any guest memory via the
> IOMMU.
> 
> Physical and virtual IOMMUs are completely dissociated. The net driver is
> mapping its own buffers via DMA/IOMMU API, and buffers are copied
> between
> virtio-net and tap.
> 
> 
> The description itself seemed too long for a single email, so I split it
> into three documents, and will attach Linux and kvmtool patches to this
> email.
> 
>   1. Firmware note,
>   2. device operations (draft for the virtio specification),
>   3. 

RE: [RFC 0/3] virtio-iommu: a paravirtualized IOMMU

2017-04-13 Thread Tian, Kevin
> From: Jason Wang
> Sent: Wednesday, April 12, 2017 5:07 PM
> 
> On 2017年04月08日 03:17, Jean-Philippe Brucker wrote:
> > This is the initial proposal for a paravirtualized IOMMU device using
> > virtio transport. It contains a description of the device, a Linux driver,
> > and a toy implementation in kvmtool. With this prototype, you can
> > translate DMA to guest memory from emulated (virtio), or passed-through
> > (VFIO) devices.
> >
> > In its simplest form, implemented here, the device handles map/unmap
> > requests from the guest. Future extensions proposed in "RFC 3/3" should
> > allow to bind page tables to devices.
> >
> > There are a number of advantages in a paravirtualized IOMMU over a full
> > emulation. It is portable and could be reused on different architectures.
> > It is easier to implement than a full emulation, with less state tracking.
> > It might be more efficient in some cases, with less context switches to
> > the host and the possibility of in-kernel emulation.
> 
> I like the idea. Consider the complexity of IOMMU hardware. I believe we
> don't want to have and fight  for bugs of three or more different IOMMU
> implementations in either userspace or kernel.
> 

Though there are definitely positive things around pvIOMMU approach,
it also has some limitations:

- Existing IOMMU implementations have been in old distros for quite some
time, while pvIOMMU driver will only land in future distros. Doing pvIOMMU
only means we completely drop support of old distros in VM;

- Similar situation on supporting other guest OSes e.g. Windows. IOMMU is
a key kernel component which I'm not sure pvIOMMU through virtio can be
recognized in those OSes (not like a virtio device driver);

I would image both full-emulated IOMMUs and pvIOMMU would co-exist
for some time due to above reasons. Someday when pvIOMMU is mature/
spread enough in the eco-system (and feature-wise comparable to full-emulated
IOMMUs for all vendors), then we may make a call.

Thanks,
Kevin
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

RE: [RFC 1/3] virtio-iommu: firmware description of the virtual topology

2017-04-18 Thread Tian, Kevin
> From: Jean-Philippe Brucker
> Sent: Saturday, April 8, 2017 3:18 AM
> 
> Unlike other virtio devices, the virtio-iommu doesn't work independently,
> it is linked to other virtual or assigned devices. So before jumping into
> device operations, we need to define a way for the guest to discover the
> virtual IOMMU and the devices it translates.
> 
> The host must describe the relation between IOMMU and devices to the
> guest
> using either device-tree or ACPI. The virtual IOMMU identifies each

Do you plan to support both device tree and ACPI?

> virtual device with a 32-bit ID, that we will call "Device ID" in this
> document. Device IDs are not necessarily unique system-wide, but they may
> not overlap within a single virtual IOMMU. Device ID of passed-through
> devices do not need to match IDs seen by the physical IOMMU.
> 
> The virtual IOMMU uses virtio-mmio transport exclusively, not virtio-pci,
> because with PCI the IOMMU interface would itself be an endpoint, and
> existing firmware interfaces don't allow to describe IOMMU<->master
> relations between PCI endpoints.

I'm not familiar with virtio-mmio mechanism. Curious how devices in
virtio-mmio are enumerated today? Could we use that mechanism to
identify vIOMMUs and then invent a purely para-virtualized method to
enumerate devices behind each vIOMMU? 

Asking this is because each vendor has its own enumeration methods.
ARM has device tree and ACPI IORT. AMR has ACPI IVRS and device
tree (same format as ARM?). Intel has APCI DMAR and sub-tables. Your 
current proposal looks following ARM definitions which I'm not sure 
extensible enough to cover features defined only in other vendors' 
structures.

Since the purpose of this series is to go para-virtualize, why not also
para-virtualize and simplify the enumeration method? For example, 
we may define a query interface through vIOMMU registers to allow 
guest query whether a device belonging to that vIOMMU. Then we 
can even remove use of any enumeration structure completely... 
Just a quick example which I may not think through all the pros and 
cons. :-)

> 
> The following diagram describes a situation where two virtual IOMMUs
> translate traffic from devices in the system. vIOMMU 1 translates two PCI
> domains, in which each function has a 16-bits requester ID. In order for
> the vIOMMU to differentiate guest requests targeted at devices in each
> domain, their Device ID ranges cannot overlap. vIOMMU 2 translates two PCI
> domains and a collection of platform devices.
> 
>Device IDRequester ID
>   /   0x0   0x0  \
>  / | |PCI domain 1
> /  0x   0x   /
> vIOMMU 1
> \ 0x1   0x0  \
>  \ | |PCI domain 2
>   \   0x1   0x   /
> 
>   /   0x0\
>  / |  platform devices
> /  0x1fff/
> vIOMMU 2
> \  0x2000   0x0  \
>  \ | |PCI domain 3
>   \   0x11fff   0x   /
> 

isn't above be (0x3, 3) for PCI domain 3 giving device ID is 16bit?

Thanks
Kevin
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [RFC 2/3] virtio-iommu: device probing and operations

2017-04-18 Thread Tian, Kevin
> From: Jean-Philippe Brucker
> Sent: Saturday, April 8, 2017 3:18 AM
> 
[...]
>   II. Feature bits
>   
> 
> VIRTIO_IOMMU_F_INPUT_RANGE (0)
>  Available range of virtual addresses is described in input_range

Usually only the maximum supported address bits are important. 
Curious do you see such situation where low end of the address 
space is not usable (since you have both start/end defined later)?

[...]
>   1. Attach device
>   
> 
> struct virtio_iommu_req_attach {
>   le32address_space;
>   le32device;
>   le32flags/reserved;
> };
> 
> Attach a device to an address space. 'address_space' is an identifier
> unique to the guest. If the address space doesn't exist in the IOMMU

Based on your description this address space ID is per operation right?
MAP/UNMAP and page-table sharing should have different ID spaces...

> device, it is created. 'device' is an identifier unique to the IOMMU. The
> host communicates unique device ID to the guest during boot. The method
> used to communicate this ID is outside the scope of this specification,
> but the following rules must apply:
> 
> * The device ID is unique from the IOMMU point of view. Multiple devices
>   whose DMA transactions are not translated by the same IOMMU may have
> the
>   same device ID. Devices whose DMA transactions may be translated by the
>   same IOMMU must have different device IDs.
> 
> * Sometimes the host cannot completely isolate two devices from each
>   others. For example on a legacy PCI bus, devices can snoop DMA
>   transactions from their neighbours. In this case, the host must
>   communicate to the guest that it cannot isolate these devices from each
>   others. The method used to communicate this is outside the scope of this
>   specification. The IOMMU device must ensure that devices that cannot be

"IOMMU device" -> "IOMMU driver"

>   isolated by the host have the same address spaces.
> 

Thanks
Kevin
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [RFC] virtio-iommu version 0.4

2017-08-14 Thread Tian, Kevin
> From: Jean-Philippe Brucker [mailto:jean-philippe.bruc...@arm.com]
> Sent: Saturday, August 5, 2017 2:19 AM
> 
> This is the continuation of my proposal for virtio-iommu, the para-
> virtualized IOMMU. Here is a summary of the changes since last time [1]:
> 
> * The virtio-iommu document now resembles an actual specification. It is
>   split into a formal description of the virtio device, and implementation
>   notes. Please find sources and binaries at [2].
> 
> * Added a probe request to describe to the guest different properties that
>   do not fit in firmware or in the virtio config space. This is a
>   necessary stepping stone for extending the virtio-iommu.
> 
> * There is a working Qemu prototype [3], thanks to Eric Auger and Bharat
>   Bhushan.
> 
> You can find the Linux driver and kvmtool device at [4] and [5]. I
> plan to rework driver and kvmtool device slightly before sending the
> patches.
> 
> To understand the virtio-iommu, I advise to first read introduction and
> motivation, then skim through implementation notes and finally look at the
> device specification.
> 
> I wasn't sure how to organize the review. For those who prefer to comment
> inline, I attached v0.4 of device-operations.tex and topology.tex+MSI.tex
> to this thread. They are the biggest chunks of the document. But LaTeX
> isn't very pleasant to read, so you can simply send a list of comments in
> relation to section numbers and a few words of context, we'll manage.
> 
> ---
> Version numbers 0.1-0.4 are arbitrary. I'm hoping they allow to compare
> more easily differences since the RFC (see [6]), but haven't been made
> public so far. This is the first public posting since initial proposal
> [1], and the following describes all changes.
> 
> ## v0.1 ##
> 
> Content is the same as the RFC, but formatted to LaTeX. 'make' generates
> one PDF and one HTML document.
> 
> ## v0.2 ##
> 
> Add introductions, improve topology example and firmware description
> based
> on feedback and a number of useful discussions.
> 
> ## v0.3 ##
> 
> Add normative sections (MUST, SHOULD, etc). Clarify some things, tighten
> the device and driver behaviour. Unmap semantics are consolidated; they
> are now closer to VFIO Type1 v2 semantics.
> 
> ## v0.4 ##
> 
> Introduce PROBE requests. They provide per-endpoint information to the
> driver that couldn't be described otherwise.
> 
> For the moment, they allow to handle MSIs on x86 virtual platforms (see
> 3.2). To do that we communicate reserved IOVA regions, that will also be
> useful for describing regions that cannot be mapped for a given endpoint,
> for instance addresses that correspond to a PCI bridge window.
> 
> Introducing such a large framework for this tiny feature may seem
> overkill, but it is needed for future extensions of the virtio-iommu and I
> believe it really is worth the effort.
> 
> ## Future ##
> 
> Other extensions are in preparation. I won't detail them here because v0.4
> already is a lot to digest, but in short, building on top of PROBE:
> 
> * First, since the IOMMU is paravirtualized, the device can expose some
>   properties of the physical topology to the guest, and let it allocate
>   resources more efficiently. For example, when the virtio-iommu manages
>   both physical and emulated endpoints, with different underlying IOMMUs,
>   we now have a way to describe multiple page and block granularities,
>   instead of forcing the guest to use the most restricted one for all
>   endpoints. This will most likely be in v0.5.

emulated IOMMU has similar requirement, e.g. available PASID bits,
address widths, etc. which may break guest usage if not aligned to
physical limitation. Suppose we can introduce a general interface
through VFIO for all vIOMMU incarnations. 

> 
> * Then on top of that, a major improvement will describe hardware
>   acceleration features available to the guest. There is what I call "Page
>   Table Handover" (or simply, from the host POV, "Nested"), the ability
>   for the guest to manipulate its own page tables instead of sending
>   MAP/UNMAP requests to the host. This, along with IO Page Fault
>   reporting, will also permit SVM virtualization on different platforms.

what's your planned cadence for future versions? :-)

> 
> Thanks,
> Jean
> 
> [1] http://www.spinics.net/lists/kvm/msg147990.html
> [2] git://linux-arm.org/virtio-iommu.git branch viommu/v0.4
> http://www.linux-arm.org/git?p=virtio-
> iommu.git;a=blob;f=dist/v0.4/virtio-iommu-v0.4.pdf
> I reiterate the disclaimers: don't use this document as a reference,
> it's a draft. It's also not an OASIS document yet. It may be riddled
> with mistakes. As this is a working draft, it is unstable and I do not
> guarantee backward compatibility of future versions.
> [3] https://lists.gnu.org/archive/html/qemu-arm/2017-08/msg4.html
> [4] git://linux-arm.org/linux-jpb.git virtio-iommu/v0.4
> Warning: UAPI headers have changed! They didn't follow the spec,
> please 

RE: [virtio-dev] [RFC] virtio-iommu version 0.4

2017-09-21 Thread Tian, Kevin
> From: Jean-Philippe Brucker [mailto:jean-philippe.bruc...@arm.com]
> Sent: Wednesday, September 6, 2017 7:49 PM
> 
> 
> > 2.6.8.2.1
> > Multiple overlapping RESV_MEM properties are merged together. Device
> > requirement? if same types I assume?
> 
> Combination rules apply to device and driver. When the driver puts
> multiple endpoints in the same domain, combination rules apply. They will
> become important once the guest attempts to do things like combining
> endpoints with different PASID capabilities in the same domain. I'm
> considering these endpoints to be behind different physical IOMMUs.
> 
> For reserved regions, we specify what the driver should do and what the
> device should enforce with regard to mapping IOVAs of overlapping regions.
> For example, if a device has some virtual address RESV_T_MSI and an other
> device has the same virtual address RESV_T_IDENTITY, what should the
> driver do?
> 
> I think it should apply the RESV_T_IDENTITY. RESV_T_MSI is just a special
> case of RESV_T_RESERVED, it's a hint for the IRQ subsystem and doesn't
> have a meaning within a domain. From DMA mappings point of view, it is
> effectively the same as RESV_T_RESERVED. When merging
> RESV_T_RESERVED and
> RESV_T_IDENTITY, we should make it RESV_T_IDENTITY. Because it is
> required
> for one endpoint to work (the one with IDENTITY) and is not accessed by
> the other (the driver must not allocate this IOVA range for DMA).
> 
> More generally, I'd like to add the following combination table to the
> spec, that shows the resulting reserved region type after merging regions
> from two endpoints. N: no reserved region, R: RESV_T_RESERVED, M:
> RESV_T_MSI, I: RESV_T_IDENTITY. It is symmetric so I didn't fill the
> bottom half.
> 
>   | N | R | M | I
> --
> N | N | R | M | ?
> --
> R |   | R | R | I
> --
> M |   |   | M | I
> --
> I |   |   |   | I
> 
> The 'I' column is problematic. If one endpoint has an IDENTITY region and
> the other doesn't have anything at that virtual address, then making that
> region an identity mapping will result in the second endpoint being able
> to access firmware memory. If using nested translation, stage-2 can help
> us here. But in general we should allow the device to reject an attach
> that would result in a N+I combination if the host considers it too dodgy.
> I think the other combinations are safe enough.
> 

will overlap happen in real? Can we simplify the spec to have device
not reporting overlapping regions?

Thanks
Kevin
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [RFC] virtio-iommu version 0.4

2017-09-21 Thread Tian, Kevin
> From: Jean-Philippe Brucker
> Sent: Wednesday, September 6, 2017 7:55 PM
> 
> Hi Kevin,
> 
> On 28/08/17 08:39, Tian, Kevin wrote:
> > Here comes some comments:
> >
> > 1.1 Motivation
> >
> > You describe I/O page faults handling as future work. Seems you
> considered
> > only recoverable fault (since "aka. PCI PRI" being used). What about other
> > unrecoverable faults e.g. what to do if a virtual DMA request doesn't find
> > a valid mapping? Even when there is no PRI support, we need some basic
> > form of fault reporting mechanism to indicate such errors to guest.
> 
> I am considering recoverable faults as the end goal, but reporting
> unrecoverable faults should use the same queue, with slightly different
> fields and no need for the driver to reply to the device.

what about adding a placeholder for now? Though same mechanism
can be reused, it's an essential part to make virtio-iommu architecture
complete even before talking support for recoverable faults. :-)

> 
> > 2.6.8.2 Property RESV_MEM
> >
> > I'm not immediately clear when
> VIRTIO_IOMMU_PROBE_RESV_MEM_T_ABORT
> > should be explicitly reported. Is there any real example on bare metal
> IOMMU?
> > usually reserved memory is reported to CPU through other method (e.g.
> e820
> > on x86 platform). Of course MSI is a special case which is covered by
> BYPASS
> > and MSI flag... If yes, maybe you can also include an example in
> implementation
> > notes.
> 
> The RESV_MEM regions only describes IOVA space for the moment, not
> guest-physical, so I guess it provides different information than e820.
> 
> I think a useful example is the PCI bridge windows reported by the Linux
> host to userspace using RESV_RESERVED regions (see
> iommu_dma_get_resv_regions). If I understand correctly, they represent
> DMA
> addresses that shouldn't be accessed by endpoints because they won't
> reach
> the IOMMU. These are specific to the physical topology: a device will have
> different reserved regions depending on the PCI slot it occupies.
> 
> When handled properly, PCI bridge windows quickly become a nuisance.
> With
> kvmtool we observed that carving out their addresses globally removes a
> lot of useful GPA space from the guest. Without a virtual IOMMU we can
> either ignore them and hope everything will be fine, or remove all
> reserved regions from the GPA space (which currently means editing by
> hand
> the static guest-physical map...)
> 
> That's where RESV_MEM_T_ABORT comes handy with virtio-iommu. It
> describes
> reserved IOVAs for a specific endpoint, and therefore removes the need to
> carve the window out of the whole guest.

Understand and thanks for elaboration.

> 
> > Another thing I want to ask your opinion, about whether there is value of
> > adding another subtype (MEM_T_IDENTITY), asking for identity mapping
> > in the address space. It's similar to Reserved Memory Region Reporting
> > (RMRR) structure defined in VT-d, to indicate BIOS allocated reserved
> > memory ranges which may be DMA target and has to be identity mapped
> > when DMA remapping is enabled. I'm not sure whether ARM has similar
> > capability and whether there might be a general usage beyond VT-d. For
> > now the only usage in my mind is to assign a device with RMRR associated
> > on VT-d (Intel GPU, or some USB controllers) where the RMRR info needs
> > propagated to the guest (since identity mapping also means reservation
> > of virtual address space).
> 
> Yes I think adding MEM_T_IDENTITY will be necessary. I can see they are
> used for both iGPU and USB controllers on my x86 machines. Do you know
> more precisely what they are used for by the firmware?

VTd spec has a clear description:

3.14 Handling Requests to Reserved System Memory
Reserved system memory regions are typically allocated by BIOS at boot 
time and reported to OS as reserved address ranges in the system memory 
map. Requests-without-PASID to these reserved regions may either occur 
as a result of operations performed by the system software driver (for 
example in the case of DMA from unified memory access (UMA) graphics 
controllers to graphics reserved memory), or may be initiated by non 
system software (for example in case of DMA performed by a USB 
controller under BIOS SMM control for legacy keyboard emulation). 
For proper functioning of these legacy reserved memory usages, when 
system software enables DMA remapping, the second-level translation 
structures for the respective devices are expected to be set up to provide
identity mapping for the specified reserved memory regions with read 
and write permissions.

(one specific example for GPU happens in legacy VGA usage in earl

RE: [RFC 2/3] virtio-iommu: device probing and operations

2017-08-21 Thread Tian, Kevin
> From: Jean-Philippe Brucker [mailto:jean-philippe.bruc...@arm.com]
> Sent: Monday, April 24, 2017 11:06 PM
>    1. Attach device
>    
> 
>  struct virtio_iommu_req_attach {
>   le32address_space;
>   le32device;
>   le32flags/reserved;
>  };
> 
>  Attach a device to an address space. 'address_space' is an identifier
>  unique to the guest. If the address space doesn't exist in the IOMMU
> >>>
> >>> Based on your description this address space ID is per operation right?
> >>> MAP/UNMAP and page-table sharing should have different ID spaces...
> >>
> >> I think it's simpler if we keep a single IOASID space per virtio-iommu
> >> device, because the maximum number of address spaces (described by
> >> ioasid_bits) might be a restriction of the pIOMMU. For page-table
> sharing
> >> you still need to define which devices will share a page directory using
> >> ATTACH requests, though that interface is not set in stone.
> >
> > got you. yes VM is supposed to consume less IOASIDs than physically
> > available. It doesn’t hurt to have one IOASID space for both IOVA
> > map/unmap usages (one IOASID per device) and SVM usages (multiple
> > IOASIDs per device). The former is digested by software and the latter
> > will be bound to hardware.
> >
> 
> Hmm, I'm using address space indexed by IOASID for "classic" IOMMU, and
> then contexts indexed by PASID when talking about SVM. So in my mind an
> address space can have multiple sub-address-spaces (contexts). Number of
> IOASIDs is a limitation of the pIOMMU, and number of PASIDs is a
> limitation of the device. Therefore attaching devices to address spaces
> would update the number of available contexts in that address space. The
> terminology is not ideal, and I'd be happy to change it for something more
> clear.
> 

(sorry to pick up this old thread, as the .tex one is not good for review
and this thread provides necessary background for IOASID).

Hi, Jean,

I'd like to hear more clarification regarding the relationship between 
IOASID and PASID. When reading back above explanation, it looks
confusing to me now (though I might get the meaning months ago :/).
At least Intel VT-d only understands PASID (or you can think IOASID
=PASID). There is no such layered address space concept. Then for
map/unmap type request, do you intend to steal some PASIDs for
that purpose on such architecture (since IOASID is a mandatory field 
in map/unmap request)?

Thanks
Kevin
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

RE: [RFC 2/3] virtio-iommu: device probing and operations

2017-08-22 Thread Tian, Kevin
> From: Jean-Philippe Brucker [mailto:jean-philippe.bruc...@arm.com]
> Sent: Monday, August 21, 2017 8:00 PM
> 
> On 21/08/17 08:59, Tian, Kevin wrote:
> >> From: Jean-Philippe Brucker [mailto:jean-philippe.bruc...@arm.com]
> >> Sent: Monday, April 24, 2017 11:06 PM
> >>>>>>   1. Attach device
> >>>>>>   
> >>>>>>
> >>>>>> struct virtio_iommu_req_attach {
> >>>>>>le32address_space;
> >>>>>>le32device;
> >>>>>>le32flags/reserved;
> >>>>>> };
> >>>>>>
> >>>>>> Attach a device to an address space. 'address_space' is an identifier
> >>>>>> unique to the guest. If the address space doesn't exist in the
> IOMMU
> >>>>>
> >>>>> Based on your description this address space ID is per operation
> right?
> >>>>> MAP/UNMAP and page-table sharing should have different ID
> spaces...
> >>>>
> >>>> I think it's simpler if we keep a single IOASID space per virtio-iommu
> >>>> device, because the maximum number of address spaces (described
> by
> >>>> ioasid_bits) might be a restriction of the pIOMMU. For page-table
> >> sharing
> >>>> you still need to define which devices will share a page directory using
> >>>> ATTACH requests, though that interface is not set in stone.
> >>>
> >>> got you. yes VM is supposed to consume less IOASIDs than physically
> >>> available. It doesn’t hurt to have one IOASID space for both IOVA
> >>> map/unmap usages (one IOASID per device) and SVM usages (multiple
> >>> IOASIDs per device). The former is digested by software and the latter
> >>> will be bound to hardware.
> >>>
> >>
> >> Hmm, I'm using address space indexed by IOASID for "classic" IOMMU,
> and
> >> then contexts indexed by PASID when talking about SVM. So in my mind
> an
> >> address space can have multiple sub-address-spaces (contexts). Number
> of
> >> IOASIDs is a limitation of the pIOMMU, and number of PASIDs is a
> >> limitation of the device. Therefore attaching devices to address spaces
> >> would update the number of available contexts in that address space.
> The
> >> terminology is not ideal, and I'd be happy to change it for something
> more
> >> clear.
> >>
> >
> > (sorry to pick up this old thread, as the .tex one is not good for review
> > and this thread provides necessary background for IOASID).
> >
> > Hi, Jean,
> >
> > I'd like to hear more clarification regarding the relationship between
> > IOASID and PASID. When reading back above explanation, it looks
> > confusing to me now (though I might get the meaning months ago :/).
> > At least Intel VT-d only understands PASID (or you can think IOASID
> > =PASID). There is no such layered address space concept. Then for
> > map/unmap type request, do you intend to steal some PASIDs for
> > that purpose on such architecture (since IOASID is a mandatory field
> > in map/unmap request)?
> 
> IOASID is a logical ID, it isn't used by hardware. The address space
> concept in virtio-iommu allows to group endpoints together, so that they
> have the same address space. I thought it was pretty much the same as
> "domains" in VT-d? In any case, it is the same as domains in Linux. An
> IOASID provides a handle for communication between virtio-iommu device
> and
> driver, but unlike PASID, the IOASID number doesn't mean anything outside
> of virtio-iommu.

Thanks. It's clear to me then.

btw does it make more sense to use "domain id" instead of "IO address
space id"? For one, when talking about layered address spaces
usually parent address space is a superset of all child address spaces
which doesn't apply to this case, since either anonymous address
space or PASID-tagged address spaces are completely isolated. Instead
'domain' is a more inclusive terminology to embrace multiple address
spaces. For two, 'domain' is better aligned to software terminology (e.g.
iommu_domain) is easier for people to catch up. :-)

> 
> I haven't introduced PASIDs in public virtio-iommu documents yet, but the
> way I intend it, PASID != IOASID. We will still have a logical address
> space identified by IOASID, that can contain multiple contexts identified
> by PASID. At the moment, after the ATTACH request, an address space
> contains a single anonymous context (NO PASID) 

RE: [RFC] virtio-iommu version 0.4

2017-08-28 Thread Tian, Kevin
> From: Jean-Philippe Brucker [mailto:jean-philippe.bruc...@arm.com]
> Sent: Wednesday, August 23, 2017 6:01 PM
> 
> On 04/08/17 19:19, Jean-Philippe Brucker wrote:
> > Other extensions are in preparation. I won't detail them here because
> v0.4
> > already is a lot to digest, but in short, building on top of PROBE:
> >
> > * First, since the IOMMU is paravirtualized, the device can expose some
> >   properties of the physical topology to the guest, and let it allocate
> >   resources more efficiently. For example, when the virtio-iommu
> manages
> >   both physical and emulated endpoints, with different underlying
> IOMMUs,
> >   we now have a way to describe multiple page and block granularities,
> >   instead of forcing the guest to use the most restricted one for all
> >   endpoints. This will most likely be in v0.5.
> 
> In order to extend requests with PASIDs and (later) nested mode, I intend
> to rename "address_space" field to "domain", since it is a lot more
> precise about what the field is referring to and the current name would
> make these extensions confusing. Please find the rationale at [1].
> "ioasid_bits" will be "domain_bits" and "VIRTIO_IOMMU_F_IOASID_BITS"
> will
> be "VIRTIO_IOMMU_F_DOMAIN_BITS".
> 
> For those that had time to read this version, do you have other comments
> and suggestions about v0.4? Otherwise it is the only update I have for
> v0.5 (along with fine-grained address range and page size properties from
> the quoted text) and I will send it soon.
> 
> In particular, please tell me now if you see the need for other
> destructive changes like this one. They will be impossible to introduce
> once a driver or device is upstream.
> 
> Thanks,
> Jean
> 
> [1] https://www.spinics.net/lists/kvm/msg154573.html

Here comes some comments:

1.1 Motivation

You describe I/O page faults handling as future work. Seems you considered
only recoverable fault (since "aka. PCI PRI" being used). What about other
unrecoverable faults e.g. what to do if a virtual DMA request doesn't find 
a valid mapping? Even when there is no PRI support, we need some basic
form of fault reporting mechanism to indicate such errors to guest.

2.6.8.2 Property RESV_MEM

I'm not immediately clear when VIRTIO_IOMMU_PROBE_RESV_MEM_T_ABORT
should be explicitly reported. Is there any real example on bare metal IOMMU?
usually reserved memory is reported to CPU through other method (e.g. e820
on x86 platform). Of course MSI is a special case which is covered by BYPASS 
and MSI flag... If yes, maybe you can also include an example in implementation 
notes.

Another thing I want to ask your opinion, about whether there is value of
adding another subtype (MEM_T_IDENTITY), asking for identity mapping
in the address space. It's similar to Reserved Memory Region Reporting
(RMRR) structure defined in VT-d, to indicate BIOS allocated reserved
memory ranges which may be DMA target and has to be identity mapped
when DMA remapping is enabled. I'm not sure whether ARM has similar
capability and whether there might be a general usage beyond VT-d. For
now the only usage in my mind is to assign a device with RMRR associated
on VT-d (Intel GPU, or some USB controllers) where the RMRR info needs
propagated to the guest (since identity mapping also means reservation
of virtual address space).

2.6.8.2.3 Device Requirements: Property RESV_MEM

--citation start--
If an endpoint is attached to an address space, the device SHOULD leave 
any access targeting one of its VIRTIO_IOMMU_PROBE_RESV_MEM_T_BYPASS 
regions pass through untranslated. In other words, the device SHOULD 
handle such a region as if it was identity-mapped (virtual address equal to
physical address). If the endpoint is not attached to any address space, 
then the device MAY abort the transaction.
--citation end

I have a question for the last sentence. From definition of BYPASS, it's
orthogonal to whether there is an address space attached, then should
we still allow "May abort" behavior? 

Thanks
Kevin 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [virtio-dev] Re: [RFC] vhost: introduce mdev based hardware vhost backend

2018-04-10 Thread Tian, Kevin
> From: Liang, Cunming
> Sent: Tuesday, April 10, 2018 10:24 PM
> 
[...]
> >
> > As others said, we do not need to go overeboard. A couple of small
> vendor-
> > specific quirks in qemu isn't a big deal.
> 
> It's quite challenge to identify it's small or not, there's no uniform metric.
> 
> It's only dependent on QEMU itself, that's the obvious benefit. Tradeoff is
> to build the entire device driver. We don't object to do that in QEMU, but
> wanna make sure to understand the boundary size clearly.
> 

It might be helpful if you can post some sample code using proposed 
framework - then people have a clear feeling about what size is talked 
about here and whether it falls into the concept of 'small quirks'.

Thanks
Kevin
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [virtio-dev] [RFC] virtio-iommu version 0.6

2018-03-19 Thread Tian, Kevin
> From: Jean-Philippe Brucker
> Sent: Wednesday, February 7, 2018 2:11 AM
> 
> Please find version 0.6 of the virtio-iommu specification at the following
> locations.
> 
> Document: http://jpbrucker.net/virtio-iommu/spec/virtio-iommu.pdf
>   http://jpbrucker.net/virtio-iommu/spec/virtio-iommu.html
> 
> Sources: git://linux-arm.org/virtio-iommu.git viommu/v0.6
> 
> I didn't receive any comment for v0.5 [1], so this update is fairly light:
> * Changed range definition in map and unmap requests
> * Use virtio-iommu device ID
> * Update IORT node ID and simplify layout
> 
> The following document shows the difference between v0.5 and v0.6:
> http://jpbrucker.net/virtio-iommu/spec/diffs/virtio-iommu-pdf-diff-v0.5-
> v0.6.pdf
> 
> Next version will hopefully add vSVA (PASID+PRI) and/or architectural
> optimizations, but I can't provide a timeline yet. I'll probably send a
> small draft first.
> 
> I will send the Linux driver soon. In the meantime you can fetch it on my
> development branches, based on next-20180206.
> 
> git://linux-arm.org/linux-jpb.git virtio-iommu/devel
> git://linux-arm.org/kvmtool-jpb.git virtio-iommu/devel
> 
> Any comments welcome!
> 
> Thanks,
> Jean
> 
> [1] https://www.spinics.net/lists/kvm/msg157402.html

some comments here:

--

BYPASS feature bit is not covered in "2.3.1/2.3.2/2.3.3"". Is it 
intended?

--2.6.2 Device Requirements: Device operations--

"If the VIRTIO_IOMMU_F_INPUT_RANGE feature is offered, 
the device MUST truncate the range described by virt_start 
and virt_end in requests to ft in the range described by 
input_range."

"If the VIRTIO_IOMMU_F_DOMAIN_BITS is offered, the device 
MUST ignore bits above domain_bits in field domain of requests."

shouldn't device return error upon above situation instead
of continuing operation with modified value?

--2.6.4 DETACH request--

" Detach an endpoint from its domain. When this request 
completes, the endpoint cannot access any mapping from that 
domain anymore "

Does it make sense to mention BYPASS situation upon this request?

--2.6.8.2 Property RESV_MEM --

"VIRTIO_IOMMU_RESV_MEM_T_RESERVED (0) Accesses to 
virtual addresses in this region are not translated by the device. 
They may either be aborted by the device (or the underlying 
IOMMU), bypass it, or never even reach it"

in 3.3 hardware device assignment you described an example 
that a reserved range is stolen by host to map the MSI 
doorbell... such map behavior is not described above.

Then comes a question for VIRTIO_IOMMU_RESV_MEM_T_MSI.
I know there were quite some discussion around this flag before,
but my mental picture now still is a bit difficult to understand its
usage based on examples in implementation notes:

- for x86, you describe it as indicating MSI bypass for
oxfeex. However guest doesn't need to know this fact. Only
requirement is to treat it as reserved range (as on bare metal)
then T_RESERVED is sufficient for this purpose

- for ARM, either let guest or let host to choose a virtual
address for mapping to MSI doorbell. the former doesn't require
a reserved range. for the latter also T_RESERVED is enough as
the example in hardware device assignment section.

Then what's the real point of differentiating MSI bypass from
normal reserved range? Is there other example where guest
may use such info to do special thing?

-- 3.2 Message Signaled Interrupts --

" Section 3.2.2 describes the added complexity (from the host 
point of view) of translating the IRQ chip doorbell "

however later 3.2.2 only says one sentence about host
complexity - " However, this mode of operations may add 
significant complexity in the host implementation". If you plan
to elaborate that part in the future, please add a note. :-)

Thanks
Kevin
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

RE: [PATCH 1/4] iommu: Add virtio-iommu driver

2018-03-22 Thread Tian, Kevin
> From: Robin Murphy [mailto:robin.mur...@arm.com]
> Sent: Wednesday, March 21, 2018 10:24 PM
> 
> On 21/03/18 13:14, Jean-Philippe Brucker wrote:
> > On 21/03/18 06:43, Tian, Kevin wrote:
> > [...]
> >>> +
> >>> +#include 
> >>> +
> >>> +#define MSI_IOVA_BASE0x800
> >>> +#define MSI_IOVA_LENGTH  0x10
> >>
> >> this is ARM specific, and according to virtio-iommu spec isn't it
> >> better probed on the endpoint instead of hard-coding here?
> >
> > These values are arbitrary, not really ARM-specific even if ARM is the
> > only user yet: we're just reserving a random IOVA region for mapping
> MSIs.
> > It is hard-coded because of the way iommu-dma.c works, but I don't
> quite
> > remember why that allocation isn't dynamic.
> 
> The host kernel needs to have *some* MSI region in place before the
> guest can start configuring interrupts, otherwise it won't know what
> address to give to the underlying hardware. However, as soon as the host
> kernel has picked a region, host userspace needs to know that it can no
> longer use addresses in that region for DMA-able guest memory. It's a
> lot easier when the address is fixed in hardware and the host userspace
> will never be stupid enough to try and VFIO_IOMMU_DMA_MAP it, but in
> the
> more general case where MSI writes undergo IOMMU address translation
> so
> it's an arbitrary IOVA, this has the potential to conflict with stuff
> like guest memory hotplug.
> 
> What we currently have is just the simplest option, with the host kernel
> just picking something up-front and pretending to host userspace that
> it's a fixed hardware address. There's certainly scope for it to be a
> bit more dynamic in the sense of adding an interface to let userspace
> move it around (before attaching any devices, at least), but I don't
> think it's feasible for the host kernel to second-guess userspace enough
> to make it entirely transparent like it is in the DMA API domain case.
> 
> Of course, that's all assuming the host itself is using a virtio-iommu
> (e.g. in a nested virt or emulation scenario). When it's purely within a
> guest then an MSI reservation shouldn't matter so much, since the guest
> won't be anywhere near the real hardware configuration anyway.
> 
> Robin.

Curious since anyway we are defining a new iommu architecture
is it possible to avoid those ARM-specific burden completely? 

Thanks
Kevin

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [virtio-dev] [RFC] virtio-iommu version 0.6

2018-03-22 Thread Tian, Kevin
> From: Jean-Philippe Brucker [mailto:jean-philippe.bruc...@arm.com]
> Sent: Wednesday, March 21, 2018 9:14 PM
> 
> Hi Kevin,
> 
> Thanks for the comments
> 
> On 19/03/18 10:03, Tian, Kevin wrote:
> > BYPASS feature bit is not covered in "2.3.1/2.3.2/2.3.3"". Is it
> > intended?
> 
> In my opinion BYPASS is a bit different from the other features: while the
> others are needed for correctness, this one is optional and even if the
> guest supports BYPASS, it should be allowed not to accept it. For security
> reasons it may not want to let endpoints access the whole guest-physical
> address space.

ok, possibly because I'm not familiar with virtio spec convention.
My original feeling is that each feature bit will have a behavior
description regarding to whether device reports and whether 
driver accepts. If no need to cover optional feature, then it's fine
to me. :-)

> 
> > --2.6.2 Device Requirements: Device operations--
> >
> > "If the VIRTIO_IOMMU_F_INPUT_RANGE feature is offered,
> > the device MUST truncate the range described by virt_start
> > and virt_end in requests to ft in the range described by
> > input_range."
> >
> > "If the VIRTIO_IOMMU_F_DOMAIN_BITS is offered, the device
> > MUST ignore bits above domain_bits in field domain of requests."
> >
> > shouldn't device return error upon above situation instead
> > of continuing operation with modified value?
> 
> The intent was to allow device and driver to pass metadata in the top bits
> of pointers and domain IDs, perhaps for a future extension or for
> debugging. I'm not convinced it's useful anymore (and do wonder what my
> initial reasoning was...) Such extension would be determined by a new
> feature bit and the device would know that the fields contain additional
> info, if the driver accepted that feature.
> 
> > --2.6.4 DETACH request--
> >
> > " Detach an endpoint from its domain. When this request
> > completes, the endpoint cannot access any mapping from that
> > domain anymore "
> >
> > Does it make sense to mention BYPASS situation upon this request?
> 
> Sure, I'll add something to clarify that situation
> 
> > --2.6.8.2 Property RESV_MEM --
> >
> > "VIRTIO_IOMMU_RESV_MEM_T_RESERVED (0) Accesses to
> > virtual addresses in this region are not translated by the device.
> > They may either be aborted by the device (or the underlying
> > IOMMU), bypass it, or never even reach it"
> >
> > in 3.3 hardware device assignment you described an example
> > that a reserved range is stolen by host to map the MSI
> > doorbell... such map behavior is not described above.
> 
> Right, we can say that accesses to the region may be used for host
> translation
> 
> > Then comes a question for VIRTIO_IOMMU_RESV_MEM_T_MSI.
> > I know there were quite some discussion around this flag before,
> > but my mental picture now still is a bit difficult to understand its
> > usage based on examples in implementation notes:
> >
> > - for x86, you describe it as indicating MSI bypass for
> > oxfeex. However guest doesn't need to know this fact. Only
> > requirement is to treat it as reserved range (as on bare metal)
> > then T_RESERVED is sufficient for this purpose>
> > - for ARM, either let guest or let host to choose a virtual
> > address for mapping to MSI doorbell. the former doesn't require
> > a reserved range. for the latter also T_RESERVED is enough as
> > the example in hardware device assignment section.
> 
> It might be nicer to have the host decide it, but when the physical IOMMU
> is ARM SMMU, nesting translation complicates things, because the guest
> *has* to create a mapping:

confirm one thing first. v0.6 doesn't support binding to guest IOVA
page table yet. So based on current map/unmap interface, there is 
no such complicity right? stage-1 is just a shadowed translation (IOVA
->HPA) to guest side (IOVA->GPA) with nesting disabled. In that case
the default behavior is host-reserved style.

Then comes nested scenario:

> 
> * The guest is in charge of stage-1. It creates IOVA->GPA mapping for the
>   MSI doorbell. The GPA is mapped to the physical MSI doorbell at
>   stage-2 by the host.

Is it a must that above GPA is mapped to physical MSI doorbell?

Ideally:

1) Host reserves some IOVA range for mapping MSI doorbell
2) the range is reported to user space
3) Qemu reported the range as reserved on endpoints, marked
as T_IDENTITY (a new type to be introduced), meaning guest
needs setup identity mapping in stage-1
4) Then device should be able to ring physical MSI doorbell
5) I'm not sure 

RE: [PATCH 1/4] iommu: Add virtio-iommu driver

2018-03-21 Thread Tian, Kevin
> From: Jean-Philippe Brucker [mailto:jean-philippe.bruc...@arm.com]
> Sent: Wednesday, February 14, 2018 10:54 PM
> 
> The virtio IOMMU is a para-virtualized device, allowing to send IOMMU
> requests such as map/unmap over virtio-mmio transport without
> emulating
> page tables. This implementation handles ATTACH, DETACH, MAP and
> UNMAP
> requests.
> 
> The bulk of the code transforms calls coming from the IOMMU API into
> corresponding virtio requests. Mappings are kept in an interval tree
> instead of page tables.
> 
> Signed-off-by: Jean-Philippe Brucker 

[...]
> diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c
> new file mode 100644
> index ..a9c9245e8ba2
> --- /dev/null
> +++ b/drivers/iommu/virtio-iommu.c
> @@ -0,0 +1,960 @@
> +/*
> + * Virtio driver for the paravirtualized IOMMU
> + *
> + * Copyright (C) 2018 ARM Limited
> + * Author: Jean-Philippe Brucker 
> + *
> + * SPDX-License-Identifier: GPL-2.0
> + */
> +
> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include 
> +
> +#define MSI_IOVA_BASE0x800
> +#define MSI_IOVA_LENGTH  0x10

this is ARM specific, and according to virtio-iommu spec isn't it
better probed on the endpoint instead of hard-coding here?

Thanks
Kevin
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH 1/4] iommu: Add virtio-iommu driver

2018-03-23 Thread Tian, Kevin
> From: Tian, Kevin
> Sent: Thursday, March 22, 2018 6:06 PM
> 
> > From: Robin Murphy [mailto:robin.mur...@arm.com]
> > Sent: Wednesday, March 21, 2018 10:24 PM
> >
> > On 21/03/18 13:14, Jean-Philippe Brucker wrote:
> > > On 21/03/18 06:43, Tian, Kevin wrote:
> > > [...]
> > >>> +
> > >>> +#include 
> > >>> +
> > >>> +#define MSI_IOVA_BASE  0x800
> > >>> +#define MSI_IOVA_LENGTH0x10
> > >>
> > >> this is ARM specific, and according to virtio-iommu spec isn't it
> > >> better probed on the endpoint instead of hard-coding here?
> > >
> > > These values are arbitrary, not really ARM-specific even if ARM is the
> > > only user yet: we're just reserving a random IOVA region for mapping
> > MSIs.
> > > It is hard-coded because of the way iommu-dma.c works, but I don't
> > quite
> > > remember why that allocation isn't dynamic.
> >
> > The host kernel needs to have *some* MSI region in place before the
> > guest can start configuring interrupts, otherwise it won't know what
> > address to give to the underlying hardware. However, as soon as the host
> > kernel has picked a region, host userspace needs to know that it can no
> > longer use addresses in that region for DMA-able guest memory. It's a
> > lot easier when the address is fixed in hardware and the host userspace
> > will never be stupid enough to try and VFIO_IOMMU_DMA_MAP it, but in
> > the
> > more general case where MSI writes undergo IOMMU address translation
> > so
> > it's an arbitrary IOVA, this has the potential to conflict with stuff
> > like guest memory hotplug.
> >
> > What we currently have is just the simplest option, with the host kernel
> > just picking something up-front and pretending to host userspace that
> > it's a fixed hardware address. There's certainly scope for it to be a
> > bit more dynamic in the sense of adding an interface to let userspace
> > move it around (before attaching any devices, at least), but I don't
> > think it's feasible for the host kernel to second-guess userspace enough
> > to make it entirely transparent like it is in the DMA API domain case.
> >
> > Of course, that's all assuming the host itself is using a virtio-iommu
> > (e.g. in a nested virt or emulation scenario). When it's purely within a
> > guest then an MSI reservation shouldn't matter so much, since the guest
> > won't be anywhere near the real hardware configuration anyway.
> >
> > Robin.
> 
> Curious since anyway we are defining a new iommu architecture
> is it possible to avoid those ARM-specific burden completely?
> 

OK, after some study around those tricks below is my learning:

- MSI_IOVA window is used only on request (iommu_dma_get
_msi_page), not meant to take effect on all architectures once 
initialized. e.g. ARM GIC does it but not x86. So it is reasonable 
for virtio-iommu driver to implement such capability;

- I thought whether hardware MSI doorbell can be always reported
on virtio-iommu since it's newly defined. Looks there is a problem
if underlying IOMMU is sw-managed MSI style - valid mapping is
expected in all level of translations, meaning guest has to manage
stage-1 mapping in nested configuration since stage-1 is owned
by guest. 

Then virtio-iommu is naturally expected to report the same MSI 
model as supported by underlying hardware. Below are some
further thoughts along this route (use 'IOMMU' to represent the
physical one and 'virtio-iommu' for virtual one):



In the scope of current virtio-iommu spec v.6, there is no nested
consideration yet. Guest driver is expected to use MAP/UNMAP
interface on assigned endpoints. In this case the MAP requests
(IOVA->GPA) is caught and maintained within Qemu which then 
further talks to VFIO to map IOVA->HPA in IOMMU.

Qemu can learn the MSI model of IOMMU from sysfs.

For hardware MSI doorbell (x86 and some ARM):
* Host kernel reports to Qemu as IOMMU_RESV_MSI
* Qemu report to guest as VIRTIO_IOMMU_RESV_MEM_T_MSI
* Guest takes the range as IOMMU_RESV_MSI. reserved
* Qemu MAP database has no mapping for the doorbell
* Physical IOMMU page table has no mapping for the doorbell
* MSI from passthrough device bypass IOMMU
* MSI from emulated device bypass virtio-iommu

For software MSI doorbell (most ARM):
* Host kernel reports to Qemu as IOMMU_RESV_SW_MSI
* Qemu report to guest as VIRTIO_IOMMU_RESV_MEM_T_RESERVED
* Guest takes the range as IOMMU_RESV_RESERVED
* vGIC requests to map 'GPA of the virtual doorbell'
* a map request (IOVA->GPA) sent on endpoint
* Qemu maintains the mapping in MAP database
* but no VFIO_MA

RE: [RFC PATCH 1/2] mdev: device id support

2019-09-17 Thread Tian, Kevin
> From: Jason Wang
> Sent: Thursday, September 12, 2019 5:40 PM
> 
> Mdev bus only support vfio driver right now, so it doesn't implement
> match method. But in the future, we may add drivers other than vfio,
> one example is virtio-mdev[1] driver. This means we need to add device
> id support in bus match method to pair the mdev device and mdev driver
> correctly.

"device id" sound a bit confusing to me - it usually means something
unique to each device, while here it is used to indicate expected driver
types (vfio, virtio, etc.). but using "bus id" is also not good - we have
only one mdev bus here. Then what about "class id"?

> 
> So this patch add id_table to mdev_driver and id for mdev parent, and
> implement the match method for mdev bus.
> 
> [1] https://lkml.org/lkml/2019/9/10/135
> 
> Signed-off-by: Jason Wang 
> ---
>  drivers/gpu/drm/i915/gvt/kvmgt.c  |  2 +-
>  drivers/s390/cio/vfio_ccw_ops.c   |  2 +-
>  drivers/s390/crypto/vfio_ap_ops.c |  3 ++-
>  drivers/vfio/mdev/mdev_core.c | 14 --
>  drivers/vfio/mdev/mdev_driver.c   | 14 ++
>  drivers/vfio/mdev/mdev_private.h  |  1 +
>  drivers/vfio/mdev/vfio_mdev.c |  6 ++
>  include/linux/mdev.h  |  6 +-
>  include/linux/mod_devicetable.h   |  6 ++
>  samples/vfio-mdev/mbochs.c|  2 +-
>  samples/vfio-mdev/mdpy.c  |  2 +-
>  samples/vfio-mdev/mtty.c  |  2 +-
>  12 files changed, 51 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gvt/kvmgt.c
> b/drivers/gpu/drm/i915/gvt/kvmgt.c
> index 23aa3e50cbf8..19d51a35f019 100644
> --- a/drivers/gpu/drm/i915/gvt/kvmgt.c
> +++ b/drivers/gpu/drm/i915/gvt/kvmgt.c
> @@ -1625,7 +1625,7 @@ static int kvmgt_host_init(struct device *dev, void
> *gvt, const void *ops)
>   return -EFAULT;
>   intel_vgpu_ops.supported_type_groups = kvm_vgpu_type_groups;
> 
> - return mdev_register_device(dev, _vgpu_ops);
> + return mdev_register_vfio_device(dev, _vgpu_ops);
>  }
> 
>  static void kvmgt_host_exit(struct device *dev)
> diff --git a/drivers/s390/cio/vfio_ccw_ops.c
> b/drivers/s390/cio/vfio_ccw_ops.c
> index 5eb61116ca6f..f87d9409e290 100644
> --- a/drivers/s390/cio/vfio_ccw_ops.c
> +++ b/drivers/s390/cio/vfio_ccw_ops.c
> @@ -578,7 +578,7 @@ static const struct mdev_parent_ops
> vfio_ccw_mdev_ops = {
> 
>  int vfio_ccw_mdev_reg(struct subchannel *sch)
>  {
> - return mdev_register_device(>dev, _ccw_mdev_ops);
> + return mdev_register_vfio_device(>dev,
> _ccw_mdev_ops);
>  }
> 
>  void vfio_ccw_mdev_unreg(struct subchannel *sch)
> diff --git a/drivers/s390/crypto/vfio_ap_ops.c
> b/drivers/s390/crypto/vfio_ap_ops.c
> index 0604b49a4d32..eacbde3c7a97 100644
> --- a/drivers/s390/crypto/vfio_ap_ops.c
> +++ b/drivers/s390/crypto/vfio_ap_ops.c
> @@ -1295,7 +1295,8 @@ int vfio_ap_mdev_register(void)
>  {
>   atomic_set(_dev->available_instances,
> MAX_ZDEV_ENTRIES_EXT);
> 
> - return mdev_register_device(_dev->device,
> _ap_matrix_ops);
> + return mdev_register_vfio_device(_dev->device,
> +  _ap_matrix_ops);
>  }
> 
>  void vfio_ap_mdev_unregister(void)
> diff --git a/drivers/vfio/mdev/mdev_core.c
> b/drivers/vfio/mdev/mdev_core.c
> index b558d4cfd082..fc07ff3ebe96 100644
> --- a/drivers/vfio/mdev/mdev_core.c
> +++ b/drivers/vfio/mdev/mdev_core.c
> @@ -135,11 +135,14 @@ static int mdev_device_remove_cb(struct device
> *dev, void *data)
>   * mdev_register_device : Register a device
>   * @dev: device structure representing parent device.
>   * @ops: Parent device operation structure to be registered.
> + * @id: device id.
>   *
>   * Add device to list of registered parent devices.
>   * Returns a negative value on error, otherwise 0.
>   */
> -int mdev_register_device(struct device *dev, const struct
> mdev_parent_ops *ops)
> +int mdev_register_device(struct device *dev,
> +  const struct mdev_parent_ops *ops,
> +  u8 id)
>  {
>   int ret;
>   struct mdev_parent *parent;
> @@ -175,6 +178,7 @@ int mdev_register_device(struct device *dev, const
> struct mdev_parent_ops *ops)
> 
>   parent->dev = dev;
>   parent->ops = ops;
> + parent->device_id = id;
> 
>   if (!mdev_bus_compat_class) {
>   mdev_bus_compat_class =
> class_compat_register("mdev_bus");
> @@ -208,7 +212,13 @@ int mdev_register_device(struct device *dev, const
> struct mdev_parent_ops *ops)
>   put_device(dev);
>   return ret;
>  }
> -EXPORT_SYMBOL(mdev_register_device);
> +
> +int mdev_register_vfio_device(struct device *dev,
> +   const struct mdev_parent_ops *ops)
> +{
> + return mdev_register_device(dev, ops, MDEV_ID_VFIO);
> +}
> +EXPORT_SYMBOL(mdev_register_vfio_device);
> 
>  /*
>   * mdev_unregister_device : Unregister a parent device
> diff --git a/drivers/vfio/mdev/mdev_driver.c
> b/drivers/vfio/mdev/mdev_driver.c
> index 0d3223aee20b..fd5e9541d18e 100644
> --- 

RE: [RFC PATCH 2/2] mdev: introduce device specific ops

2019-09-17 Thread Tian, Kevin
> From: Jason Wang
> Sent: Thursday, September 12, 2019 5:40 PM
> 
> Currently, except for the crate and remove. The rest fields of
> mdev_parent_ops is just designed for vfio-mdev driver and may not help
> for kernel mdev driver. So follow the device id support by previous
> patch, this patch introduces device specific ops which points to
> device specific ops (e.g vfio ops). This allows the future drivers
> like virtio-mdev to implement its own device specific ops.

Can you give an example about what ops might be required to support
kernel mdev driver? I know you posted a link earlier, but putting a small
example here can save time and avoid inconsistent understanding. Then
it will help whether the proposed split makes sense or there is a 
possibility of redefining the callbacks to meet the both requirements.
imo those callbacks fulfill some basic requirements when mediating
a device...

> 
> Signed-off-by: Jason Wang 
> ---
>  drivers/gpu/drm/i915/gvt/kvmgt.c  | 14 +++---
>  drivers/s390/cio/vfio_ccw_ops.c   | 14 +++---
>  drivers/s390/crypto/vfio_ap_ops.c | 10 +++--
>  drivers/vfio/mdev/vfio_mdev.c | 30 +++--
>  include/linux/mdev.h  | 72 ++-
>  samples/vfio-mdev/mbochs.c| 16 ---
>  samples/vfio-mdev/mdpy.c  | 16 ---
>  samples/vfio-mdev/mtty.c  | 14 +++---
>  8 files changed, 113 insertions(+), 73 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gvt/kvmgt.c
> b/drivers/gpu/drm/i915/gvt/kvmgt.c
> index 19d51a35f019..64823998fd58 100644
> --- a/drivers/gpu/drm/i915/gvt/kvmgt.c
> +++ b/drivers/gpu/drm/i915/gvt/kvmgt.c
> @@ -1600,20 +1600,22 @@ static const struct attribute_group
> *intel_vgpu_groups[] = {
>   NULL,
>  };
> 
> -static struct mdev_parent_ops intel_vgpu_ops = {
> - .mdev_attr_groups   = intel_vgpu_groups,
> - .create = intel_vgpu_create,
> - .remove = intel_vgpu_remove,
> -
> +static struct vfio_mdev_parent_ops intel_vfio_vgpu_ops = {
>   .open   = intel_vgpu_open,
>   .release= intel_vgpu_release,
> -
>   .read   = intel_vgpu_read,
>   .write  = intel_vgpu_write,
>   .mmap   = intel_vgpu_mmap,
>   .ioctl  = intel_vgpu_ioctl,
>  };
> 
> +static struct mdev_parent_ops intel_vgpu_ops = {
> + .mdev_attr_groups   = intel_vgpu_groups,
> + .create = intel_vgpu_create,
> + .remove = intel_vgpu_remove,
> + .device_ops = _vfio_vgpu_ops,
> +};
> +
>  static int kvmgt_host_init(struct device *dev, void *gvt, const void *ops)
>  {
>   struct attribute **kvm_type_attrs;
> diff --git a/drivers/s390/cio/vfio_ccw_ops.c
> b/drivers/s390/cio/vfio_ccw_ops.c
> index f87d9409e290..96e9f18100ae 100644
> --- a/drivers/s390/cio/vfio_ccw_ops.c
> +++ b/drivers/s390/cio/vfio_ccw_ops.c
> @@ -564,11 +564,7 @@ static ssize_t vfio_ccw_mdev_ioctl(struct
> mdev_device *mdev,
>   }
>  }
> 
> -static const struct mdev_parent_ops vfio_ccw_mdev_ops = {
> - .owner  = THIS_MODULE,
> - .supported_type_groups  = mdev_type_groups,
> - .create = vfio_ccw_mdev_create,
> - .remove = vfio_ccw_mdev_remove,
> +static const struct vfio_mdev_parent_ops vfio_mdev_ops = {
>   .open   = vfio_ccw_mdev_open,
>   .release= vfio_ccw_mdev_release,
>   .read   = vfio_ccw_mdev_read,
> @@ -576,6 +572,14 @@ static const struct mdev_parent_ops
> vfio_ccw_mdev_ops = {
>   .ioctl  = vfio_ccw_mdev_ioctl,
>  };
> 
> +static const struct mdev_parent_ops vfio_ccw_mdev_ops = {
> + .owner  = THIS_MODULE,
> + .supported_type_groups  = mdev_type_groups,
> + .create = vfio_ccw_mdev_create,
> + .remove = vfio_ccw_mdev_remove,
> + .device_ops = _mdev_ops,
> +};
> +
>  int vfio_ccw_mdev_reg(struct subchannel *sch)
>  {
>   return mdev_register_vfio_device(>dev,
> _ccw_mdev_ops);
> diff --git a/drivers/s390/crypto/vfio_ap_ops.c
> b/drivers/s390/crypto/vfio_ap_ops.c
> index eacbde3c7a97..a48282bff066 100644
> --- a/drivers/s390/crypto/vfio_ap_ops.c
> +++ b/drivers/s390/crypto/vfio_ap_ops.c
> @@ -1280,15 +1280,19 @@ static ssize_t vfio_ap_mdev_ioctl(struct
> mdev_device *mdev,
>   return ret;
>  }
> 
> +static const struct vfio_mdev_parent_ops vfio_mdev_ops = {
> + .open   = vfio_ap_mdev_open,
> + .release= vfio_ap_mdev_release,
> + .ioctl  = vfio_ap_mdev_ioctl,
> +};
> +
>  static const struct mdev_parent_ops vfio_ap_matrix_ops = {
>   .owner  = THIS_MODULE,
>   .supported_type_groups  = vfio_ap_mdev_type_groups,
>   .mdev_attr_groups   = vfio_ap_mdev_attr_groups,
>   .create = vfio_ap_mdev_create,
>  

RE: [RFC PATCH 2/2] mdev: introduce device specific ops

2019-09-17 Thread Tian, Kevin
> From: Jason Wang [mailto:jasow...@redhat.com]
> Sent: Tuesday, September 17, 2019 6:17 PM
> 
> On 2019/9/17 下午4:09, Tian, Kevin wrote:
> >> From: Jason Wang
> >> Sent: Thursday, September 12, 2019 5:40 PM
> >>
> >> Currently, except for the crate and remove. The rest fields of
> >> mdev_parent_ops is just designed for vfio-mdev driver and may not
> help
> >> for kernel mdev driver. So follow the device id support by previous
> >> patch, this patch introduces device specific ops which points to
> >> device specific ops (e.g vfio ops). This allows the future drivers
> >> like virtio-mdev to implement its own device specific ops.
> > Can you give an example about what ops might be required to support
> > kernel mdev driver? I know you posted a link earlier, but putting a small
> > example here can save time and avoid inconsistent understanding. Then
> > it will help whether the proposed split makes sense or there is a
> > possibility of redefining the callbacks to meet the both requirements.
> > imo those callbacks fulfill some basic requirements when mediating
> > a device...
> 
> 
> I put it in the cover letter.
> 
> The link is https://lkml.org/lkml/2019/9/10/135 which abuses the current
> VFIO based mdev parent ops.
> 
> Thanks

So the main problem is the handling of userspace pointers vs.
kernel space pointers. You still implement read/write/ioctl
callbacks which is a subset of current parent_ops definition.
In that regard is it better to introduce some helper to handle
the pointer difference in mdev core, while still keeping the 
same set of parent ops (in whatever form suitable for both)? 

> 
> 
> >> Signed-off-by: Jason Wang 
> >> ---
> >>   drivers/gpu/drm/i915/gvt/kvmgt.c  | 14 +++---
> >>   drivers/s390/cio/vfio_ccw_ops.c   | 14 +++---
> >>   drivers/s390/crypto/vfio_ap_ops.c | 10 +++--
> >>   drivers/vfio/mdev/vfio_mdev.c | 30 +++--
> >>   include/linux/mdev.h  | 72 ++-
> >>   samples/vfio-mdev/mbochs.c| 16 ---
> >>   samples/vfio-mdev/mdpy.c  | 16 ---
> >>   samples/vfio-mdev/mtty.c  | 14 +++---
> >>   8 files changed, 113 insertions(+), 73 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/i915/gvt/kvmgt.c
> >> b/drivers/gpu/drm/i915/gvt/kvmgt.c
> >> index 19d51a35f019..64823998fd58 100644
> >> --- a/drivers/gpu/drm/i915/gvt/kvmgt.c
> >> +++ b/drivers/gpu/drm/i915/gvt/kvmgt.c
> >> @@ -1600,20 +1600,22 @@ static const struct attribute_group
> >> *intel_vgpu_groups[] = {
> >>NULL,
> >>   };
> >>
> >> -static struct mdev_parent_ops intel_vgpu_ops = {
> >> -  .mdev_attr_groups   = intel_vgpu_groups,
> >> -  .create = intel_vgpu_create,
> >> -  .remove = intel_vgpu_remove,
> >> -
> >> +static struct vfio_mdev_parent_ops intel_vfio_vgpu_ops = {
> >>.open   = intel_vgpu_open,
> >>.release= intel_vgpu_release,
> >> -
> >>.read   = intel_vgpu_read,
> >>.write  = intel_vgpu_write,
> >>.mmap   = intel_vgpu_mmap,
> >>.ioctl  = intel_vgpu_ioctl,
> >>   };
> >>
> >> +static struct mdev_parent_ops intel_vgpu_ops = {
> >> +  .mdev_attr_groups   = intel_vgpu_groups,
> >> +  .create = intel_vgpu_create,
> >> +  .remove = intel_vgpu_remove,
> >> +  .device_ops = _vfio_vgpu_ops,
> >> +};
> >> +
> >>   static int kvmgt_host_init(struct device *dev, void *gvt, const void 
> >> *ops)
> >>   {
> >>struct attribute **kvm_type_attrs;
> >> diff --git a/drivers/s390/cio/vfio_ccw_ops.c
> >> b/drivers/s390/cio/vfio_ccw_ops.c
> >> index f87d9409e290..96e9f18100ae 100644
> >> --- a/drivers/s390/cio/vfio_ccw_ops.c
> >> +++ b/drivers/s390/cio/vfio_ccw_ops.c
> >> @@ -564,11 +564,7 @@ static ssize_t vfio_ccw_mdev_ioctl(struct
> >> mdev_device *mdev,
> >>}
> >>   }
> >>
> >> -static const struct mdev_parent_ops vfio_ccw_mdev_ops = {
> >> -  .owner  = THIS_MODULE,
> >> -  .supported_type_groups  = mdev_type_groups,
> >> -  .create = vfio_ccw_mdev_create,
> >> -  .remove = vfio_ccw_mdev_remove,
> >> +static const struct vfio_mdev_parent_ops vfio_mdev_

RE: [RFC PATCH 0/2] Mdev: support mutiple kinds of devices

2019-09-17 Thread Tian, Kevin
> From: Alex Williamson
> Sent: Wednesday, September 18, 2019 1:31 AM
> 
> [cc +Parav]
> 
> On Thu, 12 Sep 2019 17:40:10 +0800
> Jason Wang  wrote:
> 
> > Hi all:
> >
> > During the development of virtio-mdev[1]. I find that mdev needs to be
> > extended to support devices other than vfio mdev device. So this
> > series tries to extend the mdev to be able to differ from different
> > devices by:
> >
> > - device id and matching for mdev bus
> > - device speicfic callbacks and move vfio callbacks there
> >
> > Sent for early reivew, compile test only!
> >
> > Thanks
> >
> > [1] https://lkml.org/lkml/2019/9/10/135
> 
> I expect Parav must have something similar in the works for their
> in-kernel networking mdev support.  Link to discussion so far:
> 
> https://lore.kernel.org/kvm/20190912094012.29653-1-
> jasow...@redhat.com/T/#t
> 

It links to the current thread. Is it intended or do you want
people to look at another thread driven by Parav? :-)

Thanks
Kevin

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH V2 6/8] mdev: introduce virtio device and its device ops

2019-09-25 Thread Tian, Kevin
> From: Jason Wang
> Sent: Wednesday, September 25, 2019 8:45 PM
> 
> 
> On 2019/9/25 下午5:09, Tian, Kevin wrote:
> >> From: Jason Wang [mailto:jasow...@redhat.com]
> >> Sent: Tuesday, September 24, 2019 9:54 PM
> >>
> >> This patch implements basic support for mdev driver that supports
> >> virtio transport for kernel virtio driver.
> >>
> >> Signed-off-by: Jason Wang 
> >> ---
> >>   include/linux/mdev.h|   2 +
> >>   include/linux/virtio_mdev.h | 145
> >> 
> >>   2 files changed, 147 insertions(+)
> >>   create mode 100644 include/linux/virtio_mdev.h
> >>
> >> diff --git a/include/linux/mdev.h b/include/linux/mdev.h
> >> index 3414307311f1..73ac27b3b868 100644
> >> --- a/include/linux/mdev.h
> >> +++ b/include/linux/mdev.h
> >> @@ -126,6 +126,8 @@ struct mdev_device *mdev_from_dev(struct
> device
> >> *dev);
> >>
> >>   enum {
> >>MDEV_ID_VFIO = 1,
> >> +  MDEV_ID_VIRTIO = 2,
> >> +  MDEV_ID_VHOST = 3,
> >>/* New entries must be added here */
> >>   };
> >>
> >> diff --git a/include/linux/virtio_mdev.h b/include/linux/virtio_mdev.h
> >> new file mode 100644
> >> index ..d1a40a739266
> >> --- /dev/null
> >> +++ b/include/linux/virtio_mdev.h
> >> @@ -0,0 +1,145 @@
> >> +/* SPDX-License-Identifier: GPL-2.0-only */
> >> +/*
> >> + * Virtio mediated device driver
> >> + *
> >> + * Copyright 2019, Red Hat Corp.
> >> + * Author: Jason Wang 
> >> + */
> >> +#ifndef _LINUX_VIRTIO_MDEV_H
> >> +#define _LINUX_VIRTIO_MDEV_H
> >> +
> >> +#include 
> >> +#include 
> >> +#include 
> >> +
> >> +#define VIRTIO_MDEV_DEVICE_API_STRING "virtio-
> mdev"
> >> +#define VIRTIO_MDEV_VERSION 0x1
> > Just be curious. is this version identical to virtio spec version that below
> > callbacks are created for, or just irrelevant?
> 
> 
> It could be a hint but basically it's a way for userspace driver
> compatibility. For kernel we don't need this.
> 
> 
> >
> >> +
> >> +struct virtio_mdev_callback {
> >> +  irqreturn_t (*callback)(void *data);
> >> +  void *private;
> >> +};
> >> +
> >> +/**
> >> + * struct vfio_mdev_device_ops - Structure to be registered for each
> >> + * mdev device to register the device to virtio-mdev module.
> >> + *
> >> + * @set_vq_address:   Set the address of virtqueue
> >> + *@mdev: mediated device
> >> + *@idx: virtqueue index
> >> + *@desc_area: address of desc area
> >> + *@driver_area: address of driver area
> >> + *@device_area: address of device area
> >> + *Returns integer: success (0) or error 
> >> (< 0)
> >> + * @set_vq_num:   Set the size of virtqueue
> >> + *@mdev: mediated device
> >> + *@idx: virtqueue index
> >> + *@num: the size of virtqueue
> >> + * @kick_vq:  Kick the virtqueue
> >> + *@mdev: mediated device
> >> + *@idx: virtqueue index
> >> + * @set_vq_cb:Set the interrut calback function for
> >> + *a virtqueue
> >> + *@mdev: mediated device
> >> + *@idx: virtqueue index
> >> + *@cb: virtio-mdev interrupt callback
> >> structure
> >> + * @set_vq_ready: Set ready status for a virtqueue
> >> + *@mdev: mediated device
> >> + *@idx: virtqueue index
> >> + *@ready: ready (true) not ready(false)
> >> + * @get_vq_ready: Get ready status for a virtqueue
> >> + *@mdev: mediated device
> >> + *@idx: virtqueue index
> >> + *Returns boolean: ready (true) or not 
> >> (false)
> >> + * @set_vq_state: Se

RE: [PATCH V2 0/8] mdev based hardware virtio offloading support

2019-09-25 Thread Tian, Kevin
> From: Jason Wang [mailto:jasow...@redhat.com]
> Sent: Tuesday, September 24, 2019 9:53 PM
> 
> Hi all:
> 
> There are hardware that can do virtio datapath offloading while having
> its own control path. This path tries to implement a mdev based
> unified API to support using kernel virtio driver to drive those
> devices. This is done by introducing a new mdev transport for virtio
> (virtio_mdev) and register itself as a new kind of mdev driver. Then
> it provides a unified way for kernel virtio driver to talk with mdev
> device implementation.
> 
> Though the series only contains kernel driver support, the goal is to
> make the transport generic enough to support userspace drivers. This
> means vhost-mdev[1] could be built on top as well by resuing the
> transport.
> 
> A sample driver is also implemented which simulate a virito-net
> loopback ethernet device on top of vringh + workqueue. This could be
> used as a reference implementation for real hardware driver.
> 
> Consider mdev framework only support VFIO device and driver right now,
> this series also extend it to support other types. This is done
> through introducing class id to the device and pairing it with
> id_talbe claimed by the driver. On top, this seris also decouple

id_table claimed ... this series ...

> device specific parents ops out of the common ones.
> 
> Pktgen test was done with virito-net + mvnet loop back device.
> 
> Please review.
> 
> [1] https://lkml.org/lkml/2019/9/16/869
> 
> Changes from V1:
> 
> - move virtio_mdev.c to drivers/virtio
> - store class_id in mdev_device instead of mdev_parent
> - store device_ops in mdev_device instead of mdev_parent
> - reorder the patch, vringh fix comes first
> - really silent compiling warnings
> - really switch to use u16 for class_id
> - uevent and modpost support for mdev class_id
> - vraious tweaks per comments from Parav
> 
> Changes from RFC-V2:
> 
> - silent compile warnings on some specific configuration
> - use u16 instead u8 for class id
> - reseve MDEV_ID_VHOST for future vhost-mdev work
> - introduce "virtio" type for mvnet and make "vhost" type for future
>   work
> - add entries in MAINTAINER
> - tweak and typos fixes in commit log
> 
> Changes from RFC-V1:
> 
> - rename device id to class id
> - add docs for class id and device specific ops (device_ops)
> - split device_ops into seperate headers
> - drop the mdev_set_dma_ops()
> - use device_ops to implement the transport API, then it's not a part
>   of UAPI any more
> - use GFP_ATOMIC in mvnet sample device and other tweaks
> - set_vring_base/get_vring_base support for mvnet device
> 
> Jason Wang (8):
>   vringh: fix copy direction of vringh_iov_push_kern()
>   mdev: class id support
>   mdev: bus uevent support
>   modpost: add support for mdev class id
>   mdev: introduce device specific ops
>   mdev: introduce virtio device and its device ops
>   virtio: introduce a mdev based transport
>   docs: sample driver to demonstrate how to implement virtio-mdev
> framework
> 
>  .../driver-api/vfio-mediated-device.rst   |   7 +-
>  MAINTAINERS   |   2 +
>  drivers/gpu/drm/i915/gvt/kvmgt.c  |  18 +-
>  drivers/s390/cio/vfio_ccw_ops.c   |  18 +-
>  drivers/s390/crypto/vfio_ap_ops.c |  14 +-
>  drivers/vfio/mdev/mdev_core.c |  19 +
>  drivers/vfio/mdev/mdev_driver.c   |  22 +
>  drivers/vfio/mdev/mdev_private.h  |   2 +
>  drivers/vfio/mdev/vfio_mdev.c |  45 +-
>  drivers/vhost/vringh.c|   8 +-
>  drivers/virtio/Kconfig|   7 +
>  drivers/virtio/Makefile   |   1 +
>  drivers/virtio/virtio_mdev.c  | 417 +++
>  include/linux/mdev.h  |  52 +-
>  include/linux/mod_devicetable.h   |   8 +
>  include/linux/vfio_mdev.h |  52 ++
>  include/linux/virtio_mdev.h   | 145 
>  samples/Kconfig   |   7 +
>  samples/vfio-mdev/Makefile|   1 +
>  samples/vfio-mdev/mbochs.c|  20 +-
>  samples/vfio-mdev/mdpy.c  |  20 +-
>  samples/vfio-mdev/mtty.c  |  18 +-
>  samples/vfio-mdev/mvnet.c | 692 ++
>  scripts/mod/devicetable-offsets.c |   3 +
>  scripts/mod/file2alias.c  |  10 +
>  25 files changed, 1524 insertions(+), 84 deletions(-)
>  create mode 100644 drivers/virtio/virtio_mdev.c
>  create mode 100644 include/linux/vfio_mdev.h
>  create mode 100644 include/linux/virtio_mdev.h
>  create mode 100644 samples/vfio-mdev/mvnet.c
> 
> --
> 2.19.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH V2 2/8] mdev: class id support

2019-09-25 Thread Tian, Kevin
> From: Jason Wang
> Sent: Tuesday, September 24, 2019 9:53 PM
> 
> Mdev bus only supports vfio driver right now, so it doesn't implement
> match method. But in the future, we may add drivers other than vfio,
> the first driver could be virtio-mdev. This means we need to add
> device class id support in bus match method to pair the mdev device
> and mdev driver correctly.
> 
> So this patch adds id_table to mdev_driver and class_id for mdev
> parent with the match method for mdev bus.
> 
> Signed-off-by: Jason Wang 
> ---
>  Documentation/driver-api/vfio-mediated-device.rst |  3 +++
>  drivers/gpu/drm/i915/gvt/kvmgt.c  |  1 +
>  drivers/s390/cio/vfio_ccw_ops.c   |  1 +
>  drivers/s390/crypto/vfio_ap_ops.c |  1 +
>  drivers/vfio/mdev/mdev_core.c |  7 +++
>  drivers/vfio/mdev/mdev_driver.c   | 14 ++
>  drivers/vfio/mdev/mdev_private.h  |  1 +
>  drivers/vfio/mdev/vfio_mdev.c |  6 ++
>  include/linux/mdev.h  |  8 
>  include/linux/mod_devicetable.h   |  8 
>  samples/vfio-mdev/mbochs.c|  1 +
>  samples/vfio-mdev/mdpy.c  |  1 +
>  samples/vfio-mdev/mtty.c  |  1 +
>  13 files changed, 53 insertions(+)
> 
> diff --git a/Documentation/driver-api/vfio-mediated-device.rst
> b/Documentation/driver-api/vfio-mediated-device.rst
> index 25eb7d5b834b..a5bdc60d62a1 100644
> --- a/Documentation/driver-api/vfio-mediated-device.rst
> +++ b/Documentation/driver-api/vfio-mediated-device.rst
> @@ -102,12 +102,14 @@ structure to represent a mediated device's
> driver::
>* @probe: called when new device created
>* @remove: called when device removed
>* @driver: device driver structure
> +  * @id_table: the ids serviced by this driver
>*/
>   struct mdev_driver {
>const char *name;
>int  (*probe)  (struct device *dev);
>void (*remove) (struct device *dev);
>struct device_driverdriver;
> +  const struct mdev_class_id *id_table;
>   };
> 
>  A mediated bus driver for mdev should use this structure in the function
> calls
> @@ -165,6 +167,7 @@ register itself with the mdev core driver::
>   extern int  mdev_register_device(struct device *dev,
>const struct mdev_parent_ops *ops);
> 
> +
>  However, the mdev_parent_ops structure is not required in the function
> call
>  that a driver should use to unregister itself with the mdev core driver::
> 
> diff --git a/drivers/gpu/drm/i915/gvt/kvmgt.c
> b/drivers/gpu/drm/i915/gvt/kvmgt.c
> index 23aa3e50cbf8..f793252a3d2a 100644
> --- a/drivers/gpu/drm/i915/gvt/kvmgt.c
> +++ b/drivers/gpu/drm/i915/gvt/kvmgt.c
> @@ -678,6 +678,7 @@ static int intel_vgpu_create(struct kobject *kobj,
> struct mdev_device *mdev)
>dev_name(mdev_dev(mdev)));
>   ret = 0;
> 
> + mdev_set_class_id(mdev, MDEV_ID_VFIO);
>  out:
>   return ret;
>  }
> diff --git a/drivers/s390/cio/vfio_ccw_ops.c
> b/drivers/s390/cio/vfio_ccw_ops.c
> index f0d71ab77c50..d258ef1fedb9 100644
> --- a/drivers/s390/cio/vfio_ccw_ops.c
> +++ b/drivers/s390/cio/vfio_ccw_ops.c
> @@ -129,6 +129,7 @@ static int vfio_ccw_mdev_create(struct kobject
> *kobj, struct mdev_device *mdev)
>  private->sch->schid.ssid,
>  private->sch->schid.sch_no);
> 
> + mdev_set_class_id(mdev, MDEV_ID_VFIO);
>   return 0;
>  }
> 
> diff --git a/drivers/s390/crypto/vfio_ap_ops.c
> b/drivers/s390/crypto/vfio_ap_ops.c
> index 5c0f53c6dde7..2cfd96112aa0 100644
> --- a/drivers/s390/crypto/vfio_ap_ops.c
> +++ b/drivers/s390/crypto/vfio_ap_ops.c
> @@ -343,6 +343,7 @@ static int vfio_ap_mdev_create(struct kobject *kobj,
> struct mdev_device *mdev)
>   list_add(_mdev->node, _dev->mdev_list);
>   mutex_unlock(_dev->lock);
> 
> + mdev_set_class_id(mdev, MDEV_ID_VFIO);
>   return 0;
>  }
> 
> diff --git a/drivers/vfio/mdev/mdev_core.c
> b/drivers/vfio/mdev/mdev_core.c
> index b558d4cfd082..8764cf4a276d 100644
> --- a/drivers/vfio/mdev/mdev_core.c
> +++ b/drivers/vfio/mdev/mdev_core.c
> @@ -45,6 +45,12 @@ void mdev_set_drvdata(struct mdev_device *mdev,
> void *data)
>  }
>  EXPORT_SYMBOL(mdev_set_drvdata);
> 
> +void mdev_set_class_id(struct mdev_device *mdev, u16 id)
> +{
> + mdev->class_id = id;
> +}
> +EXPORT_SYMBOL(mdev_set_class_id);
> +
>  struct device *mdev_dev(struct mdev_device *mdev)
>  {
>   return >dev;
> @@ -135,6 +141,7 @@ static int mdev_device_remove_cb(struct device
> *dev, void *data)
>   * mdev_register_device : Register a device
>   * @dev: device structure representing parent device.
>   * @ops: Parent device operation structure to be registered.
> + * @id: device id.

class id.

>   *
>   * Add device to list of registered parent devices.

RE: [PATCH V2 5/8] mdev: introduce device specific ops

2019-09-25 Thread Tian, Kevin
> From: Alex Williamson
> Sent: Wednesday, September 25, 2019 7:07 AM
> 
> On Tue, 24 Sep 2019 21:53:29 +0800
> Jason Wang  wrote:
> 
> > Currently, except for the create and remove, the rest of
> > mdev_parent_ops is designed for vfio-mdev driver only and may not help
> > for kernel mdev driver. With the help of class id, this patch
> > introduces device specific callbacks inside mdev_device
> > structure. This allows different set of callback to be used by
> > vfio-mdev and virtio-mdev.
> >
> > Signed-off-by: Jason Wang 
> > ---
> >  .../driver-api/vfio-mediated-device.rst   |  4 +-
> >  MAINTAINERS   |  1 +
> >  drivers/gpu/drm/i915/gvt/kvmgt.c  | 17 +++---
> >  drivers/s390/cio/vfio_ccw_ops.c   | 17 --
> >  drivers/s390/crypto/vfio_ap_ops.c | 13 +++--
> >  drivers/vfio/mdev/mdev_core.c | 12 +
> >  drivers/vfio/mdev/mdev_private.h  |  1 +
> >  drivers/vfio/mdev/vfio_mdev.c | 37 ++---
> >  include/linux/mdev.h  | 42 ---
> >  include/linux/vfio_mdev.h | 52 +++
> >  samples/vfio-mdev/mbochs.c| 19 ---
> >  samples/vfio-mdev/mdpy.c  | 19 ---
> >  samples/vfio-mdev/mtty.c  | 17 --
> >  13 files changed, 168 insertions(+), 83 deletions(-)
> >  create mode 100644 include/linux/vfio_mdev.h
> >
> > diff --git a/Documentation/driver-api/vfio-mediated-device.rst
> b/Documentation/driver-api/vfio-mediated-device.rst
> > index a5bdc60d62a1..d50425b368bb 100644
> > --- a/Documentation/driver-api/vfio-mediated-device.rst
> > +++ b/Documentation/driver-api/vfio-mediated-device.rst
> > @@ -152,7 +152,9 @@ callbacks per mdev parent device, per mdev type,
> or any other categorization.
> >  Vendor drivers are expected to be fully asynchronous in this respect or
> >  provide their own internal resource protection.)
> >
> > -The callbacks in the mdev_parent_ops structure are as follows:
> > +The device specific callbacks are referred through device_ops pointer
> > +in mdev_parent_ops. For vfio-mdev device, its callbacks in device_ops
> > +are as follows:
> 
> This is not accurate.  device_ops is now on the mdev_device and is an
> mdev bus driver specific structure of callbacks that must be registered
> for each mdev device by the parent driver during the create callback.
> There's a one to one mapping of class_id to mdev_device_ops callbacks.

there is also a mistake in include/Linux/mdev.h, where device_ops is
still part of mdev_parent_ops in the comment line.

> 
> That also suggests to me that we could be more clever in registering
> both of these with mdev-core.  Can we embed the class_id in the ops
> structure in a common way so that the core can extract it and the bus
> drivers can access their specific callbacks?
> 
> >  * open: open callback of mediated device
> >  * close: close callback of mediated device
> > diff --git a/MAINTAINERS b/MAINTAINERS
> > index b2326dece28e..89832b316500 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -17075,6 +17075,7 @@ S:  Maintained
> >  F: Documentation/driver-api/vfio-mediated-device.rst
> >  F: drivers/vfio/mdev/
> >  F: include/linux/mdev.h
> > +F: include/linux/vfio_mdev.h
> >  F: samples/vfio-mdev/
> >
> >  VFIO PLATFORM DRIVER
> > diff --git a/drivers/gpu/drm/i915/gvt/kvmgt.c
> b/drivers/gpu/drm/i915/gvt/kvmgt.c
> > index f793252a3d2a..b274f5ee481f 100644
> > --- a/drivers/gpu/drm/i915/gvt/kvmgt.c
> > +++ b/drivers/gpu/drm/i915/gvt/kvmgt.c
> > @@ -42,6 +42,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  #include 
> >
> >  #include 
> > @@ -643,6 +644,8 @@ static void kvmgt_put_vfio_device(void *vgpu)
> > vfio_device_put(((struct intel_vgpu *)vgpu)->vdev.vfio_device);
> >  }
> >
> > +static struct vfio_mdev_device_ops intel_vfio_vgpu_dev_ops;
> > +
> >  static int intel_vgpu_create(struct kobject *kobj, struct mdev_device
> *mdev)
> >  {
> > struct intel_vgpu *vgpu = NULL;
> > @@ -679,6 +682,7 @@ static int intel_vgpu_create(struct kobject *kobj,
> struct mdev_device *mdev)
> > ret = 0;
> >
> > mdev_set_class_id(mdev, MDEV_ID_VFIO);
> > +   mdev_set_dev_ops(mdev, _vfio_vgpu_dev_ops);
> 
> This seems rather unrefined.  We're registering interdependent data in
> separate calls.  All drivers need to make both of these calls.  I'm not
> sure if this is a good idea, but what if we had:
> 
> static const struct vfio_mdev_device_ops intel_vfio_vgpu_dev_ops = {
>   .id = MDEV_ID_VFIO,
>   .open   = intel_vgpu_open,
>   .release= intel_vgpu_release,
> ...
> 
> And the set function passed _vfio_vgpu_dev_ops.id and the mdev
> bus drivers used container_of to get to their callbacks?

or just make it explicit? e.g.

mdev_set_class(mdev, MDEV_ID_VFIO, _vfio_vgpu_dev_ops);

> 
> >  out:
> > return 

RE: [PATCH V2 6/8] mdev: introduce virtio device and its device ops

2019-09-25 Thread Tian, Kevin
> From: Jason Wang [mailto:jasow...@redhat.com]
> Sent: Tuesday, September 24, 2019 9:54 PM
> 
> This patch implements basic support for mdev driver that supports
> virtio transport for kernel virtio driver.
> 
> Signed-off-by: Jason Wang 
> ---
>  include/linux/mdev.h|   2 +
>  include/linux/virtio_mdev.h | 145
> 
>  2 files changed, 147 insertions(+)
>  create mode 100644 include/linux/virtio_mdev.h
> 
> diff --git a/include/linux/mdev.h b/include/linux/mdev.h
> index 3414307311f1..73ac27b3b868 100644
> --- a/include/linux/mdev.h
> +++ b/include/linux/mdev.h
> @@ -126,6 +126,8 @@ struct mdev_device *mdev_from_dev(struct device
> *dev);
> 
>  enum {
>   MDEV_ID_VFIO = 1,
> + MDEV_ID_VIRTIO = 2,
> + MDEV_ID_VHOST = 3,
>   /* New entries must be added here */
>  };
> 
> diff --git a/include/linux/virtio_mdev.h b/include/linux/virtio_mdev.h
> new file mode 100644
> index ..d1a40a739266
> --- /dev/null
> +++ b/include/linux/virtio_mdev.h
> @@ -0,0 +1,145 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Virtio mediated device driver
> + *
> + * Copyright 2019, Red Hat Corp.
> + * Author: Jason Wang 
> + */
> +#ifndef _LINUX_VIRTIO_MDEV_H
> +#define _LINUX_VIRTIO_MDEV_H
> +
> +#include 
> +#include 
> +#include 
> +
> +#define VIRTIO_MDEV_DEVICE_API_STRING"virtio-mdev"
> +#define VIRTIO_MDEV_VERSION 0x1

Just be curious. is this version identical to virtio spec version that below
callbacks are created for, or just irrelevant?

> +
> +struct virtio_mdev_callback {
> + irqreturn_t (*callback)(void *data);
> + void *private;
> +};
> +
> +/**
> + * struct vfio_mdev_device_ops - Structure to be registered for each
> + * mdev device to register the device to virtio-mdev module.
> + *
> + * @set_vq_address:  Set the address of virtqueue
> + *   @mdev: mediated device
> + *   @idx: virtqueue index
> + *   @desc_area: address of desc area
> + *   @driver_area: address of driver area
> + *   @device_area: address of device area
> + *   Returns integer: success (0) or error (< 0)
> + * @set_vq_num:  Set the size of virtqueue
> + *   @mdev: mediated device
> + *   @idx: virtqueue index
> + *   @num: the size of virtqueue
> + * @kick_vq: Kick the virtqueue
> + *   @mdev: mediated device
> + *   @idx: virtqueue index
> + * @set_vq_cb:   Set the interrut calback function for
> + *   a virtqueue
> + *   @mdev: mediated device
> + *   @idx: virtqueue index
> + *   @cb: virtio-mdev interrupt callback
> structure
> + * @set_vq_ready:Set ready status for a virtqueue
> + *   @mdev: mediated device
> + *   @idx: virtqueue index
> + *   @ready: ready (true) not ready(false)
> + * @get_vq_ready:Get ready status for a virtqueue
> + *   @mdev: mediated device
> + *   @idx: virtqueue index
> + *   Returns boolean: ready (true) or not (false)
> + * @set_vq_state:Set the state for a virtqueue
> + *   @mdev: mediated device
> + *   @idx: virtqueue index
> + *   @state: virtqueue state (last_avail_idx)
> + *   Returns integer: success (0) or error (< 0)
> + * @get_vq_state:Get the state for a virtqueue
> + *   @mdev: mediated device
> + *   @idx: virtqueue index
> + *   Returns virtqueue state (last_avail_idx)
> + * @get_vq_align:Get the virtqueue align requirement
> + *   for the device
> + *   @mdev: mediated device
> + *   Returns virtqueue algin requirement
> + * @get_features:Get virtio features supported by the device
> + *   @mdev: mediated device
> + *   Returns the features support by the
> + *   device
> + * @get_features:Set virtio features supported by the driver
> + *   @mdev: mediated device
> + *   @features: feature support by the driver
> + *   Returns integer: success (0) or error (< 0)
> + * @set_config_cb:   Set the config interrupt callback
> + *   @mdev: mediated device
> + *   @cb: virtio-mdev interrupt callback
> structure
> + * 

RE: [PATCH 3/5] vDPA: introduce vDPA bus

2020-01-21 Thread Tian, Kevin
> From: Jason Wang 
> Sent: Friday, January 17, 2020 11:03 AM
> 
> 
> On 2020/1/16 下午11:22, Jason Gunthorpe wrote:
> > On Thu, Jan 16, 2020 at 08:42:29PM +0800, Jason Wang wrote:
> >> vDPA device is a device that uses a datapath which complies with the
> >> virtio specifications with vendor specific control path. vDPA devices
> >> can be both physically located on the hardware or emulated by
> >> software. vDPA hardware devices are usually implemented through PCIE
> >> with the following types:
> >>
> >> - PF (Physical Function) - A single Physical Function
> >> - VF (Virtual Function) - Device that supports single root I/O
> >>virtualization (SR-IOV). Its Virtual Function (VF) represents a
> >>virtualized instance of the device that can be assigned to different
> >>partitions
> >> - VDEV (Virtual Device) - With technologies such as Intel Scalable
> >>IOV, a virtual device composed by host OS utilizing one or more
> >>ADIs.

the concept of VDEV includes both software bits and ADIs. If you
only take about hardware types, using ADI is more accurate.

> >> - SF (Sub function) - Vendor specific interface to slice the Physical
> >>Function to multiple sub functions that can be assigned to different
> >>partitions as virtual devices.
> > I really hope we don't end up with two different ways to spell this
> > same thing.
> 
> 
> I think you meant ADI vs SF. It looks to me that ADI is limited to the
> scope of scalable IOV but SF not.

ADI is just a term for minimally assignable resource in Scalable IOV. 
'assignable' implies several things, e.g. the resource can be independently 
mapped to/accessed by user space or guest, DMAs between two
ADIs are isolated, operating one ADI doesn't affecting another ADI,
etc.  I'm not clear about  other vendor specific interfaces, but supposing
they need match the similar requirements. Then do we really want to
differentiate ADI vs. SF? What about merging them with ADI as just
one example of finer-grained slicing?

> 
> 
> >
> >> @@ -0,0 +1,2 @@
> >> +# SPDX-License-Identifier: GPL-2.0
> >> +obj-$(CONFIG_VDPA) += vdpa.o
> >> diff --git a/drivers/virtio/vdpa/vdpa.c b/drivers/virtio/vdpa/vdpa.c
> >> new file mode 100644
> >> index ..2b0e4a9f105d
> >> +++ b/drivers/virtio/vdpa/vdpa.c
> >> @@ -0,0 +1,141 @@
> >> +// SPDX-License-Identifier: GPL-2.0-only
> >> +/*
> >> + * vDPA bus.
> >> + *
> >> + * Copyright (c) 2019, Red Hat. All rights reserved.
> >> + * Author: Jason Wang 
> > 2020 tests days
> 
> 
> Will fix.
> 
> 
> >
> >> + *
> >> + */
> >> +
> >> +#include 
> >> +#include 
> >> +#include 
> >> +
> >> +#define MOD_VERSION  "0.1"
> > I think module versions are discouraged these days
> 
> 
> Will remove.
> 
> 
> >
> >> +#define MOD_DESC "vDPA bus"
> >> +#define MOD_AUTHOR   "Jason Wang "
> >> +#define MOD_LICENSE  "GPL v2"
> >> +
> >> +static DEFINE_IDA(vdpa_index_ida);
> >> +
> >> +struct device *vdpa_get_parent(struct vdpa_device *vdpa)
> >> +{
> >> +  return vdpa->dev.parent;
> >> +}
> >> +EXPORT_SYMBOL(vdpa_get_parent);
> >> +
> >> +void vdpa_set_parent(struct vdpa_device *vdpa, struct device *parent)
> >> +{
> >> +  vdpa->dev.parent = parent;
> >> +}
> >> +EXPORT_SYMBOL(vdpa_set_parent);
> >> +
> >> +struct vdpa_device *dev_to_vdpa(struct device *_dev)
> >> +{
> >> +  return container_of(_dev, struct vdpa_device, dev);
> >> +}
> >> +EXPORT_SYMBOL_GPL(dev_to_vdpa);
> >> +
> >> +struct device *vdpa_to_dev(struct vdpa_device *vdpa)
> >> +{
> >> +  return >dev;
> >> +}
> >> +EXPORT_SYMBOL_GPL(vdpa_to_dev);
> > Why these trivial assessors? Seems unnecessary, or should at least be
> > static inlines in a header
> 
> 
> Will fix.
> 
> 
> >
> >> +int register_vdpa_device(struct vdpa_device *vdpa)
> >> +{
> > Usually we want to see symbols consistently prefixed with vdpa_*, is
> > there a reason why register/unregister are swapped?
> 
> 
> I follow the name from virtio. I will switch to vdpa_*.
> 
> 
> >
> >> +  int err;
> >> +
> >> +  if (!vdpa_get_parent(vdpa))
> >> +  return -EINVAL;
> >> +
> >> +  if (!vdpa->config)
> >> +  return -EINVAL;
> >> +
> >> +  err = ida_simple_get(_index_ida, 0, 0, GFP_KERNEL);
> >> +  if (err < 0)
> >> +  return -EFAULT;
> >> +
> >> +  vdpa->dev.bus = _bus;
> >> +  device_initialize(>dev);
> > IMHO device_initialize should not be called inside something called
> > register, toooften we find out that the caller drivers need the device
> > to be initialized earlier, ie to use the kref, or something.
> >
> > I find the best flow is to have some init function that does the
> > device_initialize and sets the device_name that the driver can call
> > early.
> 
> 
> Ok, will do.
> 
> 
> >
> > Shouldn't there be a device/driver matching process of some kind?
> 
> 
> The question is what do we want do match here.
> 
> 1) "virtio" vs "vhost", I implemented matching method for this in mdev
> series, but it looks unnecessary for vDPA device driver to know about
> this. Anyway we can use 

RE: [PATCH 0/5] vDPA support

2020-01-21 Thread Tian, Kevin
> From: Jason Wang
> Sent: Thursday, January 16, 2020 8:42 PM
> 
> Hi all:
> 
> Based on the comments and discussion for mdev based hardware virtio
> offloading support[1]. A different approach to support vDPA device is
> proposed in this series.

Can you point to the actual link which triggered the direction change?
A quick glimpse in that thread doesn't reveal such information...

> 
> Instead of leveraging VFIO/mdev which may not work for some
> vendors. This series tries to introduce a dedicated vDPA bus and
> leverage vhost for userspace drivers. This help for the devices that
> are not fit for VFIO and may reduce the conflict when try to propose a
> bus template for virtual devices in [1].
> 
> The vDPA support is split into following parts:
> 
> 1) vDPA core (bus, device and driver abstraction)
> 2) virtio vDPA transport for kernel virtio driver to control vDPA
>device
> 3) vhost vDPA bus driver for userspace vhost driver to control vDPA
>device
> 4) vendor vDPA drivers
> 5) management API
> 
> Both 1) and 2) are included in this series. Tiwei will work on part
> 3). For 4), Ling Shan will work and post IFCVF driver. For 5) we leave
> it to vendor to implement, but it's better to come into an agreement
> for management to create/configure/destroy vDPA device.
> 
> The sample driver is kept but renamed to vdap_sim. An on-chip IOMMU
> implementation is added to sample device to make it work for both
> kernel virtio driver and userspace vhost driver. It implements a sysfs
> based management API, but it can switch to any other (e.g devlink) if
> necessary.
> 
> Please refer each patch for more information.
> 
> Comments are welcomed.
> 
> [1] https://lkml.org/lkml/2019/11/18/261
> 
> Jason Wang (5):
>   vhost: factor out IOTLB
>   vringh: IOTLB support
>   vDPA: introduce vDPA bus
>   virtio: introduce a vDPA based transport
>   vdpasim: vDPA device simulator
> 
>  MAINTAINERS|   2 +
>  drivers/vhost/Kconfig  |   7 +
>  drivers/vhost/Kconfig.vringh   |   1 +
>  drivers/vhost/Makefile |   2 +
>  drivers/vhost/net.c|   2 +-
>  drivers/vhost/vhost.c  | 221 +++--
>  drivers/vhost/vhost.h  |  36 +-
>  drivers/vhost/vhost_iotlb.c| 171 +++
>  drivers/vhost/vringh.c | 434 +-
>  drivers/virtio/Kconfig |  15 +
>  drivers/virtio/Makefile|   2 +
>  drivers/virtio/vdpa/Kconfig|  26 ++
>  drivers/virtio/vdpa/Makefile   |   3 +
>  drivers/virtio/vdpa/vdpa.c | 141 ++
>  drivers/virtio/vdpa/vdpa_sim.c | 796
> +
>  drivers/virtio/virtio_vdpa.c   | 400 +
>  include/linux/vdpa.h   | 191 
>  include/linux/vhost_iotlb.h|  45 ++
>  include/linux/vringh.h |  36 ++
>  19 files changed, 2327 insertions(+), 204 deletions(-)
>  create mode 100644 drivers/vhost/vhost_iotlb.c
>  create mode 100644 drivers/virtio/vdpa/Kconfig
>  create mode 100644 drivers/virtio/vdpa/Makefile
>  create mode 100644 drivers/virtio/vdpa/vdpa.c
>  create mode 100644 drivers/virtio/vdpa/vdpa_sim.c
>  create mode 100644 drivers/virtio/virtio_vdpa.c
>  create mode 100644 include/linux/vdpa.h
>  create mode 100644 include/linux/vhost_iotlb.h
> 
> --
> 2.19.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v2 1/3] iommu/virtio: Add topology description to virtio-iommu config space

2020-03-05 Thread Tian, Kevin
> From: Jean-Philippe Brucker
> Sent: Saturday, February 29, 2020 1:26 AM
> 
> Platforms without device-tree do not currently have a method for
> describing the vIOMMU topology. Provide a topology description embedded
> into the virtio device.
> 
> Use PCI FIXUP to probe the config space early, because we need to
> discover the topology before any DMA configuration takes place, and the
> virtio driver may be loaded much later. Since we discover the topology
> description when probing the PCI hierarchy, the virtual IOMMU cannot
> manage other platform devices discovered earlier.
> 
> This solution isn't elegant nor foolproof, but is the best we can do at

can you elaborate "isn't elegant nor foolproof" part? is there any other 
limitation (beside pci fixup) along the route, when comparing it to 
the ACPI-approach?

> the moment and works with existing virtio-iommu implementations. It also
> enables an IOMMU for lightweight hypervisors that do not rely on
> firmware methods for booting.
> 
> Signed-off-by: Eric Auger 
> Signed-off-by: Jean-Philippe Brucker 
> ---
>  MAINTAINERS   |   2 +
>  drivers/iommu/Kconfig |  10 +
>  drivers/iommu/Makefile|   1 +
>  drivers/iommu/virtio-iommu-topology.c | 343
> ++
>  drivers/iommu/virtio-iommu.c  |   3 +
>  include/linux/virt_iommu.h|  19 ++
>  include/uapi/linux/virtio_iommu.h |  26 ++
>  7 files changed, 404 insertions(+)
>  create mode 100644 drivers/iommu/virtio-iommu-topology.c
>  create mode 100644 include/linux/virt_iommu.h
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index fcd79fc38928..65a03ce53096 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -17781,6 +17781,8 @@ M:Jean-Philippe Brucker  phili...@linaro.org>
>  L:   virtualization@lists.linux-foundation.org
>  S:   Maintained
>  F:   drivers/iommu/virtio-iommu.c
> +F:   drivers/iommu/virtio-iommu-topology.c
> +F:   include/linux/virt_iommu.h
>  F:   include/uapi/linux/virtio_iommu.h
> 
>  VIRTUAL BOX GUEST DEVICE DRIVER
> diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
> index c5df570ef84a..f8cb45d84bb0 100644
> --- a/drivers/iommu/Kconfig
> +++ b/drivers/iommu/Kconfig
> @@ -516,4 +516,14 @@ config VIRTIO_IOMMU
> 
> Say Y here if you intend to run this kernel as a guest.
> 
> +config VIRTIO_IOMMU_TOPOLOGY
> + bool "Topology properties for the virtio-iommu"
> + depends on VIRTIO_IOMMU
> + default y
> + help
> +   Enable early probing of the virtio-iommu device, to detect the
> +   built-in topology description.
> +
> +   Say Y here if you intend to run this kernel as a guest.
> +
>  endif # IOMMU_SUPPORT
> diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
> index 9f33fdb3bb05..5da24280d08c 100644
> --- a/drivers/iommu/Makefile
> +++ b/drivers/iommu/Makefile
> @@ -37,3 +37,4 @@ obj-$(CONFIG_S390_IOMMU) += s390-iommu.o
>  obj-$(CONFIG_QCOM_IOMMU) += qcom_iommu.o
>  obj-$(CONFIG_HYPERV_IOMMU) += hyperv-iommu.o
>  obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o
> +obj-$(CONFIG_VIRTIO_IOMMU_TOPOLOGY) += virtio-iommu-topology.o
> diff --git a/drivers/iommu/virtio-iommu-topology.c b/drivers/iommu/virtio-
> iommu-topology.c
> new file mode 100644
> index ..2188624ef216
> --- /dev/null
> +++ b/drivers/iommu/virtio-iommu-topology.c
> @@ -0,0 +1,343 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +struct viommu_cap_config {
> + u8 bar;
> + u32 length; /* structure size */
> + u32 offset; /* structure offset within the bar */
> +};
> +
> +union viommu_topo_cfg {
> + __le16  type;
> + struct virtio_iommu_topo_pci_range  pci;
> + struct virtio_iommu_topo_endpoint   ep;
> +};
> +
> +struct viommu_spec {
> + struct device   *dev; /* transport device */
> + struct fwnode_handle*fwnode;
> + struct iommu_ops*ops;
> + struct list_headlist;
> + size_t  num_items;
> + /* The config array of length num_items follows */
> + union viommu_topo_cfg   cfg[];
> +};
> +
> +static LIST_HEAD(viommus);
> +static DEFINE_MUTEX(viommus_lock);
> +
> +#define VPCI_FIELD(field) offsetof(struct virtio_pci_cap, field)
> +
> +static inline int viommu_pci_find_capability(struct pci_dev *dev, u8 
> cfg_type,
> +  struct viommu_cap_config *cap)
> +{
> + int pos;
> + u8 bar;
> +
> + for (pos = pci_find_capability(dev, PCI_CAP_ID_VNDR);
> +  pos > 0;
> +  pos = pci_find_next_capability(dev, pos, PCI_CAP_ID_VNDR)) {
> + u8 type;
> +
> + pci_read_config_byte(dev, pos + VPCI_FIELD(cfg_type),
> );
> + if (type 

RE: [PATCH v2 1/3] iommu/virtio: Add topology description to virtio-iommu config space

2020-04-21 Thread Tian, Kevin
> From: Jean-Philippe Brucker
> Sent: Saturday, February 29, 2020 1:26 AM
> 
> Platforms without device-tree do not currently have a method for
> describing the vIOMMU topology. Provide a topology description embedded
> into the virtio device.
> 
> Use PCI FIXUP to probe the config space early, because we need to
> discover the topology before any DMA configuration takes place, and the
> virtio driver may be loaded much later. Since we discover the topology
> description when probing the PCI hierarchy, the virtual IOMMU cannot
> manage other platform devices discovered earlier.
> 
> This solution isn't elegant nor foolproof, but is the best we can do at
> the moment and works with existing virtio-iommu implementations. It also
> enables an IOMMU for lightweight hypervisors that do not rely on
> firmware methods for booting.
> 
> Signed-off-by: Eric Auger 
> Signed-off-by: Jean-Philippe Brucker 
> ---
>  MAINTAINERS   |   2 +
>  drivers/iommu/Kconfig |  10 +
>  drivers/iommu/Makefile|   1 +
>  drivers/iommu/virtio-iommu-topology.c | 343
> ++
>  drivers/iommu/virtio-iommu.c  |   3 +
>  include/linux/virt_iommu.h|  19 ++
>  include/uapi/linux/virtio_iommu.h |  26 ++
>  7 files changed, 404 insertions(+)
>  create mode 100644 drivers/iommu/virtio-iommu-topology.c
>  create mode 100644 include/linux/virt_iommu.h
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index fcd79fc38928..65a03ce53096 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -17781,6 +17781,8 @@ M:Jean-Philippe Brucker  phili...@linaro.org>
>  L:   virtualization@lists.linux-foundation.org
>  S:   Maintained
>  F:   drivers/iommu/virtio-iommu.c
> +F:   drivers/iommu/virtio-iommu-topology.c
> +F:   include/linux/virt_iommu.h
>  F:   include/uapi/linux/virtio_iommu.h
> 
>  VIRTUAL BOX GUEST DEVICE DRIVER
> diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
> index c5df570ef84a..f8cb45d84bb0 100644
> --- a/drivers/iommu/Kconfig
> +++ b/drivers/iommu/Kconfig
> @@ -516,4 +516,14 @@ config VIRTIO_IOMMU
> 
> Say Y here if you intend to run this kernel as a guest.
> 
> +config VIRTIO_IOMMU_TOPOLOGY
> + bool "Topology properties for the virtio-iommu"
> + depends on VIRTIO_IOMMU
> + default y
> + help
> +   Enable early probing of the virtio-iommu device, to detect the
> +   built-in topology description.
> +
> +   Say Y here if you intend to run this kernel as a guest.
> +
>  endif # IOMMU_SUPPORT
> diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
> index 9f33fdb3bb05..5da24280d08c 100644
> --- a/drivers/iommu/Makefile
> +++ b/drivers/iommu/Makefile
> @@ -37,3 +37,4 @@ obj-$(CONFIG_S390_IOMMU) += s390-iommu.o
>  obj-$(CONFIG_QCOM_IOMMU) += qcom_iommu.o
>  obj-$(CONFIG_HYPERV_IOMMU) += hyperv-iommu.o
>  obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o
> +obj-$(CONFIG_VIRTIO_IOMMU_TOPOLOGY) += virtio-iommu-topology.o
> diff --git a/drivers/iommu/virtio-iommu-topology.c b/drivers/iommu/virtio-
> iommu-topology.c
> new file mode 100644
> index ..2188624ef216
> --- /dev/null
> +++ b/drivers/iommu/virtio-iommu-topology.c
> @@ -0,0 +1,343 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +struct viommu_cap_config {
> + u8 bar;
> + u32 length; /* structure size */
> + u32 offset; /* structure offset within the bar */
> +};
> +
> +union viommu_topo_cfg {
> + __le16  type;
> + struct virtio_iommu_topo_pci_range  pci;
> + struct virtio_iommu_topo_endpoint   ep;
> +};
> +
> +struct viommu_spec {
> + struct device   *dev; /* transport device */
> + struct fwnode_handle*fwnode;
> + struct iommu_ops*ops;
> + struct list_headlist;
> + size_t  num_items;

Intel DMAR allows an IOMMU to claim INCLUDE_ALL thus avoid listing
every endpoint one-by-one. It is especially useful when there is only
one IOMMU device in the system. Do you think whether making sense
to allow such optimization in this spec? It doesn't work for ARM since
you need ID mapping to find the MSI doorbell. But for architectures
where only topology info is required, it makes the enumeration process
much simpler.

Thanks
Kevin

> + /* The config array of length num_items follows */
> + union viommu_topo_cfg   cfg[];
> +};
> +
> +static LIST_HEAD(viommus);
> +static DEFINE_MUTEX(viommus_lock);
> +
> +#define VPCI_FIELD(field) offsetof(struct virtio_pci_cap, field)
> +
> +static inline int viommu_pci_find_capability(struct pci_dev *dev, u8 
> cfg_type,
> +  struct viommu_cap_config *cap)
> +{
> + int pos;
> + u8 bar;
> +
> + 

RE: [RFC 3/3] vfio: Share the KVM instance with Vdmabuf

2021-01-19 Thread Tian, Kevin
> From: Alex Williamson
> Sent: Wednesday, January 20, 2021 8:51 AM
> 
> On Wed, 20 Jan 2021 00:14:49 +
> "Kasireddy, Vivek"  wrote:
> 
> > Hi Alex,
> >
> > > -Original Message-
> > > From: Alex Williamson 
> > > Sent: Tuesday, January 19, 2021 7:40 AM
> > > To: Kasireddy, Vivek 
> > > Cc: virtualization@lists.linux-foundation.org; Kim, Dongwon
> 
> > > Subject: Re: [RFC 3/3] vfio: Share the KVM instance with Vdmabuf
> > >
> > > On Tue, 19 Jan 2021 00:28:12 -0800
> > > Vivek Kasireddy  wrote:
> > >
> > > > Getting a copy of the KVM instance is necessary for mapping Guest
> > > > pages in the Host.
> > > >
> > > > TODO: Instead of invoking the symbol directly, there needs to be a
> > > > better way of getting a copy of the KVM instance probably by using
> > > > other notifiers. However, currently, KVM shares its instance only
> > > > with VFIO and therefore we are compelled to bind the passthrough'd
> > > > device to vfio-pci.
> > >
> > > Yeah, this is a bad solution, sorry, vfio is not going to gratuitously
> > > call out to vhost to share a kvm pointer.  I'd prefer to get rid of
> > > vfio having any knowledge or visibility of the kvm pointer.  Thanks,
> >
> > [Kasireddy, Vivek] I agree that this is definitely not ideal as I recognize 
> > it
> > in the TODO. However, it looks like VFIO also gets a copy of the KVM
> > pointer in a similar manner:
> >
> > virt/kvm/vfio.c
> >
> > static void kvm_vfio_group_set_kvm(struct vfio_group *group, struct kvm
> *kvm)
> > {
> > void (*fn)(struct vfio_group *, struct kvm *);
> >
> > fn = symbol_get(vfio_group_set_kvm);
> > if (!fn)
> > return;
> >
> > fn(group, kvm);
> >
> > symbol_put(vfio_group_set_kvm);
> > }
> 
> You're equating the mechanism with the architecture.  We use symbols
> here to avoid module dependencies between kvm and vfio, but this is
> just propagating data that userspace is specifically registering
> between kvm and vfio.  vhost doesn't get to piggyback on that channel.
> 
> > With this patch, I am not suggesting that this is a precedent that should be
> followed
> > but it appears there doesn't seem to be an alternative way of getting a copy
> of the KVM
> > pointer that is clean and elegant -- unless I have not looked hard enough. I
> guess we
> > could create a notifier chain with callbacks for VFIO and Vhost that KVM
> would call
> > but this would mean modifying KVM.
> >
> > Also, if I understand correctly, if VFIO does not want to share the KVM
> pointer with
> > VFIO groups, then I think it would break stuff like mdev which counts on it.
> 
> Only kvmgt requires the kvm pointer and the use case there is pretty
> questionable, I wonder if it actually still exists now that we have the
> DMA r/w interface through vfio.  Thanks,
> 

IIRC, kvmgt still needs the kvm pointer to use kvm page tracking interface 
for write-protecting guest pgtable.

Thanks
Kevin
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH 15/17] iommu: remove DOMAIN_ATTR_NESTING

2021-03-25 Thread Tian, Kevin
> From: Auger Eric
> Sent: Monday, March 15, 2021 3:52 PM
> To: Christoph Hellwig 
> Cc: k...@vger.kernel.org; Will Deacon ; linuxppc-
> d...@lists.ozlabs.org; dri-de...@lists.freedesktop.org; Li Yang
> ; io...@lists.linux-foundation.org;
> 
> Hi Christoph,
> 
> On 3/14/21 4:58 PM, Christoph Hellwig wrote:
> > On Sun, Mar 14, 2021 at 11:44:52AM +0100, Auger Eric wrote:
> >> As mentionned by Robin, there are series planning to use
> >> DOMAIN_ATTR_NESTING to get info about the nested caps of the iommu
> (ARM
> >> and Intel):
> >>
> >> [Patch v8 00/10] vfio: expose virtual Shared Virtual Addressing to VMs
> >> patches 1, 2, 3
> >>
> >> Is the plan to introduce a new domain_get_nesting_info ops then?
> >
> > The plan as usual would be to add it the series adding that support.
> > Not sure what the merge plans are - if the series is ready to be
> > merged I could rebase on top of it, otherwise that series will need
> > to add the method.
> OK I think your series may be upstreamed first.
> 

Agree. The vSVA series is still undergoing a refactor according to Jason's
comment thus won't be ready in short term. It's better to let this one
go in first.

Thanks
Kevin
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH RFC v1 12/15] iommu/virtio: Add support for INVALIDATE request

2021-03-03 Thread Tian, Kevin
> From: Jacob Pan 
> Sent: Thursday, March 4, 2021 2:29 AM
> 
> Hi Vivek,
> 
> On Fri, 15 Jan 2021 17:43:39 +0530, Vivek Gautam 
> wrote:
> 
> > From: Jean-Philippe Brucker 
> >
> > Add support for tlb invalidation ops that can send invalidation
> > requests to back-end virtio-iommu when stage-1 page tables are
> > supported.
> >
> Just curious if it possible to reuse the iommu uapi for invalidation and 
> others.
> When we started out designing the iommu uapi, the intention was to support
> both emulated and virtio iommu.

IIUC this patch is about the protocol between virtio-iommu frontend and backend.
After the virtio-iommu backend receives invalidation ops, it then needs to
forward the request to the host IOMMU driver through the existing iommu
uapi that you referred to, as a emulated VT-d or SMMU would do.

Thanks
Kevin

> 
> > Signed-off-by: Jean-Philippe Brucker 
> > [Vivek: Refactoring the iommu_flush_ops, and adding only one pasid sync
> > op that's needed with current iommu-pasid-table infrastructure.
> > Also updating uapi defines as required by latest changes]
> > Signed-off-by: Vivek Gautam 
> > Cc: Joerg Roedel 
> > Cc: Will Deacon 
> > Cc: Michael S. Tsirkin 
> > Cc: Robin Murphy 
> > Cc: Jean-Philippe Brucker 
> > Cc: Eric Auger 
> > Cc: Alex Williamson 
> > Cc: Kevin Tian 
> > Cc: Jacob Pan 
> > Cc: Liu Yi L 
> > Cc: Lorenzo Pieralisi 
> > Cc: Shameerali Kolothum Thodi 
> > ---
> >  drivers/iommu/virtio-iommu.c | 95
> 
> >  1 file changed, 95 insertions(+)
> >
> > diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c
> > index ae5dfd3f8269..004ea94e3731 100644
> > --- a/drivers/iommu/virtio-iommu.c
> > +++ b/drivers/iommu/virtio-iommu.c
> > @@ -13,6 +13,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  #include 
> >  #include 
> >  #include 
> > @@ -63,6 +64,8 @@ struct viommu_mapping {
> >  };
> >
> >  struct viommu_mm {
> > +   int pasid;
> > +   u64 archid;
> > struct io_pgtable_ops   *ops;
> > struct viommu_domain*domain;
> >  };
> > @@ -692,6 +695,98 @@ static void viommu_event_handler(struct
> virtqueue
> > *vq) virtqueue_kick(vq);
> >  }
> >
> > +/* PASID and pgtable APIs */
> > +
> > +static void __viommu_flush_pasid_tlb_all(struct viommu_domain
> *vdomain,
> > +int pasid, u64 arch_id, int
> > type) +{
> > +   struct virtio_iommu_req_invalidate req = {
> > +   .head.type  = VIRTIO_IOMMU_T_INVALIDATE,
> > +   .inv_gran   =
> > cpu_to_le32(VIRTIO_IOMMU_INVAL_G_PASID),
> > +   .flags  =
> > cpu_to_le32(VIRTIO_IOMMU_INVAL_F_PASID),
> > +   .inv_type   = cpu_to_le32(type),
> > +
> > +   .domain = cpu_to_le32(vdomain->id),
> > +   .pasid  = cpu_to_le32(pasid),
> > +   .archid = cpu_to_le64(arch_id),
> > +   };
> > +
> > +   if (viommu_send_req_sync(vdomain->viommu, , sizeof(req)))
> > +   pr_debug("could not send invalidate request\n");
> > +}
> > +
> > +static void viommu_flush_tlb_add(struct iommu_iotlb_gather *gather,
> > +unsigned long iova, size_t granule,
> > +void *cookie)
> > +{
> > +   struct viommu_mm *viommu_mm = cookie;
> > +   struct viommu_domain *vdomain = viommu_mm->domain;
> > +   struct iommu_domain *domain = >domain;
> > +
> > +   iommu_iotlb_gather_add_page(domain, gather, iova, granule);
> > +}
> > +
> > +static void viommu_flush_tlb_walk(unsigned long iova, size_t size,
> > + size_t granule, void *cookie)
> > +{
> > +   struct viommu_mm *viommu_mm = cookie;
> > +   struct viommu_domain *vdomain = viommu_mm->domain;
> > +   struct virtio_iommu_req_invalidate req = {
> > +   .head.type  = VIRTIO_IOMMU_T_INVALIDATE,
> > +   .inv_gran   = cpu_to_le32(VIRTIO_IOMMU_INVAL_G_VA),
> > +   .inv_type   =
> cpu_to_le32(VIRTIO_IOMMU_INV_T_IOTLB),
> > +   .flags  =
> > cpu_to_le32(VIRTIO_IOMMU_INVAL_F_ARCHID), +
> > +   .domain = cpu_to_le32(vdomain->id),
> > +   .pasid  = cpu_to_le32(viommu_mm->pasid),
> > +   .archid = cpu_to_le64(viommu_mm->archid),
> > +   .virt_start = cpu_to_le64(iova),
> > +   .nr_pages   = cpu_to_le64(size / granule),
> > +   .granule= ilog2(granule),
> > +   };
> > +
> > +   if (viommu_add_req(vdomain->viommu, , sizeof(req)))
> > +   pr_debug("could not add invalidate request\n");
> > +}
> > +
> > +static void viommu_flush_tlb_all(void *cookie)
> > +{
> > +   struct viommu_mm *viommu_mm = cookie;
> > +
> > +   if (!viommu_mm->archid)
> > +   return;
> > +
> > +   __viommu_flush_pasid_tlb_all(viommu_mm->domain,
> viommu_mm->pasid,
> > +viommu_mm->archid,
> > + 

RE: [PATCH 0/5] iommu/virtio: Add identity domains

2021-10-13 Thread Tian, Kevin
> From: Jean-Philippe Brucker 
> Sent: Wednesday, October 13, 2021 8:11 PM
> 
> Support identity domains, allowing to only enable IOMMU protection for a
> subset of endpoints (those assigned to userspace, for example). Users
> may enable identity domains at compile time
> (CONFIG_IOMMU_DEFAULT_PASSTHROUGH), boot time
> (iommu.passthrough=1) or
> runtime (/sys/kernel/iommu_groups/*/type = identity).

Do we want to use consistent terms between spec (bypass domain) 
and code (identity domain)? 

> 
> Patches 1-2 support identity domains using the optional
> VIRTIO_IOMMU_F_BYPASS_CONFIG feature. The feature bit is not yet in the
> spec, see [1] for the latest proposal.
> 
> Patches 3-5 add a fallback to identity mappings, when the feature is not
> supported.
> 
> Note that this series doesn't touch the global bypass bit added by
> VIRTIO_IOMMU_F_BYPASS_CONFIG. All endpoints managed by the IOMMU
> should
> be attached to a domain, so global bypass isn't in use after endpoints

I saw a concept of deferred attach in iommu core. See iommu_is_
attach_deferred(). Currently this is vendor specific and I haven't
looked into the exact reason why some vendor sets it now. Just
be curious whether the same reason might be applied to virtio-iommu.

> are probed. Before that, the global bypass policy is decided by the
> hypervisor and firmware. So I don't think Linux needs to touch the

This reminds me one thing. The spec says that the global bypass
bit is sticky and not affected by reset. This implies that in the case
of rebooting the VM into a different OS, the previous OS actually
has the right to override this setting for the next OS. Is it a right
design? Even the firmware itself is unable to identify the original
setting enforced by the hypervisor after reboot. I feel the hypervisor
setting should be recovered after reset since it reflects the 
security measure enforced by the virtual platform?

> global bypass bit, but there are some patches available on my
> virtio-iommu/bypass branch [2] to test it.
> 
> QEMU patches are on my virtio-iommu/bypass branch [3] (and the list)
> 
> [1] https://www.mail-archive.com/virtio-dev@lists.oasis-
> open.org/msg07898.html
> [2] https://jpbrucker.net/git/linux/log/?h=virtio-iommu/bypass
> [3] https://jpbrucker.net/git/qemu/log/?h=virtio-iommu/bypass
> 
> Jean-Philippe Brucker (5):
>   iommu/virtio: Add definitions for VIRTIO_IOMMU_F_BYPASS_CONFIG
>   iommu/virtio: Support bypass domains
>   iommu/virtio: Sort reserved regions
>   iommu/virtio: Pass end address to viommu_add_mapping()
>   iommu/virtio: Support identity-mapped domains
> 
>  include/uapi/linux/virtio_iommu.h |   8 ++-
>  drivers/iommu/virtio-iommu.c  | 113 +-
>  2 files changed, 101 insertions(+), 20 deletions(-)
> 
> --
> 2.33.0

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH 2/5] iommu/virtio: Support bypass domains

2021-10-13 Thread Tian, Kevin
> From: Tian, Kevin
> Sent: Thursday, October 14, 2021 11:25 AM
> 
> > From: Jean-Philippe Brucker 
> > Sent: Wednesday, October 13, 2021 8:11 PM
> >
> > The VIRTIO_IOMMU_F_BYPASS_CONFIG feature adds a new flag to the
> > ATTACH
> > request, that creates a bypass domain. Use it to enable identity
> > domains.
> >
> > When VIRTIO_IOMMU_F_BYPASS_CONFIG is not supported by the device,
> > we
> > currently fail attaching to an identity domain. Future patches will
> > instead create identity mappings in this case.
> >
> > Signed-off-by: Jean-Philippe Brucker 
> > ---
> >  drivers/iommu/virtio-iommu.c | 20 +++-
> >  1 file changed, 19 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c
> > index 80930ce04a16..ee8a7afd667b 100644
> > --- a/drivers/iommu/virtio-iommu.c
> > +++ b/drivers/iommu/virtio-iommu.c
> > @@ -71,6 +71,7 @@ struct viommu_domain {
> > struct rb_root_cached   mappings;
> >
> > unsigned long   nr_endpoints;
> > +   boolbypass;
> >  };
> >
> >  struct viommu_endpoint {
> > @@ -587,7 +588,9 @@ static struct iommu_domain
> > *viommu_domain_alloc(unsigned type)
> >  {
> > struct viommu_domain *vdomain;
> >
> > -   if (type != IOMMU_DOMAIN_UNMANAGED && type !=
> > IOMMU_DOMAIN_DMA)
> > +   if (type != IOMMU_DOMAIN_UNMANAGED &&
> > +   type != IOMMU_DOMAIN_DMA &&
> > +   type != IOMMU_DOMAIN_IDENTITY)
> > return NULL;
> >
> > vdomain = kzalloc(sizeof(*vdomain), GFP_KERNEL);
> > @@ -630,6 +633,17 @@ static int viommu_domain_finalise(struct
> > viommu_endpoint *vdev,
> > vdomain->map_flags  = viommu->map_flags;
> > vdomain->viommu = viommu;
> >
> > +   if (domain->type == IOMMU_DOMAIN_IDENTITY) {
> > +   if (!virtio_has_feature(viommu->vdev,
> > +   VIRTIO_IOMMU_F_BYPASS_CONFIG))
> > {
> > +   ida_free(>domain_ids, vdomain->id);
> > +   vdomain->viommu = 0;
> > +   return -EOPNOTSUPP;
> > +   }
> > +
> > +   vdomain->bypass = true;
> > +   }
> > +
> 
> move to the start of the function, then no need for above cleanup.
> 

forgot it as I see the reason now when looking at patch05
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH 2/5] iommu/virtio: Support bypass domains

2021-10-13 Thread Tian, Kevin
> From: Jean-Philippe Brucker 
> Sent: Wednesday, October 13, 2021 8:11 PM
> 
> The VIRTIO_IOMMU_F_BYPASS_CONFIG feature adds a new flag to the
> ATTACH
> request, that creates a bypass domain. Use it to enable identity
> domains.
> 
> When VIRTIO_IOMMU_F_BYPASS_CONFIG is not supported by the device,
> we
> currently fail attaching to an identity domain. Future patches will
> instead create identity mappings in this case.
> 
> Signed-off-by: Jean-Philippe Brucker 
> ---
>  drivers/iommu/virtio-iommu.c | 20 +++-
>  1 file changed, 19 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c
> index 80930ce04a16..ee8a7afd667b 100644
> --- a/drivers/iommu/virtio-iommu.c
> +++ b/drivers/iommu/virtio-iommu.c
> @@ -71,6 +71,7 @@ struct viommu_domain {
>   struct rb_root_cached   mappings;
> 
>   unsigned long   nr_endpoints;
> + boolbypass;
>  };
> 
>  struct viommu_endpoint {
> @@ -587,7 +588,9 @@ static struct iommu_domain
> *viommu_domain_alloc(unsigned type)
>  {
>   struct viommu_domain *vdomain;
> 
> - if (type != IOMMU_DOMAIN_UNMANAGED && type !=
> IOMMU_DOMAIN_DMA)
> + if (type != IOMMU_DOMAIN_UNMANAGED &&
> + type != IOMMU_DOMAIN_DMA &&
> + type != IOMMU_DOMAIN_IDENTITY)
>   return NULL;
> 
>   vdomain = kzalloc(sizeof(*vdomain), GFP_KERNEL);
> @@ -630,6 +633,17 @@ static int viommu_domain_finalise(struct
> viommu_endpoint *vdev,
>   vdomain->map_flags  = viommu->map_flags;
>   vdomain->viommu = viommu;
> 
> + if (domain->type == IOMMU_DOMAIN_IDENTITY) {
> + if (!virtio_has_feature(viommu->vdev,
> + VIRTIO_IOMMU_F_BYPASS_CONFIG))
> {
> + ida_free(>domain_ids, vdomain->id);
> + vdomain->viommu = 0;
> + return -EOPNOTSUPP;
> + }
> +
> + vdomain->bypass = true;
> + }
> +

move to the start of the function, then no need for above cleanup.

>   return 0;
>  }
> 
> @@ -691,6 +705,9 @@ static int viommu_attach_dev(struct iommu_domain
> *domain, struct device *dev)
>   .domain = cpu_to_le32(vdomain->id),
>   };
> 
> + if (vdomain->bypass)
> + req.flags |=
> cpu_to_le32(VIRTIO_IOMMU_ATTACH_F_BYPASS);
> +
>   for (i = 0; i < fwspec->num_ids; i++) {
>   req.endpoint = cpu_to_le32(fwspec->ids[i]);
> 
> @@ -1132,6 +1149,7 @@ static unsigned int features[] = {
>   VIRTIO_IOMMU_F_DOMAIN_RANGE,
>   VIRTIO_IOMMU_F_PROBE,
>   VIRTIO_IOMMU_F_MMIO,
> + VIRTIO_IOMMU_F_BYPASS_CONFIG,
>  };
> 
>  static struct virtio_device_id id_table[] = {
> --
> 2.33.0

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH 0/5] iommu/virtio: Add identity domains

2021-10-21 Thread Tian, Kevin
> From: j...@8bytes.org 
> Sent: Monday, October 18, 2021 7:38 PM
> 
> On Thu, Oct 14, 2021 at 03:00:38AM +, Tian, Kevin wrote:
> > I saw a concept of deferred attach in iommu core. See iommu_is_
> > attach_deferred(). Currently this is vendor specific and I haven't
> > looked into the exact reason why some vendor sets it now. Just
> > be curious whether the same reason might be applied to virtio-iommu.
> 
> The reason for attach_deferred is kdump support, where the IOMMU driver
> needs to keep the mappings from the old kernel until the device driver
> of the new kernel takes over.
> 

ok. Then there is no problem with the original statement:

All endpoints managed by the IOMMU should be attached to a 
domain, so global bypass isn't in use after endpoints are probed.

Thanks
Kevin
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH 0/5] iommu/virtio: Add identity domains

2021-10-21 Thread Tian, Kevin
> From: Jean-Philippe Brucker 
> Sent: Wednesday, October 13, 2021 8:11 PM
> 
> Support identity domains, allowing to only enable IOMMU protection for a
> subset of endpoints (those assigned to userspace, for example). Users
> may enable identity domains at compile time
> (CONFIG_IOMMU_DEFAULT_PASSTHROUGH), boot time
> (iommu.passthrough=1) or
> runtime (/sys/kernel/iommu_groups/*/type = identity).
> 
> Patches 1-2 support identity domains using the optional
> VIRTIO_IOMMU_F_BYPASS_CONFIG feature. The feature bit is not yet in the
> spec, see [1] for the latest proposal.
> 
> Patches 3-5 add a fallback to identity mappings, when the feature is not
> supported.
> 
> Note that this series doesn't touch the global bypass bit added by
> VIRTIO_IOMMU_F_BYPASS_CONFIG. All endpoints managed by the IOMMU
> should
> be attached to a domain, so global bypass isn't in use after endpoints
> are probed. Before that, the global bypass policy is decided by the
> hypervisor and firmware. So I don't think Linux needs to touch the
> global bypass bit, but there are some patches available on my
> virtio-iommu/bypass branch [2] to test it.
> 
> QEMU patches are on my virtio-iommu/bypass branch [3] (and the list)
> 
> [1] https://www.mail-archive.com/virtio-dev@lists.oasis-
> open.org/msg07898.html
> [2] https://jpbrucker.net/git/linux/log/?h=virtio-iommu/bypass
> [3] https://jpbrucker.net/git/qemu/log/?h=virtio-iommu/bypass
> 
> Jean-Philippe Brucker (5):
>   iommu/virtio: Add definitions for VIRTIO_IOMMU_F_BYPASS_CONFIG
>   iommu/virtio: Support bypass domains
>   iommu/virtio: Sort reserved regions
>   iommu/virtio: Pass end address to viommu_add_mapping()
>   iommu/virtio: Support identity-mapped domains
> 
>  include/uapi/linux/virtio_iommu.h |   8 ++-
>  drivers/iommu/virtio-iommu.c  | 113 +-
>  2 files changed, 101 insertions(+), 20 deletions(-)
> 

For this series:

Reviewed-by: Kevin Tian 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH 0/5] iommu/virtio: Add identity domains

2021-10-21 Thread Tian, Kevin
> From: Jean-Philippe Brucker 
> Sent: Monday, October 18, 2021 11:24 PM
> 
> On Thu, Oct 14, 2021 at 03:00:38AM +, Tian, Kevin wrote:
> > > From: Jean-Philippe Brucker 
> > > Sent: Wednesday, October 13, 2021 8:11 PM
> > >
> > > Support identity domains, allowing to only enable IOMMU protection for
> a
> > > subset of endpoints (those assigned to userspace, for example). Users
> > > may enable identity domains at compile time
> > > (CONFIG_IOMMU_DEFAULT_PASSTHROUGH), boot time
> > > (iommu.passthrough=1) or
> > > runtime (/sys/kernel/iommu_groups/*/type = identity).
> >
> > Do we want to use consistent terms between spec (bypass domain)
> > and code (identity domain)?
> 
> I don't think we have to. Linux uses "identity" domains and "passthrough"
> IOMMU. The old virtio-iommu feature was "bypass" so we should keep that
> for the new one, to be consistent. And then I've used "bypass" for domains
> as well, in the spec, because it would look strange to use a different
> term for the same concept. I find that it sort of falls into place: Linux'
> identity domains can be implemented either with bypass or identity-mapped
> virtio-iommu domains.

make sense.

> 
> > >
> > > Patches 1-2 support identity domains using the optional
> > > VIRTIO_IOMMU_F_BYPASS_CONFIG feature. The feature bit is not yet in
> the
> > > spec, see [1] for the latest proposal.
> > >
> > > Patches 3-5 add a fallback to identity mappings, when the feature is not
> > > supported.
> > >
> > > Note that this series doesn't touch the global bypass bit added by
> > > VIRTIO_IOMMU_F_BYPASS_CONFIG. All endpoints managed by the
> IOMMU
> > > should
> > > be attached to a domain, so global bypass isn't in use after endpoints
> >
> > I saw a concept of deferred attach in iommu core. See iommu_is_
> > attach_deferred(). Currently this is vendor specific and I haven't
> > looked into the exact reason why some vendor sets it now. Just
> > be curious whether the same reason might be applied to virtio-iommu.
> >
> > > are probed. Before that, the global bypass policy is decided by the
> > > hypervisor and firmware. So I don't think Linux needs to touch the
> >
> > This reminds me one thing. The spec says that the global bypass
> > bit is sticky and not affected by reset.
> 
> The spec talks about *device* reset, triggered by software writing 0 to
> the status register, but it doesn't mention system reset. Would be good to
> clarify that. Something like:
> 
> If the device offers the VIRTIO_IOMMU_F_BYPASS_CONFIG feature, it MAY
> initialize the \field{bypass} field to 1. Field \field{bypass} SHOULD
> NOT change on device reset, but SHOULD be restored to its initial
> value on system reset.

looks good to me

> 
> > This implies that in the case
> > of rebooting the VM into a different OS, the previous OS actually
> > has the right to override this setting for the next OS. Is it a right
> > design? Even the firmware itself is unable to identify the original
> > setting enforced by the hypervisor after reboot. I feel the hypervisor
> > setting should be recovered after reset since it reflects the
> > security measure enforced by the virtual platform?
> 
> So I think clarifying system reset should address your questions.
> I believe we should leave bypass sticky across device reset, so a FW->OS
> transition, where the OS resets the device, does not open a vulnerability
> (if bypass was enabled at boot and then left disabled by FW.)
> 
> It's still a good idea for the OS to restore on shutdown the bypass value
> it was booted with. So it can kexec into an OS that doesn't support
> virtio-iommu, for example.
> 

Thanks
Kevin
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH 3/5] iommu: Introduce the domain op enforce_cache_coherency()

2022-04-05 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Wednesday, April 6, 2022 6:58 AM
> 
> On Tue, Apr 05, 2022 at 01:50:36PM -0600, Alex Williamson wrote:
> > >
> > > +static bool intel_iommu_enforce_cache_coherency(struct
> iommu_domain *domain)
> > > +{
> > > + struct dmar_domain *dmar_domain = to_dmar_domain(domain);
> > > +
> > > + if (!dmar_domain->iommu_snooping)
> > > + return false;
> > > + dmar_domain->enforce_no_snoop = true;
> > > + return true;
> > > +}
> >
> > Don't we have issues if we try to set DMA_PTE_SNP on DMARs that don't
> > support it, ie. reserved register bit set in pte faults?
> 
> The way the Intel driver is setup that is not possible. Currently it
> does:
> 
>  static bool intel_iommu_capable(enum iommu_cap cap)
>  {
>   if (cap == IOMMU_CAP_CACHE_COHERENCY)
>   return domain_update_iommu_snooping(NULL);
> 
> Which is a global property unrelated to any device.
> 
> Thus either all devices and all domains support iommu_snooping, or
> none do.
> 
> It is unclear because for some reason the driver recalculates this
> almost constant value on every device attach..

The reason is simply because iommu capability is a global flag 

when the cap is removed by this series I don't think we need keep that
global recalculation thing.

Thanks
Kevin
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

RE: [PATCH 3/5] iommu: Introduce the domain op enforce_cache_coherency()

2022-04-05 Thread Tian, Kevin
> From: Tian, Kevin
> Sent: Wednesday, April 6, 2022 7:32 AM
> 
> > From: Jason Gunthorpe 
> > Sent: Wednesday, April 6, 2022 6:58 AM
> >
> > On Tue, Apr 05, 2022 at 01:50:36PM -0600, Alex Williamson wrote:
> > > >
> > > > +static bool intel_iommu_enforce_cache_coherency(struct
> > iommu_domain *domain)
> > > > +{
> > > > +   struct dmar_domain *dmar_domain = to_dmar_domain(domain);
> > > > +
> > > > +   if (!dmar_domain->iommu_snooping)
> > > > +   return false;
> > > > +   dmar_domain->enforce_no_snoop = true;
> > > > +   return true;
> > > > +}
> > >
> > > Don't we have issues if we try to set DMA_PTE_SNP on DMARs that don't
> > > support it, ie. reserved register bit set in pte faults?
> >
> > The way the Intel driver is setup that is not possible. Currently it
> > does:
> >
> >  static bool intel_iommu_capable(enum iommu_cap cap)
> >  {
> > if (cap == IOMMU_CAP_CACHE_COHERENCY)
> > return domain_update_iommu_snooping(NULL);
> >
> > Which is a global property unrelated to any device.
> >
> > Thus either all devices and all domains support iommu_snooping, or
> > none do.
> >
> > It is unclear because for some reason the driver recalculates this
> > almost constant value on every device attach..
> 
> The reason is simply because iommu capability is a global flag 

...and intel iommu driver supports hotplug-ed iommu. But in reality this
recalculation is almost a no-op because the only iommu that doesn't
support force snoop is for Intel GPU and built-in in the platform.

> 
> when the cap is removed by this series I don't think we need keep that
> global recalculation thing.
> 
> Thanks
> Kevin
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

RE: [PATCH 3/5] iommu: Introduce the domain op enforce_cache_coherency()

2022-04-06 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Wednesday, April 6, 2022 12:16 AM
> 
> This new mechanism will replace using IOMMU_CAP_CACHE_COHERENCY
> and
> IOMMU_CACHE to control the no-snoop blocking behavior of the IOMMU.
> 
> Currently only Intel and AMD IOMMUs are known to support this
> feature. They both implement it as an IOPTE bit, that when set, will cause
> PCIe TLPs to that IOVA with the no-snoop bit set to be treated as though
> the no-snoop bit was clear.
> 
> The new API is triggered by calling enforce_cache_coherency() before
> mapping any IOVA to the domain which globally switches on no-snoop
> blocking. This allows other implementations that might block no-snoop
> globally and outside the IOPTE - AMD also documents such an HW capability.
> 
> Leave AMD out of sync with Intel and have it block no-snoop even for
> in-kernel users. This can be trivially resolved in a follow up patch.
> 
> Only VFIO will call this new API.

Is it too restrictive? In theory vdpa may also implement a contract with
KVM and then wants to call this new API too?

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH 0/5] Make the iommu driver no-snoop block feature consistent

2022-04-06 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Wednesday, April 6, 2022 12:16 AM
> 
> PCIe defines a 'no-snoop' bit in each the TLP which is usually implemented
> by a platform as bypassing elements in the DMA coherent CPU cache
> hierarchy. A driver can command a device to set this bit on some of its
> transactions as a micro-optimization.
> 
> However, the driver is now responsible to synchronize the CPU cache with
> the DMA that bypassed it. On x86 this is done through the wbinvd
> instruction, and the i915 GPU driver is the only Linux DMA driver that
> calls it.

More accurately x86 supports both unprivileged clflush instructions
to invalidate one cacheline and a privileged wbinvd instruction to
invalidate the entire cache. Replacing 'this is done' with 'this may
be done' is clearer.

> 
> The problem comes that KVM on x86 will normally disable the wbinvd
> instruction in the guest and render it a NOP. As the driver running in the
> guest is not aware the wbinvd doesn't work it may still cause the device
> to set the no-snoop bit and the platform will bypass the CPU cache.
> Without a working wbinvd there is no way to re-synchronize the CPU cache
> and the driver in the VM has data corruption.
> 
> Thus, we see a general direction on x86 that the IOMMU HW is able to block
> the no-snoop bit in the TLP. This NOP's the optimization and allows KVM to
> to NOP the wbinvd without causing any data corruption.
> 
> This control for Intel IOMMU was exposed by using IOMMU_CACHE and
> IOMMU_CAP_CACHE_COHERENCY, however these two values now have
> multiple
> meanings and usages beyond blocking no-snoop and the whole thing has
> become confused.

Also point out your finding about AMD IOMMU?

> 
> Change it so that:
>  - IOMMU_CACHE is only about the DMA coherence of normal DMAs from a
>device. It is used by the DMA API and set when the DMA API will not be
>doing manual cache coherency operations.
> 
>  - dev_is_dma_coherent() indicates if IOMMU_CACHE can be used with the
>device
> 
>  - The new optional domain op enforce_cache_coherency() will cause the
>entire domain to block no-snoop requests - ie there is no way for any
>device attached to the domain to opt out of the IOMMU_CACHE behavior.
> 
> An iommu driver should implement enforce_cache_coherency() so that by
> default domains allow the no-snoop optimization. This leaves it available
> to kernel drivers like i915. VFIO will call enforce_cache_coherency()
> before establishing any mappings and the domain should then permanently
> block no-snoop.
> 
> If enforce_cache_coherency() fails VFIO will communicate back through to
> KVM into the arch code via kvm_arch_register_noncoherent_dma()
> (only implemented by x86) which triggers a working wbinvd to be made
> available to the VM.
> 
> While other arches are certainly welcome to implement
> enforce_cache_coherency(), it is not clear there is any benefit in doing
> so.
> 
> After this series there are only two calls left to iommu_capable() with a
> bus argument which should help Robin's work here.
> 
> This is on github:
> https://github.com/jgunthorpe/linux/commits/intel_no_snoop
> 
> Cc: "Tian, Kevin" 
> Cc: Robin Murphy 
> Cc: Alex Williamson 
> Cc: Christoph Hellwig 
> Signed-off-by: Jason Gunthorpe 
> 
> Jason Gunthorpe (5):
>   iommu: Replace uses of IOMMU_CAP_CACHE_COHERENCY with
> dev_is_dma_coherent()
>   vfio: Require that devices support DMA cache coherence
>   iommu: Introduce the domain op enforce_cache_coherency()
>   vfio: Move the Intel no-snoop control off of IOMMU_CACHE
>   iommu: Delete IOMMU_CAP_CACHE_COHERENCY
> 
>  drivers/infiniband/hw/usnic/usnic_uiom.c| 16 +--
>  drivers/iommu/amd/iommu.c   |  9 +--
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c |  2 --
>  drivers/iommu/arm/arm-smmu/arm-smmu.c   |  6 -
>  drivers/iommu/arm/arm-smmu/qcom_iommu.c |  6 -
>  drivers/iommu/fsl_pamu_domain.c |  6 -
>  drivers/iommu/intel/iommu.c | 15 ---
>  drivers/iommu/s390-iommu.c  |  2 --
>  drivers/vfio/vfio.c |  6 +
>  drivers/vfio/vfio_iommu_type1.c | 30 +
>  drivers/vhost/vdpa.c|  3 ++-
>  include/linux/intel-iommu.h |  1 +
>  include/linux/iommu.h   |  6 +++--
>  13 files changed, 58 insertions(+), 50 deletions(-)
> 
> 
> base-commit: 3123109284176b1532874591f7c81f3837bbdc17
> --
> 2.35.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH 2/5] vfio: Require that devices support DMA cache coherence

2022-04-06 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Wednesday, April 6, 2022 3:29 AM
> 
> On Tue, Apr 05, 2022 at 01:10:44PM -0600, Alex Williamson wrote:
> > On Tue,  5 Apr 2022 13:16:01 -0300
> > Jason Gunthorpe  wrote:
> >
> > > dev_is_dma_coherent() is the control to determine if IOMMU_CACHE can
> be
> > > supported.
> > >
> > > IOMMU_CACHE means that normal DMAs do not require any additional
> coherency
> > > mechanism and is the basic uAPI that VFIO exposes to userspace. For
> > > instance VFIO applications like DPDK will not work if additional coherency
> > > operations are required.
> > >
> > > Therefore check dev_is_dma_coherent() before allowing a device to join
> a
> > > domain. This will block device/platform/iommu combinations from using
> VFIO
> > > that do not support cache coherent DMA.
> > >
> > > Signed-off-by: Jason Gunthorpe 
> > >  drivers/vfio/vfio.c | 6 ++
> > >  1 file changed, 6 insertions(+)
> > >
> > > diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> > > index a4555014bd1e72..2a3aa3e742d943 100644
> > > +++ b/drivers/vfio/vfio.c
> > > @@ -32,6 +32,7 @@
> > >  #include 
> > >  #include 
> > >  #include 
> > > +#include 
> > >  #include "vfio.h"
> > >
> > >  #define DRIVER_VERSION   "0.3"
> > > @@ -1348,6 +1349,11 @@ static int vfio_group_get_device_fd(struct
> vfio_group *group, char *buf)
> > >   if (IS_ERR(device))
> > >   return PTR_ERR(device);
> > >
> > > + if (group->type == VFIO_IOMMU && !dev_is_dma_coherent(device-
> >dev)) {
> > > + ret = -ENODEV;
> > > + goto err_device_put;
> > > + }
> > > +
> >
> > Failing at the point where the user is trying to gain access to the
> > device seems a little late in the process and opaque, wouldn't we
> > rather have vfio bus drivers fail to probe such devices?  I'd expect
> > this to occur in the vfio_register_group_dev() path.  Thanks,
> 
> Yes, that is a good point.
> 
> So like this:
> 
>  int vfio_register_group_dev(struct vfio_device *device)
>  {
> +   if (!dev_is_dma_coherent(device->dev))
> +   return -EINVAL;
> +
> return __vfio_register_dev(device,
> vfio_group_find_or_alloc(device->dev));
>  }
> 
> I fixed it up.
> 

if that is the case should it also apply to usnic and vdpa in the first
patch (i.e. fail the probe)?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH 1/5] iommu: Replace uses of IOMMU_CAP_CACHE_COHERENCY with dev_is_dma_coherent()

2022-04-07 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Thursday, April 7, 2022 1:17 AM
> 
> On Wed, Apr 06, 2022 at 06:10:31PM +0200, Christoph Hellwig wrote:
> > On Wed, Apr 06, 2022 at 01:06:23PM -0300, Jason Gunthorpe wrote:
> > > On Wed, Apr 06, 2022 at 05:50:56PM +0200, Christoph Hellwig wrote:
> > > > On Wed, Apr 06, 2022 at 12:18:23PM -0300, Jason Gunthorpe wrote:
> > > > > > Oh, I didn't know about device_get_dma_attr()..
> > > >
> > > > Which is completely broken for any non-OF, non-ACPI plaform.
> > >
> > > I saw that, but I spent some time searching and could not find an
> > > iommu driver that would load independently of OF or ACPI. ie no IOMMU
> > > platform drivers are created by board files. Things like Intel/AMD
> > > discover only from ACPI, etc.

Intel discovers IOMMUs (and optionally ACPI namespace devices) from
ACPI, but there is no ACPI description for PCI devices i.e. the current
logic of device_get_dma_attr() cannot be used on PCI devices. 

> >
> > s390?
> 
> Ah, I missed looking in s390, hyperv and virtio..
> 
> hyperv is not creating iommu_domains, just IRQ remapping
> 
> virtio is using OF
> 
> And s390 indeed doesn't obviously have OF or ACPI parts..
> 
> This seems like it would be consistent with other things:
> 
> enum dev_dma_attr device_get_dma_attr(struct device *dev)
> {
>   const struct fwnode_handle *fwnode = dev_fwnode(dev);
>   struct acpi_device *adev = to_acpi_device_node(fwnode);
> 
>   if (is_of_node(fwnode)) {
>   if (of_dma_is_coherent(to_of_node(fwnode)))
>   return DEV_DMA_COHERENT;
>   return DEV_DMA_NON_COHERENT;
>   } else if (adev) {
>   return acpi_get_dma_attr(adev);
>   }
> 
>   /* Platform is always DMA coherent */
>   if (!IS_ENABLED(CONFIG_ARCH_HAS_SYNC_DMA_FOR_DEVICE) &&
>   !IS_ENABLED(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU) &&
>   !IS_ENABLED(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU_ALL) &&
>   device_iommu_mapped(dev))
>   return DEV_DMA_COHERENT;
>   return DEV_DMA_NOT_SUPPORTED;
> }
> EXPORT_SYMBOL_GPL(device_get_dma_attr);
> 
> ie s390 has no of or acpi but the entire platform is known DMA
> coherent at config time so allow it. Not sure we need the
> device_iommu_mapped() or not.

Probably not. If adding an iommu may change the coherency it would
come from specific IOPTE attributes for a device. The fact whether the
device is mapped by iommu doesn't tell that information.

> 
> We could alternatively use existing device_get_dma_attr() as a default
> with an iommu wrapper and push the exception down through the iommu
> driver and s390 can override it.
> 

if going this way probably device_get_dma_attr() should be renamed to
device_fwnode_get_dma_attr() instead to make it clearer?

Thanks
Kevin
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v2 0/6] IOMMUFD: Deliver IO page faults to user space

2023-11-07 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Thursday, November 2, 2023 8:48 PM
>
> On Thu, Oct 26, 2023 at 10:49:24AM +0800, Lu Baolu wrote:
> > Hi folks,
> >
> > This series implements the functionality of delivering IO page faults to
> > user space through the IOMMUFD framework for nested translation.
> Nested
> > translation is a hardware feature that supports two-stage translation
> > tables for IOMMU. The second-stage translation table is managed by the
> > host VMM, while the first-stage translation table is owned by user
> > space. This allows user space to control the IOMMU mappings for its
> > devices.
> 
> Having now looked more closely at the ARM requirements it seems we
> will need generic events, not just page fault events to have a
> complete emulation.

Can you elaborate?

> 
> So I'd like to see this generalized into a channel to carry any
> events..
> 
> > User space indicates its capability of handling IO page faults by
> > setting the IOMMU_HWPT_ALLOC_IOPF_CAPABLE flag when allocating a
> > hardware page table (HWPT). IOMMUFD will then set up its infrastructure
> > for page fault delivery. On a successful return of HWPT allocation, the
> > user can retrieve and respond to page faults by reading and writing to
> > the file descriptor (FD) returned in out_fault_fd.
> 
> This is the right way to approach it, and more broadly this shouldn't
> be an iommufd specific thing. Kernel drivers will also need to create
> fault capable PAGING iommu domains.
> 

Are you suggesting a common interface used by both iommufd and
kernel drivers?

but I didn't get the last piece. If those domains are created by kernel
drivers why would they require a uAPI for userspace to specify fault
capable?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH V1 vfio 0/9] Introduce a vfio driver over virtio devices

2023-10-25 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Monday, October 23, 2023 11:43 PM
> 
> On Mon, Oct 23, 2023 at 09:33:23AM -0600, Alex Williamson wrote:
> 
> > > Alex,
> > > Are you fine to leave the provisioning of the VF including the control
> > > of its transitional capability in the device hands as was suggested by
> > > Jason ?
> >
> > If this is the standard we're going to follow, ie. profiling of a
> > device is expected to occur prior to the probe of the vfio-pci variant
> > driver, then we should get the out-of-tree NVIDIA vGPU driver on board
> > with this too.
> 
> Those GPU drivers are using mdev not vfio-pci..
> 
> mdev doesn't have a way in its uapi to configure the mdev before it is
> created.
> 
> I'm hopeful that the SIOV work will develop something better because
> we clearly need it for the general use cases of SIOV beyond VFIO.
> 

The internal idxd driver version which I looked at last time leaves
provisioning via idxd's own config interface. sure let's brainstorm
what'd be (if possible) a general provisioning framework after it's
sent out for review.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v3 5/5] vfio/iommu_type1: Simplify group attachment

2022-06-27 Thread Tian, Kevin
> From: Nicolin Chen 
> Sent: Friday, June 24, 2022 4:00 AM
> 
> Un-inline the domain specific logic from the attach/detach_group ops into
> two paired functions vfio_iommu_alloc_attach_domain() and
> vfio_iommu_detach_destroy_domain() that strictly deal with creating and
> destroying struct vfio_domains.
> 
> Add the logic to check for EMEDIUMTYPE return code of
> iommu_attach_group()
> and avoid the extra domain allocations and attach/detach sequences of the
> old code. This allows properly detecting an actual attach error, like
> -ENOMEM, vs treating all attach errors as an incompatible domain.
> 
> Co-developed-by: Jason Gunthorpe 
> Signed-off-by: Jason Gunthorpe 
> Signed-off-by: Nicolin Chen 

Reviewed-by: Kevin Tian 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v3 1/5] iommu: Return -EMEDIUMTYPE for incompatible domain and device/group

2022-06-30 Thread Tian, Kevin
> From: Robin Murphy 
> Sent: Thursday, June 30, 2022 4:22 PM
> 
> On 2022-06-29 20:47, Nicolin Chen wrote:
> > On Fri, Jun 24, 2022 at 03:19:43PM -0300, Jason Gunthorpe wrote:
> >> On Fri, Jun 24, 2022 at 06:35:49PM +0800, Yong Wu wrote:
> >>
> > It's not used in VFIO context. "return 0" just satisfy the iommu
> > framework to go ahead. and yes, here we only allow the shared
> > "mapping-domain" (All the devices share a domain created
> > internally).
> >>
> >> What part of the iommu framework is trying to attach a domain and
> >> wants to see success when the domain was not actually attached ?
> >>
>  What prevent this driver from being used in VFIO context?
> >>>
> >>> Nothing prevent this. Just I didn't test.
> >>
> >> This is why it is wrong to return success here.
> >
> > Hi Yong, would you or someone you know be able to confirm whether
> > this "return 0" is still a must or not?
> 
>  From memory, it is unfortunately required, due to this driver being in
> the rare position of having to support multiple devices in a single
> address space on 32-bit ARM. Since the old ARM DMA code doesn't
> understand groups, the driver sets up its own canonical
> dma_iommu_mapping to act like a default domain, but then has to politely
> say "yeah OK" to arm_setup_iommu_dma_ops() for each device so that they
> do all end up with the right DMA ops rather than dying in screaming
> failure (the ARM code's per-device mappings then get leaked, but we
> can't really do any better).
> 
> The whole mess disappears in the proper default domain conversion, but
> in the meantime, it's still safe to assume that nobody's doing VFIO with
> embedded display/video codec/etc. blocks that don't even have reset drivers.
> 

Probably above is worth a comment in mtk code so we don't need
always dig it out from memory when similar question arises in the
the future. 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

RE: [PATCH v3 1/5] iommu: Return -EMEDIUMTYPE for incompatible domain and device/group

2022-06-24 Thread Tian, Kevin
> From: Yong Wu
> Sent: Friday, June 24, 2022 1:39 PM
> 
> On Thu, 2022-06-23 at 19:44 -0700, Nicolin Chen wrote:
> > On Fri, Jun 24, 2022 at 09:35:49AM +0800, Baolu Lu wrote:
> > > External email: Use caution opening links or attachments
> > >
> > >
> > > On 2022/6/24 04:00, Nicolin Chen wrote:
> > > > diff --git a/drivers/iommu/mtk_iommu_v1.c
> > > > b/drivers/iommu/mtk_iommu_v1.c
> > > > index e1cb51b9866c..5386d889429d 100644
> > > > --- a/drivers/iommu/mtk_iommu_v1.c
> > > > +++ b/drivers/iommu/mtk_iommu_v1.c
> > > > @@ -304,7 +304,7 @@ static int mtk_iommu_v1_attach_device(struct
> > > > iommu_domain *domain, struct device
> > > >   /* Only allow the domain created internally. */
> > > >   mtk_mapping = data->mapping;
> > > >   if (mtk_mapping->domain != domain)
> > > > - return 0;
> > > > + return -EMEDIUMTYPE;
> > > >
> > > >   if (!data->m4u_dom) {
> > > >   data->m4u_dom = dom;
> > >
> > > This change looks odd. It turns the return value from success to
> > > failure. Is it a bug? If so, it should go through a separated fix
> > > patch.
> 
> Thanks for the review:)
> 
> >
> > Makes sense.
> >
> > I read the commit log of the original change:
> >
> https://lore.kernel.org/r/1589530123-30240-1-git-send-email-
> yong...@mediatek.com
> >
> > It doesn't seem to allow devices to get attached to different
> > domains other than the shared mapping->domain, created in the
> > in the mtk_iommu_probe_device(). So it looks like returning 0
> > is intentional. Though I am still very confused by this return
> > value here, I doubt it has ever been used in a VFIO context.
> 
> It's not used in VFIO context. "return 0" just satisfy the iommu
> framework to go ahead. and yes, here we only allow the shared "mapping-
> >domain" (All the devices share a domain created internally).
> 
> thus I think we should still keep "return 0" here.
> 

What prevent this driver from being used in VFIO context?

and why would we want to go ahead when an obvious error occurs
i.e. when a device is attached to an unexpected domain?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v2 3/5] vfio/iommu_type1: Remove the domain->ops comparison

2022-06-22 Thread Tian, Kevin
> From: Robin Murphy 
> Sent: Wednesday, June 22, 2022 3:55 PM
> 
> On 2022-06-16 23:23, Nicolin Chen wrote:
> > On Thu, Jun 16, 2022 at 06:40:14AM +, Tian, Kevin wrote:
> >
> >>> The domain->ops validation was added, as a precaution, for mixed-
> driver
> >>> systems. However, at this moment only one iommu driver is possible. So
> >>> remove it.
> >>
> >> It's true on a physical platform. But I'm not sure whether a virtual
> platform
> >> is allowed to include multiple e.g. one virtio-iommu alongside a virtual 
> >> VT-
> d
> >> or a virtual smmu. It might be clearer to claim that (as Robin pointed out)
> >> there is plenty more significant problems than this to solve instead of
> simply
> >> saying that only one iommu driver is possible if we don't have explicit
> code
> >> to reject such configuration. 
> >
> > Will edit this part. Thanks!
> 
> Oh, physical platforms with mixed IOMMUs definitely exist already. The
> main point is that while bus_set_iommu still exists, the core code
> effectively *does* prevent multiple drivers from registering - even in
> emulated cases like the example above, virtio-iommu and VT-d would both
> try to bus_set_iommu(_bus_type), and one of them will lose. The
> aspect which might warrant clarification is that there's no combination
> of supported drivers which claim non-overlapping buses *and* could
> appear in the same system - even if you tried to contrive something by
> emulating, say, VT-d (PCI) alongside rockchip-iommu (platform), you
> could still only describe one or the other due to ACPI vs. Devicetree.
> 

This explanation is much clearer! thanks.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

RE: [PATCH 1/5] iommu: Return -EMEDIUMTYPE for incompatible domain and device/group

2022-06-08 Thread Tian, Kevin
> From: Nicolin Chen
> Sent: Monday, June 6, 2022 2:19 PM
> 
> Cases like VFIO wish to attach a device to an existing domain that was
> not allocated specifically from the device. This raises a condition
> where the IOMMU driver can fail the domain attach because the domain and
> device are incompatible with each other.
> 
> This is a soft failure that can be resolved by using a different domain.
> 
> Provide a dedicated errno from the IOMMU driver during attach that the
> reason attached failed is because of domain incompatability. EMEDIUMTYPE
> is chosen because it is never used within the iommu subsystem today and
> evokes a sense that the 'medium' aka the domain is incompatible.
> 
> VFIO can use this to know attach is a soft failure and it should continue
> searching. Otherwise the attach will be a hard failure and VFIO will
> return the code to userspace.
> 
> Update all drivers to return EMEDIUMTYPE in their failure paths that are
> related to domain incompatability.

Seems not all drivers are converted, e.g.:

mtk_iommu_v1_attach_device():
/* Only allow the domain created internally. */
mtk_mapping = data->mapping;
if (mtk_mapping->domain != domain)
return 0;
** the current code sounds incorrect which should return an error


s390_iommu_attach_device():
/* Allow only devices with identical DMA range limits */
} else if (domain->geometry.aperture_start != zdev->start_dma ||
domain->geometry.aperture_end != zdev->end_dma) {
rc = -EINVAL;


sprd_iommu_attach_device():
if (dom->sdev) {
pr_err("There's already a device attached to this domain.\n");
return -EINVAL;
}


gart_iommu_attach_dev():
if (gart->active_domain && gart->active_domain != domain) {
ret = -EBUSY;


arm_smmu_attach_dev():
if (!fwspec || fwspec->ops != _smmu_ops) {
dev_err(dev, "cannot attach to SMMU, is it on the same bus?\n");
return -ENXIO;
}
**probably this check can be covered by next patch which moves bus ops
check into iommu core?

Thanks
Kevin
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH 4/5] vfio/iommu_type1: Clean up update_dirty_scope in detach_group()

2022-06-08 Thread Tian, Kevin
> From: Nicolin Chen
> Sent: Monday, June 6, 2022 2:19 PM
> 
> All devices in emulated_iommu_groups have pinned_page_dirty_scope
> set, so the update_dirty_scope in the first list_for_each_entry
> is always false. Clean it up, and move the "if update_dirty_scope"
> part from the detach_group_done routine to the domain_list part.
> 
> Rename the "detach_group_done" goto label accordingly.
> 
> Suggested-by: Jason Gunthorpe 
> Signed-off-by: Nicolin Chen 
> ---
>  drivers/vfio/vfio_iommu_type1.c | 27 ---
>  1 file changed, 12 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c
> b/drivers/vfio/vfio_iommu_type1.c
> index f4e3b423a453..b45b1cc118ef 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -2463,14 +2463,12 @@ static void
> vfio_iommu_type1_detach_group(void *iommu_data,
>   struct vfio_iommu *iommu = iommu_data;
>   struct vfio_domain *domain;
>   struct vfio_iommu_group *group;
> - bool update_dirty_scope = false;
>   LIST_HEAD(iova_copy);
> 
>   mutex_lock(>lock);
>   list_for_each_entry(group, >emulated_iommu_groups,
> next) {
>   if (group->iommu_group != iommu_group)
>   continue;
> - update_dirty_scope = !group->pinned_page_dirty_scope;
>   list_del(>next);
>   kfree(group);
> 
> @@ -2479,7 +2477,7 @@ static void vfio_iommu_type1_detach_group(void
> *iommu_data,
>   WARN_ON(iommu->notifier.head);
>   vfio_iommu_unmap_unpin_all(iommu);
>   }
> - goto detach_group_done;
> + goto out_unlock;
>   }
> 
>   /*
> @@ -2495,9 +2493,7 @@ static void vfio_iommu_type1_detach_group(void
> *iommu_data,
>   continue;
> 
>   iommu_detach_group(domain->domain, group-
> >iommu_group);
> - update_dirty_scope = !group->pinned_page_dirty_scope;
>   list_del(>next);
> - kfree(group);
>   /*
>* Group ownership provides privilege, if the group list is
>* empty, the domain goes away. If it's the last domain with
> @@ -2519,7 +2515,17 @@ static void vfio_iommu_type1_detach_group(void
> *iommu_data,
>   kfree(domain);
>   vfio_iommu_aper_expand(iommu, _copy);
>   vfio_update_pgsize_bitmap(iommu);
> + /*
> +  * Removal of a group without dirty tracking may
> allow
> +  * the iommu scope to be promoted.
> +  */
> + if (!group->pinned_page_dirty_scope) {
> + iommu->num_non_pinned_groups--;
> + if (iommu->dirty_page_tracking)
> +
>   vfio_iommu_populate_bitmap_full(iommu);

This doesn't look correct. The old code decrements
num_non_pinned_groups for every detach group without dirty
tracking. But now it's only done when the domain is about to
be released...

> + }
>   }
> + kfree(group);
>   break;
>   }
> 
> @@ -2528,16 +2534,7 @@ static void vfio_iommu_type1_detach_group(void
> *iommu_data,
>   else
>   vfio_iommu_iova_free(_copy);
> 
> -detach_group_done:
> - /*
> -  * Removal of a group without dirty tracking may allow the iommu
> scope
> -  * to be promoted.
> -  */
> - if (update_dirty_scope) {
> - iommu->num_non_pinned_groups--;
> - if (iommu->dirty_page_tracking)
> - vfio_iommu_populate_bitmap_full(iommu);
> - }
> +out_unlock:
>   mutex_unlock(>lock);
>  }
> 
> --
> 2.17.1
> 
> ___
> iommu mailing list
> io...@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH 3/5] vfio/iommu_type1: Prefer to reuse domains vs match enforced cache coherency

2022-06-08 Thread Tian, Kevin
> From: Nicolin Chen
> Sent: Monday, June 6, 2022 2:19 PM
> 
> From: Jason Gunthorpe 
> 
> The KVM mechanism for controlling wbinvd is only triggered during
> kvm_vfio_group_add(), meaning it is a one-shot test done once the devices
> are setup.

It's not one-shot. kvm_vfio_update_coherency() is called in both
group_add() and group_del(). Then the coherency property is
checked dynamically in wbinvd emulation:

kvm_emulate_wbinvd()
  kvm_emulate_wbinvd_noskip()
need_emulate_wbinvd()
  kvm_arch_has_noncoherent_dma()

It's also checked when a vcpu is scheduled to a new cpu for
tracking dirty cpus which requires cache flush when emulating
wbinvd on that vcpu. See kvm_arch_vcpu_load().

/* Address WBINVD may be executed by guest */
if (need_emulate_wbinvd(vcpu)) {
if (static_call(kvm_x86_has_wbinvd_exit)())
cpumask_set_cpu(cpu, vcpu->arch.wbinvd_dirty_mask);

In addition, it's also checked when deciding the effective memory
type of EPT entry. See vmx_get_mt_mask().

if (!kvm_arch_has_noncoherent_dma(vcpu->kvm))
return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | 
VMX_EPT_IPAT_BIT;

But I doubt above can work reliably when the property is changed
in the fly given above paths are triggered at different points. The
guest may end up in a mixed state where inconsistent coherency 
is assumed in different emulation paths.

and In reality I don't think such niche scenario is even tested 
given the only device imposing such trick is integrated Intel GPU
which iiuc no one would try to hotplug/hot-remove it to/from
a guest.

given that I'm fine with the change in this patch. Even more probably
we really want an explicit one-shot model so KVM can lock down
the property once it starts to consume it then further adding a new
group which would change the coherency is explicitly rejected and
removing an existing group leaves it intact.

> 
> So, there is no value in trying to push a device that could do enforced
> cache coherency to a dedicated domain vs re-using an existing domain since

"an existing domain (even if it doesn't enforce coherency)", otherwise if
it's already compatible there is no question here.

> KVM won't be able to take advantage of it. This just wastes domain memory.
> 
> Simplify this code and eliminate the test. This removes the only logic
> that needed to have a dummy domain attached prior to searching for a
> matching domain and simplifies the next patches.
> 
> If someday we want to try and optimize this further the better approach is
> to update the Intel driver so that enforce_cache_coherency() can work on a
> domain that already has IOPTEs and then call the enforce_cache_coherency()
> after detaching a device from a domain to upgrade the whole domain to
> enforced cache coherency mode.
> 
> Signed-off-by: Jason Gunthorpe 
> Signed-off-by: Nicolin Chen 
> ---
>  drivers/vfio/vfio_iommu_type1.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c
> b/drivers/vfio/vfio_iommu_type1.c
> index c13b9290e357..f4e3b423a453 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -2285,9 +2285,7 @@ static int vfio_iommu_type1_attach_group(void
> *iommu_data,
>* testing if they're on the same bus_type.
>*/
>   list_for_each_entry(d, >domain_list, next) {
> - if (d->domain->ops == domain->domain->ops &&
> - d->enforce_cache_coherency ==
> - domain->enforce_cache_coherency) {
> + if (d->domain->ops == domain->domain->ops) {
>   iommu_detach_group(domain->domain, group-
> >iommu_group);
>   if (!iommu_attach_group(d->domain,
>   group->iommu_group)) {
> --
> 2.17.1
> 
> ___
> iommu mailing list
> io...@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH 3/5] vfio/iommu_type1: Prefer to reuse domains vs match enforced cache coherency

2022-06-08 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Wednesday, June 8, 2022 7:17 PM
> 
> On Wed, Jun 08, 2022 at 08:28:03AM +, Tian, Kevin wrote:
> > > From: Nicolin Chen
> > > Sent: Monday, June 6, 2022 2:19 PM
> > >
> > > From: Jason Gunthorpe 
> > >
> > > The KVM mechanism for controlling wbinvd is only triggered during
> > > kvm_vfio_group_add(), meaning it is a one-shot test done once the
> devices
> > > are setup.
> >
> > It's not one-shot. kvm_vfio_update_coherency() is called in both
> > group_add() and group_del(). Then the coherency property is
> > checked dynamically in wbinvd emulation:
> 
> From the perspective of managing the domains that is still
> one-shot. It doesn't get updated when individual devices are
> added/removed to domains.

It's unchanged per-domain but dynamic per-vm when multiple
domains are added/removed (i.e. kvm->arch.noncoherent_dma_count).
It's the latter being checked in the kvm.

> 
> > given that I'm fine with the change in this patch. Even more probably
> > we really want an explicit one-shot model so KVM can lock down
> > the property once it starts to consume it then further adding a new
> > group which would change the coherency is explicitly rejected and
> > removing an existing group leaves it intact.
> 
> Why? Once wbinvd is enabled it is compatible with all domain
> configurations, so just leave it on and ignore everything at that
> point.
> 

More than that. My point was to make it a static policy so even if
wbinvd is disabled in the start we want to leave it off and not affected
by adding a device which doesn't have coherency. 'wbinvd off' is not
a compatible configuration hence imo need a way to reject adding
incompatible device.

Thanks
Kevin
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v2 2/5] vfio/iommu_type1: Prefer to reuse domains vs match enforced cache coherency

2022-06-16 Thread Tian, Kevin
> From: Nicolin Chen 
> Sent: Thursday, June 16, 2022 8:03 AM
> 
> From: Jason Gunthorpe 
> 
> The KVM mechanism for controlling wbinvd is based on OR of the coherency
> property of all devices attached to a guest, no matter those devices are
> attached to a single domain or multiple domains.
> 
> So, there is no value in trying to push a device that could do enforced
> cache coherency to a dedicated domain vs re-using an existing domain
> which is non-coherent since KVM won't be able to take advantage of it.
> This just wastes domain memory.
> 
> Simplify this code and eliminate the test. This removes the only logic
> that needed to have a dummy domain attached prior to searching for a
> matching domain and simplifies the next patches.
> 
> It's unclear whether we want to further optimize the Intel driver to
> update the domain coherency after a device is detached from it, at
> least not before KVM can be verified to handle such dynamics in related
> emulation paths (wbinvd, vcpu load, write_cr0, ept, etc.). In reality
> we don't see an usage requiring such optimization as the only device
> which imposes such non-coherency is Intel GPU which even doesn't
> support hotplug/hot remove.
> 
> Signed-off-by: Jason Gunthorpe 
> Signed-off-by: Nicolin Chen 

Reviewed-by: Kevin Tian 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v2 1/5] iommu: Return -EMEDIUMTYPE for incompatible domain and device/group

2022-06-16 Thread Tian, Kevin
> From: Nicolin Chen 
> Sent: Thursday, June 16, 2022 8:03 AM
> 
> Cases like VFIO wish to attach a device to an existing domain that was
> not allocated specifically from the device. This raises a condition
> where the IOMMU driver can fail the domain attach because the domain and
> device are incompatible with each other.
> 
> This is a soft failure that can be resolved by using a different domain.
> 
> Provide a dedicated errno from the IOMMU driver during attach that the
> reason attached failed is because of domain incompatability. EMEDIUMTYPE
> is chosen because it is never used within the iommu subsystem today and
> evokes a sense that the 'medium' aka the domain is incompatible.
> 
> VFIO can use this to know attach is a soft failure and it should continue
> searching. Otherwise the attach will be a hard failure and VFIO will
> return the code to userspace.
> 
> Update all drivers to return EMEDIUMTYPE in their failure paths that are
> related to domain incompatability. Also turn adjacent error prints into
> debug prints, for these soft failures, to prevent a kernel log spam.
> 
> Add kdocs describing this behavior.
> 
> Suggested-by: Jason Gunthorpe 
> Signed-off-by: Nicolin Chen 

Reviewed-by: Kevin Tian 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v2 4/5] vfio/iommu_type1: Clean up update_dirty_scope in detach_group()

2022-06-16 Thread Tian, Kevin
> From: Nicolin Chen
> Sent: Thursday, June 16, 2022 8:03 AM
> 
> All devices in emulated_iommu_groups have pinned_page_dirty_scope
> set, so the update_dirty_scope in the first list_for_each_entry
> is always false. Clean it up, and move the "if update_dirty_scope"
> part from the detach_group_done routine to the domain_list part.
> 
> Rename the "detach_group_done" goto label accordingly.
> 
> Suggested-by: Jason Gunthorpe 
> Signed-off-by: Nicolin Chen 

Reviewed-by: Kevin Tian , with one nit:

> @@ -2469,7 +2467,7 @@ static void vfio_iommu_type1_detach_group(void
> *iommu_data,
>   WARN_ON(iommu->notifier.head);
>   vfio_iommu_unmap_unpin_all(iommu);
>   }
> - goto detach_group_done;
> + goto out_unlock;
...
> +out_unlock:
>   mutex_unlock(>lock);
>  }
> 

I'd just replace the goto with a direct unlock and then return there. 
the readability is slightly better.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v2 5/5] vfio/iommu_type1: Simplify group attachment

2022-06-16 Thread Tian, Kevin
> From: Nicolin Chen
> Sent: Thursday, June 16, 2022 8:03 AM
> 
> Un-inline the domain specific logic from the attach/detach_group ops into
> two paired functions vfio_iommu_alloc_attach_domain() and
> vfio_iommu_detach_destroy_domain() that strictly deal with creating and
> destroying struct vfio_domains.
> 
> Add the logic to check for EMEDIUMTYPE return code of
> iommu_attach_group()
> and avoid the extra domain allocations and attach/detach sequences of the
> old code. This allows properly detecting an actual attach error, like
> -ENOMEM, vs treating all attach errors as an incompatible domain.
> 
> Co-developed-by: Jason Gunthorpe 
> Signed-off-by: Jason Gunthorpe 
> Signed-off-by: Nicolin Chen 
> ---
>  drivers/vfio/vfio_iommu_type1.c | 298 +---
>  1 file changed, 163 insertions(+), 135 deletions(-)
> 

...
> +static struct vfio_domain *
> +vfio_iommu_alloc_attach_domain(struct bus_type *bus, struct vfio_iommu
> *iommu,
> +struct vfio_iommu_group *group)
> +{
> + struct iommu_domain *new_domain;
> + struct vfio_domain *domain;
> + int ret = 0;
> +
> + /* Try to match an existing compatible domain */
> + list_for_each_entry (domain, >domain_list, next) {
> + ret = iommu_attach_group(domain->domain, group-
> >iommu_group);
> + if (ret == -EMEDIUMTYPE)
> + continue;

Probably good to add one line comment here for what EMEDIUMTYPE
represents. It's not a widely-used retry type like EAGAIN. A comment
can save the time of digging out the fact by jumping to iommu file.

...
> - if (resv_msi) {
> + if (resv_msi && !domain->msi_cookie) {
>   ret = iommu_get_msi_cookie(domain->domain,
> resv_msi_base);
>   if (ret && ret != -ENODEV)
>   goto out_detach;
> + domain->msi_cookie = true;
>   }

why not moving to alloc_attach_domain() then no need for the new
domain field? It's required only when a new domain is allocated.

...
> - if (list_empty(>group_list)) {
> - if (list_is_singular(>domain_list)) {
> - if (list_empty(
> >emulated_iommu_groups)) {
> - WARN_ON(iommu->notifier.head);
> -
>   vfio_iommu_unmap_unpin_all(iommu);
> - } else {
> -
>   vfio_iommu_unmap_unpin_reaccount(iommu);
> - }
> - }
> - iommu_domain_free(domain->domain);
> - list_del(>next);
> - kfree(domain);
> - vfio_iommu_aper_expand(iommu, _copy);

Previously the aperture is adjusted when a domain is freed...

> - vfio_update_pgsize_bitmap(iommu);
> - }
> - /*
> -  * Removal of a group without dirty tracking may allow
> -  * the iommu scope to be promoted.
> -  */
> - if (!group->pinned_page_dirty_scope) {
> - iommu->num_non_pinned_groups--;
> - if (iommu->dirty_page_tracking)
> - vfio_iommu_populate_bitmap_full(iommu);
> - }
> + vfio_iommu_detach_destroy_domain(domain, iommu,
> group);
>   kfree(group);
>   break;
>   }
> 
> + vfio_iommu_aper_expand(iommu, _copy);

but now it's done for every group detach. The aperture is decided
by domain geometry which is not affected by attached groups.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v2 3/5] vfio/iommu_type1: Remove the domain->ops comparison

2022-06-16 Thread Tian, Kevin
> From: Nicolin Chen 
> Sent: Thursday, June 16, 2022 8:03 AM
> 
> The domain->ops validation was added, as a precaution, for mixed-driver
> systems. However, at this moment only one iommu driver is possible. So
> remove it.

It's true on a physical platform. But I'm not sure whether a virtual platform
is allowed to include multiple e.g. one virtio-iommu alongside a virtual VT-d
or a virtual smmu. It might be clearer to claim that (as Robin pointed out)
there is plenty more significant problems than this to solve instead of simply
saying that only one iommu driver is possible if we don't have explicit code
to reject such configuration. 

> 
> Per discussion with Robin, in future when many can be permitted we will
> rely on the IOMMU core code to check the domain->ops:
> https://lore.kernel.org/linux-iommu/6575de6d-94ba-c427-5b1e-
> 967750ddf...@arm.com/
> 
> Signed-off-by: Nicolin Chen 

Apart from that,

Reviewed-by: Kevin Tian 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

RE: [PATCH v2 5/5] vfio/iommu_type1: Simplify group attachment

2022-06-16 Thread Tian, Kevin
> From: Nicolin Chen 
> Sent: Friday, June 17, 2022 6:41 AM
> 
> > ...
> > > - if (resv_msi) {
> > > + if (resv_msi && !domain->msi_cookie) {
> > >   ret = iommu_get_msi_cookie(domain->domain,
> > > resv_msi_base);
> > >   if (ret && ret != -ENODEV)
> > >   goto out_detach;
> > > + domain->msi_cookie = true;
> > >   }
> >
> > why not moving to alloc_attach_domain() then no need for the new
> > domain field? It's required only when a new domain is allocated.
> 
> When reusing an existing domain that doesn't have an msi_cookie,
> we can do iommu_get_msi_cookie() if resv_msi is found. So it is
> not limited to a new domain.

Looks msi_cookie requirement is per platform (currently only
for smmu. see arm_smmu_get_resv_regions()). If there is
no mixed case then above check is not required.

But let's hear whether Robin has a different thought here.

> 
> > ...
> > > - if (list_empty(>group_list)) {
> > > - if (list_is_singular(>domain_list)) {
> > > - if (list_empty(
> > > >emulated_iommu_groups)) {
> > > - WARN_ON(iommu->notifier.head);
> > > -
> > >   vfio_iommu_unmap_unpin_all(iommu);
> > > - } else {
> > > -
> > >   vfio_iommu_unmap_unpin_reaccount(iommu);
> > > - }
> > > - }
> > > - iommu_domain_free(domain->domain);
> > > - list_del(>next);
> > > - kfree(domain);
> > > - vfio_iommu_aper_expand(iommu, _copy);
> >
> > Previously the aperture is adjusted when a domain is freed...
> >
> > > - vfio_update_pgsize_bitmap(iommu);
> > > - }
> > > - /*
> > > -  * Removal of a group without dirty tracking may allow
> > > -  * the iommu scope to be promoted.
> > > -  */
> > > - if (!group->pinned_page_dirty_scope) {
> > > - iommu->num_non_pinned_groups--;
> > > - if (iommu->dirty_page_tracking)
> > > - vfio_iommu_populate_bitmap_full(iommu);
> > > - }
> > > + vfio_iommu_detach_destroy_domain(domain, iommu,
> > > group);
> > >   kfree(group);
> > >   break;
> > >   }
> > >
> > > + vfio_iommu_aper_expand(iommu, _copy);
> >
> > but now it's done for every group detach. The aperture is decided
> > by domain geometry which is not affected by attached groups.
> 
> Yea, I've noticed this part. Actually Jason did this change for
> simplicity, and I think it'd be safe to do so?

Perhaps detach_destroy() can return a Boolean to indicate whether
a domain is destroyed.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH 3/5] vfio/iommu_type1: Prefer to reuse domains vs match enforced cache coherency

2022-06-15 Thread Tian, Kevin
> From: Nicolin Chen 
> Sent: Wednesday, June 15, 2022 4:45 AM
> 
> Hi Kevin,
> 
> On Wed, Jun 08, 2022 at 11:48:27PM +, Tian, Kevin wrote:
> > > > > The KVM mechanism for controlling wbinvd is only triggered during
> > > > > kvm_vfio_group_add(), meaning it is a one-shot test done once the
> > > devices
> > > > > are setup.
> > > >
> > > > It's not one-shot. kvm_vfio_update_coherency() is called in both
> > > > group_add() and group_del(). Then the coherency property is
> > > > checked dynamically in wbinvd emulation:
> > >
> > > From the perspective of managing the domains that is still
> > > one-shot. It doesn't get updated when individual devices are
> > > added/removed to domains.
> >
> > It's unchanged per-domain but dynamic per-vm when multiple
> > domains are added/removed (i.e. kvm->arch.noncoherent_dma_count).
> > It's the latter being checked in the kvm.
> 
> I am going to send a v2, yet not quite getting the point here.
> Meanwhile, Jason is on leave.
> 
> What, in your opinion, would be an accurate description here?
> 

Something like below:
--
The KVM mechanism for controlling wbinvd is based on OR of
the coherency property of all devices attached to a guest, no matter
those devices  are attached to a single domain or multiple domains.

So, there is no value in trying to push a device that could do enforced
cache coherency to a dedicated domain vs re-using an existing domain
which is non-coherent since KVM won't be able to take advantage of it. 
This just wastes domain memory.

Simplify this code and eliminate the test. This removes the only logic
that needed to have a dummy domain attached prior to searching for a
matching domain and simplifies the next patches.

It's unclear whether we want to further optimize the Intel driver to
update the domain coherency after a device is detached from it, at
least not before KVM can be verified to handle such dynamics in related
emulation paths (wbinvd, vcpu load, write_cr0, ept, etc.). In reality
we don't see an usage requiring such optimization as the only device
which imposes such non-coherency is Intel GPU which even doesn't
support hotplug/hot remove.
--

Thanks
Kevin
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v3 5/6] iommu: Use EINVAL for incompatible device/domain in ->attach_dev

2022-09-21 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Wednesday, September 21, 2022 2:07 AM
> 
> On Tue, Sep 20, 2022 at 06:38:18AM +, Tian, Kevin wrote:
> 
> > Above lacks of a conversion in intel-iommu:
> >
> > intel_iommu_attach_device()
> > if (domain->type == IOMMU_DOMAIN_UNMANAGED &&
> > device_is_rmrr_locked(dev)) {
> > dev_warn(dev, "Device is ineligible for IOMMU domain
> attach due to platform RMRR requirement.  Contact your platform
> vendor.\n");
> > return -EPERM;
> > }
> >
> > since it's based on the domain type, picking a different domain
> > may work in theory though it won't apply to vfio which always
> > creates unmanaged type.
> 
> IMHO this test shouldn't even be here, that is why it is so
> strange..
> 
> VFIO should be checking if somehow the device doesn't support
> unmanaged domains at all. We already have several drivers that can't
> support full featured unamanged domains that vfio needs, this is just
> another case of that.
> 
> But it isn't urgent to fix, I would just ignore this branch for this
> series.
> 

OK, then let's leave it as is. Anyway this is a rare path. In reality
most RMRR devices are gpu/usb which are already exempted
from the above check.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v3 5/6] iommu: Use EINVAL for incompatible device/domain in ->attach_dev

2022-09-20 Thread Tian, Kevin
> From: Nicolin Chen 
> Sent: Thursday, September 15, 2022 3:54 PM
> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> index 1f2cd43cf9bc..51ef42b1bd4e 100644
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -4158,19 +4158,15 @@ static int prepare_domain_attach_device(struct
> iommu_domain *domain,
>   return -ENODEV;
> 
>   if (dmar_domain->force_snooping && !ecap_sc_support(iommu-
> >ecap))
> - return -EOPNOTSUPP;
> + return -EINVAL;
> 
>   /* check if this iommu agaw is sufficient for max mapped address */
>   addr_width = agaw_to_width(iommu->agaw);
>   if (addr_width > cap_mgaw(iommu->cap))
>   addr_width = cap_mgaw(iommu->cap);
> 
> - if (dmar_domain->max_addr > (1LL << addr_width)) {
> - dev_err(dev, "%s: iommu width (%d) is not "
> - "sufficient for the mapped address (%llx)\n",
> - __func__, addr_width, dmar_domain->max_addr);
> - return -EFAULT;
> - }
> + if (dmar_domain->max_addr > (1LL << addr_width))
> + return -EINVAL;
>   dmar_domain->gaw = addr_width;
> 
>   /*

Above lacks of a conversion in intel-iommu:

intel_iommu_attach_device()
if (domain->type == IOMMU_DOMAIN_UNMANAGED &&
device_is_rmrr_locked(dev)) {
dev_warn(dev, "Device is ineligible for IOMMU domain attach due 
to platform RMRR requirement.  Contact your platform vendor.\n");
return -EPERM;
}

since it's based on the domain type, picking a different domain
may work in theory though it won't apply to vfio which always
creates unmanaged type.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v3 3/6] iommu: Add return value rules to attach_dev op and APIs

2022-09-20 Thread Tian, Kevin
> From: Nicolin Chen 
> Sent: Thursday, September 15, 2022 3:54 PM
> 
> +/**
> + * iommu_attach_device - Attach a device to an IOMMU domain
> + * @domain: IOMMU domain to attach
> + * @dev: Device that will be attached
> + *
> + * Returns 0 on success and error code on failure
> + *
> + * Note that EINVAL may be returned as a soft failure if the domain and
> device
> + * are incompatible: if the domain has already been used or configured in
> some

I didn't get the meaning of the 'if' part.

> + * way, attaching the same device to a different domain may succeed.
> Otherwise,
> + * it may still represent some fundamental problem.

I'm not sure what the sentence after 'otherwise' actually adds to the
caller. There is no way to differentiate incompatibility vs. fundamental
problem, hence pointless for the caller to know this fact.

IMHO just state that the caller can treat -EINVAL as soft failure indicating
incompatibility issue between domain and device.

Later for @attach_dev you can add that driver may return (but not
recommend) -EINVAL for some fundamental problems.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v3 6/6] iommu: Propagate ret for a potential soft failure EINVAL

2022-09-20 Thread Tian, Kevin
> From: Nicolin Chen 
> Sent: Thursday, September 15, 2022 3:59 PM
> 
> Following the new rules in include/linux/iommu.h kdocs, EINVAL now can be
> used to indicate that domain and device are incompatible by a caller that
> treats it as a soft failure and tries attaching to another domain.
> 
> Either mtk_iommu or virtio driver has a place that returns a hard failure
> instead of the return value from the function call, where an incompatible
> errno EINVAL could potentially occur.

in both cases there is no EINVAL returned from the calling stack

IMHO error propagation is the right way even w/o talking about EINVAL
otherwise we may miss ENOMEM etc.

> 
> Propagate the real return value to not miss a potential soft failure.
> 
> Signed-off-by: Nicolin Chen 

Apart from that comment,

Reviewed-by: Kevin Tian 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v6 1/5] iommu: Return -EMEDIUMTYPE for incompatible domain and device/group

2022-09-12 Thread Tian, Kevin
> From: Nicolin Chen 
> Sent: Sunday, September 11, 2022 7:36 AM
> 
> On Thu, Sep 08, 2022 at 09:08:38AM -0300, Jason Gunthorpe wrote:
> > On Thu, Sep 08, 2022 at 09:30:57AM +, Tian, Kevin wrote:
> 
> > > There are also cases where common kAPIs are called in the attach
> > > path which may return -EINVAL and random errno, e.g.:
> > >
> > > omap_iommu_attach_dev()
> > >   omap_iommu_attach()
> > > iommu_enable()
> > >   pm_runtime_get_sync()
> > > __pm_runtime_resume()
> > >   rpm_resume()
> > >   if (dev->power.runtime_error) {
> > >   retval = -EINVAL;
> 
> > Yes, this is was also on my mind with choosing an unpopular return
> > code, it has a higher chance of not coming out of some other kernel
> > API
> 
> I wonder if it would be safe to just treat a pm_runtime_get_sync()
> failure as -ENODEV, since a PM resume() mostly occurs to the IOMMU
> that an IOMMU client/master device is behind, while an iommu_domain
> rarely intervenes.

Yes, this is a condition preventing the device from being attached by
a domain hence converting -EINVAL to -ENODEV probably makes sense.
But as replied in another we might want to keep other errno's as is.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v6 1/5] iommu: Return -EMEDIUMTYPE for incompatible domain and device/group

2022-09-12 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Friday, September 9, 2022 8:08 PM
> 
> 
> > As discussed in a side thread a note might be added to exempt calling
> > kAPI outside of the iommu driver.
> 
> Sadly, not really.. The driver is responsible to santize this if it is
> relevant. It is the main downside of this approach.
> 

Better provide a clarification on what sanitization means.

e.g. I don't think we should change errno in those kAPIs to match the
definition in iommu subsystem since e.g. -EINVAL really means different
things in different context.

So the sanitization in iommu driver is probably that:

  - If an external kAPI returns -EINVAL, convert it to -ENODEV given iommu
domain is iommu internal object hence unlikely for external kAPIs to
capture incompatibility issue between domain/device;
  - Otherwise just pass whatever returned to the caller, following the 
definition
of "Same behavior as -ENODEV" above

Thanks
Kevin
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v6 1/5] iommu: Return -EMEDIUMTYPE for incompatible domain and device/group

2022-09-08 Thread Tian, Kevin
> From: Tian, Kevin
> Sent: Thursday, September 8, 2022 5:31 PM
> > This mixture of error codes is the basic reason why a new code was
> > used, because none of the existing codes are used with any
> > consistency.
> 
> btw I saw the policy for -EBUSY is also not consistent in this series.
> 
> while it's correct to change -EBUSY to -EMEDIUMTYPE for omap, assuming
> that retrying another fresh domain for the said device should work:
> 
>   if (omap_domain->dev) {
> - dev_err(dev, "iommu domain is already attached\n");
> - ret = -EBUSY;
> + ret = -EMEDIUMTYPE;
>   goto out;
>   }
> 
> the change in tegra-gart doesn't sound correct:
> 
>   if (gart->active_domain && gart->active_domain != domain) {
> - ret = -EBUSY;
> + ret = -EMEDIUMTYPE;
> 
> one device cannot be attached to two domains. This fact doesn't change
> no matter how many domains are tried. In concept this check is
> redundant and should have been done by iommu core, but obviously we
> didn't pay attention to what -EBUSY actually represents in this path.
> 

oops. Above is actually a right retry condition. gart is iommu instead of
device. So in concept retrying gart->active_domain for the device could
work.

So please ignore this comment.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v6 1/5] iommu: Return -EMEDIUMTYPE for incompatible domain and device/group

2022-09-08 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Thursday, September 8, 2022 8:43 AM
> 
> On Wed, Sep 07, 2022 at 08:41:13PM +0100, Robin Murphy wrote:
> 
> > Again, not what I was suggesting. In fact the nature of
> iommu_attach_group()
> > already rules out bogus devices getting this far, so all a driver currently
> > has to worry about is compatibility of a device that it definitely probed
> > with a domain that it definitely allocated. Therefore, from a caller's point
> > of view, if attaching to an existing domain returns -EINVAL, try another
> > domain; multiple different existing domains can be tried, and may also
> > return -EINVAL for the same or different reasons; the final attempt is to
> > allocate a fresh domain and attach to that, which should always be
> nominally
> > valid and *never* return -EINVAL. If any attempt returns any other error,
> > bail out down the usual "this should have worked but something went
> wrong"
> > path. Even if any driver did have a nonsensical "nothing went wrong, I just
> > can't attach my device to any of my domains" case, I don't think it would
> > really need distinguishing from any other general error anyway.
> 
> The algorithm you described is exactly what this series does, it just
> used EMEDIUMTYPE instead of EINVAL. Changing it to EINVAL is not a
> fundamental problem, just a bit more work.
> 
> Looking at Nicolin's series there is a bunch of existing errnos that
> would still need converting, ie EXDEV, EBUSY, EOPNOTSUPP, EFAULT, and
> ENXIO are all returned as codes for 'domain incompatible with device'
> in various drivers. So the patch would still look much the same, just
> changing them to EINVAL instead of EMEDIUMTYPE.
> 
> That leaves the question of the remaining EINVAL's that Nicolin did
> not convert to EMEDIUMTYPE.
> 
> eg in the AMD driver:
> 
>   if (!check_device(dev))
>   return -EINVAL;
> 
>   iommu = rlookup_amd_iommu(dev);
>   if (!iommu)
>   return -EINVAL;
> 
> These are all cases of 'something is really wrong with the device or
> iommu, everything will fail'. Other drivers are using ENODEV for this
> already, so we'd probably have an additional patch changing various
> places like that to ENODEV.
> 
> This mixture of error codes is the basic reason why a new code was
> used, because none of the existing codes are used with any
> consistency.

btw I saw the policy for -EBUSY is also not consistent in this series.

while it's correct to change -EBUSY to -EMEDIUMTYPE for omap, assuming
that retrying another fresh domain for the said device should work:

if (omap_domain->dev) {
-   dev_err(dev, "iommu domain is already attached\n");
-   ret = -EBUSY;
+   ret = -EMEDIUMTYPE;
goto out;
}

the change in tegra-gart doesn't sound correct:

if (gart->active_domain && gart->active_domain != domain) {
-   ret = -EBUSY;
+   ret = -EMEDIUMTYPE;

one device cannot be attached to two domains. This fact doesn't change
no matter how many domains are tried. In concept this check is
redundant and should have been done by iommu core, but obviously we
didn't pay attention to what -EBUSY actually represents in this path.

> 
> But OK, I'm on board, lets use more common errnos with specific
> meaning, that can be documented in a comment someplace:
>  ENOMEM - out of memory
>  ENODEV - no domain can attach, device or iommu is messed up
>  EINVAL - the domain is incompatible with the device
>   - Same behavior as ENODEV, use is discouraged.

There are also cases where common kAPIs are called in the attach
path which may return -EINVAL and random errno, e.g.:

omap_iommu_attach_dev()
  omap_iommu_attach()
iommu_enable()
  pm_runtime_get_sync()
__pm_runtime_resume()
  rpm_resume()
if (dev->power.runtime_error) {
retval = -EINVAL;

viommu_attach_dev()
  viommu_domain_finalise()
ida_alloc_range()
if ((int)min < 0)
return -ENOSPC;

> 
> I think achieving consistency of error codes is a generally desirable
> goal, it makes the error code actually useful.
> 
> Joerg this is a good bit of work, will you be OK with it?
> 
> > Thus as long as we can maintain that basic guarantee that attaching
> > a group to a newly allocated domain can only ever fail for resource
> > allocation reasons and not some spurious "incompatibility", then we
> > don't need any obscure trickery, and a single, clear, error code is
> > in fact enough to say all that needs to be said.
> 
> As above, this is not the case, drivers do seem to have error paths
> that are unconditional on the domain. Perhaps they are just protective
> assertions and never happen.
> 
> Regardless, it doesn't matter. If they return ENODEV or EINVAL the
> VFIO side algorithm will continue to work fine, it just does alot more
> work if EINVAL is permanently returned.
> 

I don't see an elegant option here.

If we care 

RE: [PATCH v6 1/5] iommu: Return -EMEDIUMTYPE for incompatible domain and device/group

2022-09-08 Thread Tian, Kevin
> From: Nicolin Chen 
> Sent: Friday, September 9, 2022 11:17 AM
> 
> On Thu, Sep 08, 2022 at 01:14:42PM -0300, Jason Gunthorpe wrote:
> 
> > > I am wondering if this can be solved by better defining what the return
> > > codes mean and adjust the call-back functions to match the definition.
> > > Something like:
> > >
> > >   -ENODEV : Device not mapped my an IOMMU
> > >   -EBUSY  : Device attached and domain can not be changed
> > >   -EINVAL : Device and domain are incompatible
> > >   ...
> >
> > Yes, this was gone over in a side thread the pros/cons, so lets do
> > it. Nicolin will come with something along these lines.
> 
> I have started this effort by combining this list and the one from
> the side thread:
> 
> @@ -266,6 +266,13 @@ struct iommu_ops {
>  /**
>   * struct iommu_domain_ops - domain specific operations
>   * @attach_dev: attach an iommu domain to a device
> + *  Rules of its return errno:
> + *   ENOMEM  - Out of memory
> + *   EINVAL  - Device and domain are incompatible
> + *   EBUSY   - Device is attached to a domain and cannot be 
> changed

With this definition then probably @attach_dev should not return -EBUSY
at all given it's already checked in the start of __iommu_attach_group():

if (group->domain && group->domain != group->default_domain &&
group->domain != group->blocking_domain)
return -EBUSY;

> + *   ENODEV  - Device or domain is messed up: device is not 
> mapped
> + * to an IOMMU, no domain can attach, and etc.

if domain is messed up then should return -EINVAL given using another domain
might just work. IMHO here -ENODEV should only cover device specific problems
preventing this device from being attached to by any domain.

> + *   - Same behavior as ENODEV, use is discouraged

didn't get the "Same behavior" part. Does it suggest all other errnos should
be converted to ENODEV?

btw what about -ENOSPC? It's sane to allocate some resource in the attach
path while the resource might be not available, e.g.:

intel_iommu_attach_device()
  domain_add_dev_info()
domain_attach_iommu():

int num, ret = -ENOSPC;
...
ndomains = cap_ndoms(iommu->cap);
num = find_first_zero_bit(iommu->domain_ids, ndomains);
if (num >= ndomains) {
pr_err("%s: No free domain ids\n", iommu->name);
goto err_unlock;
}

As discussed in a side thread a note might be added to exempt calling
kAPI outside of the iommu driver. 

>   * @detach_dev: detach an iommu domain from a device
>   * @map: map a physically contiguous memory region to an iommu domain
>   * @map_pages: map a physically contiguous set of pages of the same size
> to
> 
> I am now going through every single return value of ->attach_dev to
> make sure the list above applies. And I will also incorporate things
> like Robin's comments at the AMD IOMMU driver.
> 
> And if the change occurs to be bigger, I guess that separating it to
> be an IOMMU series from this VFIO one might be better.
> 
> Thanks
> Nic
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v4 5/6] iommu: Use EINVAL for incompatible device/domain in ->attach_dev

2022-09-22 Thread Tian, Kevin
> From: Nicolin Chen 
> Sent: Wednesday, September 21, 2022 4:24 PM
> 
> Following the new rules in include/linux/iommu.h kdocs, update all drivers
> ->attach_dev callback functions to return EINVAL in the failure paths that
> are related to domain incompatibility.
> 
> Also, drop adjacent error prints to prevent a kernel log spam.
> 
> Signed-off-by: Nicolin Chen 

Reviewed-by: Kevin Tian 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v4 3/6] iommu: Add return value rules to attach_dev op and APIs

2022-09-22 Thread Tian, Kevin
> From: Nicolin Chen 
> Sent: Wednesday, September 21, 2022 4:23 PM
> 
> 
> +/**
> + * iommu_attach_device - Attach a device to an IOMMU domain
> + * @domain: IOMMU domain to attach
> + * @dev: Device that will be attached
> + *
> + * Returns 0 on success and error code on failure
> + *
> + * Note that EINVAL may be returned as a soft failure if the domain and
> device
> + * are incompatible due to some previous configuration of the domain, in
> which
> + * case attaching the same device to a different domain may succeed.

Revise a bit:

 * Note that EINVAL can be treated as a soft failure, indicating
 * that certain configuration of the domain is incompatible with
 * the device. In this case attaching a different domain to the
 * device may succeed.

Apart from that:

Reviewed-by: Kevin Tian 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH 7/8] iommu/intel: Support the gfp argument to the map_pages op

2023-01-16 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Saturday, January 7, 2023 12:43 AM
> 
> @@ -2368,7 +2372,7 @@ static int iommu_domain_identity_map(struct
> dmar_domain *domain,
> 
>   return __domain_mapping(domain, first_vpfn,
>   first_vpfn, last_vpfn - first_vpfn + 1,
> - DMA_PTE_READ|DMA_PTE_WRITE);
> + DMA_PTE_READ|DMA_PTE_WRITE,
> GFP_KERNEL);
>  }

Baolu, can you help confirm whether switching from GFP_ATOMIC to
GFP_KERNEL is OK in this path? it looks fine to me in a quick glance
but want to be conservative here.

> @@ -4333,7 +4337,8 @@ static size_t intel_iommu_unmap(struct
> iommu_domain *domain,
> 
>   /* Cope with horrid API which requires us to unmap more than the
>  size argument if it happens to be a large-page mapping. */
> - BUG_ON(!pfn_to_dma_pte(dmar_domain, iova >> VTD_PAGE_SHIFT,
> ));
> + BUG_ON(!pfn_to_dma_pte(dmar_domain, iova >> VTD_PAGE_SHIFT,
> ,
> +GFP_ATOMIC));

with level==0 it implies it's only lookup w/o pgtable allocation. From this
angle it reads better to use a more relaxed gfp e.g. GFP_KERNEL here.

> @@ -4392,7 +4397,8 @@ static phys_addr_t
> intel_iommu_iova_to_phys(struct iommu_domain *domain,
>   int level = 0;
>   u64 phys = 0;
> 
> - pte = pfn_to_dma_pte(dmar_domain, iova >> VTD_PAGE_SHIFT,
> );
> + pte = pfn_to_dma_pte(dmar_domain, iova >> VTD_PAGE_SHIFT,
> ,
> +  GFP_ATOMIC);

ditto
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH 6/8] iommu/intel: Add a gfp parameter to alloc_pgtable_page()

2023-01-16 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Saturday, January 7, 2023 12:43 AM
> 
> @@ -2676,7 +2676,7 @@ static int copy_context_table(struct intel_iommu
> *iommu,
>   if (!old_ce)
>   goto out;
> 
> - new_ce = alloc_pgtable_page(iommu->node);
> + new_ce = alloc_pgtable_page(iommu->node,
> GFP_KERNEL);

GFP_ATOMIC
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH 6/8] iommu/intel: Add a gfp parameter to alloc_pgtable_page()

2023-01-17 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Tuesday, January 17, 2023 9:30 PM
> 
> On Tue, Jan 17, 2023 at 03:35:08AM +, Tian, Kevin wrote:
> > > From: Jason Gunthorpe 
> > > Sent: Saturday, January 7, 2023 12:43 AM
> > >
> > > @@ -2676,7 +2676,7 @@ static int copy_context_table(struct
> intel_iommu
> > > *iommu,
> > >   if (!old_ce)
> > >   goto out;
> > >
> > > - new_ce = alloc_pgtable_page(iommu->node);
> > > + new_ce = alloc_pgtable_page(iommu->node,
> > > GFP_KERNEL);
> >
> > GFP_ATOMIC
> 
> Can't be:
> 
>   old_ce = memremap(old_ce_phys, PAGE_SIZE,
>   MEMREMAP_WB);
>   if (!old_ce)
>   goto out;
> 
>   new_ce = alloc_pgtable_page(iommu->node,
> GFP_KERNEL);
>   if (!new_ce)
> 
> memremap() is sleeping.
> 
> And the only caller is:
> 
>   ctxt_tbls = kcalloc(ctxt_table_entries, sizeof(void *), GFP_KERNEL);
>   if (!ctxt_tbls)
>   goto out_unmap;
> 
>   for (bus = 0; bus < 256; bus++) {
>   ret = copy_context_table(iommu, _rt[bus],
>ctxt_tbls, bus, ext);
> 

Yes, but the patch description says "Push the GFP_ATOMIC to all
callers." implying it's purely a refactoring w/o changing those
semantics.

I'm fine with doing this change in this patch, but it should worth
a clarification in the patch description.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v2 02/10] iommu: Remove iommu_map_atomic()

2023-01-18 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Thursday, January 19, 2023 2:01 AM
> 
> There is only one call site and it can now just pass the GFP_ATOMIC to the
> normal iommu_map().
> 
> Signed-off-by: Jason Gunthorpe 

Reviewed-by: Kevin Tian 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v2 01/10] iommu: Add a gfp parameter to iommu_map()

2023-01-18 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Thursday, January 19, 2023 2:01 AM
> 
> The internal mechanisms support this, but instead of exposting the gfp to
> the caller it wrappers it into iommu_map() and iommu_map_atomic()
> 
> Fix this instead of adding more variants for GFP_KERNEL_ACCOUNT.
> 
> Signed-off-by: Jason Gunthorpe 

Reviewed-by: Kevin Tian 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v2 03/10] iommu: Add a gfp parameter to iommu_map_sg()

2023-01-18 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Thursday, January 19, 2023 2:01 AM
> 
> Follow the pattern for iommu_map() and remove iommu_map_sg_atomic().
> 
> This allows __iommu_dma_alloc_noncontiguous() to use a GFP_KERNEL
> allocation here, based on the provided gfp flags.
> 
> Signed-off-by: Jason Gunthorpe 

Reviewed-by: Kevin Tian 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v2 07/10] iommu/intel: Support the gfp argument to the map_pages op

2023-01-18 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Thursday, January 19, 2023 2:01 AM
> 
> Flow it down to alloc_pgtable_page() via pfn_to_dma_pte() and
> __domain_mapping().
> 
> Signed-off-by: Jason Gunthorpe 

Reviewed-by: Kevin Tian 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v2 04/10] iommu/dma: Use the gfp parameter in __iommu_dma_alloc_noncontiguous()

2023-01-18 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Thursday, January 19, 2023 2:01 AM
> 
> Change the sg_alloc_table_from_pages() allocation that was hardwired to
> GFP_KERNEL to use the gfp parameter like the other allocations in this
> function.
> 
> Auditing says this is never called from an atomic context, so it is safe
> as is, but reads wrong.
> 
> Signed-off-by: Jason Gunthorpe 

Reviewed-by: Kevin Tian 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v2 05/10] iommufd: Use GFP_KERNEL_ACCOUNT for iommu_map()

2023-01-18 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Thursday, January 19, 2023 2:01 AM
> 
> iommufd follows the same design as KVM and uses memory cgroups to limit
> the amount of kernel memory a iommufd file descriptor can pin down. The
> various internal data structures already use GFP_KERNEL_ACCOUNT.
> 
> However, one of the biggest consumers of kernel memory is the IOPTEs
> stored under the iommu_domain. Many drivers will allocate these at
> iommu_map() time and will trivially do the right thing if we pass in
> GFP_KERNEL_ACCOUNT.
> 
> Signed-off-by: Jason Gunthorpe 

Reviewed-by: Kevin Tian 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v2 06/10] iommu/intel: Add a gfp parameter to alloc_pgtable_page()

2023-01-18 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Thursday, January 19, 2023 2:01 AM
> 
> This is eventually called by iommufd through intel_iommu_map_pages() and
> it should not be forced to atomic. Push the GFP_ATOMIC to all callers.
> 
> Signed-off-by: Jason Gunthorpe 

Reviewed-by: Kevin Tian 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v2 08/10] iommu/intel: Use GFP_KERNEL in sleepable contexts

2023-01-18 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Thursday, January 19, 2023 2:01 AM
> 
> These contexts are sleepable, so use the proper annotation. The
> GFP_ATOMIC
> was added mechanically in the prior patches.
> 
> Signed-off-by: Jason Gunthorpe 

Reviewed-by: Kevin Tian 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v3 5/8] iommufd: Associate fault object with iommufd_hw_pgtable

2024-02-07 Thread Tian, Kevin
> From: Lu Baolu 
> Sent: Monday, January 22, 2024 3:39 PM
> 
> +
> +int iommufd_fault_iopf_handler(struct iopf_group *group)
> +{
> + struct iommufd_hw_pagetable *hwpt = group->cookie->domain-
> >fault_data;
> + struct iommufd_fault *fault = hwpt->fault;
> +

why not directly using iommufd_fault as the fault_data?



RE: [PATCH v3 1/8] iommu: Add iopf domain attach/detach/replace interface

2024-02-07 Thread Tian, Kevin
> From: Lu Baolu 
> Sent: Monday, January 22, 2024 3:39 PM
> 
> There is a slight difference between iopf domains and non-iopf domains.
> In the latter, references to domains occur between attach and detach;
> While in the former, due to the existence of asynchronous iopf handling
> paths, references to the domain may occur after detach, which leads to
> potential UAF issues.

Does UAF still exist if iommu driver follows the guidance you just added
to iopf_queue_remove_device()?

it clearly says that the driver needs to disable IOMMU PRI reception,
remove device from iopf queue and disable PRI on the device.

presumably those are all about what needs to be done in the detach
operation. Then once detach completes there should be no more
reference to the domain from the iopf path?

> 
> +struct iopf_attach_cookie {
> + struct iommu_domain *domain;
> + struct device *dev;
> + unsigned int pasid;
> + refcount_t users;
> +
> + void *private;
> + void (*release)(struct iopf_attach_cookie *cookie);
> +};

this cookie has nothing specific to iopf.

it may makes more sense to build a generic iommu_attach_device_cookie()
helper so the same object can be reused in future other usages too.

within iommu core it can check domain iopf handler and this generic cookie
to update iopf specific data e.g. the pasid_cookie xarray.



RE: [PATCH v3 1/8] iommu: Add iopf domain attach/detach/replace interface

2024-02-20 Thread Tian, Kevin
> From: Baolu Lu 
> Sent: Wednesday, February 21, 2024 3:21 PM
> 
> On 2024/2/21 14:49, Tian, Kevin wrote:
> >>>> +struct iopf_attach_cookie {
> >>>> +struct iommu_domain *domain;
> >>>> +struct device *dev;
> >>>> +unsigned int pasid;
> >>>> +refcount_t users;
> >>>> +
> >>>> +void *private;
> >>>> +void (*release)(struct iopf_attach_cookie *cookie);
> >>>> +};
> >>> this cookie has nothing specific to iopf.
> >>>
> >>> it may makes more sense to build a generic
> iommu_attach_device_cookie()
> >>> helper so the same object can be reused in future other usages too.
> >>>
> >>> within iommu core it can check domain iopf handler and this generic
> cookie
> >>> to update iopf specific data e.g. the pasid_cookie xarray.
> >> This means attaching an iopf-capable domain follows two steps:
> >>
> >> 1) Attaching the domain to the device.
> >> 2) Setting up the iopf data, necessary for handling iopf data.
> >>
> >> This creates a time window during which the iopf is enabled, but the
> >> software cannot handle it. Or not?
> >>
> > why two steps? in attach you can setup the iopf data when recognizing
> > that the domain is iopf capable...
> 
> Oh, maybe I misunderstood. So your proposal is to make the new interface
> generic, not for iopf only?
> 

yes.


RE: [PATCH v3 1/8] iommu: Add iopf domain attach/detach/replace interface

2024-02-20 Thread Tian, Kevin
> From: Baolu Lu 
> Sent: Wednesday, February 21, 2024 1:53 PM
> 
> On 2024/2/7 16:11, Tian, Kevin wrote:
> >> From: Lu Baolu 
> >> Sent: Monday, January 22, 2024 3:39 PM
> >>
> >> There is a slight difference between iopf domains and non-iopf domains.
> >> In the latter, references to domains occur between attach and detach;
> >> While in the former, due to the existence of asynchronous iopf handling
> >> paths, references to the domain may occur after detach, which leads to
> >> potential UAF issues.
> >
> > Does UAF still exist if iommu driver follows the guidance you just added
> > to iopf_queue_remove_device()?
> >
> > it clearly says that the driver needs to disable IOMMU PRI reception,
> > remove device from iopf queue and disable PRI on the device.
> 
> The iopf_queue_remove_device() function is only called after the last
> iopf-capable domain is detached from the device. It may not be called
> during domain replacement. Hence, there is no guarantee that
> iopf_queue_remove_device() will be called when a domain is detached from
> the device.

oh yes. More accurately even the last detach may not trigger it.

e.g. idxd driver does it at device/driver unbind.

> 
> >
> > presumably those are all about what needs to be done in the detach
> > operation. Then once detach completes there should be no more
> > reference to the domain from the iopf path?
> 
> The domain pointer stored in the iopf_group structure is only released
> after the iopf response, possibly after the domain is detached from the
> device. Thus, the domain pointer can only be freed after the iopf
> response.

make sense.

> 
> >
> >>
> >> +struct iopf_attach_cookie {
> >> +  struct iommu_domain *domain;
> >> +  struct device *dev;
> >> +  unsigned int pasid;
> >> +  refcount_t users;
> >> +
> >> +  void *private;
> >> +  void (*release)(struct iopf_attach_cookie *cookie);
> >> +};
> >
> > this cookie has nothing specific to iopf.
> >
> > it may makes more sense to build a generic iommu_attach_device_cookie()
> > helper so the same object can be reused in future other usages too.
> >
> > within iommu core it can check domain iopf handler and this generic cookie
> > to update iopf specific data e.g. the pasid_cookie xarray.
> 
> This means attaching an iopf-capable domain follows two steps:
> 
> 1) Attaching the domain to the device.
> 2) Setting up the iopf data, necessary for handling iopf data.
> 
> This creates a time window during which the iopf is enabled, but the
> software cannot handle it. Or not?
> 

why two steps? in attach you can setup the iopf data when recognizing
that the domain is iopf capable...


RE: [PATCH v4 1/9] iommu: Introduce domain attachment handle

2024-04-09 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Wednesday, April 10, 2024 7:38 AM
> 
> On Tue, Apr 09, 2024 at 09:53:26AM +0800, Baolu Lu wrote:
> 
> >The current code base doesn't yet support PASID attach/detach/replace
> >uAPIs. Therefore, above code is safe and reasonable. However, we will
> >need to revisit this code when those APIs become available.
> 
> Okay, I see.
> 
> Can we do the PASID iommufd stuff soon? What is the plan here?
> 

Yi is working on that. He will send out once the last open about ida
is handled.



RE: [PATCH v4 2/9] iommu: Replace sva_iommu with iommu_attach_handle

2024-04-28 Thread Tian, Kevin
> From: Baolu Lu 
> Sent: Sunday, April 28, 2024 6:22 PM
> 
> On 2024/4/10 7:48, Jason Gunthorpe wrote:
> > On Tue, Apr 09, 2024 at 10:11:28AM +0800, Baolu Lu wrote:
> >> On 4/8/24 10:19 PM, Jason Gunthorpe wrote:
> >>> On Sat, Apr 06, 2024 at 02:09:34PM +0800, Baolu Lu wrote:
>  On 4/3/24 7:59 PM, Jason Gunthorpe wrote:
> > On Wed, Apr 03, 2024 at 09:15:12AM +0800, Lu Baolu wrote:
> >> +  /* A bond already exists, just take a reference`. */
> >> +  handle = iommu_attach_handle_get(group, iommu_mm-
> >pasid);
> >> +  if (handle) {
> >> +  mutex_unlock(_sva_lock);
> >> +  return handle;
> >>}
> > At least in this context this is not enough we need to ensure that the
> > domain on the PASID is actually an SVA domain and it was installed by
> > this mechanism, not an iommufd domain for instance.
> >
> > ie you probably need a type field in the iommu_attach_handle to tell
> > what the priv is.
> >
> > Otherwise this seems like a great idea!
>  Yes, you are right. For the SVA case, I will add the following changes.
>  The IOMMUFD path will also need such enhancement. I will update it in
>  the next version.
> >>> The only use for this is the PRI callbacks right? Maybe instead of
> >>> adding a handle type let's just check domain->iopf_handler  ?
> >>>
> >>> Ie SVA will pass _sva_iopf_handler as its "type"
> >> Sorry that I don't fully understand the proposal here.
> > I was talking specifically about the type field you suggested adding
> > to the handle struct.
> >
> > Instead of adding a type field check the domain->iopf_handler to
> > determine the domain and thus handle type.
> >
> >> The problem is that the context code (SVA, IOMMUFD, etc.) needs to
> make
> >> sure that the attach handle is really what it has installed during
> >> domain attachment. The context code needs some mechanism to include
> some
> >> kind of "owner cookie" in the attach handle, so that it could check
> >> against it later for valid use.
> > Right, you have a derived struct for each user and you need a way to
> > check if casting from the general handle struct to the derived struct
> > is OK.
> >
> > I'm suggesting using domain->iopf_handle as the type key.
> 
> After removing the refcount from the attach handle, I am trying to make
> the code look like this,
> 
>  /* A bond already exists, just take a reference`. */
>  handle = iommu_attach_handle_get(group, iommu_mm->pasid);
>  if (handle) {
>  if (handle->domain->iopf_handler !=
> iommu_sva_iopf_handler) {
>  ret = -EBUSY;
>  goto out_unlock;
>  }
> 
>  refcount_inc(>users);
>  mutex_unlock(_sva_lock);
>  return handle;
>  }
> 
> But it appears that this code is not lock safe. If the domain on the
> PASID is not a SVA domain, the check of "handle->domain->iopf_handler !=
> iommu_sva_iopf_handler" could result in a use-after-free issue as the
> other thread might detach the domain in between the fetch and check
> lines.
> 
> Probably we still need to keep the refcount in the attach handle?
> 

What about Jason's another comment in his original replies?

"
Though I'm not convinced the refcount should be elevated into the core
structure. The prior patch I showed you where the caller can provide
the memory for the handle and we don't have a priv would make it easy
to put the refcount in a SVA dervied handle struct without more
allocation. Then we don't need this weirdness.
"

That sounds like we'll need a iommu_sva like structure to hold
its own refcnt. Then we don't need this type check and refcnt
in the core.


RE: [PATCH v5 2/9] iommu: Replace sva_iommu with iommu_attach_handle

2024-05-15 Thread Tian, Kevin
> From: Lu Baolu 
> Sent: Tuesday, April 30, 2024 10:57 PM
> 
>  #else
> -static inline struct iommu_sva *
> +static inline struct iommu_attach_handle *
>  iommu_sva_bind_device(struct device *dev, struct mm_struct *mm)
>  {
> - return NULL;
> + return ERR_PTR(-ENODEV);
>  }
> 

this should be a separate fix.

existing drivers (idxd and uacce) only check IS_ERR() on the return
value. A Null pointer may lead to an error reported at a later point.



RE: [PATCH v5 4/9] iommufd: Add fault and response message definitions

2024-05-15 Thread Tian, Kevin
> From: Lu Baolu 
> Sent: Tuesday, April 30, 2024 10:57 PM
> 
> iommu_hwpt_pgfaults represent fault messages that the userspace can
> retrieve. Multiple iommu_hwpt_pgfaults might be put in an iopf group,
> with the IOMMU_PGFAULT_FLAGS_LAST_PAGE flag set only for the last
> iommu_hwpt_pgfault.

Do you envision extending the same structure to report unrecoverable
fault in the future?

If yes this could be named more neutral e.g. iommu_hwpt_faults with
flags to indicate it's a recoverable PRI request.

If it's only for PRI probably iommu_hwpt_pgreqs is clearer.

> +
> +/**
> + * struct iommu_hwpt_pgfault - iommu page fault data
> + * @size: sizeof(struct iommu_hwpt_pgfault)
> + * @flags: Combination of enum iommu_hwpt_pgfault_flags
> + * @dev_id: id of the originated device
> + * @pasid: Process Address Space ID
> + * @grpid: Page Request Group Index
> + * @perm: Combination of enum iommu_hwpt_pgfault_perm
> + * @addr: Page address

'Fault address'

> + * @length: a hint of how much data the requestor is expecting to fetch. For
> + *  example, if the PRI initiator knows it is going to do a 10MB
> + *  transfer, it could fill in 10MB and the OS could pre-fault in
> + *  10MB of IOVA. It's default to 0 if there's no such hint.

This is not clear to me and I don't remember PCIe spec defines such
mechanism.

> +/**
> + * enum iommufd_page_response_code - Return status of fault handlers
> + * @IOMMUFD_PAGE_RESP_SUCCESS: Fault has been handled and the page
> tables
> + * populated, retry the access. This is the
> + * "Success" defined in PCI 10.4.2.1.
> + * @IOMMUFD_PAGE_RESP_INVALID: General error. Drop all subsequent
> faults
> + * from this device if possible. This is the
> + * "Response Failure" in PCI 10.4.2.1.
> + * @IOMMUFD_PAGE_RESP_FAILURE: Could not handle this fault, don't
> retry the
> + * access. This is the "Invalid Request" in PCI
> + * 10.4.2.1.

the comment for 'INVALID' and 'FAILURE' are misplaced. Also I'd more
use the spec words to be accurate.

> + */
> +enum iommufd_page_response_code {
> + IOMMUFD_PAGE_RESP_SUCCESS = 0,
> + IOMMUFD_PAGE_RESP_INVALID,
> + IOMMUFD_PAGE_RESP_FAILURE,
> +};
> +



RE: [PATCH v5 5/9] iommufd: Add iommufd fault object

2024-05-15 Thread Tian, Kevin
> From: Baolu Lu 
> Sent: Wednesday, May 8, 2024 6:05 PM
> 
> On 2024/5/8 8:11, Jason Gunthorpe wrote:
> > On Tue, Apr 30, 2024 at 10:57:06PM +0800, Lu Baolu wrote:
> >> diff --git a/drivers/iommu/iommu-priv.h b/drivers/iommu/iommu-priv.h
> >> index ae65e0b85d69..1a0450a83bd0 100644
> >> --- a/drivers/iommu/iommu-priv.h
> >> +++ b/drivers/iommu/iommu-priv.h
> >> @@ -36,6 +36,10 @@ struct iommu_attach_handle {
> >>struct device   *dev;
> >>refcount_t  users;
> >>};
> >> +  /* attach data for IOMMUFD */
> >> +  struct {
> >> +  void*idev;
> >> +  };
> > We can use a proper type here, just forward declare it.
> >
> > But this sequence in the other patch:
> >
> > +   ret = iommu_attach_group(hwpt->domain, idev->igroup->group);
> > +   if (ret) {
> > +   iommufd_fault_iopf_disable(idev);
> > +   return ret;
> > +   }
> > +
> > +   handle = iommu_attach_handle_get(idev->igroup->group,
> IOMMU_NO_PASID, 0);
> > +   handle->idev = idev;
> >
> > Is why I was imagining the caller would allocate, because now we have
> > the issue that a fault capable domain was installed into the IOMMU
> > before it's handle could be fully setup, so we have a race where a
> > fault could come in right between those things. Then what happens?
> > I suppose we can retry the fault and by the time it comes back the
> > race should resolve. A bit ugly I suppose.
> 
> You are right. It makes more sense if the attached data is allocated and
> managed by the caller. I will go in this direction and update my series.
> I will also consider other review comments you have given in other
> places.
> 

Does this direction imply a new iommu_attach_group_handle() helper
to pass in the caller-allocated handle pointer or exposing a new
iommu_group_set_handle() to set the handle to the group pasid_array 
and then having iomm_attach_group() to update the domain info in
the handle?


RE: [PATCH v5 6/9] iommufd: Fault-capable hwpt attach/detach/replace

2024-05-15 Thread Tian, Kevin
> From: Lu Baolu 
> Sent: Tuesday, April 30, 2024 10:57 PM
> +
> +int iommufd_fault_domain_replace_dev(struct iommufd_device *idev,
> +  struct iommufd_hw_pagetable *hwpt,
> +  struct iommufd_hw_pagetable *old)
> +{
> + struct iommu_attach_handle *handle;
> + int ret;
> +
> + if (hwpt->fault)
> + ret = iommufd_fault_iopf_enable(idev);
> + else
> + iommufd_fault_iopf_disable(idev);
> +
> + ret = iommu_group_replace_domain(idev->igroup->group, hwpt-
> >domain);
> + if (ret)
> + goto out_cleanup;
> +
> + iommufd_auto_response_faults(old, idev);
> + handle = iommu_attach_handle_get(idev->igroup->group,
> IOMMU_NO_PASID, 0);
> + handle->idev = idev;

why is auto response required in replace? new requests can come
after the auto response anyway...

The user should prepare for faults delivered to the old or new hwpt
in the transition window.



RE: [PATCH v5 1/9] iommu: Introduce domain attachment handle

2024-05-15 Thread Tian, Kevin
> From: Lu Baolu 
> Sent: Tuesday, April 30, 2024 10:57 PM
> 
> +/* Add an attach handle to the group's pasid array. */
> +static struct iommu_attach_handle *
> +iommu_attach_handle_set(struct iommu_domain *domain,
> + struct iommu_group *group, ioasid_t pasid)
> +{
> + struct iommu_attach_handle *handle;
> + void *curr;
> +
> + handle = kzalloc(sizeof(*handle), GFP_KERNEL);
> + if (!handle)
> + return ERR_PTR(-ENOMEM);
> +
> + handle->domain = domain;
> + curr = xa_cmpxchg(>pasid_array, pasid, NULL, handle,
> GFP_KERNEL);

this could be just xa_insert() which returns -EBUSY if the entry isn't NULL.

> + if (curr) {
> + kfree(handle);
> + return xa_err(curr) ? curr : ERR_PTR(-EBUSY);

'curr' is not an error pointer. You need ERR_PTR(xa_err(curr)).

>  int iommu_attach_group(struct iommu_domain *domain, struct
> iommu_group *group)
>  {
> + struct iommu_attach_handle *handle;
>   int ret;
> 
>   mutex_lock(>mutex);
> + handle = iommu_attach_handle_set(domain, group,
> IOMMU_NO_PASID);
> + if (IS_ERR(handle)) {
> + ret = PTR_ERR(handle);
> + goto out_unlock;
> + }
>   ret = __iommu_attach_group(domain, group);
> + if (ret)
> + goto out_remove_handle;
>   mutex_unlock(>mutex);

It's slightly better to set the handler after __iommu_attach_group().

doing so is aligned to the sequence in iommu_group_replace_domain().

> @@ -2211,13 +2307,33 @@ EXPORT_SYMBOL_GPL(iommu_attach_group);
>  int iommu_group_replace_domain(struct iommu_group *group,
>  struct iommu_domain *new_domain)
>  {
> + struct iommu_domain *old_domain = group->domain;
> + struct iommu_attach_handle *handle;
>   int ret;
> 
>   if (!new_domain)
>   return -EINVAL;
> 
> + if (new_domain == old_domain)
> + return 0;
> +

__iommu_group_set_domain() already does the check.



RE: [PATCH v5 3/9] iommu: Add attachment handle to struct iopf_group

2024-05-15 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Saturday, May 11, 2024 12:29 AM
> 
> On Fri, May 10, 2024 at 10:30:10PM +0800, Baolu Lu wrote:
> 
> > diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> > index 35ae9a6f73d3..09b4e671dcee 100644
> > --- a/include/linux/iommu.h
> > +++ b/include/linux/iommu.h
> > @@ -173,6 +173,8 @@ struct iommu_domain_geometry {
> >
> >  #define __IOMMU_DOMAIN_NESTED  (1U << 6)  /* User-managed address
> space
> > nested
> >   on a stage-2 translation
> > */
> > +#define __IOMMU_DOMAIN_PASID   (1U << 7)  /* User-managed domain
> for a
> > + specific PASID*/
> >
> >  #define IOMMU_DOMAIN_ALLOC_FLAGS ~__IOMMU_DOMAIN_DMA_FQ
> >  /*
> > @@ -204,6 +206,9 @@ struct iommu_domain_geometry {
> >  #define IOMMU_DOMAIN_SVA   (__IOMMU_DOMAIN_SVA)
> >  #define IOMMU_DOMAIN_PLATFORM  (__IOMMU_DOMAIN_PLATFORM)
> >  #define IOMMU_DOMAIN_NESTED(__IOMMU_DOMAIN_NESTED)
> > +#define IOMMU_DOMAIN_NESTED_PASID  \
> > +   (__IOMMU_DOMAIN_NESTED |\
> > +__IOMMU_DOMAIN_PASID)
> 
> Yeah, something like that, I don't know about naming though..
> 
> How about
> 
> DOMAIN_NESTED = Attachment to RID or PASID
> DOMAIN_NESTED_PASID_TABLE = RID and the entire PASID space
> 

this sounds clearer



RE: [PATCH v5 5/9] iommufd: Add iommufd fault object

2024-05-15 Thread Tian, Kevin
> From: Lu Baolu 
> Sent: Tuesday, April 30, 2024 10:57 PM
> 
> @@ -131,6 +131,9 @@ struct iopf_group {
>   struct iommu_attach_handle *attach_handle;
>   /* The device's fault data parameter. */
>   struct iommu_fault_param *fault_param;
> + /* Used by handler provider to hook the group on its own lists. */
> + struct list_head node;
> + u32 cookie;

better put together with attach_handle.

rename 'node' to 'handle_node'

> @@ -128,6 +128,7 @@ enum iommufd_object_type {
>   IOMMUFD_OBJ_HWPT_NESTED,
>   IOMMUFD_OBJ_IOAS,
>   IOMMUFD_OBJ_ACCESS,
> + IOMMUFD_OBJ_FAULT,

Agree with Jason that 'FAULT_QUEUE' sounds a clearer object name.

> @@ -395,6 +396,8 @@ struct iommufd_device {
>   /* always the physical device */
>   struct device *dev;
>   bool enforce_cache_coherency;
> + /* outstanding faults awaiting response indexed by fault group id */
> + struct xarray faults;

this...

> +struct iommufd_fault {
> + struct iommufd_object obj;
> + struct iommufd_ctx *ictx;
> + struct file *filep;
> +
> + /* The lists of outstanding faults protected by below mutex. */
> + struct mutex mutex;
> + struct list_head deliver;
> + struct list_head response;

...and here worth a discussion.

First the response list is not used. If continuing the choice of queueing
faults per device it should be removed.

But I wonder whether it makes more sense to keep this response
queue per fault object. sounds simpler to me.

Also it's unclear why we need the response message to carry the
same info as the request while only id/code/cookie are used.

+struct iommu_hwpt_page_response {
+   __u32 size;
+   __u32 flags;
+   __u32 dev_id;
+   __u32 pasid;
+   __u32 grpid;
+   __u32 code;
+   __u32 cookie;
+   __u32 reserved;
+};

If we keep the response queue in the fault object, the response message
only needs to carry size/flags/code/cookie and cookie can identify the
pending message uniquely in the response queue.

> +static ssize_t iommufd_fault_fops_write(struct file *filep, const char __user
> *buf,
> + size_t count, loff_t *ppos)
> +{
> + size_t response_size = sizeof(struct iommu_hwpt_page_response);
> + struct iommufd_fault *fault = filep->private_data;
> + struct iommu_hwpt_page_response response;
> + struct iommufd_device *idev = NULL;
> + struct iopf_group *group;
> + size_t done = 0;
> + int rc;
> +
> + if (*ppos || count % response_size)
> + return -ESPIPE;
> +
> + mutex_lock(>mutex);
> + while (count > done) {
> + rc = copy_from_user(, buf + done, response_size);
> + if (rc)
> + break;
> +
> + if (!idev || idev->obj.id != response.dev_id)
> + idev = container_of(iommufd_get_object(fault->ictx,
> +response.dev_id,
> +
> IOMMUFD_OBJ_DEVICE),
> + struct iommufd_device, obj);
> + if (IS_ERR(idev))
> + break;
> +
> + group = xa_erase(>faults, response.cookie);
> + if (!group)
> + break;

is 'continue' better?

> +
> + iopf_group_response(group, response.code);

PCIe spec states that a response failure disables the PRI interface. For SR-IOV
it'd be dangerous allowing user to trigger such code to VF to close the entire
shared PRI interface.

Just another example lacking of coordination for shared capabilities between
PF/VF. But exposing such gap to userspace makes it worse.

I guess we don't want to make this work depending on that cleanup. The
minimal correct thing is to disallow attaching VF to a fault-capable hwpt
with a note here that once we turn on support for VF the response failure
code should not be forwarded to the hardware. Instead it's an indication
that the user cannot serve more requests and such situation waits for
a vPRI reset to recover.

> + iopf_free_group(group);
> + done += response_size;
> +
> + iommufd_put_object(fault->ictx, >obj);

get/put is unpaired:

if (!idev || idev->obj.id != response.dev_id)
idev = iommufd_get_object();

...

iommufd_put_object(idev);

The intention might be reusing idev if multiple fault responses are
for a same idev. But idev is always put in each iteration then following
messages will access the idev w/o holding the reference.



RE: [PATCH v5 3/9] iommu: Add attachment handle to struct iopf_group

2024-05-15 Thread Tian, Kevin
> From: Lu Baolu 
> Sent: Tuesday, April 30, 2024 10:57 PM
> 
> Previously, the domain that a page fault targets is stored in an
> iopf_group, which represents a minimal set of page faults. With the
> introduction of attachment handle, replace the domain with the handle

It's better to use 'attach handle' as the code does.

> + handle = iommu_attach_handle_get(dev->iommu_group, pasid, 0);
> + if (IS_ERR(handle))
> + return PTR_ERR(handle);
> 
> - if (!domain || !domain->iopf_handler) {
> - dev_warn_ratelimited(dev,
> - "iopf (pasid %d) without domain attached or handler
> installed\n",
> -  fault->prm.pasid);
> + group->attach_handle = handle;
> + group->domain = handle->domain;

this change also removes the warning message. Is it desired?




  1   2   >