Re: [RFC 3/3] virtio-iommu: future work

2017-04-26 Thread Michael S. Tsirkin
On Fri, Apr 07, 2017 at 08:17:47PM +0100, Jean-Philippe Brucker wrote:
> Here I propose a few ideas for extensions and optimizations. This is all
> very exploratory, feel free to correct mistakes and suggest more things.
> 
>   I.   Linux host
>1. vhost-iommu

A qemu based implementation would be a first step.
Would allow validating the claim that it's much
simpler to support than e.g. VTD.

>2. VFIO nested translation
>   II.  Page table sharing
>1. Sharing IOMMU page tables
>2. Sharing MMU page tables (SVM)
>3. Fault reporting
>4. Host implementation with VFIO
>   III. Relaxed operations
>   IV.  Misc
> 
> 
>   I. Linux host
>   =
> 
>   1. vhost-iommu
>   --
> 
> An advantage of virtualizing an IOMMU using virtio is that it allows to
> hoist a lot of the emulation code into the kernel using vhost, and avoid
> returning to userspace for each request. The mainline kernel already
> implements vhost-net, vhost-scsi and vhost-vsock, and a lot of core code
> could be reused.
> 
> Introducing vhost in a simplified scenario 1 (removed guest userspace
> pass-through, irrelevant to this example) gives us the following:
> 
>   MEMpIOMMUPCI deviceHARDWARE
> |\
>   --|-+-+-\--
> | : KVM :  \
>pIOMMU drv : :   \  KERNEL
> | : : net drv
>   VFIO: :   /
> | : :  /
>vhost-iommu_virtio-iommu-drv
>   : :
>   --+---
>  HOST   : GUEST
> 
> 
> Introducing vhost in scenario 2, userspace now only handles the device
> initialisation part, and most runtime communication is handled in kernel:
> 
>   MEM__pIOMMU___PCI device HARDWARE
>  | |
>   ---|-|--+-+---
>  | |  : KVM :
> pIOMMU drv |  : : KERNEL
>  \__net drv   : :
>|  : :
>   tap : :
>|  : :
>   _vhost-netvirtio-net drv
>  (2) /: :   / (1a)
> / : :  /
>vhost-iommuvirtio-iommu drv
>   : : (1b)
>   +-+---
>  HOST   : GUEST
> 
> (1) a. Guest virtio driver maps ring and buffers
> b. Map requests are relayed to the host the same way.
> (2) To access any guest memory, vhost-net must query the IOMMU. We can
> reuse the existing TLB protocol for this. TLB commands are written to
> and read from the vhost-net fd.
> 
> As defined in Linux/include/uapi/linux/vhost.h, the vhost msg structure
> has everything needed for map/unmap operations:
> 
>   struct vhost_iotlb_msg {
>   __u64   iova;
>   __u64   size;
>   __u64   uaddr;
>   __u8perm; /* R/W */
>   __u8type;
>   #define VHOST_IOTLB_MISS
>   #define VHOST_IOTLB_UPDATE  /* MAP */
>   #define VHOST_IOTLB_INVALIDATE  /* UNMAP */
>   #define VHOST_IOTLB_ACCESS_FAIL
>   };
> 
>   struct vhost_msg {
>   int type;
>   union {
>   struct vhost_iotlb_msg iotlb;
>   __u8 padding[64];
>   };
>   };
> 
> The vhost-iommu device associates a virtual device ID to a TLB fd. We
> should be able to use the same commands for [vhost-net <-> virtio-iommu]
> and [virtio-net <-> vhost-iommu] communication. A virtio-net device
> would open a socketpair and hand one side to vhost-iommu.
> 
> If vhost_msg is ever used for another purpose than TLB, we'll have some
> trouble, as there will be multiple clients that want to read/write the
> vhost fd. A multicast transport method will be needed. Until then, this
> can work.
> 
> Details of operations would be:
> 
> (1) Userspace sets up vhost-iommu as with other vhost devices, by using
> standard vhost ioctls. Userspace starts by describing the system topology
> via ioctl:
> 
>   ioctl(iommu_fd, VHOST_IOMMU_ADD_DEVICE, struct
> vhost_iommu_add_device)
> 
>   #define VHOST_IOMMU_DEVICE_TYPE_VFIO
>   #define VHOST_IOMMU_DEVICE_TYPE_TLB
> 
>   struct vhost_iommu_add_device {

Re: [RFC 3/3] virtio-iommu: future work

2017-04-24 Thread Jean-Philippe Brucker
On 21/04/17 09:31, Tian, Kevin wrote:
>> From: Jean-Philippe Brucker
>> Sent: Saturday, April 8, 2017 3:18 AM
>>
>> Here I propose a few ideas for extensions and optimizations. This is all
>> very exploratory, feel free to correct mistakes and suggest more things.
> 
> [...]
>>
>>   II. Page table sharing
>>   ==
>>
>>   1. Sharing IOMMU page tables
>>   
>>
>> VIRTIO_IOMMU_F_PT_SHARING
>>
>> This is independent of the nested mode described in I.2, but relies on a
>> similar feature in the physical IOMMU: having two stages of page tables,
>> one for the host and one for the guest.
>>
>> When this is supported, the guest can manage its own s1 page directory, to
>> avoid sending MAP/UNMAP requests. Feature
>> VIRTIO_IOMMU_F_PT_SHARING allows
>> a driver to give a page directory pointer (pgd) to the host and send
>> invalidations when removing or changing a mapping. In this mode, three
>> requests are used: probe, attach and invalidate. An address space cannot
>> be using the MAP/UNMAP interface and PT_SHARING at the same time.
>>
>> Device and driver first need to negotiate which page table format they
>> will be using. This depends on the physical IOMMU, so the request contains
>> a negotiation part to probe the device capabilities.
>>
>> (1) Driver attaches devices to address spaces as usual, but a flag
>> VIRTIO_IOMMU_ATTACH_F_PRIVATE (working title) tells the device not to
>> create page tables for use with the MAP/UNMAP API. The driver intends
>> to manage the address space itself.
>>
>> (2) Driver sends a PROBE_TABLE request. It sets len > 0 with the size of
>> pg_format array.
>>
>>  VIRTIO_IOMMU_T_PROBE_TABLE
>>
>>  struct virtio_iommu_req_probe_table {
>>  le32address_space;
>>  le32flags;
>>  le32len;
>>
>>  le32nr_contexts;
>>  struct {
>>  le32model;
>>  u8  format[64];
>>  } pg_format[len];
>>  };
>>
>> Introducing a probe request is more flexible than advertising those
>> features in virtio config, because capabilities are dynamic, and depend on
>> which devices are attached to an address space. Within a single address
>> space, devices may support different numbers of contexts (PASIDs), and
>> some may not support recoverable faults.
>>
>> (3) Device responds success with all page table formats implemented by the
>> physical IOMMU in pg_format. 'model' 0 is invalid, so driver can
>> initialize the array to 0 and deduce from there which entries have
>> been filled by the device.
>>
>> Using a probe method seems preferable over trying to attach every possible
>> format until one sticks. For instance, with an ARM guest running on an x86
>> host, PROBE_TABLE would return the Intel IOMMU page table format, and
>> the
>> guest could use that page table code to handle its mappings, hidden behind
>> the IOMMU API. This requires that the page-table code is reasonably
>> abstracted from the architecture, as is done with drivers/iommu/io-pgtable
>> (an x86 guest could use any format implement by io-pgtable for example.)
> 
> So essentially you need modify all existing IOMMU drivers to support page 
> table sharing in pvIOMMU. After abstraction is done the core pvIOMMU files 
> can be kept vendor agnostic. But if we talk about the whole pvIOMMU 
> module, it actually includes vendor specific logic thus unlike typical 
> para-virtualized virtio drivers being completely vendor agnostic. Is this 
> understanding accurate?

Yes, although kernel modules would be separate. For Linux on ARM we
already have the page-table logic abstracted in iommu/io-pgtable module,
because multiple IOMMUs share the same PT formats (SMMUv2, SMMUv3, Renesas
IPMMU, Qcom MSM, Mediatek). It offers a simple interface:

* When attaching devices to an IOMMU domain, the IOMMU driver registers
its page table format and provides invalidation callbacks.

* On iommu_map/unmap, the IOMMU driver calls into io_pgtable_ops, which
provide map, unmap and iova_to_phys functions.

* Page table operations call back into the driver via iommu_gather_ops
when they need to invalidate TLB entries.

Currently only the few flavors of ARM PT formats are implemented, but
other page table formats could be added if they fit this model.

> It also means in the host-side pIOMMU driver needs to propagate all
> supported formats through VFIO to Qemu vIOMMU, meaning
> such format definitions need be consistently agreed across all those 
> components.

Yes, that's the icky part. We need to define a format that every OS and
hypervisor implementing virtio-iommu can understand (similarly to the
PASID table sharing interface that Yi L is working on for VFIO, although
that one is contained in Linux UAPI and doesn't require other OSes to know
about it).

>>   2. Sharing MMU page tables
>>   --
>>
>> The guest can share process 

RE: [RFC 3/3] virtio-iommu: future work

2017-04-21 Thread Tian, Kevin
> From: Jean-Philippe Brucker
> Sent: Saturday, April 8, 2017 3:18 AM
> 
> Here I propose a few ideas for extensions and optimizations. This is all
> very exploratory, feel free to correct mistakes and suggest more things.

[...]
> 
>   II. Page table sharing
>   ==
> 
>   1. Sharing IOMMU page tables
>   
> 
> VIRTIO_IOMMU_F_PT_SHARING
> 
> This is independent of the nested mode described in I.2, but relies on a
> similar feature in the physical IOMMU: having two stages of page tables,
> one for the host and one for the guest.
> 
> When this is supported, the guest can manage its own s1 page directory, to
> avoid sending MAP/UNMAP requests. Feature
> VIRTIO_IOMMU_F_PT_SHARING allows
> a driver to give a page directory pointer (pgd) to the host and send
> invalidations when removing or changing a mapping. In this mode, three
> requests are used: probe, attach and invalidate. An address space cannot
> be using the MAP/UNMAP interface and PT_SHARING at the same time.
> 
> Device and driver first need to negotiate which page table format they
> will be using. This depends on the physical IOMMU, so the request contains
> a negotiation part to probe the device capabilities.
> 
> (1) Driver attaches devices to address spaces as usual, but a flag
> VIRTIO_IOMMU_ATTACH_F_PRIVATE (working title) tells the device not to
> create page tables for use with the MAP/UNMAP API. The driver intends
> to manage the address space itself.
> 
> (2) Driver sends a PROBE_TABLE request. It sets len > 0 with the size of
> pg_format array.
> 
>   VIRTIO_IOMMU_T_PROBE_TABLE
> 
>   struct virtio_iommu_req_probe_table {
>   le32address_space;
>   le32flags;
>   le32len;
> 
>   le32nr_contexts;
>   struct {
>   le32model;
>   u8  format[64];
>   } pg_format[len];
>   };
> 
> Introducing a probe request is more flexible than advertising those
> features in virtio config, because capabilities are dynamic, and depend on
> which devices are attached to an address space. Within a single address
> space, devices may support different numbers of contexts (PASIDs), and
> some may not support recoverable faults.
> 
> (3) Device responds success with all page table formats implemented by the
> physical IOMMU in pg_format. 'model' 0 is invalid, so driver can
> initialize the array to 0 and deduce from there which entries have
> been filled by the device.
> 
> Using a probe method seems preferable over trying to attach every possible
> format until one sticks. For instance, with an ARM guest running on an x86
> host, PROBE_TABLE would return the Intel IOMMU page table format, and
> the
> guest could use that page table code to handle its mappings, hidden behind
> the IOMMU API. This requires that the page-table code is reasonably
> abstracted from the architecture, as is done with drivers/iommu/io-pgtable
> (an x86 guest could use any format implement by io-pgtable for example.)

So essentially you need modify all existing IOMMU drivers to support page 
table sharing in pvIOMMU. After abstraction is done the core pvIOMMU files 
can be kept vendor agnostic. But if we talk about the whole pvIOMMU 
module, it actually includes vendor specific logic thus unlike typical 
para-virtualized virtio drivers being completely vendor agnostic. Is this 
understanding accurate?

It also means in the host-side pIOMMU driver needs to propagate all
supported formats through VFIO to Qemu vIOMMU, meaning
such format definitions need be consistently agreed across all those 
components.

[...]

> 
>   2. Sharing MMU page tables
>   --
> 
> The guest can share process page-tables with the physical IOMMU. To do
> that, it sends PROBE_TABLE with (F_INDIRECT | F_NATIVE | F_FAULT). The
> page table format is implicit, so the pg_format array can be empty (unless
> the guest wants to query some specific property, e.g. number of levels
> supported by the pIOMMU?). If the host answers with success, guest can
> send its MMU page table details with ATTACH_TABLE and (F_NATIVE |
> F_INDIRECT | F_FAULT) flags.
> 
> F_FAULT means that the host communicates page requests from device to
> the
> guest, and the guest can handle them by mapping virtual address in the
> fault to pages. It is only available with VIRTIO_IOMMU_F_FAULT_QUEUE (see
> below.)
> 
> F_NATIVE means that the pIOMMU pgtable format is the same as guest
> MMU
> pgtable format.
> 
> F_INDIRECT means that 'table' pointer is a context table, instead of a
> page directory. Each slot in the context table points to a page directory:
> 
>64  2 1 0
>   table > +-+
>   |   pgd   |0|1|<--- context 0
>   |   ---   |0|0|<--- context 1
>   |