Re: [Qemu-devel] [RFC Design Doc] Enable Shared Virtual Memory feature in pass-through scenarios

Tian, Kevin Sun, 18 Sep 2016 19:28:38 -0700

> From: Liu, Yi L
> Sent: Wednesday, September 14, 2016 7:35 PM
> 
> Hi,
> 
> I'm sending this email for the enabling design of supporting SVM in 
> pass-through scenario.
> Comments are welcome. Pls let me know anything that failed to make you clear. 
> And any
> suggestions regards to the format is welcomed as well.


CC Qemu mailing list...

Yi, I think you need a better clarification of who does what. vIOMMU emulation
resides in Qemu, which is the place to enable SVM virtualization. Then VFIO 
needs
extension to allow propagate guest context entry from Qemu to shadow context
entry in underlying IOMMU driver (configured in nested mode). Ideally KVM 
doesn't
need any change (just reuse existing interface to forward I/O emulation request
and to inject virtual interrupt). However your description looks a bit 
confusing,
especially overusing KVM in some places.

Also Peter is now enhancing IOMMUNotifier framework. You may take a look to
see how SVM virtualization requirement can be fit there.

btw this design doc looks too high level. It might be clearer if you directly 
send
out RFC patch set with below description scattered in the right place.

> 
> Content
> ===================
> 1. Feature description
> 2. Why use it?
> 3. How to enable it
> 4. How to test
> 
> Details
> ===================
> 1. Feature description
> This feature is to let application program running within L1 guest share its 
> virtual address
> with an assigned physical device(e.g. graphics processors or accelerators).
> For SVM(shared virtual memory) detail, you may refer to section 2.5.1.1 of 
> Intel VT-d spec
> and also section 5.6 of OpenCL spec. For details about SVM address 
> translation structure,
> pls refer to section 3 of Intel VT-d spec. yeah, it's also welcomed to ask 
> directly in this
> thread.
> 
> http://www.intel.com/content/dam/www/public/us/en/documents/product-specification
> s/vt-directed-io-spec.pdf
> https://www.khronos.org/registry/cl/specs/opencl-2.0.pdf
> 
> 
> 2. Why use it?
> It is common to pass-through devices to guest and expect to achieve similar 
> performance
> as it is on host. With this feature enabled, the SVM in guest machine is also 
> able to let
> application programs pass data-structures to its assigned devices without 
> unnecessary
> overheads.
> 
> 
> 3. How to enable it
> The work is actually to virtualize a DMAR hardware which is capable to 
> translate guest
> virtual address to host physical address when the assigned device makes use 
> of the SVM
> feature. The key capability to virtualize a remapping hardware is the cache 
> mode. When
> the CM field is reported as Set, any software updates to any remapping 
> structures
> (including updates to not-present entries or present entries whose 
> programming resulted
> in translation faults) requires explicit invalidation of the caches. The 
> enabling work would
> include the following items.

virtualization of 2nd level translation (GPA->HPA) of VT-d is already there.

what you requires is virtualization of the 1st level translation (GVA->GPA), and
then has a way to propagate guest context entry (or specifically GPA of PASID
table) thru VFIO to intel-iommu driver

> 
> a) IOMMU Register Access Emulation
> The register set for each remapping hardware unit in the platform is placed 
> at a
> 4KB-aligned memory mapped location. For virtual remapping hardware, guest 
> would
> allocate such a page. KVM could intercept the access to such page and emulate 
> the
> accesses to different registers accordingly.

Not KVM's business. It's emulated by vIOMMU in Qemu

> 
> b) QI Handling Emulation
> Queued invalidation is for software to send invalidation requests to IOMMU 
> and devices
> (with device-IOTLB). The invalidation descriptor would be written to a ring 
> buffer which is
> allocated by OS. Guest OS would allocate a ring buffer for its own DMAR. As 
> designed,
> software need to set the Invalidation Queue Tail Register after writing a new 
> descriptor to
> the ring buffer. As item a) mentioned, KVM would intercept the access to the 
> Invalidation
> Queue Tail Register and then parse the QI descriptor from guest. Eventually, 
> the guest QI
> descriptors will be put to the ring buffer of the host. So that the physical 
> remapping
> hardware would process them.
> 
> c) Recoverable Fault Handling Emulation
> In the case of passed-through device, the page request is sent to host 
> firstly. If the page
> request is with PASID, then it would be injected to the corresponding guest 
> to have further
> processing. Guest would process the request and send response through the 
> guest QI
> interface. Guest QI would be intercepted by KVM as item b) mentioned. 
> Finally, the
> response would get to the host QI and then to the device. For the requests 
> without PASID,
> host should be able to handle it. The page requests with PASID would be 
> injected to guest
> through vMSI.
> 
> d) Non-Recoverable Fault Handling Emulation The non-recoverable fault would 
> be injected
> to guest by vMSI.

Again, KVM doesn't need to know detail of emulating above faults. They are
emulated by Qemu which then triggers a virtual MSI to KVM. Once guest
receives the virtual MSI interrupt, the corresponding fault handler will access
necessary register or in-memory structures which are emulated or provided
by Qemu.

> 
> e) VT-d Page Table Virtualization
> For requests with PASID from assigned device, this design would use the 
> nested mode of
> VT-d page. For the SVM capable devices which are assigned to a guest, the
> extended-context-entries that would be used to translate DMA addresses from 
> such
> devices should have the NESTE bit set. For the requests without PASID from 
> such devices,
> the address would still be translated by walking the second level page.
> 
> Another important thing is shadowing the VT-d page table. Need to report 
> cache mode as

You don't need shadow VT-d page table. Only need shadow context entry.

> Set for the virtual hardware, so the guest software would explicitly issue 
> invalidation
> operations on the virtual hardware for any/all updates to the guest remapping 
> structures.
> KVM may trap these guest invalidation operations to keep the shadow 
> translation
> structures consistent to guest translation structure modifications. In this 
> design, it is any
> change to the extended context entry would be followed by a invalidation(QI). 
> As item b)
> described, KVM would intercept it and parse it. For an extended entry 
> modification, KVM
> would determine if it is necessary to shadow the change to the 
> extended-context-entry
> which is used by the physical remapping hardware. In nested mode, the physical
> remapping hardware would treat the PASID table pointer in the 
> extended-context-entry as
> GPA. So in the shadowing, KVM would just copy the PASID table pointer from 
> guest
> extended-context-entry to the host extended-context-entry.

Please check Peter's work how this can be fit there.

> 
> f) QEMU support for this feature
> Initial plan is to support the devices assigned through VFIO mechanism on q35 
> machine
> type. A new option for QEMU would be added. It would be "svm=on|off" and its 
> default
> value would be off. A new IOCTL command would be added for the fds returned by
> KVM_CREATE_DEVICE. It would be used to create a IOMMU with SVM capability. The
> assigned device will be registered to KVM during the guest boot. So that KVM 
> would be able
> to map the guest BDF to the real BDF. With this map, KVM would be able to 
> distribute guest
> QI descriptors to different Invalidation Queue of different DMAR unit. The 
> assigned SVM
> capable devices would be attached to the DMAR0 which is also called virtual 
> remapping
> hardware in this design. This requires some modification to QEMU.

Please specify exact modifications required in QEMU. most of above description
are existing stuff.

Thanks
Kevin

Re: [Qemu-devel] [RFC Design Doc] Enable Shared Virtual Memory feature in pass-through scenarios

Reply via email to