Re: [Xen-devel] Xen virtual IOMMU high level design doc
On 11/24/2016 9:37 PM, Edgar E. Iglesias wrote: On Thu, Nov 24, 2016 at 02:49:41PM +0800, Lan Tianyu wrote: On 2016年11月24日 12:09, Edgar E. Iglesias wrote: Hi, I have a few questions. If I understand correctly, you'll be emulating an Intel IOMMU in Xen. So guests will essentially create intel iommu style page-tables. If we were to use this on Xen/ARM, we would likely be modelling an ARM SMMU as a vIOMMU. Since Xen on ARM does not use QEMU for emulation, the hypervisor OPs for QEMUs xen dummy IOMMU queries would not really be used. Do I understand this correctly? I think they could be called from the toolstack. This is why I was saying in the other thread that the hypercalls should be general enough that QEMU is not the only caller. For PVH and ARM guests, the toolstack should be able to setup the vIOMMU on behalf of the guest without QEMU intervention. OK, I see. Or, I think I understand, not sure :-) In QEMU when someone changes mappings in an IOMMU there will be a notifier to tell caches upstream that mappings have changed. I think we will need to prepare for that. I.e when TCG CPUs sit behind an IOMMU. For Xen side, we may notify pIOMMU driver about mapping change via calling pIOMMU driver's API in vIOMMU. I was refering to the other way around. When a guest modifies the mappings for a vIOMMU, the driver domain with QEMU and vDevices needs to be notified. I couldn't find any mention of this in the document... Qemu side won't have iotlb cache and all DMA translation info are in the hypervisor. All vDevice's DMA requests are passed to hypervisor, hypervisor returns back translated address and then Qemu finish the DMA operation finally. There is a race condition between iotlb invalidation operation and vDevices' in-fly DMA. We proposed a solution in "3.2 l2 translation - For virtual PCI device". We hope to take advantage of current ioreq mechanism to achieve something like notifier. Both vIOMMU in hypervisor and dummy vIOMMU in Qemu register the same MMIO region. When there is a invalidation MMIO access and hypervisor want to notify Qemu, vIOMMU's MMIO handler returns X86EMUL_UNHANDLEABLE and io emulation handler is supposed to send IO request to Qemu. Dummy vIOMMU in Qemu receives the event and start to drain in-fly DMA operation. Another area that may need change is that on ARM we need the map-query to return the memory attributes for the given mapping. Today QEMU or any emulator doesn't use it much but in the future things may change. What about the mem attributes? It's very likely we'll add support for memory attributes for IOMMU's in QEMU at some point. Emulated IOMMU's will thus have the ability to modify attributes (i.e SourceID's, cacheability, etc). Perhaps we could allocate or reserve an uint64_t for attributes TBD later in the query struct. Sounds like you hope to extend capability variable in the query struct to uint64_t to support more future feature, right? I have added "permission" variable in struct l2_translation to return vIOMMU's memory access permission for vDevice's DMA request. No sure it can meet your requirement. For SVM, whe will also need to deal with page-table faults by the IOMMU. So I think there will need to be a channel from Xen to Guesrt to report these. Yes, vIOMMU should forward the page-fault event to guest. For VTD side, we will trigger VTD's interrupt to notify guest about the event. OK, Cool. Perhaps you should document how this (and the map/unmap notifiers) will work? This is VTD specific to deal with some fault events and just like some other virtual device models emulate its interrupt. So I didn't put this in this design document. For mapping change, please see the fist comments. ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] Xen virtual IOMMU high level design doc
On November 24, 2016 9:38 PM,>On Thu, Nov 24, 2016 at 02:49:41PM +0800, Lan Tianyu wrote: >> On 2016年11月24日 12:09, Edgar E. Iglesias wrote: >> Hi, >> > > > >> > > > I have a few questions. >> > > > >> > > > If I understand correctly, you'll be emulating an Intel IOMMU in >Xen. >> > > > So guests will essentially create intel iommu style page-tables. >> > > > >> > > > If we were to use this on Xen/ARM, we would likely be >> > > > modelling an ARM SMMU as a vIOMMU. Since Xen on ARM >does >> > > > not use QEMU for emulation, the hypervisor OPs for QEMUs >xen dummy IOMMU queries would not really be used. >> > > > Do I understand this correctly? >> >>> > > >> >>> > > I think they could be called from the toolstack. This is why I >> >>> > > was saying in the other thread that the hypercalls should be >> >>> > > general enough that QEMU is not the only caller. >> >>> > > >> >>> > > For PVH and ARM guests, the toolstack should be able to setup >> >>> > > the vIOMMU on behalf of the guest without QEMU intervention. >> > OK, I see. Or, I think I understand, not sure :-) >> > >> > In QEMU when someone changes mappings in an IOMMU there will be >a >> > notifier to tell caches upstream that mappings have changed. I think >> > we will need to prepare for that. I.e when TCG CPUs sit behind an >IOMMU. >> >> For Xen side, we may notify pIOMMU driver about mapping change via >> calling pIOMMU driver's API in vIOMMU. > >I was refering to the other way around. When a guest modifies the >mappings for a vIOMMU, the driver domain with QEMU and vDevices needs >to be notified. > >I couldn't find any mention of this in the document... > > Edgar, As mentioned it supports VFIO-based user space driver (e.g. DPDK) in the guest. I am afraid all of guest memory is pinned.. Lan, right? Quan ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] Xen virtual IOMMU high level design doc
On Thu, Nov 24, 2016 at 02:49:41PM +0800, Lan Tianyu wrote: > On 2016年11月24日 12:09, Edgar E. Iglesias wrote: > Hi, > > > > > > > > I have a few questions. > > > > > > > > If I understand correctly, you'll be emulating an Intel IOMMU in > > > > Xen. > > > > So guests will essentially create intel iommu style page-tables. > > > > > > > > If we were to use this on Xen/ARM, we would likely be modelling an > > > > ARM > > > > SMMU as a vIOMMU. Since Xen on ARM does not use QEMU for > > > > emulation, the > > > > hypervisor OPs for QEMUs xen dummy IOMMU queries would not really > > > > be used. > > > > Do I understand this correctly? > >>> > > > >>> > > I think they could be called from the toolstack. This is why I was > >>> > > saying in the other thread that the hypercalls should be general > >>> > > enough > >>> > > that QEMU is not the only caller. > >>> > > > >>> > > For PVH and ARM guests, the toolstack should be able to setup the > >>> > > vIOMMU > >>> > > on behalf of the guest without QEMU intervention. > > OK, I see. Or, I think I understand, not sure :-) > > > > In QEMU when someone changes mappings in an IOMMU there will be a notifier > > to tell caches upstream that mappings have changed. I think we will need to > > prepare for that. I.e when TCG CPUs sit behind an IOMMU. > > For Xen side, we may notify pIOMMU driver about mapping change via > calling pIOMMU driver's API in vIOMMU. I was refering to the other way around. When a guest modifies the mappings for a vIOMMU, the driver domain with QEMU and vDevices needs to be notified. I couldn't find any mention of this in the document... > > > > > Another area that may need change is that on ARM we need the map-query to > > return > > the memory attributes for the given mapping. Today QEMU or any emulator > > doesn't use it much but in the future things may change. What about the mem attributes? It's very likely we'll add support for memory attributes for IOMMU's in QEMU at some point. Emulated IOMMU's will thus have the ability to modify attributes (i.e SourceID's, cacheability, etc). Perhaps we could allocate or reserve an uint64_t for attributes TBD later in the query struct. > > > > For SVM, whe will also need to deal with page-table faults by the IOMMU. > > So I think there will need to be a channel from Xen to Guesrt to report > > these. > > Yes, vIOMMU should forward the page-fault event to guest. For VTD side, > we will trigger VTD's interrupt to notify guest about the event. OK, Cool. Perhaps you should document how this (and the map/unmap notifiers) will work? I also think it would be a good idea to add a little more introduction so that some of the questions we've been asking regarding the general design are easier to grasp. Best regards, Edgar ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] Xen virtual IOMMU high level design doc
On 2016年11月24日 12:09, Edgar E. Iglesias wrote: Hi, > > > > > > I have a few questions. > > > > > > If I understand correctly, you'll be emulating an Intel IOMMU in Xen. > > > So guests will essentially create intel iommu style page-tables. > > > > > > If we were to use this on Xen/ARM, we would likely be modelling an > > > ARM > > > SMMU as a vIOMMU. Since Xen on ARM does not use QEMU for emulation, > > > the > > > hypervisor OPs for QEMUs xen dummy IOMMU queries would not really be > > > used. > > > Do I understand this correctly? >>> > > >>> > > I think they could be called from the toolstack. This is why I was >>> > > saying in the other thread that the hypercalls should be general enough >>> > > that QEMU is not the only caller. >>> > > >>> > > For PVH and ARM guests, the toolstack should be able to setup the vIOMMU >>> > > on behalf of the guest without QEMU intervention. > OK, I see. Or, I think I understand, not sure :-) > > In QEMU when someone changes mappings in an IOMMU there will be a notifier > to tell caches upstream that mappings have changed. I think we will need to > prepare for that. I.e when TCG CPUs sit behind an IOMMU. For Xen side, we may notify pIOMMU driver about mapping change via calling pIOMMU driver's API in vIOMMU. > > Another area that may need change is that on ARM we need the map-query to > return > the memory attributes for the given mapping. Today QEMU or any emulator > doesn't use it much but in the future things may change. > > For SVM, whe will also need to deal with page-table faults by the IOMMU. > So I think there will need to be a channel from Xen to Guesrt to report these. Yes, vIOMMU should forward the page-fault event to guest. For VTD side, we will trigger VTD's interrupt to notify guest about the event. > > For example, what happens when a guest assigned DMA unit page-faults? > Xen needs to know how to forward this fault back to guest for fixup and the > guest needs to be able to fix it and tell the device that it's OK to contine. > E.g PCI PRI or similar. > > -- Best regards Tianyu Lan ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] Xen virtual IOMMU high level design doc
On Thu, Nov 24, 2016 at 02:00:21AM +, Tian, Kevin wrote: > > From: Stefano Stabellini [mailto:sstabell...@kernel.org] > > Sent: Thursday, November 24, 2016 3:09 AM > > > > On Wed, 23 Nov 2016, Edgar E. Iglesias wrote: > > > On Wed, Aug 17, 2016 at 08:05:51PM +0800, Lan, Tianyu wrote: > > > > Hi All: > > > > The following is our Xen vIOMMU high level design for detail > > > > discussion. Please have a look. Very appreciate for your comments. > > > > This design doesn't cover changes when root port is moved to hypervisor. > > > > We may design it later. > > > > > > Hi, > > > > > > I have a few questions. > > > > > > If I understand correctly, you'll be emulating an Intel IOMMU in Xen. > > > So guests will essentially create intel iommu style page-tables. > > > > > > If we were to use this on Xen/ARM, we would likely be modelling an ARM > > > SMMU as a vIOMMU. Since Xen on ARM does not use QEMU for emulation, the > > > hypervisor OPs for QEMUs xen dummy IOMMU queries would not really be used. > > > Do I understand this correctly? > > > > I think they could be called from the toolstack. This is why I was > > saying in the other thread that the hypercalls should be general enough > > that QEMU is not the only caller. > > > > For PVH and ARM guests, the toolstack should be able to setup the vIOMMU > > on behalf of the guest without QEMU intervention. OK, I see. Or, I think I understand, not sure :-) In QEMU when someone changes mappings in an IOMMU there will be a notifier to tell caches upstream that mappings have changed. I think we will need to prepare for that. I.e when TCG CPUs sit behind an IOMMU. Another area that may need change is that on ARM we need the map-query to return the memory attributes for the given mapping. Today QEMU or any emulator doesn't use it much but in the future things may change. For SVM, whe will also need to deal with page-table faults by the IOMMU. So I think there will need to be a channel from Xen to Guesrt to report these. For example, what happens when a guest assigned DMA unit page-faults? Xen needs to know how to forward this fault back to guest for fixup and the guest needs to be able to fix it and tell the device that it's OK to contine. E.g PCI PRI or similar. > > > Has a platform agnostic PV-IOMMU been considered to support 2-stage > > > translation (i.e VFIO in the guest)? Perhaps that would hurt map/unmap > > > performance too much? > > > > That's an interesting idea. I don't know if that's feasible, but if it > > is not, then we need to be able to specify the PV-IOMMU type in the > > hypercalls, so that you would get Intel IOMMU on x86 and SMMU on ARM. > > > > > > Not considered yet. PV is always possible as we've done for other I/O > devices. Ideally it could be designed being more efficient than full > emulation of vendor specific IOMMU, but also means requirement of > maintaining a new guest IOMMU driver and limitation of supporting > only newer version guest OSes. It's a tradeoff... at least not compelling > now (may consider when we see a real need in the future). Agreed. Thanks. Best regards, Edgar ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] Xen virtual IOMMU high level design doc V3
On 2016年11月22日 18:24, Jan Beulich wrote: On 17.11.16 at 16:36,wrote: >> 2) Build ACPI DMAR table in toolstack >> Now tool stack can boot ACPI DMAR table according VM configure and pass >> though it to hvmloader via xenstore ACPI PT channel. But the vIOMMU MMIO >> region is managed by Qemu and it's need to be populated into DMAR >> table. We may hardcore an address in both Qemu and toolstack and use the >> same address to create vIOMMU and build DMAR table. > > Let's try to avoid any new hard coding of values. Both tool stack > and qemu ought to be able to retrieve a suitable address range > from the hypervisor. Or if the tool stack was to allocate it, it could > tell qemu. > > Jan > Hi Jan: The address range is allocated by Qemu or toolstack and pass to hypervisor when create vIOMMU. The vIOMMU's address range should be under PCI address sapce and so we need to reserve a piece of PCI region for vIOMMU in the toolstack. Then, populate base address in the vDMAR table and tell Qemu the region via new xenstore interface if we want to create vIOMMU in the Qemu dummy hypercall wrapper. Another point, I am not sure whether we can create/destroy vIOMMU directly in toolstack because virtual device models usually are handled by Qemu. If yes, we don't need new Xenstore interface. In this case, the dummy vIOMMU in Qemu will just cover L2 translation for virtual device. -- Best regards Tianyu Lan ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] Xen virtual IOMMU high level design doc
> From: Stefano Stabellini [mailto:sstabell...@kernel.org] > Sent: Thursday, November 24, 2016 3:09 AM > > On Wed, 23 Nov 2016, Edgar E. Iglesias wrote: > > On Wed, Aug 17, 2016 at 08:05:51PM +0800, Lan, Tianyu wrote: > > > Hi All: > > > The following is our Xen vIOMMU high level design for detail > > > discussion. Please have a look. Very appreciate for your comments. > > > This design doesn't cover changes when root port is moved to hypervisor. > > > We may design it later. > > > > Hi, > > > > I have a few questions. > > > > If I understand correctly, you'll be emulating an Intel IOMMU in Xen. > > So guests will essentially create intel iommu style page-tables. > > > > If we were to use this on Xen/ARM, we would likely be modelling an ARM > > SMMU as a vIOMMU. Since Xen on ARM does not use QEMU for emulation, the > > hypervisor OPs for QEMUs xen dummy IOMMU queries would not really be used. > > Do I understand this correctly? > > I think they could be called from the toolstack. This is why I was > saying in the other thread that the hypercalls should be general enough > that QEMU is not the only caller. > > For PVH and ARM guests, the toolstack should be able to setup the vIOMMU > on behalf of the guest without QEMU intervention. > > > > Has a platform agnostic PV-IOMMU been considered to support 2-stage > > translation (i.e VFIO in the guest)? Perhaps that would hurt map/unmap > > performance too much? > > That's an interesting idea. I don't know if that's feasible, but if it > is not, then we need to be able to specify the PV-IOMMU type in the > hypercalls, so that you would get Intel IOMMU on x86 and SMMU on ARM. > > Not considered yet. PV is always possible as we've done for other I/O devices. Ideally it could be designed being more efficient than full emulation of vendor specific IOMMU, but also means requirement of maintaining a new guest IOMMU driver and limitation of supporting only newer version guest OSes. It's a tradeoff... at least not compelling now (may consider when we see a real need in the future). Thanks Kevin ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] Xen virtual IOMMU high level design doc
On Wed, 23 Nov 2016, Edgar E. Iglesias wrote: > On Wed, Aug 17, 2016 at 08:05:51PM +0800, Lan, Tianyu wrote: > > Hi All: > > The following is our Xen vIOMMU high level design for detail > > discussion. Please have a look. Very appreciate for your comments. > > This design doesn't cover changes when root port is moved to hypervisor. > > We may design it later. > > Hi, > > I have a few questions. > > If I understand correctly, you'll be emulating an Intel IOMMU in Xen. > So guests will essentially create intel iommu style page-tables. > > If we were to use this on Xen/ARM, we would likely be modelling an ARM > SMMU as a vIOMMU. Since Xen on ARM does not use QEMU for emulation, the > hypervisor OPs for QEMUs xen dummy IOMMU queries would not really be used. > Do I understand this correctly? I think they could be called from the toolstack. This is why I was saying in the other thread that the hypercalls should be general enough that QEMU is not the only caller. For PVH and ARM guests, the toolstack should be able to setup the vIOMMU on behalf of the guest without QEMU intervention. > Has a platform agnostic PV-IOMMU been considered to support 2-stage > translation (i.e VFIO in the guest)? Perhaps that would hurt map/unmap > performance too much? That's an interesting idea. I don't know if that's feasible, but if it is not, then we need to be able to specify the PV-IOMMU type in the hypercalls, so that you would get Intel IOMMU on x86 and SMMU on ARM. > > > > > > Content: > > === > > 1. Motivation of vIOMMU > > 1.1 Enable more than 255 vcpus > > 1.2 Support VFIO-based user space driver > > 1.3 Support guest Shared Virtual Memory (SVM) > > 2. Xen vIOMMU Architecture > > 2.1 2th level translation overview > > 2.2 Interrupt remapping overview > > 3. Xen hypervisor > > 3.1 New vIOMMU hypercall interface > > 3.2 2nd level translation > > 3.3 Interrupt remapping > > 3.4 1st level translation > > 3.5 Implementation consideration > > 4. Qemu > > 4.1 Qemu vIOMMU framework > > 4.2 Dummy xen-vIOMMU driver > > 4.3 Q35 vs. i440x > > 4.4 Report vIOMMU to hvmloader > > > > > > 1 Motivation for Xen vIOMMU > > === > > 1.1 Enable more than 255 vcpu support > > HPC virtualization requires more than 255 vcpus support in a single VM > > to meet parallel computing requirement. More than 255 vcpus support > > requires interrupt remapping capability present on vIOMMU to deliver > > interrupt to #vcpu >255 Otherwise Linux guest fails to boot up with >255 > > vcpus if interrupt remapping is absent. > > > > > > 1.2 Support VFIO-based user space driver (e.g. DPDK) in the guest > > It relies on the 2nd level translation capability (IOVA->GPA) on > > vIOMMU. pIOMMU 2nd level becomes a shadowing structure of > > vIOMMU to isolate DMA requests initiated by user space driver. > > > > > > 1.3 Support guest SVM (Shared Virtual Memory) > > It relies on the 1st level translation table capability (GVA->GPA) on > > vIOMMU. pIOMMU needs to enable both 1st level and 2nd level translation > > in nested mode (GVA->GPA->HPA) for passthrough device. IGD passthrough > > is the main usage today (to support OpenCL 2.0 SVM feature). In the > > future SVM might be used by other I/O devices too. > > > > 2. Xen vIOMMU Architecture > > > > > > * vIOMMU will be inside Xen hypervisor for following factors > > 1) Avoid round trips between Qemu and Xen hypervisor > > 2) Ease of integration with the rest of the hypervisor > > 3) HVMlite/PVH doesn't use Qemu > > * Dummy xen-vIOMMU in Qemu as a wrapper of new hypercall to create > > /destory vIOMMU in hypervisor and deal with virtual PCI device's 2th > > level translation. > > > > 2.1 2th level translation overview > > For Virtual PCI device, dummy xen-vIOMMU does translation in the > > Qemu via new hypercall. > > > > For physical PCI device, vIOMMU in hypervisor shadows IO page table from > > IOVA->GPA to IOVA->HPA and load page table to physical IOMMU. > > > > The following diagram shows 2th level translation architecture. > > +-+ > > |Qemu++ | > > || Virtual| | > > || PCI device | | > > ||| | > > |++ | > > ||DMA | > > |V| > > | ++ Request ++ | > > | |+<---+| | > > | | Dummy xen vIOMMU | Target GPA |
Re: [Xen-devel] Xen virtual IOMMU high level design doc
On Wed, Aug 17, 2016 at 08:05:51PM +0800, Lan, Tianyu wrote: > Hi All: > The following is our Xen vIOMMU high level design for detail > discussion. Please have a look. Very appreciate for your comments. > This design doesn't cover changes when root port is moved to hypervisor. > We may design it later. Hi, I have a few questions. If I understand correctly, you'll be emulating an Intel IOMMU in Xen. So guests will essentially create intel iommu style page-tables. If we were to use this on Xen/ARM, we would likely be modelling an ARM SMMU as a vIOMMU. Since Xen on ARM does not use QEMU for emulation, the hypervisor OPs for QEMUs xen dummy IOMMU queries would not really be used. Do I understand this correctly? Has a platform agnostic PV-IOMMU been considered to support 2-stage translation (i.e VFIO in the guest)? Perhaps that would hurt map/unmap performance too much? Best regards, Edgar > > > Content: > === > 1. Motivation of vIOMMU > 1.1 Enable more than 255 vcpus > 1.2 Support VFIO-based user space driver > 1.3 Support guest Shared Virtual Memory (SVM) > 2. Xen vIOMMU Architecture > 2.1 2th level translation overview > 2.2 Interrupt remapping overview > 3. Xen hypervisor > 3.1 New vIOMMU hypercall interface > 3.2 2nd level translation > 3.3 Interrupt remapping > 3.4 1st level translation > 3.5 Implementation consideration > 4. Qemu > 4.1 Qemu vIOMMU framework > 4.2 Dummy xen-vIOMMU driver > 4.3 Q35 vs. i440x > 4.4 Report vIOMMU to hvmloader > > > 1 Motivation for Xen vIOMMU > === > 1.1 Enable more than 255 vcpu support > HPC virtualization requires more than 255 vcpus support in a single VM > to meet parallel computing requirement. More than 255 vcpus support > requires interrupt remapping capability present on vIOMMU to deliver > interrupt to #vcpu >255 Otherwise Linux guest fails to boot up with >255 > vcpus if interrupt remapping is absent. > > > 1.2 Support VFIO-based user space driver (e.g. DPDK) in the guest > It relies on the 2nd level translation capability (IOVA->GPA) on > vIOMMU. pIOMMU 2nd level becomes a shadowing structure of > vIOMMU to isolate DMA requests initiated by user space driver. > > > 1.3 Support guest SVM (Shared Virtual Memory) > It relies on the 1st level translation table capability (GVA->GPA) on > vIOMMU. pIOMMU needs to enable both 1st level and 2nd level translation > in nested mode (GVA->GPA->HPA) for passthrough device. IGD passthrough > is the main usage today (to support OpenCL 2.0 SVM feature). In the > future SVM might be used by other I/O devices too. > > 2. Xen vIOMMU Architecture > > > * vIOMMU will be inside Xen hypervisor for following factors > 1) Avoid round trips between Qemu and Xen hypervisor > 2) Ease of integration with the rest of the hypervisor > 3) HVMlite/PVH doesn't use Qemu > * Dummy xen-vIOMMU in Qemu as a wrapper of new hypercall to create > /destory vIOMMU in hypervisor and deal with virtual PCI device's 2th > level translation. > > 2.1 2th level translation overview > For Virtual PCI device, dummy xen-vIOMMU does translation in the > Qemu via new hypercall. > > For physical PCI device, vIOMMU in hypervisor shadows IO page table from > IOVA->GPA to IOVA->HPA and load page table to physical IOMMU. > > The following diagram shows 2th level translation architecture. > +-+ > |Qemu++ | > || Virtual| | > || PCI device | | > ||| | > |++ | > ||DMA | > |V| > | ++ Request ++ | > | |+<---+| | > | | Dummy xen vIOMMU | Target GPA | Memory region | | > | |+--->+| | > | +-+--++---++ | > || || > ||Hypercall || > +++ > |Hypervisor | || > || || > |v || > | +--+--+|| > | | vIOMMU||| > | +--+--+|
Re: [Xen-devel] Xen virtual IOMMU high level design doc V3
On 2016年11月21日 15:05, Tian, Kevin wrote: >> If someone add "intel_iommu=on" kernel parameter manually, IOMMU driver >> > will panic guest because it can't enable DMA remapping function via gcmd >> > register and "Translation Enable Status" bit in gsts register is never >> > set by vIOMMU. This shows actual vIOMMU status that there is no l2 >> > translation support and warn user should not enable l2 translation. > The rationale of section 3.5 is confusing. Do you mean sth. like below? > > - We can first do IRQ remapping, because DMA remapping (l1/l2) and > IRQ remapping can be enabled separately according to VT-d spec. Enabling > of DMA remapping will be first emulated as a failure, which may lead > to guest kernel panic if intel_iommu is turned on in the guest. But it's > not a big problem because major distributions have DMA remapping > disabled by default while IRQ remapping is enabled. > > - For DMA remapping, likely you'll enable L2 translation first (there is > no capability bit) with L1 translation disabled (there is a SVM capability > bit). > > If yes, maybe we can break this design into 3 parts too, so both > design review and implementation side can move forward step by > step? > Yes, we may implement IRQ remapping first. I will break this design into 3 parts(interrupt remapping, L2 translation and L1 translation). IRQ remapping will be first one to be sent out for detail discussion. -- Best regards Tianyu Lan ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] Xen virtual IOMMU high level design doc V3
>>> On 17.11.16 at 16:36,wrote: > 2) Build ACPI DMAR table in toolstack > Now tool stack can boot ACPI DMAR table according VM configure and pass > though it to hvmloader via xenstore ACPI PT channel. But the vIOMMU MMIO > region is managed by Qemu and it's need to be populated into DMAR > table. We may hardcore an address in both Qemu and toolstack and use the > same address to create vIOMMU and build DMAR table. Let's try to avoid any new hard coding of values. Both tool stack and qemu ought to be able to retrieve a suitable address range from the hypervisor. Or if the tool stack was to allocate it, it could tell qemu. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] Xen virtual IOMMU high level design doc V3
On 2016年11月21日 21:41, Andrew Cooper wrote: > On 17/11/16 15:36, Lan Tianyu wrote: >> 3.2 l2 translation >> 1) For virtual PCI device >> Xen dummy xen-vIOMMU in Qemu translates IOVA to target GPA via new >> hypercall when DMA operation happens. >> >> When guest triggers a invalidation operation, there maybe in-fly DMA >> request for virtual device has been translated by vIOMMU and return back >> Qemu. Before vIOMMU tells invalidation completed, it's necessary to make >> sure in-fly DMA operation is completed. >> >> When IOMMU driver invalidates IOTLB, it also will wait until the >> invalidation completion. We may use this to drain in-fly DMA operation >> for virtual device. >> >> Guest triggers invalidation operation and trip into vIOMMU in >> hypervisor to flush cache data. After this, it should go to Qemu to >> drain in-fly DMA translation. >> >> To do that, dummy vIOMMU in Qemu registers the same MMIO region as >> vIOMMU's and emulation part of invalidation operation in Xen hypervisor >> returns X86EMUL_UNHANDLEABLE after flush cache. MMIO emulation part is >> supposed to send event to Qemu and dummy vIOMMU get a chance to starts a >> thread to drain in-fly DMA and return emulation done. >> >> Guest polls IVT(invalidate IOTLB) bit in the IOTLB invalidate register >> until it's cleared after triggering invalidation. Dummy vIOMMU in Qemu >> notifies hypervisor drain operation completed via hypercall, vIOMMU >> clears IVT bit and guest finish invalidation operation. > > Having the guest poll will be very inefficient. If the invalidation > does need to reach qemu, it will be a very long time until it > completes. Is there no interrupt based mechanism which can be used? > That way the guest can either handle it asynchronous itself, or block > waiting on an interrupt, both of which are better than having it just > spinning. > Hi Andrew: VTD provides interrupt event for Queue invalidation completion. So guest can select poll or interrupt mode to wait for invalidation completion. I found Linux Intel IOMMU driver just used poll mode and so used it for example. Regardless of poll and interrupt mode, guest will wait for invalidation completion and we just need to make sure to finish draining in-fly DMA before clearing invalidation completion bit. -- Best regards Tianyu Lan ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] Xen virtual IOMMU high level design doc V3
> From: Andrew Cooper [mailto:andrew.coop...@citrix.com] > Sent: Monday, November 21, 2016 9:41 PM > > On 17/11/16 15:36, Lan Tianyu wrote: > > 3.2 l2 translation > > 1) For virtual PCI device > > Xen dummy xen-vIOMMU in Qemu translates IOVA to target GPA via new > > hypercall when DMA operation happens. > > > > When guest triggers a invalidation operation, there maybe in-fly DMA > > request for virtual device has been translated by vIOMMU and return back > > Qemu. Before vIOMMU tells invalidation completed, it's necessary to make > > sure in-fly DMA operation is completed. > > > > When IOMMU driver invalidates IOTLB, it also will wait until the > > invalidation completion. We may use this to drain in-fly DMA operation > > for virtual device. > > > > Guest triggers invalidation operation and trip into vIOMMU in > > hypervisor to flush cache data. After this, it should go to Qemu to > > drain in-fly DMA translation. > > > > To do that, dummy vIOMMU in Qemu registers the same MMIO region as > > vIOMMU's and emulation part of invalidation operation in Xen hypervisor > > returns X86EMUL_UNHANDLEABLE after flush cache. MMIO emulation part is > > supposed to send event to Qemu and dummy vIOMMU get a chance to starts a > > thread to drain in-fly DMA and return emulation done. > > > > Guest polls IVT(invalidate IOTLB) bit in the IOTLB invalidate register > > until it's cleared after triggering invalidation. Dummy vIOMMU in Qemu > > notifies hypervisor drain operation completed via hypercall, vIOMMU > > clears IVT bit and guest finish invalidation operation. > > Having the guest poll will be very inefficient. If the invalidation > does need to reach qemu, it will be a very long time until it > completes. Is there no interrupt based mechanism which can be used? > That way the guest can either handle it asynchronous itself, or block > waiting on an interrupt, both of which are better than having it just > spinning. > VT-d spec supports both poll and interrupt modes, and it's decided by guest IOMMU driver. Not say that this design requires guest to use poll mode. I guess Tianyu uses as an example flow. Thanks Kevin ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] Xen virtual IOMMU high level design doc V3
On Mon, 21 Nov 2016, Julien Grall wrote: > On 21/11/2016 02:21, Lan, Tianyu wrote: > > On 11/19/2016 3:43 AM, Julien Grall wrote: > > > On 17/11/2016 09:36, Lan Tianyu wrote: > > Hi Julien: > > Hello Lan, > > > Thanks for your input. This interface is just for virtual PCI device > > which is called by Qemu. I am not familiar with ARM. Are there any > > non-PCI emulated devices for arm in Qemu which need to be covered by > > vIOMMU? > > We don't use QEMU on ARM so far, so I guess it should be ok for now. ARM > guests are very similar to hvmlite/pvh. I got confused and thought this design > document was targeting pvh too. > > BTW, in the design document you mention hvmlite/pvh. Does it mean you plan to > bring support of vIOMMU for those guests later on? I quickly went through the document. I don't think we should restrict the design to only one caller: QEMU. In fact it looks like those hypercalls, without any modifications, could be called from the toolstack (xl/libxl) in the case of PVH guests. In other words PVH guests might work without any addition efforts on the hypervisor side. And they might even work on ARM. I have a couple of suggestions to make the hypercalls a bit more "future proof" and architecture agnostic. Imagine a future where two vIOMMU versions are supported. We could have a uint32_t iommu_version field to identify what version of IOMMU we are creating (create_iommu and query_capabilities commands). This could be useful even on Intel platforms. Given that in the future we might support a vIOMMU that take ids other than sbdf as input, I would change "u32 vsbdf" into the following: #define XENVIOMMUSPACE_vsbdf 0 uint16_t space; uint64_t id; ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] Xen virtual IOMMU high level design doc V3
On 17/11/16 15:36, Lan Tianyu wrote: > 3.2 l2 translation > 1) For virtual PCI device > Xen dummy xen-vIOMMU in Qemu translates IOVA to target GPA via new > hypercall when DMA operation happens. > > When guest triggers a invalidation operation, there maybe in-fly DMA > request for virtual device has been translated by vIOMMU and return back > Qemu. Before vIOMMU tells invalidation completed, it's necessary to make > sure in-fly DMA operation is completed. > > When IOMMU driver invalidates IOTLB, it also will wait until the > invalidation completion. We may use this to drain in-fly DMA operation > for virtual device. > > Guest triggers invalidation operation and trip into vIOMMU in > hypervisor to flush cache data. After this, it should go to Qemu to > drain in-fly DMA translation. > > To do that, dummy vIOMMU in Qemu registers the same MMIO region as > vIOMMU's and emulation part of invalidation operation in Xen hypervisor > returns X86EMUL_UNHANDLEABLE after flush cache. MMIO emulation part is > supposed to send event to Qemu and dummy vIOMMU get a chance to starts a > thread to drain in-fly DMA and return emulation done. > > Guest polls IVT(invalidate IOTLB) bit in the IOTLB invalidate register > until it's cleared after triggering invalidation. Dummy vIOMMU in Qemu > notifies hypervisor drain operation completed via hypercall, vIOMMU > clears IVT bit and guest finish invalidation operation. Having the guest poll will be very inefficient. If the invalidation does need to reach qemu, it will be a very long time until it completes. Is there no interrupt based mechanism which can be used? That way the guest can either handle it asynchronous itself, or block waiting on an interrupt, both of which are better than having it just spinning. ~Andrew ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] Xen virtual IOMMU high level design doc V3
On 21/11/2016 02:21, Lan, Tianyu wrote: On 11/19/2016 3:43 AM, Julien Grall wrote: On 17/11/2016 09:36, Lan Tianyu wrote: Hi Julien: Hello Lan, Thanks for your input. This interface is just for virtual PCI device which is called by Qemu. I am not familiar with ARM. Are there any non-PCI emulated devices for arm in Qemu which need to be covered by vIOMMU? We don't use QEMU on ARM so far, so I guess it should be ok for now. ARM guests are very similar to hvmlite/pvh. I got confused and thought this design document was targeting pvh too. BTW, in the design document you mention hvmlite/pvh. Does it mean you plan to bring support of vIOMMU for those guests later on? Regards, -- Julien Grall ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] Xen virtual IOMMU high level design doc V3
> From: Lan, Tianyu > Sent: Thursday, November 17, 2016 11:37 PM > > Change since V2: > 1) Update motivation for Xen vIOMMU - 288 vcpus support part > Add descriptor about plan of increasing vcpu from 128 to 255 and > dependency between X2APIC and interrupt remapping. > 2) Update 3.1 New vIOMMU hypercall interface > Change vIOMMU hypercall from sysctl to dmop, add multi vIOMMU > consideration consideration and drain in-fly DMA subcommand > 3) Update 3.5 implementation consideration > We found it's still safe to enable interrupt remapping function before > adding l2 translation(DMA translation) to increase vcpu number >255. > 4) Update 3.2 l2 translation - virtual device part > Add proposal to deal with race between in-fly DMA and invalidation > operation in hypervisor. > 5) Update 4.4 Report vIOMMU to hvmloader > Add option of building ACPI DMAR table in the toolstack for discussion. > > Change since V1: > 1) Update motivation for Xen vIOMMU - 288 vcpus support part > 2) Change definition of struct xen_sysctl_viommu_op > 3) Update "3.5 Implementation consideration" to explain why we needs to > enable l2 translation first. > 4) Update "4.3 Q35 vs I440x" - Linux/Windows VTD drivers can work on > the emulated I440 chipset. > 5) Remove stale statement in the "3.3 Interrupt remapping" > > Content: > = > == > 1. Motivation of vIOMMU > 1.1 Enable more than 255 vcpus > 1.2 Support VFIO-based user space driver > 1.3 Support guest Shared Virtual Memory (SVM) > 2. Xen vIOMMU Architecture > 2.1 l2 translation overview L2/L1 might be more readable than l2/l1. :-) > 2.2 Interrupt remapping overview to be complete, need an overview of l1 translation here > 3. Xen hypervisor > 3.1 New vIOMMU hypercall interface > 3.2 l2 translation > 3.3 Interrupt remapping > 3.4 l1 translation > 3.5 Implementation consideration > 4. Qemu > 4.1 Qemu vIOMMU framework > 4.2 Dummy xen-vIOMMU driver > 4.3 Q35 vs. i440x > 4.4 Report vIOMMU to hvmloader > > > Glossary: > = > === > l1 translation - first-level translation to remap a virtual address to > intermediate (guest) physical address. (GVA->GPA) > l2 translation - second-level translations to remap a intermediate > physical address to machine (host) physical address. (GPA->HPA) If a glossary section required, please make it complex (interrupt remapping, DMAR, etc.) Also please stick to what spec says. I don't think 'intermediate' physical address is a widely-used term, and GVA->GPA/GPA->HPA are only partial usages of those structures. You may make them an example, but be careful with the definition. > > 1 Motivation for Xen vIOMMU > = > === > 1.1 Enable more than 255 vcpu support vcpu->vcpus > HPC cloud service requires VM provides high performance parallel > computing and we hope to create a huge VM with >255 vcpu on one machine > to meet such requirement. Pin each vcpu to separate pcpus. > > Now HVM guest can support 128 vcpus at most. We can increase vcpu number > from 128 to 255 via changing some limitations and extending vcpu related > data structure. This also needs to change the rule of allocating vcpu's > APIC ID. Current rule is "(APIC ID) = (vcpu index) * 2". We need to > change it to "(APIC ID) = (vcpu index)". Andrew Cooper's CPUID > improvement work will cover this to improve guest's cpu topology. We > will base on this to increase vcpu number from 128 to 255. > > To support >255 vcpus, X2APIC mode in guest is necessary because legacy > APIC(XAPIC) just supports 8-bit APIC ID and it only can support 255 > vcpus at most. X2APIC mode supports 32-bit APIC ID and it requires > interrupt mapping function of vIOMMU. > > The reason for this is that there is no modification to existing PCI MSI > and IOAPIC with the introduction of X2APIC. PCI MSI/IOAPIC can only send > interrupt message containing 8-bit APIC ID, which cannot address >255 > cpus. Interrupt remapping supports 32-bit APIC ID and so it's necessary > to enable >255 cpus with x2apic mode. > > Both Linux and Windows requires interrupt remapping when cpu number is >255. > > > 1.2 Support VFIO-based user space driver (e.g. DPDK) in the guest > It relies on the l2 translation capability (IOVA->GPA) on GIOVA->GPA to be consistent > vIOMMU. pIOMMU l2 becomes a shadowing structure of > vIOMMU to isolate DMA requests initiated by user space driver. > You may give more background of how VFIO manages user space driver to make whole picture clearer, like what you did for >255 vcpus support. > > > 1.3 Support guest SVM (Shared Virtual Memory) > It relies on the l1 translation table capability (GVA->GPA) on >
Re: [Xen-devel] Xen virtual IOMMU high level design doc V3
On 11/19/2016 3:43 AM, Julien Grall wrote: Hi Lan, On 17/11/2016 09:36, Lan Tianyu wrote: 1) Definition of "struct xen_dmop_viommu_op" as new hypercall parameter. struct xen_dmop_viommu_op { u32 cmd; u32 domid; u32 viommu_id; union { struct { u32 capabilities; } query_capabilities; struct { /* IN parameters. */ u32 capabilities; u64 base_address; struct { u32 size; XEN_GUEST_HANDLE_64(uint32) dev_list; } dev_scope; /* Out parameters. */ u32 viommu_id; } create_iommu; struct { /* IN parameters. */ u32 vsbdf; I only gave a quick look through this design document. The new hypercalls looks arch/device agnostic except this part. Having a virtual IOMMU on Xen ARM is something we might consider in the future. In the case of ARM, a device can either be a PCI device or integrated device. The latter does not have a sbdf. The IOMMU will usually be configured with a stream ID (SID) that can be deduced from the sbdf and hardcoded for integrated device. So I would rather not tie the interface to PCI and use a more generic name for this field. Maybe vdevid, which then can be architecture specific. Hi Julien: Thanks for your input. This interface is just for virtual PCI device which is called by Qemu. I am not familiar with ARM. Are there any non-PCI emulated devices for arm in Qemu which need to be covered by vIOMMU? ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] Xen virtual IOMMU high level design doc V3
Hi Lan, On 17/11/2016 09:36, Lan Tianyu wrote: 1) Definition of "struct xen_dmop_viommu_op" as new hypercall parameter. struct xen_dmop_viommu_op { u32 cmd; u32 domid; u32 viommu_id; union { struct { u32 capabilities; } query_capabilities; struct { /* IN parameters. */ u32 capabilities; u64 base_address; struct { u32 size; XEN_GUEST_HANDLE_64(uint32) dev_list; } dev_scope; /* Out parameters. */ u32 viommu_id; } create_iommu; struct { /* IN parameters. */ u32 vsbdf; I only gave a quick look through this design document. The new hypercalls looks arch/device agnostic except this part. Having a virtual IOMMU on Xen ARM is something we might consider in the future. In the case of ARM, a device can either be a PCI device or integrated device. The latter does not have a sbdf. The IOMMU will usually be configured with a stream ID (SID) that can be deduced from the sbdf and hardcoded for integrated device. So I would rather not tie the interface to PCI and use a more generic name for this field. Maybe vdevid, which then can be architecture specific. Regards, -- Julien Grall ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
[Xen-devel] Xen virtual IOMMU high level design doc V3
Change since V2: 1) Update motivation for Xen vIOMMU - 288 vcpus support part Add descriptor about plan of increasing vcpu from 128 to 255 and dependency between X2APIC and interrupt remapping. 2) Update 3.1 New vIOMMU hypercall interface Change vIOMMU hypercall from sysctl to dmop, add multi vIOMMU consideration consideration and drain in-fly DMA subcommand 3) Update 3.5 implementation consideration We found it's still safe to enable interrupt remapping function before adding l2 translation(DMA translation) to increase vcpu number >255. 4) Update 3.2 l2 translation - virtual device part Add proposal to deal with race between in-fly DMA and invalidation operation in hypervisor. 5) Update 4.4 Report vIOMMU to hvmloader Add option of building ACPI DMAR table in the toolstack for discussion. Change since V1: 1) Update motivation for Xen vIOMMU - 288 vcpus support part 2) Change definition of struct xen_sysctl_viommu_op 3) Update "3.5 Implementation consideration" to explain why we needs to enable l2 translation first. 4) Update "4.3 Q35 vs I440x" - Linux/Windows VTD drivers can work on the emulated I440 chipset. 5) Remove stale statement in the "3.3 Interrupt remapping" Content: === 1. Motivation of vIOMMU 1.1 Enable more than 255 vcpus 1.2 Support VFIO-based user space driver 1.3 Support guest Shared Virtual Memory (SVM) 2. Xen vIOMMU Architecture 2.1 l2 translation overview 2.2 Interrupt remapping overview 3. Xen hypervisor 3.1 New vIOMMU hypercall interface 3.2 l2 translation 3.3 Interrupt remapping 3.4 l1 translation 3.5 Implementation consideration 4. Qemu 4.1 Qemu vIOMMU framework 4.2 Dummy xen-vIOMMU driver 4.3 Q35 vs. i440x 4.4 Report vIOMMU to hvmloader Glossary: l1 translation - first-level translation to remap a virtual address to intermediate (guest) physical address. (GVA->GPA) l2 translation - second-level translations to remap a intermediate physical address to machine (host) physical address. (GPA->HPA) 1 Motivation for Xen vIOMMU 1.1 Enable more than 255 vcpu support HPC cloud service requires VM provides high performance parallel computing and we hope to create a huge VM with >255 vcpu on one machine to meet such requirement. Pin each vcpu to separate pcpus. Now HVM guest can support 128 vcpus at most. We can increase vcpu number from 128 to 255 via changing some limitations and extending vcpu related data structure. This also needs to change the rule of allocating vcpu's APIC ID. Current rule is "(APIC ID) = (vcpu index) * 2". We need to change it to "(APIC ID) = (vcpu index)". Andrew Cooper's CPUID improvement work will cover this to improve guest's cpu topology. We will base on this to increase vcpu number from 128 to 255. To support >255 vcpus, X2APIC mode in guest is necessary because legacy APIC(XAPIC) just supports 8-bit APIC ID and it only can support 255 vcpus at most. X2APIC mode supports 32-bit APIC ID and it requires interrupt mapping function of vIOMMU. The reason for this is that there is no modification to existing PCI MSI and IOAPIC with the introduction of X2APIC. PCI MSI/IOAPIC can only send interrupt message containing 8-bit APIC ID, which cannot address >255 cpus. Interrupt remapping supports 32-bit APIC ID and so it's necessary to enable >255 cpus with x2apic mode. Both Linux and Windows requires interrupt remapping when cpu number is >255. 1.2 Support VFIO-based user space driver (e.g. DPDK) in the guest It relies on the l2 translation capability (IOVA->GPA) on vIOMMU. pIOMMU l2 becomes a shadowing structure of vIOMMU to isolate DMA requests initiated by user space driver. 1.3 Support guest SVM (Shared Virtual Memory) It relies on the l1 translation table capability (GVA->GPA) on vIOMMU. pIOMMU needs to enable both l1 and l2 translation in nested mode (GVA->GPA->HPA) for passthrough device. IGD passthrough is the main usage today (to support OpenCL 2.0 SVM feature). In the future SVM might be used by other I/O devices too. 2. Xen vIOMMU Architecture * vIOMMU will be inside Xen hypervisor for following factors 1) Avoid round trips between Qemu and Xen hypervisor 2) Ease of integration with the rest of the hypervisor 3) HVMlite/PVH doesn't use Qemu * Dummy xen-vIOMMU in Qemu as a wrapper of new hypercall to create /destroy vIOMMU in hypervisor and deal with virtual PCI device's l2 translation. 2.1 l2 translation overview For Virtual PCI device, dummy xen-vIOMMU does translation in the Qemu via new
Re: [Xen-devel] Xen virtual IOMMU high level design doc V2
On 10/26/2016 5:39 PM, Jan Beulich wrote: On 22.10.16 at 09:32,wrote: On 10/21/2016 4:36 AM, Andrew Cooper wrote: 3.5 Implementation consideration VT-d spec doesn't define a capability bit for the l2 translation. Architecturally there is no way to tell guest that l2 translation capability is not available. Linux Intel IOMMU driver thinks l2 translation is always available when VTD exits and fail to be loaded without l2 translation support even if interrupt remapping and l1 translation are available. So it needs to enable l2 translation first before other functions. What then is the purpose of the nested translation support bit in the extended capability register? It's to translate output GPA from first level translation(IOVA->GPA) to HPA. Detail please see VTD spec - 3.8 Nested Translation "When Nesting Enable (NESTE) field is 1 in extended-context-entries, requests-with-PASID translated through first-level translation are also subjected to nested second-level translation. Such extendedcontext- entries contain both the pointer to the PASID-table (which contains the pointer to the firstlevel translation structures), and the pointer to the second-level translation structures." I didn't phrase my question very well. I understand what the nested translation bit means, but I don't understand why we have a problem signalling the presence or lack of nested translations to the guest. In other words, why can't we hide l2 translation from the guest by simply clearing the nested translation capability? You mean to tell no support of l2 translation via nest translation bit? But the nested translation is a different function with l2 translation even from guest view and nested translation only works requests with PASID (l1 translation). Linux intel iommu driver enables l2 translation unconditionally and free iommu instance when failed to enable l2 translation. In which cases the wording of your description is confusing: Instead of "Linux Intel IOMMU driver thinks l2 translation is always available when VTD exits and fail to be loaded without l2 translation support ..." how about using something closer to what you've replied with last? Jan Hi All: I have some updates about implementation dependency between l2 translation(DMA translation) and irq remapping. I find there are a kernel parameter "intel_iommu=on" and kconfig option CONFIG_INTEL_IOMMU_DEFAULT_ON which control DMA translation function. When they aren't set, DMA translation function will not be enabled by IOMMU driver even if some vIOMMU registers show L2 translation function available. In the meantime, irq remapping function still can work to support >255 vcpus. I check distribution RHEL, SLES, Oracle and ubuntu don't set the kernel parameter or select the kconfig option. So we can emulate irq remapping fist with some capability bits(e,g SAGAW of Capability Register) of l2 translation for >255 vcpus support without l2 translation emulation. Showing l2 capability bits is to make sure IOMMU driver probe ACPI DMAR tables successfully because IOMMU driver access these bits during reading ACPI tables. If someone add "intel_iommu=on" kernel parameter manually, IOMMU driver will panic guest because it can't enable DMA remapping function via gcmd register and "Translation Enable Status" bit in gsts register is never set by vIOMMU. This shows actual vIOMMU status of no l2 translation emulation and warn user should not enable l2 translation. ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] Xen virtual IOMMU high level design doc V2
On 2016年10月21日 04:36, Andrew Cooper wrote: >> >>> u64 iova; >>> /* Out parameters. */ >>> u64 translated_addr; >>> u64 addr_mask; /* Translation page size */ >>> IOMMUAccessFlags permisson; >> >> How is this translation intended to be used? How do you plan to avoid >> race conditions where qemu requests a translation, receives one, the >> guest invalidated the mapping, and then qemu tries to use its translated >> address? >> >> There are only two ways I can see of doing this race-free. One is to >> implement a "memcpy with translation" hypercall, and the other is to >> require the use of ATS in the vIOMMU, where the guest OS is required to >> wait for a positive response from the vIOMMU before it can safely reuse >> the mapping. >> >> The former behaves like real hardware in that an intermediate entity >> performs the translation without interacting with the DMA source. The >> latter explicitly exposing the fact that caching is going on at the >> endpoint to the OS. > > The former one seems to move DMA operation into hypervisor but Qemu > vIOMMU framework just passes IOVA to dummy xen-vIOMMU without input > data and access length. I will dig more to figure out solution. Yes - that does in principle actually move the DMA out of Qemu. Hi Adnrew: The first solution "Move the DMA out of Qemu": Qemu vIOMMU framework just give a chance of doing DMA translation to dummy xen-vIOMMU device model and DMA access operation is in the vIOMMU core code. It's hard to move this out. There are a lot of places to call translation callback and some these are not for DMA access(E,G Map guest memory in Qemu). The second solution "Use ATS to sync invalidation operation.": This requires to enable ATS for all virtual PCI devices. This is not easy to do. The following is my proposal: When IOMMU driver invalidates IOTLB, it also will wait until the invalidation completion. We may use this to drain in-fly DMA operation. Guest triggers invalidation operation and trip into vIOMMU in hypervisor to flush cache data. After this, it should go to Qemu to drain in-fly DMA translation. To do that, dummy vIOMMU in Qemu registers the same MMIO region as vIOMMU's and emulation part of invalidation operation returns X86EMUL_UNHANDLEABLE after flush cache. MMIO emulation part is supposed to send event to Qemu and dummy vIOMMU get a chance to starts a thread to drain in-fly DMA and return emulation done. Guest polls IVT(invalidate IOTLB) bit in the IOTLB invalidate register until it's cleared. Dummy vIOMMU notifies vIOMMU drain operation completed via hypercall, vIOMMU clears IVT bit and guest finish invalidation operation. -- Best regards Tianyu Lan ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] Xen virtual IOMMU high level design doc V2
On 10/26/2016 5:39 PM, Jan Beulich wrote: On 22.10.16 at 09:32,wrote: On 10/21/2016 4:36 AM, Andrew Cooper wrote: 3.5 Implementation consideration VT-d spec doesn't define a capability bit for the l2 translation. Architecturally there is no way to tell guest that l2 translation capability is not available. Linux Intel IOMMU driver thinks l2 translation is always available when VTD exits and fail to be loaded without l2 translation support even if interrupt remapping and l1 translation are available. So it needs to enable l2 translation first before other functions. What then is the purpose of the nested translation support bit in the extended capability register? It's to translate output GPA from first level translation(IOVA->GPA) to HPA. Detail please see VTD spec - 3.8 Nested Translation "When Nesting Enable (NESTE) field is 1 in extended-context-entries, requests-with-PASID translated through first-level translation are also subjected to nested second-level translation. Such extendedcontext- entries contain both the pointer to the PASID-table (which contains the pointer to the firstlevel translation structures), and the pointer to the second-level translation structures." I didn't phrase my question very well. I understand what the nested translation bit means, but I don't understand why we have a problem signalling the presence or lack of nested translations to the guest. In other words, why can't we hide l2 translation from the guest by simply clearing the nested translation capability? You mean to tell no support of l2 translation via nest translation bit? But the nested translation is a different function with l2 translation even from guest view and nested translation only works requests with PASID (l1 translation). Linux intel iommu driver enables l2 translation unconditionally and free iommu instance when failed to enable l2 translation. In which cases the wording of your description is confusing: Instead of "Linux Intel IOMMU driver thinks l2 translation is always available when VTD exits and fail to be loaded without l2 translation support ..." how about using something closer to what you've replied with last? Sorry for my pool English. Will update. ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] Xen virtual IOMMU high level design doc V2
On 10/26/2016 5:36 PM, Jan Beulich wrote: On 18.10.16 at 16:14,wrote: 1.1 Enable more than 255 vcpu support HPC cloud service requires VM provides high performance parallel computing and we hope to create a huge VM with >255 vcpu on one machine to meet such requirement.Ping each vcpus on separated pcpus. More than 255 vcpus support requires X2APIC and Linux disables X2APIC mode if there is no interrupt remapping function which is present by vIOMMU. Interrupt remapping function helps to deliver interrupt to #vcpu >255. So we need to add vIOMMU before enabling >255 vcpus. I continue to dislike this completely neglecting that we can't even have >128 vCPU-s at present. Once again - there's other work to be done prior to lack of vIOMMU becoming the limiting factor. Yes, we can increase vcpu from 128 to 255 first without vIOMMU support. We have some draft patches to enable this. Andrew also will rework CPUID policy and change the rule of allocating vcpu's APIC ID. So we will base on it to increase vcpu number. VLAPIC also needs to be changed to support >255 APIC ID. These jobs can be implemented parallel with vIOMMU. ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] Xen virtual IOMMU high level design doc V2
>>> On 22.10.16 at 09:32,wrote: > On 10/21/2016 4:36 AM, Andrew Cooper wrote: > 3.5 Implementation consideration > VT-d spec doesn't define a capability bit for the l2 translation. > Architecturally there is no way to tell guest that l2 translation > capability is not available. Linux Intel IOMMU driver thinks l2 > translation is always available when VTD exits and fail to be loaded > without l2 translation support even if interrupt remapping and l1 > translation are available. So it needs to enable l2 translation first > before other functions. What then is the purpose of the nested translation support bit in the extended capability register? >>> >>> It's to translate output GPA from first level translation(IOVA->GPA) >>> to HPA. >>> >>> Detail please see VTD spec - 3.8 Nested Translation >>> "When Nesting Enable (NESTE) field is 1 in extended-context-entries, >>> requests-with-PASID translated through first-level translation are also >>> subjected to nested second-level translation. Such extendedcontext- >>> entries contain both the pointer to the PASID-table (which contains the >>> pointer to the firstlevel translation structures), and the pointer to >>> the second-level translation structures." >> >> I didn't phrase my question very well. I understand what the nested >> translation bit means, but I don't understand why we have a problem >> signalling the presence or lack of nested translations to the guest. >> >> In other words, why can't we hide l2 translation from the guest by >> simply clearing the nested translation capability? > > You mean to tell no support of l2 translation via nest translation bit? > But the nested translation is a different function with l2 translation > even from guest view and nested translation only works requests with > PASID (l1 translation). > > Linux intel iommu driver enables l2 translation unconditionally and free > iommu instance when failed to enable l2 translation. In which cases the wording of your description is confusing: Instead of "Linux Intel IOMMU driver thinks l2 translation is always available when VTD exits and fail to be loaded without l2 translation support ..." how about using something closer to what you've replied with last? Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] Xen virtual IOMMU high level design doc V2
>>> On 18.10.16 at 16:14,wrote: > 1.1 Enable more than 255 vcpu support > HPC cloud service requires VM provides high performance parallel > computing and we hope to create a huge VM with >255 vcpu on one machine > to meet such requirement.Ping each vcpus on separated pcpus. More than > 255 vcpus support requires X2APIC and Linux disables X2APIC mode if > there is no interrupt remapping function which is present by vIOMMU. > Interrupt remapping function helps to deliver interrupt to #vcpu >255. > So we need to add vIOMMU before enabling >255 vcpus. I continue to dislike this completely neglecting that we can't even have >128 vCPU-s at present. Once again - there's other work to be done prior to lack of vIOMMU becoming the limiting factor. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] Xen virtual IOMMU high level design doc V2
On 10/21/2016 4:36 AM, Andrew Cooper wrote: 255 vcpus support requires X2APIC and Linux disables X2APIC mode if there is no interrupt remapping function which is present by vIOMMU. Interrupt remapping function helps to deliver interrupt to #vcpu >255. This is only a requirement for xapic interrupt sources. x2apic interrupt sources already deliver correctly. The key is the APIC ID. There is no modification to existing PCI MSI and IOAPIC with the introduction of x2apic. PCI MSI/IOAPIC can only send interrupt message containing 8bit APIC ID, which cannot address >255 cpus. Interrupt remapping supports 32bit APIC ID so it's necessary to enable >255 cpus with x2apic mode. If LAPIC is in x2apic while interrupt remapping is disabled, IOAPIC cannot deliver interrupts to all cpus in the system if #cpu > 255. After spending a long time reading up on this, my first observation is that it is very difficult to find consistent information concerning the expected content of MSI address/data fields for x86 hardware. Having said that, this has been very educational. It is now clear that any MSI message can either specify an 8 bit APIC ID directly, or request for the message to be remapped. Apologies for my earlier confusion. Never minder, I will describe this more detail in the following version. 3 Xen hypervisor == 3.1 New hypercall XEN_SYSCTL_viommu_op This hypercall should also support pv IOMMU which is still under RFC review. Here only covers non-pv part. 1) Definition of "struct xen_sysctl_viommu_op" as new hypercall parameter. Why did you choose sysctl? As these are per-domain, domctl would be a more logical choice. However, neither of these should be usable by Qemu, and we are trying to split out "normal qemu operations" into dmops which can be safely deprivileged. Do you know what's the status of dmop now? I just found some discussions about design in the maillist. We may use domctl first and move to dmop when it's ready? I believe Paul is looking into respin the series early in the 4.9 dev cycle. I expect it won't take long until they are submitted. Ok. I got it. Thanks for information. Definition of VIOMMU subops: #define XEN_SYSCTL_viommu_query_capability0 #define XEN_SYSCTL_viommu_create1 #define XEN_SYSCTL_viommu_destroy2 #define XEN_SYSCTL_viommu_dma_translation_for_vpdev 3 Definition of VIOMMU capabilities #define XEN_VIOMMU_CAPABILITY_l1_translation(1 << 0) #define XEN_VIOMMU_CAPABILITY_l2_translation(1 << 1) #define XEN_VIOMMU_CAPABILITY_interrupt_remapping(1 << 2) How are vIOMMUs going to be modelled to guests? On real hardware, they all seem to end associated with a PCI device of some sort, even if it is just the LPC bridge. This design just considers one vIOMMU has all PCI device under its specified PCI Segment. "INCLUDE_PCI_ALL" bit of DRHD struct is set for vIOMMU. Even if the first implementation only supports a single vIOMMU, please design the interface to cope with multiple. It will save someone having to go and break the API/ABI in the future when support for multiple vIOMMUs is needed. OK. I got. How do we deal with multiple vIOMMUs in a single guest? For multi-vIOMMU, we need to add new field in the struct iommu_op to designate device scope of vIOMMUs if they are under same PCI segment. This also needs to change DMAR table. 2) Design for subops - XEN_SYSCTL_viommu_query_capability Get vIOMMU capabilities(l1/l2 translation and interrupt remapping). - XEN_SYSCTL_viommu_create Create vIOMMU in Xen hypervisor with dom_id, capabilities and reg base address. - XEN_SYSCTL_viommu_destroy Destory vIOMMU in Xen hypervisor with dom_id as parameters. - XEN_SYSCTL_viommu_dma_translation_for_vpdev Translate IOVA to GPA for specified virtual PCI device with dom id, PCI device's bdf and IOVA and xen hypervisor returns translated GPA, address mask and access permission. 3.2 l2 translation 1) For virtual PCI device Xen dummy xen-vIOMMU in Qemu translates IOVA to target GPA via new hypercall when DMA operation happens. 2) For physical PCI device DMA operations go though physical IOMMU directly and IO page table for IOVA->HPA should be loaded into physical IOMMU. When guest updates l2 Page-table pointer field, it provides IO page table for IOVA->GPA. vIOMMU needs to shadow l2 translation table, translate GPA->HPA and update shadow page table(IOVA->HPA) pointer to l2 Page-table pointer to context entry of physical IOMMU. How are you proposing to do this shadowing? Do we need to trap and emulate all writes to the vIOMMU pagetables, or is there a better way to know when the mappings need invalidating? No, we don't need to trap all write to IO page table. From VTD spec 6.1, "Reporting the Caching Mode as Set for the virtual hardware requires the guest software to explicitly issue
Re: [Xen-devel] Xen virtual IOMMU high level design doc V2
> >> >>> 255 vcpus support requires X2APIC and Linux disables X2APIC mode if >>> there is no interrupt remapping function which is present by vIOMMU. >>> Interrupt remapping function helps to deliver interrupt to #vcpu >255. >> >> This is only a requirement for xapic interrupt sources. x2apic >> interrupt sources already deliver correctly. > > The key is the APIC ID. There is no modification to existing PCI MSI and > IOAPIC with the introduction of x2apic. PCI MSI/IOAPIC can only send > interrupt message containing 8bit APIC ID, which cannot address >255 > cpus. Interrupt remapping supports 32bit APIC ID so it's necessary to > enable >255 cpus with x2apic mode. > > If LAPIC is in x2apic while interrupt remapping is disabled, IOAPIC > cannot deliver interrupts to all cpus in the system if #cpu > 255. After spending a long time reading up on this, my first observation is that it is very difficult to find consistent information concerning the expected content of MSI address/data fields for x86 hardware. Having said that, this has been very educational. It is now clear that any MSI message can either specify an 8 bit APIC ID directly, or request for the message to be remapped. Apologies for my earlier confusion. > >>> +-+ >>> |Qemu++ | >>> || Virtual| | >>> || PCI device | | >>> ||| | >>> |++ | >>> ||DMA | >>> |V| >>> | ++ Request ++ | >>> | |+<---+| | >>> | | Dummy xen vIOMMU | Target GPA | Memory region | | >>> | |+--->+| | >>> | +-+--++---++ | >>> || || >>> ||Hypercall || >>> +++ >>> |Hypervisor | || >>> || || >>> |v || >>> | +--+--+|| >>> | | vIOMMU||| >>> | +--+--+|| >>> || || >>> |v || >>> | +--+--+|| >>> | | IOMMU driver||| >>> | +--+--+|| >>> || || >>> +++ >>> |HW v V| >>> | +--+--+ +-+ | >>> | | IOMMU +>+ Memory | | >>> | +--+--+ +-+ | >>> |^| >>> ||| >>> | +--+--+ | >>> | | PCI Device | | >>> | +-+ | >>> +-+ >>> >>> 2.2 Interrupt remapping overview. >>> Interrupts from virtual devices and physical devices will be delivered >>> to vLAPIC from vIOAPIC and vMSI. vIOMMU will remap interrupt during >>> this >>> procedure. >>> >>> +---+ >>> |Qemu |VM | >>> | | ++| >>> | | | Device driver || >>> | | ++---+| >>> | | ^| >>> | ++ | ++---+| >>> | | Virtual device | | | IRQ subsystem || >>> | +---++ | ++---+| >>> | | | ^| >>> | | | || >>> +---+---+ >>> |hyperviosr | | VIRQ | >>> | |+-++ | >>> | || vLAPIC | | >>> | |+-++ | >>> | | ^| >>> | | || >>>
Re: [Xen-devel] Xen virtual IOMMU high level design doc V2
On 20/10/16 10:53, Tian, Kevin wrote: >> From: Andrew Cooper [mailto:andrew.coop...@citrix.com] >> Sent: Wednesday, October 19, 2016 3:18 AM >> >>> 1.2 Support VFIO-based user space driver (e.g. DPDK) in the guest >>> It relies on the l2 translation capability (IOVA->GPA) on >>> vIOMMU. pIOMMU l2 becomes a shadowing structure of >>> vIOMMU to isolate DMA requests initiated by user space driver. >> How is userspace supposed to drive this interface? I can't picture how >> it would function. > Inside a Linux VM, VFIO provides DMA MAP/UNMAP interface to user space > driver so gIOVA->GPA mapping can be setup on vIOMMU. vIOMMU will > export a "caching mode" capability to indicate all guest PTE changes > requiring explicit vIOMMU cache invalidations. Through trapping of those > invalidation requests, Xen can update corresponding shadow PTEs (gIOVA > ->HPA). When DMA mapping is established, user space driver programs > gIOVA addresses as DMA destination to assigned device, and then upstreaming > DMA request out of this device contains gIOVA which is translated to HPA > by pIOMMU shadow page table. Ok. So in this mode, the userspace driver owns the device, and can choose any arbitrary gIOVA layout it chooses? If it also programs the DMA addresses, I guess this setup is fine. > >>> >>> 1.3 Support guest SVM (Shared Virtual Memory) >>> It relies on the l1 translation table capability (GVA->GPA) on >>> vIOMMU. pIOMMU needs to enable both l1 and l2 translation in nested >>> mode (GVA->GPA->HPA) for passthrough device. IGD passthrough >>> is the main usage today (to support OpenCL 2.0 SVM feature). In the >>> future SVM might be used by other I/O devices too. >> As an aside, how is IGD intending to support SVM? Will it be with PCIe >> ATS/PASID, or something rather more magic as IGD is on the same piece of >> silicon? > Although integrated, IGD conforms to standard PCIe PASID convention. Ok. Any idea when hardware with SVM will be available? > >>> 3.5 Implementation consideration >>> VT-d spec doesn't define a capability bit for the l2 translation. >>> Architecturally there is no way to tell guest that l2 translation >>> capability is not available. Linux Intel IOMMU driver thinks l2 >>> translation is always available when VTD exits and fail to be loaded >>> without l2 translation support even if interrupt remapping and l1 >>> translation are available. So it needs to enable l2 translation first >>> before other functions. >> What then is the purpose of the nested translation support bit in the >> extended capability register? >> > Nested translation is for SVM virtualization. Given a DMA transaction > containing a PASID, VT-d engine first finds the 1st translation table > through PASID to translate from GVA to GPA, then once nested > translation capability is enabled, further translate GPA to HPA using the > 2nd level translation table. Bare-metal usage is not expected to turn > on this nested bit. Ok, but what happens if a guest sees a PASSID-capable vIOMMU and itself tries to turn on nesting? E.g. nesting KVM inside Xen and trying to use SVM from the L2 guest? If there is no way to indicate to the L1 guest that nesting isn't available (as it is already actually in use), and we can't shadow entries on faults, what is supposed to happen? ~Andrew ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] Xen virtual IOMMU high level design doc V2
On 10/19/2016 4:26 AM, Konrad Rzeszutek Wilk wrote: On Tue, Oct 18, 2016 at 10:14:16PM +0800, Lan Tianyu wrote: 1 Motivation for Xen vIOMMU === 1.1 Enable more than 255 vcpu support HPC cloud service requires VM provides high performance parallel computing and we hope to create a huge VM with >255 vcpu on one machine to meet such requirement.Ping each vcpus on separated pcpus. More than 255 vcpus support requires X2APIC and Linux disables X2APIC mode if there is no interrupt remapping function which is present by vIOMMU. Interrupt remapping function helps to deliver interrupt to #vcpu >255. So we need to add vIOMMU before enabling >255 vcpus. What about Windows? Does it care about this? From our test, win8 guest crashes when boot up 288 vcpus without IR and it can boot up with IR 3.2 l2 translation 1) For virtual PCI device Xen dummy xen-vIOMMU in Qemu translates IOVA to target GPA via new hypercall when DMA operation happens. 2) For physical PCI device DMA operations go though physical IOMMU directly and IO page table for IOVA->HPA should be loaded into physical IOMMU. When guest updates l2 Page-table pointer field, it provides IO page table for IOVA->GPA. vIOMMU needs to shadow l2 translation table, translate GPA->HPA and update shadow page table(IOVA->HPA) pointer to l2 Page-table pointer to context entry of physical IOMMU. Now all PCI devices in same hvm domain share one IO page table (GPA->HPA) in physical IOMMU driver of Xen. To support l2 translation of vIOMMU, IOMMU driver need to support multiple address spaces per device entry. Using existing IO page table(GPA->HPA) defaultly and switch to shadow IO page table(IOVA->HPA) when l2 defaultly? I mean GPA->HPA mapping will set in the assigned device's context entry of pIOMMU when VM creates. Just like current code works. 3.3 Interrupt remapping Interrupts from virtual devices and physical devices will be delivered to vlapic from vIOAPIC and vMSI. It needs to add interrupt remapping hooks in the vmsi_deliver() and ioapic_deliver() to find target vlapic according interrupt remapping table. 3.4 l1 translation When nested translation is enabled, any address generated by l1 translation is used as the input address for nesting with l2 translation. Physical IOMMU needs to enable both l1 and l2 translation in nested translation mode(GVA->GPA->HPA) for passthrough device. VT-d context entry points to guest l1 translation table which will be nest-translated by l2 translation table and so it can be directly linked to context entry of physical IOMMU. I think this means that the shared_ept will be disabled? The shared_ept(GPA->HPA mapping) is used to do nested translation for any output from l1 translation(GVA->GPA). ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] Xen virtual IOMMU high level design doc V2
Hi Andrew: Thanks for your review. On 2016年10月19日 03:17, Andrew Cooper wrote: On 18/10/16 15:14, Lan Tianyu wrote: Change since V1: 1) Update motivation for Xen vIOMMU - 288 vcpus support part 2) Change definition of struct xen_sysctl_viommu_op 3) Update "3.5 Implementation consideration" to explain why we needs to enable l2 translation first. 4) Update "4.3 Q35 vs I440x" - Linux/Windows VTD drivers can work on the emulated I440 chipset. 5) Remove stale statement in the "3.3 Interrupt remapping" Content: === 1. Motivation of vIOMMU 1.1 Enable more than 255 vcpus 1.2 Support VFIO-based user space driver 1.3 Support guest Shared Virtual Memory (SVM) 2. Xen vIOMMU Architecture 2.1 l2 translation overview 2.2 Interrupt remapping overview 3. Xen hypervisor 3.1 New vIOMMU hypercall interface 3.2 l2 translation 3.3 Interrupt remapping 3.4 l1 translation 3.5 Implementation consideration 4. Qemu 4.1 Qemu vIOMMU framework 4.2 Dummy xen-vIOMMU driver 4.3 Q35 vs. i440x 4.4 Report vIOMMU to hvmloader 1 Motivation for Xen vIOMMU === 1.1 Enable more than 255 vcpu support HPC cloud service requires VM provides high performance parallel computing and we hope to create a huge VM with >255 vcpu on one machine to meet such requirement.Ping each vcpus on separated pcpus. More than Pin ? Sorry, it's a typo. Also, grammatically speaking, I think you mean "each vcpu to separate pcpus". Yes. 255 vcpus support requires X2APIC and Linux disables X2APIC mode if there is no interrupt remapping function which is present by vIOMMU. Interrupt remapping function helps to deliver interrupt to #vcpu >255. This is only a requirement for xapic interrupt sources. x2apic interrupt sources already deliver correctly. The key is the APIC ID. There is no modification to existing PCI MSI and IOAPIC with the introduction of x2apic. PCI MSI/IOAPIC can only send interrupt message containing 8bit APIC ID, which cannot address >255 cpus. Interrupt remapping supports 32bit APIC ID so it's necessary to enable >255 cpus with x2apic mode. If LAPIC is in x2apic while interrupt remapping is disabled, IOAPIC cannot deliver interrupts to all cpus in the system if #cpu > 255. 1.3 Support guest SVM (Shared Virtual Memory) It relies on the l1 translation table capability (GVA->GPA) on vIOMMU. pIOMMU needs to enable both l1 and l2 translation in nested mode (GVA->GPA->HPA) for passthrough device. IGD passthrough is the main usage today (to support OpenCL 2.0 SVM feature). In the future SVM might be used by other I/O devices too. As an aside, how is IGD intending to support SVM? Will it be with PCIe ATS/PASID, or something rather more magic as IGD is on the same piece of silicon? IGD on Skylake supports PCIe PASID. 2. Xen vIOMMU Architecture * vIOMMU will be inside Xen hypervisor for following factors 1) Avoid round trips between Qemu and Xen hypervisor 2) Ease of integration with the rest of the hypervisor 3) HVMlite/PVH doesn't use Qemu * Dummy xen-vIOMMU in Qemu as a wrapper of new hypercall to create /destory vIOMMU in hypervisor and deal with virtual PCI device's l2 translation. 2.1 l2 translation overview For Virtual PCI device, dummy xen-vIOMMU does translation in the Qemu via new hypercall. For physical PCI device, vIOMMU in hypervisor shadows IO page table from IOVA->GPA to IOVA->HPA and load page table to physical IOMMU. The following diagram shows l2 translation architecture. Which scenario is this? Is this the passthrough case where the Qemu Virtual PCI device is a shadow of the real PCI device in hardware? No, this is for traditional virtual pci device emulated by Qemu and passthough PCI device. +-+ |Qemu++ | || Virtual| | || PCI device | | ||| | |++ | ||DMA | |V| | ++ Request ++ | | |+<---+| | | | Dummy xen vIOMMU | Target GPA | Memory region | | | |+--->+| | | +-+--++---++ | || || ||Hypercall || +++ |Hypervisor |
Re: [Xen-devel] Xen virtual IOMMU high level design doc V2
> From: Konrad Rzeszutek Wilk [mailto:konrad.w...@oracle.com] > Sent: Wednesday, October 19, 2016 4:27 AM > > > > 2) For physical PCI device > > DMA operations go though physical IOMMU directly and IO page table for > > IOVA->HPA should be loaded into physical IOMMU. When guest updates > > l2 Page-table pointer field, it provides IO page table for > > IOVA->GPA. vIOMMU needs to shadow l2 translation table, translate > > GPA->HPA and update shadow page table(IOVA->HPA) pointer to l2 > > Page-table pointer to context entry of physical IOMMU. > > > > Now all PCI devices in same hvm domain share one IO page table > > (GPA->HPA) in physical IOMMU driver of Xen. To support l2 > > translation of vIOMMU, IOMMU driver need to support multiple address > > spaces per device entry. Using existing IO page table(GPA->HPA) > > defaultly and switch to shadow IO page table(IOVA->HPA) when l2 > > defaultly? > > > translation function is enabled. These change will not affect current > > P2M logic. > > What happens if the guests IO page tables have incorrect values? > > For example the guest sets up the pagetables to cover some section > of HPA ranges (which are all good and permitted). But then during execution > the guest kernel decides to muck around with the pagetables and adds an HPA > range that is outside what the guest has been allocated. > > What then? Shadow PTE is controlled by hypervisor. Whatever IOVA->GPA mapping in guest PTE must be validated (IOVA->GPA->HPA) before updating into the shadow PTE. So regardless of when guest mucks its PTE, the operation is always trapped and validated. Why do you think there is a problem? Also guest only sees GPA. All it can operate is GPA ranges. > > > > 3.3 Interrupt remapping > > Interrupts from virtual devices and physical devices will be delivered > > to vlapic from vIOAPIC and vMSI. It needs to add interrupt remapping > > hooks in the vmsi_deliver() and ioapic_deliver() to find target vlapic > > according interrupt remapping table. > > > > > > 3.4 l1 translation > > When nested translation is enabled, any address generated by l1 > > translation is used as the input address for nesting with l2 > > translation. Physical IOMMU needs to enable both l1 and l2 translation > > in nested translation mode(GVA->GPA->HPA) for passthrough > > device. > > > > VT-d context entry points to guest l1 translation table which > > will be nest-translated by l2 translation table and so it > > can be directly linked to context entry of physical IOMMU. > > I think this means that the shared_ept will be disabled? > > > What about different versions of contexts? Say the V1 is exposed > to guest but the hardware supports V2? Are there any flags that have > swapped positions? Or is it pretty backwards compatible? yes, backward compatible. > > > > > > 3.5 Implementation consideration > > VT-d spec doesn't define a capability bit for the l2 translation. > > Architecturally there is no way to tell guest that l2 translation > > capability is not available. Linux Intel IOMMU driver thinks l2 > > translation is always available when VTD exits and fail to be loaded > > without l2 translation support even if interrupt remapping and l1 > > translation are available. So it needs to enable l2 translation first > > I am lost on that sentence. Are you saying that it tries to load > the IOVA and if they fail.. then it keeps on going? What is the result > of this? That you can't do IOVA (so can't use vfio ?) It's about VT-d capability. VT-d supports both 1st-level and 2nd-level translation, however only the 1st-level translation can be optionally reported through a capability bit. There is no capability bit to say a version doesn't support 2nd-level translation. The implication is that, as long as a vIOMMU is exposed, guest IOMMU driver always assumes IOVA capability available thru 2nd level translation. So we can first emulate a vIOMMU w/ only 2nd-level capability, and then extend it to support 1st-level and interrupt remapping, but cannot do the reverse direction. I think Tianyu's point is more to describe enabling sequence based on this fact. :-) > > 4.1 Qemu vIOMMU framework > > Qemu has a framework to create virtual IOMMU(e.g. virtual intel VTD and > > AMD IOMMU) and report in guest ACPI table. So for Xen side, a dummy > > xen-vIOMMU wrapper is required to connect with actual vIOMMU in Xen. > > Especially for l2 translation of virtual PCI device because > > emulations of virtual PCI devices are in the Qemu. Qemu's vIOMMU > > framework provides callback to deal with l2 translation when > > DMA operations of virtual PCI devices happen. > > You say AMD and Intel. This sounds quite OS agnostic. Does it mean you > could expose an vIOMMU to a guest and actually use the AMD IOMMU > in the hypervisor? Did you mean "expose an Intel vIOMMU to guest and then use physical AMD IOMMU in hypervisor"? I didn't think about this, but what's the value of doing so? :-) Thanks Kevin
Re: [Xen-devel] Xen virtual IOMMU high level design doc V2
> From: Andrew Cooper [mailto:andrew.coop...@citrix.com] > Sent: Wednesday, October 19, 2016 3:18 AM > > > > > 1.2 Support VFIO-based user space driver (e.g. DPDK) in the guest > > It relies on the l2 translation capability (IOVA->GPA) on > > vIOMMU. pIOMMU l2 becomes a shadowing structure of > > vIOMMU to isolate DMA requests initiated by user space driver. > > How is userspace supposed to drive this interface? I can't picture how > it would function. Inside a Linux VM, VFIO provides DMA MAP/UNMAP interface to user space driver so gIOVA->GPA mapping can be setup on vIOMMU. vIOMMU will export a "caching mode" capability to indicate all guest PTE changes requiring explicit vIOMMU cache invalidations. Through trapping of those invalidation requests, Xen can update corresponding shadow PTEs (gIOVA ->HPA). When DMA mapping is established, user space driver programs gIOVA addresses as DMA destination to assigned device, and then upstreaming DMA request out of this device contains gIOVA which is translated to HPA by pIOMMU shadow page table. > > > > > > > 1.3 Support guest SVM (Shared Virtual Memory) > > It relies on the l1 translation table capability (GVA->GPA) on > > vIOMMU. pIOMMU needs to enable both l1 and l2 translation in nested > > mode (GVA->GPA->HPA) for passthrough device. IGD passthrough > > is the main usage today (to support OpenCL 2.0 SVM feature). In the > > future SVM might be used by other I/O devices too. > > As an aside, how is IGD intending to support SVM? Will it be with PCIe > ATS/PASID, or something rather more magic as IGD is on the same piece of > silicon? Although integrated, IGD conforms to standard PCIe PASID convention. > > 3.5 Implementation consideration > > VT-d spec doesn't define a capability bit for the l2 translation. > > Architecturally there is no way to tell guest that l2 translation > > capability is not available. Linux Intel IOMMU driver thinks l2 > > translation is always available when VTD exits and fail to be loaded > > without l2 translation support even if interrupt remapping and l1 > > translation are available. So it needs to enable l2 translation first > > before other functions. > > What then is the purpose of the nested translation support bit in the > extended capability register? > Nested translation is for SVM virtualization. Given a DMA transaction containing a PASID, VT-d engine first finds the 1st translation table through PASID to translate from GVA to GPA, then once nested translation capability is enabled, further translate GPA to HPA using the 2nd level translation table. Bare-metal usage is not expected to turn on this nested bit. Thanks Kevin ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] Xen virtual IOMMU high level design doc V2
On Tue, Oct 18, 2016 at 10:14:16PM +0800, Lan Tianyu wrote: > Change since V1: > 1) Update motivation for Xen vIOMMU - 288 vcpus support part > 2) Change definition of struct xen_sysctl_viommu_op > 3) Update "3.5 Implementation consideration" to explain why we needs to > enable l2 translation first. > 4) Update "4.3 Q35 vs I440x" - Linux/Windows VTD drivers can work on the > emulated I440 chipset. > 5) Remove stale statement in the "3.3 Interrupt remapping" > > Content: > === > 1. Motivation of vIOMMU > 1.1 Enable more than 255 vcpus > 1.2 Support VFIO-based user space driver > 1.3 Support guest Shared Virtual Memory (SVM) > 2. Xen vIOMMU Architecture > 2.1 l2 translation overview > 2.2 Interrupt remapping overview > 3. Xen hypervisor > 3.1 New vIOMMU hypercall interface > 3.2 l2 translation > 3.3 Interrupt remapping > 3.4 l1 translation > 3.5 Implementation consideration > 4. Qemu > 4.1 Qemu vIOMMU framework > 4.2 Dummy xen-vIOMMU driver > 4.3 Q35 vs. i440x > 4.4 Report vIOMMU to hvmloader > > > 1 Motivation for Xen vIOMMU > === > 1.1 Enable more than 255 vcpu support > HPC cloud service requires VM provides high performance parallel > computing and we hope to create a huge VM with >255 vcpu on one machine > to meet such requirement.Ping each vcpus on separated pcpus. More than > 255 vcpus support requires X2APIC and Linux disables X2APIC mode if > there is no interrupt remapping function which is present by vIOMMU. > Interrupt remapping function helps to deliver interrupt to #vcpu >255. > So we need to add vIOMMU before enabling >255 vcpus. What about Windows? Does it care about this? > > 1.2 Support VFIO-based user space driver (e.g. DPDK) in the guest > It relies on the l2 translation capability (IOVA->GPA) on > vIOMMU. pIOMMU l2 becomes a shadowing structure of > vIOMMU to isolate DMA requests initiated by user space driver. > > > 1.3 Support guest SVM (Shared Virtual Memory) > It relies on the l1 translation table capability (GVA->GPA) on > vIOMMU. pIOMMU needs to enable both l1 and l2 translation in nested > mode (GVA->GPA->HPA) for passthrough device. IGD passthrough > is the main usage today (to support OpenCL 2.0 SVM feature). In the > future SVM might be used by other I/O devices too. > > 2. Xen vIOMMU Architecture > > > * vIOMMU will be inside Xen hypervisor for following factors > 1) Avoid round trips between Qemu and Xen hypervisor > 2) Ease of integration with the rest of the hypervisor > 3) HVMlite/PVH doesn't use Qemu > * Dummy xen-vIOMMU in Qemu as a wrapper of new hypercall to create > /destory vIOMMU in hypervisor and deal with virtual PCI device's l2 destroy > translation. > > 2.1 l2 translation overview > For Virtual PCI device, dummy xen-vIOMMU does translation in the > Qemu via new hypercall. > > For physical PCI device, vIOMMU in hypervisor shadows IO page table from > IOVA->GPA to IOVA->HPA and load page table to physical IOMMU. > > The following diagram shows l2 translation architecture. > +-+ > |Qemu++ | > || Virtual| | > || PCI device | | > ||| | > |++ | > ||DMA | > |V| > | ++ Request ++ | > | |+<---+| | > | | Dummy xen vIOMMU | Target GPA | Memory region | | > | |+--->+| | > | +-+--++---++ | > || || > ||Hypercall || > +++ > |Hypervisor | || > || || > |v || > | +--+--+|| > | | vIOMMU||| > | +--+--+|| > || || > |v || > | +--+--+|| > | | IOMMU driver||| > | +--+--+
Re: [Xen-devel] Xen virtual IOMMU high level design doc V2
On 18/10/16 15:14, Lan Tianyu wrote: > Change since V1: > 1) Update motivation for Xen vIOMMU - 288 vcpus support part > 2) Change definition of struct xen_sysctl_viommu_op > 3) Update "3.5 Implementation consideration" to explain why we > needs to enable l2 translation first. > 4) Update "4.3 Q35 vs I440x" - Linux/Windows VTD drivers can work > on the emulated I440 chipset. > 5) Remove stale statement in the "3.3 Interrupt remapping" > > Content: > === > > 1. Motivation of vIOMMU > 1.1 Enable more than 255 vcpus > 1.2 Support VFIO-based user space driver > 1.3 Support guest Shared Virtual Memory (SVM) > 2. Xen vIOMMU Architecture > 2.1 l2 translation overview > 2.2 Interrupt remapping overview > 3. Xen hypervisor > 3.1 New vIOMMU hypercall interface > 3.2 l2 translation > 3.3 Interrupt remapping > 3.4 l1 translation > 3.5 Implementation consideration > 4. Qemu > 4.1 Qemu vIOMMU framework > 4.2 Dummy xen-vIOMMU driver > 4.3 Q35 vs. i440x > 4.4 Report vIOMMU to hvmloader > > > 1 Motivation for Xen vIOMMU > === > > 1.1 Enable more than 255 vcpu support > HPC cloud service requires VM provides high performance parallel > computing and we hope to create a huge VM with >255 vcpu on one machine > to meet such requirement.Ping each vcpus on separated pcpus. More than Pin ? Also, grammatically speaking, I think you mean "each vcpu to separate pcpus". > 255 vcpus support requires X2APIC and Linux disables X2APIC mode if > there is no interrupt remapping function which is present by vIOMMU. > Interrupt remapping function helps to deliver interrupt to #vcpu >255. This is only a requirement for xapic interrupt sources. x2apic interrupt sources already deliver correctly. > So we need to add vIOMMU before enabling >255 vcpus. > > 1.2 Support VFIO-based user space driver (e.g. DPDK) in the guest > It relies on the l2 translation capability (IOVA->GPA) on > vIOMMU. pIOMMU l2 becomes a shadowing structure of > vIOMMU to isolate DMA requests initiated by user space driver. How is userspace supposed to drive this interface? I can't picture how it would function. > > > 1.3 Support guest SVM (Shared Virtual Memory) > It relies on the l1 translation table capability (GVA->GPA) on > vIOMMU. pIOMMU needs to enable both l1 and l2 translation in nested > mode (GVA->GPA->HPA) for passthrough device. IGD passthrough > is the main usage today (to support OpenCL 2.0 SVM feature). In the > future SVM might be used by other I/O devices too. As an aside, how is IGD intending to support SVM? Will it be with PCIe ATS/PASID, or something rather more magic as IGD is on the same piece of silicon? > > 2. Xen vIOMMU Architecture > > > > * vIOMMU will be inside Xen hypervisor for following factors > 1) Avoid round trips between Qemu and Xen hypervisor > 2) Ease of integration with the rest of the hypervisor > 3) HVMlite/PVH doesn't use Qemu > * Dummy xen-vIOMMU in Qemu as a wrapper of new hypercall to create > /destory vIOMMU in hypervisor and deal with virtual PCI device's l2 > translation. > > 2.1 l2 translation overview > For Virtual PCI device, dummy xen-vIOMMU does translation in the > Qemu via new hypercall. > > For physical PCI device, vIOMMU in hypervisor shadows IO page table from > IOVA->GPA to IOVA->HPA and load page table to physical IOMMU. > > The following diagram shows l2 translation architecture. Which scenario is this? Is this the passthrough case where the Qemu Virtual PCI device is a shadow of the real PCI device in hardware? > +-+ > |Qemu++ | > || Virtual| | > || PCI device | | > ||| | > |++ | > ||DMA | > |V| > | ++ Request ++ | > | |+<---+| | > | | Dummy xen vIOMMU | Target GPA | Memory region | | > | |+--->+| | > | +-+--++---++ | > || || > ||Hypercall || > +++ > |Hypervisor | || > || || > |v || > |
[Xen-devel] Xen virtual IOMMU high level design doc V2
Change since V1: 1) Update motivation for Xen vIOMMU - 288 vcpus support part 2) Change definition of struct xen_sysctl_viommu_op 3) Update "3.5 Implementation consideration" to explain why we needs to enable l2 translation first. 4) Update "4.3 Q35 vs I440x" - Linux/Windows VTD drivers can work on the emulated I440 chipset. 5) Remove stale statement in the "3.3 Interrupt remapping" Content: === 1. Motivation of vIOMMU 1.1 Enable more than 255 vcpus 1.2 Support VFIO-based user space driver 1.3 Support guest Shared Virtual Memory (SVM) 2. Xen vIOMMU Architecture 2.1 l2 translation overview 2.2 Interrupt remapping overview 3. Xen hypervisor 3.1 New vIOMMU hypercall interface 3.2 l2 translation 3.3 Interrupt remapping 3.4 l1 translation 3.5 Implementation consideration 4. Qemu 4.1 Qemu vIOMMU framework 4.2 Dummy xen-vIOMMU driver 4.3 Q35 vs. i440x 4.4 Report vIOMMU to hvmloader 1 Motivation for Xen vIOMMU === 1.1 Enable more than 255 vcpu support HPC cloud service requires VM provides high performance parallel computing and we hope to create a huge VM with >255 vcpu on one machine to meet such requirement.Ping each vcpus on separated pcpus. More than 255 vcpus support requires X2APIC and Linux disables X2APIC mode if there is no interrupt remapping function which is present by vIOMMU. Interrupt remapping function helps to deliver interrupt to #vcpu >255. So we need to add vIOMMU before enabling >255 vcpus. 1.2 Support VFIO-based user space driver (e.g. DPDK) in the guest It relies on the l2 translation capability (IOVA->GPA) on vIOMMU. pIOMMU l2 becomes a shadowing structure of vIOMMU to isolate DMA requests initiated by user space driver. 1.3 Support guest SVM (Shared Virtual Memory) It relies on the l1 translation table capability (GVA->GPA) on vIOMMU. pIOMMU needs to enable both l1 and l2 translation in nested mode (GVA->GPA->HPA) for passthrough device. IGD passthrough is the main usage today (to support OpenCL 2.0 SVM feature). In the future SVM might be used by other I/O devices too. 2. Xen vIOMMU Architecture * vIOMMU will be inside Xen hypervisor for following factors 1) Avoid round trips between Qemu and Xen hypervisor 2) Ease of integration with the rest of the hypervisor 3) HVMlite/PVH doesn't use Qemu * Dummy xen-vIOMMU in Qemu as a wrapper of new hypercall to create /destory vIOMMU in hypervisor and deal with virtual PCI device's l2 translation. 2.1 l2 translation overview For Virtual PCI device, dummy xen-vIOMMU does translation in the Qemu via new hypercall. For physical PCI device, vIOMMU in hypervisor shadows IO page table from IOVA->GPA to IOVA->HPA and load page table to physical IOMMU. The following diagram shows l2 translation architecture. +-+ |Qemu++ | || Virtual| | || PCI device | | ||| | |++ | ||DMA | |V| | ++ Request ++ | | |+<---+| | | | Dummy xen vIOMMU | Target GPA | Memory region | | | |+--->+| | | +-+--++---++ | || || ||Hypercall || +++ |Hypervisor | || || || |v || | +--+--+|| | | vIOMMU||| | +--+--+|| || || |v || | +--+--+|| | | IOMMU driver||| | +--+--+|| || || +++ |HW v V| | +--+--+ +-+ | | | IOMMU
Re: [Xen-devel] Xen virtual IOMMU high level design doc
On 2016年10月06日 02:36, Konrad Rzeszutek Wilk wrote: >>> 3.3 Interrupt remapping >>> > > Interrupts from virtual devices and physical devices will be delivered >>> > > to vlapic from vIOAPIC and vMSI. It needs to add interrupt remapping >>> > > hooks in the vmsi_deliver() and ioapic_deliver() to find target vlapic >>> > > according interrupt remapping table. The following diagram shows the >>> > > logic. >>> > > > Uh? Missing diagram? Sorry. This is stale statement. The diagram was moved to 2.2 Interrupt remapping overview. > >>> 4.3 Q35 vs i440x >>> > > VT-D is introduced since Q35 chipset. Previous concern was that IOMMU > s/since/with/ >>> > > driver has assumption that VTD only exists on Q35 and newer chipset and >>> > > we have to enable Q35 first. >>> > > >>> > > Consulted with Linux/Windows IOMMU driver experts and get that these >>> > > drivers doesn't have such assumption. So we may skip Q35 implementation >>> > > and can emulate vIOMMU on I440x chipset. KVM already have vIOMMU support >>> > > with virtual PCI device's DMA translation and interrupt remapping. We >>> > > are using KVM to do experiment of adding vIOMMU on the I440x and test >>> > > Linux/Windows guest. Will report back when have some results. > Any results? We have booted up Win8 guest with virtual VTD and emulated I440x platform on Xen and guest uses virtual VTD to enable interrupt remapping function. -- Best regards Tianyu Lan ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] Xen virtual IOMMU high level design doc
On Thu, Sep 15, 2016 at 10:22:36PM +0800, Lan, Tianyu wrote: > Hi Andrew: > Sorry to bother you. To make sure we are on the right direction, it's > better to get feedback from you before we go further step. Could you > have a look? Thanks. > > On 8/17/2016 8:05 PM, Lan, Tianyu wrote: > > Hi All: > > The following is our Xen vIOMMU high level design for detail > > discussion. Please have a look. Very appreciate for your comments. > > This design doesn't cover changes when root port is moved to hypervisor. > > We may design it later. > > > > > > Content: > > === > > > > 1. Motivation of vIOMMU > > 1.1 Enable more than 255 vcpus > > 1.2 Support VFIO-based user space driver > > 1.3 Support guest Shared Virtual Memory (SVM) > > 2. Xen vIOMMU Architecture > > 2.1 2th level translation overview > > 2.2 Interrupt remapping overview > > 3. Xen hypervisor > > 3.1 New vIOMMU hypercall interface > > 3.2 2nd level translation > > 3.3 Interrupt remapping > > 3.4 1st level translation > > 3.5 Implementation consideration > > 4. Qemu > > 4.1 Qemu vIOMMU framework > > 4.2 Dummy xen-vIOMMU driver > > 4.3 Q35 vs. i440x > > 4.4 Report vIOMMU to hvmloader > > > > > > 1 Motivation for Xen vIOMMU > > === > > > > 1.1 Enable more than 255 vcpu support > > HPC virtualization requires more than 255 vcpus support in a single VM > > to meet parallel computing requirement. More than 255 vcpus support > > requires interrupt remapping capability present on vIOMMU to deliver > > interrupt to #vcpu >255 Otherwise Linux guest fails to boot up with >255 > > vcpus if interrupt remapping is absent. > > > > > > 1.2 Support VFIO-based user space driver (e.g. DPDK) in the guest > > It relies on the 2nd level translation capability (IOVA->GPA) on > > vIOMMU. pIOMMU 2nd level becomes a shadowing structure of > > vIOMMU to isolate DMA requests initiated by user space driver. > > > > > > 1.3 Support guest SVM (Shared Virtual Memory) > > It relies on the 1st level translation table capability (GVA->GPA) on > > vIOMMU. pIOMMU needs to enable both 1st level and 2nd level translation > > in nested mode (GVA->GPA->HPA) for passthrough device. IGD passthrough > > is the main usage today (to support OpenCL 2.0 SVM feature). In the > > future SVM might be used by other I/O devices too. > > > > 2. Xen vIOMMU Architecture > > > > > > > > * vIOMMU will be inside Xen hypervisor for following factors > > 1) Avoid round trips between Qemu and Xen hypervisor > > 2) Ease of integration with the rest of the hypervisor > > 3) HVMlite/PVH doesn't use Qemu > > * Dummy xen-vIOMMU in Qemu as a wrapper of new hypercall to create > > /destory vIOMMU in hypervisor and deal with virtual PCI device's 2th > > level translation. > > > > 2.1 2th level translation overview > > For Virtual PCI device, dummy xen-vIOMMU does translation in the > > Qemu via new hypercall. > > > > For physical PCI device, vIOMMU in hypervisor shadows IO page table from > > IOVA->GPA to IOVA->HPA and load page table to physical IOMMU. > > > > The following diagram shows 2th level translation architecture. > > +-+ > > |Qemu++ | > > || Virtual| | > > || PCI device | | > > ||| | > > |++ | > > ||DMA | > > |V| > > | ++ Request ++ | > > | |+<---+| | > > | | Dummy xen vIOMMU | Target GPA | Memory region | | > > | |+--->+| | > > | +-+--++---++ | > > || || > > ||Hypercall || > > +++ > > |Hypervisor | || > > || || > > |v || > > | +--+--+|| > > | | vIOMMU||| > > | +--+--+|| > > || || > > |v || > > | +--+--+|| > > | |
Re: [Xen-devel] Xen virtual IOMMU high level design doc
Hi Andrew: Sorry to bother you. To make sure we are on the right direction, it's better to get feedback from you before we go further step. Could you have a look? Thanks. On 8/17/2016 8:05 PM, Lan, Tianyu wrote: Hi All: The following is our Xen vIOMMU high level design for detail discussion. Please have a look. Very appreciate for your comments. This design doesn't cover changes when root port is moved to hypervisor. We may design it later. Content: === 1. Motivation of vIOMMU 1.1 Enable more than 255 vcpus 1.2 Support VFIO-based user space driver 1.3 Support guest Shared Virtual Memory (SVM) 2. Xen vIOMMU Architecture 2.1 2th level translation overview 2.2 Interrupt remapping overview 3. Xen hypervisor 3.1 New vIOMMU hypercall interface 3.2 2nd level translation 3.3 Interrupt remapping 3.4 1st level translation 3.5 Implementation consideration 4. Qemu 4.1 Qemu vIOMMU framework 4.2 Dummy xen-vIOMMU driver 4.3 Q35 vs. i440x 4.4 Report vIOMMU to hvmloader 1 Motivation for Xen vIOMMU === 1.1 Enable more than 255 vcpu support HPC virtualization requires more than 255 vcpus support in a single VM to meet parallel computing requirement. More than 255 vcpus support requires interrupt remapping capability present on vIOMMU to deliver interrupt to #vcpu >255 Otherwise Linux guest fails to boot up with >255 vcpus if interrupt remapping is absent. 1.2 Support VFIO-based user space driver (e.g. DPDK) in the guest It relies on the 2nd level translation capability (IOVA->GPA) on vIOMMU. pIOMMU 2nd level becomes a shadowing structure of vIOMMU to isolate DMA requests initiated by user space driver. 1.3 Support guest SVM (Shared Virtual Memory) It relies on the 1st level translation table capability (GVA->GPA) on vIOMMU. pIOMMU needs to enable both 1st level and 2nd level translation in nested mode (GVA->GPA->HPA) for passthrough device. IGD passthrough is the main usage today (to support OpenCL 2.0 SVM feature). In the future SVM might be used by other I/O devices too. 2. Xen vIOMMU Architecture * vIOMMU will be inside Xen hypervisor for following factors 1) Avoid round trips between Qemu and Xen hypervisor 2) Ease of integration with the rest of the hypervisor 3) HVMlite/PVH doesn't use Qemu * Dummy xen-vIOMMU in Qemu as a wrapper of new hypercall to create /destory vIOMMU in hypervisor and deal with virtual PCI device's 2th level translation. 2.1 2th level translation overview For Virtual PCI device, dummy xen-vIOMMU does translation in the Qemu via new hypercall. For physical PCI device, vIOMMU in hypervisor shadows IO page table from IOVA->GPA to IOVA->HPA and load page table to physical IOMMU. The following diagram shows 2th level translation architecture. +-+ |Qemu++ | || Virtual| | || PCI device | | ||| | |++ | ||DMA | |V| | ++ Request ++ | | |+<---+| | | | Dummy xen vIOMMU | Target GPA | Memory region | | | |+--->+| | | +-+--++---++ | || || ||Hypercall || +++ |Hypervisor | || || || |v || | +--+--+|| | | vIOMMU||| | +--+--+|| || || |v || | +--+--+|| | | IOMMU driver||| | +--+--+|| || || +++ |HW v V| | +--+--+ +-+ | | | IOMMU +>+ Memory | | | +--+--+ +-+
Re: [Xen-devel] Xen virtual IOMMU high level design doc
On 2016年08月31日 20:02, Jan Beulich wrote: On 31.08.16 at 10:39,wrote: >> > On 2016年08月25日 19:11, Jan Beulich wrote: >> > On 17.08.16 at 14:05, wrote: >>> 1 Motivation for Xen vIOMMU >>> >>> === >>> 1.1 Enable more than 255 vcpu support >>> HPC virtualization requires more than 255 vcpus support in a single VM >>> to meet parallel computing requirement. More than 255 vcpus support >>> requires interrupt remapping capability present on vIOMMU to deliver >>> interrupt to #vcpu >255 Otherwise Linux guest fails to boot up with >>> >255 >>> vcpus if interrupt remapping is absent. >>> >> >>> >> I continue to question this as a valid motivation at this point in >>> >> time, for the reasons Andrew has been explaining. >> > >> > If we want to support Linux guest with >255 vcpus, interrupt remapping >> > is necessary. > I don't understand why you keep repeating this, without adding > _why_ you think there is a demand for such guests and _what_ > your plans are to eliminate Andrew's concerns. > The motivation for such huge VM is for HPC(High-performance computing) Cloud service which requires high performance parallel computing. We just create single VM on one machine and expose more than 255 pcpus to VM in order to make sure high performance parallel computing in VM. One vcpu is pinged on pcpu. For performance, we achieved high performance data(>95% native data of stream, dgemm and sgemm benchmarks in VM) after some tuning and optimizations. We presented these on Xen summit of this year. For stability, Andrew found some issues of huge VM with watchdog enabled and cause hypervisor reboot. We will reproduce and fix them. -- Best regards Tianyu Lan ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] Xen virtual IOMMU high level design doc
> From: Jan Beulich [mailto:jbeul...@suse.com] > Sent: Wednesday, August 31, 2016 8:03 PM > >>> 3.5 Implementation consideration > >>> Linux Intel IOMMU driver will fail to be loaded without 2th level > >>> translation support even if interrupt remapping and 1th level > >>> translation are available. This means it's needed to enable 2th level > >>> translation first before other functions. > >> > >> Is there a reason for this? I.e. do they unconditionally need that > >> functionality? > > > > Yes, Linux intel IOMMU driver unconditionally needs l2 translation. > > Driver checks whether there is a valid sagaw(supported Adjusted Guest > > Address Widths) during initializing IOMMU data struct and return error > > if not. > > How about my first question then? > > Jan VT-d spec doesn't define a capability bit for the 2nd level translation (for 1st level or intr remapping, there do have such capability bit to report). So architecturally there is no way to tell guest that 2nd level translation capability is not available, so existing Linux behavior is just correct. Thanks Kevin ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] Xen virtual IOMMU high level design doc
>>> On 31.08.16 at 10:39,wrote: > On 2016年08月25日 19:11, Jan Beulich wrote: > On 17.08.16 at 14:05, wrote: >>> 1 Motivation for Xen vIOMMU >>> >>> === >>> 1.1 Enable more than 255 vcpu support >>> HPC virtualization requires more than 255 vcpus support in a single VM >>> to meet parallel computing requirement. More than 255 vcpus support >>> requires interrupt remapping capability present on vIOMMU to deliver >>> interrupt to #vcpu >255 Otherwise Linux guest fails to boot up with >255 >>> vcpus if interrupt remapping is absent. >> >> I continue to question this as a valid motivation at this point in >> time, for the reasons Andrew has been explaining. > > If we want to support Linux guest with >255 vcpus, interrupt remapping > is necessary. I don't understand why you keep repeating this, without adding _why_ you think there is a demand for such guests and _what_ your plans are to eliminate Andrew's concerns. >>> 3 Xen hypervisor >>> == >>> >>> 3.1 New hypercall XEN_SYSCTL_viommu_op >>> 1) Definition of "struct xen_sysctl_viommu_op" as new hypercall parameter. >>> >>> struct xen_sysctl_viommu_op { >>> u32 cmd; >>> u32 domid; >>> union { >>> struct { >>> u32 capabilities; >>> } query_capabilities; >>> struct { >>> u32 capabilities; >>> u64 base_address; >>> } create_iommu; >>> struct { >>> u8 bus; >>> u8 devfn; >> >> Please can we avoid introducing any new interfaces without segment/ >> domain value, even if for now it'll be always zero? > > Sure. Will add segment field. > >> >>> u64 iova; >>> u64 translated_addr; >>> u64 addr_mask; /* Translation page size */ >>> IOMMUAccessFlags permisson; >>> } 2th_level_translation; >> >> I suppose "translated_addr" is an output here, but for the following >> fields this already isn't clear. Please add IN and OUT annotations for >> clarity. >> >> Also, may I suggest to name this "l2_translation"? (But there are >> other implementation specific things to be considered here, which >> I guess don't belong into a design doc discussion.) > > How about this? > struct { > /* IN parameters. */ > u8 segment; > u8 bus; > u8 devfn; > u64 iova; > /* Out parameters. */ > u64 translated_addr; > u64 addr_mask; /* Translation page size */ > IOMMUAccessFlags permisson; > } l2_translation; "segment" clearly needs to be a 16-bit value, but apart from that (and missing padding fields) this looks okay. >>> 3.5 Implementation consideration >>> Linux Intel IOMMU driver will fail to be loaded without 2th level >>> translation support even if interrupt remapping and 1th level >>> translation are available. This means it's needed to enable 2th level >>> translation first before other functions. >> >> Is there a reason for this? I.e. do they unconditionally need that >> functionality? > > Yes, Linux intel IOMMU driver unconditionally needs l2 translation. > Driver checks whether there is a valid sagaw(supported Adjusted Guest > Address Widths) during initializing IOMMU data struct and return error > if not. How about my first question then? Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] Xen virtual IOMMU high level design doc
Hi Jan: Sorry for later response. Thanks a lot for your comments. On 2016年08月25日 19:11, Jan Beulich wrote: On 17.08.16 at 14:05,wrote: >> 1 Motivation for Xen vIOMMU >> >> === >> 1.1 Enable more than 255 vcpu support >> HPC virtualization requires more than 255 vcpus support in a single VM >> to meet parallel computing requirement. More than 255 vcpus support >> requires interrupt remapping capability present on vIOMMU to deliver >> interrupt to #vcpu >255 Otherwise Linux guest fails to boot up with >255 >> vcpus if interrupt remapping is absent. > > I continue to question this as a valid motivation at this point in > time, for the reasons Andrew has been explaining. If we want to support Linux guest with >255 vcpus, interrupt remapping is necessary. From Linux commit introducing x2apic and IR mode, it said IR was a pre-requisite for enabling x2apic mode in the CPU. https://lwn.net/Articles/289881/ So far, no sure behavior on the other OS. We may watch Windows guest behavior later on KVM and there is still a bug to run Windows guest with IR function on KVM. > >> 2. Xen vIOMMU Architecture >> >> >> >> * vIOMMU will be inside Xen hypervisor for following factors >> 1) Avoid round trips between Qemu and Xen hypervisor >> 2) Ease of integration with the rest of the hypervisor >> 3) HVMlite/PVH doesn't use Qemu >> * Dummy xen-vIOMMU in Qemu as a wrapper of new hypercall to create >> /destory vIOMMU in hypervisor and deal with virtual PCI device's 2th >> level translation. > > How does the create/destroy part of this match up with 3) right > ahead of it? The create/destroy hypercalls will work for both hvm and hvmlite. Suppose hvmlite has tool stack(E.G libxl) which can call new hypercalls to create or destroy virtual iommu in hypervisor. > >> 3 Xen hypervisor >> == >> >> 3.1 New hypercall XEN_SYSCTL_viommu_op >> 1) Definition of "struct xen_sysctl_viommu_op" as new hypercall parameter. >> >> struct xen_sysctl_viommu_op { >> u32 cmd; >> u32 domid; >> union { >> struct { >> u32 capabilities; >> } query_capabilities; >> struct { >> u32 capabilities; >> u64 base_address; >> } create_iommu; >> struct { >> u8 bus; >> u8 devfn; > > Please can we avoid introducing any new interfaces without segment/ > domain value, even if for now it'll be always zero? Sure. Will add segment field. > >> u64 iova; >> u64 translated_addr; >> u64 addr_mask; /* Translation page size */ >> IOMMUAccessFlags permisson; >> } 2th_level_translation; > > I suppose "translated_addr" is an output here, but for the following > fields this already isn't clear. Please add IN and OUT annotations for > clarity. > > Also, may I suggest to name this "l2_translation"? (But there are > other implementation specific things to be considered here, which > I guess don't belong into a design doc discussion.) How about this? struct { /* IN parameters. */ u8 segment; u8 bus; u8 devfn; u64 iova; /* Out parameters. */ u64 translated_addr; u64 addr_mask; /* Translation page size */ IOMMUAccessFlags permisson; } l2_translation; > >> }; >> >> typedef enum { >> IOMMU_NONE = 0, >> IOMMU_RO = 1, >> IOMMU_WO = 2, >> IOMMU_RW = 3, >> } IOMMUAccessFlags; >> >> >> Definition of VIOMMU subops: >> #define XEN_SYSCTL_viommu_query_capability 0 >> #define XEN_SYSCTL_viommu_create 1 >> #define XEN_SYSCTL_viommu_destroy2 >> #define XEN_SYSCTL_viommu_dma_translation_for_vpdev 3 >> >> Definition of VIOMMU capabilities >> #define XEN_VIOMMU_CAPABILITY_1nd_level_translation (1 << 0) >> #define XEN_VIOMMU_CAPABILITY_2nd_level_translation (1 << 1) > > l1 and l2 respectively again, please. Will update. > >> 3.3 Interrupt remapping >> Interrupts from virtual devices and physical devices will be delivered >> to vlapic from vIOAPIC and vMSI. It needs to add interrupt remapping >> hooks in the vmsi_deliver() and ioapic_deliver() to find target vlapic >> according interrupt remapping table. The following diagram shows the logic. > > Missing diagram or stale sentence? Sorry. It's stale sentence and moved the diagram to 2.2 Interrupt remapping overview. > >> 3.5 Implementation consideration >> Linux Intel IOMMU driver will fail to be loaded without 2th level >> translation support even if
Re: [Xen-devel] Xen virtual IOMMU high level design doc
>>> On 17.08.16 at 14:05,wrote: > 1 Motivation for Xen vIOMMU > > === > 1.1 Enable more than 255 vcpu support > HPC virtualization requires more than 255 vcpus support in a single VM > to meet parallel computing requirement. More than 255 vcpus support > requires interrupt remapping capability present on vIOMMU to deliver > interrupt to #vcpu >255 Otherwise Linux guest fails to boot up with >255 > vcpus if interrupt remapping is absent. I continue to question this as a valid motivation at this point in time, for the reasons Andrew has been explaining. > 2. Xen vIOMMU Architecture > > > > * vIOMMU will be inside Xen hypervisor for following factors > 1) Avoid round trips between Qemu and Xen hypervisor > 2) Ease of integration with the rest of the hypervisor > 3) HVMlite/PVH doesn't use Qemu > * Dummy xen-vIOMMU in Qemu as a wrapper of new hypercall to create > /destory vIOMMU in hypervisor and deal with virtual PCI device's 2th > level translation. How does the create/destroy part of this match up with 3) right ahead of it? > 3 Xen hypervisor > == > > 3.1 New hypercall XEN_SYSCTL_viommu_op > 1) Definition of "struct xen_sysctl_viommu_op" as new hypercall parameter. > > struct xen_sysctl_viommu_op { > u32 cmd; > u32 domid; > union { > struct { > u32 capabilities; > } query_capabilities; > struct { > u32 capabilities; > u64 base_address; > } create_iommu; > struct { > u8 bus; > u8 devfn; Please can we avoid introducing any new interfaces without segment/ domain value, even if for now it'll be always zero? > u64 iova; > u64 translated_addr; > u64 addr_mask; /* Translation page size */ > IOMMUAccessFlags permisson; > } 2th_level_translation; I suppose "translated_addr" is an output here, but for the following fields this already isn't clear. Please add IN and OUT annotations for clarity. Also, may I suggest to name this "l2_translation"? (But there are other implementation specific things to be considered here, which I guess don't belong into a design doc discussion.) > }; > > typedef enum { > IOMMU_NONE = 0, > IOMMU_RO = 1, > IOMMU_WO = 2, > IOMMU_RW = 3, > } IOMMUAccessFlags; > > > Definition of VIOMMU subops: > #define XEN_SYSCTL_viommu_query_capability0 > #define XEN_SYSCTL_viommu_create 1 > #define XEN_SYSCTL_viommu_destroy 2 > #define XEN_SYSCTL_viommu_dma_translation_for_vpdev 3 > > Definition of VIOMMU capabilities > #define XEN_VIOMMU_CAPABILITY_1nd_level_translation (1 << 0) > #define XEN_VIOMMU_CAPABILITY_2nd_level_translation (1 << 1) l1 and l2 respectively again, please. > 3.3 Interrupt remapping > Interrupts from virtual devices and physical devices will be delivered > to vlapic from vIOAPIC and vMSI. It needs to add interrupt remapping > hooks in the vmsi_deliver() and ioapic_deliver() to find target vlapic > according interrupt remapping table. The following diagram shows the logic. Missing diagram or stale sentence? > 3.5 Implementation consideration > Linux Intel IOMMU driver will fail to be loaded without 2th level > translation support even if interrupt remapping and 1th level > translation are available. This means it's needed to enable 2th level > translation first before other functions. Is there a reason for this? I.e. do they unconditionally need that functionality? Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] Xen virtual IOMMU high level design doc
On 8/17/2016 8:42 PM, Paul Durrant wrote: -Original Message- From: Xen-devel [mailto:xen-devel-boun...@lists.xen.org] On Behalf Of Lan, Tianyu Sent: 17 August 2016 13:06 To: Jan Beulich; Kevin Tian; Andrew Cooper; yang.zhang...@gmail.com; Jun Nakajima; Stefano Stabellini Cc: Anthony Perard; xuqu...@huawei.com; xen- de...@lists.xensource.com; Ian Jackson; Roger Pau Monne Subject: [Xen-devel] Xen virtual IOMMU high level design doc Hi All: The following is our Xen vIOMMU high level design for detail discussion. Please have a look. Very appreciate for your comments. This design doesn't cover changes when root port is moved to hypervisor. We may design it later. Content: == = 1. Motivation of vIOMMU 1.1 Enable more than 255 vcpus 1.2 Support VFIO-based user space driver 1.3 Support guest Shared Virtual Memory (SVM) 2. Xen vIOMMU Architecture 2.1 2th level translation overview 2.2 Interrupt remapping overview 3. Xen hypervisor 3.1 New vIOMMU hypercall interface Would it not have been better to build on the previously discussed (and mostly agreed) PV IOMMU interface? (See https://lists.xenproject.org/archives/html/xen-devel/2016-02/msg01428.html). An RFC implementation series was also posted (https://lists.xenproject.org/archives/html/xen-devel/2016-02/msg01441.html). Paul Hi Paul: Thanks for your input. Glance the patchset and it introduces hypercall "HYPERVISOR_iommu_op". The hypercall just works for PV IOMMU now. We may abstract it and make it work for both PV and Virtual IOMMU. ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] Xen virtual IOMMU high level design doc
> -Original Message- > From: Xen-devel [mailto:xen-devel-boun...@lists.xen.org] On Behalf Of > Lan, Tianyu > Sent: 17 August 2016 13:06 > To: Jan Beulich; Kevin Tian; Andrew Cooper; yang.zhang...@gmail.com; Jun > Nakajima; Stefano Stabellini > Cc: Anthony Perard; xuqu...@huawei.com; xen- > de...@lists.xensource.com; Ian Jackson; Roger Pau Monne > Subject: [Xen-devel] Xen virtual IOMMU high level design doc > > Hi All: > The following is our Xen vIOMMU high level design for detail > discussion. Please have a look. Very appreciate for your comments. > This design doesn't cover changes when root port is moved to hypervisor. > We may design it later. > > > Content: > == > = > 1. Motivation of vIOMMU > 1.1 Enable more than 255 vcpus > 1.2 Support VFIO-based user space driver > 1.3 Support guest Shared Virtual Memory (SVM) > 2. Xen vIOMMU Architecture > 2.1 2th level translation overview > 2.2 Interrupt remapping overview > 3. Xen hypervisor > 3.1 New vIOMMU hypercall interface Would it not have been better to build on the previously discussed (and mostly agreed) PV IOMMU interface? (See https://lists.xenproject.org/archives/html/xen-devel/2016-02/msg01428.html). An RFC implementation series was also posted (https://lists.xenproject.org/archives/html/xen-devel/2016-02/msg01441.html). Paul ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
[Xen-devel] Xen virtual IOMMU high level design doc
Hi All: The following is our Xen vIOMMU high level design for detail discussion. Please have a look. Very appreciate for your comments. This design doesn't cover changes when root port is moved to hypervisor. We may design it later. Content: === 1. Motivation of vIOMMU 1.1 Enable more than 255 vcpus 1.2 Support VFIO-based user space driver 1.3 Support guest Shared Virtual Memory (SVM) 2. Xen vIOMMU Architecture 2.1 2th level translation overview 2.2 Interrupt remapping overview 3. Xen hypervisor 3.1 New vIOMMU hypercall interface 3.2 2nd level translation 3.3 Interrupt remapping 3.4 1st level translation 3.5 Implementation consideration 4. Qemu 4.1 Qemu vIOMMU framework 4.2 Dummy xen-vIOMMU driver 4.3 Q35 vs. i440x 4.4 Report vIOMMU to hvmloader 1 Motivation for Xen vIOMMU === 1.1 Enable more than 255 vcpu support HPC virtualization requires more than 255 vcpus support in a single VM to meet parallel computing requirement. More than 255 vcpus support requires interrupt remapping capability present on vIOMMU to deliver interrupt to #vcpu >255 Otherwise Linux guest fails to boot up with >255 vcpus if interrupt remapping is absent. 1.2 Support VFIO-based user space driver (e.g. DPDK) in the guest It relies on the 2nd level translation capability (IOVA->GPA) on vIOMMU. pIOMMU 2nd level becomes a shadowing structure of vIOMMU to isolate DMA requests initiated by user space driver. 1.3 Support guest SVM (Shared Virtual Memory) It relies on the 1st level translation table capability (GVA->GPA) on vIOMMU. pIOMMU needs to enable both 1st level and 2nd level translation in nested mode (GVA->GPA->HPA) for passthrough device. IGD passthrough is the main usage today (to support OpenCL 2.0 SVM feature). In the future SVM might be used by other I/O devices too. 2. Xen vIOMMU Architecture * vIOMMU will be inside Xen hypervisor for following factors 1) Avoid round trips between Qemu and Xen hypervisor 2) Ease of integration with the rest of the hypervisor 3) HVMlite/PVH doesn't use Qemu * Dummy xen-vIOMMU in Qemu as a wrapper of new hypercall to create /destory vIOMMU in hypervisor and deal with virtual PCI device's 2th level translation. 2.1 2th level translation overview For Virtual PCI device, dummy xen-vIOMMU does translation in the Qemu via new hypercall. For physical PCI device, vIOMMU in hypervisor shadows IO page table from IOVA->GPA to IOVA->HPA and load page table to physical IOMMU. The following diagram shows 2th level translation architecture. +-+ |Qemu++ | || Virtual| | || PCI device | | ||| | |++ | ||DMA | |V| | ++ Request ++ | | |+<---+| | | | Dummy xen vIOMMU | Target GPA | Memory region | | | |+--->+| | | +-+--++---++ | || || ||Hypercall || +++ |Hypervisor | || || || |v || | +--+--+|| | | vIOMMU||| | +--+--+|| || || |v || | +--+--+|| | | IOMMU driver||| | +--+--+|| || || +++ |HW v V| | +--+--+ +-+ | | | IOMMU +>+ Memory | | | +--+--+ +-+ | |^| ||| | +--+--+