Re: [Xen-devel] Xen virtual IOMMU high level design doc

2016-11-24 Thread Lan, Tianyu



On 11/24/2016 9:37 PM, Edgar E. Iglesias wrote:

On Thu, Nov 24, 2016 at 02:49:41PM +0800, Lan Tianyu wrote:

On 2016年11月24日 12:09, Edgar E. Iglesias wrote:

Hi,


I have a few questions.

If I understand correctly, you'll be emulating an Intel IOMMU in Xen.
So guests will essentially create intel iommu style page-tables.

If we were to use this on Xen/ARM, we would likely be modelling an ARM
SMMU as a vIOMMU. Since Xen on ARM does not use QEMU for emulation, the
hypervisor OPs for QEMUs xen dummy IOMMU queries would not really be used.
Do I understand this correctly?


I think they could be called from the toolstack. This is why I was
saying in the other thread that the hypercalls should be general enough
that QEMU is not the only caller.

For PVH and ARM guests, the toolstack should be able to setup the vIOMMU
on behalf of the guest without QEMU intervention.

OK, I see. Or, I think I understand, not sure :-)

In QEMU when someone changes mappings in an IOMMU there will be a notifier
to tell caches upstream that mappings have changed. I think we will need to
prepare for that. I.e when TCG CPUs sit behind an IOMMU.


For Xen side, we may notify pIOMMU driver about mapping change via
calling pIOMMU driver's API in vIOMMU.


I was refering to the other way around. When a guest modifies the mappings
for a vIOMMU, the driver domain with QEMU and vDevices needs to be notified.

I couldn't find any mention of this in the document...


Qemu side won't have iotlb cache and all DMA translation info are in the 
hypervisor. All vDevice's DMA requests are passed to hypervisor, 
hypervisor returns back translated address and then Qemu finish the DMA 
operation finally.


There is a race condition between iotlb invalidation operation and 
vDevices' in-fly DMA. We proposed a solution in "3.2 l2 translation - 
For virtual PCI device". We hope to take advantage of current ioreq 
mechanism to achieve something like notifier.


Both vIOMMU in hypervisor and dummy vIOMMU in Qemu register the same 
MMIO region. When there is a invalidation MMIO access and hypervisor 
want to notify Qemu, vIOMMU's MMIO handler returns X86EMUL_UNHANDLEABLE 
and io emulation handler is supposed to send IO request to Qemu. Dummy 
vIOMMU in Qemu receives the event and start to drain in-fly DMA 
operation.










Another area that may need change is that on ARM we need the map-query to return
the memory attributes for the given mapping. Today QEMU or any emulator
doesn't use it much but in the future things may change.


What about the mem attributes?
It's very likely we'll add support for memory attributes for IOMMU's in QEMU
at some point.
Emulated IOMMU's will thus have the ability to modify attributes (i.e 
SourceID's,
cacheability, etc). Perhaps we could allocate or reserve an uint64_t
for attributes TBD later in the query struct.


Sounds like you hope to extend capability variable in the query struct 
to uint64_t to support more future feature, right?


I have added "permission" variable in struct l2_translation to return 
vIOMMU's memory access permission for vDevice's DMA request. No sure it 
can meet your requirement.








For SVM, whe will also need to deal with page-table faults by the IOMMU.
So I think there will need to be a channel from Xen to Guesrt to report these.


Yes, vIOMMU should forward the page-fault event to guest. For VTD side,
we will trigger VTD's interrupt to notify guest about the event.


OK, Cool.

Perhaps you should document how this (and the map/unmap notifiers) will work?


This is VTD specific to deal with some fault events and just like some 
other virtual device models emulate its interrupt. So I didn't put this 
in this design document.


For mapping change, please see the fist comments.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Xen virtual IOMMU high level design doc

2016-11-24 Thread Xuquan (Quan Xu)
On November 24, 2016 9:38 PM, 
>On Thu, Nov 24, 2016 at 02:49:41PM +0800, Lan Tianyu wrote:
>> On 2016年11月24日 12:09, Edgar E. Iglesias wrote:
>>  Hi,
>>  > > >
>>  > > > I have a few questions.
>>  > > >
>>  > > > If I understand correctly, you'll be emulating an Intel IOMMU in
>Xen.
>>  > > > So guests will essentially create intel iommu style page-tables.
>>  > > >
>>  > > > If we were to use this on Xen/ARM, we would likely be
>>  > > > modelling an ARM SMMU as a vIOMMU. Since Xen on ARM
>does
>>  > > > not use QEMU for emulation, the hypervisor OPs for QEMUs
>xen dummy IOMMU queries would not really be used.
>>  > > > Do I understand this correctly?
>> >>> > >
>> >>> > > I think they could be called from the toolstack. This is why I
>> >>> > > was saying in the other thread that the hypercalls should be
>> >>> > > general enough that QEMU is not the only caller.
>> >>> > >
>> >>> > > For PVH and ARM guests, the toolstack should be able to setup
>> >>> > > the vIOMMU on behalf of the guest without QEMU intervention.
>> > OK, I see. Or, I think I understand, not sure :-)
>> >
>> > In QEMU when someone changes mappings in an IOMMU there will be
>a
>> > notifier to tell caches upstream that mappings have changed. I think
>> > we will need to prepare for that. I.e when TCG CPUs sit behind an
>IOMMU.
>>
>> For Xen side, we may notify pIOMMU driver about mapping change via
>> calling pIOMMU driver's API in vIOMMU.
>
>I was refering to the other way around. When a guest modifies the
>mappings for a vIOMMU, the driver domain with QEMU and vDevices needs
>to be notified.
>
>I couldn't find any mention of this in the document...
>
>

Edgar,
As mentioned it supports VFIO-based user space driver (e.g. DPDK) in the guest.
I am afraid all of guest memory is pinned.. Lan, right?

Quan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Xen virtual IOMMU high level design doc

2016-11-24 Thread Edgar E. Iglesias
On Thu, Nov 24, 2016 at 02:49:41PM +0800, Lan Tianyu wrote:
> On 2016年11月24日 12:09, Edgar E. Iglesias wrote:
>  Hi,
>  > > >
>  > > > I have a few questions.
>  > > >
>  > > > If I understand correctly, you'll be emulating an Intel IOMMU in 
>  > > > Xen.
>  > > > So guests will essentially create intel iommu style page-tables.
>  > > >
>  > > > If we were to use this on Xen/ARM, we would likely be modelling an 
>  > > > ARM
>  > > > SMMU as a vIOMMU. Since Xen on ARM does not use QEMU for 
>  > > > emulation, the
>  > > > hypervisor OPs for QEMUs xen dummy IOMMU queries would not really 
>  > > > be used.
>  > > > Do I understand this correctly?
> >>> > > 
> >>> > > I think they could be called from the toolstack. This is why I was
> >>> > > saying in the other thread that the hypercalls should be general 
> >>> > > enough
> >>> > > that QEMU is not the only caller.
> >>> > > 
> >>> > > For PVH and ARM guests, the toolstack should be able to setup the 
> >>> > > vIOMMU
> >>> > > on behalf of the guest without QEMU intervention.
> > OK, I see. Or, I think I understand, not sure :-)
> > 
> > In QEMU when someone changes mappings in an IOMMU there will be a notifier
> > to tell caches upstream that mappings have changed. I think we will need to
> > prepare for that. I.e when TCG CPUs sit behind an IOMMU.
> 
> For Xen side, we may notify pIOMMU driver about mapping change via
> calling pIOMMU driver's API in vIOMMU.

I was refering to the other way around. When a guest modifies the mappings
for a vIOMMU, the driver domain with QEMU and vDevices needs to be notified.

I couldn't find any mention of this in the document...


> 
> > 
> > Another area that may need change is that on ARM we need the map-query to 
> > return
> > the memory attributes for the given mapping. Today QEMU or any emulator 
> > doesn't use it much but in the future things may change.

What about the mem attributes?
It's very likely we'll add support for memory attributes for IOMMU's in QEMU
at some point.
Emulated IOMMU's will thus have the ability to modify attributes (i.e 
SourceID's,
cacheability, etc). Perhaps we could allocate or reserve an uint64_t
for attributes TBD later in the query struct.



> > 
> > For SVM, whe will also need to deal with page-table faults by the IOMMU.
> > So I think there will need to be a channel from Xen to Guesrt to report 
> > these.
> 
> Yes, vIOMMU should forward the page-fault event to guest. For VTD side,
> we will trigger VTD's interrupt to notify guest about the event.

OK, Cool.

Perhaps you should document how this (and the map/unmap notifiers) will work?

I also think it would be a good idea to add a little more introduction so that
some of the questions we've been asking regarding the general design are easier
to grasp.

Best regards,
Edgar

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Xen virtual IOMMU high level design doc

2016-11-23 Thread Lan Tianyu
On 2016年11月24日 12:09, Edgar E. Iglesias wrote:
 Hi,
 > > >
 > > > I have a few questions.
 > > >
 > > > If I understand correctly, you'll be emulating an Intel IOMMU in Xen.
 > > > So guests will essentially create intel iommu style page-tables.
 > > >
 > > > If we were to use this on Xen/ARM, we would likely be modelling an 
 > > > ARM
 > > > SMMU as a vIOMMU. Since Xen on ARM does not use QEMU for emulation, 
 > > > the
 > > > hypervisor OPs for QEMUs xen dummy IOMMU queries would not really be 
 > > > used.
 > > > Do I understand this correctly?
>>> > > 
>>> > > I think they could be called from the toolstack. This is why I was
>>> > > saying in the other thread that the hypercalls should be general enough
>>> > > that QEMU is not the only caller.
>>> > > 
>>> > > For PVH and ARM guests, the toolstack should be able to setup the vIOMMU
>>> > > on behalf of the guest without QEMU intervention.
> OK, I see. Or, I think I understand, not sure :-)
> 
> In QEMU when someone changes mappings in an IOMMU there will be a notifier
> to tell caches upstream that mappings have changed. I think we will need to
> prepare for that. I.e when TCG CPUs sit behind an IOMMU.

For Xen side, we may notify pIOMMU driver about mapping change via
calling pIOMMU driver's API in vIOMMU.

> 
> Another area that may need change is that on ARM we need the map-query to 
> return
> the memory attributes for the given mapping. Today QEMU or any emulator 
> doesn't use it much but in the future things may change.
> 
> For SVM, whe will also need to deal with page-table faults by the IOMMU.
> So I think there will need to be a channel from Xen to Guesrt to report these.

Yes, vIOMMU should forward the page-fault event to guest. For VTD side,
we will trigger VTD's interrupt to notify guest about the event.

> 
> For example, what happens when a guest assigned DMA unit page-faults?
> Xen needs to know how to forward this fault back to guest for fixup and the
> guest needs to be able to fix it and tell the device that it's OK to contine.
> E.g PCI PRI or similar.
> 
> 


-- 
Best regards
Tianyu Lan

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Xen virtual IOMMU high level design doc

2016-11-23 Thread Edgar E. Iglesias
On Thu, Nov 24, 2016 at 02:00:21AM +, Tian, Kevin wrote:
> > From: Stefano Stabellini [mailto:sstabell...@kernel.org]
> > Sent: Thursday, November 24, 2016 3:09 AM
> > 
> > On Wed, 23 Nov 2016, Edgar E. Iglesias wrote:
> > > On Wed, Aug 17, 2016 at 08:05:51PM +0800, Lan, Tianyu wrote:
> > > > Hi All:
> > > >  The following is our Xen vIOMMU high level design for detail
> > > > discussion. Please have a look. Very appreciate for your comments.
> > > > This design doesn't cover changes when root port is moved to hypervisor.
> > > > We may design it later.
> > >
> > > Hi,
> > >
> > > I have a few questions.
> > >
> > > If I understand correctly, you'll be emulating an Intel IOMMU in Xen.
> > > So guests will essentially create intel iommu style page-tables.
> > >
> > > If we were to use this on Xen/ARM, we would likely be modelling an ARM
> > > SMMU as a vIOMMU. Since Xen on ARM does not use QEMU for emulation, the
> > > hypervisor OPs for QEMUs xen dummy IOMMU queries would not really be used.
> > > Do I understand this correctly?
> > 
> > I think they could be called from the toolstack. This is why I was
> > saying in the other thread that the hypercalls should be general enough
> > that QEMU is not the only caller.
> > 
> > For PVH and ARM guests, the toolstack should be able to setup the vIOMMU
> > on behalf of the guest without QEMU intervention.

OK, I see. Or, I think I understand, not sure :-)

In QEMU when someone changes mappings in an IOMMU there will be a notifier
to tell caches upstream that mappings have changed. I think we will need to
prepare for that. I.e when TCG CPUs sit behind an IOMMU.

Another area that may need change is that on ARM we need the map-query to return
the memory attributes for the given mapping. Today QEMU or any emulator 
doesn't use it much but in the future things may change.

For SVM, whe will also need to deal with page-table faults by the IOMMU.
So I think there will need to be a channel from Xen to Guesrt to report these.

For example, what happens when a guest assigned DMA unit page-faults?
Xen needs to know how to forward this fault back to guest for fixup and the
guest needs to be able to fix it and tell the device that it's OK to contine.
E.g PCI PRI or similar.


> > > Has a platform agnostic PV-IOMMU been considered to support 2-stage
> > > translation (i.e VFIO in the guest)? Perhaps that would hurt map/unmap
> > > performance too much?
> > 
> > That's an interesting idea. I don't know if that's feasible, but if it
> > is not, then we need to be able to specify the PV-IOMMU type in the
> > hypercalls, so that you would get Intel IOMMU on x86 and SMMU on ARM.
> > 
> > 
> 
> Not considered yet. PV is always possible as we've done for other I/O
> devices. Ideally it could be designed being more efficient than full
> emulation of vendor specific IOMMU, but also means requirement of
> maintaining a new guest IOMMU driver and limitation of supporting
> only newer version guest OSes. It's a tradeoff... at least not compelling 
> now (may consider when we see a real need in the future).

Agreed. Thanks.

Best regards,
Edgar

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Xen virtual IOMMU high level design doc V3

2016-11-23 Thread Lan Tianyu
On 2016年11月22日 18:24, Jan Beulich wrote:
 On 17.11.16 at 16:36,  wrote:
>> 2) Build ACPI DMAR table in toolstack
>> Now tool stack can boot ACPI DMAR table according VM configure and pass
>> though it to hvmloader via xenstore ACPI PT channel. But the vIOMMU MMIO
>> region is managed by Qemu and it's need to be populated into DMAR
>> table. We may hardcore an address in both Qemu and toolstack and use the 
>> same address to create vIOMMU and build DMAR table.
> 
> Let's try to avoid any new hard coding of values. Both tool stack
> and qemu ought to be able to retrieve a suitable address range
> from the hypervisor. Or if the tool stack was to allocate it, it could
> tell qemu.
> 
> Jan
> 

Hi Jan:
The address range is allocated by Qemu or toolstack and pass to
hypervisor when create vIOMMU. The vIOMMU's address range should be
under PCI address sapce and so we need to reserve a piece of PCI region
for vIOMMU in the toolstack. Then, populate base address in the vDMAR
table and tell Qemu the region via new xenstore interface if we want to
create vIOMMU in the Qemu dummy hypercall wrapper.

Another point, I am not sure whether we can create/destroy vIOMMU
directly in toolstack because virtual device models usually are handled
by Qemu. If yes, we don't need new Xenstore interface. In this case, the
dummy vIOMMU in Qemu will just cover L2 translation for virtual device.

-- 
Best regards
Tianyu Lan

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Xen virtual IOMMU high level design doc

2016-11-23 Thread Tian, Kevin
> From: Stefano Stabellini [mailto:sstabell...@kernel.org]
> Sent: Thursday, November 24, 2016 3:09 AM
> 
> On Wed, 23 Nov 2016, Edgar E. Iglesias wrote:
> > On Wed, Aug 17, 2016 at 08:05:51PM +0800, Lan, Tianyu wrote:
> > > Hi All:
> > >  The following is our Xen vIOMMU high level design for detail
> > > discussion. Please have a look. Very appreciate for your comments.
> > > This design doesn't cover changes when root port is moved to hypervisor.
> > > We may design it later.
> >
> > Hi,
> >
> > I have a few questions.
> >
> > If I understand correctly, you'll be emulating an Intel IOMMU in Xen.
> > So guests will essentially create intel iommu style page-tables.
> >
> > If we were to use this on Xen/ARM, we would likely be modelling an ARM
> > SMMU as a vIOMMU. Since Xen on ARM does not use QEMU for emulation, the
> > hypervisor OPs for QEMUs xen dummy IOMMU queries would not really be used.
> > Do I understand this correctly?
> 
> I think they could be called from the toolstack. This is why I was
> saying in the other thread that the hypercalls should be general enough
> that QEMU is not the only caller.
> 
> For PVH and ARM guests, the toolstack should be able to setup the vIOMMU
> on behalf of the guest without QEMU intervention.
> 
> 
> > Has a platform agnostic PV-IOMMU been considered to support 2-stage
> > translation (i.e VFIO in the guest)? Perhaps that would hurt map/unmap
> > performance too much?
> 
> That's an interesting idea. I don't know if that's feasible, but if it
> is not, then we need to be able to specify the PV-IOMMU type in the
> hypercalls, so that you would get Intel IOMMU on x86 and SMMU on ARM.
> 
> 

Not considered yet. PV is always possible as we've done for other I/O
devices. Ideally it could be designed being more efficient than full
emulation of vendor specific IOMMU, but also means requirement of
maintaining a new guest IOMMU driver and limitation of supporting
only newer version guest OSes. It's a tradeoff... at least not compelling 
now (may consider when we see a real need in the future).

Thanks
Kevin

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Xen virtual IOMMU high level design doc

2016-11-23 Thread Stefano Stabellini
On Wed, 23 Nov 2016, Edgar E. Iglesias wrote:
> On Wed, Aug 17, 2016 at 08:05:51PM +0800, Lan, Tianyu wrote:
> > Hi All:
> >  The following is our Xen vIOMMU high level design for detail
> > discussion. Please have a look. Very appreciate for your comments.
> > This design doesn't cover changes when root port is moved to hypervisor.
> > We may design it later.
> 
> Hi,
> 
> I have a few questions.
> 
> If I understand correctly, you'll be emulating an Intel IOMMU in Xen.
> So guests will essentially create intel iommu style page-tables.
> 
> If we were to use this on Xen/ARM, we would likely be modelling an ARM
> SMMU as a vIOMMU. Since Xen on ARM does not use QEMU for emulation, the
> hypervisor OPs for QEMUs xen dummy IOMMU queries would not really be used.
> Do I understand this correctly?

I think they could be called from the toolstack. This is why I was
saying in the other thread that the hypercalls should be general enough
that QEMU is not the only caller.

For PVH and ARM guests, the toolstack should be able to setup the vIOMMU
on behalf of the guest without QEMU intervention.


> Has a platform agnostic PV-IOMMU been considered to support 2-stage
> translation (i.e VFIO in the guest)? Perhaps that would hurt map/unmap
> performance too much?
 
That's an interesting idea. I don't know if that's feasible, but if it
is not, then we need to be able to specify the PV-IOMMU type in the
hypercalls, so that you would get Intel IOMMU on x86 and SMMU on ARM.


> > 
> > 
> > Content:
> > ===
> > 1. Motivation of vIOMMU
> > 1.1 Enable more than 255 vcpus
> > 1.2 Support VFIO-based user space driver
> > 1.3 Support guest Shared Virtual Memory (SVM)
> > 2. Xen vIOMMU Architecture
> > 2.1 2th level translation overview
> > 2.2 Interrupt remapping overview
> > 3. Xen hypervisor
> > 3.1 New vIOMMU hypercall interface
> > 3.2 2nd level translation
> > 3.3 Interrupt remapping
> > 3.4 1st level translation
> > 3.5 Implementation consideration
> > 4. Qemu
> > 4.1 Qemu vIOMMU framework
> > 4.2 Dummy xen-vIOMMU driver
> > 4.3 Q35 vs. i440x
> > 4.4 Report vIOMMU to hvmloader
> > 
> > 
> > 1 Motivation for Xen vIOMMU
> > ===
> > 1.1 Enable more than 255 vcpu support
> > HPC virtualization requires more than 255 vcpus support in a single VM
> > to meet parallel computing requirement. More than 255 vcpus support
> > requires interrupt remapping capability present on vIOMMU to deliver
> > interrupt to #vcpu >255 Otherwise Linux guest fails to boot up with >255
> > vcpus if interrupt remapping is absent.
> > 
> > 
> > 1.2 Support VFIO-based user space driver (e.g. DPDK) in the guest
> > It relies on the 2nd level translation capability (IOVA->GPA) on
> > vIOMMU. pIOMMU 2nd level becomes a shadowing structure of
> > vIOMMU to isolate DMA requests initiated by user space driver.
> > 
> > 
> > 1.3 Support guest SVM (Shared Virtual Memory)
> > It relies on the 1st level translation table capability (GVA->GPA) on
> > vIOMMU. pIOMMU needs to enable both 1st level and 2nd level translation
> > in nested mode (GVA->GPA->HPA) for passthrough device. IGD passthrough
> > is the main usage today (to support OpenCL 2.0 SVM feature). In the
> > future SVM might be used by other I/O devices too.
> > 
> > 2. Xen vIOMMU Architecture
> > 
> > 
> > * vIOMMU will be inside Xen hypervisor for following factors
> > 1) Avoid round trips between Qemu and Xen hypervisor
> > 2) Ease of integration with the rest of the hypervisor
> > 3) HVMlite/PVH doesn't use Qemu
> > * Dummy xen-vIOMMU in Qemu as a wrapper of new hypercall to create
> > /destory vIOMMU in hypervisor and deal with virtual PCI device's 2th
> > level translation.
> > 
> > 2.1 2th level translation overview
> > For Virtual PCI device, dummy xen-vIOMMU does translation in the
> > Qemu via new hypercall.
> > 
> > For physical PCI device, vIOMMU in hypervisor shadows IO page table from
> > IOVA->GPA to IOVA->HPA and load page table to physical IOMMU.
> > 
> > The following diagram shows 2th level translation architecture.
> > +-+
> > |Qemu++   |
> > || Virtual|   |
> > ||   PCI device   |   |
> > |||   |
> > |++   |
> > ||DMA |
> > |V|
> > |  ++   Request  ++   |
> > |  |+<---+|   |
> > |  |  Dummy xen vIOMMU  | Target GPA | 

Re: [Xen-devel] Xen virtual IOMMU high level design doc

2016-11-23 Thread Edgar E. Iglesias
On Wed, Aug 17, 2016 at 08:05:51PM +0800, Lan, Tianyu wrote:
> Hi All:
>  The following is our Xen vIOMMU high level design for detail
> discussion. Please have a look. Very appreciate for your comments.
> This design doesn't cover changes when root port is moved to hypervisor.
> We may design it later.

Hi,

I have a few questions.

If I understand correctly, you'll be emulating an Intel IOMMU in Xen.
So guests will essentially create intel iommu style page-tables.

If we were to use this on Xen/ARM, we would likely be modelling an ARM
SMMU as a vIOMMU. Since Xen on ARM does not use QEMU for emulation, the
hypervisor OPs for QEMUs xen dummy IOMMU queries would not really be used.
Do I understand this correctly?

Has a platform agnostic PV-IOMMU been considered to support 2-stage
translation (i.e VFIO in the guest)? Perhaps that would hurt map/unmap
performance too much?

Best regards,
Edgar




> 
> 
> Content:
> ===
> 1. Motivation of vIOMMU
>   1.1 Enable more than 255 vcpus
>   1.2 Support VFIO-based user space driver
>   1.3 Support guest Shared Virtual Memory (SVM)
> 2. Xen vIOMMU Architecture
>   2.1 2th level translation overview
>   2.2 Interrupt remapping overview
> 3. Xen hypervisor
>   3.1 New vIOMMU hypercall interface
>   3.2 2nd level translation
>   3.3 Interrupt remapping
>   3.4 1st level translation
>   3.5 Implementation consideration
> 4. Qemu
>   4.1 Qemu vIOMMU framework
>   4.2 Dummy xen-vIOMMU driver
>   4.3 Q35 vs. i440x
>   4.4 Report vIOMMU to hvmloader
> 
> 
> 1 Motivation for Xen vIOMMU
> ===
> 1.1 Enable more than 255 vcpu support
> HPC virtualization requires more than 255 vcpus support in a single VM
> to meet parallel computing requirement. More than 255 vcpus support
> requires interrupt remapping capability present on vIOMMU to deliver
> interrupt to #vcpu >255 Otherwise Linux guest fails to boot up with >255
> vcpus if interrupt remapping is absent.
> 
> 
> 1.2 Support VFIO-based user space driver (e.g. DPDK) in the guest
> It relies on the 2nd level translation capability (IOVA->GPA) on
> vIOMMU. pIOMMU 2nd level becomes a shadowing structure of
> vIOMMU to isolate DMA requests initiated by user space driver.
> 
> 
> 1.3 Support guest SVM (Shared Virtual Memory)
> It relies on the 1st level translation table capability (GVA->GPA) on
> vIOMMU. pIOMMU needs to enable both 1st level and 2nd level translation
> in nested mode (GVA->GPA->HPA) for passthrough device. IGD passthrough
> is the main usage today (to support OpenCL 2.0 SVM feature). In the
> future SVM might be used by other I/O devices too.
> 
> 2. Xen vIOMMU Architecture
> 
> 
> * vIOMMU will be inside Xen hypervisor for following factors
>   1) Avoid round trips between Qemu and Xen hypervisor
>   2) Ease of integration with the rest of the hypervisor
>   3) HVMlite/PVH doesn't use Qemu
> * Dummy xen-vIOMMU in Qemu as a wrapper of new hypercall to create
> /destory vIOMMU in hypervisor and deal with virtual PCI device's 2th
> level translation.
> 
> 2.1 2th level translation overview
> For Virtual PCI device, dummy xen-vIOMMU does translation in the
> Qemu via new hypercall.
> 
> For physical PCI device, vIOMMU in hypervisor shadows IO page table from
> IOVA->GPA to IOVA->HPA and load page table to physical IOMMU.
> 
> The following diagram shows 2th level translation architecture.
> +-+
> |Qemu++   |
> || Virtual|   |
> ||   PCI device   |   |
> |||   |
> |++   |
> ||DMA |
> |V|
> |  ++   Request  ++   |
> |  |+<---+|   |
> |  |  Dummy xen vIOMMU  | Target GPA |  Memory region |   |
> |  |+--->+|   |
> |  +-+--++---++   |
> ||   ||
> ||Hypercall  ||
> +++
> |Hypervisor  |   ||
> ||   ||
> |v   ||
> | +--+--+||
> | |   vIOMMU|||
> | +--+--+|  

Re: [Xen-devel] Xen virtual IOMMU high level design doc V3

2016-11-22 Thread Lan Tianyu
On 2016年11月21日 15:05, Tian, Kevin wrote:
>> If someone add "intel_iommu=on" kernel parameter manually, IOMMU driver
>> > will panic guest because it can't enable DMA remapping function via gcmd
>> > register and "Translation Enable Status" bit in gsts register is never
>> > set by vIOMMU. This shows actual vIOMMU status that there is no l2
>> > translation support and warn user should not enable l2 translation.
> The rationale of section 3.5 is confusing. Do you mean sth. like below?
> 
> - We can first do IRQ remapping, because DMA remapping (l1/l2) and 
> IRQ remapping can be enabled separately according to VT-d spec. Enabling 
> of DMA remapping will be first emulated as a failure, which may lead
> to guest kernel panic if intel_iommu is turned on in the guest. But it's
> not a big problem because major distributions have DMA remapping
> disabled by default while IRQ remapping is enabled.
> 
> - For DMA remapping, likely you'll enable L2 translation first (there is
> no capability bit) with L1 translation disabled (there is a SVM capability 
> bit). 
> 
> If yes, maybe we can break this design into 3 parts too, so both
> design review and implementation side can move forward step by
> step?
> 

Yes, we may implement IRQ remapping first. I will break this design into
3 parts(interrupt remapping, L2 translation and L1 translation). IRQ
remapping will be first one to be sent out for detail discussion.

-- 
Best regards
Tianyu Lan

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Xen virtual IOMMU high level design doc V3

2016-11-22 Thread Jan Beulich
>>> On 17.11.16 at 16:36,  wrote:
> 2) Build ACPI DMAR table in toolstack
> Now tool stack can boot ACPI DMAR table according VM configure and pass
> though it to hvmloader via xenstore ACPI PT channel. But the vIOMMU MMIO
> region is managed by Qemu and it's need to be populated into DMAR
> table. We may hardcore an address in both Qemu and toolstack and use the 
> same address to create vIOMMU and build DMAR table.

Let's try to avoid any new hard coding of values. Both tool stack
and qemu ought to be able to retrieve a suitable address range
from the hypervisor. Or if the tool stack was to allocate it, it could
tell qemu.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Xen virtual IOMMU high level design doc V3

2016-11-22 Thread Lan Tianyu
On 2016年11月21日 21:41, Andrew Cooper wrote:
> On 17/11/16 15:36, Lan Tianyu wrote:
>> 3.2 l2 translation
>> 1) For virtual PCI device
>> Xen dummy xen-vIOMMU in Qemu translates IOVA to target GPA via new
>> hypercall when DMA operation happens.
>>
>> When guest triggers a invalidation operation, there maybe in-fly DMA
>> request for virtual device has been translated by vIOMMU and return back
>> Qemu. Before vIOMMU tells invalidation completed, it's necessary to make
>> sure in-fly DMA operation is completed.
>>
>> When IOMMU driver invalidates IOTLB, it also will wait until the
>> invalidation completion. We may use this to drain in-fly DMA operation
>> for virtual device.
>>
>> Guest triggers invalidation operation and trip into vIOMMU in
>> hypervisor to flush cache data. After this, it should go to Qemu to
>> drain in-fly DMA translation.
>>
>> To do that, dummy vIOMMU in Qemu registers the same MMIO region as
>> vIOMMU's and emulation part of invalidation operation in Xen hypervisor
>> returns X86EMUL_UNHANDLEABLE after flush cache. MMIO emulation part is
>> supposed to send event to Qemu and dummy vIOMMU get a chance to starts a
>> thread to drain in-fly DMA and return emulation done.
>>
>> Guest polls IVT(invalidate IOTLB) bit in the IOTLB invalidate register
>> until it's cleared after triggering invalidation. Dummy vIOMMU in Qemu
>> notifies hypervisor drain operation completed via hypercall, vIOMMU
>> clears IVT bit and guest finish invalidation operation.
> 
> Having the guest poll will be very inefficient.  If the invalidation
> does need to reach qemu, it will be a very long time until it
> completes.  Is there no interrupt based mechanism which can be used? 
> That way the guest can either handle it asynchronous itself, or block
> waiting on an interrupt, both of which are better than having it just
> spinning.
> 

Hi Andrew:
VTD provides interrupt event for Queue invalidation completion. So guest
can select poll or interrupt mode to wait for invalidation completion. I
found Linux Intel IOMMU driver just used poll mode and so used it for
example. Regardless of poll and interrupt mode, guest will wait for
invalidation completion and we just need to make sure to finish draining
in-fly DMA before clearing invalidation completion bit.

-- 
Best regards
Tianyu Lan

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Xen virtual IOMMU high level design doc V3

2016-11-21 Thread Tian, Kevin
> From: Andrew Cooper [mailto:andrew.coop...@citrix.com]
> Sent: Monday, November 21, 2016 9:41 PM
> 
> On 17/11/16 15:36, Lan Tianyu wrote:
> > 3.2 l2 translation
> > 1) For virtual PCI device
> > Xen dummy xen-vIOMMU in Qemu translates IOVA to target GPA via new
> > hypercall when DMA operation happens.
> >
> > When guest triggers a invalidation operation, there maybe in-fly DMA
> > request for virtual device has been translated by vIOMMU and return back
> > Qemu. Before vIOMMU tells invalidation completed, it's necessary to make
> > sure in-fly DMA operation is completed.
> >
> > When IOMMU driver invalidates IOTLB, it also will wait until the
> > invalidation completion. We may use this to drain in-fly DMA operation
> > for virtual device.
> >
> > Guest triggers invalidation operation and trip into vIOMMU in
> > hypervisor to flush cache data. After this, it should go to Qemu to
> > drain in-fly DMA translation.
> >
> > To do that, dummy vIOMMU in Qemu registers the same MMIO region as
> > vIOMMU's and emulation part of invalidation operation in Xen hypervisor
> > returns X86EMUL_UNHANDLEABLE after flush cache. MMIO emulation part is
> > supposed to send event to Qemu and dummy vIOMMU get a chance to starts a
> > thread to drain in-fly DMA and return emulation done.
> >
> > Guest polls IVT(invalidate IOTLB) bit in the IOTLB invalidate register
> > until it's cleared after triggering invalidation. Dummy vIOMMU in Qemu
> > notifies hypervisor drain operation completed via hypercall, vIOMMU
> > clears IVT bit and guest finish invalidation operation.
> 
> Having the guest poll will be very inefficient.  If the invalidation
> does need to reach qemu, it will be a very long time until it
> completes.  Is there no interrupt based mechanism which can be used?
> That way the guest can either handle it asynchronous itself, or block
> waiting on an interrupt, both of which are better than having it just
> spinning.
> 

VT-d spec supports both poll and interrupt modes, and it's decided
by guest IOMMU driver. Not say that this design requires guest to
use poll mode. I guess Tianyu uses as an example flow.

Thanks
Kevin
___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Xen virtual IOMMU high level design doc V3

2016-11-21 Thread Stefano Stabellini
On Mon, 21 Nov 2016, Julien Grall wrote:
> On 21/11/2016 02:21, Lan, Tianyu wrote:
> > On 11/19/2016 3:43 AM, Julien Grall wrote:
> > > On 17/11/2016 09:36, Lan Tianyu wrote:
> > Hi Julien:
> 
> Hello Lan,
> 
> > Thanks for your input. This interface is just for virtual PCI device
> > which is called by Qemu. I am not familiar with ARM. Are there any
> > non-PCI emulated devices for arm in Qemu which need to be covered by
> > vIOMMU?
> 
> We don't use QEMU on ARM so far, so I guess it should be ok for now. ARM
> guests are very similar to hvmlite/pvh. I got confused and thought this design
> document was targeting pvh too.
> 
> BTW, in the design document you mention hvmlite/pvh. Does it mean you plan to
> bring support of vIOMMU for those guests later on?

I quickly went through the document. I don't think we should restrict
the design to only one caller: QEMU. In fact it looks like those
hypercalls, without any modifications, could be called from the
toolstack (xl/libxl) in the case of PVH guests. In other words
PVH guests might work without any addition efforts on the hypervisor
side.

And they might even work on ARM. I have a couple of suggestions to
make the hypercalls a bit more "future proof" and architecture agnostic.

Imagine a future where two vIOMMU versions are supported. We could have
a uint32_t iommu_version field to identify what version of IOMMU we are
creating (create_iommu and query_capabilities commands). This could be
useful even on Intel platforms.

Given that in the future we might support a vIOMMU that take ids other
than sbdf as input, I would change "u32 vsbdf" into the following:

  #define XENVIOMMUSPACE_vsbdf  0
  uint16_t space;
  uint64_t id;

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Xen virtual IOMMU high level design doc V3

2016-11-21 Thread Andrew Cooper
On 17/11/16 15:36, Lan Tianyu wrote:
> 3.2 l2 translation
> 1) For virtual PCI device
> Xen dummy xen-vIOMMU in Qemu translates IOVA to target GPA via new
> hypercall when DMA operation happens.
>
> When guest triggers a invalidation operation, there maybe in-fly DMA
> request for virtual device has been translated by vIOMMU and return back
> Qemu. Before vIOMMU tells invalidation completed, it's necessary to make
> sure in-fly DMA operation is completed.
>
> When IOMMU driver invalidates IOTLB, it also will wait until the
> invalidation completion. We may use this to drain in-fly DMA operation
> for virtual device.
>
> Guest triggers invalidation operation and trip into vIOMMU in
> hypervisor to flush cache data. After this, it should go to Qemu to
> drain in-fly DMA translation.
>
> To do that, dummy vIOMMU in Qemu registers the same MMIO region as
> vIOMMU's and emulation part of invalidation operation in Xen hypervisor
> returns X86EMUL_UNHANDLEABLE after flush cache. MMIO emulation part is
> supposed to send event to Qemu and dummy vIOMMU get a chance to starts a
> thread to drain in-fly DMA and return emulation done.
>
> Guest polls IVT(invalidate IOTLB) bit in the IOTLB invalidate register
> until it's cleared after triggering invalidation. Dummy vIOMMU in Qemu
> notifies hypervisor drain operation completed via hypercall, vIOMMU
> clears IVT bit and guest finish invalidation operation.

Having the guest poll will be very inefficient.  If the invalidation
does need to reach qemu, it will be a very long time until it
completes.  Is there no interrupt based mechanism which can be used? 
That way the guest can either handle it asynchronous itself, or block
waiting on an interrupt, both of which are better than having it just
spinning.

~Andrew

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Xen virtual IOMMU high level design doc V3

2016-11-21 Thread Julien Grall



On 21/11/2016 02:21, Lan, Tianyu wrote:

On 11/19/2016 3:43 AM, Julien Grall wrote:

On 17/11/2016 09:36, Lan Tianyu wrote:

Hi Julien:


Hello Lan,


Thanks for your input. This interface is just for virtual PCI device
which is called by Qemu. I am not familiar with ARM. Are there any
non-PCI emulated devices for arm in Qemu which need to be covered by
vIOMMU?


We don't use QEMU on ARM so far, so I guess it should be ok for now. ARM 
guests are very similar to hvmlite/pvh. I got confused and thought this 
design document was targeting pvh too.


BTW, in the design document you mention hvmlite/pvh. Does it mean you 
plan to bring support of vIOMMU for those guests later on?


Regards,


--
Julien Grall

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Xen virtual IOMMU high level design doc V3

2016-11-20 Thread Tian, Kevin
> From: Lan, Tianyu
> Sent: Thursday, November 17, 2016 11:37 PM
> 
> Change since V2:
>   1) Update motivation for Xen vIOMMU - 288 vcpus support part
>   Add descriptor about plan of increasing vcpu from 128 to 255 and
> dependency between X2APIC and interrupt remapping.
>   2) Update 3.1 New vIOMMU hypercall interface
>   Change vIOMMU hypercall from sysctl to dmop, add multi vIOMMU
> consideration consideration and drain in-fly DMA subcommand
>   3) Update 3.5 implementation consideration
>   We found it's still safe to enable interrupt remapping function before
> adding l2 translation(DMA translation) to increase vcpu number >255.
>   4) Update 3.2 l2 translation - virtual device part
>   Add proposal to deal with race between in-fly DMA and invalidation
> operation in hypervisor.
>   5) Update 4.4 Report vIOMMU to hvmloader
>   Add option of building ACPI DMAR table in the toolstack for discussion.
> 
> Change since V1:
>   1) Update motivation for Xen vIOMMU - 288 vcpus support part
>   2) Change definition of struct xen_sysctl_viommu_op
>   3) Update "3.5 Implementation consideration" to explain why we needs to
> enable l2 translation first.
>   4) Update "4.3 Q35 vs I440x" - Linux/Windows VTD drivers can work on
> the emulated I440 chipset.
>   5) Remove stale statement in the "3.3 Interrupt remapping"
> 
> Content:
> =
> ==
> 1. Motivation of vIOMMU
>   1.1 Enable more than 255 vcpus
>   1.2 Support VFIO-based user space driver
>   1.3 Support guest Shared Virtual Memory (SVM)
> 2. Xen vIOMMU Architecture
>   2.1 l2 translation overview

L2/L1 might be more readable than l2/l1. :-)

>   2.2 Interrupt remapping overview

to be complete, need an overview of l1 translation here

> 3. Xen hypervisor
>   3.1 New vIOMMU hypercall interface
>   3.2 l2 translation
>   3.3 Interrupt remapping
>   3.4 l1 translation
>   3.5 Implementation consideration
> 4. Qemu
>   4.1 Qemu vIOMMU framework
>   4.2 Dummy xen-vIOMMU driver
>   4.3 Q35 vs. i440x
>   4.4 Report vIOMMU to hvmloader
> 
> 
> Glossary:
> =
> ===
> l1 translation - first-level translation to remap a virtual address to
> intermediate (guest) physical address. (GVA->GPA)
> l2 translation - second-level translations to remap a intermediate
> physical address to machine (host) physical address. (GPA->HPA)

If a glossary section required, please make it complex (interrupt remapping, 
DMAR, etc.)

Also please stick to what spec says. I don't think 'intermediate' physical
address is a widely-used term, and GVA->GPA/GPA->HPA are only partial
usages of those structures. You may make them an example, but be
careful with the definition.

> 
> 1 Motivation for Xen vIOMMU
> =
> ===
> 1.1 Enable more than 255 vcpu support

vcpu->vcpus

> HPC cloud service requires VM provides high performance parallel
> computing and we hope to create a huge VM with >255 vcpu on one machine
> to meet such requirement. Pin each vcpu to separate pcpus.
> 
> Now HVM guest can support 128 vcpus at most. We can increase vcpu number
> from 128 to 255 via changing some limitations and extending vcpu related
> data structure. This also needs to change the rule of allocating vcpu's
> APIC ID. Current rule is "(APIC ID) = (vcpu index) * 2". We need to
> change it to "(APIC ID) = (vcpu index)". Andrew Cooper's CPUID
> improvement work will cover this to improve guest's cpu topology. We
> will base on this to increase vcpu number from 128 to 255.
> 
> To support >255 vcpus, X2APIC mode in guest is necessary because legacy
> APIC(XAPIC) just supports 8-bit APIC ID and it only can support 255
> vcpus at most. X2APIC mode supports 32-bit APIC ID and it requires
> interrupt mapping function of vIOMMU.
> 
> The reason for this is that there is no modification to existing PCI MSI
> and IOAPIC with the introduction of X2APIC. PCI MSI/IOAPIC can only send
> interrupt message containing 8-bit APIC ID, which cannot address >255
> cpus. Interrupt remapping supports 32-bit APIC ID and so it's necessary
> to enable >255 cpus with x2apic mode.
> 
> Both Linux and Windows requires interrupt remapping when cpu number is >255.
> 
> 
> 1.2 Support VFIO-based user space driver (e.g. DPDK) in the guest
> It relies on the l2 translation capability (IOVA->GPA) on

GIOVA->GPA to be consistent

> vIOMMU. pIOMMU l2 becomes a shadowing structure of
> vIOMMU to isolate DMA requests initiated by user space driver.
> 

You may give more background of how VFIO manages user space driver
to make whole picture clearer, like what you did for >255 vcpus support.

> 
> 
> 1.3 Support guest SVM (Shared Virtual Memory)
> It relies on the l1 translation table capability (GVA->GPA) on
> 

Re: [Xen-devel] Xen virtual IOMMU high level design doc V3

2016-11-20 Thread Lan, Tianyu



On 11/19/2016 3:43 AM, Julien Grall wrote:

Hi Lan,

On 17/11/2016 09:36, Lan Tianyu wrote:


1) Definition of "struct xen_dmop_viommu_op" as new hypercall parameter.

struct xen_dmop_viommu_op {
u32 cmd;
u32 domid;
u32 viommu_id;
union {
struct {
u32 capabilities;
} query_capabilities;
struct {
/* IN parameters. */
u32 capabilities;
u64 base_address;
struct {
u32 size;
XEN_GUEST_HANDLE_64(uint32) dev_list;
} dev_scope;
/* Out parameters. */
u32 viommu_id;
} create_iommu;
struct {
/* IN parameters. */
u32 vsbdf;


I only gave a quick look through this design document. The new
hypercalls looks arch/device agnostic except this part.

Having a virtual IOMMU on Xen ARM is something we might consider in the
future.

In the case of ARM, a device can either be a PCI device or integrated
device. The latter does not have a sbdf. The IOMMU will usually be
configured with a stream ID (SID) that can be deduced from the sbdf and
hardcoded for integrated device.

So I would rather not tie the interface to PCI and use a more generic
name for this field. Maybe vdevid, which then can be architecture specific.


Hi Julien:
	Thanks for your input. This interface is just for virtual PCI device 
which is called by Qemu. I am not familiar with ARM. Are there any 
non-PCI emulated devices for arm in Qemu which need to be covered by vIOMMU?


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Xen virtual IOMMU high level design doc V3

2016-11-18 Thread Julien Grall

Hi Lan,

On 17/11/2016 09:36, Lan Tianyu wrote:


1) Definition of "struct xen_dmop_viommu_op" as new hypercall parameter.

struct xen_dmop_viommu_op {
u32 cmd;
u32 domid;
u32 viommu_id;
union {
struct {
u32 capabilities;
} query_capabilities;
struct {
/* IN parameters. */
u32 capabilities;
u64 base_address;
struct {
u32 size;
XEN_GUEST_HANDLE_64(uint32) dev_list;
} dev_scope;
/* Out parameters. */
u32 viommu_id;
} create_iommu;
struct {
/* IN parameters. */
u32 vsbdf;


I only gave a quick look through this design document. The new 
hypercalls looks arch/device agnostic except this part.


Having a virtual IOMMU on Xen ARM is something we might consider in the 
future.


In the case of ARM, a device can either be a PCI device or integrated 
device. The latter does not have a sbdf. The IOMMU will usually be 
configured with a stream ID (SID) that can be deduced from the sbdf and 
hardcoded for integrated device.


So I would rather not tie the interface to PCI and use a more generic 
name for this field. Maybe vdevid, which then can be architecture specific.


Regards,

--
Julien Grall

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] Xen virtual IOMMU high level design doc V3

2016-11-17 Thread Lan Tianyu

Change since V2:
1) Update motivation for Xen vIOMMU - 288 vcpus support part
Add descriptor about plan of increasing vcpu from 128 to 255 and
dependency between X2APIC and interrupt remapping.
2) Update 3.1 New vIOMMU hypercall interface
Change vIOMMU hypercall from sysctl to dmop, add multi vIOMMU
consideration consideration and drain in-fly DMA subcommand
3) Update 3.5 implementation consideration
	We found it's still safe to enable interrupt remapping function before 
adding l2 translation(DMA translation) to increase vcpu number >255.

4) Update 3.2 l2 translation - virtual device part
	Add proposal to deal with race between in-fly DMA and invalidation 
operation in hypervisor.

5) Update 4.4 Report vIOMMU to hvmloader
Add option of building ACPI DMAR table in the toolstack for discussion.

Change since V1:
1) Update motivation for Xen vIOMMU - 288 vcpus support part
2) Change definition of struct xen_sysctl_viommu_op
3) Update "3.5 Implementation consideration" to explain why we needs to
enable l2 translation first.
4) Update "4.3 Q35 vs I440x" - Linux/Windows VTD drivers can work on
the emulated I440 chipset.
5) Remove stale statement in the "3.3 Interrupt remapping"

Content:
===
1. Motivation of vIOMMU
1.1 Enable more than 255 vcpus
1.2 Support VFIO-based user space driver
1.3 Support guest Shared Virtual Memory (SVM)
2. Xen vIOMMU Architecture
2.1 l2 translation overview
2.2 Interrupt remapping overview
3. Xen hypervisor
3.1 New vIOMMU hypercall interface
3.2 l2 translation
3.3 Interrupt remapping
3.4 l1 translation
3.5 Implementation consideration
4. Qemu
4.1 Qemu vIOMMU framework
4.2 Dummy xen-vIOMMU driver
4.3 Q35 vs. i440x
4.4 Report vIOMMU to hvmloader


Glossary:

l1 translation - first-level translation to remap a virtual address to
intermediate (guest) physical address. (GVA->GPA)
l2 translation - second-level translations to remap a intermediate
physical address to machine (host) physical address. (GPA->HPA)

1 Motivation for Xen vIOMMU

1.1 Enable more than 255 vcpu support
HPC cloud service requires VM provides high performance parallel
computing and we hope to create a huge VM with >255 vcpu on one machine
to meet such requirement. Pin each vcpu to separate pcpus.

Now HVM guest can support 128 vcpus at most. We can increase vcpu number
from 128 to 255 via changing some limitations and extending vcpu related
data structure. This also needs to change the rule of allocating vcpu's
APIC ID. Current rule is "(APIC ID) = (vcpu index) * 2". We need to
change it to "(APIC ID) = (vcpu index)". Andrew Cooper's CPUID
improvement work will cover this to improve guest's cpu topology. We
will base on this to increase vcpu number from 128 to 255.

To support >255 vcpus, X2APIC mode in guest is necessary because legacy
APIC(XAPIC) just supports 8-bit APIC ID and it only can support 255
vcpus at most. X2APIC mode supports 32-bit APIC ID and it requires
interrupt mapping function of vIOMMU.

The reason for this is that there is no modification to existing PCI MSI
and IOAPIC with the introduction of X2APIC. PCI MSI/IOAPIC can only send
interrupt message containing 8-bit APIC ID, which cannot address >255
cpus. Interrupt remapping supports 32-bit APIC ID and so it's necessary
to enable >255 cpus with x2apic mode.

Both Linux and Windows requires interrupt remapping when cpu number is >255.


1.2 Support VFIO-based user space driver (e.g. DPDK) in the guest
It relies on the l2 translation capability (IOVA->GPA) on
vIOMMU. pIOMMU l2 becomes a shadowing structure of
vIOMMU to isolate DMA requests initiated by user space driver.



1.3 Support guest SVM (Shared Virtual Memory)
It relies on the l1 translation table capability (GVA->GPA) on
vIOMMU. pIOMMU needs to enable both l1 and l2 translation in nested
mode (GVA->GPA->HPA) for passthrough device. IGD passthrough
is the main usage today (to support OpenCL 2.0 SVM feature). In the
future SVM might be used by other I/O devices too.



2. Xen vIOMMU Architecture


* vIOMMU will be inside Xen hypervisor for following factors
1) Avoid round trips between Qemu and Xen hypervisor
2) Ease of integration with the rest of the hypervisor
3) HVMlite/PVH doesn't use Qemu
* Dummy xen-vIOMMU in Qemu as a wrapper of new hypercall to create
/destroy vIOMMU in hypervisor and deal with virtual PCI device's l2
translation.

2.1 l2 translation overview
For Virtual PCI device, dummy xen-vIOMMU does translation in the
Qemu via new 

Re: [Xen-devel] Xen virtual IOMMU high level design doc V2

2016-11-03 Thread Lan, Tianyu



On 10/26/2016 5:39 PM, Jan Beulich wrote:

On 22.10.16 at 09:32,  wrote:

On 10/21/2016 4:36 AM, Andrew Cooper wrote:

3.5 Implementation consideration
VT-d spec doesn't define a capability bit for the l2 translation.
Architecturally there is no way to tell guest that l2 translation
capability is not available. Linux Intel IOMMU driver thinks l2
translation is always available when VTD exits and fail to be loaded
without l2 translation support even if interrupt remapping and l1
translation are available. So it needs to enable l2 translation first
before other functions.


What then is the purpose of the nested translation support bit in the
extended capability register?


It's to translate output GPA from first level translation(IOVA->GPA)
to HPA.

Detail please see VTD spec - 3.8 Nested Translation
"When Nesting Enable (NESTE) field is 1 in extended-context-entries,
requests-with-PASID translated through first-level translation are also
subjected to nested second-level translation. Such extendedcontext-
entries contain both the pointer to the PASID-table (which contains the
pointer to the firstlevel translation structures), and the pointer to
the second-level translation structures."


I didn't phrase my question very well.  I understand what the nested
translation bit means, but I don't understand why we have a problem
signalling the presence or lack of nested translations to the guest.

In other words, why can't we hide l2 translation from the guest by
simply clearing the nested translation capability?


You mean to tell no support of l2 translation via nest translation bit?
But the nested translation is a different function with l2 translation
even from guest view and nested translation only works requests with
PASID (l1 translation).

Linux intel iommu driver enables l2 translation unconditionally and free
iommu instance when failed to enable l2 translation.


In which cases the wording of your description is confusing: Instead of
"Linux Intel IOMMU driver thinks l2 translation is always available when
VTD exits and fail to be loaded without l2 translation support ..." how
about using something closer to what you've replied with last?

Jan



Hi All:
I have some updates about implementation dependency between l2
translation(DMA translation) and irq remapping.

I find there are a kernel parameter "intel_iommu=on" and kconfig option
CONFIG_INTEL_IOMMU_DEFAULT_ON which control DMA translation function.
When they aren't set, DMA translation function will not be enabled by
IOMMU driver even if some vIOMMU registers show L2 translation function
available. In the meantime, irq remapping function still can work to
support >255 vcpus.

I check distribution RHEL, SLES, Oracle and ubuntu don't set the kernel
parameter or select the kconfig option. So we can emulate irq remapping
fist with some capability bits(e,g SAGAW of Capability Register) of l2
translation for >255 vcpus support without l2 translation emulation.

Showing l2 capability bits is to make sure IOMMU driver probe ACPI DMAR
tables successfully because IOMMU driver access these bits during
reading ACPI tables.

If someone add "intel_iommu=on" kernel parameter manually, IOMMU driver
will panic guest because it can't enable DMA remapping function via gcmd
register and "Translation Enable Status" bit in gsts register is never
set by vIOMMU. This shows actual vIOMMU status of no l2 translation
emulation and warn user should not enable l2 translation.




___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Xen virtual IOMMU high level design doc V2

2016-10-28 Thread Lan Tianyu

On 2016年10月21日 04:36, Andrew Cooper wrote:

>>

>>> u64 iova;
>>> /* Out parameters. */
>>> u64 translated_addr;
>>> u64 addr_mask; /* Translation page size */
>>> IOMMUAccessFlags permisson;

>>
>> How is this translation intended to be used?  How do you plan to avoid
>> race conditions where qemu requests a translation, receives one, the
>> guest invalidated the mapping, and then qemu tries to use its translated
>> address?
>>
>> There are only two ways I can see of doing this race-free.  One is to
>> implement a "memcpy with translation" hypercall, and the other is to
>> require the use of ATS in the vIOMMU, where the guest OS is required to
>> wait for a positive response from the vIOMMU before it can safely reuse
>> the mapping.
>>
>> The former behaves like real hardware in that an intermediate entity
>> performs the translation without interacting with the DMA source.  The
>> latter explicitly exposing the fact that caching is going on at the
>> endpoint to the OS.

>
> The former one seems to move DMA operation into hypervisor but Qemu
> vIOMMU framework just passes IOVA to dummy xen-vIOMMU without input
> data and access length. I will dig more to figure out solution.

Yes - that does in principle actually move the DMA out of Qemu.


Hi Adnrew:

The first solution "Move the DMA out of Qemu": Qemu vIOMMU framework
just give a chance of doing DMA translation to dummy xen-vIOMMU device
model and DMA access operation is in the vIOMMU core code. It's hard to
move this out. There are a lot of places to call translation callback
and some these are not for DMA access(E,G Map guest memory in Qemu).

The second solution "Use ATS to sync invalidation operation.": This
requires to enable ATS for all virtual PCI devices. This is not easy to do.

The following is my proposal:
When IOMMU driver invalidates IOTLB, it also will wait until the
invalidation completion. We may use this to drain in-fly DMA operation.

Guest triggers invalidation operation and trip into vIOMMU in
hypervisor to flush cache data. After this, it should go to Qemu to
drain in-fly DMA translation.

To do that, dummy vIOMMU in Qemu registers the same MMIO region as
vIOMMU's and emulation part of invalidation operation returns
X86EMUL_UNHANDLEABLE after flush cache. MMIO emulation part is supposed
to send event to Qemu and dummy vIOMMU get a chance to starts a thread
to drain in-fly DMA and return emulation done.

Guest polls IVT(invalidate IOTLB) bit in the IOTLB invalidate register
until it's cleared. Dummy vIOMMU notifies vIOMMU drain operation
completed via hypercall, vIOMMU clears IVT bit and guest finish
invalidation operation.

--
Best regards
Tianyu Lan

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Xen virtual IOMMU high level design doc V2

2016-10-26 Thread Lan, Tianyu



On 10/26/2016 5:39 PM, Jan Beulich wrote:

On 22.10.16 at 09:32,  wrote:

On 10/21/2016 4:36 AM, Andrew Cooper wrote:

3.5 Implementation consideration
VT-d spec doesn't define a capability bit for the l2 translation.
Architecturally there is no way to tell guest that l2 translation
capability is not available. Linux Intel IOMMU driver thinks l2
translation is always available when VTD exits and fail to be loaded
without l2 translation support even if interrupt remapping and l1
translation are available. So it needs to enable l2 translation first
before other functions.


What then is the purpose of the nested translation support bit in the
extended capability register?


It's to translate output GPA from first level translation(IOVA->GPA)
to HPA.

Detail please see VTD spec - 3.8 Nested Translation
"When Nesting Enable (NESTE) field is 1 in extended-context-entries,
requests-with-PASID translated through first-level translation are also
subjected to nested second-level translation. Such extendedcontext-
entries contain both the pointer to the PASID-table (which contains the
pointer to the firstlevel translation structures), and the pointer to
the second-level translation structures."


I didn't phrase my question very well.  I understand what the nested
translation bit means, but I don't understand why we have a problem
signalling the presence or lack of nested translations to the guest.

In other words, why can't we hide l2 translation from the guest by
simply clearing the nested translation capability?


You mean to tell no support of l2 translation via nest translation bit?
But the nested translation is a different function with l2 translation
even from guest view and nested translation only works requests with
PASID (l1 translation).

Linux intel iommu driver enables l2 translation unconditionally and free
iommu instance when failed to enable l2 translation.


In which cases the wording of your description is confusing: Instead of
"Linux Intel IOMMU driver thinks l2 translation is always available when
VTD exits and fail to be loaded without l2 translation support ..." how
about using something closer to what you've replied with last?



Sorry for my pool English. Will update.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Xen virtual IOMMU high level design doc V2

2016-10-26 Thread Lan, Tianyu

On 10/26/2016 5:36 PM, Jan Beulich wrote:

On 18.10.16 at 16:14,  wrote:

1.1 Enable more than 255 vcpu support
HPC cloud service requires VM provides high performance parallel
computing and we hope to create a huge VM with >255 vcpu on one machine
to meet such requirement.Ping each vcpus on separated pcpus. More than
255 vcpus support requires X2APIC and Linux disables X2APIC mode if
there is no interrupt remapping function which is present by vIOMMU.
Interrupt remapping function helps to deliver interrupt to #vcpu >255.
So we need to add vIOMMU before enabling >255 vcpus.


I continue to dislike this completely neglecting that we can't even
have >128 vCPU-s at present. Once again - there's other work to
be done prior to lack of vIOMMU becoming the limiting factor.



Yes, we can increase vcpu from 128 to 255 first without vIOMMU support.
We have some draft patches to enable this. Andrew also will rework CPUID
policy and change the rule of allocating vcpu's APIC ID. So we will base
on it to increase vcpu number. VLAPIC also needs to be changed to
support >255 APIC ID. These jobs can be implemented parallel with vIOMMU.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Xen virtual IOMMU high level design doc V2

2016-10-26 Thread Jan Beulich
>>> On 22.10.16 at 09:32,  wrote:
> On 10/21/2016 4:36 AM, Andrew Cooper wrote:
> 3.5 Implementation consideration
> VT-d spec doesn't define a capability bit for the l2 translation.
> Architecturally there is no way to tell guest that l2 translation
> capability is not available. Linux Intel IOMMU driver thinks l2
> translation is always available when VTD exits and fail to be loaded
> without l2 translation support even if interrupt remapping and l1
> translation are available. So it needs to enable l2 translation first
> before other functions.

 What then is the purpose of the nested translation support bit in the
 extended capability register?
>>>
>>> It's to translate output GPA from first level translation(IOVA->GPA)
>>> to HPA.
>>>
>>> Detail please see VTD spec - 3.8 Nested Translation
>>> "When Nesting Enable (NESTE) field is 1 in extended-context-entries,
>>> requests-with-PASID translated through first-level translation are also
>>> subjected to nested second-level translation. Such extendedcontext-
>>> entries contain both the pointer to the PASID-table (which contains the
>>> pointer to the firstlevel translation structures), and the pointer to
>>> the second-level translation structures."
>>
>> I didn't phrase my question very well.  I understand what the nested
>> translation bit means, but I don't understand why we have a problem
>> signalling the presence or lack of nested translations to the guest.
>>
>> In other words, why can't we hide l2 translation from the guest by
>> simply clearing the nested translation capability?
> 
> You mean to tell no support of l2 translation via nest translation bit?
> But the nested translation is a different function with l2 translation
> even from guest view and nested translation only works requests with
> PASID (l1 translation).
> 
> Linux intel iommu driver enables l2 translation unconditionally and free 
> iommu instance when failed to enable l2 translation.

In which cases the wording of your description is confusing: Instead of
"Linux Intel IOMMU driver thinks l2 translation is always available when
VTD exits and fail to be loaded without l2 translation support ..." how
about using something closer to what you've replied with last?

Jan

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Xen virtual IOMMU high level design doc V2

2016-10-26 Thread Jan Beulich
>>> On 18.10.16 at 16:14,  wrote:
> 1.1 Enable more than 255 vcpu support
> HPC cloud service requires VM provides high performance parallel
> computing and we hope to create a huge VM with >255 vcpu on one machine
> to meet such requirement.Ping each vcpus on separated pcpus. More than
> 255 vcpus support requires X2APIC and Linux disables X2APIC mode if
> there is no interrupt remapping function which is present by vIOMMU.
> Interrupt remapping function helps to deliver interrupt to #vcpu >255.
> So we need to add vIOMMU before enabling >255 vcpus.

I continue to dislike this completely neglecting that we can't even
have >128 vCPU-s at present. Once again - there's other work to
be done prior to lack of vIOMMU becoming the limiting factor.

Jan

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Xen virtual IOMMU high level design doc V2

2016-10-22 Thread Lan, Tianyu

On 10/21/2016 4:36 AM, Andrew Cooper wrote:







255 vcpus support requires X2APIC and Linux disables X2APIC mode if
there is no interrupt remapping function which is present by vIOMMU.
Interrupt remapping function helps to deliver interrupt to #vcpu >255.


This is only a requirement for xapic interrupt sources.  x2apic
interrupt sources already deliver correctly.


The key is the APIC ID. There is no modification to existing PCI MSI and
IOAPIC with the introduction of x2apic. PCI MSI/IOAPIC can only send
interrupt message containing 8bit APIC ID, which cannot address >255
cpus. Interrupt remapping supports 32bit APIC ID so it's necessary to
enable >255 cpus with x2apic mode.

If LAPIC is in x2apic while interrupt remapping is disabled, IOAPIC
cannot deliver interrupts to all cpus in the system if #cpu > 255.


After spending a long time reading up on this, my first observation is
that it is very difficult to find consistent information concerning the
expected content of MSI address/data fields for x86 hardware.  Having
said that, this has been very educational.

It is now clear that any MSI message can either specify an 8 bit APIC ID
directly, or request for the message to be remapped.  Apologies for my
earlier confusion.


Never minder, I will describe this more detail in the following version.





3 Xen hypervisor
==


3.1 New hypercall XEN_SYSCTL_viommu_op
This hypercall should also support pv IOMMU which is still under RFC
review. Here only covers non-pv part.

1) Definition of "struct xen_sysctl_viommu_op" as new hypercall
parameter.


Why did you choose sysctl?  As these are per-domain, domctl would be a
more logical choice.  However, neither of these should be usable by
Qemu, and we are trying to split out "normal qemu operations" into dmops
which can be safely deprivileged.



Do you know what's the status of dmop now? I just found some discussions
about design in the maillist. We may use domctl first and move to dmop
when it's ready?


I believe Paul is looking into respin the series early in the 4.9 dev
cycle.  I expect it won't take long until they are submitted.


Ok. I got it. Thanks for information.








Definition of VIOMMU subops:
#define XEN_SYSCTL_viommu_query_capability0
#define XEN_SYSCTL_viommu_create1
#define XEN_SYSCTL_viommu_destroy2
#define XEN_SYSCTL_viommu_dma_translation_for_vpdev 3

Definition of VIOMMU capabilities
#define XEN_VIOMMU_CAPABILITY_l1_translation(1 << 0)
#define XEN_VIOMMU_CAPABILITY_l2_translation(1 << 1)
#define XEN_VIOMMU_CAPABILITY_interrupt_remapping(1 << 2)


How are vIOMMUs going to be modelled to guests?  On real hardware, they
all seem to end associated with a PCI device of some sort, even if it is
just the LPC bridge.



This design just considers one vIOMMU has all PCI device under its
specified PCI Segment. "INCLUDE_PCI_ALL" bit of DRHD struct is set for
vIOMMU.


Even if the first implementation only supports a single vIOMMU, please
design the interface to cope with multiple.  It will save someone having
to go and break the API/ABI in the future when support for multiple
vIOMMUs is needed.


OK. I got.







How do we deal with multiple vIOMMUs in a single guest?


For multi-vIOMMU, we need to add new field in the struct iommu_op to
designate device scope of vIOMMUs if they are under same PCI
segment. This also needs to change DMAR table.






2) Design for subops
- XEN_SYSCTL_viommu_query_capability
   Get vIOMMU capabilities(l1/l2 translation and interrupt
remapping).

- XEN_SYSCTL_viommu_create
  Create vIOMMU in Xen hypervisor with dom_id, capabilities and reg
base address.

- XEN_SYSCTL_viommu_destroy
  Destory vIOMMU in Xen hypervisor with dom_id as parameters.

- XEN_SYSCTL_viommu_dma_translation_for_vpdev
  Translate IOVA to GPA for specified virtual PCI device with
dom id,
PCI device's bdf and IOVA and xen hypervisor returns translated GPA,
address mask and access permission.


3.2 l2 translation
1) For virtual PCI device
Xen dummy xen-vIOMMU in Qemu translates IOVA to target GPA via new
hypercall when DMA operation happens.

2) For physical PCI device
DMA operations go though physical IOMMU directly and IO page table for
IOVA->HPA should be loaded into physical IOMMU. When guest updates
l2 Page-table pointer field, it provides IO page table for
IOVA->GPA. vIOMMU needs to shadow l2 translation table, translate
GPA->HPA and update shadow page table(IOVA->HPA) pointer to l2
Page-table pointer to context entry of physical IOMMU.


How are you proposing to do this shadowing?  Do we need to trap and
emulate all writes to the vIOMMU pagetables, or is there a better way to
know when the mappings need invalidating?


No, we don't need to trap all write to IO page table.
From VTD spec 6.1, "Reporting the Caching Mode as Set for the
virtual hardware requires the guest software to explicitly issue

Re: [Xen-devel] Xen virtual IOMMU high level design doc V2

2016-10-20 Thread Andrew Cooper

>
>>
>>> 255 vcpus support requires X2APIC and Linux disables X2APIC mode if
>>> there is no interrupt remapping function which is present by vIOMMU.
>>> Interrupt remapping function helps to deliver interrupt to #vcpu >255.
>>
>> This is only a requirement for xapic interrupt sources.  x2apic
>> interrupt sources already deliver correctly.
>
> The key is the APIC ID. There is no modification to existing PCI MSI and
> IOAPIC with the introduction of x2apic. PCI MSI/IOAPIC can only send
> interrupt message containing 8bit APIC ID, which cannot address >255
> cpus. Interrupt remapping supports 32bit APIC ID so it's necessary to
> enable >255 cpus with x2apic mode.
>
> If LAPIC is in x2apic while interrupt remapping is disabled, IOAPIC
> cannot deliver interrupts to all cpus in the system if #cpu > 255.

After spending a long time reading up on this, my first observation is
that it is very difficult to find consistent information concerning the
expected content of MSI address/data fields for x86 hardware.  Having
said that, this has been very educational.

It is now clear that any MSI message can either specify an 8 bit APIC ID
directly, or request for the message to be remapped.  Apologies for my
earlier confusion.

>
>>> +-+
>>> |Qemu++   |
>>> || Virtual|   |
>>> ||   PCI device   |   |
>>> |||   |
>>> |++   |
>>> ||DMA |
>>> |V|
>>> |  ++   Request  ++   |
>>> |  |+<---+|   |
>>> |  |  Dummy xen vIOMMU  | Target GPA |  Memory region |   |
>>> |  |+--->+|   |
>>> |  +-+--++---++   |
>>> ||   ||
>>> ||Hypercall  ||
>>> +++
>>> |Hypervisor  |   ||
>>> ||   ||
>>> |v   ||
>>> | +--+--+||
>>> | |   vIOMMU|||
>>> | +--+--+||
>>> ||   ||
>>> |v   ||
>>> | +--+--+||
>>> | | IOMMU driver|||
>>> | +--+--+||
>>> ||   ||
>>> +++
>>> |HW  v   V|
>>> | +--+--+ +-+ |
>>> | |   IOMMU +>+  Memory | |
>>> | +--+--+ +-+ |
>>> |^|
>>> |||
>>> | +--+--+ |
>>> | | PCI Device  | |
>>> | +-+ |
>>> +-+
>>>
>>> 2.2 Interrupt remapping overview.
>>> Interrupts from virtual devices and physical devices will be delivered
>>> to vLAPIC from vIOAPIC and vMSI. vIOMMU will remap interrupt during
>>> this
>>> procedure.
>>>
>>> +---+
>>> |Qemu   |VM |
>>> |   | ++|
>>> |   | |  Device driver ||
>>> |   | ++---+|
>>> |   |  ^|
>>> |   ++  | ++---+|
>>> |   | Virtual device |  | |  IRQ subsystem ||
>>> |   +---++  | ++---+|
>>> |   |   |  ^|
>>> |   |   |  ||
>>> +---+---+
>>> |hyperviosr |  | VIRQ   |
>>> |   |+-++   |
>>> |   ||  vLAPIC  |   |
>>> |   |+-++   |
>>> |   |  ^|
>>> |   |  ||
>>> 

Re: [Xen-devel] Xen virtual IOMMU high level design doc V2

2016-10-20 Thread Andrew Cooper
On 20/10/16 10:53, Tian, Kevin wrote:
>> From: Andrew Cooper [mailto:andrew.coop...@citrix.com]
>> Sent: Wednesday, October 19, 2016 3:18 AM
>>
>>> 1.2 Support VFIO-based user space driver (e.g. DPDK) in the guest
>>> It relies on the l2 translation capability (IOVA->GPA) on
>>> vIOMMU. pIOMMU l2 becomes a shadowing structure of
>>> vIOMMU to isolate DMA requests initiated by user space driver.
>> How is userspace supposed to drive this interface?  I can't picture how
>> it would function.
> Inside a Linux VM, VFIO provides DMA MAP/UNMAP interface to user space
> driver so gIOVA->GPA mapping can be setup on vIOMMU. vIOMMU will 
> export a "caching mode" capability to indicate all guest PTE changes 
> requiring explicit vIOMMU cache invalidations. Through trapping of those
> invalidation requests, Xen can update corresponding shadow PTEs (gIOVA
> ->HPA). When DMA mapping is established, user space driver programs 
> gIOVA addresses as DMA destination to assigned device, and then upstreaming
> DMA request out of this device contains gIOVA which is translated to HPA
> by pIOMMU shadow page table.

Ok.  So in this mode, the userspace driver owns the device, and can
choose any arbitrary gIOVA layout it chooses?  If it also programs the
DMA addresses, I guess this setup is fine.

>
>>>
>>> 1.3 Support guest SVM (Shared Virtual Memory)
>>> It relies on the l1 translation table capability (GVA->GPA) on
>>> vIOMMU. pIOMMU needs to enable both l1 and l2 translation in nested
>>> mode (GVA->GPA->HPA) for passthrough device. IGD passthrough
>>> is the main usage today (to support OpenCL 2.0 SVM feature). In the
>>> future SVM might be used by other I/O devices too.
>> As an aside, how is IGD intending to support SVM?  Will it be with PCIe
>> ATS/PASID, or something rather more magic as IGD is on the same piece of
>> silicon?
> Although integrated, IGD conforms to standard PCIe PASID convention.

Ok.  Any idea when hardware with SVM will be available?

>
>>> 3.5 Implementation consideration
>>> VT-d spec doesn't define a capability bit for the l2 translation.
>>> Architecturally there is no way to tell guest that l2 translation
>>> capability is not available. Linux Intel IOMMU driver thinks l2
>>> translation is always available when VTD exits and fail to be loaded
>>> without l2 translation support even if interrupt remapping and l1
>>> translation are available. So it needs to enable l2 translation first
>>> before other functions.
>> What then is the purpose of the nested translation support bit in the
>> extended capability register?
>>
> Nested translation is for SVM virtualization. Given a DMA transaction 
> containing a PASID, VT-d engine first finds the 1st translation table 
> through PASID to translate from GVA to GPA, then once nested
> translation capability is enabled, further translate GPA to HPA using the
> 2nd level translation table. Bare-metal usage is not expected to turn
> on this nested bit.

Ok, but what happens if a guest sees a PASSID-capable vIOMMU and itself
tries to turn on nesting?  E.g. nesting KVM inside Xen and trying to use
SVM from the L2 guest?

If there is no way to indicate to the L1 guest that nesting isn't
available (as it is already actually in use), and we can't shadow
entries on faults, what is supposed to happen?

~Andrew

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Xen virtual IOMMU high level design doc V2

2016-10-20 Thread Lan, Tianyu


On 10/19/2016 4:26 AM, Konrad Rzeszutek Wilk wrote:

On Tue, Oct 18, 2016 at 10:14:16PM +0800, Lan Tianyu wrote:



1 Motivation for Xen vIOMMU
===
1.1 Enable more than 255 vcpu support
HPC cloud service requires VM provides high performance parallel
computing and we hope to create a huge VM with >255 vcpu on one machine
to meet such requirement.Ping each vcpus on separated pcpus. More than
255 vcpus support requires X2APIC and Linux disables X2APIC mode if
there is no interrupt remapping function which is present by vIOMMU.
Interrupt remapping function helps to deliver interrupt to #vcpu >255.
So we need to add vIOMMU before enabling >255 vcpus.


What about Windows? Does it care about this?


From our test, win8 guest crashes when boot up 288 vcpus without IR and 
it can boot up with IR



3.2 l2 translation
1) For virtual PCI device
Xen dummy xen-vIOMMU in Qemu translates IOVA to target GPA via new
hypercall when DMA operation happens.

2) For physical PCI device
DMA operations go though physical IOMMU directly and IO page table for
IOVA->HPA should be loaded into physical IOMMU. When guest updates
l2 Page-table pointer field, it provides IO page table for
IOVA->GPA. vIOMMU needs to shadow l2 translation table, translate
GPA->HPA and update shadow page table(IOVA->HPA) pointer to l2
Page-table pointer to context entry of physical IOMMU.

Now all PCI devices in same hvm domain share one IO page table
(GPA->HPA) in physical IOMMU driver of Xen. To support l2
translation of vIOMMU, IOMMU driver need to support multiple address
spaces per device entry. Using existing IO page table(GPA->HPA)
defaultly and switch to shadow IO page table(IOVA->HPA) when l2


defaultly?


I mean GPA->HPA mapping will set in the assigned device's context entry 
of pIOMMU when VM creates. Just like current code works.






3.3 Interrupt remapping
Interrupts from virtual devices and physical devices will be delivered
to vlapic from vIOAPIC and vMSI. It needs to add interrupt remapping
hooks in the vmsi_deliver() and ioapic_deliver() to find target vlapic
according interrupt remapping table.


3.4 l1 translation
When nested translation is enabled, any address generated by l1
translation is used as the input address for nesting with l2
translation. Physical IOMMU needs to enable both l1 and l2 translation
in nested translation mode(GVA->GPA->HPA) for passthrough
device.

VT-d context entry points to guest l1 translation table which
will be nest-translated by l2 translation table and so it
can be directly linked to context entry of physical IOMMU.


I think this means that the shared_ept will be disabled?


The shared_ept(GPA->HPA mapping) is used to do nested translation
for any output from l1 translation(GVA->GPA).




___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Xen virtual IOMMU high level design doc V2

2016-10-20 Thread Lan Tianyu

Hi Andrew:
Thanks for your review.

On 2016年10月19日 03:17, Andrew Cooper wrote:

On 18/10/16 15:14, Lan Tianyu wrote:

Change since V1:
1) Update motivation for Xen vIOMMU - 288 vcpus support part
2) Change definition of struct xen_sysctl_viommu_op
3) Update "3.5 Implementation consideration" to explain why we
needs to enable l2 translation first.
4) Update "4.3 Q35 vs I440x" - Linux/Windows VTD drivers can work
on the emulated I440 chipset.
5) Remove stale statement in the "3.3 Interrupt remapping"

Content:
===

1. Motivation of vIOMMU
1.1 Enable more than 255 vcpus
1.2 Support VFIO-based user space driver
1.3 Support guest Shared Virtual Memory (SVM)
2. Xen vIOMMU Architecture
2.1 l2 translation overview
2.2 Interrupt remapping overview
3. Xen hypervisor
3.1 New vIOMMU hypercall interface
3.2 l2 translation
3.3 Interrupt remapping
3.4 l1 translation
3.5 Implementation consideration
4. Qemu
4.1 Qemu vIOMMU framework
4.2 Dummy xen-vIOMMU driver
4.3 Q35 vs. i440x
4.4 Report vIOMMU to hvmloader


1 Motivation for Xen vIOMMU
===

1.1 Enable more than 255 vcpu support
HPC cloud service requires VM provides high performance parallel
computing and we hope to create a huge VM with >255 vcpu on one machine
to meet such requirement.Ping each vcpus on separated pcpus. More than


Pin ?



Sorry, it's a typo.


Also, grammatically speaking, I think you mean "each vcpu to separate
pcpus".



Yes.




255 vcpus support requires X2APIC and Linux disables X2APIC mode if
there is no interrupt remapping function which is present by vIOMMU.
Interrupt remapping function helps to deliver interrupt to #vcpu >255.


This is only a requirement for xapic interrupt sources.  x2apic
interrupt sources already deliver correctly.


The key is the APIC ID. There is no modification to existing PCI MSI and
IOAPIC with the introduction of x2apic. PCI MSI/IOAPIC can only send
interrupt message containing 8bit APIC ID, which cannot address >255
cpus. Interrupt remapping supports 32bit APIC ID so it's necessary to
enable >255 cpus with x2apic mode.

If LAPIC is in x2apic while interrupt remapping is disabled, IOAPIC
cannot deliver interrupts to all cpus in the system if #cpu > 255.







1.3 Support guest SVM (Shared Virtual Memory)
It relies on the l1 translation table capability (GVA->GPA) on
vIOMMU. pIOMMU needs to enable both l1 and l2 translation in nested
mode (GVA->GPA->HPA) for passthrough device. IGD passthrough
is the main usage today (to support OpenCL 2.0 SVM feature). In the
future SVM might be used by other I/O devices too.


As an aside, how is IGD intending to support SVM?  Will it be with PCIe
ATS/PASID, or something rather more magic as IGD is on the same piece of
silicon?


IGD on Skylake supports PCIe PASID.






2. Xen vIOMMU Architecture



* vIOMMU will be inside Xen hypervisor for following factors
1) Avoid round trips between Qemu and Xen hypervisor
2) Ease of integration with the rest of the hypervisor
3) HVMlite/PVH doesn't use Qemu
* Dummy xen-vIOMMU in Qemu as a wrapper of new hypercall to create
/destory vIOMMU in hypervisor and deal with virtual PCI device's l2
translation.

2.1 l2 translation overview
For Virtual PCI device, dummy xen-vIOMMU does translation in the
Qemu via new hypercall.

For physical PCI device, vIOMMU in hypervisor shadows IO page table from
IOVA->GPA to IOVA->HPA and load page table to physical IOMMU.

The following diagram shows l2 translation architecture.


Which scenario is this?  Is this the passthrough case where the Qemu
Virtual PCI device is a shadow of the real PCI device in hardware?



No, this is for traditional virtual pci device emulated by Qemu and
passthough PCI device.



+-+
|Qemu++   |
|| Virtual|   |
||   PCI device   |   |
|||   |
|++   |
||DMA |
|V|
|  ++   Request  ++   |
|  |+<---+|   |
|  |  Dummy xen vIOMMU  | Target GPA |  Memory region |   |
|  |+--->+|   |
|  +-+--++---++   |
||   ||
||Hypercall  ||
+++
|Hypervisor  | 

Re: [Xen-devel] Xen virtual IOMMU high level design doc V2

2016-10-20 Thread Tian, Kevin
> From: Konrad Rzeszutek Wilk [mailto:konrad.w...@oracle.com]
> Sent: Wednesday, October 19, 2016 4:27 AM
> >
> > 2) For physical PCI device
> > DMA operations go though physical IOMMU directly and IO page table for
> > IOVA->HPA should be loaded into physical IOMMU. When guest updates
> > l2 Page-table pointer field, it provides IO page table for
> > IOVA->GPA. vIOMMU needs to shadow l2 translation table, translate
> > GPA->HPA and update shadow page table(IOVA->HPA) pointer to l2
> > Page-table pointer to context entry of physical IOMMU.
> >
> > Now all PCI devices in same hvm domain share one IO page table
> > (GPA->HPA) in physical IOMMU driver of Xen. To support l2
> > translation of vIOMMU, IOMMU driver need to support multiple address
> > spaces per device entry. Using existing IO page table(GPA->HPA)
> > defaultly and switch to shadow IO page table(IOVA->HPA) when l2
> 
> defaultly?
> 
> > translation function is enabled. These change will not affect current
> > P2M logic.
> 
> What happens if the guests IO page tables have incorrect values?
> 
> For example the guest sets up the pagetables to cover some section
> of HPA ranges (which are all good and permitted). But then during execution
> the guest kernel decides to muck around with the pagetables and adds an HPA
> range that is outside what the guest has been allocated.
> 
> What then?

Shadow PTE is controlled by hypervisor. Whatever IOVA->GPA mapping in
guest PTE must be validated (IOVA->GPA->HPA) before updating into the
shadow PTE. So regardless of when guest mucks its PTE, the operation is
always trapped and validated. Why do you think there is a problem?

Also guest only sees GPA. All it can operate is GPA ranges.

> >
> > 3.3 Interrupt remapping
> > Interrupts from virtual devices and physical devices will be delivered
> > to vlapic from vIOAPIC and vMSI. It needs to add interrupt remapping
> > hooks in the vmsi_deliver() and ioapic_deliver() to find target vlapic
> > according interrupt remapping table.
> >
> >
> > 3.4 l1 translation
> > When nested translation is enabled, any address generated by l1
> > translation is used as the input address for nesting with l2
> > translation. Physical IOMMU needs to enable both l1 and l2 translation
> > in nested translation mode(GVA->GPA->HPA) for passthrough
> > device.
> >
> > VT-d context entry points to guest l1 translation table which
> > will be nest-translated by l2 translation table and so it
> > can be directly linked to context entry of physical IOMMU.
> 
> I think this means that the shared_ept will be disabled?
> >
> What about different versions of contexts? Say the V1 is exposed
> to guest but the hardware supports V2? Are there any flags that have
> swapped positions? Or is it pretty backwards compatible?

yes, backward compatible.

> >
> >
> > 3.5 Implementation consideration
> > VT-d spec doesn't define a capability bit for the l2 translation.
> > Architecturally there is no way to tell guest that l2 translation
> > capability is not available. Linux Intel IOMMU driver thinks l2
> > translation is always available when VTD exits and fail to be loaded
> > without l2 translation support even if interrupt remapping and l1
> > translation are available. So it needs to enable l2 translation first
> 
> I am lost on that sentence. Are you saying that it tries to load
> the IOVA and if they fail.. then it keeps on going? What is the result
> of this? That you can't do IOVA (so can't use vfio ?)

It's about VT-d capability. VT-d supports both 1st-level and 2nd-level 
translation, however only the 1st-level translation can be optionally
reported through a capability bit. There is no capability bit to say
a version doesn't support 2nd-level translation. The implication is
that, as long as a vIOMMU is exposed, guest IOMMU driver always
assumes IOVA capability available thru 2nd level translation. 

So we can first emulate a vIOMMU w/ only 2nd-level capability, and
then extend it to support 1st-level and interrupt remapping, but cannot 
do the reverse direction. I think Tianyu's point is more to describe 
enabling sequence based on this fact. :-)

> > 4.1 Qemu vIOMMU framework
> > Qemu has a framework to create virtual IOMMU(e.g. virtual intel VTD and
> > AMD IOMMU) and report in guest ACPI table. So for Xen side, a dummy
> > xen-vIOMMU wrapper is required to connect with actual vIOMMU in Xen.
> > Especially for l2 translation of virtual PCI device because
> > emulations of virtual PCI devices are in the Qemu. Qemu's vIOMMU
> > framework provides callback to deal with l2 translation when
> > DMA operations of virtual PCI devices happen.
> 
> You say AMD and Intel. This sounds quite OS agnostic. Does it mean you
> could expose an vIOMMU to a guest and actually use the AMD IOMMU
> in the hypervisor?

Did you mean "expose an Intel vIOMMU to guest and then use physical
AMD IOMMU in hypervisor"? I didn't think about this, but what's the value
of doing so? :-)
 
Thanks
Kevin


Re: [Xen-devel] Xen virtual IOMMU high level design doc V2

2016-10-20 Thread Tian, Kevin
> From: Andrew Cooper [mailto:andrew.coop...@citrix.com]
> Sent: Wednesday, October 19, 2016 3:18 AM
> 
> >
> > 1.2 Support VFIO-based user space driver (e.g. DPDK) in the guest
> > It relies on the l2 translation capability (IOVA->GPA) on
> > vIOMMU. pIOMMU l2 becomes a shadowing structure of
> > vIOMMU to isolate DMA requests initiated by user space driver.
> 
> How is userspace supposed to drive this interface?  I can't picture how
> it would function.

Inside a Linux VM, VFIO provides DMA MAP/UNMAP interface to user space
driver so gIOVA->GPA mapping can be setup on vIOMMU. vIOMMU will 
export a "caching mode" capability to indicate all guest PTE changes 
requiring explicit vIOMMU cache invalidations. Through trapping of those
invalidation requests, Xen can update corresponding shadow PTEs (gIOVA
->HPA). When DMA mapping is established, user space driver programs 
gIOVA addresses as DMA destination to assigned device, and then upstreaming
DMA request out of this device contains gIOVA which is translated to HPA
by pIOMMU shadow page table.

> 
> >
> >
> > 1.3 Support guest SVM (Shared Virtual Memory)
> > It relies on the l1 translation table capability (GVA->GPA) on
> > vIOMMU. pIOMMU needs to enable both l1 and l2 translation in nested
> > mode (GVA->GPA->HPA) for passthrough device. IGD passthrough
> > is the main usage today (to support OpenCL 2.0 SVM feature). In the
> > future SVM might be used by other I/O devices too.
> 
> As an aside, how is IGD intending to support SVM?  Will it be with PCIe
> ATS/PASID, or something rather more magic as IGD is on the same piece of
> silicon?

Although integrated, IGD conforms to standard PCIe PASID convention.

> > 3.5 Implementation consideration
> > VT-d spec doesn't define a capability bit for the l2 translation.
> > Architecturally there is no way to tell guest that l2 translation
> > capability is not available. Linux Intel IOMMU driver thinks l2
> > translation is always available when VTD exits and fail to be loaded
> > without l2 translation support even if interrupt remapping and l1
> > translation are available. So it needs to enable l2 translation first
> > before other functions.
> 
> What then is the purpose of the nested translation support bit in the
> extended capability register?
> 

Nested translation is for SVM virtualization. Given a DMA transaction 
containing a PASID, VT-d engine first finds the 1st translation table 
through PASID to translate from GVA to GPA, then once nested
translation capability is enabled, further translate GPA to HPA using the
2nd level translation table. Bare-metal usage is not expected to turn
on this nested bit.

Thanks
Kevin

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Xen virtual IOMMU high level design doc V2

2016-10-18 Thread Konrad Rzeszutek Wilk
On Tue, Oct 18, 2016 at 10:14:16PM +0800, Lan Tianyu wrote:
> Change since V1:
>   1) Update motivation for Xen vIOMMU - 288 vcpus support part
>   2) Change definition of struct xen_sysctl_viommu_op
>   3) Update "3.5 Implementation consideration" to explain why we needs to
> enable l2 translation first.
>   4) Update "4.3 Q35 vs I440x" - Linux/Windows VTD drivers can work on the
> emulated I440 chipset.
>   5) Remove stale statement in the "3.3 Interrupt remapping"
> 
> Content:
> ===
> 1. Motivation of vIOMMU
>   1.1 Enable more than 255 vcpus
>   1.2 Support VFIO-based user space driver
>   1.3 Support guest Shared Virtual Memory (SVM)
> 2. Xen vIOMMU Architecture
>   2.1 l2 translation overview
>   2.2 Interrupt remapping overview
> 3. Xen hypervisor
>   3.1 New vIOMMU hypercall interface
>   3.2 l2 translation
>   3.3 Interrupt remapping
>   3.4 l1 translation
>   3.5 Implementation consideration
> 4. Qemu
>   4.1 Qemu vIOMMU framework
>   4.2 Dummy xen-vIOMMU driver
>   4.3 Q35 vs. i440x
>   4.4 Report vIOMMU to hvmloader
> 
> 
> 1 Motivation for Xen vIOMMU
> ===
> 1.1 Enable more than 255 vcpu support
> HPC cloud service requires VM provides high performance parallel
> computing and we hope to create a huge VM with >255 vcpu on one machine
> to meet such requirement.Ping each vcpus on separated pcpus. More than
> 255 vcpus support requires X2APIC and Linux disables X2APIC mode if
> there is no interrupt remapping function which is present by vIOMMU.
> Interrupt remapping function helps to deliver interrupt to #vcpu >255.
> So we need to add vIOMMU before enabling >255 vcpus.

What about Windows? Does it care about this?

> 
> 1.2 Support VFIO-based user space driver (e.g. DPDK) in the guest
> It relies on the l2 translation capability (IOVA->GPA) on
> vIOMMU. pIOMMU l2 becomes a shadowing structure of
> vIOMMU to isolate DMA requests initiated by user space driver.
> 
> 
> 1.3 Support guest SVM (Shared Virtual Memory)
> It relies on the l1 translation table capability (GVA->GPA) on
> vIOMMU. pIOMMU needs to enable both l1 and l2 translation in nested
> mode (GVA->GPA->HPA) for passthrough device. IGD passthrough
> is the main usage today (to support OpenCL 2.0 SVM feature). In the
> future SVM might be used by other I/O devices too.
> 
> 2. Xen vIOMMU Architecture
> 
> 
> * vIOMMU will be inside Xen hypervisor for following factors
>   1) Avoid round trips between Qemu and Xen hypervisor
>   2) Ease of integration with the rest of the hypervisor
>   3) HVMlite/PVH doesn't use Qemu
> * Dummy xen-vIOMMU in Qemu as a wrapper of new hypercall to create
> /destory vIOMMU in hypervisor and deal with virtual PCI device's l2

destroy
> translation.
> 
> 2.1 l2 translation overview
> For Virtual PCI device, dummy xen-vIOMMU does translation in the
> Qemu via new hypercall.
> 
> For physical PCI device, vIOMMU in hypervisor shadows IO page table from
> IOVA->GPA to IOVA->HPA and load page table to physical IOMMU.
> 
> The following diagram shows l2 translation architecture.
> +-+
> |Qemu++   |
> || Virtual|   |
> ||   PCI device   |   |
> |||   |
> |++   |
> ||DMA |
> |V|
> |  ++   Request  ++   |
> |  |+<---+|   |
> |  |  Dummy xen vIOMMU  | Target GPA |  Memory region |   |
> |  |+--->+|   |
> |  +-+--++---++   |
> ||   ||
> ||Hypercall  ||
> +++
> |Hypervisor  |   ||
> ||   ||
> |v   ||
> | +--+--+||
> | |   vIOMMU|||
> | +--+--+||
> ||   ||
> |v   ||
> | +--+--+||
> | | IOMMU driver|||
> | +--+--+ 

Re: [Xen-devel] Xen virtual IOMMU high level design doc V2

2016-10-18 Thread Andrew Cooper
On 18/10/16 15:14, Lan Tianyu wrote:
> Change since V1:
> 1) Update motivation for Xen vIOMMU - 288 vcpus support part
> 2) Change definition of struct xen_sysctl_viommu_op
> 3) Update "3.5 Implementation consideration" to explain why we
> needs to enable l2 translation first.
> 4) Update "4.3 Q35 vs I440x" - Linux/Windows VTD drivers can work
> on the emulated I440 chipset.
> 5) Remove stale statement in the "3.3 Interrupt remapping"
>
> Content:
> ===
>
> 1. Motivation of vIOMMU
> 1.1 Enable more than 255 vcpus
> 1.2 Support VFIO-based user space driver
> 1.3 Support guest Shared Virtual Memory (SVM)
> 2. Xen vIOMMU Architecture
> 2.1 l2 translation overview
> 2.2 Interrupt remapping overview
> 3. Xen hypervisor
> 3.1 New vIOMMU hypercall interface
> 3.2 l2 translation
> 3.3 Interrupt remapping
> 3.4 l1 translation
> 3.5 Implementation consideration
> 4. Qemu
> 4.1 Qemu vIOMMU framework
> 4.2 Dummy xen-vIOMMU driver
> 4.3 Q35 vs. i440x
> 4.4 Report vIOMMU to hvmloader
>
>
> 1 Motivation for Xen vIOMMU
> ===
>
> 1.1 Enable more than 255 vcpu support
> HPC cloud service requires VM provides high performance parallel
> computing and we hope to create a huge VM with >255 vcpu on one machine
> to meet such requirement.Ping each vcpus on separated pcpus. More than

Pin ?

Also, grammatically speaking, I think you mean "each vcpu to separate
pcpus".

> 255 vcpus support requires X2APIC and Linux disables X2APIC mode if
> there is no interrupt remapping function which is present by vIOMMU.
> Interrupt remapping function helps to deliver interrupt to #vcpu >255.

This is only a requirement for xapic interrupt sources.  x2apic
interrupt sources already deliver correctly.

> So we need to add vIOMMU before enabling >255 vcpus.
>
> 1.2 Support VFIO-based user space driver (e.g. DPDK) in the guest
> It relies on the l2 translation capability (IOVA->GPA) on
> vIOMMU. pIOMMU l2 becomes a shadowing structure of
> vIOMMU to isolate DMA requests initiated by user space driver.

How is userspace supposed to drive this interface?  I can't picture how
it would function.

>
>
> 1.3 Support guest SVM (Shared Virtual Memory)
> It relies on the l1 translation table capability (GVA->GPA) on
> vIOMMU. pIOMMU needs to enable both l1 and l2 translation in nested
> mode (GVA->GPA->HPA) for passthrough device. IGD passthrough
> is the main usage today (to support OpenCL 2.0 SVM feature). In the
> future SVM might be used by other I/O devices too.

As an aside, how is IGD intending to support SVM?  Will it be with PCIe
ATS/PASID, or something rather more magic as IGD is on the same piece of
silicon?

>
> 2. Xen vIOMMU Architecture
> 
>
>
> * vIOMMU will be inside Xen hypervisor for following factors
> 1) Avoid round trips between Qemu and Xen hypervisor
> 2) Ease of integration with the rest of the hypervisor
> 3) HVMlite/PVH doesn't use Qemu
> * Dummy xen-vIOMMU in Qemu as a wrapper of new hypercall to create
> /destory vIOMMU in hypervisor and deal with virtual PCI device's l2
> translation.
>
> 2.1 l2 translation overview
> For Virtual PCI device, dummy xen-vIOMMU does translation in the
> Qemu via new hypercall.
>
> For physical PCI device, vIOMMU in hypervisor shadows IO page table from
> IOVA->GPA to IOVA->HPA and load page table to physical IOMMU.
>
> The following diagram shows l2 translation architecture.

Which scenario is this?  Is this the passthrough case where the Qemu
Virtual PCI device is a shadow of the real PCI device in hardware?

> +-+
> |Qemu++   |
> || Virtual|   |
> ||   PCI device   |   |
> |||   |
> |++   |
> ||DMA |
> |V|
> |  ++   Request  ++   |
> |  |+<---+|   |
> |  |  Dummy xen vIOMMU  | Target GPA |  Memory region |   |
> |  |+--->+|   |
> |  +-+--++---++   |
> ||   ||
> ||Hypercall  ||
> +++
> |Hypervisor  |   ||
> ||   ||
> |v   ||
> | 

[Xen-devel] Xen virtual IOMMU high level design doc V2

2016-10-18 Thread Lan Tianyu

Change since V1:
1) Update motivation for Xen vIOMMU - 288 vcpus support part
2) Change definition of struct xen_sysctl_viommu_op
	3) Update "3.5 Implementation consideration" to explain why we needs to 
enable l2 translation first.
	4) Update "4.3 Q35 vs I440x" - Linux/Windows VTD drivers can work on 
the emulated I440 chipset.

5) Remove stale statement in the "3.3 Interrupt remapping"

Content:
===
1. Motivation of vIOMMU
1.1 Enable more than 255 vcpus
1.2 Support VFIO-based user space driver
1.3 Support guest Shared Virtual Memory (SVM)
2. Xen vIOMMU Architecture
2.1 l2 translation overview
2.2 Interrupt remapping overview
3. Xen hypervisor
3.1 New vIOMMU hypercall interface
3.2 l2 translation
3.3 Interrupt remapping
3.4 l1 translation
3.5 Implementation consideration
4. Qemu
4.1 Qemu vIOMMU framework
4.2 Dummy xen-vIOMMU driver
4.3 Q35 vs. i440x
4.4 Report vIOMMU to hvmloader


1 Motivation for Xen vIOMMU
===
1.1 Enable more than 255 vcpu support
HPC cloud service requires VM provides high performance parallel
computing and we hope to create a huge VM with >255 vcpu on one machine
to meet such requirement.Ping each vcpus on separated pcpus. More than
255 vcpus support requires X2APIC and Linux disables X2APIC mode if
there is no interrupt remapping function which is present by vIOMMU.
Interrupt remapping function helps to deliver interrupt to #vcpu >255.
So we need to add vIOMMU before enabling >255 vcpus.

1.2 Support VFIO-based user space driver (e.g. DPDK) in the guest
It relies on the l2 translation capability (IOVA->GPA) on
vIOMMU. pIOMMU l2 becomes a shadowing structure of
vIOMMU to isolate DMA requests initiated by user space driver.


1.3 Support guest SVM (Shared Virtual Memory)
It relies on the l1 translation table capability (GVA->GPA) on
vIOMMU. pIOMMU needs to enable both l1 and l2 translation in nested
mode (GVA->GPA->HPA) for passthrough device. IGD passthrough
is the main usage today (to support OpenCL 2.0 SVM feature). In the
future SVM might be used by other I/O devices too.

2. Xen vIOMMU Architecture


* vIOMMU will be inside Xen hypervisor for following factors
1) Avoid round trips between Qemu and Xen hypervisor
2) Ease of integration with the rest of the hypervisor
3) HVMlite/PVH doesn't use Qemu
* Dummy xen-vIOMMU in Qemu as a wrapper of new hypercall to create
/destory vIOMMU in hypervisor and deal with virtual PCI device's l2
translation.

2.1 l2 translation overview
For Virtual PCI device, dummy xen-vIOMMU does translation in the
Qemu via new hypercall.

For physical PCI device, vIOMMU in hypervisor shadows IO page table from
IOVA->GPA to IOVA->HPA and load page table to physical IOMMU.

The following diagram shows l2 translation architecture.
+-+
|Qemu++   |
|| Virtual|   |
||   PCI device   |   |
|||   |
|++   |
||DMA |
|V|
|  ++   Request  ++   |
|  |+<---+|   |
|  |  Dummy xen vIOMMU  | Target GPA |  Memory region |   |
|  |+--->+|   |
|  +-+--++---++   |
||   ||
||Hypercall  ||
+++
|Hypervisor  |   ||
||   ||
|v   ||
| +--+--+||
| |   vIOMMU|||
| +--+--+||
||   ||
|v   ||
| +--+--+||
| | IOMMU driver|||
| +--+--+||
||   ||
+++
|HW  v   V|
| +--+--+ +-+ |
| |   IOMMU 

Re: [Xen-devel] Xen virtual IOMMU high level design doc

2016-10-10 Thread Lan Tianyu
On 2016年10月06日 02:36, Konrad Rzeszutek Wilk wrote:
>>> 3.3 Interrupt remapping
>>> > > Interrupts from virtual devices and physical devices will be delivered
>>> > > to vlapic from vIOAPIC and vMSI. It needs to add interrupt remapping
>>> > > hooks in the vmsi_deliver() and ioapic_deliver() to find target vlapic
>>> > > according interrupt remapping table. The following diagram shows the 
>>> > > logic.
>>> > > 
> Uh? Missing diagram?

Sorry. This is stale statement. The diagram was moved to 2.2 Interrupt
remapping overview.

> 
>>> 4.3 Q35 vs i440x
>>> > > VT-D is introduced since Q35 chipset. Previous concern was that IOMMU
> s/since/with/
>>> > > driver has assumption that VTD only exists on Q35 and newer chipset and
>>> > > we have to enable Q35 first.
>>> > > 
>>> > > Consulted with Linux/Windows IOMMU driver experts and get that these
>>> > > drivers doesn't have such assumption. So we may skip Q35 implementation
>>> > > and can emulate vIOMMU on I440x chipset. KVM already have vIOMMU support
>>> > > with virtual PCI device's DMA translation and interrupt remapping. We
>>> > > are using KVM to do experiment of adding vIOMMU on the I440x and test
>>> > > Linux/Windows guest. Will report back when have some results.
> Any results?

We have booted up Win8 guest with virtual VTD and emulated I440x
platform on Xen and guest uses virtual VTD to enable interrupt remapping
function.

-- 
Best regards
Tianyu Lan

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Xen virtual IOMMU high level design doc

2016-10-05 Thread Konrad Rzeszutek Wilk
On Thu, Sep 15, 2016 at 10:22:36PM +0800, Lan, Tianyu wrote:
> Hi Andrew:
> Sorry to bother you. To make sure we are on the right direction, it's
> better to get feedback from you before we go further step. Could you
> have a look? Thanks.
> 
> On 8/17/2016 8:05 PM, Lan, Tianyu wrote:
> > Hi All:
> >  The following is our Xen vIOMMU high level design for detail
> > discussion. Please have a look. Very appreciate for your comments.
> > This design doesn't cover changes when root port is moved to hypervisor.
> > We may design it later.
> > 
> > 
> > Content:
> > ===
> > 
> > 1. Motivation of vIOMMU
> > 1.1 Enable more than 255 vcpus
> > 1.2 Support VFIO-based user space driver
> > 1.3 Support guest Shared Virtual Memory (SVM)
> > 2. Xen vIOMMU Architecture
> > 2.1 2th level translation overview
> > 2.2 Interrupt remapping overview
> > 3. Xen hypervisor
> > 3.1 New vIOMMU hypercall interface
> > 3.2 2nd level translation
> > 3.3 Interrupt remapping
> > 3.4 1st level translation
> > 3.5 Implementation consideration
> > 4. Qemu
> > 4.1 Qemu vIOMMU framework
> > 4.2 Dummy xen-vIOMMU driver
> > 4.3 Q35 vs. i440x
> > 4.4 Report vIOMMU to hvmloader
> > 
> > 
> > 1 Motivation for Xen vIOMMU
> > ===
> > 
> > 1.1 Enable more than 255 vcpu support
> > HPC virtualization requires more than 255 vcpus support in a single VM
> > to meet parallel computing requirement. More than 255 vcpus support
> > requires interrupt remapping capability present on vIOMMU to deliver
> > interrupt to #vcpu >255 Otherwise Linux guest fails to boot up with >255
> > vcpus if interrupt remapping is absent.
> > 
> > 
> > 1.2 Support VFIO-based user space driver (e.g. DPDK) in the guest
> > It relies on the 2nd level translation capability (IOVA->GPA) on
> > vIOMMU. pIOMMU 2nd level becomes a shadowing structure of
> > vIOMMU to isolate DMA requests initiated by user space driver.
> > 
> > 
> > 1.3 Support guest SVM (Shared Virtual Memory)
> > It relies on the 1st level translation table capability (GVA->GPA) on
> > vIOMMU. pIOMMU needs to enable both 1st level and 2nd level translation
> > in nested mode (GVA->GPA->HPA) for passthrough device. IGD passthrough
> > is the main usage today (to support OpenCL 2.0 SVM feature). In the
> > future SVM might be used by other I/O devices too.
> > 
> > 2. Xen vIOMMU Architecture
> > 
> > 
> > 
> > * vIOMMU will be inside Xen hypervisor for following factors
> > 1) Avoid round trips between Qemu and Xen hypervisor
> > 2) Ease of integration with the rest of the hypervisor
> > 3) HVMlite/PVH doesn't use Qemu
> > * Dummy xen-vIOMMU in Qemu as a wrapper of new hypercall to create
> > /destory vIOMMU in hypervisor and deal with virtual PCI device's 2th
> > level translation.
> > 
> > 2.1 2th level translation overview
> > For Virtual PCI device, dummy xen-vIOMMU does translation in the
> > Qemu via new hypercall.
> > 
> > For physical PCI device, vIOMMU in hypervisor shadows IO page table from
> > IOVA->GPA to IOVA->HPA and load page table to physical IOMMU.
> > 
> > The following diagram shows 2th level translation architecture.
> > +-+
> > |Qemu++   |
> > || Virtual|   |
> > ||   PCI device   |   |
> > |||   |
> > |++   |
> > ||DMA |
> > |V|
> > |  ++   Request  ++   |
> > |  |+<---+|   |
> > |  |  Dummy xen vIOMMU  | Target GPA |  Memory region |   |
> > |  |+--->+|   |
> > |  +-+--++---++   |
> > ||   ||
> > ||Hypercall  ||
> > +++
> > |Hypervisor  |   ||
> > ||   ||
> > |v   ||
> > | +--+--+||
> > | |   vIOMMU|||
> > | +--+--+||
> > ||   ||
> > |v   ||
> > | +--+--+||
> > | | 

Re: [Xen-devel] Xen virtual IOMMU high level design doc

2016-09-15 Thread Lan, Tianyu

Hi Andrew:
Sorry to bother you. To make sure we are on the right direction, it's
better to get feedback from you before we go further step. Could you
have a look? Thanks.

On 8/17/2016 8:05 PM, Lan, Tianyu wrote:

Hi All:
 The following is our Xen vIOMMU high level design for detail
discussion. Please have a look. Very appreciate for your comments.
This design doesn't cover changes when root port is moved to hypervisor.
We may design it later.


Content:
===

1. Motivation of vIOMMU
1.1 Enable more than 255 vcpus
1.2 Support VFIO-based user space driver
1.3 Support guest Shared Virtual Memory (SVM)
2. Xen vIOMMU Architecture
2.1 2th level translation overview
2.2 Interrupt remapping overview
3. Xen hypervisor
3.1 New vIOMMU hypercall interface
3.2 2nd level translation
3.3 Interrupt remapping
3.4 1st level translation
3.5 Implementation consideration
4. Qemu
4.1 Qemu vIOMMU framework
4.2 Dummy xen-vIOMMU driver
4.3 Q35 vs. i440x
4.4 Report vIOMMU to hvmloader


1 Motivation for Xen vIOMMU
===

1.1 Enable more than 255 vcpu support
HPC virtualization requires more than 255 vcpus support in a single VM
to meet parallel computing requirement. More than 255 vcpus support
requires interrupt remapping capability present on vIOMMU to deliver
interrupt to #vcpu >255 Otherwise Linux guest fails to boot up with >255
vcpus if interrupt remapping is absent.


1.2 Support VFIO-based user space driver (e.g. DPDK) in the guest
It relies on the 2nd level translation capability (IOVA->GPA) on
vIOMMU. pIOMMU 2nd level becomes a shadowing structure of
vIOMMU to isolate DMA requests initiated by user space driver.


1.3 Support guest SVM (Shared Virtual Memory)
It relies on the 1st level translation table capability (GVA->GPA) on
vIOMMU. pIOMMU needs to enable both 1st level and 2nd level translation
in nested mode (GVA->GPA->HPA) for passthrough device. IGD passthrough
is the main usage today (to support OpenCL 2.0 SVM feature). In the
future SVM might be used by other I/O devices too.

2. Xen vIOMMU Architecture



* vIOMMU will be inside Xen hypervisor for following factors
1) Avoid round trips between Qemu and Xen hypervisor
2) Ease of integration with the rest of the hypervisor
3) HVMlite/PVH doesn't use Qemu
* Dummy xen-vIOMMU in Qemu as a wrapper of new hypercall to create
/destory vIOMMU in hypervisor and deal with virtual PCI device's 2th
level translation.

2.1 2th level translation overview
For Virtual PCI device, dummy xen-vIOMMU does translation in the
Qemu via new hypercall.

For physical PCI device, vIOMMU in hypervisor shadows IO page table from
IOVA->GPA to IOVA->HPA and load page table to physical IOMMU.

The following diagram shows 2th level translation architecture.
+-+
|Qemu++   |
|| Virtual|   |
||   PCI device   |   |
|||   |
|++   |
||DMA |
|V|
|  ++   Request  ++   |
|  |+<---+|   |
|  |  Dummy xen vIOMMU  | Target GPA |  Memory region |   |
|  |+--->+|   |
|  +-+--++---++   |
||   ||
||Hypercall  ||
+++
|Hypervisor  |   ||
||   ||
|v   ||
| +--+--+||
| |   vIOMMU|||
| +--+--+||
||   ||
|v   ||
| +--+--+||
| | IOMMU driver|||
| +--+--+||
||   ||
+++
|HW  v   V|
| +--+--+ +-+ |
| |   IOMMU +>+  Memory | |
| +--+--+ +-+ 

Re: [Xen-devel] Xen virtual IOMMU high level design doc

2016-08-31 Thread Lan Tianyu
On 2016年08月31日 20:02, Jan Beulich wrote:
 On 31.08.16 at 10:39,  wrote:
>> > On 2016年08月25日 19:11, Jan Beulich wrote:
>> > On 17.08.16 at 14:05,  wrote:
 >>> 1 Motivation for Xen vIOMMU
 >>> 
 >>> ===
 >>> 1.1 Enable more than 255 vcpu support
 >>> HPC virtualization requires more than 255 vcpus support in a single VM
 >>> to meet parallel computing requirement. More than 255 vcpus support
 >>> requires interrupt remapping capability present on vIOMMU to deliver
 >>> interrupt to #vcpu >255 Otherwise Linux guest fails to boot up with 
 >>> >255
 >>> vcpus if interrupt remapping is absent.
>>> >> 
>>> >> I continue to question this as a valid motivation at this point in
>>> >> time, for the reasons Andrew has been explaining.
>> > 
>> > If we want to support Linux guest with >255 vcpus, interrupt remapping
>> > is necessary.
> I don't understand why you keep repeating this, without adding
> _why_ you think there is a demand for such guests and _what_
> your plans are to eliminate Andrew's concerns.
> 

The motivation for such huge VM is for HPC(High-performance computing)
Cloud service which requires high performance parallel computing.
We just create single VM on one machine and expose more than 255 pcpus
to VM in order to make sure high performance parallel computing in VM.
One vcpu is pinged on pcpu.

For performance, we achieved high performance data(>95% native
data of stream, dgemm and sgemm benchmarks in VM) after some tuning and
optimizations. We presented these on Xen summit of this year.

For stability, Andrew found some issues of huge VM with watchdog
enabled and cause hypervisor reboot. We will reproduce and fix them.

-- 
Best regards
Tianyu Lan

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Xen virtual IOMMU high level design doc

2016-08-31 Thread Tian, Kevin
> From: Jan Beulich [mailto:jbeul...@suse.com]
> Sent: Wednesday, August 31, 2016 8:03 PM
> >>> 3.5 Implementation consideration
> >>> Linux Intel IOMMU driver will fail to be loaded without 2th level
> >>> translation support even if interrupt remapping and 1th level
> >>> translation are available. This means it's needed to enable 2th level
> >>> translation first before other functions.
> >>
> >> Is there a reason for this? I.e. do they unconditionally need that
> >> functionality?
> >
> > Yes, Linux intel IOMMU driver unconditionally needs l2 translation.
> > Driver checks whether there is a valid sagaw(supported Adjusted Guest
> > Address Widths) during initializing IOMMU data struct and return error
> > if not.
> 
> How about my first question then?
> 
> Jan

VT-d spec doesn't define a capability bit for the 2nd level translation 
(for 1st level or intr remapping, there do have such capability bit to
report). So architecturally there is no way to tell guest that 2nd level 
translation capability is not available, so existing Linux behavior is 
just correct.

Thanks
Kevin
___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Xen virtual IOMMU high level design doc

2016-08-31 Thread Jan Beulich
>>> On 31.08.16 at 10:39,  wrote:
> On 2016年08月25日 19:11, Jan Beulich wrote:
> On 17.08.16 at 14:05,  wrote:
>>> 1 Motivation for Xen vIOMMU
>>> 
>>> ===
>>> 1.1 Enable more than 255 vcpu support
>>> HPC virtualization requires more than 255 vcpus support in a single VM
>>> to meet parallel computing requirement. More than 255 vcpus support
>>> requires interrupt remapping capability present on vIOMMU to deliver
>>> interrupt to #vcpu >255 Otherwise Linux guest fails to boot up with >255
>>> vcpus if interrupt remapping is absent.
>> 
>> I continue to question this as a valid motivation at this point in
>> time, for the reasons Andrew has been explaining.
> 
> If we want to support Linux guest with >255 vcpus, interrupt remapping
> is necessary.

I don't understand why you keep repeating this, without adding
_why_ you think there is a demand for such guests and _what_
your plans are to eliminate Andrew's concerns.

>>> 3 Xen hypervisor
>>> ==
>>>
>>> 3.1 New hypercall XEN_SYSCTL_viommu_op
>>> 1) Definition of "struct xen_sysctl_viommu_op" as new hypercall parameter.
>>>
>>> struct xen_sysctl_viommu_op {
>>> u32 cmd;
>>> u32 domid;
>>> union {
>>> struct {
>>> u32 capabilities;
>>> } query_capabilities;
>>> struct {
>>> u32 capabilities;
>>> u64 base_address;
>>> } create_iommu;
>>> struct {
>>> u8  bus;
>>> u8  devfn;
>> 
>> Please can we avoid introducing any new interfaces without segment/
>> domain value, even if for now it'll be always zero?
> 
> Sure. Will add segment field.
> 
>> 
>>> u64 iova;
>>> u64 translated_addr;
>>> u64 addr_mask; /* Translation page size */
>>> IOMMUAccessFlags permisson; 
>>> } 2th_level_translation;
>> 
>> I suppose "translated_addr" is an output here, but for the following
>> fields this already isn't clear. Please add IN and OUT annotations for
>> clarity.
>> 
>> Also, may I suggest to name this "l2_translation"? (But there are
>> other implementation specific things to be considered here, which
>> I guess don't belong into a design doc discussion.)
> 
> How about this?
> struct {
>   /* IN parameters. */
>   u8  segment;
> u8  bus;
> u8  devfn;
> u64 iova;
>   /* Out parameters. */
> u64 translated_addr;
> u64 addr_mask; /* Translation page size */
> IOMMUAccessFlags permisson;
> } l2_translation;

"segment" clearly needs to be a 16-bit value, but apart from that
(and missing padding fields) this looks okay.

>>> 3.5 Implementation consideration
>>> Linux Intel IOMMU driver will fail to be loaded without 2th level
>>> translation support even if interrupt remapping and 1th level
>>> translation are available. This means it's needed to enable 2th level
>>> translation first before other functions.
>> 
>> Is there a reason for this? I.e. do they unconditionally need that
>> functionality?
> 
> Yes, Linux intel IOMMU driver unconditionally needs l2 translation.
> Driver checks whether there is a valid sagaw(supported Adjusted Guest
> Address Widths) during initializing IOMMU data struct and return error
> if not.

How about my first question then?

Jan

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Xen virtual IOMMU high level design doc

2016-08-31 Thread Lan Tianyu
Hi Jan:
Sorry for later response. Thanks a lot for your comments.

On 2016年08月25日 19:11, Jan Beulich wrote:
 On 17.08.16 at 14:05,  wrote:
>> 1 Motivation for Xen vIOMMU
>> 
>> ===
>> 1.1 Enable more than 255 vcpu support
>> HPC virtualization requires more than 255 vcpus support in a single VM
>> to meet parallel computing requirement. More than 255 vcpus support
>> requires interrupt remapping capability present on vIOMMU to deliver
>> interrupt to #vcpu >255 Otherwise Linux guest fails to boot up with >255
>> vcpus if interrupt remapping is absent.
> 
> I continue to question this as a valid motivation at this point in
> time, for the reasons Andrew has been explaining.

If we want to support Linux guest with >255 vcpus, interrupt remapping
is necessary.

From Linux commit introducing x2apic and IR mode, it said IR was
a pre-requisite for enabling x2apic mode in the CPU.
https://lwn.net/Articles/289881/

So far, no sure behavior on the other OS. We may watch Windows guest
behavior later on KVM and there is still a bug to run Windows guest with
IR function on KVM.


> 
>> 2. Xen vIOMMU Architecture
>> 
>> 
>>
>> * vIOMMU will be inside Xen hypervisor for following factors
>>  1) Avoid round trips between Qemu and Xen hypervisor
>>  2) Ease of integration with the rest of the hypervisor
>>  3) HVMlite/PVH doesn't use Qemu
>> * Dummy xen-vIOMMU in Qemu as a wrapper of new hypercall to create
>> /destory vIOMMU in hypervisor and deal with virtual PCI device's 2th
>> level translation.
> 
> How does the create/destroy part of this match up with 3) right
> ahead of it?

The create/destroy hypercalls will work for both hvm and hvmlite.
Suppose hvmlite has tool stack(E.G libxl) which can call new hypercalls
to create or destroy virtual iommu in hypervisor.

> 
>> 3 Xen hypervisor
>> ==
>>
>> 3.1 New hypercall XEN_SYSCTL_viommu_op
>> 1) Definition of "struct xen_sysctl_viommu_op" as new hypercall parameter.
>>
>> struct xen_sysctl_viommu_op {
>>  u32 cmd;
>>  u32 domid;
>>  union {
>>  struct {
>>  u32 capabilities;
>>  } query_capabilities;
>>  struct {
>>  u32 capabilities;
>>  u64 base_address;
>>  } create_iommu;
>>  struct {
>>  u8  bus;
>>  u8  devfn;
> 
> Please can we avoid introducing any new interfaces without segment/
> domain value, even if for now it'll be always zero?

Sure. Will add segment field.

> 
>>  u64 iova;
>>  u64 translated_addr;
>>  u64 addr_mask; /* Translation page size */
>>  IOMMUAccessFlags permisson; 
>>  } 2th_level_translation;
> 
> I suppose "translated_addr" is an output here, but for the following
> fields this already isn't clear. Please add IN and OUT annotations for
> clarity.
> 
> Also, may I suggest to name this "l2_translation"? (But there are
> other implementation specific things to be considered here, which
> I guess don't belong into a design doc discussion.)

How about this?
struct {
/* IN parameters. */
u8  segment;
u8  bus;
u8  devfn;
u64 iova;
/* Out parameters. */
u64 translated_addr;
u64 addr_mask; /* Translation page size */
IOMMUAccessFlags permisson;
} l2_translation;

> 
>> };
>>
>> typedef enum {
>>  IOMMU_NONE = 0,
>>  IOMMU_RO   = 1,
>>  IOMMU_WO   = 2,
>>  IOMMU_RW   = 3,
>> } IOMMUAccessFlags;
>>
>>
>> Definition of VIOMMU subops:
>> #define XEN_SYSCTL_viommu_query_capability   0
>> #define XEN_SYSCTL_viommu_create 1
>> #define XEN_SYSCTL_viommu_destroy2
>> #define XEN_SYSCTL_viommu_dma_translation_for_vpdev  3
>>
>> Definition of VIOMMU capabilities
>> #define XEN_VIOMMU_CAPABILITY_1nd_level_translation  (1 << 0)
>> #define XEN_VIOMMU_CAPABILITY_2nd_level_translation  (1 << 1)
> 
> l1 and l2 respectively again, please.

Will update.

> 
>> 3.3 Interrupt remapping
>> Interrupts from virtual devices and physical devices will be delivered
>> to vlapic from vIOAPIC and vMSI. It needs to add interrupt remapping
>> hooks in the vmsi_deliver() and ioapic_deliver() to find target vlapic
>> according interrupt remapping table. The following diagram shows the logic.
> 
> Missing diagram or stale sentence?

Sorry. It's stale sentence and moved the diagram to 2.2 Interrupt
remapping overview.

> 
>> 3.5 Implementation consideration
>> Linux Intel IOMMU driver will fail to be loaded without 2th level
>> translation support even if 

Re: [Xen-devel] Xen virtual IOMMU high level design doc

2016-08-25 Thread Jan Beulich
>>> On 17.08.16 at 14:05,  wrote:
> 1 Motivation for Xen vIOMMU
> 
> ===
> 1.1 Enable more than 255 vcpu support
> HPC virtualization requires more than 255 vcpus support in a single VM
> to meet parallel computing requirement. More than 255 vcpus support
> requires interrupt remapping capability present on vIOMMU to deliver
> interrupt to #vcpu >255 Otherwise Linux guest fails to boot up with >255
> vcpus if interrupt remapping is absent.

I continue to question this as a valid motivation at this point in
time, for the reasons Andrew has been explaining.

> 2. Xen vIOMMU Architecture
> 
> 
> 
> * vIOMMU will be inside Xen hypervisor for following factors
>   1) Avoid round trips between Qemu and Xen hypervisor
>   2) Ease of integration with the rest of the hypervisor
>   3) HVMlite/PVH doesn't use Qemu
> * Dummy xen-vIOMMU in Qemu as a wrapper of new hypercall to create
> /destory vIOMMU in hypervisor and deal with virtual PCI device's 2th
> level translation.

How does the create/destroy part of this match up with 3) right
ahead of it?

> 3 Xen hypervisor
> ==
> 
> 3.1 New hypercall XEN_SYSCTL_viommu_op
> 1) Definition of "struct xen_sysctl_viommu_op" as new hypercall parameter.
> 
> struct xen_sysctl_viommu_op {
>   u32 cmd;
>   u32 domid;
>   union {
>   struct {
>   u32 capabilities;
>   } query_capabilities;
>   struct {
>   u32 capabilities;
>   u64 base_address;
>   } create_iommu;
>   struct {
>   u8  bus;
>   u8  devfn;

Please can we avoid introducing any new interfaces without segment/
domain value, even if for now it'll be always zero?

>   u64 iova;
>   u64 translated_addr;
>   u64 addr_mask; /* Translation page size */
>   IOMMUAccessFlags permisson; 
>   } 2th_level_translation;

I suppose "translated_addr" is an output here, but for the following
fields this already isn't clear. Please add IN and OUT annotations for
clarity.

Also, may I suggest to name this "l2_translation"? (But there are
other implementation specific things to be considered here, which
I guess don't belong into a design doc discussion.)

> };
> 
> typedef enum {
>   IOMMU_NONE = 0,
>   IOMMU_RO   = 1,
>   IOMMU_WO   = 2,
>   IOMMU_RW   = 3,
> } IOMMUAccessFlags;
> 
> 
> Definition of VIOMMU subops:
> #define XEN_SYSCTL_viommu_query_capability0
> #define XEN_SYSCTL_viommu_create  1
> #define XEN_SYSCTL_viommu_destroy 2
> #define XEN_SYSCTL_viommu_dma_translation_for_vpdev   3
> 
> Definition of VIOMMU capabilities
> #define XEN_VIOMMU_CAPABILITY_1nd_level_translation   (1 << 0)
> #define XEN_VIOMMU_CAPABILITY_2nd_level_translation   (1 << 1)

l1 and l2 respectively again, please.

> 3.3 Interrupt remapping
> Interrupts from virtual devices and physical devices will be delivered
> to vlapic from vIOAPIC and vMSI. It needs to add interrupt remapping
> hooks in the vmsi_deliver() and ioapic_deliver() to find target vlapic
> according interrupt remapping table. The following diagram shows the logic.

Missing diagram or stale sentence?

> 3.5 Implementation consideration
> Linux Intel IOMMU driver will fail to be loaded without 2th level
> translation support even if interrupt remapping and 1th level
> translation are available. This means it's needed to enable 2th level
> translation first before other functions.

Is there a reason for this? I.e. do they unconditionally need that
functionality?

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Xen virtual IOMMU high level design doc

2016-08-17 Thread Lan, Tianyu



On 8/17/2016 8:42 PM, Paul Durrant wrote:

-Original Message-
From: Xen-devel [mailto:xen-devel-boun...@lists.xen.org] On Behalf Of
Lan, Tianyu
Sent: 17 August 2016 13:06
To: Jan Beulich; Kevin Tian; Andrew Cooper; yang.zhang...@gmail.com; Jun
Nakajima; Stefano Stabellini
Cc: Anthony Perard; xuqu...@huawei.com; xen-
de...@lists.xensource.com; Ian Jackson; Roger Pau Monne
Subject: [Xen-devel] Xen virtual IOMMU high level design doc

Hi All:
  The following is our Xen vIOMMU high level design for detail
discussion. Please have a look. Very appreciate for your comments.
This design doesn't cover changes when root port is moved to hypervisor.
We may design it later.


Content:
==
=
1. Motivation of vIOMMU
1.1 Enable more than 255 vcpus
1.2 Support VFIO-based user space driver
1.3 Support guest Shared Virtual Memory (SVM)
2. Xen vIOMMU Architecture
2.1 2th level translation overview
2.2 Interrupt remapping overview
3. Xen hypervisor
3.1 New vIOMMU hypercall interface


Would it not have been better to build on the previously discussed (and mostly 
agreed) PV IOMMU interface? (See 
https://lists.xenproject.org/archives/html/xen-devel/2016-02/msg01428.html). An 
RFC implementation series was also posted 
(https://lists.xenproject.org/archives/html/xen-devel/2016-02/msg01441.html).

  Paul



Hi Paul:
Thanks for your input. Glance the patchset and it introduces hypercall
"HYPERVISOR_iommu_op". The hypercall just works for PV IOMMU now. We may
abstract it and make it work for both PV and Virtual IOMMU.



___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Xen virtual IOMMU high level design doc

2016-08-17 Thread Paul Durrant
> -Original Message-
> From: Xen-devel [mailto:xen-devel-boun...@lists.xen.org] On Behalf Of
> Lan, Tianyu
> Sent: 17 August 2016 13:06
> To: Jan Beulich; Kevin Tian; Andrew Cooper; yang.zhang...@gmail.com; Jun
> Nakajima; Stefano Stabellini
> Cc: Anthony Perard; xuqu...@huawei.com; xen-
> de...@lists.xensource.com; Ian Jackson; Roger Pau Monne
> Subject: [Xen-devel] Xen virtual IOMMU high level design doc
> 
> Hi All:
>   The following is our Xen vIOMMU high level design for detail
> discussion. Please have a look. Very appreciate for your comments.
> This design doesn't cover changes when root port is moved to hypervisor.
> We may design it later.
> 
> 
> Content:
> ==
> =
> 1. Motivation of vIOMMU
>   1.1 Enable more than 255 vcpus
>   1.2 Support VFIO-based user space driver
>   1.3 Support guest Shared Virtual Memory (SVM)
> 2. Xen vIOMMU Architecture
>   2.1 2th level translation overview
>   2.2 Interrupt remapping overview
> 3. Xen hypervisor
>   3.1 New vIOMMU hypercall interface

Would it not have been better to build on the previously discussed (and mostly 
agreed) PV IOMMU interface? (See 
https://lists.xenproject.org/archives/html/xen-devel/2016-02/msg01428.html). An 
RFC implementation series was also posted 
(https://lists.xenproject.org/archives/html/xen-devel/2016-02/msg01441.html).

  Paul
___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] Xen virtual IOMMU high level design doc

2016-08-17 Thread Lan, Tianyu

Hi All:
 The following is our Xen vIOMMU high level design for detail
discussion. Please have a look. Very appreciate for your comments.
This design doesn't cover changes when root port is moved to hypervisor.
We may design it later.


Content:
===
1. Motivation of vIOMMU
1.1 Enable more than 255 vcpus
1.2 Support VFIO-based user space driver
1.3 Support guest Shared Virtual Memory (SVM)
2. Xen vIOMMU Architecture
2.1 2th level translation overview
2.2 Interrupt remapping overview
3. Xen hypervisor
3.1 New vIOMMU hypercall interface
3.2 2nd level translation
3.3 Interrupt remapping
3.4 1st level translation
3.5 Implementation consideration
4. Qemu
4.1 Qemu vIOMMU framework
4.2 Dummy xen-vIOMMU driver
4.3 Q35 vs. i440x
4.4 Report vIOMMU to hvmloader


1 Motivation for Xen vIOMMU
===
1.1 Enable more than 255 vcpu support
HPC virtualization requires more than 255 vcpus support in a single VM
to meet parallel computing requirement. More than 255 vcpus support
requires interrupt remapping capability present on vIOMMU to deliver
interrupt to #vcpu >255 Otherwise Linux guest fails to boot up with >255
vcpus if interrupt remapping is absent.


1.2 Support VFIO-based user space driver (e.g. DPDK) in the guest
It relies on the 2nd level translation capability (IOVA->GPA) on
vIOMMU. pIOMMU 2nd level becomes a shadowing structure of
vIOMMU to isolate DMA requests initiated by user space driver.


1.3 Support guest SVM (Shared Virtual Memory)
It relies on the 1st level translation table capability (GVA->GPA) on
vIOMMU. pIOMMU needs to enable both 1st level and 2nd level translation
in nested mode (GVA->GPA->HPA) for passthrough device. IGD passthrough
is the main usage today (to support OpenCL 2.0 SVM feature). In the
future SVM might be used by other I/O devices too.

2. Xen vIOMMU Architecture


* vIOMMU will be inside Xen hypervisor for following factors
1) Avoid round trips between Qemu and Xen hypervisor
2) Ease of integration with the rest of the hypervisor
3) HVMlite/PVH doesn't use Qemu
* Dummy xen-vIOMMU in Qemu as a wrapper of new hypercall to create
/destory vIOMMU in hypervisor and deal with virtual PCI device's 2th
level translation.

2.1 2th level translation overview
For Virtual PCI device, dummy xen-vIOMMU does translation in the
Qemu via new hypercall.

For physical PCI device, vIOMMU in hypervisor shadows IO page table from
IOVA->GPA to IOVA->HPA and load page table to physical IOMMU.

The following diagram shows 2th level translation architecture.
+-+
|Qemu++   |
|| Virtual|   |
||   PCI device   |   |
|||   |
|++   |
||DMA |
|V|
|  ++   Request  ++   |
|  |+<---+|   |
|  |  Dummy xen vIOMMU  | Target GPA |  Memory region |   |
|  |+--->+|   |
|  +-+--++---++   |
||   ||
||Hypercall  ||
+++
|Hypervisor  |   ||
||   ||
|v   ||
| +--+--+||
| |   vIOMMU|||
| +--+--+||
||   ||
|v   ||
| +--+--+||
| | IOMMU driver|||
| +--+--+||
||   ||
+++
|HW  v   V|
| +--+--+ +-+ |
| |   IOMMU +>+  Memory | |
| +--+--+ +-+ |
|^|
|||
| +--+--+