Re: [Xen-devel] Xen virtual IOMMU high level design doc V2

2016-11-03 Thread Lan, Tianyu



On 10/26/2016 5:39 PM, Jan Beulich wrote:

On 22.10.16 at 09:32,  wrote:

On 10/21/2016 4:36 AM, Andrew Cooper wrote:

3.5 Implementation consideration
VT-d spec doesn't define a capability bit for the l2 translation.
Architecturally there is no way to tell guest that l2 translation
capability is not available. Linux Intel IOMMU driver thinks l2
translation is always available when VTD exits and fail to be loaded
without l2 translation support even if interrupt remapping and l1
translation are available. So it needs to enable l2 translation first
before other functions.


What then is the purpose of the nested translation support bit in the
extended capability register?


It's to translate output GPA from first level translation(IOVA->GPA)
to HPA.

Detail please see VTD spec - 3.8 Nested Translation
"When Nesting Enable (NESTE) field is 1 in extended-context-entries,
requests-with-PASID translated through first-level translation are also
subjected to nested second-level translation. Such extendedcontext-
entries contain both the pointer to the PASID-table (which contains the
pointer to the firstlevel translation structures), and the pointer to
the second-level translation structures."


I didn't phrase my question very well.  I understand what the nested
translation bit means, but I don't understand why we have a problem
signalling the presence or lack of nested translations to the guest.

In other words, why can't we hide l2 translation from the guest by
simply clearing the nested translation capability?


You mean to tell no support of l2 translation via nest translation bit?
But the nested translation is a different function with l2 translation
even from guest view and nested translation only works requests with
PASID (l1 translation).

Linux intel iommu driver enables l2 translation unconditionally and free
iommu instance when failed to enable l2 translation.


In which cases the wording of your description is confusing: Instead of
"Linux Intel IOMMU driver thinks l2 translation is always available when
VTD exits and fail to be loaded without l2 translation support ..." how
about using something closer to what you've replied with last?

Jan



Hi All:
I have some updates about implementation dependency between l2
translation(DMA translation) and irq remapping.

I find there are a kernel parameter "intel_iommu=on" and kconfig option
CONFIG_INTEL_IOMMU_DEFAULT_ON which control DMA translation function.
When they aren't set, DMA translation function will not be enabled by
IOMMU driver even if some vIOMMU registers show L2 translation function
available. In the meantime, irq remapping function still can work to
support >255 vcpus.

I check distribution RHEL, SLES, Oracle and ubuntu don't set the kernel
parameter or select the kconfig option. So we can emulate irq remapping
fist with some capability bits(e,g SAGAW of Capability Register) of l2
translation for >255 vcpus support without l2 translation emulation.

Showing l2 capability bits is to make sure IOMMU driver probe ACPI DMAR
tables successfully because IOMMU driver access these bits during
reading ACPI tables.

If someone add "intel_iommu=on" kernel parameter manually, IOMMU driver
will panic guest because it can't enable DMA remapping function via gcmd
register and "Translation Enable Status" bit in gsts register is never
set by vIOMMU. This shows actual vIOMMU status of no l2 translation
emulation and warn user should not enable l2 translation.




___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Xen virtual IOMMU high level design doc V2

2016-10-28 Thread Lan Tianyu

On 2016年10月21日 04:36, Andrew Cooper wrote:

>>

>>> u64 iova;
>>> /* Out parameters. */
>>> u64 translated_addr;
>>> u64 addr_mask; /* Translation page size */
>>> IOMMUAccessFlags permisson;

>>
>> How is this translation intended to be used?  How do you plan to avoid
>> race conditions where qemu requests a translation, receives one, the
>> guest invalidated the mapping, and then qemu tries to use its translated
>> address?
>>
>> There are only two ways I can see of doing this race-free.  One is to
>> implement a "memcpy with translation" hypercall, and the other is to
>> require the use of ATS in the vIOMMU, where the guest OS is required to
>> wait for a positive response from the vIOMMU before it can safely reuse
>> the mapping.
>>
>> The former behaves like real hardware in that an intermediate entity
>> performs the translation without interacting with the DMA source.  The
>> latter explicitly exposing the fact that caching is going on at the
>> endpoint to the OS.

>
> The former one seems to move DMA operation into hypervisor but Qemu
> vIOMMU framework just passes IOVA to dummy xen-vIOMMU without input
> data and access length. I will dig more to figure out solution.

Yes - that does in principle actually move the DMA out of Qemu.


Hi Adnrew:

The first solution "Move the DMA out of Qemu": Qemu vIOMMU framework
just give a chance of doing DMA translation to dummy xen-vIOMMU device
model and DMA access operation is in the vIOMMU core code. It's hard to
move this out. There are a lot of places to call translation callback
and some these are not for DMA access(E,G Map guest memory in Qemu).

The second solution "Use ATS to sync invalidation operation.": This
requires to enable ATS for all virtual PCI devices. This is not easy to do.

The following is my proposal:
When IOMMU driver invalidates IOTLB, it also will wait until the
invalidation completion. We may use this to drain in-fly DMA operation.

Guest triggers invalidation operation and trip into vIOMMU in
hypervisor to flush cache data. After this, it should go to Qemu to
drain in-fly DMA translation.

To do that, dummy vIOMMU in Qemu registers the same MMIO region as
vIOMMU's and emulation part of invalidation operation returns
X86EMUL_UNHANDLEABLE after flush cache. MMIO emulation part is supposed
to send event to Qemu and dummy vIOMMU get a chance to starts a thread
to drain in-fly DMA and return emulation done.

Guest polls IVT(invalidate IOTLB) bit in the IOTLB invalidate register
until it's cleared. Dummy vIOMMU notifies vIOMMU drain operation
completed via hypercall, vIOMMU clears IVT bit and guest finish
invalidation operation.

--
Best regards
Tianyu Lan

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Xen virtual IOMMU high level design doc V2

2016-10-26 Thread Lan, Tianyu



On 10/26/2016 5:39 PM, Jan Beulich wrote:

On 22.10.16 at 09:32,  wrote:

On 10/21/2016 4:36 AM, Andrew Cooper wrote:

3.5 Implementation consideration
VT-d spec doesn't define a capability bit for the l2 translation.
Architecturally there is no way to tell guest that l2 translation
capability is not available. Linux Intel IOMMU driver thinks l2
translation is always available when VTD exits and fail to be loaded
without l2 translation support even if interrupt remapping and l1
translation are available. So it needs to enable l2 translation first
before other functions.


What then is the purpose of the nested translation support bit in the
extended capability register?


It's to translate output GPA from first level translation(IOVA->GPA)
to HPA.

Detail please see VTD spec - 3.8 Nested Translation
"When Nesting Enable (NESTE) field is 1 in extended-context-entries,
requests-with-PASID translated through first-level translation are also
subjected to nested second-level translation. Such extendedcontext-
entries contain both the pointer to the PASID-table (which contains the
pointer to the firstlevel translation structures), and the pointer to
the second-level translation structures."


I didn't phrase my question very well.  I understand what the nested
translation bit means, but I don't understand why we have a problem
signalling the presence or lack of nested translations to the guest.

In other words, why can't we hide l2 translation from the guest by
simply clearing the nested translation capability?


You mean to tell no support of l2 translation via nest translation bit?
But the nested translation is a different function with l2 translation
even from guest view and nested translation only works requests with
PASID (l1 translation).

Linux intel iommu driver enables l2 translation unconditionally and free
iommu instance when failed to enable l2 translation.


In which cases the wording of your description is confusing: Instead of
"Linux Intel IOMMU driver thinks l2 translation is always available when
VTD exits and fail to be loaded without l2 translation support ..." how
about using something closer to what you've replied with last?



Sorry for my pool English. Will update.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Xen virtual IOMMU high level design doc V2

2016-10-26 Thread Lan, Tianyu

On 10/26/2016 5:36 PM, Jan Beulich wrote:

On 18.10.16 at 16:14,  wrote:

1.1 Enable more than 255 vcpu support
HPC cloud service requires VM provides high performance parallel
computing and we hope to create a huge VM with >255 vcpu on one machine
to meet such requirement.Ping each vcpus on separated pcpus. More than
255 vcpus support requires X2APIC and Linux disables X2APIC mode if
there is no interrupt remapping function which is present by vIOMMU.
Interrupt remapping function helps to deliver interrupt to #vcpu >255.
So we need to add vIOMMU before enabling >255 vcpus.


I continue to dislike this completely neglecting that we can't even
have >128 vCPU-s at present. Once again - there's other work to
be done prior to lack of vIOMMU becoming the limiting factor.



Yes, we can increase vcpu from 128 to 255 first without vIOMMU support.
We have some draft patches to enable this. Andrew also will rework CPUID
policy and change the rule of allocating vcpu's APIC ID. So we will base
on it to increase vcpu number. VLAPIC also needs to be changed to
support >255 APIC ID. These jobs can be implemented parallel with vIOMMU.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Xen virtual IOMMU high level design doc V2

2016-10-26 Thread Jan Beulich
>>> On 22.10.16 at 09:32,  wrote:
> On 10/21/2016 4:36 AM, Andrew Cooper wrote:
> 3.5 Implementation consideration
> VT-d spec doesn't define a capability bit for the l2 translation.
> Architecturally there is no way to tell guest that l2 translation
> capability is not available. Linux Intel IOMMU driver thinks l2
> translation is always available when VTD exits and fail to be loaded
> without l2 translation support even if interrupt remapping and l1
> translation are available. So it needs to enable l2 translation first
> before other functions.

 What then is the purpose of the nested translation support bit in the
 extended capability register?
>>>
>>> It's to translate output GPA from first level translation(IOVA->GPA)
>>> to HPA.
>>>
>>> Detail please see VTD spec - 3.8 Nested Translation
>>> "When Nesting Enable (NESTE) field is 1 in extended-context-entries,
>>> requests-with-PASID translated through first-level translation are also
>>> subjected to nested second-level translation. Such extendedcontext-
>>> entries contain both the pointer to the PASID-table (which contains the
>>> pointer to the firstlevel translation structures), and the pointer to
>>> the second-level translation structures."
>>
>> I didn't phrase my question very well.  I understand what the nested
>> translation bit means, but I don't understand why we have a problem
>> signalling the presence or lack of nested translations to the guest.
>>
>> In other words, why can't we hide l2 translation from the guest by
>> simply clearing the nested translation capability?
> 
> You mean to tell no support of l2 translation via nest translation bit?
> But the nested translation is a different function with l2 translation
> even from guest view and nested translation only works requests with
> PASID (l1 translation).
> 
> Linux intel iommu driver enables l2 translation unconditionally and free 
> iommu instance when failed to enable l2 translation.

In which cases the wording of your description is confusing: Instead of
"Linux Intel IOMMU driver thinks l2 translation is always available when
VTD exits and fail to be loaded without l2 translation support ..." how
about using something closer to what you've replied with last?

Jan

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Xen virtual IOMMU high level design doc V2

2016-10-26 Thread Jan Beulich
>>> On 18.10.16 at 16:14,  wrote:
> 1.1 Enable more than 255 vcpu support
> HPC cloud service requires VM provides high performance parallel
> computing and we hope to create a huge VM with >255 vcpu on one machine
> to meet such requirement.Ping each vcpus on separated pcpus. More than
> 255 vcpus support requires X2APIC and Linux disables X2APIC mode if
> there is no interrupt remapping function which is present by vIOMMU.
> Interrupt remapping function helps to deliver interrupt to #vcpu >255.
> So we need to add vIOMMU before enabling >255 vcpus.

I continue to dislike this completely neglecting that we can't even
have >128 vCPU-s at present. Once again - there's other work to
be done prior to lack of vIOMMU becoming the limiting factor.

Jan

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Xen virtual IOMMU high level design doc V2

2016-10-22 Thread Lan, Tianyu

On 10/21/2016 4:36 AM, Andrew Cooper wrote:







255 vcpus support requires X2APIC and Linux disables X2APIC mode if
there is no interrupt remapping function which is present by vIOMMU.
Interrupt remapping function helps to deliver interrupt to #vcpu >255.


This is only a requirement for xapic interrupt sources.  x2apic
interrupt sources already deliver correctly.


The key is the APIC ID. There is no modification to existing PCI MSI and
IOAPIC with the introduction of x2apic. PCI MSI/IOAPIC can only send
interrupt message containing 8bit APIC ID, which cannot address >255
cpus. Interrupt remapping supports 32bit APIC ID so it's necessary to
enable >255 cpus with x2apic mode.

If LAPIC is in x2apic while interrupt remapping is disabled, IOAPIC
cannot deliver interrupts to all cpus in the system if #cpu > 255.


After spending a long time reading up on this, my first observation is
that it is very difficult to find consistent information concerning the
expected content of MSI address/data fields for x86 hardware.  Having
said that, this has been very educational.

It is now clear that any MSI message can either specify an 8 bit APIC ID
directly, or request for the message to be remapped.  Apologies for my
earlier confusion.


Never minder, I will describe this more detail in the following version.





3 Xen hypervisor
==


3.1 New hypercall XEN_SYSCTL_viommu_op
This hypercall should also support pv IOMMU which is still under RFC
review. Here only covers non-pv part.

1) Definition of "struct xen_sysctl_viommu_op" as new hypercall
parameter.


Why did you choose sysctl?  As these are per-domain, domctl would be a
more logical choice.  However, neither of these should be usable by
Qemu, and we are trying to split out "normal qemu operations" into dmops
which can be safely deprivileged.



Do you know what's the status of dmop now? I just found some discussions
about design in the maillist. We may use domctl first and move to dmop
when it's ready?


I believe Paul is looking into respin the series early in the 4.9 dev
cycle.  I expect it won't take long until they are submitted.


Ok. I got it. Thanks for information.








Definition of VIOMMU subops:
#define XEN_SYSCTL_viommu_query_capability0
#define XEN_SYSCTL_viommu_create1
#define XEN_SYSCTL_viommu_destroy2
#define XEN_SYSCTL_viommu_dma_translation_for_vpdev 3

Definition of VIOMMU capabilities
#define XEN_VIOMMU_CAPABILITY_l1_translation(1 << 0)
#define XEN_VIOMMU_CAPABILITY_l2_translation(1 << 1)
#define XEN_VIOMMU_CAPABILITY_interrupt_remapping(1 << 2)


How are vIOMMUs going to be modelled to guests?  On real hardware, they
all seem to end associated with a PCI device of some sort, even if it is
just the LPC bridge.



This design just considers one vIOMMU has all PCI device under its
specified PCI Segment. "INCLUDE_PCI_ALL" bit of DRHD struct is set for
vIOMMU.


Even if the first implementation only supports a single vIOMMU, please
design the interface to cope with multiple.  It will save someone having
to go and break the API/ABI in the future when support for multiple
vIOMMUs is needed.


OK. I got.







How do we deal with multiple vIOMMUs in a single guest?


For multi-vIOMMU, we need to add new field in the struct iommu_op to
designate device scope of vIOMMUs if they are under same PCI
segment. This also needs to change DMAR table.






2) Design for subops
- XEN_SYSCTL_viommu_query_capability
   Get vIOMMU capabilities(l1/l2 translation and interrupt
remapping).

- XEN_SYSCTL_viommu_create
  Create vIOMMU in Xen hypervisor with dom_id, capabilities and reg
base address.

- XEN_SYSCTL_viommu_destroy
  Destory vIOMMU in Xen hypervisor with dom_id as parameters.

- XEN_SYSCTL_viommu_dma_translation_for_vpdev
  Translate IOVA to GPA for specified virtual PCI device with
dom id,
PCI device's bdf and IOVA and xen hypervisor returns translated GPA,
address mask and access permission.


3.2 l2 translation
1) For virtual PCI device
Xen dummy xen-vIOMMU in Qemu translates IOVA to target GPA via new
hypercall when DMA operation happens.

2) For physical PCI device
DMA operations go though physical IOMMU directly and IO page table for
IOVA->HPA should be loaded into physical IOMMU. When guest updates
l2 Page-table pointer field, it provides IO page table for
IOVA->GPA. vIOMMU needs to shadow l2 translation table, translate
GPA->HPA and update shadow page table(IOVA->HPA) pointer to l2
Page-table pointer to context entry of physical IOMMU.


How are you proposing to do this shadowing?  Do we need to trap and
emulate all writes to the vIOMMU pagetables, or is there a better way to
know when the mappings need invalidating?


No, we don't need to trap all write to IO page table.
From VTD spec 6.1, "Reporting the Caching Mode as Set for the
virtual hardware requires the guest software to explicitly issue
invalidatio

Re: [Xen-devel] Xen virtual IOMMU high level design doc V2

2016-10-20 Thread Andrew Cooper

>
>>
>>> 255 vcpus support requires X2APIC and Linux disables X2APIC mode if
>>> there is no interrupt remapping function which is present by vIOMMU.
>>> Interrupt remapping function helps to deliver interrupt to #vcpu >255.
>>
>> This is only a requirement for xapic interrupt sources.  x2apic
>> interrupt sources already deliver correctly.
>
> The key is the APIC ID. There is no modification to existing PCI MSI and
> IOAPIC with the introduction of x2apic. PCI MSI/IOAPIC can only send
> interrupt message containing 8bit APIC ID, which cannot address >255
> cpus. Interrupt remapping supports 32bit APIC ID so it's necessary to
> enable >255 cpus with x2apic mode.
>
> If LAPIC is in x2apic while interrupt remapping is disabled, IOAPIC
> cannot deliver interrupts to all cpus in the system if #cpu > 255.

After spending a long time reading up on this, my first observation is
that it is very difficult to find consistent information concerning the
expected content of MSI address/data fields for x86 hardware.  Having
said that, this has been very educational.

It is now clear that any MSI message can either specify an 8 bit APIC ID
directly, or request for the message to be remapped.  Apologies for my
earlier confusion.

>
>>> +-+
>>> |Qemu++   |
>>> || Virtual|   |
>>> ||   PCI device   |   |
>>> |||   |
>>> |++   |
>>> ||DMA |
>>> |V|
>>> |  ++   Request  ++   |
>>> |  |+<---+|   |
>>> |  |  Dummy xen vIOMMU  | Target GPA |  Memory region |   |
>>> |  |+--->+|   |
>>> |  +-+--++---++   |
>>> ||   ||
>>> ||Hypercall  ||
>>> +++
>>> |Hypervisor  |   ||
>>> ||   ||
>>> |v   ||
>>> | +--+--+||
>>> | |   vIOMMU|||
>>> | +--+--+||
>>> ||   ||
>>> |v   ||
>>> | +--+--+||
>>> | | IOMMU driver|||
>>> | +--+--+||
>>> ||   ||
>>> +++
>>> |HW  v   V|
>>> | +--+--+ +-+ |
>>> | |   IOMMU +>+  Memory | |
>>> | +--+--+ +-+ |
>>> |^|
>>> |||
>>> | +--+--+ |
>>> | | PCI Device  | |
>>> | +-+ |
>>> +-+
>>>
>>> 2.2 Interrupt remapping overview.
>>> Interrupts from virtual devices and physical devices will be delivered
>>> to vLAPIC from vIOAPIC and vMSI. vIOMMU will remap interrupt during
>>> this
>>> procedure.
>>>
>>> +---+
>>> |Qemu   |VM |
>>> |   | ++|
>>> |   | |  Device driver ||
>>> |   | ++---+|
>>> |   |  ^|
>>> |   ++  | ++---+|
>>> |   | Virtual device |  | |  IRQ subsystem ||
>>> |   +---++  | ++---+|
>>> |   |   |  ^|
>>> |   |   |  ||
>>> +---+---+
>>> |hyperviosr |  | VIRQ   |
>>> |   |+-++   |
>>> |   ||  vLAPIC  |   |
>>> |   |+-++   |
>>> |   |  ^|
>>> |   |  ||
>>> |

Re: [Xen-devel] Xen virtual IOMMU high level design doc V2

2016-10-20 Thread Andrew Cooper
On 20/10/16 10:53, Tian, Kevin wrote:
>> From: Andrew Cooper [mailto:andrew.coop...@citrix.com]
>> Sent: Wednesday, October 19, 2016 3:18 AM
>>
>>> 1.2 Support VFIO-based user space driver (e.g. DPDK) in the guest
>>> It relies on the l2 translation capability (IOVA->GPA) on
>>> vIOMMU. pIOMMU l2 becomes a shadowing structure of
>>> vIOMMU to isolate DMA requests initiated by user space driver.
>> How is userspace supposed to drive this interface?  I can't picture how
>> it would function.
> Inside a Linux VM, VFIO provides DMA MAP/UNMAP interface to user space
> driver so gIOVA->GPA mapping can be setup on vIOMMU. vIOMMU will 
> export a "caching mode" capability to indicate all guest PTE changes 
> requiring explicit vIOMMU cache invalidations. Through trapping of those
> invalidation requests, Xen can update corresponding shadow PTEs (gIOVA
> ->HPA). When DMA mapping is established, user space driver programs 
> gIOVA addresses as DMA destination to assigned device, and then upstreaming
> DMA request out of this device contains gIOVA which is translated to HPA
> by pIOMMU shadow page table.

Ok.  So in this mode, the userspace driver owns the device, and can
choose any arbitrary gIOVA layout it chooses?  If it also programs the
DMA addresses, I guess this setup is fine.

>
>>>
>>> 1.3 Support guest SVM (Shared Virtual Memory)
>>> It relies on the l1 translation table capability (GVA->GPA) on
>>> vIOMMU. pIOMMU needs to enable both l1 and l2 translation in nested
>>> mode (GVA->GPA->HPA) for passthrough device. IGD passthrough
>>> is the main usage today (to support OpenCL 2.0 SVM feature). In the
>>> future SVM might be used by other I/O devices too.
>> As an aside, how is IGD intending to support SVM?  Will it be with PCIe
>> ATS/PASID, or something rather more magic as IGD is on the same piece of
>> silicon?
> Although integrated, IGD conforms to standard PCIe PASID convention.

Ok.  Any idea when hardware with SVM will be available?

>
>>> 3.5 Implementation consideration
>>> VT-d spec doesn't define a capability bit for the l2 translation.
>>> Architecturally there is no way to tell guest that l2 translation
>>> capability is not available. Linux Intel IOMMU driver thinks l2
>>> translation is always available when VTD exits and fail to be loaded
>>> without l2 translation support even if interrupt remapping and l1
>>> translation are available. So it needs to enable l2 translation first
>>> before other functions.
>> What then is the purpose of the nested translation support bit in the
>> extended capability register?
>>
> Nested translation is for SVM virtualization. Given a DMA transaction 
> containing a PASID, VT-d engine first finds the 1st translation table 
> through PASID to translate from GVA to GPA, then once nested
> translation capability is enabled, further translate GPA to HPA using the
> 2nd level translation table. Bare-metal usage is not expected to turn
> on this nested bit.

Ok, but what happens if a guest sees a PASSID-capable vIOMMU and itself
tries to turn on nesting?  E.g. nesting KVM inside Xen and trying to use
SVM from the L2 guest?

If there is no way to indicate to the L1 guest that nesting isn't
available (as it is already actually in use), and we can't shadow
entries on faults, what is supposed to happen?

~Andrew

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Xen virtual IOMMU high level design doc V2

2016-10-20 Thread Lan, Tianyu


On 10/19/2016 4:26 AM, Konrad Rzeszutek Wilk wrote:

On Tue, Oct 18, 2016 at 10:14:16PM +0800, Lan Tianyu wrote:



1 Motivation for Xen vIOMMU
===
1.1 Enable more than 255 vcpu support
HPC cloud service requires VM provides high performance parallel
computing and we hope to create a huge VM with >255 vcpu on one machine
to meet such requirement.Ping each vcpus on separated pcpus. More than
255 vcpus support requires X2APIC and Linux disables X2APIC mode if
there is no interrupt remapping function which is present by vIOMMU.
Interrupt remapping function helps to deliver interrupt to #vcpu >255.
So we need to add vIOMMU before enabling >255 vcpus.


What about Windows? Does it care about this?


From our test, win8 guest crashes when boot up 288 vcpus without IR and 
it can boot up with IR



3.2 l2 translation
1) For virtual PCI device
Xen dummy xen-vIOMMU in Qemu translates IOVA to target GPA via new
hypercall when DMA operation happens.

2) For physical PCI device
DMA operations go though physical IOMMU directly and IO page table for
IOVA->HPA should be loaded into physical IOMMU. When guest updates
l2 Page-table pointer field, it provides IO page table for
IOVA->GPA. vIOMMU needs to shadow l2 translation table, translate
GPA->HPA and update shadow page table(IOVA->HPA) pointer to l2
Page-table pointer to context entry of physical IOMMU.

Now all PCI devices in same hvm domain share one IO page table
(GPA->HPA) in physical IOMMU driver of Xen. To support l2
translation of vIOMMU, IOMMU driver need to support multiple address
spaces per device entry. Using existing IO page table(GPA->HPA)
defaultly and switch to shadow IO page table(IOVA->HPA) when l2


defaultly?


I mean GPA->HPA mapping will set in the assigned device's context entry 
of pIOMMU when VM creates. Just like current code works.






3.3 Interrupt remapping
Interrupts from virtual devices and physical devices will be delivered
to vlapic from vIOAPIC and vMSI. It needs to add interrupt remapping
hooks in the vmsi_deliver() and ioapic_deliver() to find target vlapic
according interrupt remapping table.


3.4 l1 translation
When nested translation is enabled, any address generated by l1
translation is used as the input address for nesting with l2
translation. Physical IOMMU needs to enable both l1 and l2 translation
in nested translation mode(GVA->GPA->HPA) for passthrough
device.

VT-d context entry points to guest l1 translation table which
will be nest-translated by l2 translation table and so it
can be directly linked to context entry of physical IOMMU.


I think this means that the shared_ept will be disabled?


The shared_ept(GPA->HPA mapping) is used to do nested translation
for any output from l1 translation(GVA->GPA).




___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Xen virtual IOMMU high level design doc V2

2016-10-20 Thread Lan Tianyu

Hi Andrew:
Thanks for your review.

On 2016年10月19日 03:17, Andrew Cooper wrote:

On 18/10/16 15:14, Lan Tianyu wrote:

Change since V1:
1) Update motivation for Xen vIOMMU - 288 vcpus support part
2) Change definition of struct xen_sysctl_viommu_op
3) Update "3.5 Implementation consideration" to explain why we
needs to enable l2 translation first.
4) Update "4.3 Q35 vs I440x" - Linux/Windows VTD drivers can work
on the emulated I440 chipset.
5) Remove stale statement in the "3.3 Interrupt remapping"

Content:
===

1. Motivation of vIOMMU
1.1 Enable more than 255 vcpus
1.2 Support VFIO-based user space driver
1.3 Support guest Shared Virtual Memory (SVM)
2. Xen vIOMMU Architecture
2.1 l2 translation overview
2.2 Interrupt remapping overview
3. Xen hypervisor
3.1 New vIOMMU hypercall interface
3.2 l2 translation
3.3 Interrupt remapping
3.4 l1 translation
3.5 Implementation consideration
4. Qemu
4.1 Qemu vIOMMU framework
4.2 Dummy xen-vIOMMU driver
4.3 Q35 vs. i440x
4.4 Report vIOMMU to hvmloader


1 Motivation for Xen vIOMMU
===

1.1 Enable more than 255 vcpu support
HPC cloud service requires VM provides high performance parallel
computing and we hope to create a huge VM with >255 vcpu on one machine
to meet such requirement.Ping each vcpus on separated pcpus. More than


Pin ?



Sorry, it's a typo.


Also, grammatically speaking, I think you mean "each vcpu to separate
pcpus".



Yes.




255 vcpus support requires X2APIC and Linux disables X2APIC mode if
there is no interrupt remapping function which is present by vIOMMU.
Interrupt remapping function helps to deliver interrupt to #vcpu >255.


This is only a requirement for xapic interrupt sources.  x2apic
interrupt sources already deliver correctly.


The key is the APIC ID. There is no modification to existing PCI MSI and
IOAPIC with the introduction of x2apic. PCI MSI/IOAPIC can only send
interrupt message containing 8bit APIC ID, which cannot address >255
cpus. Interrupt remapping supports 32bit APIC ID so it's necessary to
enable >255 cpus with x2apic mode.

If LAPIC is in x2apic while interrupt remapping is disabled, IOAPIC
cannot deliver interrupts to all cpus in the system if #cpu > 255.







1.3 Support guest SVM (Shared Virtual Memory)
It relies on the l1 translation table capability (GVA->GPA) on
vIOMMU. pIOMMU needs to enable both l1 and l2 translation in nested
mode (GVA->GPA->HPA) for passthrough device. IGD passthrough
is the main usage today (to support OpenCL 2.0 SVM feature). In the
future SVM might be used by other I/O devices too.


As an aside, how is IGD intending to support SVM?  Will it be with PCIe
ATS/PASID, or something rather more magic as IGD is on the same piece of
silicon?


IGD on Skylake supports PCIe PASID.






2. Xen vIOMMU Architecture



* vIOMMU will be inside Xen hypervisor for following factors
1) Avoid round trips between Qemu and Xen hypervisor
2) Ease of integration with the rest of the hypervisor
3) HVMlite/PVH doesn't use Qemu
* Dummy xen-vIOMMU in Qemu as a wrapper of new hypercall to create
/destory vIOMMU in hypervisor and deal with virtual PCI device's l2
translation.

2.1 l2 translation overview
For Virtual PCI device, dummy xen-vIOMMU does translation in the
Qemu via new hypercall.

For physical PCI device, vIOMMU in hypervisor shadows IO page table from
IOVA->GPA to IOVA->HPA and load page table to physical IOMMU.

The following diagram shows l2 translation architecture.


Which scenario is this?  Is this the passthrough case where the Qemu
Virtual PCI device is a shadow of the real PCI device in hardware?



No, this is for traditional virtual pci device emulated by Qemu and
passthough PCI device.



+-+
|Qemu++   |
|| Virtual|   |
||   PCI device   |   |
|||   |
|++   |
||DMA |
|V|
|  ++   Request  ++   |
|  |+<---+|   |
|  |  Dummy xen vIOMMU  | Target GPA |  Memory region |   |
|  |+--->+|   |
|  +-+--++---++   |
||   ||
||Hypercall  ||
+++
|Hypervisor  | 

Re: [Xen-devel] Xen virtual IOMMU high level design doc V2

2016-10-20 Thread Tian, Kevin
> From: Konrad Rzeszutek Wilk [mailto:konrad.w...@oracle.com]
> Sent: Wednesday, October 19, 2016 4:27 AM
> >
> > 2) For physical PCI device
> > DMA operations go though physical IOMMU directly and IO page table for
> > IOVA->HPA should be loaded into physical IOMMU. When guest updates
> > l2 Page-table pointer field, it provides IO page table for
> > IOVA->GPA. vIOMMU needs to shadow l2 translation table, translate
> > GPA->HPA and update shadow page table(IOVA->HPA) pointer to l2
> > Page-table pointer to context entry of physical IOMMU.
> >
> > Now all PCI devices in same hvm domain share one IO page table
> > (GPA->HPA) in physical IOMMU driver of Xen. To support l2
> > translation of vIOMMU, IOMMU driver need to support multiple address
> > spaces per device entry. Using existing IO page table(GPA->HPA)
> > defaultly and switch to shadow IO page table(IOVA->HPA) when l2
> 
> defaultly?
> 
> > translation function is enabled. These change will not affect current
> > P2M logic.
> 
> What happens if the guests IO page tables have incorrect values?
> 
> For example the guest sets up the pagetables to cover some section
> of HPA ranges (which are all good and permitted). But then during execution
> the guest kernel decides to muck around with the pagetables and adds an HPA
> range that is outside what the guest has been allocated.
> 
> What then?

Shadow PTE is controlled by hypervisor. Whatever IOVA->GPA mapping in
guest PTE must be validated (IOVA->GPA->HPA) before updating into the
shadow PTE. So regardless of when guest mucks its PTE, the operation is
always trapped and validated. Why do you think there is a problem?

Also guest only sees GPA. All it can operate is GPA ranges.

> >
> > 3.3 Interrupt remapping
> > Interrupts from virtual devices and physical devices will be delivered
> > to vlapic from vIOAPIC and vMSI. It needs to add interrupt remapping
> > hooks in the vmsi_deliver() and ioapic_deliver() to find target vlapic
> > according interrupt remapping table.
> >
> >
> > 3.4 l1 translation
> > When nested translation is enabled, any address generated by l1
> > translation is used as the input address for nesting with l2
> > translation. Physical IOMMU needs to enable both l1 and l2 translation
> > in nested translation mode(GVA->GPA->HPA) for passthrough
> > device.
> >
> > VT-d context entry points to guest l1 translation table which
> > will be nest-translated by l2 translation table and so it
> > can be directly linked to context entry of physical IOMMU.
> 
> I think this means that the shared_ept will be disabled?
> >
> What about different versions of contexts? Say the V1 is exposed
> to guest but the hardware supports V2? Are there any flags that have
> swapped positions? Or is it pretty backwards compatible?

yes, backward compatible.

> >
> >
> > 3.5 Implementation consideration
> > VT-d spec doesn't define a capability bit for the l2 translation.
> > Architecturally there is no way to tell guest that l2 translation
> > capability is not available. Linux Intel IOMMU driver thinks l2
> > translation is always available when VTD exits and fail to be loaded
> > without l2 translation support even if interrupt remapping and l1
> > translation are available. So it needs to enable l2 translation first
> 
> I am lost on that sentence. Are you saying that it tries to load
> the IOVA and if they fail.. then it keeps on going? What is the result
> of this? That you can't do IOVA (so can't use vfio ?)

It's about VT-d capability. VT-d supports both 1st-level and 2nd-level 
translation, however only the 1st-level translation can be optionally
reported through a capability bit. There is no capability bit to say
a version doesn't support 2nd-level translation. The implication is
that, as long as a vIOMMU is exposed, guest IOMMU driver always
assumes IOVA capability available thru 2nd level translation. 

So we can first emulate a vIOMMU w/ only 2nd-level capability, and
then extend it to support 1st-level and interrupt remapping, but cannot 
do the reverse direction. I think Tianyu's point is more to describe 
enabling sequence based on this fact. :-)

> > 4.1 Qemu vIOMMU framework
> > Qemu has a framework to create virtual IOMMU(e.g. virtual intel VTD and
> > AMD IOMMU) and report in guest ACPI table. So for Xen side, a dummy
> > xen-vIOMMU wrapper is required to connect with actual vIOMMU in Xen.
> > Especially for l2 translation of virtual PCI device because
> > emulations of virtual PCI devices are in the Qemu. Qemu's vIOMMU
> > framework provides callback to deal with l2 translation when
> > DMA operations of virtual PCI devices happen.
> 
> You say AMD and Intel. This sounds quite OS agnostic. Does it mean you
> could expose an vIOMMU to a guest and actually use the AMD IOMMU
> in the hypervisor?

Did you mean "expose an Intel vIOMMU to guest and then use physical
AMD IOMMU in hypervisor"? I didn't think about this, but what's the value
of doing so? :-)
 
Thanks
Kevin


Re: [Xen-devel] Xen virtual IOMMU high level design doc V2

2016-10-20 Thread Tian, Kevin
> From: Andrew Cooper [mailto:andrew.coop...@citrix.com]
> Sent: Wednesday, October 19, 2016 3:18 AM
> 
> >
> > 1.2 Support VFIO-based user space driver (e.g. DPDK) in the guest
> > It relies on the l2 translation capability (IOVA->GPA) on
> > vIOMMU. pIOMMU l2 becomes a shadowing structure of
> > vIOMMU to isolate DMA requests initiated by user space driver.
> 
> How is userspace supposed to drive this interface?  I can't picture how
> it would function.

Inside a Linux VM, VFIO provides DMA MAP/UNMAP interface to user space
driver so gIOVA->GPA mapping can be setup on vIOMMU. vIOMMU will 
export a "caching mode" capability to indicate all guest PTE changes 
requiring explicit vIOMMU cache invalidations. Through trapping of those
invalidation requests, Xen can update corresponding shadow PTEs (gIOVA
->HPA). When DMA mapping is established, user space driver programs 
gIOVA addresses as DMA destination to assigned device, and then upstreaming
DMA request out of this device contains gIOVA which is translated to HPA
by pIOMMU shadow page table.

> 
> >
> >
> > 1.3 Support guest SVM (Shared Virtual Memory)
> > It relies on the l1 translation table capability (GVA->GPA) on
> > vIOMMU. pIOMMU needs to enable both l1 and l2 translation in nested
> > mode (GVA->GPA->HPA) for passthrough device. IGD passthrough
> > is the main usage today (to support OpenCL 2.0 SVM feature). In the
> > future SVM might be used by other I/O devices too.
> 
> As an aside, how is IGD intending to support SVM?  Will it be with PCIe
> ATS/PASID, or something rather more magic as IGD is on the same piece of
> silicon?

Although integrated, IGD conforms to standard PCIe PASID convention.

> > 3.5 Implementation consideration
> > VT-d spec doesn't define a capability bit for the l2 translation.
> > Architecturally there is no way to tell guest that l2 translation
> > capability is not available. Linux Intel IOMMU driver thinks l2
> > translation is always available when VTD exits and fail to be loaded
> > without l2 translation support even if interrupt remapping and l1
> > translation are available. So it needs to enable l2 translation first
> > before other functions.
> 
> What then is the purpose of the nested translation support bit in the
> extended capability register?
> 

Nested translation is for SVM virtualization. Given a DMA transaction 
containing a PASID, VT-d engine first finds the 1st translation table 
through PASID to translate from GVA to GPA, then once nested
translation capability is enabled, further translate GPA to HPA using the
2nd level translation table. Bare-metal usage is not expected to turn
on this nested bit.

Thanks
Kevin

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Xen virtual IOMMU high level design doc V2

2016-10-18 Thread Konrad Rzeszutek Wilk
On Tue, Oct 18, 2016 at 10:14:16PM +0800, Lan Tianyu wrote:
> Change since V1:
>   1) Update motivation for Xen vIOMMU - 288 vcpus support part
>   2) Change definition of struct xen_sysctl_viommu_op
>   3) Update "3.5 Implementation consideration" to explain why we needs to
> enable l2 translation first.
>   4) Update "4.3 Q35 vs I440x" - Linux/Windows VTD drivers can work on the
> emulated I440 chipset.
>   5) Remove stale statement in the "3.3 Interrupt remapping"
> 
> Content:
> ===
> 1. Motivation of vIOMMU
>   1.1 Enable more than 255 vcpus
>   1.2 Support VFIO-based user space driver
>   1.3 Support guest Shared Virtual Memory (SVM)
> 2. Xen vIOMMU Architecture
>   2.1 l2 translation overview
>   2.2 Interrupt remapping overview
> 3. Xen hypervisor
>   3.1 New vIOMMU hypercall interface
>   3.2 l2 translation
>   3.3 Interrupt remapping
>   3.4 l1 translation
>   3.5 Implementation consideration
> 4. Qemu
>   4.1 Qemu vIOMMU framework
>   4.2 Dummy xen-vIOMMU driver
>   4.3 Q35 vs. i440x
>   4.4 Report vIOMMU to hvmloader
> 
> 
> 1 Motivation for Xen vIOMMU
> ===
> 1.1 Enable more than 255 vcpu support
> HPC cloud service requires VM provides high performance parallel
> computing and we hope to create a huge VM with >255 vcpu on one machine
> to meet such requirement.Ping each vcpus on separated pcpus. More than
> 255 vcpus support requires X2APIC and Linux disables X2APIC mode if
> there is no interrupt remapping function which is present by vIOMMU.
> Interrupt remapping function helps to deliver interrupt to #vcpu >255.
> So we need to add vIOMMU before enabling >255 vcpus.

What about Windows? Does it care about this?

> 
> 1.2 Support VFIO-based user space driver (e.g. DPDK) in the guest
> It relies on the l2 translation capability (IOVA->GPA) on
> vIOMMU. pIOMMU l2 becomes a shadowing structure of
> vIOMMU to isolate DMA requests initiated by user space driver.
> 
> 
> 1.3 Support guest SVM (Shared Virtual Memory)
> It relies on the l1 translation table capability (GVA->GPA) on
> vIOMMU. pIOMMU needs to enable both l1 and l2 translation in nested
> mode (GVA->GPA->HPA) for passthrough device. IGD passthrough
> is the main usage today (to support OpenCL 2.0 SVM feature). In the
> future SVM might be used by other I/O devices too.
> 
> 2. Xen vIOMMU Architecture
> 
> 
> * vIOMMU will be inside Xen hypervisor for following factors
>   1) Avoid round trips between Qemu and Xen hypervisor
>   2) Ease of integration with the rest of the hypervisor
>   3) HVMlite/PVH doesn't use Qemu
> * Dummy xen-vIOMMU in Qemu as a wrapper of new hypercall to create
> /destory vIOMMU in hypervisor and deal with virtual PCI device's l2

destroy
> translation.
> 
> 2.1 l2 translation overview
> For Virtual PCI device, dummy xen-vIOMMU does translation in the
> Qemu via new hypercall.
> 
> For physical PCI device, vIOMMU in hypervisor shadows IO page table from
> IOVA->GPA to IOVA->HPA and load page table to physical IOMMU.
> 
> The following diagram shows l2 translation architecture.
> +-+
> |Qemu++   |
> || Virtual|   |
> ||   PCI device   |   |
> |||   |
> |++   |
> ||DMA |
> |V|
> |  ++   Request  ++   |
> |  |+<---+|   |
> |  |  Dummy xen vIOMMU  | Target GPA |  Memory region |   |
> |  |+--->+|   |
> |  +-+--++---++   |
> ||   ||
> ||Hypercall  ||
> +++
> |Hypervisor  |   ||
> ||   ||
> |v   ||
> | +--+--+||
> | |   vIOMMU|||
> | +--+--+||
> ||   ||
> |v   ||
> | +--+--+||
> | | IOMMU driver|||
> | +--+--+ 

Re: [Xen-devel] Xen virtual IOMMU high level design doc V2

2016-10-18 Thread Andrew Cooper
On 18/10/16 15:14, Lan Tianyu wrote:
> Change since V1:
> 1) Update motivation for Xen vIOMMU - 288 vcpus support part
> 2) Change definition of struct xen_sysctl_viommu_op
> 3) Update "3.5 Implementation consideration" to explain why we
> needs to enable l2 translation first.
> 4) Update "4.3 Q35 vs I440x" - Linux/Windows VTD drivers can work
> on the emulated I440 chipset.
> 5) Remove stale statement in the "3.3 Interrupt remapping"
>
> Content:
> ===
>
> 1. Motivation of vIOMMU
> 1.1 Enable more than 255 vcpus
> 1.2 Support VFIO-based user space driver
> 1.3 Support guest Shared Virtual Memory (SVM)
> 2. Xen vIOMMU Architecture
> 2.1 l2 translation overview
> 2.2 Interrupt remapping overview
> 3. Xen hypervisor
> 3.1 New vIOMMU hypercall interface
> 3.2 l2 translation
> 3.3 Interrupt remapping
> 3.4 l1 translation
> 3.5 Implementation consideration
> 4. Qemu
> 4.1 Qemu vIOMMU framework
> 4.2 Dummy xen-vIOMMU driver
> 4.3 Q35 vs. i440x
> 4.4 Report vIOMMU to hvmloader
>
>
> 1 Motivation for Xen vIOMMU
> ===
>
> 1.1 Enable more than 255 vcpu support
> HPC cloud service requires VM provides high performance parallel
> computing and we hope to create a huge VM with >255 vcpu on one machine
> to meet such requirement.Ping each vcpus on separated pcpus. More than

Pin ?

Also, grammatically speaking, I think you mean "each vcpu to separate
pcpus".

> 255 vcpus support requires X2APIC and Linux disables X2APIC mode if
> there is no interrupt remapping function which is present by vIOMMU.
> Interrupt remapping function helps to deliver interrupt to #vcpu >255.

This is only a requirement for xapic interrupt sources.  x2apic
interrupt sources already deliver correctly.

> So we need to add vIOMMU before enabling >255 vcpus.
>
> 1.2 Support VFIO-based user space driver (e.g. DPDK) in the guest
> It relies on the l2 translation capability (IOVA->GPA) on
> vIOMMU. pIOMMU l2 becomes a shadowing structure of
> vIOMMU to isolate DMA requests initiated by user space driver.

How is userspace supposed to drive this interface?  I can't picture how
it would function.

>
>
> 1.3 Support guest SVM (Shared Virtual Memory)
> It relies on the l1 translation table capability (GVA->GPA) on
> vIOMMU. pIOMMU needs to enable both l1 and l2 translation in nested
> mode (GVA->GPA->HPA) for passthrough device. IGD passthrough
> is the main usage today (to support OpenCL 2.0 SVM feature). In the
> future SVM might be used by other I/O devices too.

As an aside, how is IGD intending to support SVM?  Will it be with PCIe
ATS/PASID, or something rather more magic as IGD is on the same piece of
silicon?

>
> 2. Xen vIOMMU Architecture
> 
>
>
> * vIOMMU will be inside Xen hypervisor for following factors
> 1) Avoid round trips between Qemu and Xen hypervisor
> 2) Ease of integration with the rest of the hypervisor
> 3) HVMlite/PVH doesn't use Qemu
> * Dummy xen-vIOMMU in Qemu as a wrapper of new hypercall to create
> /destory vIOMMU in hypervisor and deal with virtual PCI device's l2
> translation.
>
> 2.1 l2 translation overview
> For Virtual PCI device, dummy xen-vIOMMU does translation in the
> Qemu via new hypercall.
>
> For physical PCI device, vIOMMU in hypervisor shadows IO page table from
> IOVA->GPA to IOVA->HPA and load page table to physical IOMMU.
>
> The following diagram shows l2 translation architecture.

Which scenario is this?  Is this the passthrough case where the Qemu
Virtual PCI device is a shadow of the real PCI device in hardware?

> +-+
> |Qemu++   |
> || Virtual|   |
> ||   PCI device   |   |
> |||   |
> |++   |
> ||DMA |
> |V|
> |  ++   Request  ++   |
> |  |+<---+|   |
> |  |  Dummy xen vIOMMU  | Target GPA |  Memory region |   |
> |  |+--->+|   |
> |  +-+--++---++   |
> ||   ||
> ||Hypercall  ||
> +++
> |Hypervisor  |   ||
> ||   ||
> |v   ||
> | +--+-