Re: [RFC 2/3] virtio-iommu: device probing and operations

2017-04-18 Thread Jean-Philippe Brucker
On 18/04/17 11:26, Tian, Kevin wrote:
>> From: Jean-Philippe Brucker
>> Sent: Saturday, April 8, 2017 3:18 AM
>>
> [...]
>>   II. Feature bits
>>   
>>
>> VIRTIO_IOMMU_F_INPUT_RANGE (0)
>>  Available range of virtual addresses is described in input_range
> 
> Usually only the maximum supported address bits are important. 
> Curious do you see such situation where low end of the address 
> space is not usable (since you have both start/end defined later)?

A start address would allow to provide something resembling a GART to the
guest: an IOMMU with one address space (ioasid_bits=0) and a small IOVA
aperture. I'm not sure how useful that would be in practice.

On a related note, the virtio-iommu itself doesn't provide a
per-address-space aperture as it stands. For example, attaching a device
to an address space might restrict the available IOVA range for the whole
AS if that device cannot write to high memory (above 32-bit). If the guest
attempts to map an IOVA outside this window into the device's address
space, it should expect the MAP request to fail. And when attaching, if
the address space already has mappings outside this window, then ATTACH
should fail.

This too seems to be something that ought to be communicated by firmware,
but bits are missing (I can't find anything equivalent to DT's dma-ranges
for PCI root bridges in ACPI tables, for example). In addition VFIO
doesn't communicate any DMA mask for devices, and doesn't check them
itself. I guess that the host could find out the DMA mask of devices one
way or another, but it is tricky to enforce, so I didn't make this a hard
requirement. Although I should probably add a few words about it.

> [...]
>>   1. Attach device
>>   
>>
>> struct virtio_iommu_req_attach {
>>  le32address_space;
>>  le32device;
>>  le32flags/reserved;
>> };
>>
>> Attach a device to an address space. 'address_space' is an identifier
>> unique to the guest. If the address space doesn't exist in the IOMMU
> 
> Based on your description this address space ID is per operation right?
> MAP/UNMAP and page-table sharing should have different ID spaces...

I think it's simpler if we keep a single IOASID space per virtio-iommu
device, because the maximum number of address spaces (described by
ioasid_bits) might be a restriction of the pIOMMU. For page-table sharing
you still need to define which devices will share a page directory using
ATTACH requests, though that interface is not set in stone.

>> device, it is created. 'device' is an identifier unique to the IOMMU. The
>> host communicates unique device ID to the guest during boot. The method
>> used to communicate this ID is outside the scope of this specification,
>> but the following rules must apply:
>>
>> * The device ID is unique from the IOMMU point of view. Multiple devices
>>   whose DMA transactions are not translated by the same IOMMU may have
>> the
>>   same device ID. Devices whose DMA transactions may be translated by the
>>   same IOMMU must have different device IDs.
>>
>> * Sometimes the host cannot completely isolate two devices from each
>>   others. For example on a legacy PCI bus, devices can snoop DMA
>>   transactions from their neighbours. In this case, the host must
>>   communicate to the guest that it cannot isolate these devices from each
>>   others. The method used to communicate this is outside the scope of this
>>   specification. The IOMMU device must ensure that devices that cannot be
> 
> "IOMMU device" -> "IOMMU driver"

Indeed

Thanks!
Jean-Philippe

>>   isolated by the host have the same address spaces.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC 1/3] virtio-iommu: firmware description of the virtual topology

2017-04-18 Thread Jean-Philippe Brucker
On 18/04/17 10:51, Tian, Kevin wrote:
>> From: Jean-Philippe Brucker
>> Sent: Saturday, April 8, 2017 3:18 AM
>>
>> Unlike other virtio devices, the virtio-iommu doesn't work independently,
>> it is linked to other virtual or assigned devices. So before jumping into
>> device operations, we need to define a way for the guest to discover the
>> virtual IOMMU and the devices it translates.
>>
>> The host must describe the relation between IOMMU and devices to the
>> guest
>> using either device-tree or ACPI. The virtual IOMMU identifies each
> 
> Do you plan to support both device tree and ACPI?

Yes, with ACPI the topology would be described using IORT nodes. I didn't
include an example in my driver because DT is sufficient for a prototype
and is readily available (both in Linux and kvmtool), whereas IORT would
be quite easy to reuse in Linux, but isn't present in kvmtool at the
moment. However, both interfaces have to be supported for the virtio-iommu
to be portable.

>> virtual device with a 32-bit ID, that we will call "Device ID" in this
>> document. Device IDs are not necessarily unique system-wide, but they may
>> not overlap within a single virtual IOMMU. Device ID of passed-through
>> devices do not need to match IDs seen by the physical IOMMU.
>>
>> The virtual IOMMU uses virtio-mmio transport exclusively, not virtio-pci,
>> because with PCI the IOMMU interface would itself be an endpoint, and
>> existing firmware interfaces don't allow to describe IOMMU<->master
>> relations between PCI endpoints.
> 
> I'm not familiar with virtio-mmio mechanism. Curious how devices in
> virtio-mmio are enumerated today? Could we use that mechanism to
> identify vIOMMUs and then invent a purely para-virtualized method to
> enumerate devices behind each vIOMMU?

Using DT, virtio-mmio devices are described with "virtio-mmio" compatible
node, and with ACPI they use _HID LNRO0005. Since the host already
describes available devices to a guest using a firmware interface, I think
we should reuse the tools provided by that interface for describing
relations between DMA masters and IOMMU.

> Asking this is because each vendor has its own enumeration methods.
> ARM has device tree and ACPI IORT. AMR has ACPI IVRS and device
> tree (same format as ARM?). Intel has APCI DMAR and sub-tables. Your 
> current proposal looks following ARM definitions which I'm not sure 
> extensible enough to cover features defined only in other vendors' 
> structures.

ACPI IORT can be extended to incorporate para-virtualized IOMMUs,
regardless of the underlying architecture. It isn't defined solely for the
ARM SMMU, but serves a more general purpose of describing a map of device
identifiers communicated from one components to another. Both DMAR and
IVRS have such description (respectively DRHD and IVHD), but they are
designed for a specific IOMMU, whereas IORT could host other kinds.

It seems that all we really need is an interface that says "there is a
virtio-iommu at address X, here are the devices it translates and their
corresponding IDs", and both DT and ACPI IORT are able to fulfill this role.

> Since the purpose of this series is to go para-virtualize, why not also
> para-virtualize and simplify the enumeration method? For example, 
> we may define a query interface through vIOMMU registers to allow 
> guest query whether a device belonging to that vIOMMU. Then we 
> can even remove use of any enumeration structure completely... 
> Just a quick example which I may not think through all the pros and 
> cons. :-)

I don't think adding a brand new topology description mechanism is worth
the effort, we're better off reusing what already exists and is
implemented by operating systems. Adding a query interface inside the
vIOMMU may work (though might be very painful to integrate with fwspec in
Linux), but would be redundant since the host has to provide a firmware
description of the system anyway.

>> The following diagram describes a situation where two virtual IOMMUs
>> translate traffic from devices in the system. vIOMMU 1 translates two PCI
>> domains, in which each function has a 16-bits requester ID. In order for
>> the vIOMMU to differentiate guest requests targeted at devices in each
>> domain, their Device ID ranges cannot overlap. vIOMMU 2 translates two PCI
>> domains and a collection of platform devices.
>>
>>Device IDRequester ID
>>   /   0x0   0x0  \
>>  / | |PCI domain 1
>> /  0x   0x   /
>> vIOMMU 1
>> \ 0x1   0x0  \
>>  \ | |PCI domain 2
>>   \   0x1   0x   /
>>
>>   /   0x0\
>>  / |  platform devices
>> /  0x1fff/
>> vIOMMU 2
>>   

RE: [RFC 2/3] virtio-iommu: device probing and operations

2017-04-18 Thread Tian, Kevin
> From: Jean-Philippe Brucker
> Sent: Saturday, April 8, 2017 3:18 AM
> 
[...]
>   II. Feature bits
>   
> 
> VIRTIO_IOMMU_F_INPUT_RANGE (0)
>  Available range of virtual addresses is described in input_range

Usually only the maximum supported address bits are important. 
Curious do you see such situation where low end of the address 
space is not usable (since you have both start/end defined later)?

[...]
>   1. Attach device
>   
> 
> struct virtio_iommu_req_attach {
>   le32address_space;
>   le32device;
>   le32flags/reserved;
> };
> 
> Attach a device to an address space. 'address_space' is an identifier
> unique to the guest. If the address space doesn't exist in the IOMMU

Based on your description this address space ID is per operation right?
MAP/UNMAP and page-table sharing should have different ID spaces...

> device, it is created. 'device' is an identifier unique to the IOMMU. The
> host communicates unique device ID to the guest during boot. The method
> used to communicate this ID is outside the scope of this specification,
> but the following rules must apply:
> 
> * The device ID is unique from the IOMMU point of view. Multiple devices
>   whose DMA transactions are not translated by the same IOMMU may have
> the
>   same device ID. Devices whose DMA transactions may be translated by the
>   same IOMMU must have different device IDs.
> 
> * Sometimes the host cannot completely isolate two devices from each
>   others. For example on a legacy PCI bus, devices can snoop DMA
>   transactions from their neighbours. In this case, the host must
>   communicate to the guest that it cannot isolate these devices from each
>   others. The method used to communicate this is outside the scope of this
>   specification. The IOMMU device must ensure that devices that cannot be

"IOMMU device" -> "IOMMU driver"

>   isolated by the host have the same address spaces.
> 

Thanks
Kevin
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [RFC 1/3] virtio-iommu: firmware description of the virtual topology

2017-04-18 Thread Tian, Kevin
> From: Jean-Philippe Brucker
> Sent: Saturday, April 8, 2017 3:18 AM
> 
> Unlike other virtio devices, the virtio-iommu doesn't work independently,
> it is linked to other virtual or assigned devices. So before jumping into
> device operations, we need to define a way for the guest to discover the
> virtual IOMMU and the devices it translates.
> 
> The host must describe the relation between IOMMU and devices to the
> guest
> using either device-tree or ACPI. The virtual IOMMU identifies each

Do you plan to support both device tree and ACPI?

> virtual device with a 32-bit ID, that we will call "Device ID" in this
> document. Device IDs are not necessarily unique system-wide, but they may
> not overlap within a single virtual IOMMU. Device ID of passed-through
> devices do not need to match IDs seen by the physical IOMMU.
> 
> The virtual IOMMU uses virtio-mmio transport exclusively, not virtio-pci,
> because with PCI the IOMMU interface would itself be an endpoint, and
> existing firmware interfaces don't allow to describe IOMMU<->master
> relations between PCI endpoints.

I'm not familiar with virtio-mmio mechanism. Curious how devices in
virtio-mmio are enumerated today? Could we use that mechanism to
identify vIOMMUs and then invent a purely para-virtualized method to
enumerate devices behind each vIOMMU? 

Asking this is because each vendor has its own enumeration methods.
ARM has device tree and ACPI IORT. AMR has ACPI IVRS and device
tree (same format as ARM?). Intel has APCI DMAR and sub-tables. Your 
current proposal looks following ARM definitions which I'm not sure 
extensible enough to cover features defined only in other vendors' 
structures.

Since the purpose of this series is to go para-virtualize, why not also
para-virtualize and simplify the enumeration method? For example, 
we may define a query interface through vIOMMU registers to allow 
guest query whether a device belonging to that vIOMMU. Then we 
can even remove use of any enumeration structure completely... 
Just a quick example which I may not think through all the pros and 
cons. :-)

> 
> The following diagram describes a situation where two virtual IOMMUs
> translate traffic from devices in the system. vIOMMU 1 translates two PCI
> domains, in which each function has a 16-bits requester ID. In order for
> the vIOMMU to differentiate guest requests targeted at devices in each
> domain, their Device ID ranges cannot overlap. vIOMMU 2 translates two PCI
> domains and a collection of platform devices.
> 
>Device IDRequester ID
>   /   0x0   0x0  \
>  / | |PCI domain 1
> /  0x   0x   /
> vIOMMU 1
> \ 0x1   0x0  \
>  \ | |PCI domain 2
>   \   0x1   0x   /
> 
>   /   0x0\
>  / |  platform devices
> /  0x1fff/
> vIOMMU 2
> \  0x2000   0x0  \
>  \ | |PCI domain 3
>   \   0x11fff   0x   /
> 

isn't above be (0x3, 3) for PCI domain 3 giving device ID is 16bit?

Thanks
Kevin
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v3 10/11] vmware: set cpu capabilities during platform initialization

2017-04-18 Thread Juergen Gross
There is no need to set the same capabilities for each cpu
individually. This can be done for all cpus in platform initialization.

Cc: Alok Kataria 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: x...@kernel.org
Cc: virtualization@lists.linux-foundation.org
Signed-off-by: Juergen Gross 
Reviewed-by: Boris Ostrovsky 
Acked-by: Alok Kataria 
---
 arch/x86/kernel/cpu/vmware.c | 39 ---
 1 file changed, 20 insertions(+), 19 deletions(-)

diff --git a/arch/x86/kernel/cpu/vmware.c b/arch/x86/kernel/cpu/vmware.c
index 22403a28caf5..40ed26852ebd 100644
--- a/arch/x86/kernel/cpu/vmware.c
+++ b/arch/x86/kernel/cpu/vmware.c
@@ -113,6 +113,24 @@ static void __init vmware_paravirt_ops_setup(void)
 #define vmware_paravirt_ops_setup() do {} while (0)
 #endif
 
+/*
+ * VMware hypervisor takes care of exporting a reliable TSC to the guest.
+ * Still, due to timing difference when running on virtual cpus, the TSC can
+ * be marked as unstable in some cases. For example, the TSC sync check at
+ * bootup can fail due to a marginal offset between vcpus' TSCs (though the
+ * TSCs do not drift from each other).  Also, the ACPI PM timer clocksource
+ * is not suitable as a watchdog when running on a hypervisor because the
+ * kernel may miss a wrap of the counter if the vcpu is descheduled for a
+ * long time. To skip these checks at runtime we set these capability bits,
+ * so that the kernel could just trust the hypervisor with providing a
+ * reliable virtual TSC that is suitable for timekeeping.
+ */
+static void __init vmware_set_capabilities(void)
+{
+   setup_force_cpu_cap(X86_FEATURE_CONSTANT_TSC);
+   setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
+}
+
 static void __init vmware_platform_setup(void)
 {
uint32_t eax, ebx, ecx, edx;
@@ -152,6 +170,8 @@ static void __init vmware_platform_setup(void)
 #ifdef CONFIG_X86_IO_APIC
no_timer_check = 1;
 #endif
+
+   vmware_set_capabilities();
 }
 
 /*
@@ -176,24 +196,6 @@ static uint32_t __init vmware_platform(void)
return 0;
 }
 
-/*
- * VMware hypervisor takes care of exporting a reliable TSC to the guest.
- * Still, due to timing difference when running on virtual cpus, the TSC can
- * be marked as unstable in some cases. For example, the TSC sync check at
- * bootup can fail due to a marginal offset between vcpus' TSCs (though the
- * TSCs do not drift from each other).  Also, the ACPI PM timer clocksource
- * is not suitable as a watchdog when running on a hypervisor because the
- * kernel may miss a wrap of the counter if the vcpu is descheduled for a
- * long time. To skip these checks at runtime we set these capability bits,
- * so that the kernel could just trust the hypervisor with providing a
- * reliable virtual TSC that is suitable for timekeeping.
- */
-static void vmware_set_cpu_features(struct cpuinfo_x86 *c)
-{
-   set_cpu_cap(c, X86_FEATURE_CONSTANT_TSC);
-   set_cpu_cap(c, X86_FEATURE_TSC_RELIABLE);
-}
-
 /* Checks if hypervisor supports x2apic without VT-D interrupt remapping. */
 static bool __init vmware_legacy_x2apic_available(void)
 {
@@ -206,7 +208,6 @@ static bool __init vmware_legacy_x2apic_available(void)
 const __refconst struct hypervisor_x86 x86_hyper_vmware = {
.name   = "VMware",
.detect = vmware_platform,
-   .set_cpu_features   = vmware_set_cpu_features,
.init_platform  = vmware_platform_setup,
.x2apic_available   = vmware_legacy_x2apic_available,
 };
-- 
2.12.0

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v3 00/11] x86: xen cpuid() cleanup

2017-04-18 Thread Juergen Gross
Reduce special casing of xen_cpuid() by using cpu capabilities instead
of faked cpuid nodes.

This cleanup enables us remove the hypervisor specific set_cpu_features
callback as the same effect can be reached via
setup_[clear|force]_cpu_cap().

Removing the rest faked nodes from xen_cpuid() requires some more work
as the remaining cases (mwait leafs and extended topology info) have
to be handled at the consumer sides of this information.

Changes in V3:
- rewrite patch 9 (xsave) to use xgetbv instruction to test for osxsave
  availability (Andrew Cooper)

Changes in V2:
- added several features to this scheme
- removed hypervisor specific set_cpu_features() callbacks

Cc: Alok Kataria  
Cc: Thomas Gleixner  
Cc: Ingo Molnar  
Cc: "H. Peter Anvin"  
Cc: x...@kernel.org 
Cc: virtualization@lists.linux-foundation.org

Juergen Gross (11):
  xen: set cpu capabilities from xen_start_kernel()
  x86/xen: don't indicate DCA support in pv domains
  x86/xen: use capabilities instead of fake cpuid values for aperf
  x86/xen: use capabilities instead of fake cpuid values for mtrr
  x86/xen: use capabilities instead of fake cpuid values for acc
  x86/xen: use capabilities instead of fake cpuid values for acpi
  x86/xen: use capabilities instead of fake cpuid values for mwait
  x86/xen: use capabilities instead of fake cpuid values for x2apic
  x86/xen: use capabilities instead of fake cpuid values for xsave
  vmware: set cpu capabilities during platform initialization
  x86/cpu: remove hypervisor specific set_cpu_features

 arch/x86/include/asm/hypervisor.h |  5 ---
 arch/x86/kernel/cpu/common.c  |  1 -
 arch/x86/kernel/cpu/hypervisor.c  |  8 
 arch/x86/kernel/cpu/vmware.c  | 39 
 arch/x86/xen/enlighten_pv.c   | 95 +--
 5 files changed, 61 insertions(+), 87 deletions(-)

-- 
2.12.0

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization