Re: [RFC 2/3] virtio-iommu: device probing and operations
On 18/04/17 11:26, Tian, Kevin wrote: >> From: Jean-Philippe Brucker >> Sent: Saturday, April 8, 2017 3:18 AM >> > [...] >> II. Feature bits >> >> >> VIRTIO_IOMMU_F_INPUT_RANGE (0) >> Available range of virtual addresses is described in input_range > > Usually only the maximum supported address bits are important. > Curious do you see such situation where low end of the address > space is not usable (since you have both start/end defined later)? A start address would allow to provide something resembling a GART to the guest: an IOMMU with one address space (ioasid_bits=0) and a small IOVA aperture. I'm not sure how useful that would be in practice. On a related note, the virtio-iommu itself doesn't provide a per-address-space aperture as it stands. For example, attaching a device to an address space might restrict the available IOVA range for the whole AS if that device cannot write to high memory (above 32-bit). If the guest attempts to map an IOVA outside this window into the device's address space, it should expect the MAP request to fail. And when attaching, if the address space already has mappings outside this window, then ATTACH should fail. This too seems to be something that ought to be communicated by firmware, but bits are missing (I can't find anything equivalent to DT's dma-ranges for PCI root bridges in ACPI tables, for example). In addition VFIO doesn't communicate any DMA mask for devices, and doesn't check them itself. I guess that the host could find out the DMA mask of devices one way or another, but it is tricky to enforce, so I didn't make this a hard requirement. Although I should probably add a few words about it. > [...] >> 1. Attach device >> >> >> struct virtio_iommu_req_attach { >> le32address_space; >> le32device; >> le32flags/reserved; >> }; >> >> Attach a device to an address space. 'address_space' is an identifier >> unique to the guest. If the address space doesn't exist in the IOMMU > > Based on your description this address space ID is per operation right? > MAP/UNMAP and page-table sharing should have different ID spaces... I think it's simpler if we keep a single IOASID space per virtio-iommu device, because the maximum number of address spaces (described by ioasid_bits) might be a restriction of the pIOMMU. For page-table sharing you still need to define which devices will share a page directory using ATTACH requests, though that interface is not set in stone. >> device, it is created. 'device' is an identifier unique to the IOMMU. The >> host communicates unique device ID to the guest during boot. The method >> used to communicate this ID is outside the scope of this specification, >> but the following rules must apply: >> >> * The device ID is unique from the IOMMU point of view. Multiple devices >> whose DMA transactions are not translated by the same IOMMU may have >> the >> same device ID. Devices whose DMA transactions may be translated by the >> same IOMMU must have different device IDs. >> >> * Sometimes the host cannot completely isolate two devices from each >> others. For example on a legacy PCI bus, devices can snoop DMA >> transactions from their neighbours. In this case, the host must >> communicate to the guest that it cannot isolate these devices from each >> others. The method used to communicate this is outside the scope of this >> specification. The IOMMU device must ensure that devices that cannot be > > "IOMMU device" -> "IOMMU driver" Indeed Thanks! Jean-Philippe >> isolated by the host have the same address spaces. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC 1/3] virtio-iommu: firmware description of the virtual topology
On 18/04/17 10:51, Tian, Kevin wrote: >> From: Jean-Philippe Brucker >> Sent: Saturday, April 8, 2017 3:18 AM >> >> Unlike other virtio devices, the virtio-iommu doesn't work independently, >> it is linked to other virtual or assigned devices. So before jumping into >> device operations, we need to define a way for the guest to discover the >> virtual IOMMU and the devices it translates. >> >> The host must describe the relation between IOMMU and devices to the >> guest >> using either device-tree or ACPI. The virtual IOMMU identifies each > > Do you plan to support both device tree and ACPI? Yes, with ACPI the topology would be described using IORT nodes. I didn't include an example in my driver because DT is sufficient for a prototype and is readily available (both in Linux and kvmtool), whereas IORT would be quite easy to reuse in Linux, but isn't present in kvmtool at the moment. However, both interfaces have to be supported for the virtio-iommu to be portable. >> virtual device with a 32-bit ID, that we will call "Device ID" in this >> document. Device IDs are not necessarily unique system-wide, but they may >> not overlap within a single virtual IOMMU. Device ID of passed-through >> devices do not need to match IDs seen by the physical IOMMU. >> >> The virtual IOMMU uses virtio-mmio transport exclusively, not virtio-pci, >> because with PCI the IOMMU interface would itself be an endpoint, and >> existing firmware interfaces don't allow to describe IOMMU<->master >> relations between PCI endpoints. > > I'm not familiar with virtio-mmio mechanism. Curious how devices in > virtio-mmio are enumerated today? Could we use that mechanism to > identify vIOMMUs and then invent a purely para-virtualized method to > enumerate devices behind each vIOMMU? Using DT, virtio-mmio devices are described with "virtio-mmio" compatible node, and with ACPI they use _HID LNRO0005. Since the host already describes available devices to a guest using a firmware interface, I think we should reuse the tools provided by that interface for describing relations between DMA masters and IOMMU. > Asking this is because each vendor has its own enumeration methods. > ARM has device tree and ACPI IORT. AMR has ACPI IVRS and device > tree (same format as ARM?). Intel has APCI DMAR and sub-tables. Your > current proposal looks following ARM definitions which I'm not sure > extensible enough to cover features defined only in other vendors' > structures. ACPI IORT can be extended to incorporate para-virtualized IOMMUs, regardless of the underlying architecture. It isn't defined solely for the ARM SMMU, but serves a more general purpose of describing a map of device identifiers communicated from one components to another. Both DMAR and IVRS have such description (respectively DRHD and IVHD), but they are designed for a specific IOMMU, whereas IORT could host other kinds. It seems that all we really need is an interface that says "there is a virtio-iommu at address X, here are the devices it translates and their corresponding IDs", and both DT and ACPI IORT are able to fulfill this role. > Since the purpose of this series is to go para-virtualize, why not also > para-virtualize and simplify the enumeration method? For example, > we may define a query interface through vIOMMU registers to allow > guest query whether a device belonging to that vIOMMU. Then we > can even remove use of any enumeration structure completely... > Just a quick example which I may not think through all the pros and > cons. :-) I don't think adding a brand new topology description mechanism is worth the effort, we're better off reusing what already exists and is implemented by operating systems. Adding a query interface inside the vIOMMU may work (though might be very painful to integrate with fwspec in Linux), but would be redundant since the host has to provide a firmware description of the system anyway. >> The following diagram describes a situation where two virtual IOMMUs >> translate traffic from devices in the system. vIOMMU 1 translates two PCI >> domains, in which each function has a 16-bits requester ID. In order for >> the vIOMMU to differentiate guest requests targeted at devices in each >> domain, their Device ID ranges cannot overlap. vIOMMU 2 translates two PCI >> domains and a collection of platform devices. >> >>Device IDRequester ID >> / 0x0 0x0 \ >> / | |PCI domain 1 >> / 0x 0x / >> vIOMMU 1 >> \ 0x1 0x0 \ >> \ | |PCI domain 2 >> \ 0x1 0x / >> >> / 0x0\ >> / | platform devices >> / 0x1fff/ >> vIOMMU 2 >>
RE: [RFC 2/3] virtio-iommu: device probing and operations
> From: Jean-Philippe Brucker > Sent: Saturday, April 8, 2017 3:18 AM > [...] > II. Feature bits > > > VIRTIO_IOMMU_F_INPUT_RANGE (0) > Available range of virtual addresses is described in input_range Usually only the maximum supported address bits are important. Curious do you see such situation where low end of the address space is not usable (since you have both start/end defined later)? [...] > 1. Attach device > > > struct virtio_iommu_req_attach { > le32address_space; > le32device; > le32flags/reserved; > }; > > Attach a device to an address space. 'address_space' is an identifier > unique to the guest. If the address space doesn't exist in the IOMMU Based on your description this address space ID is per operation right? MAP/UNMAP and page-table sharing should have different ID spaces... > device, it is created. 'device' is an identifier unique to the IOMMU. The > host communicates unique device ID to the guest during boot. The method > used to communicate this ID is outside the scope of this specification, > but the following rules must apply: > > * The device ID is unique from the IOMMU point of view. Multiple devices > whose DMA transactions are not translated by the same IOMMU may have > the > same device ID. Devices whose DMA transactions may be translated by the > same IOMMU must have different device IDs. > > * Sometimes the host cannot completely isolate two devices from each > others. For example on a legacy PCI bus, devices can snoop DMA > transactions from their neighbours. In this case, the host must > communicate to the guest that it cannot isolate these devices from each > others. The method used to communicate this is outside the scope of this > specification. The IOMMU device must ensure that devices that cannot be "IOMMU device" -> "IOMMU driver" > isolated by the host have the same address spaces. > Thanks Kevin ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
RE: [RFC 1/3] virtio-iommu: firmware description of the virtual topology
> From: Jean-Philippe Brucker > Sent: Saturday, April 8, 2017 3:18 AM > > Unlike other virtio devices, the virtio-iommu doesn't work independently, > it is linked to other virtual or assigned devices. So before jumping into > device operations, we need to define a way for the guest to discover the > virtual IOMMU and the devices it translates. > > The host must describe the relation between IOMMU and devices to the > guest > using either device-tree or ACPI. The virtual IOMMU identifies each Do you plan to support both device tree and ACPI? > virtual device with a 32-bit ID, that we will call "Device ID" in this > document. Device IDs are not necessarily unique system-wide, but they may > not overlap within a single virtual IOMMU. Device ID of passed-through > devices do not need to match IDs seen by the physical IOMMU. > > The virtual IOMMU uses virtio-mmio transport exclusively, not virtio-pci, > because with PCI the IOMMU interface would itself be an endpoint, and > existing firmware interfaces don't allow to describe IOMMU<->master > relations between PCI endpoints. I'm not familiar with virtio-mmio mechanism. Curious how devices in virtio-mmio are enumerated today? Could we use that mechanism to identify vIOMMUs and then invent a purely para-virtualized method to enumerate devices behind each vIOMMU? Asking this is because each vendor has its own enumeration methods. ARM has device tree and ACPI IORT. AMR has ACPI IVRS and device tree (same format as ARM?). Intel has APCI DMAR and sub-tables. Your current proposal looks following ARM definitions which I'm not sure extensible enough to cover features defined only in other vendors' structures. Since the purpose of this series is to go para-virtualize, why not also para-virtualize and simplify the enumeration method? For example, we may define a query interface through vIOMMU registers to allow guest query whether a device belonging to that vIOMMU. Then we can even remove use of any enumeration structure completely... Just a quick example which I may not think through all the pros and cons. :-) > > The following diagram describes a situation where two virtual IOMMUs > translate traffic from devices in the system. vIOMMU 1 translates two PCI > domains, in which each function has a 16-bits requester ID. In order for > the vIOMMU to differentiate guest requests targeted at devices in each > domain, their Device ID ranges cannot overlap. vIOMMU 2 translates two PCI > domains and a collection of platform devices. > >Device IDRequester ID > / 0x0 0x0 \ > / | |PCI domain 1 > / 0x 0x / > vIOMMU 1 > \ 0x1 0x0 \ > \ | |PCI domain 2 > \ 0x1 0x / > > / 0x0\ > / | platform devices > / 0x1fff/ > vIOMMU 2 > \ 0x2000 0x0 \ > \ | |PCI domain 3 > \ 0x11fff 0x / > isn't above be (0x3, 3) for PCI domain 3 giving device ID is 16bit? Thanks Kevin ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v3 10/11] vmware: set cpu capabilities during platform initialization
There is no need to set the same capabilities for each cpu individually. This can be done for all cpus in platform initialization. Cc: Alok KatariaCc: Thomas Gleixner Cc: Ingo Molnar Cc: "H. Peter Anvin" Cc: x...@kernel.org Cc: virtualization@lists.linux-foundation.org Signed-off-by: Juergen Gross Reviewed-by: Boris Ostrovsky Acked-by: Alok Kataria --- arch/x86/kernel/cpu/vmware.c | 39 --- 1 file changed, 20 insertions(+), 19 deletions(-) diff --git a/arch/x86/kernel/cpu/vmware.c b/arch/x86/kernel/cpu/vmware.c index 22403a28caf5..40ed26852ebd 100644 --- a/arch/x86/kernel/cpu/vmware.c +++ b/arch/x86/kernel/cpu/vmware.c @@ -113,6 +113,24 @@ static void __init vmware_paravirt_ops_setup(void) #define vmware_paravirt_ops_setup() do {} while (0) #endif +/* + * VMware hypervisor takes care of exporting a reliable TSC to the guest. + * Still, due to timing difference when running on virtual cpus, the TSC can + * be marked as unstable in some cases. For example, the TSC sync check at + * bootup can fail due to a marginal offset between vcpus' TSCs (though the + * TSCs do not drift from each other). Also, the ACPI PM timer clocksource + * is not suitable as a watchdog when running on a hypervisor because the + * kernel may miss a wrap of the counter if the vcpu is descheduled for a + * long time. To skip these checks at runtime we set these capability bits, + * so that the kernel could just trust the hypervisor with providing a + * reliable virtual TSC that is suitable for timekeeping. + */ +static void __init vmware_set_capabilities(void) +{ + setup_force_cpu_cap(X86_FEATURE_CONSTANT_TSC); + setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE); +} + static void __init vmware_platform_setup(void) { uint32_t eax, ebx, ecx, edx; @@ -152,6 +170,8 @@ static void __init vmware_platform_setup(void) #ifdef CONFIG_X86_IO_APIC no_timer_check = 1; #endif + + vmware_set_capabilities(); } /* @@ -176,24 +196,6 @@ static uint32_t __init vmware_platform(void) return 0; } -/* - * VMware hypervisor takes care of exporting a reliable TSC to the guest. - * Still, due to timing difference when running on virtual cpus, the TSC can - * be marked as unstable in some cases. For example, the TSC sync check at - * bootup can fail due to a marginal offset between vcpus' TSCs (though the - * TSCs do not drift from each other). Also, the ACPI PM timer clocksource - * is not suitable as a watchdog when running on a hypervisor because the - * kernel may miss a wrap of the counter if the vcpu is descheduled for a - * long time. To skip these checks at runtime we set these capability bits, - * so that the kernel could just trust the hypervisor with providing a - * reliable virtual TSC that is suitable for timekeeping. - */ -static void vmware_set_cpu_features(struct cpuinfo_x86 *c) -{ - set_cpu_cap(c, X86_FEATURE_CONSTANT_TSC); - set_cpu_cap(c, X86_FEATURE_TSC_RELIABLE); -} - /* Checks if hypervisor supports x2apic without VT-D interrupt remapping. */ static bool __init vmware_legacy_x2apic_available(void) { @@ -206,7 +208,6 @@ static bool __init vmware_legacy_x2apic_available(void) const __refconst struct hypervisor_x86 x86_hyper_vmware = { .name = "VMware", .detect = vmware_platform, - .set_cpu_features = vmware_set_cpu_features, .init_platform = vmware_platform_setup, .x2apic_available = vmware_legacy_x2apic_available, }; -- 2.12.0 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v3 00/11] x86: xen cpuid() cleanup
Reduce special casing of xen_cpuid() by using cpu capabilities instead of faked cpuid nodes. This cleanup enables us remove the hypervisor specific set_cpu_features callback as the same effect can be reached via setup_[clear|force]_cpu_cap(). Removing the rest faked nodes from xen_cpuid() requires some more work as the remaining cases (mwait leafs and extended topology info) have to be handled at the consumer sides of this information. Changes in V3: - rewrite patch 9 (xsave) to use xgetbv instruction to test for osxsave availability (Andrew Cooper) Changes in V2: - added several features to this scheme - removed hypervisor specific set_cpu_features() callbacks Cc: Alok KatariaCc: Thomas Gleixner Cc: Ingo Molnar Cc: "H. Peter Anvin" Cc: x...@kernel.org Cc: virtualization@lists.linux-foundation.org Juergen Gross (11): xen: set cpu capabilities from xen_start_kernel() x86/xen: don't indicate DCA support in pv domains x86/xen: use capabilities instead of fake cpuid values for aperf x86/xen: use capabilities instead of fake cpuid values for mtrr x86/xen: use capabilities instead of fake cpuid values for acc x86/xen: use capabilities instead of fake cpuid values for acpi x86/xen: use capabilities instead of fake cpuid values for mwait x86/xen: use capabilities instead of fake cpuid values for x2apic x86/xen: use capabilities instead of fake cpuid values for xsave vmware: set cpu capabilities during platform initialization x86/cpu: remove hypervisor specific set_cpu_features arch/x86/include/asm/hypervisor.h | 5 --- arch/x86/kernel/cpu/common.c | 1 - arch/x86/kernel/cpu/hypervisor.c | 8 arch/x86/kernel/cpu/vmware.c | 39 arch/x86/xen/enlighten_pv.c | 95 +-- 5 files changed, 61 insertions(+), 87 deletions(-) -- 2.12.0 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization