Re: [Qemu-devel] [PATCH v7 0/5] IOMMU: intel_iommu support map and unmap notifications
On 2016年12月08日 10:39, Peter Xu wrote: > Btw, do you have time to help review my RFC series for "vt-d replay"? > :) I'd like to receive any further review comments besides the issues > mentioned above since IMHO they can be seperated from current series, > I have noted them down in my pending list. Sure. I have reviewed some patches. It looks good for me :) -- Best regards Tianyu Lan
Re: [Qemu-devel] [PATCH v7 0/5] IOMMU: intel_iommu support map and unmap notifications
On Wed, Dec 07, 2016 at 10:04:57PM +0800, Lan Tianyu wrote: [...] > >Even if the context entry is cleared and invalidated, IMHO it does not > >mean that we should be using GPA address space, nor do we need to put > >it into guest physical address space. Instead, it simply means this > >device cannot do any IO at that time. If IO comes, IOMMU should do > >fault reporting to guest OS, which should be treated as error. > > Yes, that looks right and there will be fault event reported by pIOMMU > if context entry is no present for DMA untranslated request. This goes back > to the first gap to report pIOMMU fault event to vIOMMU. Right. > > For disabling via clearing DMA translation via gcmd.TE bit, assigned > device should work after clearing operation and it's still necessary to > restore GPA->HPA mapping since we can't assume guest won't clear the bit > after enabling DMA translation. > > This maybe low priority gap since Linux IOMMU driver don't disable DMA > translation frequently or dynamically. But we also should consider the > situation. I see. That's something I missed indeed. Yes we should support disabling via gcmd.te. Also, looks like we still don't support passthrough mode translation as well. That's something we need too since that means we don't supporting iommu=pt. I agree with you that we can postpone these tasks temporarily, just like the error handling you mentioned. > > > > >So I think we are emulating the correct guest behavior here - we don't > >need to do anything if a device is detached from an existing IOMMU > >domain in guest. If we do (e.g., we replay the GPA address space on > >that device when it is detached, so the shadow page table for that > >device maps the whole guest memory), that is dangerous, because with > >that the device can DMA to anywhere it wants to guest memory. > > If guest want to disabling DMA translation, this is expected behavior > and device model should follow guest configuration. This just likes most > distributions don't enable VTD DMA translation by default and it's OS > choice. My previous comment only suites the case when a device is moved outside an IOMMU domain. I agree with you on the rest. Btw, do you have time to help review my RFC series for "vt-d replay"? :) I'd like to receive any further review comments besides the issues mentioned above since IMHO they can be seperated from current series, I have noted them down in my pending list. (I think I need to test iommu=pt a bit more to make sure it works asap) Thanks, -- peterx
Re: [Qemu-devel] [PATCH v7 0/5] IOMMU: intel_iommu support map and unmap notifications
On 2016年12月07日 14:43, Peter Xu wrote: On Wed, Dec 07, 2016 at 02:09:16PM +0800, Lan Tianyu wrote: On 2016年12月06日 18:59, Peter Xu wrote: On Tue, Dec 06, 2016 at 04:27:39PM +0800, Lan Tianyu wrote: [...] User space driver(E.G DPDK) also can enable/disable IOVA for device dynamically. Could you provide more detailed (or any pointer) on how to do that? I did try to find it myself, I see an VFIO_IOMMU_ENABLE ioctl, but looks like it is for ppc only. No, I just give an example that user space may do that but no more research. But since Qemu already can enable device's IOVA, other user application also should can do that with the same VFIO interface, right? AFAIU we can't do that at least on x86. We can use vfio interface to bind group into container, but we should not be able to dynamically disable IOMMU protection. IIUC That needs to taint the kernel. The only way I know is that we probe vfio-pci with no-iommu mode, in that case, we disabled IOMMU, but we can never dynamically enable it as well. Please correct me if I am wrong. Actually, disabling device's IOVA doesn't require to disable kernel global DMA protect and just clear device's VTD context entry in the context table. Go though IOMMU and VFIO code, find this will happen when call VFIO_GROUP_UNSET_CONTAINER ioctl and it will be called when destroy VM or unplug assigned device in Qemu. Please help to double check. Call trace: __vfio_group_unset_container() vfio_iommu_type1_detach_group() iommu_detach_group() dmar_remove_one_dev_info() __dmar_remove_one_dev_info() domain_context_clear() The legacy KVM device assign code also will call iommu_detach_device() when deassign a device. From device emulation view, we need to make sure correct register emulation regardless of guest behavior. Even if the context entry is cleared and invalidated, IMHO it does not mean that we should be using GPA address space, nor do we need to put it into guest physical address space. Instead, it simply means this device cannot do any IO at that time. If IO comes, IOMMU should do fault reporting to guest OS, which should be treated as error. Yes, that looks right and there will be fault event reported by pIOMMU if context entry is no present for DMA untranslated request. This goes back to the first gap to report pIOMMU fault event to vIOMMU. For disabling via clearing DMA translation via gcmd.TE bit, assigned device should work after clearing operation and it's still necessary to restore GPA->HPA mapping since we can't assume guest won't clear the bit after enabling DMA translation. This maybe low priority gap since Linux IOMMU driver don't disable DMA translation frequently or dynamically. But we also should consider the situation. So I think we are emulating the correct guest behavior here - we don't need to do anything if a device is detached from an existing IOMMU domain in guest. If we do (e.g., we replay the GPA address space on that device when it is detached, so the shadow page table for that device maps the whole guest memory), that is dangerous, because with that the device can DMA to anywhere it wants to guest memory. If guest want to disabling DMA translation, this is expected behavior and device model should follow guest configuration. This just likes most distributions don't enable VTD DMA translation by default and it's OS choice. -- Best regards Tianyu Lan
Re: [Qemu-devel] [PATCH v7 0/5] IOMMU: intel_iommu support map and unmap notifications
On Wed, Dec 07, 2016 at 02:09:16PM +0800, Lan Tianyu wrote: > On 2016年12月06日 18:59, Peter Xu wrote: > > On Tue, Dec 06, 2016 at 04:27:39PM +0800, Lan Tianyu wrote: > > > > [...] > > > >>> > User space driver(E.G DPDK) also can enable/disable > IOVA for device dynamically. > >>> > >>> Could you provide more detailed (or any pointer) on how to do that? I > >>> did try to find it myself, I see an VFIO_IOMMU_ENABLE ioctl, but looks > >>> like it is for ppc only. > >> > >> No, I just give an example that user space may do that but no more > >> research. But since Qemu already can enable device's IOVA, other user > >> application also should can do that with the same VFIO interface, right? > > > > AFAIU we can't do that at least on x86. We can use vfio interface to > > bind group into container, but we should not be able to dynamically > > disable IOMMU protection. IIUC That needs to taint the kernel. > > > > The only way I know is that we probe vfio-pci with no-iommu mode, in > > that case, we disabled IOMMU, but we can never dynamically enable it > > as well. > > > > Please correct me if I am wrong. > > > Actually, disabling device's IOVA doesn't require to disable kernel > global DMA protect and just clear device's VTD context entry in the > context table. Go though IOMMU and VFIO code, find this will happen when > call VFIO_GROUP_UNSET_CONTAINER ioctl and it will be called when destroy > VM or unplug assigned device in Qemu. Please help to double check. > > Call trace: > __vfio_group_unset_container() > vfio_iommu_type1_detach_group() > iommu_detach_group() > dmar_remove_one_dev_info() > __dmar_remove_one_dev_info() > domain_context_clear() > > > The legacy KVM device assign code also will call iommu_detach_device() > when deassign a device. > > From device emulation view, we need to make sure correct register > emulation regardless of guest behavior. Even if the context entry is cleared and invalidated, IMHO it does not mean that we should be using GPA address space, nor do we need to put it into guest physical address space. Instead, it simply means this device cannot do any IO at that time. If IO comes, IOMMU should do fault reporting to guest OS, which should be treated as error. So I think we are emulating the correct guest behavior here - we don't need to do anything if a device is detached from an existing IOMMU domain in guest. If we do (e.g., we replay the GPA address space on that device when it is detached, so the shadow page table for that device maps the whole guest memory), that is dangerous, because with that the device can DMA to anywhere it wants to guest memory. Thanks, -- peterx
Re: [Qemu-devel] [PATCH v7 0/5] IOMMU: intel_iommu support map and unmap notifications
On 2016年12月06日 18:59, Peter Xu wrote: > On Tue, Dec 06, 2016 at 04:27:39PM +0800, Lan Tianyu wrote: > > [...] > >>> User space driver(E.G DPDK) also can enable/disable IOVA for device dynamically. >>> >>> Could you provide more detailed (or any pointer) on how to do that? I >>> did try to find it myself, I see an VFIO_IOMMU_ENABLE ioctl, but looks >>> like it is for ppc only. >> >> No, I just give an example that user space may do that but no more >> research. But since Qemu already can enable device's IOVA, other user >> application also should can do that with the same VFIO interface, right? > > AFAIU we can't do that at least on x86. We can use vfio interface to > bind group into container, but we should not be able to dynamically > disable IOMMU protection. IIUC That needs to taint the kernel. > > The only way I know is that we probe vfio-pci with no-iommu mode, in > that case, we disabled IOMMU, but we can never dynamically enable it > as well. > > Please correct me if I am wrong. Actually, disabling device's IOVA doesn't require to disable kernel global DMA protect and just clear device's VTD context entry in the context table. Go though IOMMU and VFIO code, find this will happen when call VFIO_GROUP_UNSET_CONTAINER ioctl and it will be called when destroy VM or unplug assigned device in Qemu. Please help to double check. Call trace: __vfio_group_unset_container() vfio_iommu_type1_detach_group() iommu_detach_group() dmar_remove_one_dev_info() __dmar_remove_one_dev_info() domain_context_clear() The legacy KVM device assign code also will call iommu_detach_device() when deassign a device. >From device emulation view, we need to make sure correct register emulation regardless of guest behavior. > > Thanks, > > -- peterx > -- Best regards Tianyu Lan
Re: [Qemu-devel] [PATCH v7 0/5] IOMMU: intel_iommu support map and unmap notifications
On Tue, 6 Dec 2016 18:59:14 +0800 Peter Xuwrote: > On Tue, Dec 06, 2016 at 04:27:39PM +0800, Lan Tianyu wrote: > > [...] > > > > > > >> User space driver(E.G DPDK) also can enable/disable > > >> IOVA for device dynamically. > > > > > > Could you provide more detailed (or any pointer) on how to do that? I > > > did try to find it myself, I see an VFIO_IOMMU_ENABLE ioctl, but looks > > > like it is for ppc only. > > > > No, I just give an example that user space may do that but no more > > research. But since Qemu already can enable device's IOVA, other user > > application also should can do that with the same VFIO interface, right? > > AFAIU we can't do that at least on x86. We can use vfio interface to > bind group into container, but we should not be able to dynamically > disable IOMMU protection. IIUC That needs to taint the kernel. > > The only way I know is that we probe vfio-pci with no-iommu mode, in > that case, we disabled IOMMU, but we can never dynamically enable it > as well. > > Please correct me if I am wrong. A vfio user such as QEMU enabling or disabling translation in the vIOMMU is "simply" assigning the vfio container to the system AddressSpace (as we do w/o vIOMMU) or the device specific, IOMMU domain, AddressSpace. The device must be protected by the pIOMMU domain (container) at all times. Thanks, Alex
Re: [Qemu-devel] [PATCH v7 0/5] IOMMU: intel_iommu support map and unmap notifications
On Tue, Dec 06, 2016 at 04:27:39PM +0800, Lan Tianyu wrote: [...] > > > >> User space driver(E.G DPDK) also can enable/disable > >> IOVA for device dynamically. > > > > Could you provide more detailed (or any pointer) on how to do that? I > > did try to find it myself, I see an VFIO_IOMMU_ENABLE ioctl, but looks > > like it is for ppc only. > > No, I just give an example that user space may do that but no more > research. But since Qemu already can enable device's IOVA, other user > application also should can do that with the same VFIO interface, right? AFAIU we can't do that at least on x86. We can use vfio interface to bind group into container, but we should not be able to dynamically disable IOMMU protection. IIUC That needs to taint the kernel. The only way I know is that we probe vfio-pci with no-iommu mode, in that case, we disabled IOMMU, but we can never dynamically enable it as well. Please correct me if I am wrong. Thanks, -- peterx
Re: [Qemu-devel] [PATCH v7 0/5] IOMMU: intel_iommu support map and unmap notifications
On 2016年12月06日 15:22, Peter Xu wrote: > On Tue, Dec 06, 2016 at 03:06:20PM +0800, Lan Tianyu wrote: >> On 2016年12月06日 14:51, Peter Xu wrote: >>> On Tue, Dec 06, 2016 at 02:30:24PM +0800, Lan Tianyu wrote: >>> >>> [...] >>> >> 2. How to restore GPA->HPA mapping when IOVA is disabled by guest. >> When guest enables IOVA for device, vIOMMU will invalidate all previous >> GPA->HPA mapping and update IOVA->HPA mapping to pIOMMU via iommu >> notifier. But if IOVA is disabled, I think we should restore GPA->HPA >> mapping for the device otherwise the device won't work again in the VM. > > If we can have a workable replay mechanism, this problem will be > solved IMHO. Basic idea is to replay related memory regions to restore GPA->HPA mapping when guest disables IOVA. >>> >>> Btw, could you help elaborate in what case we will trigger such a >>> condition? Or, can we dynamically enable/disable an IOMMU? >> >> First case I think of is nest virtualization and we assign physical >> device to l2 guest. > > If we assign physical device to l2 guest, then it will belongs to an > IOMMU domain in either l1 guest and host, right? (assuming we are > using vfio-pci to do device assignment) Yes. > >> User space driver(E.G DPDK) also can enable/disable >> IOVA for device dynamically. > > Could you provide more detailed (or any pointer) on how to do that? I > did try to find it myself, I see an VFIO_IOMMU_ENABLE ioctl, but looks > like it is for ppc only. No, I just give an example that user space may do that but no more research. But since Qemu already can enable device's IOVA, other user application also should can do that with the same VFIO interface, right? -- Best regards Tianyu Lan
Re: [Qemu-devel] [PATCH v7 0/5] IOMMU: intel_iommu support map and unmap notifications
On Tue, Dec 06, 2016 at 03:06:20PM +0800, Lan Tianyu wrote: > On 2016年12月06日 14:51, Peter Xu wrote: > > On Tue, Dec 06, 2016 at 02:30:24PM +0800, Lan Tianyu wrote: > > > > [...] > > > 2. How to restore GPA->HPA mapping when IOVA is disabled by guest. > When guest enables IOVA for device, vIOMMU will invalidate all previous > GPA->HPA mapping and update IOVA->HPA mapping to pIOMMU via iommu > notifier. But if IOVA is disabled, I think we should restore GPA->HPA > mapping for the device otherwise the device won't work again in the VM. > >>> > >>> If we can have a workable replay mechanism, this problem will be > >>> solved IMHO. > >> > >> Basic idea is to replay related memory regions to restore GPA->HPA > >> mapping when guest disables IOVA. > > > > Btw, could you help elaborate in what case we will trigger such a > > condition? Or, can we dynamically enable/disable an IOMMU? > > First case I think of is nest virtualization and we assign physical > device to l2 guest. If we assign physical device to l2 guest, then it will belongs to an IOMMU domain in either l1 guest and host, right? (assuming we are using vfio-pci to do device assignment) > User space driver(E.G DPDK) also can enable/disable > IOVA for device dynamically. Could you provide more detailed (or any pointer) on how to do that? I did try to find it myself, I see an VFIO_IOMMU_ENABLE ioctl, but looks like it is for ppc only. Thanks, -- peterx
Re: [Qemu-devel] [PATCH v7 0/5] IOMMU: intel_iommu support map and unmap notifications
On 2016年12月06日 14:51, Peter Xu wrote: > On Tue, Dec 06, 2016 at 02:30:24PM +0800, Lan Tianyu wrote: > > [...] > 2. How to restore GPA->HPA mapping when IOVA is disabled by guest. When guest enables IOVA for device, vIOMMU will invalidate all previous GPA->HPA mapping and update IOVA->HPA mapping to pIOMMU via iommu notifier. But if IOVA is disabled, I think we should restore GPA->HPA mapping for the device otherwise the device won't work again in the VM. >>> >>> If we can have a workable replay mechanism, this problem will be >>> solved IMHO. >> >> Basic idea is to replay related memory regions to restore GPA->HPA >> mapping when guest disables IOVA. > > Btw, could you help elaborate in what case we will trigger such a > condition? Or, can we dynamically enable/disable an IOMMU? First case I think of is nest virtualization and we assign physical device to l2 guest. User space driver(E.G DPDK) also can enable/disable IOVA for device dynamically. -- Best regards Tianyu Lan
Re: [Qemu-devel] [PATCH v7 0/5] IOMMU: intel_iommu support map and unmap notifications
On Tue, Dec 06, 2016 at 02:30:24PM +0800, Lan Tianyu wrote: [...] > >> 2. How to restore GPA->HPA mapping when IOVA is disabled by guest. > >> When guest enables IOVA for device, vIOMMU will invalidate all previous > >> GPA->HPA mapping and update IOVA->HPA mapping to pIOMMU via iommu > >> notifier. But if IOVA is disabled, I think we should restore GPA->HPA > >> mapping for the device otherwise the device won't work again in the VM. > > > > If we can have a workable replay mechanism, this problem will be > > solved IMHO. > > Basic idea is to replay related memory regions to restore GPA->HPA > mapping when guest disables IOVA. Btw, could you help elaborate in what case we will trigger such a condition? Or, can we dynamically enable/disable an IOMMU? Thanks, -- peterx
Re: [Qemu-devel] [PATCH v7 0/5] IOMMU: intel_iommu support map and unmap notifications
On 2016年12月02日 14:52, Peter Xu wrote: > On Thu, Dec 01, 2016 at 02:44:14PM +0800, Lan Tianyu wrote: > > [...] > >> Hi: >> I think there are still other gaps to enable passthough device with >> vIOMMU's DMA translation support. >> >> 1. Since this patchset is to shadow guest IO page table to >> pIOMMU(physical IOMMU) vfio_dma_map/umap(), there will be some fault >> events from pIOMMU if guest os does misconfigurations. We should report >> these fault events to guest. This means we need to pass the fault event >> from pIOMMU driver to vIOMMU in Qemu. I suppose a channel in VFIO should >> be added to connect pIOMMU and vIOMMU. > > Thanks for raising this up - IMHO this is a good question. > >> >> The task should be divided into three parts >> 1) pIOMMU driver reports fault events for vIOMMU via new VFIO interface > > Here you mean "how host kernel capture the DMAR fault", right? Yes, VFIO kernel driver should know the fault event when it's triggered. > > IMHO We can have something like notifier/notifiee as well in DMAR > fault reporting - people (like vfio driver in kernel) can register to > fault reports related to specific device. When DMAR receives faults > for those devices, it triggers the notification list, then vfio can be > notified. Yes, something likes this. > >> 2) Add new channel in VFIO subsystem to connect pIOMMU driver and >> vIOMMU in Qemu >> 3) vIOMMU in Qemu get fault event from VFIO subsystem in Qemu and inject >> virtual fault event to guest. >> >> Such VFIO channel is also required by device's PRS(Page Request >> Services) support. This is also a part of SVM(Shared virtual memory) >> support in VM. Here is SVM design doc link. >> http://marc.info/?l=kvm=148049586514120=2 >> >> 2. How to restore GPA->HPA mapping when IOVA is disabled by guest. >> When guest enables IOVA for device, vIOMMU will invalidate all previous >> GPA->HPA mapping and update IOVA->HPA mapping to pIOMMU via iommu >> notifier. But if IOVA is disabled, I think we should restore GPA->HPA >> mapping for the device otherwise the device won't work again in the VM. > > If we can have a workable replay mechanism, this problem will be > solved IMHO. Basic idea is to replay related memory regions to restore GPA->HPA mapping when guest disables IOVA. -- Best regards Tianyu Lan
Re: [Qemu-devel] [PATCH v7 0/5] IOMMU: intel_iommu support map and unmap notifications
On Fri, Dec 02, 2016 at 10:30:34AM -0700, Alex Williamson wrote: [...] > On the host though, I'd expect we still have separate IOMMU domains, > one for each device and we do DMA_{UN}MAP ioctls separately per > container. Thanks, I have a RFC version of replay implementation that used the way to do it (each device will have its own address space, even if more than one devices are in the same IOMMU domain). I'll arrange the patches and post soon. Let's see whether that's a valid approach to fix existing issues for vfio enablement on vt-d. Thanks, -- peterx
Re: [Qemu-devel] [PATCH v7 0/5] IOMMU: intel_iommu support map and unmap notifications
On 12/3/2016 1:30 AM, Alex Williamson wrote: > On Fri, 2 Dec 2016 14:08:59 +0800 > Peter Xuwrote: > >> On Thu, Dec 01, 2016 at 04:27:52PM +0800, Lan Tianyu wrote: >>> On 2016年11月30日 17:23, Peter Xu wrote: On Mon, Nov 28, 2016 at 05:51:50PM +0200, Aviv B.D wrote: > * intel_iommu's replay op is not implemented yet (May come in different > patch > set). > The replay function is required for hotplug vfio device and to move > devices > between existing domains. I am thinking about this replay thing recently and now I start to doubt whether the whole vt-d vIOMMU framework suites this... Generally speaking, current work is throwing away the IOMMU "domain" layer here. We maintain the mapping only per device, and we don't care too much about which domain it belongs. This seems problematic. A simplest wrong case for this is (let's assume cache-mode is enabled): if we have two assigned devices A and B, both belong to the same domain 1. Meanwhile, in domain 1 assume we have one mapping which is the first page (iova range 0-0xfff). Then, if guest wants to invalidate the page, it'll notify VT-d vIOMMU with an invalidation message. If we do this invalidation per-device, we'll need to UNMAP the region twice - once for A, once for B (if we have more devices, we will unmap more times), and we can never know we have done duplicated work since we don't keep domain info, so we don't know they are using the same address space. The first unmap will work, and then we'll possibly get some errors on the rest of dma unmap failures. >>> >>> >>> Hi Peter: >> >> Hi, Tianyu, >> >>> According VTD spec 6.2.2.1, "Software must ensure that, if multiple >>> context-entries (or extended-context-entries) are programmed >>> with the same Domain-id (DID), such entries must be programmed with same >>> value for the secondlevel page-table pointer (SLPTPTR) field, and same >>> value for the PASID Table Pointer (PASIDTPTR) field.". >>> >>> So if two assigned device may have different IO page table, they should >>> be put into different domains. >> >> >> By default, devices will be put into different domains. However it >> should be legal that we put two assigned devices into the same IOMMU >> domain (in the guest), right? And we should handle both cases well >> IMHO. >> >> Actually I just wrote a tool to do it based on vfio-pci: >> >> >> https://github.com/xzpeter/clibs/blob/master/gpl/userspace/vfio-bind-group/vfio-bind-group.c >> >> If we run this tool in the guest with parameter like: >> >> ./vfio-bind-groups 00:03.0 00:04.0 >> >> It'll create one single domain, and put PCI device 00:03.0, 00:04.0 >> into the same IOMMU domain. > > On the host though, I'd expect we still have separate IOMMU domains, > one for each device and we do DMA_{UN}MAP ioctls separately per > container. Thanks, Agree. Guest may use different IO page tables for multi assigned devices and this requires to put assigned device in different VTD domain on host. I think we can't put assigned devices into the same VTD domain before enabling dynamic containers.
Re: [Qemu-devel] [PATCH v7 0/5] IOMMU: intel_iommu support map and unmap notifications
On Fri, 2 Dec 2016 14:08:59 +0800 Peter Xuwrote: > On Thu, Dec 01, 2016 at 04:27:52PM +0800, Lan Tianyu wrote: > > On 2016年11月30日 17:23, Peter Xu wrote: > > > On Mon, Nov 28, 2016 at 05:51:50PM +0200, Aviv B.D wrote: > > >> * intel_iommu's replay op is not implemented yet (May come in different > > >> patch > > >> set). > > >> The replay function is required for hotplug vfio device and to move > > >> devices > > >> between existing domains. > > > > > > I am thinking about this replay thing recently and now I start to > > > doubt whether the whole vt-d vIOMMU framework suites this... > > > > > > Generally speaking, current work is throwing away the IOMMU "domain" > > > layer here. We maintain the mapping only per device, and we don't care > > > too much about which domain it belongs. This seems problematic. > > > > > > A simplest wrong case for this is (let's assume cache-mode is > > > enabled): if we have two assigned devices A and B, both belong to the > > > same domain 1. Meanwhile, in domain 1 assume we have one mapping which > > > is the first page (iova range 0-0xfff). Then, if guest wants to > > > invalidate the page, it'll notify VT-d vIOMMU with an invalidation > > > message. If we do this invalidation per-device, we'll need to UNMAP > > > the region twice - once for A, once for B (if we have more devices, we > > > will unmap more times), and we can never know we have done duplicated > > > work since we don't keep domain info, so we don't know they are using > > > the same address space. The first unmap will work, and then we'll > > > possibly get some errors on the rest of dma unmap failures. > > > > > > Hi Peter: > > Hi, Tianyu, > > > According VTD spec 6.2.2.1, "Software must ensure that, if multiple > > context-entries (or extended-context-entries) are programmed > > with the same Domain-id (DID), such entries must be programmed with same > > value for the secondlevel page-table pointer (SLPTPTR) field, and same > > value for the PASID Table Pointer (PASIDTPTR) field.". > > > > So if two assigned device may have different IO page table, they should > > be put into different domains. > > > By default, devices will be put into different domains. However it > should be legal that we put two assigned devices into the same IOMMU > domain (in the guest), right? And we should handle both cases well > IMHO. > > Actually I just wrote a tool to do it based on vfio-pci: > > > https://github.com/xzpeter/clibs/blob/master/gpl/userspace/vfio-bind-group/vfio-bind-group.c > > If we run this tool in the guest with parameter like: > > ./vfio-bind-groups 00:03.0 00:04.0 > > It'll create one single domain, and put PCI device 00:03.0, 00:04.0 > into the same IOMMU domain. On the host though, I'd expect we still have separate IOMMU domains, one for each device and we do DMA_{UN}MAP ioctls separately per container. Thanks, Alex
Re: [Qemu-devel] [PATCH v7 0/5] IOMMU: intel_iommu support map and unmap notifications
On Fri, 2 Dec 2016 13:59:25 +0800 Peter Xuwrote: > On Thu, Dec 01, 2016 at 04:21:38AM +, Tian, Kevin wrote: > > > From: Peter Xu > > > Sent: Wednesday, November 30, 2016 5:24 PM > > > > > > On Mon, Nov 28, 2016 at 05:51:50PM +0200, Aviv B.D wrote: > > > > * intel_iommu's replay op is not implemented yet (May come in different > > > > patch > > > > set). > > > > The replay function is required for hotplug vfio device and to move > > > > devices > > > > between existing domains. > > > > > > I am thinking about this replay thing recently and now I start to > > > doubt whether the whole vt-d vIOMMU framework suites this... > > > > > > Generally speaking, current work is throwing away the IOMMU "domain" > > > layer here. We maintain the mapping only per device, and we don't care > > > too much about which domain it belongs. This seems problematic. > > > > > > A simplest wrong case for this is (let's assume cache-mode is > > > enabled): if we have two assigned devices A and B, both belong to the > > > same domain 1. Meanwhile, in domain 1 assume we have one mapping which > > > is the first page (iova range 0-0xfff). Then, if guest wants to > > > invalidate the page, it'll notify VT-d vIOMMU with an invalidation > > > message. If we do this invalidation per-device, we'll need to UNMAP > > > the region twice - once for A, once for B (if we have more devices, we > > > will unmap more times), and we can never know we have done duplicated > > > work since we don't keep domain info, so we don't know they are using > > > the same address space. The first unmap will work, and then we'll > > > possibly get some errors on the rest of dma unmap failures. > > > > Tianyu and I discussed there is a bigger problem: today VFIO assumes > > only one address space per container, which is fine w/o vIOMMU (all devices > > in > > same container share same GPA->HPA translation). However it's not the case > > when vIOMMU is enabled, because guest Linux implements per-device > > IOVA space. If a VFIO container includes multiple devices, it means > > multiple address spaces required per container... > > IIUC the vfio container is created in: > > vfio_realize > vfio_get_group > vfio_connect_container > > Along the way (for vfio_get_group()), we have: > > group = vfio_get_group(groupid, pci_device_iommu_address_space(pdev), errp); > > Here the address space is per device. If without vIOMMU, they will be > pointed to the same system address space. However if with vIOMMU, > that address space will be per-device, no? Correct, with VT-d present, there will be a separate AddressSpace per device, so each device will be placed into separate containers. This is currently the only way to provide the flexibility that those separate devices can be attached to different domains in the guest. It also automatically faults on when devices share an iommu group on the host but the guest attempts to use separate AddressSpaces. Trouble comes when the guest is booted with iommu=pt as each container will need to map the full guest memory, yet each container is accounted separately for locked memory. libvirt doesn't account for $NUM_HOSTDEVS x $VM_MEM_SIZE for locked memory. Ideally we could be more flexible with dynamic containers, but it's not currently an option to move a group from one container to another w/o first closing all the devices within the group. Thanks, Alex
Re: [Qemu-devel] [PATCH v7 0/5] IOMMU: intel_iommu support map and unmap notifications
On Fri, Dec 02, 2016 at 06:23:52AM +, Tian, Kevin wrote: [...] > > IIUC the vfio container is created in: > > > > vfio_realize > > vfio_get_group > > vfio_connect_container > > > > Along the way (for vfio_get_group()), we have: > > > > group = vfio_get_group(groupid, pci_device_iommu_address_space(pdev), > > errp); > > > > Here the address space is per device. If without vIOMMU, they will be > > pointed to the same system address space. However if with vIOMMU, > > that address space will be per-device, no? > > > > yes, I didn't note that fact. Tianyu also pointed it out in his reply. :-) Sorry I didn't see that just now. Thanks for the confirmation. :) Thanks, -- peterx
Re: [Qemu-devel] [PATCH v7 0/5] IOMMU: intel_iommu support map and unmap notifications
On Thu, Dec 01, 2016 at 02:44:14PM +0800, Lan Tianyu wrote: [...] > Hi: > I think there are still other gaps to enable passthough device with > vIOMMU's DMA translation support. > > 1. Since this patchset is to shadow guest IO page table to > pIOMMU(physical IOMMU) vfio_dma_map/umap(), there will be some fault > events from pIOMMU if guest os does misconfigurations. We should report > these fault events to guest. This means we need to pass the fault event > from pIOMMU driver to vIOMMU in Qemu. I suppose a channel in VFIO should > be added to connect pIOMMU and vIOMMU. Thanks for raising this up - IMHO this is a good question. > > The task should be divided into three parts > 1) pIOMMU driver reports fault events for vIOMMU via new VFIO interface Here you mean "how host kernel capture the DMAR fault", right? IMHO We can have something like notifier/notifiee as well in DMAR fault reporting - people (like vfio driver in kernel) can register to fault reports related to specific device. When DMAR receives faults for those devices, it triggers the notification list, then vfio can be notified. > 2) Add new channel in VFIO subsystem to connect pIOMMU driver and > vIOMMU in Qemu > 3) vIOMMU in Qemu get fault event from VFIO subsystem in Qemu and inject > virtual fault event to guest. > > Such VFIO channel is also required by device's PRS(Page Request > Services) support. This is also a part of SVM(Shared virtual memory) > support in VM. Here is SVM design doc link. > http://marc.info/?l=kvm=148049586514120=2 > > 2. How to restore GPA->HPA mapping when IOVA is disabled by guest. > When guest enables IOVA for device, vIOMMU will invalidate all previous > GPA->HPA mapping and update IOVA->HPA mapping to pIOMMU via iommu > notifier. But if IOVA is disabled, I think we should restore GPA->HPA > mapping for the device otherwise the device won't work again in the VM. If we can have a workable replay mechanism, this problem will be solved IMHO. A more general issue is we move one device into an existing domain X (no matter this is system address space, or another existing IOMMU domain) - we need to unmap the pages in A and remap the pages in B. But, this is not the best solution IMHO. Issue of this solution is that we will need to maintain some containers that has totally the same shadow page tables. The best solution is we have a domain layer, and we have an address space (container) for each domain. Then: - when new domain added: we create a new container for this new domain, re-build the shadow page table in that domain if there is any - when device moves domain (either move into/from system address space, or move into another IOMMU domain): we can just try to put that device (aka corresponding group) into the existing container that the new domain belongs Thanks, -- peterx
Re: [Qemu-devel] [PATCH v7 0/5] IOMMU: intel_iommu support map and unmap notifications
> From: Peter Xu [mailto:pet...@redhat.com] > Sent: Friday, December 02, 2016 1:59 PM > > On Thu, Dec 01, 2016 at 04:21:38AM +, Tian, Kevin wrote: > > > From: Peter Xu > > > Sent: Wednesday, November 30, 2016 5:24 PM > > > > > > On Mon, Nov 28, 2016 at 05:51:50PM +0200, Aviv B.D wrote: > > > > * intel_iommu's replay op is not implemented yet (May come in different > > > > patch > > > > set). > > > > The replay function is required for hotplug vfio device and to move > > > > devices > > > > between existing domains. > > > > > > I am thinking about this replay thing recently and now I start to > > > doubt whether the whole vt-d vIOMMU framework suites this... > > > > > > Generally speaking, current work is throwing away the IOMMU "domain" > > > layer here. We maintain the mapping only per device, and we don't care > > > too much about which domain it belongs. This seems problematic. > > > > > > A simplest wrong case for this is (let's assume cache-mode is > > > enabled): if we have two assigned devices A and B, both belong to the > > > same domain 1. Meanwhile, in domain 1 assume we have one mapping which > > > is the first page (iova range 0-0xfff). Then, if guest wants to > > > invalidate the page, it'll notify VT-d vIOMMU with an invalidation > > > message. If we do this invalidation per-device, we'll need to UNMAP > > > the region twice - once for A, once for B (if we have more devices, we > > > will unmap more times), and we can never know we have done duplicated > > > work since we don't keep domain info, so we don't know they are using > > > the same address space. The first unmap will work, and then we'll > > > possibly get some errors on the rest of dma unmap failures. > > > > Tianyu and I discussed there is a bigger problem: today VFIO assumes > > only one address space per container, which is fine w/o vIOMMU (all devices > > in > > same container share same GPA->HPA translation). However it's not the case > > when vIOMMU is enabled, because guest Linux implements per-device > > IOVA space. If a VFIO container includes multiple devices, it means > > multiple address spaces required per container... > > IIUC the vfio container is created in: > > vfio_realize > vfio_get_group > vfio_connect_container > > Along the way (for vfio_get_group()), we have: > > group = vfio_get_group(groupid, pci_device_iommu_address_space(pdev), errp); > > Here the address space is per device. If without vIOMMU, they will be > pointed to the same system address space. However if with vIOMMU, > that address space will be per-device, no? > yes, I didn't note that fact. Tianyu also pointed it out in his reply. :-) Thanks Kevin
Re: [Qemu-devel] [PATCH v7 0/5] IOMMU: intel_iommu support map and unmap notifications
On Thu, Dec 01, 2016 at 08:42:04AM -0700, Alex Williamson wrote: > On Wed, 30 Nov 2016 17:23:59 +0800 > Peter Xuwrote: > > > On Mon, Nov 28, 2016 at 05:51:50PM +0200, Aviv B.D wrote: > > > * intel_iommu's replay op is not implemented yet (May come in different > > > patch > > > set). > > > The replay function is required for hotplug vfio device and to move > > > devices > > > between existing domains. > > > > I am thinking about this replay thing recently and now I start to > > doubt whether the whole vt-d vIOMMU framework suites this... > > > > Generally speaking, current work is throwing away the IOMMU "domain" > > layer here. We maintain the mapping only per device, and we don't care > > too much about which domain it belongs. This seems problematic. > > > > A simplest wrong case for this is (let's assume cache-mode is > > enabled): if we have two assigned devices A and B, both belong to the > > same domain 1. Meanwhile, in domain 1 assume we have one mapping which > > is the first page (iova range 0-0xfff). Then, if guest wants to > > invalidate the page, it'll notify VT-d vIOMMU with an invalidation > > message. If we do this invalidation per-device, we'll need to UNMAP > > the region twice - once for A, once for B (if we have more devices, we > > will unmap more times), and we can never know we have done duplicated > > work since we don't keep domain info, so we don't know they are using > > the same address space. The first unmap will work, and then we'll > > possibly get some errors on the rest of dma unmap failures. > > > > Looks like we just cannot live without knowing this domain layer. > > Because the address space is binded to the domain. If we want to sync > > the address space (here to setup a correct shadow page table), we need > > to do it per-domain. > > > > What I can think of as a solution is that we introduce this "domain" > > layer - like a memory region per domain. When invalidation happens, > > it's per-domain, not per-device any more (actually I guess that's what > > current vt-d iommu driver in kernel is doing, we just ignored it - we > > fetch the devices that matches the domain ID). We can/need to maintain > > something different, like sid <-> domain mappings (we can do this as > > long as we are notified when context entries changed), per-domain > > mappings (just like per-device mappings that we are trying to build in > > this series, but what we really need is IMHO per domain one), etc. > > When device switches domain, we switch the IOMMU memory region > > accordingly. > > > > Does this make any sense? Comments are greatly welcomed (especially > > from AlexW and DavidG). > > It's been a bit since I've looked at VT-d emulation, but I certainly > remember that it's way more convoluted than I expected. It seems like > a domain should create an AddressSpace and any devices assigned to that > domain should make use of that single address space, but IIRC VT-d > creates an address space per device, ie. per context entry. Yes, I think this idea (one address space per domain) came from one of your replies in the past, and I just found it more essential than I thought before. I'll see whether I can clear the way out before moving on to the replay implementations. Because IIUC the replay will depend on this (introducing the domain layer in VT-d IOMMU emulation). Thanks! -- peterx
Re: [Qemu-devel] [PATCH v7 0/5] IOMMU: intel_iommu support map and unmap notifications
On Thu, Dec 01, 2016 at 04:27:52PM +0800, Lan Tianyu wrote: > On 2016年11月30日 17:23, Peter Xu wrote: > > On Mon, Nov 28, 2016 at 05:51:50PM +0200, Aviv B.D wrote: > >> * intel_iommu's replay op is not implemented yet (May come in different > >> patch > >> set). > >> The replay function is required for hotplug vfio device and to move > >> devices > >> between existing domains. > > > > I am thinking about this replay thing recently and now I start to > > doubt whether the whole vt-d vIOMMU framework suites this... > > > > Generally speaking, current work is throwing away the IOMMU "domain" > > layer here. We maintain the mapping only per device, and we don't care > > too much about which domain it belongs. This seems problematic. > > > > A simplest wrong case for this is (let's assume cache-mode is > > enabled): if we have two assigned devices A and B, both belong to the > > same domain 1. Meanwhile, in domain 1 assume we have one mapping which > > is the first page (iova range 0-0xfff). Then, if guest wants to > > invalidate the page, it'll notify VT-d vIOMMU with an invalidation > > message. If we do this invalidation per-device, we'll need to UNMAP > > the region twice - once for A, once for B (if we have more devices, we > > will unmap more times), and we can never know we have done duplicated > > work since we don't keep domain info, so we don't know they are using > > the same address space. The first unmap will work, and then we'll > > possibly get some errors on the rest of dma unmap failures. > > > Hi Peter: Hi, Tianyu, > According VTD spec 6.2.2.1, "Software must ensure that, if multiple > context-entries (or extended-context-entries) are programmed > with the same Domain-id (DID), such entries must be programmed with same > value for the secondlevel page-table pointer (SLPTPTR) field, and same > value for the PASID Table Pointer (PASIDTPTR) field.". > > So if two assigned device may have different IO page table, they should > be put into different domains. By default, devices will be put into different domains. However it should be legal that we put two assigned devices into the same IOMMU domain (in the guest), right? And we should handle both cases well IMHO. Actually I just wrote a tool to do it based on vfio-pci: https://github.com/xzpeter/clibs/blob/master/gpl/userspace/vfio-bind-group/vfio-bind-group.c If we run this tool in the guest with parameter like: ./vfio-bind-groups 00:03.0 00:04.0 It'll create one single domain, and put PCI device 00:03.0, 00:04.0 into the same IOMMU domain. Thanks, -- peterx
Re: [Qemu-devel] [PATCH v7 0/5] IOMMU: intel_iommu support map and unmap notifications
On Thu, Dec 01, 2016 at 04:21:38AM +, Tian, Kevin wrote: > > From: Peter Xu > > Sent: Wednesday, November 30, 2016 5:24 PM > > > > On Mon, Nov 28, 2016 at 05:51:50PM +0200, Aviv B.D wrote: > > > * intel_iommu's replay op is not implemented yet (May come in different > > > patch > > > set). > > > The replay function is required for hotplug vfio device and to move > > > devices > > > between existing domains. > > > > I am thinking about this replay thing recently and now I start to > > doubt whether the whole vt-d vIOMMU framework suites this... > > > > Generally speaking, current work is throwing away the IOMMU "domain" > > layer here. We maintain the mapping only per device, and we don't care > > too much about which domain it belongs. This seems problematic. > > > > A simplest wrong case for this is (let's assume cache-mode is > > enabled): if we have two assigned devices A and B, both belong to the > > same domain 1. Meanwhile, in domain 1 assume we have one mapping which > > is the first page (iova range 0-0xfff). Then, if guest wants to > > invalidate the page, it'll notify VT-d vIOMMU with an invalidation > > message. If we do this invalidation per-device, we'll need to UNMAP > > the region twice - once for A, once for B (if we have more devices, we > > will unmap more times), and we can never know we have done duplicated > > work since we don't keep domain info, so we don't know they are using > > the same address space. The first unmap will work, and then we'll > > possibly get some errors on the rest of dma unmap failures. > > Tianyu and I discussed there is a bigger problem: today VFIO assumes > only one address space per container, which is fine w/o vIOMMU (all devices > in > same container share same GPA->HPA translation). However it's not the case > when vIOMMU is enabled, because guest Linux implements per-device > IOVA space. If a VFIO container includes multiple devices, it means > multiple address spaces required per container... IIUC the vfio container is created in: vfio_realize vfio_get_group vfio_connect_container Along the way (for vfio_get_group()), we have: group = vfio_get_group(groupid, pci_device_iommu_address_space(pdev), errp); Here the address space is per device. If without vIOMMU, they will be pointed to the same system address space. However if with vIOMMU, that address space will be per-device, no? -- peterx
Re: [Qemu-devel] [PATCH v7 0/5] IOMMU: intel_iommu support map and unmap notifications
On Wed, 30 Nov 2016 17:23:59 +0800 Peter Xuwrote: > On Mon, Nov 28, 2016 at 05:51:50PM +0200, Aviv B.D wrote: > > * intel_iommu's replay op is not implemented yet (May come in different > > patch > > set). > > The replay function is required for hotplug vfio device and to move > > devices > > between existing domains. > > I am thinking about this replay thing recently and now I start to > doubt whether the whole vt-d vIOMMU framework suites this... > > Generally speaking, current work is throwing away the IOMMU "domain" > layer here. We maintain the mapping only per device, and we don't care > too much about which domain it belongs. This seems problematic. > > A simplest wrong case for this is (let's assume cache-mode is > enabled): if we have two assigned devices A and B, both belong to the > same domain 1. Meanwhile, in domain 1 assume we have one mapping which > is the first page (iova range 0-0xfff). Then, if guest wants to > invalidate the page, it'll notify VT-d vIOMMU with an invalidation > message. If we do this invalidation per-device, we'll need to UNMAP > the region twice - once for A, once for B (if we have more devices, we > will unmap more times), and we can never know we have done duplicated > work since we don't keep domain info, so we don't know they are using > the same address space. The first unmap will work, and then we'll > possibly get some errors on the rest of dma unmap failures. > > Looks like we just cannot live without knowing this domain layer. > Because the address space is binded to the domain. If we want to sync > the address space (here to setup a correct shadow page table), we need > to do it per-domain. > > What I can think of as a solution is that we introduce this "domain" > layer - like a memory region per domain. When invalidation happens, > it's per-domain, not per-device any more (actually I guess that's what > current vt-d iommu driver in kernel is doing, we just ignored it - we > fetch the devices that matches the domain ID). We can/need to maintain > something different, like sid <-> domain mappings (we can do this as > long as we are notified when context entries changed), per-domain > mappings (just like per-device mappings that we are trying to build in > this series, but what we really need is IMHO per domain one), etc. > When device switches domain, we switch the IOMMU memory region > accordingly. > > Does this make any sense? Comments are greatly welcomed (especially > from AlexW and DavidG). It's been a bit since I've looked at VT-d emulation, but I certainly remember that it's way more convoluted than I expected. It seems like a domain should create an AddressSpace and any devices assigned to that domain should make use of that single address space, but IIRC VT-d creates an address space per device, ie. per context entry.
Re: [Qemu-devel] [PATCH v7 0/5] IOMMU: intel_iommu support map and unmap notifications
On 2016年11月30日 17:23, Peter Xu wrote: > On Mon, Nov 28, 2016 at 05:51:50PM +0200, Aviv B.D wrote: >> * intel_iommu's replay op is not implemented yet (May come in different >> patch >> set). >> The replay function is required for hotplug vfio device and to move >> devices >> between existing domains. > > I am thinking about this replay thing recently and now I start to > doubt whether the whole vt-d vIOMMU framework suites this... > > Generally speaking, current work is throwing away the IOMMU "domain" > layer here. We maintain the mapping only per device, and we don't care > too much about which domain it belongs. This seems problematic. > > A simplest wrong case for this is (let's assume cache-mode is > enabled): if we have two assigned devices A and B, both belong to the > same domain 1. Meanwhile, in domain 1 assume we have one mapping which > is the first page (iova range 0-0xfff). Then, if guest wants to > invalidate the page, it'll notify VT-d vIOMMU with an invalidation > message. If we do this invalidation per-device, we'll need to UNMAP > the region twice - once for A, once for B (if we have more devices, we > will unmap more times), and we can never know we have done duplicated > work since we don't keep domain info, so we don't know they are using > the same address space. The first unmap will work, and then we'll > possibly get some errors on the rest of dma unmap failures. Hi Peter: According VTD spec 6.2.2.1, "Software must ensure that, if multiple context-entries (or extended-context-entries) are programmed with the same Domain-id (DID), such entries must be programmed with same value for the secondlevel page-table pointer (SLPTPTR) field, and same value for the PASID Table Pointer (PASIDTPTR) field.". So if two assigned device may have different IO page table, they should be put into different domains. > > Looks like we just cannot live without knowing this domain layer. > Because the address space is binded to the domain. If we want to sync > the address space (here to setup a correct shadow page table), we need > to do it per-domain. > > What I can think of as a solution is that we introduce this "domain" > layer - like a memory region per domain. When invalidation happens, > it's per-domain, not per-device any more (actually I guess that's what > current vt-d iommu driver in kernel is doing, we just ignored it - we > fetch the devices that matches the domain ID). We can/need to maintain > something different, like sid <-> domain mappings (we can do this as > long as we are notified when context entries changed), per-domain > mappings (just like per-device mappings that we are trying to build in > this series, but what we really need is IMHO per domain one), etc. > When device switches domain, we switch the IOMMU memory region > accordingly. > > Does this make any sense? Comments are greatly welcomed (especially > from AlexW and DavidG). > > Thanks, > > -- peterx > -- Best regards Tianyu Lan
Re: [Qemu-devel] [PATCH v7 0/5] IOMMU: intel_iommu support map and unmap notifications
On 2016年12月01日 12:21, Tian, Kevin wrote: >> I am thinking about this replay thing recently and now I start to >> > doubt whether the whole vt-d vIOMMU framework suites this... >> > >> > Generally speaking, current work is throwing away the IOMMU "domain" >> > layer here. We maintain the mapping only per device, and we don't care >> > too much about which domain it belongs. This seems problematic. >> > >> > A simplest wrong case for this is (let's assume cache-mode is >> > enabled): if we have two assigned devices A and B, both belong to the >> > same domain 1. Meanwhile, in domain 1 assume we have one mapping which >> > is the first page (iova range 0-0xfff). Then, if guest wants to >> > invalidate the page, it'll notify VT-d vIOMMU with an invalidation >> > message. If we do this invalidation per-device, we'll need to UNMAP >> > the region twice - once for A, once for B (if we have more devices, we >> > will unmap more times), and we can never know we have done duplicated >> > work since we don't keep domain info, so we don't know they are using >> > the same address space. The first unmap will work, and then we'll >> > possibly get some errors on the rest of dma unmap failures. > Tianyu and I discussed there is a bigger problem: today VFIO assumes > only one address space per container, which is fine w/o vIOMMU (all devices > in > same container share same GPA->HPA translation). However it's not the case > when vIOMMU is enabled, because guest Linux implements per-device > IOVA space. If a VFIO container includes multiple devices, it means > multiple address spaces required per container... > Hi All: Some updates about relationship about assigned device and container. If vIOMMU is disabled, all assigned devices will use global address_space_memory as address space (Detail please see pci_device_iommu_address_space()). VFIO creates container according to address space and so all assigned devices will put into one container. If vIOMMU is enabled, Intel vIOMMU will allocate separate address spaces for each assigned devices and then VFIO will create different container for each assigned device. In other word, it will be one assigned device per container when vIOMMU is enabled. The original concern won't take place. This is my understanding and please correct me if something wrong. -- Best regards Tianyu Lan
Re: [Qemu-devel] [PATCH v7 0/5] IOMMU: intel_iommu support map and unmap notifications
On 2016年11月28日 23:51, Aviv B.D wrote: > From: "Aviv Ben-David"> > * Advertize Cache Mode capability in iommu cap register. > This capability is controlled by "cache-mode" property of intel-iommu > device. > To enable this option call QEMU with "-device intel-iommu,cache-mode=true". > > * On page cache invalidation in intel vIOMMU, check if the domain belong to > registered notifier, and notify accordingly. > > Currently this patch still doesn't enabling VFIO devices support with vIOMMU > present. Current problems: > * vfio_iommu_map_notify is not aware about memory range belong to specific > VFIOGuestIOMMU. > * intel_iommu's replay op is not implemented yet (May come in different patch > set). > The replay function is required for hotplug vfio device and to move devices > between existing domains. > > Changes from v1 to v2: > * remove assumption that the cache do not clears > * fix lockup on high load. > > Changes from v2 to v3: > * remove debug leftovers > * split to sepearate commits > * change is_write to flags in vtd_do_iommu_translate, add IOMMU_NO_FAIL > to suppress error propagating to guest. > > Changes from v3 to v4: > * Add property to intel_iommu device to control the CM capability, > default to False. > * Use s->iommu_ops.notify_flag_changed to register notifiers. > > Changes from v4 to v4 RESEND: > * Fix codding style pointed by checkpatch.pl script. > > Changes from v4 to v5: > * Reduce the number of changes in patch 2 and make flags real bitfield. > * Revert deleted debug prints. > * Fix memory leak in patch 3. > > Changes from v5 to v6: > * fix prototype of iommu_translate function for more IOMMU types. > * VFIO will be notified only on the difference, without unmap > before change to maps. > > Changes from v6 to v7: > * Add replay op to iommu_ops, with current behavior as default implementation > (Patch 4). > * Add stub to disable VFIO with intel_iommu support (Patch 5). > * Cosmetic changes to other patches. > > Aviv Ben-David (5): > IOMMU: add option to enable VTD_CAP_CM to vIOMMU capility exposoed to > guest > IOMMU: change iommu_op->translate's is_write to flags, add support to > NO_FAIL flag mode > IOMMU: enable intel_iommu map and unmap notifiers > IOMMU: add specific replay function with default implemenation > IOMMU: add specific null implementation of iommu_replay to intel_iommu > > exec.c | 3 +- > hw/alpha/typhoon.c | 2 +- > hw/i386/amd_iommu.c| 4 +- > hw/i386/intel_iommu.c | 165 > ++--- > hw/i386/intel_iommu_internal.h | 3 + > hw/pci-host/apb.c | 2 +- > hw/ppc/spapr_iommu.c | 2 +- > hw/s390x/s390-pci-bus.c| 2 +- > include/exec/memory.h | 10 ++- > include/hw/i386/intel_iommu.h | 11 +++ > memory.c | 11 ++- > 11 files changed, 177 insertions(+), 38 deletions(-) > Hi: I think there are still other gaps to enable passthough device with vIOMMU's DMA translation support. 1. Since this patchset is to shadow guest IO page table to pIOMMU(physical IOMMU) vfio_dma_map/umap(), there will be some fault events from pIOMMU if guest os does misconfigurations. We should report these fault events to guest. This means we need to pass the fault event from pIOMMU driver to vIOMMU in Qemu. I suppose a channel in VFIO should be added to connect pIOMMU and vIOMMU. The task should be divided into three parts 1) pIOMMU driver reports fault events for vIOMMU via new VFIO interface 2) Add new channel in VFIO subsystem to connect pIOMMU driver and vIOMMU in Qemu 3) vIOMMU in Qemu get fault event from VFIO subsystem in Qemu and inject virtual fault event to guest. Such VFIO channel is also required by device's PRS(Page Request Services) support. This is also a part of SVM(Shared virtual memory) support in VM. Here is SVM design doc link. http://marc.info/?l=kvm=148049586514120=2 2. How to restore GPA->HPA mapping when IOVA is disabled by guest. When guest enables IOVA for device, vIOMMU will invalidate all previous GPA->HPA mapping and update IOVA->HPA mapping to pIOMMU via iommu notifier. But if IOVA is disabled, I think we should restore GPA->HPA mapping for the device otherwise the device won't work again in the VM. Another option to enable passthough device with intel vIOMMU We may prevent guest to enable DMA translation function successfully via gcmd and vIOMMU never set "Translation Enable Status" bit in gsts register when there is a passthough device. There are kernel parameter "intel_iommu=on" or Kconfig CONFIG_INTEL_IOMMU_DEFAULT_ON to enable DMA translation for intel IOMMU driver. But they aren't set in distribution RHEL, SLES, Oracle and ubuntu by default. Tha The iommu driver will trigger a kernel panic when fail to enable DMA translation via gcmd with log "DMAR hardware is malfunctioning". -- Best regards Tianyu Lan
Re: [Qemu-devel] [PATCH v7 0/5] IOMMU: intel_iommu support map and unmap notifications
> From: Peter Xu > Sent: Wednesday, November 30, 2016 5:24 PM > > On Mon, Nov 28, 2016 at 05:51:50PM +0200, Aviv B.D wrote: > > * intel_iommu's replay op is not implemented yet (May come in different > > patch > > set). > > The replay function is required for hotplug vfio device and to move > > devices > > between existing domains. > > I am thinking about this replay thing recently and now I start to > doubt whether the whole vt-d vIOMMU framework suites this... > > Generally speaking, current work is throwing away the IOMMU "domain" > layer here. We maintain the mapping only per device, and we don't care > too much about which domain it belongs. This seems problematic. > > A simplest wrong case for this is (let's assume cache-mode is > enabled): if we have two assigned devices A and B, both belong to the > same domain 1. Meanwhile, in domain 1 assume we have one mapping which > is the first page (iova range 0-0xfff). Then, if guest wants to > invalidate the page, it'll notify VT-d vIOMMU with an invalidation > message. If we do this invalidation per-device, we'll need to UNMAP > the region twice - once for A, once for B (if we have more devices, we > will unmap more times), and we can never know we have done duplicated > work since we don't keep domain info, so we don't know they are using > the same address space. The first unmap will work, and then we'll > possibly get some errors on the rest of dma unmap failures. Tianyu and I discussed there is a bigger problem: today VFIO assumes only one address space per container, which is fine w/o vIOMMU (all devices in same container share same GPA->HPA translation). However it's not the case when vIOMMU is enabled, because guest Linux implements per-device IOVA space. If a VFIO container includes multiple devices, it means multiple address spaces required per container... > > Looks like we just cannot live without knowing this domain layer. > Because the address space is binded to the domain. If we want to sync > the address space (here to setup a correct shadow page table), we need > to do it per-domain. > > What I can think of as a solution is that we introduce this "domain" > layer - like a memory region per domain. When invalidation happens, > it's per-domain, not per-device any more (actually I guess that's what > current vt-d iommu driver in kernel is doing, we just ignored it - we > fetch the devices that matches the domain ID). We can/need to maintain > something different, like sid <-> domain mappings (we can do this as > long as we are notified when context entries changed), per-domain > mappings (just like per-device mappings that we are trying to build in > this series, but what we really need is IMHO per domain one), etc. > When device switches domain, we switch the IOMMU memory region > accordingly. > > Does this make any sense? Comments are greatly welcomed (especially > from AlexW and DavidG). > > Thanks, > > -- peterx
Re: [Qemu-devel] [PATCH v7 0/5] IOMMU: intel_iommu support map and unmap notifications
> From: Aviv B.D > Sent: Monday, November 28, 2016 11:52 PM > > From: "Aviv Ben-David"> > * Advertize Cache Mode capability in iommu cap register. > This capability is controlled by "cache-mode" property of intel-iommu > device. > To enable this option call QEMU with "-device intel-iommu,cache-mode=true". > > * On page cache invalidation in intel vIOMMU, check if the domain belong to > registered notifier, and notify accordingly. > > Currently this patch still doesn't enabling VFIO devices support with vIOMMU > present. Current problems: > * vfio_iommu_map_notify is not aware about memory range belong to specific > VFIOGuestIOMMU. > * intel_iommu's replay op is not implemented yet (May come in different patch > set). > The replay function is required for hotplug vfio device and to move devices > between existing domains. > Thanks Aviv for your great work! Extending vIOMMU to support VFIO devices is also an important feature for Intel. We'd like to contribute and collaborate with you in this area to make it fully functional. My colleague (Tianyu Lan) will chime in later with his thoughts. Please don't hesitate sharing your TODO list so we can work together to move forward. My another colleague (Yi Liu) is also working on SVM virtualization in the meantime with draft design already sent out. There'll be more API changes built on top of the notification framework being worked on here, which we can discuss later when your work is stabilized. :-) Thanks Kevin
Re: [Qemu-devel] [PATCH v7 0/5] IOMMU: intel_iommu support map and unmap notifications
On Mon, Nov 28, 2016 at 05:51:50PM +0200, Aviv B.D wrote: > * intel_iommu's replay op is not implemented yet (May come in different patch > set). > The replay function is required for hotplug vfio device and to move devices > between existing domains. I am thinking about this replay thing recently and now I start to doubt whether the whole vt-d vIOMMU framework suites this... Generally speaking, current work is throwing away the IOMMU "domain" layer here. We maintain the mapping only per device, and we don't care too much about which domain it belongs. This seems problematic. A simplest wrong case for this is (let's assume cache-mode is enabled): if we have two assigned devices A and B, both belong to the same domain 1. Meanwhile, in domain 1 assume we have one mapping which is the first page (iova range 0-0xfff). Then, if guest wants to invalidate the page, it'll notify VT-d vIOMMU with an invalidation message. If we do this invalidation per-device, we'll need to UNMAP the region twice - once for A, once for B (if we have more devices, we will unmap more times), and we can never know we have done duplicated work since we don't keep domain info, so we don't know they are using the same address space. The first unmap will work, and then we'll possibly get some errors on the rest of dma unmap failures. Looks like we just cannot live without knowing this domain layer. Because the address space is binded to the domain. If we want to sync the address space (here to setup a correct shadow page table), we need to do it per-domain. What I can think of as a solution is that we introduce this "domain" layer - like a memory region per domain. When invalidation happens, it's per-domain, not per-device any more (actually I guess that's what current vt-d iommu driver in kernel is doing, we just ignored it - we fetch the devices that matches the domain ID). We can/need to maintain something different, like sid <-> domain mappings (we can do this as long as we are notified when context entries changed), per-domain mappings (just like per-device mappings that we are trying to build in this series, but what we really need is IMHO per domain one), etc. When device switches domain, we switch the IOMMU memory region accordingly. Does this make any sense? Comments are greatly welcomed (especially from AlexW and DavidG). Thanks, -- peterx
[Qemu-devel] [PATCH v7 0/5] IOMMU: intel_iommu support map and unmap notifications
From: "Aviv Ben-David"* Advertize Cache Mode capability in iommu cap register. This capability is controlled by "cache-mode" property of intel-iommu device. To enable this option call QEMU with "-device intel-iommu,cache-mode=true". * On page cache invalidation in intel vIOMMU, check if the domain belong to registered notifier, and notify accordingly. Currently this patch still doesn't enabling VFIO devices support with vIOMMU present. Current problems: * vfio_iommu_map_notify is not aware about memory range belong to specific VFIOGuestIOMMU. * intel_iommu's replay op is not implemented yet (May come in different patch set). The replay function is required for hotplug vfio device and to move devices between existing domains. Changes from v1 to v2: * remove assumption that the cache do not clears * fix lockup on high load. Changes from v2 to v3: * remove debug leftovers * split to sepearate commits * change is_write to flags in vtd_do_iommu_translate, add IOMMU_NO_FAIL to suppress error propagating to guest. Changes from v3 to v4: * Add property to intel_iommu device to control the CM capability, default to False. * Use s->iommu_ops.notify_flag_changed to register notifiers. Changes from v4 to v4 RESEND: * Fix codding style pointed by checkpatch.pl script. Changes from v4 to v5: * Reduce the number of changes in patch 2 and make flags real bitfield. * Revert deleted debug prints. * Fix memory leak in patch 3. Changes from v5 to v6: * fix prototype of iommu_translate function for more IOMMU types. * VFIO will be notified only on the difference, without unmap before change to maps. Changes from v6 to v7: * Add replay op to iommu_ops, with current behavior as default implementation (Patch 4). * Add stub to disable VFIO with intel_iommu support (Patch 5). * Cosmetic changes to other patches. Aviv Ben-David (5): IOMMU: add option to enable VTD_CAP_CM to vIOMMU capility exposoed to guest IOMMU: change iommu_op->translate's is_write to flags, add support to NO_FAIL flag mode IOMMU: enable intel_iommu map and unmap notifiers IOMMU: add specific replay function with default implemenation IOMMU: add specific null implementation of iommu_replay to intel_iommu exec.c | 3 +- hw/alpha/typhoon.c | 2 +- hw/i386/amd_iommu.c| 4 +- hw/i386/intel_iommu.c | 165 ++--- hw/i386/intel_iommu_internal.h | 3 + hw/pci-host/apb.c | 2 +- hw/ppc/spapr_iommu.c | 2 +- hw/s390x/s390-pci-bus.c| 2 +- include/exec/memory.h | 10 ++- include/hw/i386/intel_iommu.h | 11 +++ memory.c | 11 ++- 11 files changed, 177 insertions(+), 38 deletions(-) -- 1.9.1