Re: [Qemu-devel] [PATCH v7 0/5] IOMMU: intel_iommu support map and unmap notifications

2016-12-07 Thread Lan Tianyu
On 2016年12月08日 10:39, Peter Xu wrote:
> Btw, do you have time to help review my RFC series for "vt-d replay"?
> :) I'd like to receive any further review comments besides the issues
> mentioned above since IMHO they can be seperated from current series,
> I have noted them down in my pending list.
Sure. I have reviewed some patches. It looks good for me :)

-- 
Best regards
Tianyu Lan



Re: [Qemu-devel] [PATCH v7 0/5] IOMMU: intel_iommu support map and unmap notifications

2016-12-07 Thread Peter Xu
On Wed, Dec 07, 2016 at 10:04:57PM +0800, Lan Tianyu wrote:

[...]

> >Even if the context entry is cleared and invalidated, IMHO it does not
> >mean that we should be using GPA address space, nor do we need to put
> >it into guest physical address space. Instead, it simply means this
> >device cannot do any IO at that time. If IO comes, IOMMU should do
> >fault reporting to guest OS, which should be treated as error.
> 
> Yes, that looks right and there will be fault event reported by pIOMMU
> if context entry is no present for DMA untranslated request. This goes back
> to the first gap to report pIOMMU fault event to vIOMMU.

Right.

> 
> For disabling via clearing DMA translation via gcmd.TE bit, assigned
> device should work after clearing operation and it's still necessary to
> restore GPA->HPA mapping since we can't assume guest won't clear the bit
> after enabling DMA translation.
> 
> This maybe low priority gap since Linux IOMMU driver don't disable DMA
> translation frequently or dynamically. But we also should consider the
> situation.

I see. That's something I missed indeed. Yes we should support
disabling via gcmd.te. Also, looks like we still don't support
passthrough mode translation as well. That's something we need too
since that means we don't supporting iommu=pt.

I agree with you that we can postpone these tasks temporarily, just
like the error handling you mentioned.

> 
> >
> >So I think we are emulating the correct guest behavior here - we don't
> >need to do anything if a device is detached from an existing IOMMU
> >domain in guest. If we do (e.g., we replay the GPA address space on
> >that device when it is detached, so the shadow page table for that
> >device maps the whole guest memory), that is dangerous, because with
> >that the device can DMA to anywhere it wants to guest memory.
> 
> If guest want to disabling DMA translation, this is expected behavior
> and device model should follow guest configuration. This just likes most
> distributions don't enable VTD DMA translation by default and it's OS
> choice.

My previous comment only suites the case when a device is moved
outside an IOMMU domain. I agree with you on the rest.

Btw, do you have time to help review my RFC series for "vt-d replay"?
:) I'd like to receive any further review comments besides the issues
mentioned above since IMHO they can be seperated from current series,
I have noted them down in my pending list.

(I think I need to test iommu=pt a bit more to make sure it works
 asap)

Thanks,

-- peterx



Re: [Qemu-devel] [PATCH v7 0/5] IOMMU: intel_iommu support map and unmap notifications

2016-12-07 Thread Lan Tianyu

On 2016年12月07日 14:43, Peter Xu wrote:

On Wed, Dec 07, 2016 at 02:09:16PM +0800, Lan Tianyu wrote:

On 2016年12月06日 18:59, Peter Xu wrote:

On Tue, Dec 06, 2016 at 04:27:39PM +0800, Lan Tianyu wrote:

[...]




User space driver(E.G DPDK) also can enable/disable
IOVA for device dynamically.


Could you provide more detailed (or any pointer) on how to do that? I
did try to find it myself, I see an VFIO_IOMMU_ENABLE ioctl, but looks
like it is for ppc only.


No, I just give an example that user space may do that but no more
research. But since Qemu already can enable device's IOVA, other user
application also should can do that with the same VFIO interface, right?


AFAIU we can't do that at least on x86. We can use vfio interface to
bind group into container, but we should not be able to dynamically
disable IOMMU protection. IIUC That needs to taint the kernel.

The only way I know is that we probe vfio-pci with no-iommu mode, in
that case, we disabled IOMMU, but we can never dynamically enable it
as well.

Please correct me if I am wrong.



Actually, disabling device's IOVA doesn't require to disable kernel
global DMA protect and just clear device's VTD context entry in the
context table. Go though IOMMU and VFIO code, find this will happen when
call VFIO_GROUP_UNSET_CONTAINER ioctl and it will be called when destroy
VM or unplug assigned device in Qemu. Please help to double check.

Call trace:
__vfio_group_unset_container()
vfio_iommu_type1_detach_group()
iommu_detach_group()
dmar_remove_one_dev_info()
__dmar_remove_one_dev_info()
domain_context_clear()


The legacy KVM device assign code also will call iommu_detach_device()
when deassign a device.

From device emulation view, we need to make sure correct register
emulation regardless of guest behavior.


Even if the context entry is cleared and invalidated, IMHO it does not
mean that we should be using GPA address space, nor do we need to put
it into guest physical address space. Instead, it simply means this
device cannot do any IO at that time. If IO comes, IOMMU should do
fault reporting to guest OS, which should be treated as error.


Yes, that looks right and there will be fault event reported by pIOMMU
if context entry is no present for DMA untranslated request. This goes 
back to the first gap to report pIOMMU fault event to vIOMMU.


For disabling via clearing DMA translation via gcmd.TE bit, assigned
device should work after clearing operation and it's still necessary to
restore GPA->HPA mapping since we can't assume guest won't clear the bit
after enabling DMA translation.

This maybe low priority gap since Linux IOMMU driver don't disable DMA
translation frequently or dynamically. But we also should consider the
situation.



So I think we are emulating the correct guest behavior here - we don't
need to do anything if a device is detached from an existing IOMMU
domain in guest. If we do (e.g., we replay the GPA address space on
that device when it is detached, so the shadow page table for that
device maps the whole guest memory), that is dangerous, because with
that the device can DMA to anywhere it wants to guest memory.


If guest want to disabling DMA translation, this is expected behavior
and device model should follow guest configuration. This just likes most
distributions don't enable VTD DMA translation by default and it's OS
choice.

--
Best regards
Tianyu Lan



Re: [Qemu-devel] [PATCH v7 0/5] IOMMU: intel_iommu support map and unmap notifications

2016-12-06 Thread Peter Xu
On Wed, Dec 07, 2016 at 02:09:16PM +0800, Lan Tianyu wrote:
> On 2016年12月06日 18:59, Peter Xu wrote:
> > On Tue, Dec 06, 2016 at 04:27:39PM +0800, Lan Tianyu wrote:
> > 
> > [...]
> > 
> >>>
>  User space driver(E.G DPDK) also can enable/disable
>  IOVA for device dynamically.
> >>>
> >>> Could you provide more detailed (or any pointer) on how to do that? I
> >>> did try to find it myself, I see an VFIO_IOMMU_ENABLE ioctl, but looks
> >>> like it is for ppc only.
> >>
> >> No, I just give an example that user space may do that but no more
> >> research. But since Qemu already can enable device's IOVA, other user
> >> application also should can do that with the same VFIO interface, right?
> > 
> > AFAIU we can't do that at least on x86. We can use vfio interface to
> > bind group into container, but we should not be able to dynamically
> > disable IOMMU protection. IIUC That needs to taint the kernel.
> > 
> > The only way I know is that we probe vfio-pci with no-iommu mode, in
> > that case, we disabled IOMMU, but we can never dynamically enable it
> > as well.
> > 
> > Please correct me if I am wrong.
> 
> 
> Actually, disabling device's IOVA doesn't require to disable kernel
> global DMA protect and just clear device's VTD context entry in the
> context table. Go though IOMMU and VFIO code, find this will happen when
> call VFIO_GROUP_UNSET_CONTAINER ioctl and it will be called when destroy
> VM or unplug assigned device in Qemu. Please help to double check.
> 
> Call trace:
> __vfio_group_unset_container()
> vfio_iommu_type1_detach_group()
> iommu_detach_group()
> dmar_remove_one_dev_info()
> __dmar_remove_one_dev_info()
> domain_context_clear()
> 
> 
> The legacy KVM device assign code also will call iommu_detach_device()
> when deassign a device.
> 
> From device emulation view, we need to make sure correct register
> emulation regardless of guest behavior.

Even if the context entry is cleared and invalidated, IMHO it does not
mean that we should be using GPA address space, nor do we need to put
it into guest physical address space. Instead, it simply means this
device cannot do any IO at that time. If IO comes, IOMMU should do
fault reporting to guest OS, which should be treated as error.

So I think we are emulating the correct guest behavior here - we don't
need to do anything if a device is detached from an existing IOMMU
domain in guest. If we do (e.g., we replay the GPA address space on
that device when it is detached, so the shadow page table for that
device maps the whole guest memory), that is dangerous, because with
that the device can DMA to anywhere it wants to guest memory.

Thanks,

-- peterx



Re: [Qemu-devel] [PATCH v7 0/5] IOMMU: intel_iommu support map and unmap notifications

2016-12-06 Thread Lan Tianyu
On 2016年12月06日 18:59, Peter Xu wrote:
> On Tue, Dec 06, 2016 at 04:27:39PM +0800, Lan Tianyu wrote:
> 
> [...]
> 
>>>
 User space driver(E.G DPDK) also can enable/disable
 IOVA for device dynamically.
>>>
>>> Could you provide more detailed (or any pointer) on how to do that? I
>>> did try to find it myself, I see an VFIO_IOMMU_ENABLE ioctl, but looks
>>> like it is for ppc only.
>>
>> No, I just give an example that user space may do that but no more
>> research. But since Qemu already can enable device's IOVA, other user
>> application also should can do that with the same VFIO interface, right?
> 
> AFAIU we can't do that at least on x86. We can use vfio interface to
> bind group into container, but we should not be able to dynamically
> disable IOMMU protection. IIUC That needs to taint the kernel.
> 
> The only way I know is that we probe vfio-pci with no-iommu mode, in
> that case, we disabled IOMMU, but we can never dynamically enable it
> as well.
> 
> Please correct me if I am wrong.


Actually, disabling device's IOVA doesn't require to disable kernel
global DMA protect and just clear device's VTD context entry in the
context table. Go though IOMMU and VFIO code, find this will happen when
call VFIO_GROUP_UNSET_CONTAINER ioctl and it will be called when destroy
VM or unplug assigned device in Qemu. Please help to double check.

Call trace:
__vfio_group_unset_container()
vfio_iommu_type1_detach_group()
iommu_detach_group()
dmar_remove_one_dev_info()
__dmar_remove_one_dev_info()
domain_context_clear()


The legacy KVM device assign code also will call iommu_detach_device()
when deassign a device.

>From device emulation view, we need to make sure correct register
emulation regardless of guest behavior.

> 
> Thanks,
> 
> -- peterx
> 


-- 
Best regards
Tianyu Lan



Re: [Qemu-devel] [PATCH v7 0/5] IOMMU: intel_iommu support map and unmap notifications

2016-12-06 Thread Alex Williamson
On Tue, 6 Dec 2016 18:59:14 +0800
Peter Xu  wrote:

> On Tue, Dec 06, 2016 at 04:27:39PM +0800, Lan Tianyu wrote:
> 
> [...]
> 
> > >   
> > >> User space driver(E.G DPDK) also can enable/disable
> > >> IOVA for device dynamically.  
> > > 
> > > Could you provide more detailed (or any pointer) on how to do that? I
> > > did try to find it myself, I see an VFIO_IOMMU_ENABLE ioctl, but looks
> > > like it is for ppc only.  
> > 
> > No, I just give an example that user space may do that but no more
> > research. But since Qemu already can enable device's IOVA, other user
> > application also should can do that with the same VFIO interface, right?  
> 
> AFAIU we can't do that at least on x86. We can use vfio interface to
> bind group into container, but we should not be able to dynamically
> disable IOMMU protection. IIUC That needs to taint the kernel.
> 
> The only way I know is that we probe vfio-pci with no-iommu mode, in
> that case, we disabled IOMMU, but we can never dynamically enable it
> as well.
> 
> Please correct me if I am wrong.

A vfio user such as QEMU enabling or disabling translation in the
vIOMMU is "simply" assigning the vfio container to the system
AddressSpace (as we do w/o vIOMMU) or the device specific, IOMMU domain,
AddressSpace.  The device must be protected by the pIOMMU domain
(container) at all times.  Thanks,

Alex



Re: [Qemu-devel] [PATCH v7 0/5] IOMMU: intel_iommu support map and unmap notifications

2016-12-06 Thread Peter Xu
On Tue, Dec 06, 2016 at 04:27:39PM +0800, Lan Tianyu wrote:

[...]

> > 
> >> User space driver(E.G DPDK) also can enable/disable
> >> IOVA for device dynamically.
> > 
> > Could you provide more detailed (or any pointer) on how to do that? I
> > did try to find it myself, I see an VFIO_IOMMU_ENABLE ioctl, but looks
> > like it is for ppc only.
> 
> No, I just give an example that user space may do that but no more
> research. But since Qemu already can enable device's IOVA, other user
> application also should can do that with the same VFIO interface, right?

AFAIU we can't do that at least on x86. We can use vfio interface to
bind group into container, but we should not be able to dynamically
disable IOMMU protection. IIUC That needs to taint the kernel.

The only way I know is that we probe vfio-pci with no-iommu mode, in
that case, we disabled IOMMU, but we can never dynamically enable it
as well.

Please correct me if I am wrong.

Thanks,

-- peterx



Re: [Qemu-devel] [PATCH v7 0/5] IOMMU: intel_iommu support map and unmap notifications

2016-12-06 Thread Lan Tianyu
On 2016年12月06日 15:22, Peter Xu wrote:
> On Tue, Dec 06, 2016 at 03:06:20PM +0800, Lan Tianyu wrote:
>> On 2016年12月06日 14:51, Peter Xu wrote:
>>> On Tue, Dec 06, 2016 at 02:30:24PM +0800, Lan Tianyu wrote:
>>>
>>> [...]
>>>
>> 2. How to restore GPA->HPA mapping when IOVA is disabled by guest.
>> When guest enables IOVA for device, vIOMMU will invalidate all previous
>> GPA->HPA mapping and update IOVA->HPA mapping to pIOMMU via iommu
>> notifier. But if IOVA is disabled, I think we should restore GPA->HPA
>> mapping for the device otherwise the device won't work again in the VM.
>
> If we can have a workable replay mechanism, this problem will be
> solved IMHO.

 Basic idea is to replay related memory regions to restore GPA->HPA
 mapping when guest disables IOVA.
>>>
>>> Btw, could you help elaborate in what case we will trigger such a
>>> condition? Or, can we dynamically enable/disable an IOMMU?
>>
>> First case I think of is nest virtualization and we assign physical
>> device to l2 guest.
> 
> If we assign physical device to l2 guest, then it will belongs to an
> IOMMU domain in either l1 guest and host, right? (assuming we are
> using vfio-pci to do device assignment)

Yes.

> 
>> User space driver(E.G DPDK) also can enable/disable
>> IOVA for device dynamically.
> 
> Could you provide more detailed (or any pointer) on how to do that? I
> did try to find it myself, I see an VFIO_IOMMU_ENABLE ioctl, but looks
> like it is for ppc only.

No, I just give an example that user space may do that but no more
research. But since Qemu already can enable device's IOVA, other user
application also should can do that with the same VFIO interface, right?

-- 
Best regards
Tianyu Lan



Re: [Qemu-devel] [PATCH v7 0/5] IOMMU: intel_iommu support map and unmap notifications

2016-12-05 Thread Peter Xu
On Tue, Dec 06, 2016 at 03:06:20PM +0800, Lan Tianyu wrote:
> On 2016年12月06日 14:51, Peter Xu wrote:
> > On Tue, Dec 06, 2016 at 02:30:24PM +0800, Lan Tianyu wrote:
> > 
> > [...]
> > 
>  2. How to restore GPA->HPA mapping when IOVA is disabled by guest.
>  When guest enables IOVA for device, vIOMMU will invalidate all previous
>  GPA->HPA mapping and update IOVA->HPA mapping to pIOMMU via iommu
>  notifier. But if IOVA is disabled, I think we should restore GPA->HPA
>  mapping for the device otherwise the device won't work again in the VM.
> >>>
> >>> If we can have a workable replay mechanism, this problem will be
> >>> solved IMHO.
> >>
> >> Basic idea is to replay related memory regions to restore GPA->HPA
> >> mapping when guest disables IOVA.
> > 
> > Btw, could you help elaborate in what case we will trigger such a
> > condition? Or, can we dynamically enable/disable an IOMMU?
> 
> First case I think of is nest virtualization and we assign physical
> device to l2 guest.

If we assign physical device to l2 guest, then it will belongs to an
IOMMU domain in either l1 guest and host, right? (assuming we are
using vfio-pci to do device assignment)

> User space driver(E.G DPDK) also can enable/disable
> IOVA for device dynamically.

Could you provide more detailed (or any pointer) on how to do that? I
did try to find it myself, I see an VFIO_IOMMU_ENABLE ioctl, but looks
like it is for ppc only.

Thanks,

-- peterx



Re: [Qemu-devel] [PATCH v7 0/5] IOMMU: intel_iommu support map and unmap notifications

2016-12-05 Thread Lan Tianyu
On 2016年12月06日 14:51, Peter Xu wrote:
> On Tue, Dec 06, 2016 at 02:30:24PM +0800, Lan Tianyu wrote:
> 
> [...]
> 
 2. How to restore GPA->HPA mapping when IOVA is disabled by guest.
 When guest enables IOVA for device, vIOMMU will invalidate all previous
 GPA->HPA mapping and update IOVA->HPA mapping to pIOMMU via iommu
 notifier. But if IOVA is disabled, I think we should restore GPA->HPA
 mapping for the device otherwise the device won't work again in the VM.
>>>
>>> If we can have a workable replay mechanism, this problem will be
>>> solved IMHO.
>>
>> Basic idea is to replay related memory regions to restore GPA->HPA
>> mapping when guest disables IOVA.
> 
> Btw, could you help elaborate in what case we will trigger such a
> condition? Or, can we dynamically enable/disable an IOMMU?

First case I think of is nest virtualization and we assign physical
device to l2 guest. User space driver(E.G DPDK) also can enable/disable
IOVA for device dynamically.

-- 
Best regards
Tianyu Lan



Re: [Qemu-devel] [PATCH v7 0/5] IOMMU: intel_iommu support map and unmap notifications

2016-12-05 Thread Peter Xu
On Tue, Dec 06, 2016 at 02:30:24PM +0800, Lan Tianyu wrote:

[...]

> >> 2. How to restore GPA->HPA mapping when IOVA is disabled by guest.
> >> When guest enables IOVA for device, vIOMMU will invalidate all previous
> >> GPA->HPA mapping and update IOVA->HPA mapping to pIOMMU via iommu
> >> notifier. But if IOVA is disabled, I think we should restore GPA->HPA
> >> mapping for the device otherwise the device won't work again in the VM.
> > 
> > If we can have a workable replay mechanism, this problem will be
> > solved IMHO.
> 
> Basic idea is to replay related memory regions to restore GPA->HPA
> mapping when guest disables IOVA.

Btw, could you help elaborate in what case we will trigger such a
condition? Or, can we dynamically enable/disable an IOMMU?

Thanks,

-- peterx



Re: [Qemu-devel] [PATCH v7 0/5] IOMMU: intel_iommu support map and unmap notifications

2016-12-05 Thread Lan Tianyu
On 2016年12月02日 14:52, Peter Xu wrote:
> On Thu, Dec 01, 2016 at 02:44:14PM +0800, Lan Tianyu wrote:
> 
> [...]
> 
>> Hi:
>> I think there are still other gaps to enable passthough device with
>> vIOMMU's DMA translation support.
>>
>> 1. Since this patchset is to shadow guest IO page table to
>> pIOMMU(physical IOMMU) vfio_dma_map/umap(), there will be some fault
>> events from pIOMMU if guest os does misconfigurations. We should report
>> these fault events to guest. This means we need to pass the fault event
>> from pIOMMU driver to vIOMMU in Qemu. I suppose a channel in VFIO should
>> be added to connect pIOMMU and vIOMMU.
> 
> Thanks for raising this up - IMHO this is a good question.
> 
>>
>> The task should be divided into three parts
>> 1) pIOMMU driver reports fault events for vIOMMU via new VFIO interface
> 
> Here you mean "how host kernel capture the DMAR fault", right?

Yes, VFIO kernel driver should know the fault event when it's triggered.

> 
> IMHO We can have something like notifier/notifiee as well in DMAR
> fault reporting - people (like vfio driver in kernel) can register to
> fault reports related to specific device. When DMAR receives faults
> for those devices, it triggers the notification list, then vfio can be
> notified.

Yes, something likes this.


> 
>> 2) Add new channel in VFIO subsystem to connect pIOMMU driver and
>> vIOMMU in Qemu
>> 3) vIOMMU in Qemu get fault event from VFIO subsystem in Qemu and inject
>> virtual fault event to guest.
>>
>> Such VFIO channel is also required by device's PRS(Page Request
>> Services) support. This is also a part of SVM(Shared virtual memory)
>> support in VM. Here is SVM design doc link.
>> http://marc.info/?l=kvm=148049586514120=2
>>
>> 2. How to restore GPA->HPA mapping when IOVA is disabled by guest.
>> When guest enables IOVA for device, vIOMMU will invalidate all previous
>> GPA->HPA mapping and update IOVA->HPA mapping to pIOMMU via iommu
>> notifier. But if IOVA is disabled, I think we should restore GPA->HPA
>> mapping for the device otherwise the device won't work again in the VM.
> 
> If we can have a workable replay mechanism, this problem will be
> solved IMHO.

Basic idea is to replay related memory regions to restore GPA->HPA
mapping when guest disables IOVA.


-- 
Best regards
Tianyu Lan



Re: [Qemu-devel] [PATCH v7 0/5] IOMMU: intel_iommu support map and unmap notifications

2016-12-05 Thread Peter Xu
On Fri, Dec 02, 2016 at 10:30:34AM -0700, Alex Williamson wrote:

[...]

> On the host though, I'd expect we still have separate IOMMU domains,
> one for each device and we do DMA_{UN}MAP ioctls separately per
> container.  Thanks,

I have a RFC version of replay implementation that used the way to do
it (each device will have its own address space, even if more than one
devices are in the same IOMMU domain). I'll arrange the patches and
post soon. Let's see whether that's a valid approach to fix existing
issues for vfio enablement on vt-d.

Thanks,

-- peterx



Re: [Qemu-devel] [PATCH v7 0/5] IOMMU: intel_iommu support map and unmap notifications

2016-12-05 Thread Lan, Tianyu
On 12/3/2016 1:30 AM, Alex Williamson wrote:
> On Fri, 2 Dec 2016 14:08:59 +0800
> Peter Xu  wrote:
>
>> On Thu, Dec 01, 2016 at 04:27:52PM +0800, Lan Tianyu wrote:
>>> On 2016年11月30日 17:23, Peter Xu wrote:
 On Mon, Nov 28, 2016 at 05:51:50PM +0200, Aviv B.D wrote:
> * intel_iommu's replay op is not implemented yet (May come in different 
> patch
>   set).
>   The replay function is required for hotplug vfio device and to move 
> devices
>   between existing domains.

 I am thinking about this replay thing recently and now I start to
 doubt whether the whole vt-d vIOMMU framework suites this...

 Generally speaking, current work is throwing away the IOMMU "domain"
 layer here. We maintain the mapping only per device, and we don't care
 too much about which domain it belongs. This seems problematic.

 A simplest wrong case for this is (let's assume cache-mode is
 enabled): if we have two assigned devices A and B, both belong to the
 same domain 1. Meanwhile, in domain 1 assume we have one mapping which
 is the first page (iova range 0-0xfff). Then, if guest wants to
 invalidate the page, it'll notify VT-d vIOMMU with an invalidation
 message. If we do this invalidation per-device, we'll need to UNMAP
 the region twice - once for A, once for B (if we have more devices, we
 will unmap more times), and we can never know we have done duplicated
 work since we don't keep domain info, so we don't know they are using
 the same address space. The first unmap will work, and then we'll
 possibly get some errors on the rest of dma unmap failures.
>>>
>>>
>>> Hi Peter:
>>
>> Hi, Tianyu,
>>
>>> According VTD spec 6.2.2.1, "Software must ensure that, if multiple
>>> context-entries (or extended-context-entries) are programmed
>>> with the same Domain-id (DID), such entries must be programmed with same
>>> value for the secondlevel page-table pointer (SLPTPTR) field, and same
>>> value for the PASID Table Pointer (PASIDTPTR) field.".
>>>
>>> So if two assigned device may have different IO page table, they should
>>> be put into different domains.
>>
>>
>> By default, devices will be put into different domains. However it
>> should be legal that we put two assigned devices into the same IOMMU
>> domain (in the guest), right? And we should handle both cases well
>> IMHO.
>>
>> Actually I just wrote a tool to do it based on vfio-pci:
>>
>>   
>> https://github.com/xzpeter/clibs/blob/master/gpl/userspace/vfio-bind-group/vfio-bind-group.c
>>
>> If we run this tool in the guest with parameter like:
>>
>>   ./vfio-bind-groups 00:03.0 00:04.0
>>
>> It'll create one single domain, and put PCI device 00:03.0, 00:04.0
>> into the same IOMMU domain.
>
> On the host though, I'd expect we still have separate IOMMU domains,
> one for each device and we do DMA_{UN}MAP ioctls separately per
> container.  Thanks,

Agree. Guest may use different IO page tables for multi assigned devices
and this requires to put assigned device in different VTD domain on
host. I think we can't put assigned devices into the same VTD domain
before enabling dynamic containers.









Re: [Qemu-devel] [PATCH v7 0/5] IOMMU: intel_iommu support map and unmap notifications

2016-12-02 Thread Alex Williamson
On Fri, 2 Dec 2016 14:08:59 +0800
Peter Xu  wrote:

> On Thu, Dec 01, 2016 at 04:27:52PM +0800, Lan Tianyu wrote:
> > On 2016年11月30日 17:23, Peter Xu wrote:  
> > > On Mon, Nov 28, 2016 at 05:51:50PM +0200, Aviv B.D wrote:  
> > >> * intel_iommu's replay op is not implemented yet (May come in different 
> > >> patch 
> > >>   set).
> > >>   The replay function is required for hotplug vfio device and to move 
> > >> devices 
> > >>   between existing domains.  
> > > 
> > > I am thinking about this replay thing recently and now I start to
> > > doubt whether the whole vt-d vIOMMU framework suites this...
> > > 
> > > Generally speaking, current work is throwing away the IOMMU "domain"
> > > layer here. We maintain the mapping only per device, and we don't care
> > > too much about which domain it belongs. This seems problematic.
> > > 
> > > A simplest wrong case for this is (let's assume cache-mode is
> > > enabled): if we have two assigned devices A and B, both belong to the
> > > same domain 1. Meanwhile, in domain 1 assume we have one mapping which
> > > is the first page (iova range 0-0xfff). Then, if guest wants to
> > > invalidate the page, it'll notify VT-d vIOMMU with an invalidation
> > > message. If we do this invalidation per-device, we'll need to UNMAP
> > > the region twice - once for A, once for B (if we have more devices, we
> > > will unmap more times), and we can never know we have done duplicated
> > > work since we don't keep domain info, so we don't know they are using
> > > the same address space. The first unmap will work, and then we'll
> > > possibly get some errors on the rest of dma unmap failures.  
> > 
> > 
> > Hi Peter:  
> 
> Hi, Tianyu,
> 
> > According VTD spec 6.2.2.1, "Software must ensure that, if multiple
> > context-entries (or extended-context-entries) are programmed
> > with the same Domain-id (DID), such entries must be programmed with same
> > value for the secondlevel page-table pointer (SLPTPTR) field, and same
> > value for the PASID Table Pointer (PASIDTPTR) field.".
> > 
> > So if two assigned device may have different IO page table, they should
> > be put into different domains.  
> 
> 
> By default, devices will be put into different domains. However it
> should be legal that we put two assigned devices into the same IOMMU
> domain (in the guest), right? And we should handle both cases well
> IMHO.
> 
> Actually I just wrote a tool to do it based on vfio-pci:
> 
>   
> https://github.com/xzpeter/clibs/blob/master/gpl/userspace/vfio-bind-group/vfio-bind-group.c
> 
> If we run this tool in the guest with parameter like:
> 
>   ./vfio-bind-groups 00:03.0 00:04.0
> 
> It'll create one single domain, and put PCI device 00:03.0, 00:04.0
> into the same IOMMU domain.

On the host though, I'd expect we still have separate IOMMU domains,
one for each device and we do DMA_{UN}MAP ioctls separately per
container.  Thanks,

Alex



Re: [Qemu-devel] [PATCH v7 0/5] IOMMU: intel_iommu support map and unmap notifications

2016-12-02 Thread Alex Williamson
On Fri, 2 Dec 2016 13:59:25 +0800
Peter Xu  wrote:

> On Thu, Dec 01, 2016 at 04:21:38AM +, Tian, Kevin wrote:
> > > From: Peter Xu
> > > Sent: Wednesday, November 30, 2016 5:24 PM
> > > 
> > > On Mon, Nov 28, 2016 at 05:51:50PM +0200, Aviv B.D wrote:  
> > > > * intel_iommu's replay op is not implemented yet (May come in different 
> > > > patch
> > > >   set).
> > > >   The replay function is required for hotplug vfio device and to move 
> > > > devices
> > > >   between existing domains.  
> > > 
> > > I am thinking about this replay thing recently and now I start to
> > > doubt whether the whole vt-d vIOMMU framework suites this...
> > > 
> > > Generally speaking, current work is throwing away the IOMMU "domain"
> > > layer here. We maintain the mapping only per device, and we don't care
> > > too much about which domain it belongs. This seems problematic.
> > > 
> > > A simplest wrong case for this is (let's assume cache-mode is
> > > enabled): if we have two assigned devices A and B, both belong to the
> > > same domain 1. Meanwhile, in domain 1 assume we have one mapping which
> > > is the first page (iova range 0-0xfff). Then, if guest wants to
> > > invalidate the page, it'll notify VT-d vIOMMU with an invalidation
> > > message. If we do this invalidation per-device, we'll need to UNMAP
> > > the region twice - once for A, once for B (if we have more devices, we
> > > will unmap more times), and we can never know we have done duplicated
> > > work since we don't keep domain info, so we don't know they are using
> > > the same address space. The first unmap will work, and then we'll
> > > possibly get some errors on the rest of dma unmap failures.  
> > 
> > Tianyu and I discussed there is a bigger problem: today VFIO assumes 
> > only one address space per container, which is fine w/o vIOMMU (all devices 
> > in 
> > same container share same GPA->HPA translation). However it's not the case
> > when vIOMMU is enabled, because guest Linux implements per-device 
> > IOVA space. If a VFIO container includes multiple devices, it means 
> > multiple address spaces required per container...  
> 
> IIUC the vfio container is created in:
> 
>   vfio_realize
>   vfio_get_group
>   vfio_connect_container
> 
> Along the way (for vfio_get_group()), we have:
> 
>   group = vfio_get_group(groupid, pci_device_iommu_address_space(pdev), errp);
> 
> Here the address space is per device. If without vIOMMU, they will be
> pointed to the same system address space. However if with vIOMMU,
> that address space will be per-device, no?

Correct, with VT-d present, there will be a separate AddressSpace per
device, so each device will be placed into separate containers.  This
is currently the only way to provide the flexibility that those
separate devices can be attached to different domains in the guest.  It
also automatically faults on when devices share an iommu group on the
host but the guest attempts to use separate AddressSpaces.

Trouble comes when the guest is booted with iommu=pt as each container
will need to map the full guest memory, yet each container is accounted
separately for locked memory.  libvirt doesn't account for
$NUM_HOSTDEVS x $VM_MEM_SIZE for locked memory.

Ideally we could be more flexible with dynamic containers, but it's not
currently an option to move a group from one container to another w/o
first closing all the devices within the group.  Thanks,

Alex



Re: [Qemu-devel] [PATCH v7 0/5] IOMMU: intel_iommu support map and unmap notifications

2016-12-01 Thread Peter Xu
On Fri, Dec 02, 2016 at 06:23:52AM +, Tian, Kevin wrote:

[...]

> > IIUC the vfio container is created in:
> > 
> >   vfio_realize
> >   vfio_get_group
> >   vfio_connect_container
> > 
> > Along the way (for vfio_get_group()), we have:
> > 
> >   group = vfio_get_group(groupid, pci_device_iommu_address_space(pdev), 
> > errp);
> > 
> > Here the address space is per device. If without vIOMMU, they will be
> > pointed to the same system address space. However if with vIOMMU,
> > that address space will be per-device, no?
> > 
> 
> yes, I didn't note that fact. Tianyu also pointed it out in his reply. :-)

Sorry I didn't see that just now. Thanks for the confirmation. :)

Thanks,

-- peterx



Re: [Qemu-devel] [PATCH v7 0/5] IOMMU: intel_iommu support map and unmap notifications

2016-12-01 Thread Peter Xu
On Thu, Dec 01, 2016 at 02:44:14PM +0800, Lan Tianyu wrote:

[...]

> Hi:
> I think there are still other gaps to enable passthough device with
> vIOMMU's DMA translation support.
> 
> 1. Since this patchset is to shadow guest IO page table to
> pIOMMU(physical IOMMU) vfio_dma_map/umap(), there will be some fault
> events from pIOMMU if guest os does misconfigurations. We should report
> these fault events to guest. This means we need to pass the fault event
> from pIOMMU driver to vIOMMU in Qemu. I suppose a channel in VFIO should
> be added to connect pIOMMU and vIOMMU.

Thanks for raising this up - IMHO this is a good question.

> 
> The task should be divided into three parts
> 1) pIOMMU driver reports fault events for vIOMMU via new VFIO interface

Here you mean "how host kernel capture the DMAR fault", right?

IMHO We can have something like notifier/notifiee as well in DMAR
fault reporting - people (like vfio driver in kernel) can register to
fault reports related to specific device. When DMAR receives faults
for those devices, it triggers the notification list, then vfio can be
notified.

> 2) Add new channel in VFIO subsystem to connect pIOMMU driver and
> vIOMMU in Qemu
> 3) vIOMMU in Qemu get fault event from VFIO subsystem in Qemu and inject
> virtual fault event to guest.
> 
> Such VFIO channel is also required by device's PRS(Page Request
> Services) support. This is also a part of SVM(Shared virtual memory)
> support in VM. Here is SVM design doc link.
> http://marc.info/?l=kvm=148049586514120=2
> 
> 2. How to restore GPA->HPA mapping when IOVA is disabled by guest.
> When guest enables IOVA for device, vIOMMU will invalidate all previous
> GPA->HPA mapping and update IOVA->HPA mapping to pIOMMU via iommu
> notifier. But if IOVA is disabled, I think we should restore GPA->HPA
> mapping for the device otherwise the device won't work again in the VM.

If we can have a workable replay mechanism, this problem will be
solved IMHO. A more general issue is we move one device into an
existing domain X (no matter this is system address space, or another
existing IOMMU domain) - we need to unmap the pages in A and remap the
pages in B. But, this is not the best solution IMHO. Issue of this
solution is that we will need to maintain some containers that has
totally the same shadow page tables. The best solution is we have a
domain layer, and we have an address space (container) for each
domain. Then:

- when new domain added: we create a new container for this new
  domain, re-build the shadow page table in that domain if there is
  any

- when device moves domain (either move into/from system address
  space, or move into another IOMMU domain): we can just try to put
  that device (aka corresponding group) into the existing container
  that the new domain belongs

Thanks,

-- peterx



Re: [Qemu-devel] [PATCH v7 0/5] IOMMU: intel_iommu support map and unmap notifications

2016-12-01 Thread Tian, Kevin
> From: Peter Xu [mailto:pet...@redhat.com]
> Sent: Friday, December 02, 2016 1:59 PM
> 
> On Thu, Dec 01, 2016 at 04:21:38AM +, Tian, Kevin wrote:
> > > From: Peter Xu
> > > Sent: Wednesday, November 30, 2016 5:24 PM
> > >
> > > On Mon, Nov 28, 2016 at 05:51:50PM +0200, Aviv B.D wrote:
> > > > * intel_iommu's replay op is not implemented yet (May come in different 
> > > > patch
> > > >   set).
> > > >   The replay function is required for hotplug vfio device and to move 
> > > > devices
> > > >   between existing domains.
> > >
> > > I am thinking about this replay thing recently and now I start to
> > > doubt whether the whole vt-d vIOMMU framework suites this...
> > >
> > > Generally speaking, current work is throwing away the IOMMU "domain"
> > > layer here. We maintain the mapping only per device, and we don't care
> > > too much about which domain it belongs. This seems problematic.
> > >
> > > A simplest wrong case for this is (let's assume cache-mode is
> > > enabled): if we have two assigned devices A and B, both belong to the
> > > same domain 1. Meanwhile, in domain 1 assume we have one mapping which
> > > is the first page (iova range 0-0xfff). Then, if guest wants to
> > > invalidate the page, it'll notify VT-d vIOMMU with an invalidation
> > > message. If we do this invalidation per-device, we'll need to UNMAP
> > > the region twice - once for A, once for B (if we have more devices, we
> > > will unmap more times), and we can never know we have done duplicated
> > > work since we don't keep domain info, so we don't know they are using
> > > the same address space. The first unmap will work, and then we'll
> > > possibly get some errors on the rest of dma unmap failures.
> >
> > Tianyu and I discussed there is a bigger problem: today VFIO assumes
> > only one address space per container, which is fine w/o vIOMMU (all devices 
> > in
> > same container share same GPA->HPA translation). However it's not the case
> > when vIOMMU is enabled, because guest Linux implements per-device
> > IOVA space. If a VFIO container includes multiple devices, it means
> > multiple address spaces required per container...
> 
> IIUC the vfio container is created in:
> 
>   vfio_realize
>   vfio_get_group
>   vfio_connect_container
> 
> Along the way (for vfio_get_group()), we have:
> 
>   group = vfio_get_group(groupid, pci_device_iommu_address_space(pdev), errp);
> 
> Here the address space is per device. If without vIOMMU, they will be
> pointed to the same system address space. However if with vIOMMU,
> that address space will be per-device, no?
> 

yes, I didn't note that fact. Tianyu also pointed it out in his reply. :-)

Thanks
Kevin


Re: [Qemu-devel] [PATCH v7 0/5] IOMMU: intel_iommu support map and unmap notifications

2016-12-01 Thread Peter Xu
On Thu, Dec 01, 2016 at 08:42:04AM -0700, Alex Williamson wrote:
> On Wed, 30 Nov 2016 17:23:59 +0800
> Peter Xu  wrote:
> 
> > On Mon, Nov 28, 2016 at 05:51:50PM +0200, Aviv B.D wrote:
> > > * intel_iommu's replay op is not implemented yet (May come in different 
> > > patch 
> > >   set).
> > >   The replay function is required for hotplug vfio device and to move 
> > > devices 
> > >   between existing domains.  
> > 
> > I am thinking about this replay thing recently and now I start to
> > doubt whether the whole vt-d vIOMMU framework suites this...
> > 
> > Generally speaking, current work is throwing away the IOMMU "domain"
> > layer here. We maintain the mapping only per device, and we don't care
> > too much about which domain it belongs. This seems problematic.
> > 
> > A simplest wrong case for this is (let's assume cache-mode is
> > enabled): if we have two assigned devices A and B, both belong to the
> > same domain 1. Meanwhile, in domain 1 assume we have one mapping which
> > is the first page (iova range 0-0xfff). Then, if guest wants to
> > invalidate the page, it'll notify VT-d vIOMMU with an invalidation
> > message. If we do this invalidation per-device, we'll need to UNMAP
> > the region twice - once for A, once for B (if we have more devices, we
> > will unmap more times), and we can never know we have done duplicated
> > work since we don't keep domain info, so we don't know they are using
> > the same address space. The first unmap will work, and then we'll
> > possibly get some errors on the rest of dma unmap failures.
> > 
> > Looks like we just cannot live without knowing this domain layer.
> > Because the address space is binded to the domain. If we want to sync
> > the address space (here to setup a correct shadow page table), we need
> > to do it per-domain.
> > 
> > What I can think of as a solution is that we introduce this "domain"
> > layer - like a memory region per domain. When invalidation happens,
> > it's per-domain, not per-device any more (actually I guess that's what
> > current vt-d iommu driver in kernel is doing, we just ignored it - we
> > fetch the devices that matches the domain ID). We can/need to maintain
> > something different, like sid <-> domain mappings (we can do this as
> > long as we are notified when context entries changed), per-domain
> > mappings (just like per-device mappings that we are trying to build in
> > this series, but what we really need is IMHO per domain one), etc.
> > When device switches domain, we switch the IOMMU memory region
> > accordingly.
> > 
> > Does this make any sense? Comments are greatly welcomed (especially
> > from AlexW and DavidG).
> 
> It's been a bit since I've looked at VT-d emulation, but I certainly
> remember that it's way more convoluted than I expected.  It seems like
> a domain should create an AddressSpace and any devices assigned to that
> domain should make use of that single address space, but IIRC VT-d
> creates an address space per device, ie. per context entry.

Yes, I think this idea (one address space per domain) came from one of
your replies in the past, and I just found it more essential than I
thought before.

I'll see whether I can clear the way out before moving on to the
replay implementations. Because IIUC the replay will depend on this
(introducing the domain layer in VT-d IOMMU emulation).

Thanks!

-- peterx



Re: [Qemu-devel] [PATCH v7 0/5] IOMMU: intel_iommu support map and unmap notifications

2016-12-01 Thread Peter Xu
On Thu, Dec 01, 2016 at 04:27:52PM +0800, Lan Tianyu wrote:
> On 2016年11月30日 17:23, Peter Xu wrote:
> > On Mon, Nov 28, 2016 at 05:51:50PM +0200, Aviv B.D wrote:
> >> * intel_iommu's replay op is not implemented yet (May come in different 
> >> patch 
> >>   set).
> >>   The replay function is required for hotplug vfio device and to move 
> >> devices 
> >>   between existing domains.
> > 
> > I am thinking about this replay thing recently and now I start to
> > doubt whether the whole vt-d vIOMMU framework suites this...
> > 
> > Generally speaking, current work is throwing away the IOMMU "domain"
> > layer here. We maintain the mapping only per device, and we don't care
> > too much about which domain it belongs. This seems problematic.
> > 
> > A simplest wrong case for this is (let's assume cache-mode is
> > enabled): if we have two assigned devices A and B, both belong to the
> > same domain 1. Meanwhile, in domain 1 assume we have one mapping which
> > is the first page (iova range 0-0xfff). Then, if guest wants to
> > invalidate the page, it'll notify VT-d vIOMMU with an invalidation
> > message. If we do this invalidation per-device, we'll need to UNMAP
> > the region twice - once for A, once for B (if we have more devices, we
> > will unmap more times), and we can never know we have done duplicated
> > work since we don't keep domain info, so we don't know they are using
> > the same address space. The first unmap will work, and then we'll
> > possibly get some errors on the rest of dma unmap failures.
> 
> 
> Hi Peter:

Hi, Tianyu,

> According VTD spec 6.2.2.1, "Software must ensure that, if multiple
> context-entries (or extended-context-entries) are programmed
> with the same Domain-id (DID), such entries must be programmed with same
> value for the secondlevel page-table pointer (SLPTPTR) field, and same
> value for the PASID Table Pointer (PASIDTPTR) field.".
> 
> So if two assigned device may have different IO page table, they should
> be put into different domains.


By default, devices will be put into different domains. However it
should be legal that we put two assigned devices into the same IOMMU
domain (in the guest), right? And we should handle both cases well
IMHO.

Actually I just wrote a tool to do it based on vfio-pci:

  
https://github.com/xzpeter/clibs/blob/master/gpl/userspace/vfio-bind-group/vfio-bind-group.c

If we run this tool in the guest with parameter like:

  ./vfio-bind-groups 00:03.0 00:04.0

It'll create one single domain, and put PCI device 00:03.0, 00:04.0
into the same IOMMU domain.

Thanks,

-- peterx



Re: [Qemu-devel] [PATCH v7 0/5] IOMMU: intel_iommu support map and unmap notifications

2016-12-01 Thread Peter Xu
On Thu, Dec 01, 2016 at 04:21:38AM +, Tian, Kevin wrote:
> > From: Peter Xu
> > Sent: Wednesday, November 30, 2016 5:24 PM
> > 
> > On Mon, Nov 28, 2016 at 05:51:50PM +0200, Aviv B.D wrote:
> > > * intel_iommu's replay op is not implemented yet (May come in different 
> > > patch
> > >   set).
> > >   The replay function is required for hotplug vfio device and to move 
> > > devices
> > >   between existing domains.
> > 
> > I am thinking about this replay thing recently and now I start to
> > doubt whether the whole vt-d vIOMMU framework suites this...
> > 
> > Generally speaking, current work is throwing away the IOMMU "domain"
> > layer here. We maintain the mapping only per device, and we don't care
> > too much about which domain it belongs. This seems problematic.
> > 
> > A simplest wrong case for this is (let's assume cache-mode is
> > enabled): if we have two assigned devices A and B, both belong to the
> > same domain 1. Meanwhile, in domain 1 assume we have one mapping which
> > is the first page (iova range 0-0xfff). Then, if guest wants to
> > invalidate the page, it'll notify VT-d vIOMMU with an invalidation
> > message. If we do this invalidation per-device, we'll need to UNMAP
> > the region twice - once for A, once for B (if we have more devices, we
> > will unmap more times), and we can never know we have done duplicated
> > work since we don't keep domain info, so we don't know they are using
> > the same address space. The first unmap will work, and then we'll
> > possibly get some errors on the rest of dma unmap failures.
> 
> Tianyu and I discussed there is a bigger problem: today VFIO assumes 
> only one address space per container, which is fine w/o vIOMMU (all devices 
> in 
> same container share same GPA->HPA translation). However it's not the case
> when vIOMMU is enabled, because guest Linux implements per-device 
> IOVA space. If a VFIO container includes multiple devices, it means 
> multiple address spaces required per container...

IIUC the vfio container is created in:

  vfio_realize
  vfio_get_group
  vfio_connect_container

Along the way (for vfio_get_group()), we have:

  group = vfio_get_group(groupid, pci_device_iommu_address_space(pdev), errp);

Here the address space is per device. If without vIOMMU, they will be
pointed to the same system address space. However if with vIOMMU,
that address space will be per-device, no?

-- peterx



Re: [Qemu-devel] [PATCH v7 0/5] IOMMU: intel_iommu support map and unmap notifications

2016-12-01 Thread Alex Williamson
On Wed, 30 Nov 2016 17:23:59 +0800
Peter Xu  wrote:

> On Mon, Nov 28, 2016 at 05:51:50PM +0200, Aviv B.D wrote:
> > * intel_iommu's replay op is not implemented yet (May come in different 
> > patch 
> >   set).
> >   The replay function is required for hotplug vfio device and to move 
> > devices 
> >   between existing domains.  
> 
> I am thinking about this replay thing recently and now I start to
> doubt whether the whole vt-d vIOMMU framework suites this...
> 
> Generally speaking, current work is throwing away the IOMMU "domain"
> layer here. We maintain the mapping only per device, and we don't care
> too much about which domain it belongs. This seems problematic.
> 
> A simplest wrong case for this is (let's assume cache-mode is
> enabled): if we have two assigned devices A and B, both belong to the
> same domain 1. Meanwhile, in domain 1 assume we have one mapping which
> is the first page (iova range 0-0xfff). Then, if guest wants to
> invalidate the page, it'll notify VT-d vIOMMU with an invalidation
> message. If we do this invalidation per-device, we'll need to UNMAP
> the region twice - once for A, once for B (if we have more devices, we
> will unmap more times), and we can never know we have done duplicated
> work since we don't keep domain info, so we don't know they are using
> the same address space. The first unmap will work, and then we'll
> possibly get some errors on the rest of dma unmap failures.
> 
> Looks like we just cannot live without knowing this domain layer.
> Because the address space is binded to the domain. If we want to sync
> the address space (here to setup a correct shadow page table), we need
> to do it per-domain.
> 
> What I can think of as a solution is that we introduce this "domain"
> layer - like a memory region per domain. When invalidation happens,
> it's per-domain, not per-device any more (actually I guess that's what
> current vt-d iommu driver in kernel is doing, we just ignored it - we
> fetch the devices that matches the domain ID). We can/need to maintain
> something different, like sid <-> domain mappings (we can do this as
> long as we are notified when context entries changed), per-domain
> mappings (just like per-device mappings that we are trying to build in
> this series, but what we really need is IMHO per domain one), etc.
> When device switches domain, we switch the IOMMU memory region
> accordingly.
> 
> Does this make any sense? Comments are greatly welcomed (especially
> from AlexW and DavidG).

It's been a bit since I've looked at VT-d emulation, but I certainly
remember that it's way more convoluted than I expected.  It seems like
a domain should create an AddressSpace and any devices assigned to that
domain should make use of that single address space, but IIRC VT-d
creates an address space per device, ie. per context entry.



Re: [Qemu-devel] [PATCH v7 0/5] IOMMU: intel_iommu support map and unmap notifications

2016-12-01 Thread Lan Tianyu
On 2016年11月30日 17:23, Peter Xu wrote:
> On Mon, Nov 28, 2016 at 05:51:50PM +0200, Aviv B.D wrote:
>> * intel_iommu's replay op is not implemented yet (May come in different 
>> patch 
>>   set).
>>   The replay function is required for hotplug vfio device and to move 
>> devices 
>>   between existing domains.
> 
> I am thinking about this replay thing recently and now I start to
> doubt whether the whole vt-d vIOMMU framework suites this...
> 
> Generally speaking, current work is throwing away the IOMMU "domain"
> layer here. We maintain the mapping only per device, and we don't care
> too much about which domain it belongs. This seems problematic.
> 
> A simplest wrong case for this is (let's assume cache-mode is
> enabled): if we have two assigned devices A and B, both belong to the
> same domain 1. Meanwhile, in domain 1 assume we have one mapping which
> is the first page (iova range 0-0xfff). Then, if guest wants to
> invalidate the page, it'll notify VT-d vIOMMU with an invalidation
> message. If we do this invalidation per-device, we'll need to UNMAP
> the region twice - once for A, once for B (if we have more devices, we
> will unmap more times), and we can never know we have done duplicated
> work since we don't keep domain info, so we don't know they are using
> the same address space. The first unmap will work, and then we'll
> possibly get some errors on the rest of dma unmap failures.


Hi Peter:
According VTD spec 6.2.2.1, "Software must ensure that, if multiple
context-entries (or extended-context-entries) are programmed
with the same Domain-id (DID), such entries must be programmed with same
value for the secondlevel page-table pointer (SLPTPTR) field, and same
value for the PASID Table Pointer (PASIDTPTR) field.".

So if two assigned device may have different IO page table, they should
be put into different domains.


> 
> Looks like we just cannot live without knowing this domain layer.
> Because the address space is binded to the domain. If we want to sync
> the address space (here to setup a correct shadow page table), we need
> to do it per-domain.
> 
> What I can think of as a solution is that we introduce this "domain"
> layer - like a memory region per domain. When invalidation happens,
> it's per-domain, not per-device any more (actually I guess that's what
> current vt-d iommu driver in kernel is doing, we just ignored it - we
> fetch the devices that matches the domain ID). We can/need to maintain
> something different, like sid <-> domain mappings (we can do this as
> long as we are notified when context entries changed), per-domain
> mappings (just like per-device mappings that we are trying to build in
> this series, but what we really need is IMHO per domain one), etc.
> When device switches domain, we switch the IOMMU memory region
> accordingly.
> 
> Does this make any sense? Comments are greatly welcomed (especially
> from AlexW and DavidG).
> 
> Thanks,
> 
> -- peterx
> 


-- 
Best regards
Tianyu Lan



Re: [Qemu-devel] [PATCH v7 0/5] IOMMU: intel_iommu support map and unmap notifications

2016-12-01 Thread Lan Tianyu
On 2016年12月01日 12:21, Tian, Kevin wrote:
>> I am thinking about this replay thing recently and now I start to
>> > doubt whether the whole vt-d vIOMMU framework suites this...
>> > 
>> > Generally speaking, current work is throwing away the IOMMU "domain"
>> > layer here. We maintain the mapping only per device, and we don't care
>> > too much about which domain it belongs. This seems problematic.
>> > 
>> > A simplest wrong case for this is (let's assume cache-mode is
>> > enabled): if we have two assigned devices A and B, both belong to the
>> > same domain 1. Meanwhile, in domain 1 assume we have one mapping which
>> > is the first page (iova range 0-0xfff). Then, if guest wants to
>> > invalidate the page, it'll notify VT-d vIOMMU with an invalidation
>> > message. If we do this invalidation per-device, we'll need to UNMAP
>> > the region twice - once for A, once for B (if we have more devices, we
>> > will unmap more times), and we can never know we have done duplicated
>> > work since we don't keep domain info, so we don't know they are using
>> > the same address space. The first unmap will work, and then we'll
>> > possibly get some errors on the rest of dma unmap failures.
> Tianyu and I discussed there is a bigger problem: today VFIO assumes 
> only one address space per container, which is fine w/o vIOMMU (all devices 
> in 
> same container share same GPA->HPA translation). However it's not the case
> when vIOMMU is enabled, because guest Linux implements per-device 
> IOVA space. If a VFIO container includes multiple devices, it means 
> multiple address spaces required per container...
> 

Hi All:
Some updates about relationship about assigned device and container.

If vIOMMU is disabled, all assigned devices will use global
address_space_memory as address space (Detail please see
pci_device_iommu_address_space()). VFIO creates container according to
address space and so all assigned devices will put into one container.

If vIOMMU is enabled, Intel vIOMMU will allocate separate address spaces
for each assigned devices and then VFIO will create different container
for each assigned device. In other word, it will be one assigned device
per container when vIOMMU is enabled. The original concern won't take
place. This is my understanding and please correct me if something wrong.

-- 
Best regards
Tianyu Lan



Re: [Qemu-devel] [PATCH v7 0/5] IOMMU: intel_iommu support map and unmap notifications

2016-11-30 Thread Lan Tianyu
On 2016年11月28日 23:51, Aviv B.D wrote:
> From: "Aviv Ben-David" 
>
> * Advertize Cache Mode capability in iommu cap register.
>   This capability is controlled by "cache-mode" property of intel-iommu 
> device.
>   To enable this option call QEMU with "-device intel-iommu,cache-mode=true".
>
> * On page cache invalidation in intel vIOMMU, check if the domain belong to
>   registered notifier, and notify accordingly.
>
> Currently this patch still doesn't enabling VFIO devices support with vIOMMU
> present. Current problems:
> * vfio_iommu_map_notify is not aware about memory range belong to specific
>   VFIOGuestIOMMU.
> * intel_iommu's replay op is not implemented yet (May come in different patch
>   set).
>   The replay function is required for hotplug vfio device and to move devices
>   between existing domains.
>
> Changes from v1 to v2:
> * remove assumption that the cache do not clears
> * fix lockup on high load.
>
> Changes from v2 to v3:
> * remove debug leftovers
> * split to sepearate commits
> * change is_write to flags in vtd_do_iommu_translate, add IOMMU_NO_FAIL
>   to suppress error propagating to guest.
>
> Changes from v3 to v4:
> * Add property to intel_iommu device to control the CM capability,
>   default to False.
> * Use s->iommu_ops.notify_flag_changed to register notifiers.
>
> Changes from v4 to v4 RESEND:
> * Fix codding style pointed by checkpatch.pl script.
>
> Changes from v4 to v5:
> * Reduce the number of changes in patch 2 and make flags real bitfield.
> * Revert deleted debug prints.
> * Fix memory leak in patch 3.
>
> Changes from v5 to v6:
> * fix prototype of iommu_translate function for more IOMMU types.
> * VFIO will be notified only on the difference, without unmap
>   before change to maps.
>
> Changes from v6 to v7:
> * Add replay op to iommu_ops, with current behavior as default implementation
>   (Patch 4).
> * Add stub to disable VFIO with intel_iommu support (Patch 5).
> * Cosmetic changes to other patches.
>
> Aviv Ben-David (5):
>   IOMMU: add option to enable VTD_CAP_CM to vIOMMU capility exposoed to
> guest
>   IOMMU: change iommu_op->translate's is_write to flags, add support to
> NO_FAIL flag mode
>   IOMMU: enable intel_iommu map and unmap notifiers
>   IOMMU: add specific replay function with default implemenation
>   IOMMU: add specific null implementation of iommu_replay to intel_iommu
>
>  exec.c |   3 +-
>  hw/alpha/typhoon.c |   2 +-
>  hw/i386/amd_iommu.c|   4 +-
>  hw/i386/intel_iommu.c  | 165 
> ++---
>  hw/i386/intel_iommu_internal.h |   3 +
>  hw/pci-host/apb.c  |   2 +-
>  hw/ppc/spapr_iommu.c   |   2 +-
>  hw/s390x/s390-pci-bus.c|   2 +-
>  include/exec/memory.h  |  10 ++-
>  include/hw/i386/intel_iommu.h  |  11 +++
>  memory.c   |  11 ++-
>  11 files changed, 177 insertions(+), 38 deletions(-)
>

Hi:
I think there are still other gaps to enable passthough device with
vIOMMU's DMA translation support.

1. Since this patchset is to shadow guest IO page table to
pIOMMU(physical IOMMU) vfio_dma_map/umap(), there will be some fault
events from pIOMMU if guest os does misconfigurations. We should report
these fault events to guest. This means we need to pass the fault event
from pIOMMU driver to vIOMMU in Qemu. I suppose a channel in VFIO should
be added to connect pIOMMU and vIOMMU.

The task should be divided into three parts
1) pIOMMU driver reports fault events for vIOMMU via new VFIO interface
2) Add new channel in VFIO subsystem to connect pIOMMU driver and
vIOMMU in Qemu
3) vIOMMU in Qemu get fault event from VFIO subsystem in Qemu and inject
virtual fault event to guest.

Such VFIO channel is also required by device's PRS(Page Request
Services) support. This is also a part of SVM(Shared virtual memory)
support in VM. Here is SVM design doc link.
http://marc.info/?l=kvm=148049586514120=2

2. How to restore GPA->HPA mapping when IOVA is disabled by guest.
When guest enables IOVA for device, vIOMMU will invalidate all previous
GPA->HPA mapping and update IOVA->HPA mapping to pIOMMU via iommu
notifier. But if IOVA is disabled, I think we should restore GPA->HPA
mapping for the device otherwise the device won't work again in the VM.


Another option to enable passthough device with intel vIOMMU
We may prevent guest to enable DMA translation function successfully via
gcmd and vIOMMU never set "Translation Enable Status" bit in gsts
register when there is a passthough device.

There are kernel parameter "intel_iommu=on" or Kconfig
CONFIG_INTEL_IOMMU_DEFAULT_ON to enable DMA translation for intel IOMMU
driver. But they aren't set in distribution RHEL, SLES, Oracle and
ubuntu by default. Tha


The iommu driver will trigger a kernel panic when fail to enable DMA
translation via gcmd with log "DMAR hardware is malfunctioning".

-- 
Best regards
Tianyu Lan



Re: [Qemu-devel] [PATCH v7 0/5] IOMMU: intel_iommu support map and unmap notifications

2016-11-30 Thread Tian, Kevin
> From: Peter Xu
> Sent: Wednesday, November 30, 2016 5:24 PM
> 
> On Mon, Nov 28, 2016 at 05:51:50PM +0200, Aviv B.D wrote:
> > * intel_iommu's replay op is not implemented yet (May come in different 
> > patch
> >   set).
> >   The replay function is required for hotplug vfio device and to move 
> > devices
> >   between existing domains.
> 
> I am thinking about this replay thing recently and now I start to
> doubt whether the whole vt-d vIOMMU framework suites this...
> 
> Generally speaking, current work is throwing away the IOMMU "domain"
> layer here. We maintain the mapping only per device, and we don't care
> too much about which domain it belongs. This seems problematic.
> 
> A simplest wrong case for this is (let's assume cache-mode is
> enabled): if we have two assigned devices A and B, both belong to the
> same domain 1. Meanwhile, in domain 1 assume we have one mapping which
> is the first page (iova range 0-0xfff). Then, if guest wants to
> invalidate the page, it'll notify VT-d vIOMMU with an invalidation
> message. If we do this invalidation per-device, we'll need to UNMAP
> the region twice - once for A, once for B (if we have more devices, we
> will unmap more times), and we can never know we have done duplicated
> work since we don't keep domain info, so we don't know they are using
> the same address space. The first unmap will work, and then we'll
> possibly get some errors on the rest of dma unmap failures.

Tianyu and I discussed there is a bigger problem: today VFIO assumes 
only one address space per container, which is fine w/o vIOMMU (all devices in 
same container share same GPA->HPA translation). However it's not the case
when vIOMMU is enabled, because guest Linux implements per-device 
IOVA space. If a VFIO container includes multiple devices, it means 
multiple address spaces required per container...

> 
> Looks like we just cannot live without knowing this domain layer.
> Because the address space is binded to the domain. If we want to sync
> the address space (here to setup a correct shadow page table), we need
> to do it per-domain.
> 
> What I can think of as a solution is that we introduce this "domain"
> layer - like a memory region per domain. When invalidation happens,
> it's per-domain, not per-device any more (actually I guess that's what
> current vt-d iommu driver in kernel is doing, we just ignored it - we
> fetch the devices that matches the domain ID). We can/need to maintain
> something different, like sid <-> domain mappings (we can do this as
> long as we are notified when context entries changed), per-domain
> mappings (just like per-device mappings that we are trying to build in
> this series, but what we really need is IMHO per domain one), etc.
> When device switches domain, we switch the IOMMU memory region
> accordingly.
> 
> Does this make any sense? Comments are greatly welcomed (especially
> from AlexW and DavidG).
> 
> Thanks,
> 
> -- peterx



Re: [Qemu-devel] [PATCH v7 0/5] IOMMU: intel_iommu support map and unmap notifications

2016-11-30 Thread Tian, Kevin
> From: Aviv B.D
> Sent: Monday, November 28, 2016 11:52 PM
> 
> From: "Aviv Ben-David" 
> 
> * Advertize Cache Mode capability in iommu cap register.
>   This capability is controlled by "cache-mode" property of intel-iommu 
> device.
>   To enable this option call QEMU with "-device intel-iommu,cache-mode=true".
> 
> * On page cache invalidation in intel vIOMMU, check if the domain belong to
>   registered notifier, and notify accordingly.
> 
> Currently this patch still doesn't enabling VFIO devices support with vIOMMU
> present. Current problems:
> * vfio_iommu_map_notify is not aware about memory range belong to specific
>   VFIOGuestIOMMU.
> * intel_iommu's replay op is not implemented yet (May come in different patch
>   set).
>   The replay function is required for hotplug vfio device and to move devices
>   between existing domains.
> 

Thanks Aviv for your great work! Extending vIOMMU to support VFIO devices
is also an important feature for Intel. We'd like to contribute and collaborate 
with you in this area to make it fully functional. My colleague (Tianyu Lan) 
will
chime in later with his thoughts. Please don't hesitate sharing your TODO list 
so we can work together to move forward. 

My another colleague (Yi Liu) is also working on SVM virtualization in the
meantime with draft design already sent out. There'll be more API changes
built on top of the notification framework being worked on here, which we
can discuss later when your work is stabilized. :-)

Thanks
Kevin



Re: [Qemu-devel] [PATCH v7 0/5] IOMMU: intel_iommu support map and unmap notifications

2016-11-30 Thread Peter Xu
On Mon, Nov 28, 2016 at 05:51:50PM +0200, Aviv B.D wrote:
> * intel_iommu's replay op is not implemented yet (May come in different patch 
>   set).
>   The replay function is required for hotplug vfio device and to move devices 
>   between existing domains.

I am thinking about this replay thing recently and now I start to
doubt whether the whole vt-d vIOMMU framework suites this...

Generally speaking, current work is throwing away the IOMMU "domain"
layer here. We maintain the mapping only per device, and we don't care
too much about which domain it belongs. This seems problematic.

A simplest wrong case for this is (let's assume cache-mode is
enabled): if we have two assigned devices A and B, both belong to the
same domain 1. Meanwhile, in domain 1 assume we have one mapping which
is the first page (iova range 0-0xfff). Then, if guest wants to
invalidate the page, it'll notify VT-d vIOMMU with an invalidation
message. If we do this invalidation per-device, we'll need to UNMAP
the region twice - once for A, once for B (if we have more devices, we
will unmap more times), and we can never know we have done duplicated
work since we don't keep domain info, so we don't know they are using
the same address space. The first unmap will work, and then we'll
possibly get some errors on the rest of dma unmap failures.

Looks like we just cannot live without knowing this domain layer.
Because the address space is binded to the domain. If we want to sync
the address space (here to setup a correct shadow page table), we need
to do it per-domain.

What I can think of as a solution is that we introduce this "domain"
layer - like a memory region per domain. When invalidation happens,
it's per-domain, not per-device any more (actually I guess that's what
current vt-d iommu driver in kernel is doing, we just ignored it - we
fetch the devices that matches the domain ID). We can/need to maintain
something different, like sid <-> domain mappings (we can do this as
long as we are notified when context entries changed), per-domain
mappings (just like per-device mappings that we are trying to build in
this series, but what we really need is IMHO per domain one), etc.
When device switches domain, we switch the IOMMU memory region
accordingly.

Does this make any sense? Comments are greatly welcomed (especially
from AlexW and DavidG).

Thanks,

-- peterx