Re: [Qemu-devel] vhost, iova, and dirty page tracking
On 2019/9/24 上午10:02, Tian, Kevin wrote: From: Jason Wang [mailto:jasow...@redhat.com] Sent: Friday, September 20, 2019 9:19 AM On 2019/9/20 上午6:54, Tian, Kevin wrote: From: Paolo Bonzini [mailto:pbonz...@redhat.com] Sent: Thursday, September 19, 2019 7:14 PM On 19/09/19 09:16, Tian, Kevin wrote: why GPA1 and GPA2 should be both dirty? even they have the same HVA due to overlaping virtual address space in two processes, they still correspond to two physical pages. don't get what's your meaning :) The point is not leave any corner case that is hard to debug or fix in the future. Let's just start by a single process, the API allows userspace to maps HVA to both GPA1 and GPA2. Since it knows GPA1 and GPA2 are equivalent, it's ok to sync just through GPA1. That means if you only log GPA2, it won't work. I noted KVM itself doesn't consider such situation (one HVA is mapped to multiple GPAs), when doing its dirty page tracking. If you look at kvm_vcpu_mark_page_dirty, it simply finds the unique memslot which contains the dirty gfn and then set the dirty bit within that slot. It doesn't attempt to walk all memslots to find out any other GPA which may be mapped to the same HVA. So there must be some disconnect here. let's hear from Paolo first and understand the rationale behind such situation. In general, userspace cannot assume that it's okay to sync just through GPA1. It must sync the host page if *either* GPA1 or GPA2 are marked dirty. Agree. In this case the kernel only needs to track whether GPA1 or GPA2 is dirtied by guest operations. Not necessarily guest operations. The reason why vhost has to set both GPA1 and GPA2 is due to its own design - it maintains IOVA->HVA and GPA->HVA mappings thus given a IOVA you have to reverse lookup GPA->HVA memTable which gives multiple possible GPAs. So if userspace need to track both GPA1 and GPA2, vhost can just stop when it found a one HVA->GPA mapping there. But in concept if vhost can maintain a IOVA->GPA mapping, then it is straightforward to set the right GPA every time when a IOVA is tracked. That means, the translation is done twice by software, IOVA->GPA and GPA->HVA for each packet. Thanks yes, it's not necessary if we care about only the content of the dirty GPA, as seen in live migration. In that case, just setting the first GPA in the loop is sufficient as you pointed out. However there is one corner case which I'm not sure. What about an usage (e.g. VM introspection) which cares only about the guest access pattern i.e. which GPA is dirtied instead of poking its content? Neither setting the first GPA nor setting all the aliasing GPAs can provide the accurate info, if no explicit IOVA->GPA mapping is maintained inside vhost. But I cannot tell whether maintaining such accuracy for aliasing GPAs is really necessary. +VM introspection guys if they have some opinions. Interesting, for vhost, vIOMMU can pass IOVA->GPA actually and vhost can keep it and just do the translation from GPA->HVA in the map command. So it can have both IOVA->GPA and IOVA->HVA mapping. Thanks Thanks Kevin
RE: [Qemu-devel] vhost, iova, and dirty page tracking
> From: Jason Wang [mailto:jasow...@redhat.com] > Sent: Friday, September 20, 2019 9:19 AM > > On 2019/9/20 上午6:54, Tian, Kevin wrote: > >> From: Paolo Bonzini [mailto:pbonz...@redhat.com] > >> Sent: Thursday, September 19, 2019 7:14 PM > >> > >> On 19/09/19 09:16, Tian, Kevin wrote: > > why GPA1 and GPA2 should be both dirty? > > even they have the same HVA due to overlaping virtual address > space > >> in > > two processes, they still correspond to two physical pages. > > don't get what's your meaning :) > The point is not leave any corner case that is hard to debug or fix in > the future. > > Let's just start by a single process, the API allows userspace to maps > HVA to both GPA1 and GPA2. Since it knows GPA1 and GPA2 are > >> equivalent, > it's ok to sync just through GPA1. That means if you only log GPA2, it > won't work. > >>> I noted KVM itself doesn't consider such situation (one HVA is mapped > >>> to multiple GPAs), when doing its dirty page tracking. If you look at > >>> kvm_vcpu_mark_page_dirty, it simply finds the unique memslot which > >>> contains the dirty gfn and then set the dirty bit within that slot. It > >>> doesn't attempt to walk all memslots to find out any other GPA which > >>> may be mapped to the same HVA. > >>> > >>> So there must be some disconnect here. let's hear from Paolo first and > >>> understand the rationale behind such situation. > >> In general, userspace cannot assume that it's okay to sync just through > >> GPA1. It must sync the host page if *either* GPA1 or GPA2 are marked > >> dirty. > > Agree. In this case the kernel only needs to track whether GPA1 or > > GPA2 is dirtied by guest operations. > > > Not necessarily guest operations. > > > > The reason why vhost has to > > set both GPA1 and GPA2 is due to its own design - it maintains > > IOVA->HVA and GPA->HVA mappings thus given a IOVA you have > > to reverse lookup GPA->HVA memTable which gives multiple possible > > GPAs. > > > So if userspace need to track both GPA1 and GPA2, vhost can just stop > when it found a one HVA->GPA mapping there. > > > > But in concept if vhost can maintain a IOVA->GPA mapping, > > then it is straightforward to set the right GPA every time when a IOVA > > is tracked. > > > That means, the translation is done twice by software, IOVA->GPA and > GPA->HVA for each packet. > > Thanks > yes, it's not necessary if we care about only the content of the dirty GPA, as seen in live migration. In that case, just setting the first GPA in the loop is sufficient as you pointed out. However there is one corner case which I'm not sure. What about an usage (e.g. VM introspection) which cares only about the guest access pattern i.e. which GPA is dirtied instead of poking its content? Neither setting the first GPA nor setting all the aliasing GPAs can provide the accurate info, if no explicit IOVA->GPA mapping is maintained inside vhost. But I cannot tell whether maintaining such accuracy for aliasing GPAs is really necessary. +VM introspection guys if they have some opinions. Thanks Kevin
Re: [Qemu-devel] vhost, iova, and dirty page tracking
On Fri, Sep 20, 2019 at 09:15:40AM +0800, Jason Wang wrote: > > On 2019/9/19 下午10:06, Michael S. Tsirkin wrote: > > On Thu, Sep 19, 2019 at 05:37:48PM +0800, Jason Wang wrote: > > > On 2019/9/19 下午3:16, Tian, Kevin wrote: > > > > +Paolo to help clarify here. > > > > > > > > > From: Jason Wang [mailto:jasow...@redhat.com] > > > > > Sent: Thursday, September 19, 2019 2:32 PM > > > > > > > > > > > > > > > On 2019/9/19 下午2:17, Yan Zhao wrote: > > > > > > On Thu, Sep 19, 2019 at 02:09:53PM +0800, Jason Wang wrote: > > > > > > > On 2019/9/19 下午1:28, Yan Zhao wrote: > > > > > > > > On Thu, Sep 19, 2019 at 09:05:12AM +0800, Jason Wang wrote: > > > > > > > > > On 2019/9/18 下午4:37, Tian, Kevin wrote: > > > > > > > > > > > From: Jason Wang [mailto:jasow...@redhat.com] > > > > > > > > > > > Sent: Wednesday, September 18, 2019 2:10 PM > > > > > > > > > > > > > > > > > > > > > > > > Note that the HVA to GPA mapping is not an 1:1 > > > > > > > > > > > > > mapping. One > > > > > HVA > > > > > > > > > > > range > > > > > > > > > > > > > could be mapped to several GPA ranges. > > > > > > > > > > > > This is fine. Currently vfio_dma maintains IOVA->HVA > > > > > > > > > > > > mapping. > > > > > > > > > > > > > > > > > > > > > > > > btw under what condition HVA->GPA is not 1:1 mapping? I > > > > > > > > > > > > didn't > > > > > realize it. > > > > > > > > > > > I don't remember the details e.g memory region alias? And > > > > > > > > > > > neither > > > > > kvm > > > > > > > > > > > nor kvm API does forbid this if my memory is correct. > > > > > > > > > > > > > > > > > > > > > I checked https://qemu.weilnetz.de/doc/devel/memory.html, > > > > > > > > > > which > > > > > > > > > > provides an example of aliased layout. However, its > > > > > > > > > > aliasing is all > > > > > > > > > > 1:1, instead of N:1. From guest p.o.v every writable GPA > > > > > > > > > > implies an > > > > > > > > > > unique location. Why would we hit the situation where > > > > > > > > > > multiple > > > > > > > > > > write-able GPAs are mapped to the same HVA (i.e. same > > > > > > > > > > physical > > > > > > > > > > memory location)? > > > > > > > > > I don't know, just want to say current API does not forbid > > > > > > > > > this. So we > > > > > > > > > probably need to take care it. > > > > > > > > > > > > > > > > > yes, in KVM API level, it does not forbid two slots to have the > > > > > > > > same > > > > > HVA(slot->userspace_addr). > > > > > > > > But > > > > > > > > (1) there's only one kvm instance for each vm for each qemu > > > > > > > > process. > > > > > > > > (2) all ramblock->host (corresponds to HVA and > > > > > > > > slot->userspace_addr) > > > > > in one qemu > > > > > > > > process is non-overlapping as it's obtained from mmmap(). > > > > > > > > (3) qemu ensures two kvm slots will not point to the same > > > > > > > > section of > > > > > one ramblock. > > > > > > > > So, as long as kvm instance is not shared in two processes, and > > > > > > > > there's no bug in qemu, we can assure that HVA to GPA is 1:1. > > > > > > > Well, you leave this API for userspace, so you can't assume qemu > > > > > > > is the > > > > > > > only user or any its behavior. If you had you should limit it in > > > > > > > the API > > > > > > > level instead of open window for them. > > > > > > > > > > > > > > > > > > > > > > But even if there are two processes operating on the same kvm > > > > > instance > > > > > > > > and manipulating on memory slots, adding an extra GPA along side > > > > > current > > > > > > > > IOVA & HVA to ioctl VFIO_IOMMU_MAP_DMA can still let driver > > > > > > > > knows > > > > > the > > > > > > > > right IOVA->GPA mapping, right? > > > > > > > It looks fragile. Consider HVA was mapped to both GPA1 and GPA2. > > > > > Guest > > > > > > > maps IOVA to GPA2, so we have IOVA GPA2 HVA in the new ioctl and > > > > > then > > > > > > > log through GPA2. If userspace is trying to sync through GPA1, it > > > > > > > will > > > > > > > miss the dirty page. So for safety we need log both GPA1 and > > > > > > > GPA2. (See > > > > > > > what has been done in log_write_hva() in vhost.c). The only way > > > > > > > to do > > > > > > > that is to maintain an independent HVA to GPA mapping like what > > > > > > > KVM > > > > > or > > > > > > > vhost did. > > > > > > > > > > > > > why GPA1 and GPA2 should be both dirty? > > > > > > even they have the same HVA due to overlaping virtual address space > > > > > > in > > > > > > two processes, they still correspond to two physical pages. > > > > > > don't get what's your meaning :) > > > > > The point is not leave any corner case that is hard to debug or fix in > > > > > the future. > > > > > > > > > > Let's just start by a single process, the API allows userspace to maps > > > > > HVA to both GPA1 and GPA2. Since it knows GPA1 and GPA2 are > > > > > equivalent, > > > > > it's ok to sync just through GPA1. That means if you only log GPA2, it > > > > > won't work
Re: [Qemu-devel] vhost, iova, and dirty page tracking
On 2019/9/19 下午10:06, Michael S. Tsirkin wrote: On Thu, Sep 19, 2019 at 05:37:48PM +0800, Jason Wang wrote: On 2019/9/19 下午3:16, Tian, Kevin wrote: +Paolo to help clarify here. From: Jason Wang [mailto:jasow...@redhat.com] Sent: Thursday, September 19, 2019 2:32 PM On 2019/9/19 下午2:17, Yan Zhao wrote: On Thu, Sep 19, 2019 at 02:09:53PM +0800, Jason Wang wrote: On 2019/9/19 下午1:28, Yan Zhao wrote: On Thu, Sep 19, 2019 at 09:05:12AM +0800, Jason Wang wrote: On 2019/9/18 下午4:37, Tian, Kevin wrote: From: Jason Wang [mailto:jasow...@redhat.com] Sent: Wednesday, September 18, 2019 2:10 PM Note that the HVA to GPA mapping is not an 1:1 mapping. One HVA range could be mapped to several GPA ranges. This is fine. Currently vfio_dma maintains IOVA->HVA mapping. btw under what condition HVA->GPA is not 1:1 mapping? I didn't realize it. I don't remember the details e.g memory region alias? And neither kvm nor kvm API does forbid this if my memory is correct. I checked https://qemu.weilnetz.de/doc/devel/memory.html, which provides an example of aliased layout. However, its aliasing is all 1:1, instead of N:1. From guest p.o.v every writable GPA implies an unique location. Why would we hit the situation where multiple write-able GPAs are mapped to the same HVA (i.e. same physical memory location)? I don't know, just want to say current API does not forbid this. So we probably need to take care it. yes, in KVM API level, it does not forbid two slots to have the same HVA(slot->userspace_addr). But (1) there's only one kvm instance for each vm for each qemu process. (2) all ramblock->host (corresponds to HVA and slot->userspace_addr) in one qemu process is non-overlapping as it's obtained from mmmap(). (3) qemu ensures two kvm slots will not point to the same section of one ramblock. So, as long as kvm instance is not shared in two processes, and there's no bug in qemu, we can assure that HVA to GPA is 1:1. Well, you leave this API for userspace, so you can't assume qemu is the only user or any its behavior. If you had you should limit it in the API level instead of open window for them. But even if there are two processes operating on the same kvm instance and manipulating on memory slots, adding an extra GPA along side current IOVA & HVA to ioctl VFIO_IOMMU_MAP_DMA can still let driver knows the right IOVA->GPA mapping, right? It looks fragile. Consider HVA was mapped to both GPA1 and GPA2. Guest maps IOVA to GPA2, so we have IOVA GPA2 HVA in the new ioctl and then log through GPA2. If userspace is trying to sync through GPA1, it will miss the dirty page. So for safety we need log both GPA1 and GPA2. (See what has been done in log_write_hva() in vhost.c). The only way to do that is to maintain an independent HVA to GPA mapping like what KVM or vhost did. why GPA1 and GPA2 should be both dirty? even they have the same HVA due to overlaping virtual address space in two processes, they still correspond to two physical pages. don't get what's your meaning :) The point is not leave any corner case that is hard to debug or fix in the future. Let's just start by a single process, the API allows userspace to maps HVA to both GPA1 and GPA2. Since it knows GPA1 and GPA2 are equivalent, it's ok to sync just through GPA1. That means if you only log GPA2, it won't work. I noted KVM itself doesn't consider such situation (one HVA is mapped to multiple GPAs), when doing its dirty page tracking. If you look at kvm_vcpu_mark_page_dirty, it simply finds the unique memslot which contains the dirty gfn and then set the dirty bit within that slot. It doesn't attempt to walk all memslots to find out any other GPA which may be mapped to the same HVA. So there must be some disconnect here. let's hear from Paolo first and understand the rationale behind such situation. Neither did vhost when IOTLB is disabled. And cc Michael who points out this issue at the beginning. Thanks Thanks Kevin Yes, we fixed with a kind of a work around, at the time I proposed a new interace to fix it fully. I don't think we ever got around to implementing it - right? Paolo said userspace just need to sync through all GPAs, so my understanding is that work around is ok by redundant, so did the API you proposed. Anything I miss? Thanks
Re: [Qemu-devel] vhost, iova, and dirty page tracking
On 2019/9/20 上午6:54, Tian, Kevin wrote: From: Paolo Bonzini [mailto:pbonz...@redhat.com] Sent: Thursday, September 19, 2019 7:14 PM On 19/09/19 09:16, Tian, Kevin wrote: why GPA1 and GPA2 should be both dirty? even they have the same HVA due to overlaping virtual address space in two processes, they still correspond to two physical pages. don't get what's your meaning :) The point is not leave any corner case that is hard to debug or fix in the future. Let's just start by a single process, the API allows userspace to maps HVA to both GPA1 and GPA2. Since it knows GPA1 and GPA2 are equivalent, it's ok to sync just through GPA1. That means if you only log GPA2, it won't work. I noted KVM itself doesn't consider such situation (one HVA is mapped to multiple GPAs), when doing its dirty page tracking. If you look at kvm_vcpu_mark_page_dirty, it simply finds the unique memslot which contains the dirty gfn and then set the dirty bit within that slot. It doesn't attempt to walk all memslots to find out any other GPA which may be mapped to the same HVA. So there must be some disconnect here. let's hear from Paolo first and understand the rationale behind such situation. In general, userspace cannot assume that it's okay to sync just through GPA1. It must sync the host page if *either* GPA1 or GPA2 are marked dirty. Agree. In this case the kernel only needs to track whether GPA1 or GPA2 is dirtied by guest operations. Not necessarily guest operations. The reason why vhost has to set both GPA1 and GPA2 is due to its own design - it maintains IOVA->HVA and GPA->HVA mappings thus given a IOVA you have to reverse lookup GPA->HVA memTable which gives multiple possible GPAs. So if userspace need to track both GPA1 and GPA2, vhost can just stop when it found a one HVA->GPA mapping there. But in concept if vhost can maintain a IOVA->GPA mapping, then it is straightforward to set the right GPA every time when a IOVA is tracked. That means, the translation is done twice by software, IOVA->GPA and GPA->HVA for each packet. Thanks The situation really only arises in special cases. For example, 0xfffe..0x and 0xe..0xf might be the same memory. From "info mtree" before the guest boots: - (prio -1, i/o): pci 000e-000f (prio 1, i/o): alias isa-bios @pc.bios 0002-0003 fffc- (prio 0, rom): pc.bios However, non-x86 machines may have other cases of aliased memory so it's a case that you should cover. Above example is read-only, thus won't be touched in logdirty path. But now I agree that a specific architecture may define two writable GPA ranges with one as the alias to the other, as long as such case is explicitly documented so guest OS won't treat them as separate memory pages. Thanks Kevin
RE: [Qemu-devel] vhost, iova, and dirty page tracking
> From: Paolo Bonzini [mailto:pbonz...@redhat.com] > Sent: Thursday, September 19, 2019 7:14 PM > > On 19/09/19 09:16, Tian, Kevin wrote: > >>> why GPA1 and GPA2 should be both dirty? > >>> even they have the same HVA due to overlaping virtual address space > in > >>> two processes, they still correspond to two physical pages. > >>> don't get what's your meaning :) > >> > >> The point is not leave any corner case that is hard to debug or fix in > >> the future. > >> > >> Let's just start by a single process, the API allows userspace to maps > >> HVA to both GPA1 and GPA2. Since it knows GPA1 and GPA2 are > equivalent, > >> it's ok to sync just through GPA1. That means if you only log GPA2, it > >> won't work. > > > > I noted KVM itself doesn't consider such situation (one HVA is mapped > > to multiple GPAs), when doing its dirty page tracking. If you look at > > kvm_vcpu_mark_page_dirty, it simply finds the unique memslot which > > contains the dirty gfn and then set the dirty bit within that slot. It > > doesn't attempt to walk all memslots to find out any other GPA which > > may be mapped to the same HVA. > > > > So there must be some disconnect here. let's hear from Paolo first and > > understand the rationale behind such situation. > > In general, userspace cannot assume that it's okay to sync just through > GPA1. It must sync the host page if *either* GPA1 or GPA2 are marked > dirty. Agree. In this case the kernel only needs to track whether GPA1 or GPA2 is dirtied by guest operations. The reason why vhost has to set both GPA1 and GPA2 is due to its own design - it maintains IOVA->HVA and GPA->HVA mappings thus given a IOVA you have to reverse lookup GPA->HVA memTable which gives multiple possible GPAs. But in concept if vhost can maintain a IOVA->GPA mapping, then it is straightforward to set the right GPA every time when a IOVA is tracked. > > The situation really only arises in special cases. For example, > 0xfffe..0x and 0xe..0xf might be the same memory. > From "info mtree" before the guest boots: > > - (prio -1, i/o): pci > 000e-000f (prio 1, i/o): alias isa-bios > @pc.bios 0002-0003 > fffc- (prio 0, rom): pc.bios > > However, non-x86 machines may have other cases of aliased memory so > it's > a case that you should cover. > Above example is read-only, thus won't be touched in logdirty path. But now I agree that a specific architecture may define two writable GPA ranges with one as the alias to the other, as long as such case is explicitly documented so guest OS won't treat them as separate memory pages. Thanks Kevin
RE: [Qemu-devel] vhost, iova, and dirty page tracking
> From: Alex Williamson [mailto:alex.william...@redhat.com] > Sent: Friday, September 20, 2019 1:21 AM > > On Wed, 18 Sep 2019 07:21:05 + > "Tian, Kevin" wrote: > > > > From: Jason Wang [mailto:jasow...@redhat.com] > > > Sent: Wednesday, September 18, 2019 2:04 PM > > > > > > On 2019/9/18 上午9:31, Tian, Kevin wrote: > > > >> From: Alex Williamson [mailto:alex.william...@redhat.com] > > > >> Sent: Tuesday, September 17, 2019 10:54 PM > > > >> > > > >> On Tue, 17 Sep 2019 08:48:36 + > > > >> "Tian, Kevin" wrote: > > > >> > > > From: Jason Wang [mailto:jasow...@redhat.com] > > > Sent: Monday, September 16, 2019 4:33 PM > > > > > > > > > On 2019/9/16 上午9:51, Tian, Kevin wrote: > > > > Hi, Jason > > > > > > > > We had a discussion about dirty page tracking in VFIO, when > > > vIOMMU > > > > is enabled: > > > > > > > > https://lists.nongnu.org/archive/html/qemu-devel/2019- > > > 09/msg02690.html > > > > It's actually a similar model as vhost - Qemu cannot interpose the > > > fast- > > > path > > > > DMAs thus relies on the kernel part to track and report dirty page > > > information. > > > > Currently Qemu tracks dirty pages in GFN level, thus demanding a > > > translation > > > > from IOVA to GPA. Then the open in our discussion is where this > > > translation > > > > should happen. Doing the translation in kernel implies a device > iotlb > > > flavor, > > > > which is what vhost implements today. It requires potentially > large > > > tracking > > > > structures in the host kernel, but leveraging the existing log_sync > > > flow > > > >> in > > > Qemu. > > > > On the other hand, Qemu may perform log_sync for every > removal > > > of > > > IOVA > > > > mapping and then do the translation itself, then avoiding the GPA > > > awareness > > > > in the kernel side. It needs some change to current Qemu log- > sync > > > >> flow, > > > and > > > > may bring more overhead if IOVA is frequently unmapped. > > > > > > > > So we'd like to hear about your opinions, especially about how > you > > > >> came > > > > down to the current iotlb approach for vhost. > > > We don't consider too much in the point when introducing vhost. > And > > > before IOTLB, vhost has already know GPA through its mem table > > > (GPA->HVA). So it's nature and easier to track dirty pages at GPA > level > > > then it won't any changes in the existing ABI. > > > >>> This is the same situation as VFIO. > > > >> It is? VFIO doesn't know GPAs, it only knows HVA, HPA, and IOVA. > In > > > >> some cases IOVA is GPA, but not all. > > > > Well, I thought vhost has a similar design, that the index of its mem > table > > > > is GPA when vIOMMU is off and then becomes IOVA when vIOMMU is > on. > > > > But I may be wrong here. Jason, can you help clarify? I saw two > > > > interfaces which poke the mem table: VHOST_SET_MEM_TABLE (for > GPA) > > > > and VHOST_IOTLB_UPDATE (for IOVA). Are they used exclusively or > > > together? > > > > > > > > > > Actually, vhost maintains two interval trees, mem table GPA->HVA, and > > > device IOTLB IOVA->HVA. Device IOTLB is only used when vIOMMU is > > > enabled, and in that case mem table is used only when vhost need to > > > track dirty pages (do reverse lookup of memtable to get HVA->GPA). So > in > > > conclusion, for datapath, they are used exclusively, but they need > > > cowork for logging dirty pages when device IOTLB is enabled. > > > > > > > OK. Then it's different from current VFIO design, which maintains only > > one tree which is indexed by either GPA or IOVA exclusively, upon > > whether vIOMMU is in use. > > Nit, the VFIO tree is only ever indexed by IOVA. The MAP_DMA ioctl is > only ever performed with an IOVA. Userspace decides how that IOVA > maps > to GPA, VFIO only needs to know how the IOVA maps to HPA via the HVA. > Thanks, > I was only referring to its actual meaning from usage p.o.v, not the parameter name (which is always called iova) in vfio. Thanks Kevin
Re: [Qemu-devel] vhost, iova, and dirty page tracking
On Wed, 18 Sep 2019 07:21:05 + "Tian, Kevin" wrote: > > From: Jason Wang [mailto:jasow...@redhat.com] > > Sent: Wednesday, September 18, 2019 2:04 PM > > > > On 2019/9/18 上午9:31, Tian, Kevin wrote: > > >> From: Alex Williamson [mailto:alex.william...@redhat.com] > > >> Sent: Tuesday, September 17, 2019 10:54 PM > > >> > > >> On Tue, 17 Sep 2019 08:48:36 + > > >> "Tian, Kevin" wrote: > > >> > > From: Jason Wang [mailto:jasow...@redhat.com] > > Sent: Monday, September 16, 2019 4:33 PM > > > > > > On 2019/9/16 上午9:51, Tian, Kevin wrote: > > > Hi, Jason > > > > > > We had a discussion about dirty page tracking in VFIO, when > > vIOMMU > > > is enabled: > > > > > > https://lists.nongnu.org/archive/html/qemu-devel/2019- > > 09/msg02690.html > > > It's actually a similar model as vhost - Qemu cannot interpose the > > fast- > > path > > > DMAs thus relies on the kernel part to track and report dirty page > > information. > > > Currently Qemu tracks dirty pages in GFN level, thus demanding a > > translation > > > from IOVA to GPA. Then the open in our discussion is where this > > translation > > > should happen. Doing the translation in kernel implies a device iotlb > > > > > flavor, > > > which is what vhost implements today. It requires potentially large > > tracking > > > structures in the host kernel, but leveraging the existing log_sync > > flow > > >> in > > Qemu. > > > On the other hand, Qemu may perform log_sync for every removal > > of > > IOVA > > > mapping and then do the translation itself, then avoiding the GPA > > awareness > > > in the kernel side. It needs some change to current Qemu log-sync > > >> flow, > > and > > > may bring more overhead if IOVA is frequently unmapped. > > > > > > So we'd like to hear about your opinions, especially about how you > > >> came > > > down to the current iotlb approach for vhost. > > We don't consider too much in the point when introducing vhost. And > > before IOTLB, vhost has already know GPA through its mem table > > (GPA->HVA). So it's nature and easier to track dirty pages at GPA level > > then it won't any changes in the existing ABI. > > >>> This is the same situation as VFIO. > > >> It is? VFIO doesn't know GPAs, it only knows HVA, HPA, and IOVA. In > > >> some cases IOVA is GPA, but not all. > > > Well, I thought vhost has a similar design, that the index of its mem > > > table > > > is GPA when vIOMMU is off and then becomes IOVA when vIOMMU is on. > > > But I may be wrong here. Jason, can you help clarify? I saw two > > > interfaces which poke the mem table: VHOST_SET_MEM_TABLE (for GPA) > > > and VHOST_IOTLB_UPDATE (for IOVA). Are they used exclusively or > > together? > > > > > > > Actually, vhost maintains two interval trees, mem table GPA->HVA, and > > device IOTLB IOVA->HVA. Device IOTLB is only used when vIOMMU is > > enabled, and in that case mem table is used only when vhost need to > > track dirty pages (do reverse lookup of memtable to get HVA->GPA). So in > > conclusion, for datapath, they are used exclusively, but they need > > cowork for logging dirty pages when device IOTLB is enabled. > > > > OK. Then it's different from current VFIO design, which maintains only > one tree which is indexed by either GPA or IOVA exclusively, upon > whether vIOMMU is in use. Nit, the VFIO tree is only ever indexed by IOVA. The MAP_DMA ioctl is only ever performed with an IOVA. Userspace decides how that IOVA maps to GPA, VFIO only needs to know how the IOVA maps to HPA via the HVA. Thanks, Alex
Re: [Qemu-devel] vhost, iova, and dirty page tracking
On Thu, Sep 19, 2019 at 05:37:48PM +0800, Jason Wang wrote: > > On 2019/9/19 下午3:16, Tian, Kevin wrote: > > +Paolo to help clarify here. > > > > > From: Jason Wang [mailto:jasow...@redhat.com] > > > Sent: Thursday, September 19, 2019 2:32 PM > > > > > > > > > On 2019/9/19 下午2:17, Yan Zhao wrote: > > > > On Thu, Sep 19, 2019 at 02:09:53PM +0800, Jason Wang wrote: > > > > > On 2019/9/19 下午1:28, Yan Zhao wrote: > > > > > > On Thu, Sep 19, 2019 at 09:05:12AM +0800, Jason Wang wrote: > > > > > > > On 2019/9/18 下午4:37, Tian, Kevin wrote: > > > > > > > > > From: Jason Wang [mailto:jasow...@redhat.com] > > > > > > > > > Sent: Wednesday, September 18, 2019 2:10 PM > > > > > > > > > > > > > > > > > > > > Note that the HVA to GPA mapping is not an 1:1 mapping. > > > > > > > > > > > One > > > HVA > > > > > > > > > range > > > > > > > > > > > could be mapped to several GPA ranges. > > > > > > > > > > This is fine. Currently vfio_dma maintains IOVA->HVA > > > > > > > > > > mapping. > > > > > > > > > > > > > > > > > > > > btw under what condition HVA->GPA is not 1:1 mapping? I > > > > > > > > > > didn't > > > realize it. > > > > > > > > > I don't remember the details e.g memory region alias? And > > > > > > > > > neither > > > kvm > > > > > > > > > nor kvm API does forbid this if my memory is correct. > > > > > > > > > > > > > > > > > I checked https://qemu.weilnetz.de/doc/devel/memory.html, which > > > > > > > > provides an example of aliased layout. However, its aliasing is > > > > > > > > all > > > > > > > > 1:1, instead of N:1. From guest p.o.v every writable GPA > > > > > > > > implies an > > > > > > > > unique location. Why would we hit the situation where multiple > > > > > > > > write-able GPAs are mapped to the same HVA (i.e. same physical > > > > > > > > memory location)? > > > > > > > I don't know, just want to say current API does not forbid this. > > > > > > > So we > > > > > > > probably need to take care it. > > > > > > > > > > > > > yes, in KVM API level, it does not forbid two slots to have the same > > > HVA(slot->userspace_addr). > > > > > > But > > > > > > (1) there's only one kvm instance for each vm for each qemu process. > > > > > > (2) all ramblock->host (corresponds to HVA and slot->userspace_addr) > > > in one qemu > > > > > > process is non-overlapping as it's obtained from mmmap(). > > > > > > (3) qemu ensures two kvm slots will not point to the same section of > > > one ramblock. > > > > > > So, as long as kvm instance is not shared in two processes, and > > > > > > there's no bug in qemu, we can assure that HVA to GPA is 1:1. > > > > > Well, you leave this API for userspace, so you can't assume qemu is > > > > > the > > > > > only user or any its behavior. If you had you should limit it in the > > > > > API > > > > > level instead of open window for them. > > > > > > > > > > > > > > > > But even if there are two processes operating on the same kvm > > > instance > > > > > > and manipulating on memory slots, adding an extra GPA along side > > > current > > > > > > IOVA & HVA to ioctl VFIO_IOMMU_MAP_DMA can still let driver knows > > > the > > > > > > right IOVA->GPA mapping, right? > > > > > It looks fragile. Consider HVA was mapped to both GPA1 and GPA2. > > > Guest > > > > > maps IOVA to GPA2, so we have IOVA GPA2 HVA in the new ioctl and > > > then > > > > > log through GPA2. If userspace is trying to sync through GPA1, it will > > > > > miss the dirty page. So for safety we need log both GPA1 and GPA2. > > > > > (See > > > > > what has been done in log_write_hva() in vhost.c). The only way to do > > > > > that is to maintain an independent HVA to GPA mapping like what KVM > > > or > > > > > vhost did. > > > > > > > > > why GPA1 and GPA2 should be both dirty? > > > > even they have the same HVA due to overlaping virtual address space in > > > > two processes, they still correspond to two physical pages. > > > > don't get what's your meaning :) > > > > > > The point is not leave any corner case that is hard to debug or fix in > > > the future. > > > > > > Let's just start by a single process, the API allows userspace to maps > > > HVA to both GPA1 and GPA2. Since it knows GPA1 and GPA2 are equivalent, > > > it's ok to sync just through GPA1. That means if you only log GPA2, it > > > won't work. > > > > > I noted KVM itself doesn't consider such situation (one HVA is mapped > > to multiple GPAs), when doing its dirty page tracking. If you look at > > kvm_vcpu_mark_page_dirty, it simply finds the unique memslot which > > contains the dirty gfn and then set the dirty bit within that slot. It > > doesn't attempt to walk all memslots to find out any other GPA which > > may be mapped to the same HVA. > > > > So there must be some disconnect here. let's hear from Paolo first and > > understand the rationale behind such situation. > > > Neither did vhost when IOTLB is disabled. And cc Michael who points out this > issue at the beginning. > > Thanks > > > > > > Thanks
Re: [Qemu-devel] vhost, iova, and dirty page tracking
On 2019/9/19 下午7:14, Paolo Bonzini wrote: On 19/09/19 09:16, Tian, Kevin wrote: why GPA1 and GPA2 should be both dirty? even they have the same HVA due to overlaping virtual address space in two processes, they still correspond to two physical pages. don't get what's your meaning :) The point is not leave any corner case that is hard to debug or fix in the future. Let's just start by a single process, the API allows userspace to maps HVA to both GPA1 and GPA2. Since it knows GPA1 and GPA2 are equivalent, it's ok to sync just through GPA1. That means if you only log GPA2, it won't work. I noted KVM itself doesn't consider such situation (one HVA is mapped to multiple GPAs), when doing its dirty page tracking. If you look at kvm_vcpu_mark_page_dirty, it simply finds the unique memslot which contains the dirty gfn and then set the dirty bit within that slot. It doesn't attempt to walk all memslots to find out any other GPA which may be mapped to the same HVA. So there must be some disconnect here. let's hear from Paolo first and understand the rationale behind such situation. In general, userspace cannot assume that it's okay to sync just through GPA1. It must sync the host page if *either* GPA1 or GPA2 are marked dirty. Maybe we need document this somewhere. The situation really only arises in special cases. For example, 0xfffe..0x and 0xe..0xf might be the same memory. From "info mtree" before the guest boots: - (prio -1, i/o): pci 000e-000f (prio 1, i/o): alias isa-bios @pc.bios 0002-0003 fffc- (prio 0, rom): pc.bios However, non-x86 machines may have other cases of aliased memory so it's a case that you should cover. Paolo Any other issue that still need to be covered consider userspace need to sync both GPAs? Thanks
Re: [Qemu-devel] vhost, iova, and dirty page tracking
On 19/09/19 14:39, Jason Wang wrote: >> In general, userspace cannot assume that it's okay to sync just through >> GPA1. It must sync the host page if *either* GPA1 or GPA2 are marked >> dirty. > > Maybe we need document this somewhere. Well, it's implicit but it should be kind of obvious. The dirty page only tells you that the guest wrote to the GPA, HVAs are never mentioned in the documentation. Paolo > Any other issue that still need to be covered consider userspace need to > sync both GPAs? > > Thanks >
Re: [Qemu-devel] vhost, iova, and dirty page tracking
On 2019/9/19 下午6:16, Yan Zhao wrote: On Thu, Sep 19, 2019 at 06:06:52PM +0800, Jason Wang wrote: On 2019/9/19 下午2:29, Yan Zhao wrote: On Thu, Sep 19, 2019 at 02:32:03PM +0800, Jason Wang wrote: On 2019/9/19 下午2:17, Yan Zhao wrote: On Thu, Sep 19, 2019 at 02:09:53PM +0800, Jason Wang wrote: On 2019/9/19 下午1:28, Yan Zhao wrote: On Thu, Sep 19, 2019 at 09:05:12AM +0800, Jason Wang wrote: On 2019/9/18 下午4:37, Tian, Kevin wrote: From: Jason Wang [mailto:jasow...@redhat.com] Sent: Wednesday, September 18, 2019 2:10 PM Note that the HVA to GPA mapping is not an 1:1 mapping. One HVA range could be mapped to several GPA ranges. This is fine. Currently vfio_dma maintains IOVA->HVA mapping. btw under what condition HVA->GPA is not 1:1 mapping? I didn't realize it. I don't remember the details e.g memory region alias? And neither kvm nor kvm API does forbid this if my memory is correct. I checkedhttps://qemu.weilnetz.de/doc/devel/memory.html, which provides an example of aliased layout. However, its aliasing is all 1:1, instead of N:1. From guest p.o.v every writable GPA implies an unique location. Why would we hit the situation where multiple write-able GPAs are mapped to the same HVA (i.e. same physical memory location)? I don't know, just want to say current API does not forbid this. So we probably need to take care it. yes, in KVM API level, it does not forbid two slots to have the same HVA(slot->userspace_addr). But (1) there's only one kvm instance for each vm for each qemu process. (2) all ramblock->host (corresponds to HVA and slot->userspace_addr) in one qemu process is non-overlapping as it's obtained from mmmap(). (3) qemu ensures two kvm slots will not point to the same section of one ramblock. So, as long as kvm instance is not shared in two processes, and there's no bug in qemu, we can assure that HVA to GPA is 1:1. Well, you leave this API for userspace, so you can't assume qemu is the only user or any its behavior. If you had you should limit it in the API level instead of open window for them. But even if there are two processes operating on the same kvm instance and manipulating on memory slots, adding an extra GPA along side current IOVA & HVA to ioctl VFIO_IOMMU_MAP_DMA can still let driver knows the right IOVA->GPA mapping, right? It looks fragile. Consider HVA was mapped to both GPA1 and GPA2. Guest maps IOVA to GPA2, so we have IOVA GPA2 HVA in the new ioctl and then log through GPA2. If userspace is trying to sync through GPA1, it will miss the dirty page. So for safety we need log both GPA1 and GPA2. (See what has been done in log_write_hva() in vhost.c). The only way to do that is to maintain an independent HVA to GPA mapping like what KVM or vhost did. why GPA1 and GPA2 should be both dirty? even they have the same HVA due to overlaping virtual address space in two processes, they still correspond to two physical pages. don't get what's your meaning:) The point is not leave any corner case that is hard to debug or fix in the future. Let's just start by a single process, the API allows userspace to maps HVA to both GPA1 and GPA2. Since it knows GPA1 and GPA2 are equivalent, it's ok to sync just through GPA1. That means if you only log GPA2, it won't work. In that case, cannot log dirty according to HPA. because kvm cannot tell whether it's an valid case (the two GPAs are equivalent) or an invalid case (the two GPAs are not equivalent, but with the same HVA value). Right? There no need any examination on whether it was 'valid' or not. It's as simple as logging both GPA1 and GPA2. Then you won't need to care any corner case. But, if GPA1 and GPA2 point to the same HVA, it means they point to the same page. Then if you only log GPA2, and send GPA2 to target, it should still works, unless in the target side GPA1 and GPA2 do not point to the same HVA? The problem is whether userspace can just sync GPA1 instead of both GPA1 and GPA2. If userspace can sync through GPA1 only, the dirty pages was lost. Paolo has pointed out that userspace can not have that assumption. In what condition you met it in reality? Please kindly point it out :) It's not about reality, it's about possibility. Again, we don't want to leave any corner case that is hard to debug or fix in the future. Thanks Thanks Thanks Yan
Re: [Qemu-devel] vhost, iova, and dirty page tracking
On 19/09/19 09:16, Tian, Kevin wrote: >>> why GPA1 and GPA2 should be both dirty? >>> even they have the same HVA due to overlaping virtual address space in >>> two processes, they still correspond to two physical pages. >>> don't get what's your meaning :) >> >> The point is not leave any corner case that is hard to debug or fix in >> the future. >> >> Let's just start by a single process, the API allows userspace to maps >> HVA to both GPA1 and GPA2. Since it knows GPA1 and GPA2 are equivalent, >> it's ok to sync just through GPA1. That means if you only log GPA2, it >> won't work. > > I noted KVM itself doesn't consider such situation (one HVA is mapped > to multiple GPAs), when doing its dirty page tracking. If you look at > kvm_vcpu_mark_page_dirty, it simply finds the unique memslot which > contains the dirty gfn and then set the dirty bit within that slot. It > doesn't attempt to walk all memslots to find out any other GPA which > may be mapped to the same HVA. > > So there must be some disconnect here. let's hear from Paolo first and > understand the rationale behind such situation. In general, userspace cannot assume that it's okay to sync just through GPA1. It must sync the host page if *either* GPA1 or GPA2 are marked dirty. The situation really only arises in special cases. For example, 0xfffe..0x and 0xe..0xf might be the same memory. >From "info mtree" before the guest boots: - (prio -1, i/o): pci 000e-000f (prio 1, i/o): alias isa-bios @pc.bios 0002-0003 fffc- (prio 0, rom): pc.bios However, non-x86 machines may have other cases of aliased memory so it's a case that you should cover. Paolo
Re: [Qemu-devel] vhost, iova, and dirty page tracking
On Thu, Sep 19, 2019 at 06:06:52PM +0800, Jason Wang wrote: > > On 2019/9/19 下午2:29, Yan Zhao wrote: > > On Thu, Sep 19, 2019 at 02:32:03PM +0800, Jason Wang wrote: > >> On 2019/9/19 下午2:17, Yan Zhao wrote: > >>> On Thu, Sep 19, 2019 at 02:09:53PM +0800, Jason Wang wrote: > On 2019/9/19 下午1:28, Yan Zhao wrote: > > On Thu, Sep 19, 2019 at 09:05:12AM +0800, Jason Wang wrote: > >> On 2019/9/18 下午4:37, Tian, Kevin wrote: > From: Jason Wang [mailto:jasow...@redhat.com] > Sent: Wednesday, September 18, 2019 2:10 PM > > >> Note that the HVA to GPA mapping is not an 1:1 mapping. One HVA > range > >> could be mapped to several GPA ranges. > > This is fine. Currently vfio_dma maintains IOVA->HVA mapping. > > > > btw under what condition HVA->GPA is not 1:1 mapping? I didn't > > realize it. > I don't remember the details e.g memory region alias? And neither kvm > nor kvm API does forbid this if my memory is correct. > > >>> I checkedhttps://qemu.weilnetz.de/doc/devel/memory.html, which > >>> provides an example of aliased layout. However, its aliasing is all > >>> 1:1, instead of N:1. From guest p.o.v every writable GPA implies an > >>> unique location. Why would we hit the situation where multiple > >>> write-able GPAs are mapped to the same HVA (i.e. same physical > >>> memory location)? > >> I don't know, just want to say current API does not forbid this. So we > >> probably need to take care it. > >> > > yes, in KVM API level, it does not forbid two slots to have the same > > HVA(slot->userspace_addr). > > But > > (1) there's only one kvm instance for each vm for each qemu process. > > (2) all ramblock->host (corresponds to HVA and slot->userspace_addr) in > > one qemu > > process is non-overlapping as it's obtained from mmmap(). > > (3) qemu ensures two kvm slots will not point to the same section of > > one ramblock. > > > > So, as long as kvm instance is not shared in two processes, and > > there's no bug in qemu, we can assure that HVA to GPA is 1:1. > Well, you leave this API for userspace, so you can't assume qemu is the > only user or any its behavior. If you had you should limit it in the API > level instead of open window for them. > > > > But even if there are two processes operating on the same kvm instance > > and manipulating on memory slots, adding an extra GPA along side current > > IOVA & HVA to ioctl VFIO_IOMMU_MAP_DMA can still let driver knows the > > right IOVA->GPA mapping, right? > It looks fragile. Consider HVA was mapped to both GPA1 and GPA2. Guest > maps IOVA to GPA2, so we have IOVA GPA2 HVA in the new ioctl and then > log through GPA2. If userspace is trying to sync through GPA1, it will > miss the dirty page. So for safety we need log both GPA1 and GPA2. (See > what has been done in log_write_hva() in vhost.c). The only way to do > that is to maintain an independent HVA to GPA mapping like what KVM or > vhost did. > > >>> why GPA1 and GPA2 should be both dirty? > >>> even they have the same HVA due to overlaping virtual address space in > >>> two processes, they still correspond to two physical pages. > >>> don't get what's your meaning:) > >> The point is not leave any corner case that is hard to debug or fix in > >> the future. > >> > >> Let's just start by a single process, the API allows userspace to maps > >> HVA to both GPA1 and GPA2. Since it knows GPA1 and GPA2 are equivalent, > >> it's ok to sync just through GPA1. That means if you only log GPA2, it > >> won't work. > >> > > In that case, cannot log dirty according to HPA. > > because kvm cannot tell whether it's an valid case (the two GPAs are > > equivalent) > > or an invalid case (the two GPAs are not equivalent, but with the same > > HVA value). > > > > Right? > > > There no need any examination on whether it was 'valid' or not. It's as > simple as logging both GPA1 and GPA2. Then you won't need to care any > corner case. > But, if GPA1 and GPA2 point to the same HVA, it means they point to the same page. Then if you only log GPA2, and send GPA2 to target, it should still works, unless in the target side GPA1 and GPA2 do not point to the same HVA? In what condition you met it in reality? Please kindly point it out :) > Thanks > > > > > > Thanks > > Yan > > > >
Re: [Qemu-devel] vhost, iova, and dirty page tracking
On 2019/9/19 下午5:36, Yan Zhao wrote: On Thu, Sep 19, 2019 at 05:35:05PM +0800, Jason Wang wrote: On 2019/9/19 下午2:32, Yan Zhao wrote: On Thu, Sep 19, 2019 at 02:29:54PM +0800, Yan Zhao wrote: On Thu, Sep 19, 2019 at 02:32:03PM +0800, Jason Wang wrote: On 2019/9/19 下午2:17, Yan Zhao wrote: On Thu, Sep 19, 2019 at 02:09:53PM +0800, Jason Wang wrote: On 2019/9/19 下午1:28, Yan Zhao wrote: On Thu, Sep 19, 2019 at 09:05:12AM +0800, Jason Wang wrote: On 2019/9/18 下午4:37, Tian, Kevin wrote: From: Jason Wang [mailto:jasow...@redhat.com] Sent: Wednesday, September 18, 2019 2:10 PM Note that the HVA to GPA mapping is not an 1:1 mapping. One HVA range could be mapped to several GPA ranges. This is fine. Currently vfio_dma maintains IOVA->HVA mapping. btw under what condition HVA->GPA is not 1:1 mapping? I didn't realize it. I don't remember the details e.g memory region alias? And neither kvm nor kvm API does forbid this if my memory is correct. I checked https://qemu.weilnetz.de/doc/devel/memory.html, which provides an example of aliased layout. However, its aliasing is all 1:1, instead of N:1. From guest p.o.v every writable GPA implies an unique location. Why would we hit the situation where multiple write-able GPAs are mapped to the same HVA (i.e. same physical memory location)? I don't know, just want to say current API does not forbid this. So we probably need to take care it. yes, in KVM API level, it does not forbid two slots to have the same HVA(slot->userspace_addr). But (1) there's only one kvm instance for each vm for each qemu process. (2) all ramblock->host (corresponds to HVA and slot->userspace_addr) in one qemu process is non-overlapping as it's obtained from mmmap(). (3) qemu ensures two kvm slots will not point to the same section of one ramblock. So, as long as kvm instance is not shared in two processes, and there's no bug in qemu, we can assure that HVA to GPA is 1:1. Well, you leave this API for userspace, so you can't assume qemu is the only user or any its behavior. If you had you should limit it in the API level instead of open window for them. But even if there are two processes operating on the same kvm instance and manipulating on memory slots, adding an extra GPA along side current IOVA & HVA to ioctl VFIO_IOMMU_MAP_DMA can still let driver knows the right IOVA->GPA mapping, right? It looks fragile. Consider HVA was mapped to both GPA1 and GPA2. Guest maps IOVA to GPA2, so we have IOVA GPA2 HVA in the new ioctl and then log through GPA2. If userspace is trying to sync through GPA1, it will miss the dirty page. So for safety we need log both GPA1 and GPA2. (See what has been done in log_write_hva() in vhost.c). The only way to do that is to maintain an independent HVA to GPA mapping like what KVM or vhost did. why GPA1 and GPA2 should be both dirty? even they have the same HVA due to overlaping virtual address space in two processes, they still correspond to two physical pages. don't get what's your meaning :) The point is not leave any corner case that is hard to debug or fix in the future. Let's just start by a single process, the API allows userspace to maps HVA to both GPA1 and GPA2. Since it knows GPA1 and GPA2 are equivalent, it's ok to sync just through GPA1. That means if you only log GPA2, it won't work. In that case, cannot log dirty according to HPA. sorry, it should be "cannot log dirty according to HVA". I think we are discussing the choice between GPA and IOVA, not HVA? Right. so why do we need to care about HVA to GPA mapping? as long as IOVA to GPA is 1:1, then it's fine. The problem is (whether) userspace can try to sync from GPA2 whose HVA is the same as GPA1. Maintainers are copied by Kevin, hope it can help to clarify things. Thanks Thanks Yan Thanks because kvm cannot tell whether it's an valid case (the two GPAs are equivalent) or an invalid case (the two GPAs are not equivalent, but with the same HVA value). Right? Thanks Yan Thanks Thanks Yan Thanks Thanks Yan Is Qemu doing its own same-content memory merging in GPA level, similar to KSM? AFAIK, it doesn't. Thanks Thanks Kevin
Re: [Qemu-devel] vhost, iova, and dirty page tracking
On 2019/9/19 下午2:29, Yan Zhao wrote: On Thu, Sep 19, 2019 at 02:32:03PM +0800, Jason Wang wrote: On 2019/9/19 下午2:17, Yan Zhao wrote: On Thu, Sep 19, 2019 at 02:09:53PM +0800, Jason Wang wrote: On 2019/9/19 下午1:28, Yan Zhao wrote: On Thu, Sep 19, 2019 at 09:05:12AM +0800, Jason Wang wrote: On 2019/9/18 下午4:37, Tian, Kevin wrote: From: Jason Wang [mailto:jasow...@redhat.com] Sent: Wednesday, September 18, 2019 2:10 PM Note that the HVA to GPA mapping is not an 1:1 mapping. One HVA range could be mapped to several GPA ranges. This is fine. Currently vfio_dma maintains IOVA->HVA mapping. btw under what condition HVA->GPA is not 1:1 mapping? I didn't realize it. I don't remember the details e.g memory region alias? And neither kvm nor kvm API does forbid this if my memory is correct. I checkedhttps://qemu.weilnetz.de/doc/devel/memory.html, which provides an example of aliased layout. However, its aliasing is all 1:1, instead of N:1. From guest p.o.v every writable GPA implies an unique location. Why would we hit the situation where multiple write-able GPAs are mapped to the same HVA (i.e. same physical memory location)? I don't know, just want to say current API does not forbid this. So we probably need to take care it. yes, in KVM API level, it does not forbid two slots to have the same HVA(slot->userspace_addr). But (1) there's only one kvm instance for each vm for each qemu process. (2) all ramblock->host (corresponds to HVA and slot->userspace_addr) in one qemu process is non-overlapping as it's obtained from mmmap(). (3) qemu ensures two kvm slots will not point to the same section of one ramblock. So, as long as kvm instance is not shared in two processes, and there's no bug in qemu, we can assure that HVA to GPA is 1:1. Well, you leave this API for userspace, so you can't assume qemu is the only user or any its behavior. If you had you should limit it in the API level instead of open window for them. But even if there are two processes operating on the same kvm instance and manipulating on memory slots, adding an extra GPA along side current IOVA & HVA to ioctl VFIO_IOMMU_MAP_DMA can still let driver knows the right IOVA->GPA mapping, right? It looks fragile. Consider HVA was mapped to both GPA1 and GPA2. Guest maps IOVA to GPA2, so we have IOVA GPA2 HVA in the new ioctl and then log through GPA2. If userspace is trying to sync through GPA1, it will miss the dirty page. So for safety we need log both GPA1 and GPA2. (See what has been done in log_write_hva() in vhost.c). The only way to do that is to maintain an independent HVA to GPA mapping like what KVM or vhost did. why GPA1 and GPA2 should be both dirty? even they have the same HVA due to overlaping virtual address space in two processes, they still correspond to two physical pages. don't get what's your meaning:) The point is not leave any corner case that is hard to debug or fix in the future. Let's just start by a single process, the API allows userspace to maps HVA to both GPA1 and GPA2. Since it knows GPA1 and GPA2 are equivalent, it's ok to sync just through GPA1. That means if you only log GPA2, it won't work. In that case, cannot log dirty according to HPA. because kvm cannot tell whether it's an valid case (the two GPAs are equivalent) or an invalid case (the two GPAs are not equivalent, but with the same HVA value). Right? There no need any examination on whether it was 'valid' or not. It's as simple as logging both GPA1 and GPA2. Then you won't need to care any corner case. Thanks Thanks Yan
Re: [Qemu-devel] vhost, iova, and dirty page tracking
On Thu, Sep 19, 2019 at 05:35:05PM +0800, Jason Wang wrote: > > On 2019/9/19 下午2:32, Yan Zhao wrote: > > On Thu, Sep 19, 2019 at 02:29:54PM +0800, Yan Zhao wrote: > >> On Thu, Sep 19, 2019 at 02:32:03PM +0800, Jason Wang wrote: > >>> On 2019/9/19 下午2:17, Yan Zhao wrote: > On Thu, Sep 19, 2019 at 02:09:53PM +0800, Jason Wang wrote: > > On 2019/9/19 下午1:28, Yan Zhao wrote: > >> On Thu, Sep 19, 2019 at 09:05:12AM +0800, Jason Wang wrote: > >>> On 2019/9/18 下午4:37, Tian, Kevin wrote: > > From: Jason Wang [mailto:jasow...@redhat.com] > > Sent: Wednesday, September 18, 2019 2:10 PM > > > >>> Note that the HVA to GPA mapping is not an 1:1 mapping. One HVA > > range > >>> could be mapped to several GPA ranges. > >> This is fine. Currently vfio_dma maintains IOVA->HVA mapping. > >> > >> btw under what condition HVA->GPA is not 1:1 mapping? I didn't > >> realize it. > > I don't remember the details e.g memory region alias? And neither > > kvm > > nor kvm API does forbid this if my memory is correct. > > > I checked https://qemu.weilnetz.de/doc/devel/memory.html, which > provides an example of aliased layout. However, its aliasing is all > 1:1, instead of N:1. From guest p.o.v every writable GPA implies an > unique location. Why would we hit the situation where multiple > write-able GPAs are mapped to the same HVA (i.e. same physical > memory location)? > >>> I don't know, just want to say current API does not forbid this. So we > >>> probably need to take care it. > >>> > >> yes, in KVM API level, it does not forbid two slots to have the same > >> HVA(slot->userspace_addr). > >> But > >> (1) there's only one kvm instance for each vm for each qemu process. > >> (2) all ramblock->host (corresponds to HVA and slot->userspace_addr) > >> in one qemu > >> process is non-overlapping as it's obtained from mmmap(). > >> (3) qemu ensures two kvm slots will not point to the same section of > >> one ramblock. > >> > >> So, as long as kvm instance is not shared in two processes, and > >> there's no bug in qemu, we can assure that HVA to GPA is 1:1. > > Well, you leave this API for userspace, so you can't assume qemu is the > > only user or any its behavior. If you had you should limit it in the API > > level instead of open window for them. > > > > > >> But even if there are two processes operating on the same kvm instance > >> and manipulating on memory slots, adding an extra GPA along side > >> current > >> IOVA & HVA to ioctl VFIO_IOMMU_MAP_DMA can still let driver knows the > >> right IOVA->GPA mapping, right? > > It looks fragile. Consider HVA was mapped to both GPA1 and GPA2. Guest > > maps IOVA to GPA2, so we have IOVA GPA2 HVA in the new ioctl and then > > log through GPA2. If userspace is trying to sync through GPA1, it will > > miss the dirty page. So for safety we need log both GPA1 and GPA2. (See > > what has been done in log_write_hva() in vhost.c). The only way to do > > that is to maintain an independent HVA to GPA mapping like what KVM or > > vhost did. > > > why GPA1 and GPA2 should be both dirty? > even they have the same HVA due to overlaping virtual address space in > two processes, they still correspond to two physical pages. > don't get what's your meaning :) > >>> > >>> The point is not leave any corner case that is hard to debug or fix in > >>> the future. > >>> > >>> Let's just start by a single process, the API allows userspace to maps > >>> HVA to both GPA1 and GPA2. Since it knows GPA1 and GPA2 are equivalent, > >>> it's ok to sync just through GPA1. That means if you only log GPA2, it > >>> won't work. > >>> > >> In that case, cannot log dirty according to HPA. > > sorry, it should be "cannot log dirty according to HVA". > > > I think we are discussing the choice between GPA and IOVA, not HVA? > Right. so why do we need to care about HVA to GPA mapping? as long as IOVA to GPA is 1:1, then it's fine. Thanks Yan > Thanks > > > > > >> because kvm cannot tell whether it's an valid case (the two GPAs are > >> equivalent) > >> or an invalid case (the two GPAs are not equivalent, but with the same > >> HVA value). > >> > >> Right? > >> > >> Thanks > >> Yan > >> > >> > >>> Thanks > >>> > >>> > Thanks > Yan > > > > Thanks > > > > > >> Thanks > >> Yan > >> > Is Qemu doing its own same-content memory > merging in GPA level, similar to KSM? > >>> AFAIK, it doesn't. > >>> > >>> Thanks > >>> > >>> > Thanks > Kevin
Re: [Qemu-devel] vhost, iova, and dirty page tracking
On 2019/9/19 下午3:16, Tian, Kevin wrote: +Paolo to help clarify here. From: Jason Wang [mailto:jasow...@redhat.com] Sent: Thursday, September 19, 2019 2:32 PM On 2019/9/19 下午2:17, Yan Zhao wrote: On Thu, Sep 19, 2019 at 02:09:53PM +0800, Jason Wang wrote: On 2019/9/19 下午1:28, Yan Zhao wrote: On Thu, Sep 19, 2019 at 09:05:12AM +0800, Jason Wang wrote: On 2019/9/18 下午4:37, Tian, Kevin wrote: From: Jason Wang [mailto:jasow...@redhat.com] Sent: Wednesday, September 18, 2019 2:10 PM Note that the HVA to GPA mapping is not an 1:1 mapping. One HVA range could be mapped to several GPA ranges. This is fine. Currently vfio_dma maintains IOVA->HVA mapping. btw under what condition HVA->GPA is not 1:1 mapping? I didn't realize it. I don't remember the details e.g memory region alias? And neither kvm nor kvm API does forbid this if my memory is correct. I checked https://qemu.weilnetz.de/doc/devel/memory.html, which provides an example of aliased layout. However, its aliasing is all 1:1, instead of N:1. From guest p.o.v every writable GPA implies an unique location. Why would we hit the situation where multiple write-able GPAs are mapped to the same HVA (i.e. same physical memory location)? I don't know, just want to say current API does not forbid this. So we probably need to take care it. yes, in KVM API level, it does not forbid two slots to have the same HVA(slot->userspace_addr). But (1) there's only one kvm instance for each vm for each qemu process. (2) all ramblock->host (corresponds to HVA and slot->userspace_addr) in one qemu process is non-overlapping as it's obtained from mmmap(). (3) qemu ensures two kvm slots will not point to the same section of one ramblock. So, as long as kvm instance is not shared in two processes, and there's no bug in qemu, we can assure that HVA to GPA is 1:1. Well, you leave this API for userspace, so you can't assume qemu is the only user or any its behavior. If you had you should limit it in the API level instead of open window for them. But even if there are two processes operating on the same kvm instance and manipulating on memory slots, adding an extra GPA along side current IOVA & HVA to ioctl VFIO_IOMMU_MAP_DMA can still let driver knows the right IOVA->GPA mapping, right? It looks fragile. Consider HVA was mapped to both GPA1 and GPA2. Guest maps IOVA to GPA2, so we have IOVA GPA2 HVA in the new ioctl and then log through GPA2. If userspace is trying to sync through GPA1, it will miss the dirty page. So for safety we need log both GPA1 and GPA2. (See what has been done in log_write_hva() in vhost.c). The only way to do that is to maintain an independent HVA to GPA mapping like what KVM or vhost did. why GPA1 and GPA2 should be both dirty? even they have the same HVA due to overlaping virtual address space in two processes, they still correspond to two physical pages. don't get what's your meaning :) The point is not leave any corner case that is hard to debug or fix in the future. Let's just start by a single process, the API allows userspace to maps HVA to both GPA1 and GPA2. Since it knows GPA1 and GPA2 are equivalent, it's ok to sync just through GPA1. That means if you only log GPA2, it won't work. I noted KVM itself doesn't consider such situation (one HVA is mapped to multiple GPAs), when doing its dirty page tracking. If you look at kvm_vcpu_mark_page_dirty, it simply finds the unique memslot which contains the dirty gfn and then set the dirty bit within that slot. It doesn't attempt to walk all memslots to find out any other GPA which may be mapped to the same HVA. So there must be some disconnect here. let's hear from Paolo first and understand the rationale behind such situation. Neither did vhost when IOTLB is disabled. And cc Michael who points out this issue at the beginning. Thanks Thanks Kevin
Re: [Qemu-devel] vhost, iova, and dirty page tracking
On 2019/9/19 下午2:32, Yan Zhao wrote: On Thu, Sep 19, 2019 at 02:29:54PM +0800, Yan Zhao wrote: On Thu, Sep 19, 2019 at 02:32:03PM +0800, Jason Wang wrote: On 2019/9/19 下午2:17, Yan Zhao wrote: On Thu, Sep 19, 2019 at 02:09:53PM +0800, Jason Wang wrote: On 2019/9/19 下午1:28, Yan Zhao wrote: On Thu, Sep 19, 2019 at 09:05:12AM +0800, Jason Wang wrote: On 2019/9/18 下午4:37, Tian, Kevin wrote: From: Jason Wang [mailto:jasow...@redhat.com] Sent: Wednesday, September 18, 2019 2:10 PM Note that the HVA to GPA mapping is not an 1:1 mapping. One HVA range could be mapped to several GPA ranges. This is fine. Currently vfio_dma maintains IOVA->HVA mapping. btw under what condition HVA->GPA is not 1:1 mapping? I didn't realize it. I don't remember the details e.g memory region alias? And neither kvm nor kvm API does forbid this if my memory is correct. I checked https://qemu.weilnetz.de/doc/devel/memory.html, which provides an example of aliased layout. However, its aliasing is all 1:1, instead of N:1. From guest p.o.v every writable GPA implies an unique location. Why would we hit the situation where multiple write-able GPAs are mapped to the same HVA (i.e. same physical memory location)? I don't know, just want to say current API does not forbid this. So we probably need to take care it. yes, in KVM API level, it does not forbid two slots to have the same HVA(slot->userspace_addr). But (1) there's only one kvm instance for each vm for each qemu process. (2) all ramblock->host (corresponds to HVA and slot->userspace_addr) in one qemu process is non-overlapping as it's obtained from mmmap(). (3) qemu ensures two kvm slots will not point to the same section of one ramblock. So, as long as kvm instance is not shared in two processes, and there's no bug in qemu, we can assure that HVA to GPA is 1:1. Well, you leave this API for userspace, so you can't assume qemu is the only user or any its behavior. If you had you should limit it in the API level instead of open window for them. But even if there are two processes operating on the same kvm instance and manipulating on memory slots, adding an extra GPA along side current IOVA & HVA to ioctl VFIO_IOMMU_MAP_DMA can still let driver knows the right IOVA->GPA mapping, right? It looks fragile. Consider HVA was mapped to both GPA1 and GPA2. Guest maps IOVA to GPA2, so we have IOVA GPA2 HVA in the new ioctl and then log through GPA2. If userspace is trying to sync through GPA1, it will miss the dirty page. So for safety we need log both GPA1 and GPA2. (See what has been done in log_write_hva() in vhost.c). The only way to do that is to maintain an independent HVA to GPA mapping like what KVM or vhost did. why GPA1 and GPA2 should be both dirty? even they have the same HVA due to overlaping virtual address space in two processes, they still correspond to two physical pages. don't get what's your meaning :) The point is not leave any corner case that is hard to debug or fix in the future. Let's just start by a single process, the API allows userspace to maps HVA to both GPA1 and GPA2. Since it knows GPA1 and GPA2 are equivalent, it's ok to sync just through GPA1. That means if you only log GPA2, it won't work. In that case, cannot log dirty according to HPA. sorry, it should be "cannot log dirty according to HVA". I think we are discussing the choice between GPA and IOVA, not HVA? Thanks because kvm cannot tell whether it's an valid case (the two GPAs are equivalent) or an invalid case (the two GPAs are not equivalent, but with the same HVA value). Right? Thanks Yan Thanks Thanks Yan Thanks Thanks Yan Is Qemu doing its own same-content memory merging in GPA level, similar to KSM? AFAIK, it doesn't. Thanks Thanks Kevin
Re: [Qemu-devel] vhost, iova, and dirty page tracking
+Paolo to help clarify here. > From: Jason Wang [mailto:jasow...@redhat.com] > Sent: Thursday, September 19, 2019 2:32 PM > > > On 2019/9/19 下午2:17, Yan Zhao wrote: > > On Thu, Sep 19, 2019 at 02:09:53PM +0800, Jason Wang wrote: > >> On 2019/9/19 下午1:28, Yan Zhao wrote: > >>> On Thu, Sep 19, 2019 at 09:05:12AM +0800, Jason Wang wrote: > On 2019/9/18 下午4:37, Tian, Kevin wrote: > >> From: Jason Wang [mailto:jasow...@redhat.com] > >> Sent: Wednesday, September 18, 2019 2:10 PM > >> > Note that the HVA to GPA mapping is not an 1:1 mapping. One > HVA > >> range > could be mapped to several GPA ranges. > >>> This is fine. Currently vfio_dma maintains IOVA->HVA mapping. > >>> > >>> btw under what condition HVA->GPA is not 1:1 mapping? I didn't > realize it. > >> I don't remember the details e.g memory region alias? And neither > kvm > >> nor kvm API does forbid this if my memory is correct. > >> > > I checked https://qemu.weilnetz.de/doc/devel/memory.html, which > > provides an example of aliased layout. However, its aliasing is all > > 1:1, instead of N:1. From guest p.o.v every writable GPA implies an > > unique location. Why would we hit the situation where multiple > > write-able GPAs are mapped to the same HVA (i.e. same physical > > memory location)? > I don't know, just want to say current API does not forbid this. So we > probably need to take care it. > > >>> yes, in KVM API level, it does not forbid two slots to have the same > HVA(slot->userspace_addr). > >>> But > >>> (1) there's only one kvm instance for each vm for each qemu process. > >>> (2) all ramblock->host (corresponds to HVA and slot->userspace_addr) > in one qemu > >>> process is non-overlapping as it's obtained from mmmap(). > >>> (3) qemu ensures two kvm slots will not point to the same section of > one ramblock. > >>> > >>> So, as long as kvm instance is not shared in two processes, and > >>> there's no bug in qemu, we can assure that HVA to GPA is 1:1. > >> > >> Well, you leave this API for userspace, so you can't assume qemu is the > >> only user or any its behavior. If you had you should limit it in the API > >> level instead of open window for them. > >> > >> > >>> But even if there are two processes operating on the same kvm > instance > >>> and manipulating on memory slots, adding an extra GPA along side > current > >>> IOVA & HVA to ioctl VFIO_IOMMU_MAP_DMA can still let driver knows > the > >>> right IOVA->GPA mapping, right? > >> > >> It looks fragile. Consider HVA was mapped to both GPA1 and GPA2. > Guest > >> maps IOVA to GPA2, so we have IOVA GPA2 HVA in the new ioctl and > then > >> log through GPA2. If userspace is trying to sync through GPA1, it will > >> miss the dirty page. So for safety we need log both GPA1 and GPA2. (See > >> what has been done in log_write_hva() in vhost.c). The only way to do > >> that is to maintain an independent HVA to GPA mapping like what KVM > or > >> vhost did. > >> > > why GPA1 and GPA2 should be both dirty? > > even they have the same HVA due to overlaping virtual address space in > > two processes, they still correspond to two physical pages. > > don't get what's your meaning :) > > > The point is not leave any corner case that is hard to debug or fix in > the future. > > Let's just start by a single process, the API allows userspace to maps > HVA to both GPA1 and GPA2. Since it knows GPA1 and GPA2 are equivalent, > it's ok to sync just through GPA1. That means if you only log GPA2, it > won't work. > I noted KVM itself doesn't consider such situation (one HVA is mapped to multiple GPAs), when doing its dirty page tracking. If you look at kvm_vcpu_mark_page_dirty, it simply finds the unique memslot which contains the dirty gfn and then set the dirty bit within that slot. It doesn't attempt to walk all memslots to find out any other GPA which may be mapped to the same HVA. So there must be some disconnect here. let's hear from Paolo first and understand the rationale behind such situation. Thanks Kevin
Re: [Qemu-devel] vhost, iova, and dirty page tracking
On Thu, Sep 19, 2019 at 02:29:54PM +0800, Yan Zhao wrote: > On Thu, Sep 19, 2019 at 02:32:03PM +0800, Jason Wang wrote: > > > > On 2019/9/19 下午2:17, Yan Zhao wrote: > > > On Thu, Sep 19, 2019 at 02:09:53PM +0800, Jason Wang wrote: > > >> On 2019/9/19 下午1:28, Yan Zhao wrote: > > >>> On Thu, Sep 19, 2019 at 09:05:12AM +0800, Jason Wang wrote: > > On 2019/9/18 下午4:37, Tian, Kevin wrote: > > >> From: Jason Wang [mailto:jasow...@redhat.com] > > >> Sent: Wednesday, September 18, 2019 2:10 PM > > >> > > Note that the HVA to GPA mapping is not an 1:1 mapping. One HVA > > >> range > > could be mapped to several GPA ranges. > > >>> This is fine. Currently vfio_dma maintains IOVA->HVA mapping. > > >>> > > >>> btw under what condition HVA->GPA is not 1:1 mapping? I didn't > > >>> realize it. > > >> I don't remember the details e.g memory region alias? And neither kvm > > >> nor kvm API does forbid this if my memory is correct. > > >> > > > I checked https://qemu.weilnetz.de/doc/devel/memory.html, which > > > provides an example of aliased layout. However, its aliasing is all > > > 1:1, instead of N:1. From guest p.o.v every writable GPA implies an > > > unique location. Why would we hit the situation where multiple > > > write-able GPAs are mapped to the same HVA (i.e. same physical > > > memory location)? > > I don't know, just want to say current API does not forbid this. So we > > probably need to take care it. > > > > >>> yes, in KVM API level, it does not forbid two slots to have the same > > >>> HVA(slot->userspace_addr). > > >>> But > > >>> (1) there's only one kvm instance for each vm for each qemu process. > > >>> (2) all ramblock->host (corresponds to HVA and slot->userspace_addr) in > > >>> one qemu > > >>> process is non-overlapping as it's obtained from mmmap(). > > >>> (3) qemu ensures two kvm slots will not point to the same section of > > >>> one ramblock. > > >>> > > >>> So, as long as kvm instance is not shared in two processes, and > > >>> there's no bug in qemu, we can assure that HVA to GPA is 1:1. > > >> > > >> Well, you leave this API for userspace, so you can't assume qemu is the > > >> only user or any its behavior. If you had you should limit it in the API > > >> level instead of open window for them. > > >> > > >> > > >>> But even if there are two processes operating on the same kvm instance > > >>> and manipulating on memory slots, adding an extra GPA along side current > > >>> IOVA & HVA to ioctl VFIO_IOMMU_MAP_DMA can still let driver knows the > > >>> right IOVA->GPA mapping, right? > > >> > > >> It looks fragile. Consider HVA was mapped to both GPA1 and GPA2. Guest > > >> maps IOVA to GPA2, so we have IOVA GPA2 HVA in the new ioctl and then > > >> log through GPA2. If userspace is trying to sync through GPA1, it will > > >> miss the dirty page. So for safety we need log both GPA1 and GPA2. (See > > >> what has been done in log_write_hva() in vhost.c). The only way to do > > >> that is to maintain an independent HVA to GPA mapping like what KVM or > > >> vhost did. > > >> > > > why GPA1 and GPA2 should be both dirty? > > > even they have the same HVA due to overlaping virtual address space in > > > two processes, they still correspond to two physical pages. > > > don't get what's your meaning :) > > > > > > The point is not leave any corner case that is hard to debug or fix in > > the future. > > > > Let's just start by a single process, the API allows userspace to maps > > HVA to both GPA1 and GPA2. Since it knows GPA1 and GPA2 are equivalent, > > it's ok to sync just through GPA1. That means if you only log GPA2, it > > won't work. > > > In that case, cannot log dirty according to HPA. sorry, it should be "cannot log dirty according to HVA". > because kvm cannot tell whether it's an valid case (the two GPAs are > equivalent) > or an invalid case (the two GPAs are not equivalent, but with the same > HVA value). > > Right? > > Thanks > Yan > > > > Thanks > > > > > > > > > > Thanks > > > Yan > > > > > > > > >> Thanks > > >> > > >> > > >>> Thanks > > >>> Yan > > >>> > > > Is Qemu doing its own same-content memory > > > merging in GPA level, similar to KSM? > > AFAIK, it doesn't. > > > > Thanks > > > > > > > Thanks > > > Kevin > > >
Re: [Qemu-devel] vhost, iova, and dirty page tracking
On Thu, Sep 19, 2019 at 02:32:03PM +0800, Jason Wang wrote: > > On 2019/9/19 下午2:17, Yan Zhao wrote: > > On Thu, Sep 19, 2019 at 02:09:53PM +0800, Jason Wang wrote: > >> On 2019/9/19 下午1:28, Yan Zhao wrote: > >>> On Thu, Sep 19, 2019 at 09:05:12AM +0800, Jason Wang wrote: > On 2019/9/18 下午4:37, Tian, Kevin wrote: > >> From: Jason Wang [mailto:jasow...@redhat.com] > >> Sent: Wednesday, September 18, 2019 2:10 PM > >> > Note that the HVA to GPA mapping is not an 1:1 mapping. One HVA > >> range > could be mapped to several GPA ranges. > >>> This is fine. Currently vfio_dma maintains IOVA->HVA mapping. > >>> > >>> btw under what condition HVA->GPA is not 1:1 mapping? I didn't > >>> realize it. > >> I don't remember the details e.g memory region alias? And neither kvm > >> nor kvm API does forbid this if my memory is correct. > >> > > I checked https://qemu.weilnetz.de/doc/devel/memory.html, which > > provides an example of aliased layout. However, its aliasing is all > > 1:1, instead of N:1. From guest p.o.v every writable GPA implies an > > unique location. Why would we hit the situation where multiple > > write-able GPAs are mapped to the same HVA (i.e. same physical > > memory location)? > I don't know, just want to say current API does not forbid this. So we > probably need to take care it. > > >>> yes, in KVM API level, it does not forbid two slots to have the same > >>> HVA(slot->userspace_addr). > >>> But > >>> (1) there's only one kvm instance for each vm for each qemu process. > >>> (2) all ramblock->host (corresponds to HVA and slot->userspace_addr) in > >>> one qemu > >>> process is non-overlapping as it's obtained from mmmap(). > >>> (3) qemu ensures two kvm slots will not point to the same section of one > >>> ramblock. > >>> > >>> So, as long as kvm instance is not shared in two processes, and > >>> there's no bug in qemu, we can assure that HVA to GPA is 1:1. > >> > >> Well, you leave this API for userspace, so you can't assume qemu is the > >> only user or any its behavior. If you had you should limit it in the API > >> level instead of open window for them. > >> > >> > >>> But even if there are two processes operating on the same kvm instance > >>> and manipulating on memory slots, adding an extra GPA along side current > >>> IOVA & HVA to ioctl VFIO_IOMMU_MAP_DMA can still let driver knows the > >>> right IOVA->GPA mapping, right? > >> > >> It looks fragile. Consider HVA was mapped to both GPA1 and GPA2. Guest > >> maps IOVA to GPA2, so we have IOVA GPA2 HVA in the new ioctl and then > >> log through GPA2. If userspace is trying to sync through GPA1, it will > >> miss the dirty page. So for safety we need log both GPA1 and GPA2. (See > >> what has been done in log_write_hva() in vhost.c). The only way to do > >> that is to maintain an independent HVA to GPA mapping like what KVM or > >> vhost did. > >> > > why GPA1 and GPA2 should be both dirty? > > even they have the same HVA due to overlaping virtual address space in > > two processes, they still correspond to two physical pages. > > don't get what's your meaning :) > > > The point is not leave any corner case that is hard to debug or fix in > the future. > > Let's just start by a single process, the API allows userspace to maps > HVA to both GPA1 and GPA2. Since it knows GPA1 and GPA2 are equivalent, > it's ok to sync just through GPA1. That means if you only log GPA2, it > won't work. > In that case, cannot log dirty according to HPA. because kvm cannot tell whether it's an valid case (the two GPAs are equivalent) or an invalid case (the two GPAs are not equivalent, but with the same HVA value). Right? Thanks Yan > Thanks > > > > > > Thanks > > Yan > > > > > >> Thanks > >> > >> > >>> Thanks > >>> Yan > >>> > > Is Qemu doing its own same-content memory > > merging in GPA level, similar to KSM? > AFAIK, it doesn't. > > Thanks > > > > Thanks > > Kevin >
Re: [Qemu-devel] vhost, iova, and dirty page tracking
On 2019/9/19 下午2:17, Yan Zhao wrote: On Thu, Sep 19, 2019 at 02:09:53PM +0800, Jason Wang wrote: On 2019/9/19 下午1:28, Yan Zhao wrote: On Thu, Sep 19, 2019 at 09:05:12AM +0800, Jason Wang wrote: On 2019/9/18 下午4:37, Tian, Kevin wrote: From: Jason Wang [mailto:jasow...@redhat.com] Sent: Wednesday, September 18, 2019 2:10 PM Note that the HVA to GPA mapping is not an 1:1 mapping. One HVA range could be mapped to several GPA ranges. This is fine. Currently vfio_dma maintains IOVA->HVA mapping. btw under what condition HVA->GPA is not 1:1 mapping? I didn't realize it. I don't remember the details e.g memory region alias? And neither kvm nor kvm API does forbid this if my memory is correct. I checked https://qemu.weilnetz.de/doc/devel/memory.html, which provides an example of aliased layout. However, its aliasing is all 1:1, instead of N:1. From guest p.o.v every writable GPA implies an unique location. Why would we hit the situation where multiple write-able GPAs are mapped to the same HVA (i.e. same physical memory location)? I don't know, just want to say current API does not forbid this. So we probably need to take care it. yes, in KVM API level, it does not forbid two slots to have the same HVA(slot->userspace_addr). But (1) there's only one kvm instance for each vm for each qemu process. (2) all ramblock->host (corresponds to HVA and slot->userspace_addr) in one qemu process is non-overlapping as it's obtained from mmmap(). (3) qemu ensures two kvm slots will not point to the same section of one ramblock. So, as long as kvm instance is not shared in two processes, and there's no bug in qemu, we can assure that HVA to GPA is 1:1. Well, you leave this API for userspace, so you can't assume qemu is the only user or any its behavior. If you had you should limit it in the API level instead of open window for them. But even if there are two processes operating on the same kvm instance and manipulating on memory slots, adding an extra GPA along side current IOVA & HVA to ioctl VFIO_IOMMU_MAP_DMA can still let driver knows the right IOVA->GPA mapping, right? It looks fragile. Consider HVA was mapped to both GPA1 and GPA2. Guest maps IOVA to GPA2, so we have IOVA GPA2 HVA in the new ioctl and then log through GPA2. If userspace is trying to sync through GPA1, it will miss the dirty page. So for safety we need log both GPA1 and GPA2. (See what has been done in log_write_hva() in vhost.c). The only way to do that is to maintain an independent HVA to GPA mapping like what KVM or vhost did. why GPA1 and GPA2 should be both dirty? even they have the same HVA due to overlaping virtual address space in two processes, they still correspond to two physical pages. don't get what's your meaning :) The point is not leave any corner case that is hard to debug or fix in the future. Let's just start by a single process, the API allows userspace to maps HVA to both GPA1 and GPA2. Since it knows GPA1 and GPA2 are equivalent, it's ok to sync just through GPA1. That means if you only log GPA2, it won't work. Thanks Thanks Yan Thanks Thanks Yan Is Qemu doing its own same-content memory merging in GPA level, similar to KSM? AFAIK, it doesn't. Thanks Thanks Kevin
Re: [Qemu-devel] vhost, iova, and dirty page tracking
On Thu, Sep 19, 2019 at 02:09:53PM +0800, Jason Wang wrote: > > On 2019/9/19 下午1:28, Yan Zhao wrote: > > On Thu, Sep 19, 2019 at 09:05:12AM +0800, Jason Wang wrote: > >> On 2019/9/18 下午4:37, Tian, Kevin wrote: > From: Jason Wang [mailto:jasow...@redhat.com] > Sent: Wednesday, September 18, 2019 2:10 PM > > >> Note that the HVA to GPA mapping is not an 1:1 mapping. One HVA > range > >> could be mapped to several GPA ranges. > > This is fine. Currently vfio_dma maintains IOVA->HVA mapping. > > > > btw under what condition HVA->GPA is not 1:1 mapping? I didn't realize > > it. > I don't remember the details e.g memory region alias? And neither kvm > nor kvm API does forbid this if my memory is correct. > > >>> I checked https://qemu.weilnetz.de/doc/devel/memory.html, which > >>> provides an example of aliased layout. However, its aliasing is all > >>> 1:1, instead of N:1. From guest p.o.v every writable GPA implies an > >>> unique location. Why would we hit the situation where multiple > >>> write-able GPAs are mapped to the same HVA (i.e. same physical > >>> memory location)? > >> > >> I don't know, just want to say current API does not forbid this. So we > >> probably need to take care it. > >> > > yes, in KVM API level, it does not forbid two slots to have the same > > HVA(slot->userspace_addr). > > But > > (1) there's only one kvm instance for each vm for each qemu process. > > (2) all ramblock->host (corresponds to HVA and slot->userspace_addr) in one > > qemu > > process is non-overlapping as it's obtained from mmmap(). > > (3) qemu ensures two kvm slots will not point to the same section of one > > ramblock. > > > > So, as long as kvm instance is not shared in two processes, and > > there's no bug in qemu, we can assure that HVA to GPA is 1:1. > > > Well, you leave this API for userspace, so you can't assume qemu is the > only user or any its behavior. If you had you should limit it in the API > level instead of open window for them. > > > > > > But even if there are two processes operating on the same kvm instance > > and manipulating on memory slots, adding an extra GPA along side current > > IOVA & HVA to ioctl VFIO_IOMMU_MAP_DMA can still let driver knows the > > right IOVA->GPA mapping, right? > > > It looks fragile. Consider HVA was mapped to both GPA1 and GPA2. Guest > maps IOVA to GPA2, so we have IOVA GPA2 HVA in the new ioctl and then > log through GPA2. If userspace is trying to sync through GPA1, it will > miss the dirty page. So for safety we need log both GPA1 and GPA2. (See > what has been done in log_write_hva() in vhost.c). The only way to do > that is to maintain an independent HVA to GPA mapping like what KVM or > vhost did. > why GPA1 and GPA2 should be both dirty? even they have the same HVA due to overlaping virtual address space in two processes, they still correspond to two physical pages. don't get what's your meaning :) Thanks Yan > Thanks > > > > > > Thanks > > Yan > > > >>> Is Qemu doing its own same-content memory > >>> merging in GPA level, similar to KSM? > >> > >> AFAIK, it doesn't. > >> > >> Thanks > >> > >> > >>> Thanks > >>> Kevin > >> > >>
Re: [Qemu-devel] vhost, iova, and dirty page tracking
On 2019/9/19 下午1:28, Yan Zhao wrote: On Thu, Sep 19, 2019 at 09:05:12AM +0800, Jason Wang wrote: On 2019/9/18 下午4:37, Tian, Kevin wrote: From: Jason Wang [mailto:jasow...@redhat.com] Sent: Wednesday, September 18, 2019 2:10 PM Note that the HVA to GPA mapping is not an 1:1 mapping. One HVA range could be mapped to several GPA ranges. This is fine. Currently vfio_dma maintains IOVA->HVA mapping. btw under what condition HVA->GPA is not 1:1 mapping? I didn't realize it. I don't remember the details e.g memory region alias? And neither kvm nor kvm API does forbid this if my memory is correct. I checked https://qemu.weilnetz.de/doc/devel/memory.html, which provides an example of aliased layout. However, its aliasing is all 1:1, instead of N:1. From guest p.o.v every writable GPA implies an unique location. Why would we hit the situation where multiple write-able GPAs are mapped to the same HVA (i.e. same physical memory location)? I don't know, just want to say current API does not forbid this. So we probably need to take care it. yes, in KVM API level, it does not forbid two slots to have the same HVA(slot->userspace_addr). But (1) there's only one kvm instance for each vm for each qemu process. (2) all ramblock->host (corresponds to HVA and slot->userspace_addr) in one qemu process is non-overlapping as it's obtained from mmmap(). (3) qemu ensures two kvm slots will not point to the same section of one ramblock. So, as long as kvm instance is not shared in two processes, and there's no bug in qemu, we can assure that HVA to GPA is 1:1. Well, you leave this API for userspace, so you can't assume qemu is the only user or any its behavior. If you had you should limit it in the API level instead of open window for them. But even if there are two processes operating on the same kvm instance and manipulating on memory slots, adding an extra GPA along side current IOVA & HVA to ioctl VFIO_IOMMU_MAP_DMA can still let driver knows the right IOVA->GPA mapping, right? It looks fragile. Consider HVA was mapped to both GPA1 and GPA2. Guest maps IOVA to GPA2, so we have IOVA GPA2 HVA in the new ioctl and then log through GPA2. If userspace is trying to sync through GPA1, it will miss the dirty page. So for safety we need log both GPA1 and GPA2. (See what has been done in log_write_hva() in vhost.c). The only way to do that is to maintain an independent HVA to GPA mapping like what KVM or vhost did. Thanks Thanks Yan Is Qemu doing its own same-content memory merging in GPA level, similar to KSM? AFAIK, it doesn't. Thanks Thanks Kevin
Re: [Qemu-devel] vhost, iova, and dirty page tracking
On Thu, Sep 19, 2019 at 09:05:12AM +0800, Jason Wang wrote: > > On 2019/9/18 下午4:37, Tian, Kevin wrote: > >> From: Jason Wang [mailto:jasow...@redhat.com] > >> Sent: Wednesday, September 18, 2019 2:10 PM > >> > Note that the HVA to GPA mapping is not an 1:1 mapping. One HVA > >> range > could be mapped to several GPA ranges. > >>> This is fine. Currently vfio_dma maintains IOVA->HVA mapping. > >>> > >>> btw under what condition HVA->GPA is not 1:1 mapping? I didn't realize it. > >> > >> I don't remember the details e.g memory region alias? And neither kvm > >> nor kvm API does forbid this if my memory is correct. > >> > > I checked https://qemu.weilnetz.de/doc/devel/memory.html, which > > provides an example of aliased layout. However, its aliasing is all > > 1:1, instead of N:1. From guest p.o.v every writable GPA implies an > > unique location. Why would we hit the situation where multiple > > write-able GPAs are mapped to the same HVA (i.e. same physical > > memory location)? > > > I don't know, just want to say current API does not forbid this. So we > probably need to take care it. > yes, in KVM API level, it does not forbid two slots to have the same HVA(slot->userspace_addr). But (1) there's only one kvm instance for each vm for each qemu process. (2) all ramblock->host (corresponds to HVA and slot->userspace_addr) in one qemu process is non-overlapping as it's obtained from mmmap(). (3) qemu ensures two kvm slots will not point to the same section of one ramblock. So, as long as kvm instance is not shared in two processes, and there's no bug in qemu, we can assure that HVA to GPA is 1:1. But even if there are two processes operating on the same kvm instance and manipulating on memory slots, adding an extra GPA along side current IOVA & HVA to ioctl VFIO_IOMMU_MAP_DMA can still let driver knows the right IOVA->GPA mapping, right? Thanks Yan > > > Is Qemu doing its own same-content memory > > merging in GPA level, similar to KSM? > > > AFAIK, it doesn't. > > Thanks > > > > Thanks > > Kevin > > >
Re: [Qemu-devel] vhost, iova, and dirty page tracking
On 2019/9/18 下午4:37, Tian, Kevin wrote: From: Jason Wang [mailto:jasow...@redhat.com] Sent: Wednesday, September 18, 2019 2:10 PM Note that the HVA to GPA mapping is not an 1:1 mapping. One HVA range could be mapped to several GPA ranges. This is fine. Currently vfio_dma maintains IOVA->HVA mapping. btw under what condition HVA->GPA is not 1:1 mapping? I didn't realize it. I don't remember the details e.g memory region alias? And neither kvm nor kvm API does forbid this if my memory is correct. I checked https://qemu.weilnetz.de/doc/devel/memory.html, which provides an example of aliased layout. However, its aliasing is all 1:1, instead of N:1. From guest p.o.v every writable GPA implies an unique location. Why would we hit the situation where multiple write-able GPAs are mapped to the same HVA (i.e. same physical memory location)? I don't know, just want to say current API does not forbid this. So we probably need to take care it. Is Qemu doing its own same-content memory merging in GPA level, similar to KSM? AFAIK, it doesn't. Thanks Thanks Kevin
Re: [Qemu-devel] vhost, iova, and dirty page tracking
> From: Jason Wang [mailto:jasow...@redhat.com] > Sent: Wednesday, September 18, 2019 2:10 PM > > >> > >> Note that the HVA to GPA mapping is not an 1:1 mapping. One HVA > range > >> could be mapped to several GPA ranges. > > This is fine. Currently vfio_dma maintains IOVA->HVA mapping. > > > > btw under what condition HVA->GPA is not 1:1 mapping? I didn't realize it. > > > I don't remember the details e.g memory region alias? And neither kvm > nor kvm API does forbid this if my memory is correct. > I checked https://qemu.weilnetz.de/doc/devel/memory.html, which provides an example of aliased layout. However, its aliasing is all 1:1, instead of N:1. From guest p.o.v every writable GPA implies an unique location. Why would we hit the situation where multiple write-able GPAs are mapped to the same HVA (i.e. same physical memory location)? Is Qemu doing its own same-content memory merging in GPA level, similar to KSM? Thanks Kevin
Re: [Qemu-devel] vhost, iova, and dirty page tracking
> From: Jason Wang [mailto:jasow...@redhat.com] > Sent: Wednesday, September 18, 2019 2:10 PM > > On 2019/9/18 上午9:44, Tian, Kevin wrote: > >> From: Jason Wang [mailto:jasow...@redhat.com] > >> Sent: Tuesday, September 17, 2019 6:36 PM > >> > >> On 2019/9/17 下午4:48, Tian, Kevin wrote: > From: Jason Wang [mailto:jasow...@redhat.com] > Sent: Monday, September 16, 2019 4:33 PM > > > On 2019/9/16 上午9:51, Tian, Kevin wrote: > > Hi, Jason > > > > We had a discussion about dirty page tracking in VFIO, when > vIOMMU > > is enabled: > > > > https://lists.nongnu.org/archive/html/qemu-devel/2019- > 09/msg02690.html > > It's actually a similar model as vhost - Qemu cannot interpose the > fast- > path > > DMAs thus relies on the kernel part to track and report dirty page > information. > > Currently Qemu tracks dirty pages in GFN level, thus demanding a > translation > > from IOVA to GPA. Then the open in our discussion is where this > translation > > should happen. Doing the translation in kernel implies a device iotlb > flavor, > > which is what vhost implements today. It requires potentially large > tracking > > structures in the host kernel, but leveraging the existing log_sync > flow > >> in > Qemu. > > On the other hand, Qemu may perform log_sync for every removal > of > IOVA > > mapping and then do the translation itself, then avoiding the GPA > awareness > > in the kernel side. It needs some change to current Qemu log-sync > flow, > and > > may bring more overhead if IOVA is frequently unmapped. > > > > So we'd like to hear about your opinions, especially about how you > >> came > > down to the current iotlb approach for vhost. > We don't consider too much in the point when introducing vhost. And > before IOTLB, vhost has already know GPA through its mem table > (GPA->HVA). So it's nature and easier to track dirty pages at GPA level > then it won't any changes in the existing ABI. > >>> This is the same situation as VFIO. > >>> > For VFIO case, the only advantages of using GPA is that the log can > then > be shared among all the devices that belongs to the VM. Otherwise > syncing through IOVA is cleaner. > >>> I still worry about the potential performance impact with this approach. > >>> In current mdev live migration series, there are multiple system calls > >>> involved when retrieving the dirty bitmap information for a given > memory > >>> range. > >> > >> I haven't took a deep look at that series. Technically dirty bitmap > >> could be shared between device and driver, then there's no system call > >> in synchronization. > > That series require Qemu to tell the kernel about the information > > about queried region (start, number, and page_size), read > > the information about the dirty bitmap (offset, size) and then read > > the dirty bitmap. > > > Any pointer to that series, I can only find a "mdev live migration > support with vfio-mdev-pci" from Liu Yi without actual codes. https://lists.nongnu.org/archive/html/qemu-devel/2019-08/msg05543.html It's interesting that I cannot google it. Have to manually find it in Qemu archive. > > > > Although the bitmap can be mmaped thus shared, > > earlier reads/writes are conducted by pread/pwrite system calls. > > This design is fine for current log_dirty implementation, where > > dirty bitmap is synced in every pre-copy round. But to do it for > > every IOVA unmap, it's definitely over-killed. > > > >> > >>> IOVA mappings might be changed frequently. Though one may > >>> argue that frequent IOVA change already has bad performance, it's still > >>> not good to introduce further non-negligible overhead in such situation. > >> > >> Yes, it depends on the behavior of vIOMMU driver, e.g the frequency > and > >> granularity of the flushing. > >> > >> > >>> On the other hand, I realized that adding IOVA awareness in VFIO is > >>> actually easy. Today VFIO already maintains a full list of IOVA and its > >>> associated HVA in vfio_dma structure, according to VFIO_MAP and > >>> VFIO_UNMAP. As long as we allow the latter two operations to accept > >>> another parameter (GPA), IOVA->GPA mapping can be naturally cached > >>> in existing vfio_dma objects. > >> > >> Note that the HVA to GPA mapping is not an 1:1 mapping. One HVA > range > >> could be mapped to several GPA ranges. > > This is fine. Currently vfio_dma maintains IOVA->HVA mapping. > > > > btw under what condition HVA->GPA is not 1:1 mapping? I didn't realize it. > > > I don't remember the details e.g memory region alias? And neither kvm > nor kvm API does forbid this if my memory is correct. > I did see such comment in vhost code (log_write_hva): /* More than one GPAs can be mapped into a single HVA. So * iterate all possible umems here to be safe. */ and looks it tries t
Re: [Qemu-devel] vhost, iova, and dirty page tracking
> From: Jason Wang [mailto:jasow...@redhat.com] > Sent: Wednesday, September 18, 2019 2:04 PM > > On 2019/9/18 上午9:31, Tian, Kevin wrote: > >> From: Alex Williamson [mailto:alex.william...@redhat.com] > >> Sent: Tuesday, September 17, 2019 10:54 PM > >> > >> On Tue, 17 Sep 2019 08:48:36 + > >> "Tian, Kevin" wrote: > >> > From: Jason Wang [mailto:jasow...@redhat.com] > Sent: Monday, September 16, 2019 4:33 PM > > > On 2019/9/16 上午9:51, Tian, Kevin wrote: > > Hi, Jason > > > > We had a discussion about dirty page tracking in VFIO, when > vIOMMU > > is enabled: > > > > https://lists.nongnu.org/archive/html/qemu-devel/2019- > 09/msg02690.html > > It's actually a similar model as vhost - Qemu cannot interpose the > fast- > path > > DMAs thus relies on the kernel part to track and report dirty page > information. > > Currently Qemu tracks dirty pages in GFN level, thus demanding a > translation > > from IOVA to GPA. Then the open in our discussion is where this > translation > > should happen. Doing the translation in kernel implies a device iotlb > flavor, > > which is what vhost implements today. It requires potentially large > tracking > > structures in the host kernel, but leveraging the existing log_sync > flow > >> in > Qemu. > > On the other hand, Qemu may perform log_sync for every removal > of > IOVA > > mapping and then do the translation itself, then avoiding the GPA > awareness > > in the kernel side. It needs some change to current Qemu log-sync > >> flow, > and > > may bring more overhead if IOVA is frequently unmapped. > > > > So we'd like to hear about your opinions, especially about how you > >> came > > down to the current iotlb approach for vhost. > We don't consider too much in the point when introducing vhost. And > before IOTLB, vhost has already know GPA through its mem table > (GPA->HVA). So it's nature and easier to track dirty pages at GPA level > then it won't any changes in the existing ABI. > >>> This is the same situation as VFIO. > >> It is? VFIO doesn't know GPAs, it only knows HVA, HPA, and IOVA. In > >> some cases IOVA is GPA, but not all. > > Well, I thought vhost has a similar design, that the index of its mem table > > is GPA when vIOMMU is off and then becomes IOVA when vIOMMU is on. > > But I may be wrong here. Jason, can you help clarify? I saw two > > interfaces which poke the mem table: VHOST_SET_MEM_TABLE (for GPA) > > and VHOST_IOTLB_UPDATE (for IOVA). Are they used exclusively or > together? > > > > Actually, vhost maintains two interval trees, mem table GPA->HVA, and > device IOTLB IOVA->HVA. Device IOTLB is only used when vIOMMU is > enabled, and in that case mem table is used only when vhost need to > track dirty pages (do reverse lookup of memtable to get HVA->GPA). So in > conclusion, for datapath, they are used exclusively, but they need > cowork for logging dirty pages when device IOTLB is enabled. > OK. Then it's different from current VFIO design, which maintains only one tree which is indexed by either GPA or IOVA exclusively, upon whether vIOMMU is in use. Thanks Kevin
Re: [Qemu-devel] vhost, iova, and dirty page tracking
On 2019/9/18 上午9:44, Tian, Kevin wrote: From: Jason Wang [mailto:jasow...@redhat.com] Sent: Tuesday, September 17, 2019 6:36 PM On 2019/9/17 下午4:48, Tian, Kevin wrote: From: Jason Wang [mailto:jasow...@redhat.com] Sent: Monday, September 16, 2019 4:33 PM On 2019/9/16 上午9:51, Tian, Kevin wrote: Hi, Jason We had a discussion about dirty page tracking in VFIO, when vIOMMU is enabled: https://lists.nongnu.org/archive/html/qemu-devel/2019- 09/msg02690.html It's actually a similar model as vhost - Qemu cannot interpose the fast- path DMAs thus relies on the kernel part to track and report dirty page information. Currently Qemu tracks dirty pages in GFN level, thus demanding a translation from IOVA to GPA. Then the open in our discussion is where this translation should happen. Doing the translation in kernel implies a device iotlb flavor, which is what vhost implements today. It requires potentially large tracking structures in the host kernel, but leveraging the existing log_sync flow in Qemu. On the other hand, Qemu may perform log_sync for every removal of IOVA mapping and then do the translation itself, then avoiding the GPA awareness in the kernel side. It needs some change to current Qemu log-sync flow, and may bring more overhead if IOVA is frequently unmapped. So we'd like to hear about your opinions, especially about how you came down to the current iotlb approach for vhost. We don't consider too much in the point when introducing vhost. And before IOTLB, vhost has already know GPA through its mem table (GPA->HVA). So it's nature and easier to track dirty pages at GPA level then it won't any changes in the existing ABI. This is the same situation as VFIO. For VFIO case, the only advantages of using GPA is that the log can then be shared among all the devices that belongs to the VM. Otherwise syncing through IOVA is cleaner. I still worry about the potential performance impact with this approach. In current mdev live migration series, there are multiple system calls involved when retrieving the dirty bitmap information for a given memory range. I haven't took a deep look at that series. Technically dirty bitmap could be shared between device and driver, then there's no system call in synchronization. That series require Qemu to tell the kernel about the information about queried region (start, number, and page_size), read the information about the dirty bitmap (offset, size) and then read the dirty bitmap. Any pointer to that series, I can only find a "mdev live migration support with vfio-mdev-pci" from Liu Yi without actual codes. Although the bitmap can be mmaped thus shared, earlier reads/writes are conducted by pread/pwrite system calls. This design is fine for current log_dirty implementation, where dirty bitmap is synced in every pre-copy round. But to do it for every IOVA unmap, it's definitely over-killed. IOVA mappings might be changed frequently. Though one may argue that frequent IOVA change already has bad performance, it's still not good to introduce further non-negligible overhead in such situation. Yes, it depends on the behavior of vIOMMU driver, e.g the frequency and granularity of the flushing. On the other hand, I realized that adding IOVA awareness in VFIO is actually easy. Today VFIO already maintains a full list of IOVA and its associated HVA in vfio_dma structure, according to VFIO_MAP and VFIO_UNMAP. As long as we allow the latter two operations to accept another parameter (GPA), IOVA->GPA mapping can be naturally cached in existing vfio_dma objects. Note that the HVA to GPA mapping is not an 1:1 mapping. One HVA range could be mapped to several GPA ranges. This is fine. Currently vfio_dma maintains IOVA->HVA mapping. btw under what condition HVA->GPA is not 1:1 mapping? I didn't realize it. I don't remember the details e.g memory region alias? And neither kvm nor kvm API does forbid this if my memory is correct. Those objects are always updated according to MAP and UNMAP ioctls to be up-to-date. Qemu then uniformly retrieves the VFIO dirty bitmap for the entire GPA range in every pre-copy round, regardless of whether vIOMMU is enabled. There is no need of another IOTLB implementation, with the main ask on a v2 MAP/UNMAP interface. Or provide GPA to HVA mapping as vhost did. But a question is, I believe device can only do dirty page logging through IOVA. So how do you handle the case when IOVA is removed in this case? That's why a log_sync is required each time when IOVA is unmapped, in Alex's thought. Thanks Kevin Ok. Thanks
Re: [Qemu-devel] vhost, iova, and dirty page tracking
On 2019/9/18 上午9:31, Tian, Kevin wrote: From: Alex Williamson [mailto:alex.william...@redhat.com] Sent: Tuesday, September 17, 2019 10:54 PM On Tue, 17 Sep 2019 08:48:36 + "Tian, Kevin" wrote: From: Jason Wang [mailto:jasow...@redhat.com] Sent: Monday, September 16, 2019 4:33 PM On 2019/9/16 上午9:51, Tian, Kevin wrote: Hi, Jason We had a discussion about dirty page tracking in VFIO, when vIOMMU is enabled: https://lists.nongnu.org/archive/html/qemu-devel/2019- 09/msg02690.html It's actually a similar model as vhost - Qemu cannot interpose the fast- path DMAs thus relies on the kernel part to track and report dirty page information. Currently Qemu tracks dirty pages in GFN level, thus demanding a translation from IOVA to GPA. Then the open in our discussion is where this translation should happen. Doing the translation in kernel implies a device iotlb flavor, which is what vhost implements today. It requires potentially large tracking structures in the host kernel, but leveraging the existing log_sync flow in Qemu. On the other hand, Qemu may perform log_sync for every removal of IOVA mapping and then do the translation itself, then avoiding the GPA awareness in the kernel side. It needs some change to current Qemu log-sync flow, and may bring more overhead if IOVA is frequently unmapped. So we'd like to hear about your opinions, especially about how you came down to the current iotlb approach for vhost. We don't consider too much in the point when introducing vhost. And before IOTLB, vhost has already know GPA through its mem table (GPA->HVA). So it's nature and easier to track dirty pages at GPA level then it won't any changes in the existing ABI. This is the same situation as VFIO. It is? VFIO doesn't know GPAs, it only knows HVA, HPA, and IOVA. In some cases IOVA is GPA, but not all. Well, I thought vhost has a similar design, that the index of its mem table is GPA when vIOMMU is off and then becomes IOVA when vIOMMU is on. But I may be wrong here. Jason, can you help clarify? I saw two interfaces which poke the mem table: VHOST_SET_MEM_TABLE (for GPA) and VHOST_IOTLB_UPDATE (for IOVA). Are they used exclusively or together? Actually, vhost maintains two interval trees, mem table GPA->HVA, and device IOTLB IOVA->HVA. Device IOTLB is only used when vIOMMU is enabled, and in that case mem table is used only when vhost need to track dirty pages (do reverse lookup of memtable to get HVA->GPA). So in conclusion, for datapath, they are used exclusively, but they need cowork for logging dirty pages when device IOTLB is enabled. Thanks
Re: [Qemu-devel] vhost, iova, and dirty page tracking
> From: Tian, Kevin > Sent: Wednesday, September 18, 2019 9:32 AM > > > From: Alex Williamson [mailto:alex.william...@redhat.com] > > Sent: Tuesday, September 17, 2019 10:54 PM > > > > On Tue, 17 Sep 2019 08:48:36 + > > "Tian, Kevin" wrote: > > > > > > From: Jason Wang [mailto:jasow...@redhat.com] > > > > Sent: Monday, September 16, 2019 4:33 PM > > > > > > > > > > > > On 2019/9/16 上午9:51, Tian, Kevin wrote: > > > > > Hi, Jason > > > > > > > > > > We had a discussion about dirty page tracking in VFIO, when > vIOMMU > > > > > is enabled: > > > > > > > > > > https://lists.nongnu.org/archive/html/qemu-devel/2019- > > > > 09/msg02690.html > > > > > > > > > > It's actually a similar model as vhost - Qemu cannot interpose the > fast- > > > > path > > > > > DMAs thus relies on the kernel part to track and report dirty page > > > > information. > > > > > Currently Qemu tracks dirty pages in GFN level, thus demanding a > > > > translation > > > > > from IOVA to GPA. Then the open in our discussion is where this > > > > translation > > > > > should happen. Doing the translation in kernel implies a device iotlb > > > > flavor, > > > > > which is what vhost implements today. It requires potentially large > > > > tracking > > > > > structures in the host kernel, but leveraging the existing log_sync > flow > > in > > > > Qemu. > > > > > On the other hand, Qemu may perform log_sync for every removal > of > > > > IOVA > > > > > mapping and then do the translation itself, then avoiding the GPA > > > > awareness > > > > > in the kernel side. It needs some change to current Qemu log-sync > > flow, > > > > and > > > > > may bring more overhead if IOVA is frequently unmapped. > > > > > > > > > > So we'd like to hear about your opinions, especially about how you > > came > > > > > down to the current iotlb approach for vhost. > > > > > > > > > > > > We don't consider too much in the point when introducing vhost. And > > > > before IOTLB, vhost has already know GPA through its mem table > > > > (GPA->HVA). So it's nature and easier to track dirty pages at GPA level > > > > then it won't any changes in the existing ABI. > > > > > > This is the same situation as VFIO. > > > > It is? VFIO doesn't know GPAs, it only knows HVA, HPA, and IOVA. In > > some cases IOVA is GPA, but not all. > > Well, I thought vhost has a similar design, that the index of its mem table > is GPA when vIOMMU is off and then becomes IOVA when vIOMMU is on. > But I may be wrong here. Jason, can you help clarify? I saw two > interfaces which poke the mem table: VHOST_SET_MEM_TABLE (for GPA) > and VHOST_IOTLB_UPDATE (for IOVA). Are they used exclusively or > together? > > > > > > > For VFIO case, the only advantages of using GPA is that the log can > then > > > > be shared among all the devices that belongs to the VM. Otherwise > > > > syncing through IOVA is cleaner. > > > > > > I still worry about the potential performance impact with this approach. > > > In current mdev live migration series, there are multiple system calls > > > involved when retrieving the dirty bitmap information for a given > memory > > > range. IOVA mappings might be changed frequently. Though one may > > > argue that frequent IOVA change already has bad performance, it's still > > > not good to introduce further non-negligible overhead in such situation. > > > > > > On the other hand, I realized that adding IOVA awareness in VFIO is > > > actually easy. Today VFIO already maintains a full list of IOVA and its > > > associated HVA in vfio_dma structure, according to VFIO_MAP and > > > VFIO_UNMAP. As long as we allow the latter two operations to accept > > > another parameter (GPA), IOVA->GPA mapping can be naturally cached > > > in existing vfio_dma objects. Those objects are always updated > according > > > to MAP and UNMAP ioctls to be up-to-date. Qemu then uniformly > > > retrieves the VFIO dirty bitmap for the entire GPA range in every pre- > copy > > > round, regardless of whether vIOMMU is enabled. There is no need of > > > another IOTLB implementation, with the main ask on a v2 MAP/UNMAP > > > interface. > > > > > > Alex, your thoughts? > > > > Same as last time, you're asking VFIO to be aware of an entirely new > > address space and implement tracking structures of that address space > > to make life easier for QEMU. Don't we typically push such complexity > > to userspace rather than into the kernel? I'm not convinced. Thanks, > > > > Is it really complex? No need of a new tracking structure. Just allowing > the MAP interface to carry a new parameter and then record it in the > existing vfio_dma objects. > > Note the frequency of guest DMA map/unmap could be very high. We > saw >100K invocations per second with a 40G NIC. To do the right > translation Qemu requires log_sync for every unmap, before the > mapping for logged dirty IOVA becomes stale. In current Kirti's patch, > each log_sync requires several system_calls through the migration > info, e.g. setting st
Re: [Qemu-devel] vhost, iova, and dirty page tracking
> From: Jason Wang [mailto:jasow...@redhat.com] > Sent: Tuesday, September 17, 2019 6:36 PM > > On 2019/9/17 下午4:48, Tian, Kevin wrote: > >> From: Jason Wang [mailto:jasow...@redhat.com] > >> Sent: Monday, September 16, 2019 4:33 PM > >> > >> > >> On 2019/9/16 上午9:51, Tian, Kevin wrote: > >>> Hi, Jason > >>> > >>> We had a discussion about dirty page tracking in VFIO, when vIOMMU > >>> is enabled: > >>> > >>> https://lists.nongnu.org/archive/html/qemu-devel/2019- > >> 09/msg02690.html > >>> It's actually a similar model as vhost - Qemu cannot interpose the fast- > >> path > >>> DMAs thus relies on the kernel part to track and report dirty page > >> information. > >>> Currently Qemu tracks dirty pages in GFN level, thus demanding a > >> translation > >>> from IOVA to GPA. Then the open in our discussion is where this > >> translation > >>> should happen. Doing the translation in kernel implies a device iotlb > >> flavor, > >>> which is what vhost implements today. It requires potentially large > >> tracking > >>> structures in the host kernel, but leveraging the existing log_sync flow > in > >> Qemu. > >>> On the other hand, Qemu may perform log_sync for every removal of > >> IOVA > >>> mapping and then do the translation itself, then avoiding the GPA > >> awareness > >>> in the kernel side. It needs some change to current Qemu log-sync flow, > >> and > >>> may bring more overhead if IOVA is frequently unmapped. > >>> > >>> So we'd like to hear about your opinions, especially about how you > came > >>> down to the current iotlb approach for vhost. > >> > >> We don't consider too much in the point when introducing vhost. And > >> before IOTLB, vhost has already know GPA through its mem table > >> (GPA->HVA). So it's nature and easier to track dirty pages at GPA level > >> then it won't any changes in the existing ABI. > > This is the same situation as VFIO. > > > >> For VFIO case, the only advantages of using GPA is that the log can then > >> be shared among all the devices that belongs to the VM. Otherwise > >> syncing through IOVA is cleaner. > > I still worry about the potential performance impact with this approach. > > In current mdev live migration series, there are multiple system calls > > involved when retrieving the dirty bitmap information for a given memory > > range. > > > I haven't took a deep look at that series. Technically dirty bitmap > could be shared between device and driver, then there's no system call > in synchronization. That series require Qemu to tell the kernel about the information about queried region (start, number, and page_size), read the information about the dirty bitmap (offset, size) and then read the dirty bitmap. Although the bitmap can be mmaped thus shared, earlier reads/writes are conducted by pread/pwrite system calls. This design is fine for current log_dirty implementation, where dirty bitmap is synced in every pre-copy round. But to do it for every IOVA unmap, it's definitely over-killed. > > > > IOVA mappings might be changed frequently. Though one may > > argue that frequent IOVA change already has bad performance, it's still > > not good to introduce further non-negligible overhead in such situation. > > > Yes, it depends on the behavior of vIOMMU driver, e.g the frequency and > granularity of the flushing. > > > > > > On the other hand, I realized that adding IOVA awareness in VFIO is > > actually easy. Today VFIO already maintains a full list of IOVA and its > > associated HVA in vfio_dma structure, according to VFIO_MAP and > > VFIO_UNMAP. As long as we allow the latter two operations to accept > > another parameter (GPA), IOVA->GPA mapping can be naturally cached > > in existing vfio_dma objects. > > > Note that the HVA to GPA mapping is not an 1:1 mapping. One HVA range > could be mapped to several GPA ranges. This is fine. Currently vfio_dma maintains IOVA->HVA mapping. btw under what condition HVA->GPA is not 1:1 mapping? I didn't realize it. > > > > Those objects are always updated according > > to MAP and UNMAP ioctls to be up-to-date. Qemu then uniformly > > retrieves the VFIO dirty bitmap for the entire GPA range in every pre-copy > > round, regardless of whether vIOMMU is enabled. There is no need of > > another IOTLB implementation, with the main ask on a v2 MAP/UNMAP > > interface. > > > Or provide GPA to HVA mapping as vhost did. But a question is, I believe > device can only do dirty page logging through IOVA. So how do you handle > the case when IOVA is removed in this case? > That's why a log_sync is required each time when IOVA is unmapped, in Alex's thought. Thanks Kevin
Re: [Qemu-devel] vhost, iova, and dirty page tracking
> From: Alex Williamson [mailto:alex.william...@redhat.com] > Sent: Tuesday, September 17, 2019 10:54 PM > > On Tue, 17 Sep 2019 08:48:36 + > "Tian, Kevin" wrote: > > > > From: Jason Wang [mailto:jasow...@redhat.com] > > > Sent: Monday, September 16, 2019 4:33 PM > > > > > > > > > On 2019/9/16 上午9:51, Tian, Kevin wrote: > > > > Hi, Jason > > > > > > > > We had a discussion about dirty page tracking in VFIO, when vIOMMU > > > > is enabled: > > > > > > > > https://lists.nongnu.org/archive/html/qemu-devel/2019- > > > 09/msg02690.html > > > > > > > > It's actually a similar model as vhost - Qemu cannot interpose the fast- > > > path > > > > DMAs thus relies on the kernel part to track and report dirty page > > > information. > > > > Currently Qemu tracks dirty pages in GFN level, thus demanding a > > > translation > > > > from IOVA to GPA. Then the open in our discussion is where this > > > translation > > > > should happen. Doing the translation in kernel implies a device iotlb > > > flavor, > > > > which is what vhost implements today. It requires potentially large > > > tracking > > > > structures in the host kernel, but leveraging the existing log_sync flow > in > > > Qemu. > > > > On the other hand, Qemu may perform log_sync for every removal of > > > IOVA > > > > mapping and then do the translation itself, then avoiding the GPA > > > awareness > > > > in the kernel side. It needs some change to current Qemu log-sync > flow, > > > and > > > > may bring more overhead if IOVA is frequently unmapped. > > > > > > > > So we'd like to hear about your opinions, especially about how you > came > > > > down to the current iotlb approach for vhost. > > > > > > > > > We don't consider too much in the point when introducing vhost. And > > > before IOTLB, vhost has already know GPA through its mem table > > > (GPA->HVA). So it's nature and easier to track dirty pages at GPA level > > > then it won't any changes in the existing ABI. > > > > This is the same situation as VFIO. > > It is? VFIO doesn't know GPAs, it only knows HVA, HPA, and IOVA. In > some cases IOVA is GPA, but not all. Well, I thought vhost has a similar design, that the index of its mem table is GPA when vIOMMU is off and then becomes IOVA when vIOMMU is on. But I may be wrong here. Jason, can you help clarify? I saw two interfaces which poke the mem table: VHOST_SET_MEM_TABLE (for GPA) and VHOST_IOTLB_UPDATE (for IOVA). Are they used exclusively or together? > > > > For VFIO case, the only advantages of using GPA is that the log can then > > > be shared among all the devices that belongs to the VM. Otherwise > > > syncing through IOVA is cleaner. > > > > I still worry about the potential performance impact with this approach. > > In current mdev live migration series, there are multiple system calls > > involved when retrieving the dirty bitmap information for a given memory > > range. IOVA mappings might be changed frequently. Though one may > > argue that frequent IOVA change already has bad performance, it's still > > not good to introduce further non-negligible overhead in such situation. > > > > On the other hand, I realized that adding IOVA awareness in VFIO is > > actually easy. Today VFIO already maintains a full list of IOVA and its > > associated HVA in vfio_dma structure, according to VFIO_MAP and > > VFIO_UNMAP. As long as we allow the latter two operations to accept > > another parameter (GPA), IOVA->GPA mapping can be naturally cached > > in existing vfio_dma objects. Those objects are always updated according > > to MAP and UNMAP ioctls to be up-to-date. Qemu then uniformly > > retrieves the VFIO dirty bitmap for the entire GPA range in every pre-copy > > round, regardless of whether vIOMMU is enabled. There is no need of > > another IOTLB implementation, with the main ask on a v2 MAP/UNMAP > > interface. > > > > Alex, your thoughts? > > Same as last time, you're asking VFIO to be aware of an entirely new > address space and implement tracking structures of that address space > to make life easier for QEMU. Don't we typically push such complexity > to userspace rather than into the kernel? I'm not convinced. Thanks, > Is it really complex? No need of a new tracking structure. Just allowing the MAP interface to carry a new parameter and then record it in the existing vfio_dma objects. Note the frequency of guest DMA map/unmap could be very high. We saw >100K invocations per second with a 40G NIC. To do the right translation Qemu requires log_sync for every unmap, before the mapping for logged dirty IOVA becomes stale. In current Kirti's patch, each log_sync requires several system_calls through the migration info, e.g. setting start_pfn/page_size/total_pfns and then reading data_offset/data_size. That design is fine for doing log_sync in every pre-copy round, but too costly if doing so for every IOVA unmap. If small extension in kernel can lead to great overhead reduction, why not? Thanks Kevin
Re: [Qemu-devel] vhost, iova, and dirty page tracking
On Tue, 17 Sep 2019 08:48:36 + "Tian, Kevin" wrote: > > From: Jason Wang [mailto:jasow...@redhat.com] > > Sent: Monday, September 16, 2019 4:33 PM > > > > > > On 2019/9/16 上午9:51, Tian, Kevin wrote: > > > Hi, Jason > > > > > > We had a discussion about dirty page tracking in VFIO, when vIOMMU > > > is enabled: > > > > > > https://lists.nongnu.org/archive/html/qemu-devel/2019- > > 09/msg02690.html > > > > > > It's actually a similar model as vhost - Qemu cannot interpose the fast- > > path > > > DMAs thus relies on the kernel part to track and report dirty page > > information. > > > Currently Qemu tracks dirty pages in GFN level, thus demanding a > > translation > > > from IOVA to GPA. Then the open in our discussion is where this > > translation > > > should happen. Doing the translation in kernel implies a device iotlb > > flavor, > > > which is what vhost implements today. It requires potentially large > > tracking > > > structures in the host kernel, but leveraging the existing log_sync flow > > > in > > Qemu. > > > On the other hand, Qemu may perform log_sync for every removal of > > IOVA > > > mapping and then do the translation itself, then avoiding the GPA > > awareness > > > in the kernel side. It needs some change to current Qemu log-sync flow, > > and > > > may bring more overhead if IOVA is frequently unmapped. > > > > > > So we'd like to hear about your opinions, especially about how you came > > > down to the current iotlb approach for vhost. > > > > > > We don't consider too much in the point when introducing vhost. And > > before IOTLB, vhost has already know GPA through its mem table > > (GPA->HVA). So it's nature and easier to track dirty pages at GPA level > > then it won't any changes in the existing ABI. > > This is the same situation as VFIO. It is? VFIO doesn't know GPAs, it only knows HVA, HPA, and IOVA. In some cases IOVA is GPA, but not all. > > For VFIO case, the only advantages of using GPA is that the log can then > > be shared among all the devices that belongs to the VM. Otherwise > > syncing through IOVA is cleaner. > > I still worry about the potential performance impact with this approach. > In current mdev live migration series, there are multiple system calls > involved when retrieving the dirty bitmap information for a given memory > range. IOVA mappings might be changed frequently. Though one may > argue that frequent IOVA change already has bad performance, it's still > not good to introduce further non-negligible overhead in such situation. > > On the other hand, I realized that adding IOVA awareness in VFIO is > actually easy. Today VFIO already maintains a full list of IOVA and its > associated HVA in vfio_dma structure, according to VFIO_MAP and > VFIO_UNMAP. As long as we allow the latter two operations to accept > another parameter (GPA), IOVA->GPA mapping can be naturally cached > in existing vfio_dma objects. Those objects are always updated according > to MAP and UNMAP ioctls to be up-to-date. Qemu then uniformly > retrieves the VFIO dirty bitmap for the entire GPA range in every pre-copy > round, regardless of whether vIOMMU is enabled. There is no need of > another IOTLB implementation, with the main ask on a v2 MAP/UNMAP > interface. > > Alex, your thoughts? Same as last time, you're asking VFIO to be aware of an entirely new address space and implement tracking structures of that address space to make life easier for QEMU. Don't we typically push such complexity to userspace rather than into the kernel? I'm not convinced. Thanks, Alex
Re: [Qemu-devel] vhost, iova, and dirty page tracking
On 2019/9/17 下午4:48, Tian, Kevin wrote: From: Jason Wang [mailto:jasow...@redhat.com] Sent: Monday, September 16, 2019 4:33 PM On 2019/9/16 上午9:51, Tian, Kevin wrote: Hi, Jason We had a discussion about dirty page tracking in VFIO, when vIOMMU is enabled: https://lists.nongnu.org/archive/html/qemu-devel/2019- 09/msg02690.html It's actually a similar model as vhost - Qemu cannot interpose the fast- path DMAs thus relies on the kernel part to track and report dirty page information. Currently Qemu tracks dirty pages in GFN level, thus demanding a translation from IOVA to GPA. Then the open in our discussion is where this translation should happen. Doing the translation in kernel implies a device iotlb flavor, which is what vhost implements today. It requires potentially large tracking structures in the host kernel, but leveraging the existing log_sync flow in Qemu. On the other hand, Qemu may perform log_sync for every removal of IOVA mapping and then do the translation itself, then avoiding the GPA awareness in the kernel side. It needs some change to current Qemu log-sync flow, and may bring more overhead if IOVA is frequently unmapped. So we'd like to hear about your opinions, especially about how you came down to the current iotlb approach for vhost. We don't consider too much in the point when introducing vhost. And before IOTLB, vhost has already know GPA through its mem table (GPA->HVA). So it's nature and easier to track dirty pages at GPA level then it won't any changes in the existing ABI. This is the same situation as VFIO. For VFIO case, the only advantages of using GPA is that the log can then be shared among all the devices that belongs to the VM. Otherwise syncing through IOVA is cleaner. I still worry about the potential performance impact with this approach. In current mdev live migration series, there are multiple system calls involved when retrieving the dirty bitmap information for a given memory range. I haven't took a deep look at that series. Technically dirty bitmap could be shared between device and driver, then there's no system call in synchronization. IOVA mappings might be changed frequently. Though one may argue that frequent IOVA change already has bad performance, it's still not good to introduce further non-negligible overhead in such situation. Yes, it depends on the behavior of vIOMMU driver, e.g the frequency and granularity of the flushing. On the other hand, I realized that adding IOVA awareness in VFIO is actually easy. Today VFIO already maintains a full list of IOVA and its associated HVA in vfio_dma structure, according to VFIO_MAP and VFIO_UNMAP. As long as we allow the latter two operations to accept another parameter (GPA), IOVA->GPA mapping can be naturally cached in existing vfio_dma objects. Note that the HVA to GPA mapping is not an 1:1 mapping. One HVA range could be mapped to several GPA ranges. Those objects are always updated according to MAP and UNMAP ioctls to be up-to-date. Qemu then uniformly retrieves the VFIO dirty bitmap for the entire GPA range in every pre-copy round, regardless of whether vIOMMU is enabled. There is no need of another IOTLB implementation, with the main ask on a v2 MAP/UNMAP interface. Or provide GPA to HVA mapping as vhost did. But a question is, I believe device can only do dirty page logging through IOVA. So how do you handle the case when IOVA is removed in this case? Thanks Alex, your thoughts? Thanks Kevin
Re: [Qemu-devel] vhost, iova, and dirty page tracking
> From: Jason Wang [mailto:jasow...@redhat.com] > Sent: Monday, September 16, 2019 4:33 PM > > > On 2019/9/16 上午9:51, Tian, Kevin wrote: > > Hi, Jason > > > > We had a discussion about dirty page tracking in VFIO, when vIOMMU > > is enabled: > > > > https://lists.nongnu.org/archive/html/qemu-devel/2019- > 09/msg02690.html > > > > It's actually a similar model as vhost - Qemu cannot interpose the fast- > path > > DMAs thus relies on the kernel part to track and report dirty page > information. > > Currently Qemu tracks dirty pages in GFN level, thus demanding a > translation > > from IOVA to GPA. Then the open in our discussion is where this > translation > > should happen. Doing the translation in kernel implies a device iotlb > flavor, > > which is what vhost implements today. It requires potentially large > tracking > > structures in the host kernel, but leveraging the existing log_sync flow in > Qemu. > > On the other hand, Qemu may perform log_sync for every removal of > IOVA > > mapping and then do the translation itself, then avoiding the GPA > awareness > > in the kernel side. It needs some change to current Qemu log-sync flow, > and > > may bring more overhead if IOVA is frequently unmapped. > > > > So we'd like to hear about your opinions, especially about how you came > > down to the current iotlb approach for vhost. > > > We don't consider too much in the point when introducing vhost. And > before IOTLB, vhost has already know GPA through its mem table > (GPA->HVA). So it's nature and easier to track dirty pages at GPA level > then it won't any changes in the existing ABI. This is the same situation as VFIO. > > For VFIO case, the only advantages of using GPA is that the log can then > be shared among all the devices that belongs to the VM. Otherwise > syncing through IOVA is cleaner. I still worry about the potential performance impact with this approach. In current mdev live migration series, there are multiple system calls involved when retrieving the dirty bitmap information for a given memory range. IOVA mappings might be changed frequently. Though one may argue that frequent IOVA change already has bad performance, it's still not good to introduce further non-negligible overhead in such situation. On the other hand, I realized that adding IOVA awareness in VFIO is actually easy. Today VFIO already maintains a full list of IOVA and its associated HVA in vfio_dma structure, according to VFIO_MAP and VFIO_UNMAP. As long as we allow the latter two operations to accept another parameter (GPA), IOVA->GPA mapping can be naturally cached in existing vfio_dma objects. Those objects are always updated according to MAP and UNMAP ioctls to be up-to-date. Qemu then uniformly retrieves the VFIO dirty bitmap for the entire GPA range in every pre-copy round, regardless of whether vIOMMU is enabled. There is no need of another IOTLB implementation, with the main ask on a v2 MAP/UNMAP interface. Alex, your thoughts? Thanks Kevin
Re: [Qemu-devel] vhost, iova, and dirty page tracking
On 2019/9/16 上午9:51, Tian, Kevin wrote: Hi, Jason We had a discussion about dirty page tracking in VFIO, when vIOMMU is enabled: https://lists.nongnu.org/archive/html/qemu-devel/2019-09/msg02690.html It's actually a similar model as vhost - Qemu cannot interpose the fast-path DMAs thus relies on the kernel part to track and report dirty page information. Currently Qemu tracks dirty pages in GFN level, thus demanding a translation from IOVA to GPA. Then the open in our discussion is where this translation should happen. Doing the translation in kernel implies a device iotlb flavor, which is what vhost implements today. It requires potentially large tracking structures in the host kernel, but leveraging the existing log_sync flow in Qemu. On the other hand, Qemu may perform log_sync for every removal of IOVA mapping and then do the translation itself, then avoiding the GPA awareness in the kernel side. It needs some change to current Qemu log-sync flow, and may bring more overhead if IOVA is frequently unmapped. So we'd like to hear about your opinions, especially about how you came down to the current iotlb approach for vhost. We don't consider too much in the point when introducing vhost. And before IOTLB, vhost has already know GPA through its mem table (GPA->HVA). So it's nature and easier to track dirty pages at GPA level then it won't any changes in the existing ABI. For VFIO case, the only advantages of using GPA is that the log can then be shared among all the devices that belongs to the VM. Otherwise syncing through IOVA is cleaner. Thanks p.s. Alex's comment is also copied here from original thread. So vhost must then be configuring a listener across system memory rather than only against the device AddressSpace like we do in vfio, such that it get's log_sync() callbacks for the actual GPA space rather than only the IOVA space. OTOH, QEMU could understand that the device AddressSpace has a translate function and apply the IOVA dirty bits to the system memory AddressSpace. Wouldn't it make more sense for QEMU to perform a log_sync() prior to removing a MemoryRegionSection within an AddressSpace and update the GPA rather than pushing GPA awareness and potentially large tracking structures into the host kernel? Thanks Kevin
[Qemu-devel] vhost, iova, and dirty page tracking
Hi, Jason We had a discussion about dirty page tracking in VFIO, when vIOMMU is enabled: https://lists.nongnu.org/archive/html/qemu-devel/2019-09/msg02690.html It's actually a similar model as vhost - Qemu cannot interpose the fast-path DMAs thus relies on the kernel part to track and report dirty page information. Currently Qemu tracks dirty pages in GFN level, thus demanding a translation from IOVA to GPA. Then the open in our discussion is where this translation should happen. Doing the translation in kernel implies a device iotlb flavor, which is what vhost implements today. It requires potentially large tracking structures in the host kernel, but leveraging the existing log_sync flow in Qemu. On the other hand, Qemu may perform log_sync for every removal of IOVA mapping and then do the translation itself, then avoiding the GPA awareness in the kernel side. It needs some change to current Qemu log-sync flow, and may bring more overhead if IOVA is frequently unmapped. So we'd like to hear about your opinions, especially about how you came down to the current iotlb approach for vhost. p.s. Alex's comment is also copied here from original thread. > So vhost must then be configuring a listener across system memory > rather than only against the device AddressSpace like we do in vfio, > such that it get's log_sync() callbacks for the actual GPA space rather > than only the IOVA space. OTOH, QEMU could understand that the device > AddressSpace has a translate function and apply the IOVA dirty bits to > the system memory AddressSpace. Wouldn't it make more sense for > QEMU > to perform a log_sync() prior to removing a MemoryRegionSection within > an AddressSpace and update the GPA rather than pushing GPA awareness > and potentially large tracking structures into the host kernel? Thanks Kevin