subject:"\[Qemu\-devel\] vhost, iova, and dirty page tracking"

Re: [Qemu-devel] vhost, iova, and dirty page tracking

2019-09-24 Thread Jason Wang




On 2019/9/24 上午10:02, Tian, Kevin wrote:

From: Jason Wang [mailto:jasow...@redhat.com]
Sent: Friday, September 20, 2019 9:19 AM

On 2019/9/20 上午6:54, Tian, Kevin wrote:

From: Paolo Bonzini [mailto:pbonz...@redhat.com]
Sent: Thursday, September 19, 2019 7:14 PM

On 19/09/19 09:16, Tian, Kevin wrote:

why GPA1 and GPA2 should be both dirty?
even they have the same HVA due to overlaping virtual address

space

in

two processes, they still correspond to two physical pages.
don't get what's your meaning :)

The point is not leave any corner case that is hard to debug or fix in
the future.

Let's just start by a single process, the API allows userspace to maps
HVA to both GPA1 and GPA2. Since it knows GPA1 and GPA2 are

equivalent,

it's ok to sync just through GPA1. That means if you only log GPA2, it
won't work.

I noted KVM itself doesn't consider such situation (one HVA is mapped
to multiple GPAs), when doing its dirty page tracking. If you look at
kvm_vcpu_mark_page_dirty, it simply finds the unique memslot which
contains the dirty gfn and then set the dirty bit within that slot. It
doesn't attempt to walk all memslots to find out any other GPA which
may be mapped to the same HVA.

So there must be some disconnect here. let's hear from Paolo first and
understand the rationale behind such situation.

In general, userspace cannot assume that it's okay to sync just through
GPA1.  It must sync the host page if *either* GPA1 or GPA2 are marked
dirty.

Agree. In this case the kernel only needs to track whether GPA1 or
GPA2 is dirtied by guest operations.


Not necessarily guest operations.



   The reason why vhost has to
set both GPA1 and GPA2 is due to its own design - it maintains
IOVA->HVA and GPA->HVA mappings thus given a IOVA you have
to reverse lookup GPA->HVA memTable which gives multiple possible
GPAs.


So if userspace need to track both GPA1 and GPA2, vhost can just stop
when it found a one HVA->GPA mapping there.



   But in concept if vhost can maintain a IOVA->GPA mapping,
then it is straightforward to set the right GPA every time when a IOVA
is tracked.


That means, the translation is done twice by software, IOVA->GPA and
GPA->HVA for each packet.

Thanks


yes, it's not necessary if we care about only the content of the dirty GPA,
as seen in live migration. In that case, just setting the first GPA in the loop
is sufficient as you pointed out. However there is one corner case which I'm
not sure. What about an usage (e.g. VM introspection) which cares only
about the guest access pattern i.e. which GPA is dirtied instead of poking
its content? Neither setting the first GPA nor setting all the aliasing GPAs
can provide the accurate info, if no explicit IOVA->GPA mapping is maintained
inside vhost. But I cannot tell whether maintaining such accuracy for aliasing
GPAs is really necessary. +VM introspection guys if they have some opinions.



Interesting, for vhost, vIOMMU can pass IOVA->GPA actually and vhost can 
keep it and just do the translation from GPA->HVA in the map command. So 
it can have both IOVA->GPA and IOVA->HVA mapping.


Thanks




Thanks
Kevin

RE: [Qemu-devel] vhost, iova, and dirty page tracking

2019-09-23 Thread Tian, Kevin

> From: Jason Wang [mailto:jasow...@redhat.com]
> Sent: Friday, September 20, 2019 9:19 AM
> 
> On 2019/9/20 上午6:54, Tian, Kevin wrote:
> >> From: Paolo Bonzini [mailto:pbonz...@redhat.com]
> >> Sent: Thursday, September 19, 2019 7:14 PM
> >>
> >> On 19/09/19 09:16, Tian, Kevin wrote:
> > why GPA1 and GPA2 should be both dirty?
> > even they have the same HVA due to overlaping virtual address
> space
> >> in
> > two processes, they still correspond to two physical pages.
> > don't get what's your meaning :)
>  The point is not leave any corner case that is hard to debug or fix in
>  the future.
> 
>  Let's just start by a single process, the API allows userspace to maps
>  HVA to both GPA1 and GPA2. Since it knows GPA1 and GPA2 are
> >> equivalent,
>  it's ok to sync just through GPA1. That means if you only log GPA2, it
>  won't work.
> >>> I noted KVM itself doesn't consider such situation (one HVA is mapped
> >>> to multiple GPAs), when doing its dirty page tracking. If you look at
> >>> kvm_vcpu_mark_page_dirty, it simply finds the unique memslot which
> >>> contains the dirty gfn and then set the dirty bit within that slot. It
> >>> doesn't attempt to walk all memslots to find out any other GPA which
> >>> may be mapped to the same HVA.
> >>>
> >>> So there must be some disconnect here. let's hear from Paolo first and
> >>> understand the rationale behind such situation.
> >> In general, userspace cannot assume that it's okay to sync just through
> >> GPA1.  It must sync the host page if *either* GPA1 or GPA2 are marked
> >> dirty.
> > Agree. In this case the kernel only needs to track whether GPA1 or
> > GPA2 is dirtied by guest operations.
> 
> 
> Not necessarily guest operations.
> 
> 
> >   The reason why vhost has to
> > set both GPA1 and GPA2 is due to its own design - it maintains
> > IOVA->HVA and GPA->HVA mappings thus given a IOVA you have
> > to reverse lookup GPA->HVA memTable which gives multiple possible
> > GPAs.
> 
> 
> So if userspace need to track both GPA1 and GPA2, vhost can just stop
> when it found a one HVA->GPA mapping there.
> 
> 
> >   But in concept if vhost can maintain a IOVA->GPA mapping,
> > then it is straightforward to set the right GPA every time when a IOVA
> > is tracked.
> 
> 
> That means, the translation is done twice by software, IOVA->GPA and
> GPA->HVA for each packet.
> 
> Thanks
> 

yes, it's not necessary if we care about only the content of the dirty GPA,
as seen in live migration. In that case, just setting the first GPA in the loop
is sufficient as you pointed out. However there is one corner case which I'm
not sure. What about an usage (e.g. VM introspection) which cares only 
about the guest access pattern i.e. which GPA is dirtied instead of poking
its content? Neither setting the first GPA nor setting all the aliasing GPAs
can provide the accurate info, if no explicit IOVA->GPA mapping is maintained
inside vhost. But I cannot tell whether maintaining such accuracy for aliasing
GPAs is really necessary. +VM introspection guys if they have some opinions.

Thanks
Kevin

Re: [Qemu-devel] vhost, iova, and dirty page tracking

2019-09-20 Thread Michael S. Tsirkin

On Fri, Sep 20, 2019 at 09:15:40AM +0800, Jason Wang wrote:
> 
> On 2019/9/19 下午10:06, Michael S. Tsirkin wrote:
> > On Thu, Sep 19, 2019 at 05:37:48PM +0800, Jason Wang wrote:
> > > On 2019/9/19 下午3:16, Tian, Kevin wrote:
> > > > +Paolo to help clarify here.
> > > > 
> > > > > From: Jason Wang [mailto:jasow...@redhat.com]
> > > > > Sent: Thursday, September 19, 2019 2:32 PM
> > > > > 
> > > > > 
> > > > > On 2019/9/19 下午2:17, Yan Zhao wrote:
> > > > > > On Thu, Sep 19, 2019 at 02:09:53PM +0800, Jason Wang wrote:
> > > > > > > On 2019/9/19 下午1:28, Yan Zhao wrote:
> > > > > > > > On Thu, Sep 19, 2019 at 09:05:12AM +0800, Jason Wang wrote:
> > > > > > > > > On 2019/9/18 下午4:37, Tian, Kevin wrote:
> > > > > > > > > > > From: Jason Wang [mailto:jasow...@redhat.com]
> > > > > > > > > > > Sent: Wednesday, September 18, 2019 2:10 PM
> > > > > > > > > > > 
> > > > > > > > > > > > > Note that the HVA to GPA mapping is not an 1:1 
> > > > > > > > > > > > > mapping. One
> > > > > HVA
> > > > > > > > > > > range
> > > > > > > > > > > > > could be mapped to several GPA ranges.
> > > > > > > > > > > > This is fine. Currently vfio_dma maintains IOVA->HVA 
> > > > > > > > > > > > mapping.
> > > > > > > > > > > > 
> > > > > > > > > > > > btw under what condition HVA->GPA is not 1:1 mapping? I 
> > > > > > > > > > > > didn't
> > > > > realize it.
> > > > > > > > > > > I don't remember the details e.g memory region alias? And 
> > > > > > > > > > > neither
> > > > > kvm
> > > > > > > > > > > nor kvm API does forbid this if my memory is correct.
> > > > > > > > > > > 
> > > > > > > > > > I checked https://qemu.weilnetz.de/doc/devel/memory.html, 
> > > > > > > > > > which
> > > > > > > > > > provides an example of aliased layout. However, its 
> > > > > > > > > > aliasing is all
> > > > > > > > > > 1:1, instead of N:1. From guest p.o.v every writable GPA 
> > > > > > > > > > implies an
> > > > > > > > > > unique location. Why would we hit the situation where 
> > > > > > > > > > multiple
> > > > > > > > > > write-able GPAs are mapped to the same HVA (i.e. same 
> > > > > > > > > > physical
> > > > > > > > > > memory location)?
> > > > > > > > > I don't know, just want to say current API does not forbid 
> > > > > > > > > this. So we
> > > > > > > > > probably need to take care it.
> > > > > > > > > 
> > > > > > > > yes, in KVM API level, it does not forbid two slots to have the 
> > > > > > > > same
> > > > > HVA(slot->userspace_addr).
> > > > > > > > But
> > > > > > > > (1) there's only one kvm instance for each vm for each qemu 
> > > > > > > > process.
> > > > > > > > (2) all ramblock->host (corresponds to HVA and 
> > > > > > > > slot->userspace_addr)
> > > > > in one qemu
> > > > > > > > process is non-overlapping as it's obtained from mmmap().
> > > > > > > > (3) qemu ensures two kvm slots will not point to the same 
> > > > > > > > section of
> > > > > one ramblock.
> > > > > > > > So, as long as kvm instance is not shared in two processes, and
> > > > > > > > there's no bug in qemu, we can assure that HVA to GPA is 1:1.
> > > > > > > Well, you leave this API for userspace, so you can't assume qemu 
> > > > > > > is the
> > > > > > > only user or any its behavior. If you had you should limit it in 
> > > > > > > the API
> > > > > > > level instead of open window for them.
> > > > > > > 
> > > > > > > 
> > > > > > > > But even if there are two processes operating on the same kvm
> > > > > instance
> > > > > > > > and manipulating on memory slots, adding an extra GPA along side
> > > > > current
> > > > > > > > IOVA & HVA to ioctl VFIO_IOMMU_MAP_DMA can still let driver 
> > > > > > > > knows
> > > > > the
> > > > > > > > right IOVA->GPA mapping, right?
> > > > > > > It looks fragile. Consider HVA was mapped to both GPA1 and GPA2.
> > > > > Guest
> > > > > > > maps IOVA to GPA2, so we have IOVA GPA2 HVA in the new ioctl and
> > > > > then
> > > > > > > log through GPA2. If userspace is trying to sync through GPA1, it 
> > > > > > > will
> > > > > > > miss the dirty page. So for safety we need log both GPA1 and 
> > > > > > > GPA2. (See
> > > > > > > what has been done in log_write_hva() in vhost.c). The only way 
> > > > > > > to do
> > > > > > > that is to maintain an independent HVA to GPA mapping like what 
> > > > > > > KVM
> > > > > or
> > > > > > > vhost did.
> > > > > > > 
> > > > > > why GPA1 and GPA2 should be both dirty?
> > > > > > even they have the same HVA due to overlaping virtual address space 
> > > > > > in
> > > > > > two processes, they still correspond to two physical pages.
> > > > > > don't get what's your meaning :)
> > > > > The point is not leave any corner case that is hard to debug or fix in
> > > > > the future.
> > > > > 
> > > > > Let's just start by a single process, the API allows userspace to maps
> > > > > HVA to both GPA1 and GPA2. Since it knows GPA1 and GPA2 are 
> > > > > equivalent,
> > > > > it's ok to sync just through GPA1. That means if you only log GPA2, it
> > > > > won't work

Re: [Qemu-devel] vhost, iova, and dirty page tracking

2019-09-19 Thread Jason Wang




On 2019/9/19 下午10:06, Michael S. Tsirkin wrote:

On Thu, Sep 19, 2019 at 05:37:48PM +0800, Jason Wang wrote:

On 2019/9/19 下午3:16, Tian, Kevin wrote:

+Paolo to help clarify here.


From: Jason Wang [mailto:jasow...@redhat.com]
Sent: Thursday, September 19, 2019 2:32 PM


On 2019/9/19 下午2:17, Yan Zhao wrote:

On Thu, Sep 19, 2019 at 02:09:53PM +0800, Jason Wang wrote:

On 2019/9/19 下午1:28, Yan Zhao wrote:

On Thu, Sep 19, 2019 at 09:05:12AM +0800, Jason Wang wrote:

On 2019/9/18 下午4:37, Tian, Kevin wrote:

From: Jason Wang [mailto:jasow...@redhat.com]
Sent: Wednesday, September 18, 2019 2:10 PM


Note that the HVA to GPA mapping is not an 1:1 mapping. One

HVA

range

could be mapped to several GPA ranges.

This is fine. Currently vfio_dma maintains IOVA->HVA mapping.

btw under what condition HVA->GPA is not 1:1 mapping? I didn't

realize it.

I don't remember the details e.g memory region alias? And neither

kvm

nor kvm API does forbid this if my memory is correct.


I checked https://qemu.weilnetz.de/doc/devel/memory.html, which
provides an example of aliased layout. However, its aliasing is all
1:1, instead of N:1. From guest p.o.v every writable GPA implies an
unique location. Why would we hit the situation where multiple
write-able GPAs are mapped to the same HVA (i.e. same physical
memory location)?

I don't know, just want to say current API does not forbid this. So we
probably need to take care it.


yes, in KVM API level, it does not forbid two slots to have the same

HVA(slot->userspace_addr).

But
(1) there's only one kvm instance for each vm for each qemu process.
(2) all ramblock->host (corresponds to HVA and slot->userspace_addr)

in one qemu

process is non-overlapping as it's obtained from mmmap().
(3) qemu ensures two kvm slots will not point to the same section of

one ramblock.

So, as long as kvm instance is not shared in two processes, and
there's no bug in qemu, we can assure that HVA to GPA is 1:1.

Well, you leave this API for userspace, so you can't assume qemu is the
only user or any its behavior. If you had you should limit it in the API
level instead of open window for them.



But even if there are two processes operating on the same kvm

instance

and manipulating on memory slots, adding an extra GPA along side

current

IOVA & HVA to ioctl VFIO_IOMMU_MAP_DMA can still let driver knows

the

right IOVA->GPA mapping, right?

It looks fragile. Consider HVA was mapped to both GPA1 and GPA2.

Guest

maps IOVA to GPA2, so we have IOVA GPA2 HVA in the new ioctl and

then

log through GPA2. If userspace is trying to sync through GPA1, it will
miss the dirty page. So for safety we need log both GPA1 and GPA2. (See
what has been done in log_write_hva() in vhost.c). The only way to do
that is to maintain an independent HVA to GPA mapping like what KVM

or

vhost did.


why GPA1 and GPA2 should be both dirty?
even they have the same HVA due to overlaping virtual address space in
two processes, they still correspond to two physical pages.
don't get what's your meaning :)

The point is not leave any corner case that is hard to debug or fix in
the future.

Let's just start by a single process, the API allows userspace to maps
HVA to both GPA1 and GPA2. Since it knows GPA1 and GPA2 are equivalent,
it's ok to sync just through GPA1. That means if you only log GPA2, it
won't work.


I noted KVM itself doesn't consider such situation (one HVA is mapped
to multiple GPAs), when doing its dirty page tracking. If you look at
kvm_vcpu_mark_page_dirty, it simply finds the unique memslot which
contains the dirty gfn and then set the dirty bit within that slot. It
doesn't attempt to walk all memslots to find out any other GPA which
may be mapped to the same HVA.

So there must be some disconnect here. let's hear from Paolo first and
understand the rationale behind such situation.


Neither did vhost when IOTLB is disabled. And cc Michael who points out this
issue at the beginning.

Thanks



Thanks
Kevin

Yes, we fixed with a kind of a work around, at the time I proposed
a new interace to fix it fully. I don't think we ever got around
to implementing it - right?



Paolo said userspace just need to sync through all GPAs, so my 
understanding is that work around is ok by redundant, so did the API you 
proposed. Anything I miss?


Thanks

Re: [Qemu-devel] vhost, iova, and dirty page tracking

2019-09-19 Thread Jason Wang




On 2019/9/20 上午6:54, Tian, Kevin wrote:

From: Paolo Bonzini [mailto:pbonz...@redhat.com]
Sent: Thursday, September 19, 2019 7:14 PM

On 19/09/19 09:16, Tian, Kevin wrote:

why GPA1 and GPA2 should be both dirty?
even they have the same HVA due to overlaping virtual address space

in

two processes, they still correspond to two physical pages.
don't get what's your meaning :)

The point is not leave any corner case that is hard to debug or fix in
the future.

Let's just start by a single process, the API allows userspace to maps
HVA to both GPA1 and GPA2. Since it knows GPA1 and GPA2 are

equivalent,

it's ok to sync just through GPA1. That means if you only log GPA2, it
won't work.

I noted KVM itself doesn't consider such situation (one HVA is mapped
to multiple GPAs), when doing its dirty page tracking. If you look at
kvm_vcpu_mark_page_dirty, it simply finds the unique memslot which
contains the dirty gfn and then set the dirty bit within that slot. It
doesn't attempt to walk all memslots to find out any other GPA which
may be mapped to the same HVA.

So there must be some disconnect here. let's hear from Paolo first and
understand the rationale behind such situation.

In general, userspace cannot assume that it's okay to sync just through
GPA1.  It must sync the host page if *either* GPA1 or GPA2 are marked
dirty.

Agree. In this case the kernel only needs to track whether GPA1 or
GPA2 is dirtied by guest operations.



Not necessarily guest operations.



  The reason why vhost has to
set both GPA1 and GPA2 is due to its own design - it maintains
IOVA->HVA and GPA->HVA mappings thus given a IOVA you have
to reverse lookup GPA->HVA memTable which gives multiple possible
GPAs.



So if userspace need to track both GPA1 and GPA2, vhost can just stop 
when it found a one HVA->GPA mapping there.




  But in concept if vhost can maintain a IOVA->GPA mapping,
then it is straightforward to set the right GPA every time when a IOVA
is tracked.



That means, the translation is done twice by software, IOVA->GPA and 
GPA->HVA for each packet.


Thanks





The situation really only arises in special cases.  For example,
0xfffe..0x and 0xe..0xf might be the same memory.
 From "info mtree" before the guest boots:

 - (prio -1, i/o): pci
   000e-000f (prio 1, i/o): alias isa-bios
@pc.bios 0002-0003
   fffc- (prio 0, rom): pc.bios

However, non-x86 machines may have other cases of aliased memory so
it's
a case that you should cover.


Above example is read-only, thus won't be touched in logdirty path.
But now I agree that a specific architecture may define two
writable GPA ranges with one as the alias to the other, as long as
such case is explicitly documented so guest OS won't treat them as
separate memory pages.

Thanks
Kevin

RE: [Qemu-devel] vhost, iova, and dirty page tracking

2019-09-19 Thread Tian, Kevin

> From: Paolo Bonzini [mailto:pbonz...@redhat.com]
> Sent: Thursday, September 19, 2019 7:14 PM
> 
> On 19/09/19 09:16, Tian, Kevin wrote:
> >>> why GPA1 and GPA2 should be both dirty?
> >>> even they have the same HVA due to overlaping virtual address space
> in
> >>> two processes, they still correspond to two physical pages.
> >>> don't get what's your meaning :)
> >>
> >> The point is not leave any corner case that is hard to debug or fix in
> >> the future.
> >>
> >> Let's just start by a single process, the API allows userspace to maps
> >> HVA to both GPA1 and GPA2. Since it knows GPA1 and GPA2 are
> equivalent,
> >> it's ok to sync just through GPA1. That means if you only log GPA2, it
> >> won't work.
> >
> > I noted KVM itself doesn't consider such situation (one HVA is mapped
> > to multiple GPAs), when doing its dirty page tracking. If you look at
> > kvm_vcpu_mark_page_dirty, it simply finds the unique memslot which
> > contains the dirty gfn and then set the dirty bit within that slot. It
> > doesn't attempt to walk all memslots to find out any other GPA which
> > may be mapped to the same HVA.
> >
> > So there must be some disconnect here. let's hear from Paolo first and
> > understand the rationale behind such situation.
> 
> In general, userspace cannot assume that it's okay to sync just through
> GPA1.  It must sync the host page if *either* GPA1 or GPA2 are marked
> dirty.

Agree. In this case the kernel only needs to track whether GPA1 or
GPA2 is dirtied by guest operations. The reason why vhost has to
set both GPA1 and GPA2 is due to its own design - it maintains
IOVA->HVA and GPA->HVA mappings thus given a IOVA you have
to reverse lookup GPA->HVA memTable which gives multiple possible
GPAs. But in concept if vhost can maintain a IOVA->GPA mapping,
then it is straightforward to set the right GPA every time when a IOVA
is tracked.

> 
> The situation really only arises in special cases.  For example,
> 0xfffe..0x and 0xe..0xf might be the same memory.
> From "info mtree" before the guest boots:
> 
> - (prio -1, i/o): pci
>   000e-000f (prio 1, i/o): alias isa-bios
> @pc.bios 0002-0003
>   fffc- (prio 0, rom): pc.bios
> 
> However, non-x86 machines may have other cases of aliased memory so
> it's
> a case that you should cover.
> 

Above example is read-only, thus won't be touched in logdirty path.
But now I agree that a specific architecture may define two
writable GPA ranges with one as the alias to the other, as long as
such case is explicitly documented so guest OS won't treat them as
separate memory pages.

Thanks
Kevin

RE: [Qemu-devel] vhost, iova, and dirty page tracking

2019-09-19 Thread Tian, Kevin

> From: Alex Williamson [mailto:alex.william...@redhat.com]
> Sent: Friday, September 20, 2019 1:21 AM
> 
> On Wed, 18 Sep 2019 07:21:05 +
> "Tian, Kevin"  wrote:
> 
> > > From: Jason Wang [mailto:jasow...@redhat.com]
> > > Sent: Wednesday, September 18, 2019 2:04 PM
> > >
> > > On 2019/9/18 上午9:31, Tian, Kevin wrote:
> > > >> From: Alex Williamson [mailto:alex.william...@redhat.com]
> > > >> Sent: Tuesday, September 17, 2019 10:54 PM
> > > >>
> > > >> On Tue, 17 Sep 2019 08:48:36 +
> > > >> "Tian, Kevin"  wrote:
> > > >>
> > >  From: Jason Wang [mailto:jasow...@redhat.com]
> > >  Sent: Monday, September 16, 2019 4:33 PM
> > > 
> > > 
> > >  On 2019/9/16 上午9:51, Tian, Kevin wrote:
> > > > Hi, Jason
> > > >
> > > > We had a discussion about dirty page tracking in VFIO, when
> > > vIOMMU
> > > > is enabled:
> > > >
> > > > https://lists.nongnu.org/archive/html/qemu-devel/2019-
> > >  09/msg02690.html
> > > > It's actually a similar model as vhost - Qemu cannot interpose the
> > > fast-
> > >  path
> > > > DMAs thus relies on the kernel part to track and report dirty page
> > >  information.
> > > > Currently Qemu tracks dirty pages in GFN level, thus demanding a
> > >  translation
> > > > from IOVA to GPA. Then the open in our discussion is where this
> > >  translation
> > > > should happen. Doing the translation in kernel implies a device
> iotlb
> > >  flavor,
> > > > which is what vhost implements today. It requires potentially
> large
> > >  tracking
> > > > structures in the host kernel, but leveraging the existing log_sync
> > > flow
> > > >> in
> > >  Qemu.
> > > > On the other hand, Qemu may perform log_sync for every
> removal
> > > of
> > >  IOVA
> > > > mapping and then do the translation itself, then avoiding the GPA
> > >  awareness
> > > > in the kernel side. It needs some change to current Qemu log-
> sync
> > > >> flow,
> > >  and
> > > > may bring more overhead if IOVA is frequently unmapped.
> > > >
> > > > So we'd like to hear about your opinions, especially about how
> you
> > > >> came
> > > > down to the current iotlb approach for vhost.
> > >  We don't consider too much in the point when introducing vhost.
> And
> > >  before IOTLB, vhost has already know GPA through its mem table
> > >  (GPA->HVA). So it's nature and easier to track dirty pages at GPA
> level
> > >  then it won't any changes in the existing ABI.
> > > >>> This is the same situation as VFIO.
> > > >> It is?  VFIO doesn't know GPAs, it only knows HVA, HPA, and IOVA.
> In
> > > >> some cases IOVA is GPA, but not all.
> > > > Well, I thought vhost has a similar design, that the index of its mem
> table
> > > > is GPA when vIOMMU is off and then becomes IOVA when vIOMMU is
> on.
> > > > But I may be wrong here. Jason, can you help clarify? I saw two
> > > > interfaces which poke the mem table: VHOST_SET_MEM_TABLE (for
> GPA)
> > > > and VHOST_IOTLB_UPDATE (for IOVA). Are they used exclusively or
> > > together?
> > > >
> > >
> > > Actually, vhost maintains two interval trees, mem table GPA->HVA, and
> > > device IOTLB IOVA->HVA. Device IOTLB is only used when vIOMMU is
> > > enabled, and in that case mem table is used only when vhost need to
> > > track dirty pages (do reverse lookup of memtable to get HVA->GPA). So
> in
> > > conclusion, for datapath, they are used exclusively, but they need
> > > cowork for logging dirty pages when device IOTLB is enabled.
> > >
> >
> > OK. Then it's different from current VFIO design, which maintains only
> > one tree which is indexed by either GPA or IOVA exclusively, upon
> > whether vIOMMU is in use.
> 
> Nit, the VFIO tree is only ever indexed by IOVA.  The MAP_DMA ioctl is
> only ever performed with an IOVA.  Userspace decides how that IOVA
> maps
> to GPA, VFIO only needs to know how the IOVA maps to HPA via the HVA.
> Thanks,
> 

I was only referring to its actual meaning from usage p.o.v, not the 
parameter name (which is always called iova) in vfio. 

Thanks
Kevin

Re: [Qemu-devel] vhost, iova, and dirty page tracking

2019-09-19 Thread Alex Williamson

On Wed, 18 Sep 2019 07:21:05 +
"Tian, Kevin"  wrote:

> > From: Jason Wang [mailto:jasow...@redhat.com]
> > Sent: Wednesday, September 18, 2019 2:04 PM
> > 
> > On 2019/9/18 上午9:31, Tian, Kevin wrote:  
> > >> From: Alex Williamson [mailto:alex.william...@redhat.com]
> > >> Sent: Tuesday, September 17, 2019 10:54 PM
> > >>
> > >> On Tue, 17 Sep 2019 08:48:36 +
> > >> "Tian, Kevin"  wrote:
> > >>  
> >  From: Jason Wang [mailto:jasow...@redhat.com]
> >  Sent: Monday, September 16, 2019 4:33 PM
> > 
> > 
> >  On 2019/9/16 上午9:51, Tian, Kevin wrote:  
> > > Hi, Jason
> > >
> > > We had a discussion about dirty page tracking in VFIO, when  
> > vIOMMU  
> > > is enabled:
> > >
> > > https://lists.nongnu.org/archive/html/qemu-devel/2019-  
> >  09/msg02690.html  
> > > It's actually a similar model as vhost - Qemu cannot interpose the  
> > fast-  
> >  path  
> > > DMAs thus relies on the kernel part to track and report dirty page  
> >  information.  
> > > Currently Qemu tracks dirty pages in GFN level, thus demanding a  
> >  translation  
> > > from IOVA to GPA. Then the open in our discussion is where this  
> >  translation  
> > > should happen. Doing the translation in kernel implies a device iotlb 
> > >  
> >  flavor,  
> > > which is what vhost implements today. It requires potentially large  
> >  tracking  
> > > structures in the host kernel, but leveraging the existing log_sync  
> > flow  
> > >> in  
> >  Qemu.  
> > > On the other hand, Qemu may perform log_sync for every removal  
> > of  
> >  IOVA  
> > > mapping and then do the translation itself, then avoiding the GPA  
> >  awareness  
> > > in the kernel side. It needs some change to current Qemu log-sync  
> > >> flow,  
> >  and  
> > > may bring more overhead if IOVA is frequently unmapped.
> > >
> > > So we'd like to hear about your opinions, especially about how you  
> > >> came  
> > > down to the current iotlb approach for vhost.  
> >  We don't consider too much in the point when introducing vhost. And
> >  before IOTLB, vhost has already know GPA through its mem table
> >  (GPA->HVA). So it's nature and easier to track dirty pages at GPA level
> >  then it won't any changes in the existing ABI.  
> > >>> This is the same situation as VFIO.  
> > >> It is?  VFIO doesn't know GPAs, it only knows HVA, HPA, and IOVA.  In
> > >> some cases IOVA is GPA, but not all.  
> > > Well, I thought vhost has a similar design, that the index of its mem 
> > > table
> > > is GPA when vIOMMU is off and then becomes IOVA when vIOMMU is on.
> > > But I may be wrong here. Jason, can you help clarify? I saw two
> > > interfaces which poke the mem table: VHOST_SET_MEM_TABLE (for GPA)
> > > and VHOST_IOTLB_UPDATE (for IOVA). Are they used exclusively or  
> > together?  
> > >  
> > 
> > Actually, vhost maintains two interval trees, mem table GPA->HVA, and
> > device IOTLB IOVA->HVA. Device IOTLB is only used when vIOMMU is
> > enabled, and in that case mem table is used only when vhost need to
> > track dirty pages (do reverse lookup of memtable to get HVA->GPA). So in
> > conclusion, for datapath, they are used exclusively, but they need
> > cowork for logging dirty pages when device IOTLB is enabled.
> >   
> 
> OK. Then it's different from current VFIO design, which maintains only
> one tree which is indexed by either GPA or IOVA exclusively, upon 
> whether vIOMMU is in use. 

Nit, the VFIO tree is only ever indexed by IOVA.  The MAP_DMA ioctl is
only ever performed with an IOVA.  Userspace decides how that IOVA maps
to GPA, VFIO only needs to know how the IOVA maps to HPA via the HVA.
Thanks,

Alex

Re: [Qemu-devel] vhost, iova, and dirty page tracking

2019-09-19 Thread Michael S. Tsirkin

On Thu, Sep 19, 2019 at 05:37:48PM +0800, Jason Wang wrote:
> 
> On 2019/9/19 下午3:16, Tian, Kevin wrote:
> > +Paolo to help clarify here.
> > 
> > > From: Jason Wang [mailto:jasow...@redhat.com]
> > > Sent: Thursday, September 19, 2019 2:32 PM
> > > 
> > > 
> > > On 2019/9/19 下午2:17, Yan Zhao wrote:
> > > > On Thu, Sep 19, 2019 at 02:09:53PM +0800, Jason Wang wrote:
> > > > > On 2019/9/19 下午1:28, Yan Zhao wrote:
> > > > > > On Thu, Sep 19, 2019 at 09:05:12AM +0800, Jason Wang wrote:
> > > > > > > On 2019/9/18 下午4:37, Tian, Kevin wrote:
> > > > > > > > > From: Jason Wang [mailto:jasow...@redhat.com]
> > > > > > > > > Sent: Wednesday, September 18, 2019 2:10 PM
> > > > > > > > > 
> > > > > > > > > > > Note that the HVA to GPA mapping is not an 1:1 mapping. 
> > > > > > > > > > > One
> > > HVA
> > > > > > > > > range
> > > > > > > > > > > could be mapped to several GPA ranges.
> > > > > > > > > > This is fine. Currently vfio_dma maintains IOVA->HVA 
> > > > > > > > > > mapping.
> > > > > > > > > > 
> > > > > > > > > > btw under what condition HVA->GPA is not 1:1 mapping? I 
> > > > > > > > > > didn't
> > > realize it.
> > > > > > > > > I don't remember the details e.g memory region alias? And 
> > > > > > > > > neither
> > > kvm
> > > > > > > > > nor kvm API does forbid this if my memory is correct.
> > > > > > > > > 
> > > > > > > > I checked https://qemu.weilnetz.de/doc/devel/memory.html, which
> > > > > > > > provides an example of aliased layout. However, its aliasing is 
> > > > > > > > all
> > > > > > > > 1:1, instead of N:1. From guest p.o.v every writable GPA 
> > > > > > > > implies an
> > > > > > > > unique location. Why would we hit the situation where multiple
> > > > > > > > write-able GPAs are mapped to the same HVA (i.e. same physical
> > > > > > > > memory location)?
> > > > > > > I don't know, just want to say current API does not forbid this. 
> > > > > > > So we
> > > > > > > probably need to take care it.
> > > > > > > 
> > > > > > yes, in KVM API level, it does not forbid two slots to have the same
> > > HVA(slot->userspace_addr).
> > > > > > But
> > > > > > (1) there's only one kvm instance for each vm for each qemu process.
> > > > > > (2) all ramblock->host (corresponds to HVA and slot->userspace_addr)
> > > in one qemu
> > > > > > process is non-overlapping as it's obtained from mmmap().
> > > > > > (3) qemu ensures two kvm slots will not point to the same section of
> > > one ramblock.
> > > > > > So, as long as kvm instance is not shared in two processes, and
> > > > > > there's no bug in qemu, we can assure that HVA to GPA is 1:1.
> > > > > Well, you leave this API for userspace, so you can't assume qemu is 
> > > > > the
> > > > > only user or any its behavior. If you had you should limit it in the 
> > > > > API
> > > > > level instead of open window for them.
> > > > > 
> > > > > 
> > > > > > But even if there are two processes operating on the same kvm
> > > instance
> > > > > > and manipulating on memory slots, adding an extra GPA along side
> > > current
> > > > > > IOVA & HVA to ioctl VFIO_IOMMU_MAP_DMA can still let driver knows
> > > the
> > > > > > right IOVA->GPA mapping, right?
> > > > > It looks fragile. Consider HVA was mapped to both GPA1 and GPA2.
> > > Guest
> > > > > maps IOVA to GPA2, so we have IOVA GPA2 HVA in the new ioctl and
> > > then
> > > > > log through GPA2. If userspace is trying to sync through GPA1, it will
> > > > > miss the dirty page. So for safety we need log both GPA1 and GPA2. 
> > > > > (See
> > > > > what has been done in log_write_hva() in vhost.c). The only way to do
> > > > > that is to maintain an independent HVA to GPA mapping like what KVM
> > > or
> > > > > vhost did.
> > > > > 
> > > > why GPA1 and GPA2 should be both dirty?
> > > > even they have the same HVA due to overlaping virtual address space in
> > > > two processes, they still correspond to two physical pages.
> > > > don't get what's your meaning :)
> > > 
> > > The point is not leave any corner case that is hard to debug or fix in
> > > the future.
> > > 
> > > Let's just start by a single process, the API allows userspace to maps
> > > HVA to both GPA1 and GPA2. Since it knows GPA1 and GPA2 are equivalent,
> > > it's ok to sync just through GPA1. That means if you only log GPA2, it
> > > won't work.
> > > 
> > I noted KVM itself doesn't consider such situation (one HVA is mapped
> > to multiple GPAs), when doing its dirty page tracking. If you look at
> > kvm_vcpu_mark_page_dirty, it simply finds the unique memslot which
> > contains the dirty gfn and then set the dirty bit within that slot. It
> > doesn't attempt to walk all memslots to find out any other GPA which
> > may be mapped to the same HVA.
> > 
> > So there must be some disconnect here. let's hear from Paolo first and
> > understand the rationale behind such situation.
> 
> 
> Neither did vhost when IOTLB is disabled. And cc Michael who points out this
> issue at the beginning.
> 
> Thanks
> 
> 
> > 
> > Thanks

Re: [Qemu-devel] vhost, iova, and dirty page tracking

2019-09-19 Thread Jason Wang




On 2019/9/19 下午7:14, Paolo Bonzini wrote:

On 19/09/19 09:16, Tian, Kevin wrote:

why GPA1 and GPA2 should be both dirty?
even they have the same HVA due to overlaping virtual address space in
two processes, they still correspond to two physical pages.
don't get what's your meaning :)

The point is not leave any corner case that is hard to debug or fix in
the future.

Let's just start by a single process, the API allows userspace to maps
HVA to both GPA1 and GPA2. Since it knows GPA1 and GPA2 are equivalent,
it's ok to sync just through GPA1. That means if you only log GPA2, it
won't work.

I noted KVM itself doesn't consider such situation (one HVA is mapped
to multiple GPAs), when doing its dirty page tracking. If you look at
kvm_vcpu_mark_page_dirty, it simply finds the unique memslot which
contains the dirty gfn and then set the dirty bit within that slot. It
doesn't attempt to walk all memslots to find out any other GPA which
may be mapped to the same HVA.

So there must be some disconnect here. let's hear from Paolo first and
understand the rationale behind such situation.

In general, userspace cannot assume that it's okay to sync just through
GPA1.  It must sync the host page if *either* GPA1 or GPA2 are marked dirty.



Maybe we need document this somewhere.




The situation really only arises in special cases.  For example,
0xfffe..0x and 0xe..0xf might be the same memory.
 From "info mtree" before the guest boots:

 - (prio -1, i/o): pci
   000e-000f (prio 1, i/o): alias isa-bios
@pc.bios 0002-0003
   fffc- (prio 0, rom): pc.bios

However, non-x86 machines may have other cases of aliased memory so it's
a case that you should cover.

Paolo



Any other issue that still need to be covered consider userspace need to 
sync both GPAs?


Thanks

Re: [Qemu-devel] vhost, iova, and dirty page tracking

2019-09-19 Thread Paolo Bonzini

On 19/09/19 14:39, Jason Wang wrote:
>> In general, userspace cannot assume that it's okay to sync just through
>> GPA1.  It must sync the host page if *either* GPA1 or GPA2 are marked
>> dirty.
> 
> Maybe we need document this somewhere.

Well, it's implicit but it should be kind of obvious.  The dirty page
only tells you that the guest wrote to the GPA, HVAs are never mentioned
in the documentation.

Paolo

> Any other issue that still need to be covered consider userspace need to
> sync both GPAs?
> 
> Thanks
>

Re: [Qemu-devel] vhost, iova, and dirty page tracking

2019-09-19 Thread Jason Wang




On 2019/9/19 下午6:16, Yan Zhao wrote:

On Thu, Sep 19, 2019 at 06:06:52PM +0800, Jason Wang wrote:

On 2019/9/19 下午2:29, Yan Zhao wrote:

On Thu, Sep 19, 2019 at 02:32:03PM +0800, Jason Wang wrote:

On 2019/9/19 下午2:17, Yan Zhao wrote:

On Thu, Sep 19, 2019 at 02:09:53PM +0800, Jason Wang wrote:

On 2019/9/19 下午1:28, Yan Zhao wrote:

On Thu, Sep 19, 2019 at 09:05:12AM +0800, Jason Wang wrote:

On 2019/9/18 下午4:37, Tian, Kevin wrote:

From: Jason Wang [mailto:jasow...@redhat.com]
Sent: Wednesday, September 18, 2019 2:10 PM


Note that the HVA to GPA mapping is not an 1:1 mapping. One HVA

range

could be mapped to several GPA ranges.

This is fine. Currently vfio_dma maintains IOVA->HVA mapping.

btw under what condition HVA->GPA is not 1:1 mapping? I didn't realize it.

I don't remember the details e.g memory region alias? And neither kvm
nor kvm API does forbid this if my memory is correct.


I checkedhttps://qemu.weilnetz.de/doc/devel/memory.html, which
provides an example of aliased layout. However, its aliasing is all
1:1, instead of N:1. From guest p.o.v every writable GPA implies an
unique location. Why would we hit the situation where multiple
write-able GPAs are mapped to the same HVA (i.e. same physical
memory location)?

I don't know, just want to say current API does not forbid this. So we
probably need to take care it.


yes, in KVM API level, it does not forbid two slots to have the same 
HVA(slot->userspace_addr).
But
(1) there's only one kvm instance for each vm for each qemu process.
(2) all ramblock->host (corresponds to HVA and slot->userspace_addr) in one qemu
process is non-overlapping as it's obtained from mmmap().
(3) qemu ensures two kvm slots will not point to the same section of one 
ramblock.

So, as long as kvm instance is not shared in two processes, and
there's no bug in qemu, we can assure that HVA to GPA is 1:1.

Well, you leave this API for userspace, so you can't assume qemu is the
only user or any its behavior. If you had you should limit it in the API
level instead of open window for them.



But even if there are two processes operating on the same kvm instance
and manipulating on memory slots, adding an extra GPA along side current
IOVA & HVA to ioctl VFIO_IOMMU_MAP_DMA can still let driver knows the
right IOVA->GPA mapping, right?

It looks fragile. Consider HVA was mapped to both GPA1 and GPA2. Guest
maps IOVA to GPA2, so we have IOVA GPA2 HVA in the new ioctl and then
log through GPA2. If userspace is trying to sync through GPA1, it will
miss the dirty page. So for safety we need log both GPA1 and GPA2. (See
what has been done in log_write_hva() in vhost.c). The only way to do
that is to maintain an independent HVA to GPA mapping like what KVM or
vhost did.


why GPA1 and GPA2 should be both dirty?
even they have the same HVA due to overlaping virtual address space in
two processes, they still correspond to two physical pages.
don't get what's your meaning:)

The point is not leave any corner case that is hard to debug or fix in
the future.

Let's just start by a single process, the API allows userspace to maps
HVA to both GPA1 and GPA2. Since it knows GPA1 and GPA2 are equivalent,
it's ok to sync just through GPA1. That means if you only log GPA2, it
won't work.


In that case, cannot log dirty according to HPA.
because kvm cannot tell whether it's an valid case (the two GPAs are equivalent)
or an invalid case (the two GPAs are not equivalent, but with the same
HVA value).

Right?


There no need any examination on whether it was 'valid' or not. It's as
simple as logging both GPA1 and GPA2. Then you won't need to care any
corner case.


But, if GPA1 and GPA2 point to the same HVA, it means they point to the
same page. Then if you only log GPA2, and send GPA2 to target,  it
should still works, unless in the target side GPA1 and GPA2 do not point to
the same HVA?



The problem is whether userspace can just sync GPA1 instead of both GPA1 
and GPA2. If userspace can sync through GPA1 only, the dirty pages was 
lost. Paolo has pointed out that userspace can not have that assumption.





In what condition you met it in reality?
Please kindly point it out :)



It's not about reality, it's about possibility. Again, we don't want to 
leave any corner case that is hard to debug or fix in the future.


Thanks






Thanks



Thanks
Yan

Re: [Qemu-devel] vhost, iova, and dirty page tracking

2019-09-19 Thread Paolo Bonzini

On 19/09/19 09:16, Tian, Kevin wrote:
>>> why GPA1 and GPA2 should be both dirty?
>>> even they have the same HVA due to overlaping virtual address space in
>>> two processes, they still correspond to two physical pages.
>>> don't get what's your meaning :)
>>
>> The point is not leave any corner case that is hard to debug or fix in
>> the future.
>>
>> Let's just start by a single process, the API allows userspace to maps
>> HVA to both GPA1 and GPA2. Since it knows GPA1 and GPA2 are equivalent,
>> it's ok to sync just through GPA1. That means if you only log GPA2, it
>> won't work.
> 
> I noted KVM itself doesn't consider such situation (one HVA is mapped
> to multiple GPAs), when doing its dirty page tracking. If you look at
> kvm_vcpu_mark_page_dirty, it simply finds the unique memslot which
> contains the dirty gfn and then set the dirty bit within that slot. It
> doesn't attempt to walk all memslots to find out any other GPA which
> may be mapped to the same HVA. 
> 
> So there must be some disconnect here. let's hear from Paolo first and
> understand the rationale behind such situation.

In general, userspace cannot assume that it's okay to sync just through
GPA1.  It must sync the host page if *either* GPA1 or GPA2 are marked dirty.

The situation really only arises in special cases.  For example,
0xfffe..0x and 0xe..0xf might be the same memory.
>From "info mtree" before the guest boots:

- (prio -1, i/o): pci
  000e-000f (prio 1, i/o): alias isa-bios
@pc.bios 0002-0003
  fffc- (prio 0, rom): pc.bios

However, non-x86 machines may have other cases of aliased memory so it's
a case that you should cover.

Paolo

Re: [Qemu-devel] vhost, iova, and dirty page tracking

2019-09-19 Thread Yan Zhao

On Thu, Sep 19, 2019 at 06:06:52PM +0800, Jason Wang wrote:
> 
> On 2019/9/19 下午2:29, Yan Zhao wrote:
> > On Thu, Sep 19, 2019 at 02:32:03PM +0800, Jason Wang wrote:
> >> On 2019/9/19 下午2:17, Yan Zhao wrote:
> >>> On Thu, Sep 19, 2019 at 02:09:53PM +0800, Jason Wang wrote:
>  On 2019/9/19 下午1:28, Yan Zhao wrote:
> > On Thu, Sep 19, 2019 at 09:05:12AM +0800, Jason Wang wrote:
> >> On 2019/9/18 下午4:37, Tian, Kevin wrote:
>  From: Jason Wang [mailto:jasow...@redhat.com]
>  Sent: Wednesday, September 18, 2019 2:10 PM
> 
> >> Note that the HVA to GPA mapping is not an 1:1 mapping. One HVA
>  range
> >> could be mapped to several GPA ranges.
> > This is fine. Currently vfio_dma maintains IOVA->HVA mapping.
> >
> > btw under what condition HVA->GPA is not 1:1 mapping? I didn't 
> > realize it.
>  I don't remember the details e.g memory region alias? And neither kvm
>  nor kvm API does forbid this if my memory is correct.
> 
> >>> I checkedhttps://qemu.weilnetz.de/doc/devel/memory.html, which
> >>> provides an example of aliased layout. However, its aliasing is all
> >>> 1:1, instead of N:1. From guest p.o.v every writable GPA implies an
> >>> unique location. Why would we hit the situation where multiple
> >>> write-able GPAs are mapped to the same HVA (i.e. same physical
> >>> memory location)?
> >> I don't know, just want to say current API does not forbid this. So we
> >> probably need to take care it.
> >>
> > yes, in KVM API level, it does not forbid two slots to have the same 
> > HVA(slot->userspace_addr).
> > But
> > (1) there's only one kvm instance for each vm for each qemu process.
> > (2) all ramblock->host (corresponds to HVA and slot->userspace_addr) in 
> > one qemu
> > process is non-overlapping as it's obtained from mmmap().
> > (3) qemu ensures two kvm slots will not point to the same section of 
> > one ramblock.
> >
> > So, as long as kvm instance is not shared in two processes, and
> > there's no bug in qemu, we can assure that HVA to GPA is 1:1.
>  Well, you leave this API for userspace, so you can't assume qemu is the
>  only user or any its behavior. If you had you should limit it in the API
>  level instead of open window for them.
> 
> 
> > But even if there are two processes operating on the same kvm instance
> > and manipulating on memory slots, adding an extra GPA along side current
> > IOVA & HVA to ioctl VFIO_IOMMU_MAP_DMA can still let driver knows the
> > right IOVA->GPA mapping, right?
>  It looks fragile. Consider HVA was mapped to both GPA1 and GPA2. Guest
>  maps IOVA to GPA2, so we have IOVA GPA2 HVA in the new ioctl and then
>  log through GPA2. If userspace is trying to sync through GPA1, it will
>  miss the dirty page. So for safety we need log both GPA1 and GPA2. (See
>  what has been done in log_write_hva() in vhost.c). The only way to do
>  that is to maintain an independent HVA to GPA mapping like what KVM or
>  vhost did.
> 
> >>> why GPA1 and GPA2 should be both dirty?
> >>> even they have the same HVA due to overlaping virtual address space in
> >>> two processes, they still correspond to two physical pages.
> >>> don't get what's your meaning:)
> >> The point is not leave any corner case that is hard to debug or fix in
> >> the future.
> >>
> >> Let's just start by a single process, the API allows userspace to maps
> >> HVA to both GPA1 and GPA2. Since it knows GPA1 and GPA2 are equivalent,
> >> it's ok to sync just through GPA1. That means if you only log GPA2, it
> >> won't work.
> >>
> > In that case, cannot log dirty according to HPA.
> > because kvm cannot tell whether it's an valid case (the two GPAs are 
> > equivalent)
> > or an invalid case (the two GPAs are not equivalent, but with the same
> > HVA value).
> >
> > Right?
> 
> 
> There no need any examination on whether it was 'valid' or not. It's as 
> simple as logging both GPA1 and GPA2. Then you won't need to care any 
> corner case.
>
But, if GPA1 and GPA2 point to the same HVA, it means they point to the
same page. Then if you only log GPA2, and send GPA2 to target,  it
should still works, unless in the target side GPA1 and GPA2 do not point to
the same HVA?

In what condition you met it in reality?
Please kindly point it out :)



> Thanks
> 
> 
> >
> > Thanks
> > Yan
> >
> >

Re: [Qemu-devel] vhost, iova, and dirty page tracking

2019-09-19 Thread Jason Wang




On 2019/9/19 下午5:36, Yan Zhao wrote:

On Thu, Sep 19, 2019 at 05:35:05PM +0800, Jason Wang wrote:

On 2019/9/19 下午2:32, Yan Zhao wrote:

On Thu, Sep 19, 2019 at 02:29:54PM +0800, Yan Zhao wrote:

On Thu, Sep 19, 2019 at 02:32:03PM +0800, Jason Wang wrote:

On 2019/9/19 下午2:17, Yan Zhao wrote:

On Thu, Sep 19, 2019 at 02:09:53PM +0800, Jason Wang wrote:

On 2019/9/19 下午1:28, Yan Zhao wrote:

On Thu, Sep 19, 2019 at 09:05:12AM +0800, Jason Wang wrote:

On 2019/9/18 下午4:37, Tian, Kevin wrote:

From: Jason Wang [mailto:jasow...@redhat.com]
Sent: Wednesday, September 18, 2019 2:10 PM


Note that the HVA to GPA mapping is not an 1:1 mapping. One HVA

range

could be mapped to several GPA ranges.

This is fine. Currently vfio_dma maintains IOVA->HVA mapping.

btw under what condition HVA->GPA is not 1:1 mapping? I didn't realize it.

I don't remember the details e.g memory region alias? And neither kvm
nor kvm API does forbid this if my memory is correct.


I checked https://qemu.weilnetz.de/doc/devel/memory.html, which
provides an example of aliased layout. However, its aliasing is all
1:1, instead of N:1. From guest p.o.v every writable GPA implies an
unique location. Why would we hit the situation where multiple
write-able GPAs are mapped to the same HVA (i.e. same physical
memory location)?

I don't know, just want to say current API does not forbid this. So we
probably need to take care it.


yes, in KVM API level, it does not forbid two slots to have the same 
HVA(slot->userspace_addr).
But
(1) there's only one kvm instance for each vm for each qemu process.
(2) all ramblock->host (corresponds to HVA and slot->userspace_addr) in one qemu
process is non-overlapping as it's obtained from mmmap().
(3) qemu ensures two kvm slots will not point to the same section of one 
ramblock.

So, as long as kvm instance is not shared in two processes, and
there's no bug in qemu, we can assure that HVA to GPA is 1:1.

Well, you leave this API for userspace, so you can't assume qemu is the
only user or any its behavior. If you had you should limit it in the API
level instead of open window for them.



But even if there are two processes operating on the same kvm instance
and manipulating on memory slots, adding an extra GPA along side current
IOVA & HVA to ioctl VFIO_IOMMU_MAP_DMA can still let driver knows the
right IOVA->GPA mapping, right?

It looks fragile. Consider HVA was mapped to both GPA1 and GPA2. Guest
maps IOVA to GPA2, so we have IOVA GPA2 HVA in the new ioctl and then
log through GPA2. If userspace is trying to sync through GPA1, it will
miss the dirty page. So for safety we need log both GPA1 and GPA2. (See
what has been done in log_write_hva() in vhost.c). The only way to do
that is to maintain an independent HVA to GPA mapping like what KVM or
vhost did.


why GPA1 and GPA2 should be both dirty?
even they have the same HVA due to overlaping virtual address space in
two processes, they still correspond to two physical pages.
don't get what's your meaning :)

The point is not leave any corner case that is hard to debug or fix in
the future.

Let's just start by a single process, the API allows userspace to maps
HVA to both GPA1 and GPA2. Since it knows GPA1 and GPA2 are equivalent,
it's ok to sync just through GPA1. That means if you only log GPA2, it
won't work.


In that case, cannot log dirty according to HPA.

sorry, it should be "cannot log dirty according to HVA".


I think we are discussing the choice between GPA and IOVA, not HVA?


Right. so why do we need to care about HVA to GPA mapping?
as long as IOVA to GPA is 1:1, then it's fine.



The problem is (whether) userspace can try to sync from GPA2 whose HVA 
is the same as GPA1.


Maintainers are copied by Kevin, hope it can help to clarify things.

Thanks



Thanks
Yan


Thanks



because kvm cannot tell whether it's an valid case (the two GPAs are equivalent)
or an invalid case (the two GPAs are not equivalent, but with the same
HVA value).

Right?

Thanks
Yan



Thanks



Thanks
Yan



Thanks



Thanks
Yan


Is Qemu doing its own same-content memory
merging in GPA level, similar to KSM?

AFAIK, it doesn't.

Thanks



Thanks
Kevin

Re: [Qemu-devel] vhost, iova, and dirty page tracking

2019-09-19 Thread Jason Wang




On 2019/9/19 下午2:29, Yan Zhao wrote:

On Thu, Sep 19, 2019 at 02:32:03PM +0800, Jason Wang wrote:

On 2019/9/19 下午2:17, Yan Zhao wrote:

On Thu, Sep 19, 2019 at 02:09:53PM +0800, Jason Wang wrote:

On 2019/9/19 下午1:28, Yan Zhao wrote:

On Thu, Sep 19, 2019 at 09:05:12AM +0800, Jason Wang wrote:

On 2019/9/18 下午4:37, Tian, Kevin wrote:

From: Jason Wang [mailto:jasow...@redhat.com]
Sent: Wednesday, September 18, 2019 2:10 PM


Note that the HVA to GPA mapping is not an 1:1 mapping. One HVA

range

could be mapped to several GPA ranges.

This is fine. Currently vfio_dma maintains IOVA->HVA mapping.

btw under what condition HVA->GPA is not 1:1 mapping? I didn't realize it.

I don't remember the details e.g memory region alias? And neither kvm
nor kvm API does forbid this if my memory is correct.


I checkedhttps://qemu.weilnetz.de/doc/devel/memory.html, which
provides an example of aliased layout. However, its aliasing is all
1:1, instead of N:1. From guest p.o.v every writable GPA implies an
unique location. Why would we hit the situation where multiple
write-able GPAs are mapped to the same HVA (i.e. same physical
memory location)?

I don't know, just want to say current API does not forbid this. So we
probably need to take care it.


yes, in KVM API level, it does not forbid two slots to have the same 
HVA(slot->userspace_addr).
But
(1) there's only one kvm instance for each vm for each qemu process.
(2) all ramblock->host (corresponds to HVA and slot->userspace_addr) in one qemu
process is non-overlapping as it's obtained from mmmap().
(3) qemu ensures two kvm slots will not point to the same section of one 
ramblock.

So, as long as kvm instance is not shared in two processes, and
there's no bug in qemu, we can assure that HVA to GPA is 1:1.

Well, you leave this API for userspace, so you can't assume qemu is the
only user or any its behavior. If you had you should limit it in the API
level instead of open window for them.



But even if there are two processes operating on the same kvm instance
and manipulating on memory slots, adding an extra GPA along side current
IOVA & HVA to ioctl VFIO_IOMMU_MAP_DMA can still let driver knows the
right IOVA->GPA mapping, right?

It looks fragile. Consider HVA was mapped to both GPA1 and GPA2. Guest
maps IOVA to GPA2, so we have IOVA GPA2 HVA in the new ioctl and then
log through GPA2. If userspace is trying to sync through GPA1, it will
miss the dirty page. So for safety we need log both GPA1 and GPA2. (See
what has been done in log_write_hva() in vhost.c). The only way to do
that is to maintain an independent HVA to GPA mapping like what KVM or
vhost did.


why GPA1 and GPA2 should be both dirty?
even they have the same HVA due to overlaping virtual address space in
two processes, they still correspond to two physical pages.
don't get what's your meaning:)

The point is not leave any corner case that is hard to debug or fix in
the future.

Let's just start by a single process, the API allows userspace to maps
HVA to both GPA1 and GPA2. Since it knows GPA1 and GPA2 are equivalent,
it's ok to sync just through GPA1. That means if you only log GPA2, it
won't work.


In that case, cannot log dirty according to HPA.
because kvm cannot tell whether it's an valid case (the two GPAs are equivalent)
or an invalid case (the two GPAs are not equivalent, but with the same
HVA value).

Right?



There no need any examination on whether it was 'valid' or not. It's as 
simple as logging both GPA1 and GPA2. Then you won't need to care any 
corner case.


Thanks




Thanks
Yan

Re: [Qemu-devel] vhost, iova, and dirty page tracking

2019-09-19 Thread Yan Zhao

On Thu, Sep 19, 2019 at 05:35:05PM +0800, Jason Wang wrote:
> 
> On 2019/9/19 下午2:32, Yan Zhao wrote:
> > On Thu, Sep 19, 2019 at 02:29:54PM +0800, Yan Zhao wrote:
> >> On Thu, Sep 19, 2019 at 02:32:03PM +0800, Jason Wang wrote:
> >>> On 2019/9/19 下午2:17, Yan Zhao wrote:
>  On Thu, Sep 19, 2019 at 02:09:53PM +0800, Jason Wang wrote:
> > On 2019/9/19 下午1:28, Yan Zhao wrote:
> >> On Thu, Sep 19, 2019 at 09:05:12AM +0800, Jason Wang wrote:
> >>> On 2019/9/18 下午4:37, Tian, Kevin wrote:
> > From: Jason Wang [mailto:jasow...@redhat.com]
> > Sent: Wednesday, September 18, 2019 2:10 PM
> >
> >>> Note that the HVA to GPA mapping is not an 1:1 mapping. One HVA
> > range
> >>> could be mapped to several GPA ranges.
> >> This is fine. Currently vfio_dma maintains IOVA->HVA mapping.
> >>
> >> btw under what condition HVA->GPA is not 1:1 mapping? I didn't 
> >> realize it.
> > I don't remember the details e.g memory region alias? And neither 
> > kvm
> > nor kvm API does forbid this if my memory is correct.
> >
>  I checked https://qemu.weilnetz.de/doc/devel/memory.html, which
>  provides an example of aliased layout. However, its aliasing is all
>  1:1, instead of N:1. From guest p.o.v every writable GPA implies an
>  unique location. Why would we hit the situation where multiple
>  write-able GPAs are mapped to the same HVA (i.e. same physical
>  memory location)?
> >>> I don't know, just want to say current API does not forbid this. So we
> >>> probably need to take care it.
> >>>
> >> yes, in KVM API level, it does not forbid two slots to have the same 
> >> HVA(slot->userspace_addr).
> >> But
> >> (1) there's only one kvm instance for each vm for each qemu process.
> >> (2) all ramblock->host (corresponds to HVA and slot->userspace_addr) 
> >> in one qemu
> >> process is non-overlapping as it's obtained from mmmap().
> >> (3) qemu ensures two kvm slots will not point to the same section of 
> >> one ramblock.
> >>
> >> So, as long as kvm instance is not shared in two processes, and
> >> there's no bug in qemu, we can assure that HVA to GPA is 1:1.
> > Well, you leave this API for userspace, so you can't assume qemu is the
> > only user or any its behavior. If you had you should limit it in the API
> > level instead of open window for them.
> >
> >
> >> But even if there are two processes operating on the same kvm instance
> >> and manipulating on memory slots, adding an extra GPA along side 
> >> current
> >> IOVA & HVA to ioctl VFIO_IOMMU_MAP_DMA can still let driver knows the
> >> right IOVA->GPA mapping, right?
> > It looks fragile. Consider HVA was mapped to both GPA1 and GPA2. Guest
> > maps IOVA to GPA2, so we have IOVA GPA2 HVA in the new ioctl and then
> > log through GPA2. If userspace is trying to sync through GPA1, it will
> > miss the dirty page. So for safety we need log both GPA1 and GPA2. (See
> > what has been done in log_write_hva() in vhost.c). The only way to do
> > that is to maintain an independent HVA to GPA mapping like what KVM or
> > vhost did.
> >
>  why GPA1 and GPA2 should be both dirty?
>  even they have the same HVA due to overlaping virtual address space in
>  two processes, they still correspond to two physical pages.
>  don't get what's your meaning :)
> >>>
> >>> The point is not leave any corner case that is hard to debug or fix in
> >>> the future.
> >>>
> >>> Let's just start by a single process, the API allows userspace to maps
> >>> HVA to both GPA1 and GPA2. Since it knows GPA1 and GPA2 are equivalent,
> >>> it's ok to sync just through GPA1. That means if you only log GPA2, it
> >>> won't work.
> >>>
> >> In that case, cannot log dirty according to HPA.
> > sorry, it should be "cannot log dirty according to HVA".
> 
> 
> I think we are discussing the choice between GPA and IOVA, not HVA?
>
Right. so why do we need to care about HVA to GPA mapping?
as long as IOVA to GPA is 1:1, then it's fine.

Thanks
Yan

> Thanks
> 
> 
> >
> >> because kvm cannot tell whether it's an valid case (the two GPAs are 
> >> equivalent)
> >> or an invalid case (the two GPAs are not equivalent, but with the same
> >> HVA value).
> >>
> >> Right?
> >>
> >> Thanks
> >> Yan
> >>
> >>
> >>> Thanks
> >>>
> >>>
>  Thanks
>  Yan
> 
> 
> > Thanks
> >
> >
> >> Thanks
> >> Yan
> >>
>  Is Qemu doing its own same-content memory
>  merging in GPA level, similar to KSM?
> >>> AFAIK, it doesn't.
> >>>
> >>> Thanks
> >>>
> >>>
>  Thanks
>  Kevin

Re: [Qemu-devel] vhost, iova, and dirty page tracking

2019-09-19 Thread Jason Wang




On 2019/9/19 下午3:16, Tian, Kevin wrote:

+Paolo to help clarify here.


From: Jason Wang [mailto:jasow...@redhat.com]
Sent: Thursday, September 19, 2019 2:32 PM


On 2019/9/19 下午2:17, Yan Zhao wrote:

On Thu, Sep 19, 2019 at 02:09:53PM +0800, Jason Wang wrote:

On 2019/9/19 下午1:28, Yan Zhao wrote:

On Thu, Sep 19, 2019 at 09:05:12AM +0800, Jason Wang wrote:

On 2019/9/18 下午4:37, Tian, Kevin wrote:

From: Jason Wang [mailto:jasow...@redhat.com]
Sent: Wednesday, September 18, 2019 2:10 PM


Note that the HVA to GPA mapping is not an 1:1 mapping. One

HVA

range

could be mapped to several GPA ranges.

This is fine. Currently vfio_dma maintains IOVA->HVA mapping.

btw under what condition HVA->GPA is not 1:1 mapping? I didn't

realize it.

I don't remember the details e.g memory region alias? And neither

kvm

nor kvm API does forbid this if my memory is correct.


I checked https://qemu.weilnetz.de/doc/devel/memory.html, which
provides an example of aliased layout. However, its aliasing is all
1:1, instead of N:1. From guest p.o.v every writable GPA implies an
unique location. Why would we hit the situation where multiple
write-able GPAs are mapped to the same HVA (i.e. same physical
memory location)?

I don't know, just want to say current API does not forbid this. So we
probably need to take care it.


yes, in KVM API level, it does not forbid two slots to have the same

HVA(slot->userspace_addr).

But
(1) there's only one kvm instance for each vm for each qemu process.
(2) all ramblock->host (corresponds to HVA and slot->userspace_addr)

in one qemu

process is non-overlapping as it's obtained from mmmap().
(3) qemu ensures two kvm slots will not point to the same section of

one ramblock.

So, as long as kvm instance is not shared in two processes, and
there's no bug in qemu, we can assure that HVA to GPA is 1:1.

Well, you leave this API for userspace, so you can't assume qemu is the
only user or any its behavior. If you had you should limit it in the API
level instead of open window for them.



But even if there are two processes operating on the same kvm

instance

and manipulating on memory slots, adding an extra GPA along side

current

IOVA & HVA to ioctl VFIO_IOMMU_MAP_DMA can still let driver knows

the

right IOVA->GPA mapping, right?

It looks fragile. Consider HVA was mapped to both GPA1 and GPA2.

Guest

maps IOVA to GPA2, so we have IOVA GPA2 HVA in the new ioctl and

then

log through GPA2. If userspace is trying to sync through GPA1, it will
miss the dirty page. So for safety we need log both GPA1 and GPA2. (See
what has been done in log_write_hva() in vhost.c). The only way to do
that is to maintain an independent HVA to GPA mapping like what KVM

or

vhost did.


why GPA1 and GPA2 should be both dirty?
even they have the same HVA due to overlaping virtual address space in
two processes, they still correspond to two physical pages.
don't get what's your meaning :)


The point is not leave any corner case that is hard to debug or fix in
the future.

Let's just start by a single process, the API allows userspace to maps
HVA to both GPA1 and GPA2. Since it knows GPA1 and GPA2 are equivalent,
it's ok to sync just through GPA1. That means if you only log GPA2, it
won't work.


I noted KVM itself doesn't consider such situation (one HVA is mapped
to multiple GPAs), when doing its dirty page tracking. If you look at
kvm_vcpu_mark_page_dirty, it simply finds the unique memslot which
contains the dirty gfn and then set the dirty bit within that slot. It
doesn't attempt to walk all memslots to find out any other GPA which
may be mapped to the same HVA.

So there must be some disconnect here. let's hear from Paolo first and
understand the rationale behind such situation.



Neither did vhost when IOTLB is disabled. And cc Michael who points out 
this issue at the beginning.


Thanks




Thanks
Kevin

Re: [Qemu-devel] vhost, iova, and dirty page tracking

2019-09-19 Thread Jason Wang




On 2019/9/19 下午2:32, Yan Zhao wrote:

On Thu, Sep 19, 2019 at 02:29:54PM +0800, Yan Zhao wrote:

On Thu, Sep 19, 2019 at 02:32:03PM +0800, Jason Wang wrote:

On 2019/9/19 下午2:17, Yan Zhao wrote:

On Thu, Sep 19, 2019 at 02:09:53PM +0800, Jason Wang wrote:

On 2019/9/19 下午1:28, Yan Zhao wrote:

On Thu, Sep 19, 2019 at 09:05:12AM +0800, Jason Wang wrote:

On 2019/9/18 下午4:37, Tian, Kevin wrote:

From: Jason Wang [mailto:jasow...@redhat.com]
Sent: Wednesday, September 18, 2019 2:10 PM


Note that the HVA to GPA mapping is not an 1:1 mapping. One HVA

range

could be mapped to several GPA ranges.

This is fine. Currently vfio_dma maintains IOVA->HVA mapping.

btw under what condition HVA->GPA is not 1:1 mapping? I didn't realize it.

I don't remember the details e.g memory region alias? And neither kvm
nor kvm API does forbid this if my memory is correct.


I checked https://qemu.weilnetz.de/doc/devel/memory.html, which
provides an example of aliased layout. However, its aliasing is all
1:1, instead of N:1. From guest p.o.v every writable GPA implies an
unique location. Why would we hit the situation where multiple
write-able GPAs are mapped to the same HVA (i.e. same physical
memory location)?

I don't know, just want to say current API does not forbid this. So we
probably need to take care it.


yes, in KVM API level, it does not forbid two slots to have the same 
HVA(slot->userspace_addr).
But
(1) there's only one kvm instance for each vm for each qemu process.
(2) all ramblock->host (corresponds to HVA and slot->userspace_addr) in one qemu
process is non-overlapping as it's obtained from mmmap().
(3) qemu ensures two kvm slots will not point to the same section of one 
ramblock.

So, as long as kvm instance is not shared in two processes, and
there's no bug in qemu, we can assure that HVA to GPA is 1:1.

Well, you leave this API for userspace, so you can't assume qemu is the
only user or any its behavior. If you had you should limit it in the API
level instead of open window for them.



But even if there are two processes operating on the same kvm instance
and manipulating on memory slots, adding an extra GPA along side current
IOVA & HVA to ioctl VFIO_IOMMU_MAP_DMA can still let driver knows the
right IOVA->GPA mapping, right?

It looks fragile. Consider HVA was mapped to both GPA1 and GPA2. Guest
maps IOVA to GPA2, so we have IOVA GPA2 HVA in the new ioctl and then
log through GPA2. If userspace is trying to sync through GPA1, it will
miss the dirty page. So for safety we need log both GPA1 and GPA2. (See
what has been done in log_write_hva() in vhost.c). The only way to do
that is to maintain an independent HVA to GPA mapping like what KVM or
vhost did.


why GPA1 and GPA2 should be both dirty?
even they have the same HVA due to overlaping virtual address space in
two processes, they still correspond to two physical pages.
don't get what's your meaning :)


The point is not leave any corner case that is hard to debug or fix in
the future.

Let's just start by a single process, the API allows userspace to maps
HVA to both GPA1 and GPA2. Since it knows GPA1 and GPA2 are equivalent,
it's ok to sync just through GPA1. That means if you only log GPA2, it
won't work.


In that case, cannot log dirty according to HPA.

sorry, it should be "cannot log dirty according to HVA".



I think we are discussing the choice between GPA and IOVA, not HVA?

Thanks





because kvm cannot tell whether it's an valid case (the two GPAs are equivalent)
or an invalid case (the two GPAs are not equivalent, but with the same
HVA value).

Right?

Thanks
Yan



Thanks



Thanks
Yan



Thanks



Thanks
Yan


Is Qemu doing its own same-content memory
merging in GPA level, similar to KSM?

AFAIK, it doesn't.

Thanks



Thanks
Kevin

Re: [Qemu-devel] vhost, iova, and dirty page tracking

2019-09-19 Thread Tian, Kevin

+Paolo to help clarify here.

> From: Jason Wang [mailto:jasow...@redhat.com]
> Sent: Thursday, September 19, 2019 2:32 PM
> 
> 
> On 2019/9/19 下午2:17, Yan Zhao wrote:
> > On Thu, Sep 19, 2019 at 02:09:53PM +0800, Jason Wang wrote:
> >> On 2019/9/19 下午1:28, Yan Zhao wrote:
> >>> On Thu, Sep 19, 2019 at 09:05:12AM +0800, Jason Wang wrote:
>  On 2019/9/18 下午4:37, Tian, Kevin wrote:
> >> From: Jason Wang [mailto:jasow...@redhat.com]
> >> Sent: Wednesday, September 18, 2019 2:10 PM
> >>
>  Note that the HVA to GPA mapping is not an 1:1 mapping. One
> HVA
> >> range
>  could be mapped to several GPA ranges.
> >>> This is fine. Currently vfio_dma maintains IOVA->HVA mapping.
> >>>
> >>> btw under what condition HVA->GPA is not 1:1 mapping? I didn't
> realize it.
> >> I don't remember the details e.g memory region alias? And neither
> kvm
> >> nor kvm API does forbid this if my memory is correct.
> >>
> > I checked https://qemu.weilnetz.de/doc/devel/memory.html, which
> > provides an example of aliased layout. However, its aliasing is all
> > 1:1, instead of N:1. From guest p.o.v every writable GPA implies an
> > unique location. Why would we hit the situation where multiple
> > write-able GPAs are mapped to the same HVA (i.e. same physical
> > memory location)?
>  I don't know, just want to say current API does not forbid this. So we
>  probably need to take care it.
> 
> >>> yes, in KVM API level, it does not forbid two slots to have the same
> HVA(slot->userspace_addr).
> >>> But
> >>> (1) there's only one kvm instance for each vm for each qemu process.
> >>> (2) all ramblock->host (corresponds to HVA and slot->userspace_addr)
> in one qemu
> >>> process is non-overlapping as it's obtained from mmmap().
> >>> (3) qemu ensures two kvm slots will not point to the same section of
> one ramblock.
> >>>
> >>> So, as long as kvm instance is not shared in two processes, and
> >>> there's no bug in qemu, we can assure that HVA to GPA is 1:1.
> >>
> >> Well, you leave this API for userspace, so you can't assume qemu is the
> >> only user or any its behavior. If you had you should limit it in the API
> >> level instead of open window for them.
> >>
> >>
> >>> But even if there are two processes operating on the same kvm
> instance
> >>> and manipulating on memory slots, adding an extra GPA along side
> current
> >>> IOVA & HVA to ioctl VFIO_IOMMU_MAP_DMA can still let driver knows
> the
> >>> right IOVA->GPA mapping, right?
> >>
> >> It looks fragile. Consider HVA was mapped to both GPA1 and GPA2.
> Guest
> >> maps IOVA to GPA2, so we have IOVA GPA2 HVA in the new ioctl and
> then
> >> log through GPA2. If userspace is trying to sync through GPA1, it will
> >> miss the dirty page. So for safety we need log both GPA1 and GPA2. (See
> >> what has been done in log_write_hva() in vhost.c). The only way to do
> >> that is to maintain an independent HVA to GPA mapping like what KVM
> or
> >> vhost did.
> >>
> > why GPA1 and GPA2 should be both dirty?
> > even they have the same HVA due to overlaping virtual address space in
> > two processes, they still correspond to two physical pages.
> > don't get what's your meaning :)
> 
> 
> The point is not leave any corner case that is hard to debug or fix in
> the future.
> 
> Let's just start by a single process, the API allows userspace to maps
> HVA to both GPA1 and GPA2. Since it knows GPA1 and GPA2 are equivalent,
> it's ok to sync just through GPA1. That means if you only log GPA2, it
> won't work.
> 

I noted KVM itself doesn't consider such situation (one HVA is mapped
to multiple GPAs), when doing its dirty page tracking. If you look at
kvm_vcpu_mark_page_dirty, it simply finds the unique memslot which
contains the dirty gfn and then set the dirty bit within that slot. It
doesn't attempt to walk all memslots to find out any other GPA which
may be mapped to the same HVA. 

So there must be some disconnect here. let's hear from Paolo first and
understand the rationale behind such situation.

Thanks
Kevin

Re: [Qemu-devel] vhost, iova, and dirty page tracking

2019-09-18 Thread Yan Zhao

On Thu, Sep 19, 2019 at 02:29:54PM +0800, Yan Zhao wrote:
> On Thu, Sep 19, 2019 at 02:32:03PM +0800, Jason Wang wrote:
> > 
> > On 2019/9/19 下午2:17, Yan Zhao wrote:
> > > On Thu, Sep 19, 2019 at 02:09:53PM +0800, Jason Wang wrote:
> > >> On 2019/9/19 下午1:28, Yan Zhao wrote:
> > >>> On Thu, Sep 19, 2019 at 09:05:12AM +0800, Jason Wang wrote:
> >  On 2019/9/18 下午4:37, Tian, Kevin wrote:
> > >> From: Jason Wang [mailto:jasow...@redhat.com]
> > >> Sent: Wednesday, September 18, 2019 2:10 PM
> > >>
> >  Note that the HVA to GPA mapping is not an 1:1 mapping. One HVA
> > >> range
> >  could be mapped to several GPA ranges.
> > >>> This is fine. Currently vfio_dma maintains IOVA->HVA mapping.
> > >>>
> > >>> btw under what condition HVA->GPA is not 1:1 mapping? I didn't 
> > >>> realize it.
> > >> I don't remember the details e.g memory region alias? And neither kvm
> > >> nor kvm API does forbid this if my memory is correct.
> > >>
> > > I checked https://qemu.weilnetz.de/doc/devel/memory.html, which
> > > provides an example of aliased layout. However, its aliasing is all
> > > 1:1, instead of N:1. From guest p.o.v every writable GPA implies an
> > > unique location. Why would we hit the situation where multiple
> > > write-able GPAs are mapped to the same HVA (i.e. same physical
> > > memory location)?
> >  I don't know, just want to say current API does not forbid this. So we
> >  probably need to take care it.
> > 
> > >>> yes, in KVM API level, it does not forbid two slots to have the same 
> > >>> HVA(slot->userspace_addr).
> > >>> But
> > >>> (1) there's only one kvm instance for each vm for each qemu process.
> > >>> (2) all ramblock->host (corresponds to HVA and slot->userspace_addr) in 
> > >>> one qemu
> > >>> process is non-overlapping as it's obtained from mmmap().
> > >>> (3) qemu ensures two kvm slots will not point to the same section of 
> > >>> one ramblock.
> > >>>
> > >>> So, as long as kvm instance is not shared in two processes, and
> > >>> there's no bug in qemu, we can assure that HVA to GPA is 1:1.
> > >>
> > >> Well, you leave this API for userspace, so you can't assume qemu is the
> > >> only user or any its behavior. If you had you should limit it in the API
> > >> level instead of open window for them.
> > >>
> > >>
> > >>> But even if there are two processes operating on the same kvm instance
> > >>> and manipulating on memory slots, adding an extra GPA along side current
> > >>> IOVA & HVA to ioctl VFIO_IOMMU_MAP_DMA can still let driver knows the
> > >>> right IOVA->GPA mapping, right?
> > >>
> > >> It looks fragile. Consider HVA was mapped to both GPA1 and GPA2. Guest
> > >> maps IOVA to GPA2, so we have IOVA GPA2 HVA in the new ioctl and then
> > >> log through GPA2. If userspace is trying to sync through GPA1, it will
> > >> miss the dirty page. So for safety we need log both GPA1 and GPA2. (See
> > >> what has been done in log_write_hva() in vhost.c). The only way to do
> > >> that is to maintain an independent HVA to GPA mapping like what KVM or
> > >> vhost did.
> > >>
> > > why GPA1 and GPA2 should be both dirty?
> > > even they have the same HVA due to overlaping virtual address space in
> > > two processes, they still correspond to two physical pages.
> > > don't get what's your meaning :)
> > 
> > 
> > The point is not leave any corner case that is hard to debug or fix in 
> > the future.
> > 
> > Let's just start by a single process, the API allows userspace to maps 
> > HVA to both GPA1 and GPA2. Since it knows GPA1 and GPA2 are equivalent, 
> > it's ok to sync just through GPA1. That means if you only log GPA2, it 
> > won't work.
> >
> In that case, cannot log dirty according to HPA.
sorry, it should be "cannot log dirty according to HVA".

> because kvm cannot tell whether it's an valid case (the two GPAs are 
> equivalent)
> or an invalid case (the two GPAs are not equivalent, but with the same
> HVA value).
> 
> Right?
> 
> Thanks
> Yan
> 
> 
> > Thanks
> > 
> > 
> > >
> > > Thanks
> > > Yan
> > >
> > >
> > >> Thanks
> > >>
> > >>
> > >>> Thanks
> > >>> Yan
> > >>>
> > > Is Qemu doing its own same-content memory
> > > merging in GPA level, similar to KSM?
> >  AFAIK, it doesn't.
> > 
> >  Thanks
> > 
> > 
> > > Thanks
> > > Kevin
> > 
>

Re: [Qemu-devel] vhost, iova, and dirty page tracking

2019-09-18 Thread Yan Zhao

On Thu, Sep 19, 2019 at 02:32:03PM +0800, Jason Wang wrote:
> 
> On 2019/9/19 下午2:17, Yan Zhao wrote:
> > On Thu, Sep 19, 2019 at 02:09:53PM +0800, Jason Wang wrote:
> >> On 2019/9/19 下午1:28, Yan Zhao wrote:
> >>> On Thu, Sep 19, 2019 at 09:05:12AM +0800, Jason Wang wrote:
>  On 2019/9/18 下午4:37, Tian, Kevin wrote:
> >> From: Jason Wang [mailto:jasow...@redhat.com]
> >> Sent: Wednesday, September 18, 2019 2:10 PM
> >>
>  Note that the HVA to GPA mapping is not an 1:1 mapping. One HVA
> >> range
>  could be mapped to several GPA ranges.
> >>> This is fine. Currently vfio_dma maintains IOVA->HVA mapping.
> >>>
> >>> btw under what condition HVA->GPA is not 1:1 mapping? I didn't 
> >>> realize it.
> >> I don't remember the details e.g memory region alias? And neither kvm
> >> nor kvm API does forbid this if my memory is correct.
> >>
> > I checked https://qemu.weilnetz.de/doc/devel/memory.html, which
> > provides an example of aliased layout. However, its aliasing is all
> > 1:1, instead of N:1. From guest p.o.v every writable GPA implies an
> > unique location. Why would we hit the situation where multiple
> > write-able GPAs are mapped to the same HVA (i.e. same physical
> > memory location)?
>  I don't know, just want to say current API does not forbid this. So we
>  probably need to take care it.
> 
> >>> yes, in KVM API level, it does not forbid two slots to have the same 
> >>> HVA(slot->userspace_addr).
> >>> But
> >>> (1) there's only one kvm instance for each vm for each qemu process.
> >>> (2) all ramblock->host (corresponds to HVA and slot->userspace_addr) in 
> >>> one qemu
> >>> process is non-overlapping as it's obtained from mmmap().
> >>> (3) qemu ensures two kvm slots will not point to the same section of one 
> >>> ramblock.
> >>>
> >>> So, as long as kvm instance is not shared in two processes, and
> >>> there's no bug in qemu, we can assure that HVA to GPA is 1:1.
> >>
> >> Well, you leave this API for userspace, so you can't assume qemu is the
> >> only user or any its behavior. If you had you should limit it in the API
> >> level instead of open window for them.
> >>
> >>
> >>> But even if there are two processes operating on the same kvm instance
> >>> and manipulating on memory slots, adding an extra GPA along side current
> >>> IOVA & HVA to ioctl VFIO_IOMMU_MAP_DMA can still let driver knows the
> >>> right IOVA->GPA mapping, right?
> >>
> >> It looks fragile. Consider HVA was mapped to both GPA1 and GPA2. Guest
> >> maps IOVA to GPA2, so we have IOVA GPA2 HVA in the new ioctl and then
> >> log through GPA2. If userspace is trying to sync through GPA1, it will
> >> miss the dirty page. So for safety we need log both GPA1 and GPA2. (See
> >> what has been done in log_write_hva() in vhost.c). The only way to do
> >> that is to maintain an independent HVA to GPA mapping like what KVM or
> >> vhost did.
> >>
> > why GPA1 and GPA2 should be both dirty?
> > even they have the same HVA due to overlaping virtual address space in
> > two processes, they still correspond to two physical pages.
> > don't get what's your meaning :)
> 
> 
> The point is not leave any corner case that is hard to debug or fix in 
> the future.
> 
> Let's just start by a single process, the API allows userspace to maps 
> HVA to both GPA1 and GPA2. Since it knows GPA1 and GPA2 are equivalent, 
> it's ok to sync just through GPA1. That means if you only log GPA2, it 
> won't work.
>
In that case, cannot log dirty according to HPA.
because kvm cannot tell whether it's an valid case (the two GPAs are equivalent)
or an invalid case (the two GPAs are not equivalent, but with the same
HVA value).

Right?

Thanks
Yan


> Thanks
> 
> 
> >
> > Thanks
> > Yan
> >
> >
> >> Thanks
> >>
> >>
> >>> Thanks
> >>> Yan
> >>>
> > Is Qemu doing its own same-content memory
> > merging in GPA level, similar to KSM?
>  AFAIK, it doesn't.
> 
>  Thanks
> 
> 
> > Thanks
> > Kevin
>

Re: [Qemu-devel] vhost, iova, and dirty page tracking

2019-09-18 Thread Jason Wang




On 2019/9/19 下午2:17, Yan Zhao wrote:

On Thu, Sep 19, 2019 at 02:09:53PM +0800, Jason Wang wrote:

On 2019/9/19 下午1:28, Yan Zhao wrote:

On Thu, Sep 19, 2019 at 09:05:12AM +0800, Jason Wang wrote:

On 2019/9/18 下午4:37, Tian, Kevin wrote:

From: Jason Wang [mailto:jasow...@redhat.com]
Sent: Wednesday, September 18, 2019 2:10 PM


Note that the HVA to GPA mapping is not an 1:1 mapping. One HVA

range

could be mapped to several GPA ranges.

This is fine. Currently vfio_dma maintains IOVA->HVA mapping.

btw under what condition HVA->GPA is not 1:1 mapping? I didn't realize it.

I don't remember the details e.g memory region alias? And neither kvm
nor kvm API does forbid this if my memory is correct.


I checked https://qemu.weilnetz.de/doc/devel/memory.html, which
provides an example of aliased layout. However, its aliasing is all
1:1, instead of N:1. From guest p.o.v every writable GPA implies an
unique location. Why would we hit the situation where multiple
write-able GPAs are mapped to the same HVA (i.e. same physical
memory location)?

I don't know, just want to say current API does not forbid this. So we
probably need to take care it.


yes, in KVM API level, it does not forbid two slots to have the same 
HVA(slot->userspace_addr).
But
(1) there's only one kvm instance for each vm for each qemu process.
(2) all ramblock->host (corresponds to HVA and slot->userspace_addr) in one qemu
process is non-overlapping as it's obtained from mmmap().
(3) qemu ensures two kvm slots will not point to the same section of one 
ramblock.

So, as long as kvm instance is not shared in two processes, and
there's no bug in qemu, we can assure that HVA to GPA is 1:1.


Well, you leave this API for userspace, so you can't assume qemu is the
only user or any its behavior. If you had you should limit it in the API
level instead of open window for them.



But even if there are two processes operating on the same kvm instance
and manipulating on memory slots, adding an extra GPA along side current
IOVA & HVA to ioctl VFIO_IOMMU_MAP_DMA can still let driver knows the
right IOVA->GPA mapping, right?


It looks fragile. Consider HVA was mapped to both GPA1 and GPA2. Guest
maps IOVA to GPA2, so we have IOVA GPA2 HVA in the new ioctl and then
log through GPA2. If userspace is trying to sync through GPA1, it will
miss the dirty page. So for safety we need log both GPA1 and GPA2. (See
what has been done in log_write_hva() in vhost.c). The only way to do
that is to maintain an independent HVA to GPA mapping like what KVM or
vhost did.


why GPA1 and GPA2 should be both dirty?
even they have the same HVA due to overlaping virtual address space in
two processes, they still correspond to two physical pages.
don't get what's your meaning :)



The point is not leave any corner case that is hard to debug or fix in 
the future.


Let's just start by a single process, the API allows userspace to maps 
HVA to both GPA1 and GPA2. Since it knows GPA1 and GPA2 are equivalent, 
it's ok to sync just through GPA1. That means if you only log GPA2, it 
won't work.


Thanks




Thanks
Yan



Thanks



Thanks
Yan


Is Qemu doing its own same-content memory
merging in GPA level, similar to KSM?

AFAIK, it doesn't.

Thanks



Thanks
Kevin

Re: [Qemu-devel] vhost, iova, and dirty page tracking

2019-09-18 Thread Yan Zhao

On Thu, Sep 19, 2019 at 02:09:53PM +0800, Jason Wang wrote:
> 
> On 2019/9/19 下午1:28, Yan Zhao wrote:
> > On Thu, Sep 19, 2019 at 09:05:12AM +0800, Jason Wang wrote:
> >> On 2019/9/18 下午4:37, Tian, Kevin wrote:
>  From: Jason Wang [mailto:jasow...@redhat.com]
>  Sent: Wednesday, September 18, 2019 2:10 PM
> 
> >> Note that the HVA to GPA mapping is not an 1:1 mapping. One HVA
>  range
> >> could be mapped to several GPA ranges.
> > This is fine. Currently vfio_dma maintains IOVA->HVA mapping.
> >
> > btw under what condition HVA->GPA is not 1:1 mapping? I didn't realize 
> > it.
>  I don't remember the details e.g memory region alias? And neither kvm
>  nor kvm API does forbid this if my memory is correct.
> 
> >>> I checked https://qemu.weilnetz.de/doc/devel/memory.html, which
> >>> provides an example of aliased layout. However, its aliasing is all
> >>> 1:1, instead of N:1. From guest p.o.v every writable GPA implies an
> >>> unique location. Why would we hit the situation where multiple
> >>> write-able GPAs are mapped to the same HVA (i.e. same physical
> >>> memory location)?
> >>
> >> I don't know, just want to say current API does not forbid this. So we
> >> probably need to take care it.
> >>
> > yes, in KVM API level, it does not forbid two slots to have the same 
> > HVA(slot->userspace_addr).
> > But
> > (1) there's only one kvm instance for each vm for each qemu process.
> > (2) all ramblock->host (corresponds to HVA and slot->userspace_addr) in one 
> > qemu
> > process is non-overlapping as it's obtained from mmmap().
> > (3) qemu ensures two kvm slots will not point to the same section of one 
> > ramblock.
> >
> > So, as long as kvm instance is not shared in two processes, and
> > there's no bug in qemu, we can assure that HVA to GPA is 1:1.
> 
> 
> Well, you leave this API for userspace, so you can't assume qemu is the 
> only user or any its behavior. If you had you should limit it in the API 
> level instead of open window for them.
> 
> 
> >
> > But even if there are two processes operating on the same kvm instance
> > and manipulating on memory slots, adding an extra GPA along side current
> > IOVA & HVA to ioctl VFIO_IOMMU_MAP_DMA can still let driver knows the
> > right IOVA->GPA mapping, right?
> 
> 
> It looks fragile. Consider HVA was mapped to both GPA1 and GPA2. Guest 
> maps IOVA to GPA2, so we have IOVA GPA2 HVA in the new ioctl and then 
> log through GPA2. If userspace is trying to sync through GPA1, it will 
> miss the dirty page. So for safety we need log both GPA1 and GPA2. (See 
> what has been done in log_write_hva() in vhost.c). The only way to do 
> that is to maintain an independent HVA to GPA mapping like what KVM or 
> vhost did.
> 
why GPA1 and GPA2 should be both dirty?
even they have the same HVA due to overlaping virtual address space in
two processes, they still correspond to two physical pages.
don't get what's your meaning :)

Thanks
Yan


> Thanks
> 
> 
> >
> > Thanks
> > Yan
> >
> >>> Is Qemu doing its own same-content memory
> >>> merging in GPA level, similar to KSM?
> >>
> >> AFAIK, it doesn't.
> >>
> >> Thanks
> >>
> >>
> >>> Thanks
> >>> Kevin
> >>
> >>

Re: [Qemu-devel] vhost, iova, and dirty page tracking

2019-09-18 Thread Jason Wang




On 2019/9/19 下午1:28, Yan Zhao wrote:

On Thu, Sep 19, 2019 at 09:05:12AM +0800, Jason Wang wrote:

On 2019/9/18 下午4:37, Tian, Kevin wrote:

From: Jason Wang [mailto:jasow...@redhat.com]
Sent: Wednesday, September 18, 2019 2:10 PM


Note that the HVA to GPA mapping is not an 1:1 mapping. One HVA

range

could be mapped to several GPA ranges.

This is fine. Currently vfio_dma maintains IOVA->HVA mapping.

btw under what condition HVA->GPA is not 1:1 mapping? I didn't realize it.

I don't remember the details e.g memory region alias? And neither kvm
nor kvm API does forbid this if my memory is correct.


I checked https://qemu.weilnetz.de/doc/devel/memory.html, which
provides an example of aliased layout. However, its aliasing is all
1:1, instead of N:1. From guest p.o.v every writable GPA implies an
unique location. Why would we hit the situation where multiple
write-able GPAs are mapped to the same HVA (i.e. same physical
memory location)?


I don't know, just want to say current API does not forbid this. So we
probably need to take care it.


yes, in KVM API level, it does not forbid two slots to have the same 
HVA(slot->userspace_addr).
But
(1) there's only one kvm instance for each vm for each qemu process.
(2) all ramblock->host (corresponds to HVA and slot->userspace_addr) in one qemu
process is non-overlapping as it's obtained from mmmap().
(3) qemu ensures two kvm slots will not point to the same section of one 
ramblock.

So, as long as kvm instance is not shared in two processes, and
there's no bug in qemu, we can assure that HVA to GPA is 1:1.



Well, you leave this API for userspace, so you can't assume qemu is the 
only user or any its behavior. If you had you should limit it in the API 
level instead of open window for them.





But even if there are two processes operating on the same kvm instance
and manipulating on memory slots, adding an extra GPA along side current
IOVA & HVA to ioctl VFIO_IOMMU_MAP_DMA can still let driver knows the
right IOVA->GPA mapping, right?



It looks fragile. Consider HVA was mapped to both GPA1 and GPA2. Guest 
maps IOVA to GPA2, so we have IOVA GPA2 HVA in the new ioctl and then 
log through GPA2. If userspace is trying to sync through GPA1, it will 
miss the dirty page. So for safety we need log both GPA1 and GPA2. (See 
what has been done in log_write_hva() in vhost.c). The only way to do 
that is to maintain an independent HVA to GPA mapping like what KVM or 
vhost did.


Thanks




Thanks
Yan


Is Qemu doing its own same-content memory
merging in GPA level, similar to KSM?


AFAIK, it doesn't.

Thanks



Thanks
Kevin

Re: [Qemu-devel] vhost, iova, and dirty page tracking

2019-09-18 Thread Yan Zhao

On Thu, Sep 19, 2019 at 09:05:12AM +0800, Jason Wang wrote:
> 
> On 2019/9/18 下午4:37, Tian, Kevin wrote:
> >> From: Jason Wang [mailto:jasow...@redhat.com]
> >> Sent: Wednesday, September 18, 2019 2:10 PM
> >>
>  Note that the HVA to GPA mapping is not an 1:1 mapping. One HVA
> >> range
>  could be mapped to several GPA ranges.
> >>> This is fine. Currently vfio_dma maintains IOVA->HVA mapping.
> >>>
> >>> btw under what condition HVA->GPA is not 1:1 mapping? I didn't realize it.
> >>
> >> I don't remember the details e.g memory region alias? And neither kvm
> >> nor kvm API does forbid this if my memory is correct.
> >>
> > I checked https://qemu.weilnetz.de/doc/devel/memory.html, which
> > provides an example of aliased layout. However, its aliasing is all
> > 1:1, instead of N:1. From guest p.o.v every writable GPA implies an
> > unique location. Why would we hit the situation where multiple
> > write-able GPAs are mapped to the same HVA (i.e. same physical
> > memory location)?
> 
> 
> I don't know, just want to say current API does not forbid this. So we 
> probably need to take care it.
>
yes, in KVM API level, it does not forbid two slots to have the same 
HVA(slot->userspace_addr).
But 
(1) there's only one kvm instance for each vm for each qemu process.
(2) all ramblock->host (corresponds to HVA and slot->userspace_addr) in one qemu
process is non-overlapping as it's obtained from mmmap().
(3) qemu ensures two kvm slots will not point to the same section of one 
ramblock.

So, as long as kvm instance is not shared in two processes, and 
there's no bug in qemu, we can assure that HVA to GPA is 1:1.

But even if there are two processes operating on the same kvm instance
and manipulating on memory slots, adding an extra GPA along side current
IOVA & HVA to ioctl VFIO_IOMMU_MAP_DMA can still let driver knows the
right IOVA->GPA mapping, right?

Thanks
Yan

> 
> > Is Qemu doing its own same-content memory
> > merging in GPA level, similar to KSM?
> 
> 
> AFAIK, it doesn't.
> 
> Thanks
> 
> 
> > Thanks
> > Kevin
> 
> 
>

Re: [Qemu-devel] vhost, iova, and dirty page tracking

2019-09-18 Thread Jason Wang




On 2019/9/18 下午4:37, Tian, Kevin wrote:

From: Jason Wang [mailto:jasow...@redhat.com]
Sent: Wednesday, September 18, 2019 2:10 PM


Note that the HVA to GPA mapping is not an 1:1 mapping. One HVA

range

could be mapped to several GPA ranges.

This is fine. Currently vfio_dma maintains IOVA->HVA mapping.

btw under what condition HVA->GPA is not 1:1 mapping? I didn't realize it.


I don't remember the details e.g memory region alias? And neither kvm
nor kvm API does forbid this if my memory is correct.


I checked https://qemu.weilnetz.de/doc/devel/memory.html, which
provides an example of aliased layout. However, its aliasing is all
1:1, instead of N:1. From guest p.o.v every writable GPA implies an
unique location. Why would we hit the situation where multiple
write-able GPAs are mapped to the same HVA (i.e. same physical
memory location)?



I don't know, just want to say current API does not forbid this. So we 
probably need to take care it.




Is Qemu doing its own same-content memory
merging in GPA level, similar to KSM?



AFAIK, it doesn't.

Thanks



Thanks
Kevin

Re: [Qemu-devel] vhost, iova, and dirty page tracking

2019-09-18 Thread Tian, Kevin

> From: Jason Wang [mailto:jasow...@redhat.com]
> Sent: Wednesday, September 18, 2019 2:10 PM
> 
> >>
> >> Note that the HVA to GPA mapping is not an 1:1 mapping. One HVA
> range
> >> could be mapped to several GPA ranges.
> > This is fine. Currently vfio_dma maintains IOVA->HVA mapping.
> >
> > btw under what condition HVA->GPA is not 1:1 mapping? I didn't realize it.
> 
> 
> I don't remember the details e.g memory region alias? And neither kvm
> nor kvm API does forbid this if my memory is correct.
> 

I checked https://qemu.weilnetz.de/doc/devel/memory.html, which
provides an example of aliased layout. However, its aliasing is all
1:1, instead of N:1. From guest p.o.v every writable GPA implies an
unique location. Why would we hit the situation where multiple
write-able GPAs are mapped to the same HVA (i.e. same physical
memory location)? Is Qemu doing its own same-content memory
merging in GPA level, similar to KSM?

Thanks
Kevin

Re: [Qemu-devel] vhost, iova, and dirty page tracking

2019-09-18 Thread Tian, Kevin

> From: Jason Wang [mailto:jasow...@redhat.com]
> Sent: Wednesday, September 18, 2019 2:10 PM
> 
> On 2019/9/18 上午9:44, Tian, Kevin wrote:
> >> From: Jason Wang [mailto:jasow...@redhat.com]
> >> Sent: Tuesday, September 17, 2019 6:36 PM
> >>
> >> On 2019/9/17 下午4:48, Tian, Kevin wrote:
>  From: Jason Wang [mailto:jasow...@redhat.com]
>  Sent: Monday, September 16, 2019 4:33 PM
> 
> 
>  On 2019/9/16 上午9:51, Tian, Kevin wrote:
> > Hi, Jason
> >
> > We had a discussion about dirty page tracking in VFIO, when
> vIOMMU
> > is enabled:
> >
> > https://lists.nongnu.org/archive/html/qemu-devel/2019-
>  09/msg02690.html
> > It's actually a similar model as vhost - Qemu cannot interpose the
> fast-
>  path
> > DMAs thus relies on the kernel part to track and report dirty page
>  information.
> > Currently Qemu tracks dirty pages in GFN level, thus demanding a
>  translation
> > from IOVA to GPA. Then the open in our discussion is where this
>  translation
> > should happen. Doing the translation in kernel implies a device iotlb
>  flavor,
> > which is what vhost implements today. It requires potentially large
>  tracking
> > structures in the host kernel, but leveraging the existing log_sync
> flow
> >> in
>  Qemu.
> > On the other hand, Qemu may perform log_sync for every removal
> of
>  IOVA
> > mapping and then do the translation itself, then avoiding the GPA
>  awareness
> > in the kernel side. It needs some change to current Qemu log-sync
> flow,
>  and
> > may bring more overhead if IOVA is frequently unmapped.
> >
> > So we'd like to hear about your opinions, especially about how you
> >> came
> > down to the current iotlb approach for vhost.
>  We don't consider too much in the point when introducing vhost. And
>  before IOTLB, vhost has already know GPA through its mem table
>  (GPA->HVA). So it's nature and easier to track dirty pages at GPA level
>  then it won't any changes in the existing ABI.
> >>> This is the same situation as VFIO.
> >>>
>  For VFIO case, the only advantages of using GPA is that the log can
> then
>  be shared among all the devices that belongs to the VM. Otherwise
>  syncing through IOVA is cleaner.
> >>> I still worry about the potential performance impact with this approach.
> >>> In current mdev live migration series, there are multiple system calls
> >>> involved when retrieving the dirty bitmap information for a given
> memory
> >>> range.
> >>
> >> I haven't took a deep look at that series. Technically dirty bitmap
> >> could be shared between device and driver, then there's no system call
> >> in synchronization.
> > That series require Qemu to tell the kernel about the information
> > about queried region (start, number, and page_size), read
> > the information about the dirty bitmap (offset, size) and then read
> > the dirty bitmap.
> 
> 
> Any pointer to that series, I can only find a "mdev live migration
> support with vfio-mdev-pci" from Liu Yi without actual codes.

https://lists.nongnu.org/archive/html/qemu-devel/2019-08/msg05543.html
It's interesting that I cannot google it. Have to manually find it in
Qemu archive.

> 
> 
> > Although the bitmap can be mmaped thus shared,
> > earlier reads/writes are conducted by pread/pwrite system calls.
> > This design is fine for current log_dirty implementation, where
> > dirty bitmap is synced in every pre-copy round. But to do it for
> > every IOVA unmap, it's definitely over-killed.
> >
> >>
> >>> IOVA mappings might be changed frequently. Though one may
> >>> argue that frequent IOVA change already has bad performance, it's still
> >>> not good to introduce further non-negligible overhead in such situation.
> >>
> >> Yes, it depends on the behavior of vIOMMU driver, e.g the frequency
> and
> >> granularity of the flushing.
> >>
> >>
> >>> On the other hand, I realized that adding IOVA awareness in VFIO is
> >>> actually easy. Today VFIO already maintains a full list of IOVA and its
> >>> associated HVA in vfio_dma structure, according to VFIO_MAP and
> >>> VFIO_UNMAP. As long as we allow the latter two operations to accept
> >>> another parameter (GPA), IOVA->GPA mapping can be naturally cached
> >>> in existing vfio_dma objects.
> >>
> >> Note that the HVA to GPA mapping is not an 1:1 mapping. One HVA
> range
> >> could be mapped to several GPA ranges.
> > This is fine. Currently vfio_dma maintains IOVA->HVA mapping.
> >
> > btw under what condition HVA->GPA is not 1:1 mapping? I didn't realize it.
> 
> 
> I don't remember the details e.g memory region alias? And neither kvm
> nor kvm API does forbid this if my memory is correct.
> 

I did see such comment in vhost code (log_write_hva):

/* More than one GPAs can be mapped into a single HVA. So
 * iterate all possible umems here to be safe.
 */

and looks it tries t

Re: [Qemu-devel] vhost, iova, and dirty page tracking

2019-09-18 Thread Tian, Kevin

> From: Jason Wang [mailto:jasow...@redhat.com]
> Sent: Wednesday, September 18, 2019 2:04 PM
> 
> On 2019/9/18 上午9:31, Tian, Kevin wrote:
> >> From: Alex Williamson [mailto:alex.william...@redhat.com]
> >> Sent: Tuesday, September 17, 2019 10:54 PM
> >>
> >> On Tue, 17 Sep 2019 08:48:36 +
> >> "Tian, Kevin"  wrote:
> >>
>  From: Jason Wang [mailto:jasow...@redhat.com]
>  Sent: Monday, September 16, 2019 4:33 PM
> 
> 
>  On 2019/9/16 上午9:51, Tian, Kevin wrote:
> > Hi, Jason
> >
> > We had a discussion about dirty page tracking in VFIO, when
> vIOMMU
> > is enabled:
> >
> > https://lists.nongnu.org/archive/html/qemu-devel/2019-
>  09/msg02690.html
> > It's actually a similar model as vhost - Qemu cannot interpose the
> fast-
>  path
> > DMAs thus relies on the kernel part to track and report dirty page
>  information.
> > Currently Qemu tracks dirty pages in GFN level, thus demanding a
>  translation
> > from IOVA to GPA. Then the open in our discussion is where this
>  translation
> > should happen. Doing the translation in kernel implies a device iotlb
>  flavor,
> > which is what vhost implements today. It requires potentially large
>  tracking
> > structures in the host kernel, but leveraging the existing log_sync
> flow
> >> in
>  Qemu.
> > On the other hand, Qemu may perform log_sync for every removal
> of
>  IOVA
> > mapping and then do the translation itself, then avoiding the GPA
>  awareness
> > in the kernel side. It needs some change to current Qemu log-sync
> >> flow,
>  and
> > may bring more overhead if IOVA is frequently unmapped.
> >
> > So we'd like to hear about your opinions, especially about how you
> >> came
> > down to the current iotlb approach for vhost.
>  We don't consider too much in the point when introducing vhost. And
>  before IOTLB, vhost has already know GPA through its mem table
>  (GPA->HVA). So it's nature and easier to track dirty pages at GPA level
>  then it won't any changes in the existing ABI.
> >>> This is the same situation as VFIO.
> >> It is?  VFIO doesn't know GPAs, it only knows HVA, HPA, and IOVA.  In
> >> some cases IOVA is GPA, but not all.
> > Well, I thought vhost has a similar design, that the index of its mem table
> > is GPA when vIOMMU is off and then becomes IOVA when vIOMMU is on.
> > But I may be wrong here. Jason, can you help clarify? I saw two
> > interfaces which poke the mem table: VHOST_SET_MEM_TABLE (for GPA)
> > and VHOST_IOTLB_UPDATE (for IOVA). Are they used exclusively or
> together?
> >
> 
> Actually, vhost maintains two interval trees, mem table GPA->HVA, and
> device IOTLB IOVA->HVA. Device IOTLB is only used when vIOMMU is
> enabled, and in that case mem table is used only when vhost need to
> track dirty pages (do reverse lookup of memtable to get HVA->GPA). So in
> conclusion, for datapath, they are used exclusively, but they need
> cowork for logging dirty pages when device IOTLB is enabled.
> 

OK. Then it's different from current VFIO design, which maintains only
one tree which is indexed by either GPA or IOVA exclusively, upon 
whether vIOMMU is in use. 

Thanks
Kevin

Re: [Qemu-devel] vhost, iova, and dirty page tracking

2019-09-17 Thread Jason Wang




On 2019/9/18 上午9:44, Tian, Kevin wrote:

From: Jason Wang [mailto:jasow...@redhat.com]
Sent: Tuesday, September 17, 2019 6:36 PM

On 2019/9/17 下午4:48, Tian, Kevin wrote:

From: Jason Wang [mailto:jasow...@redhat.com]
Sent: Monday, September 16, 2019 4:33 PM


On 2019/9/16 上午9:51, Tian, Kevin wrote:

Hi, Jason

We had a discussion about dirty page tracking in VFIO, when vIOMMU
is enabled:

https://lists.nongnu.org/archive/html/qemu-devel/2019-

09/msg02690.html

It's actually a similar model as vhost - Qemu cannot interpose the fast-

path

DMAs thus relies on the kernel part to track and report dirty page

information.

Currently Qemu tracks dirty pages in GFN level, thus demanding a

translation

from IOVA to GPA. Then the open in our discussion is where this

translation

should happen. Doing the translation in kernel implies a device iotlb

flavor,

which is what vhost implements today. It requires potentially large

tracking

structures in the host kernel, but leveraging the existing log_sync flow

in

Qemu.

On the other hand, Qemu may perform log_sync for every removal of

IOVA

mapping and then do the translation itself, then avoiding the GPA

awareness

in the kernel side. It needs some change to current Qemu log-sync flow,

and

may bring more overhead if IOVA is frequently unmapped.

So we'd like to hear about your opinions, especially about how you

came

down to the current iotlb approach for vhost.

We don't consider too much in the point when introducing vhost. And
before IOTLB, vhost has already know GPA through its mem table
(GPA->HVA). So it's nature and easier to track dirty pages at GPA level
then it won't any changes in the existing ABI.

This is the same situation as VFIO.


For VFIO case, the only advantages of using GPA is that the log can then
be shared among all the devices that belongs to the VM. Otherwise
syncing through IOVA is cleaner.

I still worry about the potential performance impact with this approach.
In current mdev live migration series, there are multiple system calls
involved when retrieving the dirty bitmap information for a given memory
range.


I haven't took a deep look at that series. Technically dirty bitmap
could be shared between device and driver, then there's no system call
in synchronization.

That series require Qemu to tell the kernel about the information
about queried region (start, number, and page_size), read
the information about the dirty bitmap (offset, size) and then read
the dirty bitmap.



Any pointer to that series, I can only find a "mdev live migration 
support with vfio-mdev-pci" from Liu Yi without actual codes.




Although the bitmap can be mmaped thus shared,
earlier reads/writes are conducted by pread/pwrite system calls.
This design is fine for current log_dirty implementation, where
dirty bitmap is synced in every pre-copy round. But to do it for
every IOVA unmap, it's definitely over-killed.




IOVA mappings might be changed frequently. Though one may
argue that frequent IOVA change already has bad performance, it's still
not good to introduce further non-negligible overhead in such situation.


Yes, it depends on the behavior of vIOMMU driver, e.g the frequency and
granularity of the flushing.



On the other hand, I realized that adding IOVA awareness in VFIO is
actually easy. Today VFIO already maintains a full list of IOVA and its
associated HVA in vfio_dma structure, according to VFIO_MAP and
VFIO_UNMAP. As long as we allow the latter two operations to accept
another parameter (GPA), IOVA->GPA mapping can be naturally cached
in existing vfio_dma objects.


Note that the HVA to GPA mapping is not an 1:1 mapping. One HVA range
could be mapped to several GPA ranges.

This is fine. Currently vfio_dma maintains IOVA->HVA mapping.

btw under what condition HVA->GPA is not 1:1 mapping? I didn't realize it.



I don't remember the details e.g memory region alias? And neither kvm 
nor kvm API does forbid this if my memory is correct.








   Those objects are always updated according
to MAP and UNMAP ioctls to be up-to-date. Qemu then uniformly
retrieves the VFIO dirty bitmap for the entire GPA range in every pre-copy
round, regardless of whether vIOMMU is enabled. There is no need of
another IOTLB implementation, with the main ask on a v2 MAP/UNMAP
interface.


Or provide GPA to HVA mapping as vhost did. But a question is, I believe
device can only do dirty page logging through IOVA. So how do you handle
the case when IOVA is removed in this case?


That's why a log_sync is required each time when IOVA is unmapped, in
Alex's thought.

Thanks
Kevin



Ok.

Thanks

Re: [Qemu-devel] vhost, iova, and dirty page tracking

2019-09-17 Thread Jason Wang




On 2019/9/18 上午9:31, Tian, Kevin wrote:

From: Alex Williamson [mailto:alex.william...@redhat.com]
Sent: Tuesday, September 17, 2019 10:54 PM

On Tue, 17 Sep 2019 08:48:36 +
"Tian, Kevin"  wrote:


From: Jason Wang [mailto:jasow...@redhat.com]
Sent: Monday, September 16, 2019 4:33 PM


On 2019/9/16 上午9:51, Tian, Kevin wrote:

Hi, Jason

We had a discussion about dirty page tracking in VFIO, when vIOMMU
is enabled:

https://lists.nongnu.org/archive/html/qemu-devel/2019-

09/msg02690.html

It's actually a similar model as vhost - Qemu cannot interpose the fast-

path

DMAs thus relies on the kernel part to track and report dirty page

information.

Currently Qemu tracks dirty pages in GFN level, thus demanding a

translation

from IOVA to GPA. Then the open in our discussion is where this

translation

should happen. Doing the translation in kernel implies a device iotlb

flavor,

which is what vhost implements today. It requires potentially large

tracking

structures in the host kernel, but leveraging the existing log_sync flow

in

Qemu.

On the other hand, Qemu may perform log_sync for every removal of

IOVA

mapping and then do the translation itself, then avoiding the GPA

awareness

in the kernel side. It needs some change to current Qemu log-sync

flow,

and

may bring more overhead if IOVA is frequently unmapped.

So we'd like to hear about your opinions, especially about how you

came

down to the current iotlb approach for vhost.

We don't consider too much in the point when introducing vhost. And
before IOTLB, vhost has already know GPA through its mem table
(GPA->HVA). So it's nature and easier to track dirty pages at GPA level
then it won't any changes in the existing ABI.

This is the same situation as VFIO.

It is?  VFIO doesn't know GPAs, it only knows HVA, HPA, and IOVA.  In
some cases IOVA is GPA, but not all.

Well, I thought vhost has a similar design, that the index of its mem table
is GPA when vIOMMU is off and then becomes IOVA when vIOMMU is on.
But I may be wrong here. Jason, can you help clarify? I saw two
interfaces which poke the mem table: VHOST_SET_MEM_TABLE (for GPA)
and VHOST_IOTLB_UPDATE (for IOVA). Are they used exclusively or together?



Actually, vhost maintains two interval trees, mem table GPA->HVA, and 
device IOTLB IOVA->HVA. Device IOTLB is only used when vIOMMU is 
enabled, and in that case mem table is used only when vhost need to 
track dirty pages (do reverse lookup of memtable to get HVA->GPA). So in 
conclusion, for datapath, they are used exclusively, but they need 
cowork for logging dirty pages when device IOTLB is enabled.


Thanks

Re: [Qemu-devel] vhost, iova, and dirty page tracking

2019-09-17 Thread Tian, Kevin

> From: Tian, Kevin
> Sent: Wednesday, September 18, 2019 9:32 AM
> 
> > From: Alex Williamson [mailto:alex.william...@redhat.com]
> > Sent: Tuesday, September 17, 2019 10:54 PM
> >
> > On Tue, 17 Sep 2019 08:48:36 +
> > "Tian, Kevin"  wrote:
> >
> > > > From: Jason Wang [mailto:jasow...@redhat.com]
> > > > Sent: Monday, September 16, 2019 4:33 PM
> > > >
> > > >
> > > > On 2019/9/16 上午9:51, Tian, Kevin wrote:
> > > > > Hi, Jason
> > > > >
> > > > > We had a discussion about dirty page tracking in VFIO, when
> vIOMMU
> > > > > is enabled:
> > > > >
> > > > > https://lists.nongnu.org/archive/html/qemu-devel/2019-
> > > > 09/msg02690.html
> > > > >
> > > > > It's actually a similar model as vhost - Qemu cannot interpose the
> fast-
> > > > path
> > > > > DMAs thus relies on the kernel part to track and report dirty page
> > > > information.
> > > > > Currently Qemu tracks dirty pages in GFN level, thus demanding a
> > > > translation
> > > > > from IOVA to GPA. Then the open in our discussion is where this
> > > > translation
> > > > > should happen. Doing the translation in kernel implies a device iotlb
> > > > flavor,
> > > > > which is what vhost implements today. It requires potentially large
> > > > tracking
> > > > > structures in the host kernel, but leveraging the existing log_sync
> flow
> > in
> > > > Qemu.
> > > > > On the other hand, Qemu may perform log_sync for every removal
> of
> > > > IOVA
> > > > > mapping and then do the translation itself, then avoiding the GPA
> > > > awareness
> > > > > in the kernel side. It needs some change to current Qemu log-sync
> > flow,
> > > > and
> > > > > may bring more overhead if IOVA is frequently unmapped.
> > > > >
> > > > > So we'd like to hear about your opinions, especially about how you
> > came
> > > > > down to the current iotlb approach for vhost.
> > > >
> > > >
> > > > We don't consider too much in the point when introducing vhost. And
> > > > before IOTLB, vhost has already know GPA through its mem table
> > > > (GPA->HVA). So it's nature and easier to track dirty pages at GPA level
> > > > then it won't any changes in the existing ABI.
> > >
> > > This is the same situation as VFIO.
> >
> > It is?  VFIO doesn't know GPAs, it only knows HVA, HPA, and IOVA.  In
> > some cases IOVA is GPA, but not all.
> 
> Well, I thought vhost has a similar design, that the index of its mem table
> is GPA when vIOMMU is off and then becomes IOVA when vIOMMU is on.
> But I may be wrong here. Jason, can you help clarify? I saw two
> interfaces which poke the mem table: VHOST_SET_MEM_TABLE (for GPA)
> and VHOST_IOTLB_UPDATE (for IOVA). Are they used exclusively or
> together?
> 
> >
> > > > For VFIO case, the only advantages of using GPA is that the log can
> then
> > > > be shared among all the devices that belongs to the VM. Otherwise
> > > > syncing through IOVA is cleaner.
> > >
> > > I still worry about the potential performance impact with this approach.
> > > In current mdev live migration series, there are multiple system calls
> > > involved when retrieving the dirty bitmap information for a given
> memory
> > > range. IOVA mappings might be changed frequently. Though one may
> > > argue that frequent IOVA change already has bad performance, it's still
> > > not good to introduce further non-negligible overhead in such situation.
> > >
> > > On the other hand, I realized that adding IOVA awareness in VFIO is
> > > actually easy. Today VFIO already maintains a full list of IOVA and its
> > > associated HVA in vfio_dma structure, according to VFIO_MAP and
> > > VFIO_UNMAP. As long as we allow the latter two operations to accept
> > > another parameter (GPA), IOVA->GPA mapping can be naturally cached
> > > in existing vfio_dma objects. Those objects are always updated
> according
> > > to MAP and UNMAP ioctls to be up-to-date. Qemu then uniformly
> > > retrieves the VFIO dirty bitmap for the entire GPA range in every pre-
> copy
> > > round, regardless of whether vIOMMU is enabled. There is no need of
> > > another IOTLB implementation, with the main ask on a v2 MAP/UNMAP
> > > interface.
> > >
> > > Alex, your thoughts?
> >
> > Same as last time, you're asking VFIO to be aware of an entirely new
> > address space and implement tracking structures of that address space
> > to make life easier for QEMU.  Don't we typically push such complexity
> > to userspace rather than into the kernel?  I'm not convinced.  Thanks,
> >
> 
> Is it really complex? No need of a new tracking structure. Just allowing
> the MAP interface to carry a new parameter and then record it in the
> existing vfio_dma objects.
> 
> Note the frequency of guest DMA map/unmap could be very high. We
> saw >100K invocations per second with a 40G NIC. To do the right
> translation Qemu requires log_sync for every unmap, before the
> mapping for logged dirty IOVA becomes stale. In current Kirti's patch,
> each log_sync requires several system_calls through the migration
> info, e.g. setting st

Re: [Qemu-devel] vhost, iova, and dirty page tracking

2019-09-17 Thread Tian, Kevin

> From: Jason Wang [mailto:jasow...@redhat.com]
> Sent: Tuesday, September 17, 2019 6:36 PM
> 
> On 2019/9/17 下午4:48, Tian, Kevin wrote:
> >> From: Jason Wang [mailto:jasow...@redhat.com]
> >> Sent: Monday, September 16, 2019 4:33 PM
> >>
> >>
> >> On 2019/9/16 上午9:51, Tian, Kevin wrote:
> >>> Hi, Jason
> >>>
> >>> We had a discussion about dirty page tracking in VFIO, when vIOMMU
> >>> is enabled:
> >>>
> >>> https://lists.nongnu.org/archive/html/qemu-devel/2019-
> >> 09/msg02690.html
> >>> It's actually a similar model as vhost - Qemu cannot interpose the fast-
> >> path
> >>> DMAs thus relies on the kernel part to track and report dirty page
> >> information.
> >>> Currently Qemu tracks dirty pages in GFN level, thus demanding a
> >> translation
> >>> from IOVA to GPA. Then the open in our discussion is where this
> >> translation
> >>> should happen. Doing the translation in kernel implies a device iotlb
> >> flavor,
> >>> which is what vhost implements today. It requires potentially large
> >> tracking
> >>> structures in the host kernel, but leveraging the existing log_sync flow
> in
> >> Qemu.
> >>> On the other hand, Qemu may perform log_sync for every removal of
> >> IOVA
> >>> mapping and then do the translation itself, then avoiding the GPA
> >> awareness
> >>> in the kernel side. It needs some change to current Qemu log-sync flow,
> >> and
> >>> may bring more overhead if IOVA is frequently unmapped.
> >>>
> >>> So we'd like to hear about your opinions, especially about how you
> came
> >>> down to the current iotlb approach for vhost.
> >>
> >> We don't consider too much in the point when introducing vhost. And
> >> before IOTLB, vhost has already know GPA through its mem table
> >> (GPA->HVA). So it's nature and easier to track dirty pages at GPA level
> >> then it won't any changes in the existing ABI.
> > This is the same situation as VFIO.
> >
> >> For VFIO case, the only advantages of using GPA is that the log can then
> >> be shared among all the devices that belongs to the VM. Otherwise
> >> syncing through IOVA is cleaner.
> > I still worry about the potential performance impact with this approach.
> > In current mdev live migration series, there are multiple system calls
> > involved when retrieving the dirty bitmap information for a given memory
> > range.
> 
> 
> I haven't took a deep look at that series. Technically dirty bitmap
> could be shared between device and driver, then there's no system call
> in synchronization.

That series require Qemu to tell the kernel about the information
about queried region (start, number, and page_size), read
the information about the dirty bitmap (offset, size) and then read
the dirty bitmap. Although the bitmap can be mmaped thus shared, 
earlier reads/writes are conducted by pread/pwrite system calls.
This design is fine for current log_dirty implementation, where 
dirty bitmap is synced in every pre-copy round. But to do it for
every IOVA unmap, it's definitely over-killed. 

> 
> 
> > IOVA mappings might be changed frequently. Though one may
> > argue that frequent IOVA change already has bad performance, it's still
> > not good to introduce further non-negligible overhead in such situation.
> 
> 
> Yes, it depends on the behavior of vIOMMU driver, e.g the frequency and
> granularity of the flushing.
> 
> 
> >
> > On the other hand, I realized that adding IOVA awareness in VFIO is
> > actually easy. Today VFIO already maintains a full list of IOVA and its
> > associated HVA in vfio_dma structure, according to VFIO_MAP and
> > VFIO_UNMAP. As long as we allow the latter two operations to accept
> > another parameter (GPA), IOVA->GPA mapping can be naturally cached
> > in existing vfio_dma objects.
> 
> 
> Note that the HVA to GPA mapping is not an 1:1 mapping. One HVA range
> could be mapped to several GPA ranges.

This is fine. Currently vfio_dma maintains IOVA->HVA mapping.

btw under what condition HVA->GPA is not 1:1 mapping? I didn't realize it.

> 
> 
> >   Those objects are always updated according
> > to MAP and UNMAP ioctls to be up-to-date. Qemu then uniformly
> > retrieves the VFIO dirty bitmap for the entire GPA range in every pre-copy
> > round, regardless of whether vIOMMU is enabled. There is no need of
> > another IOTLB implementation, with the main ask on a v2 MAP/UNMAP
> > interface.
> 
> 
> Or provide GPA to HVA mapping as vhost did. But a question is, I believe
> device can only do dirty page logging through IOVA. So how do you handle
> the case when IOVA is removed in this case?
> 

That's why a log_sync is required each time when IOVA is unmapped, in
Alex's thought. 

Thanks
Kevin

Re: [Qemu-devel] vhost, iova, and dirty page tracking

2019-09-17 Thread Tian, Kevin

> From: Alex Williamson [mailto:alex.william...@redhat.com]
> Sent: Tuesday, September 17, 2019 10:54 PM
> 
> On Tue, 17 Sep 2019 08:48:36 +
> "Tian, Kevin"  wrote:
> 
> > > From: Jason Wang [mailto:jasow...@redhat.com]
> > > Sent: Monday, September 16, 2019 4:33 PM
> > >
> > >
> > > On 2019/9/16 上午9:51, Tian, Kevin wrote:
> > > > Hi, Jason
> > > >
> > > > We had a discussion about dirty page tracking in VFIO, when vIOMMU
> > > > is enabled:
> > > >
> > > > https://lists.nongnu.org/archive/html/qemu-devel/2019-
> > > 09/msg02690.html
> > > >
> > > > It's actually a similar model as vhost - Qemu cannot interpose the fast-
> > > path
> > > > DMAs thus relies on the kernel part to track and report dirty page
> > > information.
> > > > Currently Qemu tracks dirty pages in GFN level, thus demanding a
> > > translation
> > > > from IOVA to GPA. Then the open in our discussion is where this
> > > translation
> > > > should happen. Doing the translation in kernel implies a device iotlb
> > > flavor,
> > > > which is what vhost implements today. It requires potentially large
> > > tracking
> > > > structures in the host kernel, but leveraging the existing log_sync flow
> in
> > > Qemu.
> > > > On the other hand, Qemu may perform log_sync for every removal of
> > > IOVA
> > > > mapping and then do the translation itself, then avoiding the GPA
> > > awareness
> > > > in the kernel side. It needs some change to current Qemu log-sync
> flow,
> > > and
> > > > may bring more overhead if IOVA is frequently unmapped.
> > > >
> > > > So we'd like to hear about your opinions, especially about how you
> came
> > > > down to the current iotlb approach for vhost.
> > >
> > >
> > > We don't consider too much in the point when introducing vhost. And
> > > before IOTLB, vhost has already know GPA through its mem table
> > > (GPA->HVA). So it's nature and easier to track dirty pages at GPA level
> > > then it won't any changes in the existing ABI.
> >
> > This is the same situation as VFIO.
> 
> It is?  VFIO doesn't know GPAs, it only knows HVA, HPA, and IOVA.  In
> some cases IOVA is GPA, but not all.

Well, I thought vhost has a similar design, that the index of its mem table
is GPA when vIOMMU is off and then becomes IOVA when vIOMMU is on.
But I may be wrong here. Jason, can you help clarify? I saw two 
interfaces which poke the mem table: VHOST_SET_MEM_TABLE (for GPA)
and VHOST_IOTLB_UPDATE (for IOVA). Are they used exclusively or together?

> 
> > > For VFIO case, the only advantages of using GPA is that the log can then
> > > be shared among all the devices that belongs to the VM. Otherwise
> > > syncing through IOVA is cleaner.
> >
> > I still worry about the potential performance impact with this approach.
> > In current mdev live migration series, there are multiple system calls
> > involved when retrieving the dirty bitmap information for a given memory
> > range. IOVA mappings might be changed frequently. Though one may
> > argue that frequent IOVA change already has bad performance, it's still
> > not good to introduce further non-negligible overhead in such situation.
> >
> > On the other hand, I realized that adding IOVA awareness in VFIO is
> > actually easy. Today VFIO already maintains a full list of IOVA and its
> > associated HVA in vfio_dma structure, according to VFIO_MAP and
> > VFIO_UNMAP. As long as we allow the latter two operations to accept
> > another parameter (GPA), IOVA->GPA mapping can be naturally cached
> > in existing vfio_dma objects. Those objects are always updated according
> > to MAP and UNMAP ioctls to be up-to-date. Qemu then uniformly
> > retrieves the VFIO dirty bitmap for the entire GPA range in every pre-copy
> > round, regardless of whether vIOMMU is enabled. There is no need of
> > another IOTLB implementation, with the main ask on a v2 MAP/UNMAP
> > interface.
> >
> > Alex, your thoughts?
> 
> Same as last time, you're asking VFIO to be aware of an entirely new
> address space and implement tracking structures of that address space
> to make life easier for QEMU.  Don't we typically push such complexity
> to userspace rather than into the kernel?  I'm not convinced.  Thanks,
> 

Is it really complex? No need of a new tracking structure. Just allowing
the MAP interface to carry a new parameter and then record it in the
existing vfio_dma objects.

Note the frequency of guest DMA map/unmap could be very high. We
saw >100K invocations per second with a 40G NIC. To do the right
translation Qemu requires log_sync for every unmap, before the
mapping for logged dirty IOVA becomes stale. In current Kirti's patch,
each log_sync requires several system_calls through the migration
info, e.g. setting start_pfn/page_size/total_pfns and then reading
data_offset/data_size. That design is fine for doing log_sync in every
pre-copy round, but too costly if doing so for every IOVA unmap. If
small extension in kernel can lead to great overhead reduction,
why not?

Thanks
Kevin

Re: [Qemu-devel] vhost, iova, and dirty page tracking

2019-09-17 Thread Alex Williamson

On Tue, 17 Sep 2019 08:48:36 +
"Tian, Kevin"  wrote:

> > From: Jason Wang [mailto:jasow...@redhat.com]
> > Sent: Monday, September 16, 2019 4:33 PM
> > 
> > 
> > On 2019/9/16 上午9:51, Tian, Kevin wrote:  
> > > Hi, Jason
> > >
> > > We had a discussion about dirty page tracking in VFIO, when vIOMMU
> > > is enabled:
> > >
> > > https://lists.nongnu.org/archive/html/qemu-devel/2019-  
> > 09/msg02690.html  
> > >
> > > It's actually a similar model as vhost - Qemu cannot interpose the fast-  
> > path  
> > > DMAs thus relies on the kernel part to track and report dirty page  
> > information.  
> > > Currently Qemu tracks dirty pages in GFN level, thus demanding a  
> > translation  
> > > from IOVA to GPA. Then the open in our discussion is where this  
> > translation  
> > > should happen. Doing the translation in kernel implies a device iotlb  
> > flavor,  
> > > which is what vhost implements today. It requires potentially large  
> > tracking  
> > > structures in the host kernel, but leveraging the existing log_sync flow 
> > > in  
> > Qemu.  
> > > On the other hand, Qemu may perform log_sync for every removal of  
> > IOVA  
> > > mapping and then do the translation itself, then avoiding the GPA  
> > awareness  
> > > in the kernel side. It needs some change to current Qemu log-sync flow,  
> > and  
> > > may bring more overhead if IOVA is frequently unmapped.
> > >
> > > So we'd like to hear about your opinions, especially about how you came
> > > down to the current iotlb approach for vhost.  
> > 
> > 
> > We don't consider too much in the point when introducing vhost. And
> > before IOTLB, vhost has already know GPA through its mem table
> > (GPA->HVA). So it's nature and easier to track dirty pages at GPA level
> > then it won't any changes in the existing ABI.  
> 
> This is the same situation as VFIO.

It is?  VFIO doesn't know GPAs, it only knows HVA, HPA, and IOVA.  In
some cases IOVA is GPA, but not all.

> > For VFIO case, the only advantages of using GPA is that the log can then
> > be shared among all the devices that belongs to the VM. Otherwise
> > syncing through IOVA is cleaner.  
> 
> I still worry about the potential performance impact with this approach.
> In current mdev live migration series, there are multiple system calls 
> involved when retrieving the dirty bitmap information for a given memory
> range. IOVA mappings might be changed frequently. Though one may
> argue that frequent IOVA change already has bad performance, it's still
> not good to introduce further non-negligible overhead in such situation.
> 
> On the other hand, I realized that adding IOVA awareness in VFIO is
> actually easy. Today VFIO already maintains a full list of IOVA and its 
> associated HVA in vfio_dma structure, according to VFIO_MAP and 
> VFIO_UNMAP. As long as we allow the latter two operations to accept 
> another parameter (GPA), IOVA->GPA mapping can be naturally cached 
> in existing vfio_dma objects. Those objects are always updated according 
> to MAP and UNMAP ioctls to be up-to-date. Qemu then uniformly 
> retrieves the VFIO dirty bitmap for the entire GPA range in every pre-copy 
> round, regardless of whether vIOMMU is enabled. There is no need of 
> another IOTLB implementation, with the main ask on a v2 MAP/UNMAP 
> interface. 
> 
> Alex, your thoughts?

Same as last time, you're asking VFIO to be aware of an entirely new
address space and implement tracking structures of that address space
to make life easier for QEMU.  Don't we typically push such complexity
to userspace rather than into the kernel?  I'm not convinced.  Thanks,

Alex

Re: [Qemu-devel] vhost, iova, and dirty page tracking

2019-09-17 Thread Jason Wang




On 2019/9/17 下午4:48, Tian, Kevin wrote:

From: Jason Wang [mailto:jasow...@redhat.com]
Sent: Monday, September 16, 2019 4:33 PM


On 2019/9/16 上午9:51, Tian, Kevin wrote:

Hi, Jason

We had a discussion about dirty page tracking in VFIO, when vIOMMU
is enabled:

https://lists.nongnu.org/archive/html/qemu-devel/2019-

09/msg02690.html

It's actually a similar model as vhost - Qemu cannot interpose the fast-

path

DMAs thus relies on the kernel part to track and report dirty page

information.

Currently Qemu tracks dirty pages in GFN level, thus demanding a

translation

from IOVA to GPA. Then the open in our discussion is where this

translation

should happen. Doing the translation in kernel implies a device iotlb

flavor,

which is what vhost implements today. It requires potentially large

tracking

structures in the host kernel, but leveraging the existing log_sync flow in

Qemu.

On the other hand, Qemu may perform log_sync for every removal of

IOVA

mapping and then do the translation itself, then avoiding the GPA

awareness

in the kernel side. It needs some change to current Qemu log-sync flow,

and

may bring more overhead if IOVA is frequently unmapped.

So we'd like to hear about your opinions, especially about how you came
down to the current iotlb approach for vhost.


We don't consider too much in the point when introducing vhost. And
before IOTLB, vhost has already know GPA through its mem table
(GPA->HVA). So it's nature and easier to track dirty pages at GPA level
then it won't any changes in the existing ABI.

This is the same situation as VFIO.


For VFIO case, the only advantages of using GPA is that the log can then
be shared among all the devices that belongs to the VM. Otherwise
syncing through IOVA is cleaner.

I still worry about the potential performance impact with this approach.
In current mdev live migration series, there are multiple system calls
involved when retrieving the dirty bitmap information for a given memory
range.



I haven't took a deep look at that series. Technically dirty bitmap 
could be shared between device and driver, then there's no system call 
in synchronization.




IOVA mappings might be changed frequently. Though one may
argue that frequent IOVA change already has bad performance, it's still
not good to introduce further non-negligible overhead in such situation.



Yes, it depends on the behavior of vIOMMU driver, e.g the frequency and 
granularity of the flushing.





On the other hand, I realized that adding IOVA awareness in VFIO is
actually easy. Today VFIO already maintains a full list of IOVA and its
associated HVA in vfio_dma structure, according to VFIO_MAP and
VFIO_UNMAP. As long as we allow the latter two operations to accept
another parameter (GPA), IOVA->GPA mapping can be naturally cached
in existing vfio_dma objects.



Note that the HVA to GPA mapping is not an 1:1 mapping. One HVA range 
could be mapped to several GPA ranges.




  Those objects are always updated according
to MAP and UNMAP ioctls to be up-to-date. Qemu then uniformly
retrieves the VFIO dirty bitmap for the entire GPA range in every pre-copy
round, regardless of whether vIOMMU is enabled. There is no need of
another IOTLB implementation, with the main ask on a v2 MAP/UNMAP
interface.



Or provide GPA to HVA mapping as vhost did. But a question is, I believe 
device can only do dirty page logging through IOVA. So how do you handle 
the case when IOVA is removed in this case?


Thanks




Alex, your thoughts?

Thanks
Kevin

Re: [Qemu-devel] vhost, iova, and dirty page tracking

2019-09-17 Thread Tian, Kevin

> From: Jason Wang [mailto:jasow...@redhat.com]
> Sent: Monday, September 16, 2019 4:33 PM
> 
> 
> On 2019/9/16 上午9:51, Tian, Kevin wrote:
> > Hi, Jason
> >
> > We had a discussion about dirty page tracking in VFIO, when vIOMMU
> > is enabled:
> >
> > https://lists.nongnu.org/archive/html/qemu-devel/2019-
> 09/msg02690.html
> >
> > It's actually a similar model as vhost - Qemu cannot interpose the fast-
> path
> > DMAs thus relies on the kernel part to track and report dirty page
> information.
> > Currently Qemu tracks dirty pages in GFN level, thus demanding a
> translation
> > from IOVA to GPA. Then the open in our discussion is where this
> translation
> > should happen. Doing the translation in kernel implies a device iotlb
> flavor,
> > which is what vhost implements today. It requires potentially large
> tracking
> > structures in the host kernel, but leveraging the existing log_sync flow in
> Qemu.
> > On the other hand, Qemu may perform log_sync for every removal of
> IOVA
> > mapping and then do the translation itself, then avoiding the GPA
> awareness
> > in the kernel side. It needs some change to current Qemu log-sync flow,
> and
> > may bring more overhead if IOVA is frequently unmapped.
> >
> > So we'd like to hear about your opinions, especially about how you came
> > down to the current iotlb approach for vhost.
> 
> 
> We don't consider too much in the point when introducing vhost. And
> before IOTLB, vhost has already know GPA through its mem table
> (GPA->HVA). So it's nature and easier to track dirty pages at GPA level
> then it won't any changes in the existing ABI.

This is the same situation as VFIO.

> 
> For VFIO case, the only advantages of using GPA is that the log can then
> be shared among all the devices that belongs to the VM. Otherwise
> syncing through IOVA is cleaner.

I still worry about the potential performance impact with this approach.
In current mdev live migration series, there are multiple system calls 
involved when retrieving the dirty bitmap information for a given memory
range. IOVA mappings might be changed frequently. Though one may
argue that frequent IOVA change already has bad performance, it's still
not good to introduce further non-negligible overhead in such situation.

On the other hand, I realized that adding IOVA awareness in VFIO is
actually easy. Today VFIO already maintains a full list of IOVA and its 
associated HVA in vfio_dma structure, according to VFIO_MAP and 
VFIO_UNMAP. As long as we allow the latter two operations to accept 
another parameter (GPA), IOVA->GPA mapping can be naturally cached 
in existing vfio_dma objects. Those objects are always updated according 
to MAP and UNMAP ioctls to be up-to-date. Qemu then uniformly 
retrieves the VFIO dirty bitmap for the entire GPA range in every pre-copy 
round, regardless of whether vIOMMU is enabled. There is no need of 
another IOTLB implementation, with the main ask on a v2 MAP/UNMAP 
interface. 

Alex, your thoughts?

Thanks
Kevin

Re: [Qemu-devel] vhost, iova, and dirty page tracking

2019-09-16 Thread Jason Wang




On 2019/9/16 上午9:51, Tian, Kevin wrote:

Hi, Jason

We had a discussion about dirty page tracking in VFIO, when vIOMMU
is enabled:

https://lists.nongnu.org/archive/html/qemu-devel/2019-09/msg02690.html

It's actually a similar model as vhost - Qemu cannot interpose the fast-path
DMAs thus relies on the kernel part to track and report dirty page information.
Currently Qemu tracks dirty pages in GFN level, thus demanding a translation
from IOVA to GPA. Then the open in our discussion is where this translation
should happen. Doing the translation in kernel implies a device iotlb flavor,
which is what vhost implements today. It requires potentially large tracking
structures in the host kernel, but leveraging the existing log_sync flow in 
Qemu.
On the other hand, Qemu may perform log_sync for every removal of IOVA
mapping and then do the translation itself, then avoiding the GPA awareness
in the kernel side. It needs some change to current Qemu log-sync flow, and
may bring more overhead if IOVA is frequently unmapped.

So we'd like to hear about your opinions, especially about how you came
down to the current iotlb approach for vhost.



We don't consider too much in the point when introducing vhost. And 
before IOTLB, vhost has already know GPA through its mem table 
(GPA->HVA). So it's nature and easier to track dirty pages at GPA level 
then it won't any changes in the existing ABI.


For VFIO case, the only advantages of using GPA is that the log can then 
be shared among all the devices that belongs to the VM. Otherwise 
syncing through IOVA is cleaner.


Thanks



p.s. Alex's comment is also copied here from original thread.


So vhost must then be configuring a listener across system memory
rather than only against the device AddressSpace like we do in vfio,
such that it get's log_sync() callbacks for the actual GPA space rather
than only the IOVA space.  OTOH, QEMU could understand that the device
AddressSpace has a translate function and apply the IOVA dirty bits to
the system memory AddressSpace.  Wouldn't it make more sense for
QEMU
to perform a log_sync() prior to removing a MemoryRegionSection within
an AddressSpace and update the GPA rather than pushing GPA awareness
and potentially large tracking structures into the host kernel?

Thanks
Kevin

[Qemu-devel] vhost, iova, and dirty page tracking

2019-09-15 Thread Tian, Kevin

Hi, Jason

We had a discussion about dirty page tracking in VFIO, when vIOMMU
is enabled:

https://lists.nongnu.org/archive/html/qemu-devel/2019-09/msg02690.html

It's actually a similar model as vhost - Qemu cannot interpose the fast-path
DMAs thus relies on the kernel part to track and report dirty page information.
Currently Qemu tracks dirty pages in GFN level, thus demanding a translation
from IOVA to GPA. Then the open in our discussion is where this translation
should happen. Doing the translation in kernel implies a device iotlb flavor,
which is what vhost implements today. It requires potentially large tracking
structures in the host kernel, but leveraging the existing log_sync flow in 
Qemu.
On the other hand, Qemu may perform log_sync for every removal of IOVA
mapping and then do the translation itself, then avoiding the GPA awareness
in the kernel side. It needs some change to current Qemu log-sync flow, and 
may bring more overhead if IOVA is frequently unmapped.

So we'd like to hear about your opinions, especially about how you came
down to the current iotlb approach for vhost. 

p.s. Alex's comment is also copied here from original thread.

> So vhost must then be configuring a listener across system memory
> rather than only against the device AddressSpace like we do in vfio,
> such that it get's log_sync() callbacks for the actual GPA space rather
> than only the IOVA space.  OTOH, QEMU could understand that the device
> AddressSpace has a translate function and apply the IOVA dirty bits to
> the system memory AddressSpace.  Wouldn't it make more sense for
> QEMU
> to perform a log_sync() prior to removing a MemoryRegionSection within
> an AddressSpace and update the GPA rather than pushing GPA awareness
> and potentially large tracking structures into the host kernel?

Thanks
Kevin

40 matches

Mail list logo