Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages

2016-04-25 Thread Michael S. Tsirkin
On Tue, Apr 19, 2016 at 08:12:02PM +0100, Dr. David Alan Gilbert wrote:
> * Michael S. Tsirkin (m...@redhat.com) wrote:
> > On Mon, Apr 18, 2016 at 11:08:31AM +, Li, Liang Z wrote:
> > > Hi Dave,
> > > 
> > > I am now working on how to benefit post-copy by skipping the free pages, 
> > > and I remember you have said we should let the destination know the info
> > > of free pages so as to avoid request the free pages from the source. 
> > > 
> > > We have two solutions:
> > > 
> > > a. send the migration dirty page bitmap to destination before post
> > > copy start, so the destination can decide whether to request the pages or 
> > > place zero pages by checking the migration dirty page bitmap. The 
> > > advantage
> > > is that we can avoid sending the free pages. the disadvantage is that we 
> > > have 
> > > to send extra data to destination.
> > > 
> > > b. Check the page request on the source side, if it's not a dirty page, 
> > > send a zero
> > > page header to the destination.
> > > 
> > > What's your opinion about them?
> > > 
> > > Liang
> > > 
> > 
> > Both are ad-hoc solutions imho.
> > 
> > c. put the bitmap in a ramblock, check it on destination before
> >requesting pages.
> > 
> > This way it's migrated on-demand.
> 
> I can see where you're coming from, but I don't like this idea, because
> sending data controlling the RAM migration process in RAM blocks controlled
> by the same data just sounds too recursive to ever debug.
> 
> Dave

Thinking some more about it, we could send the request for page from
desination to guest.  Once we get free pages, we send them to source
with a flag that says "I don't really need these pages".  Could be a new
message with a bitmap, or just multiple existing ones that request
pages.

source marks these pages sent without actually sending them.
if it sees it sent all pages it exits as previously.

> > 
> > -- 
> > MST
> --
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages

2016-04-20 Thread Dr. David Alan Gilbert
* Li, Liang Z (liang.z...@intel.com) wrote:
> > Subject: Re: [RFC Design Doc]Speed up live migration by skipping free pages
> > 
> > * Li, Liang Z (liang.z...@intel.com) wrote:
> > > Hi Dave,
> > >
> > > I am now working on how to benefit post-copy by skipping the free
> > > pages, and I remember you have said we should let the destination know
> > > the info of free pages so as to avoid request the free pages from the
> > source.
> > >
> > > We have two solutions:
> > >
> > > a. send the migration dirty page bitmap to destination before post
> > > copy start, so the destination can decide whether to request the pages
> > > or place zero pages by checking the migration dirty page bitmap. The
> > > advantage is that we can avoid sending the free pages. the
> > > disadvantage is that we have to send extra data to destination.
> > >
> > > b. Check the page request on the source side, if it's not a dirty
> > > page, send a zero page header to the destination.
> > >
> > > What's your opinion about them?
> > 
> > (b) is certainly simpler - and requires no changes on the destination side 
> > or
> > the protocol.
> > If you then decided to add stuff to send the dirty page bit map later you
> > could do.
> > 
> > However, there are some other problems to figure out:
> >1) The source side quits when it thinks it's sent all pages; when is your
> >source going to quit?  If it quits while the destination still has
> >unfulfilled pages then the destination will fail.
> 
> The source quit as the same as before, but before quitting, tell destination 
> it has already quit.
> After that, the destination don't need to request pages from the source, just 
> place zero pages. works?

Yes, maybe. The destination side would somehow have to clean up once it has all
the zero pages, but it currently doesn't keep a count or map of which pages
still need to be received.
Actually, perhaps that's easy - when the destination receives the 'quit it's 
zero'
message from the source, maybe it just turns off userfault; any fresh accesses
would get a zero page.  However, I'm not sure what happens to pages that are
already blocked/waiting for a page - that we'd need to check with Andrea/test.

> >2) I sent a 'discard' bitmap of pages for the destination to unmap
> >   just at the change into postcopy; so I'm already sending one bitmap;
> >   this is for pages that have been changed since they were first sent
> >   but not yet resent.
> >   Be careful about how any changes you make interact with the generation
> >   of that bitmap.
> 
> Thanks for your reminding.
> 
> >3) It's potentially very slow if the destination has to keep requesting
> >   blank pages.
> 
> Yes, really.
> 
> > Essentially what you're suggesting for (a) is a way to send a compressed set
> > of 'page is zero' messages based on a bitmap, and you're worried about the
> > time to send it - which I think is where we started the conversation about
> > time to deal with zeros :-).  Two ways to think of that are:
> 
> All my thoughts are in your words. :)
> 
> >4) I already send one bitmap - so you're only doubling it in theory;
> >   I originally used a sparse bitmap but the suggestion was it was
> >   more complex than needed and it turned into more of a run-length
> > encoding.
> >5) You're worried it would increase the downtime as you send the bitmap;
> > however
> >   if you implement (b) as well as (a) then you can send the data for
> >   (a) after the destination is running and not increase the downtime.
> 
> The downtime is main reason that I start to consider about (b), for VM with 
> huge amount of RAM.
> the downtime will become a big problem.  Obviously, (a) is more efficient 
> then (b).

With your idea about sending a 'quit' message to tell the destination the 
remaining
pages are all zero, I'm not sure that's true - (b) + the quit message sounds 
like
a good combination.

Dave

> 
> 
> > Dave
> > 
> > --
> > Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages

2016-04-19 Thread Li, Liang Z
> Subject: Re: [RFC Design Doc]Speed up live migration by skipping free pages
> 
> * Li, Liang Z (liang.z...@intel.com) wrote:
> > Hi Dave,
> >
> > I am now working on how to benefit post-copy by skipping the free
> > pages, and I remember you have said we should let the destination know
> > the info of free pages so as to avoid request the free pages from the
> source.
> >
> > We have two solutions:
> >
> > a. send the migration dirty page bitmap to destination before post
> > copy start, so the destination can decide whether to request the pages
> > or place zero pages by checking the migration dirty page bitmap. The
> > advantage is that we can avoid sending the free pages. the
> > disadvantage is that we have to send extra data to destination.
> >
> > b. Check the page request on the source side, if it's not a dirty
> > page, send a zero page header to the destination.
> >
> > What's your opinion about them?
> 
> (b) is certainly simpler - and requires no changes on the destination side or
> the protocol.
> If you then decided to add stuff to send the dirty page bit map later you
> could do.
> 
> However, there are some other problems to figure out:
>1) The source side quits when it thinks it's sent all pages; when is your
>source going to quit?  If it quits while the destination still has
>unfulfilled pages then the destination will fail.

The source quit as the same as before, but before quitting, tell destination it 
has already quit.
After that, the destination don't need to request pages from the source, just 
place zero pages. works?

>2) I sent a 'discard' bitmap of pages for the destination to unmap
>   just at the change into postcopy; so I'm already sending one bitmap;
>   this is for pages that have been changed since they were first sent
>   but not yet resent.
>   Be careful about how any changes you make interact with the generation
>   of that bitmap.

Thanks for your reminding.

>3) It's potentially very slow if the destination has to keep requesting
>   blank pages.

Yes, really.

> Essentially what you're suggesting for (a) is a way to send a compressed set
> of 'page is zero' messages based on a bitmap, and you're worried about the
> time to send it - which I think is where we started the conversation about
> time to deal with zeros :-).  Two ways to think of that are:

All my thoughts are in your words. :)

>4) I already send one bitmap - so you're only doubling it in theory;
>   I originally used a sparse bitmap but the suggestion was it was
>   more complex than needed and it turned into more of a run-length
> encoding.
>5) You're worried it would increase the downtime as you send the bitmap;
> however
>   if you implement (b) as well as (a) then you can send the data for
>   (a) after the destination is running and not increase the downtime.

The downtime is main reason that I start to consider about (b), for VM with 
huge amount of RAM.
the downtime will become a big problem.  Obviously, (a) is more efficient then 
(b).


> Dave
> 
> --
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages

2016-04-19 Thread Dr. David Alan Gilbert
* Michael S. Tsirkin (m...@redhat.com) wrote:
> On Mon, Apr 18, 2016 at 11:08:31AM +, Li, Liang Z wrote:
> > Hi Dave,
> > 
> > I am now working on how to benefit post-copy by skipping the free pages, 
> > and I remember you have said we should let the destination know the info
> > of free pages so as to avoid request the free pages from the source. 
> > 
> > We have two solutions:
> > 
> > a. send the migration dirty page bitmap to destination before post
> > copy start, so the destination can decide whether to request the pages or 
> > place zero pages by checking the migration dirty page bitmap. The advantage
> > is that we can avoid sending the free pages. the disadvantage is that we 
> > have 
> > to send extra data to destination.
> > 
> > b. Check the page request on the source side, if it's not a dirty page, 
> > send a zero
> > page header to the destination.
> > 
> > What's your opinion about them?
> > 
> > Liang
> > 
> 
> Both are ad-hoc solutions imho.
> 
> c. put the bitmap in a ramblock, check it on destination before
>requesting pages.
> 
> This way it's migrated on-demand.

I can see where you're coming from, but I don't like this idea, because
sending data controlling the RAM migration process in RAM blocks controlled
by the same data just sounds too recursive to ever debug.

Dave

> 
> -- 
> MST
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages

2016-04-19 Thread Dr. David Alan Gilbert
* Li, Liang Z (liang.z...@intel.com) wrote:
> Hi Dave,
> 
> I am now working on how to benefit post-copy by skipping the free pages, 
> and I remember you have said we should let the destination know the info
> of free pages so as to avoid request the free pages from the source. 
> 
> We have two solutions:
> 
> a. send the migration dirty page bitmap to destination before post
> copy start, so the destination can decide whether to request the pages or 
> place zero pages by checking the migration dirty page bitmap. The advantage
> is that we can avoid sending the free pages. the disadvantage is that we have 
> to send extra data to destination.
> 
> b. Check the page request on the source side, if it's not a dirty page, send 
> a zero
> page header to the destination.
> 
> What's your opinion about them?

(b) is certainly simpler - and requires no changes on the destination side
or the protocol.
If you then decided to add stuff to send the dirty page bit map later
you could do.

However, there are some other problems to figure out:
   1) The source side quits when it thinks it's sent all pages; when is your
   source going to quit?  If it quits while the destination still has
   unfulfilled pages then the destination will fail.
   2) I sent a 'discard' bitmap of pages for the destination to unmap
  just at the change into postcopy; so I'm already sending one bitmap;
  this is for pages that have been changed since they were first sent
  but not yet resent.
  Be careful about how any changes you make interact with the generation
  of that bitmap.
   3) It's potentially very slow if the destination has to keep requesting
  blank pages.

Essentially what you're suggesting for (a) is a way to send a compressed
set of 'page is zero' messages based on a bitmap, and you're worried about
the time to send it - which I think is where we started the conversation
about time to deal with zeros :-).  Two ways to think of that are:
   4) I already send one bitmap - so you're only doubling it in theory;
  I originally used a sparse bitmap but the suggestion was it was
  more complex than needed and it turned into more of a run-length encoding.
   5) You're worried it would increase the downtime as you send the bitmap; 
however
  if you implement (b) as well as (a) then you can send the data for
  (a) after the destination is running and not increase the downtime.

Dave

--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages

2016-04-18 Thread Li, Liang Z
> > > > I am now working on how to benefit post-copy by skipping the free
> > > > pages, and I remember you have said we should let the destination
> > > > know the info of free pages so as to avoid request the free pages
> > > > from the
> > > source.
> > > >
> > > > We have two solutions:
> > > >
> > > > a. send the migration dirty page bitmap to destination before post
> > > > copy start, so the destination can decide whether to request the
> > > > pages or place zero pages by checking the migration dirty page
> > > > bitmap. The advantage is that we can avoid sending the free pages.
> > > > the disadvantage is that we have to send extra data to destination.
> > > >
> > > > b. Check the page request on the source side, if it's not a dirty
> > > > page, send a zero page header to the destination.
> > > >
> > > > What's your opinion about them?
> > > >
> > > > Liang
> > > >
> > >
> > > Both are ad-hoc solutions imho.
> > >
> > > c. put the bitmap in a ramblock, check it on destination before
> > >requesting pages.
> > >
> > > This way it's migrated on-demand.
> > >
> > Hi MST,
> >
> > I think you mean  putting the free page bitmap in a ramblock. Right?
> > If some of the free pages become dirty after updating the free page
> > bitmap, and these pages are discarded by destination, how can we
> > distinguish these discarded pages with the free pages?
> >
> > Could you elaborate how it works?
> >
> > Thanks!
> > Liang
> 
> Maybe I'm confused - IIUC it's postcopy so VM is running on destination, if
> page is dirty it was modified there so we don't need to get it from source.
> 

The current post-copy is implemented as pre-copy plus post-copy, there is a 
stage
 that the source told the destination to discard the newly dirty pages before 
vm start
to run on destination. I don't think the free page bitmap will help, we need a 
dirty page
bitmap. The dirty page bitmap should not be put in a ramblock, it does not 
belong to guest.

> But really I agree with David here - at step 1 just ignore post-copy, don't
> special-case it, even if it becomes slower with your patch.
> Think about it later.
> 

Yes, we can simply ignore the free page if post-copy is enabled.  Think about 
it now
will make my work easier :)

Thanks!

Liang

> >
> >
> > > --
> > > MST
> > > --



Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages

2016-04-18 Thread Michael S. Tsirkin
On Mon, Apr 18, 2016 at 02:36:31PM +, Li, Liang Z wrote:
> > On Mon, Apr 18, 2016 at 11:08:31AM +, Li, Liang Z wrote:
> > > Hi Dave,
> > >
> > > I am now working on how to benefit post-copy by skipping the free
> > > pages, and I remember you have said we should let the destination know
> > > the info of free pages so as to avoid request the free pages from the
> > source.
> > >
> > > We have two solutions:
> > >
> > > a. send the migration dirty page bitmap to destination before post
> > > copy start, so the destination can decide whether to request the pages
> > > or place zero pages by checking the migration dirty page bitmap. The
> > > advantage is that we can avoid sending the free pages. the
> > > disadvantage is that we have to send extra data to destination.
> > >
> > > b. Check the page request on the source side, if it's not a dirty
> > > page, send a zero page header to the destination.
> > >
> > > What's your opinion about them?
> > >
> > > Liang
> > >
> > 
> > Both are ad-hoc solutions imho.
> > 
> > c. put the bitmap in a ramblock, check it on destination before
> >requesting pages.
> > 
> > This way it's migrated on-demand.
> > 
> Hi MST,
> 
> I think you mean  putting the free page bitmap in a ramblock. Right?
> If some of the free pages become dirty after updating the free page bitmap,
> and these pages are discarded by destination, how can we distinguish these
> discarded pages with the free pages?
> 
> Could you elaborate how it works?
> 
> Thanks!
> Liang

Maybe I'm confused - IIUC it's postcopy so VM is running on destination,
if page is dirty it was modified there so we don't need to get it from
source.

But really I agree with David here - at step 1 just ignore postcopy,
don't special-case it, even if it becomes slower with your patch.
Think about it later.

> 
> 
> > --
> > MST
> > --
> > To unsubscribe from this list: send the line "unsubscribe kvm" in the body 
> > of
> > a message to majord...@vger.kernel.org More majordomo info at
> > http://vger.kernel.org/majordomo-info.html



Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages

2016-04-18 Thread Li, Liang Z
> On Mon, Apr 18, 2016 at 11:08:31AM +, Li, Liang Z wrote:
> > Hi Dave,
> >
> > I am now working on how to benefit post-copy by skipping the free
> > pages, and I remember you have said we should let the destination know
> > the info of free pages so as to avoid request the free pages from the
> source.
> >
> > We have two solutions:
> >
> > a. send the migration dirty page bitmap to destination before post
> > copy start, so the destination can decide whether to request the pages
> > or place zero pages by checking the migration dirty page bitmap. The
> > advantage is that we can avoid sending the free pages. the
> > disadvantage is that we have to send extra data to destination.
> >
> > b. Check the page request on the source side, if it's not a dirty
> > page, send a zero page header to the destination.
> >
> > What's your opinion about them?
> >
> > Liang
> >
> 
> Both are ad-hoc solutions imho.
> 
> c. put the bitmap in a ramblock, check it on destination before
>requesting pages.
> 
> This way it's migrated on-demand.
> 
Hi MST,

I think you mean  putting the free page bitmap in a ramblock. Right?
If some of the free pages become dirty after updating the free page bitmap,
and these pages are discarded by destination, how can we distinguish these
discarded pages with the free pages?

Could you elaborate how it works?

Thanks!
Liang



> --
> MST
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in the body of
> a message to majord...@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html



Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages

2016-04-18 Thread Michael S. Tsirkin
On Mon, Apr 18, 2016 at 11:08:31AM +, Li, Liang Z wrote:
> Hi Dave,
> 
> I am now working on how to benefit post-copy by skipping the free pages, 
> and I remember you have said we should let the destination know the info
> of free pages so as to avoid request the free pages from the source. 
> 
> We have two solutions:
> 
> a. send the migration dirty page bitmap to destination before post
> copy start, so the destination can decide whether to request the pages or 
> place zero pages by checking the migration dirty page bitmap. The advantage
> is that we can avoid sending the free pages. the disadvantage is that we have 
> to send extra data to destination.
> 
> b. Check the page request on the source side, if it's not a dirty page, send 
> a zero
> page header to the destination.
> 
> What's your opinion about them?
> 
> Liang
> 

Both are ad-hoc solutions imho.

c. put the bitmap in a ramblock, check it on destination before
   requesting pages.

This way it's migrated on-demand.

-- 
MST



Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages

2016-04-18 Thread Li, Liang Z
Hi Dave,

I am now working on how to benefit post-copy by skipping the free pages, 
and I remember you have said we should let the destination know the info
of free pages so as to avoid request the free pages from the source. 

We have two solutions:

a. send the migration dirty page bitmap to destination before post
copy start, so the destination can decide whether to request the pages or 
place zero pages by checking the migration dirty page bitmap. The advantage
is that we can avoid sending the free pages. the disadvantage is that we have 
to send extra data to destination.

b. Check the page request on the source side, if it's not a dirty page, send a 
zero
page header to the destination.

What's your opinion about them?

Liang






Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages

2016-04-04 Thread Li, Liang Z
> On (Tue) 22 Mar 2016 [19:05:31], Dr. David Alan Gilbert wrote:
> > * Liang Li (liang.z...@intel.com) wrote:
> 
> > > b. Implement a new virtio device
> > > Implementing a brand new virtio device to exchange information
> > > between host and guest is another choice. It requires modifying the
> > > virtio spec too.
> >
> > If the right solution is to change the spec then we should do it; we
> > shouldn't use a technically worse solution just to avoid the spec
> > change; although we have to be even more careful to get the right
> > solution if we want to change the spec.
> 
> Yes, absolutely.  I didn't mean to suggest virtio-serial because it'll help 
> us save
> with modifying specs.  I suggested it because for me, the most important
> point was that if a guest is sensitive to latencies or spikes, the balloon 
> driver
> is definitely going to be not installed by the guest admin.
> 
> (I'm still not caught up on this thread, but wanted to clarify this right 
> away.)
> 
>   Amit

Amit, thanks for your reply.

If the balloon driver is installed but there is no inflate/deflate operations, 
does it
still affect guest's performance?

Finally, we have to make a choice between these solutions. 
A new virtio device/virtio-balloon/virtio-serial.

Which one your guys think is better? Maybe we should have more discussions 
about this.

Thanks!
Liang






Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages

2016-04-01 Thread Amit Shah
On (Tue) 22 Mar 2016 [19:05:31], Dr. David Alan Gilbert wrote:
> * Liang Li (liang.z...@intel.com) wrote:

> > b. Implement a new virtio device
> > Implementing a brand new virtio device to exchange information
> > between host and guest is another choice. It requires modifying the
> > virtio spec too.
> 
> If the right solution is to change the spec then we should do it;
> we shouldn't use a technically worse solution just to avoid the spec
> change; although we have to be even more careful to get the right
> solution if we want to change the spec.

Yes, absolutely.  I didn't mean to suggest virtio-serial because it'll
help us save with modifying specs.  I suggested it because for me, the
most important point was that if a guest is sensitive to latencies or
spikes, the balloon driver is definitely going to be not installed by
the guest admin.

(I'm still not caught up on this thread, but wanted to clarify this
right away.)

Amit



Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages

2016-03-24 Thread Li, Liang Z
> > > > > > > > > The order I'm trying to understand is something like:
> > > > > > > > >
> > > > > > > > > a) Send the get_free_page_bitmap request
> > > > > > > > > b) Start sending pages
> > > > > > > > > c) Reach the end of memory
> > > > > > > > >   [ is_ready is false - guest hasn't made free map yet ]
> > > > > > > > > d) normal migration_bitmap_sync() at end of first pass
> > > > > > > > > e) Carry on sending dirty pages
> > > > > > > > > f) is_ready is true
> > > > > > > > >   f.1) filter out free pages?
> > > > > > > > >   f.2) migration_bitmap_sync()
> > > > > > > > >
> > > > > > > > > It's f.1 I'm worried about.  If the guest started
> > > > > > > > > generating the free bitmap before (d), then a page
> > > > > > > > > marked as 'free' in f.1 might have become dirty before
> > > > > > > > > (d) and so (f.2) doesn't set the dirty again, and so we can't
> filter out pages in f.1.
> > > > > > > > >
> > > > > > > >
> > > > > > > > As you described, the order is incorrect.
> > > > > > > >
> > > > > > > > Liang
> > > > > > >
> > > > > > >
> > > > > > > So to make it safe, what is required is to make sure no free
> > > > > > > list us outstanding before calling migration_bitmap_sync.
> > > > > > >
> > > > > > > If one is outstanding, filter out pages before calling
> > > > > migration_bitmap_sync.
> > > > > > >
> > > > > > > Of course, if we just do it like we normally do with
> > > > > > > migration, then by the time we call migration_bitmap_sync
> > > > > > > dirty bitmap is completely empty, so there won't be anything to
> filter out.
> > > > > > >
> > > > > > > One way to address this is call migration_bitmap_sync in the
> > > > > > > IO handler, while VCPU is stopped, then make sure to filter
> > > > > > > out pages before the next migration_bitmap_sync.
> > > > > > >
> > > > > > > Another is to start filtering out pages upon IO handler, but
> > > > > > > make sure to flush the queue before calling
> migration_bitmap_sync.
> > > > > > >
> > > > > >
> > > > > > It's really complex, maybe we should switch to a simple start,
> > > > > > just skip the free page in the ram bulk stage and make it
> asynchronous?
> > > > > >
> > > > > > Liang
> > > > >
> > > > > You mean like your patches do? No, blocking bulk migration until
> > > > > guest response is basically a non-starter.
> > > > >
> > > >
> > > > No, don't wait anymore. Like below (copy from previous thread)
> > > > --
> > > > 1. Set all the bits in the migration_bitmap_rcu->bmap to 1 2.
> > > > Clear all the bits in
> > > > ram_list.dirty_memory[DIRTY_MEMORY_MIGRATION]
> > > > 3. Send the get_free_page_bitmap request 4. Start to send  pages
> > > > to destination and check if the free_page_bitmap is ready
> > > >if (is_ready) {
> > > >  filter out the free pages from  migration_bitmap_rcu->bmap;
> > > >  migration_bitmap_sync();
> > > >  }
> > > > continue until live migration complete.
> > > > ---
> > > > Can this work?
> > > >
> > > > Liang
> > >
> > > Not if you get the ready bit asynchronously like you wrote here
> > > since is_ready can get set while you called migration_bitmap_sync.
> > >
> > > As I said previously,
> > > to make this work you need to filter out synchronously while VCPU is
> > > stopped and while free pages from list are not being used.
> > >
> > > Alternatively prevent getting free page list and filtering them out
> > > from guest from racing with migration_bitmap_sync.
> > >
> > > For example, flush the VQ after migration_bitmap_sync.
> > > So:
> > >
> > > lock
> > > migration_bitmap_sync();
> > > while (elem = virtqueue_pop) {
> > > virtqueue_push(elem)
> > > g_free(elem)
> > > }
> > > unlock
> > >
> > >
> > > while in handle_output
> > >
> > > lock
> > > while (elem = virtqueue_pop) {
> > > list = get_free_list(elem)
> > > filter_out_free(list)
> > > virtqueue_push(elem)
> > > free(elem)
> > > }
> > > unlock
> > >
> > >
> > > lock prevents migration_bitmap_sync from racing against
> > > handle_output
> >
> > I think the easier way is just to ignore the guests free list response
> > if it comes back after the first pass.
> >
> > Dave
> 
> That's a subset of course - after the first pass == after
> migration_bitmap_sync.
> 
> But it's really nasty - for example, how do you know it's the response from
> this migration round and not a previous one?

It's easy, add a request and response ID can solve this issue.

> It is really better to just keep things orthogonal and not introduce arbitrary
> limitations.
> 
> 
> For example, with post-copy there's no first pass, and it can still benefit 
> from
> this optimization.
> 

Leave this to Dave ...

Liang

> 
> > >
> > >
> > > This way you can actually use ioeventfd for this VQ so VCPU won't be
> > > blocked.
> > >
> > > I do not 

Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages

2016-03-24 Thread Li, Liang Z
> > > On Thu, Mar 24, 2016 at 03:53:25PM +, Li, Liang Z wrote:
> > > > > > > > Not very complex, we can implement like this:
> > > > > > > >
> > > > > > > > 1. Set all the bits in the migration_bitmap_rcu->bmap to 1 2.
> > > > > > > > Clear all the bits in ram_list.
> > > > > > > > dirty_memory[DIRTY_MEMORY_MIGRATION]
> > > > > > > > 3. Send the get_free_page_bitmap request 4. Start to send
> > > > > > > > pages to destination and check if the free_page_bitmap is ready
> > > > > > > > if (is_ready) {
> > > > > > > >   filter out the free pages from  migration_bitmap_rcu-
> >bmap;
> > > > > > > >   migration_bitmap_sync();
> > > > > > > > }
> > > > > > > >  continue until live migration complete.
> > > > > > > >
> > > > > > > >
> > > > > > > > Is that right?
> > > > > > >
> > > > > > > The order I'm trying to understand is something like:
> > > > > > >
> > > > > > > a) Send the get_free_page_bitmap request
> > > > > > > b) Start sending pages
> > > > > > > c) Reach the end of memory
> > > > > > >   [ is_ready is false - guest hasn't made free map yet ]
> > > > > > > d) normal migration_bitmap_sync() at end of first pass
> > > > > > > e) Carry on sending dirty pages
> > > > > > > f) is_ready is true
> > > > > > >   f.1) filter out free pages?
> > > > > > >   f.2) migration_bitmap_sync()
> > > > > > >
> > > > > > > It's f.1 I'm worried about.  If the guest started generating
> > > > > > > the free bitmap before (d), then a page marked as 'free' in
> > > > > > > f.1 might have become dirty before (d) and so (f.2) doesn't
> > > > > > > set the dirty again, and so we can't filter out pages in f.1.
> > > > > > >
> > > > > >
> > > > > > As you described, the order is incorrect.
> > > > > >
> > > > > > Liang
> > > > >
> > > > >
> > > > > So to make it safe, what is required is to make sure no free
> > > > > list us outstanding before calling migration_bitmap_sync.
> > > > >
> > > > > If one is outstanding, filter out pages before calling
> > > migration_bitmap_sync.
> > > > >
> > > > > Of course, if we just do it like we normally do with migration,
> > > > > then by the time we call migration_bitmap_sync dirty bitmap is
> > > > > completely empty, so there won't be anything to filter out.
> > > > >
> > > > > One way to address this is call migration_bitmap_sync in the IO
> > > > > handler, while VCPU is stopped, then make sure to filter out
> > > > > pages before the next migration_bitmap_sync.
> > > > >
> > > > > Another is to start filtering out pages upon IO handler, but
> > > > > make sure to flush the queue before calling migration_bitmap_sync.
> > > > >
> > > >
> > > > It's really complex, maybe we should switch to a simple start,
> > > > just skip the free page in the ram bulk stage and make it asynchronous?
> > > >
> > > > Liang
> > >
> > > You mean like your patches do? No, blocking bulk migration until
> > > guest response is basically a non-starter.
> > >
> >
> > No, don't wait anymore. Like below (copy from previous thread)
> > --
> > 1. Set all the bits in the migration_bitmap_rcu->bmap to 1 2. Clear
> > all the bits in ram_list.dirty_memory[DIRTY_MEMORY_MIGRATION]
> > 3. Send the get_free_page_bitmap request 4. Start to send  pages to
> > destination and check if the free_page_bitmap is ready
> >if (is_ready) {
> >  filter out the free pages from  migration_bitmap_rcu->bmap;
> >  migration_bitmap_sync();
> >  }
> > continue until live migration complete.
> > ---
> > Can this work?
> >
> > Liang
> 
> Not if you get the ready bit asynchronously like you wrote here since
> is_ready can get set while you called migration_bitmap_sync.

If the is_ready is not set just before the first calling of 
migration_bitmap_sync,
then we can ignore the free page bitmap and not filter the free pages.
Another thing we should do it to remove migration_bitmap_sync() from 
ram_save_setup().

> As I said previously,
> to make this work you need to filter out synchronously while VCPU is
> stopped and while free pages from list are not being used.
> 
> Alternatively prevent getting free page list and filtering them out from guest
> from racing with migration_bitmap_sync.
> 
> For example, flush the VQ after migration_bitmap_sync.
> So:
> 
> lock
> migration_bitmap_sync();
> while (elem = virtqueue_pop) {
> virtqueue_push(elem)
> g_free(elem)
> }
> unlock
> 
> 
> while in handle_output
> 
> lock
> while (elem = virtqueue_pop) {
> list = get_free_list(elem)
> filter_out_free(list)
> virtqueue_push(elem)
> free(elem)
> }
> unlock
> 
> 
> lock prevents migration_bitmap_sync from racing against  handle_output
> 
> 
> This way you can actually use ioeventfd
> for this VQ so VCPU won't be blocked.
> 
> I do not think this is so complex, and
> this way 

Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages

2016-03-24 Thread Li, Liang Z
> > > > On Thu, Mar 24, 2016 at 03:53:25PM +, Li, Liang Z wrote:
> > > > > > > > > Not very complex, we can implement like this:
> > > > > > > > >
> > > > > > > > > 1. Set all the bits in the migration_bitmap_rcu->bmap to 1 2.
> > > > > > > > > Clear all the bits in ram_list.
> > > > > > > > > dirty_memory[DIRTY_MEMORY_MIGRATION]
> > > > > > > > > 3. Send the get_free_page_bitmap request 4. Start to
> > > > > > > > > send pages to destination and check if the free_page_bitmap
> is ready
> > > > > > > > > if (is_ready) {
> > > > > > > > >   filter out the free pages from  
> > > > > > > > > migration_bitmap_rcu-
> >bmap;
> > > > > > > > >   migration_bitmap_sync();
> > > > > > > > > }
> > > > > > > > >  continue until live migration complete.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Is that right?
> > > > > > > >
> > > > > > > > The order I'm trying to understand is something like:
> > > > > > > >
> > > > > > > > a) Send the get_free_page_bitmap request
> > > > > > > > b) Start sending pages
> > > > > > > > c) Reach the end of memory
> > > > > > > >   [ is_ready is false - guest hasn't made free map yet ]
> > > > > > > > d) normal migration_bitmap_sync() at end of first pass
> > > > > > > > e) Carry on sending dirty pages
> > > > > > > > f) is_ready is true
> > > > > > > >   f.1) filter out free pages?
> > > > > > > >   f.2) migration_bitmap_sync()
> > > > > > > >
> > > > > > > > It's f.1 I'm worried about.  If the guest started
> > > > > > > > generating the free bitmap before (d), then a page marked
> > > > > > > > as 'free' in f.1 might have become dirty before (d) and so
> > > > > > > > (f.2) doesn't set the dirty again, and so we can't filter out 
> > > > > > > > pages
> in f.1.
> > > > > > > >
> > > > > > >
> > > > > > > As you described, the order is incorrect.
> > > > > > >
> > > > > > > Liang
> > > > > >
> > > > > >
> > > > > > So to make it safe, what is required is to make sure no free
> > > > > > list us outstanding before calling migration_bitmap_sync.
> > > > > >
> > > > > > If one is outstanding, filter out pages before calling
> > > > migration_bitmap_sync.
> > > > > >
> > > > > > Of course, if we just do it like we normally do with
> > > > > > migration, then by the time we call migration_bitmap_sync
> > > > > > dirty bitmap is completely empty, so there won't be anything to
> filter out.
> > > > > >
> > > > > > One way to address this is call migration_bitmap_sync in the
> > > > > > IO handler, while VCPU is stopped, then make sure to filter
> > > > > > out pages before the next migration_bitmap_sync.
> > > > > >
> > > > > > Another is to start filtering out pages upon IO handler, but
> > > > > > make sure to flush the queue before calling migration_bitmap_sync.
> > > > > >
> > > > >
> > > > > It's really complex, maybe we should switch to a simple start,
> > > > > just skip the free page in the ram bulk stage and make it
> asynchronous?
> > > > >
> > > > > Liang
> > > >
> > > > You mean like your patches do? No, blocking bulk migration until
> > > > guest response is basically a non-starter.
> > > >
> > >
> > > No, don't wait anymore. Like below (copy from previous thread)
> > > --
> > > 1. Set all the bits in the migration_bitmap_rcu->bmap to 1 2. Clear
> > > all the bits in ram_list.dirty_memory[DIRTY_MEMORY_MIGRATION]
> > > 3. Send the get_free_page_bitmap request 4. Start to send  pages to
> > > destination and check if the free_page_bitmap is ready
> > >if (is_ready) {
> > >  filter out the free pages from  migration_bitmap_rcu->bmap;
> > >  migration_bitmap_sync();
> > >  }
> > > continue until live migration complete.
> > > ---
> > > Can this work?
> > >
> > > Liang
> >
> > Not if you get the ready bit asynchronously like you wrote here since
> > is_ready can get set while you called migration_bitmap_sync.
> >
> > As I said previously,
> > to make this work you need to filter out synchronously while VCPU is
> > stopped and while free pages from list are not being used.
> >
> > Alternatively prevent getting free page list and filtering them out
> > from guest from racing with migration_bitmap_sync.
> >
> > For example, flush the VQ after migration_bitmap_sync.
> > So:
> >
> > lock
> > migration_bitmap_sync();
> > while (elem = virtqueue_pop) {
> > virtqueue_push(elem)
> > g_free(elem)
> > }
> > unlock
> >
> >
> > while in handle_output
> >
> > lock
> > while (elem = virtqueue_pop) {
> > list = get_free_list(elem)
> > filter_out_free(list)
> > virtqueue_push(elem)
> > free(elem)
> > }
> > unlock
> >
> >
> > lock prevents migration_bitmap_sync from racing against  handle_output
> 
> I think the easier way is just to ignore the guests free list response if it 
> comes
> back after the first 

Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages

2016-03-24 Thread Michael S. Tsirkin
On Thu, Mar 24, 2016 at 05:49:33PM +, Dr. David Alan Gilbert wrote:
> * Michael S. Tsirkin (m...@redhat.com) wrote:
> > On Thu, Mar 24, 2016 at 04:05:16PM +, Li, Liang Z wrote:
> > > 
> > > 
> > > On %D, %SN wrote:
> > > %Q
> > > 
> > > %C
> > > 
> > > Liang
> > > 
> > > 
> > > > -Original Message-
> > > > From: Michael S. Tsirkin [mailto:m...@redhat.com]
> > > > Sent: Thursday, March 24, 2016 11:57 PM
> > > > To: Li, Liang Z
> > > > Cc: Dr. David Alan Gilbert; Wei Yang; qemu-devel@nongnu.org;
> > > > k...@vger.kernel.org; linux-ker...@vger.kenel.org; pbonz...@redhat.com;
> > > > r...@twiddle.net; ehabk...@redhat.com; amit.s...@redhat.com;
> > > > quint...@redhat.com; mohan_parthasara...@hpe.com;
> > > > jitendra.ko...@hpe.com; sim...@hpe.com; rka...@virtuozzo.com;
> > > > r...@redhat.com
> > > > Subject: Re: [RFC Design Doc]Speed up live migration by skipping free 
> > > > pages
> > > > 
> > > > On Thu, Mar 24, 2016 at 03:53:25PM +, Li, Liang Z wrote:
> > > > > > > > > Not very complex, we can implement like this:
> > > > > > > > >
> > > > > > > > > 1. Set all the bits in the migration_bitmap_rcu->bmap to 1 2.
> > > > > > > > > Clear all the bits in ram_list.
> > > > > > > > > dirty_memory[DIRTY_MEMORY_MIGRATION]
> > > > > > > > > 3. Send the get_free_page_bitmap request 4. Start to send
> > > > > > > > > pages to destination and check if the free_page_bitmap is 
> > > > > > > > > ready
> > > > > > > > > if (is_ready) {
> > > > > > > > >   filter out the free pages from  
> > > > > > > > > migration_bitmap_rcu->bmap;
> > > > > > > > >   migration_bitmap_sync();
> > > > > > > > > }
> > > > > > > > >  continue until live migration complete.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Is that right?
> > > > > > > >
> > > > > > > > The order I'm trying to understand is something like:
> > > > > > > >
> > > > > > > > a) Send the get_free_page_bitmap request
> > > > > > > > b) Start sending pages
> > > > > > > > c) Reach the end of memory
> > > > > > > >   [ is_ready is false - guest hasn't made free map yet ]
> > > > > > > > d) normal migration_bitmap_sync() at end of first pass
> > > > > > > > e) Carry on sending dirty pages
> > > > > > > > f) is_ready is true
> > > > > > > >   f.1) filter out free pages?
> > > > > > > >   f.2) migration_bitmap_sync()
> > > > > > > >
> > > > > > > > It's f.1 I'm worried about.  If the guest started generating the
> > > > > > > > free bitmap before (d), then a page marked as 'free' in f.1
> > > > > > > > might have become dirty before (d) and so (f.2) doesn't set the
> > > > > > > > dirty again, and so we can't filter out pages in f.1.
> > > > > > > >
> > > > > > >
> > > > > > > As you described, the order is incorrect.
> > > > > > >
> > > > > > > Liang
> > > > > >
> > > > > >
> > > > > > So to make it safe, what is required is to make sure no free list us
> > > > > > outstanding before calling migration_bitmap_sync.
> > > > > >
> > > > > > If one is outstanding, filter out pages before calling
> > > > migration_bitmap_sync.
> > > > > >
> > > > > > Of course, if we just do it like we normally do with migration, then
> > > > > > by the time we call migration_bitmap_sync dirty bitmap is completely
> > > > > > empty, so there won't be anything to filter out.
> > > > > >
> > > > > > One way to address this is call migration_bitmap_sync in the IO
> > > > > > handler, while VCPU is stopped, then make sure to filter out pages
> > > > > > before the next migration_bitmap_sync.
> > > > > >
> > > > > > Another is to start filtering out pages upon IO handler, but make
> > > > > > sure to flush the queue before calling migration_bitmap_sync.
> > > > > >
> > > > >
> > > > > It's really complex, maybe we should switch to a simple start,  just
> > > > > skip the free page in the ram bulk stage and make it asynchronous?
> > > > >
> > > > > Liang
> > > > 
> > > > You mean like your patches do? No, blocking bulk migration until guest
> > > > response is basically a non-starter.
> > > > 
> > > 
> > > No, don't wait anymore. Like below (copy from previous thread)
> > > --
> > > 1. Set all the bits in the migration_bitmap_rcu->bmap to 1 
> > > 2. Clear all the bits in ram_list.dirty_memory[DIRTY_MEMORY_MIGRATION]
> > > 3. Send the get_free_page_bitmap request 
> > > 4. Start to send  pages to destination and check if the free_page_bitmap 
> > > is ready
> > >if (is_ready) {
> > >  filter out the free pages from  migration_bitmap_rcu->bmap;
> > >  migration_bitmap_sync();
> > >  }
> > > continue until live migration complete.
> > > ---
> > > Can this work?
> > > 
> > > Liang
> > 
> > Not if you get the ready bit asynchronously like you wrote here
> > since is_ready can get set while you called migration_bitmap_sync.
> > 
> > As I said previously,
> > to make this work you 

Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages

2016-03-24 Thread Dr. David Alan Gilbert
* Michael S. Tsirkin (m...@redhat.com) wrote:
> On Thu, Mar 24, 2016 at 04:05:16PM +, Li, Liang Z wrote:
> > 
> > 
> > On %D, %SN wrote:
> > %Q
> > 
> > %C
> > 
> > Liang
> > 
> > 
> > > -Original Message-
> > > From: Michael S. Tsirkin [mailto:m...@redhat.com]
> > > Sent: Thursday, March 24, 2016 11:57 PM
> > > To: Li, Liang Z
> > > Cc: Dr. David Alan Gilbert; Wei Yang; qemu-devel@nongnu.org;
> > > k...@vger.kernel.org; linux-ker...@vger.kenel.org; pbonz...@redhat.com;
> > > r...@twiddle.net; ehabk...@redhat.com; amit.s...@redhat.com;
> > > quint...@redhat.com; mohan_parthasara...@hpe.com;
> > > jitendra.ko...@hpe.com; sim...@hpe.com; rka...@virtuozzo.com;
> > > r...@redhat.com
> > > Subject: Re: [RFC Design Doc]Speed up live migration by skipping free 
> > > pages
> > > 
> > > On Thu, Mar 24, 2016 at 03:53:25PM +, Li, Liang Z wrote:
> > > > > > > > Not very complex, we can implement like this:
> > > > > > > >
> > > > > > > > 1. Set all the bits in the migration_bitmap_rcu->bmap to 1 2.
> > > > > > > > Clear all the bits in ram_list.
> > > > > > > > dirty_memory[DIRTY_MEMORY_MIGRATION]
> > > > > > > > 3. Send the get_free_page_bitmap request 4. Start to send
> > > > > > > > pages to destination and check if the free_page_bitmap is ready
> > > > > > > > if (is_ready) {
> > > > > > > >   filter out the free pages from  
> > > > > > > > migration_bitmap_rcu->bmap;
> > > > > > > >   migration_bitmap_sync();
> > > > > > > > }
> > > > > > > >  continue until live migration complete.
> > > > > > > >
> > > > > > > >
> > > > > > > > Is that right?
> > > > > > >
> > > > > > > The order I'm trying to understand is something like:
> > > > > > >
> > > > > > > a) Send the get_free_page_bitmap request
> > > > > > > b) Start sending pages
> > > > > > > c) Reach the end of memory
> > > > > > >   [ is_ready is false - guest hasn't made free map yet ]
> > > > > > > d) normal migration_bitmap_sync() at end of first pass
> > > > > > > e) Carry on sending dirty pages
> > > > > > > f) is_ready is true
> > > > > > >   f.1) filter out free pages?
> > > > > > >   f.2) migration_bitmap_sync()
> > > > > > >
> > > > > > > It's f.1 I'm worried about.  If the guest started generating the
> > > > > > > free bitmap before (d), then a page marked as 'free' in f.1
> > > > > > > might have become dirty before (d) and so (f.2) doesn't set the
> > > > > > > dirty again, and so we can't filter out pages in f.1.
> > > > > > >
> > > > > >
> > > > > > As you described, the order is incorrect.
> > > > > >
> > > > > > Liang
> > > > >
> > > > >
> > > > > So to make it safe, what is required is to make sure no free list us
> > > > > outstanding before calling migration_bitmap_sync.
> > > > >
> > > > > If one is outstanding, filter out pages before calling
> > > migration_bitmap_sync.
> > > > >
> > > > > Of course, if we just do it like we normally do with migration, then
> > > > > by the time we call migration_bitmap_sync dirty bitmap is completely
> > > > > empty, so there won't be anything to filter out.
> > > > >
> > > > > One way to address this is call migration_bitmap_sync in the IO
> > > > > handler, while VCPU is stopped, then make sure to filter out pages
> > > > > before the next migration_bitmap_sync.
> > > > >
> > > > > Another is to start filtering out pages upon IO handler, but make
> > > > > sure to flush the queue before calling migration_bitmap_sync.
> > > > >
> > > >
> > > > It's really complex, maybe we should switch to a simple start,  just
> > > > skip the free page in the ram bulk stage and make it asynchronous?
> > > >
> > > > Liang
> > > 
> > > You mean like your patches do? No, blocking bulk migration until guest
> > > response is basically a non-starter.
> > > 
> > 
> > No, don't wait anymore. Like below (copy from previous thread)
> > --
> > 1. Set all the bits in the migration_bitmap_rcu->bmap to 1 
> > 2. Clear all the bits in ram_list.dirty_memory[DIRTY_MEMORY_MIGRATION]
> > 3. Send the get_free_page_bitmap request 
> > 4. Start to send  pages to destination and check if the free_page_bitmap is 
> > ready
> >if (is_ready) {
> >  filter out the free pages from  migration_bitmap_rcu->bmap;
> >  migration_bitmap_sync();
> >  }
> > continue until live migration complete.
> > ---
> > Can this work?
> > 
> > Liang
> 
> Not if you get the ready bit asynchronously like you wrote here
> since is_ready can get set while you called migration_bitmap_sync.
> 
> As I said previously,
> to make this work you need to filter out synchronously while VCPU is
> stopped and while free pages from list are not being used.
> 
> Alternatively prevent getting free page list
> and filtering them out from
> guest from racing with migration_bitmap_sync.
> 
> For example, flush the VQ after migration_bitmap_sync.
> So:
> 
> lock
> 

Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages

2016-03-24 Thread Li, Liang Z


On %D, %SN wrote:
%Q

%C

Liang


> -Original Message-
> From: Michael S. Tsirkin [mailto:m...@redhat.com]
> Sent: Thursday, March 24, 2016 11:57 PM
> To: Li, Liang Z
> Cc: Dr. David Alan Gilbert; Wei Yang; qemu-devel@nongnu.org;
> k...@vger.kernel.org; linux-ker...@vger.kenel.org; pbonz...@redhat.com;
> r...@twiddle.net; ehabk...@redhat.com; amit.s...@redhat.com;
> quint...@redhat.com; mohan_parthasara...@hpe.com;
> jitendra.ko...@hpe.com; sim...@hpe.com; rka...@virtuozzo.com;
> r...@redhat.com
> Subject: Re: [RFC Design Doc]Speed up live migration by skipping free pages
> 
> On Thu, Mar 24, 2016 at 03:53:25PM +, Li, Liang Z wrote:
> > > > > > Not very complex, we can implement like this:
> > > > > >
> > > > > > 1. Set all the bits in the migration_bitmap_rcu->bmap to 1 2.
> > > > > > Clear all the bits in ram_list.
> > > > > > dirty_memory[DIRTY_MEMORY_MIGRATION]
> > > > > > 3. Send the get_free_page_bitmap request 4. Start to send
> > > > > > pages to destination and check if the free_page_bitmap is ready
> > > > > > if (is_ready) {
> > > > > >   filter out the free pages from  
> > > > > > migration_bitmap_rcu->bmap;
> > > > > >   migration_bitmap_sync();
> > > > > > }
> > > > > >  continue until live migration complete.
> > > > > >
> > > > > >
> > > > > > Is that right?
> > > > >
> > > > > The order I'm trying to understand is something like:
> > > > >
> > > > > a) Send the get_free_page_bitmap request
> > > > > b) Start sending pages
> > > > > c) Reach the end of memory
> > > > >   [ is_ready is false - guest hasn't made free map yet ]
> > > > > d) normal migration_bitmap_sync() at end of first pass
> > > > > e) Carry on sending dirty pages
> > > > > f) is_ready is true
> > > > >   f.1) filter out free pages?
> > > > >   f.2) migration_bitmap_sync()
> > > > >
> > > > > It's f.1 I'm worried about.  If the guest started generating the
> > > > > free bitmap before (d), then a page marked as 'free' in f.1
> > > > > might have become dirty before (d) and so (f.2) doesn't set the
> > > > > dirty again, and so we can't filter out pages in f.1.
> > > > >
> > > >
> > > > As you described, the order is incorrect.
> > > >
> > > > Liang
> > >
> > >
> > > So to make it safe, what is required is to make sure no free list us
> > > outstanding before calling migration_bitmap_sync.
> > >
> > > If one is outstanding, filter out pages before calling
> migration_bitmap_sync.
> > >
> > > Of course, if we just do it like we normally do with migration, then
> > > by the time we call migration_bitmap_sync dirty bitmap is completely
> > > empty, so there won't be anything to filter out.
> > >
> > > One way to address this is call migration_bitmap_sync in the IO
> > > handler, while VCPU is stopped, then make sure to filter out pages
> > > before the next migration_bitmap_sync.
> > >
> > > Another is to start filtering out pages upon IO handler, but make
> > > sure to flush the queue before calling migration_bitmap_sync.
> > >
> >
> > It's really complex, maybe we should switch to a simple start,  just
> > skip the free page in the ram bulk stage and make it asynchronous?
> >
> > Liang
> 
> You mean like your patches do? No, blocking bulk migration until guest
> response is basically a non-starter.
> 

No, don't wait anymore. Like below (copy from previous thread)
--
1. Set all the bits in the migration_bitmap_rcu->bmap to 1 
2. Clear all the bits in ram_list.dirty_memory[DIRTY_MEMORY_MIGRATION]
3. Send the get_free_page_bitmap request 
4. Start to send  pages to destination and check if the free_page_bitmap is 
ready
   if (is_ready) {
 filter out the free pages from  migration_bitmap_rcu->bmap;
 migration_bitmap_sync();
 }
continue until live migration complete.
---
Can this work?

Liang

> --
> MST



Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages

2016-03-24 Thread Michael S. Tsirkin
On Thu, Mar 24, 2016 at 04:05:16PM +, Li, Liang Z wrote:
> 
> 
> On %D, %SN wrote:
> %Q
> 
> %C
> 
> Liang
> 
> 
> > -Original Message-
> > From: Michael S. Tsirkin [mailto:m...@redhat.com]
> > Sent: Thursday, March 24, 2016 11:57 PM
> > To: Li, Liang Z
> > Cc: Dr. David Alan Gilbert; Wei Yang; qemu-devel@nongnu.org;
> > k...@vger.kernel.org; linux-ker...@vger.kenel.org; pbonz...@redhat.com;
> > r...@twiddle.net; ehabk...@redhat.com; amit.s...@redhat.com;
> > quint...@redhat.com; mohan_parthasara...@hpe.com;
> > jitendra.ko...@hpe.com; sim...@hpe.com; rka...@virtuozzo.com;
> > r...@redhat.com
> > Subject: Re: [RFC Design Doc]Speed up live migration by skipping free pages
> > 
> > On Thu, Mar 24, 2016 at 03:53:25PM +, Li, Liang Z wrote:
> > > > > > > Not very complex, we can implement like this:
> > > > > > >
> > > > > > > 1. Set all the bits in the migration_bitmap_rcu->bmap to 1 2.
> > > > > > > Clear all the bits in ram_list.
> > > > > > > dirty_memory[DIRTY_MEMORY_MIGRATION]
> > > > > > > 3. Send the get_free_page_bitmap request 4. Start to send
> > > > > > > pages to destination and check if the free_page_bitmap is ready
> > > > > > > if (is_ready) {
> > > > > > >   filter out the free pages from  
> > > > > > > migration_bitmap_rcu->bmap;
> > > > > > >   migration_bitmap_sync();
> > > > > > > }
> > > > > > >  continue until live migration complete.
> > > > > > >
> > > > > > >
> > > > > > > Is that right?
> > > > > >
> > > > > > The order I'm trying to understand is something like:
> > > > > >
> > > > > > a) Send the get_free_page_bitmap request
> > > > > > b) Start sending pages
> > > > > > c) Reach the end of memory
> > > > > >   [ is_ready is false - guest hasn't made free map yet ]
> > > > > > d) normal migration_bitmap_sync() at end of first pass
> > > > > > e) Carry on sending dirty pages
> > > > > > f) is_ready is true
> > > > > >   f.1) filter out free pages?
> > > > > >   f.2) migration_bitmap_sync()
> > > > > >
> > > > > > It's f.1 I'm worried about.  If the guest started generating the
> > > > > > free bitmap before (d), then a page marked as 'free' in f.1
> > > > > > might have become dirty before (d) and so (f.2) doesn't set the
> > > > > > dirty again, and so we can't filter out pages in f.1.
> > > > > >
> > > > >
> > > > > As you described, the order is incorrect.
> > > > >
> > > > > Liang
> > > >
> > > >
> > > > So to make it safe, what is required is to make sure no free list us
> > > > outstanding before calling migration_bitmap_sync.
> > > >
> > > > If one is outstanding, filter out pages before calling
> > migration_bitmap_sync.
> > > >
> > > > Of course, if we just do it like we normally do with migration, then
> > > > by the time we call migration_bitmap_sync dirty bitmap is completely
> > > > empty, so there won't be anything to filter out.
> > > >
> > > > One way to address this is call migration_bitmap_sync in the IO
> > > > handler, while VCPU is stopped, then make sure to filter out pages
> > > > before the next migration_bitmap_sync.
> > > >
> > > > Another is to start filtering out pages upon IO handler, but make
> > > > sure to flush the queue before calling migration_bitmap_sync.
> > > >
> > >
> > > It's really complex, maybe we should switch to a simple start,  just
> > > skip the free page in the ram bulk stage and make it asynchronous?
> > >
> > > Liang
> > 
> > You mean like your patches do? No, blocking bulk migration until guest
> > response is basically a non-starter.
> > 
> 
> No, don't wait anymore. Like below (copy from previous thread)
> --
> 1. Set all the bits in the migration_bitmap_rcu->bmap to 1 
> 2. Clear all the bits in ram_list.dirty_memory[DIRTY_MEMORY_MIGRATION]
> 3. Send the get_free_page_bitmap request 
> 4. Start to send  pages to destination and check if the free_page_bitmap is 
> ready
>if (is_ready) {
>  filter out the free pages from  migration_bitmap_rcu->bmap;
>  migration_bitmap_sync();
>  }
> continue until live migration complete.
> ---
> Can this work?
> 
> Liang

Not if you get the ready bit asynchronously like you wrote here
since is_ready can get set while you called migration_bitmap_sync.

As I said previously,
to make this work you need to filter out synchronously while VCPU is
stopped and while free pages from list are not being used.

Alternatively prevent getting free page list
and filtering them out from
guest from racing with migration_bitmap_sync.

For example, flush the VQ after migration_bitmap_sync.
So:

lock
migration_bitmap_sync();
while (elem = virtqueue_pop) {
virtqueue_push(elem)
g_free(elem)
}
unlock


while in handle_output

lock
while (elem = virtqueue_pop) {
list = get_free_list(elem)
filter_out_free(list)
virtqueue_push(elem)

Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages

2016-03-24 Thread Li, Liang Z
> On 24/03/2016 16:16, Li, Liang Z wrote:
> > > There's no guarantee that there's a single 'hole'
> > > even on the PC, and we want balloon to be portable.
> >
> > As long as we know how many 'hole' and where the holes are.
> 
> The mapping between ram_addr_t and GPA is completely internal to QEMU.
> Passing it to the guest is a layering violation.
> 
> Paolo

Yes. I have already changed this in my design.

Liang
> 
> > we can filter out them. QEMU should have this kind of information.
> > I know my RFC patch passed an arch specific free page bitmap is not a
> > good idea. So in my design, I changed this by passing a loose free
> > page bitmap which contains the holes, and let QEMU to filter out the
> > holes according to some arch specific information. This can make balloon be
> portable.
> >



Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages

2016-03-24 Thread Li, Liang Z
> > > > > > Agree. Current balloon just send 256 PFNs a time, that's too
> > > > > > few and lead to too many times of virtio transmission, that's
> > > > > > the main reason for the
> > > > > bad performance.
> > > > > > Change the VIRTIO_BALLOON_ARRAY_PFNS_MAX to a large value
> can
> > > > > improve
> > > > > > the performance significant. Maybe we should increase it
> > > > > > before doing the further optimization, do you think so ?
> > > > >
> > > > > We could push it up a bit higher: 256 is 1kbyte in size, so we
> > > > > can make it 3x bigger and still fit struct virtio_balloon is a
> > > > > single page. But if we are going to add the bitmap variant
> > > > > anyway, we probably
> > > shouldn't bother.
> > > > >
> > > > > > > > c. address translation and madvise() operation (24%,
> > > > > > > > 1423ms)
> > > > > > >
> > > > > > > How is this split between translation and madvise?  I
> > > > > > > suspect it's mostly madvise since you need translation when
> > > > > > > using bitmap as
> > > well.
> > > > > > > Correct? Could you measure this please?  Also, what if we
> > > > > > > use the new MADV_FREE instead?  By how much would this help?
> > > > > > >
> > > > > > For the current balloon, address translation is needed.
> > > > > > But for live migration, there is no need to do address translation.
> > > > >
> > > > > Well you need ram address in order to clear the dirty bit.
> > > > > How would you get it without translation?
> > > > >
> > > >
> > > > If you means that kind of address translation, yes, it need.
> > > > What I want to say is, filter out the free page can be done by
> > > > bitmap
> > > operation.
> > > >
> > > > Liang
> > >
> > > OK so I see that your patches use block->offset in struct RAMBlock
> > > to look up bits in guest-supplied bitmap.
> > > I don't think that's guaranteed to work.
> >
> > It's part of the bitmap operation, because the latest change of the
> ram_list.dirty_memory.
> > Why do you think so? Could you tell me the reason?
> >
> > Liang
> 
> Sorry, why do I think what? That ram_addr_t is not guaranteed to equal GPA
> of the block?
> 

I mean why do you think that's can't guaranteed to work.
Yes, ram_addr_t is not guaranteed to equal GPA of the block. But I didn't use 
them as
GPA. The code in the filter_out_guest_free_pages() in my patch just follow the 
style of
the latest change of  ram_list.dirty_memory[].

The free page bitmap got from the guest in my RFC patch has been filtered out 
the
'hole', so the bit N of the free page bitmap and the bit N in 
ram_list.dirty_memory[DIRTY_MEMORY_MIGRATION]->blocks are corresponding to
the same guest page.  Right?
If it's true, I think I am doing the right thing?


Liang

> E.g. HACKING says:
>   Use hwaddr for guest physical addresses except pcibus_t
>   for PCI addresses.  In addition, ram_addr_t is a QEMU internal
> address
>   space that maps guest RAM physical addresses into an intermediate
>   address space that can map to host virtual address spaces.
> 
> 
> --
> MST
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in the body of
> a message to majord...@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html



Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages

2016-03-24 Thread Michael S. Tsirkin
On Thu, Mar 24, 2016 at 03:16:29PM +, Li, Liang Z wrote:
> > > > Sorry, why do I think what? That ram_addr_t is not guaranteed to
> > > > equal GPA of the block?
> > > >
> > >
> > > I mean why do you think that's can't guaranteed to work.
> > > Yes, ram_addr_t is not guaranteed to equal GPA of the block. But I
> > > didn't use them as GPA. The code in the filter_out_guest_free_pages()
> > > in my patch just follow the style of the latest change of
> > ram_list.dirty_memory[].
> > >
> > > The free page bitmap got from the guest in my RFC patch has been
> > > filtered out the 'hole', so the bit N of the free page bitmap and the
> > > bit N in ram_list.dirty_memory[DIRTY_MEMORY_MIGRATION]->blocks are
> > > corresponding to the same guest page.  Right?
> > > If it's true, I think I am doing the right thing?
> > >
> > >
> > > Liang
> > 
> > There's no guarantee that there's a single 'hole'
> > even on the PC, and we want balloon to be portable.
> > 
> 
> As long as we know how many 'hole' and where the holes are.
> we can filter out them. QEMU should have this kind of information.
> I know my RFC patch passed an arch specific free page bitmap is not
> a good idea. So in my design, I changed this by passing a loose free page
> bitmap which contains the holes, and let QEMU to filter out the holes
> according to some arch specific information. This can make balloon be 
> portable.

Only if you write the arch specific thing for all arches.
This concept of holes simply does not match how we manage memory in qemu.

> > So I'm not sure I understand what your patch is doing, do you mean you pass
> > the GPA to ram addr mapping from host to guest?
> > 
> 
> No, my patch passed the 'lowmem', which helps to filter out the hole from 
> host to guest.
> The design has changed this.
> 
> > That can be made to work but it's not a good idea, and I don't see why would
> > it be faster than doing the same translation host side.
> > 
> 
> It's faster because there is no address translation, most of them are bitmap 
> operation.
> 
> Liang

It's just wrong to say that there is no translation. Of course there has to be 
one.

Fundamentally guest uses GPA as an offset in the bitmap. QEMU uses
ram_addr_t for migration so you either translate GPA to ram_addr_t or
ram_addr_t to GPA.

I think the reason for the speedup that you observe is that you
only need to translate ram_addr_t to GPA once per ramblock,
which is much faster than translating GPA to ram_addr_t for each page.



> > 
> > > > E.g. HACKING says:
> > > > Use hwaddr for guest physical addresses except pcibus_t
> > > > for PCI addresses.  In addition, ram_addr_t is a QEMU internal
> > > > address
> > > > space that maps guest RAM physical addresses into an 
> > > > intermediate
> > > > address space that can map to host virtual address spaces.
> > > >
> > > >
> > > > --



Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages

2016-03-24 Thread Michael S. Tsirkin
On Thu, Mar 24, 2016 at 02:33:15PM +, Li, Liang Z wrote:
> > > > > > > Agree. Current balloon just send 256 PFNs a time, that's too
> > > > > > > few and lead to too many times of virtio transmission, that's
> > > > > > > the main reason for the
> > > > > > bad performance.
> > > > > > > Change the VIRTIO_BALLOON_ARRAY_PFNS_MAX to a large value
> > can
> > > > > > improve
> > > > > > > the performance significant. Maybe we should increase it
> > > > > > > before doing the further optimization, do you think so ?
> > > > > >
> > > > > > We could push it up a bit higher: 256 is 1kbyte in size, so we
> > > > > > can make it 3x bigger and still fit struct virtio_balloon is a
> > > > > > single page. But if we are going to add the bitmap variant
> > > > > > anyway, we probably
> > > > shouldn't bother.
> > > > > >
> > > > > > > > > c. address translation and madvise() operation (24%,
> > > > > > > > > 1423ms)
> > > > > > > >
> > > > > > > > How is this split between translation and madvise?  I
> > > > > > > > suspect it's mostly madvise since you need translation when
> > > > > > > > using bitmap as
> > > > well.
> > > > > > > > Correct? Could you measure this please?  Also, what if we
> > > > > > > > use the new MADV_FREE instead?  By how much would this help?
> > > > > > > >
> > > > > > > For the current balloon, address translation is needed.
> > > > > > > But for live migration, there is no need to do address 
> > > > > > > translation.
> > > > > >
> > > > > > Well you need ram address in order to clear the dirty bit.
> > > > > > How would you get it without translation?
> > > > > >
> > > > >
> > > > > If you means that kind of address translation, yes, it need.
> > > > > What I want to say is, filter out the free page can be done by
> > > > > bitmap
> > > > operation.
> > > > >
> > > > > Liang
> > > >
> > > > OK so I see that your patches use block->offset in struct RAMBlock
> > > > to look up bits in guest-supplied bitmap.
> > > > I don't think that's guaranteed to work.
> > >
> > > It's part of the bitmap operation, because the latest change of the
> > ram_list.dirty_memory.
> > > Why do you think so? Could you tell me the reason?
> > >
> > > Liang
> > 
> > Sorry, why do I think what? That ram_addr_t is not guaranteed to equal GPA
> > of the block?
> > 
> 
> I mean why do you think that's can't guaranteed to work.
> Yes, ram_addr_t is not guaranteed to equal GPA of the block. But I didn't use 
> them as
> GPA. The code in the filter_out_guest_free_pages() in my patch just follow 
> the style of
> the latest change of  ram_list.dirty_memory[].
> 
> The free page bitmap got from the guest in my RFC patch has been filtered out 
> the
> 'hole', so the bit N of the free page bitmap and the bit N in 
> ram_list.dirty_memory[DIRTY_MEMORY_MIGRATION]->blocks are corresponding to
> the same guest page.  Right?
> If it's true, I think I am doing the right thing?
> 
> 
> Liang

There's no guarantee that there's a single 'hole'
even on the PC, and we want balloon to be portable.

So I'm not sure I understand what your patch is doing,
do you mean you pass the GPA to ram addr
mapping from host to guest?

That can be made to work but it's not a good idea,
and I don't see why would it be faster than doing
the same translation host side.


> > E.g. HACKING says:
> > Use hwaddr for guest physical addresses except pcibus_t
> > for PCI addresses.  In addition, ram_addr_t is a QEMU internal
> > address
> > space that maps guest RAM physical addresses into an intermediate
> > address space that can map to host virtual address spaces.
> > 
> > 
> > --
> > MST
> > --
> > To unsubscribe from this list: send the line "unsubscribe kvm" in the body 
> > of
> > a message to majord...@vger.kernel.org More majordomo info at
> > http://vger.kernel.org/majordomo-info.html



Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages

2016-03-24 Thread Michael S. Tsirkin
On Thu, Mar 24, 2016 at 03:53:25PM +, Li, Liang Z wrote:
> > > > > Not very complex, we can implement like this:
> > > > >
> > > > > 1. Set all the bits in the migration_bitmap_rcu->bmap to 1 2.
> > > > > Clear all the bits in ram_list.
> > > > > dirty_memory[DIRTY_MEMORY_MIGRATION]
> > > > > 3. Send the get_free_page_bitmap request 4. Start to send pages to
> > > > > destination and check if the free_page_bitmap is ready
> > > > > if (is_ready) {
> > > > >   filter out the free pages from  migration_bitmap_rcu->bmap;
> > > > >   migration_bitmap_sync();
> > > > > }
> > > > >  continue until live migration complete.
> > > > >
> > > > >
> > > > > Is that right?
> > > >
> > > > The order I'm trying to understand is something like:
> > > >
> > > > a) Send the get_free_page_bitmap request
> > > > b) Start sending pages
> > > > c) Reach the end of memory
> > > >   [ is_ready is false - guest hasn't made free map yet ]
> > > > d) normal migration_bitmap_sync() at end of first pass
> > > > e) Carry on sending dirty pages
> > > > f) is_ready is true
> > > >   f.1) filter out free pages?
> > > >   f.2) migration_bitmap_sync()
> > > >
> > > > It's f.1 I'm worried about.  If the guest started generating the
> > > > free bitmap before (d), then a page marked as 'free' in f.1 might
> > > > have become dirty before (d) and so (f.2) doesn't set the dirty
> > > > again, and so we can't filter out pages in f.1.
> > > >
> > >
> > > As you described, the order is incorrect.
> > >
> > > Liang
> > 
> > 
> > So to make it safe, what is required is to make sure no free list us 
> > outstanding
> > before calling migration_bitmap_sync.
> > 
> > If one is outstanding, filter out pages before calling 
> > migration_bitmap_sync.
> > 
> > Of course, if we just do it like we normally do with migration, then by the
> > time we call migration_bitmap_sync dirty bitmap is completely empty, so
> > there won't be anything to filter out.
> > 
> > One way to address this is call migration_bitmap_sync in the IO handler,
> > while VCPU is stopped, then make sure to filter out pages before the next
> > migration_bitmap_sync.
> > 
> > Another is to start filtering out pages upon IO handler, but make sure to 
> > flush
> > the queue before calling migration_bitmap_sync.
> > 
> 
> It's really complex, maybe we should switch to a simple start,  just skip the 
> free page in
> the ram bulk stage and make it asynchronous?
> 
> Liang

You mean like your patches do? No, blocking bulk migration until guest
response is basically a non-starter.

-- 
MST



Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages

2016-03-24 Thread Li, Liang Z
> > > > Not very complex, we can implement like this:
> > > >
> > > > 1. Set all the bits in the migration_bitmap_rcu->bmap to 1 2.
> > > > Clear all the bits in ram_list.
> > > > dirty_memory[DIRTY_MEMORY_MIGRATION]
> > > > 3. Send the get_free_page_bitmap request 4. Start to send pages to
> > > > destination and check if the free_page_bitmap is ready
> > > > if (is_ready) {
> > > >   filter out the free pages from  migration_bitmap_rcu->bmap;
> > > >   migration_bitmap_sync();
> > > > }
> > > >  continue until live migration complete.
> > > >
> > > >
> > > > Is that right?
> > >
> > > The order I'm trying to understand is something like:
> > >
> > > a) Send the get_free_page_bitmap request
> > > b) Start sending pages
> > > c) Reach the end of memory
> > >   [ is_ready is false - guest hasn't made free map yet ]
> > > d) normal migration_bitmap_sync() at end of first pass
> > > e) Carry on sending dirty pages
> > > f) is_ready is true
> > >   f.1) filter out free pages?
> > >   f.2) migration_bitmap_sync()
> > >
> > > It's f.1 I'm worried about.  If the guest started generating the
> > > free bitmap before (d), then a page marked as 'free' in f.1 might
> > > have become dirty before (d) and so (f.2) doesn't set the dirty
> > > again, and so we can't filter out pages in f.1.
> > >
> >
> > As you described, the order is incorrect.
> >
> > Liang
> 
> 
> So to make it safe, what is required is to make sure no free list us 
> outstanding
> before calling migration_bitmap_sync.
> 
> If one is outstanding, filter out pages before calling migration_bitmap_sync.
> 
> Of course, if we just do it like we normally do with migration, then by the
> time we call migration_bitmap_sync dirty bitmap is completely empty, so
> there won't be anything to filter out.
> 
> One way to address this is call migration_bitmap_sync in the IO handler,
> while VCPU is stopped, then make sure to filter out pages before the next
> migration_bitmap_sync.
> 
> Another is to start filtering out pages upon IO handler, but make sure to 
> flush
> the queue before calling migration_bitmap_sync.
> 

It's really complex, maybe we should switch to a simple start,  just skip the 
free page in
the ram bulk stage and make it asynchronous?

Liang



Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages

2016-03-24 Thread Michael S. Tsirkin
On Thu, Mar 24, 2016 at 02:50:56PM +, Li, Liang Z wrote:
> > > > > >> Given the typical speed of networks; it wouldn't do too much
> > > > > >> harm to start sending assuming all pages are dirty and then
> > > > > >> when the guest finally gets around to finishing the bitmap then
> > > > > >> update, so it's asynchronous - and then if the guest never
> > > > > >> responds we don't really
> > > > care.
> > > > > >
> > > > > >Indeed, thanks!
> > > > > >
> > > > >
> > > > > This is interesting. By doing so, the threshold I mentioned in
> > > > > another mail is not necessary, since we can do it in parallel.
> > > >
> > > > Actually I just realised it's a little more complex; we can't sync
> > > > the dirty bitmap again from the guest until after we've received the
> > guests 'free'
> > > > bitmap; that's because we wouldn't know if a 'dirty' page reflected
> > > > that a page declared as 'free' had now been reused - so there is
> > > > still an ordering there.
> > > >
> > > > Dave
> > >
> > > Not very complex, we can implement like this:
> > >
> > > 1. Set all the bits in the migration_bitmap_rcu->bmap to 1 2. Clear
> > > all the bits in ram_list. dirty_memory[DIRTY_MEMORY_MIGRATION]
> > > 3. Send the get_free_page_bitmap request 4. Start to send pages to
> > > destination and check if the free_page_bitmap is ready
> > > if (is_ready) {
> > >   filter out the free pages from  migration_bitmap_rcu->bmap;
> > >   migration_bitmap_sync();
> > > }
> > >  continue until live migration complete.
> > >
> > >
> > > Is that right?
> > 
> > The order I'm trying to understand is something like:
> > 
> > a) Send the get_free_page_bitmap request
> > b) Start sending pages
> > c) Reach the end of memory
> >   [ is_ready is false - guest hasn't made free map yet ]
> > d) normal migration_bitmap_sync() at end of first pass
> > e) Carry on sending dirty pages
> > f) is_ready is true
> >   f.1) filter out free pages?
> >   f.2) migration_bitmap_sync()
> > 
> > It's f.1 I'm worried about.  If the guest started generating the free bitmap
> > before (d), then a page marked as 'free' in f.1 might have become dirty
> > before (d) and so (f.2) doesn't set the dirty again, and so we can't filter 
> > out
> > pages in f.1.
> > 
> 
> As you described, the order is incorrect.
> 
> Liang


So to make it safe, what is required is to make
sure no free list us outstanding before calling
migration_bitmap_sync.

If one is outstanding, filter out pages before
calling migration_bitmap_sync.

Of course, if we just do it like we normally
do with migration, then by the time we call
migration_bitmap_sync dirty bitmap
is completely empty, so there won't be
anything to filter out.

One way to address this is call migration_bitmap_sync
in the IO handler, while VCPU is stopped,
then make sure to filter out pages before the next
migration_bitmap_sync.

Another is to start filtering out pages upon
IO handler, but make sure to flush the queue
before calling migration_bitmap_sync.


> > Dave
> > 
> > >
> > > Liang
> > > >
> > > > >
> > > > > >Liang
> > --
> > Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages

2016-03-24 Thread Li, Liang Z
> On 24/03/2016 16:39, Li, Liang Z wrote:
> > > Only if you write the arch specific thing for all arches.
> >
> > I plan to keep a function stub for each arch to implement. And I have
> > done that for X86.
> 
> Again: the ram_addr_t matching is internal to QEMU and can vary from
> release to release.  Do not do this.
> 

OK.  I got it  this time.

> > > I think the reason for the speedup that you observe is that you only
> > > need to translate ram_addr_t to GPA once per ramblock, which is much
> > > faster than translating GPA to ram_addr_t for each page.
> >
> > Yes, exactly!
> 
> You don't need to translate it once per page.  When QEMU copies the bitmap
> from guest memory to its own internal data structures, it can do so one block
> at a time with a function like
> 
> void bitmap_copy_bits(unsigned long *dst, unsigned int dst_start,
>   unsigned long *src, unsigned int src_start
> unsigned int nbits);
> 
> Paolo

Balloon can get benefit from this.

Thanks!

Liang



Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages

2016-03-24 Thread Paolo Bonzini


On 24/03/2016 16:39, Li, Liang Z wrote:
> > Only if you write the arch specific thing for all arches.
> 
> I plan to keep a function stub for each arch to implement. And I
> have done that for X86.

Again: the ram_addr_t matching is internal to QEMU and can vary from
release to release.  Do not do this.

> > I think the reason for the speedup that you observe is that you only need to
> > translate ram_addr_t to GPA once per ramblock, which is much faster than
> > translating GPA to ram_addr_t for each page.
> 
> Yes, exactly! 

You don't need to translate it once per page.  When QEMU copies the
bitmap from guest memory to its own internal data structures, it can do
so one block at a time with a function like

void bitmap_copy_bits(unsigned long *dst, unsigned int dst_start,
  unsigned long *src, unsigned int src_start
  unsigned int nbits);

Paolo



Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages

2016-03-24 Thread Li, Liang Z
> > > > I mean why do you think that's can't guaranteed to work.
> > > > Yes, ram_addr_t is not guaranteed to equal GPA of the block. But I
> > > > didn't use them as GPA. The code in the
> > > > filter_out_guest_free_pages() in my patch just follow the style of
> > > > the latest change of
> > > ram_list.dirty_memory[].
> > > >
> > > > The free page bitmap got from the guest in my RFC patch has been
> > > > filtered out the 'hole', so the bit N of the free page bitmap and
> > > > the bit N in ram_list.dirty_memory[DIRTY_MEMORY_MIGRATION]-
> >blocks
> > > > are corresponding to the same guest page.  Right?
> > > > If it's true, I think I am doing the right thing?
> > > >
> > > >
> > > > Liang
> > >
> > > There's no guarantee that there's a single 'hole'
> > > even on the PC, and we want balloon to be portable.
> > >
> >
> > As long as we know how many 'hole' and where the holes are.
> > we can filter out them. QEMU should have this kind of information.
> > I know my RFC patch passed an arch specific free page bitmap is not a
> > good idea. So in my design, I changed this by passing a loose free
> > page bitmap which contains the holes, and let QEMU to filter out the
> > holes according to some arch specific information. This can make balloon be
> portable.
> 
> Only if you write the arch specific thing for all arches.

I plan to keep a function stub for each arch to implement. And I have done that 
for X86.

> This concept of holes simply does not match how we manage memory in
> qemu.

I don't know if it works for other arches, but it works for X86.

> > > So I'm not sure I understand what your patch is doing, do you mean
> > > you pass the GPA to ram addr mapping from host to guest?
> > >
> >
> > No, my patch passed the 'lowmem', which helps to filter out the hole from
> host to guest.
> > The design has changed this.
> >
> > > That can be made to work but it's not a good idea, and I don't see
> > > why would it be faster than doing the same translation host side.
> > >
> >
> > It's faster because there is no address translation, most of them are bitmap
> operation.
> >
> > Liang
> 
> It's just wrong to say that there is no translation. Of course there has to be
> one.
> 
> Fundamentally guest uses GPA as an offset in the bitmap. QEMU uses
> ram_addr_t for migration so you either translate GPA to ram_addr_t or
> ram_addr_t to GPA.
> 
> I think the reason for the speedup that you observe is that you only need to
> translate ram_addr_t to GPA once per ramblock, which is much faster than
> translating GPA to ram_addr_t for each page.
> 

Yes, exactly! 

Liang



Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages

2016-03-24 Thread Li, Liang Z
> > > Sorry, why do I think what? That ram_addr_t is not guaranteed to
> > > equal GPA of the block?
> > >
> >
> > I mean why do you think that's can't guaranteed to work.
> > Yes, ram_addr_t is not guaranteed to equal GPA of the block. But I
> > didn't use them as GPA. The code in the filter_out_guest_free_pages()
> > in my patch just follow the style of the latest change of
> ram_list.dirty_memory[].
> >
> > The free page bitmap got from the guest in my RFC patch has been
> > filtered out the 'hole', so the bit N of the free page bitmap and the
> > bit N in ram_list.dirty_memory[DIRTY_MEMORY_MIGRATION]->blocks are
> > corresponding to the same guest page.  Right?
> > If it's true, I think I am doing the right thing?
> >
> >
> > Liang
> 
> There's no guarantee that there's a single 'hole'
> even on the PC, and we want balloon to be portable.
> 

As long as we know how many 'hole' and where the holes are.
we can filter out them. QEMU should have this kind of information.
I know my RFC patch passed an arch specific free page bitmap is not
a good idea. So in my design, I changed this by passing a loose free page
bitmap which contains the holes, and let QEMU to filter out the holes
according to some arch specific information. This can make balloon be portable.
  
> So I'm not sure I understand what your patch is doing, do you mean you pass
> the GPA to ram addr mapping from host to guest?
> 

No, my patch passed the 'lowmem', which helps to filter out the hole from host 
to guest.
The design has changed this.

> That can be made to work but it's not a good idea, and I don't see why would
> it be faster than doing the same translation host side.
> 

It's faster because there is no address translation, most of them are bitmap 
operation.

Liang

> 
> > > E.g. HACKING says:
> > >   Use hwaddr for guest physical addresses except pcibus_t
> > >   for PCI addresses.  In addition, ram_addr_t is a QEMU internal
> > > address
> > >   space that maps guest RAM physical addresses into an intermediate
> > >   address space that can map to host virtual address spaces.
> > >
> > >
> > > --



Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages

2016-03-24 Thread Li, Liang Z
> > > > >> Given the typical speed of networks; it wouldn't do too much
> > > > >> harm to start sending assuming all pages are dirty and then
> > > > >> when the guest finally gets around to finishing the bitmap then
> > > > >> update, so it's asynchronous - and then if the guest never
> > > > >> responds we don't really
> > > care.
> > > > >
> > > > >Indeed, thanks!
> > > > >
> > > >
> > > > This is interesting. By doing so, the threshold I mentioned in
> > > > another mail is not necessary, since we can do it in parallel.
> > >
> > > Actually I just realised it's a little more complex; we can't sync
> > > the dirty bitmap again from the guest until after we've received the
> guests 'free'
> > > bitmap; that's because we wouldn't know if a 'dirty' page reflected
> > > that a page declared as 'free' had now been reused - so there is
> > > still an ordering there.
> > >
> > > Dave
> >
> > Not very complex, we can implement like this:
> >
> > 1. Set all the bits in the migration_bitmap_rcu->bmap to 1 2. Clear
> > all the bits in ram_list. dirty_memory[DIRTY_MEMORY_MIGRATION]
> > 3. Send the get_free_page_bitmap request 4. Start to send pages to
> > destination and check if the free_page_bitmap is ready
> > if (is_ready) {
> >   filter out the free pages from  migration_bitmap_rcu->bmap;
> >   migration_bitmap_sync();
> > }
> >  continue until live migration complete.
> >
> >
> > Is that right?
> 
> The order I'm trying to understand is something like:
> 
> a) Send the get_free_page_bitmap request
> b) Start sending pages
> c) Reach the end of memory
>   [ is_ready is false - guest hasn't made free map yet ]
> d) normal migration_bitmap_sync() at end of first pass
> e) Carry on sending dirty pages
> f) is_ready is true
>   f.1) filter out free pages?
>   f.2) migration_bitmap_sync()
> 
> It's f.1 I'm worried about.  If the guest started generating the free bitmap
> before (d), then a page marked as 'free' in f.1 might have become dirty
> before (d) and so (f.2) doesn't set the dirty again, and so we can't filter 
> out
> pages in f.1.
> 

As you described, the order is incorrect.

Liang

> Dave
> 
> >
> > Liang
> > >
> > > >
> > > > >Liang
> --
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages

2016-03-24 Thread Paolo Bonzini


On 24/03/2016 16:16, Li, Liang Z wrote:
> > There's no guarantee that there's a single 'hole'
> > even on the PC, and we want balloon to be portable.
> 
> As long as we know how many 'hole' and where the holes are.

The mapping between ram_addr_t and GPA is completely internal to QEMU.
Passing it to the guest is a layering violation.

Paolo

> we can filter out them. QEMU should have this kind of information.
> I know my RFC patch passed an arch specific free page bitmap is not
> a good idea. So in my design, I changed this by passing a loose free page
> bitmap which contains the holes, and let QEMU to filter out the holes
> according to some arch specific information. This can make balloon be 
> portable.
>   



Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages

2016-03-24 Thread Michael S. Tsirkin
On Thu, Mar 24, 2016 at 10:16:47AM +, Li, Liang Z wrote:
> > On Thu, Mar 24, 2016 at 01:19:40AM +, Li, Liang Z wrote:
> > > > > > > 2. Why not use virtio-balloon
> > > > > > > Actually, the virtio-balloon can do the similar thing by
> > > > > > > inflating the balloon before live migration, but its
> > > > > > > performance is no good, for an 8GB idle guest just boots, it
> > > > > > > takes about 5.7 Sec to inflate the balloon to 7GB, but it only
> > > > > > > takes 25ms to get a valid free page bitmap from the guest.
> > > > > > > There are some of reasons for the bad performance of
> > > > > > > vitio-balloon:
> > > > > > > a. allocating pages (5%, 304ms)
> > > > > >
> > > > > > Interesting. This is definitely worth improving in guest kernel.
> > > > > > Also, will it be faster if we allocate and pass to guest huge pages
> > instead?
> > > > > > Might speed up madvise as well.
> > > > >
> > > > > Maybe.
> > > > >
> > > > > > > b. sending PFNs to host (71%, 4194ms)
> > > > > >
> > > > > > OK, so we probably should teach balloon to pass huge lists in 
> > > > > > bitmaps.
> > > > > > Will be benefitial for regular balloon operation, as well.
> > > > > >
> > > > >
> > > > > Agree. Current balloon just send 256 PFNs a time, that's too few
> > > > > and lead to too many times of virtio transmission, that's the main
> > > > > reason for the
> > > > bad performance.
> > > > > Change the VIRTIO_BALLOON_ARRAY_PFNS_MAX to a large value can
> > > > improve
> > > > > the performance significant. Maybe we should increase it before
> > > > > doing the further optimization, do you think so ?
> > > >
> > > > We could push it up a bit higher: 256 is 1kbyte in size, so we can
> > > > make it 3x bigger and still fit struct virtio_balloon is a single
> > > > page. But if we are going to add the bitmap variant anyway, we probably
> > shouldn't bother.
> > > >
> > > > > > > c. address translation and madvise() operation (24%, 1423ms)
> > > > > >
> > > > > > How is this split between translation and madvise?  I suspect
> > > > > > it's mostly madvise since you need translation when using bitmap as
> > well.
> > > > > > Correct? Could you measure this please?  Also, what if we use
> > > > > > the new MADV_FREE instead?  By how much would this help?
> > > > > >
> > > > > For the current balloon, address translation is needed.
> > > > > But for live migration, there is no need to do address translation.
> > > >
> > > > Well you need ram address in order to clear the dirty bit.
> > > > How would you get it without translation?
> > > >
> > >
> > > If you means that kind of address translation, yes, it need.
> > > What I want to say is, filter out the free page can be done by bitmap
> > operation.
> > >
> > > Liang
> > 
> > OK so I see that your patches use block->offset in struct RAMBlock to look 
> > up
> > bits in guest-supplied bitmap.
> > I don't think that's guaranteed to work.
> 
> It's part of the bitmap operation, because the latest change of the 
> ram_list.dirty_memory.
> Why do you think so? Could you tell me the reason?
> 
> Liang

Sorry, why do I think what? That ram_addr_t is not guaranteed to equal GPA of 
the block?

E.g. HACKING says:
Use hwaddr for guest physical addresses except pcibus_t
for PCI addresses.  In addition, ram_addr_t is a QEMU internal address
space that maps guest RAM physical addresses into an intermediate
address space that can map to host virtual address spaces.


-- 
MST



Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages

2016-03-24 Thread Dr. David Alan Gilbert
* Li, Liang Z (liang.z...@intel.com) wrote:
> > * Wei Yang (richard.weiy...@huawei.com) wrote:
> > > On Wed, Mar 23, 2016 at 06:48:22AM +, Li, Liang Z wrote:
> > > [...]
> > > >> > 8. Pseudo code
> > > >> > Dirty page logging should be enabled before getting the free page
> > > >> > information from guest, this is important because during the
> > > >> > process of getting free pages, some free pages may be used and
> > > >> > written by the guest, dirty page logging can trace these pages.
> > > >> > The pseudo code is like below:
> > > >> >
> > > >> > ---
> > > >> > MigrationState *s = migrate_get_current();
> > > >> > ...
> > > >> >
> > > >> > memory_global_dirty_log_start();
> > > >> >
> > > >> > if (get_guest_mem_info()) {
> > > >> > while (!get_free_page_bmap(free_page_bitmap,
> > > >> > drop_page_cache)
> > > >> &&
> > > >> >s->state != MIGRATION_STATUS_CANCELLING) {
> > > >> > usleep(1000) // sleep for 1 ms
> > > >> > }
> > > >> >
> > > >> > tighten_free_page_bmap =
> > > >> tighten_guest_free_pages(free_page_bitmap);
> > > >> > filter_out_guest_free_pages(tighten_free_page_bmap);
> > > >> > }
> > > >>
> > > >> Given the typical speed of networks; it wouldn't do too much harm
> > > >> to start sending assuming all pages are dirty and then when the
> > > >> guest finally gets around to finishing the bitmap then update, so
> > > >> it's asynchronous - and then if the guest never responds we don't 
> > > >> really
> > care.
> > > >
> > > >Indeed, thanks!
> > > >
> > >
> > > This is interesting. By doing so, the threshold I mentioned in another
> > > mail is not necessary, since we can do it in parallel.
> > 
> > Actually I just realised it's a little more complex; we can't sync the dirty
> > bitmap again from the guest until after we've received the guests 'free'
> > bitmap; that's because we wouldn't know if a 'dirty' page reflected that a
> > page declared as 'free' had now been reused - so there is still an ordering
> > there.
> > 
> > Dave
> 
> Not very complex, we can implement like this:
> 
> 1. Set all the bits in the migration_bitmap_rcu->bmap to 1
> 2. Clear all the bits in ram_list. dirty_memory[DIRTY_MEMORY_MIGRATION]
> 3. Send the get_free_page_bitmap request
> 4. Start to send pages to destination and check if the free_page_bitmap is 
> ready
> if (is_ready) {
>   filter out the free pages from  migration_bitmap_rcu->bmap;
>   migration_bitmap_sync();
> } 
>  continue until live migration complete. 
> 
> 
> Is that right?

The order I'm trying to understand is something like:

a) Send the get_free_page_bitmap request
b) Start sending pages
c) Reach the end of memory
  [ is_ready is false - guest hasn't made free map yet ]
d) normal migration_bitmap_sync() at end of first pass
e) Carry on sending dirty pages
f) is_ready is true
  f.1) filter out free pages?
  f.2) migration_bitmap_sync()

It's f.1 I'm worried about.  If the guest started generating the
free bitmap before (d), then a page marked as 'free' in f.1 
might have become dirty before (d) and so (f.2) doesn't set
the dirty again, and so we can't filter out pages in f.1.

Dave

> 
> Liang
> > 
> > >
> > > >Liang
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages

2016-03-24 Thread Li, Liang Z
> On Thu, Mar 24, 2016 at 01:19:40AM +, Li, Liang Z wrote:
> > > > > > 2. Why not use virtio-balloon
> > > > > > Actually, the virtio-balloon can do the similar thing by
> > > > > > inflating the balloon before live migration, but its
> > > > > > performance is no good, for an 8GB idle guest just boots, it
> > > > > > takes about 5.7 Sec to inflate the balloon to 7GB, but it only
> > > > > > takes 25ms to get a valid free page bitmap from the guest.
> > > > > > There are some of reasons for the bad performance of
> > > > > > vitio-balloon:
> > > > > > a. allocating pages (5%, 304ms)
> > > > >
> > > > > Interesting. This is definitely worth improving in guest kernel.
> > > > > Also, will it be faster if we allocate and pass to guest huge pages
> instead?
> > > > > Might speed up madvise as well.
> > > >
> > > > Maybe.
> > > >
> > > > > > b. sending PFNs to host (71%, 4194ms)
> > > > >
> > > > > OK, so we probably should teach balloon to pass huge lists in bitmaps.
> > > > > Will be benefitial for regular balloon operation, as well.
> > > > >
> > > >
> > > > Agree. Current balloon just send 256 PFNs a time, that's too few
> > > > and lead to too many times of virtio transmission, that's the main
> > > > reason for the
> > > bad performance.
> > > > Change the VIRTIO_BALLOON_ARRAY_PFNS_MAX to a large value can
> > > improve
> > > > the performance significant. Maybe we should increase it before
> > > > doing the further optimization, do you think so ?
> > >
> > > We could push it up a bit higher: 256 is 1kbyte in size, so we can
> > > make it 3x bigger and still fit struct virtio_balloon is a single
> > > page. But if we are going to add the bitmap variant anyway, we probably
> shouldn't bother.
> > >
> > > > > > c. address translation and madvise() operation (24%, 1423ms)
> > > > >
> > > > > How is this split between translation and madvise?  I suspect
> > > > > it's mostly madvise since you need translation when using bitmap as
> well.
> > > > > Correct? Could you measure this please?  Also, what if we use
> > > > > the new MADV_FREE instead?  By how much would this help?
> > > > >
> > > > For the current balloon, address translation is needed.
> > > > But for live migration, there is no need to do address translation.
> > >
> > > Well you need ram address in order to clear the dirty bit.
> > > How would you get it without translation?
> > >
> >
> > If you means that kind of address translation, yes, it need.
> > What I want to say is, filter out the free page can be done by bitmap
> operation.
> >
> > Liang
> 
> OK so I see that your patches use block->offset in struct RAMBlock to look up
> bits in guest-supplied bitmap.
> I don't think that's guaranteed to work.

It's part of the bitmap operation, because the latest change of the 
ram_list.dirty_memory.
Why do you think so? Could you tell me the reason?

Liang




Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages

2016-03-24 Thread Li, Liang Z
> * Wei Yang (richard.weiy...@huawei.com) wrote:
> > On Wed, Mar 23, 2016 at 06:48:22AM +, Li, Liang Z wrote:
> > [...]
> > >> > 8. Pseudo code
> > >> > Dirty page logging should be enabled before getting the free page
> > >> > information from guest, this is important because during the
> > >> > process of getting free pages, some free pages may be used and
> > >> > written by the guest, dirty page logging can trace these pages.
> > >> > The pseudo code is like below:
> > >> >
> > >> > ---
> > >> > MigrationState *s = migrate_get_current();
> > >> > ...
> > >> >
> > >> > memory_global_dirty_log_start();
> > >> >
> > >> > if (get_guest_mem_info()) {
> > >> > while (!get_free_page_bmap(free_page_bitmap,
> > >> > drop_page_cache)
> > >> &&
> > >> >s->state != MIGRATION_STATUS_CANCELLING) {
> > >> > usleep(1000) // sleep for 1 ms
> > >> > }
> > >> >
> > >> > tighten_free_page_bmap =
> > >> tighten_guest_free_pages(free_page_bitmap);
> > >> > filter_out_guest_free_pages(tighten_free_page_bmap);
> > >> > }
> > >>
> > >> Given the typical speed of networks; it wouldn't do too much harm
> > >> to start sending assuming all pages are dirty and then when the
> > >> guest finally gets around to finishing the bitmap then update, so
> > >> it's asynchronous - and then if the guest never responds we don't really
> care.
> > >
> > >Indeed, thanks!
> > >
> >
> > This is interesting. By doing so, the threshold I mentioned in another
> > mail is not necessary, since we can do it in parallel.
> 
> Actually I just realised it's a little more complex; we can't sync the dirty
> bitmap again from the guest until after we've received the guests 'free'
> bitmap; that's because we wouldn't know if a 'dirty' page reflected that a
> page declared as 'free' had now been reused - so there is still an ordering
> there.
> 
> Dave

Not very complex, we can implement like this:

1. Set all the bits in the migration_bitmap_rcu->bmap to 1
2. Clear all the bits in ram_list. dirty_memory[DIRTY_MEMORY_MIGRATION]
3. Send the get_free_page_bitmap request
4. Start to send pages to destination and check if the free_page_bitmap is ready
if (is_ready) {
  filter out the free pages from  migration_bitmap_rcu->bmap;
  migration_bitmap_sync();
} 
 continue until live migration complete. 


Is that right?

Liang
> 
> >
> > >Liang


Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages

2016-03-24 Thread Michael S. Tsirkin
On Thu, Mar 24, 2016 at 01:19:40AM +, Li, Liang Z wrote:
> > > > > 2. Why not use virtio-balloon
> > > > > Actually, the virtio-balloon can do the similar thing by inflating
> > > > > the balloon before live migration, but its performance is no good,
> > > > > for an 8GB idle guest just boots, it takes about 5.7 Sec to
> > > > > inflate the balloon to 7GB, but it only takes 25ms to get a valid
> > > > > free page bitmap from the guest.  There are some of reasons for
> > > > > the bad performance of
> > > > > vitio-balloon:
> > > > > a. allocating pages (5%, 304ms)
> > > >
> > > > Interesting. This is definitely worth improving in guest kernel.
> > > > Also, will it be faster if we allocate and pass to guest huge pages 
> > > > instead?
> > > > Might speed up madvise as well.
> > >
> > > Maybe.
> > >
> > > > > b. sending PFNs to host (71%, 4194ms)
> > > >
> > > > OK, so we probably should teach balloon to pass huge lists in bitmaps.
> > > > Will be benefitial for regular balloon operation, as well.
> > > >
> > >
> > > Agree. Current balloon just send 256 PFNs a time, that's too few and
> > > lead to too many times of virtio transmission, that's the main reason for 
> > > the
> > bad performance.
> > > Change the VIRTIO_BALLOON_ARRAY_PFNS_MAX to a large value can
> > improve
> > > the performance significant. Maybe we should increase it before doing
> > > the further optimization, do you think so ?
> > 
> > We could push it up a bit higher: 256 is 1kbyte in size, so we can make it 
> > 3x
> > bigger and still fit struct virtio_balloon is a single page. But if we are 
> > going to
> > add the bitmap variant anyway, we probably shouldn't bother.
> > 
> > > > > c. address translation and madvise() operation (24%, 1423ms)
> > > >
> > > > How is this split between translation and madvise?  I suspect it's
> > > > mostly madvise since you need translation when using bitmap as well.
> > > > Correct? Could you measure this please?  Also, what if we use the
> > > > new MADV_FREE instead?  By how much would this help?
> > > >
> > > For the current balloon, address translation is needed.
> > > But for live migration, there is no need to do address translation.
> > 
> > Well you need ram address in order to clear the dirty bit.
> > How would you get it without translation?
> > 
> 
> If you means that kind of address translation, yes, it need.
> What I want to say is, filter out the free page can be done by bitmap 
> operation.
> 
> Liang

OK so I see that your patches use block->offset in struct RAMBlock
to look up bits in guest-supplied bitmap.
I don't think that's guaranteed to work.

-- 
MST



Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages

2016-03-24 Thread Dr. David Alan Gilbert
* Wei Yang (richard.weiy...@huawei.com) wrote:
> On Wed, Mar 23, 2016 at 06:48:22AM +, Li, Liang Z wrote:
> [...]
> >> > 8. Pseudo code
> >> > Dirty page logging should be enabled before getting the free page
> >> > information from guest, this is important because during the process
> >> > of getting free pages, some free pages may be used and written by the
> >> > guest, dirty page logging can trace these pages. The pseudo code is
> >> > like below:
> >> >
> >> > ---
> >> > MigrationState *s = migrate_get_current();
> >> > ...
> >> >
> >> > memory_global_dirty_log_start();
> >> >
> >> > if (get_guest_mem_info()) {
> >> > while (!get_free_page_bmap(free_page_bitmap,  drop_page_cache)
> >> &&
> >> >s->state != MIGRATION_STATUS_CANCELLING) {
> >> > usleep(1000) // sleep for 1 ms
> >> > }
> >> >
> >> > tighten_free_page_bmap =
> >> tighten_guest_free_pages(free_page_bitmap);
> >> > filter_out_guest_free_pages(tighten_free_page_bmap);
> >> > }
> >> 
> >> Given the typical speed of networks; it wouldn't do too much harm to start
> >> sending assuming all pages are dirty and then when the guest finally gets
> >> around to finishing the bitmap then update, so it's asynchronous - and 
> >> then if
> >> the guest never responds we don't really care.
> >
> >Indeed, thanks!
> >
> 
> This is interesting. By doing so, the threshold I mentioned in another mail is
> not necessary, since we can do it in parallel.

Actually I just realised it's a little more complex; we can't sync the dirty
bitmap again from the guest until after we've received the guests 'free' bitmap;
that's because we wouldn't know if a 'dirty' page reflected that a page declared
as 'free' had now been reused - so there is still an ordering there.

Dave

> 
> >Liang
> >> 
> >> Dave
> >> 
> >> >
> >> > migration_bitmap_sync();
> >> > ...
> >> >
> >> > ---
> >> >
> >> >
> >> > --
> >> > 1.9.1
> >> >
> >> --
> >> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
> >N�r��y���b�X��ǧv�^�)޺{.n�+h����ܨ}���Ơz�:+v���zZ+��+zf���h���~i���z��w���?&�)ߢf
> -- 
> Richard Yang\nHelp you, Help me
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages

2016-03-23 Thread Wei Yang
On Thu, Mar 24, 2016 at 01:32:25AM +, Li, Liang Z wrote:
>> >> >> >
>> >> >> >6. Handling page cache in the guest The memory used for page
>> >> >> >cache in the guest will change depends on the workload, if guest
>> >> >> >run some block IO intensive work load, there will
>> >> >>
>> >> >> Would this improvement benefit a lot when guest only has little free
>> page?
>> >> >
>> >> >Yes, the improvement is very obvious.
>> >> >
>> >>
>> >> Good to know this.
>> >>
>> >> >> In your Performance data Case 2, I think it mimic this kind of case.
>> >> >> While the memory consuming task is stopped before migration. If it
>> >> >> continues, would we still perform better than before?
>> >> >
>> >> >Actually, my RFC patch didn't consider the page cache, Roman raised
>> >> >this
>> >> issue.
>> >> >so I add this part in this doc.
>> >> >
>> >> >Case 2 didn't mimic this kind of scenario, the work load is an
>> >> >memory consuming work load, not an block IO intensive work load, so
>> >> >there are not many page cache in this case.
>> >> >
>> >> >If the work load in case 2 continues, as long as it not write all
>> >> >the memory it allocates, we still can get benefits.
>> >> >
>> >>
>> >> Sounds I have little knowledge on page cache, and its relationship
>> >> between free page and I/O intensive work.
>> >>
>> >> Here is some personal understanding, I would appreciate if you could
>> >> correct me.
>> >>
>> >> +-+
>> >> |PageCache|
>> >> +-+
>> >>   +-+-+-+-+
>> >>   |Page |Page |Free Page|Page |
>> >>   +-+-+-+-+
>> >>
>> >> Free Page is a page in the free_list, PageCache is some page cached
>> >> in CPU's cache line?
>> >
>> >No, page cache is quite different with CPU cache line.
>> >" In computing, a page cache, sometimes also called disk cache,[2] is a
>> >transparent cache  for the pages originating from a secondary storage
>> device such as a hard disk drive (HDD).
>> > The operating system keeps a page cache in otherwise unused portions
>> >of the main  memory (RAM), resulting in quicker access to the contents
>> >of cached pages and overall performance improvements "
>> >you can refer to https://en.wikipedia.org/wiki/Page_cache
>> >for more details.
>> >
>> 
>> My poor knowledge~ Should google it before I imagine the meaning of the
>> terminology.
>> 
>> If my understanding is correct, the Page Cache is counted as Free Page, while
>> actually we should migrate them instead of filter them.
>
>No, the Page Cache is not counted as Free Page ...

OK, I misunderstand the concept in wiki.

The Page Cache is a trade off between Free Page percentage and the I/O
performance in guest.

>
>Liang

-- 
Richard Yang\nHelp you, Help me



Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages

2016-03-23 Thread Li, Liang Z
> >> >> >
> >> >> >6. Handling page cache in the guest The memory used for page
> >> >> >cache in the guest will change depends on the workload, if guest
> >> >> >run some block IO intensive work load, there will
> >> >>
> >> >> Would this improvement benefit a lot when guest only has little free
> page?
> >> >
> >> >Yes, the improvement is very obvious.
> >> >
> >>
> >> Good to know this.
> >>
> >> >> In your Performance data Case 2, I think it mimic this kind of case.
> >> >> While the memory consuming task is stopped before migration. If it
> >> >> continues, would we still perform better than before?
> >> >
> >> >Actually, my RFC patch didn't consider the page cache, Roman raised
> >> >this
> >> issue.
> >> >so I add this part in this doc.
> >> >
> >> >Case 2 didn't mimic this kind of scenario, the work load is an
> >> >memory consuming work load, not an block IO intensive work load, so
> >> >there are not many page cache in this case.
> >> >
> >> >If the work load in case 2 continues, as long as it not write all
> >> >the memory it allocates, we still can get benefits.
> >> >
> >>
> >> Sounds I have little knowledge on page cache, and its relationship
> >> between free page and I/O intensive work.
> >>
> >> Here is some personal understanding, I would appreciate if you could
> >> correct me.
> >>
> >> +-+
> >> |PageCache|
> >> +-+
> >>   +-+-+-+-+
> >>   |Page |Page |Free Page|Page |
> >>   +-+-+-+-+
> >>
> >> Free Page is a page in the free_list, PageCache is some page cached
> >> in CPU's cache line?
> >
> >No, page cache is quite different with CPU cache line.
> >" In computing, a page cache, sometimes also called disk cache,[2] is a
> >transparent cache  for the pages originating from a secondary storage
> device such as a hard disk drive (HDD).
> > The operating system keeps a page cache in otherwise unused portions
> >of the main  memory (RAM), resulting in quicker access to the contents
> >of cached pages and overall performance improvements "
> >you can refer to https://en.wikipedia.org/wiki/Page_cache
> >for more details.
> >
> 
> My poor knowledge~ Should google it before I imagine the meaning of the
> terminology.
> 
> If my understanding is correct, the Page Cache is counted as Free Page, while
> actually we should migrate them instead of filter them.

No, the Page Cache is not counted as Free Page ...

Liang


Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages

2016-03-23 Thread Wei Yang
On Wed, Mar 23, 2016 at 06:48:22AM +, Li, Liang Z wrote:
[...]
>> > 8. Pseudo code
>> > Dirty page logging should be enabled before getting the free page
>> > information from guest, this is important because during the process
>> > of getting free pages, some free pages may be used and written by the
>> > guest, dirty page logging can trace these pages. The pseudo code is
>> > like below:
>> >
>> > ---
>> > MigrationState *s = migrate_get_current();
>> > ...
>> >
>> > memory_global_dirty_log_start();
>> >
>> > if (get_guest_mem_info()) {
>> > while (!get_free_page_bmap(free_page_bitmap,  drop_page_cache)
>> &&
>> >s->state != MIGRATION_STATUS_CANCELLING) {
>> > usleep(1000) // sleep for 1 ms
>> > }
>> >
>> > tighten_free_page_bmap =
>> tighten_guest_free_pages(free_page_bitmap);
>> > filter_out_guest_free_pages(tighten_free_page_bmap);
>> > }
>> 
>> Given the typical speed of networks; it wouldn't do too much harm to start
>> sending assuming all pages are dirty and then when the guest finally gets
>> around to finishing the bitmap then update, so it's asynchronous - and then 
>> if
>> the guest never responds we don't really care.
>
>Indeed, thanks!
>

This is interesting. By doing so, the threshold I mentioned in another mail is
not necessary, since we can do it in parallel.

>Liang
>> 
>> Dave
>> 
>> >
>> > migration_bitmap_sync();
>> > ...
>> >
>> > ---
>> >
>> >
>> > --
>> > 1.9.1
>> >
>> --
>> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
>N�r��y���b�X��ǧv�^�)޺{.n�+h����ܨ}���Ơz�:+v���zZ+��+zf���h���~i���z��w���?&�)ߢf
-- 
Richard Yang\nHelp you, Help me



Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages

2016-03-23 Thread Li, Liang Z
> >>> >From guest's point of view, there are some pages currently not used
> >>> >by
> >>
> >> I see in your original RFC patch and your RFC doc, this line starts
> >> with a character '>'. Not sure this one has a special purpose?
> >>
> >
> > No special purpose. Maybe it's caused by the email client. I didn't
> > find the character in the original doc.
> >
> 
> Yes, it's an artifact used by many mailers so that mailboxes don't get
> confused by a bare "From" at the start of a line but in the middle rather than
> the start of a message.
> 
> It's possible to avoid the artifact by using quoted-printable and escaping the
> 'F', and I'm honestly a bit surprised that git doesn't do it automatically.
> 

Thanks for your explanation!

Liang


Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages

2016-03-23 Thread Li, Liang Z
> > > > 2. Why not use virtio-balloon
> > > > Actually, the virtio-balloon can do the similar thing by inflating
> > > > the balloon before live migration, but its performance is no good,
> > > > for an 8GB idle guest just boots, it takes about 5.7 Sec to
> > > > inflate the balloon to 7GB, but it only takes 25ms to get a valid
> > > > free page bitmap from the guest.  There are some of reasons for
> > > > the bad performance of
> > > > vitio-balloon:
> > > > a. allocating pages (5%, 304ms)
> > >
> > > Interesting. This is definitely worth improving in guest kernel.
> > > Also, will it be faster if we allocate and pass to guest huge pages 
> > > instead?
> > > Might speed up madvise as well.
> >
> > Maybe.
> >
> > > > b. sending PFNs to host (71%, 4194ms)
> > >
> > > OK, so we probably should teach balloon to pass huge lists in bitmaps.
> > > Will be benefitial for regular balloon operation, as well.
> > >
> >
> > Agree. Current balloon just send 256 PFNs a time, that's too few and
> > lead to too many times of virtio transmission, that's the main reason for 
> > the
> bad performance.
> > Change the VIRTIO_BALLOON_ARRAY_PFNS_MAX to a large value can
> improve
> > the performance significant. Maybe we should increase it before doing
> > the further optimization, do you think so ?
> 
> We could push it up a bit higher: 256 is 1kbyte in size, so we can make it 3x
> bigger and still fit struct virtio_balloon is a single page. But if we are 
> going to
> add the bitmap variant anyway, we probably shouldn't bother.
> 
> > > > c. address translation and madvise() operation (24%, 1423ms)
> > >
> > > How is this split between translation and madvise?  I suspect it's
> > > mostly madvise since you need translation when using bitmap as well.
> > > Correct? Could you measure this please?  Also, what if we use the
> > > new MADV_FREE instead?  By how much would this help?
> > >
> > For the current balloon, address translation is needed.
> > But for live migration, there is no need to do address translation.
> 
> Well you need ram address in order to clear the dirty bit.
> How would you get it without translation?
> 

If you means that kind of address translation, yes, it need.
What I want to say is, filter out the free page can be done by bitmap operation.

Liang


Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages

2016-03-23 Thread Wei Yang
On Wed, Mar 23, 2016 at 02:35:42PM +, Li, Liang Z wrote:
>> >No special purpose. Maybe it's caused by the email client. I didn't
>> >find the character in the original doc.
>> >
>> 
>> https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00715.html
>> 
>> You could take a look at this link, there is a '>' before From.
>
>Yes, there is. 
>
>> >> >
>> >> >6. Handling page cache in the guest
>> >> >The memory used for page cache in the guest will change depends on
>> >> >the workload, if guest run some block IO intensive work load, there
>> >> >will
>> >>
>> >> Would this improvement benefit a lot when guest only has little free page?
>> >
>> >Yes, the improvement is very obvious.
>> >
>> 
>> Good to know this.
>> 
>> >> In your Performance data Case 2, I think it mimic this kind of case.
>> >> While the memory consuming task is stopped before migration. If it
>> >> continues, would we still perform better than before?
>> >
>> >Actually, my RFC patch didn't consider the page cache, Roman raised this
>> issue.
>> >so I add this part in this doc.
>> >
>> >Case 2 didn't mimic this kind of scenario, the work load is an memory
>> >consuming work load, not an block IO intensive work load, so there are
>> >not many page cache in this case.
>> >
>> >If the work load in case 2 continues, as long as it not write all the
>> >memory it allocates, we still can get benefits.
>> >
>> 
>> Sounds I have little knowledge on page cache, and its relationship between
>> free page and I/O intensive work.
>> 
>> Here is some personal understanding, I would appreciate if you could correct
>> me.
>> 
>> +-+
>> |PageCache|
>> +-+
>>   +-+-+-+-+
>>   |Page |Page |Free Page|Page |
>>   +-+-+-+-+
>> 
>> Free Page is a page in the free_list, PageCache is some page cached in CPU's
>> cache line?
>
>No, page cache is quite different with CPU cache line.
>" In computing, a page cache, sometimes also called disk cache,[2] is a 
>transparent cache
> for the pages originating from a secondary storage device such as a hard disk 
> drive (HDD).
> The operating system keeps a page cache in otherwise unused portions of the 
> main
> memory (RAM), resulting in quicker access to the contents of cached pages and 
>overall performance improvements "
>you can refer to https://en.wikipedia.org/wiki/Page_cache
>for more details.
>

My poor knowledge~ Should google it before I imagine the meaning of the
terminology.

If my understanding is correct, the Page Cache is counted as Free Page, while
actually we should migrate them instead of filter them.

>
>> When memory consuming task runs, it leads to little Free Page in the whole
>> system. What's the consequence when I/O intensive work runs? I guess it
>> still leads to little Free Page. And will have some problem in sync on
>> PageCache?
>> 
>> >>
>> >> I am thinking is it possible to have a threshold or configurable
>> >> threshold to utilize free page bitmap optimization?
>> >>
>> >
>> >Could you elaborate your idea? How does it work?
>> >
>> 
>> Let's back to Case 2. We run a memory consuming task which will leads to
>> little Free Page in the whole system. Which means from Qemu perspective,
>> little of the dirty_memory is filtered by Free Page list. My original 
>> question is
>> whether your solution benefits in this scenario. As you mentioned it works
>> fine. So maybe this threshold is not necessary.
>> 
>I didn't quite understand your question before. 
>The benefits we get depends on the  count of free pages we can filter out.
>This is always true.
>
>> My original idea is in Qemu we can calculate the percentage of the Free Page
>> in the whole system. If it finds there is only little percentage of Free 
>> Page,
>> then we don't need to bother to use this method.
>> 
>
>I got you. The threshold can be used for optimization, but the effect is very 
>limited.
>If there are only a few of free pages, the process of constructing the free 
>page
>bitmap is very quick. 
>But we can stop doing the following things, e.g. sending the free page bitmap 
>and doing
>the bitmap operation, theoretically, that may help to save some time, maybe 
>several ms.
>

Ha, you got what I mean.

>I think a VM has no free pages at all is very rare, in the worst case, there 
>are still several
> MB of free pages. The proper threshold should be determined by comparing  the 
> extra
> time spends on processing the free page bitmap and the time spends on sending
>the several MB of free pages though the network. If the formal is longer, we 
>can stop
>using this method. So we should take the network bandwidth into consideration, 
>it's 
>too complicated and not worth to do.
>

Yes, after some thinking, it maybe not that easy and worth to do this
optimization.

>Thanks
>
>Liang
>> Have a nice day~
>> 
>> >Liang
>> >
>> >>
>> >> --
>> >> Richard Yang\nHelp you, Help me
>> 

Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages

2016-03-23 Thread Wei Yang
On Wed, Mar 23, 2016 at 10:53:42AM -0600, Eric Blake wrote:
>On 03/23/2016 01:18 AM, Li, Liang Z wrote:
>

 >From guest's point of view, there are some pages currently not used by
>>>
>>> I see in your original RFC patch and your RFC doc, this line starts with a
>>> character '>'. Not sure this one has a special purpose?
>>>
>> 
>> No special purpose. Maybe it's caused by the email client. I didn't find the
>> character in the original doc.
>> 
>
>Yes, it's an artifact used by many mailers so that mailboxes don't get
>confused by a bare "From" at the start of a line but in the middle
>rather than the start of a message.
>
>It's possible to avoid the artifact by using quoted-printable and
>escaping the 'F', and I'm honestly a bit surprised that git doesn't do
>it automatically.
>

Oh, first time to notice this, interesting~

>-- 
>Eric Blake   eblake redhat com+1-919-301-3266
>Libvirt virtualization library http://libvirt.org
>



-- 
Wei Yang
Help you, Help me



Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages

2016-03-23 Thread Eric Blake
On 03/23/2016 01:18 AM, Li, Liang Z wrote:

>>>
>>> >From guest's point of view, there are some pages currently not used by
>>
>> I see in your original RFC patch and your RFC doc, this line starts with a
>> character '>'. Not sure this one has a special purpose?
>>
> 
> No special purpose. Maybe it's caused by the email client. I didn't find the
> character in the original doc.
> 

Yes, it's an artifact used by many mailers so that mailboxes don't get
confused by a bare "From" at the start of a line but in the middle
rather than the start of a message.

It's possible to avoid the artifact by using quoted-printable and
escaping the 'F', and I'm honestly a bit surprised that git doesn't do
it automatically.

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages

2016-03-23 Thread Li, Liang Z
> >No special purpose. Maybe it's caused by the email client. I didn't
> >find the character in the original doc.
> >
> 
> https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00715.html
> 
> You could take a look at this link, there is a '>' before From.

Yes, there is. 

> >> >
> >> >6. Handling page cache in the guest
> >> >The memory used for page cache in the guest will change depends on
> >> >the workload, if guest run some block IO intensive work load, there
> >> >will
> >>
> >> Would this improvement benefit a lot when guest only has little free page?
> >
> >Yes, the improvement is very obvious.
> >
> 
> Good to know this.
> 
> >> In your Performance data Case 2, I think it mimic this kind of case.
> >> While the memory consuming task is stopped before migration. If it
> >> continues, would we still perform better than before?
> >
> >Actually, my RFC patch didn't consider the page cache, Roman raised this
> issue.
> >so I add this part in this doc.
> >
> >Case 2 didn't mimic this kind of scenario, the work load is an memory
> >consuming work load, not an block IO intensive work load, so there are
> >not many page cache in this case.
> >
> >If the work load in case 2 continues, as long as it not write all the
> >memory it allocates, we still can get benefits.
> >
> 
> Sounds I have little knowledge on page cache, and its relationship between
> free page and I/O intensive work.
> 
> Here is some personal understanding, I would appreciate if you could correct
> me.
> 
> +-+
> |PageCache|
> +-+
>   +-+-+-+-+
>   |Page |Page |Free Page|Page |
>   +-+-+-+-+
> 
> Free Page is a page in the free_list, PageCache is some page cached in CPU's
> cache line?

No, page cache is quite different with CPU cache line.
" In computing, a page cache, sometimes also called disk cache,[2] is a 
transparent cache
 for the pages originating from a secondary storage device such as a hard disk 
drive (HDD).
 The operating system keeps a page cache in otherwise unused portions of the 
main
 memory (RAM), resulting in quicker access to the contents of cached pages and 
overall performance improvements "
you can refer to https://en.wikipedia.org/wiki/Page_cache
for more details.


> When memory consuming task runs, it leads to little Free Page in the whole
> system. What's the consequence when I/O intensive work runs? I guess it
> still leads to little Free Page. And will have some problem in sync on
> PageCache?
> 
> >>
> >> I am thinking is it possible to have a threshold or configurable
> >> threshold to utilize free page bitmap optimization?
> >>
> >
> >Could you elaborate your idea? How does it work?
> >
> 
> Let's back to Case 2. We run a memory consuming task which will leads to
> little Free Page in the whole system. Which means from Qemu perspective,
> little of the dirty_memory is filtered by Free Page list. My original 
> question is
> whether your solution benefits in this scenario. As you mentioned it works
> fine. So maybe this threshold is not necessary.
> 
I didn't quite understand your question before. 
The benefits we get depends on the  count of free pages we can filter out.
This is always true.

> My original idea is in Qemu we can calculate the percentage of the Free Page
> in the whole system. If it finds there is only little percentage of Free Page,
> then we don't need to bother to use this method.
> 

I got you. The threshold can be used for optimization, but the effect is very 
limited.
If there are only a few of free pages, the process of constructing the free page
bitmap is very quick. 
But we can stop doing the following things, e.g. sending the free page bitmap 
and doing
the bitmap operation, theoretically, that may help to save some time, maybe 
several ms.

I think a VM has no free pages at all is very rare, in the worst case, there 
are still several
 MB of free pages. The proper threshold should be determined by comparing  the 
extra
 time spends on processing the free page bitmap and the time spends on sending
the several MB of free pages though the network. If the formal is longer, we 
can stop
using this method. So we should take the network bandwidth into consideration, 
it's 
too complicated and not worth to do.

Thanks

Liang
> Have a nice day~
> 
> >Liang
> >
> >>
> >> --
> >> Richard Yang\nHelp you, Help me
> 
> --
> Richard Yang\nHelp you, Help me


Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages

2016-03-23 Thread Michael S. Tsirkin
On Wed, Mar 23, 2016 at 06:05:27AM +, Li, Liang Z wrote:
> > > To make things easier, I wrote this doc about the possible designs and
> > > my choices. Comments are welcome!
> > 
> > Thanks for putting this together, and especially for taking the trouble to
> > benchmark existing code paths!
> > 
> > I think these numbers do show that there are gains to be had from merging
> > your code with the existing balloon device. It will probably be a bit more 
> > work,
> > but I think it'll be worth it.
> > 
> > More comments below.
> > 
> 
> Thanks for your comments!
> 
> > > 2. Why not use virtio-balloon
> > > Actually, the virtio-balloon can do the similar thing by inflating the
> > > balloon before live migration, but its performance is no good, for an
> > > 8GB idle guest just boots, it takes about 5.7 Sec to inflate the
> > > balloon to 7GB, but it only takes 25ms to get a valid free page bitmap
> > > from the guest.  There are some of reasons for the bad performance of
> > > vitio-balloon:
> > > a. allocating pages (5%, 304ms)
> > 
> > Interesting. This is definitely worth improving in guest kernel.
> > Also, will it be faster if we allocate and pass to guest huge pages instead?
> > Might speed up madvise as well.
> 
> Maybe.
> 
> > > b. sending PFNs to host (71%, 4194ms)
> > 
> > OK, so we probably should teach balloon to pass huge lists in bitmaps.
> > Will be benefitial for regular balloon operation, as well.
> > 
> 
> Agree. Current balloon just send 256 PFNs a time, that's too few and lead to 
> too many times 
> of virtio transmission, that's the main reason for the bad performance.
> Change the VIRTIO_BALLOON_ARRAY_PFNS_MAX to a large value can  improve the
> performance significant. Maybe we should increase it before doing the further 
> optimization,
> do you think so ?

We could push it up a bit higher: 256 is 1kbyte in size,
so we can make it 3x bigger and still fit struct virtio_balloon
is a single page. But if we are going to add the bitmap variant
anyway, we probably shouldn't bother.

> > > c. address translation and madvise() operation (24%, 1423ms)
> > 
> > How is this split between translation and madvise?  I suspect it's mostly
> > madvise since you need translation when using bitmap as well.
> > Correct? Could you measure this please?  Also, what if we use the new
> > MADV_FREE instead?  By how much would this help?
> > 
> For the current balloon, address translation is needed. 
> But for live migration, there is no need to do address translation.

Well you need ram address in order to clear the dirty bit.
How would you get it without translation?

> 
> I did a another try and got the following data:
>a. allocating pages (6.4%, 402ms)
>b. sending PFNs to host (68.3%, 4263ms)
>c. address translation (6.2%, 389ms)
>d. madvise (19.0%, 1188ms)
> 
> The address translation is a time consuming operation too.
> I will try MADV_FREE later.


Thanks!

> > Finally, we could teach balloon to skip madvise completely.
> > By how much would this help?
> > 
> > > Debugging shows the time spends on these operations are listed in the
> > > brackets above. By changing the VIRTIO_BALLOON_ARRAY_PFNS_MAX to
> > a
> > > large value, such as 16384, the time spends on sending the PFNs can be
> > > reduced to about 400ms, but it’s still too long.
> > > Obviously, the virtio-balloon mechanism has a bigger performance
> > > impact to the guest than the way we are trying to implement.
> > 
> > Since as we see some of the new interfaces might be benefitial to balloon as
> > well, I am rather of the opinion that extending the balloon (basically 3a)
> > might be the right thing to do.
> > 
> > > 3. Virtio interface
> > > There are three different ways of using the virtio interface to send
> > > the free page information.
> > > a. Extend the current virtio device
> > > The virtio spec has already defined some virtio devices, and we can
> > > extend one of these devices so as to use it to transport the free page
> > > information. It requires modifying the virtio spec.
> > 
> > You don't have to do it all by yourself by the way.
> > Submit the proposal to the oasis virtio tc mailing list, we will take it 
> > from there.
> > 
> That's great.
> 
> >> 4. Construct free page bitmap
> >> To minimize the space for saving free page information, it’s better to 
> >> use a bitmap to describe the free pages. There are two ways to 
> >> construct the free page bitmap.
> >> 
> >> a. Construct free page bitmap when demand (My choice) Guest can 
> >> allocate memory for the free page bitmap only when it receives the 
> >> request from QEMU, and set the free page bitmap by traversing the free 
> >> page list. The advantage of this way is that it’s quite simple and 
> >> easy to implement. The disadvantage is that the traversing operation 
> >> may consume quite a long time when there are a lot of free pages. 
> >> (About 20ms for 7GB free pages)
> >> 
> >> b. Update free page bitmap when allocating/freeing pages 

Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages

2016-03-23 Thread Wei Yang
On Wed, Mar 23, 2016 at 07:18:57AM +, Li, Liang Z wrote:
>> Hi, Liang
>> 
>> This is a very clear documentation of your work, I appreciated it a lot. 
>> Below
>> are some of my personal opinion and question.
>> 
>
>Thanks for your comments!
>
>> On Tue, Mar 22, 2016 at 03:43:49PM +0800, Liang Li wrote:
>> >I have sent the RFC version patch set for live migration optimization
>> >by skipping processing the free pages in the ram bulk stage and
>> >received a lot of comments. The related threads can be found at:
>> >
>> >https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00715.html
>> >https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00714.html
>> >https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00717.html
>> >https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00716.html
>> >https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00718.html
>> >
>> >https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00719.html
>> >https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00720.html
>> >https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00721.html
>> >
>> 
>> Actually there are two threads, Qemu thread and kernel thread. It would be
>> more clear for audience, if you just list two first mail for these two thread
>> respectively.
>> 
>
>Indeed,  my original version has this kind of information, but I removed it.
>
>> >To make things easier, I wrote this doc about the possible designs
>> >and my choices. Comments are welcome!
>> >
>> >Content
>> >===
>> >1. Background
>> >2. Why not use virtio-balloon
>> >3. Virtio interface
>> >4. Constructing free page bitmap
>> >5. Tighten free page bitmap
>> >6. Handling page cache in the guest
>> >7. APIs for live migration
>> >8. Pseudo code
>> >
>> >Details
>> >===
>> >1. Background
>> >As we know, in the ram bulk stage of live migration, current QEMU live
>> >migration implementation mark the all guest's RAM pages as dirtied in
>> >the ram bulk stage, all these pages will be checked for zero page
>> >first, and the page content will be sent to the destination depends on
>> >the checking result, that process consumes quite a lot of CPU cycles
>> >and network bandwidth.
>> >
>> >>From guest's point of view, there are some pages currently not used by
>> 
>> I see in your original RFC patch and your RFC doc, this line starts with a
>> character '>'. Not sure this one has a special purpose?
>> 
>
>No special purpose. Maybe it's caused by the email client. I didn't find the
>character in the original doc.
>

https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00715.html

You could take a look at this link, there is a '>' before From.

>> >the guest, guest doesn't care about the content in these pages. Free
>> >pages are this kind of pages which are not used by guest. We can make
>> >use of this fact and skip processing the free pages in the ram bulk
>> >stage, it can save a lot CPU cycles and reduce the network traffic
>> >while speed up the live migration process obviously.
>> >
>> >Usually, only the guest has the information of free pages. But it’s
>> >possible to let the guest tell QEMU it’s free page information by some
>> >mechanism. E.g. Through the virtio interface. Once QEMU get the free
>> >page information, it can skip processing these free pages in the ram
>> >bulk stage by clearing the corresponding bit of the migration bitmap.
>> >
>> >2. Why not use virtio-balloon
>> >Actually, the virtio-balloon can do the similar thing by inflating the
>> >balloon before live migration, but its performance is no good, for an
>> >8GB idle guest just boots, it takes about 5.7 Sec to inflate the
>> >balloon to 7GB, but it only takes 25ms to get a valid free page bitmap
>> >from the guest.  There are some of reasons for the bad performance of
>> >vitio-balloon:
>> >a. allocating pages (5%, 304ms)
>> >b. sending PFNs to host (71%, 4194ms)
>> >c. address translation and madvise() operation (24%, 1423ms)
>> >Debugging shows the time spends on these operations are listed in the
>> >brackets above. By changing the VIRTIO_BALLOON_ARRAY_PFNS_MAX to a
>> >large value, such as 16384, the time spends on sending the PFNs can be
>> >reduced to about 400ms, but it’s still too long.
>> >
>> >Obviously, the virtio-balloon mechanism has a bigger performance
>> >impact to the guest than the way we are trying to implement.
>> >
>> >3. Virtio interface
>> >There are three different ways of using the virtio interface to
>> >send the free page information.
>> >a. Extend the current virtio device
>> >The virtio spec has already defined some virtio devices, and we can
>> >extend one of these devices so as to use it to transport the free page
>> >information. It requires modifying the virtio spec.
>> >
>> >b. Implement a new virtio device
>> >Implementing a brand new virtio device to exchange information
>> >between host and guest is another choice. It requires modifying the
>> >virtio spec too.
>> >
>> >c. Make use of virtio-serial (Amit’s 

Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages

2016-03-23 Thread Li, Liang Z
> Hi, Liang
> 
> This is a very clear documentation of your work, I appreciated it a lot. Below
> are some of my personal opinion and question.
> 

Thanks for your comments!

> On Tue, Mar 22, 2016 at 03:43:49PM +0800, Liang Li wrote:
> >I have sent the RFC version patch set for live migration optimization
> >by skipping processing the free pages in the ram bulk stage and
> >received a lot of comments. The related threads can be found at:
> >
> >https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00715.html
> >https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00714.html
> >https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00717.html
> >https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00716.html
> >https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00718.html
> >
> >https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00719.html
> >https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00720.html
> >https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00721.html
> >
> 
> Actually there are two threads, Qemu thread and kernel thread. It would be
> more clear for audience, if you just list two first mail for these two thread
> respectively.
> 

Indeed,  my original version has this kind of information, but I removed it.

> >To make things easier, I wrote this doc about the possible designs
> >and my choices. Comments are welcome!
> >
> >Content
> >===
> >1. Background
> >2. Why not use virtio-balloon
> >3. Virtio interface
> >4. Constructing free page bitmap
> >5. Tighten free page bitmap
> >6. Handling page cache in the guest
> >7. APIs for live migration
> >8. Pseudo code
> >
> >Details
> >===
> >1. Background
> >As we know, in the ram bulk stage of live migration, current QEMU live
> >migration implementation mark the all guest's RAM pages as dirtied in
> >the ram bulk stage, all these pages will be checked for zero page
> >first, and the page content will be sent to the destination depends on
> >the checking result, that process consumes quite a lot of CPU cycles
> >and network bandwidth.
> >
> >>From guest's point of view, there are some pages currently not used by
> 
> I see in your original RFC patch and your RFC doc, this line starts with a
> character '>'. Not sure this one has a special purpose?
> 

No special purpose. Maybe it's caused by the email client. I didn't find the
character in the original doc.

> >the guest, guest doesn't care about the content in these pages. Free
> >pages are this kind of pages which are not used by guest. We can make
> >use of this fact and skip processing the free pages in the ram bulk
> >stage, it can save a lot CPU cycles and reduce the network traffic
> >while speed up the live migration process obviously.
> >
> >Usually, only the guest has the information of free pages. But it’s
> >possible to let the guest tell QEMU it’s free page information by some
> >mechanism. E.g. Through the virtio interface. Once QEMU get the free
> >page information, it can skip processing these free pages in the ram
> >bulk stage by clearing the corresponding bit of the migration bitmap.
> >
> >2. Why not use virtio-balloon
> >Actually, the virtio-balloon can do the similar thing by inflating the
> >balloon before live migration, but its performance is no good, for an
> >8GB idle guest just boots, it takes about 5.7 Sec to inflate the
> >balloon to 7GB, but it only takes 25ms to get a valid free page bitmap
> >from the guest.  There are some of reasons for the bad performance of
> >vitio-balloon:
> >a. allocating pages (5%, 304ms)
> >b. sending PFNs to host (71%, 4194ms)
> >c. address translation and madvise() operation (24%, 1423ms)
> >Debugging shows the time spends on these operations are listed in the
> >brackets above. By changing the VIRTIO_BALLOON_ARRAY_PFNS_MAX to a
> >large value, such as 16384, the time spends on sending the PFNs can be
> >reduced to about 400ms, but it’s still too long.
> >
> >Obviously, the virtio-balloon mechanism has a bigger performance
> >impact to the guest than the way we are trying to implement.
> >
> >3. Virtio interface
> >There are three different ways of using the virtio interface to
> >send the free page information.
> >a. Extend the current virtio device
> >The virtio spec has already defined some virtio devices, and we can
> >extend one of these devices so as to use it to transport the free page
> >information. It requires modifying the virtio spec.
> >
> >b. Implement a new virtio device
> >Implementing a brand new virtio device to exchange information
> >between host and guest is another choice. It requires modifying the
> >virtio spec too.
> >
> >c. Make use of virtio-serial (Amit’s suggestion, my choice)
> >It’s possible to make use the virtio-serial for communication between
> >host and guest, the benefit of this solution is no need to modify the
> >virtio spec.
> >
> >4. Construct free page bitmap
> >To minimize the space for saving free page information, it’s better to
> >use a bitmap 

Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages

2016-03-23 Thread Li, Liang Z
> > Obviously, the virtio-balloon mechanism has a bigger performance
> > impact to the guest than the way we are trying to implement.
> 
> Yeh, we should separately try and fix that; if it's that slow then people 
> will be
> annoyed about it when they're just using it for balloon.
> 
> > 3. Virtio interface
> > There are three different ways of using the virtio interface to send
> > the free page information.
> > a. Extend the current virtio device
> > The virtio spec has already defined some virtio devices, and we can
> > extend one of these devices so as to use it to transport the free page
> > information. It requires modifying the virtio spec.
> >
> > b. Implement a new virtio device
> > Implementing a brand new virtio device to exchange information between
> > host and guest is another choice. It requires modifying the virtio
> > spec too.
> 
> If the right solution is to change the spec then we should do it; we shouldn't
> use a technically worse solution just to avoid the spec change; although we
> have to be even more careful to get the right solution if we want to change
> the spec.
> 
> > c. Make use of virtio-serial (Amit’s suggestion, my choice) It’s
> > possible to make use the virtio-serial for communication between host
> > and guest, the benefit of this solution is no need to modify the
> > virtio spec.
> >
> > 4. Construct free page bitmap
> > To minimize the space for saving free page information, it’s better to
> > use a bitmap to describe the free pages. There are two ways to
> > construct the free page bitmap.
> >
> > a. Construct free page bitmap when demand (My choice) Guest can
> > allocate memory for the free page bitmap only when it receives the
> > request from QEMU, and set the free page bitmap by traversing the free
> > page list. The advantage of this way is that it’s quite simple and
> > easy to implement. The disadvantage is that the traversing operation
> > may consume quite a long time when there are a lot of free pages.
> > (About 20ms for 7GB free pages)
> 
> I wonder how that scales; 20ms isn't too bad - but I'm more worried about
> what happens when someone does it to the 1TB database VM.

Totally depends on the count of free pages in the VM, if 90% of the memory
in the 1TB VM are free pages, the time is about:
1024 * 0.9 / 7 *20 = 2633 ms

Is it unbearable? if so, we can use 4b to construct the free page bitmap, hope
the kernel guys can tolerate it.

> > b. Update free page bitmap when allocating/freeing pages Another
> > choice is to allocate the memory for the free page bitmap when guest
> > boots, and then update the free page bitmap when allocating/freeing
> > pages. It needs more modification to the code related to memory
> > management in guest. The advantage of this way is that guest can
> > response QEMU’s request for a free page bitmap very quickly, no matter
> > how many free pages in the guest. Do the kernel guys like this?
> >
> > 5. Tighten the free page bitmap
> > At last, the free page bitmap should be operated with the
> > ramlist.dirty_memory to filter out the free pages. We should make sure
> > the bit N in the free page bitmap and the bit N in the
> > ramlist.dirty_memory are corresponding to the same guest’s page.
> > Some arch, like X86, there are ‘holes’ in the memory’s physical
> > address, which means there are no actual physical RAM pages
> > corresponding to some PFNs. So, some arch specific information is
> > needed to construct a proper free page bitmap.
> >
> > migration dirty page bitmap:
> > -
> > |a|b|c|d|e|f|g|h|i|j|
> > -
> > loose free page bitmap:
> > -
> > |a|b|c|d|e|f| | | | |g|h|i|j|
> > -
> > tight free page bitmap:
> > -
> > |a|b|c|d|e|f|g|h|i|j|
> > -
> >
> > There are two places for tightening the free page bitmap:
> > a. In guest
> > Constructing the free page bitmap in guest requires adding the arch
> > related code in guest for building a tight bitmap. The advantage of
> > this way is that less memory is needed to store the free page bitmap.
> > b. In QEMU (My choice)
> > Constructing the free page bitmap in QEMU is more flexible, we can get
> > a loose free page bitmap which contains the holes, and then filter out
> > the holes in QEMU, the advantage of this way is that we can keep the
> > kernel code as simple as we can, the disadvantage is that more memory
> > is needed to save the loose free page bitmap. Because this is a mainly
> > QEMU feature, if possible, do all the related things in QEMU is
> > better.
> 
> Yes, maybe; although we'd have to be careful to validate what the guest fills
> in makes sense.
> 
> > 6. Handling page cache in the guest
> > The memory used for page cache in the guest will change depends on the
> > workload, if guest run some block IO intensive work load, there will
> > be lots of pages used for page cache, only a 

Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages

2016-03-23 Thread Li, Liang Z
> > To make things easier, I wrote this doc about the possible designs and
> > my choices. Comments are welcome!
> 
> Thanks for putting this together, and especially for taking the trouble to
> benchmark existing code paths!
> 
> I think these numbers do show that there are gains to be had from merging
> your code with the existing balloon device. It will probably be a bit more 
> work,
> but I think it'll be worth it.
> 
> More comments below.
> 

Thanks for your comments!

> > 2. Why not use virtio-balloon
> > Actually, the virtio-balloon can do the similar thing by inflating the
> > balloon before live migration, but its performance is no good, for an
> > 8GB idle guest just boots, it takes about 5.7 Sec to inflate the
> > balloon to 7GB, but it only takes 25ms to get a valid free page bitmap
> > from the guest.  There are some of reasons for the bad performance of
> > vitio-balloon:
> > a. allocating pages (5%, 304ms)
> 
> Interesting. This is definitely worth improving in guest kernel.
> Also, will it be faster if we allocate and pass to guest huge pages instead?
> Might speed up madvise as well.

Maybe.

> > b. sending PFNs to host (71%, 4194ms)
> 
> OK, so we probably should teach balloon to pass huge lists in bitmaps.
> Will be benefitial for regular balloon operation, as well.
> 

Agree. Current balloon just send 256 PFNs a time, that's too few and lead to 
too many times 
of virtio transmission, that's the main reason for the bad performance.
Change the VIRTIO_BALLOON_ARRAY_PFNS_MAX to a large value can  improve the
performance significant. Maybe we should increase it before doing the further 
optimization,
do you think so ?

> > c. address translation and madvise() operation (24%, 1423ms)
> 
> How is this split between translation and madvise?  I suspect it's mostly
> madvise since you need translation when using bitmap as well.
> Correct? Could you measure this please?  Also, what if we use the new
> MADV_FREE instead?  By how much would this help?
> 
For the current balloon, address translation is needed. 
But for live migration, there is no need to do address translation.

I did a another try and got the following data:
   a. allocating pages (6.4%, 402ms)
   b. sending PFNs to host (68.3%, 4263ms)
   c. address translation (6.2%, 389ms)
   d. madvise (19.0%, 1188ms)

The address translation is a time consuming operation too.
I will try MADV_FREE later.

> Finally, we could teach balloon to skip madvise completely.
> By how much would this help?
> 
> > Debugging shows the time spends on these operations are listed in the
> > brackets above. By changing the VIRTIO_BALLOON_ARRAY_PFNS_MAX to
> a
> > large value, such as 16384, the time spends on sending the PFNs can be
> > reduced to about 400ms, but it’s still too long.
> > Obviously, the virtio-balloon mechanism has a bigger performance
> > impact to the guest than the way we are trying to implement.
> 
> Since as we see some of the new interfaces might be benefitial to balloon as
> well, I am rather of the opinion that extending the balloon (basically 3a)
> might be the right thing to do.
> 
> > 3. Virtio interface
> > There are three different ways of using the virtio interface to send
> > the free page information.
> > a. Extend the current virtio device
> > The virtio spec has already defined some virtio devices, and we can
> > extend one of these devices so as to use it to transport the free page
> > information. It requires modifying the virtio spec.
> 
> You don't have to do it all by yourself by the way.
> Submit the proposal to the oasis virtio tc mailing list, we will take it from 
> there.
> 
That's great.

>> 4. Construct free page bitmap
>> To minimize the space for saving free page information, it’s better to 
>> use a bitmap to describe the free pages. There are two ways to 
>> construct the free page bitmap.
>> 
>> a. Construct free page bitmap when demand (My choice) Guest can 
>> allocate memory for the free page bitmap only when it receives the 
>> request from QEMU, and set the free page bitmap by traversing the free 
>> page list. The advantage of this way is that it’s quite simple and 
>> easy to implement. The disadvantage is that the traversing operation 
>> may consume quite a long time when there are a lot of free pages. 
>> (About 20ms for 7GB free pages)
>> 
>> b. Update free page bitmap when allocating/freeing pages Another 
>> choice is to allocate the memory for the free page bitmap when guest 
>>boots, and then update the free page bitmap when allocating/freeing 
>> pages. It needs more modification to the code related to memory 
>>management in guest. The advantage of this way is that guest can 
>> response QEMU’s request for a free page bitmap very quickly, no matter 
>> how many free pages in the guest. Do the kernel guys like this?
>>

> > 8. Pseudo code
> > Dirty page logging should be enabled before getting the free page
> > information from guest, this is important because during the process
> > of 

Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages

2016-03-22 Thread Wei Yang
Hi, Liang

This is a very clear documentation of your work, I appreciated it a lot. Below
are some of my personal opinion and question.

On Tue, Mar 22, 2016 at 03:43:49PM +0800, Liang Li wrote:
>I have sent the RFC version patch set for live migration optimization
>by skipping processing the free pages in the ram bulk stage and
>received a lot of comments. The related threads can be found at:
>
>https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00715.html
>https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00714.html
>https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00717.html
>https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00716.html
>https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00718.html
>
>https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00719.html 
>https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00720.html
>https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00721.html
>

Actually there are two threads, Qemu thread and kernel thread. It would be
more clear for audience, if you just list two first mail for these two thread
respectively.

>To make things easier, I wrote this doc about the possible designs
>and my choices. Comments are welcome! 
>
>Content
>===
>1. Background
>2. Why not use virtio-balloon
>3. Virtio interface
>4. Constructing free page bitmap
>5. Tighten free page bitmap
>6. Handling page cache in the guest
>7. APIs for live migration
>8. Pseudo code 
>
>Details
>===
>1. Background
>As we know, in the ram bulk stage of live migration, current QEMU live
>migration implementation mark the all guest's RAM pages as dirtied in
>the ram bulk stage, all these pages will be checked for zero page
>first, and the page content will be sent to the destination depends on
>the checking result, that process consumes quite a lot of CPU cycles
>and network bandwidth.
>
>>From guest's point of view, there are some pages currently not used by

I see in your original RFC patch and your RFC doc, this line starts with a
character '>'. Not sure this one has a special purpose?

>the guest, guest doesn't care about the content in these pages. Free
>pages are this kind of pages which are not used by guest. We can make
>use of this fact and skip processing the free pages in the ram bulk
>stage, it can save a lot CPU cycles and reduce the network traffic
>while speed up the live migration process obviously.
>
>Usually, only the guest has the information of free pages. But it’s
>possible to let the guest tell QEMU it’s free page information by some
>mechanism. E.g. Through the virtio interface. Once QEMU get the free
>page information, it can skip processing these free pages in the ram
>bulk stage by clearing the corresponding bit of the migration bitmap. 
>
>2. Why not use virtio-balloon 
>Actually, the virtio-balloon can do the similar thing by inflating the
>balloon before live migration, but its performance is no good, for an
>8GB idle guest just boots, it takes about 5.7 Sec to inflate the
>balloon to 7GB, but it only takes 25ms to get a valid free page bitmap
>from the guest.  There are some of reasons for the bad performance of
>vitio-balloon:
>a. allocating pages (5%, 304ms)
>b. sending PFNs to host (71%, 4194ms)
>c. address translation and madvise() operation (24%, 1423ms)
>Debugging shows the time spends on these operations are listed in the
>brackets above. By changing the VIRTIO_BALLOON_ARRAY_PFNS_MAX to a
>large value, such as 16384, the time spends on sending the PFNs can be
>reduced to about 400ms, but it’s still too long.
>
>Obviously, the virtio-balloon mechanism has a bigger performance
>impact to the guest than the way we are trying to implement.
>
>3. Virtio interface
>There are three different ways of using the virtio interface to
>send the free page information.
>a. Extend the current virtio device
>The virtio spec has already defined some virtio devices, and we can
>extend one of these devices so as to use it to transport the free page
>information. It requires modifying the virtio spec.
>
>b. Implement a new virtio device
>Implementing a brand new virtio device to exchange information
>between host and guest is another choice. It requires modifying the
>virtio spec too.
>
>c. Make use of virtio-serial (Amit’s suggestion, my choice)
>It’s possible to make use the virtio-serial for communication between
>host and guest, the benefit of this solution is no need to modify the
>virtio spec. 
>
>4. Construct free page bitmap
>To minimize the space for saving free page information, it’s better to
>use a bitmap to describe the free pages. There are two ways to
>construct the free page bitmap.
>
>a. Construct free page bitmap when demand (My choice)
>Guest can allocate memory for the free page bitmap only when it
>receives the request from QEMU, and set the free page bitmap by
>traversing the free page list. The advantage of this way is that it’s
>quite simple and easy to implement. The disadvantage is that the
>traversing 

Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages

2016-03-22 Thread Dr. David Alan Gilbert
* Liang Li (liang.z...@intel.com) wrote:
> I have sent the RFC version patch set for live migration optimization
> by skipping processing the free pages in the ram bulk stage and
> received a lot of comments. The related threads can be found at:

Thanks!


> Obviously, the virtio-balloon mechanism has a bigger performance
> impact to the guest than the way we are trying to implement.

Yeh, we should separately try and fix that; if it's that slow then
people will be annoyed about it when they're just using it for balloon.

> 3. Virtio interface
> There are three different ways of using the virtio interface to
> send the free page information.
> a. Extend the current virtio device
> The virtio spec has already defined some virtio devices, and we can
> extend one of these devices so as to use it to transport the free page
> information. It requires modifying the virtio spec.
> 
> b. Implement a new virtio device
> Implementing a brand new virtio device to exchange information
> between host and guest is another choice. It requires modifying the
> virtio spec too.

If the right solution is to change the spec then we should do it;
we shouldn't use a technically worse solution just to avoid the spec
change; although we have to be even more careful to get the right
solution if we want to change the spec.

> c. Make use of virtio-serial (Amit’s suggestion, my choice)
> It’s possible to make use the virtio-serial for communication between
> host and guest, the benefit of this solution is no need to modify the
> virtio spec. 
> 
> 4. Construct free page bitmap
> To minimize the space for saving free page information, it’s better to
> use a bitmap to describe the free pages. There are two ways to
> construct the free page bitmap.
> 
> a. Construct free page bitmap when demand (My choice)
> Guest can allocate memory for the free page bitmap only when it
> receives the request from QEMU, and set the free page bitmap by
> traversing the free page list. The advantage of this way is that it’s
> quite simple and easy to implement. The disadvantage is that the
> traversing operation may consume quite a long time when there are a
> lot of free pages. (About 20ms for 7GB free pages)

I wonder how that scales; 20ms isn't too bad - but I'm more worried about
what happens when someone does it to the 1TB database VM.

> b. Update free page bitmap when allocating/freeing pages 
> Another choice is to allocate the memory for the free page bitmap
> when guest boots, and then update the free page bitmap when
> allocating/freeing pages. It needs more modification to the code
> related to memory management in guest. The advantage of this way is
> that guest can response QEMU’s request for a free page bitmap very
> quickly, no matter how many free pages in the guest. Do the kernel guys
> like this?
> 
> 5. Tighten the free page bitmap
> At last, the free page bitmap should be operated with the
> ramlist.dirty_memory to filter out the free pages. We should make sure
> the bit N in the free page bitmap and the bit N in the
> ramlist.dirty_memory are corresponding to the same guest’s page. 
> Some arch, like X86, there are ‘holes’ in the memory’s physical
> address, which means there are no actual physical RAM pages
> corresponding to some PFNs. So, some arch specific information is
> needed to construct a proper free page bitmap.
> 
> migration dirty page bitmap:
> -
> |a|b|c|d|e|f|g|h|i|j|
> -
> loose free page bitmap:
> -  
> |a|b|c|d|e|f| | | | |g|h|i|j|
> -
> tight free page bitmap:
> -
> |a|b|c|d|e|f|g|h|i|j|
> -
> 
> There are two places for tightening the free page bitmap:
> a. In guest 
> Constructing the free page bitmap in guest requires adding the arch
> related code in guest for building a tight bitmap. The advantage of
> this way is that less memory is needed to store the free page bitmap.
> b. In QEMU (My choice)
> Constructing the free page bitmap in QEMU is more flexible, we can get
> a loose free page bitmap which contains the holes, and then filter out
> the holes in QEMU, the advantage of this way is that we can keep the
> kernel code as simple as we can, the disadvantage is that more memory
> is needed to save the loose free page bitmap. Because this is a mainly
> QEMU feature, if possible, do all the related things in QEMU is
> better.

Yes, maybe; although we'd have to be careful to validate what the guest
fills in makes sense.

> 6. Handling page cache in the guest
> The memory used for page cache in the guest will change depends on the
> workload, if guest run some block IO intensive work load, there will
> be lots of pages used for page cache, only a few free pages are left in
> the guest. In order to get more free pages, we can select to ask guest
> to drop some page caches.  Because dropping the page cache may lead to
> performance 

Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages

2016-03-22 Thread Michael S. Tsirkin
On Tue, Mar 22, 2016 at 03:43:49PM +0800, Liang Li wrote:
> I have sent the RFC version patch set for live migration optimization
> by skipping processing the free pages in the ram bulk stage and
> received a lot of comments. The related threads can be found at:
> 
> https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00715.html
> https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00714.html
> https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00717.html
> https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00716.html
> https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00718.html
> 
> https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00719.html 
> https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00720.html
> https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00721.html
> 
> To make things easier, I wrote this doc about the possible designs
> and my choices. Comments are welcome! 

Thanks for putting this together, and especially for taking the trouble
to benchmark existing code paths!

I think these numbers do show that there are gains to be had from merging your 
code
with the existing balloon device. It will probably be a bit more work,
but I think it'll be worth it.

More comments below.


> Content
> ===
> 1. Background
> 2. Why not use virtio-balloon
> 3. Virtio interface
> 4. Constructing free page bitmap
> 5. Tighten free page bitmap
> 6. Handling page cache in the guest
> 7. APIs for live migration
> 8. Pseudo code 
> 
> Details
> ===
> 1. Background
> As we know, in the ram bulk stage of live migration, current QEMU live
> migration implementation mark the all guest's RAM pages as dirtied in
> the ram bulk stage, all these pages will be checked for zero page
> first, and the page content will be sent to the destination depends on
> the checking result, that process consumes quite a lot of CPU cycles
> and network bandwidth.
> 
> >From guest's point of view, there are some pages currently not used by
> the guest, guest doesn't care about the content in these pages. Free
> pages are this kind of pages which are not used by guest. We can make
> use of this fact and skip processing the free pages in the ram bulk
> stage, it can save a lot CPU cycles and reduce the network traffic
> while speed up the live migration process obviously.
> 
> Usually, only the guest has the information of free pages. But it’s
> possible to let the guest tell QEMU it’s free page information by some
> mechanism. E.g. Through the virtio interface. Once QEMU get the free
> page information, it can skip processing these free pages in the ram
> bulk stage by clearing the corresponding bit of the migration bitmap. 
> 
> 2. Why not use virtio-balloon 
> Actually, the virtio-balloon can do the similar thing by inflating the
> balloon before live migration, but its performance is no good, for an
> 8GB idle guest just boots, it takes about 5.7 Sec to inflate the
> balloon to 7GB, but it only takes 25ms to get a valid free page bitmap
> from the guest.  There are some of reasons for the bad performance of
> vitio-balloon:
> a. allocating pages (5%, 304ms)

Interesting. This is definitely worth improving in guest kernel.
Also, will it be faster if we allocate and pass to guest huge pages instead?
Might speed up madvise as well.

> b. sending PFNs to host (71%, 4194ms)

OK, so we probably should teach balloon to pass huge lists in bitmaps.
Will be benefitial for regular balloon operation, as well.

> c. address translation and madvise() operation (24%, 1423ms)

How is this split between translation and madvise?  I suspect it's
mostly madvise since you need translation when using bitmap as well.
Correct? Could you measure this please?  Also, what if we use the new
MADV_FREE instead?  By how much would this help?

Finally, we could teach balloon to skip madvise completely.
By how much would this help?

> Debugging shows the time spends on these operations are listed in the
> brackets above. By changing the VIRTIO_BALLOON_ARRAY_PFNS_MAX to a
> large value, such as 16384, the time spends on sending the PFNs can be
> reduced to about 400ms, but it’s still too long.
> Obviously, the virtio-balloon mechanism has a bigger performance
> impact to the guest than the way we are trying to implement.

Since as we see some of the new interfaces might be
benefitial to balloon as well, I am rather of the opinion that
extending the balloon (basically 3a) might be the right thing to do.

> 3. Virtio interface
> There are three different ways of using the virtio interface to
> send the free page information.
> a. Extend the current virtio device
> The virtio spec has already defined some virtio devices, and we can
> extend one of these devices so as to use it to transport the free page
> information. It requires modifying the virtio spec.

You don't have to do it all by yourself by the way.
Submit the proposal to the oasis virtio tc mailing list,
we will take it from there.


> b. Implement a new