Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-15 Thread Li, Liang Z
> > > > > >   I'm just catching back up on this thread; so without
> > > > > > reference to any particular previous mail in the thread.
> > > > > >
> > > > > >   1) How many of the free pages do we tell the host about?
> > > > > >  Your main change is telling the host about all the
> > > > > >  free pages.
> > > > >
> > > > > Yes, all the guest's free pages.
> > > > >
> > > > > >  If we tell the host about all the free pages, then we might
> > > > > >  end up needing to allocate more pages and update the host
> > > > > >  with pages we now want to use; that would have to wait for the
> > > > > >  host to acknowledge that use of these pages, since if we don't
> > > > > >  wait for it then it might have skipped migrating a page we
> > > > > >  just started using (I don't understand how your series solves 
> > > > > > that).
> > > > > >  So the guest probably needs to keep some free pages - how
> many?
> > > > >
> > > > > Actually, there is no need to care about whether the free pages
> > > > > will be
> > > used by the host.
> > > > > We only care about some of the free pages we get reused by the
> > > > > guest,
> > > right?
> > > > >
> > > > > The dirty page logging can be used to solve this, starting the
> > > > > dirty page logging before getting the free pages informant from guest.
> > > > > Even some of the free pages are modified by the guest during the
> > > > > process of getting the free pages information, these modified
> > > > > pages will
> > > be traced by the dirty page logging mechanism. So in the following
> > > migration_bitmap_sync() function.
> > > > > The pages in the free pages bitmap, but latter was modified,
> > > > > will be reset to dirty. We won't omit any dirtied pages.
> > > > >
> > > > > So, guest doesn't need to keep any free pages.
> > > >
> > > > OK, yes, that works; so we do:
> > > >   * enable dirty logging
> > > >   * ask guest for free pages
> > > >   * initialise the migration bitmap as everything-free
> > > >   * then later we do the normal sync-dirty bitmap stuff and it all just
> works.
> > > >
> > > > That's nice and simple.
> > >
> > > This works once, sure. But there's an issue is that you have to
> > > defer migration until you get the free page list, and this only
> > > works once. So you end up with heuristics about how long to wait.
> > >
> > > Instead I propose:
> > >
> > > - mark all pages dirty as we do now.
> > >
> > > - at start of migration, start tracking dirty
> > >   pages in kvm, and tell guest to start tracking free pages
> > >
> > > we can now introduce any kind of delay, for example wait for ack
> > > from guest, or do whatever else, or even just start migrating pages
> > >
> > > - repeatedly:
> > >   - get list of free pages from guest
> > >   - clear them in migration bitmap
> > >   - get dirty list from kvm
> > >
> > > - at end of migration, stop tracking writes in kvm,
> > >   and tell guest to stop tracking free pages
> >
> > I had thought of filtering out the free pages in each migration bitmap
> synchronization.
> > The advantage is we can skip process as many free pages as possible. Not
> just once.
> > The disadvantage is that we should change the current memory
> > management code to track the free pages, instead of traversing the free
> page list to construct the free pages bitmap, to reduce the overhead to get
> the free pages bitmap.
> > I am not sure the if the Kernel people would like it.
> >
> > If keeping the traversing mechanism, because of the overhead, maybe it's
> not worth to filter out the free pages repeatedly.
> 
> Well, Michael's idea of not waiting for the dirty bitmap to be filled does 
> make
> that idea of constnatly using the free-bitmap better.
> 

No wait is a good idea.
Actually, we could shorten the waiting time by pre allocating the free pages 
bit map
and update it when guest allocating/freeing pages. it requires to modify the mm 
related code. I don't know whether the kernel people like this.

> In that case, is it easier if something (guest/host?) allocates some memory in
> the guests physical RAM space and just points the host to it, rather than
> having an explicit 'send'.
> 

Good idea too.

Liang
> Dave



Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-15 Thread Dr. David Alan Gilbert
* Li, Liang Z (liang.z...@intel.com) wrote:
> > On Mon, Mar 14, 2016 at 05:03:34PM +, Dr. David Alan Gilbert wrote:
> > > * Li, Liang Z (liang.z...@intel.com) wrote:
> > > > >
> > > > > Hi,
> > > > >   I'm just catching back up on this thread; so without reference
> > > > > to any particular previous mail in the thread.
> > > > >
> > > > >   1) How many of the free pages do we tell the host about?
> > > > >  Your main change is telling the host about all the
> > > > >  free pages.
> > > >
> > > > Yes, all the guest's free pages.
> > > >
> > > > >  If we tell the host about all the free pages, then we might
> > > > >  end up needing to allocate more pages and update the host
> > > > >  with pages we now want to use; that would have to wait for the
> > > > >  host to acknowledge that use of these pages, since if we don't
> > > > >  wait for it then it might have skipped migrating a page we
> > > > >  just started using (I don't understand how your series solves 
> > > > > that).
> > > > >  So the guest probably needs to keep some free pages - how many?
> > > >
> > > > Actually, there is no need to care about whether the free pages will be
> > used by the host.
> > > > We only care about some of the free pages we get reused by the guest,
> > right?
> > > >
> > > > The dirty page logging can be used to solve this, starting the dirty
> > > > page logging before getting the free pages informant from guest.
> > > > Even some of the free pages are modified by the guest during the
> > > > process of getting the free pages information, these modified pages will
> > be traced by the dirty page logging mechanism. So in the following
> > migration_bitmap_sync() function.
> > > > The pages in the free pages bitmap, but latter was modified, will be
> > > > reset to dirty. We won't omit any dirtied pages.
> > > >
> > > > So, guest doesn't need to keep any free pages.
> > >
> > > OK, yes, that works; so we do:
> > >   * enable dirty logging
> > >   * ask guest for free pages
> > >   * initialise the migration bitmap as everything-free
> > >   * then later we do the normal sync-dirty bitmap stuff and it all just 
> > > works.
> > >
> > > That's nice and simple.
> > 
> > This works once, sure. But there's an issue is that you have to defer 
> > migration
> > until you get the free page list, and this only works once. So you end up 
> > with
> > heuristics about how long to wait.
> > 
> > Instead I propose:
> > 
> > - mark all pages dirty as we do now.
> > 
> > - at start of migration, start tracking dirty
> >   pages in kvm, and tell guest to start tracking free pages
> > 
> > we can now introduce any kind of delay, for example wait for ack from guest,
> > or do whatever else, or even just start migrating pages
> > 
> > - repeatedly:
> > - get list of free pages from guest
> > - clear them in migration bitmap
> > - get dirty list from kvm
> > 
> > - at end of migration, stop tracking writes in kvm,
> >   and tell guest to stop tracking free pages
> 
> I had thought of filtering out the free pages in each migration bitmap 
> synchronization. 
> The advantage is we can skip process as many free pages as possible. Not just 
> once.
> The disadvantage is that we should change the current memory management code 
> to track the free pages,
> instead of traversing the free page list to construct the free pages bitmap, 
> to reduce the overhead to get the free pages bitmap.
> I am not sure the if the Kernel people would like it.
> 
> If keeping the traversing mechanism, because of the overhead, maybe it's not 
> worth to filter out the free pages repeatedly.

Well, Michael's idea of not waiting for the dirty
bitmap to be filled does make that idea of constnatly
using the free-bitmap better.

In that case, is it easier if something (guest/host?)
allocates some memory in the guests physical RAM space
and just points the host to it, rather than having an 
explicit 'send'.

Dave

> Liang
> 
> 
> 
> 
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-15 Thread Li, Liang Z
> On Mon, Mar 14, 2016 at 05:03:34PM +, Dr. David Alan Gilbert wrote:
> > * Li, Liang Z (liang.z...@intel.com) wrote:
> > > >
> > > > Hi,
> > > >   I'm just catching back up on this thread; so without reference
> > > > to any particular previous mail in the thread.
> > > >
> > > >   1) How many of the free pages do we tell the host about?
> > > >  Your main change is telling the host about all the
> > > >  free pages.
> > >
> > > Yes, all the guest's free pages.
> > >
> > > >  If we tell the host about all the free pages, then we might
> > > >  end up needing to allocate more pages and update the host
> > > >  with pages we now want to use; that would have to wait for the
> > > >  host to acknowledge that use of these pages, since if we don't
> > > >  wait for it then it might have skipped migrating a page we
> > > >  just started using (I don't understand how your series solves 
> > > > that).
> > > >  So the guest probably needs to keep some free pages - how many?
> > >
> > > Actually, there is no need to care about whether the free pages will be
> used by the host.
> > > We only care about some of the free pages we get reused by the guest,
> right?
> > >
> > > The dirty page logging can be used to solve this, starting the dirty
> > > page logging before getting the free pages informant from guest.
> > > Even some of the free pages are modified by the guest during the
> > > process of getting the free pages information, these modified pages will
> be traced by the dirty page logging mechanism. So in the following
> migration_bitmap_sync() function.
> > > The pages in the free pages bitmap, but latter was modified, will be
> > > reset to dirty. We won't omit any dirtied pages.
> > >
> > > So, guest doesn't need to keep any free pages.
> >
> > OK, yes, that works; so we do:
> >   * enable dirty logging
> >   * ask guest for free pages
> >   * initialise the migration bitmap as everything-free
> >   * then later we do the normal sync-dirty bitmap stuff and it all just 
> > works.
> >
> > That's nice and simple.
> 
> This works once, sure. But there's an issue is that you have to defer 
> migration
> until you get the free page list, and this only works once. So you end up with
> heuristics about how long to wait.
> 
> Instead I propose:
> 
> - mark all pages dirty as we do now.
> 
> - at start of migration, start tracking dirty
>   pages in kvm, and tell guest to start tracking free pages
> 
> we can now introduce any kind of delay, for example wait for ack from guest,
> or do whatever else, or even just start migrating pages
> 
> - repeatedly:
>   - get list of free pages from guest
>   - clear them in migration bitmap
>   - get dirty list from kvm
> 
> - at end of migration, stop tracking writes in kvm,
>   and tell guest to stop tracking free pages

I had thought of filtering out the free pages in each migration bitmap 
synchronization. 
The advantage is we can skip process as many free pages as possible. Not just 
once.
The disadvantage is that we should change the current memory management code to 
track the free pages,
instead of traversing the free page list to construct the free pages bitmap, to 
reduce the overhead to get the free pages bitmap.
I am not sure the if the Kernel people would like it.

If keeping the traversing mechanism, because of the overhead, maybe it's not 
worth to filter out the free pages repeatedly.

Liang







Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-15 Thread Michael S. Tsirkin
On Mon, Mar 14, 2016 at 05:03:34PM +, Dr. David Alan Gilbert wrote:
> * Li, Liang Z (liang.z...@intel.com) wrote:
> > > 
> > > Hi,
> > >   I'm just catching back up on this thread; so without reference to any
> > > particular previous mail in the thread.
> > > 
> > >   1) How many of the free pages do we tell the host about?
> > >  Your main change is telling the host about all the
> > >  free pages.
> > 
> > Yes, all the guest's free pages.
> > 
> > >  If we tell the host about all the free pages, then we might
> > >  end up needing to allocate more pages and update the host
> > >  with pages we now want to use; that would have to wait for the
> > >  host to acknowledge that use of these pages, since if we don't
> > >  wait for it then it might have skipped migrating a page we
> > >  just started using (I don't understand how your series solves that).
> > >  So the guest probably needs to keep some free pages - how many?
> > 
> > Actually, there is no need to care about whether the free pages will be 
> > used by the host.
> > We only care about some of the free pages we get reused by the guest, right?
> > 
> > The dirty page logging can be used to solve this, starting the dirty page 
> > logging before getting
> > the free pages informant from guest. Even some of the free pages are 
> > modified by the guest
> > during the process of getting the free pages information, these modified 
> > pages will be traced
> > by the dirty page logging mechanism. So in the following 
> > migration_bitmap_sync() function.
> > The pages in the free pages bitmap, but latter was modified, will be reset 
> > to dirty. We won't
> > omit any dirtied pages.
> > 
> > So, guest doesn't need to keep any free pages.
> 
> OK, yes, that works; so we do:
>   * enable dirty logging
>   * ask guest for free pages
>   * initialise the migration bitmap as everything-free
>   * then later we do the normal sync-dirty bitmap stuff and it all just works.
> 
> That's nice and simple.

This works once, sure. But there's an issue is that you have
to defer migration until you get the free page list,
and this only works once. So you end up with heuristics
about how long to wait.

Instead I propose:

- mark all pages dirty as we do now.

- at start of migration, start tracking dirty
  pages in kvm, and tell guest to start tracking free pages

we can now introduce any kind of delay, for
example wait for ack from guest, or do whatever else,
or even just start migrating pages

- repeatedly:
- get list of free pages from guest
- clear them in migration bitmap
- get dirty list from kvm

- at end of migration, stop tracking writes in kvm,
  and tell guest to stop tracking free pages



> > >   2) Clearing out caches
> > >  Does it make sense to clean caches?  They're apparently useful data
> > >  so if we clean them it's likely to slow the guest down; I guess
> > >  they're also likely to be fairly static data - so at least fairly
> > >  easy to migrate.
> > >  The answer here partially depends on what you want from your 
> > > migration;
> > >  if you're after the fastest possible migration time it might make
> > >  sense to clean the caches and avoid migrating them; but that might
> > >  be at the cost of more disruption to the guest - there's a trade off
> > >  somewhere and it's not clear to me how you set that depending on your
> > >  guest/network/reqirements.
> > > 
> > 
> > Yes, clean the caches is an option.  Let the users decide using it or not.
> > 
> > >   3) Why is ballooning slow?
> > >  You've got a figure of 5s to balloon on an 8GB VM - but an
> > >  8GB VM isn't huge; so I worry about how long it would take
> > >  on a big VM.   We need to understand why it's slow
> > >* is it due to the guest shuffling pages around?
> > >* is it due to the virtio-balloon protocol sending one page
> > >  at a time?
> > >  + Do balloon pages normally clump in physical memory
> > > - i.e. would a 'large balloon' message help
> > > - or do we need a bitmap because it tends not to clump?
> > > 
> > 
> > I didn't do a comprehensive test. But I found most of the time spending
> > on allocating the pages and sending the PFNs to guest, I don't know that's
> > the most time consuming operation, allocating the pages or sending the PFNs.
> 
> It might be a good idea to analyse it a bit more to convince people where
> the problem is.
> 
> > >* is it due to the madvise on the host?
> > >  If we were using the normal balloon messages, then we
> > >  could, during migration, just route those to the migration
> > >  code rather than bothering with the madvise.
> > >  If they're clumping together we could just turn that into
> > >  one big madvise; if they're not then would we benefit from
> > >  a call that lets us madvise lots of areas?
> > > 

Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-14 Thread Li, Liang Z
> > > Hi,
> > >   I'm just catching back up on this thread; so without reference to
> > > any particular previous mail in the thread.
> > >
> > >   1) How many of the free pages do we tell the host about?
> > >  Your main change is telling the host about all the
> > >  free pages.
> >
> > Yes, all the guest's free pages.
> >
> > >  If we tell the host about all the free pages, then we might
> > >  end up needing to allocate more pages and update the host
> > >  with pages we now want to use; that would have to wait for the
> > >  host to acknowledge that use of these pages, since if we don't
> > >  wait for it then it might have skipped migrating a page we
> > >  just started using (I don't understand how your series solves that).
> > >  So the guest probably needs to keep some free pages - how many?
> >
> > Actually, there is no need to care about whether the free pages will be
> used by the host.
> > We only care about some of the free pages we get reused by the guest,
> right?
> >
> > The dirty page logging can be used to solve this, starting the dirty
> > page logging before getting the free pages informant from guest. Even
> > some of the free pages are modified by the guest during the process of
> > getting the free pages information, these modified pages will be traced by
> the dirty page logging mechanism. So in the following
> migration_bitmap_sync() function.
> > The pages in the free pages bitmap, but latter was modified, will be
> > reset to dirty. We won't omit any dirtied pages.
> >
> > So, guest doesn't need to keep any free pages.
> 
> OK, yes, that works; so we do:
>   * enable dirty logging
>   * ask guest for free pages
>   * initialise the migration bitmap as everything-free
>   * then later we do the normal sync-dirty bitmap stuff and it all just works.
> 
> That's nice and simple.
> 
> > >   2) Clearing out caches
> > >  Does it make sense to clean caches?  They're apparently useful data
> > >  so if we clean them it's likely to slow the guest down; I guess
> > >  they're also likely to be fairly static data - so at least fairly
> > >  easy to migrate.
> > >  The answer here partially depends on what you want from your
> migration;
> > >  if you're after the fastest possible migration time it might make
> > >  sense to clean the caches and avoid migrating them; but that might
> > >  be at the cost of more disruption to the guest - there's a trade off
> > >  somewhere and it's not clear to me how you set that depending on
> your
> > >  guest/network/reqirements.
> > >
> >
> > Yes, clean the caches is an option.  Let the users decide using it or not.
> >
> > >   3) Why is ballooning slow?
> > >  You've got a figure of 5s to balloon on an 8GB VM - but an
> > >  8GB VM isn't huge; so I worry about how long it would take
> > >  on a big VM.   We need to understand why it's slow
> > >* is it due to the guest shuffling pages around?
> > >* is it due to the virtio-balloon protocol sending one page
> > >  at a time?
> > >  + Do balloon pages normally clump in physical memory
> > > - i.e. would a 'large balloon' message help
> > > - or do we need a bitmap because it tends not to clump?
> > >
> >
> > I didn't do a comprehensive test. But I found most of the time
> > spending on allocating the pages and sending the PFNs to guest, I
> > don't know that's the most time consuming operation, allocating the pages
> or sending the PFNs.
> 
> It might be a good idea to analyse it a bit more to convince people where the
> problem is.
> 

Yes, I will try to measure the time spending on different parts.

> > >* is it due to the madvise on the host?
> > >  If we were using the normal balloon messages, then we
> > >  could, during migration, just route those to the migration
> > >  code rather than bothering with the madvise.
> > >  If they're clumping together we could just turn that into
> > >  one big madvise; if they're not then would we benefit from
> > >  a call that lets us madvise lots of areas?
> > >
> >
> > My test showed madvise() is not the main reason for the long time,
> > only taken 10% of the total  inflating balloon operation time.
> > Big madvise can more or less improve the performance.
> 
> OK; 10% of the total is still pretty big even for your 8GB VM.
> 
> > >   4) Speeding up the migration of those free pages
> > > You're using the bitmap to avoid migrating those free pages; HPe's
> > > patchset is reconstructing a bitmap from the balloon data;  OK, so
> > > this all makes sense to avoid migrating them - I'd also been thinking
> > > of using pagemap to spot zero pages that would help find other zero'd
> > > pages, but perhaps ballooned is enough?
> > >
> > Could you describe your ideal with more details?
> 
> At the moment the migration code spends a fair amount of time 

Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-14 Thread Dr. David Alan Gilbert
* Li, Liang Z (liang.z...@intel.com) wrote:
> > 
> > Hi,
> >   I'm just catching back up on this thread; so without reference to any
> > particular previous mail in the thread.
> > 
> >   1) How many of the free pages do we tell the host about?
> >  Your main change is telling the host about all the
> >  free pages.
> 
> Yes, all the guest's free pages.
> 
> >  If we tell the host about all the free pages, then we might
> >  end up needing to allocate more pages and update the host
> >  with pages we now want to use; that would have to wait for the
> >  host to acknowledge that use of these pages, since if we don't
> >  wait for it then it might have skipped migrating a page we
> >  just started using (I don't understand how your series solves that).
> >  So the guest probably needs to keep some free pages - how many?
> 
> Actually, there is no need to care about whether the free pages will be used 
> by the host.
> We only care about some of the free pages we get reused by the guest, right?
> 
> The dirty page logging can be used to solve this, starting the dirty page 
> logging before getting
> the free pages informant from guest. Even some of the free pages are modified 
> by the guest
> during the process of getting the free pages information, these modified 
> pages will be traced
> by the dirty page logging mechanism. So in the following 
> migration_bitmap_sync() function.
> The pages in the free pages bitmap, but latter was modified, will be reset to 
> dirty. We won't
> omit any dirtied pages.
> 
> So, guest doesn't need to keep any free pages.

OK, yes, that works; so we do:
  * enable dirty logging
  * ask guest for free pages
  * initialise the migration bitmap as everything-free
  * then later we do the normal sync-dirty bitmap stuff and it all just works.

That's nice and simple.

> >   2) Clearing out caches
> >  Does it make sense to clean caches?  They're apparently useful data
> >  so if we clean them it's likely to slow the guest down; I guess
> >  they're also likely to be fairly static data - so at least fairly
> >  easy to migrate.
> >  The answer here partially depends on what you want from your migration;
> >  if you're after the fastest possible migration time it might make
> >  sense to clean the caches and avoid migrating them; but that might
> >  be at the cost of more disruption to the guest - there's a trade off
> >  somewhere and it's not clear to me how you set that depending on your
> >  guest/network/reqirements.
> > 
> 
> Yes, clean the caches is an option.  Let the users decide using it or not.
> 
> >   3) Why is ballooning slow?
> >  You've got a figure of 5s to balloon on an 8GB VM - but an
> >  8GB VM isn't huge; so I worry about how long it would take
> >  on a big VM.   We need to understand why it's slow
> >* is it due to the guest shuffling pages around?
> >* is it due to the virtio-balloon protocol sending one page
> >  at a time?
> >  + Do balloon pages normally clump in physical memory
> > - i.e. would a 'large balloon' message help
> > - or do we need a bitmap because it tends not to clump?
> > 
> 
> I didn't do a comprehensive test. But I found most of the time spending
> on allocating the pages and sending the PFNs to guest, I don't know that's
> the most time consuming operation, allocating the pages or sending the PFNs.

It might be a good idea to analyse it a bit more to convince people where
the problem is.

> >* is it due to the madvise on the host?
> >  If we were using the normal balloon messages, then we
> >  could, during migration, just route those to the migration
> >  code rather than bothering with the madvise.
> >  If they're clumping together we could just turn that into
> >  one big madvise; if they're not then would we benefit from
> >  a call that lets us madvise lots of areas?
> > 
> 
> My test showed madvise() is not the main reason for the long time, only taken
> 10% of the total  inflating balloon operation time.
> Big madvise can more or less improve the performance.

OK; 10% of the total is still pretty big even for your 8GB VM.

> >   4) Speeding up the migration of those free pages
> > You're using the bitmap to avoid migrating those free pages; HPe's
> > patchset is reconstructing a bitmap from the balloon data;  OK, so
> > this all makes sense to avoid migrating them - I'd also been thinking
> > of using pagemap to spot zero pages that would help find other zero'd
> > pages, but perhaps ballooned is enough?
> > 
> Could you describe your ideal with more details?

At the moment the migration code spends a fair amount of time checking if a page
is zero; I was thinking perhaps the qemu could just open /proc/self/pagemap
and check if the page was mapped; that would seem cheap if we're checking big
ranges; and that 

Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-10 Thread Li, Liang Z
> 
> Hi,
>   I'm just catching back up on this thread; so without reference to any
> particular previous mail in the thread.
> 
>   1) How many of the free pages do we tell the host about?
>  Your main change is telling the host about all the
>  free pages.

Yes, all the guest's free pages.

>  If we tell the host about all the free pages, then we might
>  end up needing to allocate more pages and update the host
>  with pages we now want to use; that would have to wait for the
>  host to acknowledge that use of these pages, since if we don't
>  wait for it then it might have skipped migrating a page we
>  just started using (I don't understand how your series solves that).
>  So the guest probably needs to keep some free pages - how many?

Actually, there is no need to care about whether the free pages will be used by 
the host.
We only care about some of the free pages we get reused by the guest, right?

The dirty page logging can be used to solve this, starting the dirty page 
logging before getting
the free pages informant from guest. Even some of the free pages are modified 
by the guest
during the process of getting the free pages information, these modified pages 
will be traced
by the dirty page logging mechanism. So in the following 
migration_bitmap_sync() function.
The pages in the free pages bitmap, but latter was modified, will be reset to 
dirty. We won't
omit any dirtied pages.

So, guest doesn't need to keep any free pages.

>   2) Clearing out caches
>  Does it make sense to clean caches?  They're apparently useful data
>  so if we clean them it's likely to slow the guest down; I guess
>  they're also likely to be fairly static data - so at least fairly
>  easy to migrate.
>  The answer here partially depends on what you want from your migration;
>  if you're after the fastest possible migration time it might make
>  sense to clean the caches and avoid migrating them; but that might
>  be at the cost of more disruption to the guest - there's a trade off
>  somewhere and it's not clear to me how you set that depending on your
>  guest/network/reqirements.
> 

Yes, clean the caches is an option.  Let the users decide using it or not.

>   3) Why is ballooning slow?
>  You've got a figure of 5s to balloon on an 8GB VM - but an
>  8GB VM isn't huge; so I worry about how long it would take
>  on a big VM.   We need to understand why it's slow
>* is it due to the guest shuffling pages around?
>* is it due to the virtio-balloon protocol sending one page
>  at a time?
>  + Do balloon pages normally clump in physical memory
> - i.e. would a 'large balloon' message help
> - or do we need a bitmap because it tends not to clump?
> 

I didn't do a comprehensive test. But I found most of the time spending
on allocating the pages and sending the PFNs to guest, I don't know that's
the most time consuming operation, allocating the pages or sending the PFNs.

>* is it due to the madvise on the host?
>  If we were using the normal balloon messages, then we
>  could, during migration, just route those to the migration
>  code rather than bothering with the madvise.
>  If they're clumping together we could just turn that into
>  one big madvise; if they're not then would we benefit from
>  a call that lets us madvise lots of areas?
> 

My test showed madvise() is not the main reason for the long time, only taken
10% of the total  inflating balloon operation time.
Big madvise can more or less improve the performance.

>   4) Speeding up the migration of those free pages
> You're using the bitmap to avoid migrating those free pages; HPe's
> patchset is reconstructing a bitmap from the balloon data;  OK, so
> this all makes sense to avoid migrating them - I'd also been thinking
> of using pagemap to spot zero pages that would help find other zero'd
> pages, but perhaps ballooned is enough?
> 
Could you describe your ideal with more details?

>   5) Second-migrate
> Given a VM where you've done all those tricks on, what happens when
> you migrate it a second time?   I guess you're aiming for the guest
> to update it's bitmap;  HPe's solution is to migrate it's balloon
> bitmap along with the migration data.

Nothing is special in the second migration, QEMU will request the guest for 
free pages
Information, and the guest will traverse it's current free page list to 
construct a
new free page bitmap and send it to QEMU. Just like in the first migration.

Liang
> 
> Dave
> 
> --
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-10 Thread Michael S. Tsirkin
On Thu, Mar 10, 2016 at 01:41:16AM +, Li, Liang Z wrote:
> > > > > > Yes, we really can teach qemu to skip these pages and it's not hard.
> > > > > > The problem is the poor performance, this PV solution
> > > > >
> > > > > Balloon is always PV. And do not call patches solutions please.
> > > > >
> > > > > > is aimed to make it more
> > > > > > efficient and reduce the performance impact on guest.
> > > > >
> > > > > We need to get a bit beyond this.  You are making multiple
> > > > > changes, it seems to make sense to split it all up, and analyse
> > > > > each change separately.
> > > >
> > > > Couldn't agree more.
> > > >
> > > > There are three stages in this optimization:
> > > >
> > > > 1) choosing which pages to skip
> > > >
> > > > 2) communicating them from guest to host
> > > >
> > > > 3) skip transferring uninteresting pages to the remote side on
> > > > migration
> > > >
> > > > For (3) there seems to be a low-hanging fruit to amend
> > > > migration/ram.c:iz_zero_range() to consult /proc/self/pagemap.  This
> > > > would work for guest RAM that hasn't been touched yet or which has
> > > > been ballooned out.
> > > >
> > > > For (1) I've been trying to make a point that skipping clean pages
> > > > is much more likely to result in noticable benefit than free pages only.
> > > >
> > >
> > > I am considering to drop the pagecache before getting the free pages.
> > >
> > > > As for (2), we do seem to have a problem with the existing balloon:
> > > > according to your measurements it's very slow; besides, I guess it
> > > > plays badly
> > >
> > > I didn't say communicating is slow. Even this is very slow, my
> > > solution use bitmap instead of PFNs, there is fewer data traffic, so it's
> > faster than the existing balloon which use PFNs.
> > 
> > By how much?
> > 
> 
> Haven't measured yet. 
> To identify a page, 1 bit is needed if using bitmap, 4 Bytes(32bit) is needed 
> if using PFN, 
> 
> For a guest with 8GB RAM,  the corresponding free page bitmap size is 256KB.
> And the corresponding total PFNs size is 8192KB. Assuming the inflating size
> is 7GB, the total PFNs size is 7168KB.

Yes but this is not how balloon works, instead, it will reuse a single
4K page multiple times. We can also trade off more memory for speed
if we want to, it's completely up to guest.

> 
> Maybe this is not the point.
> 
> Liang



> > > > with transparent huge pages (as both the guest and the host work
> > > > with one 4k page at a time).  This is a problem for other use cases
> > > > of balloon (e.g. as a facility for resource management); tackling
> > > > that appears a more natural application for optimization efforts.
> > > >
> > > > Thanks,
> > > > Roman.



Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-10 Thread Dr. David Alan Gilbert
Hi,
  I'm just catching back up on this thread; so without reference to any
particular previous mail in the thread.

  1) How many of the free pages do we tell the host about?
 Your main change is telling the host about all the
 free pages.
 If we tell the host about all the free pages, then we might
 end up needing to allocate more pages and update the host
 with pages we now want to use; that would have to wait for the
 host to acknowledge that use of these pages, since if we don't
 wait for it then it might have skipped migrating a page we
 just started using (I don't understand how your series solves that).
 So the guest probably needs to keep some free pages - how many?

  2) Clearing out caches
 Does it make sense to clean caches?  They're apparently useful data
 so if we clean them it's likely to slow the guest down; I guess
 they're also likely to be fairly static data - so at least fairly
 easy to migrate.
 The answer here partially depends on what you want from your migration;
 if you're after the fastest possible migration time it might make
 sense to clean the caches and avoid migrating them; but that might
 be at the cost of more disruption to the guest - there's a trade off
 somewhere and it's not clear to me how you set that depending on your
 guest/network/reqirements.

  3) Why is ballooning slow?
 You've got a figure of 5s to balloon on an 8GB VM - but an 
 8GB VM isn't huge; so I worry about how long it would take
 on a big VM.   We need to understand why it's slow 
   * is it due to the guest shuffling pages around? 
   * is it due to the virtio-balloon protocol sending one page
 at a time?
 + Do balloon pages normally clump in physical memory
- i.e. would a 'large balloon' message help
- or do we need a bitmap because it tends not to clump?

   * is it due to the madvise on the host?
 If we were using the normal balloon messages, then we
 could, during migration, just route those to the migration
 code rather than bothering with the madvise.
 If they're clumping together we could just turn that into
 one big madvise; if they're not then would we benefit from
 a call that lets us madvise lots of areas?

  4) Speeding up the migration of those free pages
You're using the bitmap to avoid migrating those free pages; HPe's
patchset is reconstructing a bitmap from the balloon data;  OK, so
this all makes sense to avoid migrating them - I'd also been thinking
of using pagemap to spot zero pages that would help find other zero'd
pages, but perhaps ballooned is enough?

  5) Second-migrate
Given a VM where you've done all those tricks on, what happens when
you migrate it a second time?   I guess you're aiming for the guest
to update it's bitmap;  HPe's solution is to migrate it's balloon
bitmap along with the migration data.
 
Dave

--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-10 Thread Roman Kagan
On Wed, Mar 09, 2016 at 07:39:18PM +0200, Michael S. Tsirkin wrote:
> On Wed, Mar 09, 2016 at 08:04:39PM +0300, Roman Kagan wrote:
> > On Wed, Mar 09, 2016 at 05:41:39PM +0200, Michael S. Tsirkin wrote:
> > > On Wed, Mar 09, 2016 at 05:28:54PM +0300, Roman Kagan wrote:
> > > > For (1) I've been trying to make a point that skipping clean pages is
> > > > much more likely to result in noticable benefit than free pages only.
> > > 
> > > I guess when you say clean you mean zero?
> > 
> > No I meant clean, i.e. those that could be evicted from RAM without
> > causing I/O.
> 
> They must be migrated unless guest actually evicts them.

If the balloon is inflated the guest will.

> It's not at all clear to me that it's always preferable
> to drop all clean pages from pagecache. It is clearly is
> going to slow the guest down significantly.

That's a matter for optimization.  The current value for
/proc/meminfo:MemAvailable (which is being proposed as a member of
balloon stats, too) is a conservative estimate which will probably cover
a good deal of cases.

> > I must be missing something obvious, but how is that different from
> > inflating and then immediately deflating the balloon?
> 
> It's exactly the same except
> - we do not initiate this from host - it's guest doing
>   things for its own reasons
> - a bit less guest/host interaction this way

I don't quite understand why you need to deflate the balloon until the
VM is on the destination host.  deflate_on_oom will do it if the guest
is really tight on memory; otherwise there appears to be no reason for
it.  But then inflation followed immediately by deflation doubles the
guest/host interactions rather than reduces them, no?

> > it's just the granularity that makes things slow and
> > stands in the way.
> 
> So we could request a specific page size/alignment from guest.
> Send guest request to give us memory in aligned units of 2Mbytes,
> and then host can treat each of these as a single huge page.

I'd guess just coalescing contiguous pages would already speed things
up.  I'll try to find some time to experiment with it.

Roman.



Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-10 Thread Roman Kagan
On Wed, Mar 09, 2016 at 02:38:52PM -0500, Rik van Riel wrote:
> On Wed, 2016-03-09 at 20:04 +0300, Roman Kagan wrote:
> > On Wed, Mar 09, 2016 at 05:41:39PM +0200, Michael S. Tsirkin wrote:
> > > On Wed, Mar 09, 2016 at 05:28:54PM +0300, Roman Kagan wrote:
> > > > For (1) I've been trying to make a point that skipping clean
> > > > pages is
> > > > much more likely to result in noticable benefit than free pages
> > > > only.
> > > 
> > > I guess when you say clean you mean zero?
> > 
> > No I meant clean, i.e. those that could be evicted from RAM without
> > causing I/O.
> > 
> 
> Programs in the guest may have that memory mmapped.
> This could include things like libraries and executables.
> 
> How do you deal with the guest page cache containing
> references to now non-existent memory?
> 
> How do you re-populate the memory on the destination
> host?

I guess the confusion is due to the context I stripped from the previous
messages...  Actually I've been talking about doing full-fledged balloon
inflation before the migration, so, when it's deflated the guest will
fault in that data from the filesystem as usual.

Roman.



Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-10 Thread Li, Liang Z
> >  Could provide more information on how to use virtio-serial to exchange
> data?  Thread , Wiki or code are all OK.
> >  I have not find some useful information yet.
> 
> See this commit in the Linux sources:
> 
> 108fc82596e3b66b819df9d28c1ebbc9ab5de14c
> 
> that adds a way to send guest trace data over to the host.  I think that's the
> most relevant to your use-case.  However, you'll have to add an in-kernel
> user of virtio-serial (like the virtio-console code
> -- the code that deals with tty and hvc currently).  There's no other non-tty
> user right now, and this is the right kind of use-case to add one for!
> 
> For many other (userspace) use-cases, see the qemu-guest-agent in the
> qemu sources.
> 
> The API is documented in the wiki:
> 
> http://www.linux-kvm.org/page/Virtio-serial_API
> 
> and the feature pages have some information that may help as well:
> 
> https://fedoraproject.org/wiki/Features/VirtioSerial
> 
> There are some links in here too:
> 
> http://log.amitshah.net/2010/09/communication-between-guests-and-
> hosts/
> 
> Hope this helps.
> 
> 
>   Amit

Thanks a lot !!

Liang



Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-09 Thread Amit Shah
On (Thu) 10 Mar 2016 [07:44:19], Li, Liang Z wrote:
> 
> Hi Amit,
> 
>  Could provide more information on how to use virtio-serial to exchange data? 
>  Thread , Wiki or code are all OK. 
>  I have not find some useful information yet.

See this commit in the Linux sources:

108fc82596e3b66b819df9d28c1ebbc9ab5de14c

that adds a way to send guest trace data over to the host.  I think
that's the most relevant to your use-case.  However, you'll have to
add an in-kernel user of virtio-serial (like the virtio-console code
-- the code that deals with tty and hvc currently).  There's no other
non-tty user right now, and this is the right kind of use-case to add
one for!

For many other (userspace) use-cases, see the qemu-guest-agent in the
qemu sources.

The API is documented in the wiki:

http://www.linux-kvm.org/page/Virtio-serial_API

and the feature pages have some information that may help as well:

https://fedoraproject.org/wiki/Features/VirtioSerial

There are some links in here too:

http://log.amitshah.net/2010/09/communication-between-guests-and-hosts/

Hope this helps.


Amit



Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-09 Thread Li, Liang Z
> > This patch set is the QEMU side implementation.
> >
> > The virtio-balloon is extended so that QEMU can get the free pages
> > information from the guest through virtio.
> >
> > After getting the free pages information (a bitmap), QEMU can use it
> > to filter out the guest's free pages in the ram bulk stage. This make
> > the live migration process much more efficient.
> >
> > This RFC version doesn't take the post-copy and RDMA into
> > consideration, maybe both of them can benefit from this PV solution by
> > with some extra modifications.
> 
> I like the idea, just have to prove (review) and test it a lot to ensure we 
> don't
> end up skipping pages that matter.
> 
> However, there are a couple of points:
> 
> In my opinion, the information that's exchanged between the guest and the
> host should be exchanged over a virtio-serial channel rather than virtio-
> balloon.  First, there's nothing related to the balloon here.
> It just happens to be memory info.  Second, I would never enable balloon in
> a guest that I want to be performance-sensitive.  So even if you add this as
> part of balloon, you'll find no one is using this solution.
> 
> Secondly, I suggest virtio-serial, because it's meant exactly to exchange 
> free-
> flowing information between a host and a guest, and you don't need to
> extend any part of the protocol for it (hence no changes necessary to the
> spec).  You can see how spice, vnc, etc., use virtio-serial to exchange data.
> 
> 
>   Amit

Hi Amit,

 Could provide more information on how to use virtio-serial to exchange data?  
Thread , Wiki or code are all OK. 
 I have not find some useful information yet.

Thanks
Liang




Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-09 Thread Li, Liang Z
> > > > > Yes, we really can teach qemu to skip these pages and it's not hard.
> > > > > The problem is the poor performance, this PV solution
> > > >
> > > > Balloon is always PV. And do not call patches solutions please.
> > > >
> > > > > is aimed to make it more
> > > > > efficient and reduce the performance impact on guest.
> > > >
> > > > We need to get a bit beyond this.  You are making multiple
> > > > changes, it seems to make sense to split it all up, and analyse
> > > > each change separately.
> > >
> > > Couldn't agree more.
> > >
> > > There are three stages in this optimization:
> > >
> > > 1) choosing which pages to skip
> > >
> > > 2) communicating them from guest to host
> > >
> > > 3) skip transferring uninteresting pages to the remote side on
> > > migration
> > >
> > > For (3) there seems to be a low-hanging fruit to amend
> > > migration/ram.c:iz_zero_range() to consult /proc/self/pagemap.  This
> > > would work for guest RAM that hasn't been touched yet or which has
> > > been ballooned out.
> > >
> > > For (1) I've been trying to make a point that skipping clean pages
> > > is much more likely to result in noticable benefit than free pages only.
> > >
> >
> > I am considering to drop the pagecache before getting the free pages.
> >
> > > As for (2), we do seem to have a problem with the existing balloon:
> > > according to your measurements it's very slow; besides, I guess it
> > > plays badly
> >
> > I didn't say communicating is slow. Even this is very slow, my
> > solution use bitmap instead of PFNs, there is fewer data traffic, so it's
> faster than the existing balloon which use PFNs.
> 
> By how much?
> 

Haven't measured yet. 
To identify a page, 1 bit is needed if using bitmap, 4 Bytes(32bit) is needed 
if using PFN, 

For a guest with 8GB RAM,  the corresponding free page bitmap size is 256KB.
And the corresponding total PFNs size is 8192KB. Assuming the inflating size
is 7GB, the total PFNs size is 7168KB.

Maybe this is not the point.

Liang

> > > with transparent huge pages (as both the guest and the host work
> > > with one 4k page at a time).  This is a problem for other use cases
> > > of balloon (e.g. as a facility for resource management); tackling
> > > that appears a more natural application for optimization efforts.
> > >
> > > Thanks,
> > > Roman.



Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-09 Thread Rik van Riel
On Wed, 2016-03-09 at 20:04 +0300, Roman Kagan wrote:
> On Wed, Mar 09, 2016 at 05:41:39PM +0200, Michael S. Tsirkin wrote:
> > On Wed, Mar 09, 2016 at 05:28:54PM +0300, Roman Kagan wrote:
> > > For (1) I've been trying to make a point that skipping clean
> > > pages is
> > > much more likely to result in noticable benefit than free pages
> > > only.
> > 
> > I guess when you say clean you mean zero?
> 
> No I meant clean, i.e. those that could be evicted from RAM without
> causing I/O.
> 

Programs in the guest may have that memory mmapped.
This could include things like libraries and executables.

How do you deal with the guest page cache containing
references to now non-existent memory?

How do you re-populate the memory on the destination
host?

-- 
All rights reversed


signature.asc
Description: This is a digitally signed message part


Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-09 Thread Michael S. Tsirkin
On Wed, Mar 09, 2016 at 08:04:39PM +0300, Roman Kagan wrote:
> On Wed, Mar 09, 2016 at 05:41:39PM +0200, Michael S. Tsirkin wrote:
> > On Wed, Mar 09, 2016 at 05:28:54PM +0300, Roman Kagan wrote:
> > > For (1) I've been trying to make a point that skipping clean pages is
> > > much more likely to result in noticable benefit than free pages only.
> > 
> > I guess when you say clean you mean zero?
> 
> No I meant clean, i.e. those that could be evicted from RAM without
> causing I/O.

They must be migrated unless guest actually evicts them.
It's not at all clear to me that it's always preferable
to drop all clean pages from pagecache. It is clearly is
going to slow the guest down significantly.


> > Yea. In fact, one can zero out any number of pages
> > quickly by putting them in balloon and immediately
> > taking them out.
> > 
> > Access will fault a zero page in, then COW kicks in.
> 
> I must be missing something obvious, but how is that different from
> inflating and then immediately deflating the balloon?

It's exactly the same except
- we do not initiate this from host - it's guest doing
  things for its own reasons
- a bit less guest/host interaction this way


> > We could have a new zero VQ (or some other option)
> > to pass these pages guest to host, but this only
> > works well if page size matches the host page size.
> 
> I'm afraid I don't yet understand what kind of pages that would be and
> how they are different from ballooned pages.
> 
> I still tend to think that ballooning is a sensible solution to the
> problem at hand;

I think it is, too. This does not mean we can't improve things though.
This patchset is reported to improve things, it should be
split up so we improve them for everyone and not just
one specific workload.


> it's just the granularity that makes things slow and
> stands in the way.

So we could request a specific page size/alignment from guest.
Send guest request to give us memory in aligned units of 2Mbytes,
and then host can treat each of these as a single huge page.


> Roman.
-- 
MST



Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-09 Thread Roman Kagan
On Wed, Mar 09, 2016 at 05:41:39PM +0200, Michael S. Tsirkin wrote:
> On Wed, Mar 09, 2016 at 05:28:54PM +0300, Roman Kagan wrote:
> > For (1) I've been trying to make a point that skipping clean pages is
> > much more likely to result in noticable benefit than free pages only.
> 
> I guess when you say clean you mean zero?

No I meant clean, i.e. those that could be evicted from RAM without
causing I/O.

> Yea. In fact, one can zero out any number of pages
> quickly by putting them in balloon and immediately
> taking them out.
> 
> Access will fault a zero page in, then COW kicks in.

I must be missing something obvious, but how is that different from
inflating and then immediately deflating the balloon?

> We could have a new zero VQ (or some other option)
> to pass these pages guest to host, but this only
> works well if page size matches the host page size.

I'm afraid I don't yet understand what kind of pages that would be and
how they are different from ballooned pages.

I still tend to think that ballooning is a sensible solution to the
problem at hand; it's just the granularity that makes things slow and
stands in the way.

Roman.



Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-09 Thread Michael S. Tsirkin
On Wed, Mar 09, 2016 at 05:28:54PM +0300, Roman Kagan wrote:
> On Mon, Mar 07, 2016 at 01:40:06PM +0200, Michael S. Tsirkin wrote:
> > On Mon, Mar 07, 2016 at 06:49:19AM +, Li, Liang Z wrote:
> > > > > No. And it's exactly what I mean. The ballooned memory is still
> > > > > processed during live migration without skipping. The live migration 
> > > > > code is
> > > > in migration/ram.c.
> > > > 
> > > > So if guest acknowledged VIRTIO_BALLOON_F_MUST_TELL_HOST, we can
> > > > teach qemu to skip these pages.
> > > > Want to write a patch to do this?
> > > > 
> > > 
> > > Yes, we really can teach qemu to skip these pages and it's not hard.  
> > > The problem is the poor performance, this PV solution
> > 
> > Balloon is always PV. And do not call patches solutions please.
> > 
> > > is aimed to make it more
> > > efficient and reduce the performance impact on guest.
> > 
> > We need to get a bit beyond this.  You are making multiple
> > changes, it seems to make sense to split it all up, and analyse each
> > change separately.
> 
> Couldn't agree more.
> 
> There are three stages in this optimization:
> 
> 1) choosing which pages to skip
> 
> 2) communicating them from guest to host
> 
> 3) skip transferring uninteresting pages to the remote side on migration
> 
> For (3) there seems to be a low-hanging fruit to amend
> migration/ram.c:iz_zero_range() to consult /proc/self/pagemap.  This
> would work for guest RAM that hasn't been touched yet or which has been
> ballooned out.
> 
> For (1) I've been trying to make a point that skipping clean pages is
> much more likely to result in noticable benefit than free pages only.

I guess when you say clean you mean zero?

Yea. In fact, one can zero out any number of pages
quickly by putting them in balloon and immediately
taking them out.

Access will fault a zero page in, then COW kicks in.

We could have a new zero VQ (or some other option)
to pass these pages guest to host, but this only
works well if page size matches the host page size.




> As for (2), we do seem to have a problem with the existing balloon:
> according to your measurements it's very slow; besides, I guess it plays
> badly with transparent huge pages (as both the guest and the host work
> with one 4k page at a time).  This is a problem for other use cases of
> balloon (e.g. as a facility for resource management); tackling that
> appears a more natural application for optimization efforts.
> 
> Thanks,
> Roman.



Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-09 Thread Michael S. Tsirkin
On Wed, Mar 09, 2016 at 03:27:54PM +, Li, Liang Z wrote:
> > On Mon, Mar 07, 2016 at 01:40:06PM +0200, Michael S. Tsirkin wrote:
> > > On Mon, Mar 07, 2016 at 06:49:19AM +, Li, Liang Z wrote:
> > > > > > No. And it's exactly what I mean. The ballooned memory is still
> > > > > > processed during live migration without skipping. The live
> > > > > > migration code is
> > > > > in migration/ram.c.
> > > > >
> > > > > So if guest acknowledged VIRTIO_BALLOON_F_MUST_TELL_HOST, we
> > can
> > > > > teach qemu to skip these pages.
> > > > > Want to write a patch to do this?
> > > > >
> > > >
> > > > Yes, we really can teach qemu to skip these pages and it's not hard.
> > > > The problem is the poor performance, this PV solution
> > >
> > > Balloon is always PV. And do not call patches solutions please.
> > >
> > > > is aimed to make it more
> > > > efficient and reduce the performance impact on guest.
> > >
> > > We need to get a bit beyond this.  You are making multiple changes, it
> > > seems to make sense to split it all up, and analyse each change
> > > separately.
> > 
> > Couldn't agree more.
> > 
> > There are three stages in this optimization:
> > 
> > 1) choosing which pages to skip
> > 
> > 2) communicating them from guest to host
> > 
> > 3) skip transferring uninteresting pages to the remote side on migration
> > 
> > For (3) there seems to be a low-hanging fruit to amend
> > migration/ram.c:iz_zero_range() to consult /proc/self/pagemap.  This would
> > work for guest RAM that hasn't been touched yet or which has been
> > ballooned out.
> > 
> > For (1) I've been trying to make a point that skipping clean pages is much
> > more likely to result in noticable benefit than free pages only.
> > 
> 
> I am considering to drop the pagecache before getting the free pages. 
> 
> > As for (2), we do seem to have a problem with the existing balloon:
> > according to your measurements it's very slow; besides, I guess it plays 
> > badly
> 
> I didn't say communicating is slow. Even this is very slow, my solution use 
> bitmap instead of
> PFNs, there is fewer data traffic, so it's faster than the existing balloon 
> which use PFNs.

By how much?

> > with transparent huge pages (as both the guest and the host work with one
> > 4k page at a time).  This is a problem for other use cases of balloon (e.g. 
> > as a
> > facility for resource management); tackling that appears a more natural
> > application for optimization efforts.
> > 
> > Thanks,
> > Roman.



Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-09 Thread Li, Liang Z
> On Mon, Mar 07, 2016 at 01:40:06PM +0200, Michael S. Tsirkin wrote:
> > On Mon, Mar 07, 2016 at 06:49:19AM +, Li, Liang Z wrote:
> > > > > No. And it's exactly what I mean. The ballooned memory is still
> > > > > processed during live migration without skipping. The live
> > > > > migration code is
> > > > in migration/ram.c.
> > > >
> > > > So if guest acknowledged VIRTIO_BALLOON_F_MUST_TELL_HOST, we
> can
> > > > teach qemu to skip these pages.
> > > > Want to write a patch to do this?
> > > >
> > >
> > > Yes, we really can teach qemu to skip these pages and it's not hard.
> > > The problem is the poor performance, this PV solution
> >
> > Balloon is always PV. And do not call patches solutions please.
> >
> > > is aimed to make it more
> > > efficient and reduce the performance impact on guest.
> >
> > We need to get a bit beyond this.  You are making multiple changes, it
> > seems to make sense to split it all up, and analyse each change
> > separately.
> 
> Couldn't agree more.
> 
> There are three stages in this optimization:
> 
> 1) choosing which pages to skip
> 
> 2) communicating them from guest to host
> 
> 3) skip transferring uninteresting pages to the remote side on migration
> 
> For (3) there seems to be a low-hanging fruit to amend
> migration/ram.c:iz_zero_range() to consult /proc/self/pagemap.  This would
> work for guest RAM that hasn't been touched yet or which has been
> ballooned out.
> 
> For (1) I've been trying to make a point that skipping clean pages is much
> more likely to result in noticable benefit than free pages only.
> 

I am considering to drop the pagecache before getting the free pages. 

> As for (2), we do seem to have a problem with the existing balloon:
> according to your measurements it's very slow; besides, I guess it plays badly

I didn't say communicating is slow. Even this is very slow, my solution use 
bitmap instead of
PFNs, there is fewer data traffic, so it's faster than the existing balloon 
which use PFNs.

> with transparent huge pages (as both the guest and the host work with one
> 4k page at a time).  This is a problem for other use cases of balloon (e.g. 
> as a
> facility for resource management); tackling that appears a more natural
> application for optimization efforts.
> 
> Thanks,
> Roman.



Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-09 Thread Roman Kagan
On Mon, Mar 07, 2016 at 01:40:06PM +0200, Michael S. Tsirkin wrote:
> On Mon, Mar 07, 2016 at 06:49:19AM +, Li, Liang Z wrote:
> > > > No. And it's exactly what I mean. The ballooned memory is still
> > > > processed during live migration without skipping. The live migration 
> > > > code is
> > > in migration/ram.c.
> > > 
> > > So if guest acknowledged VIRTIO_BALLOON_F_MUST_TELL_HOST, we can
> > > teach qemu to skip these pages.
> > > Want to write a patch to do this?
> > > 
> > 
> > Yes, we really can teach qemu to skip these pages and it's not hard.  
> > The problem is the poor performance, this PV solution
> 
> Balloon is always PV. And do not call patches solutions please.
> 
> > is aimed to make it more
> > efficient and reduce the performance impact on guest.
> 
> We need to get a bit beyond this.  You are making multiple
> changes, it seems to make sense to split it all up, and analyse each
> change separately.

Couldn't agree more.

There are three stages in this optimization:

1) choosing which pages to skip

2) communicating them from guest to host

3) skip transferring uninteresting pages to the remote side on migration

For (3) there seems to be a low-hanging fruit to amend
migration/ram.c:iz_zero_range() to consult /proc/self/pagemap.  This
would work for guest RAM that hasn't been touched yet or which has been
ballooned out.

For (1) I've been trying to make a point that skipping clean pages is
much more likely to result in noticable benefit than free pages only.

As for (2), we do seem to have a problem with the existing balloon:
according to your measurements it's very slow; besides, I guess it plays
badly with transparent huge pages (as both the guest and the host work
with one 4k page at a time).  This is a problem for other use cases of
balloon (e.g. as a facility for resource management); tackling that
appears a more natural application for optimization efforts.

Thanks,
Roman.



Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-09 Thread Li, Liang Z
> On Fri, Mar 04, 2016 at 06:51:21PM +, Dr. David Alan Gilbert wrote:
> > * Paolo Bonzini (pbonz...@redhat.com) wrote:
> > >
> > >
> > > On 04/03/2016 15:26, Li, Liang Z wrote:
> > > >> >
> > > >> > The memory usage will keep increasing due to ever growing
> > > >> > caches, etc, so you'll be left with very little free memory fairly 
> > > >> > soon.
> > > >> >
> > > > I don't think so.
> > > >
> > >
> > > Roman is right.  For example, here I am looking at a 64 GB
> > > (physical) machine which was booted about 30 minutes ago, and which
> > > is running disk-heavy workloads (installing VMs).
> > >
> > > Since I have started writing this email (2 minutes?), the amount of
> > > free memory has already gone down from 37 GB to 33 GB.  I expect
> > > that by the time I have finished running the workload, in two hours,
> > > it will not have any free memory.
> >
> > But what about a VM sitting idle, or that just has more RAM assigned
> > to it than is currently using.
> >  I've got a host here that's been up for 46 days and has been doing
> > some heavy VM debugging a few days ago, but today:
> >
> > # free -m
> >   totalusedfree  shared  buff/cache   
> > available
> > Mem:  965361146   44834 184   50555   
> > 94735
> >
> > I very rarely use all it's RAM, so it's got a big chunk of free RAM,
> > and yes it's got a big chunk of cache as well.
> 
> One of the promises of virtualization is better resource utilization.
> People tend to avoid purchasing VMs so much oversized that they never
> touch a significant amount of their RAM.  (Well, at least this is how things
> stand in hosting market; I guess enterprize market is similar in this regard).
> 
> That said, I'm not at all opposed to optimizing the migration of free memory;
> what I'm trying to say is that creating brand new infrastructure specifically 
> for
> that case doesn't look justified when the existing one can cover it in 
> addition
> to much more common scenarios.
> 
> Roman.

Even the existing one can cover more common scenarios, but it has performance 
issue.
that's why I create a new one.

Liang



Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-09 Thread Roman Kagan
On Fri, Mar 04, 2016 at 06:51:21PM +, Dr. David Alan Gilbert wrote:
> * Paolo Bonzini (pbonz...@redhat.com) wrote:
> > 
> > 
> > On 04/03/2016 15:26, Li, Liang Z wrote:
> > >> > 
> > >> > The memory usage will keep increasing due to ever growing caches, etc, 
> > >> > so
> > >> > you'll be left with very little free memory fairly soon.
> > >> > 
> > > I don't think so.
> > > 
> > 
> > Roman is right.  For example, here I am looking at a 64 GB (physical)
> > machine which was booted about 30 minutes ago, and which is running
> > disk-heavy workloads (installing VMs).
> > 
> > Since I have started writing this email (2 minutes?), the amount of free
> > memory has already gone down from 37 GB to 33 GB.  I expect that by the
> > time I have finished running the workload, in two hours, it will not
> > have any free memory.
> 
> But what about a VM sitting idle, or that just has more RAM assigned to it
> than is currently using.
>  I've got a host here that's been up for 46 days and has been doing some
> heavy VM debugging a few days ago, but today:
> 
> # free -m
>   totalusedfree  shared  buff/cache   
> available
> Mem:  965361146   44834 184   50555   
> 94735
> 
> I very rarely use all it's RAM, so it's got a big chunk of free RAM, and yes
> it's got a big chunk of cache as well.

One of the promises of virtualization is better resource utilization.
People tend to avoid purchasing VMs so much oversized that they never
touch a significant amount of their RAM.  (Well, at least this is how
things stand in hosting market; I guess enterprize market is similar in
this regard).

That said, I'm not at all opposed to optimizing the migration of free
memory; what I'm trying to say is that creating brand new infrastructure
specifically for that case doesn't look justified when the existing one
can cover it in addition to much more common scenarios.

Roman.



Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-08 Thread Li, Liang Z
> On 04/03/2016 15:26, Li, Liang Z wrote:
> >> >
> >> > The memory usage will keep increasing due to ever growing caches,
> >> > etc, so you'll be left with very little free memory fairly soon.
> >> >
> > I don't think so.
> >
> 
> Roman is right.  For example, here I am looking at a 64 GB (physical) machine
> which was booted about 30 minutes ago, and which is running disk-heavy
> workloads (installing VMs).
> 
> Since I have started writing this email (2 minutes?), the amount of free
> memory has already gone down from 37 GB to 33 GB.  I expect that by the
> time I have finished running the workload, in two hours, it will not have any
> free memory.
> 
> Paolo

I have a VM which has 2GB of RAM, when the guest booted, there were about 1.4GB 
of free pages.
Then I tried to download a large file from the internet with the browser, after 
the downloading finished,
there were only 72MB of free pages left, as Roman pointed out, there were quite 
a lot of Cached memory.
Then I tried to compile the QEMU, after the compiling finished, there were 
about 1.3G free pages.

So even the cache will increase to a large amount, it will be freed if there 
are some other specific workloads. 
The cache memory is a big issue that should be taken into consideration.
 How about reclaim some cache before getting the free pages information?  

Liang 



Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-08 Thread Li, Liang Z
> On Fri, Mar 04, 2016 at 03:13:03PM +, Li, Liang Z wrote:
> > > > Maybe I am not clear enough.
> > > >
> > > > I mean if we inflate balloon before live migration, for a 8GB
> > > > guest, it takes
> > > about 5 Seconds for the inflating operation to finish.
> > >
> > > And these 5 seconds are spent where?
> > >
> >
> > The time is spent on allocating the pages and send the allocated pages
> > pfns to QEMU through virtio.
> 
> What if we skip allocating pages but use the existing interface to send pfns 
> to
> QEMU?
>

I think it will be much faster, allocating pages is the main reason for the 
long time of the operation.
Experiment is needed to get the exact time spend on sending the pfns.

Liang



Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-08 Thread Michael S. Tsirkin
On Fri, Mar 04, 2016 at 03:13:03PM +, Li, Liang Z wrote:
> > > Maybe I am not clear enough.
> > >
> > > I mean if we inflate balloon before live migration, for a 8GB guest, it 
> > > takes
> > about 5 Seconds for the inflating operation to finish.
> > 
> > And these 5 seconds are spent where?
> > 
> 
> The time is spent on allocating the pages and send the allocated pages pfns 
> to QEMU
> through virtio.

What if we skip allocating pages but use the existing interface to send pfns
to QEMU?

> > > For the PV solution, there is no need to inflate balloon before live
> > > migration, the only cost is to traversing the free_list to  construct
> > > the free pages bitmap, and it takes about 20ms for a 8GB idle guest( less 
> > > if
> > there is less free pages),  passing the free pages info to host will take 
> > about
> > extra 3ms.
> > >
> > >
> > > Liang
> > 
> > So now let's please stop talking about solutions at a high level and 
> > discuss the
> > interface changes you make in detail.
> > What makes it faster? Better host/guest interface? No need to go through
> > buddy allocator within guest? Less interrupts? Something else?
> > 
> 
> I assume you are familiar with the current virtio-balloon and how it works. 
> The new interface is very simple, send a request to the virtio-balloon driver,
> The virtio-driver will travers the '>free_area[order].free_list[t])' to 
> construct a 'free_page_bitmap', and then the driver will send the content
> of  'free_page_bitmap' back to QEMU. That all the new interface does and
> there are no ' alloc_page' related affairs, so it's faster.
> 
> 
> Some code snippet:
> --
> +static void mark_free_pages_bitmap(struct zone *zone,
> +  unsigned long *free_page_bitmap, unsigned long pfn_gap) {
> + unsigned long pfn, flags, i;
> + unsigned int order, t;
> + struct list_head *curr;
> +
> + if (zone_is_empty(zone))
> + return;
> +
> + spin_lock_irqsave(>lock, flags);
> +
> + for_each_migratetype_order(order, t) {
> + list_for_each(curr, >free_area[order].free_list[t]) {
> +
> + pfn = page_to_pfn(list_entry(curr, struct page, lru));
> + for (i = 0; i < (1UL << order); i++) {
> + if ((pfn + i) >= PFN_4G)
> + set_bit_le(pfn + i - pfn_gap,
> +free_page_bitmap);
> + else
> + set_bit_le(pfn + i, free_page_bitmap);
> + }
> + }
> + }
> +
> + spin_unlock_irqrestore(>lock, flags); }
> 
> Sorry for my poor English and expression, if you still can't understand,
> you could glance at the patch, total about 400 lines.
> > 
> > > > --
> > > > MST



Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-08 Thread Li, Liang Z
> Subject: Re: [RFC qemu 0/4] A PV solution for live migration optimization
> 
> On (Thu) 03 Mar 2016 [18:44:24], Liang Li wrote:
> > The current QEMU live migration implementation mark the all the
> > guest's RAM pages as dirtied in the ram bulk stage, all these pages
> > will be processed and that takes quit a lot of CPU cycles.
> >
> > From guest's point of view, it doesn't care about the content in free
> > pages. We can make use of this fact and skip processing the free pages
> > in the ram bulk stage, it can save a lot CPU cycles and reduce the
> > network traffic significantly while speed up the live migration
> > process obviously.
> >
> > This patch set is the QEMU side implementation.
> >
> > The virtio-balloon is extended so that QEMU can get the free pages
> > information from the guest through virtio.
> >
> > After getting the free pages information (a bitmap), QEMU can use it
> > to filter out the guest's free pages in the ram bulk stage. This make
> > the live migration process much more efficient.
> >
> > This RFC version doesn't take the post-copy and RDMA into
> > consideration, maybe both of them can benefit from this PV solution by
> > with some extra modifications.
> 
> I like the idea, just have to prove (review) and test it a lot to ensure we 
> don't
> end up skipping pages that matter.
> 
> However, there are a couple of points:
> 
> In my opinion, the information that's exchanged between the guest and the
> host should be exchanged over a virtio-serial channel rather than virtio-
> balloon.  First, there's nothing related to the balloon here.
> It just happens to be memory info.  Second, I would never enable balloon in
> a guest that I want to be performance-sensitive.  So even if you add this as
> part of balloon, you'll find no one is using this solution.
> 
> Secondly, I suggest virtio-serial, because it's meant exactly to exchange 
> free-
> flowing information between a host and a guest, and you don't need to
> extend any part of the protocol for it (hence no changes necessary to the
> spec).  You can see how spice, vnc, etc., use virtio-serial to exchange data.
> 
> 
>   Amit

I don't like to use the virtio-balloon too, and it's confusing. 
It's grate if the virtio-serial can be used, I will take a look at it. 

Thanks for your suggestion!

Liang



Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-08 Thread Amit Shah
On (Thu) 03 Mar 2016 [18:44:24], Liang Li wrote:
> The current QEMU live migration implementation mark the all the
> guest's RAM pages as dirtied in the ram bulk stage, all these pages
> will be processed and that takes quit a lot of CPU cycles.
> 
> From guest's point of view, it doesn't care about the content in free
> pages. We can make use of this fact and skip processing the free
> pages in the ram bulk stage, it can save a lot CPU cycles and reduce
> the network traffic significantly while speed up the live migration
> process obviously.
> 
> This patch set is the QEMU side implementation.
> 
> The virtio-balloon is extended so that QEMU can get the free pages
> information from the guest through virtio.
> 
> After getting the free pages information (a bitmap), QEMU can use it
> to filter out the guest's free pages in the ram bulk stage. This make
> the live migration process much more efficient.
> 
> This RFC version doesn't take the post-copy and RDMA into
> consideration, maybe both of them can benefit from this PV solution
> by with some extra modifications.

I like the idea, just have to prove (review) and test it a lot to
ensure we don't end up skipping pages that matter.

However, there are a couple of points:

In my opinion, the information that's exchanged between the guest and
the host should be exchanged over a virtio-serial channel rather than
virtio-balloon.  First, there's nothing related to the balloon here.
It just happens to be memory info.  Second, I would never enable
balloon in a guest that I want to be performance-sensitive.  So even
if you add this as part of balloon, you'll find no one is using this
solution.

Secondly, I suggest virtio-serial, because it's meant exactly to
exchange free-flowing information between a host and a guest, and you
don't need to extend any part of the protocol for it (hence no changes
necessary to the spec).  You can see how spice, vnc, etc., use
virtio-serial to exchange data.


Amit



Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-08 Thread Amit Shah
On (Fri) 04 Mar 2016 [15:02:47], Jitendra Kolhe wrote:
> > >
> > > * Liang Li (liang.z...@intel.com) wrote:
> > > > The current QEMU live migration implementation mark the all the
> > > > guest's RAM pages as dirtied in the ram bulk stage, all these pages
> > > > will be processed and that takes quit a lot of CPU cycles.
> > > >
> > > > From guest's point of view, it doesn't care about the content in free
> > > > pages. We can make use of this fact and skip processing the free pages
> > > > in the ram bulk stage, it can save a lot CPU cycles and reduce the
> > > > network traffic significantly while speed up the live migration
> > > > process obviously.
> > > >
> > > > This patch set is the QEMU side implementation.
> > > >
> > > > The virtio-balloon is extended so that QEMU can get the free pages
> > > > information from the guest through virtio.
> > > >
> > > > After getting the free pages information (a bitmap), QEMU can use it
> > > > to filter out the guest's free pages in the ram bulk stage. This make
> > > > the live migration process much more efficient.
> > >
> > > Hi,
> > >   An interesting solution; I know a few different people have been 
> > > looking at
> > > how to speed up ballooned VM migration.
> > >
> >
> > Ooh, different solutions for the same purpose, and both based on the 
> > balloon.
> 
> We were also tying to address similar problem, without actually needing to 
> modify
> the guest driver. Please find patch details under mail with subject.
> migration: skip sending ram pages released by virtio-balloon driver

The scope of this patch series seems to be wider: don't send free
pages to a dest at all, vs. don't send pages that are ballooned out.

Amit



Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-07 Thread Li, Liang Z
> Cc: Roman Kagan; Dr. David Alan Gilbert; ehabk...@redhat.com;
> k...@vger.kernel.org; quint...@redhat.com; linux-ker...@vger.kernel.org;
> qemu-devel@nongnu.org; linux...@kvack.org; amit.s...@redhat.com;
> pbonz...@redhat.com; a...@linux-foundation.org;
> virtualizat...@lists.linux-foundation.org; r...@twiddle.net; r...@redhat.com
> Subject: Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration
> optimization
> 
> On Mon, Mar 07, 2016 at 06:49:19AM +, Li, Liang Z wrote:
> > > > No. And it's exactly what I mean. The ballooned memory is still
> > > > processed during live migration without skipping. The live
> > > > migration code is
> > > in migration/ram.c.
> > >
> > > So if guest acknowledged VIRTIO_BALLOON_F_MUST_TELL_HOST, we
> can
> > > teach qemu to skip these pages.
> > > Want to write a patch to do this?
> > >
> >
> > Yes, we really can teach qemu to skip these pages and it's not hard.
> > The problem is the poor performance, this PV solution
> 
> Balloon is always PV. And do not call patches solutions please.
> 

OK.
  
> > is aimed to make it more
> > efficient and reduce the performance impact on guest.
> 
> We need to get a bit beyond this.  You are making multiple changes, it seems
> to make sense to split it all up, and analyse each change separately.  If you
> don't this patchset will be stuck: as you have seen people aren't convinced it
> actually helps with real workloads.
> 
Really, changing the virtio spec must have good reasons.

> > > > >
> > > > > > > > The only advantage of ' inflating the balloon before live
> > > > > > > > migration' is simple,
> > > > > > > nothing more.
> > > > > > >
> > > > > > > That's a big advantage.  Another one is that it does
> > > > > > > something useful in real- world scenarios.
> > > > > > >
> > > > > >
> > > > > > I don't think the heave performance impaction is something
> > > > > > useful in real
> > > > > world scenarios.
> > > > > >
> > > > > > Liang
> > > > > > > Roman.
> > > > >
> > > > > So fix the performance then. You will have to try harder if you
> > > > > want to convince people that the performance is due to bad
> > > > > host/guest interface, and so we have to change *that*.
> > > > >
> > > >
> > > > Actually, the PV solution is irrelevant with the balloon
> > > > mechanism, I just use it to transfer information between host and
> guest.
> > > > I am not sure if I should implement a new virtio device, and I
> > > > want to get the answer from the community.
> > > > In this RFC patch, to make things simple, I choose to extend the
> > > > virtio-balloon and use the extended interface to transfer the
> > > > request and
> > > free_page_bimap content.
> > > >
> > > > I am not intend to change the current virtio-balloon implementation.
> > > >
> > > > Liang
> > >
> > > And the answer would depend on the answer to my question above.
> > > Does balloon need an interface passing page bitmaps around?
> >
> > Yes, I need a new interface.
> 
> Possibly, but you will need to justify this at some level if you care about
> upstreaming your patches.
> 
> > > Does this speed up any operations?
> >
> > No, a new interface will not speed up anything, but it is the easiest way to
> solve the compatibility issue.
> 
> A bunch of new code is often easier to write than to figure out the old one,
> but if we keep piling it up we'll end up with an unmaintainable mess. So we
> are rather careful about adding new interfaces, and we try to make them
> generic sometimes even at cost of slight inefficiencies.
> 
> > > OTOH what if you use the regular balloon interface with your patches?
> > >
> >
> > The regular balloon interfaces have their specific function and I can't use
> them in my patches.
> > If using these regular interface, I have to do a lot of changes to keep the
> compatibility.
> 
> Why can't you?
> 
> What exactly do we need to change?
> 
> If we put things in terms of the balloon, that supports adding and removing
> pages.
> 
> Using these terms, let's enumerate:
> - a new method (e.g. new virtqueue) that adds and immediately removes
> page in a balloon
>   clearly,

Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-07 Thread Michael S. Tsirkin
On Mon, Mar 07, 2016 at 06:49:19AM +, Li, Liang Z wrote:
> > > No. And it's exactly what I mean. The ballooned memory is still
> > > processed during live migration without skipping. The live migration code 
> > > is
> > in migration/ram.c.
> > 
> > So if guest acknowledged VIRTIO_BALLOON_F_MUST_TELL_HOST, we can
> > teach qemu to skip these pages.
> > Want to write a patch to do this?
> > 
> 
> Yes, we really can teach qemu to skip these pages and it's not hard.  
> The problem is the poor performance, this PV solution

Balloon is always PV. And do not call patches solutions please.

> is aimed to make it more
> efficient and reduce the performance impact on guest.

We need to get a bit beyond this.  You are making multiple
changes, it seems to make sense to split it all up, and analyse each
change separately.  If you don't this patchset will be stuck: as you
have seen people aren't convinced it actually helps with real workloads.

> > > >
> > > > > > > The only advantage of ' inflating the balloon before live
> > > > > > > migration' is simple,
> > > > > > nothing more.
> > > > > >
> > > > > > That's a big advantage.  Another one is that it does something
> > > > > > useful in real- world scenarios.
> > > > > >
> > > > >
> > > > > I don't think the heave performance impaction is something useful
> > > > > in real
> > > > world scenarios.
> > > > >
> > > > > Liang
> > > > > > Roman.
> > > >
> > > > So fix the performance then. You will have to try harder if you want
> > > > to convince people that the performance is due to bad host/guest
> > > > interface, and so we have to change *that*.
> > > >
> > >
> > > Actually, the PV solution is irrelevant with the balloon mechanism, I
> > > just use it to transfer information between host and guest.
> > > I am not sure if I should implement a new virtio device, and I want to
> > > get the answer from the community.
> > > In this RFC patch, to make things simple, I choose to extend the
> > > virtio-balloon and use the extended interface to transfer the request and
> > free_page_bimap content.
> > >
> > > I am not intend to change the current virtio-balloon implementation.
> > >
> > > Liang
> > 
> > And the answer would depend on the answer to my question above.
> > Does balloon need an interface passing page bitmaps around?
> 
> Yes, I need a new interface.

Possibly, but you will need to justify this at some level if you care
about upstreaming your patches.

> > Does this speed up any operations?
> 
> No, a new interface will not speed up anything, but it is the easiest way to 
> solve the compatibility issue.

A bunch of new code is often easier to write than to figure
out the old one, but if we keep piling it up we'll end up
with an unmaintainable mess. So we are rather careful
about adding new interfaces, and we try to make them generic
sometimes even at cost of slight inefficiencies.

> > OTOH what if you use the regular balloon interface with your patches?
> >
> 
> The regular balloon interfaces have their specific function and I can't use 
> them in my patches.
> If using these regular interface, I have to do a lot of changes to keep the 
> compatibility. 

Why can't you?

What exactly do we need to change?

If we put things in terms of the balloon, that supports
adding and removing pages.

Using these terms, let's enumerate:
- a new method (e.g. new virtqueue) that adds and immediately removes page in a 
balloon
clearly, you can add then remove using the existing interfaces
is a single command significantly faster than using existing two vqs?
- a new kind of request that says "add (and immediately remove?) as many pages 
as you can"
sounds rather benign
- a new kind of message that adds multiple pages using a bitmap
(instead of an address list)
again, is this significantly faster?

Does not look like compatibility is an issue, to me.


At some level, your patches look like page hints.
If we have more patches in mind that use page hints,
then a new hint device might make sense.

However, people experimented with page hints in the past, so far this
always went nowhere.  E.g. I CC Rick who saw some problems when page
hints interact with huge pages. Rick, could you elaborate please?


-- 
MST



Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-06 Thread Li, Liang Z
> > No. And it's exactly what I mean. The ballooned memory is still
> > processed during live migration without skipping. The live migration code is
> in migration/ram.c.
> 
> So if guest acknowledged VIRTIO_BALLOON_F_MUST_TELL_HOST, we can
> teach qemu to skip these pages.
> Want to write a patch to do this?
> 

Yes, we really can teach qemu to skip these pages and it's not hard.  
The problem is the poor performance, this PV solution is aimed to make it more
efficient and reduce the performance impact on guest.

> > >
> > > > > > The only advantage of ' inflating the balloon before live
> > > > > > migration' is simple,
> > > > > nothing more.
> > > > >
> > > > > That's a big advantage.  Another one is that it does something
> > > > > useful in real- world scenarios.
> > > > >
> > > >
> > > > I don't think the heave performance impaction is something useful
> > > > in real
> > > world scenarios.
> > > >
> > > > Liang
> > > > > Roman.
> > >
> > > So fix the performance then. You will have to try harder if you want
> > > to convince people that the performance is due to bad host/guest
> > > interface, and so we have to change *that*.
> > >
> >
> > Actually, the PV solution is irrelevant with the balloon mechanism, I
> > just use it to transfer information between host and guest.
> > I am not sure if I should implement a new virtio device, and I want to
> > get the answer from the community.
> > In this RFC patch, to make things simple, I choose to extend the
> > virtio-balloon and use the extended interface to transfer the request and
> free_page_bimap content.
> >
> > I am not intend to change the current virtio-balloon implementation.
> >
> > Liang
> 
> And the answer would depend on the answer to my question above.
> Does balloon need an interface passing page bitmaps around?

Yes, I need a new interface.

> Does this speed up any operations?

No, a new interface will not speed up anything, but it is the easiest way to 
solve the compatibility issue.

> OTOH what if you use the regular balloon interface with your patches?
>

The regular balloon interfaces have their specific function and I can't use 
them in my patches.
If using these regular interface, I have to do a lot of changes to keep the 
compatibility. 

> 
> > > --
> > > MST



Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-06 Thread Li, Liang Z
> > On 04/03/2016 15:26, Li, Liang Z wrote:
> > >> >
> > >> > The memory usage will keep increasing due to ever growing caches,
> > >> > etc, so you'll be left with very little free memory fairly soon.
> > >> >
> > > I don't think so.
> > >
> >
> > Roman is right.  For example, here I am looking at a 64 GB (physical)
> > machine which was booted about 30 minutes ago, and which is running
> > disk-heavy workloads (installing VMs).
> >
> > Since I have started writing this email (2 minutes?), the amount of
> > free memory has already gone down from 37 GB to 33 GB.  I expect that
> > by the time I have finished running the workload, in two hours, it
> > will not have any free memory.
> 
> But what about a VM sitting idle, or that just has more RAM assigned to it
> than is currently using.
>  I've got a host here that's been up for 46 days and has been doing some
> heavy VM debugging a few days ago, but today:
> 
> # free -m
>   totalusedfree  shared  buff/cache   
> available
> Mem:  965361146   44834 184   50555   
> 94735
> 
> I very rarely use all it's RAM, so it's got a big chunk of free RAM, and yes 
> it's
> got a big chunk of cache as well.
> 
> Dave
> 
> >
> > Paolo

I begin to realize Roman's opinions. The PV solution can't handle the cache 
memory while inflating balloon could.
Inflating balloon so as to skipping the cache memory is no good for guest's 
performance.

How much of the free memory in the guest depends on the workload in the VM  and 
the time VM has already run
before live migration. Even the memory usage will keep increasing due to ever 
growing caches, but we don't know
when the live migration will happen, assuming there are no or very little free 
pages in the guest is not quite right.

The advantage of the pv solution is the smaller performance impact, comparing 
with inflating the balloon.

Liang






Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-05 Thread Michael S. Tsirkin
On Fri, Mar 04, 2016 at 03:49:37PM +, Li, Liang Z wrote:
> > > > > > > Only detect the unmapped/zero mapped pages is not enough.
> > > > Consider
> > > > > > the
> > > > > > > situation like case 2, it can't achieve the same result.
> > > > > >
> > > > > > Your case 2 doesn't exist in the real world.  If people could
> > > > > > stop their main memory consumer in the guest prior to migration
> > > > > > they wouldn't need live migration at all.
> > > > >
> > > > > The case 2 is just a simplified scenario, not a real case.
> > > > > As long as the guest's memory usage does not keep increasing, or
> > > > > not always run out, it can be covered by the case 2.
> > > >
> > > > The memory usage will keep increasing due to ever growing caches,
> > > > etc, so you'll be left with very little free memory fairly soon.
> > > >
> > >
> > > I don't think so.
> > 
> > Here's my laptop:
> > KiB Mem : 16048560 total,  8574956 free,  3360532 used,  4113072 buff/cache
> > 
> > But here's a server:
> > KiB Mem:  32892768 total, 20092812 used, 12799956 free,   368704 buffers
> > 
> > What is the difference? A ton of tiny daemons not doing anything, staying
> > resident in memory.
> > 
> > > > > > I tend to think you can safely assume there's no free memory in
> > > > > > the guest, so there's little point optimizing for it.
> > > > >
> > > > > If this is true, we should not inflate the balloon either.
> > > >
> > > > We certainly should if there's "available" memory, i.e. not free but
> > > > cheap to reclaim.
> > > >
> > >
> > > What's your mean by "available" memory? if they are not free, I don't 
> > > think
> > it's cheap.
> > 
> > clean pages are cheap to drop as they don't have to be written.
> > whether they will be ever be used is another matter.
> > 
> > > > > > OTOH it makes perfect sense optimizing for the unmapped memory
> > > > > > that's made up, in particular, by the ballon, and consider
> > > > > > inflating the balloon right before migration unless you already
> > > > > > maintain it at the optimal size for other reasons (like e.g. a
> > > > > > global resource manager
> > > > optimizing the VM density).
> > > > > >
> > > > >
> > > > > Yes, I believe the current balloon works and it's simple. Do you
> > > > > take the
> > > > performance impact for consideration?
> > > > > For and 8G guest, it takes about 5s to  inflating the balloon. But
> > > > > it only takes 20ms to  traverse the free_list and construct the
> > > > > free pages
> > > > bitmap.
> > > >
> > > > I don't have any feeling of how important the difference is.  And if
> > > > the limiting factor for balloon inflation speed is the granularity
> > > > of communication it may be worth optimizing that, because quick
> > > > balloon reaction may be important in certain resource management
> > scenarios.
> > > >
> > > > > By inflating the balloon, all the guest's pages are still be
> > > > > processed (zero
> > > > page checking).
> > > >
> > > > Not sure what you mean.  If you describe the current state of
> > > > affairs that's exactly the suggested optimization point: skip unmapped
> > pages.
> > > >
> > >
> > > You'd better check the live migration code.
> > 
> > What's there to check in migration code?
> > Here's the extent of what balloon does on output:
> > 
> > 
> > while (iov_to_buf(elem->out_sg, elem->out_num, offset, , 4) == 
> > 4)
> > {
> > ram_addr_t pa;
> > ram_addr_t addr;
> > int p = virtio_ldl_p(vdev, );
> > 
> > pa = (ram_addr_t) p << VIRTIO_BALLOON_PFN_SHIFT;
> > offset += 4;
> > 
> > /* FIXME: remove get_system_memory(), but how? */
> > section = memory_region_find(get_system_memory(), pa, 1);
> > if (!int128_nz(section.size) || 
> > !memory_region_is_ram(section.mr))
> > continue;
> > 
> > 
> > trace_virtio_balloon_handle_output(memory_region_name(section.mr),
> >pa);
> > /* Using memory_region_get_ram_ptr is bending the rules a bit, 
> > but
> >should be OK because we only want a single page.  */
> > addr = section.offset_within_region;
> > balloon_page(memory_region_get_ram_ptr(section.mr) + addr,
> >  !!(vq == s->dvq));
> > memory_region_unref(section.mr);
> > }
> > 
> > so all that happens when we get a page is balloon_page.
> > and
> > 
> > static void balloon_page(void *addr, int deflate) { #if defined(__linux__)
> > if (!qemu_balloon_is_inhibited() && (!kvm_enabled() ||
> >  kvm_has_sync_mmu())) {
> > qemu_madvise(addr, TARGET_PAGE_SIZE,
> > deflate ? QEMU_MADV_WILLNEED : QEMU_MADV_DONTNEED);
> > }
> > #endif
> > }
> > 
> > 
> > Do you see anything that tracks pages to help migration skip the ballooned
> > memory? I don't.
> > 
> 
> No. And it's exactly what I mean. The ballooned memory 

Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-04 Thread Dr. David Alan Gilbert
* Paolo Bonzini (pbonz...@redhat.com) wrote:
> 
> 
> On 04/03/2016 15:26, Li, Liang Z wrote:
> >> > 
> >> > The memory usage will keep increasing due to ever growing caches, etc, so
> >> > you'll be left with very little free memory fairly soon.
> >> > 
> > I don't think so.
> > 
> 
> Roman is right.  For example, here I am looking at a 64 GB (physical)
> machine which was booted about 30 minutes ago, and which is running
> disk-heavy workloads (installing VMs).
> 
> Since I have started writing this email (2 minutes?), the amount of free
> memory has already gone down from 37 GB to 33 GB.  I expect that by the
> time I have finished running the workload, in two hours, it will not
> have any free memory.

But what about a VM sitting idle, or that just has more RAM assigned to it
than is currently using.
 I've got a host here that's been up for 46 days and has been doing some
heavy VM debugging a few days ago, but today:

# free -m
  totalusedfree  shared  buff/cache   available
Mem:  965361146   44834 184   50555   94735

I very rarely use all it's RAM, so it's got a big chunk of free RAM, and yes
it's got a big chunk of cache as well.

Dave

> 
> Paolo
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-04 Thread Paolo Bonzini


On 04/03/2016 15:26, Li, Liang Z wrote:
>> > 
>> > The memory usage will keep increasing due to ever growing caches, etc, so
>> > you'll be left with very little free memory fairly soon.
>> > 
> I don't think so.
> 

Roman is right.  For example, here I am looking at a 64 GB (physical)
machine which was booted about 30 minutes ago, and which is running
disk-heavy workloads (installing VMs).

Since I have started writing this email (2 minutes?), the amount of free
memory has already gone down from 37 GB to 33 GB.  I expect that by the
time I have finished running the workload, in two hours, it will not
have any free memory.

Paolo



Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-04 Thread Li, Liang Z
> > > > > > Only detect the unmapped/zero mapped pages is not enough.
> > > Consider
> > > > > the
> > > > > > situation like case 2, it can't achieve the same result.
> > > > >
> > > > > Your case 2 doesn't exist in the real world.  If people could
> > > > > stop their main memory consumer in the guest prior to migration
> > > > > they wouldn't need live migration at all.
> > > >
> > > > The case 2 is just a simplified scenario, not a real case.
> > > > As long as the guest's memory usage does not keep increasing, or
> > > > not always run out, it can be covered by the case 2.
> > >
> > > The memory usage will keep increasing due to ever growing caches,
> > > etc, so you'll be left with very little free memory fairly soon.
> > >
> >
> > I don't think so.
> 
> Here's my laptop:
> KiB Mem : 16048560 total,  8574956 free,  3360532 used,  4113072 buff/cache
> 
> But here's a server:
> KiB Mem:  32892768 total, 20092812 used, 12799956 free,   368704 buffers
> 
> What is the difference? A ton of tiny daemons not doing anything, staying
> resident in memory.
> 
> > > > > I tend to think you can safely assume there's no free memory in
> > > > > the guest, so there's little point optimizing for it.
> > > >
> > > > If this is true, we should not inflate the balloon either.
> > >
> > > We certainly should if there's "available" memory, i.e. not free but
> > > cheap to reclaim.
> > >
> >
> > What's your mean by "available" memory? if they are not free, I don't think
> it's cheap.
> 
> clean pages are cheap to drop as they don't have to be written.
> whether they will be ever be used is another matter.
> 
> > > > > OTOH it makes perfect sense optimizing for the unmapped memory
> > > > > that's made up, in particular, by the ballon, and consider
> > > > > inflating the balloon right before migration unless you already
> > > > > maintain it at the optimal size for other reasons (like e.g. a
> > > > > global resource manager
> > > optimizing the VM density).
> > > > >
> > > >
> > > > Yes, I believe the current balloon works and it's simple. Do you
> > > > take the
> > > performance impact for consideration?
> > > > For and 8G guest, it takes about 5s to  inflating the balloon. But
> > > > it only takes 20ms to  traverse the free_list and construct the
> > > > free pages
> > > bitmap.
> > >
> > > I don't have any feeling of how important the difference is.  And if
> > > the limiting factor for balloon inflation speed is the granularity
> > > of communication it may be worth optimizing that, because quick
> > > balloon reaction may be important in certain resource management
> scenarios.
> > >
> > > > By inflating the balloon, all the guest's pages are still be
> > > > processed (zero
> > > page checking).
> > >
> > > Not sure what you mean.  If you describe the current state of
> > > affairs that's exactly the suggested optimization point: skip unmapped
> pages.
> > >
> >
> > You'd better check the live migration code.
> 
> What's there to check in migration code?
> Here's the extent of what balloon does on output:
> 
> 
> while (iov_to_buf(elem->out_sg, elem->out_num, offset, , 4) == 4)
> {
> ram_addr_t pa;
> ram_addr_t addr;
> int p = virtio_ldl_p(vdev, );
> 
> pa = (ram_addr_t) p << VIRTIO_BALLOON_PFN_SHIFT;
> offset += 4;
> 
> /* FIXME: remove get_system_memory(), but how? */
> section = memory_region_find(get_system_memory(), pa, 1);
> if (!int128_nz(section.size) || !memory_region_is_ram(section.mr))
> continue;
> 
> 
> trace_virtio_balloon_handle_output(memory_region_name(section.mr),
>pa);
> /* Using memory_region_get_ram_ptr is bending the rules a bit, but
>should be OK because we only want a single page.  */
> addr = section.offset_within_region;
> balloon_page(memory_region_get_ram_ptr(section.mr) + addr,
>  !!(vq == s->dvq));
> memory_region_unref(section.mr);
> }
> 
> so all that happens when we get a page is balloon_page.
> and
> 
> static void balloon_page(void *addr, int deflate) { #if defined(__linux__)
> if (!qemu_balloon_is_inhibited() && (!kvm_enabled() ||
>  kvm_has_sync_mmu())) {
> qemu_madvise(addr, TARGET_PAGE_SIZE,
> deflate ? QEMU_MADV_WILLNEED : QEMU_MADV_DONTNEED);
> }
> #endif
> }
> 
> 
> Do you see anything that tracks pages to help migration skip the ballooned
> memory? I don't.
> 

No. And it's exactly what I mean. The ballooned memory is still processed during
live migration without skipping. The live migration code is in migration/ram.c.

> 
> > > > The only advantage of ' inflating the balloon before live
> > > > migration' is simple,
> > > nothing more.
> > >
> > > That's a big advantage.  Another one is that it does something
> > > useful in real- world 

Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-04 Thread Li, Liang Z
> > Maybe I am not clear enough.
> >
> > I mean if we inflate balloon before live migration, for a 8GB guest, it 
> > takes
> about 5 Seconds for the inflating operation to finish.
> 
> And these 5 seconds are spent where?
> 

The time is spent on allocating the pages and send the allocated pages pfns to 
QEMU
through virtio.

> > For the PV solution, there is no need to inflate balloon before live
> > migration, the only cost is to traversing the free_list to  construct
> > the free pages bitmap, and it takes about 20ms for a 8GB idle guest( less if
> there is less free pages),  passing the free pages info to host will take 
> about
> extra 3ms.
> >
> >
> > Liang
> 
> So now let's please stop talking about solutions at a high level and discuss 
> the
> interface changes you make in detail.
> What makes it faster? Better host/guest interface? No need to go through
> buddy allocator within guest? Less interrupts? Something else?
> 

I assume you are familiar with the current virtio-balloon and how it works. 
The new interface is very simple, send a request to the virtio-balloon driver,
The virtio-driver will travers the '>free_area[order].free_list[t])' to 
construct a 'free_page_bitmap', and then the driver will send the content
of  'free_page_bitmap' back to QEMU. That all the new interface does and
there are no ' alloc_page' related affairs, so it's faster.


Some code snippet:
--
+static void mark_free_pages_bitmap(struct zone *zone,
+unsigned long *free_page_bitmap, unsigned long pfn_gap) {
+   unsigned long pfn, flags, i;
+   unsigned int order, t;
+   struct list_head *curr;
+
+   if (zone_is_empty(zone))
+   return;
+
+   spin_lock_irqsave(>lock, flags);
+
+   for_each_migratetype_order(order, t) {
+   list_for_each(curr, >free_area[order].free_list[t]) {
+
+   pfn = page_to_pfn(list_entry(curr, struct page, lru));
+   for (i = 0; i < (1UL << order); i++) {
+   if ((pfn + i) >= PFN_4G)
+   set_bit_le(pfn + i - pfn_gap,
+  free_page_bitmap);
+   else
+   set_bit_le(pfn + i, free_page_bitmap);
+   }
+   }
+   }
+
+   spin_unlock_irqrestore(>lock, flags); }

Sorry for my poor English and expression, if you still can't understand,
you could glance at the patch, total about 400 lines.
> 
> > > --
> > > MST



Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-04 Thread Michael S. Tsirkin
On Fri, Mar 04, 2016 at 02:26:49PM +, Li, Liang Z wrote:
> > Subject: Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration
> > optimization
> > 
> > On Fri, Mar 04, 2016 at 09:08:44AM +, Li, Liang Z wrote:
> > > > On Fri, Mar 04, 2016 at 01:52:53AM +, Li, Liang Z wrote:
> > > > > >   I wonder if it would be possible to avoid the kernel changes
> > > > > > by parsing /proc/self/pagemap - if that can be used to detect
> > > > > > unmapped/zero mapped pages in the guest ram, would it achieve
> > > > > > the
> > > > same result?
> > > > >
> > > > > Only detect the unmapped/zero mapped pages is not enough.
> > Consider
> > > > the
> > > > > situation like case 2, it can't achieve the same result.
> > > >
> > > > Your case 2 doesn't exist in the real world.  If people could stop
> > > > their main memory consumer in the guest prior to migration they
> > > > wouldn't need live migration at all.
> > >
> > > The case 2 is just a simplified scenario, not a real case.
> > > As long as the guest's memory usage does not keep increasing, or not
> > > always run out, it can be covered by the case 2.
> > 
> > The memory usage will keep increasing due to ever growing caches, etc, so
> > you'll be left with very little free memory fairly soon.
> > 
> 
> I don't think so.

Here's my laptop:
KiB Mem : 16048560 total,  8574956 free,  3360532 used,  4113072 buff/cache

But here's a server:
KiB Mem:  32892768 total, 20092812 used, 12799956 free,   368704 buffers

What is the difference? A ton of tiny daemons not doing anything,
staying resident in memory.

> > > > I tend to think you can safely assume there's no free memory in the
> > > > guest, so there's little point optimizing for it.
> > >
> > > If this is true, we should not inflate the balloon either.
> > 
> > We certainly should if there's "available" memory, i.e. not free but cheap 
> > to
> > reclaim.
> > 
> 
> What's your mean by "available" memory? if they are not free, I don't think 
> it's cheap.

clean pages are cheap to drop as they don't have to be written.
whether they will be ever be used is another matter.

> > > > OTOH it makes perfect sense optimizing for the unmapped memory
> > > > that's made up, in particular, by the ballon, and consider inflating
> > > > the balloon right before migration unless you already maintain it at
> > > > the optimal size for other reasons (like e.g. a global resource manager
> > optimizing the VM density).
> > > >
> > >
> > > Yes, I believe the current balloon works and it's simple. Do you take the
> > performance impact for consideration?
> > > For and 8G guest, it takes about 5s to  inflating the balloon. But it
> > > only takes 20ms to  traverse the free_list and construct the free pages
> > bitmap.
> > 
> > I don't have any feeling of how important the difference is.  And if the
> > limiting factor for balloon inflation speed is the granularity of 
> > communication
> > it may be worth optimizing that, because quick balloon reaction may be
> > important in certain resource management scenarios.
> > 
> > > By inflating the balloon, all the guest's pages are still be processed 
> > > (zero
> > page checking).
> > 
> > Not sure what you mean.  If you describe the current state of affairs that's
> > exactly the suggested optimization point: skip unmapped pages.
> > 
> 
> You'd better check the live migration code.

What's there to check in migration code?
Here's the extent of what balloon does on output:


while (iov_to_buf(elem->out_sg, elem->out_num, offset, , 4) == 4) {
ram_addr_t pa;
ram_addr_t addr;
int p = virtio_ldl_p(vdev, );

pa = (ram_addr_t) p << VIRTIO_BALLOON_PFN_SHIFT;
offset += 4;

/* FIXME: remove get_system_memory(), but how? */
section = memory_region_find(get_system_memory(), pa, 1);
if (!int128_nz(section.size) || !memory_region_is_ram(section.mr))
continue;

trace_virtio_balloon_handle_output(memory_region_name(section.mr),
   pa);
/* Using memory_region_get_ram_ptr is bending the rules a bit, but
   should be OK because we only want a single page.  */
addr = section.offset_within_region;
balloon_page(memory_region_get_ram_ptr(section.mr) + addr,
   

Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-04 Thread Li, Liang Z
> Subject: Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration
> optimization
> 
> On Fri, Mar 04, 2016 at 09:08:44AM +, Li, Liang Z wrote:
> > > On Fri, Mar 04, 2016 at 01:52:53AM +, Li, Liang Z wrote:
> > > > >   I wonder if it would be possible to avoid the kernel changes
> > > > > by parsing /proc/self/pagemap - if that can be used to detect
> > > > > unmapped/zero mapped pages in the guest ram, would it achieve
> > > > > the
> > > same result?
> > > >
> > > > Only detect the unmapped/zero mapped pages is not enough.
> Consider
> > > the
> > > > situation like case 2, it can't achieve the same result.
> > >
> > > Your case 2 doesn't exist in the real world.  If people could stop
> > > their main memory consumer in the guest prior to migration they
> > > wouldn't need live migration at all.
> >
> > The case 2 is just a simplified scenario, not a real case.
> > As long as the guest's memory usage does not keep increasing, or not
> > always run out, it can be covered by the case 2.
> 
> The memory usage will keep increasing due to ever growing caches, etc, so
> you'll be left with very little free memory fairly soon.
> 

I don't think so.

> > > I tend to think you can safely assume there's no free memory in the
> > > guest, so there's little point optimizing for it.
> >
> > If this is true, we should not inflate the balloon either.
> 
> We certainly should if there's "available" memory, i.e. not free but cheap to
> reclaim.
> 

What's your mean by "available" memory? if they are not free, I don't think 
it's cheap.

> > > OTOH it makes perfect sense optimizing for the unmapped memory
> > > that's made up, in particular, by the ballon, and consider inflating
> > > the balloon right before migration unless you already maintain it at
> > > the optimal size for other reasons (like e.g. a global resource manager
> optimizing the VM density).
> > >
> >
> > Yes, I believe the current balloon works and it's simple. Do you take the
> performance impact for consideration?
> > For and 8G guest, it takes about 5s to  inflating the balloon. But it
> > only takes 20ms to  traverse the free_list and construct the free pages
> bitmap.
> 
> I don't have any feeling of how important the difference is.  And if the
> limiting factor for balloon inflation speed is the granularity of 
> communication
> it may be worth optimizing that, because quick balloon reaction may be
> important in certain resource management scenarios.
> 
> > By inflating the balloon, all the guest's pages are still be processed (zero
> page checking).
> 
> Not sure what you mean.  If you describe the current state of affairs that's
> exactly the suggested optimization point: skip unmapped pages.
> 

You'd better check the live migration code.

> > The only advantage of ' inflating the balloon before live migration' is 
> > simple,
> nothing more.
> 
> That's a big advantage.  Another one is that it does something useful in real-
> world scenarios.
> 

I don't think the heave performance impaction is something useful in real world 
scenarios.

Liang
> Roman.



Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-04 Thread Michael S. Tsirkin
On Fri, Mar 04, 2016 at 10:11:00AM +, Li, Liang Z wrote:
> > On Fri, Mar 04, 2016 at 09:12:12AM +, Li, Liang Z wrote:
> > > > Although I wonder which is cheaper; that would be fairly expensive
> > > > for the guest wouldn't it? And you'd somehow have to kick the guest
> > > > before migration to do the ballooning - and how long would you wait for
> > it to finish?
> > >
> > > About 5 seconds for an 8G guest, balloon to 1G. Get the free pages
> > > bitmap take about 20ms for an 8G idle guest.
> > >
> > > Liang
> > 
> > Where is the time spent though? allocating within guest?
> > Or passing the info to host?
> > If the former, we can use existing inflate/deflate vqs:
> > Have guest put each free page on inflate vq, then on deflate vq.
> > 
> 
> Maybe I am not clear enough.
> 
> I mean if we inflate balloon before live migration, for a 8GB guest, it takes 
> about 5 Seconds for the inflating operation to finish.

And these 5 seconds are spent where?

> For the PV solution, there is no need to inflate balloon before live 
> migration, the only cost is to traversing the free_list to
>  construct the free pages bitmap, and it takes about 20ms for a 8GB idle 
> guest( less if there is less free pages),
>  passing the free pages info to host will take about extra 3ms.
> 
> 
> Liang

So now let's please stop talking about solutions at a high level and
discuss the interface changes you make in detail.
What makes it faster? Better host/guest interface? No need to go through
buddy allocator within guest? Less interrupts? Something else?


> > --
> > MST



Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-04 Thread Roman Kagan
On Fri, Mar 04, 2016 at 09:08:44AM +, Li, Liang Z wrote:
> > On Fri, Mar 04, 2016 at 01:52:53AM +, Li, Liang Z wrote:
> > > >   I wonder if it would be possible to avoid the kernel changes by
> > > > parsing /proc/self/pagemap - if that can be used to detect
> > > > unmapped/zero mapped pages in the guest ram, would it achieve the
> > same result?
> > >
> > > Only detect the unmapped/zero mapped pages is not enough. Consider
> > the
> > > situation like case 2, it can't achieve the same result.
> > 
> > Your case 2 doesn't exist in the real world.  If people could stop their 
> > main
> > memory consumer in the guest prior to migration they wouldn't need live
> > migration at all.
> 
> The case 2 is just a simplified scenario, not a real case.
> As long as the guest's memory usage does not keep increasing, or not always 
> run out,
> it can be covered by the case 2.

The memory usage will keep increasing due to ever growing caches, etc,
so you'll be left with very little free memory fairly soon.

> > I tend to think you can safely assume there's no free memory in the guest, 
> > so
> > there's little point optimizing for it.
> 
> If this is true, we should not inflate the balloon either.

We certainly should if there's "available" memory, i.e. not free but
cheap to reclaim.

> > OTOH it makes perfect sense optimizing for the unmapped memory that's
> > made up, in particular, by the ballon, and consider inflating the balloon 
> > right
> > before migration unless you already maintain it at the optimal size for 
> > other
> > reasons (like e.g. a global resource manager optimizing the VM density).
> > 
> 
> Yes, I believe the current balloon works and it's simple. Do you take the 
> performance impact for consideration?
> For and 8G guest, it takes about 5s to  inflating the balloon. But it only 
> takes 20ms to  traverse the free_list and
> construct the free pages bitmap.

I don't have any feeling of how important the difference is.  And if the
limiting factor for balloon inflation speed is the granularity of
communication it may be worth optimizing that, because quick balloon
reaction may be important in certain resource management scenarios.

> By inflating the balloon, all the guest's pages are still be processed (zero 
> page checking).

Not sure what you mean.  If you describe the current state of affairs
that's exactly the suggested optimization point: skip unmapped pages.

> The only advantage of ' inflating the balloon before live migration' is 
> simple, nothing more.

That's a big advantage.  Another one is that it does something useful in
real-world scenarios.

Roman.



Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-04 Thread Li, Liang Z
> On Fri, Mar 04, 2016 at 09:12:12AM +, Li, Liang Z wrote:
> > > Although I wonder which is cheaper; that would be fairly expensive
> > > for the guest wouldn't it? And you'd somehow have to kick the guest
> > > before migration to do the ballooning - and how long would you wait for
> it to finish?
> >
> > About 5 seconds for an 8G guest, balloon to 1G. Get the free pages
> > bitmap take about 20ms for an 8G idle guest.
> >
> > Liang
> 
> Where is the time spent though? allocating within guest?
> Or passing the info to host?
> If the former, we can use existing inflate/deflate vqs:
> Have guest put each free page on inflate vq, then on deflate vq.
> 

Maybe I am not clear enough.

I mean if we inflate balloon before live migration, for a 8GB guest, it takes 
about 5 Seconds for the inflating operation to finish.

For the PV solution, there is no need to inflate balloon before live migration, 
the only cost is to traversing the free_list to
 construct the free pages bitmap, and it takes about 20ms for a 8GB idle guest( 
less if there is less free pages),
 passing the free pages info to host will take about extra 3ms.


Liang
> --
> MST



Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-04 Thread Michael S. Tsirkin
On Fri, Mar 04, 2016 at 09:12:12AM +, Li, Liang Z wrote:
> > Although I wonder which is cheaper; that would be fairly expensive for the
> > guest wouldn't it? And you'd somehow have to kick the guest before
> > migration to do the ballooning - and how long would you wait for it to 
> > finish?
> 
> About 5 seconds for an 8G guest, balloon to 1G. Get the free pages bitmap 
> take about 20ms
> for an 8G idle guest.
> 
> Liang

Where is the time spent though? allocating within guest?
Or passing the info to host?
If the former, we can use existing inflate/deflate vqs:
Have guest put each free page on inflate vq, then on deflate vq.

-- 
MST



Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-04 Thread Li, Liang Z
> > > * Liang Li (liang.z...@intel.com) wrote:
> > > > The current QEMU live migration implementation mark the all the
> > > > guest's RAM pages as dirtied in the ram bulk stage, all these
> > > > pages will be processed and that takes quit a lot of CPU cycles.
> > > >
> > > > From guest's point of view, it doesn't care about the content in
> > > > free pages. We can make use of this fact and skip processing the
> > > > free pages in the ram bulk stage, it can save a lot CPU cycles and
> > > > reduce the network traffic significantly while speed up the live
> > > > migration process obviously.
> > > >
> > > > This patch set is the QEMU side implementation.
> > > >
> > > > The virtio-balloon is extended so that QEMU can get the free pages
> > > > information from the guest through virtio.
> > > >
> > > > After getting the free pages information (a bitmap), QEMU can use
> > > > it to filter out the guest's free pages in the ram bulk stage.
> > > > This make the live migration process much more efficient.
> > >
> > > Hi,
> > >   An interesting solution; I know a few different people have been
> > > looking at how to speed up ballooned VM migration.
> > >
> >
> > Ooh, different solutions for the same purpose, and both based on the
> balloon.
> 
> We were also tying to address similar problem, without actually needing to
> modify the guest driver. Please find patch details under mail with subject.
> migration: skip sending ram pages released by virtio-balloon driver
> 
> Thanks,
> - Jitendra
> 

Great! Thanks for your information.

Liang
> >
> > >   I wonder if it would be possible to avoid the kernel changes by
> > > parsing /proc/self/pagemap - if that can be used to detect
> > > unmapped/zero mapped pages in the guest ram, would it achieve the
> same result?
> > >



Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-04 Thread Roman Kagan
On Fri, Mar 04, 2016 at 09:08:20AM +, Dr. David Alan Gilbert wrote:
> * Roman Kagan (rka...@virtuozzo.com) wrote:
> > On Fri, Mar 04, 2016 at 08:23:09AM +, Li, Liang Z wrote:
> > > The unmapped/zero mapped pages can be detected by parsing 
> > > /proc/self/pagemap,
> > > but the free pages can't be detected by this. Imaging an application 
> > > allocates a large amount
> > > of memory , after using, it frees the memory, then live migration 
> > > happens. All these free pages
> > > will be process and sent to the destination, it's not optimal.
> > 
> > First, the likelihood of such a situation is marginal, there's no point
> > optimizing for it specifically.
> > 
> > And second, even if that happens, you inflate the balloon right before
> > the migration and the free memory will get umapped very quickly, so this
> > case is covered nicely by the same technique that works for more
> > realistic cases, too.
> 
> Although I wonder which is cheaper; that would be fairly expensive for
> the guest wouldn't it?

For the guest -- generally it wouldn't if you have a good estimate of
available memory (i.e. the amount you can balloon out without forcing
the guest to swap).

And yes you need certain cost estimates for choosing the best migration
strategy: e.g. if your network bandwidth is unlimited you may be better
off transferring the zeros to the destination rather than optimizing
them away.

> And you'd somehow have to kick the guest
> before migration to do the ballooning - and how long would you wait
> for it to finish?

It's a matter for fine-tuning with all the inputs at hand, like network
bandwidth, costs of delaying the migration, etc.  And you don't need to
wait for it to finish, i.e. reach the balloon size target: you can start
the migration as soon as it's good enough (for whatever definition of
"enough" is found appropriate by that fine-tuning).

Roman.



[Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-04 Thread Jitendra Kolhe
> >
> > * Liang Li (liang.z...@intel.com) wrote:
> > > The current QEMU live migration implementation mark the all the
> > > guest's RAM pages as dirtied in the ram bulk stage, all these pages
> > > will be processed and that takes quit a lot of CPU cycles.
> > >
> > > From guest's point of view, it doesn't care about the content in free
> > > pages. We can make use of this fact and skip processing the free pages
> > > in the ram bulk stage, it can save a lot CPU cycles and reduce the
> > > network traffic significantly while speed up the live migration
> > > process obviously.
> > >
> > > This patch set is the QEMU side implementation.
> > >
> > > The virtio-balloon is extended so that QEMU can get the free pages
> > > information from the guest through virtio.
> > >
> > > After getting the free pages information (a bitmap), QEMU can use it
> > > to filter out the guest's free pages in the ram bulk stage. This make
> > > the live migration process much more efficient.
> >
> > Hi,
> >   An interesting solution; I know a few different people have been looking 
> > at
> > how to speed up ballooned VM migration.
> >
>
> Ooh, different solutions for the same purpose, and both based on the balloon.

We were also tying to address similar problem, without actually needing to 
modify
the guest driver. Please find patch details under mail with subject.
migration: skip sending ram pages released by virtio-balloon driver

Thanks,
- Jitendra

>
> >   I wonder if it would be possible to avoid the kernel changes by parsing
> > /proc/self/pagemap - if that can be used to detect unmapped/zero mapped
> > pages in the guest ram, would it achieve the same result?
> >
>
> Only detect the unmapped/zero mapped pages is not enough. Consider the
> situation like case 2, it can't achieve the same result.
>
> > > This RFC version doesn't take the post-copy and RDMA into
> > > consideration, maybe both of them can benefit from this PV solution by
> > > with some extra modifications.
> >
> > For postcopy to be safe, you would still need to send a message to the
> > destination telling it that there were zero pages, otherwise the destination
> > can't tell if it's supposed to request the page from the source or treat the
> > page as zero.
> >
> > Dave
>
> I will consider this later, thanks, Dave.
>
> Liang
>
> >
> > >
> > > Performance data
> > > 
> > >
> > > Test environment:
> > >
> > > CPU: Intel (R) Xeon(R) CPU ES-2699 v3 @ 2.30GHz Host RAM: 64GB
> > > Host Linux Kernel:  4.2.0   Host OS: CentOS 7.1
> > > Guest Linux Kernel:  4.5.rc6Guest OS: CentOS 6.6
> > > Network:  X540-AT2 with 10 Gigabit connection Guest RAM: 8GB
> > >
> > > Case 1: Idle guest just boots:
> > > 
> > > | original  |pv
> > > ---
> > > total time(ms)  |1894   |   421
> > > 
> > > transferred ram(KB) |   398017  |  353242
> > > 
> > >
> > >
> > > Case 2: The guest has ever run some memory consuming workload, the
> > > workload is terminated just before live migration.
> > > 
> > > | original  |pv
> > > ---
> > > total time(ms)  |   7436|   552
> > > 
> > > transferred ram(KB) |  8146291  |  361375
> > > 
> > >
>
>



Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-04 Thread Li, Liang Z
> * Roman Kagan (rka...@virtuozzo.com) wrote:
> > On Fri, Mar 04, 2016 at 08:23:09AM +, Li, Liang Z wrote:
> > > > On Thu, Mar 03, 2016 at 05:46:15PM +, Dr. David Alan Gilbert wrote:
> > > > > * Liang Li (liang.z...@intel.com) wrote:
> > > > > > The current QEMU live migration implementation mark the all
> > > > > > the guest's RAM pages as dirtied in the ram bulk stage, all
> > > > > > these pages will be processed and that takes quit a lot of CPU 
> > > > > > cycles.
> > > > > >
> > > > > > From guest's point of view, it doesn't care about the content
> > > > > > in free pages. We can make use of this fact and skip
> > > > > > processing the free pages in the ram bulk stage, it can save a
> > > > > > lot CPU cycles and reduce the network traffic significantly
> > > > > > while speed up the live migration process obviously.
> > > > > >
> > > > > > This patch set is the QEMU side implementation.
> > > > > >
> > > > > > The virtio-balloon is extended so that QEMU can get the free
> > > > > > pages information from the guest through virtio.
> > > > > >
> > > > > > After getting the free pages information (a bitmap), QEMU can
> > > > > > use it to filter out the guest's free pages in the ram bulk
> > > > > > stage. This make the live migration process much more efficient.
> > > > >
> > > > > Hi,
> > > > >   An interesting solution; I know a few different people have
> > > > > been looking at how to speed up ballooned VM migration.
> > > > >
> > > > >   I wonder if it would be possible to avoid the kernel changes
> > > > > by parsing /proc/self/pagemap - if that can be used to detect
> > > > > unmapped/zero mapped pages in the guest ram, would it achieve
> > > > > the
> > > > same result?
> > > >
> > > > Yes I was about to suggest the same thing: it's simple and makes
> > > > use of the existing infrastructure.  And you wouldn't need to care
> > > > if the pages were unmapped by ballooning or anything else
> > > > (alternative balloon implementations, not yet touched by the
> > > > guest, etc.).  Besides, you wouldn't need to synchronize with the guest.
> > > >
> > > > Roman.
> > >
> > > The unmapped/zero mapped pages can be detected by parsing
> > > /proc/self/pagemap, but the free pages can't be detected by this.
> > > Imaging an application allocates a large amount of memory , after
> > > using, it frees the memory, then live migration happens. All these free
> pages will be process and sent to the destination, it's not optimal.
> >
> > First, the likelihood of such a situation is marginal, there's no
> > point optimizing for it specifically.
> >
> > And second, even if that happens, you inflate the balloon right before
> > the migration and the free memory will get umapped very quickly, so
> > this case is covered nicely by the same technique that works for more
> > realistic cases, too.
> 
> Although I wonder which is cheaper; that would be fairly expensive for the
> guest wouldn't it? And you'd somehow have to kick the guest before
> migration to do the ballooning - and how long would you wait for it to finish?

About 5 seconds for an 8G guest, balloon to 1G. Get the free pages bitmap take 
about 20ms
for an 8G idle guest.

Liang

> 
> Dave
> 
> >
> > Roman.
> --
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-04 Thread Li, Liang Z
> On Fri, Mar 04, 2016 at 01:52:53AM +, Li, Liang Z wrote:
> > >   I wonder if it would be possible to avoid the kernel changes by
> > > parsing /proc/self/pagemap - if that can be used to detect
> > > unmapped/zero mapped pages in the guest ram, would it achieve the
> same result?
> >
> > Only detect the unmapped/zero mapped pages is not enough. Consider
> the
> > situation like case 2, it can't achieve the same result.
> 
> Your case 2 doesn't exist in the real world.  If people could stop their main
> memory consumer in the guest prior to migration they wouldn't need live
> migration at all.

The case 2 is just a simplified scenario, not a real case.
As long as the guest's memory usage does not keep increasing, or not always run 
out,
it can be covered by the case 2.

> I tend to think you can safely assume there's no free memory in the guest, so
> there's little point optimizing for it.

If this is true, we should not inflate the balloon either.

> OTOH it makes perfect sense optimizing for the unmapped memory that's
> made up, in particular, by the ballon, and consider inflating the balloon 
> right
> before migration unless you already maintain it at the optimal size for other
> reasons (like e.g. a global resource manager optimizing the VM density).
> 

Yes, I believe the current balloon works and it's simple. Do you take the 
performance impact for consideration?
For and 8G guest, it takes about 5s to  inflating the balloon. But it only 
takes 20ms to  traverse the free_list and
construct the free pages bitmap. In this period, the guest are very busy.

By inflating the balloon, all the guest's pages are still be processed (zero 
page checking).

The only advantage of ' inflating the balloon before live migration' is simple, 
nothing more.

Liang

> Roman.



Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-04 Thread Dr. David Alan Gilbert
* Roman Kagan (rka...@virtuozzo.com) wrote:
> On Fri, Mar 04, 2016 at 08:23:09AM +, Li, Liang Z wrote:
> > > On Thu, Mar 03, 2016 at 05:46:15PM +, Dr. David Alan Gilbert wrote:
> > > > * Liang Li (liang.z...@intel.com) wrote:
> > > > > The current QEMU live migration implementation mark the all the
> > > > > guest's RAM pages as dirtied in the ram bulk stage, all these pages
> > > > > will be processed and that takes quit a lot of CPU cycles.
> > > > >
> > > > > From guest's point of view, it doesn't care about the content in
> > > > > free pages. We can make use of this fact and skip processing the
> > > > > free pages in the ram bulk stage, it can save a lot CPU cycles and
> > > > > reduce the network traffic significantly while speed up the live
> > > > > migration process obviously.
> > > > >
> > > > > This patch set is the QEMU side implementation.
> > > > >
> > > > > The virtio-balloon is extended so that QEMU can get the free pages
> > > > > information from the guest through virtio.
> > > > >
> > > > > After getting the free pages information (a bitmap), QEMU can use it
> > > > > to filter out the guest's free pages in the ram bulk stage. This
> > > > > make the live migration process much more efficient.
> > > >
> > > > Hi,
> > > >   An interesting solution; I know a few different people have been
> > > > looking at how to speed up ballooned VM migration.
> > > >
> > > >   I wonder if it would be possible to avoid the kernel changes by
> > > > parsing /proc/self/pagemap - if that can be used to detect
> > > > unmapped/zero mapped pages in the guest ram, would it achieve the
> > > same result?
> > > 
> > > Yes I was about to suggest the same thing: it's simple and makes use of 
> > > the
> > > existing infrastructure.  And you wouldn't need to care if the pages were
> > > unmapped by ballooning or anything else (alternative balloon
> > > implementations, not yet touched by the guest, etc.).  Besides, you 
> > > wouldn't
> > > need to synchronize with the guest.
> > > 
> > > Roman.
> > 
> > The unmapped/zero mapped pages can be detected by parsing 
> > /proc/self/pagemap,
> > but the free pages can't be detected by this. Imaging an application 
> > allocates a large amount
> > of memory , after using, it frees the memory, then live migration happens. 
> > All these free pages
> > will be process and sent to the destination, it's not optimal.
> 
> First, the likelihood of such a situation is marginal, there's no point
> optimizing for it specifically.
> 
> And second, even if that happens, you inflate the balloon right before
> the migration and the free memory will get umapped very quickly, so this
> case is covered nicely by the same technique that works for more
> realistic cases, too.

Although I wonder which is cheaper; that would be fairly expensive for
the guest wouldn't it? And you'd somehow have to kick the guest
before migration to do the ballooning - and how long would you wait
for it to finish?

Dave

> 
> Roman.
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-04 Thread Roman Kagan
On Fri, Mar 04, 2016 at 08:23:09AM +, Li, Liang Z wrote:
> > On Thu, Mar 03, 2016 at 05:46:15PM +, Dr. David Alan Gilbert wrote:
> > > * Liang Li (liang.z...@intel.com) wrote:
> > > > The current QEMU live migration implementation mark the all the
> > > > guest's RAM pages as dirtied in the ram bulk stage, all these pages
> > > > will be processed and that takes quit a lot of CPU cycles.
> > > >
> > > > From guest's point of view, it doesn't care about the content in
> > > > free pages. We can make use of this fact and skip processing the
> > > > free pages in the ram bulk stage, it can save a lot CPU cycles and
> > > > reduce the network traffic significantly while speed up the live
> > > > migration process obviously.
> > > >
> > > > This patch set is the QEMU side implementation.
> > > >
> > > > The virtio-balloon is extended so that QEMU can get the free pages
> > > > information from the guest through virtio.
> > > >
> > > > After getting the free pages information (a bitmap), QEMU can use it
> > > > to filter out the guest's free pages in the ram bulk stage. This
> > > > make the live migration process much more efficient.
> > >
> > > Hi,
> > >   An interesting solution; I know a few different people have been
> > > looking at how to speed up ballooned VM migration.
> > >
> > >   I wonder if it would be possible to avoid the kernel changes by
> > > parsing /proc/self/pagemap - if that can be used to detect
> > > unmapped/zero mapped pages in the guest ram, would it achieve the
> > same result?
> > 
> > Yes I was about to suggest the same thing: it's simple and makes use of the
> > existing infrastructure.  And you wouldn't need to care if the pages were
> > unmapped by ballooning or anything else (alternative balloon
> > implementations, not yet touched by the guest, etc.).  Besides, you wouldn't
> > need to synchronize with the guest.
> > 
> > Roman.
> 
> The unmapped/zero mapped pages can be detected by parsing /proc/self/pagemap,
> but the free pages can't be detected by this. Imaging an application 
> allocates a large amount
> of memory , after using, it frees the memory, then live migration happens. 
> All these free pages
> will be process and sent to the destination, it's not optimal.

First, the likelihood of such a situation is marginal, there's no point
optimizing for it specifically.

And second, even if that happens, you inflate the balloon right before
the migration and the free memory will get umapped very quickly, so this
case is covered nicely by the same technique that works for more
realistic cases, too.

Roman.



Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-04 Thread Li, Liang Z
> On Thu, Mar 03, 2016 at 05:46:15PM +, Dr. David Alan Gilbert wrote:
> > * Liang Li (liang.z...@intel.com) wrote:
> > > The current QEMU live migration implementation mark the all the
> > > guest's RAM pages as dirtied in the ram bulk stage, all these pages
> > > will be processed and that takes quit a lot of CPU cycles.
> > >
> > > From guest's point of view, it doesn't care about the content in
> > > free pages. We can make use of this fact and skip processing the
> > > free pages in the ram bulk stage, it can save a lot CPU cycles and
> > > reduce the network traffic significantly while speed up the live
> > > migration process obviously.
> > >
> > > This patch set is the QEMU side implementation.
> > >
> > > The virtio-balloon is extended so that QEMU can get the free pages
> > > information from the guest through virtio.
> > >
> > > After getting the free pages information (a bitmap), QEMU can use it
> > > to filter out the guest's free pages in the ram bulk stage. This
> > > make the live migration process much more efficient.
> >
> > Hi,
> >   An interesting solution; I know a few different people have been
> > looking at how to speed up ballooned VM migration.
> >
> >   I wonder if it would be possible to avoid the kernel changes by
> > parsing /proc/self/pagemap - if that can be used to detect
> > unmapped/zero mapped pages in the guest ram, would it achieve the
> same result?
> 
> Yes I was about to suggest the same thing: it's simple and makes use of the
> existing infrastructure.  And you wouldn't need to care if the pages were
> unmapped by ballooning or anything else (alternative balloon
> implementations, not yet touched by the guest, etc.).  Besides, you wouldn't
> need to synchronize with the guest.
> 
> Roman.

The unmapped/zero mapped pages can be detected by parsing /proc/self/pagemap,
but the free pages can't be detected by this. Imaging an application allocates 
a large amount
of memory , after using, it frees the memory, then live migration happens. All 
these free pages
will be process and sent to the destination, it's not optimal.

Liang




Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-04 Thread Roman Kagan
On Fri, Mar 04, 2016 at 01:52:53AM +, Li, Liang Z wrote:
> >   I wonder if it would be possible to avoid the kernel changes by parsing
> > /proc/self/pagemap - if that can be used to detect unmapped/zero mapped
> > pages in the guest ram, would it achieve the same result?
> 
> Only detect the unmapped/zero mapped pages is not enough. Consider the 
> situation like case 2, it can't achieve the same result.

Your case 2 doesn't exist in the real world.  If people could stop their
main memory consumer in the guest prior to migration they wouldn't need
live migration at all.

I tend to think you can safely assume there's no free memory in the
guest, so there's little point optimizing for it.

OTOH it makes perfect sense optimizing for the unmapped memory that's
made up, in particular, by the ballon, and consider inflating the
balloon right before migration unless you already maintain it at the
optimal size for other reasons (like e.g. a global resource manager
optimizing the VM density).

Roman.



Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-03 Thread Roman Kagan
On Thu, Mar 03, 2016 at 05:46:15PM +, Dr. David Alan Gilbert wrote:
> * Liang Li (liang.z...@intel.com) wrote:
> > The current QEMU live migration implementation mark the all the
> > guest's RAM pages as dirtied in the ram bulk stage, all these pages
> > will be processed and that takes quit a lot of CPU cycles.
> > 
> > From guest's point of view, it doesn't care about the content in free
> > pages. We can make use of this fact and skip processing the free
> > pages in the ram bulk stage, it can save a lot CPU cycles and reduce
> > the network traffic significantly while speed up the live migration
> > process obviously.
> > 
> > This patch set is the QEMU side implementation.
> > 
> > The virtio-balloon is extended so that QEMU can get the free pages
> > information from the guest through virtio.
> > 
> > After getting the free pages information (a bitmap), QEMU can use it
> > to filter out the guest's free pages in the ram bulk stage. This make
> > the live migration process much more efficient.
> 
> Hi,
>   An interesting solution; I know a few different people have been looking
> at how to speed up ballooned VM migration.
> 
>   I wonder if it would be possible to avoid the kernel changes by
> parsing /proc/self/pagemap - if that can be used to detect unmapped/zero
> mapped pages in the guest ram, would it achieve the same result?

Yes I was about to suggest the same thing: it's simple and makes use of
the existing infrastructure.  And you wouldn't need to care if the pages
were unmapped by ballooning or anything else (alternative balloon
implementations, not yet touched by the guest, etc.).  Besides, you
wouldn't need to synchronize with the guest.

Roman.



Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-03 Thread Li, Liang Z
> Subject: Re: [RFC qemu 0/4] A PV solution for live migration optimization
> 
> * Liang Li (liang.z...@intel.com) wrote:
> > The current QEMU live migration implementation mark the all the
> > guest's RAM pages as dirtied in the ram bulk stage, all these pages
> > will be processed and that takes quit a lot of CPU cycles.
> >
> > From guest's point of view, it doesn't care about the content in free
> > pages. We can make use of this fact and skip processing the free pages
> > in the ram bulk stage, it can save a lot CPU cycles and reduce the
> > network traffic significantly while speed up the live migration
> > process obviously.
> >
> > This patch set is the QEMU side implementation.
> >
> > The virtio-balloon is extended so that QEMU can get the free pages
> > information from the guest through virtio.
> >
> > After getting the free pages information (a bitmap), QEMU can use it
> > to filter out the guest's free pages in the ram bulk stage. This make
> > the live migration process much more efficient.
> 
> Hi,
>   An interesting solution; I know a few different people have been looking at
> how to speed up ballooned VM migration.
> 

Ooh, different solutions for the same purpose, and both based on the balloon.

>   I wonder if it would be possible to avoid the kernel changes by parsing
> /proc/self/pagemap - if that can be used to detect unmapped/zero mapped
> pages in the guest ram, would it achieve the same result?
> 

Only detect the unmapped/zero mapped pages is not enough. Consider the 
situation like case 2, it can't achieve the same result.

> > This RFC version doesn't take the post-copy and RDMA into
> > consideration, maybe both of them can benefit from this PV solution by
> > with some extra modifications.
> 
> For postcopy to be safe, you would still need to send a message to the
> destination telling it that there were zero pages, otherwise the destination
> can't tell if it's supposed to request the page from the source or treat the
> page as zero.
> 
> Dave

I will consider this later, thanks, Dave.

Liang

> 
> >
> > Performance data
> > 
> >
> > Test environment:
> >
> > CPU: Intel (R) Xeon(R) CPU ES-2699 v3 @ 2.30GHz Host RAM: 64GB
> > Host Linux Kernel:  4.2.0   Host OS: CentOS 7.1
> > Guest Linux Kernel:  4.5.rc6Guest OS: CentOS 6.6
> > Network:  X540-AT2 with 10 Gigabit connection Guest RAM: 8GB
> >
> > Case 1: Idle guest just boots:
> > 
> > | original  |pv
> > ---
> > total time(ms)  |1894   |   421
> > 
> > transferred ram(KB) |   398017  |  353242
> > 
> >
> >
> > Case 2: The guest has ever run some memory consuming workload, the
> > workload is terminated just before live migration.
> > 
> > | original  |pv
> > ---
> > total time(ms)  |   7436|   552
> > 
> > transferred ram(KB) |  8146291  |  361375
> > 
> >



Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-03 Thread Li, Liang Z
> On Thu, Mar 03, 2016 at 06:44:24PM +0800, Liang Li wrote:
> > The current QEMU live migration implementation mark the all the
> > guest's RAM pages as dirtied in the ram bulk stage, all these pages
> > will be processed and that takes quit a lot of CPU cycles.
> >
> > From guest's point of view, it doesn't care about the content in free
> > pages. We can make use of this fact and skip processing the free pages
> > in the ram bulk stage, it can save a lot CPU cycles and reduce the
> > network traffic significantly while speed up the live migration
> > process obviously.
> >
> > This patch set is the QEMU side implementation.
> >
> > The virtio-balloon is extended so that QEMU can get the free pages
> > information from the guest through virtio.
> >
> > After getting the free pages information (a bitmap), QEMU can use it
> > to filter out the guest's free pages in the ram bulk stage. This make
> > the live migration process much more efficient.
> >
> > This RFC version doesn't take the post-copy and RDMA into
> > consideration, maybe both of them can benefit from this PV solution by
> > with some extra modifications.
> >
> > Performance data
> > 
> >
> > Test environment:
> >
> > CPU: Intel (R) Xeon(R) CPU ES-2699 v3 @ 2.30GHz Host RAM: 64GB
> > Host Linux Kernel:  4.2.0   Host OS: CentOS 7.1
> > Guest Linux Kernel:  4.5.rc6Guest OS: CentOS 6.6
> > Network:  X540-AT2 with 10 Gigabit connection Guest RAM: 8GB
> >
> > Case 1: Idle guest just boots:
> > 
> > | original  |pv
> > ---
> > total time(ms)  |1894   |   421
> > 
> > transferred ram(KB) |   398017  |  353242
> > 
> >
> >
> > Case 2: The guest has ever run some memory consuming workload, the
> > workload is terminated just before live migration.
> > 
> > | original  |pv
> > ---
> > total time(ms)  |   7436|   552
> > 
> > transferred ram(KB) |  8146291  |  361375
> > 
> 
> Both cases look very artificial to me.  Normally you migrate VMs which have
> started long ago and which can't have their services terminated before the
> migration, so I wouldn't expect any useful amount of free pages obtained
> this way.
> 

Yes, it's somewhat artificial, just to emphasize the effect.  And I think these 
two
cases are very easy to reproduce. Using the real workload and do the test
in production environment will be more convince.

We can predict that as long as the guest doesn't use out of its memory, this 
solution
may still take affect and shorten the total live migration time. (Off cause, we 
should
consider the time cost of the virtio communication.)

> OTOH I don't see why you can't just inflate the balloon before the migration,
> and really optimize the amount of transferred data this way?
> With the recently proposed VIRTIO_BALLOON_S_AVAIL you can have a fairly
> good estimate of the optimal balloon size, and with the recently merged
> balloon deflation on OOM it's a safe thing to do without exposing the guest
> workloads to OOM risks.
> 
> Roman.

Thanks for your information.  The size of the free page bitmap is not very 
large, for a
guest with 8GB RAM, only 256KB  extra memory is required.
Comparing to this solution, inflate the balloon is more expensive. If the 
balloon size
is not so optimal and guest request more memory during live migration, the 
guest's
performance will be impacted.

Liang




Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-03 Thread Dr. David Alan Gilbert
* Liang Li (liang.z...@intel.com) wrote:
> The current QEMU live migration implementation mark the all the
> guest's RAM pages as dirtied in the ram bulk stage, all these pages
> will be processed and that takes quit a lot of CPU cycles.
> 
> From guest's point of view, it doesn't care about the content in free
> pages. We can make use of this fact and skip processing the free
> pages in the ram bulk stage, it can save a lot CPU cycles and reduce
> the network traffic significantly while speed up the live migration
> process obviously.
> 
> This patch set is the QEMU side implementation.
> 
> The virtio-balloon is extended so that QEMU can get the free pages
> information from the guest through virtio.
> 
> After getting the free pages information (a bitmap), QEMU can use it
> to filter out the guest's free pages in the ram bulk stage. This make
> the live migration process much more efficient.

Hi,
  An interesting solution; I know a few different people have been looking
at how to speed up ballooned VM migration.

  I wonder if it would be possible to avoid the kernel changes by
parsing /proc/self/pagemap - if that can be used to detect unmapped/zero
mapped pages in the guest ram, would it achieve the same result?

> This RFC version doesn't take the post-copy and RDMA into
> consideration, maybe both of them can benefit from this PV solution
> by with some extra modifications.

For postcopy to be safe, you would still need to send a message to the
destination telling it that there were zero pages, otherwise the destination
can't tell if it's supposed to request the page from the source or
treat the page as zero.

Dave

> 
> Performance data
> 
> 
> Test environment:
> 
> CPU: Intel (R) Xeon(R) CPU ES-2699 v3 @ 2.30GHz
> Host RAM: 64GB
> Host Linux Kernel:  4.2.0   Host OS: CentOS 7.1
> Guest Linux Kernel:  4.5.rc6Guest OS: CentOS 6.6
> Network:  X540-AT2 with 10 Gigabit connection
> Guest RAM: 8GB
> 
> Case 1: Idle guest just boots:
> 
> | original  |pv
> ---
> total time(ms)  |1894   |   421
> 
> transferred ram(KB) |   398017  |  353242
> 
> 
> 
> Case 2: The guest has ever run some memory consuming workload, the
> workload is terminated just before live migration.
> 
> | original  |pv
> ---
> total time(ms)  |   7436|   552
> 
> transferred ram(KB) |  8146291  |  361375
> 
> 
> Liang Li (4):
>   pc: Add code to get the lowmem form PCMachineState
>   virtio-balloon: Add a new feature to balloon device
>   migration: not set migration bitmap in setup stage
>   migration: filter out guest's free pages in ram bulk stage
> 
>  balloon.c   | 30 -
>  hw/i386/pc.c|  5 ++
>  hw/i386/pc_piix.c   |  1 +
>  hw/i386/pc_q35.c|  1 +
>  hw/virtio/virtio-balloon.c  | 81 
> -
>  include/hw/i386/pc.h|  3 +-
>  include/hw/virtio/virtio-balloon.h  | 17 +-
>  include/standard-headers/linux/virtio_balloon.h |  1 +
>  include/sysemu/balloon.h| 10 ++-
>  migration/ram.c | 64 +++
>  10 files changed, 195 insertions(+), 18 deletions(-)
> 
> -- 
> 1.8.3.1
> 
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-03 Thread Roman Kagan
On Thu, Mar 03, 2016 at 06:44:24PM +0800, Liang Li wrote:
> The current QEMU live migration implementation mark the all the
> guest's RAM pages as dirtied in the ram bulk stage, all these pages
> will be processed and that takes quit a lot of CPU cycles.
> 
> From guest's point of view, it doesn't care about the content in free
> pages. We can make use of this fact and skip processing the free
> pages in the ram bulk stage, it can save a lot CPU cycles and reduce
> the network traffic significantly while speed up the live migration
> process obviously.
> 
> This patch set is the QEMU side implementation.
> 
> The virtio-balloon is extended so that QEMU can get the free pages
> information from the guest through virtio.
> 
> After getting the free pages information (a bitmap), QEMU can use it
> to filter out the guest's free pages in the ram bulk stage. This make
> the live migration process much more efficient.
> 
> This RFC version doesn't take the post-copy and RDMA into
> consideration, maybe both of them can benefit from this PV solution
> by with some extra modifications.
> 
> Performance data
> 
> 
> Test environment:
> 
> CPU: Intel (R) Xeon(R) CPU ES-2699 v3 @ 2.30GHz
> Host RAM: 64GB
> Host Linux Kernel:  4.2.0   Host OS: CentOS 7.1
> Guest Linux Kernel:  4.5.rc6Guest OS: CentOS 6.6
> Network:  X540-AT2 with 10 Gigabit connection
> Guest RAM: 8GB
> 
> Case 1: Idle guest just boots:
> 
> | original  |pv
> ---
> total time(ms)  |1894   |   421
> 
> transferred ram(KB) |   398017  |  353242
> 
> 
> 
> Case 2: The guest has ever run some memory consuming workload, the
> workload is terminated just before live migration.
> 
> | original  |pv
> ---
> total time(ms)  |   7436|   552
> 
> transferred ram(KB) |  8146291  |  361375
> 

Both cases look very artificial to me.  Normally you migrate VMs which
have started long ago and which can't have their services terminated
before the migration, so I wouldn't expect any useful amount of free
pages obtained this way.

OTOH I don't see why you can't just inflate the balloon before the
migration, and really optimize the amount of transferred data this way?
With the recently proposed VIRTIO_BALLOON_S_AVAIL you can have a fairly
good estimate of the optimal balloon size, and with the recently merged
balloon deflation on OOM it's a safe thing to do without exposing the
guest workloads to OOM risks.

Roman.



[Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-03 Thread Liang Li
The current QEMU live migration implementation mark the all the
guest's RAM pages as dirtied in the ram bulk stage, all these pages
will be processed and that takes quit a lot of CPU cycles.

>From guest's point of view, it doesn't care about the content in free
pages. We can make use of this fact and skip processing the free
pages in the ram bulk stage, it can save a lot CPU cycles and reduce
the network traffic significantly while speed up the live migration
process obviously.

This patch set is the QEMU side implementation.

The virtio-balloon is extended so that QEMU can get the free pages
information from the guest through virtio.

After getting the free pages information (a bitmap), QEMU can use it
to filter out the guest's free pages in the ram bulk stage. This make
the live migration process much more efficient.

This RFC version doesn't take the post-copy and RDMA into
consideration, maybe both of them can benefit from this PV solution
by with some extra modifications.

Performance data


Test environment:

CPU: Intel (R) Xeon(R) CPU ES-2699 v3 @ 2.30GHz
Host RAM: 64GB
Host Linux Kernel:  4.2.0   Host OS: CentOS 7.1
Guest Linux Kernel:  4.5.rc6Guest OS: CentOS 6.6
Network:  X540-AT2 with 10 Gigabit connection
Guest RAM: 8GB

Case 1: Idle guest just boots:

| original  |pv
---
total time(ms)  |1894   |   421

transferred ram(KB) |   398017  |  353242



Case 2: The guest has ever run some memory consuming workload, the
workload is terminated just before live migration.

| original  |pv
---
total time(ms)  |   7436|   552

transferred ram(KB) |  8146291  |  361375


Liang Li (4):
  pc: Add code to get the lowmem form PCMachineState
  virtio-balloon: Add a new feature to balloon device
  migration: not set migration bitmap in setup stage
  migration: filter out guest's free pages in ram bulk stage

 balloon.c   | 30 -
 hw/i386/pc.c|  5 ++
 hw/i386/pc_piix.c   |  1 +
 hw/i386/pc_q35.c|  1 +
 hw/virtio/virtio-balloon.c  | 81 -
 include/hw/i386/pc.h|  3 +-
 include/hw/virtio/virtio-balloon.h  | 17 +-
 include/standard-headers/linux/virtio_balloon.h |  1 +
 include/sysemu/balloon.h| 10 ++-
 migration/ram.c | 64 +++
 10 files changed, 195 insertions(+), 18 deletions(-)

-- 
1.8.3.1