from:"Li, Liang Z"

Re: [Qemu-devel] [PATCH v6 kernel 3/5] virtio-balloon: speed up inflate/deflate process

2017-02-03 Thread Li, Liang Z

> 
> 
> > +static void free_extended_page_bitmap(struct virtio_balloon *vb) {
> > +   int i, bmap_count = vb->nr_page_bmap;
> > +
> > +   for (i = 1; i < bmap_count; i++) {
> > +   kfree(vb->page_bitmap[i]);
> > +   vb->page_bitmap[i] = NULL;
> > +   vb->nr_page_bmap--;
> > +   }
> > +}
> > +
> > +static void kfree_page_bitmap(struct virtio_balloon *vb) {
> > +   int i;
> > +
> > +   for (i = 0; i < vb->nr_page_bmap; i++)
> > +   kfree(vb->page_bitmap[i]);
> > +}
> 
> It might be worth commenting that pair of functions to make it clear why
> they are so different; I guess the kfree_page_bitmap is used just before you
> free the structure above it so you don't need to keep the count/pointers
> updated?
> 

Yes. I will add some comments for that. Thanks!

Liang
 
> Dave
> --
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [Qemu-devel] [PATCH v6 kernel 3/5] virtio-balloon: speed up inflate/deflate process

2017-01-18 Thread Li, Liang Z

> On Wed, Jan 18, 2017 at 04:56:58AM +0000, Li, Liang Z wrote:
> > > > -   virtqueue_add_outbuf(vq, , 1, vb, GFP_KERNEL);
> > > > -   virtqueue_kick(vq);
> > > > +static void do_set_resp_bitmap(struct virtio_balloon *vb,
> > > > +   unsigned long base_pfn, int pages)
> > > >
> > > > -   /* When host has read buffer, this completes via balloon_ack */
> > > > -   wait_event(vb->acked, virtqueue_get_buf(vq, ));
> > > > +{
> > > > +   __le64 *range = vb->resp_data + vb->resp_pos;
> > > >
> > > > +   if (pages > (1 << VIRTIO_BALLOON_NR_PFN_BITS)) {
> > > > +   /* when the length field can't contain pages, set it to 
> > > > 0 to
> > >
> > > /*
> > >  * Multi-line
> > >  * comments
> > >  * should look like this.
> > >  */
> > >
> > > Also, pls start sentences with an upper-case letter.
> > >
> >
> > Sorry for that.
> >
> > > > +* indicate the actual length is in the next __le64;
> > > > +*/
> > >
> > > This is part of the interface so should be documented as such.
> > >
> > > > +   *range = cpu_to_le64((base_pfn <<
> > > > +   VIRTIO_BALLOON_NR_PFN_BITS) | 0);
> > > > +   *(range + 1) = cpu_to_le64(pages);
> > > > +   vb->resp_pos += 2;
> > >
> > > Pls use structs for this kind of stuff.
> >
> > I am not sure if you mean to use
> >
> > struct  range {
> > __le64 pfn: 52;
> > __le64 nr_page: 12
> > }
> > Instead of the shift operation?
> 
> Not just that. You want to add a pages field as well.
> 

pages field? Could you give more hints?

> Generally describe the format in the header in some way so host and guest
> can easily stay in sync.

'VIRTIO_BALLOON_NR_PFN_BITS' is for this purpose and it will be passed to the
related function in page_alloc.c as a parameter.

Thanks!
Liang
> All the pointer math and void * means we get zero type safety and I'm not
> happy about it.
> 
> It's not good that virtio format seeps out to page_alloc anyway.
> If unavoidable it is not a good idea to try to hide this fact, people will 
> assume
> they can change the format at will.
> 
> --
> MST

Re: [Qemu-devel] [virtio-dev] Re: [PATCH v6 kernel 2/5] virtio-balloon: define new feature bit and head struct

2017-01-18 Thread Li, Liang Z

> > > > > > Signed-off-by: Liang Li 
> > > > > > Cc: Michael S. Tsirkin 
> > > > > > Cc: Paolo Bonzini 
> > > > > > Cc: Cornelia Huck 
> > > > > > Cc: Amit Shah 
> > > > > > Cc: Dave Hansen 
> > > > > > Cc: Andrea Arcangeli 
> > > > > > Cc: David Hildenbrand 
> > > > > > ---
> > > > > >  include/uapi/linux/virtio_balloon.h | 12 
> > > > > >  1 file changed, 12 insertions(+)
> > > > > >
> > > > > > diff --git a/include/uapi/linux/virtio_balloon.h
> > > > > > b/include/uapi/linux/virtio_balloon.h
> > > > > > index 343d7dd..2f850bf 100644
> > > > > > --- a/include/uapi/linux/virtio_balloon.h
> > > > > > +++ b/include/uapi/linux/virtio_balloon.h
> > > > > > @@ -34,10 +34,14 @@
> > > > > >  #define VIRTIO_BALLOON_F_MUST_TELL_HOST0 /* Tell
> before
> > > > > reclaiming pages */
> > > > > >  #define VIRTIO_BALLOON_F_STATS_VQ  1 /* Memory Stats
> virtqueue
> > > > > */
> > > > > >  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM2 /* Deflate
> > > balloon
> > > > > on OOM */
> > > > > > +#define VIRTIO_BALLOON_F_PAGE_RANGE3 /* Send page info
> > > > > with ranges */
> > > > > >
> > > > > >  /* Size of a PFN in the balloon interface. */  #define
> > > > > > VIRTIO_BALLOON_PFN_SHIFT 12
> > > > > >
> > > > > > +/* Bits width for the length of the pfn range */
> > > > >
> > > > > What does this mean? Couldn't figure it out.
> > > > >
> > > > > > +#define VIRTIO_BALLOON_NR_PFN_BITS 12
> > > > > > +
> > > > > >  struct virtio_balloon_config {
> > > > > > /* Number of pages host wants Guest to give up. */
> > > > > > __u32 num_pages;
> > > > > > @@ -82,4 +86,12 @@ struct virtio_balloon_stat {
> > > > > > __virtio64 val;
> > > > > >  } __attribute__((packed));
> > > > > >
> > > > > > +/* Response header structure */ struct
> > > > > > +virtio_balloon_resp_hdr {
> > > > > > +   __le64 cmd : 8; /* Distinguish different requests type */
> > > > > > +   __le64 flag: 8; /* Mark status for a specific request type */
> > > > > > +   __le64 id : 16; /* Distinguish requests of a specific type */
> > > > > > +   __le64 data_len: 32; /* Length of the following data, in
> > > > > > +bytes */
> > > > >
> > > > > This use of __le64 makes no sense.  Just use u8/le16/le32 pls.
> > > > >
> > > >
> > > > Got it, will change in the next version.
> > > >
> > > > And could help take a look at other parts? as well as the QEMU part.
> > > >
> > > > Thanks!
> > > > Liang
> > >
> > > Yes but first I would like to understand how come no fields in this
> > > new structure come up if I search for them in the following patch. I
> > > don't see why
> >
> > It's not true, all of the field will be referenced in the following
> > patches except the 'reserved' filed.
> 
> But none of these are used in the following patch 3.

Yes. Only 'data_len' is used in patch 3, and for expansibility maybe at least 
'cmd' is needed to. I should set it in patch 3 to some default value even
it's not currently useful. 'flag' and 'id' are for patch 4. I just want to 
reuse the 'struct virtio_balloon_resp_hdr' and make the code simpler.

Thanks!
Liang

Re: [Qemu-devel] [PATCH v6 kernel 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2017-01-18 Thread Li, Liang Z

> Am 21.12.2016 um 07:52 schrieb Liang Li:
> > This patch set contains two parts of changes to the virtio-balloon.
> >
> > One is the change for speeding up the inflating & deflating process,
> > the main idea of this optimization is to use {pfn|length} to present
> > the page information instead of the PFNs, to reduce the overhead of
> > virtio data transmission, address translation and madvise(). This can
> > help to improve the performance by about 85%.
> >
> > Another change is for speeding up live migration. By skipping process
> > guest's unused pages in the first round of data copy, to reduce
> > needless data processing, this can help to save quite a lot of CPU
> > cycles and network bandwidth. We put guest's unused page information
> > in a {pfn|length} array and send it to host with the virt queue of
> > virtio-balloon. For an idle guest with 8GB RAM, this can help to
> > shorten the total live migration time from 2Sec to about 500ms in
> > 10Gbps network environment. For an guest with quite a lot of page
> > cache and with little unused pages, it's possible to let the guest
> > drop it's page cache before live migration, this case can benefit from this
> new feature too.
> 
> I agree that both changes make sense (although the second change just
> smells very racy, as you also pointed out in the patch description), however I
> am not sure if virtio-balloon is really the right place for the latter change.
> 
> virtio-balloon is all about ballooning, nothing else. What you're doing is 
> using
> it as a way to communicate balloon-unrelated data from/to the hypervisor.
> Yes, it is also about guest memory, but completely unrelated to the purpose
> of the balloon device.
> 
> Maybe using virtio-balloon for this purpose is okay - I have mixed feelings
> (especially as I can't tell where else this could go). I would like to get a 
> second
> opinion on this.
> 

We have ever discussed the implementation for a long time, making use the 
current
virtio balloon seems better than the other solutions and is recommended by 
Michael.

Thanks!
Liang
> --
> 
> David

Re: [Qemu-devel] [PATCH v6 kernel 3/5] virtio-balloon: speed up inflate/deflate process

2017-01-17 Thread Li, Liang Z

> > -   virtqueue_add_outbuf(vq, , 1, vb, GFP_KERNEL);
> > -   virtqueue_kick(vq);
> > +static void do_set_resp_bitmap(struct virtio_balloon *vb,
> > +   unsigned long base_pfn, int pages)
> >
> > -   /* When host has read buffer, this completes via balloon_ack */
> > -   wait_event(vb->acked, virtqueue_get_buf(vq, ));
> > +{
> > +   __le64 *range = vb->resp_data + vb->resp_pos;
> >
> > +   if (pages > (1 << VIRTIO_BALLOON_NR_PFN_BITS)) {
> > +   /* when the length field can't contain pages, set it to 0 to
> 
> /*
>  * Multi-line
>  * comments
>  * should look like this.
>  */
> 
> Also, pls start sentences with an upper-case letter.
> 

Sorry for that.

> > +* indicate the actual length is in the next __le64;
> > +*/
> 
> This is part of the interface so should be documented as such.
> 
> > +   *range = cpu_to_le64((base_pfn <<
> > +   VIRTIO_BALLOON_NR_PFN_BITS) | 0);
> > +   *(range + 1) = cpu_to_le64(pages);
> > +   vb->resp_pos += 2;
> 
> Pls use structs for this kind of stuff.

I am not sure if you mean to use 

struct  range {
__le64 pfn: 52;
__le64 nr_page: 12
}
Instead of the shift operation?

I didn't use this way because I don't want to include 'virtio-balloon.h' in 
page_alloc.c,
or copy the define of this struct in page_alloc.c

Thanks!
Liang

Re: [Qemu-devel] [virtio-dev] Re: [PATCH v6 kernel 2/5] virtio-balloon: define new feature bit and head struct

2017-01-17 Thread Li, Liang Z

> Sent: Wednesday, January 18, 2017 3:11 AM
> To: Li, Liang Z
> Cc: k...@vger.kernel.org; virtio-...@lists.oasis-open.org; qemu-
> de...@nongnu.org; linux...@kvack.org; linux-ker...@vger.kernel.org;
> virtualizat...@lists.linux-foundation.org; amit.s...@redhat.com; Hansen,
> Dave; cornelia.h...@de.ibm.com; pbonz...@redhat.com;
> da...@redhat.com; aarca...@redhat.com; dgilb...@redhat.com;
> quint...@redhat.com
> Subject: Re: [virtio-dev] Re: [PATCH v6 kernel 2/5] virtio-balloon: define new
> feature bit and head struct
> 
> On Fri, Jan 13, 2017 at 09:24:22AM +, Li, Liang Z wrote:
> > > On Wed, Dec 21, 2016 at 02:52:25PM +0800, Liang Li wrote:
> > > > Add a new feature which supports sending the page information with
> > > > range array. The current implementation uses PFNs array, which is
> > > > not very efficient. Using ranges can improve the performance of
> > > > inflating/deflating significantly.
> > > >
> > > > Signed-off-by: Liang Li <liang.z...@intel.com>
> > > > Cc: Michael S. Tsirkin <m...@redhat.com>
> > > > Cc: Paolo Bonzini <pbonz...@redhat.com>
> > > > Cc: Cornelia Huck <cornelia.h...@de.ibm.com>
> > > > Cc: Amit Shah <amit.s...@redhat.com>
> > > > Cc: Dave Hansen <dave.han...@intel.com>
> > > > Cc: Andrea Arcangeli <aarca...@redhat.com>
> > > > Cc: David Hildenbrand <da...@redhat.com>
> > > > ---
> > > >  include/uapi/linux/virtio_balloon.h | 12 
> > > >  1 file changed, 12 insertions(+)
> > > >
> > > > diff --git a/include/uapi/linux/virtio_balloon.h
> > > > b/include/uapi/linux/virtio_balloon.h
> > > > index 343d7dd..2f850bf 100644
> > > > --- a/include/uapi/linux/virtio_balloon.h
> > > > +++ b/include/uapi/linux/virtio_balloon.h
> > > > @@ -34,10 +34,14 @@
> > > >  #define VIRTIO_BALLOON_F_MUST_TELL_HOST0 /* Tell before
> > > reclaiming pages */
> > > >  #define VIRTIO_BALLOON_F_STATS_VQ  1 /* Memory Stats virtqueue
> > > */
> > > >  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM2 /* Deflate
> balloon
> > > on OOM */
> > > > +#define VIRTIO_BALLOON_F_PAGE_RANGE3 /* Send page info
> > > with ranges */
> > > >
> > > >  /* Size of a PFN in the balloon interface. */  #define
> > > > VIRTIO_BALLOON_PFN_SHIFT 12
> > > >
> > > > +/* Bits width for the length of the pfn range */
> > >
> > > What does this mean? Couldn't figure it out.
> > >
> > > > +#define VIRTIO_BALLOON_NR_PFN_BITS 12
> > > > +
> > > >  struct virtio_balloon_config {
> > > > /* Number of pages host wants Guest to give up. */
> > > > __u32 num_pages;
> > > > @@ -82,4 +86,12 @@ struct virtio_balloon_stat {
> > > > __virtio64 val;
> > > >  } __attribute__((packed));
> > > >
> > > > +/* Response header structure */
> > > > +struct virtio_balloon_resp_hdr {
> > > > +   __le64 cmd : 8; /* Distinguish different requests type */
> > > > +   __le64 flag: 8; /* Mark status for a specific request type */
> > > > +   __le64 id : 16; /* Distinguish requests of a specific type */
> > > > +   __le64 data_len: 32; /* Length of the following data, in bytes
> > > > +*/
> > >
> > > This use of __le64 makes no sense.  Just use u8/le16/le32 pls.
> > >
> >
> > Got it, will change in the next version.
> >
> > And could help take a look at other parts? as well as the QEMU part.
> >
> > Thanks!
> > Liang
> 
> Yes but first I would like to understand how come no fields in this new
> structure come up if I search for them in the following patch. I don't see why

It's not true, all of the field will be referenced in the following patches 
except 
the 'reserved' filed.

> should I waste time on reviewing the implementation if the interface isn't
> reasonable. You don't have to waste it too - just send RFC patches with the
> header until we can agree on it.

OK. I will post the header part separately.

Thanks!
Liang
> 
> --
> MST
> 
> -
> To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org

Re: [Qemu-devel] about post copy recovery

2017-01-15 Thread Li, Liang Z

> * Li, Liang Z (liang.z...@intel.com) wrote:
> >
> > Hi David,
> >
> > I remembered some guys wanted to solve the issue of post copy recovery
> when network broken down, do you know latest status?
> 
> Hi Liang,
>   Yes, Haris looked at it as part of GSoC, the latest version is what was 
> posted:
> 
> https://lists.nongnu.org/archive/html/qemu-devel/2016-08/msg03468.html
> 
> I've not done any work on it since then;  there are a couple of hard problems
> to be solved.  The simpler is making sure that we always correctly detect a
> migration error due to networking (rather than some other non-recoverable
> error); there's lots of migration code that doesn't check for a file error
> straight away and only hits the error code later on when it's too late to
> recover.
> 
> The harder problem is that we often end up with the case where the main
> thread is blocked trying to access postcopied-RAM, e.g. an emulated
> network driver tries to write an incoming packet to guest RAM but finds the
> guest RAM hasn't arrived yet.
> With the main thread blocked it's very difficult to recover - we can't issue 
> any
> commands to trigger the recovery and even if we could we'll have to be very
> careful about what things those commands need the main thread to do.
> 
> Dave

Thanks for your information!
I will take a look first, maybe get back to you later.

Liang

Re: [Qemu-devel] [virtio-dev] Re: [PATCH v6 kernel 2/5] virtio-balloon: define new feature bit and head struct

2017-01-13 Thread Li, Liang Z

> On Wed, Dec 21, 2016 at 02:52:25PM +0800, Liang Li wrote:
> > Add a new feature which supports sending the page information with
> > range array. The current implementation uses PFNs array, which is not
> > very efficient. Using ranges can improve the performance of
> > inflating/deflating significantly.
> >
> > Signed-off-by: Liang Li 
> > Cc: Michael S. Tsirkin 
> > Cc: Paolo Bonzini 
> > Cc: Cornelia Huck 
> > Cc: Amit Shah 
> > Cc: Dave Hansen 
> > Cc: Andrea Arcangeli 
> > Cc: David Hildenbrand 
> > ---
> >  include/uapi/linux/virtio_balloon.h | 12 
> >  1 file changed, 12 insertions(+)
> >
> > diff --git a/include/uapi/linux/virtio_balloon.h
> > b/include/uapi/linux/virtio_balloon.h
> > index 343d7dd..2f850bf 100644
> > --- a/include/uapi/linux/virtio_balloon.h
> > +++ b/include/uapi/linux/virtio_balloon.h
> > @@ -34,10 +34,14 @@
> >  #define VIRTIO_BALLOON_F_MUST_TELL_HOST0 /* Tell before
> reclaiming pages */
> >  #define VIRTIO_BALLOON_F_STATS_VQ  1 /* Memory Stats virtqueue
> */
> >  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM2 /* Deflate balloon
> on OOM */
> > +#define VIRTIO_BALLOON_F_PAGE_RANGE3 /* Send page info
> with ranges */
> >
> >  /* Size of a PFN in the balloon interface. */  #define
> > VIRTIO_BALLOON_PFN_SHIFT 12
> >
> > +/* Bits width for the length of the pfn range */
> 
> What does this mean? Couldn't figure it out.
> 
> > +#define VIRTIO_BALLOON_NR_PFN_BITS 12
> > +
> >  struct virtio_balloon_config {
> > /* Number of pages host wants Guest to give up. */
> > __u32 num_pages;
> > @@ -82,4 +86,12 @@ struct virtio_balloon_stat {
> > __virtio64 val;
> >  } __attribute__((packed));
> >
> > +/* Response header structure */
> > +struct virtio_balloon_resp_hdr {
> > +   __le64 cmd : 8; /* Distinguish different requests type */
> > +   __le64 flag: 8; /* Mark status for a specific request type */
> > +   __le64 id : 16; /* Distinguish requests of a specific type */
> > +   __le64 data_len: 32; /* Length of the following data, in bytes */
> 
> This use of __le64 makes no sense.  Just use u8/le16/le32 pls.
> 

Got it, will change in the next version. 

And could help take a look at other parts? as well as the QEMU part.

Thanks!
Liang

[Qemu-devel] about post copy recovery

2017-01-10 Thread Li, Liang Z


Hi David,

I remembered some guys wanted to solve the issue of post copy recovery when 
network broken down, do you know latest status?

Thanks!
Liang

Re: [Qemu-devel] [PATCH v6 kernel 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2017-01-09 Thread Li, Liang Z

Hi guys,

Could you help to review this patch set?

Thanks!
Liang

> -Original Message-
> From: Li, Liang Z
> Sent: Wednesday, December 21, 2016 2:52 PM
> To: k...@vger.kernel.org
> Cc: virtio-...@lists.oasis-open.org; qemu-devel@nongnu.org; linux-
> m...@kvack.org; linux-ker...@vger.kernel.org; virtualization@lists.linux-
> foundation.org; amit.s...@redhat.com; Hansen, Dave;
> cornelia.h...@de.ibm.com; pbonz...@redhat.com; m...@redhat.com;
> da...@redhat.com; aarca...@redhat.com; dgilb...@redhat.com;
> quint...@redhat.com; Li, Liang Z
> Subject: [PATCH v6 kernel 0/5] Extend virtio-balloon for fast (de)inflating &
> fast live migration
> 
> This patch set contains two parts of changes to the virtio-balloon.
> 
> One is the change for speeding up the inflating & deflating process, the main
> idea of this optimization is to use {pfn|length} to present the page
> information instead of the PFNs, to reduce the overhead of virtio data
> transmission, address translation and madvise(). This can help to improve the
> performance by about 85%.
> 
> Another change is for speeding up live migration. By skipping process guest's
> unused pages in the first round of data copy, to reduce needless data
> processing, this can help to save quite a lot of CPU cycles and network
> bandwidth. We put guest's unused page information in a {pfn|length} array
> and send it to host with the virt queue of virtio-balloon. For an idle guest 
> with
> 8GB RAM, this can help to shorten the total live migration time from 2Sec to
> about 500ms in 10Gbps network environment. For an guest with quite a lot
> of page cache and with little unused pages, it's possible to let the guest 
> drop
> it's page cache before live migration, this case can benefit from this new
> feature too.
> 
> Changes from v5 to v6:
> * Drop the bitmap from the virtio ABI, use {pfn|length} only.
> * Enhance the API to get the unused page information from mm.
> 
> Changes from v4 to v5:
> * Drop the code to get the max_pfn, use another way instead.
> * Simplify the API to get the unused page information from mm.
> 
> Changes from v3 to v4:
> * Use the new scheme suggested by Dave Hansen to encode the bitmap.
> * Add code which is missed in v3 to handle migrate page.
> * Free the memory for bitmap intime once the operation is done.
> * Address some of the comments in v3.
> 
> Changes from v2 to v3:
> * Change the name of 'free page' to 'unused page'.
> * Use the scatter & gather bitmap instead of a 1MB page bitmap.
> * Fix overwriting the page bitmap after kicking.
> * Some of MST's comments for v2.
> 
> Changes from v1 to v2:
> * Abandon the patch for dropping page cache.
> * Put some structures to uapi head file.
> * Use a new way to determine the page bitmap size.
> * Use a unified way to send the free page information with the bitmap
> * Address the issues referred in MST's comments
> 
> Liang Li (5):
>   virtio-balloon: rework deflate to add page to a list
>   virtio-balloon: define new feature bit and head struct
>   virtio-balloon: speed up inflate/deflate process
>   virtio-balloon: define flags and head for host request vq
>   virtio-balloon: tell host vm's unused page info
> 
>  drivers/virtio/virtio_balloon.c | 510
> 
>  include/linux/mm.h  |   3 +
>  include/uapi/linux/virtio_balloon.h |  34 +++
>  mm/page_alloc.c | 120 +
>  4 files changed, 621 insertions(+), 46 deletions(-)
> 
> --
> 1.9.1

Re: [Qemu-devel] [PATCH v3 0/4] migration: skip scanning and migrating ram pages released by virtio-balloon driver.

2017-01-05 Thread Li, Liang Z

> >> If we want to actually host enforce (disallow) access to inflated
> >> pages, having
> >
> > Is that a new feature?
> 
> Still in the design phase. Andrea Arcangeli briefly mentioned in in his reply 
> to
> v5 of your patch series
> 
> "We also plan to use userfaultfd to make the balloon driver host enforced
> (will work fine on hugetlbfs 2M and tmpfs too) but that's going to be 
> invisible
> to the ABI so it's not strictly relevant for this discussion."
> 

Looking forward to see more details. Is there any thread about the design?

Thanks!

Liang

> >
> >> such a bitmap in QEMU is required. Getting them from the guest
> >> doesn't make sense in that context anymore.
> >
> > Even a bitmap is required, we should avoid to send it in the stop & copy
> stage.
> 
> yes, I agree.
> 
> Thanks!

Re: [Qemu-devel] [PATCH v3 0/4] migration: skip scanning and migrating ram pages released by virtio-balloon driver.

2017-01-04 Thread Li, Liang Z

> Am 23.12.2016 um 03:50 schrieb Li, Liang Z:
> >> While measuring live migration performance for qemu/kvm guest, it was
> >> observed that the qemu doesn’t maintain any intelligence for the
> >> guest ram pages released by the guest balloon driver and treat such
> >> pages as any other normal guest ram pages. This has direct impact on
> >> overall migration time for the guest which has released (ballooned out)
> memory to the host.
> >>
> >> In case of large systems, where we can configure large guests with
> >> 1TB and with considerable amount of memory released by balloon driver
> >> to the host, the migration time gets worse.
> >>
> >> The solution proposed below is local to qemu (and does not require
> >> any modification to Linux kernel or any guest driver). We have
> >> verified the fix for large guests =1TB on HPE Superdome X (which can
> >> support up to 240 cores and 12TB of memory).
> >>
> >> During live migration, as part of first iteration in
> >> ram_save_iterate() -> ram_find_and_save_block () will try to migrate
> >> ram pages even if they are released by vitrio-balloon driver (balloon
> >> inflate). Although these pages which are returned to the host by
> >> virtio-balloon driver are zero pages, the migration algorithm will
> >> still end up scanning the entire page
> >> ram_find_and_save_block() ->
> >> ram_save_page()/ram_save_compressed_page() ->
> >> save_zero_page() -> is_zero_range(). We also end-up sending header
> >> information over network for these pages during migration. This adds
> >> to the total migration time.
> >>
> >> The solution creates a balloon bitmap ramblock as a part of
> >> virtio-balloon device initialization. The bits in the balloon bitmap
> >> represent a guest ram page of size 1UL << VIRTIO_BALLOON_PFN_SHIFT
> or
> >> 4K. If TARGET_PAGE_BITS <= VIRTIO_BALLOON_PFN_SHIFT, ram_addr
> offset
> >> for the dirty page which is used by dirty page bitmap during
> >> migration is checked against the balloon bitmap as is, if the bit is
> >> set in the balloon bitmap, the corresponding ram page will be
> >> excluded from scanning and sending header information during
> >> migration. In case TARGET_PAGE_BITS > VIRTIO_BALLOON_PFN_SHIFT
> for a
> >> given dirty page ram_addr, all sub-pages of 1UL <<
> >> VIRTIO_BALLOON_PFN_SHIFT size should be ballooned out to avoid zero
> page scan and sending header information.
> >>
> >> The bitmap represents entire guest ram memory till max configured
> memory.
> >> Guest ram pages claimed by the virtio-balloon driver will be
> >> represented by 1 in the bitmap. Since the bitmap is maintained as a
> >> ramblock, it’s migrated to target as part migration’s ram iterative
> >> and ram complete phase. So that substituent migrations from the target
> can continue to use optimization.
> >>
> >> A new migration capability called skip-balloon is introduced. The
> >> user can disable the capability in cases where user does not expect
> >> much benefit or in case the migration is from an older version.
> >>
> >> During live migration setup the optimization can be set to disabled state 
> >> if .
> >> no virtio-balloon device is initialized.
> >> . skip-balloon migration capability is disabled.
> >> . If the guest virtio-balloon driver has not set
> >> VIRTIO_BALLOON_F_MUST_TELL_HOST
> >>   flag. Which means the guest may start using a ram pages freed by
> >> guest balloon
> >>   driver, even before the host/qemu is aware of it. In such case, the
> >>   optimization is disabled so that the ram pages that are being used by the
> >>   guest will continue to be scanned and migrated.
> >>
> >> Balloon bitmap ramblock size is set to zero if the optimization is
> >> disabled, to avoid overhead of migrating the bitmap. If the bitmap is
> >> not migrated to the target, the destination starts with a fresh
> >> bitmap and tracks the ballooning operation thereafter.
> >>
> >
> > I have a better way to get rid of the bitmap.
> > We should not maintain the inflating pages in the bitmap, instead, we
> > can get them from the guest if it's needed, just like what we did for
> > the guest's unused pages. Then we can combine the inflating page info
> > with the unused page info together, and skip them during live migration.
> >
> 
> If we want to actually host enforce (disallow) access to inflated pages, 
> having

Is that a new feature?

> such a bitmap in QEMU is required. Getting them from the guest doesn't
> make sense in that context anymore.

Even a bitmap is required, we should avoid to send it in the stop & copy stage.

Thanks!
Liang
> 
> Thanks,
> David
>

Re: [Qemu-devel] [PATCH v3 0/4] migration: skip scanning and migrating ram pages released by virtio-balloon driver.

2016-12-22 Thread Li, Liang Z

> While measuring live migration performance for qemu/kvm guest, it was
> observed that the qemu doesn’t maintain any intelligence for the guest ram
> pages released by the guest balloon driver and treat such pages as any other
> normal guest ram pages. This has direct impact on overall migration time for
> the guest which has released (ballooned out) memory to the host.
> 
> In case of large systems, where we can configure large guests with 1TB and
> with considerable amount of memory released by balloon driver to the host,
> the migration time gets worse.
> 
> The solution proposed below is local to qemu (and does not require any
> modification to Linux kernel or any guest driver). We have verified the fix 
> for
> large guests =1TB on HPE Superdome X (which can support up to 240 cores
> and 12TB of memory).
> 
> During live migration, as part of first iteration in ram_save_iterate() ->
> ram_find_and_save_block () will try to migrate ram pages even if they are
> released by vitrio-balloon driver (balloon inflate). Although these pages
> which are returned to the host by virtio-balloon driver are zero pages, the
> migration algorithm will still end up scanning the entire page
> ram_find_and_save_block() ->
> ram_save_page()/ram_save_compressed_page() ->
> save_zero_page() -> is_zero_range(). We also end-up sending header
> information over network for these pages during migration. This adds to the
> total migration time.
> 
> The solution creates a balloon bitmap ramblock as a part of virtio-balloon
> device initialization. The bits in the balloon bitmap represent a guest ram
> page of size 1UL << VIRTIO_BALLOON_PFN_SHIFT or 4K. If
> TARGET_PAGE_BITS <= VIRTIO_BALLOON_PFN_SHIFT, ram_addr offset for
> the dirty page which is used by dirty page bitmap during migration is checked
> against the balloon bitmap as is, if the bit is set in the balloon bitmap, the
> corresponding ram page will be excluded from scanning and sending header
> information during migration. In case TARGET_PAGE_BITS >
> VIRTIO_BALLOON_PFN_SHIFT for a given dirty page ram_addr, all sub-pages
> of 1UL << VIRTIO_BALLOON_PFN_SHIFT size should be ballooned out to
> avoid zero page scan and sending header information.
> 
> The bitmap represents entire guest ram memory till max configured memory.
> Guest ram pages claimed by the virtio-balloon driver will be represented by 1
> in the bitmap. Since the bitmap is maintained as a ramblock, it’s migrated to
> target as part migration’s ram iterative and ram complete phase. So that
> substituent migrations from the target can continue to use optimization.
> 
> A new migration capability called skip-balloon is introduced. The user can
> disable the capability in cases where user does not expect much benefit or in
> case the migration is from an older version.
> 
> During live migration setup the optimization can be set to disabled state if .
> no virtio-balloon device is initialized.
> . skip-balloon migration capability is disabled.
> . If the guest virtio-balloon driver has not set
> VIRTIO_BALLOON_F_MUST_TELL_HOST
>   flag. Which means the guest may start using a ram pages freed by guest
> balloon
>   driver, even before the host/qemu is aware of it. In such case, the
>   optimization is disabled so that the ram pages that are being used by the
>   guest will continue to be scanned and migrated.
> 
> Balloon bitmap ramblock size is set to zero if the optimization is disabled, 
> to
> avoid overhead of migrating the bitmap. If the bitmap is not migrated to the
> target, the destination starts with a fresh bitmap and tracks the ballooning
> operation thereafter.
> 

I have a better way to get rid of the bitmap.
We should not maintain the inflating pages in the bitmap, instead, we
can get them from the guest if it's needed, just like what we did for the 
guest's 
unused pages. Then we can combine the inflating page info with the unused page
info together, and skip them during live migration.

Thanks!
Liang

Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-17 Thread Li, Liang Z

> Subject: Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for
> fast (de)inflating & fast live migration
> 
> On Thu, Dec 15, 2016 at 05:40:45PM -0800, Dave Hansen wrote:
> > On 12/15/2016 05:38 PM, Li, Liang Z wrote:
> > >
> > > Use 52 bits for 'pfn', 12 bits for 'length', when the 12 bits is not long
> enough for the 'length'
> > > Set the 'length' to a special value to indicate the "actual length in 
> > > next 8
> bytes".
> > >
> > > That will be much more simple. Right?
> >
> > Sounds fine to me.
> >
> 
> Sounds fine to me too indeed.
> 
> I'm only wondering what is the major point for compressing gpfn+len in
> 8 bytes in the common case, you already use sg_init_table to send down two
> pages, we could send three as well and avoid all math and bit shifts and ors,
> or not?
> 

Yes, we can use more pages for that.

> I agree with the above because from a performance prospective I tend to
> think the above proposal will run at least theoretically faster because the
> other way is to waste double amount of CPU cache, and bit mangling in the
> encoding and the later decoding on qemu side should be faster than
> accessing an array of double size, but then I'm not sure if it's measurable
> optimization. So I'd be curious to know the exact motivation and if it is to
> reduce the CPU cache usage or if there's some other fundamental reason to
> compress it.
> The header already tells qemu how big is the array payload, couldn't we just
> add more pages if one isn't enough?
> 

The original intention to compress the PFN and length it's to reduce the memory 
required.
Even the code was changed a lot from the previous versions, I think this is 
still true.

Now we allocate a specified buffer size to save the 'PFN|length', when the 
buffer is not big
enough to save all the page info for a specified order. A double size buffer 
will be allocated.
This is what we want to avoid because the allocation may fail and allocation 
takes some time,
for fast live migration, time is a critical factor we have to consider, more 
time takes means
more unnecessary pages are sent, because live migration starts before the 
request for unused
 pages get response. 

Thanks

Liang

> Thanks,
> Andrea

Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-17 Thread Li, Liang Z

> On Fri, Dec 16, 2016 at 01:12:21AM +0000, Li, Liang Z wrote:
> > There still exist the case if the MAX_ORDER is configured to a large
> > value, e.g. 36 for a system with huge amount of memory, then there is only
> 28 bits left for the pfn, which is not enough.
> 
> Not related to the balloon but how would it help to set MAX_ORDER to 36?
> 

My point here is  MAX_ORDER may be configured to a big value.

> What the MAX_ORDER affects is that you won't be able to ask the kernel
> page allocator for contiguous memory bigger than 1<<(MAX_ORDER-1), but
> that's a driver issue not relevant to the amount of RAM. Drivers won't
> suddenly start to ask the kernel allocator to allocate compound pages at
> orders >= 11 just because more RAM was added.
> 
> The higher the MAX_ORDER the slower the kernel runs simply so the smaller
> the MAX_ORDER the better.
> 
> > Should  we limit the MAX_ORDER? I don't think so.
> 
> We shouldn't strictly depend on MAX_ORDER value but it's mostly limited
> already even if configurable at build time.
> 

I didn't know that and will take a look, thanks for your information.


Liang
> We definitely need it to reach at least the hugepage size, then it's mostly
> driver issue, but drivers requiring large contiguous allocations should rely 
> on
> CMA only or vmalloc if they only require it virtually contiguous, and not rely
> on larger MAX_ORDER that would slowdown all kernel allocations/freeing.

Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-15 Thread Li, Liang Z

> On 12/15/2016 05:38 PM, Li, Liang Z wrote:
> >
> > Use 52 bits for 'pfn', 12 bits for 'length', when the 12 bits is not long 
> > enough
> for the 'length'
> > Set the 'length' to a special value to indicate the "actual length in next 8
> bytes".
> >
> > That will be much more simple. Right?
> 
> Sounds fine to me.

Thanks for your inspiration!

Liang

Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-15 Thread Li, Liang Z

> On 12/15/2016 04:48 PM, Li, Liang Z wrote:
> >>> It seems we leave too many bit  for the pfn, and the bits leave for
> >>> length is not enough, How about keep 45 bits for the pfn and 19 bits
> >>> for length, 45 bits for pfn can cover 57 bits physical address, that
> >>> should be
> >> enough in the near feature.
> >>> What's your opinion?
> >> I still think 'order' makes a lot of sense.  But, as you say, 57 bits
> >> is enough for
> >> x86 for a while.  Other architectures who knows?
> 
> Thinking about this some more...  There are really only two cases that
> matter: 4k pages and "much bigger" ones.
> 
> Squeezing each 4k page into 8 bytes of metadata helps guarantee that this
> scheme won't regress over the old scheme in any cases.  For bigger ranges, 8
> vs 16 bytes means *nothing*.  And 16 bytes will be as good or better than
> the old scheme for everything which is >4k.
> 
> How about this:
>  * 52 bits of 'pfn', 5 bits of 'order', 7 bits of 'length'
>  * One special 'length' value to mean "actual length in next 8 bytes"
> 
> That should be pretty simple to produce and decode.  We have two record
> sizes, but I think it is manageable.

It works,  Now that we intend to use another 8 bytes for length

Why not:

Use 52 bits for 'pfn', 12 bits for 'length', when the 12 bits is not long 
enough for the 'length'
Set the 'length' to a special value to indicate the "actual length in next 8 
bytes".

That will be much more simple. Right?

Liang

Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-15 Thread Li, Liang Z

> On Thu, Dec 15, 2016 at 07:34:33AM -0800, Dave Hansen wrote:
> > On 12/14/2016 12:59 AM, Li, Liang Z wrote:
> > >> Subject: Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend
> > >> virtio-balloon for fast (de)inflating & fast live migration
> > >>
> > >> On 12/08/2016 08:45 PM, Li, Liang Z wrote:
> > >>> What's the conclusion of your discussion? It seems you want some
> > >>> statistic before deciding whether to  ripping the bitmap from the
> > >>> ABI, am I right?
> > >>
> > >> I think Andrea and David feel pretty strongly that we should remove
> > >> the bitmap, unless we have some data to support keeping it.  I
> > >> don't feel as strongly about it, but I think their critique of it
> > >> is pretty valid.  I think the consensus is that the bitmap needs to go.
> > >>
> > >> The only real question IMNHO is whether we should do a power-of-2
> > >> or a length.  But, if we have 12 bits, then the argument for doing
> > >> length is pretty strong.  We don't need anywhere near 12 bits if doing
> power-of-2.
> > >
> > > Just found the MAX_ORDER should be limited to 12 if use length
> > > instead of order, If the MAX_ORDER is configured to a value bigger
> > > than 12, it will make things more complex to handle this case.
> > >
> > > If use order, we need to break a large memory range whose length is
> > > not the power of 2 into several small ranges, it also make the code
> complex.
> >
> > I can't imagine it makes the code that much more complex.  It adds a
> > for loop.  Right?
> >
> > > It seems we leave too many bit  for the pfn, and the bits leave for
> > > length is not enough, How about keep 45 bits for the pfn and 19 bits
> > > for length, 45 bits for pfn can cover 57 bits physical address, that 
> > > should
> be enough in the near feature.
> > >
> > > What's your opinion?
> >
> > I still think 'order' makes a lot of sense.  But, as you say, 57 bits
> > is enough for x86 for a while.  Other architectures who knows?
> 
> I think you can probably assume page size >= 4K. But I would not want to
> make any other assumptions. E.g. there are systems that absolutely require
> you to set high bits for DMA.
> 
> I think we really want both length and order.
> 
> I understand how you are trying to pack them as tightly as possible.
> 
> However, I thought of a trick, we don't need to encode all possible orders.
> For example, with 2 bits of order, we can make them mean:
> 00 - 4K pages
> 01 - 2M pages
> 02 - 1G pages
> 
> guest can program the sizes for each order through config space.
> 
> We will have 10 bits left for legth.
> 

Please don't, we just get rid of the bitmap for simplification. :)

> It might make sense to also allow guest to program the number of bits used
> for order, this will make it easy to extend without host changes.
> 

There still exist the case if the MAX_ORDER is configured to a large value, 
e.g. 36 for a system
with huge amount of memory, then there is only 28 bits left for the pfn, which 
is not enough.
Should  we limit the MAX_ORDER? I don't think so.

It seems use order is better. 

Thanks!
Liang
> --
> MST

Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-15 Thread Li, Liang Z

> Subject: Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for
> fast (de)inflating & fast live migration
> 
> On 12/14/2016 12:59 AM, Li, Liang Z wrote:
> >> Subject: Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon
> >> for fast (de)inflating & fast live migration
> >>
> >> On 12/08/2016 08:45 PM, Li, Liang Z wrote:
> >>> What's the conclusion of your discussion? It seems you want some
> >>> statistic before deciding whether to  ripping the bitmap from the
> >>> ABI, am I right?
> >>
> >> I think Andrea and David feel pretty strongly that we should remove
> >> the bitmap, unless we have some data to support keeping it.  I don't
> >> feel as strongly about it, but I think their critique of it is pretty
> >> valid.  I think the consensus is that the bitmap needs to go.
> >>
> >> The only real question IMNHO is whether we should do a power-of-2 or
> >> a length.  But, if we have 12 bits, then the argument for doing
> >> length is pretty strong.  We don't need anywhere near 12 bits if doing
> power-of-2.
> >
> > Just found the MAX_ORDER should be limited to 12 if use length instead
> > of order, If the MAX_ORDER is configured to a value bigger than 12, it
> > will make things more complex to handle this case.
> >
> > If use order, we need to break a large memory range whose length is
> > not the power of 2 into several small ranges, it also make the code complex.
> 
> I can't imagine it makes the code that much more complex.  It adds a for loop.
> Right?
> 

Yes, just a little. :)

> > It seems we leave too many bit  for the pfn, and the bits leave for
> > length is not enough, How about keep 45 bits for the pfn and 19 bits
> > for length, 45 bits for pfn can cover 57 bits physical address, that should 
> > be
> enough in the near feature.
> >
> > What's your opinion?
> 
> I still think 'order' makes a lot of sense.  But, as you say, 57 bits is 
> enough for
> x86 for a while.  Other architectures who knows?

Yes.

Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-14 Thread Li, Liang Z

> Subject: Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for
> fast (de)inflating & fast live migration
> 
> On 12/08/2016 08:45 PM, Li, Liang Z wrote:
> > What's the conclusion of your discussion? It seems you want some
> > statistic before deciding whether to  ripping the bitmap from the ABI,
> > am I right?
> 
> I think Andrea and David feel pretty strongly that we should remove the
> bitmap, unless we have some data to support keeping it.  I don't feel as
> strongly about it, but I think their critique of it is pretty valid.  I think 
> the
> consensus is that the bitmap needs to go.
> 
> The only real question IMNHO is whether we should do a power-of-2 or a
> length.  But, if we have 12 bits, then the argument for doing length is pretty
> strong.  We don't need anywhere near 12 bits if doing power-of-2.

Just found the MAX_ORDER should be limited to 12 if use length instead of order,
If the MAX_ORDER is configured to a value bigger than 12, it will make things 
more
complex to handle this case. 

If use order, we need to break a large memory range whose length is not the 
power of 2 into several
small ranges, it also make the code complex.

It seems we leave too many bit  for the pfn, and the bits leave for length is 
not enough,
How about keep 45 bits for the pfn and 19 bits for length, 45 bits for pfn can 
cover 57 bits
physical address, that should be enough in the near feature. 

What's your opinion?

thanks!
Liang

Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-14 Thread Li, Liang Z

> fast (de)inflating & fast live migration
> 
> Hello,
> 
> On Fri, Dec 09, 2016 at 05:35:45AM +, Li, Liang Z wrote:
> > > On 12/08/2016 08:45 PM, Li, Liang Z wrote:
> > > > What's the conclusion of your discussion? It seems you want some
> > > > statistic before deciding whether to  ripping the bitmap from the
> > > > ABI, am I right?
> > >
> > > I think Andrea and David feel pretty strongly that we should remove
> > > the bitmap, unless we have some data to support keeping it.  I don't
> > > feel as strongly about it, but I think their critique of it is
> > > pretty valid.  I think the consensus is that the bitmap needs to go.
> > >
> >
> > Thanks for you clarification.
> >
> > > The only real question IMNHO is whether we should do a power-of-2 or
> > > a length.  But, if we have 12 bits, then the argument for doing
> > > length is pretty strong.  We don't need anywhere near 12 bits if doing
> power-of-2.
> > >
> > So each item can max represent 16MB Bytes, seems not big enough, but
> > enough for most case.
> > Things became much more simple without the bitmap, and I like simple
> > solution too. :)
> >
> > I will prepare the v6 and remove all the bitmap related stuffs. Thank you 
> > all!
> 
> Sounds great!
> 
> I suggested to check the statistics, because collecting those stats looked
> simpler and quicker than removing all bitmap related stuff from the patchset.
> However if you prefer to prepare a v6 without the bitmap another perhaps
> more interesting way to evaluate the usefulness of the bitmap is to just run
> the same benchmark and verify that there is no regression compared to the
> bitmap enabled code.
> 
> The other issue with the bitmap is, the best case for the bitmap is ever less
> likely to materialize the more RAM is added to the guest. It won't regress
> linearly because after all there can be some locality bias in the buddy 
> splits,
> but if sync compaction is used in the large order allocations tried before
> reaching order 0, the bitmap payoff will regress close to linearly with the
> increase of RAM.
> 
> So it'd be good to check the stats or the benchmark on large guests, at least
> one hundred gigabytes or so.
> 
> Changing topic but still about the ABI features needed, so it may be relevant
> for this discussion:
> 
> 1) vNUMA locality: i.e. allowing host to specify which vNODEs to take
>memory from, using alloc_pages_node in guest. So you can ask to
>take X pages from vnode A, Y pages from vnode B, in one vmenter.
> 
> 2) allowing qemu to tell the guest to stop inflating the balloon and
>report a fragmentation limit being hit, when sync compaction
>powered allocations fails at certain power-of-two order granularity
>passed by qemu to the guest. This order constraint will be passed
>by default for hugetlbfs guests with 2MB hpage size, while it can
>be used optionally on THP backed guests. This option with THP
>guests would allow a highlevel management software to provide a
>"don't reduce guest performance" while shrinking the memory size of
>the guest from the GUI. If you deselect the option, you can shrink
>down to the last freeable 4k guest page, but doing so may have to
>split THP in the host (you don't know for sure if they were really
>THP but they could have been), and it may regress
>performance. Inflating the balloon while passing a minimum
>granularity "order" of the pages being zapped, will guarantee
>inflating the balloon cannot decrease guest performance
>instead. Plus it's needed for hugetlbfs anyway as far as I can
>tell. hugetlbfs would not be host enforceable even if the idea is
>not to free memory but only reduce the available memory of the
>guest (not without major changes that maps a hugetlb page with 4k
>ptes at least). While for a more cooperative usage of hugetlbfs
>guests, it's simply not useful to inflate the balloon at anything
>less than the "HPAGE_SIZE" hugetlbfs granularity.
> 
> We also plan to use userfaultfd to make the balloon driver host enforced (will
> work fine on hugetlbfs 2M and tmpfs too) but that's going to be invisible to
> the ABI so it's not strictly relevant for this discussion.
> 
> On a side note, registering userfaultfd on the ballooned range, will keep
> khugepaged at bay so it won't risk to re-inflating the MADV_DONTNEED
> zapped sub-THP fragments no matter the sysfs tunings.
> 

Thanks for your elaboration!

> Thanks!
> Andrea

Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-08 Thread Li, Liang Z

> On 12/08/2016 08:45 PM, Li, Liang Z wrote:
> > What's the conclusion of your discussion? It seems you want some
> > statistic before deciding whether to  ripping the bitmap from the ABI,
> > am I right?
> 
> I think Andrea and David feel pretty strongly that we should remove the
> bitmap, unless we have some data to support keeping it.  I don't feel as
> strongly about it, but I think their critique of it is pretty valid.  I think 
> the
> consensus is that the bitmap needs to go.
> 

Thanks for you clarification.

> The only real question IMNHO is whether we should do a power-of-2 or a
> length.  But, if we have 12 bits, then the argument for doing length is pretty
> strong.  We don't need anywhere near 12 bits if doing power-of-2.
> 
So each item can max represent 16MB Bytes, seems not big enough,
but enough for most case.
Things became much more simple without the bitmap, and I like simple solution 
too. :)

I will prepare the v6 and remove all the bitmap related stuffs. Thank you all!

Liang

Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-08 Thread Li, Liang Z

> > 1. Current patches do a hypercall for each order in the allocator.
> >This is inefficient, but independent from the underlying data
> >structure in the ABI, unless bitmaps are in play, which they aren't.
> > 2. Should we have bitmaps in the ABI, even if they are not in use by the
> >guest implementation today?  Andrea says they have zero benefits
> >over a pfn/len scheme.  Dave doesn't think they have zero benefits
> >but isn't that attached to them.  QEMU's handling gets more
> >complicated when using a bitmap.
> > 3. Should the ABI contain records each with a pfn/len pair or a
> >pfn/order pair?
> >3a. 'len' is more flexible, but will always be a power-of-two anyway
> > for high-order pages (the common case)
> 
> Len wouldn't be a power of two practically only if we detect adjacent pages
> of smaller order that may merge into larger orders we already allocated (or
> the other way around).
> 
> [addr=2M, len=2M] allocated at order 9 pass [addr=4M, len=1M] allocated at
> order 8 pass -> merge as [addr=2M, len=3M]
> 
> Not sure if it would be worth it, but that unless we do this, page-order or 
> len
> won't make much difference.
> 
> >3b. if we decide not to have a bitmap, then we basically have plenty
> > of space for 'len' and should just do it
> >3c. It's easiest for the hypervisor to turn pfn/len into the
> >madvise() calls that it needs.
> >
> > Did I miss anything?
> 
> I think you summarized fine all my arguments in your summary.
> 
> > FWIW, I don't feel that strongly about the bitmap.  Li had one
> > originally, but I think the code thus far has demonstrated a huge
> > benefit without even having a bitmap.
> >
> > I've got no objections to ripping the bitmap out of the ABI.
> 
> I think we need to see a statistic showing the number of bits set in each
> bitmap in average, after some uptime and lru churn, like running stresstest
> app for a while with I/O and then inflate the balloon and
> count:
> 
> 1) how many bits were set vs total number of bits used in bitmaps
> 
> 2) how many times bitmaps were used vs bitmap_len = 0 case of single
>page
> 
> My guess would be like very low percentage for both points.
> 

> So there is a connection with the MAX_ORDER..0 allocation loop and the ABI
> change, but I agree any of the ABI proposed would still allow for it this 
> logic to
> be used. Bitmap or not bitmap, the loop would still work.

Hi guys,

What's the conclusion of your discussion? 
It seems you want some statistic before deciding whether to  ripping the bitmap 
from the ABI, am I right?

Thanks!
Liang

Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-08 Thread Li, Liang Z

> Subject: Re: [PATCH kernel v5 0/5] Extend virtio-balloon for fast 
> (de)inflating
> & fast live migration
> 
> On 12/07/2016 05:35 AM, Li, Liang Z wrote:
> >> Am 30.11.2016 um 09:43 schrieb Liang Li:
> >> IOW in real examples, do we have really large consecutive areas or
> >> are all pages just completely distributed over our memory?
> >
> > The buddy system of Linux kernel memory management shows there
> should
> > be quite a lot of consecutive pages as long as there are a portion of
> > free memory in the guest.
> ...
> > If all pages just completely distributed over our memory, it means the
> > memory fragmentation is very serious, the kernel has the mechanism to
> > avoid this happened.
> 
> While it is correct that the kernel has anti-fragmentation mechanisms, I don't
> think it invalidates the question as to whether a bitmap would be too sparse
> to be effective.
> 
> > In the other hand, the inflating should not happen at this time
> > because the guest is almost 'out of memory'.
> 
> I don't think this is correct.  Most systems try to run with relatively 
> little free
> memory all the time, using the bulk of it as page cache.  We have no reason
> to expect that ballooning will only occur when there is lots of actual free
> memory and that it will not occur when that same memory is in use as page
> cache.
> 

Yes.
> In these patches, you're effectively still sending pfns.  You're just sending
> one pfn per high-order page which is giving a really nice speedup.  IMNHO,
> you're avoiding doing a real bitmap because creating a bitmap means either
> have a really big bitmap, or you would have to do some sorting (or multiple
> passes) of the free lists before populating a smaller bitmap.
> 
> Like David, I would still like to see some data on whether the choice between
> bitmaps and pfn lists is ever clearly in favor of bitmaps.  You haven't
> convinced me, at least, that the data isn't even worth collecting.

I will try to get some data with the real workload and share it with your guys.

Thanks!
Liang

Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-07 Thread Li, Liang Z

> Am 30.11.2016 um 09:43 schrieb Liang Li:
> > This patch set contains two parts of changes to the virtio-balloon.
> >
> > One is the change for speeding up the inflating & deflating process,
> > the main idea of this optimization is to use bitmap to send the page
> > information to host instead of the PFNs, to reduce the overhead of
> > virtio data transmission, address translation and madvise(). This can
> > help to improve the performance by about 85%.
> 
> Do you have some statistics/some rough feeling how many consecutive bits are
> usually set in the bitmaps? Is it really just purely random or is there some
> granularity that is usually consecutive?
> 

I did something similar. Filled the balloon with 15GB for a 16GB idle guest, by
using bitmap, the madvise count was reduced to 605. when using the PFNs, the 
madvise count
was 3932160. It means there are quite a lot consecutive bits in the bitmap.
I didn't test for a guest with heavy memory workload. 

> IOW in real examples, do we have really large consecutive areas or are all
> pages just completely distributed over our memory?
> 

The buddy system of Linux kernel memory management shows there should be quite 
a lot of
 consecutive pages as long as there are a portion of free memory in the guest.
If all pages just completely distributed over our memory, it means the memory 
fragmentation is very serious, the kernel has the mechanism to avoid this 
happened.
In the other hand, the inflating should not happen at this time because the 
guest is almost
'out of memory'.

Liang

> Thanks!
> 
> --
> 
> David

Re: [Qemu-devel] [PATCH kernel v5 5/5] virtio-balloon: tell host vm's unused page info

2016-12-05 Thread Li, Liang Z

> >>> + mutex_lock(>balloon_lock);
> >>> +
> >>> + for (order = MAX_ORDER - 1; order >= 0; order--) {
> >>
> >> I scratched my head for a bit on this one.  Why are you walking over
> >> orders,
> >> *then* zones.  I *think* you're doing it because you can efficiently
> >> fill the bitmaps at a given order for all zones, then move to a new
> >> bitmap.  But, it would be interesting to document this.
> >
> > Yes, use the order is somewhat strange, but it's helpful to keep the API 
> > simple.
> > Do you think it's acceptable?
> 
> Yeah, it's fine.  Just comment it, please.
> 
Good!

> >>> + if (ret == -ENOSPC) {
> >>> + void *new_resp_data;
> >>> +
> >>> + new_resp_data = kmalloc(2 * vb->resp_buf_size,
> >>> + GFP_KERNEL);
> >>> + if (new_resp_data) {
> >>> + kfree(vb->resp_data);
> >>> + vb->resp_data = new_resp_data;
> >>> + vb->resp_buf_size *= 2;
> >>
> >> What happens to the data in ->resp_data at this point?  Doesn't this
> >> just throw it away?
> >
> > Yes, so we should make sure the data in resp_data is not inuse.
> 
> But doesn't it have valid data that we just collected and haven't told the
> hypervisor about yet?  Aren't we throwing away good data that cost us
> something to collect?

Indeed.  Some filled data may exist for the previous zone. Should we
change the API to 
'int get_unused_pages(unsigned long *unused_pages, unsigned long size,
int order, unsigned long *pos, struct zone *zone)' ?

then we can use the 'zone' to record the zone to retry and not discard the
filled data.

> >> ...
> >>> +struct page_info_item {
> >>> + __le64 start_pfn : 52; /* start pfn for the bitmap */
> >>> + __le64 page_shift : 6; /* page shift width, in bytes */
> 
> What does a page_shift "in bytes" mean? :)

Obviously, you know. :o
I will try to make it clear.

> 
> >>> + __le64 bmap_len : 6;  /* bitmap length, in bytes */ };
> >>
> >> Is 'bmap_len' too short?  a 64-byte buffer is a bit tiny.  Right?
> >
> > Currently, we just use the 8 bytes and 0 bytes bitmap, should we support
> more than 64 bytes?
> 
> It just means that with this format, you end up wasting at least ~1/8th of the
> space with metadata.  That's a bit unfortunate, but I guess it's not fatal.
> 
> I'd definitely call it out in the patch description and make sure other folks 
> take
> a look at it.

OK.

> 
> There's a somewhat easy fix, but that would make the qemu implementation
> more complicated: You could just have bmap_len==0x3f imply that there's
> another field that contains an extended bitmap length for when you need long
> bitmaps.
> 
> But, as you note, there's no need for it, so it's a matter of trading the 
> extra
> complexity versus the desire to not habing to change the ABI again for longer
> (hopefully).
> 

Your suggestion still works without changing the current code, just reserve
 ' bmap_len==0x3f' for future extension, and it's not used by the current code.

> >>> +static int  mark_unused_pages(struct zone *zone,
> >>> + unsigned long *unused_pages, unsigned long size,
> >>> + int order, unsigned long *pos)
> >>> +{
> >>> + unsigned long pfn, flags;
> >>> + unsigned int t;
> >>> + struct list_head *curr;
> >>> + struct page_info_item *info;
> >>> +
> >>> + if (zone_is_empty(zone))
> >>> + return 0;
> >>> +
> >>> + spin_lock_irqsave(>lock, flags);
> >>> +
> >>> + if (*pos + zone->free_area[order].nr_free > size)
> >>> + return -ENOSPC;
> >>
> >> Urg, so this won't partially fill?  So, what the nr_free pages limit
> >> where we no longer fit in the kmalloc()'d buffer where this simply won't
> work?
> >
> > Yes.  My initial implementation is partially fill, it's better for the 
> > worst case.
> > I thought the above code is more efficient for most case ...
> > Do you think partially fill the bitmap is better?
> 
> Could you please answer the question I asked?
> 

For your question:
---
>So, what the nr_free pages limit where we no longer fit in the kmalloc()'d 
>buffer
> where this simply won't work?
--
No, if the buffer is not big enough to save 'nr_free'  pages, 
get_unused_pages() will return
'-ENOSPC', and the following code will try to allocate a 2x times size buffer 
for retrying,
until the proper size buffer is allocated. The current order will not be 
skipped unless the
buffer allocation failed.

> Because if you don't get this right, it could mean that there are system that
> simply *fail* here.

Re: [Qemu-devel] [PATCH kernel v5 5/5] virtio-balloon: tell host vm's unused page info

2016-12-04 Thread Li, Liang Z

> On 11/30/2016 12:43 AM, Liang Li wrote:
> > +static void send_unused_pages_info(struct virtio_balloon *vb,
> > +   unsigned long req_id)
> > +{
> > +   struct scatterlist sg_in;
> > +   unsigned long pos = 0;
> > +   struct virtqueue *vq = vb->req_vq;
> > +   struct virtio_balloon_resp_hdr *hdr = vb->resp_hdr;
> > +   int ret, order;
> > +
> > +   mutex_lock(>balloon_lock);
> > +
> > +   for (order = MAX_ORDER - 1; order >= 0; order--) {
> 
> I scratched my head for a bit on this one.  Why are you walking over orders,
> *then* zones.  I *think* you're doing it because you can efficiently fill the
> bitmaps at a given order for all zones, then move to a new bitmap.  But, it
> would be interesting to document this.
> 

Yes, use the order is somewhat strange, but it's helpful to keep the API 
simple. 
Do you think it's acceptable?

> > +   pos = 0;
> > +   ret = get_unused_pages(vb->resp_data,
> > +vb->resp_buf_size / sizeof(unsigned long),
> > +order, );
> 
> FWIW, get_unsued_pages() is a pretty bad name.  "get" usually implies
> bumping reference counts or consuming something.  You're just "recording"
> or "marking" them.
> 

Will change to mark_unused_pages().

> > +   if (ret == -ENOSPC) {
> > +   void *new_resp_data;
> > +
> > +   new_resp_data = kmalloc(2 * vb->resp_buf_size,
> > +   GFP_KERNEL);
> > +   if (new_resp_data) {
> > +   kfree(vb->resp_data);
> > +   vb->resp_data = new_resp_data;
> > +   vb->resp_buf_size *= 2;
> 
> What happens to the data in ->resp_data at this point?  Doesn't this just
> throw it away?
> 

Yes, so we should make sure the data in resp_data is not inuse.

> ...
> > +struct page_info_item {
> > +   __le64 start_pfn : 52; /* start pfn for the bitmap */
> > +   __le64 page_shift : 6; /* page shift width, in bytes */
> > +   __le64 bmap_len : 6;  /* bitmap length, in bytes */ };
> 
> Is 'bmap_len' too short?  a 64-byte buffer is a bit tiny.  Right?
> 

Currently, we just use the 8 bytes and 0 bytes bitmap, should we support more 
than 64 bytes?

> > +static int  mark_unused_pages(struct zone *zone,
> > +   unsigned long *unused_pages, unsigned long size,
> > +   int order, unsigned long *pos)
> > +{
> > +   unsigned long pfn, flags;
> > +   unsigned int t;
> > +   struct list_head *curr;
> > +   struct page_info_item *info;
> > +
> > +   if (zone_is_empty(zone))
> > +   return 0;
> > +
> > +   spin_lock_irqsave(>lock, flags);
> > +
> > +   if (*pos + zone->free_area[order].nr_free > size)
> > +   return -ENOSPC;
> 
> Urg, so this won't partially fill?  So, what the nr_free pages limit where we 
> no
> longer fit in the kmalloc()'d buffer where this simply won't work?
> 

Yes.  My initial implementation is partially fill, it's better for the worst 
case.
I thought the above code is more efficient for most case ...
Do you think partially fill the bitmap is better?
 
> > +   for (t = 0; t < MIGRATE_TYPES; t++) {
> > +   list_for_each(curr, >free_area[order].free_list[t]) {
> > +   pfn = page_to_pfn(list_entry(curr, struct page, lru));
> > +   info = (struct page_info_item *)(unused_pages +
> *pos);
> > +   info->start_pfn = pfn;
> > +   info->page_shift = order + PAGE_SHIFT;
> > +   *pos += 1;
> > +   }
> > +   }
> 
> Do we need to fill in ->bmap_len here?

For integrity, the bmap_len should be filled, will add.
Omit this step just because QEMU assume the ->bmp_len is 0 and ignore this 
field.

Thanks for your comment!

Liang

Re: [Qemu-devel] [PATCH kernel v4 7/7] virtio-balloon: tell host vm's unused page info

2016-11-07 Thread Li, Liang Z

> On 11/06/2016 07:37 PM, Li, Liang Z wrote:
> >> Let's say we do a 32k bitmap that can hold ~1M pages.  That's 4GB of RAM.
> >> On a 1TB system, that's 256 passes through the top-level loop.
> >> The bottom-level lists have tens of thousands of pages in them, even
> >> on my laptop.  Only 1/256 of these pages will get consumed in a given pass.
> >>
> > Your description is not exactly.
> > A 32k bitmap is used only when there is few free memory left in the
> > system and when the extend_page_bitmap() failed to allocate more
> > memory for the bitmap. Or dozens of 32k split bitmap will be used,
> > this version limit the bitmap count to 32, it means we can use at most
> > 32*32 kB for the bitmap, which can cover 128GB for RAM. We can increase
> the bitmap count limit to a larger value if 32 is not big enough.
> 
> OK, so it tries to allocate a large bitmap.  But, if it fails, it will try to 
> work with a
> smaller bitmap.  Correct?
> 
Yes.

> So, what's the _worst_ case?  It sounds like it is even worse than I was
> positing.
> 

Only a  32KB bitmap can be allocated, and there are a huge amount of low order 
(<3) free pages is the worst case. 

> >> That's an awfully inefficient way of doing it.  This patch
> >> essentially changed the data structure without changing the algorithm to
> populate it.
> >>
> >> Please change the *algorithm* to use the new data structure efficiently.
> >>  Such a change would only do a single pass through each freelist, and
> >> would choose whether to use the extent-based (pfn -> range) or
> >> bitmap-based approach based on the contents of the free lists.
> >
> > Save the free page info to a raw bitmap first and then process the raw
> > bitmap to get the proper ' extent-based ' and  'bitmap-based' is the
> > most efficient way I can come up with to save the virtio data transmission.
> Do you have some better idea?
> 
> That's kinda my point.  This patch *does* processing to try to pack the
> bitmaps full of pages from the various pfn ranges.  It's a form of processing
> that gets *REALLY*, *REALLY* bad in some (admittedly obscure) cases.
> 
> Let's not pretend that making an essentially unlimited number of passes over
> the free lists is not processing.
> 
> 1. Allocate as large of a bitmap as you can. (what you already do) 2. Iterate
> from the largest freelist order.  Store those pages in the
>bitmap.
> 3. If you can no longer fit pages in the bitmap, return the list that
>you have.
> 4. Make an approximation about where the bitmap does not make any more,
>and fall back to listing individual PFNs.  This would make sens, for
>instance in a large zone with very few free order-0 pages left.
> 
Sounds good.  Should we ignore some of the order-0 pages in step 4 if the 
bitmap is full?
Or should retry to get a complete list of order-0 pages?

> 
> > It seems the benefit we get for this feature is not as big as that in fast
> balloon inflating/deflating.
> >>
> >> You should not be using get_max_pfn().  Any patch set that continues
> >> to use it is not likely to be using a proper algorithm.
> >
> > Do you have any suggestion about how to avoid it?
> 
> Yes: get the pfns from the page free lists alone.  Don't derive them from the
> pfn limits of the system or zones.

The ' get_max_pfn()' can be avoid in this patch, but I think we can't avoid it 
completely.
We need it as a hint for allocating a proper size bitmap. No?

Thanks!
Liang

Re: [Qemu-devel] Migration dirty bitmap: should only mark pages as dirty after they have been sent

2016-11-07 Thread Li, Liang Z

> > I compare SHA1 with XBZRLE. I use XBZRLE in two ways:
> > 1. Begins to buffer pages from iteration 1; 2. As current
> > implementation, begins to buffer pages from iteration 2.
> >
> > I post the results of three workloads: cpu2006.zeusmp, cpu2006.mcf,
> > memcached.
> > I set the cache size as 256MB for zeusmp & mcf (they run in VM with
> > 1GB ram), and set the cache size as 1GB for memcached (it run in VM
> > with 6GB ram, and memcached takes 4GB as cache).
> >
> > As you can read from the data below, beginning to buffer pages from
> > iteration 1 is better than the current implementation(from iteration
> > 2), because the total migration time is shorter.
> >
> > SHA1 is better than the XBZRLE with the cache size I choose, because
> > it leads to shorter migration time, and consumes far less memory
> > overhead (<1/200 of the total VM memory).
> >
> 
> Hi Chunguang,
> 
> Have you tried to use a large XBZRLE cache size which equals to the guest's
> RAM size?
> Is SHA1 faster in that case?
> 
> Thanks!
> Liang

Intel's future chipset will contain hardware engines which supports SHA-x and 
MD5,
We can make use these engines to offload the overhead from CPU for SHA/MD5 
calculation.

Liang

Re: [Qemu-devel] Migration dirty bitmap: should only mark pages as dirty after they have been sent

2016-11-07 Thread Li, Liang Z

> > > > > > > > > I think this is "very" wasteful. Assume the workload
> > > > > > > > > writes the pages
> > > > > dirty randomly within the guest address space, and the transfer
> > > > > speed is constant. Intuitively, I think nearly half of the dirty
> > > > > pages produced in Iteration 1 is not really dirty. This means
> > > > > the time of Iteration 2 is double of that to send only really dirty 
> > > > > pages.
> > > > > > > >
> > > > > > > > It makes sense, can you get some perf numbers to show what
> > > > > > > > kinds of workloads get impacted the most?  That would also
> > > > > > > > help us to figure out what kinds of speed improvements we
> > > > > > > > can
> > > expect.
> > > > > > > >
> > > > > > > >
> > > > > > > > Amit
> > > > > > >
> > > > > > > I have picked up 6 workloads and got the following
> > > > > > > statistics numbers of every iteration (except the last
> > > > > > > stop-copy one) during
> > > precopy.
> > > > > > > These numbers are obtained with the basic precopy migration,
> > > > > > > without the capabilities like xbzrle or compression, etc.
> > > > > > > The network for the migration is exclusive, with a separate
> > > > > > > network for
> > > the workloads.
> > > > > > > They are both gigabit ethernet. I use qemu-2.5.1.
> > > > > > >
> > > > > > > Three (booting, idle, web server) of them converged to the
> > > > > > > stop-copy
> > > > > phase,
> > > > > > > with the given bandwidth and default downtime (300ms), while
> > > > > > > the other three (kernel compilation, zeusmp, memcached) did not.
> > > > > > >
> > > > > > > One page is "not-really-dirty", if it is written first and
> > > > > > > is sent later (and not written again after that) during one
> > > > > > > iteration. I guess this would not happen so often during the
> > > > > > > other iterations as during the 1st iteration. Because all
> > > > > > > the pages of the VM are sent to the dest node
> > > > > during
> > > > > > > the 1st iteration, while during the others, only part of the
> > > > > > > pages are
> > > sent.
> > > > > > > So I think the "not-really-dirty" pages should be produced
> > > > > > > mainly during the 1st iteration , and maybe very little
> > > > > > > during the other
> > > iterations.
> > > > > > >
> > > > > > > If we could avoid resending the "not-really-dirty" pages,
> > > > > > > intuitively, I think the time spent on Iteration 2 would be
> > > > > > > halved. This is a chain
> > > > > reaction,
> > > > > > > because the dirty pages produced during Iteration 2 is
> > > > > > > halved, which
> > > > > incurs
> > > > > > > that the time spent on Iteration 3 is halved, then Iteration 4, 
> > > > > > > 5...
> > > > > >
> > > > > > Yes; these numbers don't show how many of them are false dirty
> > > though.
> > > > > >
> > > > > > One problem is thinking about pages that have been redirtied,
> > > > > > if the page is
> > > > > dirtied
> > > > > > after the sync but before the network write then it's the
> > > > > > false-dirty that you're describing.
> > > > > >
> > > > > > However, if the page is being written a few times, and so it
> > > > > > would have
> > > > > been written
> > > > > > after the network write then it isn't a false-dirty.
> > > > > >
> > > > > > You might be able to figure that out with some kernel tracing
> > > > > > of when the
> > > > > dirtying
> > > > > > happens, but it might be easier to write the fix!
> > > > > >
> > > > > > Dave
> > > > >
> > > > > Hi, I have made some new progress now.
> > > > >
> > > > > To tell how many false dirty pages there are exactly in each
> > > > > iteration, I malloc a buffer in memory as big as the size of the
> > > > > whole VM memory. When a page is transferred to the dest node, it
> > > > > is copied to the buffer; During the next iteration, if one page
> > > > > is transferred, it is compared to the old one in the buffer, and
> > > > > the old one will be replaced for next comparison if it is really 
> > > > > dirty.
> > > > > Thus, we are now able to get the exact number of false dirty pages.
> > > > >
> > > > > This time, I use 15 workloads to get the statistic number. They are:
> > > > >
> > > > >   1. 11 benchmarks picked up from cpu2006 benchmark suit. They
> > > > > are all scientific
> > > > >  computing workloads like Quantum Chromodynamics, Fluid
> > > > > Dynamics,
> > > etc.
> > > > > I pick
> > > > >  up these 11 benchmarks because compared to others, they
> > > > > have bigger memory
> > > > >  occupation and higher memory dirty rate. Thus most of them
> > > > > could not converge
> > > > >  to stop-and-copy using the default migration speed (32MB/s).
> > > > >   2. kernel compilation
> > > > >   3. idle VM
> > > > >   4. Apache web server which serves static content
> > > > >
> > > > >   (the above workloads are all running in VM with 1 vcpu and 1GB
> > > > > memory, and the
> > > > >migration speed is the default 32MB/s)
> > > > >
> > > > >   5. Memcached. The VM has

Re: [Qemu-devel] [PATCH kernel v4 7/7] virtio-balloon: tell host vm's unused page info

2016-11-06 Thread Li, Liang Z

> Please squish this and patch 5 together.  It makes no sense to separate them.
> 

OK.

> > +static void send_unused_pages_info(struct virtio_balloon *vb,
> > +   unsigned long req_id)
> > +{
> > +   struct scatterlist sg_in;
> > +   unsigned long pfn = 0, bmap_len, pfn_limit, last_pfn, nr_pfn;
> > +   struct virtqueue *vq = vb->req_vq;
> > +   struct virtio_balloon_resp_hdr *hdr = vb->resp_hdr;
> > +   int ret = 1, used_nr_bmap = 0, i;
> > +
> > +   if (virtio_has_feature(vb->vdev,
> VIRTIO_BALLOON_F_PAGE_BITMAP) &&
> > +   vb->nr_page_bmap == 1)
> > +   extend_page_bitmap(vb);
> > +
> > +   pfn_limit = PFNS_PER_BMAP * vb->nr_page_bmap;
> > +   mutex_lock(>balloon_lock);
> > +   last_pfn = get_max_pfn();
> > +
> > +   while (ret) {
> > +   clear_page_bitmap(vb);
> > +   ret = get_unused_pages(pfn, pfn + pfn_limit, vb-
> >page_bitmap,
> > +PFNS_PER_BMAP, vb->nr_page_bmap);
> 
> This changed the underlying data structure without changing the way that
> the structure is populated.
> 
> This algorithm picks a "PFNS_PER_BMAP * vb->nr_page_bmap"-sized set of
> pfns, allocates a bitmap for them, the loops through all zones looking for
> pages in any free list that are in that range.
> 
> Unpacking all the indirection, it looks like this:
> 
> for (pfn = 0; pfn < get_max_pfn(); pfn += BITMAP_SIZE_IN_PFNS)
>   for_each_populated_zone(zone)
>   for_each_migratetype_order(order, t)
>   list_for_each(..., >free_area[order])...
> 
> Let's say we do a 32k bitmap that can hold ~1M pages.  That's 4GB of RAM.
> On a 1TB system, that's 256 passes through the top-level loop.
> The bottom-level lists have tens of thousands of pages in them, even on my
> laptop.  Only 1/256 of these pages will get consumed in a given pass.
> 
Your description is not exactly.
A 32k bitmap is used only when there is few free memory left in the system and 
when 
the extend_page_bitmap() failed to allocate more memory for the bitmap. Or 
dozens of 
32k split bitmap will be used, this version limit the bitmap count to 32, it 
means we can use
at most 32*32 kB for the bitmap, which can cover 128GB for RAM. We can increase 
the bitmap
count limit to a larger value if 32 is not big enough.

> That's an awfully inefficient way of doing it.  This patch essentially changed
> the data structure without changing the algorithm to populate it.
> 
> Please change the *algorithm* to use the new data structure efficiently.
>  Such a change would only do a single pass through each freelist, and would
> choose whether to use the extent-based (pfn -> range) or bitmap-based
> approach based on the contents of the free lists.

Save the free page info to a raw bitmap first and then process the raw bitmap to
get the proper ' extent-based ' and  'bitmap-based' is the most efficient way I 
can 
come up with to save the virtio data transmission.  Do you have some better 
idea?


In the QEMU, no matter how we encode the bitmap, the raw format bitmap will be
used in the end.  But what I did in this version is:
   kernel: get the raw bitmap  --> encode the bitmap
   QEMU: decode the bitmap --> get the raw bitmap

Is it worth to do this kind of job here? we can save the virtio data 
transmission, but at the
same time, we did extra work.

It seems the benefit we get for this feature is not as big as that in fast 
balloon inflating/deflating.
> 
> You should not be using get_max_pfn().  Any patch set that continues to use
> it is not likely to be using a proper algorithm.

Do you have any suggestion about how to avoid it?

Thanks!
Liang

Re: [Qemu-devel] Migration dirty bitmap: should only mark pages as dirty after they have been sent

2016-11-03 Thread Li, Liang Z

> > > > > > > I think this is "very" wasteful. Assume the workload writes
> > > > > > > the pages
> > > dirty randomly within the guest address space, and the transfer
> > > speed is constant. Intuitively, I think nearly half of the dirty
> > > pages produced in Iteration 1 is not really dirty. This means the
> > > time of Iteration 2 is double of that to send only really dirty pages.
> > > > > >
> > > > > > It makes sense, can you get some perf numbers to show what
> > > > > > kinds of workloads get impacted the most?  That would also
> > > > > > help us to figure out what kinds of speed improvements we can
> expect.
> > > > > >
> > > > > >
> > > > > > Amit
> > > > >
> > > > > I have picked up 6 workloads and got the following statistics
> > > > > numbers of every iteration (except the last stop-copy one) during
> precopy.
> > > > > These numbers are obtained with the basic precopy migration,
> > > > > without the capabilities like xbzrle or compression, etc. The
> > > > > network for the migration is exclusive, with a separate network for
> the workloads.
> > > > > They are both gigabit ethernet. I use qemu-2.5.1.
> > > > >
> > > > > Three (booting, idle, web server) of them converged to the
> > > > > stop-copy
> > > phase,
> > > > > with the given bandwidth and default downtime (300ms), while the
> > > > > other three (kernel compilation, zeusmp, memcached) did not.
> > > > >
> > > > > One page is "not-really-dirty", if it is written first and is
> > > > > sent later (and not written again after that) during one
> > > > > iteration. I guess this would not happen so often during the
> > > > > other iterations as during the 1st iteration. Because all the
> > > > > pages of the VM are sent to the dest node
> > > during
> > > > > the 1st iteration, while during the others, only part of the pages are
> sent.
> > > > > So I think the "not-really-dirty" pages should be produced
> > > > > mainly during the 1st iteration , and maybe very little during the 
> > > > > other
> iterations.
> > > > >
> > > > > If we could avoid resending the "not-really-dirty" pages,
> > > > > intuitively, I think the time spent on Iteration 2 would be
> > > > > halved. This is a chain
> > > reaction,
> > > > > because the dirty pages produced during Iteration 2 is halved,
> > > > > which
> > > incurs
> > > > > that the time spent on Iteration 3 is halved, then Iteration 4, 5...
> > > >
> > > > Yes; these numbers don't show how many of them are false dirty
> though.
> > > >
> > > > One problem is thinking about pages that have been redirtied, if
> > > > the page is
> > > dirtied
> > > > after the sync but before the network write then it's the
> > > > false-dirty that you're describing.
> > > >
> > > > However, if the page is being written a few times, and so it would
> > > > have
> > > been written
> > > > after the network write then it isn't a false-dirty.
> > > >
> > > > You might be able to figure that out with some kernel tracing of
> > > > when the
> > > dirtying
> > > > happens, but it might be easier to write the fix!
> > > >
> > > > Dave
> > >
> > > Hi, I have made some new progress now.
> > >
> > > To tell how many false dirty pages there are exactly in each
> > > iteration, I malloc a buffer in memory as big as the size of the
> > > whole VM memory. When a page is transferred to the dest node, it is
> > > copied to the buffer; During the next iteration, if one page is
> > > transferred, it is compared to the old one in the buffer, and the
> > > old one will be replaced for next comparison if it is really dirty.
> > > Thus, we are now able to get the exact number of false dirty pages.
> > >
> > > This time, I use 15 workloads to get the statistic number. They are:
> > >
> > >   1. 11 benchmarks picked up from cpu2006 benchmark suit. They are
> > > all scientific
> > >  computing workloads like Quantum Chromodynamics, Fluid Dynamics,
> etc.
> > > I pick
> > >  up these 11 benchmarks because compared to others, they have
> > > bigger memory
> > >  occupation and higher memory dirty rate. Thus most of them
> > > could not converge
> > >  to stop-and-copy using the default migration speed (32MB/s).
> > >   2. kernel compilation
> > >   3. idle VM
> > >   4. Apache web server which serves static content
> > >
> > >   (the above workloads are all running in VM with 1 vcpu and 1GB
> > > memory, and the
> > >migration speed is the default 32MB/s)
> > >
> > >   5. Memcached. The VM has 6 cpu cores and 6GB memory, and 4GB are
> > > used as the cache.
> > >  After filling up the 4GB cache, a client writes the cache at a 
> > > constant
> speed
> > >  during migration. This time, migration speed has no limit, and is up 
> > > to
> the
> > >  capability of 1Gbps Ethernet.
> > >
> > > Summarize the results first: (and you can read the precise number
> > > below)
> > >
> > >   1. 4 of these 15 workloads have a big proportion (>60%, even >80%
> > > during some iterations)
> > >  of false

Re: [Qemu-devel] Migration dirty bitmap: should only mark pages as dirty after they have been sent

2016-11-03 Thread Li, Liang Z

> > > > > I think this is "very" wasteful. Assume the workload writes the pages
> dirty randomly within the guest address space, and the transfer speed is
> constant. Intuitively, I think nearly half of the dirty pages produced in
> Iteration 1 is not really dirty. This means the time of Iteration 2 is double 
> of
> that to send only really dirty pages.
> > > >
> > > > It makes sense, can you get some perf numbers to show what kinds of
> > > > workloads get impacted the most?  That would also help us to figure
> > > > out what kinds of speed improvements we can expect.
> > > >
> > > >
> > > > Amit
> > >
> > > I have picked up 6 workloads and got the following statistics numbers
> > > of every iteration (except the last stop-copy one) during precopy.
> > > These numbers are obtained with the basic precopy migration, without
> > > the capabilities like xbzrle or compression, etc. The network for the
> > > migration is exclusive, with a separate network for the workloads.
> > > They are both gigabit ethernet. I use qemu-2.5.1.
> > >
> > > Three (booting, idle, web server) of them converged to the stop-copy
> phase,
> > > with the given bandwidth and default downtime (300ms), while the other
> > > three (kernel compilation, zeusmp, memcached) did not.
> > >
> > > One page is "not-really-dirty", if it is written first and is sent later
> > > (and not written again after that) during one iteration. I guess this
> > > would not happen so often during the other iterations as during the 1st
> > > iteration. Because all the pages of the VM are sent to the dest node
> during
> > > the 1st iteration, while during the others, only part of the pages are 
> > > sent.
> > > So I think the "not-really-dirty" pages should be produced mainly during
> > > the 1st iteration , and maybe very little during the other iterations.
> > >
> > > If we could avoid resending the "not-really-dirty" pages, intuitively, I
> > > think the time spent on Iteration 2 would be halved. This is a chain
> reaction,
> > > because the dirty pages produced during Iteration 2 is halved, which
> incurs
> > > that the time spent on Iteration 3 is halved, then Iteration 4, 5...
> >
> > Yes; these numbers don't show how many of them are false dirty though.
> >
> > One problem is thinking about pages that have been redirtied, if the page is
> dirtied
> > after the sync but before the network write then it's the false-dirty that
> > you're describing.
> >
> > However, if the page is being written a few times, and so it would have
> been written
> > after the network write then it isn't a false-dirty.
> >
> > You might be able to figure that out with some kernel tracing of when the
> dirtying
> > happens, but it might be easier to write the fix!
> >
> > Dave
> 
> Hi, I have made some new progress now.
> 
> To tell how many false dirty pages there are exactly in each iteration, I 
> malloc
> a
> buffer in memory as big as the size of the whole VM memory. When a page
> is
> transferred to the dest node, it is copied to the buffer; During the next
> iteration,
> if one page is transferred, it is compared to the old one in the buffer, and 
> the
> old one will be replaced for next comparison if it is really dirty. Thus, we 
> are
> now
> able to get the exact number of false dirty pages.
> 
> This time, I use 15 workloads to get the statistic number. They are:
> 
>   1. 11 benchmarks picked up from cpu2006 benchmark suit. They are all
> scientific
>  computing workloads like Quantum Chromodynamics, Fluid Dynamics, etc.
> I pick
>  up these 11 benchmarks because compared to others, they have bigger
> memory
>  occupation and higher memory dirty rate. Thus most of them could not
> converge
>  to stop-and-copy using the default migration speed (32MB/s).
>   2. kernel compilation
>   3. idle VM
>   4. Apache web server which serves static content
> 
>   (the above workloads are all running in VM with 1 vcpu and 1GB memory,
> and the
>migration speed is the default 32MB/s)
> 
>   5. Memcached. The VM has 6 cpu cores and 6GB memory, and 4GB are used
> as the cache.
>  After filling up the 4GB cache, a client writes the cache at a constant 
> speed
>  during migration. This time, migration speed has no limit, and is up to 
> the
>  capability of 1Gbps Ethernet.
> 
> Summarize the results first: (and you can read the precise number below)
> 
>   1. 4 of these 15 workloads have a big proportion (>60%, even >80% during
> some iterations)
>  of false dirty pages out of all the dirty pages since iteration 2 (and 
> the big
>  proportion lasts during the following iterations). They are 
> cpu2006.zeusmp,
>  cpu2006.bzip2, cpu2006.mcf, and memcached.
>   2. 2 workloads (idle, webserver) spend most of the migration time on
> iteration 1, even
>  though the proportion of false dirty pages is big since iteration 2, the 
> space
> to
>  optimize is small.
>   3. 1 workload (kernel compilation) only have a big

Re: [Qemu-devel] Migration dirty bitmap: should only mark pages as dirty after they have been sent

2016-11-03 Thread Li, Liang Z

> pages will be sent. Before that during the migration setup, the
> ioctl(KVM_GET_DIRTY_LOG) is called once, so the kernel begins to produce
> the dirty bitmap from this moment. When the pages "that haven't been
> sent" are written, the kernel space marks them as dirty. However I don't
> think this is correct, because these pages will be sent during this and the 
> next
> iterations with the same content (if they are not written again after they are
> sent). It only makes sense to mark the pages which have already been sent
> during one iteration as dirty when they are written.
> > > > > > >
> > > > > > >
> > > > > > > Am I right about this consideration? If I am right, is there some
> advice to improve this?
> > > > > >
> > > > > > I think you're right that this can happen; to clarify I think the
> > > > > > case you're talking about is:
> > > > > >
> > > > > >   Iteration 1
> > > > > > sync bitmap
> > > > > > start sending pages
> > > > > > page 'n' is modified - but hasn't been sent yet
> > > > > > page 'n' gets sent
> > > > > >   Iteration 2
> > > > > > sync bitmap
> > > > > >'page n is shown as modified'
> > > > > > send page 'n' again
> > > > > >
> > > > >
> > > > > Yes，this is right the case I am talking about.
> > > > >
> > > > > > So you're right that is wasteful; I guess it's more wasteful
> > > > > > on big VMs with slow networks where the length of each iteration
> > > > > > is large.
> > > > >
> > > > > I think this is "very" wasteful. Assume the workload writes the pages
> dirty randomly within the guest address space, and the transfer speed is
> constant. Intuitively, I think nearly half of the dirty pages produced in
> Iteration 1 is not really dirty. This means the time of Iteration 2 is double 
> of
> that to send only really dirty pages.
> > > >
> > > > It makes sense, can you get some perf numbers to show what kinds of
> > > > workloads get impacted the most?  That would also help us to figure
> > > > out what kinds of speed improvements we can expect.
> > > >
> > > >
> > > > Amit
> > >
> > > I have picked up 6 workloads and got the following statistics numbers
> > > of every iteration (except the last stop-copy one) during precopy.
> > > These numbers are obtained with the basic precopy migration, without
> > > the capabilities like xbzrle or compression, etc. The network for the
> > > migration is exclusive, with a separate network for the workloads.
> > > They are both gigabit ethernet. I use qemu-2.5.1.
> > >
> > > Three (booting, idle, web server) of them converged to the stop-copy
> phase,
> > > with the given bandwidth and default downtime (300ms), while the other
> > > three (kernel compilation, zeusmp, memcached) did not.
> > >
> > > One page is "not-really-dirty", if it is written first and is sent later
> > > (and not written again after that) during one iteration. I guess this
> > > would not happen so often during the other iterations as during the 1st
> > > iteration. Because all the pages of the VM are sent to the dest node
> during
> > > the 1st iteration, while during the others, only part of the pages are 
> > > sent.
> > > So I think the "not-really-dirty" pages should be produced mainly during
> > > the 1st iteration , and maybe very little during the other iterations.
> > >
> > > If we could avoid resending the "not-really-dirty" pages, intuitively, I
> > > think the time spent on Iteration 2 would be halved. This is a chain
> reaction,
> > > because the dirty pages produced during Iteration 2 is halved, which
> incurs
> > > that the time spent on Iteration 3 is halved, then Iteration 4, 5...
> >
> > Yes; these numbers don't show how many of them are false dirty though.
> >
> > One problem is thinking about pages that have been redirtied, if the page is
> dirtied
> > after the sync but before the network write then it's the false-dirty that
> > you're describing.
> >
> > However, if the page is being written a few times, and so it would have
> been written
> > after the network write then it isn't a false-dirty.
> >
> > You might be able to figure that out with some kernel tracing of when the
> dirtying
> > happens, but it might be easier to write the fix!
> >
> > Dave
> 
> Hi, I have made some new progress now.
> 
> To tell how many false dirty pages there are exactly in each iteration, I 
> malloc
> a
> buffer in memory as big as the size of the whole VM memory. When a page
> is
> transferred to the dest node, it is copied to the buffer; During the next
> iteration,
> if one page is transferred, it is compared to the old one in the buffer, and 
> the
> old one will be replaced for next comparison if it is really dirty. Thus, we 
> are
> now
> able to get the exact number of false dirty pages.
> 
> This time, I use 15 workloads to get the statistic number. They are:
> 
>   1. 11 benchmarks picked up from cpu2006 benchmark suit. They are all
> scientific
>  computing workloads like Quantum Chromodynamics, Fluid

Re: [Qemu-devel] [RESEND PATCH v3 kernel 0/7] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-10-26 Thread Li, Liang Z

> On 10/26/2016 03:06 AM, Li, Liang Z wrote:
> > I am working on Dave's new bitmap schema, I have finished the part of
> > getting the 'hybrid scheme bitmap' and found the complexity was more
> > than I expected. The main issue is more memory is required to save the
> > 'hybrid scheme bitmap' beside that used to save the raw page bitmap,
> > for the worst case, the memory required is 3 times than that in the
> > previous implementation.
> 
> Really?  Could you please describe the scenario where this occurs?
> > I am wondering if I should continue, as an alternative solution, how
> > about using PFNs array when inflating/deflating only a few pages?
> > Things will be much more simple.
> 
> Yes, using pfn lists is more efficient than using bitmaps for sparse bitmaps.
> Yes, there will be cases where it is preferable to just use pfn lists vs. any 
> kind
> of bitmap.
> 
> But, what does it matter?  At least with your current scheme where we go
> out and collect get_unused_pages(), we do the allocation up front.  The
> space efficiency doesn't matter at all for small sizes since we do the 
> constant-
> size allocation *anyway*.
> 
> I'm also pretty sure you can pack the pfn and page order into a single 64-bit
> word and have no bitmap for a given record.  That would make it pack just as
> well as the old pfns alone.  Right?

Yes, thanks for reminding, I am using 128 bit now, I will change it to 64 bit.
Let me finish the v4 first.

Thanks!
Liang

Re: [Qemu-devel] [RESEND PATCH v3 kernel 0/7] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-10-26 Thread Li, Liang Z

> Cc: linux-ker...@vger.kernel.org; virtualizat...@lists.linux-foundation.org;
> linux...@kvack.org; virtio-...@lists.oasis-open.org; k...@vger.kernel.org;
> qemu-devel@nongnu.org; quint...@redhat.com; dgilb...@redhat.com;
> pbonz...@redhat.com; cornelia.h...@de.ibm.com; amit.s...@redhat.com
> Subject: Re: [RESEND PATCH v3 kernel 0/7] Extend virtio-balloon for fast
> (de)inflating & fast live migration
> 
> On 10/26/2016 03:13 AM, Li, Liang Z wrote:
> > 3 times memory required is not accurate, please ignore this. sorry ...
> > The complexity is the point.
> 
> What is making it so complex?  Can you describe the problems?

I plan to complete it first and send out the patch set,  then discuss if it 
worth.  I need some time.

Thanks!
Liang

Re: [Qemu-devel] [RESEND PATCH v3 kernel 0/7] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-10-26 Thread Li, Liang Z

> > On 10/20/2016 11:24 PM, Liang Li wrote:
> > > Dave Hansen suggested a new scheme to encode the data structure,
> > > because of additional complexity, it's not implemented in v3.
> >
> > So, what do you want done with this patch set?  Do you want it applied
> > as-is so that we can introduce a new host/guest ABI that we must
> > support until the end of time?  Then, we go back in a year or two and
> > add the newer format that addresses the deficiencies that this ABI has with
> a third version?
> >
> 
> Hi Dave & Michael,
> 
> I am working on Dave's new bitmap schema, I have finished the part of
> getting the 'hybrid scheme bitmap'
> and found the complexity was more than I expected.  The main issue is more
> memory is required to  save the 'hybrid scheme bitmap' beside that used to
> save the raw page bitmap, for the worst case, the memory required is 3
> times than that in the previous implementation.
> 

3 times memory required is not accurate, please ignore this. sorry ...
The complexity is the point. 

> I am wondering if I should continue, as an alternative solution, how about
> using PFNs array when inflating/deflating only a few pages? Things will be
> much more simple.
> 
> 
> Thanks!
> Liang
> 
> 
> 
> 
> -
> To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org

Re: [Qemu-devel] [RESEND PATCH v3 kernel 0/7] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-10-26 Thread Li, Liang Z

> On 10/20/2016 11:24 PM, Liang Li wrote:
> > Dave Hansen suggested a new scheme to encode the data structure,
> > because of additional complexity, it's not implemented in v3.
> 
> So, what do you want done with this patch set?  Do you want it applied as-is
> so that we can introduce a new host/guest ABI that we must support until
> the end of time?  Then, we go back in a year or two and add the newer
> format that addresses the deficiencies that this ABI has with a third version?
> 

Hi Dave & Michael,

I am working on Dave's new bitmap schema, I have finished the part of getting 
the 'hybrid scheme bitmap'
and found the complexity was more than I expected.  The main issue is more 
memory is required to
 save the 'hybrid scheme bitmap' beside that used to save the raw page bitmap, 
for the worst case, the
memory required is 3 times than that in the previous implementation. 

I am wondering if I should continue, as an alternative solution, how about 
using PFNs array when
inflating/deflating only a few pages? Things will be much more simple.

Thanks!
Liang

Re: [Qemu-devel] [RESEND PATCH v3 kernel 4/7] virtio-balloon: speed up inflate/deflate process

2016-10-25 Thread Li, Liang Z

> > +static inline void init_pfn_range(struct virtio_balloon *vb) {
> > +   vb->min_pfn = ULONG_MAX;
> > +   vb->max_pfn = 0;
> > +}
> > +
> > +static inline void update_pfn_range(struct virtio_balloon *vb,
> > +struct page *page)
> > +{
> > +   unsigned long balloon_pfn = page_to_balloon_pfn(page);
> > +
> > +   if (balloon_pfn < vb->min_pfn)
> > +   vb->min_pfn = balloon_pfn;
> > +   if (balloon_pfn > vb->max_pfn)
> > +   vb->max_pfn = balloon_pfn;
> > +}
> > +
> 
> rename to hint these are all bitmap related.

Will change in v4.

> 
> 
> >  static void tell_host(struct virtio_balloon *vb, struct virtqueue
> > *vq)  {
> > -   struct scatterlist sg;
> > -   unsigned int len;
> > +   struct scatterlist sg, sg2[BALLOON_BMAP_COUNT + 1];
> > +   unsigned int len, i;
> > +
> > +   if (virtio_has_feature(vb->vdev,
> VIRTIO_BALLOON_F_PAGE_BITMAP)) {
> > +   struct balloon_bmap_hdr *hdr = vb->bmap_hdr;
> > +   unsigned long bmap_len;
> > +   int nr_pfn, nr_used_bmap, nr_buf;
> > +
> > +   nr_pfn = vb->end_pfn - vb->start_pfn + 1;
> > +   nr_pfn = roundup(nr_pfn, BITS_PER_LONG);
> > +   nr_used_bmap = nr_pfn / PFNS_PER_BMAP;
> > +   bmap_len = nr_pfn / BITS_PER_BYTE;
> > +   nr_buf = nr_used_bmap + 1;
> > +
> > +   /* cmd, reserved and req_id are init to 0, unused here */
> > +   hdr->page_shift = cpu_to_virtio16(vb->vdev, PAGE_SHIFT);
> > +   hdr->start_pfn = cpu_to_virtio64(vb->vdev, vb->start_pfn);
> > +   hdr->bmap_len = cpu_to_virtio64(vb->vdev, bmap_len);
> > +   sg_init_table(sg2, nr_buf);
> > +   sg_set_buf([0], hdr, sizeof(struct balloon_bmap_hdr));
> > +   for (i = 0; i < nr_used_bmap; i++) {
> > +   unsigned int  buf_len = BALLOON_BMAP_SIZE;
> > +
> > +   if (i + 1 == nr_used_bmap)
> > +   buf_len = bmap_len - BALLOON_BMAP_SIZE
> * i;
> > +   sg_set_buf([i + 1], vb->page_bitmap[i],
> buf_len);
> > +   }
> >
> > -   sg_init_one(, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
> > +   while (vq->num_free < nr_buf)
> > +   msleep(2);
> 
> 
> What's going on here? Who is expected to update num_free?
> 

I just want to wait until the vq have enough space to write the bitmap, I 
thought qemu
side will update the vq->num_free, is it wrong?

> 
> 
> > +   if (virtqueue_add_outbuf(vq, sg2, nr_buf, vb, GFP_KERNEL)
> == 0)
> > +   virtqueue_kick(vq);
> >
> > -   /* We should always be able to add one buffer to an empty queue.
> */
> > -   virtqueue_add_outbuf(vq, , 1, vb, GFP_KERNEL);
> > -   virtqueue_kick(vq);
> > +   } else {
> > +   sg_init_one(, vb->pfns, sizeof(vb->pfns[0]) * vb-
> >num_pfns);
> > +
> > +   /* We should always be able to add one buffer to an empty
> > +* queue. */
> 
> Pls use a multiple comment style consistent with kernel coding style.

Will change in next version.

> 
> > +   virtqueue_add_outbuf(vq, , 1, vb, GFP_KERNEL);
> > +   virtqueue_kick(vq);
> > +   }
> >
> > /* When host has read buffer, this completes via balloon_ack */
> > wait_event(vb->acked, virtqueue_get_buf(vq, )); @@ -138,13
> > +199,93 @@ static void set_page_pfns(struct virtio_balloon *vb,
> >   page_to_balloon_pfn(page) + i);  }
> >
> > -static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
> > +static void extend_page_bitmap(struct virtio_balloon *vb) {
> > +   int i;
> > +   unsigned long bmap_len, bmap_count;
> > +
> > +   bmap_len = ALIGN(get_max_pfn(), BITS_PER_LONG) /
> BITS_PER_BYTE;
> > +   bmap_count = bmap_len / BALLOON_BMAP_SIZE;
> > +   if (bmap_len % BALLOON_BMAP_SIZE)
> > +   bmap_count++;
> > +   if (bmap_count > BALLOON_BMAP_COUNT)
> > +   bmap_count = BALLOON_BMAP_COUNT;
> > +
> 
> This is doing simple things in tricky ways.
> Please use macros such as ALIGN and max instead of if.
> 

Will change.

> 
> > +   for (i = 1; i < bmap_count; i++) {
> 
> why 1?

In probe stage, already allocated one bitmap.

> 
> > +   vb->page_bitmap[i] = kmalloc(BALLOON_BMAP_SIZE,
> GFP_ATOMIC);
> 
> why GFP_ATOMIC?

Yes, GFP_ATOMIC is not necessary.

> and what will free the previous buffer?

The previous buffer will not be freed.

> 
> 
> > +   if (vb->page_bitmap[i])
> > +   vb->nr_page_bmap++;
> > +   else
> > +   break;
> 
> and what will happen then?

I plan to use the previous allocated buffer to save the bitmap, need more code 
for kmalloc failure?

> > -static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
> > +static unsigned int leak_balloon(struct virtio_balloon *vb, size_t num,
> > +   bool use_bmap)
> 
> this is just a feature bit - why not get it internally?

Indeed.

> > @@ -218,8 +374,14 @@ static unsigned

Re: [Qemu-devel] [RESEND PATCH v3 kernel 3/7] mm: add a function to get the max pfn

2016-10-24 Thread Li, Liang Z

> On 10/20/2016 11:24 PM, Liang Li wrote:
> > Expose the function to get the max pfn, so it can be used in the
> > virtio-balloon device driver. Simply include the 'linux/bootmem.h'
> > is not enough, if the device driver is built to a module, directly
> > refer the max_pfn lead to build failed.
> 
> I'm not sure the rest of the set is worth reviewing.  I think a lot of it will
> change pretty fundamentally once you have those improved data structures
> in place.

That's true. I will send out the v4 as soon as possible.

Liang

Re: [Qemu-devel] [RESEND PATCH v3 kernel 2/7] virtio-balloon: define new feature bit and page bitmap head

2016-10-24 Thread Li, Liang Z

> On 10/20/2016 11:24 PM, Liang Li wrote:
> > Add a new feature which supports sending the page information with a
> > bitmap. The current implementation uses PFNs array, which is not very
> > efficient. Using bitmap can improve the performance of
> > inflating/deflating significantly
> 
> Why is it not efficient?  How is using a bitmap more efficient?  What kinds of
> cases is the bitmap inefficient?
> 
> > The page bitmap header will used to tell the host some information
> > about the page bitmap. e.g. the page size, page bitmap length and
> > start pfn.
> 
> Why did you choose to add these features to the structure?  What benefits
> do they add?
> 
> Could you describe your solution a bit here, and describe its strengths and
> weaknesses?
> 

Will elaborate the solution in V4.

> >  /* Size of a PFN in the balloon interface. */  #define
> > VIRTIO_BALLOON_PFN_SHIFT 12 @@ -82,4 +83,22 @@ struct
> > virtio_balloon_stat {
> > __virtio64 val;
> >  } __attribute__((packed));
> >
> > +/* Page bitmap header structure */
> > +struct balloon_bmap_hdr {
> > +   /* Used to distinguish different request */
> > +   __virtio16 cmd;
> > +   /* Shift width of page in the bitmap */
> > +   __virtio16 page_shift;
> > +   /* flag used to identify different status */
> > +   __virtio16 flag;
> > +   /* Reserved */
> > +   __virtio16 reserved;
> > +   /* ID of the request */
> > +   __virtio64 req_id;
> > +   /* The pfn of 0 bit in the bitmap */
> > +   __virtio64 start_pfn;
> > +   /* The length of the bitmap, in bytes */
> > +   __virtio64 bmap_len;
> > +};
> 
> FWIW this is totally unreadable.  Please do something like this:
> 
> > +struct balloon_bmap_hdr {
> > +   __virtio16 cmd; /* Used to distinguish different ...
> > +   __virtio16 page_shift;  /* Shift width of page in the bitmap */
> > +   __virtio16 flag;/* flag used to identify different...
> > +   __virtio16 reserved;/* Reserved */
> > +   __virtio64 req_id;  /* ID of the request */
> > +   __virtio64 start_pfn;   /* The pfn of 0 bit in the bitmap */
> > +   __virtio64 bmap_len;/* The length of the bitmap, in bytes */
> > +};
> 
> and please make an effort to add useful comments.  "/* Reserved */"
> seems like a waste of bytes to me.

OK. Maybe 'padding' is better than 'reserved' .

Thanks for your comments!

Liang

Re: [Qemu-devel] [RESEND PATCH v3 kernel 1/7] virtio-balloon: rework deflate to add page to a list

2016-10-24 Thread Li, Liang Z

> On 10/20/2016 11:24 PM, Liang Li wrote:
> > Will allow faster notifications using a bitmap down the road.
> > balloon_pfn_to_page() can be removed because it's useless.
> 
> This is a pretty terse description of what's going on here.  Could you try to
> elaborate a bit?  What *is* the current approach?  Why does it not work
> going forward?  What do you propose instead?  Why is it better?

Sure. The description will be more clear if it's described as you suggest. 
Thanks!

Liang

Re: [Qemu-devel] [RESEND PATCH v3 kernel 0/7] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-10-23 Thread Li, Liang Z

> On Fri, Oct 21, 2016 at 10:25:21AM -0700, Dave Hansen wrote:
> > On 10/20/2016 11:24 PM, Liang Li wrote:
> > > Dave Hansen suggested a new scheme to encode the data structure,
> > > because of additional complexity, it's not implemented in v3.
> >
> > So, what do you want done with this patch set?  Do you want it applied
> > as-is so that we can introduce a new host/guest ABI that we must
> > support until the end of time?  Then, we go back in a year or two and
> > add the newer format that addresses the deficiencies that this ABI has
> > with a third version?
> >
> 
> Exactly my questions.

Hi Dave & Michael,

In the V2, both of you thought that the memory I allocated for the bitmap is 
too large, and gave some
 suggestions about the solution, so I changed the implementation and used  
scattered pages for the bitmap
instead of a large physical continued memory. I didn't get the comments about 
the changes, so I am not 
sure whether that is OK or not, that's the why I resend the V3, I just want 
your opinions about that part. 

I will implement the new schema as Dave suggested in V4. Before that, could you 
take a look at this version and
give some comments? 

Thanks!
Liang

Re: [Qemu-devel] [PATCH v3 kernel 0/7] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-08-31 Thread Li, Liang Z

> Subject: Re: [PATCH v3 kernel 0/7] Extend virtio-balloon for fast 
> (de)inflating
> & fast live migration
> 
> 2016-08-08 14:35 GMT+08:00 Liang Li :
> > This patch set contains two parts of changes to the virtio-balloon.
> >
> > One is the change for speeding up the inflating & deflating process,
> > the main idea of this optimization is to use bitmap to send the page
> > information to host instead of the PFNs, to reduce the overhead of
> > virtio data transmission, address translation and madvise(). This can
> > help to improve the performance by about 85%.
> >
> > Another change is for speeding up live migration. By skipping process
> > guest's free pages in the first round of data copy, to reduce needless
> > data processing, this can help to save quite a lot of CPU cycles and
> > network bandwidth. We put guest's free page information in bitmap and
> > send it to host with the virt queue of virtio-balloon. For an idle 8GB
> > guest, this can help to shorten the total live migration time from
> > 2Sec to about 500ms in the 10Gbps network environment.
> 
> I just read the slides of this feature for recent kvm forum, the cloud
> providers more care about live migration downtime to avoid customers'
> perception than total time, however, this feature will increase downtime
> when acquire the benefit of reducing total time, maybe it will be more
> acceptable if there is no downside for downtime.
> 
> Regards,
> Wanpeng Li

In theory, there is no factor that will increase the downtime. There is no 
additional operation
and no more data copy during the stop and copy stage. But in the test, the 
downtime increases
and this can be reproduced. I think the busy network line maybe the reason for 
this. With this
 optimization, a huge amount of data is written to the socket in a shorter 
time, so some of the write
operation may need to wait. Without this optimization, zero page checking takes 
more time,
the network is not so busy.

If the guest is not an idle one, I think the gap of the downtime will not so 
obvious.  Anyway, the
downtime is still less than the  max_down_time set by the user.

Thanks!
Liang

Re: [Qemu-devel] [PATCH v3 kernel 0/7] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-08-31 Thread Li, Liang Z

Hi Michael,

I know you are very busy. If you have time, could you help to take a look at 
this patch set?

Thanks!
Liang

> -Original Message-
> From: Li, Liang Z
> Sent: Thursday, August 18, 2016 9:06 AM
> To: Michael S. Tsirkin
> Cc: virtualizat...@lists.linux-foundation.org; linux...@kvack.org; virtio-
> d...@lists.oasis-open.org; k...@vger.kernel.org; qemu-devel@nongnu.org;
> quint...@redhat.com; dgilb...@redhat.com; Hansen, Dave; linux-
> ker...@vger.kernel.org
> Subject: RE: [PATCH v3 kernel 0/7] Extend virtio-balloon for fast 
> (de)inflating
> & fast live migration
> 
> Hi Michael,
> 
> Could you help to review this version when you have time?
> 
> Thanks!
> Liang
> 
> > -Original Message-
> > From: Li, Liang Z
> > Sent: Monday, August 08, 2016 2:35 PM
> > To: linux-ker...@vger.kernel.org
> > Cc: virtualizat...@lists.linux-foundation.org; linux...@kvack.org;
> > virtio- d...@lists.oasis-open.org; k...@vger.kernel.org;
> > qemu-devel@nongnu.org; quint...@redhat.com; dgilb...@redhat.com;
> > Hansen, Dave; Li, Liang Z
> > Subject: [PATCH v3 kernel 0/7] Extend virtio-balloon for fast
> > (de)inflating & fast live migration
> >
> > This patch set contains two parts of changes to the virtio-balloon.
> >
> > One is the change for speeding up the inflating & deflating process,
> > the main idea of this optimization is to use bitmap to send the page
> > information to host instead of the PFNs, to reduce the overhead of
> > virtio data transmission, address translation and madvise(). This can
> > help to improve the performance by about 85%.
> >
> > Another change is for speeding up live migration. By skipping process
> > guest's free pages in the first round of data copy, to reduce needless
> > data processing, this can help to save quite a lot of CPU cycles and
> > network bandwidth. We put guest's free page information in bitmap and
> > send it to host with the virt queue of virtio-balloon. For an idle 8GB
> > guest, this can help to shorten the total live migration time from
> > 2Sec to about 500ms in the 10Gbps network environment.
> >
> > Dave Hansen suggested a new scheme to encode the data structure,
> > because of additional complexity, it's not implemented in v3.
> >
> > Changes from v2 to v3:
> > * Change the name of 'free page' to 'unused page'.
> > * Use the scatter & gather bitmap instead of a 1MB page bitmap.
> > * Fix overwriting the page bitmap after kicking.
> > * Some of MST's comments for v2.
> >
> > Changes from v1 to v2:
> > * Abandon the patch for dropping page cache.
> > * Put some structures to uapi head file.
> > * Use a new way to determine the page bitmap size.
> > * Use a unified way to send the free page information with the bitmap
> > * Address the issues referred in MST's comments
> >
> >
> > Liang Li (7):
> >   virtio-balloon: rework deflate to add page to a list
> >   virtio-balloon: define new feature bit and page bitmap head
> >   mm: add a function to get the max pfn
> >   virtio-balloon: speed up inflate/deflate process
> >   mm: add the related functions to get unused page
> >   virtio-balloon: define feature bit and head for misc virt queue
> >   virtio-balloon: tell host vm's unused page info
> >
> >  drivers/virtio/virtio_balloon.c | 390
> > 
> >  include/linux/mm.h  |   3 +
> >  include/uapi/linux/virtio_balloon.h |  41 
> >  mm/page_alloc.c |  94 +
> >  4 files changed, 485 insertions(+), 43 deletions(-)
> >
> > --
> > 1.8.3.1

Re: [Qemu-devel] [PATCH v3 kernel 0/7] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-08-17 Thread Li, Liang Z

Hi Michael,

Could you help to review this version when you have time? 

Thanks!
Liang

> -Original Message-
> From: Li, Liang Z
> Sent: Monday, August 08, 2016 2:35 PM
> To: linux-ker...@vger.kernel.org
> Cc: virtualizat...@lists.linux-foundation.org; linux...@kvack.org; virtio-
> d...@lists.oasis-open.org; k...@vger.kernel.org; qemu-devel@nongnu.org;
> quint...@redhat.com; dgilb...@redhat.com; Hansen, Dave; Li, Liang Z
> Subject: [PATCH v3 kernel 0/7] Extend virtio-balloon for fast (de)inflating &
> fast live migration
> 
> This patch set contains two parts of changes to the virtio-balloon.
> 
> One is the change for speeding up the inflating & deflating process, the main
> idea of this optimization is to use bitmap to send the page information to
> host instead of the PFNs, to reduce the overhead of virtio data transmission,
> address translation and madvise(). This can help to improve the performance
> by about 85%.
> 
> Another change is for speeding up live migration. By skipping process guest's
> free pages in the first round of data copy, to reduce needless data 
> processing,
> this can help to save quite a lot of CPU cycles and network bandwidth. We
> put guest's free page information in bitmap and send it to host with the virt
> queue of virtio-balloon. For an idle 8GB guest, this can help to shorten the
> total live migration time from 2Sec to about 500ms in the 10Gbps network
> environment.
> 
> Dave Hansen suggested a new scheme to encode the data structure,
> because of additional complexity, it's not implemented in v3.
> 
> Changes from v2 to v3:
> * Change the name of 'free page' to 'unused page'.
> * Use the scatter & gather bitmap instead of a 1MB page bitmap.
> * Fix overwriting the page bitmap after kicking.
> * Some of MST's comments for v2.
> 
> Changes from v1 to v2:
> * Abandon the patch for dropping page cache.
> * Put some structures to uapi head file.
> * Use a new way to determine the page bitmap size.
> * Use a unified way to send the free page information with the bitmap
> * Address the issues referred in MST's comments
> 
> 
> Liang Li (7):
>   virtio-balloon: rework deflate to add page to a list
>   virtio-balloon: define new feature bit and page bitmap head
>   mm: add a function to get the max pfn
>   virtio-balloon: speed up inflate/deflate process
>   mm: add the related functions to get unused page
>   virtio-balloon: define feature bit and head for misc virt queue
>   virtio-balloon: tell host vm's unused page info
> 
>  drivers/virtio/virtio_balloon.c | 390
> 
>  include/linux/mm.h  |   3 +
>  include/uapi/linux/virtio_balloon.h |  41 
>  mm/page_alloc.c |  94 +
>  4 files changed, 485 insertions(+), 43 deletions(-)
> 
> --
> 1.8.3.1

Re: [Qemu-devel] [PATCH V2] add migration capability to bypass the shared memory

2016-08-12 Thread Li, Liang Z

> >> > BTW. Is it possible to bypass the shared block in the
> >> 'ram_find_and_save_block'?
> >> > I mean no to check if a page is dirty for a shared block, it may
> >> > make things
> >> faster.
> >>
> >> Nice spotted.  That would make things faster.  But once there we
> >> could split the bitmap by ramblock, and the migration_dirty_pages
> >> also per block, that would make all the searchs faster, and
> >> adding/removing blocks of ram much easier.  If we enter optimizing
> >> that function, we could really do much better that the "again" parameter
> that we have right now.
> >>
> >> Later, Juan.
> 
> 
> This patch focuses on the core thing: adding the ability to bypass migrating
> the memory.
> I don't want to add further optimization for it now.
> I hope this patch merged solo and earlier.
> optimization patches can be sent after this patch accepted.
> 

Will you send a v3?

Liang

Re: [Qemu-devel] [PATCH V2] add migration capability to bypass the shared memory

2016-08-11 Thread Li, Liang Z

> >>
> >> I might have missed something, could you tell me more?
> >>
> >> void bitmap_set(unsigned long *map, long start, long nr); I think the
> >> @start and @nr are both the number of the bits.
> >>
> >> thanks,
> >> Lai
> >
> > You are right,  I have make a mistake by checking the code. Sorry for the
> noise.
> >
> > BTW. Is it possible to bypass the shared block in the
> 'ram_find_and_save_block'?
> > I mean no to check if a page is dirty for a shared block, it may make things
> faster.
> 
> Nice spotted.  That would make things faster.  But once there we could split
> the bitmap by ramblock, and the migration_dirty_pages also per block, that
> would make all the searchs faster, and adding/removing blocks of ram much
> easier.  If we enter optimizing that function, we could really do much better
> that the "again" parameter that we have right now.
> 
> Later, Juan.

Did some experiment with an 8GB VM.
Total time: 93ms
Setup: 70ms
Downtime: 6ms

So most of the time spend on Setup stage, and most of the time spend on 
setup stage is starting dirty page logging.
How about not start dirty page logging for shared blocks? then it's possible
to shorten the total time to about 30ms.

Liang

Re: [Qemu-devel] [PATCH repost] virtio-balloon: Remove needless precompiled directive

2016-08-10 Thread Li, Liang Z

> On Tue, Aug 09, 2016 at 08:30:42AM +0800, Liang Li wrote:
> > Since there in wrapper around madvise(), the virtio-balloon code is
> > able to work without the precompiled directive, the directive can be
> > removed.
> >
> > Signed-off-by: Liang Li 
> > Suggested-by: Thomas Huth 
> > Reviewd-by: Dr. David Alan Gilbert 
> 
> Sorry about missing this earlier.
> I'm inclined to merge this past 2.7 now.
> Sorry about the delay.
> Can you pls repost after 2.7 is out?
> 
> Thanks!

Of course, I will repost then.

Liang

Re: [Qemu-devel] [PATCH] migration: fix live migration failure with compression

2016-08-10 Thread Li, Liang Z

> Subject: Re: [PATCH] migration: fix live migration failure with compression
> 
> * Liang Li (liang.z...@intel.com) wrote:
> > Because of commit 11808bb0c422, which remove some condition checks of
> > 'f->ops->writev_buffer', 'qemu_put_qemu_file' should be enhanced to
> > clear the 'f_src->iovcnt', or 'f_src->iovcnt' may exceed the
> > MAX_IOV_SIZE which will break live migration. This should be fixed.
> >
> > Signed-off-by: Liang Li 
> > Reported-by: Jinshi Zhang 
> > ---
> >  migration/qemu-file.c | 1 +
> >  1 file changed, 1 insertion(+)
> >
> > diff --git a/migration/qemu-file.c b/migration/qemu-file.c index
> > bbc565e..e9fae31 100644
> > --- a/migration/qemu-file.c
> > +++ b/migration/qemu-file.c
> > @@ -668,6 +668,7 @@ int qemu_put_qemu_file(QEMUFile *f_des,
> QEMUFile *f_src)
> >  len = f_src->buf_index;
> >  qemu_put_buffer(f_des, f_src->buf, f_src->buf_index);
> >  f_src->buf_index = 0;
> > +f_src->iovcnt = 0;
> >  }
> 
> If you're just using the buf[] in the src, how does it end up incrementing the
> iovcnt?
> 
> Dave
> 
> >  return len;
> >  }

'qemu_put_be32' is used to put some data to  an allocated 'f_src'. Before the 
11808bb0c422, this operation
won't increase the 'f_src->iovcnt', there is no issue.
Commit 11808bb0c422 remove the checking of 'f->ops->writev_buffer',
now 'qemu_put_be32' will increase 'f_src->iovcnt' and set 'f_src->iov []', once 
the 'f_src->iovcnt' reach to
MAX_IOV_SIZE, 'qemu_fflush' will be trigged. Beacause 'f_src' is not writeable, 
'qemu_fflsh' will return
without wrapping around 'f_src->iovcnt' , the following 'qemu_put_be32' on 
'f_src' will increase 'f_src->iovcnt'
and make it exceed MAX_IOV_SIZE, then set 'f_src->iov[]' will corrupt memory.

Liang

Re: [Qemu-devel] [PATCH V2] add migration capability to bypass the shared memory

2016-08-09 Thread Li, Liang Z

> On Wed, Aug 10, 2016 at 10:22 AM, Li, Liang Z <liang.z...@intel.com> wrote:
> > Hi Jiangshan,
> >
> > Glad to see your patch. It's a simple implementation which could provide
> very useful functions.
> >
> >> +static void migration_bitmap_init(unsigned long *bitmap) {
> >> +RAMBlock *block;
> >> +
> >> +bitmap_clear(bitmap, 0, last_ram_offset() >> TARGET_PAGE_BITS);
> >> +rcu_read_lock();
> >> +QLIST_FOREACH_RCU(block, _list.blocks, next) {
> >> +if (!migrate_bypass_shared_memory()
> >> || !qemu_ram_is_shared(block)) {
> >> +bitmap_set(bitmap, block->offset >> TARGET_PAGE_BITS,
> >
> > You should use (block->offset >> TARGET_PAGE_BITS )/ BITS_PER_LONG
> here.
> 
> Hello, Li
> 
> I might have missed something, could you tell me more?
> 
> void bitmap_set(unsigned long *map, long start, long nr); I think the @start
> and @nr are both the number of the bits.
> 
> thanks,
> Lai

You are right,  I have make a mistake by checking the code. Sorry for the noise.

BTW. Is it possible to bypass the shared block in the 'ram_find_and_save_block'?
I mean no to check if a page is dirty for a shared block, it may make things 
faster.

Liang

Re: [Qemu-devel] [PATCH V2] add migration capability to bypass the shared memory

2016-08-09 Thread Li, Liang Z

Hi Jiangshan,

Glad to see your patch. It's a simple implementation which could provide very 
useful functions.

> +static void migration_bitmap_init(unsigned long *bitmap) {
> +RAMBlock *block;
> +
> +bitmap_clear(bitmap, 0, last_ram_offset() >> TARGET_PAGE_BITS);
> +rcu_read_lock();
> +QLIST_FOREACH_RCU(block, _list.blocks, next) {
> +if (!migrate_bypass_shared_memory()
> || !qemu_ram_is_shared(block)) {
> +bitmap_set(bitmap, block->offset >> TARGET_PAGE_BITS,

You should use (block->offset >> TARGET_PAGE_BITS )/ BITS_PER_LONG here.

> +   block->used_length >> TARGET_PAGE_BITS);
> +
> +/*
> + * Count the total number of pages used by ram blocks not 
> including
> + * any gaps due to alignment or unplugs.
> + */
> +migration_dirty_pages += block->used_length >>
> TARGET_PAGE_BITS;
> +}
> +}
> +rcu_read_unlock();
>  }

Liang

Re: [Qemu-devel] [PATCH v3 kernel 0/7] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-08-08 Thread Li, Liang Z

> Subject: Re: [PATCH v3 kernel 0/7] Extend virtio-balloon for fast 
> (de)inflating
> & fast live migration
> 
> On 08/07/2016 11:35 PM, Liang Li wrote:
> > Dave Hansen suggested a new scheme to encode the data structure,
> > because of additional complexity, it's not implemented in v3.
> 
> FWIW, I don't think it takes any additional complexity here, at least in the
> guest implementation side.  The thing I suggested would just mean explicitly
> calling out that there was a single bitmap instead of implying it in the ABI.
> 
> Do you think the scheme I suggested is the way to go?

Yes, I think so.  And I will do that in the later version. In this V3, I just 
want to solve the 
issue caused by a large page bitmap in v2.

Liang

Re: [Qemu-devel] [PATCH v2] balloon: Fix failure of updating guest memory status

2016-08-08 Thread Li, Liang Z

> >
> > Hi Liang,
> >
> > I should be able to look into it this week if you help me with testing.
> >
> > Thanks,
> > Ladi
> 
> Please try the attached patch. I have tested it with a simple 'migrate' to 
> save
> the state and then '-incoming' to load it back.
> 

Thanks for your work. I have already saw your v3 patch, so I will test that  
one.

> One question for you: is it expected that stats_poll_interval is not preserved
> by save/load? I had to explicitly set guest-stats-polling-interval on the

Yes, I found that too. And I am not quite sure about that. IMO,
' stats_poll_interval ' should be treated as part of device state as well as the
' stats_vq_elem ', but they are not contained at the very beginning.

> receiving VM to start getting stats again. It's also the reason why the new
> virtio_balloon_receive_stats call is not under if
> (balloon_stats_enabled(s)) because this condition always evaluates to false
> for me.

Thanks!
Liang

Re: [Qemu-devel] [virtio-dev] Re: [PATCH v2 repost 4/7] virtio-balloon: speed up inflate/deflate process

2016-08-01 Thread Li, Liang Z

> > It's only small because it makes you rescan the free list.
> > So maybe you should do something else.
> > I looked at it a bit. Instead of scanning the free list, how about
> > scanning actual page structures? If page is unused, pass it to host.
> > Solves the problem of rescanning multiple times, does it not?
> 
> FWIW, I think the new data structure needs some work.
> 
> Before, we had a potentially very long list of 4k areas.  Now, we've just got 
> a
> very large bitmap.  The bitmap might not even be very dense if we are
> ballooning relatively few things.
> 
> Can I suggest an alternate scheme?  I think you actually need a hybrid
> scheme that has bitmaps but also allows more flexibility in the pfn ranges.
> The payload could be a number of records each containing 3 things:
> 
>   pfn, page order, length of bitmap (maybe in powers of 2)
> 
> Each record is followed by the bitmap.  Or, if the bitmap length is 0,
> immediately followed by another record.  A bitmap length of 0 implies a
> bitmap with the least significant bit set.  Page order specifies how many
> pages each bit represents.
> 
> This scheme could easily encode the new data structure you are proposing
> by just setting pfn=0, order=0, and a very long bitmap length.  But, it could
> handle sparse bitmaps much better *and* represent large pages much more
> efficiently.
> 
> There's plenty of space to fit a whole record in 64 bits.

I like your idea and it's more flexible, and it's very useful if we want to 
optimize the
page allocating stage further. I believe the memory fragmentation will not be 
very
serious, so the performance won't be too bad in the worst case.

Thanks!
Liang

Re: [Qemu-devel] [PATCH v2] balloon: Fix failure of updating guest memory status

2016-08-01 Thread Li, Liang Z

> On Wed, Jul 06, 2016 at 12:49:06PM +0000, Li, Liang Z wrote:
> > > > > > After live migration, 'guest-stats' can't get the expected
> > > > > > memory status in the guest. This issue is caused by commit
> 4eae2a657d.
> > > > > > The value of 's->stats_vq_elem' will be NULL after live
> > > > > > migration, and the check in the function
> > > > > > 'balloon_stats_poll_cb()' will prevent the 'virtio_notify()'
> > > > > > from executing. So guest will not update the memory status.
> > > > > >
> > > > > > Commit 4eae2a657d is doing the right thing, but 's->stats_vq_elem'
> > > > > > should be treated as part of balloon device state and migrated
> > > > > > to destination if it's not NULL to make everything works well.
> > > > > >
> > > > > > Signed-off-by: Liang Li <liang.z...@intel.com>
> > > > > > Suggested-by: Paolo Bonzini <pbonz...@redhat.com>
> > > > > > Cc: Michael S. Tsirkin <m...@redhat.com>
> > > > > > Cc: Ladi Prosek <lpro...@redhat.com>
> > > > > > Cc: Paolo Bonzini <pbonz...@redhat.com>
> > > > >
> > > > > I agree there's an issue but we don't change versions anymore.
> > > > > Breaking migrations for everyone is also not nice.
> > > > >
> > > > > How about queueing virtio_balloon_receive_stats so it will get
> > > > > invoked when vm starts?
> > > > >
> > > >
> > > > Could you give more explanation about how it works?  I can't catch you.
> > > >
> > > > Thanks!
> > > > Liang
> > >
> > > virtqueue_discard before migration
> > >
> > > virtio_balloon_receive_stats after migration
> > >
> >
> > Sorry, I still can't catch you. Maybe it's easier for you to submit a
> > patch than writing a lot a words to make me understand your idea.
> 
> I'm rather busy now.  I might look into it towards end of the month.
> 
> > I just don't understand why not to use the version to make things
> > easier, is that not the original intent of version id?
> 
> This was the original idea but we stopped using version ids since they have
> many shortcomings.
> 
> > If we want to extend the device and more states are needed, the idea
> > you suggest can be used as a common solution?
> >
> > Thanks!
> > Liang
> 
> The idea is to try to avoid adding more state. that's not always possible but 
> in
> this case element was seen but not consumed yet, so it should be possible
> for destination to simply get it from the VQ again.
> 
> > > --
> > > MST

Hi Michel,

Do you have time for this issue recently?  

Thanks!
Liang

Re: [Qemu-devel] [QEMU v2 1/9] virtio-balloon: Remove needless precompiled directive

2016-08-01 Thread Li, Liang Z

> * Liang Li (liang.z...@intel.com) wrote:
> > Since there in wrapper around madvise(), the virtio-balloon code is
> > able to work without the precompiled directive, the directive can be
> > removed.
> >
> > Signed-off-by: Liang Li 
> > Suggested-by: Thomas Huth 
> 
> This one could be posted separately.
> 
> Reviewed-by: Dr. David Alan Gilbert 
> 

I will send a separate one. Thanks! David. 

Liang

Re: [Qemu-devel] [virtio-dev] Re: [PATCH v2 repost 4/7] virtio-balloon: speed up inflate/deflate process

2016-07-28 Thread Li, Liang Z

> > > On Wed, Jul 27, 2016 at 09:03:21AM -0700, Dave Hansen wrote:
> > > > On 07/26/2016 06:23 PM, Liang Li wrote:
> > > > > + vb->pfn_limit = VIRTIO_BALLOON_PFNS_LIMIT;
> > > > > + vb->pfn_limit = min(vb->pfn_limit, get_max_pfn());
> > > > > + vb->bmap_len = ALIGN(vb->pfn_limit, BITS_PER_LONG) /
> > > > > +  BITS_PER_BYTE + 2 * sizeof(unsigned long);
> > > > > + hdr_len = sizeof(struct balloon_bmap_hdr);
> > > > > + vb->bmap_hdr = kzalloc(hdr_len + vb->bmap_len,
> GFP_KERNEL);
> > > >
> > > > This ends up doing a 1MB kmalloc() right?  That seems a _bit_ big.
> > > > How big was the pfn buffer before?
> > >
> > >
> > > Yes I would limit this to 1G memory in a go, will result in a 32KByte 
> > > bitmap.
> > >
> > > --
> > > MST
> >
> > Limit to 1G is bad for the performance, I sent you the test result several
> weeks ago.
> >
> > Paste it bellow:
> > --
> > --
> > About the size of page bitmap, I have test the performance of filling
> > the balloon to 15GB with a  16GB RAM VM.
> >
> > ===
> > 32K Byte (cover 1GB of RAM)
> >
> > Time spends on inflating: 2031ms
> > -
> > 64K Byte (cover 2GB of RAM)
> >
> > Time spends on inflating: 1507ms
> > 
> > 512K Byte (cover 16GB of RAM)
> >
> > Time spends on inflating: 1237ms
> > 
> >
> > If possible, a big bitmap is better for performance.
> >
> > Liang
> 
> Earlier you said:
> a. allocating pages (6.5%)
> b. sending PFNs to host (68.3%)
> c. address translation (6.1%)
> d. madvise (19%)
> 
> Here sending PFNs to host with 512K Byte map should be almost free.
> 
> So is something else taking up the time?
> 
I just want to show you the benefits of using a big bitmap. :)
I did not measure the time spend on each stage after optimization(I will do it 
later),
but I have tried to allocate the page with big chunk and found it can make 
things faster.
Without allocating big chunk page, the performance improvement is about 85%, 
and with
 allocating big  chunk page, the improvement is about 94%.

Liang

> 
> --
> MST

Re: [Qemu-devel] [virtio-dev] Re: [PATCH v2 repost 4/7] virtio-balloon: speed up inflate/deflate process

2016-07-28 Thread Li, Liang Z

> On Thu, Jul 28, 2016 at 06:36:18AM +0000, Li, Liang Z wrote:
> > > > > This ends up doing a 1MB kmalloc() right?  That seems a _bit_ big.
> > > > > How big was the pfn buffer before?
> > > >
> > > > Yes, it is if the max pfn is more than 32GB.
> > > > The size of the pfn buffer use before is 256*4 = 1024 Bytes, it's
> > > > too small, and it's the main reason for bad performance.
> > > > Use the max 1MB kmalloc is a balance between performance and
> > > > flexibility, a large page bitmap covers the range of all the
> > > > memory is no good for a system with huge amount of memory. If the
> > > > bitmap is too small, it means we have to traverse a long list for
> > > > many times, and it's bad
> > > for performance.
> > > >
> > > > Thanks!
> > > > Liang
> > >
> > > There are all your implementation decisions though.
> > >
> > > If guest memory is so fragmented that you only have order 0 4k
> > > pages, then allocating a huge 1M contigious chunk is very problematic in
> and of itself.
> > >
> >
> > The memory is allocated in the probe stage. This will not happen if
> > the driver is  loaded when booting the guest.
> >
> > > Most people rarely migrate and do not care how fast that happens.
> > > Wasting a large chunk of memory (and it's zeroed for no good reason,
> > > so you actually request host memory for it) for everyone to speed it
> > > up when it does happen is not really an option.
> > >
> > If people don't plan to do inflating/deflating, they should not enable
> > the virtio-balloon at the beginning, once they decide to use it, the
> > driver should provide better performance as much as possible.
> 
> The reason people inflate/deflate is so they can overcommit memory.
> Do they need to overcommit very quickly? I don't see why.
> So let's get what we can for free but I don't really believe people would want
> to pay for it.
> 
> > 1MB is a very small portion for a VM with more than 32GB memory and
> > it's the *worst case*, for VM with less than 32GB memory, the amount
> > of RAM depends on VM's memory size and will be less than 1MB.
> 
> It's guest memmory so might all be in swap and never touched, your memset
> at probe time will fault it in and make hypervisor actually pay for it.
> 
> > If 1MB is too big, how about 512K, or 256K?  32K seems too small.
> >
> > Liang
> 
> It's only small because it makes you rescan the free list.
> So maybe you should do something else.
> I looked at it a bit. Instead of scanning the free list, how about scanning 
> actual
> page structures? If page is unused, pass it to host.
> Solves the problem of rescanning multiple times, does it not?
> 

Yes, agree.
> 
> Another idea: allocate a small bitmap at probe time (e.g. for deflate), 
> allocate
> a bunch more on each request. Use something like GFP_ATOMIC and a
> scatter/gather, if that fails use the smaller bitmap.
> 

So, the aim of v3 is to use a smaller bitmap without too heavy performance 
penalty.
Thanks a lot!

Liang

Re: [Qemu-devel] [PATCH v2 repost 4/7] virtio-balloon: speed up inflate/deflate process

2016-07-28 Thread Li, Liang Z

> On Thu, Jul 28, 2016 at 03:06:37AM +0000, Li, Liang Z wrote:
> > > > + * VIRTIO_BALLOON_PFNS_LIMIT is used to limit the size of page
> > > > +bitmap
> > > > + * to prevent a very large page bitmap, there are two reasons for this:
> > > > + * 1) to save memory.
> > > > + * 2) allocate a large bitmap may fail.
> > > > + *
> > > > + * The actual limit of pfn is determined by:
> > > > + * pfn_limit = min(max_pfn, VIRTIO_BALLOON_PFNS_LIMIT);
> > > > + *
> > > > + * If system has more pages than VIRTIO_BALLOON_PFNS_LIMIT, we
> > > > +will scan
> > > > + * the page list and send the PFNs with several times. To reduce
> > > > +the
> > > > + * overhead of scanning the page list. VIRTIO_BALLOON_PFNS_LIMIT
> > > > +should
> > > > + * be set with a value which can cover most cases.
> > >
> > > So what if it covers 1/32 of the memory? We'll do 32 exits and not
> > > 1, still not a big deal for a big guest.
> > >
> >
> > The issue here is the overhead is too high for scanning the page list for 32
> times.
> > Limit the page bitmap size to a fixed value is better for a big guest?
> >
> 
> I'd say avoid scanning free lists completely. Scan pages themselves and check
> the refcount to see whether they are free.
> This way each page needs to be tested once.
> 
> And skip the whole optimization if less than e.g. 10% is free.

That's better than rescanning the free list. Will change in next version.

Thanks!

Liang

Re: [Qemu-devel] [PATCH v2 repost 7/7] virtio-balloon: tell host vm's free page info

2016-07-28 Thread Li, Liang Z

> >  }
> >
> > +static void update_free_pages_stats(struct virtio_balloon *vb,
> 
> why _stats?

Will change.

> > +   max_pfn = get_max_pfn();
> > +   mutex_lock(>balloon_lock);
> > +   while (pfn < max_pfn) {
> > +   memset(vb->page_bitmap, 0, vb->bmap_len);
> > +   ret = get_free_pages(pfn, pfn + vb->pfn_limit,
> > +   vb->page_bitmap, vb->bmap_len * BITS_PER_BYTE);
> > +   hdr->cmd = cpu_to_virtio16(vb->vdev,
> BALLOON_GET_FREE_PAGES);
> > +   hdr->page_shift = cpu_to_virtio16(vb->vdev, PAGE_SHIFT);
> > +   hdr->req_id = cpu_to_virtio64(vb->vdev, req_id);
> > +   hdr->start_pfn = cpu_to_virtio64(vb->vdev, pfn);
> > +   bmap_len = vb->pfn_limit / BITS_PER_BYTE;
> > +   if (!ret) {
> > +   hdr->flag = cpu_to_virtio16(vb->vdev,
> > +
>   BALLOON_FLAG_DONE);
> > +   if (pfn + vb->pfn_limit > max_pfn)
> > +   bmap_len = (max_pfn - pfn) /
> BITS_PER_BYTE;
> > +   } else
> > +   hdr->flag = cpu_to_virtio16(vb->vdev,
> > +
>   BALLOON_FLAG_CONT);
> > +   hdr->bmap_len = cpu_to_virtio64(vb->vdev, bmap_len);
> > +   sg_init_one(_out, hdr,
> > +sizeof(struct balloon_bmap_hdr) + bmap_len);
> 
> Wait a second. This adds the same buffer multiple times in a loop.
> We will overwrite the buffer without waiting for hypervisor to process it.
> What did I miss?

I am no quite sure about this part, I though the virtqueue_kick(vq) will prevent
the buffer from overwrite, I realized it's wrong.

> > +
> > +   virtqueue_add_outbuf(vq, _out, 1, vb, GFP_KERNEL);
> 
> this can fail. you want to maybe make sure vq has enough space before you
> use it or check error and wait.
> 
> > +   virtqueue_kick(vq);
> 
> why kick here within loop? wait until done. in fact kick outside lock is 
> better
> for smp.

I will change this part in v3.

> 
> > +   pfn += vb->pfn_limit;
> > +   static const char * const names[] = { "inflate", "deflate", "stats",
> > +"misc" };
> > int err, nvqs;
> >
> > /*
> >  * We expect two virtqueues: inflate and deflate, and
> >  * optionally stat.
> >  */
> > -   nvqs = virtio_has_feature(vb->vdev,
> VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2;
> > +   if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_MISC_VQ))
> > +   nvqs = 4;
> 
> Does misc vq depend on stats vq feature then? if yes please validate that.

Yes, what's you mean by 'validate' that?

> 
> 
> > +   else
> > +   nvqs = virtio_has_feature(vb->vdev,
> > + VIRTIO_BALLOON_F_STATS_VQ) ? 3 :
> 2;
> 
> Replace that ? with else too pls.

Will change.

Thanks!
Liang

Re: [Qemu-devel] [virtio-dev] Re: [PATCH v2 repost 4/7] virtio-balloon: speed up inflate/deflate process

2016-07-28 Thread Li, Liang Z

> > > This ends up doing a 1MB kmalloc() right?  That seems a _bit_ big.
> > > How big was the pfn buffer before?
> >
> > Yes, it is if the max pfn is more than 32GB.
> > The size of the pfn buffer use before is 256*4 = 1024 Bytes, it's too
> > small, and it's the main reason for bad performance.
> > Use the max 1MB kmalloc is a balance between performance and
> > flexibility, a large page bitmap covers the range of all the memory is
> > no good for a system with huge amount of memory. If the bitmap is too
> > small, it means we have to traverse a long list for many times, and it's bad
> for performance.
> >
> > Thanks!
> > Liang
> 
> There are all your implementation decisions though.
> 
> If guest memory is so fragmented that you only have order 0 4k pages, then
> allocating a huge 1M contigious chunk is very problematic in and of itself.
> 

The memory is allocated in the probe stage. This will not happen if the driver 
is
 loaded when booting the guest.

> Most people rarely migrate and do not care how fast that happens.
> Wasting a large chunk of memory (and it's zeroed for no good reason, so you
> actually request host memory for it) for everyone to speed it up when it
> does happen is not really an option.
> 
If people don't plan to do inflating/deflating, they should not enable the 
virtio-balloon
at the beginning, once they decide to use it, the driver should provide better 
performance
as much as possible.

1MB is a very small portion for a VM with more than 32GB memory and it's the 
*worst case*, 
for VM with less than 32GB memory, the amount of RAM depends on VM's memory size
and will be less than 1MB.

If 1MB is too big, how about 512K, or 256K?  32K seems too small.

Liang

> --
> MST
> 
> -
> To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org

Re: [Qemu-devel] [virtio-dev] Re: [PATCH v2 repost 6/7] mm: add the related functions to get free page info

2016-07-27 Thread Li, Liang Z

> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 7da61ad..3ad8b10
> > 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -4523,6 +4523,52 @@ unsigned long get_max_pfn(void)  }
> > EXPORT_SYMBOL(get_max_pfn);
> >
> > +static void mark_free_pages_bitmap(struct zone *zone, unsigned long
> start_pfn,
> > +   unsigned long end_pfn, unsigned long *bitmap, unsigned long len) {
> > +   unsigned long pfn, flags, page_num;
> > +   unsigned int order, t;
> > +   struct list_head *curr;
> > +
> > +   if (zone_is_empty(zone))
> > +   return;
> > +   end_pfn = min(start_pfn + len, end_pfn);
> > +   spin_lock_irqsave(>lock, flags);
> > +
> > +   for_each_migratetype_order(order, t) {
> 
> Why not do each order separately? This way you can use a single bit to pass a
> huge page to host.
> 

I thought about that before, and did not that because of complexity and small 
benefits.
Use separated page bitmaps for each order won't help to reduce the data 
traffic, except
ignoring the pages with small order. 

> Not a requirement but hey.
> 
> Alternatively (and maybe that is a better idea0 if you wanted to, you could
> just skip lone 4K pages.
> It's not clear that they are worth bothering with.
> Add a flag to start with some reasonably large order and go from there.
> 
One of the main reason of this patch is to reduce the network traffic as much 
as possible,
it looks strange to skip the lone 4K pages. Skipping these pages can made live 
migration
faster? I think it depends on the amount of lone 4K pages.

In the other hand, it's faster to send one bit through virtio than that send 4K 
bytes 
through even 10Gps network, is that true?

Liang

Re: [Qemu-devel] [PATCH v2 repost 6/7] mm: add the related functions to get free page info

2016-07-27 Thread Li, Liang Z

> On 07/27/2016 03:05 PM, Michael S. Tsirkin wrote:
> > On Wed, Jul 27, 2016 at 09:40:56AM -0700, Dave Hansen wrote:
> >> On 07/26/2016 06:23 PM, Liang Li wrote:
> >>> + for_each_migratetype_order(order, t) {
> >>> + list_for_each(curr, >free_area[order].free_list[t]) {
> >>> + pfn = page_to_pfn(list_entry(curr, struct page, lru));
> >>> + if (pfn >= start_pfn && pfn <= end_pfn) {
> >>> + page_num = 1UL << order;
> >>> + if (pfn + page_num > end_pfn)
> >>> + page_num = end_pfn - pfn;
> >>> + bitmap_set(bitmap, pfn - start_pfn,
> page_num);
> >>> + }
> >>> + }
> >>> + }
> >>
> >> Nit:  The 'page_num' nomenclature really confused me here.  It is the
> >> number of bits being set in the bitmap.  Seems like calling it
> >> nr_pages or num_pages would be more appropriate.
> >>
> >> Isn't this bitmap out of date by the time it's send up to the
> >> hypervisor?  Is there something that makes the inaccuracy OK here?
> >
> > Yes. Calling these free pages is unfortunate. It's likely to confuse
> > people thinking they can just discard these pages.
> >
> > Hypervisor sends a request. We respond with this list of pages, and
> > the guarantee hypervisor needs is that these were free sometime
> > between request and response, so they are safe to free if they are
> > unmodified since the request. hypervisor can detect modifications so
> > it can detect modifications itself and does not need guest help.
> 
> Ahh, that makes sense.
> 
> So the hypervisor is trying to figure out: "Which pages do I move?".  It wants
> to know which pages the guest thinks have good data and need to move.
> But, the list of free pages is (likely) smaller than the list of pages with 
> good
> data, so it asks for that instead.
> 
> A write to a page means that it has valuable data, regardless of whether it
> was in the free list or not.
> 
> The hypervisor only skips moving pages that were free *and* were never
> written to.  So we never lose data, even if this "get free page info"
> stuff is totally out of date.
> 
> The patch description and code comments are, um, a _bit_ light for this level
> of subtlety. :)

I will add more description about this in v3.

Thanks!
Liang

Re: [Qemu-devel] [PATCH v2 repost 4/7] virtio-balloon: speed up inflate/deflate process

2016-07-27 Thread Li, Liang Z

> > +/*
> > + * VIRTIO_BALLOON_PFNS_LIMIT is used to limit the size of page bitmap
> > + * to prevent a very large page bitmap, there are two reasons for this:
> > + * 1) to save memory.
> > + * 2) allocate a large bitmap may fail.
> > + *
> > + * The actual limit of pfn is determined by:
> > + * pfn_limit = min(max_pfn, VIRTIO_BALLOON_PFNS_LIMIT);
> > + *
> > + * If system has more pages than VIRTIO_BALLOON_PFNS_LIMIT, we will
> > +scan
> > + * the page list and send the PFNs with several times. To reduce the
> > + * overhead of scanning the page list. VIRTIO_BALLOON_PFNS_LIMIT
> > +should
> > + * be set with a value which can cover most cases.
> > + */
> > +#define VIRTIO_BALLOON_PFNS_LIMIT ((32 * (1ULL << 30)) >>
> PAGE_SHIFT)
> > +/* 32GB */
> > +
> >  static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
> > module_param(oom_pages, int, S_IRUSR | S_IWUSR);
> > MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
> >
> > +extern unsigned long get_max_pfn(void);
> > +
> 
> Please just include the correct header. No need for this hackery.
> 

Will change. Thanks!

Liang

Re: [Qemu-devel] [PATCH v2 repost 4/7] virtio-balloon: speed up inflate/deflate process

2016-07-27 Thread Li, Liang Z

> Subject: Re: [PATCH v2 repost 4/7] virtio-balloon: speed up inflate/deflate
> process
> 
> On Wed, Jul 27, 2016 at 09:03:21AM -0700, Dave Hansen wrote:
> > On 07/26/2016 06:23 PM, Liang Li wrote:
> > > + vb->pfn_limit = VIRTIO_BALLOON_PFNS_LIMIT;
> > > + vb->pfn_limit = min(vb->pfn_limit, get_max_pfn());
> > > + vb->bmap_len = ALIGN(vb->pfn_limit, BITS_PER_LONG) /
> > > +  BITS_PER_BYTE + 2 * sizeof(unsigned long);
> > > + hdr_len = sizeof(struct balloon_bmap_hdr);
> > > + vb->bmap_hdr = kzalloc(hdr_len + vb->bmap_len, GFP_KERNEL);
> >
> > This ends up doing a 1MB kmalloc() right?  That seems a _bit_ big.
> > How big was the pfn buffer before?
> 
> 
> Yes I would limit this to 1G memory in a go, will result in a 32KByte bitmap.
> 
> --
> MST

Limit to 1G is bad for the performance, I sent you the test result several 
weeks ago.

Paste it bellow:

About the size of page bitmap, I have test the performance of filling the 
balloon to 15GB with a
 16GB RAM VM.

===
32K Byte (cover 1GB of RAM)

Time spends on inflating: 2031ms
-
64K Byte (cover 2GB of RAM)

Time spends on inflating: 1507ms

512K Byte (cover 16GB of RAM)

Time spends on inflating: 1237ms


If possible, a big bitmap is better for performance.

Liang

Re: [Qemu-devel] [PATCH v2 repost 4/7] virtio-balloon: speed up inflate/deflate process

2016-07-27 Thread Li, Liang Z

> > + * VIRTIO_BALLOON_PFNS_LIMIT is used to limit the size of page bitmap
> > + * to prevent a very large page bitmap, there are two reasons for this:
> > + * 1) to save memory.
> > + * 2) allocate a large bitmap may fail.
> > + *
> > + * The actual limit of pfn is determined by:
> > + * pfn_limit = min(max_pfn, VIRTIO_BALLOON_PFNS_LIMIT);
> > + *
> > + * If system has more pages than VIRTIO_BALLOON_PFNS_LIMIT, we will
> > +scan
> > + * the page list and send the PFNs with several times. To reduce the
> > + * overhead of scanning the page list. VIRTIO_BALLOON_PFNS_LIMIT
> > +should
> > + * be set with a value which can cover most cases.
> 
> So what if it covers 1/32 of the memory? We'll do 32 exits and not 1, still 
> not a
> big deal for a big guest.
> 

The issue here is the overhead is too high for scanning the page list for 32 
times.
Limit the page bitmap size to a fixed value is better for a big guest?

> > + */
> > +#define VIRTIO_BALLOON_PFNS_LIMIT ((32 * (1ULL << 30)) >>
> PAGE_SHIFT)
> > +/* 32GB */
> 
> I already said this with a smaller limit.
> 
>   2<< 30  is 2G but that is not a useful comment.
>   pls explain what is the reason for this selection.
> 
> Still applies here.
> 

I will add the comment for this.

> > -   sg_init_one(, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
> > +   if (virtio_has_feature(vb->vdev,
> VIRTIO_BALLOON_F_PAGE_BITMAP)) {
> > +   struct balloon_bmap_hdr *hdr = vb->bmap_hdr;
> > +   unsigned long bmap_len;
> > +
> > +   /* cmd and req_id are not used here, set them to 0 */
> > +   hdr->cmd = cpu_to_virtio16(vb->vdev, 0);
> > +   hdr->page_shift = cpu_to_virtio16(vb->vdev, PAGE_SHIFT);
> > +   hdr->reserved = cpu_to_virtio16(vb->vdev, 0);
> > +   hdr->req_id = cpu_to_virtio64(vb->vdev, 0);
> 
> no need to swap 0, just fill it in. in fact you allocated all 0s so no need 
> to touch
> these fields at all.
> 

Will change in v3.

> > @@ -489,7 +612,7 @@ static int virtballoon_migratepage(struct
> > balloon_dev_info *vb_dev_info,  static int virtballoon_probe(struct
> > virtio_device *vdev)  {
> > struct virtio_balloon *vb;
> > -   int err;
> > +   int err, hdr_len;
> >
> > if (!vdev->config->get) {
> > dev_err(>dev, "%s failure: config access disabled\n",
> @@
> > -508,6 +631,18 @@ static int virtballoon_probe(struct virtio_device *vdev)
> > spin_lock_init(>stop_update_lock);
> > vb->stop_update = false;
> > vb->num_pages = 0;
> > +   vb->pfn_limit = VIRTIO_BALLOON_PFNS_LIMIT;
> > +   vb->pfn_limit = min(vb->pfn_limit, get_max_pfn());
> > +   vb->bmap_len = ALIGN(vb->pfn_limit, BITS_PER_LONG) /
> > +BITS_PER_BYTE + 2 * sizeof(unsigned long);
> 
> What are these 2 longs in aid of?
> 
The rounddown(vb->start_pfn,  BITS_PER_LONG) and roundup(vb->end_pfn, 
BITS_PER_LONG) 
may cause (vb->end_pfn - vb->start_pfn) > vb->pfn_limit, so we need extra space 
to save the
bitmap for this case. 2 longs are enough.

> > +   hdr_len = sizeof(struct balloon_bmap_hdr);
> > +   vb->bmap_hdr = kzalloc(hdr_len + vb->bmap_len, GFP_KERNEL);
> 
> So it can go up to 1MByte but adding header size etc you need a higher order
> allocation. This is a waste, there is no need to have a power of two 
> allocation.
> Start from the other side. Say "I want to allocate 32KBytes for the bitmap".
> Subtract the header and you get bitmap size.
> Calculate the pfn limit from there.
> 

Indeed, will change. Thanks a lot!

Liang

Re: [Qemu-devel] [PATCH v2 repost 4/7] virtio-balloon: speed up inflate/deflate process

2016-07-27 Thread Li, Liang Z

> Subject: Re: [PATCH v2 repost 4/7] virtio-balloon: speed up inflate/deflate
> process
> 
> On 07/26/2016 06:23 PM, Liang Li wrote:
> > +   vb->pfn_limit = VIRTIO_BALLOON_PFNS_LIMIT;
> > +   vb->pfn_limit = min(vb->pfn_limit, get_max_pfn());
> > +   vb->bmap_len = ALIGN(vb->pfn_limit, BITS_PER_LONG) /
> > +BITS_PER_BYTE + 2 * sizeof(unsigned long);
> > +   hdr_len = sizeof(struct balloon_bmap_hdr);
> > +   vb->bmap_hdr = kzalloc(hdr_len + vb->bmap_len, GFP_KERNEL);
> 
> This ends up doing a 1MB kmalloc() right?  That seems a _bit_ big.  How big
> was the pfn buffer before?

Yes, it is if the max pfn is more than 32GB.
The size of the pfn buffer use before is 256*4 = 1024 Bytes, it's too small, 
and it's the main reason for bad performance.
Use the max 1MB kmalloc is a balance between performance and flexibility,
a large page bitmap covers the range of all the memory is no good for a system
with huge amount of memory. If the bitmap is too small, it means we have
to traverse a long list for many times, and it's bad for performance.

Thanks!
Liang

Re: [Qemu-devel] [PATCH v2 repost 6/7] mm: add the related functions to get free page info

2016-07-27 Thread Li, Liang Z

> Subject: Re: [PATCH v2 repost 6/7] mm: add the related functions to get free
> page info
> 
> On 07/26/2016 06:23 PM, Liang Li wrote:
> > +   for_each_migratetype_order(order, t) {
> > +   list_for_each(curr, >free_area[order].free_list[t]) {
> > +   pfn = page_to_pfn(list_entry(curr, struct page, lru));
> > +   if (pfn >= start_pfn && pfn <= end_pfn) {
> > +   page_num = 1UL << order;
> > +   if (pfn + page_num > end_pfn)
> > +   page_num = end_pfn - pfn;
> > +   bitmap_set(bitmap, pfn - start_pfn,
> page_num);
> > +   }
> > +   }
> > +   }
> 
> Nit:  The 'page_num' nomenclature really confused me here.  It is the
> number of bits being set in the bitmap.  Seems like calling it nr_pages or
> num_pages would be more appropriate.
> 

You are right,  will change.

> Isn't this bitmap out of date by the time it's send up to the hypervisor?  Is
> there something that makes the inaccuracy OK here?

Yes. The dirty page logging will be used to correct the inaccuracy.
The dirty page logging should be started before getting the free page bitmap, 
then if some of the free pages become no free for writing, these pages will be 
tracked by the dirty page logging mechanism.

Thanks!
Liang

Re: [Qemu-devel] [virtio-dev] Re: [PATCH v2 kernel 0/7] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-07-26 Thread Li, Liang Z

> So I'm fine with this patchset, but I noticed it was not yet reviewed by MM
> people. And that is not surprising since you did not copy memory
> management mailing list on it.
> 
> I added linux...@kvack.org Cc on this mail but this might not be enough.
> 
> Please repost (e.g. [PATCH v2 repost]) copying the relevant mailing list so we
> can get some reviews.
> 

I will repost. Thanks!

Liang

Re: [Qemu-devel] [PATCH v2 kernel 0/7] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-07-21 Thread Li, Liang Z

Hi Michael,

If you have time, could you help to review this patch set?

Thanks!
Liang


> -Original Message-
> From: Li, Liang Z
> Sent: Wednesday, June 29, 2016 6:32 PM
> To: m...@redhat.com
> Cc: linux-ker...@vger.kernel.org; virtualizat...@lists.linux-foundation.org;
> k...@vger.kernel.org; qemu-devel@nongnu.org; virtio-dev@lists.oasis-
> open.org; dgilb...@redhat.com; quint...@redhat.com; Li, Liang Z
> Subject: [PATCH v2 kernel 0/7] Extend virtio-balloon for fast (de)inflating &
> fast live migration
> 
> This patch set contains two parts of changes to the virtio-balloon.
> 
> One is the change for speeding up the inflating & deflating process, the main
> idea of this optimization is to use bitmap to send the page information to
> host instead of the PFNs, to reduce the overhead of virtio data transmission,
> address translation and madvise(). This can help to improve the performance
> by about 85%.
> 
> Another change is for speeding up live migration. By skipping process guest's
> free pages in the first round of data copy, to reduce needless data 
> processing,
> this can help to save quite a lot of CPU cycles and network bandwidth. We
> put guest's free page information in bitmap and send it to host with the virt
> queue of virtio-balloon. For an idle 8GB guest, this can help to shorten the
> total live migration time from 2Sec to about 500ms in the 10Gbps network
> environment.
> 
> 
> Changes from v1 to v2:
> * Abandon the patch for dropping page cache.
> * Put some structures to uapi head file.
> * Use a new way to determine the page bitmap size.
> * Use a unified way to send the free page information with the bitmap
> * Address the issues referred in MST's comments
> 
> Liang Li (7):
>   virtio-balloon: rework deflate to add page to a list
>   virtio-balloon: define new feature bit and page bitmap head
>   mm: add a function to get the max pfn
>   virtio-balloon: speed up inflate/deflate process
>   virtio-balloon: define feature bit and head for misc virt queue
>   mm: add the related functions to get free page info
>   virtio-balloon: tell host vm's free page info
> 
>  drivers/virtio/virtio_balloon.c | 306
> +++-
>  include/uapi/linux/virtio_balloon.h |  41 +
>  mm/page_alloc.c |  52 ++
>  3 files changed, 359 insertions(+), 40 deletions(-)
> 
> --
> 1.8.3.1

Re: [Qemu-devel] [QEMU v2 0/9] Fast (de)inflating & fast live migration

2016-07-21 Thread Li, Liang Z

Ping ...

Liang

> -Original Message-
> From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org]
> On Behalf Of Liang Li
> Sent: Friday, July 15, 2016 10:47 AM
> To: qemu-devel@nongnu.org
> Cc: m...@redhat.com; pbonz...@redhat.com; quint...@redhat.com;
> amit.s...@redhat.com; k...@vger.kernel.org; dgilb...@redhat.com;
> th...@redhat.com; Li, Liang Z
> Subject: [QEMU v2 0/9] Fast (de)inflating & fast live migration
> 
> This patch set intends to do two optimizations, one is to speed up the
> (de)inflating process of virtio balloon, and another one which is to speed up
> the live migration process. We put them together because both of them are
> required to change the virtio balloon spec.
> 
> The main idea of speeding up the (de)inflating process is to use bitmap to
> send the page information to host instead of the PFNs, to reduce the
> overhead of virtio data transmission, address translation and madvise(). This
> can help to improve the performance by about 85%.
> 
> The idea of speeding up live migration is to skip process guest's free pages 
> in
> the first round of data copy, to reduce needless data processing, this can
> help to save quite a lot of CPU cycles and network bandwidth. We get guest's
> free page information through the virt queue of virtio-balloon, and filter out
> these free pages during live migration. For an idle 8GB guest, this can help 
> to
> shorten the total live migration time from 2Sec to about 500ms in the 10Gbps
> network environment.
> 
> Changes from v1 to v2:
> * Abandon the patch for dropping page cache.
> * Get a struct from vq instead of separate variables.
> * Use two separate APIs to request free pages and query the status.
> * Changed the virtio balloon interface.
> * Addressed some of the comments of v1.
> 
> Liang Li (9):
>   virtio-balloon: Remove needless precompiled directive
>   virtio-balloon: update linux head file
>   virtio-balloon: speed up inflating & deflating process
>   virtio-balloon: update linux head file for new feature
>   balloon: get free page info from guest
>   balloon: migrate vq elem to destination
>   bitmap: Add a new bitmap_move function
>   kvm: Add two new arch specific functions
>   migration: skip free pages during live migration
> 
>  balloon.c   |  47 +++-
>  hw/virtio/virtio-balloon.c  | 304 
> ++--
>  include/hw/virtio/virtio-balloon.h  |  18 +-
>  include/qemu/bitmap.h   |  13 +
>  include/standard-headers/linux/virtio_balloon.h |  41 
>  include/sysemu/balloon.h|  18 +-
>  include/sysemu/kvm.h|  18 ++
>  migration/ram.c |  86 +++
>  target-arm/kvm.c|  14 ++
>  target-i386/kvm.c   |  37 +++
>  target-mips/kvm.c   |  14 ++
>  target-ppc/kvm.c|  14 ++
>  target-s390x/kvm.c  |  14 ++
>  13 files changed, 608 insertions(+), 30 deletions(-)
> 
> --
> 1.9.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in the body of
> a message to majord...@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] [PATCH v2] balloon: Fix failure of updating guest memory status

2016-07-06 Thread Li, Liang Z

> > > > After live migration, 'guest-stats' can't get the expected memory
> > > > status in the guest. This issue is caused by commit 4eae2a657d.
> > > > The value of 's->stats_vq_elem' will be NULL after live migration,
> > > > and the check in the function 'balloon_stats_poll_cb()' will
> > > > prevent the 'virtio_notify()' from executing. So guest will not
> > > > update the memory status.
> > > >
> > > > Commit 4eae2a657d is doing the right thing, but 's->stats_vq_elem'
> > > > should be treated as part of balloon device state and migrated to
> > > > destination if it's not NULL to make everything works well.
> > > >
> > > > Signed-off-by: Liang Li 
> > > > Suggested-by: Paolo Bonzini 
> > > > Cc: Michael S. Tsirkin 
> > > > Cc: Ladi Prosek 
> > > > Cc: Paolo Bonzini 
> > >
> > > I agree there's an issue but we don't change versions anymore.
> > > Breaking migrations for everyone is also not nice.
> > >
> > > How about queueing virtio_balloon_receive_stats so it will get
> > > invoked when vm starts?
> > >
> >
> > Could you give more explanation about how it works?  I can't catch you.
> >
> > Thanks!
> > Liang
> 
> virtqueue_discard before migration
> 
> virtio_balloon_receive_stats after migration
> 

Sorry, I still can't catch you. Maybe it's easier for you to submit a patch
than writing a lot a words to make me understand your idea. 

I just don't understand why not to use the version to make things easier, is 
that not
the original intent of version id? 
If we want to extend the device and more states are needed, the idea you 
suggest can
be used as a common solution?

Thanks!
Liang

> --
> MST

Re: [Qemu-devel] [PATCH v2] balloon: Fix failure of updating guest memory status

2016-07-06 Thread Li, Liang Z

> On Wed, Jul 06, 2016 at 10:36:33AM +0800, Liang Li wrote:
> > After live migration, 'guest-stats' can't get the expected memory
> > status in the guest. This issue is caused by commit 4eae2a657d.
> > The value of 's->stats_vq_elem' will be NULL after live migration, and
> > the check in the function 'balloon_stats_poll_cb()' will prevent the
> > 'virtio_notify()' from executing. So guest will not update the memory
> > status.
> >
> > Commit 4eae2a657d is doing the right thing, but 's->stats_vq_elem'
> > should be treated as part of balloon device state and migrated to
> > destination if it's not NULL to make everything works well.
> >
> > Signed-off-by: Liang Li 
> > Suggested-by: Paolo Bonzini 
> > Cc: Michael S. Tsirkin 
> > Cc: Ladi Prosek 
> > Cc: Paolo Bonzini 
> 
> I agree there's an issue but we don't change versions anymore.
> Breaking migrations for everyone is also not nice.
> 
> How about queueing virtio_balloon_receive_stats so it will get invoked when
> vm starts?
> 

Could you give more explanation about how it works?  I can't catch you.

Thanks!
Liang

Re: [Qemu-devel] [PATCH v2 kernel 0/7] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-07-05 Thread Li, Liang Z

Ping ...

Liang

> -Original Message-
> From: Li, Liang Z
> Sent: Wednesday, June 29, 2016 6:32 PM
> To: m...@redhat.com
> Cc: linux-ker...@vger.kernel.org; virtualizat...@lists.linux-foundation.org;
> k...@vger.kernel.org; qemu-devel@nongnu.org; virtio-dev@lists.oasis-
> open.org; dgilb...@redhat.com; quint...@redhat.com; Li, Liang Z
> Subject: [PATCH v2 kernel 0/7] Extend virtio-balloon for fast (de)inflating &
> fast live migration
> 
> This patch set contains two parts of changes to the virtio-balloon.
> 
> One is the change for speeding up the inflating & deflating process, the main
> idea of this optimization is to use bitmap to send the page information to
> host instead of the PFNs, to reduce the overhead of virtio data transmission,
> address translation and madvise(). This can help to improve the performance
> by about 85%.
> 
> Another change is for speeding up live migration. By skipping process guest's
> free pages in the first round of data copy, to reduce needless data 
> processing,
> this can help to save quite a lot of CPU cycles and network bandwidth. We
> put guest's free page information in bitmap and send it to host with the virt
> queue of virtio-balloon. For an idle 8GB guest, this can help to shorten the
> total live migration time from 2Sec to about 500ms in the 10Gbps network
> environment.
> 
> 
> Changes from v1 to v2:
> * Abandon the patch for dropping page cache.
> * Put some structures to uapi head file.
> * Use a new way to determine the page bitmap size.
> * Use a unified way to send the free page information with the bitmap
> * Address the issues referred in MST's comments
> 
> Liang Li (7):
>   virtio-balloon: rework deflate to add page to a list
>   virtio-balloon: define new feature bit and page bitmap head
>   mm: add a function to get the max pfn
>   virtio-balloon: speed up inflate/deflate process
>   virtio-balloon: define feature bit and head for misc virt queue
>   mm: add the related functions to get free page info
>   virtio-balloon: tell host vm's free page info
> 
>  drivers/virtio/virtio_balloon.c | 306
> +++-
>  include/uapi/linux/virtio_balloon.h |  41 +
>  mm/page_alloc.c |  52 ++
>  3 files changed, 359 insertions(+), 40 deletions(-)
> 
> --
> 1.8.3.1

Re: [Qemu-devel] [PATCH] balloon: Fix failure of updating guest memory status

2016-07-01 Thread Li, Liang Z

> > +if (s->stats_vq_elem == NULL) {
> > +virtqueue_push(s->svq, , 0);
> > +virtio_notify(vdev, s->svq);
> > +return;
> > +}
> >  virtqueue_push(s->svq, s->stats_vq_elem, s->stats_vq_offset);
> >  virtio_notify(vdev, s->svq);
> >  g_free(s->stats_vq_elem);
> >
> 
> Hi, the right fix is to migrate s->stats_vq_elem if it is not NULL.  See how 
> it's
> done in hw/char/virtio-serial.c's virtio_serial_save_device
> (save) and fetch_active_ports_list (load).
> 
> Paolo

Hi Paolo,

   If we try to migrate ' migrate s->stats_vq_elem ', how to make it works with 
the old version ?
   with a new feature bit? Or with some version control?
   
Thanks!
Liang

Re: [Qemu-devel] [PATCH] balloon: Fix failure of updating guest memory status

2016-06-30 Thread Li, Liang Z

> > +if (s->stats_vq_elem == NULL) {
> > +virtqueue_push(s->svq, , 0);
> > +virtio_notify(vdev, s->svq);
> > +return;
> > +}
> >  virtqueue_push(s->svq, s->stats_vq_elem, s->stats_vq_offset);
> >  virtio_notify(vdev, s->svq);
> >  g_free(s->stats_vq_elem);
> >
> 
> Hi, the right fix is to migrate s->stats_vq_elem if it is not NULL.  See how 
> it's
> done in hw/char/virtio-serial.c's virtio_serial_save_device
> (save) and fetch_active_ports_list (load).
> 
> Paolo

Thanks for your information, it's very helpful. I will send the v2.

Liang

Re: [Qemu-devel] [PATCH] balloon: Fix failure of updating guest memory status

2016-06-30 Thread Li, Liang Z

> >> What if we are not in the just-after-live-migration situation though?
> >> If the guest simply didn't add a buffer to the queue for some reason,
> >> wouldn't this newly added push/notify break the balloon protocol?
> >>
> > Could you elaborate how it happens?
> > The added code only works for the situation just after live migration. 
> > right?
> 
> I don't believe so. As eloquently stated in the spec (5.5.6.3 Memory
> Statistics), the stats queue is "atypical". If everything goes well, the
> stats_vq_elem field is indeed expected to be non-NULL here in
> balloon_stats_poll_cb. But it may as well be NULL, for example if step 3. "The
> driver collects memory statistics and writes them into a new buffer." takes a
> long time and the driver doesn't make it by the time the callback fires.
> 
> Basically, the driver and device play ping-pong and the value of
> stats_vq_elem is NULL or non-NULL depending on which side the ball is.
> It looks like stats_vq_elem should be part of the QEMU state that is
> preserved across migrations, although I'll admit that I am unfamiliar with how

Migrate the stats_vq_elem to destination is better, and it can solve the 
similar problem
I encountered in another patch, I will look into how to do this.

Thanks for your explanation!

Liang

> migrations work and may be missing something important.
> 
> Thanks!
> Ladi
>

Re: [Qemu-devel] [PATCH] balloon: Fix failure of updating guest memory status

2016-06-30 Thread Li, Liang Z

> On Thu, Jun 30, 2016 at 9:31 AM, Liang Li  wrote:
> > After live migration, 'guest-stats' can't get the expected memory
> > status in the guest. This issue is caused by commit 4eae2a657d.
> > The value of 's->stats_vq_elem' will be NULL after live migration, and
> > the check in the function 'balloon_stats_poll_cb()' will prevent the
> > 'virtio_notify()' from executing. So guest will not update the memory
> > status.
> >
> > Signed-off-by: Liang Li 
> > Cc: Michael S. Tsirkin 
> > Cc: Ladi Prosek 
> > Cc: Paolo Bonzini 
> > ---
> >  hw/virtio/virtio-balloon.c | 8 +++-
> >  1 file changed, 7 insertions(+), 1 deletion(-)
> >
> > diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
> > index 557d3f9..cc6947f 100644
> > --- a/hw/virtio/virtio-balloon.c
> > +++ b/hw/virtio/virtio-balloon.c
> > @@ -98,13 +98,19 @@ static void balloon_stats_poll_cb(void *opaque)  {
> >  VirtIOBalloon *s = opaque;
> >  VirtIODevice *vdev = VIRTIO_DEVICE(s);
> > +VirtQueueElement elem = {0};
> >
> > -if (s->stats_vq_elem == NULL || !balloon_stats_supported(s)) {
> > +if (!balloon_stats_supported(s)) {
> >  /* re-schedule */
> >  balloon_stats_change_timer(s, s->stats_poll_interval);
> >  return;
> >  }
> >
> > +if (s->stats_vq_elem == NULL) {
> > +virtqueue_push(s->svq, , 0);
> > +virtio_notify(vdev, s->svq);
> > +return;
> > +}
> 
> What if we are not in the just-after-live-migration situation though?
> If the guest simply didn't add a buffer to the queue for some reason,
> wouldn't this newly added push/notify break the balloon protocol?
> 
Could you elaborate how it happens?  
The added code only works for the situation just after live migration. right?

Thanks!
Liang

Re: [Qemu-devel] [QEMU 6/7] kvm: Add two new arch specific functions

2016-06-19 Thread Li, Liang Z

> > --- a/target-arm/kvm.c
> > +++ b/target-arm/kvm.c
> > @@ -627,3 +627,17 @@ int kvm_arch_msi_data_to_gsi(uint32_t data)  {
> >  return (data - 32) & 0x;
> >  }
> > +
> > +unsigned long get_guest_max_pfn(void) {
> > +/* To be done */
> > +
> > +return 0;
> > +}
> > +
> > +unsigned long *tighten_guest_free_page_bmap(unsigned long *bmap) {
> > +/* To be done */
> > +
> > +return bmap;
> > +}
> > diff --git a/target-i386/kvm.c b/target-i386/kvm.c index
> > abf50e6..0b394cb 100644
> > --- a/target-i386/kvm.c
> > +++ b/target-i386/kvm.c
> > @@ -3327,3 +3327,38 @@ int kvm_arch_msi_data_to_gsi(uint32_t data)  {
> >  abort();
> >  }
> > +
> > +unsigned long get_guest_max_pfn(void) {
> > +PCMachineState *pcms = PC_MACHINE(current_machine);
> > +ram_addr_t above_4g_mem = pcms->above_4g_mem_size;
> > +unsigned long max_pfn;
> > +
> > +if (above_4g_mem) {
> > +max_pfn = ((1ULL << 32) + above_4g_mem) >> TARGET_PAGE_BITS;
> > +} else {
> > +max_pfn = pcms->below_4g_mem_size >> TARGET_PAGE_BITS;
> > +}
> > +
> > +return max_pfn;
> > +}
> 
> Why is this in kvm?

I can't find a better place. Do you have any suggestion? 

> > +pos = (above_4g_mem + below_4g_mem) >> TARGET_PAGE_BITS;
> > +len = ((1ULL << 32) - below_4g_mem) >> TARGET_PAGE_BITS;
> > +bitmap_clear(bmap, pos, len);
> > +}
> > +
> > +return bmap;
> > +}
> 
> what does this do? External APIs should have documentation.

I will add the documentation. Thanks!

Liang

Re: [Qemu-devel] [QEMU 7/7] migration: skip free pages during live migration

2016-06-19 Thread Li, Liang Z

> On Mon, Jun 13, 2016 at 06:16:49PM +0800, Liang Li wrote:
> > After sending out the request for free pages, live migration process
> > will start without waiting for the free page bitmap is ready. If the
> > free page bitmap is not ready when doing the 1st
> > migration_bitmap_sync() after ram_save_setup(), the free page bitmap
> > will be ignored, this means the free pages will not be filtered out in
> > this case.
> > The current implementation can not work with post copy, if post copy
> > is enabled, we simply ignore the free pages. Will make it work later.
> >
> > Signed-off-by: Liang Li 
> 
> Tying migration to balloon in this way seems rather ugly.
> So with request ID, the logic would basically be
> 
>   - add memory listener with high priority
>   - before sync bitmap, increment request id
>   - when we get response, if it has latest request id,
> clear qemu migration bitmap
> otherwise, ignore

Use the request ID is good. 
Could you elaborate the meaning of ' add memory listener with high priority ' ?

Thanks!
Liang

Re: [Qemu-devel] [QEMU 4/7] balloon: get free page info from guest

2016-06-19 Thread Li, Liang Z

> On Mon, Jun 13, 2016 at 06:16:46PM +0800, Liang Li wrote:
> > Add a new feature to get the free page information from guest, the
> > free page information is saved in a bitmap. Please note that 'free
> > page' only means these pages are free before the request, some of the
> > pages will become no free during the process of sending the free page
> > bitmap to QEMU.
> >
> > Signed-off-by: Liang Li 
> 
> 
> I don't much like this request interface.
> The meaning of free page is rather fuzzy too - so at what point are they free?
> 
> 
> My suggestion would be:
>   report free page request ID to guest
>   include request ID when guest sends free page list
> 
> the definition is then:
>   page was free sometime after host set this value of request
>   ID and before it received response with the same ID

That's better. I will change in next version.
And there is another issue similar as we solved to speed up the 
inflating/deflating process.
Should we use a large page bitmap or a small one ? I used a big one in the 
patch.

If we chose to use a small page bitmap, then we have to traverse the free page 
list for many times,
and the meaning of free page will be more fuzzy.

But if we use a big map bitmap, people may ask, why a small one here and a big 
one there?

Thanks!

Liang

Re: [Qemu-devel] [QEMU 2/7] virtio-balloon: add drop cache support

2016-06-19 Thread Li, Liang Z

> On Mon, Jun 13, 2016 at 06:16:44PM +0800, Liang Li wrote:
> > virtio-balloon can make use of the amount of free memory to determine
> > the amount of memory to be filled in the balloon, but the amount of
> > free memory will be effected by the page cache, which can be reclaimed.
> > Drop the cache before getting the amount of free memory will be very
> > helpful to relect the exact amount of memroy that can be reclaimed.
> 
> Can't we just extend stats to report "reclaimable" memory?
> 

Yes, I noticed the VIRTIO_BALLOON_S_AVAIL is for this purpose.  

I summarized the possible solutions from others:
a. Drop the cache in guest agent instead of an obvious qmp command. (Paolo) 
b. Use a parameter as a hint to tell the guest live migration is going to 
happen, and let the guest do what it can do to make the host's life easier.  
(David)

What' your opinion about these two solutions?

Thanks!
Liang

Re: [Qemu-devel] [QEMU 1/7] balloon: speed up inflating & deflating process

2016-06-19 Thread Li, Liang Z

> >
> >  virtqueue_push(vq, elem, offset); @@ -374,6 +489,7 @@ static
> > uint64_t virtio_balloon_get_features(VirtIODevice *vdev, uint64_t f,
> >  VirtIOBalloon *dev = VIRTIO_BALLOON(vdev);
> >  f |= dev->host_features;
> >  virtio_add_feature(, VIRTIO_BALLOON_F_STATS_VQ);
> > +virtio_add_feature(, VIRTIO_BALLOON_F_PAGE_BITMAP);
> >  return f;
> >  }
> >
> 
> Pls add features to virtio_balloon_properties.
> You also need to handle compatibility by disabling for old machine types.
> 

I forgot that, will add in next version.

> > --- a/include/standard-headers/linux/virtio_balloon.h
> > +++ b/include/standard-headers/linux/virtio_balloon.h
> > @@ -34,6 +34,7 @@
> >  #define VIRTIO_BALLOON_F_MUST_TELL_HOST0 /* Tell before
> reclaiming pages */
> >  #define VIRTIO_BALLOON_F_STATS_VQ  1 /* Memory Stats virtqueue
> */
> >  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM2 /* Deflate balloon
> on OOM */
> > +#define VIRTIO_BALLOON_F_PAGE_BITMAP  3 /* Use page bitmap to
> send
> > +page info */
> >
> >  /* Size of a PFN in the balloon interface. */  #define
> > VIRTIO_BALLOON_PFN_SHIFT 12
> 
> We want to keep this in sync with Linux.
> Let's get a minimal patch to extend this header merged in linux, then update
> this one.

OK. Can this be independent of the virtio-balloon SPEC? As I understand it,
 it will not get merged before the SPEC is set?

Thanks!
Liang

Re: [Qemu-devel] [QEMU 3/7] Add the hmp and qmp interface for dropping cache

2016-06-16 Thread Li, Liang Z

> > > > +{ 'command': 'balloon_drop_cache', 'data': {'value':
> > > > +'DropCacheType'} }
> > >
> > > Also, as noted in the man page quote above, it is recommended to
> > > call
> > > sync() to minimise dirty pages. Should we have a way to request a
> > > sync as part of this monitor command.
> > >
> > > More generally, it feels like this is taking as down a path towards
> > > actively managing the guest kernel VM from the host. Is this really
> > > a path we want to be going down, given that its going to take us
> > > into increasing non-portable concepts which are potentially
> > > different for each guest OS kernel.  Is this drop caches feature at
> > > all applicable to Windows, OS-X, *BSD guest OS impls of the balloon
> > > driver ? If it is applicable, are the 3 fixed constants you've
> >
> > No.
> >
> > > defined at all useful to those other OS ?
> > >
> >
> > Maybe they are not.
> > I agree that there are too Linux specific.  And I did more than needed.
> > Actually, I just want to drop the clean cache, do more than that is
> > too heavy and no good for performance.
> >
> > > I'm warying of us taking a design path which is so Linux specific it
> > > isn't useful elsewhere. IOW, just because we can do this, doesn't
> > > mean we should do this...
> > >
> >
> > Agree.
> 
> I can see an argument for giving the guest a hint about what's going on and
> letting the guest decide what it's going to do - so telling the guest that a
> migration is happening and you'd like it to make the hosts life easy seems
> reasonable and it doesn't make any guest OS assumptions.
> 

That's much better. Thanks!

Liang

Re: [Qemu-devel] [QEMU 1/7] balloon: speed up inflating & deflating process

2016-06-16 Thread Li, Liang Z

>  +chunk = TARGET_PAGE_SIZE;
>  +}
>  +}
>  +}
>  +
>  +static void balloon_bulk_pages(ram_addr_t base_pfn, unsigned long
> >>> *bitmap,
>  +   unsigned long len, int page_shift,
>  +bool deflate) { #if defined(__linux__)
> >>>
> >>> Why do you need this #if here?
> >>>
> >> Ooh,  it is wrong to add the '#if' here, will remove.
> >>
> > No, it is needed, just follow the code in balloon_page().
> > only Linux support the madvise().
> 
> I think it is not needed anymore today and the #if in balloon_page could be
> removed, too: As far as I can see, the #if there is from the early days, when
> there was no wrapper around madvise() yet. But nowadays, we've got the
> qemu_madvise() wrapper which takes care of either using madvise(),
> posix_madvise() or doing nothing, so the virtio-balloon code should be able
> to work without the #if now.
> 
>  Thomas

You are right! I will remove both of them.

Thanks!
Liang

Re: [Qemu-devel] [QEMU 1/7] balloon: speed up inflating & deflating process

2016-06-14 Thread Li, Liang Z

> > On 13.06.2016 12:16, Liang Li wrote:
> > > The implementation of the current virtio-balloon is not very
> > > efficient, Bellow is test result of time spends on inflating the
> > > balloon to 3GB of a 4GB idle guest:
> > >
> > > a. allocating pages (6.5%, 103ms)
> > > b. sending PFNs to host (68.3%, 787ms) c. address translation (6.1%,
> > > 96ms) d. madvise (19%, 300ms)
> > >
> > > It takes about 1577ms for the whole inflating process to complete.
> > > The test shows that the bottle neck is the stage b and stage d.
> > >
> > > If using a bitmap to send the page info instead of the PFNs, we can
> > > reduce the overhead spends on stage b quite a lot. Furthermore, it's
> > > possible to do the address translation and do the madvise with a
> > > bulk of pages, instead of the current page per page way, so the
> > > overhead of stage c and stage d can also be reduced a lot.
> > >
> > > This patch is the QEMU side implementation which is intended to
> > > speed up the inflating & deflating process by adding a new feature
> > > to the virtio-balloon device. And now, inflating the balloon to 3GB
> > > of a 4GB idle guest only takes 210ms, it's about 8 times as fast as 
> > > before.
> > >
> > > TODO: optimize stage a by allocating/freeing a chunk of pages
> > > instead of a single page at a time.
> > >
> > > Signed-off-by: Liang Li 
> > > ---
> > >  hw/virtio/virtio-balloon.c  | 159 
> > > 
> > >  include/standard-headers/linux/virtio_balloon.h |   1 +
> > >  2 files changed, 139 insertions(+), 21 deletions(-)
> > >
> > > diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
> > > index 8c15e09..8cf74c2 100644
> > > --- a/hw/virtio/virtio-balloon.c
> > > +++ b/hw/virtio/virtio-balloon.c
> > > @@ -47,6 +47,76 @@ static void balloon_page(void *addr, int deflate)
> > > #endif  }
> > >
> > > +static void do_balloon_bulk_pages(ram_addr_t base_pfn, int
> page_shift,
> > > +  unsigned long len, bool deflate) {
> > > +ram_addr_t size, processed, chunk, base;
> > > +void *addr;
> > > +MemoryRegionSection section = {.mr = NULL};
> > > +
> > > +size = (len << page_shift);
> > > +base = (base_pfn << page_shift);
> > > +
> > > +for (processed = 0; processed < size; processed += chunk) {
> > > +chunk = size - processed;
> > > +while (chunk >= TARGET_PAGE_SIZE) {
> > > +section = memory_region_find(get_system_memory(),
> > > + base + processed, chunk);
> > > +if (!section.mr) {
> > > +chunk = QEMU_ALIGN_DOWN(chunk / 2, TARGET_PAGE_SIZE);
> > > +} else {
> > > +break;
> > > +}
> > > +}
> > > +
> > > +if (section.mr &&
> > > +(int128_nz(section.size) && 
> > > memory_region_is_ram(section.mr)))
> {
> > > +addr = section.offset_within_region +
> > > +   memory_region_get_ram_ptr(section.mr);
> > > +qemu_madvise(addr, chunk,
> > > + deflate ? QEMU_MADV_WILLNEED :
> > QEMU_MADV_DONTNEED);
> > > +} else {
> > > +fprintf(stderr, "can't find the chunk, skip\n");
> >
> > Please try to avoid new fprintf(stderr, ...) in the QEMU sources.
> > Use error_report(...) or in this case maybe rather
> > qemu_log_mask(LOG_GUEST_ERROR, ...) instead, and try to use a more
> > reasonable error message (e.g. that it is clear that the error
> > happened in the balloon code).
> >
> 
> Indeed, the error message is no good, will change in next version.
> 
> > > +chunk = TARGET_PAGE_SIZE;
> > > +}
> > > +}
> > > +}
> > > +
> > > +static void balloon_bulk_pages(ram_addr_t base_pfn, unsigned long
> > *bitmap,
> > > +   unsigned long len, int page_shift,
> > > +bool deflate) { #if defined(__linux__)
> >
> > Why do you need this #if here?
> >
> 
> Ooh,  it is wrong to add the '#if' here, will remove.
No, it is needed, just follow the code in balloon_page().
only Linux support the madvise().

Liang

Re: [Qemu-devel] [QEMU 1/7] balloon: speed up inflating & deflating process

2016-06-14 Thread Li, Liang Z

> Subject: Re: [QEMU 1/7] balloon: speed up inflating & deflating process
> 
> On 13.06.2016 12:16, Liang Li wrote:
> > The implementation of the current virtio-balloon is not very
> > efficient, Bellow is test result of time spends on inflating the
> > balloon to 3GB of a 4GB idle guest:
> >
> > a. allocating pages (6.5%, 103ms)
> > b. sending PFNs to host (68.3%, 787ms) c. address translation (6.1%,
> > 96ms) d. madvise (19%, 300ms)
> >
> > It takes about 1577ms for the whole inflating process to complete. The
> > test shows that the bottle neck is the stage b and stage d.
> >
> > If using a bitmap to send the page info instead of the PFNs, we can
> > reduce the overhead spends on stage b quite a lot. Furthermore, it's
> > possible to do the address translation and do the madvise with a bulk
> > of pages, instead of the current page per page way, so the overhead of
> > stage c and stage d can also be reduced a lot.
> >
> > This patch is the QEMU side implementation which is intended to speed
> > up the inflating & deflating process by adding a new feature to the
> > virtio-balloon device. And now, inflating the balloon to 3GB of a 4GB
> > idle guest only takes 210ms, it's about 8 times as fast as before.
> >
> > TODO: optimize stage a by allocating/freeing a chunk of pages instead
> > of a single page at a time.
> >
> > Signed-off-by: Liang Li 
> > ---
> >  hw/virtio/virtio-balloon.c  | 159 
> > 
> >  include/standard-headers/linux/virtio_balloon.h |   1 +
> >  2 files changed, 139 insertions(+), 21 deletions(-)
> >
> > diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
> > index 8c15e09..8cf74c2 100644
> > --- a/hw/virtio/virtio-balloon.c
> > +++ b/hw/virtio/virtio-balloon.c
> > @@ -47,6 +47,76 @@ static void balloon_page(void *addr, int deflate)
> > #endif  }
> >
> > +static void do_balloon_bulk_pages(ram_addr_t base_pfn, int page_shift,
> > +  unsigned long len, bool deflate) {
> > +ram_addr_t size, processed, chunk, base;
> > +void *addr;
> > +MemoryRegionSection section = {.mr = NULL};
> > +
> > +size = (len << page_shift);
> > +base = (base_pfn << page_shift);
> > +
> > +for (processed = 0; processed < size; processed += chunk) {
> > +chunk = size - processed;
> > +while (chunk >= TARGET_PAGE_SIZE) {
> > +section = memory_region_find(get_system_memory(),
> > + base + processed, chunk);
> > +if (!section.mr) {
> > +chunk = QEMU_ALIGN_DOWN(chunk / 2, TARGET_PAGE_SIZE);
> > +} else {
> > +break;
> > +}
> > +}
> > +
> > +if (section.mr &&
> > +(int128_nz(section.size) && memory_region_is_ram(section.mr))) 
> > {
> > +addr = section.offset_within_region +
> > +   memory_region_get_ram_ptr(section.mr);
> > +qemu_madvise(addr, chunk,
> > + deflate ? QEMU_MADV_WILLNEED :
> QEMU_MADV_DONTNEED);
> > +} else {
> > +fprintf(stderr, "can't find the chunk, skip\n");
> 
> Please try to avoid new fprintf(stderr, ...) in the QEMU sources.
> Use error_report(...) or in this case maybe rather
> qemu_log_mask(LOG_GUEST_ERROR, ...) instead, and try to use a more
> reasonable error message (e.g. that it is clear that the error happened in the
> balloon code).
> 

Indeed, the error message is no good, will change in next version.

> > +chunk = TARGET_PAGE_SIZE;
> > +}
> > +}
> > +}
> > +
> > +static void balloon_bulk_pages(ram_addr_t base_pfn, unsigned long
> *bitmap,
> > +   unsigned long len, int page_shift,
> > +bool deflate) { #if defined(__linux__)
> 
> Why do you need this #if here?
> 

Ooh,  it is wrong to add the '#if' here, will remove.

Thanks a lot!

Liang

Re: [Qemu-devel] [QEMU 3/7] Add the hmp and qmp interface for dropping cache

2016-06-13 Thread Li, Liang Z

> * Li, Liang Z (liang.z...@intel.com) wrote:
> > > Because writing to this file is a nondestructive operation and dirty
> > > objects are not freeable, the user should run sync(1) first.
> > > [/quote]
> > >
> > > IOW, by 'slab' you mean dentries and inodes ?
> > >
> > Yes.
> >
> > > > +##
> > > > +{ 'command': 'balloon_drop_cache', 'data': {'value':
> > > > +'DropCacheType'} }
> > >
> > > Also, as noted in the man page quote above, it is recommended to
> > > call
> > > sync() to minimise dirty pages. Should we have a way to request a
> > > sync as part of this monitor command.
> > >
> > > More generally, it feels like this is taking as down a path towards
> > > actively managing the guest kernel VM from the host. Is this really
> > > a path we want to be going down, given that its going to take us
> > > into increasing non-portable concepts which are potentially
> > > different for each guest OS kernel.  Is this drop caches feature at
> > > all applicable to Windows, OS-X, *BSD guest OS impls of the balloon
> > > driver ? If it is applicable, are the 3 fixed constants you've
> >
> > No.
> >
> > > defined at all useful to those other OS ?
> > >
> >
> > Maybe they are not.
> > I agree that there are too Linux specific.  And I did more than needed.
> > Actually, I just want to drop the clean cache, do more than that is
> > too heavy and no good for performance.
> >
> > > I'm warying of us taking a design path which is so Linux specific it
> > > isn't useful elsewhere. IOW, just because we can do this, doesn't
> > > mean we should do this...
> > >
> >
> > Agree.
> 
> I can see an argument for giving the guest a hint about what's going on and
> letting the guest decide what it's going to do - so telling the guest that a
> migration is happening and you'd like it to make the hosts life easy seems
> reasonable and it doesn't make any guest OS assumptions.
> 
> Dave
> 

It seems the way I used in the previous patches is more acceptable.

Thanks!
Liang

Re: [Qemu-devel] [QEMU 3/7] Add the hmp and qmp interface for dropping cache

2016-06-13 Thread Li, Liang Z

> 
> On 13/06/2016 12:50, Daniel P. Berrange wrote:
> > More generally, it feels like this is taking as down a path towards
> > actively managing the guest kernel VM from the host. Is this really a
> > path we want to be going down, given that its going to take us into
> > increasing non-portable concepts which are potentially different for
> > each guest OS kernel.  Is this drop caches feature at all applicable
> > to Windows, OS-X, *BSD guest OS impls of the balloon driver ? If it is
> > applicable, are the 3 fixed constants you've defined at all useful to
> > those other OS ?
> >
> > I'm warying of us taking a design path which is so Linux specific it
> > isn't useful elsewhere. IOW, just because we can do this, doesn't mean
> > we should do this...
> 
> I agree.  And if anything, this should be handled through the guest agent.
> 
> Paolo

Guest agent is a good choice. Thanks!

Liang

Re: [Qemu-devel] [QEMU 3/7] Add the hmp and qmp interface for dropping cache

2016-06-13 Thread Li, Liang Z

> On Mon, Jun 13, 2016 at 11:50:08AM +0100, Daniel P. Berrange wrote:
> > On Mon, Jun 13, 2016 at 06:16:45PM +0800, Liang Li wrote:
> > > Add the hmp and qmp interface to drop vm's page cache, users can
> > > control the type of cache they want vm to drop.
> > >
> > > Signed-off-by: Liang Li 
> > > ---
> > >  balloon.c| 19 +++
> > >  hmp-commands.hx  | 15 +++
> > >  hmp.c| 22 ++
> > >  hmp.h|  3 +++
> > >  monitor.c| 18 ++
> > >  qapi-schema.json | 35 +++
> > >  qmp-commands.hx  | 23 +++
> > >  7 files changed, 135 insertions(+)
> >
> > > diff --git a/qapi-schema.json b/qapi-schema.json index
> > > 8483bdf..117f70a 100644
> > > --- a/qapi-schema.json
> > > +++ b/qapi-schema.json
> > > @@ -1655,6 +1655,41 @@
> > >  { 'command': 'balloon', 'data': {'value': 'int'} }
> > >
> > >  ##
> > > +# @DropCacheType
> > > +#
> > > +# Cache types enumeration
> > > +#
> > > +# @clean: Drop the clean page cache.
> > > +#
> > > +# @slab: Drop the slab cache.
> > > +#
> > > +# @all: Drop both the clean and the slab cache.
> > > +#
> > > +# Since: 2.7
> > > +##
> > > +{ 'enum': 'DropCacheType', 'data': ['clean', 'slab', 'all'] }
> >
> > Presumably these constants are corresponding to the 3 options for
> > vm.drop_caches sysctl knob
> >
> > [quote]
> > To free pagecache, use:
> >
> >   echo 1 > /proc/sys/vm/drop_caches
> >
> > To free dentries and inodes, use:
> >
> >   echo 2 > /proc/sys/vm/drop_caches
> >
> > To free pagecache, dentries and inodes, use:
> >
> >   echo 3 > /proc/sys/vm/drop_caches
> >
> > Because writing to this file is a nondestructive operation and dirty
> > objects are not freeable, the user should run sync(1) first.
> > [/quote]
> >
> > IOW, by 'slab' you mean dentries and inodes ?
> >
> > > +
> > > +##
> > > +# @balloon_drop_cache:
> > > +#
> > > +# Request the vm to drop its cache.
> > > +#
> > > +# @value: the type of cache want vm to drop # # Returns: Nothing on
> > > +success
> > > +#  If the balloon driver is enabled but not functional because 
> > > the
> KVM
> > > +#kernel module cannot support it, KvmMissingCap
> > > +#  If no balloon device is present, DeviceNotActive
> > > +#
> > > +# Notes: This command just issues a request to the guest.  When it
> returns,
> > > +#the drop cache operation may not have completed.  A guest can
> drop its
> > > +#cache independent of this command.
> > > +#
> > > +# Since: 2.7.0
> > > +##
> > > +{ 'command': 'balloon_drop_cache', 'data': {'value':
> > > +'DropCacheType'} }
> >
> > Also, as noted in the man page quote above, it is recommended to call
> > sync() to minimise dirty pages. Should we have a way to request a sync
> > as part of this monitor command.
> >
> > More generally, it feels like this is taking as down a path towards
> > actively managing the guest kernel VM from the host. Is this really a
> > path we want to be going down, given that its going to take us into
> > increasing non-portable concepts which are potentially different for
> > each guest OS kernel.  Is this drop caches feature at all applicable
> > to Windows, OS-X, *BSD guest OS impls of the balloon driver ? If it is
> > applicable, are the 3 fixed constants you've defined at all useful to
> > those other OS ?
> >
> > I'm warying of us taking a design path which is so Linux specific it
> > isn't useful elsewhere. IOW, just because we can do this, doesn't mean
> > we should do this...
> 
> Also, I'm wondering about the overall performance benefit of dropping guest
> cache(s). Increasing the amount of free memory pages may have a benefit in
> terms of reducing data that needs to be migrated, but it comes with a
> penalty that if the guest OS needs that data, it will have to repopulate the
> caches.
> 
> If the guest is merely reading those cached pages, it isn't going to cause any
> problem with chances of convergance of migration, as clean pages will be
> copied only once during migration. IOW, dropping clean pages will reduce the
> total memory that needs to be copied, but won't have notable affect on
> convergance of live migration. Cache pages that are dirty will potentially
> affect live migration convergance, if the guest OS re-dirties the pages before
> they're flushed to storage. Dropping caches won't help in this respect though,
> since you can't drop dirty pages. At the same time it will have a potentially
> significant negative penalty on guest OS performance by forcing the guest to
> re-populate the cache from slow underlying storage.  I don't think there's
> enough info exposed by KVM about the guest OS to be able to figure out
> what kind of situation we're in wrt the guest OS cache usage.
> 
> Based on this I think it is hard to see how a host mgmt app can make a well
> informed decision about whether telling the guest OS to drop caches is a
> positive

Re: [Qemu-devel] [QEMU 3/7] Add the hmp and qmp interface for dropping cache

2016-06-13 Thread Li, Liang Z

> Because writing to this file is a nondestructive operation and dirty objects 
> are
> not freeable, the user should run sync(1) first.
> [/quote]
> 
> IOW, by 'slab' you mean dentries and inodes ?
> 
Yes.

> > +##
> > +{ 'command': 'balloon_drop_cache', 'data': {'value': 'DropCacheType'}
> > +}
> 
> Also, as noted in the man page quote above, it is recommended to call
> sync() to minimise dirty pages. Should we have a way to request a sync as
> part of this monitor command.
> 
> More generally, it feels like this is taking as down a path towards actively
> managing the guest kernel VM from the host. Is this really a path we want to
> be going down, given that its going to take us into increasing non-portable
> concepts which are potentially different for each guest OS kernel.  Is this
> drop caches feature at all applicable to Windows, OS-X, *BSD guest OS impls
> of the balloon driver ? If it is applicable, are the 3 fixed constants you've

No. 

> defined at all useful to those other OS ?
> 

Maybe they are not.  
I agree that there are too Linux specific.  And I did more than needed.
Actually, I just want to drop the clean cache, do more than that is too heavy
and no good for performance.

> I'm warying of us taking a design path which is so Linux specific it isn't 
> useful
> elsewhere. IOW, just because we can do this, doesn't mean we should do
> this...
> 

Agree.

Thanks!

Liang
> Regards,
> Daniel
> --
> |: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
> |: http://libvirt.org  -o- http://virt-manager.org :|
> |: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
> |: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|

Re: [Qemu-devel] [PATCH v2 2/9] migration: Fix a potential issue

2016-06-12 Thread Li, Liang Z

> > > > +static void wait_for_decompress_done(void) {
> > > > +int idx, thread_count;
> > > > +
> > > > +if (!migrate_use_compression()) {
> > > > +return;
> > > > +}
> > > > +
> > > > +thread_count = migrate_decompress_threads();
> > > > +qemu_mutex_lock(_done_lock);
> > > > +for (idx = 0; idx < thread_count; idx++) {
> > > > +while (!decomp_param[idx].done) {
> > > > +qemu_cond_wait(_done_cond,
> _done_lock);
> > > > +}
> > > > +}
> > > > +qemu_mutex_unlock(_done_lock);
> > >
> > > Not sure how this works: in the previous patch, done is set to false
> > > under the decomp_done_lock.  Here, we take the lock, and wait for done
> to turn false.
> > > That can't happen because this thread holds the lock.
> > > My reading is this is going to lead to a deadlock.  What am I missing?
> > >
> >
> > This is the typical usage of the QemuCond, actually, in
> > qemu_cond_wait() , decomp_done_lock will be unlocked at first and then
> > locked again before
> > qemu_cond_wait() return.  So deadlock won't happen.
> 
> In qemu-thread-posix.c, I don't see such unlock/lock.
> 
> 
>   Amit

I mean in the 'pthread_cond_wait()' which called by qemu_cond_wait().

Liang

Re: [Qemu-devel] [PATCH v2 2/9] migration: Fix a potential issue

2016-06-10 Thread Li, Liang Z

> Subject: Re: [PATCH v2 2/9] migration: Fix a potential issue
> 
> On (Thu) 05 May 2016 [15:32:52], Liang Li wrote:
> > At the end of live migration and before vm_start() on the destination
> > side, we should make sure all the decompression tasks are finished, if
> > this can not be guaranteed, the VM may get the incorrect memory data,
> > or the updated memory may be overwritten by the decompression thread.
> > Add the code to fix this potential issue.
> >
> > Suggested-by: David Alan Gilbert 
> > Suggested-by: Juan Quintela 
> > Signed-off-by: Liang Li 
> > ---
> >  migration/ram.c | 19 +++
> >  1 file changed, 19 insertions(+)
> >
> > diff --git a/migration/ram.c b/migration/ram.c index 7ab6ab5..8a59a08
> > 100644
> > --- a/migration/ram.c
> > +++ b/migration/ram.c
> > @@ -2220,6 +2220,24 @@ static void *do_data_decompress(void *opaque)
> >  return NULL;
> >  }
> >
> > +static void wait_for_decompress_done(void) {
> > +int idx, thread_count;
> > +
> > +if (!migrate_use_compression()) {
> > +return;
> > +}
> > +
> > +thread_count = migrate_decompress_threads();
> > +qemu_mutex_lock(_done_lock);
> > +for (idx = 0; idx < thread_count; idx++) {
> > +while (!decomp_param[idx].done) {
> > +qemu_cond_wait(_done_cond, _done_lock);
> > +}
> > +}
> > +qemu_mutex_unlock(_done_lock);
> 
> Not sure how this works: in the previous patch, done is set to false under the
> decomp_done_lock.  Here, we take the lock, and wait for done to turn false.
> That can't happen because this thread holds the lock.
> My reading is this is going to lead to a deadlock.  What am I missing?
> 
> 
>   Amit

This is the typical usage of the QemuCond, actually, in qemu_cond_wait() ,
decomp_done_lock will be unlocked at first and then locked again before 
qemu_cond_wait() return.  So deadlock won't happen.

Liang

Re: [Qemu-devel] [PATCH RFC v2 kernel] balloon: speed up inflating/deflating process

2016-06-07 Thread Li, Liang Z

Ping...

Liang


> -Original Message-
> From: Li, Liang Z
> Sent: Wednesday, June 01, 2016 10:41 AM
> To: linux-ker...@vger.kernel.org
> Cc: k...@vger.kernel.org; qemu-devel@nongnu.org; Michael S. Tsirkin;
> Paolo Bonzini; Cornelia Huck; Amit Shah
> Subject: RE: [PATCH RFC v2 kernel] balloon: speed up inflating/deflating
> process
> 
> Hi MST,
> 
> About the size of page bitmap, I have test the performance of filling the
> balloon to 15GB with a 16GB RAM VM.
> 
> ===
> 32K Byte (cover 1GB of RAM)
> 
> Time spends on inflating: 2031ms
> -
> 64K Byte (cover 2GB of RAM)
> 
> Time spends on inflating: 1507ms
> 
> 512K Byte (cover 16GB of RAM)
> 
> Time spends on inflating: 1237ms
> 
> 
> It shows the performance is better if using a larger page bitmap, should we
> use a 64K/ 128KB page bitmap for a better  balance between the
> performance and the memory consumption?
> 
> BTW, about the VIRTIO_BALLOON_F_DEFLATE_ON_OOM feature, I found
> the configurable  'oom_pages'
> will be limited by the ARRAY_SIZE(vb->pfns), which means only 1MB of RAM
> will be recycled each time even the users want more, is that too conservative?
> 
> Liang
> 
> 
> > -Original Message-
> > From: Li, Liang Z
> > Sent: Friday, May 27, 2016 6:34 PM
> > To: linux-ker...@vger.kernel.org
> > Cc: k...@vger.kernel.org; qemu-devel@nongnu.org; Li, Liang Z; Michael S.
> > Tsirkin; Paolo Bonzini; Cornelia Huck; Amit Shah
> > Subject: [PATCH RFC v2 kernel] balloon: speed up inflating/deflating
> > process
> >
> > The implementation of the current virtio-balloon is not very
> > efficient, Bellow is test result of time spends on inflating the
> > balloon to 3GB of a 4GB idle
> > guest:
> >
> > a. allocating pages (6.5%, 103ms)
> > b. sending PFNs to host (68.3%, 787ms) c. address translation (6.1%,
> > 96ms) d. madvise (19%, 300ms)
> >
> > It takes about 1577ms for the whole inflating process to complete. The
> > test shows that the bottle neck is the stage b and stage d.
> >
> > If using a bitmap to send the page info instead of the PFNs, we can
> > reduce the overhead spends on stage b quite a lot. Furthermore, it's
> > possible to do the address translation and do the madvise with a bulk
> > of pages, instead of the current page per page way, so the overhead of
> > stage c and stage d can also be reduced a lot.
> >
> > This patch is the kernel side implementation which is intended to
> > speed up the inflating & deflating process by adding a new feature to
> > the virtio-balloon device. And now, inflating the balloon to 3GB of a
> > 4GB idle guest only takes 200ms, it's about 8 times as fast as before.
> >
> > TODO: optimize stage a by allocating/freeing a chunk of pages instead
> > of a single page at a time.
> >
> > v2 change:
> > 1. Use a small page bitmap instead of a large one.
> > 2. Address some of comments of v1.
> >
> > Signed-off-by: Liang Li <liang.z...@intel.com>
> > Suggested-by: Michael S. Tsirkin <m...@redhat.com>
> > Cc: Michael S. Tsirkin <m...@redhat.com>
> > Cc: Paolo Bonzini <pbonz...@redhat.com>
> > Cc: Cornelia Huck <cornelia.h...@de.ibm.com>
> > Cc: Amit Shah <amit.s...@redhat.com>
> > ---
> >  drivers/virtio/virtio_balloon.c | 207
> > ++--
> >  include/uapi/linux/virtio_balloon.h |   1 +
> >  2 files changed, 200 insertions(+), 8 deletions(-)
> >
> > diff --git a/drivers/virtio/virtio_balloon.c
> > b/drivers/virtio/virtio_balloon.c index 476c0e3..823b4e4 100644
> > --- a/drivers/virtio/virtio_balloon.c
> > +++ b/drivers/virtio/virtio_balloon.c
> > @@ -40,11 +40,19 @@
> >  #define VIRTIO_BALLOON_ARRAY_PFNS_MAX 256  #define
> > OOM_VBALLOON_DEFAULT_PAGES 256  #define
> > VIRTBALLOON_OOM_NOTIFY_PRIORITY 80
> > +#define VIRTIO_BALLOON_PFNS_LIMIT ((2 * (1ULL << 30)) >>
> PAGE_SHIFT)
> > /*
> > +2GB */
> >
> >  static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
> > module_param(oom_pages, int, S_IRUSR | S_IWUSR);
> > MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
> >
> > +struct balloon_bmap_hdr {
> > +   __virtio32 type;
> > +   __virtio32 page_shift;
> > +   __virtio64 start_pfn;
> > +   __virtio64 bmap_len;
> > +};
> > +
> >  struct virtio_balloon {
> > struct virtio_dev

Re: [Qemu-devel] [PATCH v2 0/2] AVX2 configure fixes

2016-06-02 Thread Li, Liang Z

> Cc: amit.s...@redhat.com
> Subject: [PATCH v2 0/2] AVX2 configure fixes
> 
> From: "Dr. David Alan Gilbert" 
> 
> Hi,
>   This pair of patches fixes a couple of issues that we found during test.
> The first is that the configure test is pessimistic when compiled with -O2, 
> the
> second is that the explicit 4.9 gcc test is a bit coarse; I've removed that 
> test
> but beefed up the ./configure test to actually use the avx2 intrinsics and 
> that
> fails in the same way as the main code, so it works in the same way as the
> explicit check but allows older gcc's to work when -save-temps isn't used.
> 
> Dave
> 
> v2
>   Remove the explicit version check
>   Split the patches
> 
> Dr. David Alan Gilbert (2):
>   Make avx2 configure test work with -O2
>   avx2 configure: Use primitives in test
> 
>  configure | 15 +++
>  util/cutils.c |  8 +---
>  2 files changed, 12 insertions(+), 11 deletions(-)
> 
> --
> 2.7.4

Looks good. Thank!

Reviewed-by: Liang Li 

Liang

Re: [Qemu-devel] [PATCH RFC v2 kernel] balloon: speed up inflating/deflating process

2016-05-31 Thread Li, Liang Z

Hi MST,

About the size of page bitmap, I have test the performance of filling the 
balloon to 15GB with a 16GB RAM VM.

===
32K Byte (cover 1GB of RAM)

Time spends on inflating: 2031ms
-
64K Byte (cover 2GB of RAM)

Time spends on inflating: 1507ms

512K Byte (cover 16GB of RAM)

Time spends on inflating: 1237ms


It shows the performance is better if using a larger page bitmap, should we use 
a 64K/ 128KB page bitmap
for a better  balance between the performance and the memory consumption?

BTW, about the VIRTIO_BALLOON_F_DEFLATE_ON_OOM feature, I found the 
configurable  'oom_pages' 
will be limited by the ARRAY_SIZE(vb->pfns), which means only 1MB of RAM will 
be recycled each time even
the users want more, is that too conservative? 

Liang


> -Original Message-
> From: Li, Liang Z
> Sent: Friday, May 27, 2016 6:34 PM
> To: linux-ker...@vger.kernel.org
> Cc: k...@vger.kernel.org; qemu-devel@nongnu.org; Li, Liang Z; Michael S.
> Tsirkin; Paolo Bonzini; Cornelia Huck; Amit Shah
> Subject: [PATCH RFC v2 kernel] balloon: speed up inflating/deflating process
> 
> The implementation of the current virtio-balloon is not very efficient, Bellow
> is test result of time spends on inflating the balloon to 3GB of a 4GB idle
> guest:
> 
> a. allocating pages (6.5%, 103ms)
> b. sending PFNs to host (68.3%, 787ms)
> c. address translation (6.1%, 96ms)
> d. madvise (19%, 300ms)
> 
> It takes about 1577ms for the whole inflating process to complete. The test
> shows that the bottle neck is the stage b and stage d.
> 
> If using a bitmap to send the page info instead of the PFNs, we can reduce
> the overhead spends on stage b quite a lot. Furthermore, it's possible to do
> the address translation and do the madvise with a bulk of pages, instead of
> the current page per page way, so the overhead of stage c and stage d can
> also be reduced a lot.
> 
> This patch is the kernel side implementation which is intended to speed up
> the inflating & deflating process by adding a new feature to the 
> virtio-balloon
> device. And now, inflating the balloon to 3GB of a 4GB idle guest only takes
> 200ms, it's about 8 times as fast as before.
> 
> TODO: optimize stage a by allocating/freeing a chunk of pages instead of a
> single page at a time.
> 
> v2 change:
> 1. Use a small page bitmap instead of a large one.
> 2. Address some of comments of v1.
> 
> Signed-off-by: Liang Li <liang.z...@intel.com>
> Suggested-by: Michael S. Tsirkin <m...@redhat.com>
> Cc: Michael S. Tsirkin <m...@redhat.com>
> Cc: Paolo Bonzini <pbonz...@redhat.com>
> Cc: Cornelia Huck <cornelia.h...@de.ibm.com>
> Cc: Amit Shah <amit.s...@redhat.com>
> ---
>  drivers/virtio/virtio_balloon.c | 207
> ++--
>  include/uapi/linux/virtio_balloon.h |   1 +
>  2 files changed, 200 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index 476c0e3..823b4e4 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -40,11 +40,19 @@
>  #define VIRTIO_BALLOON_ARRAY_PFNS_MAX 256  #define
> OOM_VBALLOON_DEFAULT_PAGES 256  #define
> VIRTBALLOON_OOM_NOTIFY_PRIORITY 80
> +#define VIRTIO_BALLOON_PFNS_LIMIT ((2 * (1ULL << 30)) >> PAGE_SHIFT)
> /*
> +2GB */
> 
>  static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
> module_param(oom_pages, int, S_IRUSR | S_IWUSR);
> MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
> 
> +struct balloon_bmap_hdr {
> + __virtio32 type;
> + __virtio32 page_shift;
> + __virtio64 start_pfn;
> + __virtio64 bmap_len;
> +};
> +
>  struct virtio_balloon {
>   struct virtio_device *vdev;
>   struct virtqueue *inflate_vq, *deflate_vq, *stats_vq; @@ -62,6
> +70,13 @@ struct virtio_balloon {
> 
>   /* Number of balloon pages we've told the Host we're not using. */
>   unsigned int num_pages;
> + /* Bitmap and length used to tell the host the pages */
> + unsigned long *page_bitmap;
> + unsigned long bmap_len;
> + /* Used to record the processed pfn range */
> + unsigned long min_pfn, max_pfn, start_pfn, end_pfn;
> + /* Used for sending page bitmap and header */
> + struct scatterlist sg[2];
>   /*
>* The pages we've told the Host we're not using are enqueued
>* at vb_dev_info->pages list.
> @@ -111,15 +126,39 @@ static void balloon_ack(struct virtqueue *vq)
>   wake_up(>acked);
>  }
> 
> +static inline void

Re: [Qemu-devel] [PATCH RFC kernel] balloon: speed up inflating/deflating process

2016-05-25 Thread Li, Liang Z

> > > Interesting. How about instead of tell host, we do multiple scans,
> > > each time ignoring pages out of range?
> > >
> > >   for (pfn = min pfn; pfn < max pfn; pfn += 1G) {
> > >   foreach page
> > >   if page pfn < pfn || page pfn >= pfn + 1G
> > >   continue
> > >   set bit
> > >   tell host
> > >   }
> > >
> >
> > That means we have to allocate/free all the requested pages first, and then
> tell the host.
> > It works fine for inflating, but for deflating, because the page has
> > been deleted from the vb-> vb_dev_info->pages, so, we have to use a
> > struct to save the dequeued pages before calling
> > release_pages_balloon(),
> 
> struct list_head?  I think you can just replace set_page_pfns with
> list_add(>lru, _list);
> 
> 

That's is fine, I will retry and get back to you.

Thanks!

Liang
> 
> >  I think a page bitmap is the best struct to save these pages, because it
> consumes less memory.
> > And that bitmap should be large enough to save pfn 0 to max_pfn.
> >
> > If the above is true, then we are back to the square one. we really need a
> large page bitmap. Right?
> >
> > Liang
> 
> These look like implementation issues to me.
> 
> I think the below might be helpful (completely untested), your work can go
> on top.
> 
> --->
> 
> virtio-balloon: rework deflate to add page to a tmp list
> 
> Will allow faster notifications using a bitmap down the road.
> 
> Signed-off-by: Michael S. Tsirkin 
> 
> ---
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index 476c0e3..44050a3 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -195,8 +195,9 @@ static void release_pages_balloon(struct
> virtio_balloon *vb)  static unsigned leak_balloon(struct virtio_balloon *vb,
> size_t num)  {
>   unsigned num_freed_pages;
> - struct page *page;
> + struct page *page, *next;
>   struct balloon_dev_info *vb_dev_info = >vb_dev_info;
> + LIST_HEAD(pages);   /* Pages dequeued for handing to
> Host */
> 
>   /* We can only do one array worth at a time. */
>   num = min(num, ARRAY_SIZE(vb->pfns));
> @@ -207,10 +208,13 @@ static unsigned leak_balloon(struct virtio_balloon
> *vb, size_t num)
>   page = balloon_page_dequeue(vb_dev_info);
>   if (!page)
>   break;
> - set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> + list_add(>lru, );
>   vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
>   }
> 
> + list_for_each_entry_safe(page, next, , lru)
> + set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> +
>   num_freed_pages = vb->num_pfns;
>   /*
>* Note that if
> --
> MST

Re: [Qemu-devel] [PATCH RFC kernel] balloon: speed up inflating/deflating process

2016-05-25 Thread Li, Liang Z

> > > >
> > > > Hi MST,
> > > >
> > > > I have measured the performance when using a 32K page bitmap,
> > >
> > > Just to make sure. Do you mean a 32Kbyte bitmap?
> > > Covering 1Gbyte of memory?
> > Yes.
> >
> > >
> > > > and inflate the balloon to 3GB
> > > > of an idle guest with 4GB RAM.
> > >
> > > Should take 3 requests then, right?
> > >
> >
> > No,  we can't assign the PFN when allocating page in balloon driver,
> > So the PFNs of pages allocated may be across a large range,  we will
> > tell the host once the pfn_max -pfn_min >= 0x4(1GB range), so the
> > requests count is most likely to be more than 3.
> >
> > > > Now:
> > > > total inflating time: 338ms
> > > > the count of virtio data transmission:  373
> > >
> > > Why was this so high? I would expect 3 transmissions.
> >
> > I follow your suggestion:
> > --
> > -- Suggestion to address all above comments:
> > 1. allocate a bunch of pages and link them up,
> >calculating the min and the max pfn.
> >if max-min exceeds the allocated bitmap size,
> >tell host.
> > 2. limit allocated bitmap size to something reasonable.
> >How about 32Kbytes? This is 256kilo bit in the map, which comes
> >out to 1Giga bytes of memory in the balloon.
> > --
> > --- Because the PFNs of the allocated pages are not linear
> > increased, so 3 transmissions are  impossible.
> >
> >
> > Liang
> 
> Interesting. How about instead of tell host, we do multiple scans, each time
> ignoring pages out of range?
> 
>   for (pfn = min pfn; pfn < max pfn; pfn += 1G) {
>   foreach page
>   if page pfn < pfn || page pfn >= pfn + 1G
>   continue
>   set bit
>   tell host
>   }
> 

That means we have to allocate/free all the requested pages first, and then 
tell the host.
It works fine for inflating, but for deflating, because the page has been 
deleted from the vb-> vb_dev_info->pages,
so, we have to use a struct to save the dequeued pages before calling 
release_pages_balloon(),
 I think a page bitmap is the best struct to save these pages, because it 
consumes less memory.
And that bitmap should be large enough to save pfn 0 to max_pfn.

If the above is true, then we are back to the square one. we really need a 
large page bitmap. Right?

Liang

Re: [Qemu-devel] [PATCH RFC kernel] balloon: speed up inflating/deflating process

2016-05-25 Thread Li, Liang Z

> On Wed, May 25, 2016 at 08:48:17AM +0000, Li, Liang Z wrote:
> > > > > Suggestion to address all above comments:
> > > > >   1. allocate a bunch of pages and link them up,
> > > > >  calculating the min and the max pfn.
> > > > >  if max-min exceeds the allocated bitmap size,
> > > > >  tell host.
> > > >
> > > > I am not sure if it works well in some cases, e.g. The allocated
> > > > pages are across a wide range and the max-min > limit is very
> > > > frequently to be
> > > true.
> > > > Then, there will be many times of virtio transmission and it's bad
> > > > for performance improvement. Right?
> > >
> > > It's a tradeoff for sure. Measure it, see what the overhead is.
> > >
> >
> > Hi MST,
> >
> > I have measured the performance when using a 32K page bitmap,
> 
> Just to make sure. Do you mean a 32Kbyte bitmap?
> Covering 1Gbyte of memory?
Yes.

> 
> > and inflate the balloon to 3GB
> > of an idle guest with 4GB RAM.
> 
> Should take 3 requests then, right?
> 

No,  we can't assign the PFN when allocating page in balloon driver,
So the PFNs of pages allocated may be across a large range,  we will
tell the host once the pfn_max -pfn_min >= 0x4(1GB range),
so the requests count is most likely to be more than 3. 


> > Now:
> > total inflating time: 338ms
> > the count of virtio data transmission:  373
> 
> Why was this so high? I would expect 3 transmissions.

I follow your suggestion:

Suggestion to address all above comments:
1. allocate a bunch of pages and link them up,
   calculating the min and the max pfn.
   if max-min exceeds the allocated bitmap size,
   tell host.
2. limit allocated bitmap size to something reasonable.
   How about 32Kbytes? This is 256kilo bit in the map, which comes
   out to 1Giga bytes of memory in the balloon.
-
Because the PFNs of the allocated pages are not linear increased, so 3 
transmissions
are  impossible.


Liang

> 
> > the call count of madvise: 865
> >
> > before:
> > total inflating time: 175ms
> > the count of virtio data transmission: 1 the call count of madvise: 42
> >
> > Maybe the result will be worse if the guest is not idle, or the guest has
> more RAM.
> > Do you want more data?
> >
> > Is it worth to do that?
> >
> > Liang
> 
> Either my math is wrong or there's an implementation bug.
> 
> > > >
> > > > >   2. limit allocated bitmap size to something reasonable.
> > > > >  How about 32Kbytes? This is 256kilo bit in the map, which 
> > > > > comes
> > > > >  out to 1Giga bytes of memory in the balloon.
> > > >
> > > > So, even the VM has 1TB of RAM, the page bitmap will take 32MB of
> > > memory.
> > > > Maybe it's better to use a big page bitmap the save the pages
> > > > allocated by balloon, and split the big page bitmap to 32K bytes
> > > > unit, then
> > > transfer one unit at a time.
> > >
> > > How is this different from what I said?
> > >
> > > >
> > > > Should we use a page bitmap to replace 'vb->pages' ?
> > > >
> > > > How about rolling back to use PFNs if the count of requested pages
> > > > is a
> > > small number?
> > > >
> > > > Liang
> > >
> > > That's why we have start pfn. you can use that to pass even a single
> > > page without a lot of overhead.
> > >
> > > > > > --
> > > > > > 1.9.1
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe kvm"
> > > > > in the body of a message to majord...@vger.kernel.org More
> > > > > majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in the body of
> a message to majord...@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html

1 2 3 4 >

1 - 100 of 313 matches

Mail list logo