Re: [PATCH v18 07/10] virtio-balloon: VIRTIO_BALLOON_F_SG

2017-12-03 Thread Wei Wang

On 12/01/2017 11:38 PM, Michael S. Tsirkin wrote:

On Wed, Nov 29, 2017 at 09:55:23PM +0800, Wei Wang wrote:

+static void send_one_desc(struct virtio_balloon *vb,
+ struct virtqueue *vq,
+ uint64_t addr,
+ uint32_t len,
+ bool inbuf,
+ bool batch)
+{
+   int err;
+   unsigned int size;
+
+   /* Detach all the used buffers from the vq */
+   while (virtqueue_get_buf(vq, ))
+   ;
+
+   err = virtqueue_add_one_desc(vq, addr, len, inbuf, vq);
+   /*
+* This is expected to never fail: there is always at least 1 entry
+* available on the vq, because when the vq is full the worker thread
+* that adds the desc will be put into sleep until at least 1 entry is
+* available to use.
+*/
+   BUG_ON(err);
+
+   /* If batching is requested, we batch till the vq is full */
+   if (!batch || !vq->num_free)
+   kick_and_wait(vq, vb->acked);
+}
+

This internal kick complicates callers. I suggest that instead,
you move this to callers, just return a "kick required" boolean.
This way callers do not need to play with num_free at all.


Then in what situation would the function return true of "kick required"?

I think this wouldn't make a difference fundamentally. For example, we 
have 257 sgs (batching size=256) to send to host:


while (i < 257) {
kick_required = send_sgs();
if (kick_required)
kick(); // After the 256 sgs have been added, the caller 
performs a kick().

}

Do we still need a kick here for the 257th sg as before? Only the caller 
knows if the last added sgs need a kick (when the send_sgs receives one 
sg, it doesn't know if there are more to come).


There is another approach to checking if the last added sgs haven't been 
sync-ed to the host: expose "vring_virtqueue->num_added" to the caller 
via a virtio_ring API:


unsigned int virtqueue_num_added(struct virtqueue *_vq)
   {
struct vring_virtqueue *vq = to_vvq(_vq);

return vq->num_added;
  }




+/*
+ * Send balloon pages in sgs to host. The balloon pages are recorded in the
+ * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
+ * The page xbitmap is searched for continuous "1" bits, which correspond
+ * to continuous pages, to chunk into sgs.
+ *
+ * @page_xb_start and @page_xb_end form the range of bits in the xbitmap that
+ * need to be searched.
+ */
+static void tell_host_sgs(struct virtio_balloon *vb,
+ struct virtqueue *vq,
+ unsigned long page_xb_start,
+ unsigned long page_xb_end)
+{
+   unsigned long pfn_start, pfn_end;
+   uint64_t addr;
+   uint32_t len, max_len = round_down(UINT_MAX, PAGE_SIZE);
+
+   pfn_start = page_xb_start;
+   while (pfn_start < page_xb_end) {
+   pfn_start = xb_find_next_set_bit(>page_xb, pfn_start,
+page_xb_end);
+   if (pfn_start == page_xb_end + 1)
+   break;
+   pfn_end = xb_find_next_zero_bit(>page_xb,
+   pfn_start + 1,
+   page_xb_end);
+   addr = pfn_start << PAGE_SHIFT;
+   len = (pfn_end - pfn_start) << PAGE_SHIFT;

This assugnment can overflow. Next line compares with UINT_MAX but by
that time it is too late.  I think you should do all math in 64 bit to
avoid surprises, then truncate to max_len and then it's safe to assign
to sg.


Sounds reasonable, thanks.



+
+   xb_clear_bit_range(>page_xb, page_xb_start, page_xb_end);
+}
+
+static inline int xb_set_page(struct virtio_balloon *vb,
+  struct page *page,
+  unsigned long *pfn_min,
+  unsigned long *pfn_max)
+{
+   unsigned long pfn = page_to_pfn(page);
+   int ret;
+
+   *pfn_min = min(pfn, *pfn_min);
+   *pfn_max = max(pfn, *pfn_max);
+
+   do {
+   ret = xb_preload_and_set_bit(>page_xb, pfn,
+GFP_NOWAIT | __GFP_NOWARN);
+   } while (unlikely(ret == -EAGAIN));

what exactly does this loop do? Does this wait
forever until there is some free memory? why GFP_NOWAIT?


Basically, "-EAGAIN" is returned from xb_set_bit() in the case when the 
pre-allocated per-cpu ida_bitmap is NULL. In that case, the caller 
re-invokes xb_preload_and_set_bit(), which re-invokes xb_preload to 
allocate ida_bitmap. So "-EAGAIN" actually does not indicate a status 
about memory allocation. "-ENOMEM" is the one to indicate the failure of 
memory allocation, but the loop doesn't re-try on "-ENOMEM".


GFP_NOWAIT is used to avoid memory reclaiming, which could cause the 
deadlock issue we discussed before.






return 

Re: [PATCH v18 07/10] virtio-balloon: VIRTIO_BALLOON_F_SG

2017-12-01 Thread Michael S. Tsirkin
On Wed, Nov 29, 2017 at 09:55:23PM +0800, Wei Wang wrote:
> Add a new feature, VIRTIO_BALLOON_F_SG, which enables the transfer of
> balloon (i.e. inflated/deflated) pages using scatter-gather lists to the
> host. A scatter-gather list is described by a vring desc.
> 
> The implementation of the previous virtio-balloon is not very efficient,
> because the balloon pages are transferred to the host by one array each
> time. Here is the breakdown of the time in percentage spent on each step
> of the balloon inflating process (inflating 7GB of an 8GB idle guest).
> 
> 1) allocating pages (6.5%)
> 2) sending PFNs to host (68.3%)
> 3) address translation (6.1%)
> 4) madvise (19%)
> 
> It takes about 4126ms for the inflating process to complete. The above
> profiling shows that the bottlenecks are stage 2) and stage 4).
> 
> This patch optimizes step 2) by transferring pages to host in sgs. An sg
> describes a chunk of guest physically continuous pages. With this
> mechanism, step 4) can also be optimized by doing address translation and
> madvise() in chunks rather than page by page.
> 
> With this new feature, the above ballooning process takes ~440ms resulting
> in an improvement of ~89%.
> 
> TODO: optimize stage 1) by allocating/freeing a chunk of pages instead of
> a single page each time.
> 
> Signed-off-by: Wei Wang 
> Signed-off-by: Liang Li 
> Suggested-by: Michael S. Tsirkin 
> Cc: Tetsuo Handa 
> ---
>  drivers/virtio/virtio_balloon.c | 230 
> +---
>  include/uapi/linux/virtio_balloon.h |   1 +
>  2 files changed, 212 insertions(+), 19 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index 7960746..2c21c5a 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -32,6 +32,8 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
>  
>  /*
>   * Balloon device works in 4K page units.  So each page is pointed to by
> @@ -79,6 +81,9 @@ struct virtio_balloon {
>   /* Synchronize access/update to this struct virtio_balloon elements */
>   struct mutex balloon_lock;
>  
> + /* The xbitmap used to record balloon pages */
> + struct xb page_xb;
> +
>   /* The array of pfns we tell the Host about. */
>   unsigned int num_pfns;
>   __virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
> @@ -141,15 +146,121 @@ static void set_page_pfns(struct virtio_balloon *vb,
> page_to_balloon_pfn(page) + i);
>  }
>  
> +static void kick_and_wait(struct virtqueue *vq, wait_queue_head_t wq_head)
> +{
> + unsigned int len;
> +
> + virtqueue_kick(vq);
> + wait_event(wq_head, virtqueue_get_buf(vq, ));
> +}
> +
> +static void send_one_desc(struct virtio_balloon *vb,
> +   struct virtqueue *vq,
> +   uint64_t addr,
> +   uint32_t len,
> +   bool inbuf,
> +   bool batch)
> +{
> + int err;
> + unsigned int size;
> +
> + /* Detach all the used buffers from the vq */
> + while (virtqueue_get_buf(vq, ))
> + ;
> +
> + err = virtqueue_add_one_desc(vq, addr, len, inbuf, vq);
> + /*
> +  * This is expected to never fail: there is always at least 1 entry
> +  * available on the vq, because when the vq is full the worker thread
> +  * that adds the desc will be put into sleep until at least 1 entry is
> +  * available to use.
> +  */
> + BUG_ON(err);
> +
> + /* If batching is requested, we batch till the vq is full */
> + if (!batch || !vq->num_free)
> + kick_and_wait(vq, vb->acked);
> +}
> +

This internal kick complicates callers. I suggest that instead,
you move this to callers, just return a "kick required" boolean.
This way callers do not need to play with num_free at all.

> +/*
> + * Send balloon pages in sgs to host. The balloon pages are recorded in the
> + * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
> + * The page xbitmap is searched for continuous "1" bits, which correspond
> + * to continuous pages, to chunk into sgs.
> + *
> + * @page_xb_start and @page_xb_end form the range of bits in the xbitmap that
> + * need to be searched.
> + */
> +static void tell_host_sgs(struct virtio_balloon *vb,
> +   struct virtqueue *vq,
> +   unsigned long page_xb_start,
> +   unsigned long page_xb_end)
> +{
> + unsigned long pfn_start, pfn_end;
> + uint64_t addr;
> + uint32_t len, max_len = round_down(UINT_MAX, PAGE_SIZE);
> +
> + pfn_start = page_xb_start;
> + while (pfn_start < page_xb_end) {
> + pfn_start = xb_find_next_set_bit(>page_xb, pfn_start,
> +  page_xb_end);
> + if (pfn_start == 

RE: [PATCH v18 07/10] virtio-balloon: VIRTIO_BALLOON_F_SG

2017-11-30 Thread Wang, Wei W
On Thursday, November 30, 2017 6:36 PM, Tetsuo Handa wrote:
> Wei Wang wrote:
> > +static inline int xb_set_page(struct virtio_balloon *vb,
> > +  struct page *page,
> > +  unsigned long *pfn_min,
> > +  unsigned long *pfn_max)
> > +{
> > +   unsigned long pfn = page_to_pfn(page);
> > +   int ret;
> > +
> > +   *pfn_min = min(pfn, *pfn_min);
> > +   *pfn_max = max(pfn, *pfn_max);
> > +
> > +   do {
> > +   ret = xb_preload_and_set_bit(>page_xb, pfn,
> > +GFP_NOWAIT | __GFP_NOWARN);
> 
> It is a bit of pity that __GFP_NOWARN here is applied to only xb_preload().
> Memory allocation by xb_set_bit() will after all emit warnings. Maybe
> 
>   xb_init(>page_xb);
>   vb->page_xb.gfp_mask |= __GFP_NOWARN;
> 
> is tolerable? Or, unconditionally apply __GFP_NOWARN at xb_init()?
> 



Please have a check this one: radix_tree_node_alloc()

In our case, I think the code path goes to 

if (!gfpflags_allow_blocking(gfp_mask) && !in_interrupt()) {
...
ret = kmem_cache_alloc(radix_tree_node_cachep,
   gfp_mask | __GFP_NOWARN);
...
goto out;
}

So I think the __GFP_NOWARN is already there.



>   static inline void xb_init(struct xb *xb)
>   {
>   INIT_RADIX_TREE(>xbrt, IDR_RT_MARKER | GFP_NOWAIT);
>   }
> 
> > +   } while (unlikely(ret == -EAGAIN));
> > +
> > +   return ret;
> > +}
> > +
> 
> 
> 
> > @@ -172,11 +283,18 @@ static unsigned fill_balloon(struct virtio_balloon
> *vb, size_t num)
> > vb->num_pfns = 0;
> >
> > while ((page = balloon_page_pop())) {
> > +   if (use_sg) {
> > +   if (xb_set_page(vb, page, _min, _max) < 0) {
> > +   __free_page(page);
> > +   break;
> 
> You cannot "break;" without consuming all pages in "pages".


Right, I think it should be "continue" here. Thanks. 

> 
> > +   }
> > +   } else {
> > +   set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> > +   }
> > +
> > balloon_page_enqueue(>vb_dev_info, page);
> >
> > vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE;
> > -
> > -   set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> > vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
> > if (!virtio_has_feature(vb->vdev,
> >
>   VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
> 
> 
> 
> > @@ -212,9 +334,12 @@ static unsigned leak_balloon(struct virtio_balloon
> *vb, size_t num)
> > struct page *page;
> > struct balloon_dev_info *vb_dev_info = >vb_dev_info;
> > LIST_HEAD(pages);
> > +   bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
> 
> You can pass use_sg as an argument to leak_balloon(). Then, you won't need
> to define leak_balloon_sg_oom(). Since xbitmap allocation does not use
> __GFP_DIRECT_RECLAIM, it is safe to reuse leak_balloon() for OOM path.
> Just be sure to pass use_sg == false because memory allocation for use_sg ==
> true likely fails when called from OOM path. (But trying use_sg == true for
> OOM path and then fallback to use_sg == false is not bad?)
> 


But once the SG mechanism is in use, we cannot use the non-SG mechanism, 
because the interface between the guest and host is not the same for SG and 
non-SG. Methods to make the host support both mechanisms at the same time would 
complicate the interface and implementation. 

I also think the old non-SG mechanism should be deprecated to use since its 
implementation isn't perfect in some sense, e.g. it balloons pages using outbuf 
which implies that no changes are allowed to the balloon pages and this isn't 
true for ballooning. The new mechanism (SG) has changed it to use inbuf.

So I think using leak_balloon_sg_oom() would be better. Is there any reason 
that we couldn't use it?

Best,
Wei

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v18 07/10] virtio-balloon: VIRTIO_BALLOON_F_SG

2017-11-29 Thread Wei Wang
Add a new feature, VIRTIO_BALLOON_F_SG, which enables the transfer of
balloon (i.e. inflated/deflated) pages using scatter-gather lists to the
host. A scatter-gather list is described by a vring desc.

The implementation of the previous virtio-balloon is not very efficient,
because the balloon pages are transferred to the host by one array each
time. Here is the breakdown of the time in percentage spent on each step
of the balloon inflating process (inflating 7GB of an 8GB idle guest).

1) allocating pages (6.5%)
2) sending PFNs to host (68.3%)
3) address translation (6.1%)
4) madvise (19%)

It takes about 4126ms for the inflating process to complete. The above
profiling shows that the bottlenecks are stage 2) and stage 4).

This patch optimizes step 2) by transferring pages to host in sgs. An sg
describes a chunk of guest physically continuous pages. With this
mechanism, step 4) can also be optimized by doing address translation and
madvise() in chunks rather than page by page.

With this new feature, the above ballooning process takes ~440ms resulting
in an improvement of ~89%.

TODO: optimize stage 1) by allocating/freeing a chunk of pages instead of
a single page each time.

Signed-off-by: Wei Wang 
Signed-off-by: Liang Li 
Suggested-by: Michael S. Tsirkin 
Cc: Tetsuo Handa 
---
 drivers/virtio/virtio_balloon.c | 230 +---
 include/uapi/linux/virtio_balloon.h |   1 +
 2 files changed, 212 insertions(+), 19 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 7960746..2c21c5a 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -32,6 +32,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 /*
  * Balloon device works in 4K page units.  So each page is pointed to by
@@ -79,6 +81,9 @@ struct virtio_balloon {
/* Synchronize access/update to this struct virtio_balloon elements */
struct mutex balloon_lock;
 
+   /* The xbitmap used to record balloon pages */
+   struct xb page_xb;
+
/* The array of pfns we tell the Host about. */
unsigned int num_pfns;
__virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
@@ -141,15 +146,121 @@ static void set_page_pfns(struct virtio_balloon *vb,
  page_to_balloon_pfn(page) + i);
 }
 
+static void kick_and_wait(struct virtqueue *vq, wait_queue_head_t wq_head)
+{
+   unsigned int len;
+
+   virtqueue_kick(vq);
+   wait_event(wq_head, virtqueue_get_buf(vq, ));
+}
+
+static void send_one_desc(struct virtio_balloon *vb,
+ struct virtqueue *vq,
+ uint64_t addr,
+ uint32_t len,
+ bool inbuf,
+ bool batch)
+{
+   int err;
+   unsigned int size;
+
+   /* Detach all the used buffers from the vq */
+   while (virtqueue_get_buf(vq, ))
+   ;
+
+   err = virtqueue_add_one_desc(vq, addr, len, inbuf, vq);
+   /*
+* This is expected to never fail: there is always at least 1 entry
+* available on the vq, because when the vq is full the worker thread
+* that adds the desc will be put into sleep until at least 1 entry is
+* available to use.
+*/
+   BUG_ON(err);
+
+   /* If batching is requested, we batch till the vq is full */
+   if (!batch || !vq->num_free)
+   kick_and_wait(vq, vb->acked);
+}
+
+/*
+ * Send balloon pages in sgs to host. The balloon pages are recorded in the
+ * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
+ * The page xbitmap is searched for continuous "1" bits, which correspond
+ * to continuous pages, to chunk into sgs.
+ *
+ * @page_xb_start and @page_xb_end form the range of bits in the xbitmap that
+ * need to be searched.
+ */
+static void tell_host_sgs(struct virtio_balloon *vb,
+ struct virtqueue *vq,
+ unsigned long page_xb_start,
+ unsigned long page_xb_end)
+{
+   unsigned long pfn_start, pfn_end;
+   uint64_t addr;
+   uint32_t len, max_len = round_down(UINT_MAX, PAGE_SIZE);
+
+   pfn_start = page_xb_start;
+   while (pfn_start < page_xb_end) {
+   pfn_start = xb_find_next_set_bit(>page_xb, pfn_start,
+page_xb_end);
+   if (pfn_start == page_xb_end + 1)
+   break;
+   pfn_end = xb_find_next_zero_bit(>page_xb,
+   pfn_start + 1,
+   page_xb_end);
+   addr = pfn_start << PAGE_SHIFT;
+   len = (pfn_end - pfn_start) << PAGE_SHIFT;
+   while (len > max_len) {
+