[PATCH v2] virtio/vsock: fix the transport to work with VMADDR_CID_ANY

2021-11-25 Thread Wei Wang
The VMADDR_CID_ANY flag used by a socket means that the socket isn't bound
to any specific CID. For example, a host vsock server may want to be bound
with VMADDR_CID_ANY, so that a guest vsock client can connect to the host
server with CID=VMADDR_CID_HOST (i.e. 2), and meanwhile, a host vsock
client can connect to the same local server with CID=VMADDR_CID_LOCAL
(i.e. 1).

The current implementation sets the destination socket's svm_cid to a
fixed CID value after the first client's connection, which isn't an
expected operation. For example, if the guest client first connects to the
host server, the server's svm_cid gets set to VMADDR_CID_HOST, then other
host clients won't be able to connect to the server anymore.

Reproduce steps:
1. Run the host server:
   socat VSOCK-LISTEN:1234,fork -
2. Run a guest client to connect to the host server:
   socat - VSOCK-CONNECT:2:1234
3. Run a host client to connect to the host server:
   socat - VSOCK-CONNECT:1:1234

Without this patch, step 3. above fails to connect, and socat complains
"socat[1720] E connect(5, AF=40 cid:1 port:1234, 16): Connection
reset by peer".
With this patch, the above works well.

Fixes: c0cfa2d8a788 ("vsock: add multi-transports support")
Signed-off-by: Wei Wang 
---
 net/vmw_vsock/virtio_transport_common.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/vmw_vsock/virtio_transport_common.c 
b/net/vmw_vsock/virtio_transport_common.c
index 59ee1be5a6dd..ec2c2afbf0d0 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -1299,7 +1299,8 @@ void virtio_transport_recv_pkt(struct virtio_transport *t,
space_available = virtio_transport_space_update(sk, pkt);
 
/* Update CID in case it has changed after a transport reset event */
-   vsk->local_addr.svm_cid = dst.svm_cid;
+   if (vsk->local_addr.svm_cid != VMADDR_CID_ANY)
+   vsk->local_addr.svm_cid = dst.svm_cid;
 
if (space_available)
sk->sk_write_space(sk);

base-commit: 5f53fa508db098c9d372423a6dac31c8a5679cdf
-- 
2.25.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH] virtio/vsock: fix the transport to work with VMADDR_CID_ANY

2021-11-25 Thread Wei Wang
The VMADDR_CID_ANY flag used by a socket means that the socket isn't bound
to any specific CID. For example, a host vsock server may want to be bound
with VMADDR_CID_ANY, so that a guest vsock client can connect to the host
server with CID=VMADDR_CID_HOST (i.e. 2), and meanwhile, a host vsock
client can connect to the same local server with CID=VMADDR_CID_LOCAL
(i.e. 1).

The current implementation sets the destination socket's svm_cid to a
fixed CID value after the first client's connection, which isn't an
expected operation. For example, if the guest client first connects to the
host server, the server's svm_cid gets set to VMADDR_CID_HOST, then other
host clients won't be able to connect to the server anymore.

Reproduce steps:
1. Run the host server:
   socat VSOCK-LISTEN:1234,fork -
2. Run a guest client to connect to the host server:
   socat - VSOCK-CONNECT:2:1234
3. Run a host client to connect to the host server:
   socat - VSOCK-CONNECT:1:1234

Without this patch, step 3. above fails to connect, and socat complains
"socat[1720] E connect(5, AF=40 cid:1 port:1234, 16): Connection
reset by peer".
With this patch, the above works well.

Signed-off-by: Wei Wang 
---
 net/vmw_vsock/virtio_transport_common.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/net/vmw_vsock/virtio_transport_common.c 
b/net/vmw_vsock/virtio_transport_common.c
index 59ee1be5a6dd..5c60fae10569 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -1298,9 +1298,6 @@ void virtio_transport_recv_pkt(struct virtio_transport *t,
 
space_available = virtio_transport_space_update(sk, pkt);
 
-   /* Update CID in case it has changed after a transport reset event */
-   vsk->local_addr.svm_cid = dst.svm_cid;
-
if (space_available)
sk->sk_write_space(sk);
 
-- 
2.25.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] virtio_balloon: fix up endian-ness for free cmd id

2020-07-28 Thread Wei Wang

On 07/28/2020 12:03 AM, Michael S. Tsirkin wrote:

free cmd id is read using virtio endian, spec says all fields
in balloon are LE. Fix it up.

Fixes: 86a559787e6f ("virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT")
Cc: sta...@vger.kernel.org
Signed-off-by: Michael S. Tsirkin 
---
  drivers/virtio/virtio_balloon.c | 6 +-
  1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 774deb65a9bb..798ec304fe3e 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -578,10 +578,14 @@ static int init_vqs(struct virtio_balloon *vb)
  static u32 virtio_balloon_cmd_id_received(struct virtio_balloon *vb)
  {
if (test_and_clear_bit(VIRTIO_BALLOON_CONFIG_READ_CMD_ID,
-  >config_read_bitmap))
+  >config_read_bitmap)) {
virtio_cread(vb->vdev, struct virtio_balloon_config,
 free_page_hint_cmd_id,
 >cmd_id_received_cache);
+   /* Legacy balloon config space is LE, unlike all other devices. 
*/
+   if (!virtio_has_feature(vb->vdev, VIRTIO_F_VERSION_1))
+   vb->cmd_id_received_cache = le32_to_cpu((__force 
__le32)vb->cmd_id_received_cache);

Seems it exceeds 80 character length.

Reviewed-by: Wei Wang 

Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH RFC] virtio_balloon: conservative balloon page shrinking

2020-02-06 Thread Wei Wang
There are cases that users want to shrink balloon pages after the
pagecache depleted. The conservative_shrinker lets the shrinker
shrink balloon pages when all the pagecache has been reclaimed.

Signed-off-by: Wei Wang 
---
 drivers/virtio/virtio_balloon.c | 14 +-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 93f995f6cf36..b4c5bb13a867 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -42,6 +42,10 @@
 static struct vfsmount *balloon_mnt;
 #endif
 
+static bool conservative_shrinker = true;
+module_param(conservative_shrinker, bool, 0644);
+MODULE_PARM_DESC(conservative_shrinker, "conservatively shrink balloon pages");
+
 enum virtio_balloon_vq {
VIRTIO_BALLOON_VQ_INFLATE,
VIRTIO_BALLOON_VQ_DEFLATE,
@@ -796,6 +800,10 @@ static unsigned long shrink_balloon_pages(struct 
virtio_balloon *vb,
 {
unsigned long pages_freed = 0;
 
+   /* Balloon pages only gets shrunk when the pagecache depleted */
+   if (conservative_shrinker && global_node_page_state(NR_FILE_PAGES))
+   return 0;
+
/*
 * One invocation of leak_balloon can deflate at most
 * VIRTIO_BALLOON_ARRAY_PFNS_MAX balloon pages, so we call it
@@ -837,7 +845,11 @@ static unsigned long virtio_balloon_shrinker_count(struct 
shrinker *shrinker,
struct virtio_balloon, shrinker);
unsigned long count;
 
-   count = vb->num_pages / VIRTIO_BALLOON_PAGES_PER_PAGE;
+   if (conservative_shrinker && global_node_page_state(NR_FILE_PAGES))
+   count = 0;
+   else
+   count = vb->num_pages / VIRTIO_BALLOON_PAGES_PER_PAGE;
+
count += vb->num_free_page_blocks * VIRTIO_BALLOON_HINT_BLOCK_PAGES;
 
return count;
-- 
2.17.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 1/2] virtio-balloon: initialize all vq callbacks

2019-12-17 Thread Wei Wang

On 12/18/2019 01:19 PM, Michael S. Tsirkin wrote:

On Wed, Dec 18, 2019 at 11:18:45AM +0800, Wei Wang wrote:

On 12/18/2019 03:06 AM, Daniel Verkamp wrote:

Ensure that elements of the array that correspond to unavailable
features are set to NULL; previously, they would be left uninitialized.

Since the corresponding names array elements were explicitly set to
NULL, the uninitialized callback pointers would not actually be
dereferenced; however, the uninitialized callbacks elements would still
be read in vp_find_vqs_msix() and used to calculate the number of MSI-X
vectors required.

With your 2nd patch:
if (names[i] && callbacks[i])
 ++nvectors;

It seems this patch isn't necessary as names[i] is already NULL, isn't it?

Best,
Wei

Right but passing uninitialized values isn't nice even if
the function called happens to ignore them.


Hmm.. then please make one more change:

 if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT)) {
 names[VIRTIO_BALLOON_VQ_FREE_PAGE] = "free_page_vq";
-callbacks[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
 }

And I think all the commit description isn't accurate then, it seems to 
be a coding style improvement,

instead of affecting "calculate the number of MSI-X vectors required".

Best,
Wei


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 1/2] virtio-balloon: initialize all vq callbacks

2019-12-17 Thread Wei Wang

On 12/18/2019 03:06 AM, Daniel Verkamp wrote:

Ensure that elements of the array that correspond to unavailable
features are set to NULL; previously, they would be left uninitialized.

Since the corresponding names array elements were explicitly set to
NULL, the uninitialized callback pointers would not actually be
dereferenced; however, the uninitialized callbacks elements would still
be read in vp_find_vqs_msix() and used to calculate the number of MSI-X
vectors required.


With your 2nd patch:
if (names[i] && callbacks[i])
++nvectors;

It seems this patch isn't necessary as names[i] is already NULL, isn't it?

Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] virtio-balloon: request nvqs based on features

2019-12-16 Thread Wei Wang

On 12/17/2019 07:14 AM, Daniel Verkamp wrote:

After 86a559787e6f ("virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT"),
the virtio-balloon device unconditionally specifies 4 virtqueues as the
argument to find_vqs(), which means that 5 MSI-X vectors are required in
order to assign one vector per VQ plus one for configuration changes.

However, in cases where the virtio device only provides exactly as many
MSI-X vectors as required for the features it implements (e.g. 3 for the
basic configuration of inflate + deflate + config), this results in the
selection of the fallback configuration where one interrupt vector is
used for all VQs instead of having one VQ per vector.


I'm not sure if I get the above. Virtio won't do any vq allocation for
the one that has "callbacks[i] = NULL", so no msi-x vector will be
assigned for it.

Btw, did you trigger any bug with the existing code?

Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: VIRTIO_BALLOON_F_FREE_PAGE_HINT

2019-09-15 Thread Wei Wang

On 09/14/2019 02:36 AM, Tyler Sanderson wrote:
Hello, I'm curious about the intent of VIRTIO_BALLOON_F_FREE_PAGE_HINT 
(commit 
). 



My understanding is that this mechanism works similarly to the 
existing inflate/deflate queues. Pages are allocated by the guest and 
then reported on VQ_FREE_PAGE.


Question: Is there a limit to how many pages will be allocated? What 
controls the amount of memory pressure applied?


No control for the limit currently. The implementation reports all the 
guest free pages to host.
The main usage for this feature so far is to have guest skip sending 
those guest free pages

(the more, the better) during live migration.




In my experience with virtio balloon there are problems with the 
mechanisms that are supposed to deflate the balloon in response to 
memory pressure (e.g. OOM notifier).


What problem did you see? We've also changed balloon to use memory shrinker,
did you see the problem with shrinker as well?



It seems an ideal balloon interface would allow the guest to round 
robin through free guest physical pages, allowing the host to unback 
them, but never having more than a few pages allocated to the balloon 
at any one time. For example:
1. Guest allocates 1 page and notifies balloon device of this page's 
address.

2. Host debacks the received page.
3. Guest frees the page.
4. Repeat at #1, but ensure that different pages are allocated each time.


Probably you need a mechanism to "ensure" different pages to be allocated.
The current implementation (having balloon hold the allocated pages) could
be thought of as one mechanism (it is simple).



This way the "balloon size" is never more than a few pages and does 
not create memory pressure. However the difficulty is in ensuring each 
set of sent pages is disjoint from previously sent pages. Is there a 
mechanism to round-robin allocations through all of guest physical 
memory? Does VIRTIO_BALLOON_F_FREE_PAGE_HINT enable this?




AFAIK, no such round-robin allocation so far. This may need the page 
allocation to have states recording

the allocation history.

Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v3 2/3] virtio-balloon: improve update_balloon_size_func

2019-01-16 Thread Wei Wang

On 01/15/2019 09:00 AM, Michael S. Tsirkin wrote:

On Mon, Jan 07, 2019 at 03:01:05PM +0800, Wei Wang wrote:

There is no need to update the balloon actual register when there is no
ballooning request. This patch avoids update_balloon_size when diff is 0.

Signed-off-by: Wei Wang 
Reviewed-by: Cornelia Huck 
Reviewed-by: Halil Pasic 

It sounds reasonable but it does have a side effect of
not writing into actual anymore.

I am not sure actual never can get out of sync as a result.

So better deferred, pls send this in the next merge window.



No problem, it's not urgent to me.

Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [virtio-dev] [PATCH 2/2] virtio: document virtio_config_ops restrictions

2019-01-16 Thread Wei Wang

On 01/04/2019 12:08 AM, Cornelia Huck wrote:

Some transports (e.g. virtio-ccw) implement virtio operations that
seem to be a simple read/write as something more involved that
cannot be done from an atomic context.

Give at least a hint about that.

Signed-off-by: Cornelia Huck 
---
  include/linux/virtio_config.h | 5 +
  1 file changed, 5 insertions(+)

diff --git a/include/linux/virtio_config.h b/include/linux/virtio_config.h
index 7087ef946ba7..987b6491b946 100644
--- a/include/linux/virtio_config.h
+++ b/include/linux/virtio_config.h
@@ -12,6 +12,11 @@ struct irq_affinity;
  
  /**

   * virtio_config_ops - operations for configuring a virtio device
+ * Note: Do not assume that a transport implements all of the operations
+ *   getting/setting a value as a simple read/write! Generally speaking,
+ *   any of @get/@set, @get_status/@set_status, or @get_features/
+ *   @finalize_features are NOT safe to be called from an atomic
+ *   context.
   * @get: read the value of a configuration field
   *vdev: the virtio_device
   *offset: the offset of the configuration field



Reviewed-by: Wei Wang 

Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 1/2] virtio: fix virtio_config_ops description

2019-01-16 Thread Wei Wang

On 01/14/2019 10:09 PM, Cornelia Huck wrote:

On Thu,  3 Jan 2019 17:08:03 +0100
Cornelia Huck  wrote:


- get_features has returned 64 bits since commit d025477368792
   ("virtio: add support for 64 bit features.")
- properly mark all optional callbacks

Signed-off-by: Cornelia Huck 
---
  include/linux/virtio_config.h | 8 
  1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/virtio_config.h b/include/linux/virtio_config.h
index 32baf8e26735..7087ef946ba7 100644
--- a/include/linux/virtio_config.h
+++ b/include/linux/virtio_config.h
@@ -22,7 +22,7 @@ struct irq_affinity;
   *offset: the offset of the configuration field
   *buf: the buffer to read the field value from.
   *len: the length of the buffer
- * @generation: config generation counter
+ * @generation: config generation counter (optional)
   *vdev: the virtio_device
   *Returns the config generation counter
   * @get_status: read the status byte
@@ -48,17 +48,17 @@ struct irq_affinity;
   * @del_vqs: free virtqueues found by find_vqs().
   * @get_features: get the array of feature bits for this device.
   *vdev: the virtio_device
- * Returns the first 32 feature bits (all we currently need).
+ * Returns the first 64 feature bits (all we currently need).
   * @finalize_features: confirm what device features we'll be using.
   *vdev: the virtio_device
   *This gives the final feature bits for the device: it can change
   *the dev->feature bits if it wants.
   *Returns 0 on success or error status
- * @bus_name: return the bus name associated with the device
+ * @bus_name: return the bus name associated with the device (optional)
   *vdev: the virtio_device
   *  This returns a pointer to the bus name a la pci_name from which
   *  the caller can then copy.
- * @set_vq_affinity: set the affinity for a virtqueue.
+ * @set_vq_affinity: set the affinity for a virtqueue (optional).
   * @get_vq_affinity: get the affinity for a virtqueue (optional).
   */
  typedef void vq_callback_t(struct virtqueue *);

Ping. Any feedback on that patch?


Reviewed-by: Wei Wang 

Best,
Wei

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v3 1/3] virtio-balloon: tweak config_changed implementation

2019-01-09 Thread Wei Wang

On 01/08/2019 04:46 PM, Christian Borntraeger wrote:


On 08.01.2019 06:35, Wei Wang wrote:

On 01/07/2019 09:49 PM, Christian Borntraeger wrote:

On 07.01.2019 08:01, Wei Wang wrote:

virtio-ccw has deadlock issues with reading the config space inside the
interrupt context, so we tweak the virtballoon_changed implementation
by moving the config read operations into the related workqueue contexts.
The config_read_bitmap is used as a flag to the workqueue callbacks
about the related config fields that need to be read.

The cmd_id_received is also renamed to cmd_id_received_cache, and
the value should be obtained via virtio_balloon_cmd_id_received.

Reported-by: Christian Borntraeger 
Signed-off-by: Wei Wang 
Reviewed-by: Cornelia Huck 
Reviewed-by: Halil Pasic 

Together with
virtio_pci: use queue idx instead of array idx to set up the vq
virtio: don't allocate vqs when names[i] = NULL

Tested-by: Christian Borntraeger 

OK. I don't plan to send a new version of the above patches as no changes 
needed so far.

Michael, if the above two patches look good to you, please help add the related 
tested-by
and reviewed-by tags. Thanks.

Can we also make sure that

virtio_pci: use queue idx instead of array idx to set up the vq
virtio: don't allocate vqs when names[i] = NULL

also land in stable?



You could also send the request to stable after it gets merged to Linus' 
tree.

The stable review committee will decide whether to take it.

Please see Option 2:

https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html


Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v3 1/3] virtio-balloon: tweak config_changed implementation

2019-01-07 Thread Wei Wang

On 01/07/2019 09:49 PM, Christian Borntraeger wrote:


On 07.01.2019 08:01, Wei Wang wrote:

virtio-ccw has deadlock issues with reading the config space inside the
interrupt context, so we tweak the virtballoon_changed implementation
by moving the config read operations into the related workqueue contexts.
The config_read_bitmap is used as a flag to the workqueue callbacks
about the related config fields that need to be read.

The cmd_id_received is also renamed to cmd_id_received_cache, and
the value should be obtained via virtio_balloon_cmd_id_received.

Reported-by: Christian Borntraeger 
Signed-off-by: Wei Wang 
Reviewed-by: Cornelia Huck 
Reviewed-by: Halil Pasic 

Together with
   virtio_pci: use queue idx instead of array idx to set up the vq
   virtio: don't allocate vqs when names[i] = NULL

Tested-by: Christian Borntraeger 


OK. I don't plan to send a new version of the above patches as no 
changes needed so far.


Michael, if the above two patches look good to you, please help add the 
related tested-by

and reviewed-by tags. Thanks.

Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v4 3/3] virtio_balloon: remove the unnecessary 0-initialization

2019-01-07 Thread Wei Wang
We've changed to kzalloc the vb struct, so no need to 0-initialize
this field one more time.

Signed-off-by: Wei Wang 
Reviewed-by: Cornelia Huck 
---
 drivers/virtio/virtio_balloon.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index d48c12c..048959a 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -925,7 +925,6 @@ static int virtballoon_probe(struct virtio_device *vdev)
  VIRTIO_BALLOON_CMD_ID_STOP);
vb->cmd_id_stop = cpu_to_virtio32(vb->vdev,
  VIRTIO_BALLOON_CMD_ID_STOP);
-   vb->num_free_page_blocks = 0;
spin_lock_init(>free_page_list_lock);
INIT_LIST_HEAD(>free_page_list);
if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_PAGE_POISON)) {
-- 
2.7.4

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v4 0/3] virtio-balloon: tweak config_changed

2019-01-07 Thread Wei Wang
Since virtio-ccw doesn't work with accessing to the config space
inside an interrupt context, this patch series avoids that issue by
moving the config register accesses to the related workqueue contexts.

v3->v4 ChangeLog:
- change virtio32_to_cpu to cpu_to_virtio_32 in send_cmd_id_start;
v2->v3 ChangeLog:
- rename cmd_id_received to cmd_id_received_cache, and have call sites
  read the latest value via virtio_balloon_cmd_id_received. (Still
  kept Cornelia and Halil's reviewed-by as it's a minor change)
- remove zeroing vb->num_free_page_blocks in probe since vb is
  allocated via kzalloc.
v1->v2 ChangeLog:
- add config_read_bitmap to indicate to the workqueue callbacks about
  the necessity of reading the related config fields.

Wei Wang (3):
  virtio-balloon: tweak config_changed implementation
  virtio-balloon: improve update_balloon_size_func
  virtio_balloon: remove the unnecessary 0-initialization

 drivers/virtio/virtio_balloon.c | 104 ++--
 1 file changed, 69 insertions(+), 35 deletions(-)

-- 
2.7.4

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v4 2/3] virtio-balloon: improve update_balloon_size_func

2019-01-07 Thread Wei Wang
There is no need to update the balloon actual register when there is no
ballooning request. This patch avoids update_balloon_size when diff is 0.

Signed-off-by: Wei Wang 
Reviewed-by: Cornelia Huck 
Reviewed-by: Halil Pasic 
Tested-by: Christian Borntraeger 
---
 drivers/virtio/virtio_balloon.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 45d32f5..d48c12c 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -457,9 +457,12 @@ static void update_balloon_size_func(struct work_struct 
*work)
  update_balloon_size_work);
diff = towards_target(vb);
 
+   if (!diff)
+   return;
+
if (diff > 0)
diff -= fill_balloon(vb, diff);
-   else if (diff < 0)
+   else
diff += leak_balloon(vb, -diff);
update_balloon_size(vb);
 
-- 
2.7.4

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v4 1/3] virtio-balloon: tweak config_changed implementation

2019-01-07 Thread Wei Wang
virtio-ccw has deadlock issues with reading the config space inside the
interrupt context, so we tweak the virtballoon_changed implementation
by moving the config read operations into the related workqueue contexts.
The config_read_bitmap is used as a flag to the workqueue callbacks
about the related config fields that need to be read.

The cmd_id_received is also renamed to cmd_id_received_cache, and
the value should be obtained via virtio_balloon_cmd_id_received.

Fixes: 86a559787e6f ("virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT")
Reported-by: Christian Borntraeger 
Signed-off-by: Wei Wang 
Reviewed-by: Cornelia Huck 
Reviewed-by: Halil Pasic 
Tested-by: Christian Borntraeger 
---
 drivers/virtio/virtio_balloon.c | 98 +++--
 1 file changed, 65 insertions(+), 33 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 728ecd1..45d32f5 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -61,6 +61,10 @@ enum virtio_balloon_vq {
VIRTIO_BALLOON_VQ_MAX
 };
 
+enum virtio_balloon_config_read {
+   VIRTIO_BALLOON_CONFIG_READ_CMD_ID = 0,
+};
+
 struct virtio_balloon {
struct virtio_device *vdev;
struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
@@ -77,14 +81,20 @@ struct virtio_balloon {
/* Prevent updating balloon when it is being canceled. */
spinlock_t stop_update_lock;
bool stop_update;
+   /* Bitmap to indicate if reading the related config fields are needed */
+   unsigned long config_read_bitmap;
 
/* The list of allocated free pages, waiting to be given back to mm */
struct list_head free_page_list;
spinlock_t free_page_list_lock;
/* The number of free page blocks on the above list */
unsigned long num_free_page_blocks;
-   /* The cmd id received from host */
-   u32 cmd_id_received;
+   /*
+* The cmd id received from host.
+* Read it via virtio_balloon_cmd_id_received to get the latest value
+* sent from host.
+*/
+   u32 cmd_id_received_cache;
/* The cmd id that is actively in use */
__virtio32 cmd_id_active;
/* Buffer to store the stop sign */
@@ -390,37 +400,31 @@ static unsigned long return_free_pages_to_mm(struct 
virtio_balloon *vb,
return num_returned;
 }
 
+static void virtio_balloon_queue_free_page_work(struct virtio_balloon *vb)
+{
+   if (!virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT))
+   return;
+
+   /* No need to queue the work if the bit was already set. */
+   if (test_and_set_bit(VIRTIO_BALLOON_CONFIG_READ_CMD_ID,
+>config_read_bitmap))
+   return;
+
+   queue_work(vb->balloon_wq, >report_free_page_work);
+}
+
 static void virtballoon_changed(struct virtio_device *vdev)
 {
struct virtio_balloon *vb = vdev->priv;
unsigned long flags;
-   s64 diff = towards_target(vb);
-
-   if (diff) {
-   spin_lock_irqsave(>stop_update_lock, flags);
-   if (!vb->stop_update)
-   queue_work(system_freezable_wq,
-  >update_balloon_size_work);
-   spin_unlock_irqrestore(>stop_update_lock, flags);
-   }
 
-   if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT)) {
-   virtio_cread(vdev, struct virtio_balloon_config,
-free_page_report_cmd_id, >cmd_id_received);
-   if (vb->cmd_id_received == VIRTIO_BALLOON_CMD_ID_DONE) {
-   /* Pass ULONG_MAX to give back all the free pages */
-   return_free_pages_to_mm(vb, ULONG_MAX);
-   } else if (vb->cmd_id_received != VIRTIO_BALLOON_CMD_ID_STOP &&
-  vb->cmd_id_received !=
-  virtio32_to_cpu(vdev, vb->cmd_id_active)) {
-   spin_lock_irqsave(>stop_update_lock, flags);
-   if (!vb->stop_update) {
-   queue_work(vb->balloon_wq,
-  >report_free_page_work);
-   }
-   spin_unlock_irqrestore(>stop_update_lock, flags);
-   }
+   spin_lock_irqsave(>stop_update_lock, flags);
+   if (!vb->stop_update) {
+   queue_work(system_freezable_wq,
+  >update_balloon_size_work);
+   virtio_balloon_queue_free_page_work(vb);
}
+   spin_unlock_irqrestore(>stop_update_lock, flags);
 }
 
 static void update_balloon_size(struct virtio_balloon *vb)
@@ -527,6 +531,17 @@ static int init_vqs(struct virtio_balloon *vb)
return 0;
 }
 
+static u32 virtio_balloon_cmd_id_received(struct 

[PATCH v3 3/3] virtio_balloon: remove the unnecessary 0-initialization

2019-01-07 Thread Wei Wang
We've changed to kzalloc the vb struct, so no need to 0-initialize
this field one more time.

Signed-off-by: Wei Wang 
---
 drivers/virtio/virtio_balloon.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index e33dc8e..f19061b 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -925,7 +925,6 @@ static int virtballoon_probe(struct virtio_device *vdev)
  VIRTIO_BALLOON_CMD_ID_STOP);
vb->cmd_id_stop = cpu_to_virtio32(vb->vdev,
  VIRTIO_BALLOON_CMD_ID_STOP);
-   vb->num_free_page_blocks = 0;
spin_lock_init(>free_page_list_lock);
INIT_LIST_HEAD(>free_page_list);
if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_PAGE_POISON)) {
-- 
2.7.4

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v3 2/3] virtio-balloon: improve update_balloon_size_func

2019-01-07 Thread Wei Wang
There is no need to update the balloon actual register when there is no
ballooning request. This patch avoids update_balloon_size when diff is 0.

Signed-off-by: Wei Wang 
Reviewed-by: Cornelia Huck 
Reviewed-by: Halil Pasic 
---
 drivers/virtio/virtio_balloon.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index fb12fe2..e33dc8e 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -457,9 +457,12 @@ static void update_balloon_size_func(struct work_struct 
*work)
  update_balloon_size_work);
diff = towards_target(vb);
 
+   if (!diff)
+   return;
+
if (diff > 0)
diff -= fill_balloon(vb, diff);
-   else if (diff < 0)
+   else
diff += leak_balloon(vb, -diff);
update_balloon_size(vb);
 
-- 
2.7.4

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v3 0/3] virtio-balloon: tweak config_changed

2019-01-07 Thread Wei Wang
Since virtio-ccw doesn't work with accessing to the config space
inside an interrupt context, this patch series avoids that issue by
moving the config register accesses to the related workqueue contexts.

v2->v3 ChangeLog:
- rename cmd_id_received to cmd_id_received_cache, and have call sites
  read the latest value via virtio_balloon_cmd_id_received. (Still
  kept Cornelia and Halil's reviewed-by as it's a minor change)
- remove zeroing vb->num_free_page_blocks in probe since vb is
  allocated via kzalloc.
v1->v2 ChangeLog:
- add config_read_bitmap to indicate to the workqueue callbacks about
  the necessity of reading the related config fields.

Wei Wang (3):
  virtio-balloon: tweak config_changed implementation
  virtio-balloon: improve update_balloon_size_func
  virtio_balloon: remove the unnecessary 0-initialization

 drivers/virtio/virtio_balloon.c | 104 ++--
 1 file changed, 69 insertions(+), 35 deletions(-)

-- 
2.7.4

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v3 1/3] virtio-balloon: tweak config_changed implementation

2019-01-06 Thread Wei Wang
virtio-ccw has deadlock issues with reading the config space inside the
interrupt context, so we tweak the virtballoon_changed implementation
by moving the config read operations into the related workqueue contexts.
The config_read_bitmap is used as a flag to the workqueue callbacks
about the related config fields that need to be read.

The cmd_id_received is also renamed to cmd_id_received_cache, and
the value should be obtained via virtio_balloon_cmd_id_received.

Reported-by: Christian Borntraeger 
Signed-off-by: Wei Wang 
Reviewed-by: Cornelia Huck 
Reviewed-by: Halil Pasic 
---
 drivers/virtio/virtio_balloon.c | 98 +++--
 1 file changed, 65 insertions(+), 33 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 728ecd1..fb12fe2 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -61,6 +61,10 @@ enum virtio_balloon_vq {
VIRTIO_BALLOON_VQ_MAX
 };
 
+enum virtio_balloon_config_read {
+   VIRTIO_BALLOON_CONFIG_READ_CMD_ID = 0,
+};
+
 struct virtio_balloon {
struct virtio_device *vdev;
struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
@@ -77,14 +81,20 @@ struct virtio_balloon {
/* Prevent updating balloon when it is being canceled. */
spinlock_t stop_update_lock;
bool stop_update;
+   /* Bitmap to indicate if reading the related config fields are needed */
+   unsigned long config_read_bitmap;
 
/* The list of allocated free pages, waiting to be given back to mm */
struct list_head free_page_list;
spinlock_t free_page_list_lock;
/* The number of free page blocks on the above list */
unsigned long num_free_page_blocks;
-   /* The cmd id received from host */
-   u32 cmd_id_received;
+   /*
+* The cmd id received from host.
+* Read it via virtio_balloon_cmd_id_received to get the latest value
+* sent from host.
+*/
+   u32 cmd_id_received_cache;
/* The cmd id that is actively in use */
__virtio32 cmd_id_active;
/* Buffer to store the stop sign */
@@ -390,37 +400,31 @@ static unsigned long return_free_pages_to_mm(struct 
virtio_balloon *vb,
return num_returned;
 }
 
+static void virtio_balloon_queue_free_page_work(struct virtio_balloon *vb)
+{
+   if (!virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT))
+   return;
+
+   /* No need to queue the work if the bit was already set. */
+   if (test_and_set_bit(VIRTIO_BALLOON_CONFIG_READ_CMD_ID,
+>config_read_bitmap))
+   return;
+
+   queue_work(vb->balloon_wq, >report_free_page_work);
+}
+
 static void virtballoon_changed(struct virtio_device *vdev)
 {
struct virtio_balloon *vb = vdev->priv;
unsigned long flags;
-   s64 diff = towards_target(vb);
-
-   if (diff) {
-   spin_lock_irqsave(>stop_update_lock, flags);
-   if (!vb->stop_update)
-   queue_work(system_freezable_wq,
-  >update_balloon_size_work);
-   spin_unlock_irqrestore(>stop_update_lock, flags);
-   }
 
-   if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT)) {
-   virtio_cread(vdev, struct virtio_balloon_config,
-free_page_report_cmd_id, >cmd_id_received);
-   if (vb->cmd_id_received == VIRTIO_BALLOON_CMD_ID_DONE) {
-   /* Pass ULONG_MAX to give back all the free pages */
-   return_free_pages_to_mm(vb, ULONG_MAX);
-   } else if (vb->cmd_id_received != VIRTIO_BALLOON_CMD_ID_STOP &&
-  vb->cmd_id_received !=
-  virtio32_to_cpu(vdev, vb->cmd_id_active)) {
-   spin_lock_irqsave(>stop_update_lock, flags);
-   if (!vb->stop_update) {
-   queue_work(vb->balloon_wq,
-  >report_free_page_work);
-   }
-   spin_unlock_irqrestore(>stop_update_lock, flags);
-   }
+   spin_lock_irqsave(>stop_update_lock, flags);
+   if (!vb->stop_update) {
+   queue_work(system_freezable_wq,
+  >update_balloon_size_work);
+   virtio_balloon_queue_free_page_work(vb);
}
+   spin_unlock_irqrestore(>stop_update_lock, flags);
 }
 
 static void update_balloon_size(struct virtio_balloon *vb)
@@ -527,6 +531,17 @@ static int init_vqs(struct virtio_balloon *vb)
return 0;
 }
 
+static u32 virtio_balloon_cmd_id_received(struct virtio_balloon *vb)
+{
+   if (test_and_clear_bit(VIRTIO_BALLOON_CONFIG_READ_CMD_ID,
+  >config_r

[PATCH v2 2/2] virtio-balloon: improve update_balloon_size_func

2019-01-04 Thread Wei Wang
There is no need to update the balloon actual register when there is no
ballooning request. This patch avoids update_balloon_size when diff is 0.

Signed-off-by: Wei Wang 
Reviewed-by: Cornelia Huck 
Reviewed-by: Halil Pasic 
---
 drivers/virtio/virtio_balloon.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 35ee762..af1e6d9 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -453,9 +453,12 @@ static void update_balloon_size_func(struct work_struct 
*work)
  update_balloon_size_work);
diff = towards_target(vb);
 
+   if (!diff)
+   return;
+
if (diff > 0)
diff -= fill_balloon(vb, diff);
-   else if (diff < 0)
+   else
diff += leak_balloon(vb, -diff);
update_balloon_size(vb);
 
-- 
2.7.4

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v2 1/2] virtio-balloon: tweak config_changed implementation

2019-01-04 Thread Wei Wang
virtio-ccw has deadlock issues with reading the config space inside the
interrupt context, so we tweak the virtballoon_changed implementation
by moving the config read operations into the related workqueue contexts.
The config_read_bitmap is used as a flag to the workqueue callbacks
about the related config fields that need to be read.

Reported-by: Christian Borntraeger 
Signed-off-by: Wei Wang 
---
 drivers/virtio/virtio_balloon.c | 81 +++--
 1 file changed, 53 insertions(+), 28 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 728ecd1..35ee762 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -61,6 +61,10 @@ enum virtio_balloon_vq {
VIRTIO_BALLOON_VQ_MAX
 };
 
+enum virtio_balloon_config_read {
+   VIRTIO_BALLOON_CONFIG_READ_CMD_ID = 0,
+};
+
 struct virtio_balloon {
struct virtio_device *vdev;
struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
@@ -77,6 +81,8 @@ struct virtio_balloon {
/* Prevent updating balloon when it is being canceled. */
spinlock_t stop_update_lock;
bool stop_update;
+   /* Bitmap to indicate if reading the related config fields are needed */
+   unsigned long config_read_bitmap;
 
/* The list of allocated free pages, waiting to be given back to mm */
struct list_head free_page_list;
@@ -390,37 +396,31 @@ static unsigned long return_free_pages_to_mm(struct 
virtio_balloon *vb,
return num_returned;
 }
 
+static void virtio_balloon_queue_free_page_work(struct virtio_balloon *vb)
+{
+   if (!virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT))
+   return;
+
+   /* No need to queue the work if the bit was already set. */
+   if (test_and_set_bit(VIRTIO_BALLOON_CONFIG_READ_CMD_ID,
+>config_read_bitmap))
+   return;
+
+   queue_work(vb->balloon_wq, >report_free_page_work);
+}
+
 static void virtballoon_changed(struct virtio_device *vdev)
 {
struct virtio_balloon *vb = vdev->priv;
unsigned long flags;
-   s64 diff = towards_target(vb);
-
-   if (diff) {
-   spin_lock_irqsave(>stop_update_lock, flags);
-   if (!vb->stop_update)
-   queue_work(system_freezable_wq,
-  >update_balloon_size_work);
-   spin_unlock_irqrestore(>stop_update_lock, flags);
-   }
 
-   if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT)) {
-   virtio_cread(vdev, struct virtio_balloon_config,
-free_page_report_cmd_id, >cmd_id_received);
-   if (vb->cmd_id_received == VIRTIO_BALLOON_CMD_ID_DONE) {
-   /* Pass ULONG_MAX to give back all the free pages */
-   return_free_pages_to_mm(vb, ULONG_MAX);
-   } else if (vb->cmd_id_received != VIRTIO_BALLOON_CMD_ID_STOP &&
-  vb->cmd_id_received !=
-  virtio32_to_cpu(vdev, vb->cmd_id_active)) {
-   spin_lock_irqsave(>stop_update_lock, flags);
-   if (!vb->stop_update) {
-   queue_work(vb->balloon_wq,
-  >report_free_page_work);
-   }
-   spin_unlock_irqrestore(>stop_update_lock, flags);
-   }
+   spin_lock_irqsave(>stop_update_lock, flags);
+   if (!vb->stop_update) {
+   queue_work(system_freezable_wq,
+  >update_balloon_size_work);
+   virtio_balloon_queue_free_page_work(vb);
}
+   spin_unlock_irqrestore(>stop_update_lock, flags);
 }
 
 static void update_balloon_size(struct virtio_balloon *vb)
@@ -609,6 +609,16 @@ static int get_free_page_and_send(struct virtio_balloon 
*vb)
return 0;
 }
 
+static void virtio_balloon_read_cmd_id_received(struct virtio_balloon *vb)
+{
+   if (!test_and_clear_bit(VIRTIO_BALLOON_CONFIG_READ_CMD_ID,
+   >config_read_bitmap))
+   return;
+
+   virtio_cread(vb->vdev, struct virtio_balloon_config,
+free_page_report_cmd_id, >cmd_id_received);
+}
+
 static int send_free_pages(struct virtio_balloon *vb)
 {
int err;
@@ -620,6 +630,7 @@ static int send_free_pages(struct virtio_balloon *vb)
 * stop the reporting.
 */
cmd_id_active = virtio32_to_cpu(vb->vdev, vb->cmd_id_active);
+   virtio_balloon_read_cmd_id_received(vb);
if (cmd_id_active != vb->cmd_id_received)
break;
 
@@ -637,11 +648,9 @@ static int send_free_pages(struct virtio_balloon *vb)
return 0;
 }
 
-static void repo

[PATCH v2 0/2] virtio-balloon: tweak config_changed

2019-01-03 Thread Wei Wang
Since virtio-ccw doesn't work with accessing to the config space
inside an interrupt context, this patch series avoids that issue by
moving the config register accesses to the related workqueue contexts.

v1->v2 ChangeLog:
- add config_read_bitmap to indicate to the workqueue callbacks about
  the necessity of reading the related config fields.

Wei Wang (2):
  virtio-balloon: tweak config_changed implementation
  virtio-balloon: improve update_balloon_size_func

 drivers/virtio/virtio_balloon.c | 86 +++--
 1 file changed, 57 insertions(+), 29 deletions(-)

-- 
2.7.4

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [virtio-dev] Re: [PATCH v1 1/2] virtio-balloon: tweak config_changed implementation

2019-01-03 Thread Wei Wang

On 01/04/2019 12:42 AM, Michael S. Tsirkin wrote:

On Thu, Jan 03, 2019 at 01:31:01PM +0800, Wei Wang wrote:

virtio-ccw has deadlock issues with reading config registers inside the
interrupt context, so we tweak the virtballoon_changed implementation
by moving the config read operations into the related workqueue contexts.

Signed-off-by: Wei Wang 
---
  drivers/virtio/virtio_balloon.c | 54 -
  1 file changed, 26 insertions(+), 28 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 728ecd1..9a82a11 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -394,33 +394,15 @@ static void virtballoon_changed(struct virtio_device 
*vdev)
  {
struct virtio_balloon *vb = vdev->priv;
unsigned long flags;
-   s64 diff = towards_target(vb);
-
-   if (diff) {
-   spin_lock_irqsave(>stop_update_lock, flags);
-   if (!vb->stop_update)
-   queue_work(system_freezable_wq,
-  >update_balloon_size_work);
-   spin_unlock_irqrestore(>stop_update_lock, flags);
-   }
  
-	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT)) {

-   virtio_cread(vdev, struct virtio_balloon_config,
-free_page_report_cmd_id, >cmd_id_received);

There's one problem with this approach:

previously updating the cmd_id_received here would immediately
stop the report in send_free_pages.

With this approach we are waiting for the wq to schedule,
which might be blocked waiting for report to complete.
So host can no longer quickly stop the report in progress.

A simple work-around would be to set some kind of flag whenever there
is a change interrupt, then have send_free_pages test it
and re-read cmd_id_received.

Needs to be an atomic I guess ...



Yes, sounds better..will update the patch.

Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v1 1/2] virtio-balloon: tweak config_changed implementation

2019-01-03 Thread Wei Wang

On 01/03/2019 05:40 PM, Cornelia Huck wrote:

On Thu,  3 Jan 2019 13:31:01 +0800
Wei Wang  wrote:


virtio-ccw has deadlock issues with reading config registers inside the

s/config registers/the config space/ ?


Sounds good.




interrupt context, so we tweak the virtballoon_changed implementation
by moving the config read operations into the related workqueue contexts.

Also credit Christian with a Reported-by:?


Yes, definitely. Sorry for missing that.

Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v1 0/2] Virtio: fix some vq allocation issues

2019-01-02 Thread Wei Wang

On 12/28/2018 03:57 PM, Christian Borntraeger wrote:


On 28.12.2018 03:26, Wei Wang wrote:

Some vqs don't need to be allocated when the related feature bits are
disabled. Callers notice the vq allocation layer by setting the related
names[i] to be NULL.

This patch series fixes the find_vqs implementations to handle this case.

So the random crashes during boot are gone.
What still does not work is actually using the balloon.

So in the qemu monitor using lets say "balloon 1000"  will hang the guest.
Seems to be a deadlock in the virtio-ccw code.  We seem to call the
config code in the interrupt handler.


Please try with applying both this series and the "virtio-balloon: tweak 
config_changed"

series that I just sent.
This series fixes the ccw booting issue and that series tries to fix the 
ccw deadlock issue.


Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v1 2/2] virtio-balloon: improve update_balloon_size_func

2019-01-02 Thread Wei Wang
There is no need to update the balloon actual register when there is no
ballooning request. This patch avoids update_balloon_size when diff is 0.

Signed-off-by: Wei Wang 
---
 drivers/virtio/virtio_balloon.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 9a82a11..4884b85 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -435,9 +435,12 @@ static void update_balloon_size_func(struct work_struct 
*work)
  update_balloon_size_work);
diff = towards_target(vb);
 
+   if (!diff)
+   return;
+
if (diff > 0)
diff -= fill_balloon(vb, diff);
-   else if (diff < 0)
+   else
diff += leak_balloon(vb, -diff);
update_balloon_size(vb);
 
-- 
2.7.4

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v1 1/2] virtio-balloon: tweak config_changed implementation

2019-01-02 Thread Wei Wang
virtio-ccw has deadlock issues with reading config registers inside the
interrupt context, so we tweak the virtballoon_changed implementation
by moving the config read operations into the related workqueue contexts.

Signed-off-by: Wei Wang 
---
 drivers/virtio/virtio_balloon.c | 54 -
 1 file changed, 26 insertions(+), 28 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 728ecd1..9a82a11 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -394,33 +394,15 @@ static void virtballoon_changed(struct virtio_device 
*vdev)
 {
struct virtio_balloon *vb = vdev->priv;
unsigned long flags;
-   s64 diff = towards_target(vb);
-
-   if (diff) {
-   spin_lock_irqsave(>stop_update_lock, flags);
-   if (!vb->stop_update)
-   queue_work(system_freezable_wq,
-  >update_balloon_size_work);
-   spin_unlock_irqrestore(>stop_update_lock, flags);
-   }
 
-   if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT)) {
-   virtio_cread(vdev, struct virtio_balloon_config,
-free_page_report_cmd_id, >cmd_id_received);
-   if (vb->cmd_id_received == VIRTIO_BALLOON_CMD_ID_DONE) {
-   /* Pass ULONG_MAX to give back all the free pages */
-   return_free_pages_to_mm(vb, ULONG_MAX);
-   } else if (vb->cmd_id_received != VIRTIO_BALLOON_CMD_ID_STOP &&
-  vb->cmd_id_received !=
-  virtio32_to_cpu(vdev, vb->cmd_id_active)) {
-   spin_lock_irqsave(>stop_update_lock, flags);
-   if (!vb->stop_update) {
-   queue_work(vb->balloon_wq,
-  >report_free_page_work);
-   }
-   spin_unlock_irqrestore(>stop_update_lock, flags);
-   }
+   spin_lock_irqsave(>stop_update_lock, flags);
+   if (!vb->stop_update) {
+   queue_work(system_freezable_wq,
+  >update_balloon_size_work);
+   if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT))
+   queue_work(vb->balloon_wq, >report_free_page_work);
}
+   spin_unlock_irqrestore(>stop_update_lock, flags);
 }
 
 static void update_balloon_size(struct virtio_balloon *vb)
@@ -637,11 +619,9 @@ static int send_free_pages(struct virtio_balloon *vb)
return 0;
 }
 
-static void report_free_page_func(struct work_struct *work)
+static void virtio_balloon_report_free_page(struct virtio_balloon *vb)
 {
int err;
-   struct virtio_balloon *vb = container_of(work, struct virtio_balloon,
-report_free_page_work);
struct device *dev = >vdev->dev;
 
/* Start by sending the received cmd id to host with an outbuf. */
@@ -659,6 +639,24 @@ static void report_free_page_func(struct work_struct *work)
dev_err(dev, "Failed to send a stop id, err = %d\n", err);
 }
 
+static void report_free_page_func(struct work_struct *work)
+{
+   struct virtio_balloon *vb = container_of(work, struct virtio_balloon,
+report_free_page_work);
+
+   virtio_cread(vb->vdev, struct virtio_balloon_config,
+free_page_report_cmd_id, >cmd_id_received);
+
+   if (vb->cmd_id_received == VIRTIO_BALLOON_CMD_ID_DONE) {
+   /* Pass ULONG_MAX to give back all the free pages */
+   return_free_pages_to_mm(vb, ULONG_MAX);
+   } else if (vb->cmd_id_received != VIRTIO_BALLOON_CMD_ID_STOP &&
+  vb->cmd_id_received !=
+  virtio32_to_cpu(vb->vdev, vb->cmd_id_active)) {
+   virtio_balloon_report_free_page(vb);
+   }
+}
+
 #ifdef CONFIG_BALLOON_COMPACTION
 /*
  * virtballoon_migratepage - perform the balloon page migration on behalf of
-- 
2.7.4

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v1 0/2] virtio-balloon: tweak config_changed

2019-01-02 Thread Wei Wang
Since virtio-ccw doesn't work with accessing to the config registers
inside an interrupt context, this patch series avoids that issue by
moving the config register accesses to the related workqueue contexts.

Wei Wang (2):
  virtio-balloon: tweak config_changed implementation
  virtio-balloon: improve update_balloon_size_func

 drivers/virtio/virtio_balloon.c | 59 +
 1 file changed, 30 insertions(+), 29 deletions(-)

-- 
2.7.4

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v37 0/3] Virtio-balloon: support free page reporting

2018-12-27 Thread Wei Wang

On 12/27/2018 08:17 PM, Christian Borntraeger wrote:


On 27.12.2018 12:59, Christian Borntraeger wrote:

On 27.12.2018 12:31, Christian Borntraeger wrote:

This patch triggers random crashes in the guest kernel on s390 early during 
boot.
No migration and no setting of the balloon is involved.


Adding Conny and Halil,

As the QEMU provides no PAGE_HINT feature yet, this quick hack makes the
guest boot fine again:


diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 728ecd1eea305..aa2e1864c5736 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -492,7 +492,7 @@ static int init_vqs(struct virtio_balloon *vb)
 callbacks[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
 }
  
-   err = vb->vdev->config->find_vqs(vb->vdev, VIRTIO_BALLOON_VQ_MAX,

+   err = vb->vdev->config->find_vqs(vb->vdev, 3, //VIRTIO_BALLOON_VQ_MAX,
  vqs, callbacks, names, NULL, NULL);
 if (err)
 return err;


To me it looks like that virtio_ccw_find_vqs will abort if any of the virtqueues
that it is been asked for does not exist (including the earlier ones).


This "hack" makes the random crashes go away, but the balloon interface itself
does not work. (setting the value to anything will hang the guest).
As patch 1 also modifies the main path, there seem to be additional issues, 
maybe
endianess

Looking at things like

+   vb->cmd_id_received = VIRTIO_BALLOON_CMD_ID_STOP;
+   vb->cmd_id_active = cpu_to_virtio32(vb->vdev,
+ VIRTIO_BALLOON_CMD_ID_STOP);
+   vb->cmd_id_stop = cpu_to_virtio32(vb->vdev,
+ VIRTIO_BALLOON_CMD_ID_STOP);


Why is cmd_id_received not using cpu_to_virtio32?



That conversion is only needed when we need to send the value to the device.
cmd_id_received doesn't need to be sent to the device.

Best,
Wei


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v37 1/3] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT

2018-12-27 Thread Wei Wang

On 12/27/2018 08:03 PM, Christian Borntraeger wrote:

On 27.08.2018 03:32, Wei Wang wrote:

  static int init_vqs(struct virtio_balloon *vb)
  {
-   struct virtqueue *vqs[3];
-   vq_callback_t *callbacks[] = { balloon_ack, balloon_ack, stats_request 
};
-   static const char * const names[] = { "inflate", "deflate", "stats" };
-   int err, nvqs;
+   struct virtqueue *vqs[VIRTIO_BALLOON_VQ_MAX];
+   vq_callback_t *callbacks[VIRTIO_BALLOON_VQ_MAX];
+   const char *names[VIRTIO_BALLOON_VQ_MAX];
+   int err;

/*
-* We expect two virtqueues: inflate and deflate, and
-* optionally stat.
+* Inflateq and deflateq are used unconditionally. The names[]
+* will be NULL if the related feature is not enabled, which will
+* cause no allocation for the corresponding virtqueue in find_vqs.
 */

This might be true for virtio-pci, but it is not for virtio-ccw.


Hi Christian,


Please try the fix patches: https://lkml.org/lkml/2018/12/27/336

Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v1 2/2] virtio: don't allocate vqs when names[i] = NULL

2018-12-27 Thread Wei Wang
Some vqs may not need to be allocated when their related feature bits
are disabled. So callers may pass in such vqs with "names = NULL".
Then we skip such vq allocations.

Signed-off-by: Wei Wang 
---
 drivers/misc/mic/vop/vop_main.c|  9 +++--
 drivers/remoteproc/remoteproc_virtio.c |  9 +++--
 drivers/s390/virtio/virtio_ccw.c   | 12 +---
 drivers/virtio/virtio_mmio.c   |  9 +++--
 4 files changed, 30 insertions(+), 9 deletions(-)

diff --git a/drivers/misc/mic/vop/vop_main.c b/drivers/misc/mic/vop/vop_main.c
index 6b212c8..2bfa3a9 100644
--- a/drivers/misc/mic/vop/vop_main.c
+++ b/drivers/misc/mic/vop/vop_main.c
@@ -394,16 +394,21 @@ static int vop_find_vqs(struct virtio_device *dev, 
unsigned nvqs,
struct _vop_vdev *vdev = to_vopvdev(dev);
struct vop_device *vpdev = vdev->vpdev;
struct mic_device_ctrl __iomem *dc = vdev->dc;
-   int i, err, retry;
+   int i, err, retry, queue_idx = 0;
 
/* We must have this many virtqueues. */
if (nvqs > ioread8(>desc->num_vq))
return -ENOENT;
 
for (i = 0; i < nvqs; ++i) {
+   if (!names[i]) {
+   vqs[i] = NULL;
+   continue;
+   }
+
dev_dbg(_vop_dev(vdev), "%s: %d: %s\n",
__func__, i, names[i]);
-   vqs[i] = vop_find_vq(dev, i, callbacks[i], names[i],
+   vqs[i] = vop_find_vq(dev, queue_idx++, callbacks[i], names[i],
 ctx ? ctx[i] : false);
if (IS_ERR(vqs[i])) {
err = PTR_ERR(vqs[i]);
diff --git a/drivers/remoteproc/remoteproc_virtio.c 
b/drivers/remoteproc/remoteproc_virtio.c
index 183fc42..2d7cd344 100644
--- a/drivers/remoteproc/remoteproc_virtio.c
+++ b/drivers/remoteproc/remoteproc_virtio.c
@@ -153,10 +153,15 @@ static int rproc_virtio_find_vqs(struct virtio_device 
*vdev, unsigned int nvqs,
 const bool * ctx,
 struct irq_affinity *desc)
 {
-   int i, ret;
+   int i, ret, queue_idx = 0;
 
for (i = 0; i < nvqs; ++i) {
-   vqs[i] = rp_find_vq(vdev, i, callbacks[i], names[i],
+   if (!names[i]) {
+   vqs[i] = NULL;
+   continue;
+   }
+
+   vqs[i] = rp_find_vq(vdev, queue_idx++, callbacks[i], names[i],
ctx ? ctx[i] : false);
if (IS_ERR(vqs[i])) {
ret = PTR_ERR(vqs[i]);
diff --git a/drivers/s390/virtio/virtio_ccw.c b/drivers/s390/virtio/virtio_ccw.c
index fc9dbad..ae1d56d 100644
--- a/drivers/s390/virtio/virtio_ccw.c
+++ b/drivers/s390/virtio/virtio_ccw.c
@@ -635,7 +635,7 @@ static int virtio_ccw_find_vqs(struct virtio_device *vdev, 
unsigned nvqs,
 {
struct virtio_ccw_device *vcdev = to_vc_device(vdev);
unsigned long *indicatorp = NULL;
-   int ret, i;
+   int ret, i, queue_idx = 0;
struct ccw1 *ccw;
 
ccw = kzalloc(sizeof(*ccw), GFP_DMA | GFP_KERNEL);
@@ -643,8 +643,14 @@ static int virtio_ccw_find_vqs(struct virtio_device *vdev, 
unsigned nvqs,
return -ENOMEM;
 
for (i = 0; i < nvqs; ++i) {
-   vqs[i] = virtio_ccw_setup_vq(vdev, i, callbacks[i], names[i],
-ctx ? ctx[i] : false, ccw);
+   if (!names[i]) {
+   vqs[i] = NULL;
+   continue;
+   }
+
+   vqs[i] = virtio_ccw_setup_vq(vdev, queue_idx++, callbacks[i],
+names[i], ctx ? ctx[i] : false,
+ccw);
if (IS_ERR(vqs[i])) {
ret = PTR_ERR(vqs[i]);
vqs[i] = NULL;
diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c
index 4cd9ea5..d9dd0f78 100644
--- a/drivers/virtio/virtio_mmio.c
+++ b/drivers/virtio/virtio_mmio.c
@@ -468,7 +468,7 @@ static int vm_find_vqs(struct virtio_device *vdev, unsigned 
nvqs,
 {
struct virtio_mmio_device *vm_dev = to_virtio_mmio_device(vdev);
unsigned int irq = platform_get_irq(vm_dev->pdev, 0);
-   int i, err;
+   int i, err, queue_idx = 0;
 
err = request_irq(irq, vm_interrupt, IRQF_SHARED,
dev_name(>dev), vm_dev);
@@ -476,7 +476,12 @@ static int vm_find_vqs(struct virtio_device *vdev, 
unsigned nvqs,
return err;
 
for (i = 0; i < nvqs; ++i) {
-   vqs[i] = vm_setup_vq(vdev, i, callbacks[i], names[i],
+   if (!names[i]) {
+   vqs[i] = NULL;
+   continue;
+   }
+
+   vqs[i] = vm_setup_vq(vdev, queue_idx++, callbacks[i], names[i],
 ct

[PATCH v1 1/2] virtio_pci: use queue idx instead of array idx to set up the vq

2018-12-27 Thread Wei Wang
When find_vqs, there will be no vq[i] allocation if its corresponding
names[i] is NULL. For example, the caller may pass in names[i] (i=4)
with names[2] being NULL because the related feature bit is turned off,
so technically there are 3 queues on the device, and name[4] should
correspond to the 3rd queue on the device.

So we use queue_idx as the queue index, which is increased only when the
queue exists.

Signed-off-by: Wei Wang 
---
 drivers/virtio/virtio_pci_common.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/virtio/virtio_pci_common.c 
b/drivers/virtio/virtio_pci_common.c
index 465a6f5..d0584c0 100644
--- a/drivers/virtio/virtio_pci_common.c
+++ b/drivers/virtio/virtio_pci_common.c
@@ -285,7 +285,7 @@ static int vp_find_vqs_msix(struct virtio_device *vdev, 
unsigned nvqs,
 {
struct virtio_pci_device *vp_dev = to_vp_device(vdev);
u16 msix_vec;
-   int i, err, nvectors, allocated_vectors;
+   int i, err, nvectors, allocated_vectors, queue_idx = 0;
 
vp_dev->vqs = kcalloc(nvqs, sizeof(*vp_dev->vqs), GFP_KERNEL);
if (!vp_dev->vqs)
@@ -321,7 +321,7 @@ static int vp_find_vqs_msix(struct virtio_device *vdev, 
unsigned nvqs,
msix_vec = allocated_vectors++;
else
msix_vec = VP_MSIX_VQ_VECTOR;
-   vqs[i] = vp_setup_vq(vdev, i, callbacks[i], names[i],
+   vqs[i] = vp_setup_vq(vdev, queue_idx++, callbacks[i], names[i],
 ctx ? ctx[i] : false,
 msix_vec);
if (IS_ERR(vqs[i])) {
@@ -356,7 +356,7 @@ static int vp_find_vqs_intx(struct virtio_device *vdev, 
unsigned nvqs,
const char * const names[], const bool *ctx)
 {
struct virtio_pci_device *vp_dev = to_vp_device(vdev);
-   int i, err;
+   int i, err, queue_idx = 0;
 
vp_dev->vqs = kcalloc(nvqs, sizeof(*vp_dev->vqs), GFP_KERNEL);
if (!vp_dev->vqs)
@@ -374,7 +374,7 @@ static int vp_find_vqs_intx(struct virtio_device *vdev, 
unsigned nvqs,
vqs[i] = NULL;
continue;
}
-   vqs[i] = vp_setup_vq(vdev, i, callbacks[i], names[i],
+   vqs[i] = vp_setup_vq(vdev, queue_idx++, callbacks[i], names[i],
 ctx ? ctx[i] : false,
 VIRTIO_MSI_NO_VECTOR);
if (IS_ERR(vqs[i])) {
-- 
2.7.4

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v1 0/2] Virtio: fix some vq allocation issues

2018-12-27 Thread Wei Wang
Some vqs don't need to be allocated when the related feature bits are
disabled. Callers notice the vq allocation layer by setting the related
names[i] to be NULL.

This patch series fixes the find_vqs implementations to handle this case.

Wei Wang (2):
  virtio_pci: use queue idx instead of array idx to set up the vq
  virtio: don't allocate vqs when names[i] = NULL

 drivers/misc/mic/vop/vop_main.c|  9 +++--
 drivers/remoteproc/remoteproc_virtio.c |  9 +++--
 drivers/s390/virtio/virtio_ccw.c   | 12 +---
 drivers/virtio/virtio_mmio.c   |  9 +++--
 drivers/virtio/virtio_pci_common.c |  8 
 5 files changed, 34 insertions(+), 13 deletions(-)

-- 
2.7.4

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v37 0/3] Virtio-balloon: support free page reporting

2018-10-25 Thread Wei Wang

On 10/25/2018 08:58 AM, Michael S. Tsirkin wrote:

On Mon, Aug 27, 2018 at 09:32:16AM +0800, Wei Wang wrote:

The new feature, VIRTIO_BALLOON_F_FREE_PAGE_HINT, implemented by this
series enables the virtio-balloon driver to report hints of guest free
pages to host. It can be used to accelerate virtual machine (VM) live
migration. Here is an introduction of this usage:

Live migration needs to transfer the VM's memory from the source machine
to the destination round by round. For the 1st round, all the VM's memory
is transferred. From the 2nd round, only the pieces of memory that were
written by the guest (after the 1st round) are transferred. One method
that is popularly used by the hypervisor to track which part of memory is
written is to have the hypervisor write-protect all the guest memory.

This feature enables the optimization by skipping the transfer of guest
free pages during VM live migration. It is not concerned that the memory
pages are used after they are given to the hypervisor as a hint of the
free pages, because they will be tracked by the hypervisor and transferred
in the subsequent round if they are used and written.

OK so it will be in linux-next.  Now can I trouble you for a virtio spec
patch with the description please?


No problem, I'll start to patch the spec. Thanks!

Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v36 0/5] Virtio-balloon: support free page reporting

2018-09-06 Thread Wei Wang

On 07/23/2018 10:36 PM, Dr. David Alan Gilbert wrote:

* Michael S. Tsirkin (m...@redhat.com) wrote:

On Fri, Jul 20, 2018 at 04:33:00PM +0800, Wei Wang wrote:

This patch series is separated from the previous "Virtio-balloon
Enhancement" series. The new feature, VIRTIO_BALLOON_F_FREE_PAGE_HINT,
implemented by this series enables the virtio-balloon driver to report
hints of guest free pages to the host. It can be used to accelerate live
migration of VMs. Here is an introduction of this usage:

Live migration needs to transfer the VM's memory from the source machine
to the destination round by round. For the 1st round, all the VM's memory
is transferred. From the 2nd round, only the pieces of memory that were
written by the guest (after the 1st round) are transferred. One method
that is popularly used by the hypervisor to track which part of memory is
written is to write-protect all the guest memory.

This feature enables the optimization by skipping the transfer of guest
free pages during VM live migration. It is not concerned that the memory
pages are used after they are given to the hypervisor as a hint of the
free pages, because they will be tracked by the hypervisor and transferred
in the subsequent round if they are used and written.

* Tests
- Test Environment
 Host: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
 Guest: 8G RAM, 4 vCPU
 Migration setup: migrate_set_speed 100G, migrate_set_downtime 2 second

- Test Results
 - Idle Guest Live Migration Time (results are averaged over 10 runs):
 - Optimization v.s. Legacy = 409ms vs 1757ms --> ~77% reduction
(setting page poisoning zero and enabling ksm don't affect the
  comparison result)
 - Guest with Linux Compilation Workload (make bzImage -j4):
 - Live Migration Time (average)
   Optimization v.s. Legacy = 1407ms v.s. 2528ms --> ~44% reduction
 - Linux Compilation Time
   Optimization v.s. Legacy = 5min4s v.s. 5min12s
   --> no obvious difference

I'd like to see dgilbert's take on whether this kind of gain
justifies adding a PV interfaces, and what kind of guest workload
is appropriate.

Cc'd.

Well, 44% is great ... although the measurement is a bit weird.

a) A 2 second downtime is very large; 300-500ms is more normal
b) I'm not sure what the 'average' is  - is that just between a bunch of
repeated migrations?
c) What load was running in the guest during the live migration?

An interesting measurement to add would be to do the same test but
with a VM with a lot more RAM but the same load;  you'd hope the gain
would be even better.
It would be interesting, especially because the users who are interested
are people creating VMs allocated with lots of extra memory (for the
worst case) but most of the time migrating when it's fairly idle.

Dave



Hi Dave,

The results of the added experiments have been shown in the v37 cover 
letter.

Could you have a look at https://lkml.org/lkml/2018/8/27/29 . Thanks.

Best,
Wei

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v37 1/3] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT

2018-08-26 Thread Wei Wang
Negotiation of the VIRTIO_BALLOON_F_FREE_PAGE_HINT feature indicates the
support of reporting hints of guest free pages to host via virtio-balloon.
Currenlty, only free page blocks of MAX_ORDER - 1 are reported. They are
obtained one by one from the mm free list via the regular allocation
function.

Host requests the guest to report free page hints by sending a new cmd id
to the guest via the free_page_report_cmd_id configuration register. When
the guest starts to report, it first sends a start cmd to host via the
free page vq, which acks to host the cmd id received. When the guest
finishes reporting free pages, a stop cmd is sent to host via the vq.
Host may also send a stop cmd id to the guest to stop the reporting.

VIRTIO_BALLOON_CMD_ID_STOP: Host sends this cmd to stop the guest
reporting.
VIRTIO_BALLOON_CMD_ID_DONE: Host sends this cmd to tell the guest that
the reported pages are ready to be freed.

Why does the guest free the reported pages when host tells it is ready to
free?
This is because freeing pages appears to be expensive for live migration.
free_pages() dirties memory very quickly and makes the live migraion not
converge in some cases. So it is good to delay the free_page operation
when the migration is done, and host sends a command to guest about that.

Why do we need the new VIRTIO_BALLOON_CMD_ID_DONE, instead of reusing
VIRTIO_BALLOON_CMD_ID_STOP?
This is because live migration is usually done in several rounds. At the
end of each round, host needs to send a VIRTIO_BALLOON_CMD_ID_STOP cmd to
the guest to stop (or say pause) the reporting. The guest resumes the
reporting when it receives a new command id at the beginning of the next
round. So we need a new cmd id to distinguish between "stop reporting" and
"ready to free the reported pages".

TODO:
- Add a batch page allocation API to amortize the allocation overhead.

Signed-off-by: Wei Wang 
Signed-off-by: Liang Li 
Cc: Michael S. Tsirkin 
Cc: Michal Hocko 
Cc: Andrew Morton 
Cc: Linus Torvalds 
---
 drivers/virtio/virtio_balloon.c | 364 
 include/uapi/linux/virtio_balloon.h |   5 +
 2 files changed, 336 insertions(+), 33 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index d1c1f62..a185678 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -41,13 +41,34 @@
 #define VIRTIO_BALLOON_ARRAY_PFNS_MAX 256
 #define VIRTBALLOON_OOM_NOTIFY_PRIORITY 80
 
+#define VIRTIO_BALLOON_FREE_PAGE_ALLOC_FLAG (__GFP_NORETRY | __GFP_NOWARN | \
+__GFP_NOMEMALLOC)
+/* The order of free page blocks to report to host */
+#define VIRTIO_BALLOON_FREE_PAGE_ORDER (MAX_ORDER - 1)
+/* The size of a free page block in bytes */
+#define VIRTIO_BALLOON_FREE_PAGE_SIZE \
+   (1 << (VIRTIO_BALLOON_FREE_PAGE_ORDER + PAGE_SHIFT))
+
 #ifdef CONFIG_BALLOON_COMPACTION
 static struct vfsmount *balloon_mnt;
 #endif
 
+enum virtio_balloon_vq {
+   VIRTIO_BALLOON_VQ_INFLATE,
+   VIRTIO_BALLOON_VQ_DEFLATE,
+   VIRTIO_BALLOON_VQ_STATS,
+   VIRTIO_BALLOON_VQ_FREE_PAGE,
+   VIRTIO_BALLOON_VQ_MAX
+};
+
 struct virtio_balloon {
struct virtio_device *vdev;
-   struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
+   struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
+
+   /* Balloon's own wq for cpu-intensive work items */
+   struct workqueue_struct *balloon_wq;
+   /* The free page reporting work item submitted to the balloon wq */
+   struct work_struct report_free_page_work;
 
/* The balloon servicing is delegated to a freezable workqueue. */
struct work_struct update_balloon_stats_work;
@@ -57,6 +78,18 @@ struct virtio_balloon {
spinlock_t stop_update_lock;
bool stop_update;
 
+   /* The list of allocated free pages, waiting to be given back to mm */
+   struct list_head free_page_list;
+   spinlock_t free_page_list_lock;
+   /* The number of free page blocks on the above list */
+   unsigned long num_free_page_blocks;
+   /* The cmd id received from host */
+   u32 cmd_id_received;
+   /* The cmd id that is actively in use */
+   __virtio32 cmd_id_active;
+   /* Buffer to store the stop sign */
+   __virtio32 cmd_id_stop;
+
/* Waiting for host to ack the pages we released. */
wait_queue_head_t acked;
 
@@ -320,17 +353,6 @@ static void stats_handle_request(struct virtio_balloon *vb)
virtqueue_kick(vq);
 }
 
-static void virtballoon_changed(struct virtio_device *vdev)
-{
-   struct virtio_balloon *vb = vdev->priv;
-   unsigned long flags;
-
-   spin_lock_irqsave(>stop_update_lock, flags);
-   if (!vb->stop_update)
-   queue_work(system_freezable_wq, >update_balloon_size_work);
-   spin_unlock_irqrestore(>stop_update_lock, flags);
-}
-
 static inline s64 towards_target(struct 

[PATCH v37 2/3] mm/page_poison: expose page_poisoning_enabled to kernel modules

2018-08-26 Thread Wei Wang
In some usages, e.g. virtio-balloon, a kernel module needs to know if
page poisoning is in use. This patch exposes the page_poisoning_enabled
function to kernel modules.

Signed-off-by: Wei Wang 
Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: Michael S. Tsirkin 
Acked-by: Andrew Morton 
---
 mm/page_poison.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/mm/page_poison.c b/mm/page_poison.c
index aa2b3d3..830f604 100644
--- a/mm/page_poison.c
+++ b/mm/page_poison.c
@@ -17,6 +17,11 @@ static int __init early_page_poison_param(char *buf)
 }
 early_param("page_poison", early_page_poison_param);
 
+/**
+ * page_poisoning_enabled - check if page poisoning is enabled
+ *
+ * Return true if page poisoning is enabled, or false if not.
+ */
 bool page_poisoning_enabled(void)
 {
/*
@@ -29,6 +34,7 @@ bool page_poisoning_enabled(void)
(!IS_ENABLED(CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC) &&
debug_pagealloc_enabled()));
 }
+EXPORT_SYMBOL_GPL(page_poisoning_enabled);
 
 static void poison_page(struct page *page)
 {
-- 
2.7.4

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v37 3/3] virtio-balloon: VIRTIO_BALLOON_F_PAGE_POISON

2018-08-26 Thread Wei Wang
The VIRTIO_BALLOON_F_PAGE_POISON feature bit is used to indicate if the
guest is using page poisoning. Guest writes to the poison_val config
field to tell host about the page poisoning value that is in use.

Suggested-by: Michael S. Tsirkin 
Signed-off-by: Wei Wang 
Cc: Michael S. Tsirkin 
Cc: Michal Hocko 
Cc: Andrew Morton 
---
 drivers/virtio/virtio_balloon.c | 10 ++
 include/uapi/linux/virtio_balloon.h |  3 +++
 2 files changed, 13 insertions(+)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index a185678..728ecd1 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -825,6 +825,7 @@ static int virtio_balloon_register_shrinker(struct 
virtio_balloon *vb)
 static int virtballoon_probe(struct virtio_device *vdev)
 {
struct virtio_balloon *vb;
+   __u32 poison_val;
int err;
 
if (!vdev->config->get) {
@@ -892,6 +893,11 @@ static int virtballoon_probe(struct virtio_device *vdev)
vb->num_free_page_blocks = 0;
spin_lock_init(>free_page_list_lock);
INIT_LIST_HEAD(>free_page_list);
+   if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_PAGE_POISON)) {
+   memset(_val, PAGE_POISON, sizeof(poison_val));
+   virtio_cwrite(vb->vdev, struct virtio_balloon_config,
+ poison_val, _val);
+   }
}
/*
 * We continue to use VIRTIO_BALLOON_F_DEFLATE_ON_OOM to decide if a
@@ -992,6 +998,9 @@ static int virtballoon_restore(struct virtio_device *vdev)
 
 static int virtballoon_validate(struct virtio_device *vdev)
 {
+   if (!page_poisoning_enabled())
+   __virtio_clear_bit(vdev, VIRTIO_BALLOON_F_PAGE_POISON);
+
__virtio_clear_bit(vdev, VIRTIO_F_IOMMU_PLATFORM);
return 0;
 }
@@ -1001,6 +1010,7 @@ static unsigned int features[] = {
VIRTIO_BALLOON_F_STATS_VQ,
VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
VIRTIO_BALLOON_F_FREE_PAGE_HINT,
+   VIRTIO_BALLOON_F_PAGE_POISON,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/uapi/linux/virtio_balloon.h 
b/include/uapi/linux/virtio_balloon.h
index 47c9eb4..a1966cd7 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -35,6 +35,7 @@
 #define VIRTIO_BALLOON_F_STATS_VQ  1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM2 /* Deflate balloon on OOM */
 #define VIRTIO_BALLOON_F_FREE_PAGE_HINT3 /* VQ to report free pages */
+#define VIRTIO_BALLOON_F_PAGE_POISON   4 /* Guest is using page poisoning */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
@@ -48,6 +49,8 @@ struct virtio_balloon_config {
__u32 actual;
/* Free page report command id, readonly by guest */
__u32 free_page_report_cmd_id;
+   /* Stores PAGE_POISON if page poisoning is in use */
+   __u32 poison_val;
 };
 
 #define VIRTIO_BALLOON_S_SWAP_IN  0   /* Amount of memory swapped in */
-- 
2.7.4

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v37 0/3] Virtio-balloon: support free page reporting

2018-08-26 Thread Wei Wang
The new feature, VIRTIO_BALLOON_F_FREE_PAGE_HINT, implemented by this
series enables the virtio-balloon driver to report hints of guest free
pages to host. It can be used to accelerate virtual machine (VM) live
migration. Here is an introduction of this usage:

Live migration needs to transfer the VM's memory from the source machine
to the destination round by round. For the 1st round, all the VM's memory
is transferred. From the 2nd round, only the pieces of memory that were
written by the guest (after the 1st round) are transferred. One method
that is popularly used by the hypervisor to track which part of memory is
written is to have the hypervisor write-protect all the guest memory.

This feature enables the optimization by skipping the transfer of guest
free pages during VM live migration. It is not concerned that the memory
pages are used after they are given to the hypervisor as a hint of the
free pages, because they will be tracked by the hypervisor and transferred
in the subsequent round if they are used and written.

* Tests
1 Test Environment
Host: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
Migration setup: migrate_set_speed 100G, migrate_set_downtime 400ms

2 Test Results (results are averaged over several repeated runs)
2.1 Guest setup: 8G RAM, 4 vCPU
2.1.1 Idle guest live migration time
Optimization v.s. Legacy = 620ms vs 2970ms
--> ~79% reduction
2.1.2 Guest live migration with Linux compilation workload
  (i.e. make bzImage -j4) running
  1) Live Migration Time:
 Optimization v.s. Legacy = 2273ms v.s. 4502ms
 --> ~50% reduction
  2) Linux Compilation Time:
 Optimization v.s. Legacy = 8min42s v.s. 8min43s
 --> no obvious difference

2.2 Guest setup: 128G RAM, 4 vCPU
2.2.1 Idle guest live migration time
Optimization v.s. Legacy = 5294ms vs 41651ms
--> ~87% reduction
2.2.2 Guest live migration with Linux compilation workload
  1) Live Migration Time:
Optimization v.s. Legacy = 8816ms v.s. 54201ms
--> 84% reduction
  2) Linux Compilation Time:
 Optimization v.s. Legacy = 8min30s v.s. 8min36s
 --> no obvious difference

ChangeLog:
v36->v37:
- free the reported pages to mm when receives a DONE cmd from host.
  Please see patch 1's commit log for reasons. Please see patch 1's
  commit for detailed explanations.

For ChangeLogs from v22 to v36, please reference
https://lkml.org/lkml/2018/7/20/199

For ChangeLogs before v21, please reference
https://lwn.net/Articles/743660/

Wei Wang (3):
  virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT
  mm/page_poison: expose page_poisoning_enabled to kernel modules
  virtio-balloon: VIRTIO_BALLOON_F_PAGE_POISON

 drivers/virtio/virtio_balloon.c | 374 
 include/uapi/linux/virtio_balloon.h |   8 +
 mm/page_poison.c|   6 +
 3 files changed, 355 insertions(+), 33 deletions(-)

-- 
2.7.4

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v4 3/3] virtio_balloon: replace oom notifier with shrinker

2018-08-16 Thread Wei Wang
The OOM notifier is getting deprecated to use for the reasons:
- As a callout from the oom context, it is too subtle and easy to
  generate bugs and corner cases which are hard to track;
- It is called too late (after the reclaiming has been performed).
  Drivers with large amuont of reclaimable memory is expected to
  release them at an early stage of memory pressure;
- The notifier callback isn't aware of oom contrains;
Link: https://lkml.org/lkml/2018/7/12/314

This patch replaces the virtio-balloon oom notifier with a shrinker
to release balloon pages on memory pressure. The balloon pages are
given back to mm adaptively by returning the number of pages that the
reclaimer is asking for (i.e. sc->nr_to_scan).

Currently the max possible value of sc->nr_to_scan passed to the balloon
shrinker is SHRINK_BATCH, which is 128. This is smaller than the
limitation that only VIRTIO_BALLOON_ARRAY_PFNS_MAX (256) pages can be
returned via one invocation of leak_balloon. But this patch still
considers the case that SHRINK_BATCH or shrinker->batch could be changed
to a value larger than VIRTIO_BALLOON_ARRAY_PFNS_MAX, which will need to
do multiple invocations of leak_balloon.

Historically, the feature VIRTIO_BALLOON_F_DEFLATE_ON_OOM has been used
to release balloon pages on OOM. We continue to use this feature bit for
the shrinker, so the shrinker is only registered when this feature bit
has been negotiated with host.

Signed-off-by: Wei Wang 
Cc: Michael S. Tsirkin 
Cc: Michal Hocko 
Cc: Andrew Morton 
Cc: Tetsuo Handa 
---
 drivers/virtio/virtio_balloon.c | 110 +---
 1 file changed, 59 insertions(+), 51 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index d97d73c..d1c1f62 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -27,7 +27,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -40,13 +39,8 @@
  */
 #define VIRTIO_BALLOON_PAGES_PER_PAGE (unsigned)(PAGE_SIZE >> 
VIRTIO_BALLOON_PFN_SHIFT)
 #define VIRTIO_BALLOON_ARRAY_PFNS_MAX 256
-#define OOM_VBALLOON_DEFAULT_PAGES 256
 #define VIRTBALLOON_OOM_NOTIFY_PRIORITY 80
 
-static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
-module_param(oom_pages, int, S_IRUSR | S_IWUSR);
-MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
-
 #ifdef CONFIG_BALLOON_COMPACTION
 static struct vfsmount *balloon_mnt;
 #endif
@@ -86,8 +80,8 @@ struct virtio_balloon {
/* Memory statistics */
struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
 
-   /* To register callback in oom notifier call chain */
-   struct notifier_block nb;
+   /* To register a shrinker to shrink memory upon memory pressure */
+   struct shrinker shrinker;
 };
 
 static struct virtio_device_id id_table[] = {
@@ -365,38 +359,6 @@ static void update_balloon_size(struct virtio_balloon *vb)
  );
 }
 
-/*
- * virtballoon_oom_notify - release pages when system is under severe
- * memory pressure (called from out_of_memory())
- * @self : notifier block struct
- * @dummy: not used
- * @parm : returned - number of freed pages
- *
- * The balancing of memory by use of the virtio balloon should not cause
- * the termination of processes while there are pages in the balloon.
- * If virtio balloon manages to release some memory, it will make the
- * system return and retry the allocation that forced the OOM killer
- * to run.
- */
-static int virtballoon_oom_notify(struct notifier_block *self,
- unsigned long dummy, void *parm)
-{
-   struct virtio_balloon *vb;
-   unsigned long *freed;
-   unsigned num_freed_pages;
-
-   vb = container_of(self, struct virtio_balloon, nb);
-   if (!virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
-   return NOTIFY_OK;
-
-   freed = parm;
-   num_freed_pages = leak_balloon(vb, oom_pages);
-   update_balloon_size(vb);
-   *freed += num_freed_pages;
-
-   return NOTIFY_OK;
-}
-
 static void update_balloon_stats_func(struct work_struct *work)
 {
struct virtio_balloon *vb;
@@ -550,6 +512,52 @@ static struct file_system_type balloon_fs = {
 
 #endif /* CONFIG_BALLOON_COMPACTION */
 
+static unsigned long virtio_balloon_shrinker_scan(struct shrinker *shrinker,
+ struct shrink_control *sc)
+{
+   unsigned long pages_to_free, pages_freed = 0;
+   struct virtio_balloon *vb = container_of(shrinker,
+   struct virtio_balloon, shrinker);
+
+   pages_to_free = sc->nr_to_scan * VIRTIO_BALLOON_PAGES_PER_PAGE;
+
+   /*
+* One invocation of leak_balloon can deflate at most
+* VIRTIO_BALLOON_ARRAY_PFNS_MAX balloon pages, so we call it
+* multiple times to deflate pages till reaching pages_to_free.
+*/
+   while 

[PATCH v4 2/3] virtio-balloon: kzalloc the vb struct

2018-08-16 Thread Wei Wang
Zero all the vb fields at alloaction, so that we don't need to
zero-initialize each field one by one later.

Signed-off-by: Wei Wang 
Cc: Michael S. Tsirkin 
Cc: Tetsuo Handa 
---
 drivers/virtio/virtio_balloon.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 8100e77..d97d73c 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -561,7 +561,7 @@ static int virtballoon_probe(struct virtio_device *vdev)
return -EINVAL;
}
 
-   vdev->priv = vb = kmalloc(sizeof(*vb), GFP_KERNEL);
+   vdev->priv = vb = kzalloc(sizeof(*vb), GFP_KERNEL);
if (!vb) {
err = -ENOMEM;
goto out;
@@ -570,8 +570,6 @@ static int virtballoon_probe(struct virtio_device *vdev)
INIT_WORK(>update_balloon_stats_work, update_balloon_stats_func);
INIT_WORK(>update_balloon_size_work, update_balloon_size_func);
spin_lock_init(>stop_update_lock);
-   vb->stop_update = false;
-   vb->num_pages = 0;
mutex_init(>balloon_lock);
init_waitqueue_head(>acked);
vb->vdev = vdev;
@@ -602,7 +600,6 @@ static int virtballoon_probe(struct virtio_device *vdev)
err = PTR_ERR(vb->vb_dev_info.inode);
kern_unmount(balloon_mnt);
unregister_oom_notifier(>nb);
-   vb->vb_dev_info.inode = NULL;
goto out_del_vqs;
}
vb->vb_dev_info.inode->i_mapping->a_ops = _aops;
-- 
2.7.4

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v4 1/3] virtio-balloon: remove BUG() in init_vqs

2018-08-16 Thread Wei Wang
It's a bit overkill to use BUG when failing to add an entry to the
stats_vq in init_vqs. So remove it and just return the error to the
caller to bail out nicely.

Signed-off-by: Wei Wang 
Cc: Michael S. Tsirkin 
---
 drivers/virtio/virtio_balloon.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 3988c09..8100e77 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -455,9 +455,13 @@ static int init_vqs(struct virtio_balloon *vb)
num_stats = update_balloon_stats(vb);
 
sg_init_one(, vb->stats, sizeof(vb->stats[0]) * num_stats);
-   if (virtqueue_add_outbuf(vb->stats_vq, , 1, vb, GFP_KERNEL)
-   < 0)
-   BUG();
+   err = virtqueue_add_outbuf(vb->stats_vq, , 1, vb,
+  GFP_KERNEL);
+   if (err) {
+   dev_warn(>vdev->dev, "%s: add stat_vq failed\n",
+__func__);
+   return err;
+   }
virtqueue_kick(vb->stats_vq);
}
return 0;
-- 
2.7.4

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v4 0/3] virtio-balloon: some improvements

2018-08-16 Thread Wei Wang
This series is split from the "Virtio-balloon: support free page
reporting" series to make some improvements.

ChangeLog:
v3->v4:
- use kzalloc to allocate the vb struct so that we don't need to zero
  initialize each field one by one later;
- also remove vb->shrinker.batch = 0, which is not needed now.
v2->v3:
- shrink the balloon pages according to the amount requested by the
  claimer, instead of using a user specified number;
v1->v2:
- register the shrinker when VIRTIO_BALLOON_F_DEFLATE_ON_OOM is
  negotiated.

Wei Wang (3):
  virtio-balloon: remove BUG() in init_vqs
  virtio-balloon: kzalloc the vb struct
  virtio_balloon: replace oom notifier with shrinker

 drivers/virtio/virtio_balloon.c | 125 +---
 1 file changed, 67 insertions(+), 58 deletions(-)

-- 
2.7.4

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v3 2/2] virtio_balloon: replace oom notifier with shrinker

2018-08-06 Thread Wei Wang

On 08/03/2018 08:11 PM, Tetsuo Handa wrote:

On 2018/08/03 17:32, Wei Wang wrote:

+static int virtio_balloon_register_shrinker(struct virtio_balloon *vb)
+{
+   vb->shrinker.scan_objects = virtio_balloon_shrinker_scan;
+   vb->shrinker.count_objects = virtio_balloon_shrinker_count;
+   vb->shrinker.batch = 0;
+   vb->shrinker.seeks = DEFAULT_SEEKS;

Why flags field is not set? If vb is allocated by kmalloc(GFP_KERNEL)
and is nowhere zero-cleared, KASAN would complain it.


Could you point where in the code that would complain it?
I only see two shrinker flags (NUMA_AWARE and MEMCG_AWARE), and they 
seem not related to that.



Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v3 2/2] virtio_balloon: replace oom notifier with shrinker

2018-08-05 Thread Wei Wang

On 08/04/2018 03:15 AM, Michael S. Tsirkin wrote:

On Fri, Aug 03, 2018 at 04:32:26PM +0800, Wei Wang wrote:

The OOM notifier is getting deprecated to use for the reasons:
- As a callout from the oom context, it is too subtle and easy to
   generate bugs and corner cases which are hard to track;
- It is called too late (after the reclaiming has been performed).
   Drivers with large amuont of reclaimable memory is expected to
   release them at an early stage of memory pressure;
- The notifier callback isn't aware of oom contrains;
Link: https://lkml.org/lkml/2018/7/12/314

This patch replaces the virtio-balloon oom notifier with a shrinker
to release balloon pages on memory pressure. The balloon pages are
given back to mm adaptively by returning the number of pages that the
reclaimer is asking for (i.e. sc->nr_to_scan).

Currently the max possible value of sc->nr_to_scan passed to the balloon
shrinker is SHRINK_BATCH, which is 128. This is smaller than the
limitation that only VIRTIO_BALLOON_ARRAY_PFNS_MAX (256) pages can be
returned via one invocation of leak_balloon. But this patch still
considers the case that SHRINK_BATCH or shrinker->batch could be changed
to a value larger than VIRTIO_BALLOON_ARRAY_PFNS_MAX, which will need to
do multiple invocations of leak_balloon.

Historically, the feature VIRTIO_BALLOON_F_DEFLATE_ON_OOM has been used
to release balloon pages on OOM. We continue to use this feature bit for
the shrinker, so the shrinker is only registered when this feature bit
has been negotiated with host.

Signed-off-by: Wei Wang 
Cc: Michael S. Tsirkin 
Cc: Michal Hocko 
Cc: Andrew Morton 


Could you add data at how was this tested and how did guest
behaviour change. Which configurations see an improvement?



Yes. Please see the differences from the "*1" and "*2" cases below.

Taking this chance, I use "*2" and "*3" to show Michal etc the 
differences of applying and not applying the shrinker fix patch here: 
https://lkml.org/lkml/2018/8/3/384



*1. V3 patches
1)After inflating some amount of memory, actual=101536 Bytes
free -m
  totalusedfree  shared buff/cache   
available

Mem:   79757289 514  10 171 447
Swap: 10236   0   10236

2) dd if=478MB_file of=/dev/null, actual=1058721792 Bytes
free -m
  totalusedfree  shared buff/cache   
available

Mem:   79757233 102  10 639 475
Swap: 10236   0   10236

The advantage is that the inflated pages are given back to mm based on 
the number, i.e. ~56MB(diff "actual" above) of the reclaimer is asking 
for. This is more adaptive.




*2. V2 paches, balloon_pages_to_shrink=100 pages (around 4GB), with 
the shrinker fix patches applied.

1)After inflating some amount of memory, actual=101536 Bytes
free -m
  totalusedfree  shared buff/cache   
available

Mem:   79757288 530  10 157 455
Swap: 10236   0   10236

2)dd if=478MB_file of=/dev/null, actual=5096001536 Bytes
free -m
  totalusedfree  shared buff/cache   
available

Mem:   797533813953  10 6404327
Swap: 10236   0   10236

In the above example, we set 4GB to shrink to make the difference 
obvious. Though the claimer only needs to reclaim ~56MB memory, 4GB 
inflated pages are given back to mm, which is unnecessary. From the 
user's perspective, it has no idea of how many pages to given back at 
the time of setting the module parameter (balloon_pages_to_shrink). So I 
think the above "*1" is better.




*3.  V2 paches, balloon_pages_to_shrink=100 pages (around 4GB), 
without the shrinker fix patches applied.

1) After inflating some amount of memory, actual=101536 Bytes
free -m
   totalusedfree  shared buff/cache   
available

Mem:   79757292 524  10 158 450
Swap: 10236   0   10236

2) dd if=478MB_file of=/dev/null, actual=8589934592 Bytes
free -m
 totalusedfree  shared  buff/cache 
available

Mem:   7975  537281  10 6407656
Swap: 10236   0   10236

Compared to *2, all the balloon pages are shrunk, but users expect 4GB 
to shrink. The reason is that do_slab_shrink has a mistake in 
calculating schrinkctl->nr_scanned, which should be the actual number of 
pages that the shrinker has freed, but do slab_shrink still treat that 
value as 128 (but 4GB has actually been freed).



Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v3 2/2] virtio_balloon: replace oom notifier with shrinker

2018-08-03 Thread Wei Wang
The OOM notifier is getting deprecated to use for the reasons:
- As a callout from the oom context, it is too subtle and easy to
  generate bugs and corner cases which are hard to track;
- It is called too late (after the reclaiming has been performed).
  Drivers with large amuont of reclaimable memory is expected to
  release them at an early stage of memory pressure;
- The notifier callback isn't aware of oom contrains;
Link: https://lkml.org/lkml/2018/7/12/314

This patch replaces the virtio-balloon oom notifier with a shrinker
to release balloon pages on memory pressure. The balloon pages are
given back to mm adaptively by returning the number of pages that the
reclaimer is asking for (i.e. sc->nr_to_scan).

Currently the max possible value of sc->nr_to_scan passed to the balloon
shrinker is SHRINK_BATCH, which is 128. This is smaller than the
limitation that only VIRTIO_BALLOON_ARRAY_PFNS_MAX (256) pages can be
returned via one invocation of leak_balloon. But this patch still
considers the case that SHRINK_BATCH or shrinker->batch could be changed
to a value larger than VIRTIO_BALLOON_ARRAY_PFNS_MAX, which will need to
do multiple invocations of leak_balloon.

Historically, the feature VIRTIO_BALLOON_F_DEFLATE_ON_OOM has been used
to release balloon pages on OOM. We continue to use this feature bit for
the shrinker, so the shrinker is only registered when this feature bit
has been negotiated with host.

Signed-off-by: Wei Wang 
Cc: Michael S. Tsirkin 
Cc: Michal Hocko 
Cc: Andrew Morton 
---
 drivers/virtio/virtio_balloon.c | 111 ++--
 1 file changed, 60 insertions(+), 51 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 8100e77..612a359 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -27,7 +27,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -40,13 +39,8 @@
  */
 #define VIRTIO_BALLOON_PAGES_PER_PAGE (unsigned)(PAGE_SIZE >> 
VIRTIO_BALLOON_PFN_SHIFT)
 #define VIRTIO_BALLOON_ARRAY_PFNS_MAX 256
-#define OOM_VBALLOON_DEFAULT_PAGES 256
 #define VIRTBALLOON_OOM_NOTIFY_PRIORITY 80
 
-static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
-module_param(oom_pages, int, S_IRUSR | S_IWUSR);
-MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
-
 #ifdef CONFIG_BALLOON_COMPACTION
 static struct vfsmount *balloon_mnt;
 #endif
@@ -86,8 +80,8 @@ struct virtio_balloon {
/* Memory statistics */
struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
 
-   /* To register callback in oom notifier call chain */
-   struct notifier_block nb;
+   /* To register a shrinker to shrink memory upon memory pressure */
+   struct shrinker shrinker;
 };
 
 static struct virtio_device_id id_table[] = {
@@ -365,38 +359,6 @@ static void update_balloon_size(struct virtio_balloon *vb)
  );
 }
 
-/*
- * virtballoon_oom_notify - release pages when system is under severe
- * memory pressure (called from out_of_memory())
- * @self : notifier block struct
- * @dummy: not used
- * @parm : returned - number of freed pages
- *
- * The balancing of memory by use of the virtio balloon should not cause
- * the termination of processes while there are pages in the balloon.
- * If virtio balloon manages to release some memory, it will make the
- * system return and retry the allocation that forced the OOM killer
- * to run.
- */
-static int virtballoon_oom_notify(struct notifier_block *self,
- unsigned long dummy, void *parm)
-{
-   struct virtio_balloon *vb;
-   unsigned long *freed;
-   unsigned num_freed_pages;
-
-   vb = container_of(self, struct virtio_balloon, nb);
-   if (!virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
-   return NOTIFY_OK;
-
-   freed = parm;
-   num_freed_pages = leak_balloon(vb, oom_pages);
-   update_balloon_size(vb);
-   *freed += num_freed_pages;
-
-   return NOTIFY_OK;
-}
-
 static void update_balloon_stats_func(struct work_struct *work)
 {
struct virtio_balloon *vb;
@@ -550,6 +512,53 @@ static struct file_system_type balloon_fs = {
 
 #endif /* CONFIG_BALLOON_COMPACTION */
 
+static unsigned long virtio_balloon_shrinker_scan(struct shrinker *shrinker,
+ struct shrink_control *sc)
+{
+   unsigned long pages_to_free, pages_freed = 0;
+   struct virtio_balloon *vb = container_of(shrinker,
+   struct virtio_balloon, shrinker);
+
+   pages_to_free = sc->nr_to_scan * VIRTIO_BALLOON_PAGES_PER_PAGE;
+
+   /*
+* One invocation of leak_balloon can deflate at most
+* VIRTIO_BALLOON_ARRAY_PFNS_MAX balloon pages, so we call it
+* multiple times to deflate pages till reaching pages_to_free.
+*/
+   while 

[PATCH v3 0/2] virtio-balloon: some improvements

2018-08-03 Thread Wei Wang
This series is split from the "Virtio-balloon: support free page
reporting" series to make some improvements.

ChangeLog:
v2->v3:
- shrink the balloon pages according to the amount requested by the
  claimer, instead of using a user specified number;
v1->v2:
- register the shrinker when VIRTIO_BALLOON_F_DEFLATE_ON_OOM is
  negotiated.

Wei Wang (2):
  virtio-balloon: remove BUG() in init_vqs
  virtio_balloon: replace oom notifier with shrinker

 drivers/virtio/virtio_balloon.c | 121 ++--
 1 file changed, 67 insertions(+), 54 deletions(-)

-- 
2.7.4

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 2/2] virtio_balloon: replace oom notifier with shrinker

2018-08-02 Thread Wei Wang

On 08/02/2018 07:47 PM, Michal Hocko wrote:

On Thu 02-08-18 18:32:44, Wei Wang wrote:

On 08/01/2018 07:34 PM, Michal Hocko wrote:

On Wed 01-08-18 19:12:25, Wei Wang wrote:

On 07/30/2018 05:00 PM, Michal Hocko wrote:

On Fri 27-07-18 17:24:55, Wei Wang wrote:

The OOM notifier is getting deprecated to use for the reasons mentioned
here by Michal Hocko: https://lkml.org/lkml/2018/7/12/314

This patch replaces the virtio-balloon oom notifier with a shrinker
to release balloon pages on memory pressure.

It would be great to document the replacement. This is not a small
change...

OK. I plan to document the following to the commit log:

The OOM notifier is getting deprecated to use for the reasons:
  - As a callout from the oom context, it is too subtle and easy to
generate bugs and corner cases which are hard to track;
  - It is called too late (after the reclaiming has been performed).
Drivers with large amuont of reclaimable memory is expected to be
released them at an early age of memory pressure;
  - The notifier callback isn't aware of the oom contrains;
  Link: https://lkml.org/lkml/2018/7/12/314

  This patch replaces the virtio-balloon oom notifier with a shrinker
  to release balloon pages on memory pressure. Users can set the amount of
  memory pages to release each time a shrinker_scan is called via the
  module parameter balloon_pages_to_shrink, and the default amount is 256
  pages. Historically, the feature VIRTIO_BALLOON_F_DEFLATE_ON_OOM has
  been used to release balloon pages on OOM. We continue to use this
  feature bit for the shrinker, so the shrinker is only registered when
  this feature bit has been negotiated with host.

Do you have any numbers for how does this work in practice?

It works in this way: for example, we can set the parameter,
balloon_pages_to_shrink, to shrink 1GB memory once shrink scan is called.
Now, we have a 8GB guest, and we balloon out 7GB. When shrink scan is
called, the balloon driver will get back 1GB memory and give them back to
mm, then the ballooned memory becomes 6GB.

When the shrinker scan is called the second time, another 1GB will be given
back to mm. So the ballooned pages are given back to mm gradually.


Let's say
you have a medium page cache workload which triggers kswapd to do a
light reclaim? Hardcoded shrinking sounds quite dubious to me but I have
no idea how people expect this to work. Shouldn't this be more
adaptive? How precious are those pages anyway?

Those pages are given to host to use usually because the guest has enough
free memory, and host doesn't want to waste those pieces of memory as they
are not used by this guest. When the guest needs them, it is reasonable that
the guest has higher priority to take them back.
But I'm not sure if there would be a more adaptive approach than "gradually
giving back as the guest wants more".

I am not sure I follow. Let me be more specific. Say you have a trivial
stream IO triggering reclaim to recycle clean page cache. This will
invoke slab shrinkers as well. Do you really want to drop your batch of
pages on each invocation? Doesn't that remove them very quickly? Just
try to dd if=large_file of=/dev/null and see how your pages are
disappearing. Shrinkers usually scale the number of objects they are
going to reclaim based on the memory pressure (aka targer to be
reclaimed).


Thanks, I think it looks better to make it more adaptive. I'll send out 
a new version for review.


Best,
Wei


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 2/2] virtio_balloon: replace oom notifier with shrinker

2018-08-02 Thread Wei Wang

On 08/02/2018 07:00 PM, Tetsuo Handa wrote:

On 2018/08/02 19:32, Wei Wang wrote:

On 08/01/2018 07:34 PM, Michal Hocko wrote:

Do you have any numbers for how does this work in practice?

It works in this way: for example, we can set the parameter, 
balloon_pages_to_shrink,
to shrink 1GB memory once shrink scan is called. Now, we have a 8GB guest, and 
we balloon
out 7GB. When shrink scan is called, the balloon driver will get back 1GB 
memory and give
them back to mm, then the ballooned memory becomes 6GB.

Since shrinker might be called concurrently (am I correct?),


Not sure about it being concurrently, but I think it would be called 
repeatedly as should_continue_reclaim() returns true.



Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 2/2] virtio_balloon: replace oom notifier with shrinker

2018-08-02 Thread Wei Wang

On 08/01/2018 07:34 PM, Michal Hocko wrote:

On Wed 01-08-18 19:12:25, Wei Wang wrote:

On 07/30/2018 05:00 PM, Michal Hocko wrote:

On Fri 27-07-18 17:24:55, Wei Wang wrote:

The OOM notifier is getting deprecated to use for the reasons mentioned
here by Michal Hocko: https://lkml.org/lkml/2018/7/12/314

This patch replaces the virtio-balloon oom notifier with a shrinker
to release balloon pages on memory pressure.

It would be great to document the replacement. This is not a small
change...

OK. I plan to document the following to the commit log:

   The OOM notifier is getting deprecated to use for the reasons:
 - As a callout from the oom context, it is too subtle and easy to
   generate bugs and corner cases which are hard to track;
 - It is called too late (after the reclaiming has been performed).
   Drivers with large amuont of reclaimable memory is expected to be
   released them at an early age of memory pressure;
 - The notifier callback isn't aware of the oom contrains;
 Link: https://lkml.org/lkml/2018/7/12/314

 This patch replaces the virtio-balloon oom notifier with a shrinker
 to release balloon pages on memory pressure. Users can set the amount of
 memory pages to release each time a shrinker_scan is called via the
 module parameter balloon_pages_to_shrink, and the default amount is 256
 pages. Historically, the feature VIRTIO_BALLOON_F_DEFLATE_ON_OOM has
 been used to release balloon pages on OOM. We continue to use this
 feature bit for the shrinker, so the shrinker is only registered when
 this feature bit has been negotiated with host.

Do you have any numbers for how does this work in practice?


It works in this way: for example, we can set the parameter, 
balloon_pages_to_shrink, to shrink 1GB memory once shrink scan is 
called. Now, we have a 8GB guest, and we balloon out 7GB. When shrink 
scan is called, the balloon driver will get back 1GB memory and give 
them back to mm, then the ballooned memory becomes 6GB.


When the shrinker scan is called the second time, another 1GB will be 
given back to mm. So the ballooned pages are given back to mm gradually.



Let's say
you have a medium page cache workload which triggers kswapd to do a
light reclaim? Hardcoded shrinking sounds quite dubious to me but I have
no idea how people expect this to work. Shouldn't this be more
adaptive? How precious are those pages anyway?


Those pages are given to host to use usually because the guest has 
enough free memory, and host doesn't want to waste those pieces of 
memory as they are not used by this guest. When the guest needs them, it 
is reasonable that the guest has higher priority to take them back.
But I'm not sure if there would be a more adaptive approach than 
"gradually giving back as the guest wants more".


Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 2/2] virtio_balloon: replace oom notifier with shrinker

2018-08-01 Thread Wei Wang

On 07/30/2018 05:00 PM, Michal Hocko wrote:

On Fri 27-07-18 17:24:55, Wei Wang wrote:

The OOM notifier is getting deprecated to use for the reasons mentioned
here by Michal Hocko: https://lkml.org/lkml/2018/7/12/314

This patch replaces the virtio-balloon oom notifier with a shrinker
to release balloon pages on memory pressure.

It would be great to document the replacement. This is not a small
change...


OK. I plan to document the following to the commit log:

  The OOM notifier is getting deprecated to use for the reasons:
- As a callout from the oom context, it is too subtle and easy to
  generate bugs and corner cases which are hard to track;
- It is called too late (after the reclaiming has been performed).
  Drivers with large amuont of reclaimable memory is expected to be
  released them at an early age of memory pressure;
- The notifier callback isn't aware of the oom contrains;
Link: https://lkml.org/lkml/2018/7/12/314

This patch replaces the virtio-balloon oom notifier with a shrinker
to release balloon pages on memory pressure. Users can set the 
amount of

memory pages to release each time a shrinker_scan is called via the
module parameter balloon_pages_to_shrink, and the default amount is 256
pages. Historically, the feature VIRTIO_BALLOON_F_DEFLATE_ON_OOM has
been used to release balloon pages on OOM. We continue to use this
feature bit for the shrinker, so the shrinker is only registered when
this feature bit has been negotiated with host.

In addition, the bug in the replaced virtballoon_oom_notify that only
VIRTIO_BALLOON_ARRAY_PFNS_MAX (i.e 256) balloon pages can be freed
though the user has specified more than that number is fixed in the
shrinker_scan function.


Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v2 2/2] virtio_balloon: replace oom notifier with shrinker

2018-07-27 Thread Wei Wang
The OOM notifier is getting deprecated to use for the reasons mentioned
here by Michal Hocko: https://lkml.org/lkml/2018/7/12/314

This patch replaces the virtio-balloon oom notifier with a shrinker
to release balloon pages on memory pressure.

In addition, the bug in the replaced virtballoon_oom_notify that only
VIRTIO_BALLOON_ARRAY_PFNS_MAX (i.e 256) balloon pages can be freed
though the user has specified more than that number is fixed in the
shrinker_scan function.

Signed-off-by: Wei Wang 
Cc: Michael S. Tsirkin 
Cc: Michal Hocko 
Cc: Andrew Morton 
---
 drivers/virtio/virtio_balloon.c | 115 +++-
 1 file changed, 65 insertions(+), 50 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 9356a1a..6b2229b 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -27,7 +27,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -40,12 +39,12 @@
  */
 #define VIRTIO_BALLOON_PAGES_PER_PAGE (unsigned)(PAGE_SIZE >> 
VIRTIO_BALLOON_PFN_SHIFT)
 #define VIRTIO_BALLOON_ARRAY_PFNS_MAX 256
-#define OOM_VBALLOON_DEFAULT_PAGES 256
+#define DEFAULT_BALLOON_PAGES_TO_SHRINK 256
 #define VIRTBALLOON_OOM_NOTIFY_PRIORITY 80
 
-static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
-module_param(oom_pages, int, S_IRUSR | S_IWUSR);
-MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
+static unsigned long balloon_pages_to_shrink = DEFAULT_BALLOON_PAGES_TO_SHRINK;
+module_param(balloon_pages_to_shrink, ulong, 0600);
+MODULE_PARM_DESC(balloon_pages_to_shrink, "pages to free on memory presure");
 
 #ifdef CONFIG_BALLOON_COMPACTION
 static struct vfsmount *balloon_mnt;
@@ -86,8 +85,8 @@ struct virtio_balloon {
/* Memory statistics */
struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
 
-   /* To register callback in oom notifier call chain */
-   struct notifier_block nb;
+   /* To register a shrinker to shrink memory upon memory pressure */
+   struct shrinker shrinker;
 };
 
 static struct virtio_device_id id_table[] = {
@@ -365,38 +364,6 @@ static void update_balloon_size(struct virtio_balloon *vb)
  );
 }
 
-/*
- * virtballoon_oom_notify - release pages when system is under severe
- * memory pressure (called from out_of_memory())
- * @self : notifier block struct
- * @dummy: not used
- * @parm : returned - number of freed pages
- *
- * The balancing of memory by use of the virtio balloon should not cause
- * the termination of processes while there are pages in the balloon.
- * If virtio balloon manages to release some memory, it will make the
- * system return and retry the allocation that forced the OOM killer
- * to run.
- */
-static int virtballoon_oom_notify(struct notifier_block *self,
- unsigned long dummy, void *parm)
-{
-   struct virtio_balloon *vb;
-   unsigned long *freed;
-   unsigned num_freed_pages;
-
-   vb = container_of(self, struct virtio_balloon, nb);
-   if (!virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
-   return NOTIFY_OK;
-
-   freed = parm;
-   num_freed_pages = leak_balloon(vb, oom_pages);
-   update_balloon_size(vb);
-   *freed += num_freed_pages;
-
-   return NOTIFY_OK;
-}
-
 static void update_balloon_stats_func(struct work_struct *work)
 {
struct virtio_balloon *vb;
@@ -548,6 +515,54 @@ static struct file_system_type balloon_fs = {
 
 #endif /* CONFIG_BALLOON_COMPACTION */
 
+static unsigned long virtio_balloon_shrinker_scan(struct shrinker *shrinker,
+ struct shrink_control *sc)
+{
+   unsigned long pages_to_free = balloon_pages_to_shrink,
+ pages_freed = 0;
+   struct virtio_balloon *vb = container_of(shrinker,
+   struct virtio_balloon, shrinker);
+
+   /*
+* One invocation of leak_balloon can deflate at most
+* VIRTIO_BALLOON_ARRAY_PFNS_MAX balloon pages, so we call it
+* multiple times to deflate pages till reaching
+* balloon_pages_to_shrink pages.
+*/
+   while (vb->num_pages && pages_to_free) {
+   pages_to_free = balloon_pages_to_shrink - pages_freed;
+   pages_freed += leak_balloon(vb, pages_to_free);
+   }
+   update_balloon_size(vb);
+
+   return pages_freed / VIRTIO_BALLOON_PAGES_PER_PAGE;
+}
+
+static unsigned long virtio_balloon_shrinker_count(struct shrinker *shrinker,
+  struct shrink_control *sc)
+{
+   struct virtio_balloon *vb = container_of(shrinker,
+   struct virtio_balloon, shrinker);
+
+   return min_t(unsigned long, vb->num_pages, balloon_pages_to_shrink) /
+  VIRTIO_BALLOON_PAGES_PER_PAGE;
+}
+
+static voi

[PATCH v2 1/2] virtio-balloon: remove BUG() in init_vqs

2018-07-27 Thread Wei Wang
It's a bit overkill to use BUG when failing to add an entry to the
stats_vq in init_vqs. So remove it and just return the error to the
caller to bail out nicely.

Signed-off-by: Wei Wang 
Cc: Michael S. Tsirkin 
---
 drivers/virtio/virtio_balloon.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 6b237e3..9356a1a 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -455,9 +455,13 @@ static int init_vqs(struct virtio_balloon *vb)
num_stats = update_balloon_stats(vb);
 
sg_init_one(, vb->stats, sizeof(vb->stats[0]) * num_stats);
-   if (virtqueue_add_outbuf(vb->stats_vq, , 1, vb, GFP_KERNEL)
-   < 0)
-   BUG();
+   err = virtqueue_add_outbuf(vb->stats_vq, , 1, vb,
+  GFP_KERNEL);
+   if (err) {
+   dev_warn(>vdev->dev, "%s: add stat_vq failed\n",
+__func__);
+   return err;
+   }
virtqueue_kick(vb->stats_vq);
}
return 0;
-- 
2.7.4

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v2 0/2] virtio-balloon: some improvements

2018-07-27 Thread Wei Wang
This series is split from the "Virtio-balloon: support free page
reporting" series to make some improvements.

v1->v2 ChangeLog:
- register the shrinker when VIRTIO_BALLOON_F_DEFLATE_ON_OOM is negotiated.

Wei Wang (2):
  virtio-balloon: remove BUG() in init_vqs
  virtio_balloon: replace oom notifier with shrinker

 drivers/virtio/virtio_balloon.c | 125 +++-
 1 file changed, 72 insertions(+), 53 deletions(-)

-- 
2.7.4

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v36 0/5] Virtio-balloon: support free page reporting

2018-07-24 Thread Wei Wang

On 07/23/2018 10:36 PM, Dr. David Alan Gilbert wrote:

* Michael S. Tsirkin (m...@redhat.com) wrote:

On Fri, Jul 20, 2018 at 04:33:00PM +0800, Wei Wang wrote:

This patch series is separated from the previous "Virtio-balloon
Enhancement" series. The new feature, VIRTIO_BALLOON_F_FREE_PAGE_HINT,
implemented by this series enables the virtio-balloon driver to report
hints of guest free pages to the host. It can be used to accelerate live
migration of VMs. Here is an introduction of this usage:

Live migration needs to transfer the VM's memory from the source machine
to the destination round by round. For the 1st round, all the VM's memory
is transferred. From the 2nd round, only the pieces of memory that were
written by the guest (after the 1st round) are transferred. One method
that is popularly used by the hypervisor to track which part of memory is
written is to write-protect all the guest memory.

This feature enables the optimization by skipping the transfer of guest
free pages during VM live migration. It is not concerned that the memory
pages are used after they are given to the hypervisor as a hint of the
free pages, because they will be tracked by the hypervisor and transferred
in the subsequent round if they are used and written.

* Tests
- Test Environment
 Host: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
 Guest: 8G RAM, 4 vCPU
 Migration setup: migrate_set_speed 100G, migrate_set_downtime 2 second

- Test Results
 - Idle Guest Live Migration Time (results are averaged over 10 runs):
 - Optimization v.s. Legacy = 409ms vs 1757ms --> ~77% reduction
(setting page poisoning zero and enabling ksm don't affect the
  comparison result)
 - Guest with Linux Compilation Workload (make bzImage -j4):
 - Live Migration Time (average)
   Optimization v.s. Legacy = 1407ms v.s. 2528ms --> ~44% reduction
 - Linux Compilation Time
   Optimization v.s. Legacy = 5min4s v.s. 5min12s
   --> no obvious difference

I'd like to see dgilbert's take on whether this kind of gain
justifies adding a PV interfaces, and what kind of guest workload
is appropriate.

Cc'd.

Well, 44% is great ... although the measurement is a bit weird.

a) A 2 second downtime is very large; 300-500ms is more normal


No problem, I will set downtime to 400ms for the tests.


b) I'm not sure what the 'average' is  - is that just between a bunch of
repeated migrations?


Yes, just repeatedly ("source<>destination" migration) do the tests 
and get an averaged result.




c) What load was running in the guest during the live migration?


The first one above just uses a guest without running any specific 
workload (named idle guests).

The second one uses a guest with the Linux compilation workload running.



An interesting measurement to add would be to do the same test but
with a VM with a lot more RAM but the same load;  you'd hope the gain
would be even better.
It would be interesting, especially because the users who are interested
are people creating VMs allocated with lots of extra memory (for the
worst case) but most of the time migrating when it's fairly idle.


OK. I will add tests of a guest with larger memory.

Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v36 2/5] virtio_balloon: replace oom notifier with shrinker

2018-07-23 Thread Wei Wang

On 07/23/2018 10:13 PM, Michael S. Tsirkin wrote:

vb->vb_dev_info.inode->i_mapping->a_ops = _aops;
   #endif
+   err = virtio_balloon_register_shrinker(vb);
+   if (err)
+   goto out_del_vqs;
So we can get scans before device is ready. Leak will fail
then. Why not register later after device is ready?

Probably no.

- it would be better not to set device ready when register_shrinker failed.

That's very rare so I won't be too worried.


Just a little confused with the point here. "very rare" means it still 
could happen (even it's a corner case), and if that happens, we got 
something wrong functionally. So it will be a bug if we change like 
that, right?


Still couldn't understand the reason of changing shrinker_register after 
device_ready (the original oom notifier was registered before setting 
device ready too)?
(I think the driver won't get shrinker_scan called if device isn't ready 
because of the reasons below)



- When the device isn't ready, ballooning won't happen, that is,
vb->num_pages will be 0, which results in shrinker_count=0 and shrinker_scan
won't be called.


Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v36 2/5] virtio_balloon: replace oom notifier with shrinker

2018-07-23 Thread Wei Wang

On 07/22/2018 10:48 PM, Michael S. Tsirkin wrote:

On Fri, Jul 20, 2018 at 04:33:02PM +0800, Wei Wang wrote:
  
+static unsigned long virtio_balloon_shrinker_scan(struct shrinker *shrinker,

+ struct shrink_control *sc)
+{
+   unsigned long pages_to_free = balloon_pages_to_shrink,
+ pages_freed = 0;
+   struct virtio_balloon *vb = container_of(shrinker,
+   struct virtio_balloon, shrinker);
+
+   /*
+* One invocation of leak_balloon can deflate at most
+* VIRTIO_BALLOON_ARRAY_PFNS_MAX balloon pages, so we call it
+* multiple times to deflate pages till reaching
+* balloon_pages_to_shrink pages.
+*/
+   while (vb->num_pages && pages_to_free) {
+   pages_to_free = balloon_pages_to_shrink - pages_freed;
+   pages_freed += leak_balloon(vb, pages_to_free);
+   }
+   update_balloon_size(vb);

Are you sure that this is never called if count returned 0?


Yes. Please see do_shrink_slab, it just returns if count is 0.




+
+   return pages_freed / VIRTIO_BALLOON_PAGES_PER_PAGE;
+}
+
+static unsigned long virtio_balloon_shrinker_count(struct shrinker *shrinker,
+  struct shrink_control *sc)
+{
+   struct virtio_balloon *vb = container_of(shrinker,
+   struct virtio_balloon, shrinker);
+
+   /*
+* We continue to use VIRTIO_BALLOON_F_DEFLATE_ON_OOM to handle the
+* case when shrinker needs to be invoked to relieve memory pressure.
+*/
+   if (!virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
+   return 0;

So why not skip notifier registration when deflate on oom
is clear?


Sounds good, thanks.



vb->vb_dev_info.inode->i_mapping->a_ops = _aops;
  #endif
+   err = virtio_balloon_register_shrinker(vb);
+   if (err)
+   goto out_del_vqs;
  
So we can get scans before device is ready. Leak will fail

then. Why not register later after device is ready?


Probably no.

- it would be better not to set device ready when register_shrinker failed.
- When the device isn't ready, ballooning won't happen, that is, 
vb->num_pages will be 0, which results in shrinker_count=0 and 
shrinker_scan won't be called.


So I think it would be better to have shrinker registered before 
device_ready.


Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v36 3/5] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT

2018-07-20 Thread Wei Wang
Negotiation of the VIRTIO_BALLOON_F_FREE_PAGE_HINT feature indicates the
support of reporting hints of guest free pages to host via virtio-balloon.
Currenlty, only free page blocks of MAX_ORDER - 1 are reported. They are
obtained one by one from the mm free list via the regular allocation
function. The allocated pages are given back to mm after they are put onto
the vq.

Host requests the guest to report free page hints by sending a new cmd id
to the guest via the free_page_report_cmd_id configuration register. When
the guest starts to report, it first sends a start cmd to host via the
free page vq, which acks to host the cmd id received. When the guest
finishes reporting free pages, a stop cmd is sent to host via the vq.

TODO:
- Add a batch page allocation API to amortize the allocation overhead.

Signed-off-by: Wei Wang 
Signed-off-by: Liang Li 
Cc: Michael S. Tsirkin 
Cc: Michal Hocko 
Cc: Andrew Morton 
Cc: Linus Torvalds 
---
 drivers/virtio/virtio_balloon.c | 331 +---
 include/uapi/linux/virtio_balloon.h |   4 +
 2 files changed, 307 insertions(+), 28 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index c6fd406..82cd497 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -42,6 +42,14 @@
 #define DEFAULT_BALLOON_PAGES_TO_SHRINK 256
 #define VIRTBALLOON_OOM_NOTIFY_PRIORITY 80
 
+#define VIRTIO_BALLOON_FREE_PAGE_ALLOC_FLAG (__GFP_NORETRY | __GFP_NOWARN | \
+__GFP_NOMEMALLOC)
+/* The order of free page blocks to report to host */
+#define VIRTIO_BALLOON_FREE_PAGE_ORDER (MAX_ORDER - 1)
+/* The size of a free page block in bytes */
+#define VIRTIO_BALLOON_FREE_PAGE_SIZE \
+   (1 << (VIRTIO_BALLOON_FREE_PAGE_ORDER + PAGE_SHIFT))
+
 static unsigned long balloon_pages_to_shrink = DEFAULT_BALLOON_PAGES_TO_SHRINK;
 module_param(balloon_pages_to_shrink, ulong, 0600);
 MODULE_PARM_DESC(balloon_pages_to_shrink, "pages to free on memory presure");
@@ -50,9 +58,22 @@ MODULE_PARM_DESC(balloon_pages_to_shrink, "pages to free on 
memory presure");
 static struct vfsmount *balloon_mnt;
 #endif
 
+enum virtio_balloon_vq {
+   VIRTIO_BALLOON_VQ_INFLATE,
+   VIRTIO_BALLOON_VQ_DEFLATE,
+   VIRTIO_BALLOON_VQ_STATS,
+   VIRTIO_BALLOON_VQ_FREE_PAGE,
+   VIRTIO_BALLOON_VQ_MAX
+};
+
 struct virtio_balloon {
struct virtio_device *vdev;
-   struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
+   struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
+
+   /* Balloon's own wq for cpu-intensive work items */
+   struct workqueue_struct *balloon_wq;
+   /* The free page reporting work item submitted to the balloon wq */
+   struct work_struct report_free_page_work;
 
/* The balloon servicing is delegated to a freezable workqueue. */
struct work_struct update_balloon_stats_work;
@@ -62,6 +83,16 @@ struct virtio_balloon {
spinlock_t stop_update_lock;
bool stop_update;
 
+   /* The list of allocated free pages, waiting to be given back to mm */
+   struct list_head free_page_list;
+   spinlock_t free_page_list_lock;
+   /* The cmd id received from host */
+   u32 cmd_id_received;
+   /* The cmd id that is actively in use */
+   __virtio32 cmd_id_active;
+   /* Buffer to store the stop sign */
+   __virtio32 cmd_id_stop;
+
/* Waiting for host to ack the pages we released. */
wait_queue_head_t acked;
 
@@ -325,17 +356,6 @@ static void stats_handle_request(struct virtio_balloon *vb)
virtqueue_kick(vq);
 }
 
-static void virtballoon_changed(struct virtio_device *vdev)
-{
-   struct virtio_balloon *vb = vdev->priv;
-   unsigned long flags;
-
-   spin_lock_irqsave(>stop_update_lock, flags);
-   if (!vb->stop_update)
-   queue_work(system_freezable_wq, >update_balloon_size_work);
-   spin_unlock_irqrestore(>stop_update_lock, flags);
-}
-
 static inline s64 towards_target(struct virtio_balloon *vb)
 {
s64 target;
@@ -352,6 +372,52 @@ static inline s64 towards_target(struct virtio_balloon *vb)
return target - vb->num_pages;
 }
 
+static void virtballoon_changed(struct virtio_device *vdev)
+{
+   struct virtio_balloon *vb = vdev->priv;
+   unsigned long flags;
+   s64 diff = towards_target(vb);
+
+   if (diff) {
+   spin_lock_irqsave(>stop_update_lock, flags);
+   if (!vb->stop_update)
+   queue_work(system_freezable_wq,
+  >update_balloon_size_work);
+   spin_unlock_irqrestore(>stop_update_lock, flags);
+   }
+
+   if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT)) {
+   virtio_cread(vdev, struct virtio_balloon_config,
+free_page_report_cmd_id,

[PATCH v36 5/5] virtio-balloon: VIRTIO_BALLOON_F_PAGE_POISON

2018-07-20 Thread Wei Wang
The VIRTIO_BALLOON_F_PAGE_POISON feature bit is used to indicate if the
guest is using page poisoning. Guest writes to the poison_val config
field to tell host about the page poisoning value that is in use.

Suggested-by: Michael S. Tsirkin 
Signed-off-by: Wei Wang 
Cc: Michael S. Tsirkin 
Cc: Michal Hocko 
Cc: Andrew Morton 
---
 drivers/virtio/virtio_balloon.c | 10 ++
 include/uapi/linux/virtio_balloon.h |  3 +++
 2 files changed, 13 insertions(+)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 82cd497..6340cc1 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -814,6 +814,7 @@ static int virtio_balloon_register_shrinker(struct 
virtio_balloon *vb)
 static int virtballoon_probe(struct virtio_device *vdev)
 {
struct virtio_balloon *vb;
+   __u32 poison_val;
int err;
 
if (!vdev->config->get) {
@@ -883,6 +884,11 @@ static int virtballoon_probe(struct virtio_device *vdev)
  VIRTIO_BALLOON_CMD_ID_STOP);
spin_lock_init(>free_page_list_lock);
INIT_LIST_HEAD(>free_page_list);
+   if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_PAGE_POISON)) {
+   memset(_val, PAGE_POISON, sizeof(poison_val));
+   virtio_cwrite(vb->vdev, struct virtio_balloon_config,
+ poison_val, _val);
+   }
}
 
err = virtio_balloon_register_shrinker(vb);
@@ -979,6 +985,9 @@ static int virtballoon_restore(struct virtio_device *vdev)
 
 static int virtballoon_validate(struct virtio_device *vdev)
 {
+   if (!page_poisoning_enabled())
+   __virtio_clear_bit(vdev, VIRTIO_BALLOON_F_PAGE_POISON);
+
__virtio_clear_bit(vdev, VIRTIO_F_IOMMU_PLATFORM);
return 0;
 }
@@ -988,6 +997,7 @@ static unsigned int features[] = {
VIRTIO_BALLOON_F_STATS_VQ,
VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
VIRTIO_BALLOON_F_FREE_PAGE_HINT,
+   VIRTIO_BALLOON_F_PAGE_POISON,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/uapi/linux/virtio_balloon.h 
b/include/uapi/linux/virtio_balloon.h
index 18ee430..80a7b7e 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -35,6 +35,7 @@
 #define VIRTIO_BALLOON_F_STATS_VQ  1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM2 /* Deflate balloon on OOM */
 #define VIRTIO_BALLOON_F_FREE_PAGE_HINT3 /* VQ to report free pages */
+#define VIRTIO_BALLOON_F_PAGE_POISON   4 /* Guest is using page poisoning */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
@@ -47,6 +48,8 @@ struct virtio_balloon_config {
__u32 actual;
/* Free page report command id, readonly by guest */
__u32 free_page_report_cmd_id;
+   /* Stores PAGE_POISON if page poisoning is in use */
+   __u32 poison_val;
 };
 
 #define VIRTIO_BALLOON_S_SWAP_IN  0   /* Amount of memory swapped in */
-- 
2.7.4

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v36 4/5] mm/page_poison: expose page_poisoning_enabled to kernel modules

2018-07-20 Thread Wei Wang
In some usages, e.g. virtio-balloon, a kernel module needs to know if
page poisoning is in use. This patch exposes the page_poisoning_enabled
function to kernel modules.

Signed-off-by: Wei Wang 
Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: Michael S. Tsirkin 
Acked-by: Andrew Morton 
---
 mm/page_poison.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/mm/page_poison.c b/mm/page_poison.c
index aa2b3d3..830f604 100644
--- a/mm/page_poison.c
+++ b/mm/page_poison.c
@@ -17,6 +17,11 @@ static int __init early_page_poison_param(char *buf)
 }
 early_param("page_poison", early_page_poison_param);
 
+/**
+ * page_poisoning_enabled - check if page poisoning is enabled
+ *
+ * Return true if page poisoning is enabled, or false if not.
+ */
 bool page_poisoning_enabled(void)
 {
/*
@@ -29,6 +34,7 @@ bool page_poisoning_enabled(void)
(!IS_ENABLED(CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC) &&
debug_pagealloc_enabled()));
 }
+EXPORT_SYMBOL_GPL(page_poisoning_enabled);
 
 static void poison_page(struct page *page)
 {
-- 
2.7.4

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v36 1/5] virtio-balloon: remove BUG() in init_vqs

2018-07-20 Thread Wei Wang
It's a bit overkill to use BUG when failing to add an entry to the
stats_vq in init_vqs. So remove it and just return the error to the
caller to bail out nicely.

Signed-off-by: Wei Wang 
Cc: Michael S. Tsirkin 
---
 drivers/virtio/virtio_balloon.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 6b237e3..9356a1a 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -455,9 +455,13 @@ static int init_vqs(struct virtio_balloon *vb)
num_stats = update_balloon_stats(vb);
 
sg_init_one(, vb->stats, sizeof(vb->stats[0]) * num_stats);
-   if (virtqueue_add_outbuf(vb->stats_vq, , 1, vb, GFP_KERNEL)
-   < 0)
-   BUG();
+   err = virtqueue_add_outbuf(vb->stats_vq, , 1, vb,
+  GFP_KERNEL);
+   if (err) {
+   dev_warn(>vdev->dev, "%s: add stat_vq failed\n",
+__func__);
+   return err;
+   }
virtqueue_kick(vb->stats_vq);
}
return 0;
-- 
2.7.4

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v36 2/5] virtio_balloon: replace oom notifier with shrinker

2018-07-20 Thread Wei Wang
The OOM notifier is getting deprecated to use for the reasons mentioned
here by Michal Hocko: https://lkml.org/lkml/2018/7/12/314

This patch replaces the virtio-balloon oom notifier with a shrinker
to release balloon pages on memory pressure.

In addition, the bug in the replaced virtballoon_oom_notify that only
VIRTIO_BALLOON_ARRAY_PFNS_MAX (i.e 256) balloon pages can be freed
though the user has specified more than that number is fixed in the
shrinker_scan function.

Signed-off-by: Wei Wang 
Cc: Michael S. Tsirkin 
Cc: Michal Hocko 
Cc: Andrew Morton 
Cc: Linus Torvalds 
---
 drivers/virtio/virtio_balloon.c | 113 +++-
 1 file changed, 65 insertions(+), 48 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 9356a1a..c6fd406 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -27,7 +27,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -40,12 +39,12 @@
  */
 #define VIRTIO_BALLOON_PAGES_PER_PAGE (unsigned)(PAGE_SIZE >> 
VIRTIO_BALLOON_PFN_SHIFT)
 #define VIRTIO_BALLOON_ARRAY_PFNS_MAX 256
-#define OOM_VBALLOON_DEFAULT_PAGES 256
+#define DEFAULT_BALLOON_PAGES_TO_SHRINK 256
 #define VIRTBALLOON_OOM_NOTIFY_PRIORITY 80
 
-static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
-module_param(oom_pages, int, S_IRUSR | S_IWUSR);
-MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
+static unsigned long balloon_pages_to_shrink = DEFAULT_BALLOON_PAGES_TO_SHRINK;
+module_param(balloon_pages_to_shrink, ulong, 0600);
+MODULE_PARM_DESC(balloon_pages_to_shrink, "pages to free on memory presure");
 
 #ifdef CONFIG_BALLOON_COMPACTION
 static struct vfsmount *balloon_mnt;
@@ -86,8 +85,8 @@ struct virtio_balloon {
/* Memory statistics */
struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
 
-   /* To register callback in oom notifier call chain */
-   struct notifier_block nb;
+   /* To register a shrinker to shrink memory upon memory pressure */
+   struct shrinker shrinker;
 };
 
 static struct virtio_device_id id_table[] = {
@@ -365,38 +364,6 @@ static void update_balloon_size(struct virtio_balloon *vb)
  );
 }
 
-/*
- * virtballoon_oom_notify - release pages when system is under severe
- * memory pressure (called from out_of_memory())
- * @self : notifier block struct
- * @dummy: not used
- * @parm : returned - number of freed pages
- *
- * The balancing of memory by use of the virtio balloon should not cause
- * the termination of processes while there are pages in the balloon.
- * If virtio balloon manages to release some memory, it will make the
- * system return and retry the allocation that forced the OOM killer
- * to run.
- */
-static int virtballoon_oom_notify(struct notifier_block *self,
- unsigned long dummy, void *parm)
-{
-   struct virtio_balloon *vb;
-   unsigned long *freed;
-   unsigned num_freed_pages;
-
-   vb = container_of(self, struct virtio_balloon, nb);
-   if (!virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
-   return NOTIFY_OK;
-
-   freed = parm;
-   num_freed_pages = leak_balloon(vb, oom_pages);
-   update_balloon_size(vb);
-   *freed += num_freed_pages;
-
-   return NOTIFY_OK;
-}
-
 static void update_balloon_stats_func(struct work_struct *work)
 {
struct virtio_balloon *vb;
@@ -548,6 +515,61 @@ static struct file_system_type balloon_fs = {
 
 #endif /* CONFIG_BALLOON_COMPACTION */
 
+static unsigned long virtio_balloon_shrinker_scan(struct shrinker *shrinker,
+ struct shrink_control *sc)
+{
+   unsigned long pages_to_free = balloon_pages_to_shrink,
+ pages_freed = 0;
+   struct virtio_balloon *vb = container_of(shrinker,
+   struct virtio_balloon, shrinker);
+
+   /*
+* One invocation of leak_balloon can deflate at most
+* VIRTIO_BALLOON_ARRAY_PFNS_MAX balloon pages, so we call it
+* multiple times to deflate pages till reaching
+* balloon_pages_to_shrink pages.
+*/
+   while (vb->num_pages && pages_to_free) {
+   pages_to_free = balloon_pages_to_shrink - pages_freed;
+   pages_freed += leak_balloon(vb, pages_to_free);
+   }
+   update_balloon_size(vb);
+
+   return pages_freed / VIRTIO_BALLOON_PAGES_PER_PAGE;
+}
+
+static unsigned long virtio_balloon_shrinker_count(struct shrinker *shrinker,
+  struct shrink_control *sc)
+{
+   struct virtio_balloon *vb = container_of(shrinker,
+   struct virtio_balloon, shrinker);
+
+   /*
+* We continue to use VIRTIO_BALLOON_F_DEFLATE_ON_OOM to handle the
+* case when shrinker needs 

[PATCH v36 0/5] Virtio-balloon: support free page reporting

2018-07-20 Thread Wei Wang
This patch series is separated from the previous "Virtio-balloon
Enhancement" series. The new feature, VIRTIO_BALLOON_F_FREE_PAGE_HINT,  
implemented by this series enables the virtio-balloon driver to report
hints of guest free pages to the host. It can be used to accelerate live
migration of VMs. Here is an introduction of this usage:

Live migration needs to transfer the VM's memory from the source machine
to the destination round by round. For the 1st round, all the VM's memory
is transferred. From the 2nd round, only the pieces of memory that were
written by the guest (after the 1st round) are transferred. One method
that is popularly used by the hypervisor to track which part of memory is
written is to write-protect all the guest memory.

This feature enables the optimization by skipping the transfer of guest
free pages during VM live migration. It is not concerned that the memory
pages are used after they are given to the hypervisor as a hint of the
free pages, because they will be tracked by the hypervisor and transferred
in the subsequent round if they are used and written.

* Tests
- Test Environment
Host: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
Guest: 8G RAM, 4 vCPU
Migration setup: migrate_set_speed 100G, migrate_set_downtime 2 second

- Test Results
- Idle Guest Live Migration Time (results are averaged over 10 runs):
- Optimization v.s. Legacy = 409ms vs 1757ms --> ~77% reduction
(setting page poisoning zero and enabling ksm don't affect the
 comparison result)
- Guest with Linux Compilation Workload (make bzImage -j4):
- Live Migration Time (average)
  Optimization v.s. Legacy = 1407ms v.s. 2528ms --> ~44% reduction
- Linux Compilation Time
  Optimization v.s. Legacy = 5min4s v.s. 5min12s
  --> no obvious difference

ChangeLog:
v35->v36:
- remove the mm patch, as Linus has a suggestion to get free page
  addresses via allocation, instead of reading from the free page
  list.
- virtio-balloon:
- replace oom notifier with shrinker;
- the guest to host communication interface remains the same as
  v32.
- allocate free page blocks and send to host one by one, and free
  them after sending all the pages.

For ChangeLogs from v22 to v35, please reference
https://lwn.net/Articles/759413/

For ChangeLogs before v21, please reference
https://lwn.net/Articles/743660/

Wei Wang (5):
  virtio-balloon: remove BUG() in init_vqs
  virtio_balloon: replace oom notifier with shrinker
  virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT
  mm/page_poison: expose page_poisoning_enabled to kernel modules
  virtio-balloon: VIRTIO_BALLOON_F_PAGE_POISON

 drivers/virtio/virtio_balloon.c | 456 ++--
 include/uapi/linux/virtio_balloon.h |   7 +
 mm/page_poison.c|   6 +
 3 files changed, 394 insertions(+), 75 deletions(-)

-- 
2.7.4

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v35 1/5] mm: support to get hints of free page blocks

2018-07-12 Thread Wei Wang

On 07/12/2018 07:49 PM, Michal Hocko wrote:

On Thu 12-07-18 19:34:16, Wei Wang wrote:

On 07/12/2018 04:13 PM, Michal Hocko wrote:

On Thu 12-07-18 10:52:08, Wei Wang wrote:

On 07/12/2018 10:30 AM, Linus Torvalds wrote:

On Wed, Jul 11, 2018 at 7:17 PM Wei Wang  wrote:

Would it be better to remove __GFP_THISNODE? We actually want to get all
the guest free pages (from all the nodes).

Maybe. Or maybe it would be better to have the memory balloon logic be
per-node? Maybe you don't want to remove too much memory from one
node? I think it's one of those "play with it" things.

I don't think that's the big issue, actually. I think the real issue
is how to react quickly and gracefully to "oops, I'm trying to give
memory away, but now the guest wants it back" while you're in the
middle of trying to create that 2TB list of pages.

OK. virtio-balloon has already registered an oom notifier
(virtballoon_oom_notify). I plan to add some control there. If oom happens,
- stop the page allocation;
- immediately give back the allocated pages to mm.

Please don't. Oom notifier is an absolutely hideous interface which
should go away sooner or later (I would much rather like the former) so
do not build a new logic on top of it. I would appreciate if you
actually remove the notifier much more.

You can give memory back from the standard shrinker interface. If we are
reaching low reclaim priorities then we are struggling to reclaim memory
and then you can start returning pages back.

OK. Just curious why oom notifier is thought to be hideous, and has it been
a consensus?

Because it is a completely non-transparent callout from the OOM context
which is really subtle on its own. It is just too easy to end up in
weird corner cases. We really have to be careful and be as swift as
possible. Any potential sleep would make the OOM situation much worse
because nobody would be able to make a forward progress or (in)direct
dependency on MM subsystem can easily deadlock. Those are really hard
to track down and defining the notifier as blockable by design which
just asks for bad implementations because most people simply do not
realize how subtle the oom context is.

Another thing is that it happens way too late when we have basically
reclaimed the world and didn't get out of the memory pressure so you can
expect any workload is suffering already. Anybody sitting on a large
amount of reclaimable memory should have released that memory by that
time. Proportionally to the reclaim pressure ideally.

The notifier API is completely unaware of oom constrains. Just imagine
you are OOM in a subset of numa nodes. Callback doesn't have any idea
about that.

Moreover we do have proper reclaim mechanism that has a feedback
loop and that should be always preferable to an abrupt reclaim.


Sounds very reasonable, thanks for the elaboration. I'll try with shrinker.

Best,
Wei



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v35 1/5] mm: support to get hints of free page blocks

2018-07-11 Thread Wei Wang

On 07/12/2018 10:30 AM, Linus Torvalds wrote:

On Wed, Jul 11, 2018 at 7:17 PM Wei Wang  wrote:

Would it be better to remove __GFP_THISNODE? We actually want to get all
the guest free pages (from all the nodes).

Maybe. Or maybe it would be better to have the memory balloon logic be
per-node? Maybe you don't want to remove too much memory from one
node? I think it's one of those "play with it" things.

I don't think that's the big issue, actually. I think the real issue
is how to react quickly and gracefully to "oops, I'm trying to give
memory away, but now the guest wants it back" while you're in the
middle of trying to create that 2TB list of pages.


OK. virtio-balloon has already registered an oom notifier 
(virtballoon_oom_notify). I plan to add some control there. If oom happens,

- stop the page allocation;
- immediately give back the allocated pages to mm.

Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v35 1/5] mm: support to get hints of free page blocks

2018-07-11 Thread Wei Wang

On 07/12/2018 12:23 AM, Linus Torvalds wrote:

On Wed, Jul 11, 2018 at 2:21 AM Michal Hocko  wrote:

We already have an interface for that. alloc_pages(GFP_NOWAIT, MAX_ORDER -1).
So why do we need any array based interface?

That was actually my original argument in the original thread - that
the only new interface people might want is one that just tells how
many of those MAX_ORDER-1 pages there are.

See the thread in v33 with the subject

   "[PATCH v33 1/4] mm: add a function to get free page blocks"

and look for me suggesting just using

 #define GFP_MINFLAGS (__GFP_NORETRY | __GFP_NOWARN |
__GFP_THISNODE | __GFP_NOMEMALLOC)


Would it be better to remove __GFP_THISNODE? We actually want to get all 
the guest free pages (from all the nodes).


Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v35 1/5] mm: support to get hints of free page blocks

2018-07-11 Thread Wei Wang

On 07/11/2018 05:21 PM, Michal Hocko wrote:

On Tue 10-07-18 18:44:34, Linus Torvalds wrote:
[...]

That was what I tried to encourage with actually removing the pages
form the page list. That would be an _incremental_ interface. You can
remove MAX_ORDER-1 pages one by one (or a hundred at a time), and mark
them free for ballooning that way. And if you still feel you have tons
of free memory, just continue removing more pages from the free list.

We already have an interface for that. alloc_pages(GFP_NOWAIT, MAX_ORDER -1).
So why do we need any array based interface?


Yes, I'm trying to get free pages directly via alloc_pages, so there 
will be no new mm APIs.
I plan to let free page allocation stop when the remaining system free 
memory becomes close to min_free_kbytes (prevent swapping).



Best,
Wei


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v35 4/5] mm/page_poison: expose page_poisoning_enabled to kernel modules

2018-07-10 Thread Wei Wang
In some usages, e.g. virtio-balloon, a kernel module needs to know if
page poisoning is in use. This patch exposes the page_poisoning_enabled
function to kernel modules.

Signed-off-by: Wei Wang 
Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: Michael S. Tsirkin 
Acked-by: Andrew Morton 
---
 mm/page_poison.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/mm/page_poison.c b/mm/page_poison.c
index aa2b3d3..830f604 100644
--- a/mm/page_poison.c
+++ b/mm/page_poison.c
@@ -17,6 +17,11 @@ static int __init early_page_poison_param(char *buf)
 }
 early_param("page_poison", early_page_poison_param);
 
+/**
+ * page_poisoning_enabled - check if page poisoning is enabled
+ *
+ * Return true if page poisoning is enabled, or false if not.
+ */
 bool page_poisoning_enabled(void)
 {
/*
@@ -29,6 +34,7 @@ bool page_poisoning_enabled(void)
(!IS_ENABLED(CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC) &&
debug_pagealloc_enabled()));
 }
+EXPORT_SYMBOL_GPL(page_poisoning_enabled);
 
 static void poison_page(struct page *page)
 {
-- 
2.7.4

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v35 5/5] virtio-balloon: VIRTIO_BALLOON_F_PAGE_POISON

2018-07-10 Thread Wei Wang
The VIRTIO_BALLOON_F_PAGE_POISON feature bit is used to indicate if the
guest is using page poisoning. Guest writes to the poison_val config
field to tell host about the page poisoning value that is in use.

Suggested-by: Michael S. Tsirkin 
Signed-off-by: Wei Wang 
Cc: Michael S. Tsirkin 
Cc: Michal Hocko 
Cc: Andrew Morton 
---
 drivers/virtio/virtio_balloon.c | 10 ++
 include/uapi/linux/virtio_balloon.h |  3 +++
 2 files changed, 13 insertions(+)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 8754154..dd61660 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -869,6 +869,7 @@ static struct file_system_type balloon_fs = {
 static int virtballoon_probe(struct virtio_device *vdev)
 {
struct virtio_balloon *vb;
+   __u32 poison_val;
int err;
 
if (!vdev->config->get) {
@@ -916,6 +917,11 @@ static int virtballoon_probe(struct virtio_device *vdev)
INIT_WORK(>report_free_page_work, report_free_page_func);
vb->cmd_id_received = VIRTIO_BALLOON_FREE_PAGE_REPORT_STOP_ID;
vb->cmd_id_active = VIRTIO_BALLOON_FREE_PAGE_REPORT_STOP_ID;
+   if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_PAGE_POISON)) {
+   memset(_val, PAGE_POISON, sizeof(poison_val));
+   virtio_cwrite(vb->vdev, struct virtio_balloon_config,
+ poison_val, _val);
+   }
}
 
vb->nb.notifier_call = virtballoon_oom_notify;
@@ -1034,6 +1040,9 @@ static int virtballoon_restore(struct virtio_device *vdev)
 
 static int virtballoon_validate(struct virtio_device *vdev)
 {
+   if (!page_poisoning_enabled())
+   __virtio_clear_bit(vdev, VIRTIO_BALLOON_F_PAGE_POISON);
+
__virtio_clear_bit(vdev, VIRTIO_F_IOMMU_PLATFORM);
return 0;
 }
@@ -1043,6 +1052,7 @@ static unsigned int features[] = {
VIRTIO_BALLOON_F_STATS_VQ,
VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
VIRTIO_BALLOON_F_FREE_PAGE_HINT,
+   VIRTIO_BALLOON_F_PAGE_POISON,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/uapi/linux/virtio_balloon.h 
b/include/uapi/linux/virtio_balloon.h
index b77919b..97415ba 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -35,6 +35,7 @@
 #define VIRTIO_BALLOON_F_STATS_VQ  1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM2 /* Deflate balloon on OOM */
 #define VIRTIO_BALLOON_F_FREE_PAGE_HINT3 /* VQ to report free pages */
+#define VIRTIO_BALLOON_F_PAGE_POISON   4 /* Guest is using page poisoning */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
@@ -47,6 +48,8 @@ struct virtio_balloon_config {
__u32 actual;
/* Free page report command id, readonly by guest */
__u32 free_page_report_cmd_id;
+   /* Stores PAGE_POISON if page poisoning is in use */
+   __u32 poison_val;
 };
 
 struct virtio_balloon_free_page_hints_cmd {
-- 
2.7.4

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v35 3/5] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT

2018-07-10 Thread Wei Wang
Negotiation of the VIRTIO_BALLOON_F_FREE_PAGE_HINT feature indicates the
support of reporting hints of guest free pages to host via virtio-balloon.

Host requests the guest to report free page hints by sending a new cmd id
to the guest via the free_page_report_cmd_id configuration register.

As the first step here, virtio-balloon only reports free page hints from
the max order (i.e. 10) free page list to host. This has generated similar
good results as reporting all free page hints during our tests.

When the guest starts to report, it first sends a start cmd to host via
the free page vq, which acks to host the cmd id received, and tells it the
hint size (e.g. 4MB each on x86). When the guest finishes the reporting,
a stop cmd is sent to host via the vq.

TODO:
- support reporting free page hints from smaller order free page lists
  when there is a need/request from users.

Signed-off-by: Wei Wang 
Signed-off-by: Liang Li 
Cc: Michael S. Tsirkin 
Cc: Michal Hocko 
Cc: Andrew Morton 
---
 drivers/virtio/virtio_balloon.c | 399 +---
 include/uapi/linux/virtio_balloon.h |  11 +
 2 files changed, 384 insertions(+), 26 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 9356a1a..8754154 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -43,6 +43,14 @@
 #define OOM_VBALLOON_DEFAULT_PAGES 256
 #define VIRTBALLOON_OOM_NOTIFY_PRIORITY 80
 
+/* The order used to allocate a buffer to load free page hints */
+#define VIRTIO_BALLOON_HINT_BUF_ORDER (MAX_ORDER - 1)
+/* The number of pages a hint buffer has */
+#define VIRTIO_BALLOON_HINT_BUF_PAGES (1 << VIRTIO_BALLOON_HINT_BUF_ORDER)
+/* The size of a hint buffer in bytes */
+#define VIRTIO_BALLOON_HINT_BUF_SIZE (VIRTIO_BALLOON_HINT_BUF_PAGES << \
+ PAGE_SHIFT)
+
 static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
 module_param(oom_pages, int, S_IRUSR | S_IWUSR);
 MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
@@ -51,9 +59,22 @@ MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
 static struct vfsmount *balloon_mnt;
 #endif
 
+enum virtio_balloon_vq {
+   VIRTIO_BALLOON_VQ_INFLATE,
+   VIRTIO_BALLOON_VQ_DEFLATE,
+   VIRTIO_BALLOON_VQ_STATS,
+   VIRTIO_BALLOON_VQ_FREE_PAGE,
+   VIRTIO_BALLOON_VQ_MAX
+};
+
 struct virtio_balloon {
struct virtio_device *vdev;
-   struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
+   struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
+
+   /* Balloon's own wq for cpu-intensive work items */
+   struct workqueue_struct *balloon_wq;
+   /* The free page reporting work item submitted to the balloon wq */
+   struct work_struct report_free_page_work;
 
/* The balloon servicing is delegated to a freezable workqueue. */
struct work_struct update_balloon_stats_work;
@@ -63,6 +84,15 @@ struct virtio_balloon {
spinlock_t stop_update_lock;
bool stop_update;
 
+   /* Command buffers to start and stop the reporting of hints to host */
+   struct virtio_balloon_free_page_hints_cmd cmd_start;
+   struct virtio_balloon_free_page_hints_cmd cmd_stop;
+
+   /* The cmd id received from host */
+   u32 cmd_id_received;
+   /* The cmd id that is actively in use */
+   u32 cmd_id_active;
+
/* Waiting for host to ack the pages we released. */
wait_queue_head_t acked;
 
@@ -326,17 +356,6 @@ static void stats_handle_request(struct virtio_balloon *vb)
virtqueue_kick(vq);
 }
 
-static void virtballoon_changed(struct virtio_device *vdev)
-{
-   struct virtio_balloon *vb = vdev->priv;
-   unsigned long flags;
-
-   spin_lock_irqsave(>stop_update_lock, flags);
-   if (!vb->stop_update)
-   queue_work(system_freezable_wq, >update_balloon_size_work);
-   spin_unlock_irqrestore(>stop_update_lock, flags);
-}
-
 static inline s64 towards_target(struct virtio_balloon *vb)
 {
s64 target;
@@ -353,6 +372,35 @@ static inline s64 towards_target(struct virtio_balloon *vb)
return target - vb->num_pages;
 }
 
+static void virtballoon_changed(struct virtio_device *vdev)
+{
+   struct virtio_balloon *vb = vdev->priv;
+   unsigned long flags;
+   s64 diff = towards_target(vb);
+
+   if (diff) {
+   spin_lock_irqsave(>stop_update_lock, flags);
+   if (!vb->stop_update)
+   queue_work(system_freezable_wq,
+  >update_balloon_size_work);
+   spin_unlock_irqrestore(>stop_update_lock, flags);
+   }
+
+   if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT)) {
+   virtio_cread(vdev, struct virtio_balloon_config,
+free_page_report_cmd_id, >cmd_id_received);
+   if (vb->cmd_id_

[PATCH v35 2/5] virtio-balloon: remove BUG() in init_vqs

2018-07-10 Thread Wei Wang
It's a bit overkill to use BUG when failing to add an entry to the
stats_vq in init_vqs. So remove it and just return the error to the
caller to bail out nicely.

Signed-off-by: Wei Wang 
Cc: Michael S. Tsirkin 
---
 drivers/virtio/virtio_balloon.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 6b237e3..9356a1a 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -455,9 +455,13 @@ static int init_vqs(struct virtio_balloon *vb)
num_stats = update_balloon_stats(vb);
 
sg_init_one(, vb->stats, sizeof(vb->stats[0]) * num_stats);
-   if (virtqueue_add_outbuf(vb->stats_vq, , 1, vb, GFP_KERNEL)
-   < 0)
-   BUG();
+   err = virtqueue_add_outbuf(vb->stats_vq, , 1, vb,
+  GFP_KERNEL);
+   if (err) {
+   dev_warn(>vdev->dev, "%s: add stat_vq failed\n",
+__func__);
+   return err;
+   }
virtqueue_kick(vb->stats_vq);
}
return 0;
-- 
2.7.4

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v35 1/5] mm: support to get hints of free page blocks

2018-07-10 Thread Wei Wang
This patch adds support to get free page blocks from a free page list.
The physical addresses of the blocks are stored to a list of buffers
passed from the caller. The obtained free page blocks are hints about
free pages, because there is no guarantee that they are still on the free
page list after the function returns.

One use example of this patch is to accelerate live migration by skipping
the transfer of free pages reported from the guest. A popular method used
by the hypervisor to track which part of memory is written during live
migration is to write-protect all the guest memory. So, those pages that
are hinted as free pages but are written after this function returns will
be captured by the hypervisor, and they will be added to the next round of
memory transfer.

Suggested-by: Linus Torvalds 
Signed-off-by: Wei Wang 
Signed-off-by: Liang Li 
Cc: Michal Hocko 
Cc: Andrew Morton 
Cc: Michael S. Tsirkin 
Cc: Linus Torvalds 
---
 include/linux/mm.h |  3 ++
 mm/page_alloc.c| 98 ++
 2 files changed, 101 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index a0fbb9f..5ce654f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2007,6 +2007,9 @@ extern void free_area_init(unsigned long * zones_size);
 extern void free_area_init_node(int nid, unsigned long * zones_size,
unsigned long zone_start_pfn, unsigned long *zholes_size);
 extern void free_initmem(void);
+unsigned long max_free_page_blocks(int order);
+int get_from_free_page_list(int order, struct list_head *pages,
+   unsigned int size, unsigned long *loaded_num);
 
 /*
  * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1521100..b67839b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5043,6 +5043,104 @@ void show_free_areas(unsigned int filter, nodemask_t 
*nodemask)
show_swap_cache_info();
 }
 
+/**
+ * max_free_page_blocks - estimate the max number of free page blocks
+ * @order: the order of the free page blocks to estimate
+ *
+ * This function gives a rough estimation of the possible maximum number of
+ * free page blocks a free list may have. The estimation works on an assumption
+ * that all the system pages are on that list.
+ *
+ * Context: Any context.
+ *
+ * Return: The largest number of free page blocks that the free list can have.
+ */
+unsigned long max_free_page_blocks(int order)
+{
+   return totalram_pages / (1 << order);
+}
+EXPORT_SYMBOL_GPL(max_free_page_blocks);
+
+/**
+ * get_from_free_page_list - get hints of free pages from a free page list
+ * @order: the order of the free page list to check
+ * @pages: the list of page blocks used as buffers to load the addresses
+ * @size: the size of each buffer in bytes
+ * @loaded_num: the number of addresses loaded to the buffers
+ *
+ * This function offers hints about free pages. The addresses of free page
+ * blocks are stored to the list of buffers passed from the caller. There is
+ * no guarantee that the obtained free pages are still on the free page list
+ * after the function returns. pfn_to_page on the obtained free pages is
+ * strongly discouraged and if there is an absolute need for that, make sure
+ * to contact MM people to discuss potential problems.
+ *
+ * The addresses are currently stored to a buffer in little endian. This
+ * avoids the overhead of converting endianness by the caller who needs data
+ * in the little endian format. Big endian support can be added on demand in
+ * the future.
+ *
+ * Context: Process context.
+ *
+ * Return: 0 if all the free page block addresses are stored to the buffers;
+ * -ENOSPC if the buffers are not sufficient to store all the
+ * addresses; or -EINVAL if an unexpected argument is received (e.g.
+ * incorrect @order, empty buffer list).
+ */
+int get_from_free_page_list(int order, struct list_head *pages,
+   unsigned int size, unsigned long *loaded_num)
+{
+   struct zone *zone;
+   enum migratetype mt;
+   struct list_head *free_list;
+   struct page *free_page, *buf_page;
+   unsigned long addr;
+   __le64 *buf;
+   unsigned int used_buf_num = 0, entry_index = 0,
+entries = size / sizeof(__le64);
+   *loaded_num = 0;
+
+   /* Validity check */
+   if (order < 0 || order >= MAX_ORDER)
+   return -EINVAL;
+
+   buf_page = list_first_entry_or_null(pages, struct page, lru);
+   if (!buf_page)
+   return -EINVAL;
+   buf = (__le64 *)page_address(buf_page);
+
+   for_each_populated_zone(zone) {
+   spin_lock_irq(>lock);
+   for (mt = 0; mt < MIGRATE_TYPES; mt++) {
+   free_list = >free_area[order].free_list[mt];
+   list_for_each_entry(free_page, free_list, lru) {
+  

[PATCH v35 0/5] Virtio-balloon: support free page reporting

2018-07-10 Thread Wei Wang
: use a separate buffer for the stop cmd, instead of
  having the start and stop cmd use the same buffer. This avoids the
  corner case that the start cmd is overridden by the stop cmd when
  the host has a delay in reading the start cmd.
v27->v28:
- mm/page_poison: Move PAGE_POISON to page_poison.c and add a function
  to expose page poison val to kernel modules.
v26->v27:
- add a new patch to expose page_poisoning_enabled to kernel modules
- virtio-balloon: set poison_val to 0x, instead of 0xaa
v25->v26: virtio-balloon changes only
- remove kicking free page vq since the host now polls the vq after
  initiating the reporting
- report_free_page_func: detach all the used buffers after sending
  the stop cmd id. This avoids leaving the detaching burden (i.e.
  overhead) to the next cmd id. Detaching here isn't considered
  overhead since the stop cmd id has been sent, and host has already
  moved formard.
v24->v25:
- mm: change walk_free_mem_block to return 0 (instead of true) on
  completing the report, and return a non-zero value from the
  callabck, which stops the reporting.
- virtio-balloon:
- use enum instead of define for VIRTIO_BALLOON_VQ_INFLATE etc.
- avoid __virtio_clear_bit when bailing out;
- a new method to avoid reporting the some cmd id to host twice
- destroy_workqueue can cancel free page work when the feature is
  negotiated;
- fail probe when the free page vq size is less than 2.
v23->v24:
- change feature name VIRTIO_BALLOON_F_FREE_PAGE_VQ to
  VIRTIO_BALLOON_F_FREE_PAGE_HINT
- kick when vq->num_free < half full, instead of "= half full"
- replace BUG_ON with bailing out
- check vb->balloon_wq in probe(), if null, bail out
- add a new feature bit for page poisoning
- solve the corner case that one cmd id being sent to host twice
v22->v23:
- change to kick the device when the vq is half-way full;
- open-code batch_free_page_sg into add_one_sg;
- change cmd_id from "uint32_t" to "__virtio32";
- reserver one entry in the vq for the driver to send cmd_id, instead
  of busywaiting for an available entry;
- add "stop_update" check before queue_work for prudence purpose for
  now, will have a separate patch to discuss this flag check later;
- init_vqs: change to put some variables on stack to have simpler
  implementation;
- add destroy_workqueue(vb->balloon_wq);
v21->v22:
- add_one_sg: some code and comment re-arrangement
    - send_cmd_id: handle a cornercase

For previous ChangeLog, please reference
https://lwn.net/Articles/743660/

Wei Wang (5):
  mm: support to get hints of free page blocks
  virtio-balloon: remove BUG() in init_vqs
  virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT
  mm/page_poison: expose page_poisoning_enabled to kernel modules
  virtio-balloon: VIRTIO_BALLOON_F_PAGE_POISON

 drivers/virtio/virtio_balloon.c | 419 +---
 include/linux/mm.h  |   3 +
 include/uapi/linux/virtio_balloon.h |  14 ++
 mm/page_alloc.c |  98 +
 mm/page_poison.c|   6 +
 5 files changed, 511 insertions(+), 29 deletions(-)

-- 
2.7.4

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v34 0/4] Virtio-balloon: support free page reporting

2018-06-29 Thread Wei Wang

On 06/25/2018 08:05 PM, Wei Wang wrote:

This patch series is separated from the previous "Virtio-balloon
Enhancement" series. The new feature, VIRTIO_BALLOON_F_FREE_PAGE_HINT,
implemented by this series enables the virtio-balloon driver to report
hints of guest free pages to the host. It can be used to accelerate live
migration of VMs. Here is an introduction of this usage:

Live migration needs to transfer the VM's memory from the source machine
to the destination round by round. For the 1st round, all the VM's memory
is transferred. From the 2nd round, only the pieces of memory that were
written by the guest (after the 1st round) are transferred. One method
that is popularly used by the hypervisor to track which part of memory is
written is to write-protect all the guest memory.

This feature enables the optimization by skipping the transfer of guest
free pages during VM live migration. It is not concerned that the memory
pages are used after they are given to the hypervisor as a hint of the
free pages, because they will be tracked by the hypervisor and transferred
in the subsequent round if they are used and written.

* Tests
- Test Environment
 Host: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
 Guest: 8G RAM, 4 vCPU
 Migration setup: migrate_set_speed 100G, migrate_set_downtime 2 second

- Test Results
 - Idle Guest Live Migration Time (results are averaged over 10 runs):
 - Optimization v.s. Legacy = 284ms vs 1757ms --> ~84% reduction


According to Michael's comments, add one more set of data here:

Enabling page poison with value=0, and enable KSM.

The legacy live migration time is 1806ms (averaged across 10 runs), 
compared to the case with this optimization feature in use (i.e. 284ms), 
there is still around ~84% reduction.



Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v34 0/4] Virtio-balloon: support free page reporting

2018-06-29 Thread Wei Wang

On 06/29/2018 03:46 PM, David Hildenbrand wrote:


I'm afraid it can't. For example, when we have a guest booted, without
too many memory activities. Assume the guest has 8GB free memory. The
arch_free_page there won't be able to capture the 8GB free pages since
there is no free() called. This results in no free pages reported to host.


So, it takes some time from when the guest boots up until the balloon
device was initialized and therefore page hinting can start. For that
period, you won't get any arch_free_page()/page hinting callbacks, correct.

However in the hypervisor, you can theoretically track which pages the
guest actually touched ("dirty"), so you already know "which pages were
never touched while booting up until virtio-balloon was brought to
life". These, you can directly exclude from migration. No interface
required.

The remaining problem is pages that were touched ("allocated") by the
guest during bootup but freed again, before virtio-balloon came up. One
would have to measure how many pages these usually are, I would say it
would not be that many (because recently freed pages are likely to be
used again next for allocation). However, there are some pages not being
reported.

During the lifetime of the guest, this should not be a problem,
eventually one of these pages would get allocated/freed again, so the
problem "solves itself over time". You are looking into the special case
of migrating the VM just after it has been started. But we have the
exact same problem also for ordinary free page hinting, so we should
rather solve that problem. It is not migration specific.

If we are looking for an alternative to "problem solves itself",
something like "if virtio-balloon comes up, it will report all free
pages step by step using free page hinting, just like we would have from
"arch_free_pages()"". This would be the same interface we are using for
free page hinting - and it could even be made configurable in the guest.

The current approach we are discussing internally for details about
Nitesh's work ("how the magic inside arch_fee_pages() will work
efficiently) would allow this as far as I can see just fine.

There would be a tiny little window between virtio-balloon comes up and
it has reported all free pages step by step, but that can be considered
a very special corner case that I would argue is not worth it to be
optimized.

If I am missing something important here, sorry in advance :)



Probably I didn't explain that well. Please see my re-try:

That work is to monitor page allocation and free activities via 
arch_alloc_pages and arch_free_pages. It has per-CPU lists to record the 
pages that are freed to the mm free list, and the per-CPU lists dump the 
recorded pages to a global list when any of them is full.
So its own per-CPU list will only be able to get free pages when there 
is an mm free() function gets called. If we have 8GB free memory on the 
mm free list, but no application uses them and thus no mm free() calls 
are made. In that case, the arch_free_pages isn't called, and no free 
pages added to the per-CPU list, but we have 8G free memory right on the 
mm free list.
How would you guarantee the per-CPU lists have got all the free pages 
that the mm free lists have?


- I'm also worried about the overhead of maintaining so many per-CPU 
lists and the global list. For example, if we have applications 
frequently allocate and free 4KB pages, and each per-CPU list needs to 
implement the buddy algorithm to sort and merge neighbor pages. Today a 
server can have more than 100 CPUs, then there will be more than 100 
per-CPU lists which need to sync to a global list under a lock, I'm not 
sure if this would scale well.


- This seems to be a burden imposed on the core mm memory 
allocation/free path. The whole overhead needs to be carried during the 
whole system life cycle. What we actually expected is to just make one 
call to get the free page hints only when live migration happens.


Best,
Wei










___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v34 0/4] Virtio-balloon: support free page reporting

2018-06-28 Thread Wei Wang

On 06/27/2018 07:06 PM, David Hildenbrand wrote:

On 25.06.2018 14:05, Wei Wang wrote:

This patch series is separated from the previous "Virtio-balloon
Enhancement" series. The new feature, VIRTIO_BALLOON_F_FREE_PAGE_HINT,
implemented by this series enables the virtio-balloon driver to report
hints of guest free pages to the host. It can be used to accelerate live
migration of VMs. Here is an introduction of this usage:

Live migration needs to transfer the VM's memory from the source machine
to the destination round by round. For the 1st round, all the VM's memory
is transferred. From the 2nd round, only the pieces of memory that were
written by the guest (after the 1st round) are transferred. One method
that is popularly used by the hypervisor to track which part of memory is
written is to write-protect all the guest memory.

This feature enables the optimization by skipping the transfer of guest
free pages during VM live migration. It is not concerned that the memory
pages are used after they are given to the hypervisor as a hint of the
free pages, because they will be tracked by the hypervisor and transferred
in the subsequent round if they are used and written.

* Tests
- Test Environment
 Host: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
 Guest: 8G RAM, 4 vCPU
 Migration setup: migrate_set_speed 100G, migrate_set_downtime 2 second

- Test Results
 - Idle Guest Live Migration Time (results are averaged over 10 runs):
 - Optimization v.s. Legacy = 284ms vs 1757ms --> ~84% reduction
 - Guest with Linux Compilation Workload (make bzImage -j4):
 - Live Migration Time (average)
   Optimization v.s. Legacy = 1402ms v.s. 2528ms --> ~44% reduction
 - Linux Compilation Time
   Optimization v.s. Legacy = 5min6s v.s. 5min12s
   --> no obvious difference


Being in version 34 already, this whole thing still looks and feels like
a big hack to me. It might just be me, but especially if I read about
assumptions like "QEMU will not hotplug memory during migration". This
does not feel like a clean solution.

I am still not sure if we really need this interface, especially as real
free page hinting might be on its way.

a) we perform free page hinting by setting all free pages
(arch_free_page()) to zero. Migration will detect zero pages and
minimize #pages to migrate. I don't think this is a good idea but Michel
suggested to do a performance evaluation and Nitesh is looking into that
right now.


The hypervisor doesn't get the zero pages for free. It pays lots of CPU 
utilization and memory bandwidth (there are some guys complaining abut 
this on the QEMU mailinglist)
In the above results, the legacy part already has the zero page feature 
in use.




b) we perform free page hinting using something that Nitesh proposed. We
get in QEMU blocks of free pages that we can MADV_FREE. In addition we
could e.g. clear the dirty bit of these pages in the dirty bitmap, to
hinder them from getting migrated. Right now the hinting mechanism is
synchronous (called from arch_free_page()) but we might be able to
convert it into something asynchronous.

So we might be able to completely get rid of this interface. And looking
at all the discussions and problems that already happened during the
development of this series, I think we should rather look into how clean
free page hinting might solve the same problem.

If it can't be solved using free page hinting, fair enough.



I'm afraid it can't. For example, when we have a guest booted, without 
too many memory activities. Assume the guest has 8GB free memory. The 
arch_free_page there won't be able to capture the 8GB free pages since 
there is no free() called. This results in no free pages reported to host.


Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v33 1/4] mm: add a function to get free page blocks

2018-06-28 Thread Wei Wang

On 06/28/2018 03:07 AM, Michael S. Tsirkin wrote:

On Wed, Jun 27, 2018 at 09:05:39AM -0700, Linus Torvalds wrote:

[ Sorry for slow reply, my travels have made a mess of my inbox ]

On Mon, Jun 25, 2018 at 6:55 PM Michael S. Tsirkin  wrote:

Linus, do you think it would be ok to have get_from_free_page_list
actually pop entries from the free list and use them as the buffer
to store PAs?

Honestly, what I think the best option would be is to get rid of this
interface *entirely*, and just have the balloon code do

 #define GFP_MINFLAGS (__GFP_NORETRY | __GFP_NOWARN |
__GFP_THISNODE | __GFP_NOMEMALLOC)

 struct page *page =  alloc_pages(GFP_MINFLAGS, MAX_ORDER-1);

  which is not a new interface, and simply removes the max-order page
from the list if at all possible.

The above has the advantage of "just working", and not having any races.

Now, because you don't want to necessarily *entirely* deplete the max
order, I'd suggest that the *one* new interface you add is just a "how
many max-order pages are there" interface. So then you can query
(either before or after getting the max-order page) just how many of
them there were and whether you want to give that page back.

Notice? No need for any page lists or physical addresses. No races. No
complex new functions.

The physical address you can just get from the "struct page" you got.

And if you run out of memory because of getting a page, you get all
the usual "hey, we ran out of memory" responses..

Wouldn't the above be sufficient?

 Linus



Thanks for the elaboration.


I think so, thanks!

Wei, to put it in balloon terms, I think there's one thing we missed: if
you do manage to allocate a page, and you don't have a use for it, then
hey, you can just give it to the host because you know it's free - you
are going to return it to the free list.



I'm not sure if this would be better than Linus' previous suggestion, 
because live migration is expected to be performed without disturbing 
the guest. If we do allocation to get all the free pages at all 
possible, then the guest applications would be seriously affected. For 
example, the network would become very slow as the allocation of sk_buf 
often triggers OOM during live migration. If live migration happens from 
time to time, and users try memory related tools like "free -h" on the 
guest, the reported statistics (e.g. the fee memory becomes very low 
abruptly due to the balloon allocation) would confuse them.


With the previous suggestion, we only get hints of the free pages (i.e. 
just report the address of free pages to host without taking them off 
the list).


Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v34 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT

2018-06-26 Thread Wei Wang

On 06/27/2018 11:58 AM, Michael S. Tsirkin wrote:

On Wed, Jun 27, 2018 at 11:00:05AM +0800, Wei Wang wrote:

On 06/27/2018 10:41 AM, Michael S. Tsirkin wrote:

On Wed, Jun 27, 2018 at 09:24:18AM +0800, Wei Wang wrote:

On 06/26/2018 09:34 PM, Michael S. Tsirkin wrote:

On Tue, Jun 26, 2018 at 08:27:44PM +0800, Wei Wang wrote:

On 06/26/2018 11:56 AM, Michael S. Tsirkin wrote:

On Tue, Jun 26, 2018 at 11:46:35AM +0800, Wei Wang wrote:


+   if (!arrays)
+   return NULL;
+
+   for (i = 0; i < max_array_num; i++) {

So we are getting a ton of memory here just to free it up a bit later.
Why doesn't get_from_free_page_list get the pages from free list for us?
We could also avoid the 1st allocation then - just build a list
of these.

That wouldn't be a good choice for us. If we check how the regular
allocation works, there are many many things we need to consider when pages
are allocated to users.
For example, we need to take care of the nr_free
counter, we need to check the watermark and perform the related actions.
Also the folks working on arch_alloc_page to monitor page allocation
activities would get a surprise..if page allocation is allowed to work in
this way.


mm/ code is well positioned to handle all this correctly.

I'm afraid that would be a re-implementation of the alloc functions,

A re-factoring - you can share code. The main difference is locking.


and
that would be much more complex than what we have. I think your idea of
passing a list of pages is better.

Best,
Wei

How much memory is this allocating anyway?


For every 2TB memory that the guest has, we allocate 4MB.

Hmm I guess I'm missing something, I don't see it:


+   max_entries = max_free_page_blocks(ARRAY_ALLOC_ORDER);
+   entries_per_page = PAGE_SIZE / sizeof(__le64);
+   entries_per_array = entries_per_page * (1 << ARRAY_ALLOC_ORDER);
+   max_array_num = max_entries / entries_per_array +
+   !!(max_entries % entries_per_array);

Looks like you always allocate the max number?

Yes. We allocated the max number and then free what's not used.
For example, a 16TB guest, we allocate Four 4MB buffers and pass the 4
buffers to get_from_free_page_list. If it uses 3, then the remaining 1 "4MB
buffer" will end up being freed.

For today's guests, max_array_num is usually 1.

Best,
Wei

I see, it's based on total ram pages. It's reasonable but might
get out of sync if memory is onlined quickly. So you want to
detect that there's more free memory than can fit and
retry the reporting.




- AFAIK, memory hotplug isn't expected to happen during live migration 
today. Hypervisors (e.g. QEMU) explicitly forbid this.


- Allocating buffers based on total ram pages already gives some 
headroom for newly plugged memory if that could happen in any case. 
Also, we can think about why people plug in more memory - usually 
because the existing memory isn't enough, which implies that the free 
page list is very likely to be close to empty.


- This method could be easily scaled if people really need more headroom 
for hot-plugged memory. For example, calculation based on "X * 
total_ram_pages", X could be a number passed from the hypervisor.


- This is an optimization feature, and reporting less free memory in 
that rare case doesn't hurt anything.


So I think it is good to start from a fundamental implementation, which 
doesn't confuse people, and complexities can be added when there is a 
real need in the future.


Best,
Wei


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v34 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT

2018-06-26 Thread Wei Wang

On 06/27/2018 10:41 AM, Michael S. Tsirkin wrote:

On Wed, Jun 27, 2018 at 09:24:18AM +0800, Wei Wang wrote:

On 06/26/2018 09:34 PM, Michael S. Tsirkin wrote:

On Tue, Jun 26, 2018 at 08:27:44PM +0800, Wei Wang wrote:

On 06/26/2018 11:56 AM, Michael S. Tsirkin wrote:

On Tue, Jun 26, 2018 at 11:46:35AM +0800, Wei Wang wrote:


+   if (!arrays)
+   return NULL;
+
+   for (i = 0; i < max_array_num; i++) {

So we are getting a ton of memory here just to free it up a bit later.
Why doesn't get_from_free_page_list get the pages from free list for us?
We could also avoid the 1st allocation then - just build a list
of these.

That wouldn't be a good choice for us. If we check how the regular
allocation works, there are many many things we need to consider when pages
are allocated to users.
For example, we need to take care of the nr_free
counter, we need to check the watermark and perform the related actions.
Also the folks working on arch_alloc_page to monitor page allocation
activities would get a surprise..if page allocation is allowed to work in
this way.


mm/ code is well positioned to handle all this correctly.

I'm afraid that would be a re-implementation of the alloc functions,

A re-factoring - you can share code. The main difference is locking.


and
that would be much more complex than what we have. I think your idea of
passing a list of pages is better.

Best,
Wei

How much memory is this allocating anyway?


For every 2TB memory that the guest has, we allocate 4MB.

Hmm I guess I'm missing something, I don't see it:


+   max_entries = max_free_page_blocks(ARRAY_ALLOC_ORDER);
+   entries_per_page = PAGE_SIZE / sizeof(__le64);
+   entries_per_array = entries_per_page * (1 << ARRAY_ALLOC_ORDER);
+   max_array_num = max_entries / entries_per_array +
+   !!(max_entries % entries_per_array);

Looks like you always allocate the max number?


Yes. We allocated the max number and then free what's not used.
For example, a 16TB guest, we allocate Four 4MB buffers and pass the 4 
buffers to get_from_free_page_list. If it uses 3, then the remaining 1 
"4MB buffer" will end up being freed.


For today's guests, max_array_num is usually 1.

Best,
Wei





___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v34 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT

2018-06-26 Thread Wei Wang

On 06/26/2018 09:34 PM, Michael S. Tsirkin wrote:

On Tue, Jun 26, 2018 at 08:27:44PM +0800, Wei Wang wrote:

On 06/26/2018 11:56 AM, Michael S. Tsirkin wrote:

On Tue, Jun 26, 2018 at 11:46:35AM +0800, Wei Wang wrote:


+   if (!arrays)
+   return NULL;
+
+   for (i = 0; i < max_array_num; i++) {

So we are getting a ton of memory here just to free it up a bit later.
Why doesn't get_from_free_page_list get the pages from free list for us?
We could also avoid the 1st allocation then - just build a list
of these.

That wouldn't be a good choice for us. If we check how the regular
allocation works, there are many many things we need to consider when pages
are allocated to users.
For example, we need to take care of the nr_free
counter, we need to check the watermark and perform the related actions.
Also the folks working on arch_alloc_page to monitor page allocation
activities would get a surprise..if page allocation is allowed to work in
this way.


mm/ code is well positioned to handle all this correctly.

I'm afraid that would be a re-implementation of the alloc functions,

A re-factoring - you can share code. The main difference is locking.


and
that would be much more complex than what we have. I think your idea of
passing a list of pages is better.

Best,
Wei

How much memory is this allocating anyway?



For every 2TB memory that the guest has, we allocate 4MB. This is the 
same for both cases.
For today's guests, usually there will be only one 4MB allocated and 
passed to get_from_free_page_list.


Best,
Wei



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v34 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT

2018-06-26 Thread Wei Wang

On 06/26/2018 11:56 AM, Michael S. Tsirkin wrote:

On Tue, Jun 26, 2018 at 11:46:35AM +0800, Wei Wang wrote:








+   if (!arrays)
+   return NULL;
+
+   for (i = 0; i < max_array_num; i++) {

So we are getting a ton of memory here just to free it up a bit later.
Why doesn't get_from_free_page_list get the pages from free list for us?
We could also avoid the 1st allocation then - just build a list
of these.

That wouldn't be a good choice for us. If we check how the regular
allocation works, there are many many things we need to consider when pages
are allocated to users.
For example, we need to take care of the nr_free
counter, we need to check the watermark and perform the related actions.
Also the folks working on arch_alloc_page to monitor page allocation
activities would get a surprise..if page allocation is allowed to work in
this way.


mm/ code is well positioned to handle all this correctly.


I'm afraid that would be a re-implementation of the alloc functions, and 
that would be much more complex than what we have. I think your idea of 
passing a list of pages is better.


Best,
Wei






___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v34 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT

2018-06-25 Thread Wei Wang

On 06/26/2018 09:37 AM, Michael S. Tsirkin wrote:

On Mon, Jun 25, 2018 at 08:05:10PM +0800, Wei Wang wrote:


@@ -326,17 +353,6 @@ static void stats_handle_request(struct virtio_balloon *vb)
virtqueue_kick(vq);
  }
  
-static void virtballoon_changed(struct virtio_device *vdev)

-{
-   struct virtio_balloon *vb = vdev->priv;
-   unsigned long flags;
-
-   spin_lock_irqsave(>stop_update_lock, flags);
-   if (!vb->stop_update)
-   queue_work(system_freezable_wq, >update_balloon_size_work);
-   spin_unlock_irqrestore(>stop_update_lock, flags);
-}
-
  static inline s64 towards_target(struct virtio_balloon *vb)
  {
s64 target;
@@ -353,6 +369,35 @@ static inline s64 towards_target(struct virtio_balloon *vb)
return target - vb->num_pages;
  }
  
+static void virtballoon_changed(struct virtio_device *vdev)

+{
+   struct virtio_balloon *vb = vdev->priv;
+   unsigned long flags;
+   s64 diff = towards_target(vb);
+
+   if (diff) {
+   spin_lock_irqsave(>stop_update_lock, flags);
+   if (!vb->stop_update)
+   queue_work(system_freezable_wq,
+  >update_balloon_size_work);
+   spin_unlock_irqrestore(>stop_update_lock, flags);
+   }
+
+   if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT)) {
+   virtio_cread(vdev, struct virtio_balloon_config,
+free_page_report_cmd_id, >cmd_id_received);
+   if (vb->cmd_id_received !=
+   VIRTIO_BALLOON_FREE_PAGE_REPORT_STOP_ID &&
+   vb->cmd_id_received != vb->cmd_id_active) {
+   spin_lock_irqsave(>stop_update_lock, flags);
+   if (!vb->stop_update)
+   queue_work(vb->balloon_wq,
+  >report_free_page_work);
+   spin_unlock_irqrestore(>stop_update_lock, flags);
+   }
+   }
+}
+
  static void update_balloon_size(struct virtio_balloon *vb)
  {
u32 actual = vb->num_pages;
@@ -425,44 +470,253 @@ static void update_balloon_size_func(struct work_struct 
*work)
queue_work(system_freezable_wq, work);
  }
  
+static void free_page_vq_cb(struct virtqueue *vq)

+{
+   unsigned int len;
+   void *buf;
+   struct virtio_balloon *vb = vq->vdev->priv;
+
+   while (1) {
+   buf = virtqueue_get_buf(vq, );
+
+   if (!buf || buf == >cmd_start || buf == >cmd_stop)
+   break;

If there's any buffer after this one we might never get another
callback.


I think every used buffer can get the callback, because host takes from 
the arrays one by one, and puts back each with a vq notify.





+   free_pages((unsigned long)buf, ARRAY_ALLOC_ORDER);
+   }
+}
+
  static int init_vqs(struct virtio_balloon *vb)
  {
-   struct virtqueue *vqs[3];
-   vq_callback_t *callbacks[] = { balloon_ack, balloon_ack, stats_request 
};
-   static const char * const names[] = { "inflate", "deflate", "stats" };
-   int err, nvqs;
+   struct virtqueue *vqs[VIRTIO_BALLOON_VQ_MAX];
+   vq_callback_t *callbacks[VIRTIO_BALLOON_VQ_MAX];
+   const char *names[VIRTIO_BALLOON_VQ_MAX];
+   struct scatterlist sg;
+   int ret;
  
  	/*

-* We expect two virtqueues: inflate and deflate, and
-* optionally stat.
+* Inflateq and deflateq are used unconditionally. The names[]
+* will be NULL if the related feature is not enabled, which will
+* cause no allocation for the corresponding virtqueue in find_vqs.
 */
-   nvqs = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2;
-   err = virtio_find_vqs(vb->vdev, nvqs, vqs, callbacks, names, NULL);
-   if (err)
-   return err;
+   callbacks[VIRTIO_BALLOON_VQ_INFLATE] = balloon_ack;
+   names[VIRTIO_BALLOON_VQ_INFLATE] = "inflate";
+   callbacks[VIRTIO_BALLOON_VQ_DEFLATE] = balloon_ack;
+   names[VIRTIO_BALLOON_VQ_DEFLATE] = "deflate";
+   names[VIRTIO_BALLOON_VQ_STATS] = NULL;
+   names[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
  
-	vb->inflate_vq = vqs[0];

-   vb->deflate_vq = vqs[1];
if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
-   struct scatterlist sg;
-   unsigned int num_stats;
-   vb->stats_vq = vqs[2];
+   names[VIRTIO_BALLOON_VQ_STATS] = "stats";
+   callbacks[VIRTIO_BALLOON_VQ_STATS] = stats_request;
+   }
  
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT)) {

+   names[VIRTIO_BALLOON_VQ_FREE_PAGE] = "free_page_vq";
+   callbacks[VIRTIO_BALLOON_

[PATCH v34 4/4] virtio-balloon: VIRTIO_BALLOON_F_PAGE_POISON

2018-06-25 Thread Wei Wang
The VIRTIO_BALLOON_F_PAGE_POISON feature bit is used to indicate if the
guest is using page poisoning. Guest writes to the poison_val config
field to tell host about the page poisoning value that is in use.

Suggested-by: Michael S. Tsirkin 
Signed-off-by: Wei Wang 
Cc: Michael S. Tsirkin 
Cc: Michal Hocko 
Cc: Andrew Morton 
---
 drivers/virtio/virtio_balloon.c | 10 ++
 include/uapi/linux/virtio_balloon.h |  3 +++
 2 files changed, 13 insertions(+)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index d05f0ba..c834ef1 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -801,6 +801,7 @@ static struct file_system_type balloon_fs = {
 static int virtballoon_probe(struct virtio_device *vdev)
 {
struct virtio_balloon *vb;
+   __u32 poison_val;
int err;
 
if (!vdev->config->get) {
@@ -840,6 +841,11 @@ static int virtballoon_probe(struct virtio_device *vdev)
INIT_WORK(>report_free_page_work, report_free_page_func);
vb->cmd_id_received = VIRTIO_BALLOON_FREE_PAGE_REPORT_STOP_ID;
vb->cmd_id_active = VIRTIO_BALLOON_FREE_PAGE_REPORT_STOP_ID;
+   if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_PAGE_POISON)) {
+   memset(_val, PAGE_POISON, sizeof(poison_val));
+   virtio_cwrite(vb->vdev, struct virtio_balloon_config,
+ poison_val, _val);
+   }
}
 
vb->nb.notifier_call = virtballoon_oom_notify;
@@ -958,6 +964,9 @@ static int virtballoon_restore(struct virtio_device *vdev)
 
 static int virtballoon_validate(struct virtio_device *vdev)
 {
+   if (!page_poisoning_enabled())
+   __virtio_clear_bit(vdev, VIRTIO_BALLOON_F_PAGE_POISON);
+
__virtio_clear_bit(vdev, VIRTIO_F_IOMMU_PLATFORM);
return 0;
 }
@@ -967,6 +976,7 @@ static unsigned int features[] = {
VIRTIO_BALLOON_F_STATS_VQ,
VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
VIRTIO_BALLOON_F_FREE_PAGE_HINT,
+   VIRTIO_BALLOON_F_PAGE_POISON,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/uapi/linux/virtio_balloon.h 
b/include/uapi/linux/virtio_balloon.h
index 860456f..12b5a4f 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -35,6 +35,7 @@
 #define VIRTIO_BALLOON_F_STATS_VQ  1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM2 /* Deflate balloon on OOM */
 #define VIRTIO_BALLOON_F_FREE_PAGE_HINT3 /* VQ to report free pages */
+#define VIRTIO_BALLOON_F_PAGE_POISON   4 /* Guest is using page poisoning */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
@@ -47,6 +48,8 @@ struct virtio_balloon_config {
__u32 actual;
/* Free page report command id, readonly by guest */
__u32 free_page_report_cmd_id;
+   /* Stores PAGE_POISON if page poisoning is in use */
+   __u32 poison_val;
 };
 
 struct virtio_balloon_free_page_hints_cmd {
-- 
2.7.4

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v34 3/4] mm/page_poison: expose page_poisoning_enabled to kernel modules

2018-06-25 Thread Wei Wang
In some usages, e.g. virtio-balloon, a kernel module needs to know if
page poisoning is in use. This patch exposes the page_poisoning_enabled
function to kernel modules.

Signed-off-by: Wei Wang 
Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: Michael S. Tsirkin 
Acked-by: Andrew Morton 
---
 mm/page_poison.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/mm/page_poison.c b/mm/page_poison.c
index aa2b3d3..830f604 100644
--- a/mm/page_poison.c
+++ b/mm/page_poison.c
@@ -17,6 +17,11 @@ static int __init early_page_poison_param(char *buf)
 }
 early_param("page_poison", early_page_poison_param);
 
+/**
+ * page_poisoning_enabled - check if page poisoning is enabled
+ *
+ * Return true if page poisoning is enabled, or false if not.
+ */
 bool page_poisoning_enabled(void)
 {
/*
@@ -29,6 +34,7 @@ bool page_poisoning_enabled(void)
(!IS_ENABLED(CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC) &&
debug_pagealloc_enabled()));
 }
+EXPORT_SYMBOL_GPL(page_poisoning_enabled);
 
 static void poison_page(struct page *page)
 {
-- 
2.7.4

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v34 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT

2018-06-25 Thread Wei Wang
Negotiation of the VIRTIO_BALLOON_F_FREE_PAGE_HINT feature indicates the
support of reporting hints of guest free pages to host via virtio-balloon.

Host requests the guest to report free page hints by sending a new cmd id
to the guest via the free_page_report_cmd_id configuration register.

As the first step here, virtio-balloon only reports free page hints from
the max order (i.e. 10) free page list to host. This has generated similar
good results as reporting all free page hints during our tests.

When the guest starts to report, it first sends a start cmd to host via
the free page vq, which acks to host the cmd id received, and tells it the
hint size (e.g. 4MB each on x86). When the guest finishes the reporting,
a stop cmd is sent to host via the vq.

TODO:
- support reporting free page hints from smaller order free page lists
  when there is a need/request from users.

Signed-off-by: Wei Wang 
Signed-off-by: Liang Li 
Cc: Michael S. Tsirkin 
Cc: Michal Hocko 
Cc: Andrew Morton 
---
 drivers/virtio/virtio_balloon.c | 347 
 include/uapi/linux/virtio_balloon.h |  11 ++
 2 files changed, 322 insertions(+), 36 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 6b237e3..d05f0ba 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -43,6 +43,11 @@
 #define OOM_VBALLOON_DEFAULT_PAGES 256
 #define VIRTBALLOON_OOM_NOTIFY_PRIORITY 80
 
+/* The order used to allocate an array to load free page hints */
+#define ARRAY_ALLOC_ORDER (MAX_ORDER - 1)
+/* The size of an array in bytes */
+#define ARRAY_ALLOC_SIZE ((1 << ARRAY_ALLOC_ORDER) << PAGE_SHIFT)
+
 static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
 module_param(oom_pages, int, S_IRUSR | S_IWUSR);
 MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
@@ -51,9 +56,22 @@ MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
 static struct vfsmount *balloon_mnt;
 #endif
 
+enum virtio_balloon_vq {
+   VIRTIO_BALLOON_VQ_INFLATE,
+   VIRTIO_BALLOON_VQ_DEFLATE,
+   VIRTIO_BALLOON_VQ_STATS,
+   VIRTIO_BALLOON_VQ_FREE_PAGE,
+   VIRTIO_BALLOON_VQ_MAX
+};
+
 struct virtio_balloon {
struct virtio_device *vdev;
-   struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
+   struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
+
+   /* Balloon's own wq for cpu-intensive work items */
+   struct workqueue_struct *balloon_wq;
+   /* The free page reporting work item submitted to the balloon wq */
+   struct work_struct report_free_page_work;
 
/* The balloon servicing is delegated to a freezable workqueue. */
struct work_struct update_balloon_stats_work;
@@ -63,6 +81,15 @@ struct virtio_balloon {
spinlock_t stop_update_lock;
bool stop_update;
 
+   /* Command buffers to start and stop the reporting of hints to host */
+   struct virtio_balloon_free_page_hints_cmd cmd_start;
+   struct virtio_balloon_free_page_hints_cmd cmd_stop;
+
+   /* The cmd id received from host */
+   uint32_t cmd_id_received;
+   /* The cmd id that is actively in use */
+   uint32_t cmd_id_active;
+
/* Waiting for host to ack the pages we released. */
wait_queue_head_t acked;
 
@@ -326,17 +353,6 @@ static void stats_handle_request(struct virtio_balloon *vb)
virtqueue_kick(vq);
 }
 
-static void virtballoon_changed(struct virtio_device *vdev)
-{
-   struct virtio_balloon *vb = vdev->priv;
-   unsigned long flags;
-
-   spin_lock_irqsave(>stop_update_lock, flags);
-   if (!vb->stop_update)
-   queue_work(system_freezable_wq, >update_balloon_size_work);
-   spin_unlock_irqrestore(>stop_update_lock, flags);
-}
-
 static inline s64 towards_target(struct virtio_balloon *vb)
 {
s64 target;
@@ -353,6 +369,35 @@ static inline s64 towards_target(struct virtio_balloon *vb)
return target - vb->num_pages;
 }
 
+static void virtballoon_changed(struct virtio_device *vdev)
+{
+   struct virtio_balloon *vb = vdev->priv;
+   unsigned long flags;
+   s64 diff = towards_target(vb);
+
+   if (diff) {
+   spin_lock_irqsave(>stop_update_lock, flags);
+   if (!vb->stop_update)
+   queue_work(system_freezable_wq,
+  >update_balloon_size_work);
+   spin_unlock_irqrestore(>stop_update_lock, flags);
+   }
+
+   if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT)) {
+   virtio_cread(vdev, struct virtio_balloon_config,
+free_page_report_cmd_id, >cmd_id_received);
+   if (vb->cmd_id_received !=
+   VIRTIO_BALLOON_FREE_PAGE_REPORT_STOP_ID &&
+   vb->cmd_id_received != vb->cmd_id_active) {
+   spin

[PATCH v34 1/4] mm: support to get hints of free page blocks

2018-06-25 Thread Wei Wang
This patch adds support to get free page blocks from a free page list.
The physical addresses of the blocks are stored to the arrays passed
from the caller. The obtained free page blocks are hints about free pages,
because there is no guarantee that they are still on the free page list
after the function returns.

One use example of this patch is to accelerate live migration by skipping
the transfer of free pages reported from the guest. A popular method used
by the hypervisor to track which part of memory is written during live
migration is to write-protect all the guest memory. So, those pages that
are hinted as free pages but are written after this function returns will
be captured by the hypervisor, and they will be added to the next round of
memory transfer.

Suggested-by: Linus Torvalds 
Signed-off-by: Wei Wang 
Signed-off-by: Liang Li 
Cc: Michal Hocko 
Cc: Andrew Morton 
Cc: Michael S. Tsirkin 
Cc: Linus Torvalds 
---
 include/linux/mm.h |  3 ++
 mm/page_alloc.c| 82 ++
 2 files changed, 85 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index a0fbb9f..1b51d43 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2007,6 +2007,9 @@ extern void free_area_init(unsigned long * zones_size);
 extern void free_area_init_node(int nid, unsigned long * zones_size,
unsigned long zone_start_pfn, unsigned long *zholes_size);
 extern void free_initmem(void);
+uint32_t max_free_page_blocks(int order);
+uint32_t get_from_free_page_list(int order, uint32_t num, __le64 *buf[],
+uint32_t size);
 
 /*
  * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1521100..2e462ab 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5042,6 +5042,88 @@ void show_free_areas(unsigned int filter, nodemask_t 
*nodemask)
 
show_swap_cache_info();
 }
+/**
+ * max_free_page_blocks - estimate the max number of free page blocks
+ * @order: the order of the free page blocks to estimate
+ *
+ * This function gives a rough estimation of the possible maximum number of
+ * free page blocks a free list may have. The estimation works on an assumption
+ * that all the system pages are on that list.
+ *
+ * Context: Any context.
+ *
+ * Return: The largest number of free page blocks that the free list can have.
+ */
+uint32_t max_free_page_blocks(int order)
+{
+   return totalram_pages / (1 << order);
+}
+EXPORT_SYMBOL_GPL(max_free_page_blocks);
+
+/**
+ * get_from_free_page_list - get hints of free pages from a free page list
+ * @order: the order of the free page list to check
+ * @num: the number of arrays
+ * @bufs: the arrays to store the physical addresses of the free page blocks
+ * @size: the number of entries each array has
+ *
+ * This function offers hints about free pages. The addresses of free page
+ * blocks are stored to the arrays passed from the caller. There is no
+ * guarantee that the obtained free pages are still on the free page list
+ * after the function returns. pfn_to_page on the obtained free pages is
+ * strongly discouraged and if there is an absolute need for that, make sure
+ * to contact MM people to discuss potential problems.
+ *
+ * The addresses are currently stored to an array in little endian. This
+ * avoids the overhead of converting endianness by the caller who needs data
+ * in the little endian format. Big endian support can be added on demand in
+ * the future. The maximum number of free page blocks that can be obtained is
+ * limited to the size of arrays.
+ *
+ * Context: Process context.
+ *
+ * Return: The number of free page blocks obtained from the free page list.
+ */
+uint32_t get_from_free_page_list(int order, uint32_t num, __le64 *bufs[],
+uint32_t size)
+{
+   struct zone *zone;
+   enum migratetype mt;
+   struct page *page;
+   struct list_head *list;
+   unsigned long addr;
+   uint32_t array_index = 0, entry_index = 0;
+   __le64 *array = bufs[array_index];
+
+   /* Validity check */
+   if (order < 0 || order >= MAX_ORDER)
+   return 0;
+
+   for_each_populated_zone(zone) {
+   spin_lock_irq(>lock);
+   for (mt = 0; mt < MIGRATE_TYPES; mt++) {
+   list = >free_area[order].free_list[mt];
+   list_for_each_entry(page, list, lru) {
+   addr = page_to_pfn(page) << PAGE_SHIFT;
+   /* This array is full, so use the next one */
+   if (entry_index == size) {
+   /* All the arrays are consumed */
+   if (++array_index == num) {
+   spin_unlock_irq(>lock);
+  

[PATCH v34 0/4] Virtio-balloon: support free page reporting

2018-06-25 Thread Wei Wang
ting the reporting
- report_free_page_func: detach all the used buffers after sending
  the stop cmd id. This avoids leaving the detaching burden (i.e.
  overhead) to the next cmd id. Detaching here isn't considered
  overhead since the stop cmd id has been sent, and host has already
  moved formard.
v24->v25:
- mm: change walk_free_mem_block to return 0 (instead of true) on
  completing the report, and return a non-zero value from the
  callabck, which stops the reporting.
- virtio-balloon:
- use enum instead of define for VIRTIO_BALLOON_VQ_INFLATE etc.
- avoid __virtio_clear_bit when bailing out;
- a new method to avoid reporting the some cmd id to host twice
- destroy_workqueue can cancel free page work when the feature is
  negotiated;
- fail probe when the free page vq size is less than 2.
v23->v24:
- change feature name VIRTIO_BALLOON_F_FREE_PAGE_VQ to
  VIRTIO_BALLOON_F_FREE_PAGE_HINT
- kick when vq->num_free < half full, instead of "= half full"
- replace BUG_ON with bailing out
- check vb->balloon_wq in probe(), if null, bail out
- add a new feature bit for page poisoning
- solve the corner case that one cmd id being sent to host twice
v22->v23:
- change to kick the device when the vq is half-way full;
- open-code batch_free_page_sg into add_one_sg;
- change cmd_id from "uint32_t" to "__virtio32";
- reserver one entry in the vq for the driver to send cmd_id, instead
  of busywaiting for an available entry;
- add "stop_update" check before queue_work for prudence purpose for
  now, will have a separate patch to discuss this flag check later;
- init_vqs: change to put some variables on stack to have simpler
  implementation;
- add destroy_workqueue(vb->balloon_wq);
v21->v22:
- add_one_sg: some code and comment re-arrangement
- send_cmd_id: handle a cornercase

For previous ChangeLog, please reference
https://lwn.net/Articles/743660/



Wei Wang (4):
  mm: support to get hints of free page blocks
  virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT
  mm/page_poison: expose page_poisoning_enabled to kernel modules
  virtio-balloon: VIRTIO_BALLOON_F_PAGE_POISON

 drivers/virtio/virtio_balloon.c | 357 
 include/linux/mm.h  |   3 +
 include/uapi/linux/virtio_balloon.h |  14 ++
 mm/page_alloc.c |  82 +
 mm/page_poison.c|   6 +
 5 files changed, 426 insertions(+), 36 deletions(-)

-- 
2.7.4

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [virtio-dev] Re: [PATCH v33 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT

2018-06-19 Thread Wei Wang

On 06/19/2018 11:05 AM, Michael S. Tsirkin wrote:

On Tue, Jun 19, 2018 at 01:06:48AM +, Wang, Wei W wrote:

On Monday, June 18, 2018 10:29 AM, Michael S. Tsirkin wrote:

On Sat, Jun 16, 2018 at 01:09:44AM +, Wang, Wei W wrote:

Not necessarily, I think. We have min(4m_page_blocks / 512, 1024) above,

so the maximum memory that can be reported is 2TB. For larger guests, e.g.
4TB, the optimization can still offer 2TB free memory (better than no
optimization).

Maybe it's better, maybe it isn't. It certainly muddies the waters even more.
I'd rather we had a better plan. From that POV I like what Matthew Wilcox
suggested for this which is to steal the necessary # of entries off the list.

Actually what Matthew suggested doesn't make a difference here. That method 
always steal the first free page blocks, and sure can be changed to take more. 
But all these can be achieved via kmalloc

I'd do get_user_pages really. You don't want pages split, etc.



by the caller which is more prudent and makes the code more straightforward. I 
think we don't need to take that risk unless the MM folks strongly endorse that 
approach.

The max size of the kmalloc-ed memory is 4MB, which gives us the limitation 
that the max free memory to report is 2TB. Back to the motivation of this work, 
the cloud guys want to use this optimization to accelerate their guest live 
migration. 2TB guests are not common in today's clouds. When huge guests become 
common in the future, we can easily tweak this API to fill hints into scattered 
buffer (e.g. several 4MB arrays passed to this API) instead of one as in this 
version.

This limitation doesn't cause any issue from functionality perspective. For the 
extreme case like a 100TB guest live migration which is theoretically possible 
today, this optimization helps skip 2TB of its free memory. This result is that 
it may reduce only 2% live migration time, but still better than not skipping 
the 2TB (if not using the feature).

Not clearly better, no, since you are slowing the guest.


Not really. Live migration slows down the guest itself. It seems that 
the guest spends a little extra time reporting free pages, but in return 
the live migration time gets reduced a lot, which makes the guest endure 
less from live migration. (there is no drop of the workload performance 
when using the optimization in the tests)








So, for the first release of this feature, I think it is better to have the 
simpler and more straightforward solution as we have now, and clearly document 
why it can report up to 2TB free memory.

No one has the time to read documentation about how an internal flag
within a device works. Come on, getting two pages isn't much harder
than a single one.


  

If that doesn't fly, we can allocate out of the loop and just retry with more
pages.


On the other hand, large guests being large mostly because the guests need

to use large memory. In that case, they usually won't have that much free
memory to report.

And following this logic small guests don't have a lot of memory to report at
all.
Could you remind me why are we considering this optimization then?

If there is a 3TB guest, it is 3TB not 2TB mostly because it would need to use 
e.g. 2.5TB memory from time to time. In the worst case, it only has 0.5TB free 
memory to report, but reporting 0.5TB with this optimization is better than no 
optimization. (and the current 2TB limitation isn't a limitation for the 3TB 
guest in this case)

I'd rather not spend time writing up random limitations.


This is not a random limitation. It would be more clear to see the code. 
Also I'm not sure how get_user_pages could be used in our case, and what 
you meant by "getting two pages". I'll post out a new version, and we 
can discuss on the code.



Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v33 4/4] virtio-balloon: VIRTIO_BALLOON_F_PAGE_POISON

2018-06-14 Thread Wei Wang
The VIRTIO_BALLOON_F_PAGE_POISON feature bit is used to indicate if the
guest is using page poisoning. Guest writes to the poison_val config
field to tell host about the page poisoning value that is in use.

Suggested-by: Michael S. Tsirkin 
Signed-off-by: Wei Wang 
Cc: Michael S. Tsirkin 
Cc: Michal Hocko 
Cc: Andrew Morton 
---
 drivers/virtio/virtio_balloon.c | 10 ++
 include/uapi/linux/virtio_balloon.h |  3 +++
 2 files changed, 13 insertions(+)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 582a03b..c59bb380 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -634,6 +634,7 @@ static struct file_system_type balloon_fs = {
 static int virtballoon_probe(struct virtio_device *vdev)
 {
struct virtio_balloon *vb;
+   __u32 poison_val;
int err;
 
if (!vdev->config->get) {
@@ -671,6 +672,11 @@ static int virtballoon_probe(struct virtio_device *vdev)
goto out_del_vqs;
}
INIT_WORK(>report_free_page_work, report_free_page_func);
+   if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_PAGE_POISON)) {
+   memset(_val, PAGE_POISON, sizeof(poison_val));
+   virtio_cwrite(vb->vdev, struct virtio_balloon_config,
+ poison_val, _val);
+   }
vb->hints = kmalloc(FREE_PAGE_HINT_MEM_SIZE, GFP_KERNEL);
if (!vb->hints) {
err = -ENOMEM;
@@ -796,6 +802,9 @@ static int virtballoon_restore(struct virtio_device *vdev)
 
 static int virtballoon_validate(struct virtio_device *vdev)
 {
+   if (!page_poisoning_enabled())
+   __virtio_clear_bit(vdev, VIRTIO_BALLOON_F_PAGE_POISON);
+
__virtio_clear_bit(vdev, VIRTIO_F_IOMMU_PLATFORM);
return 0;
 }
@@ -805,6 +814,7 @@ static unsigned int features[] = {
VIRTIO_BALLOON_F_STATS_VQ,
VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
VIRTIO_BALLOON_F_FREE_PAGE_HINT,
+   VIRTIO_BALLOON_F_PAGE_POISON,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/uapi/linux/virtio_balloon.h 
b/include/uapi/linux/virtio_balloon.h
index 99b8416..f3b6191 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -35,6 +35,7 @@
 #define VIRTIO_BALLOON_F_STATS_VQ  1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM2 /* Deflate balloon on OOM */
 #define VIRTIO_BALLOON_F_FREE_PAGE_HINT3 /* VQ to report free pages */
+#define VIRTIO_BALLOON_F_PAGE_POISON   4 /* Guest is using page poisoning */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
@@ -47,6 +48,8 @@ struct virtio_balloon_config {
__u32 actual;
/* Command sent from host */
__u32 host_cmd;
+   /* Stores PAGE_POISON if page poisoning is in use */
+   __u32 poison_val;
 };
 
 struct virtio_balloon_free_page_hints {
-- 
2.7.4

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v33 3/4] mm/page_poison: expose page_poisoning_enabled to kernel modules

2018-06-14 Thread Wei Wang
In some usages, e.g. virtio-balloon, a kernel module needs to know if
page poisoning is in use. This patch exposes the page_poisoning_enabled
function to kernel modules.

Signed-off-by: Wei Wang 
Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: Michael S. Tsirkin 
Acked-by: Andrew Morton 
---
 mm/page_poison.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/mm/page_poison.c b/mm/page_poison.c
index aa2b3d3..830f604 100644
--- a/mm/page_poison.c
+++ b/mm/page_poison.c
@@ -17,6 +17,11 @@ static int __init early_page_poison_param(char *buf)
 }
 early_param("page_poison", early_page_poison_param);
 
+/**
+ * page_poisoning_enabled - check if page poisoning is enabled
+ *
+ * Return true if page poisoning is enabled, or false if not.
+ */
 bool page_poisoning_enabled(void)
 {
/*
@@ -29,6 +34,7 @@ bool page_poisoning_enabled(void)
(!IS_ENABLED(CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC) &&
debug_pagealloc_enabled()));
 }
+EXPORT_SYMBOL_GPL(page_poisoning_enabled);
 
 static void poison_page(struct page *page)
 {
-- 
2.7.4

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v33 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT

2018-06-14 Thread Wei Wang
Negotiation of the VIRTIO_BALLOON_F_FREE_PAGE_HINT feature indicates the
support of reporting hints of guest free pages to host via virtio-balloon.

Host requests the guest to report free page hints by sending a command
to the guest via setting the VIRTIO_BALLOON_HOST_CMD_FREE_PAGE_HINT bit
of the host_cmd config register.

As the first step here, virtio-balloon only reports free page hints from
the max order (10) free page list to host. This has generated similar good
results as reporting all free page hints during our tests.

TODO:
- support reporting free page hints from smaller order free page lists
  when there is a need/request from users.

Signed-off-by: Wei Wang 
Signed-off-by: Liang Li 
Cc: Michael S. Tsirkin 
Cc: Michal Hocko 
Cc: Andrew Morton 
---
 drivers/virtio/virtio_balloon.c | 187 +---
 include/uapi/linux/virtio_balloon.h |  13 +++
 2 files changed, 163 insertions(+), 37 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 6b237e3..582a03b 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -43,6 +43,9 @@
 #define OOM_VBALLOON_DEFAULT_PAGES 256
 #define VIRTBALLOON_OOM_NOTIFY_PRIORITY 80
 
+/* The size of memory in bytes allocated for reporting free page hints */
+#define FREE_PAGE_HINT_MEM_SIZE (PAGE_SIZE * 16)
+
 static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
 module_param(oom_pages, int, S_IRUSR | S_IWUSR);
 MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
@@ -51,9 +54,22 @@ MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
 static struct vfsmount *balloon_mnt;
 #endif
 
+enum virtio_balloon_vq {
+   VIRTIO_BALLOON_VQ_INFLATE,
+   VIRTIO_BALLOON_VQ_DEFLATE,
+   VIRTIO_BALLOON_VQ_STATS,
+   VIRTIO_BALLOON_VQ_FREE_PAGE,
+   VIRTIO_BALLOON_VQ_MAX
+};
+
 struct virtio_balloon {
struct virtio_device *vdev;
-   struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
+   struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
+
+   /* Balloon's own wq for cpu-intensive work items */
+   struct workqueue_struct *balloon_wq;
+   /* The free page reporting work item submitted to the balloon wq */
+   struct work_struct report_free_page_work;
 
/* The balloon servicing is delegated to a freezable workqueue. */
struct work_struct update_balloon_stats_work;
@@ -63,6 +79,8 @@ struct virtio_balloon {
spinlock_t stop_update_lock;
bool stop_update;
 
+   struct virtio_balloon_free_page_hints *hints;
+
/* Waiting for host to ack the pages we released. */
wait_queue_head_t acked;
 
@@ -326,17 +344,6 @@ static void stats_handle_request(struct virtio_balloon *vb)
virtqueue_kick(vq);
 }
 
-static void virtballoon_changed(struct virtio_device *vdev)
-{
-   struct virtio_balloon *vb = vdev->priv;
-   unsigned long flags;
-
-   spin_lock_irqsave(>stop_update_lock, flags);
-   if (!vb->stop_update)
-   queue_work(system_freezable_wq, >update_balloon_size_work);
-   spin_unlock_irqrestore(>stop_update_lock, flags);
-}
-
 static inline s64 towards_target(struct virtio_balloon *vb)
 {
s64 target;
@@ -353,6 +360,32 @@ static inline s64 towards_target(struct virtio_balloon *vb)
return target - vb->num_pages;
 }
 
+static void virtballoon_changed(struct virtio_device *vdev)
+{
+   struct virtio_balloon *vb = vdev->priv;
+   unsigned long flags;
+   uint32_t host_cmd;
+   s64 diff = towards_target(vb);
+
+   if (diff) {
+   spin_lock_irqsave(>stop_update_lock, flags);
+   if (!vb->stop_update)
+   queue_work(system_freezable_wq,
+  >update_balloon_size_work);
+   spin_unlock_irqrestore(>stop_update_lock, flags);
+   }
+
+   virtio_cread(vdev, struct virtio_balloon_config, host_cmd, _cmd);
+   if ((host_cmd & VIRTIO_BALLOON_HOST_CMD_FREE_PAGE_HINT) &&
+virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT)) {
+   spin_lock_irqsave(>stop_update_lock, flags);
+   if (!vb->stop_update)
+   queue_work(vb->balloon_wq,
+  >report_free_page_work);
+   spin_unlock_irqrestore(>stop_update_lock, flags);
+   }
+}
+
 static void update_balloon_size(struct virtio_balloon *vb)
 {
u32 actual = vb->num_pages;
@@ -425,44 +458,98 @@ static void update_balloon_size_func(struct work_struct 
*work)
queue_work(system_freezable_wq, work);
 }
 
+static void free_page_vq_cb(struct virtqueue *vq)
+{
+   unsigned int unused;
+
+   while (virtqueue_get_buf(vq, ))
+   ;
+}
+
 static int init_vqs(struct virtio_balloon *vb)
 {
-   struct virtqueue *vqs[3];
-   vq_callback_t *callbacks[] = { 

[PATCH v33 1/4] mm: add a function to get free page blocks

2018-06-14 Thread Wei Wang
This patch adds a function to get free pages blocks from a free page
list. The obtained free page blocks are hints about free pages, because
there is no guarantee that they are still on the free page list after
the function returns.

One use example of this patch is to accelerate live migration by skipping
the transfer of free pages reported from the guest. A popular method used
by the hypervisor to track which part of memory is written during live
migration is to write-protect all the guest memory. So, those pages that
are hinted as free pages but are written after this function returns will
be captured by the hypervisor, and they will be added to the next round of
memory transfer.

Suggested-by: Linus Torvalds 
Signed-off-by: Wei Wang 
Signed-off-by: Liang Li 
Cc: Michal Hocko 
Cc: Andrew Morton 
Cc: Michael S. Tsirkin 
Cc: Linus Torvalds 
---
 include/linux/mm.h |  1 +
 mm/page_alloc.c| 52 
 2 files changed, 53 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0e49388..c58b4e5 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2002,6 +2002,7 @@ extern void free_area_init(unsigned long * zones_size);
 extern void free_area_init_node(int nid, unsigned long * zones_size,
unsigned long zone_start_pfn, unsigned long *zholes_size);
 extern void free_initmem(void);
+uint32_t get_from_free_page_list(int order, __le64 buf[], uint32_t size);
 
 /*
  * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 07b3c23..7c816d9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5043,6 +5043,58 @@ void show_free_areas(unsigned int filter, nodemask_t 
*nodemask)
show_swap_cache_info();
 }
 
+/**
+ * get_from_free_page_list - get free page blocks from a free page list
+ * @order: the order of the free page list to check
+ * @buf: the array to store the physical addresses of the free page blocks
+ * @size: the array size
+ *
+ * This function offers hints about free pages. There is no guarantee that
+ * the obtained free pages are still on the free page list after the function
+ * returns. pfn_to_page on the obtained free pages is strongly discouraged
+ * and if there is an absolute need for that, make sure to contact MM people
+ * to discuss potential problems.
+ *
+ * The addresses are currently stored to the array in little endian. This
+ * avoids the overhead of converting endianness by the caller who needs data
+ * in the little endian format. Big endian support can be added on demand in
+ * the future.
+ *
+ * Return the number of free page blocks obtained from the free page list.
+ * The maximum number of free page blocks that can be obtained is limited to
+ * the caller's array size.
+ */
+uint32_t get_from_free_page_list(int order, __le64 buf[], uint32_t size)
+{
+   struct zone *zone;
+   enum migratetype mt;
+   struct page *page;
+   struct list_head *list;
+   unsigned long addr, flags;
+   uint32_t index = 0;
+
+   for_each_populated_zone(zone) {
+   spin_lock_irqsave(>lock, flags);
+   for (mt = 0; mt < MIGRATE_TYPES; mt++) {
+   list = >free_area[order].free_list[mt];
+   list_for_each_entry(page, list, lru) {
+   addr = page_to_pfn(page) << PAGE_SHIFT;
+   if (likely(index < size)) {
+   buf[index++] = cpu_to_le64(addr);
+   } else {
+   spin_unlock_irqrestore(>lock,
+  flags);
+   return index;
+   }
+   }
+   }
+   spin_unlock_irqrestore(>lock, flags);
+   }
+
+   return index;
+}
+EXPORT_SYMBOL_GPL(get_from_free_page_list);
+
 static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
 {
zoneref->zone = zone;
-- 
2.7.4

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v33 0/4] Virtio-balloon: support free page reporting

2018-06-14 Thread Wei Wang
change feature name VIRTIO_BALLOON_F_FREE_PAGE_VQ to
  VIRTIO_BALLOON_F_FREE_PAGE_HINT
- kick when vq->num_free < half full, instead of "= half full"
- replace BUG_ON with bailing out
- check vb->balloon_wq in probe(), if null, bail out
- add a new feature bit for page poisoning
- solve the corner case that one cmd id being sent to host twice
v22->v23:
- change to kick the device when the vq is half-way full;
- open-code batch_free_page_sg into add_one_sg;
- change cmd_id from "uint32_t" to "__virtio32";
- reserver one entry in the vq for the driver to send cmd_id, instead
  of busywaiting for an available entry;
- add "stop_update" check before queue_work for prudence purpose for
  now, will have a separate patch to discuss this flag check later;
- init_vqs: change to put some variables on stack to have simpler
  implementation;
- add destroy_workqueue(vb->balloon_wq);
v21->v22:
- add_one_sg: some code and comment re-arrangement
- send_cmd_id: handle a cornercase

For previous ChangeLog, please reference
https://lwn.net/Articles/743660/

Wei Wang (4):
  mm: add a function to get free page blocks
  virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT
  mm/page_poison: expose page_poisoning_enabled to kernel modules
  virtio-balloon: VIRTIO_BALLOON_F_PAGE_POISON

 drivers/virtio/virtio_balloon.c | 197 +---
 include/linux/mm.h  |   1 +
 include/uapi/linux/virtio_balloon.h |  16 +++
 mm/page_alloc.c |  52 ++
 mm/page_poison.c|   6 ++
 5 files changed, 235 insertions(+), 37 deletions(-)

-- 
2.7.4

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PULL] vhost: cleanups and fixes

2018-06-14 Thread Wei Wang

On 06/14/2018 11:01 PM, Nitesh Narayan Lal wrote:

Hi Wei,


On 06/12/2018 07:05 AM, Wei Wang wrote:

On 06/12/2018 09:59 AM, Linus Torvalds wrote:

On Mon, Jun 11, 2018 at 6:36 PM Michael S. Tsirkin 
wrote:

Maybe it will help to have GFP_NONE which will make any allocation
fail if attempted. Linus, would this address your comment?

It would definitely have helped me initially overlook that call chain.

But then when I started looking at the whole dma_map_page() thing, it
just raised my hackles again.

I would seriously suggest having a much simpler version for the "no
allocation, no dma mapping" case, so that it's *obvious* that that
never happens.

So instead of having virtio_balloon_send_free_pages() call a really
generic complex chain of functions that in _some_ cases can do memory
allocation, why isn't there a short-circuited "vitruque_add_datum()"
that is guaranteed to never do anything like that?

Honestly, I look at "add_one_sg()" and it really doesn't make me
happy. It looks hacky as hell. If I read the code right, you're really
trying to just queue up a simple tuple of , except you encode
it as a page pointer in order to play games with the SG logic, and
then you hmap that to the ring, except in this case it's all a fake
ring that just adds the cpu-physical address instead.

And to figuer that out, it's like five layers of indirection through
different helper functions that *can* do more generic things but in
this case don't.

And you do all of this from a core VM callback function with some
_really_ core VM locks held.

That makes no sense to me.

How about this:

   - get rid of all that code

   - make the core VM callback save the "these are the free memory
regions" in a fixed and limited array. One that DOES JUST THAT. No
crazy "SG IO dma-mapping function crap". Just a plain array of a fixed
size, pre-allocated for that virtio instance.

   - make it obvious that what you do in that sequence is ten
instructions and no allocations ("Look ma, I wrote a value to an array
and incremented the array idex, and I'M DONE")

   - then in that workqueue entry that you start *anyway*, you empty the
array and do all the crazy virtio stuff.

In fact, while at it, just simplify the VM interface too. Instead of
traversing a random number of buddy lists, just trraverse *one* - the
top-level one. Are you seriously ever going to shrink or mark
read-only anythin *but* something big enough to be in the maximum
order?

MAX_ORDER is what, 11? So we're talking 8MB blocks. Do you *really*
want the balloon code to work on smaller things, particularly since
the whole interface is fundamentally racy and opportunistic to begin
with?

OK, I will implement a new version based on the suggestions. Thanks.

I have been working on a similar series [1] that is more generic, which
solves the problem of giving unused memory back to the host and could be
used to solve the migration problem as well. Can you take a look and see
if you can use my series in some way?


Hi Nitesh,

I actually checked the last version, which dates back to last year. It 
seems the new version does not have fundamental differences.


Actually there are obvious differences between the two series. This 
series provides a simple lightweight method (will continue to post out a 
new version with simpler interfaces based on the above suggestions) to 
offer free pages hints, and the hints are quit helpful for usages like 
accelerating live migration and guest snapshot. If I read that 
correctly, that series seems to provide true (guaranteed) free pages 
with much more heavyweight logics, but true free pages are not necessary 
for the live migration optimization, which is the goal and motivation of 
this work. And from my point of view, that series seems more like an 
alternative function to ballooning, which takes out free pages (or say 
guest unused pages) via allocation.


I will join the discussion in that thread. Probably we would need to 
think about other new usages for that series.


Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PULL] vhost: cleanups and fixes

2018-06-12 Thread Wei Wang

On 06/12/2018 09:59 AM, Linus Torvalds wrote:

On Mon, Jun 11, 2018 at 6:36 PM Michael S. Tsirkin  wrote:

Maybe it will help to have GFP_NONE which will make any allocation
fail if attempted. Linus, would this address your comment?

It would definitely have helped me initially overlook that call chain.

But then when I started looking at the whole dma_map_page() thing, it
just raised my hackles again.

I would seriously suggest having a much simpler version for the "no
allocation, no dma mapping" case, so that it's *obvious* that that
never happens.

So instead of having virtio_balloon_send_free_pages() call a really
generic complex chain of functions that in _some_ cases can do memory
allocation, why isn't there a short-circuited "vitruque_add_datum()"
that is guaranteed to never do anything like that?

Honestly, I look at "add_one_sg()" and it really doesn't make me
happy. It looks hacky as hell. If I read the code right, you're really
trying to just queue up a simple tuple of , except you encode
it as a page pointer in order to play games with the SG logic, and
then you hmap that to the ring, except in this case it's all a fake
ring that just adds the cpu-physical address instead.

And to figuer that out, it's like five layers of indirection through
different helper functions that *can* do more generic things but in
this case don't.

And you do all of this from a core VM callback function with some
_really_ core VM locks held.

That makes no sense to me.

How about this:

  - get rid of all that code

  - make the core VM callback save the "these are the free memory
regions" in a fixed and limited array. One that DOES JUST THAT. No
crazy "SG IO dma-mapping function crap". Just a plain array of a fixed
size, pre-allocated for that virtio instance.

  - make it obvious that what you do in that sequence is ten
instructions and no allocations ("Look ma, I wrote a value to an array
and incremented the array idex, and I'M DONE")

  - then in that workqueue entry that you start *anyway*, you empty the
array and do all the crazy virtio stuff.

In fact, while at it, just simplify the VM interface too. Instead of
traversing a random number of buddy lists, just trraverse *one* - the
top-level one. Are you seriously ever going to shrink or mark
read-only anythin *but* something big enough to be in the maximum
order?

MAX_ORDER is what, 11? So we're talking 8MB blocks. Do you *really*
want the balloon code to work on smaller things, particularly since
the whole interface is fundamentally racy and opportunistic to begin
with?


OK, I will implement a new version based on the suggestions. Thanks.

Best,
Wei

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v29 1/4] mm: support reporting free page blocks

2018-04-10 Thread Wei Wang

On 04/11/2018 07:25 AM, Michael S. Tsirkin wrote:

On Tue, Apr 10, 2018 at 01:54:29PM -0700, Andrew Morton wrote:

On Tue, 10 Apr 2018 21:19:31 +0300 "Michael S. Tsirkin"  wrote:


Andrew, were your questions answered? If yes could I bother you for an ack on 
this?


Still not very happy that readers are told that "this function may
sleep" when it clearly doesn't do so.  If we wish to be able to change
it to sleep in the future then that should be mentioned.  And even put a
might_sleep() in there, to catch people who didn't read the comments...

Otherwise it looks OK.

Oh, might_sleep with a comment explaining it's for the future sounds
good to me. I queued this - Wei, could you post a patch on top pls?



I'm just thinking if it would be necessary to add another might_sleep, 
because we've had a cond_resched there which has wrapped a __might_sleep.


Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v32 1/4] mm: support reporting free page blocks

2018-04-09 Thread Wei Wang
This patch adds support to walk through the free page blocks in the
system and report them via a callback function. Some page blocks may
leave the free list after zone->lock is released, so it is the caller's
responsibility to either detect or prevent the use of such pages.

One use example of this patch is to accelerate live migration by skipping
the transfer of free pages reported from the guest. A popular method used
by the hypervisor to track which part of memory is written during live
migration is to write-protect all the guest memory. So, those pages that
are reported as free pages but are written after the report function
returns will be captured by the hypervisor, and they will be added to the
next round of memory transfer.

Signed-off-by: Wei Wang <wei.w.w...@intel.com>
Signed-off-by: Liang Li <liang.z...@intel.com>
Cc: Michal Hocko <mho...@kernel.org>
Cc: Andrew Morton <a...@linux-foundation.org>
Cc: Michael S. Tsirkin <m...@redhat.com>
Acked-by: Michal Hocko <mho...@kernel.org>
---
 include/linux/mm.h |  6 
 mm/page_alloc.c| 97 ++
 2 files changed, 103 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f945dff..4698df1 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1951,6 +1951,12 @@ extern void free_area_init_node(int nid, unsigned long * 
zones_size,
unsigned long zone_start_pfn, unsigned long *zholes_size);
 extern void free_initmem(void);
 
+extern int walk_free_mem_block(void *opaque,
+  int min_order,
+  int (*report_pfn_range)(void *opaque,
+  unsigned long pfn,
+  unsigned long num));
+
 /*
  * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
  * into the buddy system. The freed pages will be poisoned with pattern
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4ea0182..83e50fd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4912,6 +4912,103 @@ void show_free_areas(unsigned int filter, nodemask_t 
*nodemask)
show_swap_cache_info();
 }
 
+/*
+ * Walk through a free page list and report the found pfn range via the
+ * callback.
+ *
+ * Return 0 if it completes the reporting. Otherwise, return the non-zero
+ * value returned from the callback.
+ */
+static int walk_free_page_list(void *opaque,
+  struct zone *zone,
+  int order,
+  enum migratetype mt,
+  int (*report_pfn_range)(void *,
+  unsigned long,
+  unsigned long))
+{
+   struct page *page;
+   struct list_head *list;
+   unsigned long pfn, flags;
+   int ret = 0;
+
+   spin_lock_irqsave(>lock, flags);
+   list = >free_area[order].free_list[mt];
+   list_for_each_entry(page, list, lru) {
+   pfn = page_to_pfn(page);
+   ret = report_pfn_range(opaque, pfn, 1 << order);
+   if (ret)
+   break;
+   }
+   spin_unlock_irqrestore(>lock, flags);
+
+   return ret;
+}
+
+/**
+ * walk_free_mem_block - Walk through the free page blocks in the system
+ * @opaque: the context passed from the caller
+ * @min_order: the minimum order of free lists to check
+ * @report_pfn_range: the callback to report the pfn range of the free pages
+ *
+ * If the callback returns a non-zero value, stop iterating the list of free
+ * page blocks. Otherwise, continue to report.
+ *
+ * Please note that there are no locking guarantees for the callback and
+ * that the reported pfn range might be freed or disappear after the
+ * callback returns so the caller has to be very careful how it is used.
+ *
+ * The callback itself must not sleep or perform any operations which would
+ * require any memory allocations directly (not even GFP_NOWAIT/GFP_ATOMIC)
+ * or via any lock dependency. It is generally advisable to implement
+ * the callback as simple as possible and defer any heavy lifting to a
+ * different context.
+ *
+ * There is no guarantee that each free range will be reported only once
+ * during one walk_free_mem_block invocation.
+ *
+ * pfn_to_page on the given range is strongly discouraged and if there is
+ * an absolute need for that make sure to contact MM people to discuss
+ * potential problems.
+ *
+ * The function itself might sleep so it cannot be called from atomic
+ * contexts.
+ *
+ * In general low orders tend to be very volatile and so it makes more
+ * sense to query larger ones first for various optimizations which like
+ * ballooning etc... This will reduce the overhead as well.
+ *
+ * Return 0 if it completes the reporting. Otherwise, return the non-zero
+ * value retu

[PATCH v32 0/4] Virtio-balloon: support free page reporting

2018-04-09 Thread Wei Wang
rate patch to discuss this flag check later;
- init_vqs: change to put some variables on stack to have simpler
  implementation;
- add destroy_workqueue(vb->balloon_wq);
v21->v22:
- add_one_sg: some code and comment re-arrangement
- send_cmd_id: handle a cornercase

For previous ChangeLog, please reference
https://lwn.net/Articles/743660/

Wei Wang (4):
  mm: support reporting free page blocks
  virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT
  mm/page_poison: expose page_poisoning_enabled to kernel modules
  virtio-balloon: VIRTIO_BALLOON_F_PAGE_POISON

 drivers/virtio/virtio_balloon.c | 298 +++-
 include/linux/mm.h  |   6 +
 include/uapi/linux/virtio_balloon.h |   7 +
 mm/page_alloc.c |  97 
 mm/page_poison.c|   6 +
 5 files changed, 378 insertions(+), 36 deletions(-)

-- 
2.7.4

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


  1   2   3   4   5   >