Re: [PATCH v2] virtio: add VIRTQUEUE_ERROR QAPI event

2023-09-13 Thread Denis Plotnikov

Reviewed-by: Denis Plotnikov 

On 9/12/23 20:57, Vladimir Sementsov-Ogievskiy wrote:

For now we only log the vhost device error, when virtqueue is actually
stopped. Let's add a QAPI event, which makes possible:

  - collect statistics of such errors
  - make immediate actions: take core dumps or do some other debugging
  - inform the user through a management API or UI, so that (s)he can
   react somehow, e.g. reset the device driver in the guest or even
   build up some automation to do so

Note that basically every inconsistency discovered during virtqueue
processing results in a silent virtqueue stop.  The guest then just
sees the requests getting stuck somewhere in the device for no visible
reason.  This event provides a means to inform the management layer of
this situation in a timely fashion.

The event could be reused for some other virtqueue problems (not only
for vhost devices) in future. For this it gets a generic name and
structure.

We keep original VHOST_OPS_DEBUG(), to keep original debug output as is
here, it's not the only call to VHOST_OPS_DEBUG in the file.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---

v2: - improve commit message (just stole wording by Roman, hope he don't
   mind:)
 - add event throttling

  hw/virtio/vhost.c | 12 +---
  monitor/monitor.c | 10 ++
  qapi/qdev.json| 25 +
  3 files changed, 44 insertions(+), 3 deletions(-)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index e2f6ffb446..162899feee 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -15,6 +15,7 @@
  
  #include "qemu/osdep.h"

  #include "qapi/error.h"
+#include "qapi/qapi-events-qdev.h"
  #include "hw/virtio/vhost.h"
  #include "qemu/atomic.h"
  #include "qemu/range.h"
@@ -1332,11 +1333,16 @@ static void 
vhost_virtqueue_error_notifier(EventNotifier *n)
  struct vhost_virtqueue *vq = container_of(n, struct vhost_virtqueue,
error_notifier);
  struct vhost_dev *dev = vq->dev;
-int index = vq - dev->vqs;
  
  if (event_notifier_test_and_clear(n) && dev->vdev) {

-VHOST_OPS_DEBUG(-EINVAL,  "vhost vring error in virtqueue %d",
-dev->vq_index + index);
+int ind = vq - dev->vqs + dev->vq_index;
+DeviceState *ds = >vdev->parent_obj;
+
+VHOST_OPS_DEBUG(-EINVAL,  "vhost vring error in virtqueue %d", ind);
+qapi_event_send_virtqueue_error(ds->id, ds->canonical_path, ind,
+VIRTQUEUE_ERROR_VHOST_VRING_ERR,
+"vhost reported failure through vring "
+"error fd");
  }
  }
  
diff --git a/monitor/monitor.c b/monitor/monitor.c

index 941f87815a..cb1ee31156 100644
--- a/monitor/monitor.c
+++ b/monitor/monitor.c
@@ -313,6 +313,7 @@ static MonitorQAPIEventConf 
monitor_qapi_event_conf[QAPI_EVENT__MAX] = {
  [QAPI_EVENT_BALLOON_CHANGE]= { 1000 * SCALE_MS },
  [QAPI_EVENT_QUORUM_REPORT_BAD] = { 1000 * SCALE_MS },
  [QAPI_EVENT_QUORUM_FAILURE]= { 1000 * SCALE_MS },
+[QAPI_EVENT_VIRTQUEUE_ERROR]   = { 1000 * SCALE_MS },
  [QAPI_EVENT_VSERPORT_CHANGE]   = { 1000 * SCALE_MS },
  [QAPI_EVENT_MEMORY_DEVICE_SIZE_CHANGE] = { 1000 * SCALE_MS },
  };
@@ -497,6 +498,10 @@ static unsigned int qapi_event_throttle_hash(const void 
*key)
  hash += g_str_hash(qdict_get_str(evstate->data, "qom-path"));
  }
  
+if (evstate->event == QAPI_EVENT_VIRTQUEUE_ERROR) {

+hash += g_str_hash(qdict_get_str(evstate->data, "device"));
+}
+
  return hash;
  }
  
@@ -524,6 +529,11 @@ static gboolean qapi_event_throttle_equal(const void *a, const void *b)

 qdict_get_str(evb->data, "qom-path"));
  }
  
+if (eva->event == QAPI_EVENT_VIRTQUEUE_ERROR) {

+return !strcmp(qdict_get_str(eva->data, "device"),
+   qdict_get_str(evb->data, "device"));
+}
+
  return TRUE;
  }
  
diff --git a/qapi/qdev.json b/qapi/qdev.json

index 6bc5a733b8..199e21cae7 100644
--- a/qapi/qdev.json
+++ b/qapi/qdev.json
@@ -161,3 +161,28 @@
  ##
  { 'event': 'DEVICE_UNPLUG_GUEST_ERROR',
'data': { '*device': 'str', 'path': 'str' } }
+
+##
+# @VirtqueueError:
+#
+# Since: 8.2
+##
+{ 'enum': 'VirtqueueError',
+  'data': [ 'vhost-vring-err' ] }
+
+##
+# @VIRTQUEUE_ERROR:
+#
+# Emitted when a device virtqueue fails in runtime.
+#
+# @device: the device's ID if it has one
+# @path: the device's QOM path
+# @virtqueue: virtqueue index
+# @error: error identifier
+# @description: human readable description
+#
+# Since: 8.2
+##
+{ 'event': 'VIRTQUEUE_ERROR',
+ 'data': { '*device': 'str', 'path': 'str', 'virtqueue': 'int',
+'error': 'VirtqueueError', 'description': 'str'} }




[PING][PATCH v5] qapi/qmp: Add timestamps to qmp command responses

2023-05-10 Thread Denis Plotnikov

Hi all!

It seems that this series has come through a number of reviews and got 
some "reviewed-by".


Is there any flaws to fix preventing to merge this series?

Thanks, Denis

On 26.04.2023 17:08, Denis Plotnikov wrote:

Add "start" & "end" time values to QMP command responses.

These time values are added to let the qemu management layer get the exact
command execution time without any other time variance which might be brought
by other parts of management layer or qemu internals.
This helps to look for problems poactively from the management layer side.
The management layer would be able to detect problem cases by calculating
QMP command execution time:
1. execution_time_from_mgmt_perspective -
execution_time_of_qmp_command > some_threshold
This detects problems with management layer or internal qemu QMP command
dispatching
2. current_qmp_command_execution_time > avg_qmp_command_execution_time
This detects that a certain QMP command starts to execute longer than
usual
In both these cases more thorough investigation of the root cases should be
done by using some qemu tracepoints depending on particular QMP command under
investigation or by other means. The timestamps help to avoid excessive log
output when qemu tracepoints are used to address similar cases.

Example of result:

 ./qemu/scripts/qmp/qmp-shell /tmp/qmp.socket

 (QEMU) query-status
 {"end": {"seconds": 1650367305, "microseconds": 831032},
  "start": {"seconds": 1650367305, "microseconds": 831012},
  "return": {"status": "running", "singlestep": false, "running": true}}

The response of the QMP command contains the start & end time of
the QMP command processing.

Also, "start" & "end" timestaps are added to qemu guest agent responses as
qemu-ga shares the same code for request dispatching.

Suggested-by: Andrey Ryabinin 
Signed-off-by: Denis Plotnikov 
Reviewed-by: Daniel P. Berrangé 
---
v4->v5:
  - use json-number instead of json-value for time values [Vladimir]
  - use a new util function for timestamp printing [Vladimir]

v3->v4:
  - rewrite commit message [Markus]
  - use new fileds description in doc [Markus]
  - change type to int64_t [Markus]
  - simplify tests [Markus]

v2->v3:
  - fix typo "timestaps -> timestamps" [Marc-André]

v1->v2:
  - rephrase doc descriptions [Daniel]
  - add tests for qmp timestamps to qmp test and qga test [Daniel]
  - adjust asserts in test-qmp-cmds according to the new number of returning 
keys

v0->v1:
  - remove interface to control "start" and "end" time values: return 
timestamps unconditionally
  - add description to qmp specification
  - leave the same timestamp format in "seconds", "microseconds" to be 
consistent with events
timestamp
  - fix patch description
---
  docs/interop/qmp-spec.txt  | 28 ++--
  include/qapi/util.h|  2 ++
  qapi/qapi-util.c   | 11 +++
  qapi/qmp-dispatch.c| 11 +++
  qapi/qmp-event.c   |  6 +-
  tests/qtest/qmp-test.c | 32 
  tests/unit/test-qga.c  | 29 +
  tests/unit/test-qmp-cmds.c |  4 ++--
  8 files changed, 114 insertions(+), 9 deletions(-)

diff --git a/docs/interop/qmp-spec.txt b/docs/interop/qmp-spec.txt
index b0e8351d5b261..ed204b53373e5 100644
--- a/docs/interop/qmp-spec.txt
+++ b/docs/interop/qmp-spec.txt
@@ -158,7 +158,9 @@ responses that have an unknown "id" field.
  
  The format of a success response is:
  
-{ "return": json-value, "id": json-value }

+{ "return": json-value, "id": json-value,
+  "start": {"seconds": json-number, "microseconds": json-number},
+  "end": {"seconds": json-number, "microseconds": json-number} }
  
   Where,
  
@@ -169,13 +171,25 @@ The format of a success response is:

command does not return data
  - The "id" member contains the transaction identification associated
with the command execution if issued by the Client
+- The "start" member contains the exact time of when the server
+  started executing the command. This excludes any time the
+  command request spent queued, after reading it off the wire.
+  It is a json-object with the number of seconds and microseconds
+  since the Unix epoch
+- The "end" member contains the exact time of when the server
+  finished executing the command. This excludes any time the
+  command response spent queued, waiting to be sent on the wire.
+  It is a json-object with the number of seconds and microseconds
+  since the Unix epoch
  
  2.4.2 error

  ---
  
  The format of an error response is:
  

[PATCH v5] qapi/qmp: Add timestamps to qmp command responses

2023-04-26 Thread Denis Plotnikov
Add "start" & "end" time values to QMP command responses.

These time values are added to let the qemu management layer get the exact
command execution time without any other time variance which might be brought
by other parts of management layer or qemu internals.
This helps to look for problems poactively from the management layer side.
The management layer would be able to detect problem cases by calculating
QMP command execution time:
1. execution_time_from_mgmt_perspective -
   execution_time_of_qmp_command > some_threshold
   This detects problems with management layer or internal qemu QMP command
   dispatching
2. current_qmp_command_execution_time > avg_qmp_command_execution_time
   This detects that a certain QMP command starts to execute longer than
   usual
In both these cases more thorough investigation of the root cases should be
done by using some qemu tracepoints depending on particular QMP command under
investigation or by other means. The timestamps help to avoid excessive log
output when qemu tracepoints are used to address similar cases.

Example of result:

./qemu/scripts/qmp/qmp-shell /tmp/qmp.socket

(QEMU) query-status
{"end": {"seconds": 1650367305, "microseconds": 831032},
 "start": {"seconds": 1650367305, "microseconds": 831012},
 "return": {"status": "running", "singlestep": false, "running": true}}

The response of the QMP command contains the start & end time of
the QMP command processing.

Also, "start" & "end" timestaps are added to qemu guest agent responses as
qemu-ga shares the same code for request dispatching.

Suggested-by: Andrey Ryabinin 
Signed-off-by: Denis Plotnikov 
Reviewed-by: Daniel P. Berrangé 
---
v4->v5:
 - use json-number instead of json-value for time values [Vladimir]
 - use a new util function for timestamp printing [Vladimir]

v3->v4:
 - rewrite commit message [Markus]
 - use new fileds description in doc [Markus]
 - change type to int64_t [Markus]
 - simplify tests [Markus]

v2->v3:
 - fix typo "timestaps -> timestamps" [Marc-André]

v1->v2:
 - rephrase doc descriptions [Daniel]
 - add tests for qmp timestamps to qmp test and qga test [Daniel]
 - adjust asserts in test-qmp-cmds according to the new number of returning keys

v0->v1:
 - remove interface to control "start" and "end" time values: return timestamps 
unconditionally
 - add description to qmp specification
 - leave the same timestamp format in "seconds", "microseconds" to be 
consistent with events
   timestamp
 - fix patch description
---
 docs/interop/qmp-spec.txt  | 28 ++--
 include/qapi/util.h|  2 ++
 qapi/qapi-util.c   | 11 +++
 qapi/qmp-dispatch.c| 11 +++
 qapi/qmp-event.c   |  6 +-
 tests/qtest/qmp-test.c | 32 
 tests/unit/test-qga.c  | 29 +
 tests/unit/test-qmp-cmds.c |  4 ++--
 8 files changed, 114 insertions(+), 9 deletions(-)

diff --git a/docs/interop/qmp-spec.txt b/docs/interop/qmp-spec.txt
index b0e8351d5b261..ed204b53373e5 100644
--- a/docs/interop/qmp-spec.txt
+++ b/docs/interop/qmp-spec.txt
@@ -158,7 +158,9 @@ responses that have an unknown "id" field.
 
 The format of a success response is:
 
-{ "return": json-value, "id": json-value }
+{ "return": json-value, "id": json-value,
+  "start": {"seconds": json-number, "microseconds": json-number},
+  "end": {"seconds": json-number, "microseconds": json-number} }
 
  Where,
 
@@ -169,13 +171,25 @@ The format of a success response is:
   command does not return data
 - The "id" member contains the transaction identification associated
   with the command execution if issued by the Client
+- The "start" member contains the exact time of when the server
+  started executing the command. This excludes any time the
+  command request spent queued, after reading it off the wire.
+  It is a json-object with the number of seconds and microseconds
+  since the Unix epoch
+- The "end" member contains the exact time of when the server
+  finished executing the command. This excludes any time the
+  command response spent queued, waiting to be sent on the wire.
+  It is a json-object with the number of seconds and microseconds
+  since the Unix epoch
 
 2.4.2 error
 ---
 
 The format of an error response is:
 
-{ "error": { "class": json-string, "desc": json-string }, "id": json-value }
+{ "error": { "class": json-string, "desc": json-string }, "id": json-value
+  "start": {"seconds": json-number, &q

[PING] [PATCH v4] qapi/qmp: Add timestamps to qmp command responses

2023-01-16 Thread Denis Plotnikov


On 10.01.2023 13:32, Denis Plotnikov wrote:


[ping]

On 01.11.2022 18:37, Denis Plotnikov wrote:

Add "start" & "end" time values to QMP command responses.

These time values are added to let the qemu management layer get the exact
command execution time without any other time variance which might be brought
by other parts of management layer or qemu internals.
This helps to look for problems poactively from the management layer side.
The management layer would be able to detect problem cases by calculating
QMP command execution time:
1. execution_time_from_mgmt_perspective -
execution_time_of_qmp_command > some_threshold
This detects problems with management layer or internal qemu QMP command
dispatching
2. current_qmp_command_execution_time > avg_qmp_command_execution_time
This detects that a certain QMP command starts to execute longer than
usual
In both these cases more thorough investigation of the root cases should be
done by using some qemu tracepoints depending on particular QMP command under
investigation or by other means. The timestamps help to avoid excessive log
output when qemu tracepoints are used to address similar cases.

Example of result:

 ./qemu/scripts/qmp/qmp-shell /tmp/qmp.socket

 (QEMU) query-status
 {"end": {"seconds": 1650367305, "microseconds": 831032},
  "start": {"seconds": 1650367305, "microseconds": 831012},
  "return": {"status": "running", "singlestep": false, "running": true}}

The response of the QMP command contains the start & end time of
the QMP command processing.

Also, "start" & "end" timestaps are added to qemu guest agent responses as
qemu-ga shares the same code for request dispatching.

Suggested-by: Andrey Ryabinin
Signed-off-by: Denis Plotnikov
Reviewed-by: Daniel P. Berrangé
---
v3->v4
  - rewrite commit message [Markus]
  - use new fileds description in doc [Markus]
  - change type to int64_t [Markus]
  - simplify tests [Markus]

v2->v3:
  - fix typo "timestaps -> timestamps" [Marc-André]

v1->v2:
  - rephrase doc descriptions [Daniel]
  - add tests for qmp timestamps to qmp test and qga test [Daniel]
  - adjust asserts in test-qmp-cmds according to the new number of returning 
keys

v0->v1:
  - remove interface to control "start" and "end" time values: return 
timestamps unconditionally
  - add description to qmp specification
  - leave the same timestamp format in "seconds", "microseconds" to be 
consistent with events
timestamp
  - fix patch description

  docs/interop/qmp-spec.txt  | 28 ++--
  qapi/qmp-dispatch.c| 18 ++
  tests/qtest/qmp-test.c | 32 
  tests/unit/test-qga.c  | 29 +
  tests/unit/test-qmp-cmds.c |  4 ++--
  5 files changed, 107 insertions(+), 4 deletions(-)

diff --git a/docs/interop/qmp-spec.txt b/docs/interop/qmp-spec.txt
index b0e8351d5b261..0dd8e716c02f0 100644
--- a/docs/interop/qmp-spec.txt
+++ b/docs/interop/qmp-spec.txt
@@ -158,7 +158,9 @@ responses that have an unknown "id" field.
  
  The format of a success response is:
  
-{ "return": json-value, "id": json-value }

+{ "return": json-value, "id": json-value,
+  "start": {"seconds": json-value, "microseconds": json-value},
+  "end": {"seconds": json-value, "microseconds": json-value} }
  
   Where,
  
@@ -169,13 +171,25 @@ The format of a success response is:

command does not return data
  - The "id" member contains the transaction identification associated
with the command execution if issued by the Client
+- The "start" member contains the exact time of when the server
+  started executing the command. This excludes any time the
+  command request spent queued, after reading it off the wire.
+  It is a json-object with the number of seconds and microseconds
+  since the Unix epoch
+- The "end" member contains the exact time of when the server
+  finished executing the command. This excludes any time the
+  command response spent queued, waiting to be sent on the wire.
+  It is a json-object with the number of seconds and microseconds
+  since the Unix epoch
  
  2.4.2 error

  ---
  
  The format of an error response is:
  
-{ "error": { "class": json-string, "desc": json-string }, "id": json-value }

+{ "error": { "class": json-string, "desc": json-string }, "id": json-value
+  "start": {"seconds": json-value, "microseconds": json-value},
+  "end": {"seconds": json-value, "microseconds&quo

Re: [PATCH v4] qapi/qmp: Add timestamps to qmp command responses

2023-01-10 Thread Denis Plotnikov

[ping]

On 01.11.2022 18:37, Denis Plotnikov wrote:

Add "start" & "end" time values to QMP command responses.

These time values are added to let the qemu management layer get the exact
command execution time without any other time variance which might be brought
by other parts of management layer or qemu internals.
This helps to look for problems poactively from the management layer side.
The management layer would be able to detect problem cases by calculating
QMP command execution time:
1. execution_time_from_mgmt_perspective -
execution_time_of_qmp_command > some_threshold
This detects problems with management layer or internal qemu QMP command
dispatching
2. current_qmp_command_execution_time > avg_qmp_command_execution_time
This detects that a certain QMP command starts to execute longer than
usual
In both these cases more thorough investigation of the root cases should be
done by using some qemu tracepoints depending on particular QMP command under
investigation or by other means. The timestamps help to avoid excessive log
output when qemu tracepoints are used to address similar cases.

Example of result:

 ./qemu/scripts/qmp/qmp-shell /tmp/qmp.socket

 (QEMU) query-status
 {"end": {"seconds": 1650367305, "microseconds": 831032},
  "start": {"seconds": 1650367305, "microseconds": 831012},
  "return": {"status": "running", "singlestep": false, "running": true}}

The response of the QMP command contains the start & end time of
the QMP command processing.

Also, "start" & "end" timestaps are added to qemu guest agent responses as
qemu-ga shares the same code for request dispatching.

Suggested-by: Andrey Ryabinin
Signed-off-by: Denis Plotnikov
Reviewed-by: Daniel P. Berrangé
---
v3->v4
  - rewrite commit message [Markus]
  - use new fileds description in doc [Markus]
  - change type to int64_t [Markus]
  - simplify tests [Markus]

v2->v3:
  - fix typo "timestaps -> timestamps" [Marc-André]

v1->v2:
  - rephrase doc descriptions [Daniel]
  - add tests for qmp timestamps to qmp test and qga test [Daniel]
  - adjust asserts in test-qmp-cmds according to the new number of returning 
keys

v0->v1:
  - remove interface to control "start" and "end" time values: return 
timestamps unconditionally
  - add description to qmp specification
  - leave the same timestamp format in "seconds", "microseconds" to be 
consistent with events
timestamp
  - fix patch description

  docs/interop/qmp-spec.txt  | 28 ++--
  qapi/qmp-dispatch.c| 18 ++
  tests/qtest/qmp-test.c | 32 
  tests/unit/test-qga.c  | 29 +
  tests/unit/test-qmp-cmds.c |  4 ++--
  5 files changed, 107 insertions(+), 4 deletions(-)

diff --git a/docs/interop/qmp-spec.txt b/docs/interop/qmp-spec.txt
index b0e8351d5b261..0dd8e716c02f0 100644
--- a/docs/interop/qmp-spec.txt
+++ b/docs/interop/qmp-spec.txt
@@ -158,7 +158,9 @@ responses that have an unknown "id" field.
  
  The format of a success response is:
  
-{ "return": json-value, "id": json-value }

+{ "return": json-value, "id": json-value,
+  "start": {"seconds": json-value, "microseconds": json-value},
+  "end": {"seconds": json-value, "microseconds": json-value} }
  
   Where,
  
@@ -169,13 +171,25 @@ The format of a success response is:

command does not return data
  - The "id" member contains the transaction identification associated
with the command execution if issued by the Client
+- The "start" member contains the exact time of when the server
+  started executing the command. This excludes any time the
+  command request spent queued, after reading it off the wire.
+  It is a json-object with the number of seconds and microseconds
+  since the Unix epoch
+- The "end" member contains the exact time of when the server
+  finished executing the command. This excludes any time the
+  command response spent queued, waiting to be sent on the wire.
+  It is a json-object with the number of seconds and microseconds
+  since the Unix epoch
  
  2.4.2 error

  ---
  
  The format of an error response is:
  
-{ "error": { "class": json-string, "desc": json-string }, "id": json-value }

+{ "error": { "class": json-string, "desc": json-string }, "id": json-value
+  "start": {"seconds": json-value, "microseconds": json-value},
+  "end": {"seconds": json-value, "microseconds": json-value} }
  
   Where,
  
@@ -184,6 

[PATCH v4] qapi/qmp: Add timestamps to qmp command responses

2022-11-01 Thread Denis Plotnikov
Add "start" & "end" time values to QMP command responses.

These time values are added to let the qemu management layer get the exact
command execution time without any other time variance which might be brought
by other parts of management layer or qemu internals.
This helps to look for problems poactively from the management layer side.
The management layer would be able to detect problem cases by calculating
QMP command execution time:
1. execution_time_from_mgmt_perspective -
   execution_time_of_qmp_command > some_threshold
   This detects problems with management layer or internal qemu QMP command
   dispatching
2. current_qmp_command_execution_time > avg_qmp_command_execution_time
   This detects that a certain QMP command starts to execute longer than
   usual
In both these cases more thorough investigation of the root cases should be
done by using some qemu tracepoints depending on particular QMP command under
investigation or by other means. The timestamps help to avoid excessive log
output when qemu tracepoints are used to address similar cases.

Example of result:

./qemu/scripts/qmp/qmp-shell /tmp/qmp.socket

(QEMU) query-status
{"end": {"seconds": 1650367305, "microseconds": 831032},
 "start": {"seconds": 1650367305, "microseconds": 831012},
 "return": {"status": "running", "singlestep": false, "running": true}}

The response of the QMP command contains the start & end time of
the QMP command processing.

Also, "start" & "end" timestaps are added to qemu guest agent responses as
qemu-ga shares the same code for request dispatching.

Suggested-by: Andrey Ryabinin 
Signed-off-by: Denis Plotnikov 
Reviewed-by: Daniel P. Berrangé 
---
v3->v4
 - rewrite commit message [Markus]
 - use new fileds description in doc [Markus]
 - change type to int64_t [Markus]
 - simplify tests [Markus]

v2->v3:
 - fix typo "timestaps -> timestamps" [Marc-André]

v1->v2:
 - rephrase doc descriptions [Daniel]
 - add tests for qmp timestamps to qmp test and qga test [Daniel]
 - adjust asserts in test-qmp-cmds according to the new number of returning keys

v0->v1:
 - remove interface to control "start" and "end" time values: return timestamps 
unconditionally
 - add description to qmp specification
 - leave the same timestamp format in "seconds", "microseconds" to be 
consistent with events
   timestamp
 - fix patch description

 docs/interop/qmp-spec.txt  | 28 ++--
 qapi/qmp-dispatch.c| 18 ++
 tests/qtest/qmp-test.c | 32 
 tests/unit/test-qga.c  | 29 +
 tests/unit/test-qmp-cmds.c |  4 ++--
 5 files changed, 107 insertions(+), 4 deletions(-)

diff --git a/docs/interop/qmp-spec.txt b/docs/interop/qmp-spec.txt
index b0e8351d5b261..0dd8e716c02f0 100644
--- a/docs/interop/qmp-spec.txt
+++ b/docs/interop/qmp-spec.txt
@@ -158,7 +158,9 @@ responses that have an unknown "id" field.
 
 The format of a success response is:
 
-{ "return": json-value, "id": json-value }
+{ "return": json-value, "id": json-value,
+  "start": {"seconds": json-value, "microseconds": json-value},
+  "end": {"seconds": json-value, "microseconds": json-value} }
 
  Where,
 
@@ -169,13 +171,25 @@ The format of a success response is:
   command does not return data
 - The "id" member contains the transaction identification associated
   with the command execution if issued by the Client
+- The "start" member contains the exact time of when the server
+  started executing the command. This excludes any time the
+  command request spent queued, after reading it off the wire.
+  It is a json-object with the number of seconds and microseconds
+  since the Unix epoch
+- The "end" member contains the exact time of when the server
+  finished executing the command. This excludes any time the
+  command response spent queued, waiting to be sent on the wire.
+  It is a json-object with the number of seconds and microseconds
+  since the Unix epoch
 
 2.4.2 error
 ---
 
 The format of an error response is:
 
-{ "error": { "class": json-string, "desc": json-string }, "id": json-value }
+{ "error": { "class": json-string, "desc": json-string }, "id": json-value
+  "start": {"seconds": json-value, "microseconds": json-value},
+  "end": {"seconds": json-value, "microseconds": json-value} }
 
  Where,
 
@@ -184,6 +198,16 @@ The format of an error response is:
   not attempt to parse this message.
 - The "

Re: [PATCH v3] qapi/qmp: Add timestamps to qmp command responses

2022-10-16 Thread Denis Plotnikov



On 14.10.2022 16:19, Daniel P. Berrangé wrote:

On Fri, Oct 14, 2022 at 02:57:06PM +0200, Markus Armbruster wrote:

Daniel P. Berrangé  writes:


On Fri, Oct 14, 2022 at 11:31:13AM +0200, Markus Armbruster wrote:

Daniel P. Berrangé  writes:


On Thu, Oct 13, 2022 at 05:00:26PM +0200, Markus Armbruster wrote:

Denis Plotnikov  writes:


Add "start" & "end" time values to qmp command responses.

Please spell it QMP.  More of the same below.


These time values are added to let the qemu management layer get the exact
command execution time without any other time variance which might be brought by
other parts of management layer or qemu internals. This is particulary useful
for the management layer logging for later problems resolving.

I'm still having difficulties seeing the value add over existing
tracepoints and logging.

Can you tell me about a problem you cracked (or could have cracked) with
the help of this?

Consider your QMP client is logging all commands and replies in its
own logfile (libvirt can do this). Having this start/end timestamps
included means the QMP client log is self contained.

A QMP client can include client-side timestamps in its log.  What value
is being added by server-side timestamps?  According to the commit
message, it's for getting "the exact command execution time without any
other time variance which might be brought by other parts of management
layer or qemu internals."  Why is that useful?  In particular, why is
excluding network and QEMU queueing delays (inbound and outbound)
useful?

Lets, say some commands normally runs in ~100ms, but occasionally
runs in 2secs, and you want to understand why.

A first step is understanding whether a given command itself is
slow at executing, or whether its execution has merely been
delayed because some other aspect of QEMU has delayed its execution.
If the server timestamps show it was very fast, then that indicates
delayed processing. Thus instead of debugging the slow command, I
can think about what scenarios would be responsible for the delay.
Perhaps a previous QMP command was very slow, or maybe there is
simply a large volume of QMP commands backlogged, or some part of
QEMU got blocked.

Another case would be a command that is normally fast, and sometimes
is slower, but still relatively fast. The network and queueing side
might be a significant enough proportion of the total time to obscure
the slowdown. If you can eliminate the non-execution time, you can
see the performance trends over time to spot the subtle slowdowns
and detect abnormal behaviour before it becomes too terrible.

This is troubleshooting.  Asking for better troubleshooting tools is
fair.

However, the proposed timestamps provide much more limited insight than
existing tracepoints.  For instance, enabling

tracepoints are absolutely great and let you get a hell of alot
more information, *provided* you are in a position to actually
use tracepoints. This is, unfortunately, frequently not the case
when supporting real world production deployments.

Exactly!!! Thanks for the pointing out!


Bug reports from customers typically include little more than a
log file they got from the mgmt client at time the problem happened.
The problem experianced may no longer exist, so asking them to run
a tracepoint script is not possible. They may also be reluctant to
actually run tracepoint scripts on a production system, or simply
lack the ability todo so at all, due to constraints of the deployment
environment. Logs from libvirt are something that are collected by
default for many mgmt apps, or can be turned on by the user with
minimal risk of disruption.

Overall, there's a compelling desire to be proactive in collecting
information ahead of time, that might be useful in diagnosing
future bug reports.


This is the main reason. When you encounter a problem one of the first 
questions is "Was there something similar in the past. Another question 
is how often does it happen.


With the timestamps these questions answering becomes easier.

Another thing is that with the qmp command timestamps you can build a 
monitoring system which will report about the cases when 
execution_time_from_mgmt_perspective - excution_time_qmp_command > 
some_threshold which in turn proactively tell you about the potential 
problems. And then you'll start using the qmp tracepoints (and other 
means) to figure out the real reason of the execution time variance.


Thanks, Denis



So it isn't an 'either / or' decision of QMP reply logs vs use of
tracepoints, both are beneficial, with their own pros/cons.

With regards,
Daniel




Re: [PATCH v3] qapi/qmp: Add timestamps to qmp command responses

2022-10-14 Thread Denis Plotnikov



On 13.10.2022 18:00, Markus Armbruster wrote:

Denis Plotnikov  writes:


Add "start" & "end" time values to qmp command responses.

Please spell it QMP.  More of the same below.

ok

Can you tell me about a problem you cracked (or could have cracked) with
the help of this?


We have a management layer which interacts with qemu via qmp. When it 
issues a qmp command we measure execution time which takes to perform a 
certain qmp command. Some of that commands seems to execute longer that 
expected. In that case there is a question what part of command 
execution takes the majority of time. Is it the flaw in the management 
layer or in qemu qmp command scheduling or the qmp command execution 
itself? The timestaps being added help to exclude the qmp command 
execution time from the question. Also timestamps helps to get know the 
exact time when the command is started and ended and put that 
information to a system logs properly according to timestamps.



  "return": {"status": "running", "singlestep": false, "running": true}}

The responce of the qmp command contains the start & end time of

response

ok



the qmp command processing.

Suggested-by: Andrey Ryabinin 
Signed-off-by: Denis Plotnikov 
Reviewed-by: Daniel P. Berrangé 

Please spell out that this affects both QMP and qemu-ga.

ok

command does not return data
  - The "id" member contains the transaction identification associated
with the command execution if issued by the Client
+- The "start" member contains the exact time of when the server
+  started executing the command. This excludes any time the
+  command request spent queued, after reading it off the wire.
+  It is a fixed json-object with time in seconds and microseconds
+  relative to the Unix Epoch (1 Jan 1970)

What's a "fixed json-object"?

Hmm, I guess you're copying from the description of event member
"timestamp".

That's right

Let's go with "a json-object with the number of seconds and microseconds
since the Unix epoch" everywhere.

ok


Make this int64_t, because that's what g_get_real_time() returns.

Same for add_timestamps() parameters.

ok, will fix the type everywhere


+qobject_unref(resp);
I'd be tempted to fold this into existing tests.


Do you want me to put timestamp checking to an existing testcase?


Thanks,

Denis




+
  qtest_quit(qts);
  }
  
diff --git a/tests/unit/test-qga.c b/tests/unit/test-qga.c

index b4e0a145737d1..18ec9bac3650e 100644
--- a/tests/unit/test-qga.c
+++ b/tests/unit/test-qga.c
@@ -217,6 +217,36 @@ static void test_qga_ping(gconstpointer fix)
  qmp_assert_no_error(ret);
  }
  
+static void test_qga_timestamps(gconstpointer fix)

+{
+QDict *start, *end;
+uint64_t start_s, start_us, end_s, end_us, start_ts, end_ts;
+const TestFixture *fixture = fix;
+g_autoptr(QDict) ret = NULL;
+
+ret = qmp_fd(fixture->fd, "{'execute': 'guest-ping'}");
+g_assert_nonnull(ret);
+qmp_assert_no_error(ret);
+
+start = qdict_get_qdict(ret, "start");
+g_assert(start);
+end = qdict_get_qdict(ret, "end");
+g_assert(end);
+
+start_s = qdict_get_try_int(start, "seconds", 0);
+g_assert(start_s);
+start_us = qdict_get_try_int(start, "microseconds", 0);
+
+end_s = qdict_get_try_int(end, "seconds", 0);
+g_assert(end_s);
+end_us = qdict_get_try_int(end, "microseconds", 0);
+
+start_ts = (start_s * G_USEC_PER_SEC) + start_us;
+end_ts = (end_s * G_USEC_PER_SEC) + end_us;
+
+g_assert(end_ts > start_ts);
+}
+
  static void test_qga_id(gconstpointer fix)
  {
  const TestFixture *fixture = fix;
@@ -948,6 +978,7 @@ int main(int argc, char **argv)
  g_test_add_data_func("/qga/sync-delimited", , 
test_qga_sync_delimited);
  g_test_add_data_func("/qga/sync", , test_qga_sync);
  g_test_add_data_func("/qga/ping", , test_qga_ping);
+g_test_add_data_func("/qga/timestamps", , test_qga_timestamps);
  g_test_add_data_func("/qga/info", , test_qga_info);
  g_test_add_data_func("/qga/network-get-interfaces", ,
   test_qga_network_get_interfaces);
diff --git a/tests/unit/test-qmp-cmds.c b/tests/unit/test-qmp-cmds.c
index 6085c099950b5..54d63bb8e346f 100644
--- a/tests/unit/test-qmp-cmds.c
+++ b/tests/unit/test-qmp-cmds.c
@@ -154,7 +154,7 @@ static QObject *do_qmp_dispatch(bool allow_oob, const char 
*template, ...)
  g_assert(resp);
  ret = qdict_get(resp, "return");
  g_assert(ret);
-g_assert(qdict_size(resp) == 1);
+g_assert(qdict_size(resp) == 3);
  
  qobject_ref(ret);

  qobject_unref(resp);
@@ -181,7 +181,7 @@ static void do_qmp_dispatch_error(bool allow_oob, 
ErrorClass cls,
  ==, QapiErrorClass_str(cls));
  g_assert(qdict_get_try_str(error, "desc"));
  g_assert(qdict_size(error) == 2);
-g_assert(qdict_size(resp) == 1);
+g_assert(qdict_size(resp) == 3);
  
  qobject_unref(resp);

  qobject_unref(req);




[PATCH v3] qapi/qmp: Add timestamps to qmp command responses

2022-10-11 Thread Denis Plotnikov
Add "start" & "end" time values to qmp command responses.

These time values are added to let the qemu management layer get the exact
command execution time without any other time variance which might be brought by
other parts of management layer or qemu internals. This is particulary useful
for the management layer logging for later problems resolving.

Example of result:

./qemu/scripts/qmp/qmp-shell /tmp/qmp.socket

(QEMU) query-status
{"end": {"seconds": 1650367305, "microseconds": 831032},
 "start": {"seconds": 1650367305, "microseconds": 831012},
 "return": {"status": "running", "singlestep": false, "running": true}}

The responce of the qmp command contains the start & end time of
the qmp command processing.

Suggested-by: Andrey Ryabinin 
Signed-off-by: Denis Plotnikov 
Reviewed-by: Daniel P. Berrangé 
---

v0->v1:
 - remove interface to control "start" and "end" time values: return timestamps 
unconditionally
 - add description to qmp specification
 - leave the same timestamp format in "seconds", "microseconds" to be 
consistent with events
   timestamp
 - fix patch description

v1->v2:
 - rephrase doc descriptions [Daniel]
 - add tests for qmp timestamps to qmp test and qga test [Daniel]
 - adjust asserts in test-qmp-cmds according to the new number of returning keys

v2->v3:
 - fix typo "timestaps -> timestamps" [Marc-André]

 docs/interop/qmp-spec.txt  | 28 ++--
 qapi/qmp-dispatch.c| 18 ++
 tests/qtest/qmp-test.c | 34 ++
 tests/unit/test-qga.c  | 31 +++
 tests/unit/test-qmp-cmds.c |  4 ++--
 5 files changed, 111 insertions(+), 4 deletions(-)

diff --git a/docs/interop/qmp-spec.txt b/docs/interop/qmp-spec.txt
index b0e8351d5b261..2e0b7de0c4dc7 100644
--- a/docs/interop/qmp-spec.txt
+++ b/docs/interop/qmp-spec.txt
@@ -158,7 +158,9 @@ responses that have an unknown "id" field.
 
 The format of a success response is:
 
-{ "return": json-value, "id": json-value }
+{ "return": json-value, "id": json-value,
+  "start": {"seconds": json-value, "microseconds": json-value},
+  "end": {"seconds": json-value, "microseconds": json-value} }
 
  Where,
 
@@ -169,13 +171,25 @@ The format of a success response is:
   command does not return data
 - The "id" member contains the transaction identification associated
   with the command execution if issued by the Client
+- The "start" member contains the exact time of when the server
+  started executing the command. This excludes any time the
+  command request spent queued, after reading it off the wire.
+  It is a fixed json-object with time in seconds and microseconds
+  relative to the Unix Epoch (1 Jan 1970)
+- The "end" member contains the exact time of when the server
+  finished executing the command. This excludes any time the
+  command response spent queued, waiting to be sent on the wire.
+  It is a fixed json-object with time in seconds and microseconds
+  relative to the Unix Epoch (1 Jan 1970)
 
 2.4.2 error
 ---
 
 The format of an error response is:
 
-{ "error": { "class": json-string, "desc": json-string }, "id": json-value }
+{ "error": { "class": json-string, "desc": json-string }, "id": json-value
+  "start": {"seconds": json-value, "microseconds": json-value},
+  "end": {"seconds": json-value, "microseconds": json-value} }
 
  Where,
 
@@ -184,6 +198,16 @@ The format of an error response is:
   not attempt to parse this message.
 - The "id" member contains the transaction identification associated with
   the command execution if issued by the Client
+- The "start" member contains the exact time of when the server
+  started executing the command. This excludes any time the
+  command request spent queued, after reading it off the wire.
+  It is a fixed json-object with time in seconds and microseconds
+  relative to the Unix Epoch (1 Jan 1970)
+- The "end" member contains the exact time of when the server
+  finished executing the command. This excludes any time the
+  command response spent queued, waiting to be sent on the wire.
+  It is a fixed json-object with time in seconds and microseconds
+  relative to the Unix Epoch (1 Jan 1970)
 
 NOTE: Some errors can occur before the Server is able to read the "id" member,
 in these cases the "id" member will not be part of the error response, even
diff --git a/qapi/qmp-dispatch.c b/qapi/qmp-dispatch.c
index 0990873ec8ec1..fce8

[PATCH v2] qapi/qmp: Add timestamps to qmp command responses

2022-10-11 Thread Denis Plotnikov
Add "start" & "end" time values to qmp command responses.

These time values are added to let the qemu management layer get the exact
command execution time without any other time variance which might be brought by
other parts of management layer or qemu internals. This is particulary useful
for the management layer logging for later problems resolving.

Example of result:

./qemu/scripts/qmp/qmp-shell /tmp/qmp.socket

(QEMU) query-status
{"end": {"seconds": 1650367305, "microseconds": 831032},
 "start": {"seconds": 1650367305, "microseconds": 831012},
 "return": {"status": "running", "singlestep": false, "running": true}}

The responce of the qmp command contains the start & end time of
the qmp command processing.

Suggested-by: Andrey Ryabinin 
Signed-off-by: Denis Plotnikov 
---
v0->v1:
 - remove interface to control "start" and "end" time values: return timestamps 
unconditionally
 - add description to qmp specification
 - leave the same timestamp format in "seconds", "microseconds" to be 
consistent with events
   timestamp
 - fix patch description

v1->v2:
 - rephrase doc descriptions [Daniel]
 - add tests for qmp timestamps to qmp test and qga test [Daniel]
 - adjust asserts in test-qmp-cmds according to the new number of returning keys

 docs/interop/qmp-spec.txt  | 28 ++--
 qapi/qmp-dispatch.c| 18 ++
 tests/qtest/qmp-test.c | 34 ++
 tests/unit/test-qga.c  | 31 +++
 tests/unit/test-qmp-cmds.c |  4 ++--
 5 files changed, 111 insertions(+), 4 deletions(-)

diff --git a/docs/interop/qmp-spec.txt b/docs/interop/qmp-spec.txt
index b0e8351d5b261..2e0b7de0c4dc7 100644
--- a/docs/interop/qmp-spec.txt
+++ b/docs/interop/qmp-spec.txt
@@ -158,7 +158,9 @@ responses that have an unknown "id" field.
 
 The format of a success response is:
 
-{ "return": json-value, "id": json-value }
+{ "return": json-value, "id": json-value,
+  "start": {"seconds": json-value, "microseconds": json-value},
+  "end": {"seconds": json-value, "microseconds": json-value} }
 
  Where,
 
@@ -169,13 +171,25 @@ The format of a success response is:
   command does not return data
 - The "id" member contains the transaction identification associated
   with the command execution if issued by the Client
+- The "start" member contains the exact time of when the server
+  started executing the command. This excludes any time the
+  command request spent queued, after reading it off the wire.
+  It is a fixed json-object with time in seconds and microseconds
+  relative to the Unix Epoch (1 Jan 1970)
+- The "end" member contains the exact time of when the server
+  finished executing the command. This excludes any time the
+  command response spent queued, waiting to be sent on the wire.
+  It is a fixed json-object with time in seconds and microseconds
+  relative to the Unix Epoch (1 Jan 1970)
 
 2.4.2 error
 ---
 
 The format of an error response is:
 
-{ "error": { "class": json-string, "desc": json-string }, "id": json-value }
+{ "error": { "class": json-string, "desc": json-string }, "id": json-value
+  "start": {"seconds": json-value, "microseconds": json-value},
+  "end": {"seconds": json-value, "microseconds": json-value} }
 
  Where,
 
@@ -184,6 +198,16 @@ The format of an error response is:
   not attempt to parse this message.
 - The "id" member contains the transaction identification associated with
   the command execution if issued by the Client
+- The "start" member contains the exact time of when the server
+  started executing the command. This excludes any time the
+  command request spent queued, after reading it off the wire.
+  It is a fixed json-object with time in seconds and microseconds
+  relative to the Unix Epoch (1 Jan 1970)
+- The "end" member contains the exact time of when the server
+  finished executing the command. This excludes any time the
+  command response spent queued, waiting to be sent on the wire.
+  It is a fixed json-object with time in seconds and microseconds
+  relative to the Unix Epoch (1 Jan 1970)
 
 NOTE: Some errors can occur before the Server is able to read the "id" member,
 in these cases the "id" member will not be part of the error response, even
diff --git a/qapi/qmp-dispatch.c b/qapi/qmp-dispatch.c
index 0990873ec8ec1..fce87416f2128 100644
--- a/qapi/qmp-dispatch.c
+++ b/qapi/qmp-dispatch.c
@@ -130,6 +130,22 @@ static void do_qmp_disp

[PATCH v1] qapi/qmp: Add timestamps to qmp command responses

2022-10-07 Thread Denis Plotnikov
Add "start" & "end" time values to qmp command responses.

These time values are added to let the qemu management layer get the exact
command execution time without any other time variance which might be brought by
other parts of management layer or qemu internals. This is particulary useful
for the management layer logging for later problems resolving.

Example of result:

./qemu/scripts/qmp/qmp-shell /tmp/qmp.socket

(QEMU) query-status
{"end": {"seconds": 1650367305, "microseconds": 831032},
 "start": {"seconds": 1650367305, "microseconds": 831012},
 "return": {"status": "running", "singlestep": false, "running": true}}

The responce of the qmp command contains the start & end time of
the qmp command processing.

Suggested-by: Andrey Ryabinin 
Signed-off-by: Denis Plotnikov 
---
v0->v1:
 - remove interface to control "start" and "end" time values: return timestamps 
unconditionally
 - add description to qmp specification
 - leave the same timestamp format in "seconds", "microseconds" to be 
consistent with events
   timestamp
 - fix patch description

 docs/interop/qmp-spec.txt | 20 ++--
 qapi/qmp-dispatch.c   | 18 ++
 2 files changed, 36 insertions(+), 2 deletions(-)

diff --git a/docs/interop/qmp-spec.txt b/docs/interop/qmp-spec.txt
index b0e8351d5b261..d1cca8bc447ce 100644
--- a/docs/interop/qmp-spec.txt
+++ b/docs/interop/qmp-spec.txt
@@ -158,7 +158,9 @@ responses that have an unknown "id" field.
 
 The format of a success response is:
 
-{ "return": json-value, "id": json-value }
+{ "return": json-value, "id": json-value,
+  "start": {"seconds": json-value, "microseconds": json-value},
+  "end": {"seconds": json-value, "microseconds": json-value} }
 
  Where,
 
@@ -169,13 +171,21 @@ The format of a success response is:
   command does not return data
 - The "id" member contains the transaction identification associated
   with the command execution if issued by the Client
+- The "start" member contains the exact time of when the command has been
+  stated to be processed. It is a fixed json-object with time in
+  seconds and microseconds relative to the Unix Epoch (1 Jan 1970)
+- The "end" member contains the exact time of when the command has been
+  finished to be processed. It is a fixed json-object with time in
+  seconds and microseconds relative to the Unix Epoch (1 Jan 1970)
 
 2.4.2 error
 ---
 
 The format of an error response is:
 
-{ "error": { "class": json-string, "desc": json-string }, "id": json-value }
+{ "error": { "class": json-string, "desc": json-string }, "id": json-value
+  "start": {"seconds": json-value, "microseconds": json-value},
+  "end": {"seconds": json-value, "microseconds": json-value} }
 
  Where,
 
@@ -184,6 +194,12 @@ The format of an error response is:
   not attempt to parse this message.
 - The "id" member contains the transaction identification associated with
   the command execution if issued by the Client
+- The "start" member contains the exact time of when the command has been
+  stated to be processed. It is a fixed json-object with time in
+  seconds and microseconds relative to the Unix Epoch (1 Jan 1970)
+- The "end" member contains the exact time of when the command has been
+  finished to be processed. It is a fixed json-object with time in
+  seconds and microseconds relative to the Unix Epoch (1 Jan 1970)
 
 NOTE: Some errors can occur before the Server is able to read the "id" member,
 in these cases the "id" member will not be part of the error response, even
diff --git a/qapi/qmp-dispatch.c b/qapi/qmp-dispatch.c
index 0990873ec8ec1..fce87416f2128 100644
--- a/qapi/qmp-dispatch.c
+++ b/qapi/qmp-dispatch.c
@@ -130,6 +130,22 @@ static void do_qmp_dispatch_bh(void *opaque)
 aio_co_wake(data->co);
 }
 
+static void add_timestamps(QDict *qdict, uint64_t start_ms, uint64_t end_ms)
+{
+QDict *start_dict, *end_dict;
+
+start_dict = qdict_new();
+qdict_put_int(start_dict, "seconds", start_ms / G_USEC_PER_SEC);
+qdict_put_int(start_dict, "microseconds", start_ms % G_USEC_PER_SEC);
+
+end_dict = qdict_new();
+qdict_put_int(end_dict, "seconds", end_ms / G_USEC_PER_SEC);
+qdict_put_int(end_dict, "microseconds", end_ms % G_USEC_PER_SEC);
+
+qdict_put_obj(qdict, "start", QOBJECT(start_dict));
+qdict_put_obj(qdict, "end", QOBJECT(end_dict));
+}
+
 /*
  * Runs outside of coroutine context for OOB commands, but in coroutine
  * context for everything else.
@@ -146,6 +162,7 @@ QDict *qmp_dispatch(const QmpCommandList *cmds, QObject 
*request,
 QObject *id;
 QObject *ret = NULL;
 QDict *rsp = NULL;
+uint64_t ts_start = g_get_real_time();
 
 dict = qobject_to(QDict, request);
 if (!dict) {
@@ -270,5 +287,6 @@ out:
 qdict_put_obj(rsp, "id", qobject_ref(id));
 }
 
+add_timestamps(rsp, ts_start, g_get_real_time());
 return rsp;
 }
-- 
2.25.1




Re: [patch v0] qapi/qmp: Add timestamps to qmp command responses.

2022-09-27 Thread Denis Plotnikov



On 27.09.2022 09:04, Markus Armbruster wrote:

Daniel P. Berrangé  writes:


On Mon, Sep 26, 2022 at 12:59:40PM +0300, Denis Plotnikov wrote:

Add "start" & "end" timestamps to qmp command responses.
It's disabled by default, but can be enabled with 'timestamp=on'
monitor's parameter, e.g.:
 -chardev  socket,id=mon1,path=/tmp/qmp.socket,server=on,wait=off
 -mon chardev=mon1,mode=control,timestamp=on

I'm not convinced a cmdline flag is the right approach here.

I think it ought be something defined by the QMP spec.

The QMP spec is docs/interop/qmp-spec.txt.  The feature needs to be
defined there regardless of how we control it.

ok, thanks for pointing out



The "QMP" greeting should report "timestamp" capabilities.

The 'qmp_capabilities' command can be used to turn on this
capability for all commands henceforth.

Yes, this is how optional QMP protocol features should be controlled.

Bonus: control is per connection, not just globally.


As an option extra, the 'execute' command could gain a
parameter to allow this to be requested for only an
individual command.

Needs a use case.


Alternatively we could say the overhead of adding the timestmaps
is small enough that we just add this unconditionally for
everything hence, with no opt-in/opt-out.

Yes, because the extension is backwards compatible.


May be it worth to send the timestamps always in the response if doesn't 
contradicts with anything and doesn't bring any unnecessary data overhead.


From the other hand turning it on via qmp capabilities seems to be more 
flexible solution.




Aside: qmp-spec.txt could be clearer on what that means.


Example of result:

 ./qemu/scripts/qmp/qmp-shell /tmp/qmp.socket

 (QEMU) query-status
 {"end": {"seconds": 1650367305, "microseconds": 831032},
  "start": {"seconds": 1650367305, "microseconds": 831012},
  "return": {"status": "running", "singlestep": false, "running": true}}

The responce of the qmp command contains the start & end time of
the qmp command processing.

Seconds and microseconds since when?  The update to qmp-spec.txt should
tell.

Why split the time into seconds and microseconds?  If you use
microseconds since the Unix epoch (1970-01-01 UTC), 64 bit unsigned will
result in a year 586524 problem:

 $ date --date "@`echo '2^64/100' | bc`"
 Wed Jan 19 09:01:49 CET 586524

Even a mere 53 bits will last until 2255.
This is Just for convenience, may be it's too much and timestamp in msec 
if enough



These times may be helpful for the management layer in understanding of
the actual timeline of a qmp command processing.

Can you explain the problem scenario in more detail.

Yes, please, because:


The mgmt app already knows when it send the QMP command and knows
when it gets the QMP reply.  This covers the time the QMP was
queued before processing (might be large if QMP is blocked on
another slow command) , the processing time, and the time any
reply was queued before sending (ought to be small).

So IIUC, the value these fields add is that they let the mgmt
app extract only the command processing time, eliminating
any variance do to queue before/after.
So the scenario is the following: we need a means to understand from the 
management layer prospecitive of what is the timeline of the command 
execution. This is needed for a problem resolving if a qmp command 
executes for too long from the management layer point of view. 
Specifically, management layer sees the execution time as 
"management_layer_internal_routine_time" + "qemu_dispatching_time" + 
"qemu_qmp_command_execution_time". Suggested qmp command timestaps gives 
"qemu_command_execution_time". Management layer calculates 
"management_layer_internal_routine_time" internally. Using those two 
things we can calculate "qemu_dispatching_time" and decide where the 
potential delays comes from. This will gives us a direction of further 
problem investigation.



Suggested-by: Andrey Ryabinin 
Signed-off-by: Denis Plotnikov 




[patch v0] qapi/qmp: Add timestamps to qmp command responses.

2022-09-26 Thread Denis Plotnikov
Add "start" & "end" timestamps to qmp command responses.
It's disabled by default, but can be enabled with 'timestamp=on'
monitor's parameter, e.g.:
-chardev  socket,id=mon1,path=/tmp/qmp.socket,server=on,wait=off
-mon chardev=mon1,mode=control,timestamp=on

Example of result:

./qemu/scripts/qmp/qmp-shell /tmp/qmp.socket

(QEMU) query-status
{"end": {"seconds": 1650367305, "microseconds": 831032},
 "start": {"seconds": 1650367305, "microseconds": 831012},
 "return": {"status": "running", "singlestep": false, "running": true}}

The responce of the qmp command contains the start & end time of
the qmp command processing.

These times may be helpful for the management layer in understanding of
the actual timeline of a qmp command processing.

Suggested-by: Andrey Ryabinin 
Signed-off-by: Denis Plotnikov 
---
 include/monitor/monitor.h   |  2 +-
 include/qapi/qmp/dispatch.h |  2 +-
 monitor/monitor-internal.h  |  1 +
 monitor/monitor.c   |  9 -
 monitor/qmp.c   |  5 +++--
 qapi/control.json   |  3 +++
 qapi/qmp-dispatch.c | 28 +++-
 qga/main.c  |  2 +-
 stubs/monitor-core.c|  2 +-
 tests/unit/test-qmp-cmds.c  |  6 +++---
 10 files changed, 49 insertions(+), 11 deletions(-)

diff --git a/include/monitor/monitor.h b/include/monitor/monitor.h
index a4b40e8391db4..2a18e9ee34bc2 100644
--- a/include/monitor/monitor.h
+++ b/include/monitor/monitor.h
@@ -19,7 +19,7 @@ bool monitor_cur_is_qmp(void);
 
 void monitor_init_globals(void);
 void monitor_init_globals_core(void);
-void monitor_init_qmp(Chardev *chr, bool pretty, Error **errp);
+void monitor_init_qmp(Chardev *chr, bool pretty, bool timestamp, Error **errp);
 void monitor_init_hmp(Chardev *chr, bool use_readline, Error **errp);
 int monitor_init(MonitorOptions *opts, bool allow_hmp, Error **errp);
 int monitor_init_opts(QemuOpts *opts, Error **errp);
diff --git a/include/qapi/qmp/dispatch.h b/include/qapi/qmp/dispatch.h
index 1e4240fd0dbc0..d07f5764271be 100644
--- a/include/qapi/qmp/dispatch.h
+++ b/include/qapi/qmp/dispatch.h
@@ -56,7 +56,7 @@ const char *qmp_command_name(const QmpCommand *cmd);
 bool qmp_has_success_response(const QmpCommand *cmd);
 QDict *qmp_error_response(Error *err);
 QDict *qmp_dispatch(const QmpCommandList *cmds, QObject *request,
-bool allow_oob, Monitor *cur_mon);
+bool allow_oob, bool timestamp, Monitor *cur_mon);
 bool qmp_is_oob(const QDict *dict);
 
 typedef void (*qmp_cmd_callback_fn)(const QmpCommand *cmd, void *opaque);
diff --git a/monitor/monitor-internal.h b/monitor/monitor-internal.h
index caa2e90ef22a4..69425a7bc8152 100644
--- a/monitor/monitor-internal.h
+++ b/monitor/monitor-internal.h
@@ -136,6 +136,7 @@ typedef struct {
 Monitor common;
 JSONMessageParser parser;
 bool pretty;
+bool timestamp;
 /*
  * When a client connects, we're in capabilities negotiation mode.
  * @commands is _cap_negotiation_commands then.  When command
diff --git a/monitor/monitor.c b/monitor/monitor.c
index 86949024f643a..85a0b6498dbc1 100644
--- a/monitor/monitor.c
+++ b/monitor/monitor.c
@@ -726,7 +726,7 @@ int monitor_init(MonitorOptions *opts, bool allow_hmp, 
Error **errp)
 
 switch (opts->mode) {
 case MONITOR_MODE_CONTROL:
-monitor_init_qmp(chr, opts->pretty, _err);
+monitor_init_qmp(chr, opts->pretty, opts->timestamp, _err);
 break;
 case MONITOR_MODE_READLINE:
 if (!allow_hmp) {
@@ -737,6 +737,10 @@ int monitor_init(MonitorOptions *opts, bool allow_hmp, 
Error **errp)
 error_setg(errp, "'pretty' is not compatible with HMP monitors");
 return -1;
 }
+if (opts->timestamp) {
+error_setg(errp, "'timestamp' is not compatible with HMP 
monitors");
+return -1;
+}
 monitor_init_hmp(chr, true, _err);
 break;
 default:
@@ -782,6 +786,9 @@ QemuOptsList qemu_mon_opts = {
 },{
 .name = "pretty",
 .type = QEMU_OPT_BOOL,
+},{
+.name = "timestamp",
+.type = QEMU_OPT_BOOL,
 },
 { /* end of list */ }
 },
diff --git a/monitor/qmp.c b/monitor/qmp.c
index 092c527b6fc9c..fd487fee9f850 100644
--- a/monitor/qmp.c
+++ b/monitor/qmp.c
@@ -142,7 +142,7 @@ static void monitor_qmp_dispatch(MonitorQMP *mon, QObject 
*req)
 QDict *error;
 
 rsp = qmp_dispatch(mon->commands, req, qmp_oob_enabled(mon),
-   >common);
+   mon->timestamp, >common);
 
 if (mon->commands == _cap_negotiation_commands) {
 error = qdict_get_qdict(rsp, "error");
@@ -495,7 +495,7 @@ static void monitor_

[PING][Ping] [PATCH v1 0/2] vl: flush all task from rcu queue before exiting

2021-11-24 Thread Denis Plotnikov

ping ping

On 19.11.2021 12:42, Denis Plotnikov wrote:


Ping!

On 15.11.2021 12:41, Denis Plotnikov wrote:

v1 -> v0:
  * move monitor cleanup to the very end of qemu cleanup [Paolo]

The goal is to notify management layer about device destruction on qemu 
shutdown.
Without this series DEVICE_DELETED event may not be sent because of stuck tasks
in the rcu thread. The rcu tasks may stuck on qemu shutdown because the rcu
not always have enough time to run them.


Denis Plotnikov (2):
   monitor: move monitor destruction to the very end of qemu cleanup
   vl: flush all task from rcu queue before exiting

  include/qemu/rcu.h |  1 +
  monitor/monitor.c  |  6 ++
  softmmu/runstate.c |  4 +++-
  util/rcu.c | 12 
  4 files changed, 22 insertions(+), 1 deletion(-)



[Ping] [PATCH v1 0/2] vl: flush all task from rcu queue before exiting

2021-11-19 Thread Denis Plotnikov

Ping!

On 15.11.2021 12:41, Denis Plotnikov wrote:

v1 -> v0:
  * move monitor cleanup to the very end of qemu cleanup [Paolo]

The goal is to notify management layer about device destruction on qemu 
shutdown.
Without this series DEVICE_DELETED event may not be sent because of stuck tasks
in the rcu thread. The rcu tasks may stuck on qemu shutdown because the rcu
not always have enough time to run them.


Denis Plotnikov (2):
   monitor: move monitor destruction to the very end of qemu cleanup
   vl: flush all task from rcu queue before exiting

  include/qemu/rcu.h |  1 +
  monitor/monitor.c  |  6 ++
  softmmu/runstate.c |  4 +++-
  util/rcu.c | 12 
  4 files changed, 22 insertions(+), 1 deletion(-)



[PATCH v1 2/2] vl: flush all task from rcu queue before exiting

2021-11-15 Thread Denis Plotnikov
The device destruction may superimpose over qemu shutdown.
In this case some management layer, requested a device unplug and
waiting for DEVICE_DELETED event, may never get this event.

This happens because device_finalize() may never be called on qemu shutdown
for some devices using address_space_destroy(). The later is called from
the rcu thread.
On qemu shutdown, not all rcu callbacks may be called because the rcu thread
may not have enough time to converge before qemu main thread exit.

To resolve this issue this patch makes rcu thread to finish all its callbacks
explicitly by calling a new rcu intreface function right before
qemu main thread exit.

Signed-off-by: Denis Plotnikov 
---
 include/qemu/rcu.h |  1 +
 softmmu/runstate.c |  2 ++
 util/rcu.c | 12 
 3 files changed, 15 insertions(+)

diff --git a/include/qemu/rcu.h b/include/qemu/rcu.h
index 515d327cf11c..f7fbdc3781e5 100644
--- a/include/qemu/rcu.h
+++ b/include/qemu/rcu.h
@@ -134,6 +134,7 @@ struct rcu_head {
 
 extern void call_rcu1(struct rcu_head *head, RCUCBFunc *func);
 extern void drain_call_rcu(void);
+extern void flush_rcu(void);
 
 /* The operands of the minus operator must have the same type,
  * which must be the one that we specify in the cast.
diff --git a/softmmu/runstate.c b/softmmu/runstate.c
index 8d29dd2c00e2..3f833678f6eb 100644
--- a/softmmu/runstate.c
+++ b/softmmu/runstate.c
@@ -821,6 +821,8 @@ void qemu_cleanup(void)
 audio_cleanup();
 qemu_chr_cleanup();
 user_creatable_cleanup();
+/* finish all the tasks from rcu queue before exiting */
+flush_rcu();
 monitor_cleanup();
 /* TODO: unref root container, check all devices are ok */
 }
diff --git a/util/rcu.c b/util/rcu.c
index 13ac0f75cb2a..f047f8ee8d16 100644
--- a/util/rcu.c
+++ b/util/rcu.c
@@ -348,6 +348,18 @@ void drain_call_rcu(void)
 
 }
 
+/*
+ * This function drains rcu queue until there are no tasks to do left
+ * and aims to the cases when one needs to ensure that no work hang
+ * in rcu thread before proceeding, e.g. on qemu shutdown.
+ */
+void flush_rcu(void)
+{
+while (qatomic_read(_call_count) > 0) {
+drain_call_rcu();
+}
+}
+
 void rcu_register_thread(void)
 {
 assert(rcu_reader.ctr == 0);
-- 
2.25.1




[PATCH v1 1/2] monitor: move monitor destruction to the very end of qemu cleanup

2021-11-15 Thread Denis Plotnikov
This is needed to keep sending DEVICE_DELETED events on qemu cleanup.
The event may happen in the rcu thread and we're going to flush the rcu queue
explicitly before qemu exiting in the next patch. So move the monitor
destruction to the very end of qemu cleanup to be able to send all the events.

Signed-off-by: Denis Plotnikov 
---
 monitor/monitor.c  | 6 ++
 softmmu/runstate.c | 2 +-
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/monitor/monitor.c b/monitor/monitor.c
index 21c7a68758f5..b04ae4850db2 100644
--- a/monitor/monitor.c
+++ b/monitor/monitor.c
@@ -605,11 +605,17 @@ void monitor_data_init(Monitor *mon, bool is_qmp, bool 
skip_flush,
 mon->outbuf = g_string_new(NULL);
 mon->skip_flush = skip_flush;
 mon->use_io_thread = use_io_thread;
+/*
+ * take an extra ref to prevent monitor's chardev
+ * from destroying in qemu_chr_cleanup()
+ */
+object_ref(OBJECT(mon->chr.chr));
 }
 
 void monitor_data_destroy(Monitor *mon)
 {
 g_free(mon->mon_cpu_path);
+object_unref(OBJECT(mon->chr.chr));
 qemu_chr_fe_deinit(>chr, false);
 if (monitor_is_qmp(mon)) {
 monitor_data_destroy_qmp(container_of(mon, MonitorQMP, common));
diff --git a/softmmu/runstate.c b/softmmu/runstate.c
index 10d9b7365aa7..8d29dd2c00e2 100644
--- a/softmmu/runstate.c
+++ b/softmmu/runstate.c
@@ -819,8 +819,8 @@ void qemu_cleanup(void)
 tpm_cleanup();
 net_cleanup();
 audio_cleanup();
-monitor_cleanup();
 qemu_chr_cleanup();
 user_creatable_cleanup();
+monitor_cleanup();
 /* TODO: unref root container, check all devices are ok */
 }
-- 
2.25.1




[PATCH v1 0/2] vl: flush all task from rcu queue before exiting

2021-11-15 Thread Denis Plotnikov
v1 -> v0:
 * move monitor cleanup to the very end of qemu cleanup [Paolo]

The goal is to notify management layer about device destruction on qemu 
shutdown.
Without this series DEVICE_DELETED event may not be sent because of stuck tasks
in the rcu thread. The rcu tasks may stuck on qemu shutdown because the rcu
not always have enough time to run them. 


Denis Plotnikov (2):
  monitor: move monitor destruction to the very end of qemu cleanup
  vl: flush all task from rcu queue before exiting

 include/qemu/rcu.h |  1 +
 monitor/monitor.c  |  6 ++
 softmmu/runstate.c |  4 +++-
 util/rcu.c | 12 
 4 files changed, 22 insertions(+), 1 deletion(-)

-- 
2.25.1




Re: [Ping][PATCH v0] vl: flush all task from rcu queue before exiting

2021-11-10 Thread Denis Plotnikov



On 09.11.2021 20:46, Paolo Bonzini wrote:

On 11/9/21 08:23, Denis Plotnikov wrote:

Ping ping!


Looks good, but can you explain why it's okay to call it before 
qemu_chr_cleanup() and user_creatable_cleanup()?


I think a better solution to the ordering problem would be:

  qemu_chr_cleanup();
  user_creatable_cleanup();
  flush_rcu();
  monitor_cleanup();

I agree, this looks better


with something like this:

diff --git a/chardev/char-fe.c b/chardev/char-fe.c
index 7789f7be9c..f0c3ea5447 100644
--- a/chardev/char-fe.c
+++ b/chardev/char-fe.c
@@ -195,6 +195,7 @@ bool qemu_chr_fe_init(CharBackend *b,
 int tag = 0;

 if (s) {
+    object_ref(OBJECT(s));
 if (CHARDEV_IS_MUX(s)) {
 MuxChardev *d = MUX_CHARDEV(s);

@@ -241,6 +242,7 @@ void qemu_chr_fe_deinit(CharBackend *b, bool del)
 } else {
 object_unref(obj);
 }
+    object_unref(obj);
 }
 b->chr = NULL;
 }

to keep the chardev live between qemu_chr_cleanup() and 
monitor_cleanup().


but frankly speaking I don't understand why we have to do ref/unref in 
char-fe interface functions, instead of just ref/uref-ing monitor's char 
device directly like this:


diff --git a/monitor/monitor.c b/monitor/monitor.c
index 21c7a68758f5..3692a8e15268 100644
--- a/monitor/monitor.c
+++ b/monitor/monitor.c
@@ -611,6 +611,7 @@ void monitor_data_destroy(Monitor *mon)
 {
 g_free(mon->mon_cpu_path);
 qemu_chr_fe_deinit(>chr, false);
+    object_unref(OBJECT(>chr));
 if (monitor_is_qmp(mon)) {
 monitor_data_destroy_qmp(container_of(mon, MonitorQMP, common));
 } else {
@@ -737,6 +738,7 @@ int monitor_init(MonitorOptions *opts, bool 
allow_hmp, Error **errp)

 error_propagate(errp, local_err);
 return -1;
 }
+    object_ref(OBJECT(chr));
 return 0;
 }

May be this shows the intentions better?

Denis



Paolo





[Ping][PATCH v0] vl: flush all task from rcu queue before exiting

2021-11-08 Thread Denis Plotnikov

Ping ping!

On 02.11.2021 16:39, Denis Plotnikov wrote:

The device destruction may superimpose over qemu shutdown.
In this case some management layer, requested a device unplug and
waiting for DEVICE_DELETED event, may never get this event.

This happens because device_finalize() may never be called on qemu shutdown
for some devices using address_space_destroy(). The later is called from
the rcu thread.
On qemu shutdown, not all rcu callbacks may be called because the rcu thread
may not have enough time to converge before qemu main thread exit.

To resolve this issue this patch makes rcu thread to finish all its callbacks
explicitly by calling a new rcu intreface function right before
qemu main thread exit.

Signed-off-by: Denis Plotnikov 
---
  include/qemu/rcu.h |  1 +
  softmmu/runstate.c |  3 +++
  util/rcu.c | 12 
  3 files changed, 16 insertions(+)

diff --git a/include/qemu/rcu.h b/include/qemu/rcu.h
index 515d327cf11c..f7fbdc3781e5 100644
--- a/include/qemu/rcu.h
+++ b/include/qemu/rcu.h
@@ -134,6 +134,7 @@ struct rcu_head {
  
  extern void call_rcu1(struct rcu_head *head, RCUCBFunc *func);

  extern void drain_call_rcu(void);
+extern void flush_rcu(void);
  
  /* The operands of the minus operator must have the same type,

   * which must be the one that we specify in the cast.
diff --git a/softmmu/runstate.c b/softmmu/runstate.c
index 10d9b7365aa7..28f319a97a2b 100644
--- a/softmmu/runstate.c
+++ b/softmmu/runstate.c
@@ -822,5 +822,8 @@ void qemu_cleanup(void)
  monitor_cleanup();
  qemu_chr_cleanup();
  user_creatable_cleanup();
+
+/* finish all the tasks from rcu queue before exiting */
+flush_rcu();
  /* TODO: unref root container, check all devices are ok */
  }
diff --git a/util/rcu.c b/util/rcu.c
index 13ac0f75cb2a..f047f8ee8d16 100644
--- a/util/rcu.c
+++ b/util/rcu.c
@@ -348,6 +348,18 @@ void drain_call_rcu(void)
  
  }
  
+/*

+ * This function drains rcu queue until there are no tasks to do left
+ * and aims to the cases when one needs to ensure that no work hang
+ * in rcu thread before proceeding, e.g. on qemu shutdown.
+ */
+void flush_rcu(void)
+{
+while (qatomic_read(_call_count) > 0) {
+drain_call_rcu();
+}
+}
+
  void rcu_register_thread(void)
  {
  assert(rcu_reader.ctr == 0);




Re: [PATCH v0] vl: flush all task from rcu queue before exiting

2021-11-02 Thread Denis Plotnikov



On 02.11.2021 16:39, Denis Plotnikov wrote:

The device destruction may superimpose over qemu shutdown.
In this case some management layer, requested a device unplug and
waiting for DEVICE_DELETED event, may never get this event.

This happens because device_finalize() may never be called on qemu shutdown
for some devices using address_space_destroy(). The later is called from
the rcu thread.
On qemu shutdown, not all rcu callbacks may be called because the rcu thread
may not have enough time to converge before qemu main thread exit.

To resolve this issue this patch makes rcu thread to finish all its callbacks
explicitly by calling a new rcu intreface function right before
qemu main thread exit.

Signed-off-by: Denis Plotnikov 
---
  include/qemu/rcu.h |  1 +
  softmmu/runstate.c |  3 +++
  util/rcu.c | 12 
  3 files changed, 16 insertions(+)

diff --git a/include/qemu/rcu.h b/include/qemu/rcu.h
index 515d327cf11c..f7fbdc3781e5 100644
--- a/include/qemu/rcu.h
+++ b/include/qemu/rcu.h
@@ -134,6 +134,7 @@ struct rcu_head {
  
  extern void call_rcu1(struct rcu_head *head, RCUCBFunc *func);

  extern void drain_call_rcu(void);
+extern void flush_rcu(void);
  
  /* The operands of the minus operator must have the same type,

   * which must be the one that we specify in the cast.
diff --git a/softmmu/runstate.c b/softmmu/runstate.c
index 10d9b7365aa7..28f319a97a2b 100644
--- a/softmmu/runstate.c
+++ b/softmmu/runstate.c
@@ -822,5 +822,8 @@ void qemu_cleanup(void)
actually, flush_rcu() should be here before monitor_cleanup to send 
DEVICE_DELETED

  monitor_cleanup();
  qemu_chr_cleanup();
  user_creatable_cleanup();
+
+/* finish all the tasks from rcu queue before exiting */
+flush_rcu();
  /* TODO: unref root container, check all devices are ok */
  }
diff --git a/util/rcu.c b/util/rcu.c
index 13ac0f75cb2a..f047f8ee8d16 100644
--- a/util/rcu.c
+++ b/util/rcu.c
@@ -348,6 +348,18 @@ void drain_call_rcu(void)
  
  }
  
+/*

+ * This function drains rcu queue until there are no tasks to do left
+ * and aims to the cases when one needs to ensure that no work hang
+ * in rcu thread before proceeding, e.g. on qemu shutdown.
+ */
+void flush_rcu(void)
+{
+while (qatomic_read(_call_count) > 0) {
+drain_call_rcu();
+}
+}
+
  void rcu_register_thread(void)
  {
  assert(rcu_reader.ctr == 0);




[PATCH v0] vl: flush all task from rcu queue before exiting

2021-11-02 Thread Denis Plotnikov
The device destruction may superimpose over qemu shutdown.
In this case some management layer, requested a device unplug and
waiting for DEVICE_DELETED event, may never get this event.

This happens because device_finalize() may never be called on qemu shutdown
for some devices using address_space_destroy(). The later is called from
the rcu thread.
On qemu shutdown, not all rcu callbacks may be called because the rcu thread
may not have enough time to converge before qemu main thread exit.

To resolve this issue this patch makes rcu thread to finish all its callbacks
explicitly by calling a new rcu intreface function right before
qemu main thread exit.

Signed-off-by: Denis Plotnikov 
---
 include/qemu/rcu.h |  1 +
 softmmu/runstate.c |  3 +++
 util/rcu.c | 12 
 3 files changed, 16 insertions(+)

diff --git a/include/qemu/rcu.h b/include/qemu/rcu.h
index 515d327cf11c..f7fbdc3781e5 100644
--- a/include/qemu/rcu.h
+++ b/include/qemu/rcu.h
@@ -134,6 +134,7 @@ struct rcu_head {
 
 extern void call_rcu1(struct rcu_head *head, RCUCBFunc *func);
 extern void drain_call_rcu(void);
+extern void flush_rcu(void);
 
 /* The operands of the minus operator must have the same type,
  * which must be the one that we specify in the cast.
diff --git a/softmmu/runstate.c b/softmmu/runstate.c
index 10d9b7365aa7..28f319a97a2b 100644
--- a/softmmu/runstate.c
+++ b/softmmu/runstate.c
@@ -822,5 +822,8 @@ void qemu_cleanup(void)
 monitor_cleanup();
 qemu_chr_cleanup();
 user_creatable_cleanup();
+
+/* finish all the tasks from rcu queue before exiting */
+flush_rcu();
 /* TODO: unref root container, check all devices are ok */
 }
diff --git a/util/rcu.c b/util/rcu.c
index 13ac0f75cb2a..f047f8ee8d16 100644
--- a/util/rcu.c
+++ b/util/rcu.c
@@ -348,6 +348,18 @@ void drain_call_rcu(void)
 
 }
 
+/*
+ * This function drains rcu queue until there are no tasks to do left
+ * and aims to the cases when one needs to ensure that no work hang
+ * in rcu thread before proceeding, e.g. on qemu shutdown.
+ */
+void flush_rcu(void)
+{
+while (qatomic_read(_call_count) > 0) {
+drain_call_rcu();
+}
+}
+
 void rcu_register_thread(void)
 {
 assert(rcu_reader.ctr == 0);
-- 
2.25.1




[PATCH v0 1/2] vhost-user-blk: add a new vhost-user-virtio-blk type

2021-10-04 Thread Denis Plotnikov
The main reason of adding a new type is to make cross-device live migration
between "virtio-blk" and "vhost-user-blk" devices possible in both directions.

It might be useful for the cases when a slow block layer should be replaced
with a more performant one on running VM without stopping, i.e. with very low
downtime comparable with the one on migration.

It's possible to achive that for two reasons:

1.The VMStates of "virtio-blk" and "vhost-user-blk" are almost the same.
  They consist of the identical VMSTATE_VIRTIO_DEVICE and differs from
  each other in the values of migration service fields only.
2.The device driver used in the guest is the same: virtio-blk

The new type uses virtio-blk VMState instead of vhost-user-blk specific
VMstate, also it implements migration save/load callbacks to be compatible
with migration stream produced by "virtio-blk" device.

Adding the new vhost-user-blk type instead of modifying the existing one
is convenent. It ease to differ the new virtio-blk-compatible vhost-user-blk
device from the existing non-compatible one using qemu machinery without any
other modifiactions. That gives all the variety of qemu device related
constraints out of box.

Signed-off-by: Denis Plotnikov 
---
 hw/block/vhost-user-blk.c  | 63 ++
 include/hw/virtio/vhost-user-blk.h |  2 +
 2 files changed, 65 insertions(+)

diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c
index ba13cb87e520..877fe54e891f 100644
--- a/hw/block/vhost-user-blk.c
+++ b/hw/block/vhost-user-blk.c
@@ -30,6 +30,7 @@
 #include "hw/virtio/virtio-access.h"
 #include "sysemu/sysemu.h"
 #include "sysemu/runstate.h"
+#include "migration/qemu-file-types.h"
 
 #define REALIZE_CONNECTION_RETRIES 3
 
@@ -612,9 +613,71 @@ static const TypeInfo vhost_user_blk_info = {
 .class_init = vhost_user_blk_class_init,
 };
 
+/*
+ * this is the same as vmstate_virtio_blk
+ * we use it to allow virtio-blk <-> vhost-user-virtio-blk migration
+ */
+static const VMStateDescription vmstate_vhost_user_virtio_blk = {
+.name = "virtio-blk",
+.minimum_version_id = 2,
+.version_id = 2,
+.fields = (VMStateField[]) {
+VMSTATE_VIRTIO_DEVICE,
+VMSTATE_END_OF_LIST()
+},
+};
+
+static void vhost_user_virtio_blk_save(VirtIODevice *vdev, QEMUFile *f)
+{
+/*
+ * put a zero byte in the stream to be compatible with virtio-blk
+ */
+qemu_put_sbyte(f, 0);
+}
+
+static int vhost_user_virtio_blk_load(VirtIODevice *vdev, QEMUFile *f,
+  int version_id)
+{
+if (qemu_get_sbyte(f)) {
+/*
+ * on virtio-blk -> vhost-user-virtio-blk migration we don't expect
+ * to get any infilght requests in the migration stream because
+ * we can't load them yet.
+ * TODO: consider putting those inflight requests to inflight region
+ */
+error_report("%s: can't load in-flight requests",
+ TYPE_VHOST_USER_VIRTIO_BLK);
+return -EINVAL;
+}
+
+return 0;
+}
+
+static void vhost_user_virtio_blk_class_init(ObjectClass *klass, void *data)
+{
+DeviceClass *dc = DEVICE_CLASS(klass);
+VirtioDeviceClass *vdc = VIRTIO_DEVICE_CLASS(klass);
+
+/* override vmstate of vhost_user_blk */
+dc->vmsd = _vhost_user_virtio_blk;
+
+/* adding callbacks to be compatible with virtio-blk migration stream */
+vdc->save = vhost_user_virtio_blk_save;
+vdc->load = vhost_user_virtio_blk_load;
+}
+
+static const TypeInfo vhost_user_virtio_blk_info = {
+.name = TYPE_VHOST_USER_VIRTIO_BLK,
+.parent = TYPE_VHOST_USER_BLK,
+.instance_size = sizeof(VHostUserBlk),
+/* instance_init is the same as in parent type */
+.class_init = vhost_user_virtio_blk_class_init,
+};
+
 static void virtio_register_types(void)
 {
 type_register_static(_user_blk_info);
+type_register_static(_user_virtio_blk_info);
 }
 
 type_init(virtio_register_types)
diff --git a/include/hw/virtio/vhost-user-blk.h 
b/include/hw/virtio/vhost-user-blk.h
index 7c91f15040eb..d81f18d22596 100644
--- a/include/hw/virtio/vhost-user-blk.h
+++ b/include/hw/virtio/vhost-user-blk.h
@@ -23,6 +23,8 @@
 #include "qom/object.h"
 
 #define TYPE_VHOST_USER_BLK "vhost-user-blk"
+#define TYPE_VHOST_USER_VIRTIO_BLK "vhost-user-virtio-blk"
+
 OBJECT_DECLARE_SIMPLE_TYPE(VHostUserBlk, VHOST_USER_BLK)
 
 #define VHOST_USER_BLK_AUTO_NUM_QUEUES UINT16_MAX
-- 
2.25.1




[PATCH v0 2/2] vhost-user-blk-pci: add new pci device type to support vhost-user-virtio-blk

2021-10-04 Thread Denis Plotnikov
To allow the recently added vhost-user-virtio-blk work via virtio-pci.

This patch refactors the vhost-user-blk-pci object model to reuse
the existing code.

Signed-off-by: Denis Plotnikov 
---
 hw/virtio/vhost-user-blk-pci.c | 43 +++---
 1 file changed, 40 insertions(+), 3 deletions(-)

diff --git a/hw/virtio/vhost-user-blk-pci.c b/hw/virtio/vhost-user-blk-pci.c
index 33b404d8a225..2f68296af22f 100644
--- a/hw/virtio/vhost-user-blk-pci.c
+++ b/hw/virtio/vhost-user-blk-pci.c
@@ -34,10 +34,18 @@ typedef struct VHostUserBlkPCI VHostUserBlkPCI;
 /*
  * vhost-user-blk-pci: This extends VirtioPCIProxy.
  */
+#define TYPE_VHOST_USER_BLK_PCI_ABSTRACT "vhost-user-blk-pci-abstract-base"
+#define VHOST_USER_BLK_PCI_ABSTRACT(obj) \
+OBJECT_CHECK(VHostUserBlkPCI, (obj), TYPE_VHOST_USER_BLK_PCI_ABSTRACT)
+
 #define TYPE_VHOST_USER_BLK_PCI "vhost-user-blk-pci-base"
 DECLARE_INSTANCE_CHECKER(VHostUserBlkPCI, VHOST_USER_BLK_PCI,
  TYPE_VHOST_USER_BLK_PCI)
 
+#define TYPE_VHOST_USER_VIRTIO_BLK_PCI "vhost-user-virtio-blk-pci-base"
+#define VHOST_USER_VIRTIO_BLK_PCI(obj) \
+OBJECT_CHECK(VHostUserBlkPCI, (obj), TYPE_VHOST_USER_VIRTIO_BLK_PCI)
+
 struct VHostUserBlkPCI {
 VirtIOPCIProxy parent_obj;
 VHostUserBlk vdev;
@@ -52,7 +60,7 @@ static Property vhost_user_blk_pci_properties[] = {
 
 static void vhost_user_blk_pci_realize(VirtIOPCIProxy *vpci_dev, Error **errp)
 {
-VHostUserBlkPCI *dev = VHOST_USER_BLK_PCI(vpci_dev);
+VHostUserBlkPCI *dev = VHOST_USER_BLK_PCI_ABSTRACT(vpci_dev);
 DeviceState *vdev = DEVICE(>vdev);
 
 if (dev->vdev.num_queues == VHOST_USER_BLK_AUTO_NUM_QUEUES) {
@@ -66,7 +74,8 @@ static void vhost_user_blk_pci_realize(VirtIOPCIProxy 
*vpci_dev, Error **errp)
 qdev_realize(vdev, BUS(_dev->bus), errp);
 }
 
-static void vhost_user_blk_pci_class_init(ObjectClass *klass, void *data)
+static void vhost_user_blk_pci_abstract_class_init(ObjectClass *klass,
+   void *data)
 {
 DeviceClass *dc = DEVICE_CLASS(klass);
 VirtioPCIClass *k = VIRTIO_PCI_CLASS(klass);
@@ -81,6 +90,12 @@ static void vhost_user_blk_pci_class_init(ObjectClass 
*klass, void *data)
 pcidev_k->class_id = PCI_CLASS_STORAGE_SCSI;
 }
 
+static const VirtioPCIDeviceTypeInfo vhost_user_blk_pci_abstract_info = {
+.base_name  = TYPE_VHOST_USER_BLK_PCI_ABSTRACT,
+.instance_size  = sizeof(VHostUserBlkPCI),
+.class_init = vhost_user_blk_pci_abstract_class_init,
+};
+
 static void vhost_user_blk_pci_instance_init(Object *obj)
 {
 VHostUserBlkPCI *dev = VHOST_USER_BLK_PCI(obj);
@@ -92,18 +107,40 @@ static void vhost_user_blk_pci_instance_init(Object *obj)
 }
 
 static const VirtioPCIDeviceTypeInfo vhost_user_blk_pci_info = {
+.parent  = TYPE_VHOST_USER_BLK_PCI_ABSTRACT,
 .base_name   = TYPE_VHOST_USER_BLK_PCI,
 .generic_name= "vhost-user-blk-pci",
 .transitional_name   = "vhost-user-blk-pci-transitional",
 .non_transitional_name   = "vhost-user-blk-pci-non-transitional",
 .instance_size  = sizeof(VHostUserBlkPCI),
 .instance_init  = vhost_user_blk_pci_instance_init,
-.class_init = vhost_user_blk_pci_class_init,
+};
+
+static void vhost_user_virtio_blk_pci_instance_init(Object *obj)
+{
+VHostUserBlkPCI *dev = VHOST_USER_VIRTIO_BLK_PCI(obj);
+
+virtio_instance_init_common(obj, >vdev, sizeof(dev->vdev),
+TYPE_VHOST_USER_VIRTIO_BLK);
+object_property_add_alias(obj, "bootindex", OBJECT(>vdev),
+  "bootindex");
+}
+
+static const VirtioPCIDeviceTypeInfo vhost_user_virtio_blk_pci_info = {
+.parent  = TYPE_VHOST_USER_BLK_PCI_ABSTRACT,
+.base_name   = TYPE_VHOST_USER_VIRTIO_BLK_PCI,
+.generic_name= "vhost-user-virtio-blk-pci",
+.transitional_name   = "vhost-user-virtio-blk-pci-transitional",
+.non_transitional_name   = "vhost-user-virtio-blk-pci-non-transitional",
+.instance_size  = sizeof(VHostUserBlkPCI),
+.instance_init  = vhost_user_virtio_blk_pci_instance_init,
 };
 
 static void vhost_user_blk_pci_register(void)
 {
+virtio_pci_types_register(_user_blk_pci_abstract_info);
 virtio_pci_types_register(_user_blk_pci_info);
+virtio_pci_types_register(_user_virtio_blk_pci_info);
 }
 
 type_init(vhost_user_blk_pci_register)
-- 
2.25.1




[PATCH v0 0/2] virtio-blk and vhost-user-blk cross-device migration

2021-10-04 Thread Denis Plotnikov
It might be useful for the cases when a slow block layer should be replaced
with a more performant one on running VM without stopping, i.e. with very low
downtime comparable with the one on migration.

It's possible to achive that for two reasons:

1.The VMStates of "virtio-blk" and "vhost-user-blk" are almost the same.
  They consist of the identical VMSTATE_VIRTIO_DEVICE and differs from
  each other in the values of migration service fields only.
2.The device driver used in the guest is the same: virtio-blk

In the series cross-migration is achieved by adding a new type.
The new type uses virtio-blk VMState instead of vhost-user-blk specific
VMstate, also it implements migration save/load callbacks to be compatible
with migration stream produced by "virtio-blk" device.

Adding the new type instead of modifying the existing one is convenent.
It ease to differ the new virtio-blk-compatible vhost-user-blk
device from the existing non-compatible one using qemu machinery without any
other modifiactions. That gives all the variety of qemu device related
constraints out of box.

0001: adds new type "vhost-user-virtio-blk"
0002: add new type "vhost-user-virtio-blk-pci"

Denis Plotnikov (2):
  vhost-user-blk: add a new vhost-user-virtio-blk type
  vhost-user-blk-pci: add new pci device type to support
vhost-user-virtio-blk

 hw/block/vhost-user-blk.c  | 63 ++
 hw/virtio/vhost-user-blk-pci.c | 43 ++--
 include/hw/virtio/vhost-user-blk.h |  2 +
 3 files changed, 105 insertions(+), 3 deletions(-)

-- 
2.25.1




[PATCH v0] machine: remove non existent device tuning

2021-09-09 Thread Denis Plotnikov
Device "vhost-blk-device" doesn't exist in qemu yet.
So, its compatibility tuning is meaningless.

The line with the non existent device was introduced with
"1bf8a989a566b virtio: make seg_max virtqueue size dependent"
patch by mistake. The oriiginal patch was meant to set
"seg-max-adjust" property for "vhost-scsi" device instead of
"vhost-blk-device".

So now, we have "seg-max-adjust" property enabled for all machine
types using binaries with implemented "seg-max-adjust".

Replacing "vhost-blk-device" with "vhost-scsi" instead of removing
the line seems to be a bad idea. Now, because of the mistake, "seg-max"
for "vhost-scsi" device is queue-size dependent even for machine types
using "hw_compat_4_2" and older on new binaries.
Thus, now we have two behaviors:
* on old binaries (before the original patch) - seg_max is always 126
* on new binaries - seg_max is queue-size dependent
Replacing the line will split the behavior of new binaries into two.
This would make an investigation in case of related problems harder.

To not make things worse, this patch just removes the line
to keep behavior like it is now.

Signed-off-by: Denis Plotnikov 
---
 hw/core/machine.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/hw/core/machine.c b/hw/core/machine.c
index 067f42b528fd..d4f70cc01af0 100644
--- a/hw/core/machine.c
+++ b/hw/core/machine.c
@@ -87,7 +87,6 @@ GlobalProperty hw_compat_4_2[] = {
 { "virtio-blk-device", "x-enable-wce-if-config-wce", "off" },
 { "virtio-blk-device", "seg-max-adjust", "off"},
 { "virtio-scsi-device", "seg_max_adjust", "off"},
-{ "vhost-blk-device", "seg_max_adjust", "off"},
 { "usb-host", "suppress-remote-wake", "off" },
 { "usb-redir", "suppress-remote-wake", "off" },
 { "qxl", "revision", "4" },
-- 
2.25.1




Re: [PATCH v1 2/4] virtio: increase virtuqueue size for virtio-scsi and virtio-blk

2021-09-09 Thread Denis Plotnikov



On 09.09.2021 11:28, Stefano Garzarella wrote:

On Wed, Sep 08, 2021 at 06:20:49PM +0300, Denis Plotnikov wrote:


On 08.09.2021 16:22, Stefano Garzarella wrote:

Message bounced, I use new Denis's email address.

On Wed, Sep 08, 2021 at 03:17:16PM +0200, Stefano Garzarella wrote:

Hi Denis,
I just found this discussion since we still have the following line 
in hw/core/machine.c:

   { "vhost-blk-device", "seg_max_adjust", "off"}

IIUC it was a typo, and I think we should fix it since in the 
future we can have `vhost-blk-device`.


So, I think we have 2 options:
1. remove that line since for now is useless
2. replace with "vhost-scsi"

I'm not sure which is the best, what do you suggest?

Thanks,
Stefano


Hi Stefano

I prefer to just remove the line without replacing. This will keep 
things exactly like it is now.


Replacing with "vhost-scsi" will affect seg_max (limit to 126) on 
newly created VMs with machine types using "hw_compat_4_2" and older.


Now, because of the typo (error), their seg-max is queue-size 
dependent. I'm not sure, if replacing the line may cause any 
problems, for example on migration: source (queue-size 256, seg max 
254) -> destination (queue-size 256, seg max 126). But it will 
definitely introduce two different behaviors for VMs with 
"hw_compat_4_2" and older. So, I'd choose the lesser of two evils and 
keep the things like it's now.




Yep, make sense. It was the same concern I had.

Do you want to send a patch to remove it with this explanation?


Yes, sure.

I'll do it today.

Denis



Thanks for the clarification,
Stefano





Re: [PATCH v1 2/4] virtio: increase virtuqueue size for virtio-scsi and virtio-blk

2021-09-08 Thread Denis Plotnikov



On 08.09.2021 16:22, Stefano Garzarella wrote:

Message bounced, I use new Denis's email address.

On Wed, Sep 08, 2021 at 03:17:16PM +0200, Stefano Garzarella wrote:

Hi Denis,
I just found this discussion since we still have the following line 
in hw/core/machine.c:

   { "vhost-blk-device", "seg_max_adjust", "off"}

IIUC it was a typo, and I think we should fix it since in the future 
we can have `vhost-blk-device`.


So, I think we have 2 options:
1. remove that line since for now is useless
2. replace with "vhost-scsi"

I'm not sure which is the best, what do you suggest?

Thanks,
Stefano


Hi Stefano

I prefer to just remove the line without replacing. This will keep 
things exactly like it is now.


Replacing with "vhost-scsi" will affect seg_max (limit to 126) on newly 
created VMs with machine types using "hw_compat_4_2" and older.


Now, because of the typo (error), their seg-max is queue-size dependent. 
I'm not sure, if replacing the line may cause any problems, for example 
on migration: source (queue-size 256, seg max 254) -> destination 
(queue-size 256, seg max 126). But it will definitely introduce two 
different behaviors for VMs with "hw_compat_4_2" and older. So, I'd 
choose the lesser of two evils and keep the things like it's now.


Thanks!

Denis



On Fri, Feb 07, 2020 at 11:48:05AM +0300, Denis Plotnikov wrote:

On 05.02.2020 14:19, Stefan Hajnoczi wrote:

On Tue, Feb 04, 2020 at 12:59:04PM +0300, Denis Plotnikov wrote:

On 30.01.2020 17:58, Stefan Hajnoczi wrote:

On Wed, Jan 29, 2020 at 05:07:00PM +0300, Denis Plotnikov wrote:

The goal is to reduce the amount of requests issued by a guest on
1M reads/writes. This rises the performance up to 4% on that 
kind of

disk access pattern.

The maximum chunk size to be used for the guest disk accessing is
limited with seg_max parameter, which represents the max amount of
pices in the scatter-geather list in one guest disk request.

Since seg_max is virqueue_size dependent, increasing the virtqueue
size increases seg_max, which, in turn, increases the maximum size
of data to be read/write from guest disk.

More details in the original problem statment:
https://lists.gnu.org/archive/html/qemu-devel/2017-12/msg03721.html

Suggested-by: Denis V. Lunev 
Signed-off-by: Denis Plotnikov 
---
 hw/core/machine.c  | 3 +++
 include/hw/virtio/virtio.h | 2 +-
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/hw/core/machine.c b/hw/core/machine.c
index 3e288bfceb..8bc401d8b7 100644
--- a/hw/core/machine.c
+++ b/hw/core/machine.c
@@ -28,6 +28,9 @@
 #include "hw/mem/nvdimm.h"
 GlobalProperty hw_compat_4_2[] = {
+    { "virtio-blk-device", "queue-size", "128"},
+    { "virtio-scsi-device", "virtqueue_size", "128"},
+    { "vhost-blk-device", "virtqueue_size", "128"},
vhost-blk-device?!  Who has this?  It's not in qemu.git so please 
omit

this line. ;-)

So in this case the line:

{ "vhost-blk-device", "seg_max_adjust", "off"},

introduced by my patch:

commit 1bf8a989a566b2ba41c197004ec2a02562a766a4
Author: Denis Plotnikov 
Date:   Fri Dec 20 17:09:04 2019 +0300

    virtio: make seg_max virtqueue size dependent

is also wrong. It should be:

{ "vhost-scsi-device", "seg_max_adjust", "off"},

Am I right?

It's just called "vhost-scsi":

include/hw/virtio/vhost-scsi.h:#define TYPE_VHOST_SCSI "vhost-scsi"


On the other hand, do you want to do this for the vhost-user-blk,
vhost-user-scsi, and vhost-scsi devices that exist in qemu.git?  
Those

devices would benefit from better performance too.
After thinking about that for a while, I think we shouldn't extend 
queue sizes for vhost-user-blk, vhost-user-scsi and vhost-scsi.
This is because increasing the queue sizes seems to be just useless 
for them: the all thing is about increasing the queue sizes for 
increasing seg_max (it limits the max block query size from the 
guest). For virtio-blk-device and virtio-scsi-device it makes sense, 
since they have seg-max-adjust property which, if true, sets seg_max 
to virtqueue_size-2. vhost-scsi also have this property but it seems 
the property just doesn't affect anything (remove it?).
Also vhost-user-blk, vhost-user-scsi and vhost-scsi don't do any 
seg_max settings. If I understand correctly, their backends are ment 
to be responsible for doing that.
So, what about changing the queue sizes just for virtio-blk-device 
and virtio-scsi-device?


Denis


It seems to be so. We also have the test checking those settings:
tests/acceptance/virtio_seg_max_adjust.py
For now it checks virtio-scsi-pci and virtio-blk-pci.
I'm going to extend it for the virtqueue size checking.
If I change vhost-user-blk, vhost-user-scsi and vhost-scsi it's worth
to check those devices too. But I don't know how 

[PING][PING] [PATCH v4] vhost: make SET_VRING_ADDR, SET_FEATURES send replies

2021-08-16 Thread Denis Plotnikov



On 12.08.2021 11:04, Denis Plotnikov wrote:


On 09.08.2021 13:48, Denis Plotnikov wrote:

On vhost-user-blk migration, qemu normally sends a number of commands
to enable logging if VHOST_USER_PROTOCOL_F_LOG_SHMFD is negotiated.
Qemu sends VHOST_USER_SET_FEATURES to enable buffers logging and
VHOST_USER_SET_VRING_ADDR per each started ring to enable "used ring"
data logging.
The issue is that qemu doesn't wait for reply from the vhost daemon
for these commands which may result in races between qemu expectation
of logging starting and actual login starting in vhost daemon.

The race can appear as follows: on migration setup, qemu enables 
dirty page
logging by sending VHOST_USER_SET_FEATURES. The command doesn't 
arrive to a
vhost-user-blk daemon immediately and the daemon needs some time to 
turn the

logging on internally. If qemu doesn't wait for reply, after sending the
command, qemu may start migrateing memory pages to a destination. At 
this time,
the logging may not be actually turned on in the daemon but some 
guest pages,

which the daemon is about to write to, may have already been transferred
without logging to the destination. Since the logging wasn't turned on,
those pages won't be transferred again as dirty. So we may end up with
corrupted data on the destination.
The same scenario is applicable for "used ring" data logging, which is
turned on with VHOST_USER_SET_VRING_ADDR command.

To resolve this issue, this patch makes qemu wait for the command result
explicitly if VHOST_USER_PROTOCOL_F_REPLY_ACK is negotiated and 
logging enabled.


Signed-off-by: Denis Plotnikov 

---
v3 -> v4:
   * join acked reply and get_features in enforce_reply [mst]
   * typos, rewording, cosmetic changes [mst]

v2 -> v3:
   * send VHOST_USER_GET_FEATURES to flush out outstanding messages 
[mst]


v1 -> v2:
   * send reply only when logging is enabled [mst]

v0 -> v1:
   * send reply for SET_VRING_ADDR, SET_FEATURES only [mst]
---
  hw/virtio/vhost-user.c | 139 +
  1 file changed, 98 insertions(+), 41 deletions(-)

diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index ee57abe04526..5bb9254acd21 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -1095,23 +1095,6 @@ static int vhost_user_set_mem_table(struct 
vhost_dev *dev,

  return 0;
  }
  -static int vhost_user_set_vring_addr(struct vhost_dev *dev,
- struct vhost_vring_addr *addr)
-{
-    VhostUserMsg msg = {
-    .hdr.request = VHOST_USER_SET_VRING_ADDR,
-    .hdr.flags = VHOST_USER_VERSION,
-    .payload.addr = *addr,
-    .hdr.size = sizeof(msg.payload.addr),
-    };
-
-    if (vhost_user_write(dev, , NULL, 0) < 0) {
-    return -1;
-    }
-
-    return 0;
-}
-
  static int vhost_user_set_vring_endian(struct vhost_dev *dev,
 struct vhost_vring_state *ring)
  {
@@ -1288,72 +1271,146 @@ static int vhost_user_set_vring_call(struct 
vhost_dev *dev,

  return vhost_set_vring_file(dev, VHOST_USER_SET_VRING_CALL, file);
  }
  -static int vhost_user_set_u64(struct vhost_dev *dev, int request, 
uint64_t u64)

+
+static int vhost_user_get_u64(struct vhost_dev *dev, int request, 
uint64_t *u64)

  {
  VhostUserMsg msg = {
  .hdr.request = request,
  .hdr.flags = VHOST_USER_VERSION,
-    .payload.u64 = u64,
-    .hdr.size = sizeof(msg.payload.u64),
  };
  +    if (vhost_user_one_time_request(request) && dev->vq_index != 0) {
+    return 0;
+    }
+
  if (vhost_user_write(dev, , NULL, 0) < 0) {
  return -1;
  }
  +    if (vhost_user_read(dev, ) < 0) {
+    return -1;
+    }
+
+    if (msg.hdr.request != request) {
+    error_report("Received unexpected msg type. Expected %d 
received %d",

+ request, msg.hdr.request);
+    return -1;
+    }
+
+    if (msg.hdr.size != sizeof(msg.payload.u64)) {
+    error_report("Received bad msg size.");
+    return -1;
+    }
+
+    *u64 = msg.payload.u64;
+
  return 0;
  }
  -static int vhost_user_set_features(struct vhost_dev *dev,
-   uint64_t features)
+static int vhost_user_get_features(struct vhost_dev *dev, uint64_t 
*features)

  {
-    return vhost_user_set_u64(dev, VHOST_USER_SET_FEATURES, features);
+    return vhost_user_get_u64(dev, VHOST_USER_GET_FEATURES, features);
  }
  -static int vhost_user_set_protocol_features(struct vhost_dev *dev,
-    uint64_t features)
+static int enforce_reply(struct vhost_dev *dev,
+ const VhostUserMsg *msg)
  {
-    return vhost_user_set_u64(dev, VHOST_USER_SET_PROTOCOL_FEATURES, 
features);

+    uint64_t dummy;
+
+    if (msg->hdr.flags & VHOST_USER_NEED_REPLY_MASK) {
+    return process_message_reply(dev, msg);
+    }
+
+   /*
+    * We need to w

[PING] [PATCH v4] vhost: make SET_VRING_ADDR, SET_FEATURES send replies

2021-08-12 Thread Denis Plotnikov



On 09.08.2021 13:48, Denis Plotnikov wrote:

On vhost-user-blk migration, qemu normally sends a number of commands
to enable logging if VHOST_USER_PROTOCOL_F_LOG_SHMFD is negotiated.
Qemu sends VHOST_USER_SET_FEATURES to enable buffers logging and
VHOST_USER_SET_VRING_ADDR per each started ring to enable "used ring"
data logging.
The issue is that qemu doesn't wait for reply from the vhost daemon
for these commands which may result in races between qemu expectation
of logging starting and actual login starting in vhost daemon.

The race can appear as follows: on migration setup, qemu enables dirty page
logging by sending VHOST_USER_SET_FEATURES. The command doesn't arrive to a
vhost-user-blk daemon immediately and the daemon needs some time to turn the
logging on internally. If qemu doesn't wait for reply, after sending the
command, qemu may start migrateing memory pages to a destination. At this time,
the logging may not be actually turned on in the daemon but some guest pages,
which the daemon is about to write to, may have already been transferred
without logging to the destination. Since the logging wasn't turned on,
those pages won't be transferred again as dirty. So we may end up with
corrupted data on the destination.
The same scenario is applicable for "used ring" data logging, which is
turned on with VHOST_USER_SET_VRING_ADDR command.

To resolve this issue, this patch makes qemu wait for the command result
explicitly if VHOST_USER_PROTOCOL_F_REPLY_ACK is negotiated and logging enabled.

Signed-off-by: Denis Plotnikov 

---
v3 -> v4:
   * join acked reply and get_features in enforce_reply [mst]
   * typos, rewording, cosmetic changes [mst]

v2 -> v3:
   * send VHOST_USER_GET_FEATURES to flush out outstanding messages [mst]

v1 -> v2:
   * send reply only when logging is enabled [mst]

v0 -> v1:
   * send reply for SET_VRING_ADDR, SET_FEATURES only [mst]
---
  hw/virtio/vhost-user.c | 139 +
  1 file changed, 98 insertions(+), 41 deletions(-)

diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index ee57abe04526..5bb9254acd21 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -1095,23 +1095,6 @@ static int vhost_user_set_mem_table(struct vhost_dev 
*dev,
  return 0;
  }
  
-static int vhost_user_set_vring_addr(struct vhost_dev *dev,

- struct vhost_vring_addr *addr)
-{
-VhostUserMsg msg = {
-.hdr.request = VHOST_USER_SET_VRING_ADDR,
-.hdr.flags = VHOST_USER_VERSION,
-.payload.addr = *addr,
-.hdr.size = sizeof(msg.payload.addr),
-};
-
-if (vhost_user_write(dev, , NULL, 0) < 0) {
-return -1;
-}
-
-return 0;
-}
-
  static int vhost_user_set_vring_endian(struct vhost_dev *dev,
 struct vhost_vring_state *ring)
  {
@@ -1288,72 +1271,146 @@ static int vhost_user_set_vring_call(struct vhost_dev 
*dev,
  return vhost_set_vring_file(dev, VHOST_USER_SET_VRING_CALL, file);
  }
  
-static int vhost_user_set_u64(struct vhost_dev *dev, int request, uint64_t u64)

+
+static int vhost_user_get_u64(struct vhost_dev *dev, int request, uint64_t 
*u64)
  {
  VhostUserMsg msg = {
  .hdr.request = request,
  .hdr.flags = VHOST_USER_VERSION,
-.payload.u64 = u64,
-.hdr.size = sizeof(msg.payload.u64),
  };
  
+if (vhost_user_one_time_request(request) && dev->vq_index != 0) {

+return 0;
+}
+
  if (vhost_user_write(dev, , NULL, 0) < 0) {
  return -1;
  }
  
+if (vhost_user_read(dev, ) < 0) {

+return -1;
+}
+
+if (msg.hdr.request != request) {
+error_report("Received unexpected msg type. Expected %d received %d",
+ request, msg.hdr.request);
+return -1;
+}
+
+if (msg.hdr.size != sizeof(msg.payload.u64)) {
+error_report("Received bad msg size.");
+return -1;
+}
+
+*u64 = msg.payload.u64;
+
  return 0;
  }
  
-static int vhost_user_set_features(struct vhost_dev *dev,

-   uint64_t features)
+static int vhost_user_get_features(struct vhost_dev *dev, uint64_t *features)
  {
-return vhost_user_set_u64(dev, VHOST_USER_SET_FEATURES, features);
+return vhost_user_get_u64(dev, VHOST_USER_GET_FEATURES, features);
  }
  
-static int vhost_user_set_protocol_features(struct vhost_dev *dev,

-uint64_t features)
+static int enforce_reply(struct vhost_dev *dev,
+ const VhostUserMsg *msg)
  {
-return vhost_user_set_u64(dev, VHOST_USER_SET_PROTOCOL_FEATURES, features);
+uint64_t dummy;
+
+if (msg->hdr.flags & VHOST_USER_NEED_REPLY_MASK) {
+return process_message_reply(dev, msg);
+}
+
+   /*
+* We need to wait for a reply but the backend does not
+* support 

Re: [PATCH v3] vhost: make SET_VRING_ADDR, SET_FEATURES send replies

2021-08-09 Thread Denis Plotnikov



On 09.08.2021 12:34, Michael S. Tsirkin wrote:

Looks good. Some cosmetics:

all comments addressed in already sent v4


On Mon, Aug 09, 2021 at 12:03:30PM +0300, Denis Plotnikov wrote:

On vhost-user-blk migration, qemu normally sends a number of commands
to enable logging if VHOST_USER_PROTOCOL_F_LOG_SHMFD is negotiated.
Qemu sends VHOST_USER_SET_FEATURES to enable buffers logging and
VHOST_USER_SET_VRING_ADDR per each started ring to enable "used ring"
data logging.
The issue is that qemu doesn't wait for reply from the vhost daemon
for these commands which may result in races between qemu expectation
of logging starting and actual login starting in vhost daemon.

The race can appear as follows: on migration setup, qemu enables dirty page
logging by sending VHOST_USER_SET_FEATURES. The command doesn't arrive to a
vhost-user-blk daemon immediately and the daemon needs some time to turn the
logging on internally. If qemu doesn't wait for reply, after sending the
command, qemu may start migrate memory pages to a destination. At this time,

start migrating


the logging may not be actually turned on in the daemon but some guest pages,
which the daemon is about to write to, may have already been transferred
without logging to the destination. Since the logging wasn't turned on,
those pages won't be transferred again as dirty. So we may end up with
corrupted data on the destination.
The same scenario is applicable for "used ring" data logging, which is
turned on with VHOST_USER_SET_VRING_ADDR command.

To resolve this issue, this patch makes qemu wait for the commands result

command result


explicilty if VHOST_USER_PROTOCOL_F_REPLY_ACK is negotiated and logging enabled.

typo


Signed-off-by: Denis Plotnikov 

---
v2 -> v3:
   * send VHOST_USER_GET_FEATURES to flush out outstanding messages [mst]

v1 -> v2:
   * send reply only when logging is enabled [mst]

v0 -> v1:
   * send reply for SET_VRING_ADDR, SET_FEATURES only [mst]
---
  hw/virtio/vhost-user.c | 130 -
  1 file changed, 89 insertions(+), 41 deletions(-)

diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index ee57abe04526..18f685df549f 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -1095,23 +1095,6 @@ static int vhost_user_set_mem_table(struct vhost_dev 
*dev,
  return 0;
  }
  
-static int vhost_user_set_vring_addr(struct vhost_dev *dev,

- struct vhost_vring_addr *addr)
-{
-VhostUserMsg msg = {
-.hdr.request = VHOST_USER_SET_VRING_ADDR,
-.hdr.flags = VHOST_USER_VERSION,
-.payload.addr = *addr,
-.hdr.size = sizeof(msg.payload.addr),
-};
-
-if (vhost_user_write(dev, , NULL, 0) < 0) {
-return -1;
-}
-
-return 0;
-}
-
  static int vhost_user_set_vring_endian(struct vhost_dev *dev,
 struct vhost_vring_state *ring)
  {
@@ -1288,72 +1271,137 @@ static int vhost_user_set_vring_call(struct vhost_dev 
*dev,
  return vhost_set_vring_file(dev, VHOST_USER_SET_VRING_CALL, file);
  }
  
-static int vhost_user_set_u64(struct vhost_dev *dev, int request, uint64_t u64)

+
+static int vhost_user_get_u64(struct vhost_dev *dev, int request, uint64_t 
*u64)
  {
  VhostUserMsg msg = {
  .hdr.request = request,
  .hdr.flags = VHOST_USER_VERSION,
-.payload.u64 = u64,
-.hdr.size = sizeof(msg.payload.u64),
  };
  
+if (vhost_user_one_time_request(request) && dev->vq_index != 0) {

+return 0;
+}
+
  if (vhost_user_write(dev, , NULL, 0) < 0) {
  return -1;
  }
  
+if (vhost_user_read(dev, ) < 0) {

+return -1;
+}
+
+if (msg.hdr.request != request) {
+error_report("Received unexpected msg type. Expected %d received %d",
+ request, msg.hdr.request);
+return -1;
+}
+
+if (msg.hdr.size != sizeof(msg.payload.u64)) {
+error_report("Received bad msg size.");
+return -1;
+}
+
+*u64 = msg.payload.u64;
+
  return 0;
  }
  
-static int vhost_user_set_features(struct vhost_dev *dev,

-   uint64_t features)
+static int vhost_user_get_features(struct vhost_dev *dev, uint64_t *features)
  {
-return vhost_user_set_u64(dev, VHOST_USER_SET_FEATURES, features);
+return vhost_user_get_u64(dev, VHOST_USER_GET_FEATURES, features);
  }
  
-static int vhost_user_set_protocol_features(struct vhost_dev *dev,

-uint64_t features)
+static int enforce_reply(struct vhost_dev *dev)
  {
-return vhost_user_set_u64(dev, VHOST_USER_SET_PROTOCOL_FEATURES, features);
+   /*
+* we need a reply but can't get it from some command directly,
+* so send the command which must send a reply
to make sure
+* the command we sent before is actually completed.


better:

[PATCH v4] vhost: make SET_VRING_ADDR, SET_FEATURES send replies

2021-08-09 Thread Denis Plotnikov
On vhost-user-blk migration, qemu normally sends a number of commands
to enable logging if VHOST_USER_PROTOCOL_F_LOG_SHMFD is negotiated.
Qemu sends VHOST_USER_SET_FEATURES to enable buffers logging and
VHOST_USER_SET_VRING_ADDR per each started ring to enable "used ring"
data logging.
The issue is that qemu doesn't wait for reply from the vhost daemon
for these commands which may result in races between qemu expectation
of logging starting and actual login starting in vhost daemon.

The race can appear as follows: on migration setup, qemu enables dirty page
logging by sending VHOST_USER_SET_FEATURES. The command doesn't arrive to a
vhost-user-blk daemon immediately and the daemon needs some time to turn the
logging on internally. If qemu doesn't wait for reply, after sending the
command, qemu may start migrateing memory pages to a destination. At this time,
the logging may not be actually turned on in the daemon but some guest pages,
which the daemon is about to write to, may have already been transferred
without logging to the destination. Since the logging wasn't turned on,
those pages won't be transferred again as dirty. So we may end up with
corrupted data on the destination.
The same scenario is applicable for "used ring" data logging, which is
turned on with VHOST_USER_SET_VRING_ADDR command.

To resolve this issue, this patch makes qemu wait for the command result
explicitly if VHOST_USER_PROTOCOL_F_REPLY_ACK is negotiated and logging enabled.

Signed-off-by: Denis Plotnikov 

---
v3 -> v4:
  * join acked reply and get_features in enforce_reply [mst]
  * typos, rewording, cosmetic changes [mst]

v2 -> v3:
  * send VHOST_USER_GET_FEATURES to flush out outstanding messages [mst]

v1 -> v2:
  * send reply only when logging is enabled [mst]

v0 -> v1:
  * send reply for SET_VRING_ADDR, SET_FEATURES only [mst]
---
 hw/virtio/vhost-user.c | 139 +
 1 file changed, 98 insertions(+), 41 deletions(-)

diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index ee57abe04526..5bb9254acd21 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -1095,23 +1095,6 @@ static int vhost_user_set_mem_table(struct vhost_dev 
*dev,
 return 0;
 }
 
-static int vhost_user_set_vring_addr(struct vhost_dev *dev,
- struct vhost_vring_addr *addr)
-{
-VhostUserMsg msg = {
-.hdr.request = VHOST_USER_SET_VRING_ADDR,
-.hdr.flags = VHOST_USER_VERSION,
-.payload.addr = *addr,
-.hdr.size = sizeof(msg.payload.addr),
-};
-
-if (vhost_user_write(dev, , NULL, 0) < 0) {
-return -1;
-}
-
-return 0;
-}
-
 static int vhost_user_set_vring_endian(struct vhost_dev *dev,
struct vhost_vring_state *ring)
 {
@@ -1288,72 +1271,146 @@ static int vhost_user_set_vring_call(struct vhost_dev 
*dev,
 return vhost_set_vring_file(dev, VHOST_USER_SET_VRING_CALL, file);
 }
 
-static int vhost_user_set_u64(struct vhost_dev *dev, int request, uint64_t u64)
+
+static int vhost_user_get_u64(struct vhost_dev *dev, int request, uint64_t 
*u64)
 {
 VhostUserMsg msg = {
 .hdr.request = request,
 .hdr.flags = VHOST_USER_VERSION,
-.payload.u64 = u64,
-.hdr.size = sizeof(msg.payload.u64),
 };
 
+if (vhost_user_one_time_request(request) && dev->vq_index != 0) {
+return 0;
+}
+
 if (vhost_user_write(dev, , NULL, 0) < 0) {
 return -1;
 }
 
+if (vhost_user_read(dev, ) < 0) {
+return -1;
+}
+
+if (msg.hdr.request != request) {
+error_report("Received unexpected msg type. Expected %d received %d",
+ request, msg.hdr.request);
+return -1;
+}
+
+if (msg.hdr.size != sizeof(msg.payload.u64)) {
+error_report("Received bad msg size.");
+return -1;
+}
+
+*u64 = msg.payload.u64;
+
 return 0;
 }
 
-static int vhost_user_set_features(struct vhost_dev *dev,
-   uint64_t features)
+static int vhost_user_get_features(struct vhost_dev *dev, uint64_t *features)
 {
-return vhost_user_set_u64(dev, VHOST_USER_SET_FEATURES, features);
+return vhost_user_get_u64(dev, VHOST_USER_GET_FEATURES, features);
 }
 
-static int vhost_user_set_protocol_features(struct vhost_dev *dev,
-uint64_t features)
+static int enforce_reply(struct vhost_dev *dev,
+ const VhostUserMsg *msg)
 {
-return vhost_user_set_u64(dev, VHOST_USER_SET_PROTOCOL_FEATURES, features);
+uint64_t dummy;
+
+if (msg->hdr.flags & VHOST_USER_NEED_REPLY_MASK) {
+return process_message_reply(dev, msg);
+}
+
+   /*
+* We need to wait for a reply but the backend does not
+* support replies for the command we just sent.
+* Send VHOST_USER_GET_FEATURES which make

[PATCH v3] vhost: make SET_VRING_ADDR, SET_FEATURES send replies

2021-08-09 Thread Denis Plotnikov
On vhost-user-blk migration, qemu normally sends a number of commands
to enable logging if VHOST_USER_PROTOCOL_F_LOG_SHMFD is negotiated.
Qemu sends VHOST_USER_SET_FEATURES to enable buffers logging and
VHOST_USER_SET_VRING_ADDR per each started ring to enable "used ring"
data logging.
The issue is that qemu doesn't wait for reply from the vhost daemon
for these commands which may result in races between qemu expectation
of logging starting and actual login starting in vhost daemon.

The race can appear as follows: on migration setup, qemu enables dirty page
logging by sending VHOST_USER_SET_FEATURES. The command doesn't arrive to a
vhost-user-blk daemon immediately and the daemon needs some time to turn the
logging on internally. If qemu doesn't wait for reply, after sending the
command, qemu may start migrate memory pages to a destination. At this time,
the logging may not be actually turned on in the daemon but some guest pages,
which the daemon is about to write to, may have already been transferred
without logging to the destination. Since the logging wasn't turned on,
those pages won't be transferred again as dirty. So we may end up with
corrupted data on the destination.
The same scenario is applicable for "used ring" data logging, which is
turned on with VHOST_USER_SET_VRING_ADDR command.

To resolve this issue, this patch makes qemu wait for the commands result
explicilty if VHOST_USER_PROTOCOL_F_REPLY_ACK is negotiated and logging enabled.

Signed-off-by: Denis Plotnikov 

---
v2 -> v3:
  * send VHOST_USER_GET_FEATURES to flush out outstanding messages [mst]

v1 -> v2:
  * send reply only when logging is enabled [mst]

v0 -> v1:
  * send reply for SET_VRING_ADDR, SET_FEATURES only [mst]
---
 hw/virtio/vhost-user.c | 130 -
 1 file changed, 89 insertions(+), 41 deletions(-)

diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index ee57abe04526..18f685df549f 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -1095,23 +1095,6 @@ static int vhost_user_set_mem_table(struct vhost_dev 
*dev,
 return 0;
 }
 
-static int vhost_user_set_vring_addr(struct vhost_dev *dev,
- struct vhost_vring_addr *addr)
-{
-VhostUserMsg msg = {
-.hdr.request = VHOST_USER_SET_VRING_ADDR,
-.hdr.flags = VHOST_USER_VERSION,
-.payload.addr = *addr,
-.hdr.size = sizeof(msg.payload.addr),
-};
-
-if (vhost_user_write(dev, , NULL, 0) < 0) {
-return -1;
-}
-
-return 0;
-}
-
 static int vhost_user_set_vring_endian(struct vhost_dev *dev,
struct vhost_vring_state *ring)
 {
@@ -1288,72 +1271,137 @@ static int vhost_user_set_vring_call(struct vhost_dev 
*dev,
 return vhost_set_vring_file(dev, VHOST_USER_SET_VRING_CALL, file);
 }
 
-static int vhost_user_set_u64(struct vhost_dev *dev, int request, uint64_t u64)
+
+static int vhost_user_get_u64(struct vhost_dev *dev, int request, uint64_t 
*u64)
 {
 VhostUserMsg msg = {
 .hdr.request = request,
 .hdr.flags = VHOST_USER_VERSION,
-.payload.u64 = u64,
-.hdr.size = sizeof(msg.payload.u64),
 };
 
+if (vhost_user_one_time_request(request) && dev->vq_index != 0) {
+return 0;
+}
+
 if (vhost_user_write(dev, , NULL, 0) < 0) {
 return -1;
 }
 
+if (vhost_user_read(dev, ) < 0) {
+return -1;
+}
+
+if (msg.hdr.request != request) {
+error_report("Received unexpected msg type. Expected %d received %d",
+ request, msg.hdr.request);
+return -1;
+}
+
+if (msg.hdr.size != sizeof(msg.payload.u64)) {
+error_report("Received bad msg size.");
+return -1;
+}
+
+*u64 = msg.payload.u64;
+
 return 0;
 }
 
-static int vhost_user_set_features(struct vhost_dev *dev,
-   uint64_t features)
+static int vhost_user_get_features(struct vhost_dev *dev, uint64_t *features)
 {
-return vhost_user_set_u64(dev, VHOST_USER_SET_FEATURES, features);
+return vhost_user_get_u64(dev, VHOST_USER_GET_FEATURES, features);
 }
 
-static int vhost_user_set_protocol_features(struct vhost_dev *dev,
-uint64_t features)
+static int enforce_reply(struct vhost_dev *dev)
 {
-return vhost_user_set_u64(dev, VHOST_USER_SET_PROTOCOL_FEATURES, features);
+   /*
+* we need a reply but can't get it from some command directly,
+* so send the command which must send a reply to make sure
+* the command we sent before is actually completed.
+*/
+uint64_t dummy;
+return vhost_user_get_features(dev, );
 }
 
-static int vhost_user_get_u64(struct vhost_dev *dev, int request, uint64_t 
*u64)
+static int vhost_user_set_vring_addr(struct vhost_dev *dev,
+ struct vhost_vring_add

Re: [PATCH v2] vhost: make SET_VRING_ADDR, SET_FEATURES send replies

2021-08-09 Thread Denis Plotnikov



On 03.08.2021 18:05, Michael S. Tsirkin wrote:

On Mon, Jul 19, 2021 at 05:21:38PM +0300, Denis Plotnikov wrote:

On vhost-user-blk migration, qemu normally sends a number of commands
to enable logging if VHOST_USER_PROTOCOL_F_LOG_SHMFD is negotiated.
Qemu sends VHOST_USER_SET_FEATURES to enable buffers logging and
VHOST_USER_SET_VRING_ADDR per each started ring to enable "used ring"
data logging.
The issue is that qemu doesn't wait for reply from the vhost daemon
for these commands which may result in races between qemu expectation
of logging starting and actual login starting in vhost daemon.

The race can appear as follows: on migration setup, qemu enables dirty page
logging by sending VHOST_USER_SET_FEATURES. The command doesn't arrive to a
vhost-user-blk daemon immediately and the daemon needs some time to turn the
logging on internally. If qemu doesn't wait for reply, after sending the
command, qemu may start migrate memory pages to a destination. At this time,
the logging may not be actually turned on in the daemon but some guest pages,
which the daemon is about to write to, may have already been transferred
without logging to the destination. Since the logging wasn't turned on,
those pages won't be transferred again as dirty. So we may end up with
corrupted data on the destination.
The same scenario is applicable for "used ring" data logging, which is
turned on with VHOST_USER_SET_VRING_ADDR command.

To resolve this issue, this patch makes qemu wait for the commands result
explicilty if VHOST_USER_PROTOCOL_F_REPLY_ACK is negotiated and
logging is enabled.

Signed-off-by: Denis Plotnikov 
---
v1 -> v2:
   * send reply only when logging is enabled [mst]

v0 -> v1:
   * send reply for SET_VRING_ADDR, SET_FEATURES only [mst]
   
  hw/virtio/vhost-user.c | 37 ++---

  1 file changed, 34 insertions(+), 3 deletions(-)

diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index ee57abe04526..133588b3961e 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -1095,6 +1095,11 @@ static int vhost_user_set_mem_table(struct vhost_dev 
*dev,
  return 0;
  }
  
+static bool log_enabled(uint64_t features)

+{
+return !!(features & (0x1ULL << VHOST_F_LOG_ALL));
+}
+
  static int vhost_user_set_vring_addr(struct vhost_dev *dev,
   struct vhost_vring_addr *addr)
  {
@@ -1105,10 +1110,21 @@ static int vhost_user_set_vring_addr(struct vhost_dev 
*dev,
  .hdr.size = sizeof(msg.payload.addr),
  };
  
+bool reply_supported = virtio_has_feature(dev->protocol_features,

+  VHOST_USER_PROTOCOL_F_REPLY_ACK);
+
+if (reply_supported && log_enabled(msg.hdr.flags)) {
+msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK;
+}
+
  if (vhost_user_write(dev, , NULL, 0) < 0) {
  return -1;
  }
  
+if (msg.hdr.flags & VHOST_USER_NEED_REPLY_MASK) {

+return process_message_reply(dev, );
+}
+
  return 0;
  }


OK this is good, but the problem is that we then still have a race
if VHOST_USER_PROTOCOL_F_REPLY_ACK is not set. Bummer.

Let's send VHOST_USER_GET_FEATURES in this case to flush out outstanding
messages?

Ok, I've already sent v3 with related changes.
   

@@ -1288,7 +1304,8 @@ static int vhost_user_set_vring_call(struct vhost_dev 
*dev,
  return vhost_set_vring_file(dev, VHOST_USER_SET_VRING_CALL, file);
  }
  
-static int vhost_user_set_u64(struct vhost_dev *dev, int request, uint64_t u64)

+static int vhost_user_set_u64(struct vhost_dev *dev, int request, uint64_t u64,
+  bool need_reply)
  {
  VhostUserMsg msg = {
  .hdr.request = request,
@@ -1297,23 +1314,37 @@ static int vhost_user_set_u64(struct vhost_dev *dev, 
int request, uint64_t u64)
  .hdr.size = sizeof(msg.payload.u64),
  };
  
+if (need_reply) {

+bool reply_supported = virtio_has_feature(dev->protocol_features,
+  VHOST_USER_PROTOCOL_F_REPLY_ACK);
+if (reply_supported) {
+msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK;
+}
+}
+
  if (vhost_user_write(dev, , NULL, 0) < 0) {
  return -1;
  }
  
+if (msg.hdr.flags & VHOST_USER_NEED_REPLY_MASK) {

+return process_message_reply(dev, );
+}
+
  return 0;
  }
  
  static int vhost_user_set_features(struct vhost_dev *dev,

 uint64_t features)
  {
-return vhost_user_set_u64(dev, VHOST_USER_SET_FEATURES, features);
+return vhost_user_set_u64(dev, VHOST_USER_SET_FEATURES, features,
+  log_enabled(features));
  }
  
  static int vhost_user_set_protocol_features(struct vhost_dev *dev,

  uint64_t features)
  {
-return vhost_user_set_u64(dev, VHOST_USER_SET_

[PING][PING][PATCH v2] vhost: make SET_VRING_ADDR, SET_FEATURES send replies

2021-07-29 Thread Denis Plotnikov


On 23.07.2021 12:59, Denis Plotnikov wrote:


ping!

On 19.07.2021 17:21, Denis Plotnikov wrote:

On vhost-user-blk migration, qemu normally sends a number of commands
to enable logging if VHOST_USER_PROTOCOL_F_LOG_SHMFD is negotiated.
Qemu sends VHOST_USER_SET_FEATURES to enable buffers logging and
VHOST_USER_SET_VRING_ADDR per each started ring to enable "used ring"
data logging.
The issue is that qemu doesn't wait for reply from the vhost daemon
for these commands which may result in races between qemu expectation
of logging starting and actual login starting in vhost daemon.

The race can appear as follows: on migration setup, qemu enables dirty page
logging by sending VHOST_USER_SET_FEATURES. The command doesn't arrive to a
vhost-user-blk daemon immediately and the daemon needs some time to turn the
logging on internally. If qemu doesn't wait for reply, after sending the
command, qemu may start migrate memory pages to a destination. At this time,
the logging may not be actually turned on in the daemon but some guest pages,
which the daemon is about to write to, may have already been transferred
without logging to the destination. Since the logging wasn't turned on,
those pages won't be transferred again as dirty. So we may end up with
corrupted data on the destination.
The same scenario is applicable for "used ring" data logging, which is
turned on with VHOST_USER_SET_VRING_ADDR command.

To resolve this issue, this patch makes qemu wait for the commands result
explicilty if VHOST_USER_PROTOCOL_F_REPLY_ACK is negotiated and
logging is enabled.

Signed-off-by: Denis Plotnikov
---
v1 -> v2:
   * send reply only when logging is enabled [mst]

v0 -> v1:
   * send reply for SET_VRING_ADDR, SET_FEATURES only [mst]
   
  hw/virtio/vhost-user.c | 37 ++---

  1 file changed, 34 insertions(+), 3 deletions(-)

diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index ee57abe04526..133588b3961e 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -1095,6 +1095,11 @@ static int vhost_user_set_mem_table(struct vhost_dev 
*dev,
  return 0;
  }
  
+static bool log_enabled(uint64_t features)

+{
+return !!(features & (0x1ULL << VHOST_F_LOG_ALL));
+}
+
  static int vhost_user_set_vring_addr(struct vhost_dev *dev,
   struct vhost_vring_addr *addr)
  {
@@ -1105,10 +1110,21 @@ static int vhost_user_set_vring_addr(struct vhost_dev 
*dev,
  .hdr.size = sizeof(msg.payload.addr),
  };
  
+bool reply_supported = virtio_has_feature(dev->protocol_features,

+  VHOST_USER_PROTOCOL_F_REPLY_ACK);
+
+if (reply_supported && log_enabled(msg.hdr.flags)) {
+msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK;
+}
+
  if (vhost_user_write(dev, , NULL, 0) < 0) {
  return -1;
  }
  
+if (msg.hdr.flags & VHOST_USER_NEED_REPLY_MASK) {

+return process_message_reply(dev, );
+}
+
  return 0;
  }
  
@@ -1288,7 +1304,8 @@ static int vhost_user_set_vring_call(struct vhost_dev *dev,

  return vhost_set_vring_file(dev, VHOST_USER_SET_VRING_CALL, file);
  }
  
-static int vhost_user_set_u64(struct vhost_dev *dev, int request, uint64_t u64)

+static int vhost_user_set_u64(struct vhost_dev *dev, int request, uint64_t u64,
+  bool need_reply)
  {
  VhostUserMsg msg = {
  .hdr.request = request,
@@ -1297,23 +1314,37 @@ static int vhost_user_set_u64(struct vhost_dev *dev, 
int request, uint64_t u64)
  .hdr.size = sizeof(msg.payload.u64),
  };
  
+if (need_reply) {

+bool reply_supported = virtio_has_feature(dev->protocol_features,
+  VHOST_USER_PROTOCOL_F_REPLY_ACK);
+if (reply_supported) {
+msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK;
+}
+}
+
  if (vhost_user_write(dev, , NULL, 0) < 0) {
  return -1;
  }
  
+if (msg.hdr.flags & VHOST_USER_NEED_REPLY_MASK) {

+return process_message_reply(dev, );
+}
+
  return 0;
  }
  
  static int vhost_user_set_features(struct vhost_dev *dev,

 uint64_t features)
  {
-return vhost_user_set_u64(dev, VHOST_USER_SET_FEATURES, features);
+return vhost_user_set_u64(dev, VHOST_USER_SET_FEATURES, features,
+  log_enabled(features));
  }
  
  static int vhost_user_set_protocol_features(struct vhost_dev *dev,

  uint64_t features)
  {
-return vhost_user_set_u64(dev, VHOST_USER_SET_PROTOCOL_FEATURES, features);
+return vhost_user_set_u64(dev, VHOST_USER_SET_PROTOCOL_FEATURES, features,
+  false);
  }
  
  static int vhost_user_get_u64(struct vhost_dev *dev, int request, uint64_t *u64)


[PING][PATCH v2] vhost: make SET_VRING_ADDR, SET_FEATURES send replies

2021-07-23 Thread Denis Plotnikov

ping!

On 19.07.2021 17:21, Denis Plotnikov wrote:

On vhost-user-blk migration, qemu normally sends a number of commands
to enable logging if VHOST_USER_PROTOCOL_F_LOG_SHMFD is negotiated.
Qemu sends VHOST_USER_SET_FEATURES to enable buffers logging and
VHOST_USER_SET_VRING_ADDR per each started ring to enable "used ring"
data logging.
The issue is that qemu doesn't wait for reply from the vhost daemon
for these commands which may result in races between qemu expectation
of logging starting and actual login starting in vhost daemon.

The race can appear as follows: on migration setup, qemu enables dirty page
logging by sending VHOST_USER_SET_FEATURES. The command doesn't arrive to a
vhost-user-blk daemon immediately and the daemon needs some time to turn the
logging on internally. If qemu doesn't wait for reply, after sending the
command, qemu may start migrate memory pages to a destination. At this time,
the logging may not be actually turned on in the daemon but some guest pages,
which the daemon is about to write to, may have already been transferred
without logging to the destination. Since the logging wasn't turned on,
those pages won't be transferred again as dirty. So we may end up with
corrupted data on the destination.
The same scenario is applicable for "used ring" data logging, which is
turned on with VHOST_USER_SET_VRING_ADDR command.

To resolve this issue, this patch makes qemu wait for the commands result
explicilty if VHOST_USER_PROTOCOL_F_REPLY_ACK is negotiated and
logging is enabled.

Signed-off-by: Denis Plotnikov 
---
v1 -> v2:
   * send reply only when logging is enabled [mst]

v0 -> v1:
   * send reply for SET_VRING_ADDR, SET_FEATURES only [mst]
   
  hw/virtio/vhost-user.c | 37 ++---

  1 file changed, 34 insertions(+), 3 deletions(-)

diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index ee57abe04526..133588b3961e 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -1095,6 +1095,11 @@ static int vhost_user_set_mem_table(struct vhost_dev 
*dev,
  return 0;
  }
  
+static bool log_enabled(uint64_t features)

+{
+return !!(features & (0x1ULL << VHOST_F_LOG_ALL));
+}
+
  static int vhost_user_set_vring_addr(struct vhost_dev *dev,
   struct vhost_vring_addr *addr)
  {
@@ -1105,10 +1110,21 @@ static int vhost_user_set_vring_addr(struct vhost_dev 
*dev,
  .hdr.size = sizeof(msg.payload.addr),
  };
  
+bool reply_supported = virtio_has_feature(dev->protocol_features,

+  VHOST_USER_PROTOCOL_F_REPLY_ACK);
+
+if (reply_supported && log_enabled(msg.hdr.flags)) {
+msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK;
+}
+
  if (vhost_user_write(dev, , NULL, 0) < 0) {
  return -1;
  }
  
+if (msg.hdr.flags & VHOST_USER_NEED_REPLY_MASK) {

+return process_message_reply(dev, );
+}
+
  return 0;
  }
  
@@ -1288,7 +1304,8 @@ static int vhost_user_set_vring_call(struct vhost_dev *dev,

  return vhost_set_vring_file(dev, VHOST_USER_SET_VRING_CALL, file);
  }
  
-static int vhost_user_set_u64(struct vhost_dev *dev, int request, uint64_t u64)

+static int vhost_user_set_u64(struct vhost_dev *dev, int request, uint64_t u64,
+  bool need_reply)
  {
  VhostUserMsg msg = {
  .hdr.request = request,
@@ -1297,23 +1314,37 @@ static int vhost_user_set_u64(struct vhost_dev *dev, 
int request, uint64_t u64)
  .hdr.size = sizeof(msg.payload.u64),
  };
  
+if (need_reply) {

+bool reply_supported = virtio_has_feature(dev->protocol_features,
+  VHOST_USER_PROTOCOL_F_REPLY_ACK);
+if (reply_supported) {
+msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK;
+}
+}
+
  if (vhost_user_write(dev, , NULL, 0) < 0) {
  return -1;
  }
  
+if (msg.hdr.flags & VHOST_USER_NEED_REPLY_MASK) {

+return process_message_reply(dev, );
+}
+
  return 0;
  }
  
  static int vhost_user_set_features(struct vhost_dev *dev,

 uint64_t features)
  {
-return vhost_user_set_u64(dev, VHOST_USER_SET_FEATURES, features);
+return vhost_user_set_u64(dev, VHOST_USER_SET_FEATURES, features,
+  log_enabled(features));
  }
  
  static int vhost_user_set_protocol_features(struct vhost_dev *dev,

  uint64_t features)
  {
-return vhost_user_set_u64(dev, VHOST_USER_SET_PROTOCOL_FEATURES, features);
+return vhost_user_set_u64(dev, VHOST_USER_SET_PROTOCOL_FEATURES, features,
+  false);
  }
  
  static int vhost_user_get_u64(struct vhost_dev *dev, int request, uint64_t *u64)


[PATCH v2] vhost: make SET_VRING_ADDR, SET_FEATURES send replies

2021-07-19 Thread Denis Plotnikov
On vhost-user-blk migration, qemu normally sends a number of commands
to enable logging if VHOST_USER_PROTOCOL_F_LOG_SHMFD is negotiated.
Qemu sends VHOST_USER_SET_FEATURES to enable buffers logging and
VHOST_USER_SET_VRING_ADDR per each started ring to enable "used ring"
data logging.
The issue is that qemu doesn't wait for reply from the vhost daemon
for these commands which may result in races between qemu expectation
of logging starting and actual login starting in vhost daemon.

The race can appear as follows: on migration setup, qemu enables dirty page
logging by sending VHOST_USER_SET_FEATURES. The command doesn't arrive to a
vhost-user-blk daemon immediately and the daemon needs some time to turn the
logging on internally. If qemu doesn't wait for reply, after sending the
command, qemu may start migrate memory pages to a destination. At this time,
the logging may not be actually turned on in the daemon but some guest pages,
which the daemon is about to write to, may have already been transferred
without logging to the destination. Since the logging wasn't turned on,
those pages won't be transferred again as dirty. So we may end up with
corrupted data on the destination.
The same scenario is applicable for "used ring" data logging, which is
turned on with VHOST_USER_SET_VRING_ADDR command.

To resolve this issue, this patch makes qemu wait for the commands result
explicilty if VHOST_USER_PROTOCOL_F_REPLY_ACK is negotiated and
logging is enabled.

Signed-off-by: Denis Plotnikov 
---
v1 -> v2:
  * send reply only when logging is enabled [mst]

v0 -> v1:
  * send reply for SET_VRING_ADDR, SET_FEATURES only [mst]
  
 hw/virtio/vhost-user.c | 37 ++---
 1 file changed, 34 insertions(+), 3 deletions(-)

diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index ee57abe04526..133588b3961e 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -1095,6 +1095,11 @@ static int vhost_user_set_mem_table(struct vhost_dev 
*dev,
 return 0;
 }
 
+static bool log_enabled(uint64_t features)
+{
+return !!(features & (0x1ULL << VHOST_F_LOG_ALL));
+}
+
 static int vhost_user_set_vring_addr(struct vhost_dev *dev,
  struct vhost_vring_addr *addr)
 {
@@ -1105,10 +1110,21 @@ static int vhost_user_set_vring_addr(struct vhost_dev 
*dev,
 .hdr.size = sizeof(msg.payload.addr),
 };
 
+bool reply_supported = virtio_has_feature(dev->protocol_features,
+  VHOST_USER_PROTOCOL_F_REPLY_ACK);
+
+if (reply_supported && log_enabled(msg.hdr.flags)) {
+msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK;
+}
+
 if (vhost_user_write(dev, , NULL, 0) < 0) {
 return -1;
 }
 
+if (msg.hdr.flags & VHOST_USER_NEED_REPLY_MASK) {
+return process_message_reply(dev, );
+}
+
 return 0;
 }
 
@@ -1288,7 +1304,8 @@ static int vhost_user_set_vring_call(struct vhost_dev 
*dev,
 return vhost_set_vring_file(dev, VHOST_USER_SET_VRING_CALL, file);
 }
 
-static int vhost_user_set_u64(struct vhost_dev *dev, int request, uint64_t u64)
+static int vhost_user_set_u64(struct vhost_dev *dev, int request, uint64_t u64,
+  bool need_reply)
 {
 VhostUserMsg msg = {
 .hdr.request = request,
@@ -1297,23 +1314,37 @@ static int vhost_user_set_u64(struct vhost_dev *dev, 
int request, uint64_t u64)
 .hdr.size = sizeof(msg.payload.u64),
 };
 
+if (need_reply) {
+bool reply_supported = virtio_has_feature(dev->protocol_features,
+  VHOST_USER_PROTOCOL_F_REPLY_ACK);
+if (reply_supported) {
+msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK;
+}
+}
+
 if (vhost_user_write(dev, , NULL, 0) < 0) {
 return -1;
 }
 
+if (msg.hdr.flags & VHOST_USER_NEED_REPLY_MASK) {
+return process_message_reply(dev, );
+}
+
 return 0;
 }
 
 static int vhost_user_set_features(struct vhost_dev *dev,
uint64_t features)
 {
-return vhost_user_set_u64(dev, VHOST_USER_SET_FEATURES, features);
+return vhost_user_set_u64(dev, VHOST_USER_SET_FEATURES, features,
+  log_enabled(features));
 }
 
 static int vhost_user_set_protocol_features(struct vhost_dev *dev,
 uint64_t features)
 {
-return vhost_user_set_u64(dev, VHOST_USER_SET_PROTOCOL_FEATURES, features);
+return vhost_user_set_u64(dev, VHOST_USER_SET_PROTOCOL_FEATURES, features,
+  false);
 }
 
 static int vhost_user_get_u64(struct vhost_dev *dev, int request, uint64_t 
*u64)
-- 
2.25.1




Re: [PATCH v1] vhost: make SET_VRING_ADDR, SET_FEATURES send replies

2021-07-14 Thread Denis Plotnikov



On 08.07.2021 16:02, Denis Plotnikov wrote:


On 08.07.2021 15:04, Michael S. Tsirkin wrote:

On Thu, Jul 08, 2021 at 11:28:40AM +0300, Denis Plotnikov wrote:

On vhost-user-blk migration, qemu normally sends a number of commands
to enable logging if VHOST_USER_PROTOCOL_F_LOG_SHMFD is negotiated.
Qemu sends VHOST_USER_SET_FEATURES to enable buffers logging and
VHOST_USER_SET_VRING_ADDR per each started ring to enable "used ring"
data logging.
The issue is that qemu doesn't wait for reply from the vhost daemon
for these commands which may result in races between qemu expectation
of logging starting and actual login starting in vhost daemon.

The race can appear as follows: on migration setup, qemu enables 
dirty page
logging by sending VHOST_USER_SET_FEATURES. The command doesn't 
arrive to a
vhost-user-blk daemon immediately and the daemon needs some time to 
turn the
logging on internally. If qemu doesn't wait for reply, after sending 
the
command, qemu may start migrate memory pages to a destination. At 
this time,
the logging may not be actually turned on in the daemon but some 
guest pages,
which the daemon is about to write to, may have already been 
transferred

without logging to the destination. Since the logging wasn't turned on,
those pages won't be transferred again as dirty. So we may end up with
corrupted data on the destination.
The same scenario is applicable for "used ring" data logging, which is
turned on with VHOST_USER_SET_VRING_ADDR command.

To resolve this issue, this patch makes qemu wait for the commands 
result

explicilty if VHOST_USER_PROTOCOL_F_REPLY_ACK is negotiated.

Signed-off-by: Denis Plotnikov 
---
  hw/virtio/vhost-user.c | 31 ---
  1 file changed, 28 insertions(+), 3 deletions(-)

diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index ee57abe04526..15b5fac67cf3 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -1105,10 +1105,20 @@ static int vhost_user_set_vring_addr(struct 
vhost_dev *dev,

  .hdr.size = sizeof(msg.payload.addr),
  };
  +    bool reply_supported = 
virtio_has_feature(dev->protocol_features,

+ VHOST_USER_PROTOCOL_F_REPLY_ACK);
+    if (reply_supported) {
+    msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK;
+    }
+
  if (vhost_user_write(dev, , NULL, 0) < 0) {
  return -1;
  }
  +    if (reply_supported) {
+    return process_message_reply(dev, );
+    }
+
  return 0;
  }

Same - can we limit this to when logging is being enabled?


I think it's possible but do we really need some additional complexity?

Do you bother about delays on device initialization?  Would the reply 
for the command introduce significant device initialization time 
delay? In my understanding, this is done rarely on vhost-user device 
initialization. So, may be we can afford it to be a little bit longer?


According to the migration case, in my understanding, major time the 
migration of vhost-user should be done with logging enabled. Otherwise 
it's hard to tell how to make sure that the memory migrates with 
consistent data. So here we shouldn't care too much about setup speed 
and should care more about data consistency. What do you think?


Thanks!

Denis


please, let me know if my points above seem to be unreasonable and I'll 
send another version with reply sending only when logging is enabled.


Thanks!

Denis





@@ -1288,7 +1298,8 @@ static int vhost_user_set_vring_call(struct 
vhost_dev *dev,
  return vhost_set_vring_file(dev, VHOST_USER_SET_VRING_CALL, 
file);

  }
  -static int vhost_user_set_u64(struct vhost_dev *dev, int request, 
uint64_t u64)
+static int vhost_user_set_u64(struct vhost_dev *dev, int request, 
uint64_t u64,

+  bool need_reply)
  {
  VhostUserMsg msg = {
  .hdr.request = request,
@@ -1297,23 +1308,37 @@ static int vhost_user_set_u64(struct 
vhost_dev *dev, int request, uint64_t u64)

  .hdr.size = sizeof(msg.payload.u64),
  };
  +    if (need_reply) {
+    bool reply_supported = 
virtio_has_feature(dev->protocol_features,

+ VHOST_USER_PROTOCOL_F_REPLY_ACK);
+    if (reply_supported) {
+    msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK;
+    }
+    }
+
  if (vhost_user_write(dev, , NULL, 0) < 0) {
  return -1;
  }
  +    if (msg.hdr.flags & VHOST_USER_NEED_REPLY_MASK) {
+    return process_message_reply(dev, );
+    }
+
  return 0;
  }
    static int vhost_user_set_features(struct vhost_dev *dev,
 uint64_t features)
  {
-    return vhost_user_set_u64(dev, VHOST_USER_SET_FEATURES, features);
+    return vhost_user_set_u64(dev, VHOST_USER_SET_FEATURES, features,
+  true);
  }

Same here. In fact,


  static int vhost_user_set_protocol_features(struct vhost_dev *dev,
  uint64_t features)
  

Re: [PATCH v1] vhost: make SET_VRING_ADDR, SET_FEATURES send replies

2021-07-08 Thread Denis Plotnikov



On 08.07.2021 15:04, Michael S. Tsirkin wrote:

On Thu, Jul 08, 2021 at 11:28:40AM +0300, Denis Plotnikov wrote:

On vhost-user-blk migration, qemu normally sends a number of commands
to enable logging if VHOST_USER_PROTOCOL_F_LOG_SHMFD is negotiated.
Qemu sends VHOST_USER_SET_FEATURES to enable buffers logging and
VHOST_USER_SET_VRING_ADDR per each started ring to enable "used ring"
data logging.
The issue is that qemu doesn't wait for reply from the vhost daemon
for these commands which may result in races between qemu expectation
of logging starting and actual login starting in vhost daemon.

The race can appear as follows: on migration setup, qemu enables dirty page
logging by sending VHOST_USER_SET_FEATURES. The command doesn't arrive to a
vhost-user-blk daemon immediately and the daemon needs some time to turn the
logging on internally. If qemu doesn't wait for reply, after sending the
command, qemu may start migrate memory pages to a destination. At this time,
the logging may not be actually turned on in the daemon but some guest pages,
which the daemon is about to write to, may have already been transferred
without logging to the destination. Since the logging wasn't turned on,
those pages won't be transferred again as dirty. So we may end up with
corrupted data on the destination.
The same scenario is applicable for "used ring" data logging, which is
turned on with VHOST_USER_SET_VRING_ADDR command.

To resolve this issue, this patch makes qemu wait for the commands result
explicilty if VHOST_USER_PROTOCOL_F_REPLY_ACK is negotiated.

Signed-off-by: Denis Plotnikov 
---
  hw/virtio/vhost-user.c | 31 ---
  1 file changed, 28 insertions(+), 3 deletions(-)

diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index ee57abe04526..15b5fac67cf3 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -1105,10 +1105,20 @@ static int vhost_user_set_vring_addr(struct vhost_dev 
*dev,
  .hdr.size = sizeof(msg.payload.addr),
  };
  
+bool reply_supported = virtio_has_feature(dev->protocol_features,

+  VHOST_USER_PROTOCOL_F_REPLY_ACK);
+if (reply_supported) {
+msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK;
+}
+
  if (vhost_user_write(dev, , NULL, 0) < 0) {
  return -1;
  }
  
+if (reply_supported) {

+return process_message_reply(dev, );
+}
+
  return 0;
  }
  

Same - can we limit this to when logging is being enabled?


I think it's possible but do we really need some additional complexity?

Do you bother about delays on device initialization?  Would the reply 
for the command introduce significant device initialization time delay? 
In my understanding, this is done rarely on vhost-user device 
initialization. So, may be we can afford it to be a little bit longer?


According to the migration case, in my understanding, major time the 
migration of vhost-user should be done with logging enabled. Otherwise 
it's hard to tell how to make sure that the memory migrates with 
consistent data. So here we shouldn't care too much about setup speed 
and should care more about data consistency. What do you think?


Thanks!

Denis




@@ -1288,7 +1298,8 @@ static int vhost_user_set_vring_call(struct vhost_dev 
*dev,
  return vhost_set_vring_file(dev, VHOST_USER_SET_VRING_CALL, file);
  }
  
-static int vhost_user_set_u64(struct vhost_dev *dev, int request, uint64_t u64)

+static int vhost_user_set_u64(struct vhost_dev *dev, int request, uint64_t u64,
+  bool need_reply)
  {
  VhostUserMsg msg = {
  .hdr.request = request,
@@ -1297,23 +1308,37 @@ static int vhost_user_set_u64(struct vhost_dev *dev, 
int request, uint64_t u64)
  .hdr.size = sizeof(msg.payload.u64),
  };
  
+if (need_reply) {

+bool reply_supported = virtio_has_feature(dev->protocol_features,
+  VHOST_USER_PROTOCOL_F_REPLY_ACK);
+if (reply_supported) {
+msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK;
+}
+}
+
  if (vhost_user_write(dev, , NULL, 0) < 0) {
  return -1;
  }
  
+if (msg.hdr.flags & VHOST_USER_NEED_REPLY_MASK) {

+return process_message_reply(dev, );
+}
+
  return 0;
  }
  
  static int vhost_user_set_features(struct vhost_dev *dev,

 uint64_t features)
  {
-return vhost_user_set_u64(dev, VHOST_USER_SET_FEATURES, features);
+return vhost_user_set_u64(dev, VHOST_USER_SET_FEATURES, features,
+  true);
  }
  

Same here. In fact,


  static int vhost_user_set_protocol_features(struct vhost_dev *dev,
  uint64_t features)
  {
-return vhost_user_set_u64(dev, VHOST_USER_SET_PROTOCOL_FEATURES, features);
+return vhost_user_set_u64(dev, VHOST_

[PATCH v1] vhost: make SET_VRING_ADDR, SET_FEATURES send replies

2021-07-08 Thread Denis Plotnikov
On vhost-user-blk migration, qemu normally sends a number of commands
to enable logging if VHOST_USER_PROTOCOL_F_LOG_SHMFD is negotiated.
Qemu sends VHOST_USER_SET_FEATURES to enable buffers logging and
VHOST_USER_SET_VRING_ADDR per each started ring to enable "used ring"
data logging.
The issue is that qemu doesn't wait for reply from the vhost daemon
for these commands which may result in races between qemu expectation
of logging starting and actual login starting in vhost daemon.

The race can appear as follows: on migration setup, qemu enables dirty page
logging by sending VHOST_USER_SET_FEATURES. The command doesn't arrive to a
vhost-user-blk daemon immediately and the daemon needs some time to turn the
logging on internally. If qemu doesn't wait for reply, after sending the
command, qemu may start migrate memory pages to a destination. At this time,
the logging may not be actually turned on in the daemon but some guest pages,
which the daemon is about to write to, may have already been transferred
without logging to the destination. Since the logging wasn't turned on,
those pages won't be transferred again as dirty. So we may end up with
corrupted data on the destination.
The same scenario is applicable for "used ring" data logging, which is
turned on with VHOST_USER_SET_VRING_ADDR command.

To resolve this issue, this patch makes qemu wait for the commands result
explicilty if VHOST_USER_PROTOCOL_F_REPLY_ACK is negotiated.

Signed-off-by: Denis Plotnikov 
---
 hw/virtio/vhost-user.c | 31 ---
 1 file changed, 28 insertions(+), 3 deletions(-)

diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index ee57abe04526..15b5fac67cf3 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -1105,10 +1105,20 @@ static int vhost_user_set_vring_addr(struct vhost_dev 
*dev,
 .hdr.size = sizeof(msg.payload.addr),
 };
 
+bool reply_supported = virtio_has_feature(dev->protocol_features,
+  VHOST_USER_PROTOCOL_F_REPLY_ACK);
+if (reply_supported) {
+msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK;
+}
+
 if (vhost_user_write(dev, , NULL, 0) < 0) {
 return -1;
 }
 
+if (reply_supported) {
+return process_message_reply(dev, );
+}
+
 return 0;
 }
 
@@ -1288,7 +1298,8 @@ static int vhost_user_set_vring_call(struct vhost_dev 
*dev,
 return vhost_set_vring_file(dev, VHOST_USER_SET_VRING_CALL, file);
 }
 
-static int vhost_user_set_u64(struct vhost_dev *dev, int request, uint64_t u64)
+static int vhost_user_set_u64(struct vhost_dev *dev, int request, uint64_t u64,
+  bool need_reply)
 {
 VhostUserMsg msg = {
 .hdr.request = request,
@@ -1297,23 +1308,37 @@ static int vhost_user_set_u64(struct vhost_dev *dev, 
int request, uint64_t u64)
 .hdr.size = sizeof(msg.payload.u64),
 };
 
+if (need_reply) {
+bool reply_supported = virtio_has_feature(dev->protocol_features,
+  VHOST_USER_PROTOCOL_F_REPLY_ACK);
+if (reply_supported) {
+msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK;
+}
+}
+
 if (vhost_user_write(dev, , NULL, 0) < 0) {
 return -1;
 }
 
+if (msg.hdr.flags & VHOST_USER_NEED_REPLY_MASK) {
+return process_message_reply(dev, );
+}
+
 return 0;
 }
 
 static int vhost_user_set_features(struct vhost_dev *dev,
uint64_t features)
 {
-return vhost_user_set_u64(dev, VHOST_USER_SET_FEATURES, features);
+return vhost_user_set_u64(dev, VHOST_USER_SET_FEATURES, features,
+  true);
 }
 
 static int vhost_user_set_protocol_features(struct vhost_dev *dev,
 uint64_t features)
 {
-return vhost_user_set_u64(dev, VHOST_USER_SET_PROTOCOL_FEATURES, features);
+return vhost_user_set_u64(dev, VHOST_USER_SET_PROTOCOL_FEATURES, features,
+  false);
 }
 
 static int vhost_user_get_u64(struct vhost_dev *dev, int request, uint64_t 
*u64)
-- 
2.25.1




Re: [PATCH v0] vhost: make SET_VRING_ADDR, SET_[PROTOCOL_]FEATEURES send replies

2021-07-08 Thread Denis Plotnikov



On 07.07.2021 21:44, Michael S. Tsirkin wrote:

On Wed, Jul 07, 2021 at 05:58:50PM +0300, Denis Plotnikov wrote:

On 07.07.2021 17:39, Michael S. Tsirkin wrote:

On Wed, Jul 07, 2021 at 03:19:20PM +0300, Denis Plotnikov wrote:

On 07.07.2021 13:10, Michael S. Tsirkin wrote:

On Fri, Jun 25, 2021 at 11:52:10AM +0300, Denis Plotnikov wrote:

On vhost-user-blk migration, qemu normally sends a number of commands
to enable logging if VHOST_USER_PROTOCOL_F_LOG_SHMFD is negotiated.
Qemu sends VHOST_USER_SET_FEATURES to enable buffers logging and
VHOST_USER_SET_FEATURES per each started ring to enable "used ring"
data logging.
The issue is that qemu doesn't wait for reply from the vhost daemon
for these commands which may result in races between qemu expectation
of logging starting and actual login starting in vhost daemon.

Could you be more explicit please? What kind of race have you
observed? Getting a reply slows down the setup considerably and
should not be done lightly.

I'm talking about the vhost-user-blk case. On migration setup, we enable
logging by sending VHOST_USER_SET_FEATURES. The command doesn't arrive to a
vhost-user-blk daemon immediately and the daemon needs some time turn the
logging on internally. If qemu doesn't wait for reply, after sending the
command qemu may start migrate memory pages. At this time the logging may
not be actually turned on in the daemon  but some guest pages, which the
daemon is about to write to, may be already transferred without logging to a
destination. Since the logging wasn't turned on, those pages won't be
transferred again as dirty. So we may end up with corrupted data on the
destination.

Have I managed to explain the case clearly?

Thanks!

Denis

OK so this is just about enabling logging. It would be cleaner to
defer migrating memory until response ... if that is too hard,
at least document why we are doing this please.
And, let's wait for an ack just in that case then - why not?

And what about VHOST_USER_SET_PROTOCOL_FEATURES?

The code uses the same path for both VHOST_USER_SET_PROTOCOL_FEATURES and
VHOST_USER_SET_FEATURES via vhost_user_set_u64(). So, I decided to suggest
adding reply to both of them, so both feature setting commands work
similarly as it doesn't contradicts with vhost-user spec.

I'm not sure that it worth doing that, so if you think it's not I'll just
remove them.


Denis


I'm inclined to say let's not add to the latency of setting up the
device unnecessarily.


ok

I'll remove reply for VHOST_USER_SET_FEATURES and amend the commit 
message in v2


Thanks!

Denis




Thanks!


To resolve this issue, this patch makes qemu wait for the commands result
explicilty if VHOST_USER_PROTOCOL_F_REPLY_ACK is negotiated.
Also, this patch adds the reply waiting for VHOST_USER_SET_PROTOCOL_FEATURES
command to make the features setting functions work similary.

Signed-off-by: Denis Plotnikov 
---
hw/virtio/vhost-user.c | 20 
1 file changed, 20 insertions(+)

diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index ee57abe04526..e47b82adab00 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -1105,10 +1105,20 @@ static int vhost_user_set_vring_addr(struct vhost_dev 
*dev,
.hdr.size = sizeof(msg.payload.addr),
};
+bool reply_supported = virtio_has_feature(dev->protocol_features,
+  VHOST_USER_PROTOCOL_F_REPLY_ACK);
+if (reply_supported) {
+msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK;
+}
+
if (vhost_user_write(dev, , NULL, 0) < 0) {
return -1;
}
+if (reply_supported) {
+return process_message_reply(dev, );
+}
+
return 0;
}
@@ -1297,10 +1307,20 @@ static int vhost_user_set_u64(struct vhost_dev *dev, 
int request, uint64_t u64)
.hdr.size = sizeof(msg.payload.u64),
};
+bool reply_supported = virtio_has_feature(dev->protocol_features,
+  VHOST_USER_PROTOCOL_F_REPLY_ACK);
+if (reply_supported) {
+msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK;
+}
+
if (vhost_user_write(dev, , NULL, 0) < 0) {
return -1;
}
+if (reply_supported) {
+return process_message_reply(dev, );
+}
+
return 0;
}
--
2.25.1




Re: [PATCH v0] vhost: make SET_VRING_ADDR, SET_[PROTOCOL_]FEATEURES send replies

2021-07-07 Thread Denis Plotnikov



On 07.07.2021 17:39, Michael S. Tsirkin wrote:

On Wed, Jul 07, 2021 at 03:19:20PM +0300, Denis Plotnikov wrote:

On 07.07.2021 13:10, Michael S. Tsirkin wrote:

On Fri, Jun 25, 2021 at 11:52:10AM +0300, Denis Plotnikov wrote:

On vhost-user-blk migration, qemu normally sends a number of commands
to enable logging if VHOST_USER_PROTOCOL_F_LOG_SHMFD is negotiated.
Qemu sends VHOST_USER_SET_FEATURES to enable buffers logging and
VHOST_USER_SET_FEATURES per each started ring to enable "used ring"
data logging.
The issue is that qemu doesn't wait for reply from the vhost daemon
for these commands which may result in races between qemu expectation
of logging starting and actual login starting in vhost daemon.

Could you be more explicit please? What kind of race have you
observed? Getting a reply slows down the setup considerably and
should not be done lightly.

I'm talking about the vhost-user-blk case. On migration setup, we enable
logging by sending VHOST_USER_SET_FEATURES. The command doesn't arrive to a
vhost-user-blk daemon immediately and the daemon needs some time turn the
logging on internally. If qemu doesn't wait for reply, after sending the
command qemu may start migrate memory pages. At this time the logging may
not be actually turned on in the daemon  but some guest pages, which the
daemon is about to write to, may be already transferred without logging to a
destination. Since the logging wasn't turned on, those pages won't be
transferred again as dirty. So we may end up with corrupted data on the
destination.

Have I managed to explain the case clearly?

Thanks!

Denis

OK so this is just about enabling logging. It would be cleaner to
defer migrating memory until response ... if that is too hard,
at least document why we are doing this please.
And, let's wait for an ack just in that case then - why not?

And what about VHOST_USER_SET_PROTOCOL_FEATURES?


The code uses the same path for both VHOST_USER_SET_PROTOCOL_FEATURES 
and VHOST_USER_SET_FEATURES via vhost_user_set_u64(). So, I decided to 
suggest adding reply to both of them, so both feature setting commands 
work similarly as it doesn't contradicts with vhost-user spec.


I'm not sure that it worth doing that, so if you think it's not I'll 
just remove them.



Denis





Thanks!


To resolve this issue, this patch makes qemu wait for the commands result
explicilty if VHOST_USER_PROTOCOL_F_REPLY_ACK is negotiated.
Also, this patch adds the reply waiting for VHOST_USER_SET_PROTOCOL_FEATURES
command to make the features setting functions work similary.

Signed-off-by: Denis Plotnikov 
---
   hw/virtio/vhost-user.c | 20 
   1 file changed, 20 insertions(+)

diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index ee57abe04526..e47b82adab00 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -1105,10 +1105,20 @@ static int vhost_user_set_vring_addr(struct vhost_dev 
*dev,
   .hdr.size = sizeof(msg.payload.addr),
   };
+bool reply_supported = virtio_has_feature(dev->protocol_features,
+  VHOST_USER_PROTOCOL_F_REPLY_ACK);
+if (reply_supported) {
+msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK;
+}
+
   if (vhost_user_write(dev, , NULL, 0) < 0) {
   return -1;
   }
+if (reply_supported) {
+return process_message_reply(dev, );
+}
+
   return 0;
   }
@@ -1297,10 +1307,20 @@ static int vhost_user_set_u64(struct vhost_dev *dev, 
int request, uint64_t u64)
   .hdr.size = sizeof(msg.payload.u64),
   };
+bool reply_supported = virtio_has_feature(dev->protocol_features,
+  VHOST_USER_PROTOCOL_F_REPLY_ACK);
+if (reply_supported) {
+msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK;
+}
+
   if (vhost_user_write(dev, , NULL, 0) < 0) {
   return -1;
   }
+if (reply_supported) {
+return process_message_reply(dev, );
+}
+
   return 0;
   }
--
2.25.1




Re: [PATCH v0] vhost: make SET_VRING_ADDR, SET_[PROTOCOL_]FEATEURES send replies

2021-07-07 Thread Denis Plotnikov



On 07.07.2021 13:10, Michael S. Tsirkin wrote:

On Fri, Jun 25, 2021 at 11:52:10AM +0300, Denis Plotnikov wrote:

On vhost-user-blk migration, qemu normally sends a number of commands
to enable logging if VHOST_USER_PROTOCOL_F_LOG_SHMFD is negotiated.
Qemu sends VHOST_USER_SET_FEATURES to enable buffers logging and
VHOST_USER_SET_FEATURES per each started ring to enable "used ring"
data logging.
The issue is that qemu doesn't wait for reply from the vhost daemon
for these commands which may result in races between qemu expectation
of logging starting and actual login starting in vhost daemon.

Could you be more explicit please? What kind of race have you
observed? Getting a reply slows down the setup considerably and
should not be done lightly.


I'm talking about the vhost-user-blk case. On migration setup, we enable 
logging by sending VHOST_USER_SET_FEATURES. The command doesn't arrive 
to a vhost-user-blk daemon immediately and the daemon needs some time 
turn the logging on internally. If qemu doesn't wait for reply, after 
sending the command qemu may start migrate memory pages. At this time 
the logging may not be actually turned on in the daemon  but some guest 
pages, which the daemon is about to write to, may be already transferred 
without logging to a destination. Since the logging wasn't turned on, 
those pages won't be transferred again as dirty. So we may end up with 
corrupted data on the destination.


Have I managed to explain the case clearly?

Thanks!

Denis



Thanks!


To resolve this issue, this patch makes qemu wait for the commands result
explicilty if VHOST_USER_PROTOCOL_F_REPLY_ACK is negotiated.
Also, this patch adds the reply waiting for VHOST_USER_SET_PROTOCOL_FEATURES
command to make the features setting functions work similary.

Signed-off-by: Denis Plotnikov 
---
  hw/virtio/vhost-user.c | 20 
  1 file changed, 20 insertions(+)

diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index ee57abe04526..e47b82adab00 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -1105,10 +1105,20 @@ static int vhost_user_set_vring_addr(struct vhost_dev 
*dev,
  .hdr.size = sizeof(msg.payload.addr),
  };
  
+bool reply_supported = virtio_has_feature(dev->protocol_features,

+  VHOST_USER_PROTOCOL_F_REPLY_ACK);
+if (reply_supported) {
+msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK;
+}
+
  if (vhost_user_write(dev, , NULL, 0) < 0) {
  return -1;
  }
  
+if (reply_supported) {

+return process_message_reply(dev, );
+}
+
  return 0;
  }
  
@@ -1297,10 +1307,20 @@ static int vhost_user_set_u64(struct vhost_dev *dev, int request, uint64_t u64)

  .hdr.size = sizeof(msg.payload.u64),
  };
  
+bool reply_supported = virtio_has_feature(dev->protocol_features,

+  VHOST_USER_PROTOCOL_F_REPLY_ACK);
+if (reply_supported) {
+msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK;
+}
+
  if (vhost_user_write(dev, , NULL, 0) < 0) {
  return -1;
  }
  
+if (reply_supported) {

+return process_message_reply(dev, );
+}
+
  return 0;
  }
  
--

2.25.1




[PING] [PATCH v0] vhost: make SET_VRING_ADDR, SET_[PROTOCOL_]FEATEURES send replies

2021-07-07 Thread Denis Plotnikov


On 02.07.2021 12:41, Denis Plotnikov wrote:


ping ping!

On 25.06.2021 11:52, Denis Plotnikov wrote:

On vhost-user-blk migration, qemu normally sends a number of commands
to enable logging if VHOST_USER_PROTOCOL_F_LOG_SHMFD is negotiated.
Qemu sends VHOST_USER_SET_FEATURES to enable buffers logging and
VHOST_USER_SET_FEATURES per each started ring to enable "used ring"
data logging.
The issue is that qemu doesn't wait for reply from the vhost daemon
for these commands which may result in races between qemu expectation
of logging starting and actual login starting in vhost daemon.
To resolve this issue, this patch makes qemu wait for the commands result
explicilty if VHOST_USER_PROTOCOL_F_REPLY_ACK is negotiated.
Also, this patch adds the reply waiting for VHOST_USER_SET_PROTOCOL_FEATURES
command to make the features setting functions work similary.

Signed-off-by: Denis Plotnikov
---
  hw/virtio/vhost-user.c | 20 
  1 file changed, 20 insertions(+)

diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index ee57abe04526..e47b82adab00 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -1105,10 +1105,20 @@ static int vhost_user_set_vring_addr(struct vhost_dev 
*dev,
  .hdr.size = sizeof(msg.payload.addr),
  };
  
+bool reply_supported = virtio_has_feature(dev->protocol_features,

+  VHOST_USER_PROTOCOL_F_REPLY_ACK);
+if (reply_supported) {
+msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK;
+}
+
  if (vhost_user_write(dev, , NULL, 0) < 0) {
  return -1;
  }
  
+if (reply_supported) {

+return process_message_reply(dev, );
+}
+
  return 0;
  }
  
@@ -1297,10 +1307,20 @@ static int vhost_user_set_u64(struct vhost_dev *dev, int request, uint64_t u64)

  .hdr.size = sizeof(msg.payload.u64),
  };
  
+bool reply_supported = virtio_has_feature(dev->protocol_features,

+  VHOST_USER_PROTOCOL_F_REPLY_ACK);
+if (reply_supported) {
+msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK;
+}
+
  if (vhost_user_write(dev, , NULL, 0) < 0) {
  return -1;
  }
  
+if (reply_supported) {

+return process_message_reply(dev, );
+}
+
  return 0;
  }
  


Re: [PATCH v0] vhost: make SET_VRING_ADDR, SET_[PROTOCOL_]FEATEURES send replies

2021-07-02 Thread Denis Plotnikov

ping ping!

On 25.06.2021 11:52, Denis Plotnikov wrote:

On vhost-user-blk migration, qemu normally sends a number of commands
to enable logging if VHOST_USER_PROTOCOL_F_LOG_SHMFD is negotiated.
Qemu sends VHOST_USER_SET_FEATURES to enable buffers logging and
VHOST_USER_SET_FEATURES per each started ring to enable "used ring"
data logging.
The issue is that qemu doesn't wait for reply from the vhost daemon
for these commands which may result in races between qemu expectation
of logging starting and actual login starting in vhost daemon.
To resolve this issue, this patch makes qemu wait for the commands result
explicilty if VHOST_USER_PROTOCOL_F_REPLY_ACK is negotiated.
Also, this patch adds the reply waiting for VHOST_USER_SET_PROTOCOL_FEATURES
command to make the features setting functions work similary.

Signed-off-by: Denis Plotnikov 
---
  hw/virtio/vhost-user.c | 20 
  1 file changed, 20 insertions(+)

diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index ee57abe04526..e47b82adab00 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -1105,10 +1105,20 @@ static int vhost_user_set_vring_addr(struct vhost_dev 
*dev,
  .hdr.size = sizeof(msg.payload.addr),
  };
  
+bool reply_supported = virtio_has_feature(dev->protocol_features,

+  VHOST_USER_PROTOCOL_F_REPLY_ACK);
+if (reply_supported) {
+msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK;
+}
+
  if (vhost_user_write(dev, , NULL, 0) < 0) {
  return -1;
  }
  
+if (reply_supported) {

+return process_message_reply(dev, );
+}
+
  return 0;
  }
  
@@ -1297,10 +1307,20 @@ static int vhost_user_set_u64(struct vhost_dev *dev, int request, uint64_t u64)

  .hdr.size = sizeof(msg.payload.u64),
  };
  
+bool reply_supported = virtio_has_feature(dev->protocol_features,

+  VHOST_USER_PROTOCOL_F_REPLY_ACK);
+if (reply_supported) {
+msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK;
+}
+
  if (vhost_user_write(dev, , NULL, 0) < 0) {
  return -1;
  }
  
+if (reply_supported) {

+return process_message_reply(dev, );
+}
+
  return 0;
  }
  


[PATCH v0] vhost: make SET_VRING_ADDR, SET_[PROTOCOL_]FEATEURES send replies

2021-06-25 Thread Denis Plotnikov
On vhost-user-blk migration, qemu normally sends a number of commands
to enable logging if VHOST_USER_PROTOCOL_F_LOG_SHMFD is negotiated.
Qemu sends VHOST_USER_SET_FEATURES to enable buffers logging and
VHOST_USER_SET_FEATURES per each started ring to enable "used ring"
data logging.
The issue is that qemu doesn't wait for reply from the vhost daemon
for these commands which may result in races between qemu expectation
of logging starting and actual login starting in vhost daemon.
To resolve this issue, this patch makes qemu wait for the commands result
explicilty if VHOST_USER_PROTOCOL_F_REPLY_ACK is negotiated.
Also, this patch adds the reply waiting for VHOST_USER_SET_PROTOCOL_FEATURES
command to make the features setting functions work similary.

Signed-off-by: Denis Plotnikov 
---
 hw/virtio/vhost-user.c | 20 
 1 file changed, 20 insertions(+)

diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index ee57abe04526..e47b82adab00 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -1105,10 +1105,20 @@ static int vhost_user_set_vring_addr(struct vhost_dev 
*dev,
 .hdr.size = sizeof(msg.payload.addr),
 };
 
+bool reply_supported = virtio_has_feature(dev->protocol_features,
+  VHOST_USER_PROTOCOL_F_REPLY_ACK);
+if (reply_supported) {
+msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK;
+}
+
 if (vhost_user_write(dev, , NULL, 0) < 0) {
 return -1;
 }
 
+if (reply_supported) {
+return process_message_reply(dev, );
+}
+
 return 0;
 }
 
@@ -1297,10 +1307,20 @@ static int vhost_user_set_u64(struct vhost_dev *dev, 
int request, uint64_t u64)
 .hdr.size = sizeof(msg.payload.u64),
 };
 
+bool reply_supported = virtio_has_feature(dev->protocol_features,
+  VHOST_USER_PROTOCOL_F_REPLY_ACK);
+if (reply_supported) {
+msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK;
+}
+
 if (vhost_user_write(dev, , NULL, 0) < 0) {
 return -1;
 }
 
+if (reply_supported) {
+return process_message_reply(dev, );
+}
+
 return 0;
 }
 
-- 
2.25.1




Re: [PATCH 1/5] vhost-user-blk: Don't reconnect during initialisation

2021-04-23 Thread Denis Plotnikov

Reviewed-by: Denis Plotnikov 

On 22.04.2021 20:02, Kevin Wolf wrote:

This is a partial revert of commits 77542d43149 and bc79c87bcde.

Usually, an error during initialisation means that the configuration was
wrong. Reconnecting won't make the error go away, but just turn the
error condition into an endless loop. Avoid this and return errors
again.

Additionally, calling vhost_user_blk_disconnect() from the chardev event
handler could result in use-after-free because none of the
initialisation code expects that the device could just go away in the
middle. So removing the call fixes crashes in several places.

For example, using a num-queues setting that is incompatible with the
backend would result in a crash like this (dereferencing dev->opaque,
which is already NULL):

  #0  0x55d0a4bd in vhost_user_read_cb (source=0x568f4690, 
condition=(G_IO_IN | G_IO_HUP), opaque=0x7fffcbf0) at 
../hw/virtio/vhost-user.c:313
  #1  0x55d950d3 in qio_channel_fd_source_dispatch (source=0x57c3f750, 
callback=0x55d0a478 , user_data=0x7fffcbf0) at 
../io/channel-watch.c:84
  #2  0x77b32a9f in g_main_context_dispatch () at 
/lib64/libglib-2.0.so.0
  #3  0x77b84a98 in g_main_context_iterate.constprop () at 
/lib64/libglib-2.0.so.0
  #4  0x77b32163 in g_main_loop_run () at /lib64/libglib-2.0.so.0
  #5  0x55d0a724 in vhost_user_read (dev=0x57bc62f8, 
msg=0x7fffcc50) at ../hw/virtio/vhost-user.c:402
  #6  0x55d0ee6b in vhost_user_get_config (dev=0x57bc62f8, 
config=0x57bc62ac "", config_len=60) at ../hw/virtio/vhost-user.c:2133
  #7  0x55d56d46 in vhost_dev_get_config (hdev=0x57bc62f8, 
config=0x57bc62ac "", config_len=60) at ../hw/virtio/vhost.c:1566
  #8  0x55cdd150 in vhost_user_blk_device_realize (dev=0x57bc60b0, 
errp=0x7fffcf90) at ../hw/block/vhost-user-blk.c:510
  #9  0x55d08f6d in virtio_device_realize (dev=0x57bc60b0, 
errp=0x7fffcff0) at ../hw/virtio/virtio.c:3660

Signed-off-by: Kevin Wolf 
---
  hw/block/vhost-user-blk.c | 54 ++-
  1 file changed, 13 insertions(+), 41 deletions(-)

diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c
index f5e9682703..e824b0a759 100644
--- a/hw/block/vhost-user-blk.c
+++ b/hw/block/vhost-user-blk.c
@@ -50,6 +50,8 @@ static const int user_feature_bits[] = {
  VHOST_INVALID_FEATURE_BIT
  };
  
+static void vhost_user_blk_event(void *opaque, QEMUChrEvent event);

+
  static void vhost_user_blk_update_config(VirtIODevice *vdev, uint8_t *config)
  {
  VHostUserBlk *s = VHOST_USER_BLK(vdev);
@@ -362,19 +364,6 @@ static void vhost_user_blk_disconnect(DeviceState *dev)
  vhost_dev_cleanup(>dev);
  }
  
-static void vhost_user_blk_event(void *opaque, QEMUChrEvent event,

- bool realized);
-
-static void vhost_user_blk_event_realize(void *opaque, QEMUChrEvent event)
-{
-vhost_user_blk_event(opaque, event, false);
-}
-
-static void vhost_user_blk_event_oper(void *opaque, QEMUChrEvent event)
-{
-vhost_user_blk_event(opaque, event, true);
-}
-
  static void vhost_user_blk_chr_closed_bh(void *opaque)
  {
  DeviceState *dev = opaque;
@@ -382,12 +371,11 @@ static void vhost_user_blk_chr_closed_bh(void *opaque)
  VHostUserBlk *s = VHOST_USER_BLK(vdev);
  
  vhost_user_blk_disconnect(dev);

-qemu_chr_fe_set_handlers(>chardev, NULL, NULL,
-vhost_user_blk_event_oper, NULL, opaque, NULL, true);
+qemu_chr_fe_set_handlers(>chardev, NULL, NULL, vhost_user_blk_event,
+ NULL, opaque, NULL, true);
  }
  
-static void vhost_user_blk_event(void *opaque, QEMUChrEvent event,

- bool realized)
+static void vhost_user_blk_event(void *opaque, QEMUChrEvent event)
  {
  DeviceState *dev = opaque;
  VirtIODevice *vdev = VIRTIO_DEVICE(dev);
@@ -401,17 +389,7 @@ static void vhost_user_blk_event(void *opaque, 
QEMUChrEvent event,
  }
  break;
  case CHR_EVENT_CLOSED:
-/*
- * Closing the connection should happen differently on device
- * initialization and operation stages.
- * On initalization, we want to re-start vhost_dev initialization
- * from the very beginning right away when the connection is closed,
- * so we clean up vhost_dev on each connection closing.
- * On operation, we want to postpone vhost_dev cleanup to let the
- * other code perform its own cleanup sequence using vhost_dev data
- * (e.g. vhost_dev_set_log).
- */
-if (realized && !runstate_check(RUN_STATE_SHUTDOWN)) {
+if (!runstate_check(RUN_STATE_SHUTDOWN)) {
  /*
   * A close event may happen during a read/write, but vhost
   * code assumes the vhost_dev remains setup, so delay the
@@ -431,8 +409,6 @@ static void vhost_u

Re: [PATCH v3 2/3] vhost-user-blk: perform immediate cleanup if disconnect on initialization

2021-04-22 Thread Denis Plotnikov



On 21.04.2021 22:59, Michael S. Tsirkin wrote:

On Wed, Apr 21, 2021 at 07:13:24PM +0300, Denis Plotnikov wrote:

On 21.04.2021 18:24, Kevin Wolf wrote:

Am 25.03.2021 um 16:12 hat Denis Plotnikov geschrieben:

Commit 4bcad76f4c39 ("vhost-user-blk: delay vhost_user_blk_disconnect")
introduced postponing vhost_dev cleanup aiming to eliminate qemu aborts
because of connection problems with vhost-blk daemon.

However, it introdues a new problem. Now, any communication errors
during execution of vhost_dev_init() called by vhost_user_blk_device_realize()
lead to qemu abort on assert in vhost_dev_get_config().

This happens because vhost_user_blk_disconnect() is postponed but
it should have dropped s->connected flag by the time
vhost_user_blk_device_realize() performs a new connection opening.
On the connection opening, vhost_dev initialization in
vhost_user_blk_connect() relies on s->connection flag and
if it's not dropped, it skips vhost_dev initialization and returns
with success. Then, vhost_user_blk_device_realize()'s execution flow
goes to vhost_dev_get_config() where it's aborted on the assert.

To fix the problem this patch adds immediate cleanup on device
initialization(in vhost_user_blk_device_realize()) using different
event handlers for initialization and operation introduced in the
previous patch.
On initialization (in vhost_user_blk_device_realize()) we fully
control the initialization process. At that point, nobody can use the
device since it isn't initialized and we don't need to postpone any
cleanups, so we can do cleaup right away when there is a communication
problem with the vhost-blk daemon.
On operation we leave it as is, since the disconnect may happen when
the device is in use, so the device users may want to use vhost_dev's data
to do rollback before vhost_dev is re-initialized (e.g. in vhost_dev_set_log()).

Signed-off-by: Denis Plotnikov 
Reviewed-by: Raphael Norwitz 

I think there is something wrong with this patch.

I'm debugging an error case, specifically num-queues being larger in
QEMU that in the vhost-user-blk export. Before this patch, it has just
an unfriendly error message:

qemu-system-x86_64: -device 
vhost-user-blk-pci,chardev=vhost1,id=blk1,iommu_platform=off,disable-legacy=on,num-queues=4:
 Unexpected end-of-file before all data were read
qemu-system-x86_64: -device 
vhost-user-blk-pci,chardev=vhost1,id=blk1,iommu_platform=off,disable-legacy=on,num-queues=4:
 Failed to read msg header. Read 0 instead of 12. Original request 24.
qemu-system-x86_64: -device 
vhost-user-blk-pci,chardev=vhost1,id=blk1,iommu_platform=off,disable-legacy=on,num-queues=4:
 vhost-user-blk: get block config failed
qemu-system-x86_64: Failed to set msg fds.
qemu-system-x86_64: vhost VQ 0 ring restore failed: -1: Resource temporarily 
unavailable (11)

After the patch, it crashes:

#0  0x55d0a4bd in vhost_user_read_cb (source=0x568f4690, 
condition=(G_IO_IN | G_IO_HUP), opaque=0x7fffcbf0) at 
../hw/virtio/vhost-user.c:313
#1  0x55d950d3 in qio_channel_fd_source_dispatch (source=0x57c3f750, 
callback=0x55d0a478 , user_data=0x7fffcbf0) at 
../io/channel-watch.c:84
#2  0x77b32a9f in g_main_context_dispatch () at /lib64/libglib-2.0.so.0
#3  0x77b84a98 in g_main_context_iterate.constprop () at 
/lib64/libglib-2.0.so.0
#4  0x77b32163 in g_main_loop_run () at /lib64/libglib-2.0.so.0
#5  0x55d0a724 in vhost_user_read (dev=0x57bc62f8, 
msg=0x7fffcc50) at ../hw/virtio/vhost-user.c:402
#6  0x55d0ee6b in vhost_user_get_config (dev=0x57bc62f8, 
config=0x57bc62ac "", config_len=60) at ../hw/virtio/vhost-user.c:2133
#7  0x55d56d46 in vhost_dev_get_config (hdev=0x57bc62f8, 
config=0x57bc62ac "", config_len=60) at ../hw/virtio/vhost.c:1566
#8  0x55cdd150 in vhost_user_blk_device_realize (dev=0x57bc60b0, 
errp=0x7fffcf90) at ../hw/block/vhost-user-blk.c:510
#9  0x55d08f6d in virtio_device_realize (dev=0x57bc60b0, 
errp=0x7fffcff0) at ../hw/virtio/virtio.c:3660

The problem is that vhost_user_read_cb() still accesses dev->opaque even
though the device has been cleaned up meanwhile when the connection was
closed (the vhost_user_blk_disconnect() added by this patch), so it's
NULL now. This problem was actually mentioned in the comment that is
removed by this patch.

I tried to fix this by making vhost_user_read() cope with the fact that
the device might have been cleaned up meanwhile, but then I'm running
into the next set of problems.

The first is that retrying is pointless, the error condition is in the
configuration, it will never change.

The other is that after many repetitions of the same error message, I
got a crash where the device is cleaned up a second time in
vhost_dev_init() and the virtqueues are already NULL.

So it seems to me that erroring out during the initialisation phase
makes a lot more sense than 

Re: [PATCH v3 2/3] vhost-user-blk: perform immediate cleanup if disconnect on initialization

2021-04-21 Thread Denis Plotnikov



On 21.04.2021 18:24, Kevin Wolf wrote:

Am 25.03.2021 um 16:12 hat Denis Plotnikov geschrieben:

Commit 4bcad76f4c39 ("vhost-user-blk: delay vhost_user_blk_disconnect")
introduced postponing vhost_dev cleanup aiming to eliminate qemu aborts
because of connection problems with vhost-blk daemon.

However, it introdues a new problem. Now, any communication errors
during execution of vhost_dev_init() called by vhost_user_blk_device_realize()
lead to qemu abort on assert in vhost_dev_get_config().

This happens because vhost_user_blk_disconnect() is postponed but
it should have dropped s->connected flag by the time
vhost_user_blk_device_realize() performs a new connection opening.
On the connection opening, vhost_dev initialization in
vhost_user_blk_connect() relies on s->connection flag and
if it's not dropped, it skips vhost_dev initialization and returns
with success. Then, vhost_user_blk_device_realize()'s execution flow
goes to vhost_dev_get_config() where it's aborted on the assert.

To fix the problem this patch adds immediate cleanup on device
initialization(in vhost_user_blk_device_realize()) using different
event handlers for initialization and operation introduced in the
previous patch.
On initialization (in vhost_user_blk_device_realize()) we fully
control the initialization process. At that point, nobody can use the
device since it isn't initialized and we don't need to postpone any
cleanups, so we can do cleaup right away when there is a communication
problem with the vhost-blk daemon.
On operation we leave it as is, since the disconnect may happen when
the device is in use, so the device users may want to use vhost_dev's data
to do rollback before vhost_dev is re-initialized (e.g. in vhost_dev_set_log()).

Signed-off-by: Denis Plotnikov 
Reviewed-by: Raphael Norwitz 

I think there is something wrong with this patch.

I'm debugging an error case, specifically num-queues being larger in
QEMU that in the vhost-user-blk export. Before this patch, it has just
an unfriendly error message:

qemu-system-x86_64: -device 
vhost-user-blk-pci,chardev=vhost1,id=blk1,iommu_platform=off,disable-legacy=on,num-queues=4:
 Unexpected end-of-file before all data were read
qemu-system-x86_64: -device 
vhost-user-blk-pci,chardev=vhost1,id=blk1,iommu_platform=off,disable-legacy=on,num-queues=4:
 Failed to read msg header. Read 0 instead of 12. Original request 24.
qemu-system-x86_64: -device 
vhost-user-blk-pci,chardev=vhost1,id=blk1,iommu_platform=off,disable-legacy=on,num-queues=4:
 vhost-user-blk: get block config failed
qemu-system-x86_64: Failed to set msg fds.
qemu-system-x86_64: vhost VQ 0 ring restore failed: -1: Resource temporarily 
unavailable (11)

After the patch, it crashes:

#0  0x55d0a4bd in vhost_user_read_cb (source=0x568f4690, 
condition=(G_IO_IN | G_IO_HUP), opaque=0x7fffcbf0) at 
../hw/virtio/vhost-user.c:313
#1  0x55d950d3 in qio_channel_fd_source_dispatch (source=0x57c3f750, 
callback=0x55d0a478 , user_data=0x7fffcbf0) at 
../io/channel-watch.c:84
#2  0x77b32a9f in g_main_context_dispatch () at /lib64/libglib-2.0.so.0
#3  0x77b84a98 in g_main_context_iterate.constprop () at 
/lib64/libglib-2.0.so.0
#4  0x77b32163 in g_main_loop_run () at /lib64/libglib-2.0.so.0
#5  0x55d0a724 in vhost_user_read (dev=0x57bc62f8, 
msg=0x7fffcc50) at ../hw/virtio/vhost-user.c:402
#6  0x55d0ee6b in vhost_user_get_config (dev=0x57bc62f8, 
config=0x57bc62ac "", config_len=60) at ../hw/virtio/vhost-user.c:2133
#7  0x55d56d46 in vhost_dev_get_config (hdev=0x57bc62f8, 
config=0x57bc62ac "", config_len=60) at ../hw/virtio/vhost.c:1566
#8  0x55cdd150 in vhost_user_blk_device_realize (dev=0x57bc60b0, 
errp=0x7fffcf90) at ../hw/block/vhost-user-blk.c:510
#9  0x55d08f6d in virtio_device_realize (dev=0x57bc60b0, 
errp=0x7fffcff0) at ../hw/virtio/virtio.c:3660

The problem is that vhost_user_read_cb() still accesses dev->opaque even
though the device has been cleaned up meanwhile when the connection was
closed (the vhost_user_blk_disconnect() added by this patch), so it's
NULL now. This problem was actually mentioned in the comment that is
removed by this patch.

I tried to fix this by making vhost_user_read() cope with the fact that
the device might have been cleaned up meanwhile, but then I'm running
into the next set of problems.

The first is that retrying is pointless, the error condition is in the
configuration, it will never change.

The other is that after many repetitions of the same error message, I
got a crash where the device is cleaned up a second time in
vhost_dev_init() and the virtqueues are already NULL.

So it seems to me that erroring out during the initialisation phase
makes a lot more sense than retrying.

Kevin


But without the patch there will be another problem which the patch 
actually addresses.


It seems to me

[BUG FIX][PATCH v3 0/3] vhost-user-blk: fix bug on device disconnection during initialization

2021-04-01 Thread Denis Plotnikov

This is a series fixing a bug in host-user-blk.
Is there any chance for it to be considered for the next rc?

Thanks!

Denis

On 29.03.2021 16:44, Denis Plotnikov wrote:


ping!

On 25.03.2021 18:12, Denis Plotnikov wrote:

v3:
   * 0003: a new patch added fixing the problem on vm shutdown
 I stumbled on this bug after v2 sending.
   * 0001: gramma fixing (Raphael)
   * 0002: commit message fixing (Raphael)

v2:
   * split the initial patch into two (Raphael)
   * rename init to realized (Raphael)
   * remove unrelated comment (Raphael)

When the vhost-user-blk device lose the connection to the daemon during
the initialization phase it kills qemu because of the assert in the code.
The series fixes the bug.

0001 is preparation for the fix
0002 fixes the bug, patch description has the full motivation for the series
0003 (added in v3) fix bug on vm shutdown

Denis Plotnikov (3):
   vhost-user-blk: use different event handlers on initialization
   vhost-user-blk: perform immediate cleanup if disconnect on
 initialization
   vhost-user-blk: add immediate cleanup on shutdown

  hw/block/vhost-user-blk.c | 79 ---
  1 file changed, 48 insertions(+), 31 deletions(-)



Re: [PATCH v3 0/3] vhost-user-blk: fix bug on device disconnection during initialization

2021-03-29 Thread Denis Plotnikov

ping!

On 25.03.2021 18:12, Denis Plotnikov wrote:

v3:
   * 0003: a new patch added fixing the problem on vm shutdown
 I stumbled on this bug after v2 sending.
   * 0001: gramma fixing (Raphael)
   * 0002: commit message fixing (Raphael)

v2:
   * split the initial patch into two (Raphael)
   * rename init to realized (Raphael)
   * remove unrelated comment (Raphael)

When the vhost-user-blk device lose the connection to the daemon during
the initialization phase it kills qemu because of the assert in the code.
The series fixes the bug.

0001 is preparation for the fix
0002 fixes the bug, patch description has the full motivation for the series
0003 (added in v3) fix bug on vm shutdown

Denis Plotnikov (3):
   vhost-user-blk: use different event handlers on initialization
   vhost-user-blk: perform immediate cleanup if disconnect on
 initialization
   vhost-user-blk: add immediate cleanup on shutdown

  hw/block/vhost-user-blk.c | 79 ---
  1 file changed, 48 insertions(+), 31 deletions(-)



[PATCH v3 3/3] vhost-user-blk: add immediate cleanup on shutdown

2021-03-25 Thread Denis Plotnikov
Qemu crashes on shutdown if the chardev used by vhost-user-blk has been
finalized before the vhost-user-blk.

This happens with char-socket chardev operating in the listening mode (server).
The char-socket chardev emits "close" event at the end of finalizing when
its internal data is destroyed. This calls vhost-user-blk event handler
which in turn tries to manipulate with destroyed chardev by setting an empty
event handler for vhost-user-blk cleanup postponing.

This patch separates the shutdown case from the cleanup postponing removing
the need to set an event handler.

Signed-off-by: Denis Plotnikov 
---
 hw/block/vhost-user-blk.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c
index 4e215f71f152..0b5b9d44cdb0 100644
--- a/hw/block/vhost-user-blk.c
+++ b/hw/block/vhost-user-blk.c
@@ -411,7 +411,7 @@ static void vhost_user_blk_event(void *opaque, QEMUChrEvent 
event,
  * other code perform its own cleanup sequence using vhost_dev data
  * (e.g. vhost_dev_set_log).
  */
-if (realized) {
+if (realized && !runstate_check(RUN_STATE_SHUTDOWN)) {
 /*
  * A close event may happen during a read/write, but vhost
  * code assumes the vhost_dev remains setup, so delay the
-- 
2.25.1




[PATCH v3 0/3] vhost-user-blk: fix bug on device disconnection during initialization

2021-03-25 Thread Denis Plotnikov
v3:
  * 0003: a new patch added fixing the problem on vm shutdown
I stumbled on this bug after v2 sending.
  * 0001: gramma fixing (Raphael)
  * 0002: commit message fixing (Raphael)

v2:
  * split the initial patch into two (Raphael)
  * rename init to realized (Raphael)
  * remove unrelated comment (Raphael)

When the vhost-user-blk device lose the connection to the daemon during
the initialization phase it kills qemu because of the assert in the code.
The series fixes the bug.

0001 is preparation for the fix
0002 fixes the bug, patch description has the full motivation for the series
0003 (added in v3) fix bug on vm shutdown

Denis Plotnikov (3):
  vhost-user-blk: use different event handlers on initialization
  vhost-user-blk: perform immediate cleanup if disconnect on
initialization
  vhost-user-blk: add immediate cleanup on shutdown

 hw/block/vhost-user-blk.c | 79 ---
 1 file changed, 48 insertions(+), 31 deletions(-)

-- 
2.25.1




[PATCH v3 1/3] vhost-user-blk: use different event handlers on initialization

2021-03-25 Thread Denis Plotnikov
It is useful to use different connect/disconnect event handlers
on device initialization and operation as seen from the further
commit fixing a bug on device initialization.

This patch refactors the code to make use of them: we don't rely any
more on the VM state for choosing how to cleanup the device, instead
we explicitly use the proper event handler depending on whether
the device has been initialized.

Signed-off-by: Denis Plotnikov 
Reviewed-by: Raphael Norwitz 
---
 hw/block/vhost-user-blk.c | 31 ---
 1 file changed, 24 insertions(+), 7 deletions(-)

diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c
index b870a50e6b20..1af95ec6aae7 100644
--- a/hw/block/vhost-user-blk.c
+++ b/hw/block/vhost-user-blk.c
@@ -362,7 +362,18 @@ static void vhost_user_blk_disconnect(DeviceState *dev)
 vhost_dev_cleanup(>dev);
 }
 
-static void vhost_user_blk_event(void *opaque, QEMUChrEvent event);
+static void vhost_user_blk_event(void *opaque, QEMUChrEvent event,
+ bool realized);
+
+static void vhost_user_blk_event_realize(void *opaque, QEMUChrEvent event)
+{
+vhost_user_blk_event(opaque, event, false);
+}
+
+static void vhost_user_blk_event_oper(void *opaque, QEMUChrEvent event)
+{
+vhost_user_blk_event(opaque, event, true);
+}
 
 static void vhost_user_blk_chr_closed_bh(void *opaque)
 {
@@ -371,11 +382,12 @@ static void vhost_user_blk_chr_closed_bh(void *opaque)
 VHostUserBlk *s = VHOST_USER_BLK(vdev);
 
 vhost_user_blk_disconnect(dev);
-qemu_chr_fe_set_handlers(>chardev, NULL, NULL, vhost_user_blk_event,
-NULL, opaque, NULL, true);
+qemu_chr_fe_set_handlers(>chardev, NULL, NULL,
+vhost_user_blk_event_oper, NULL, opaque, NULL, true);
 }
 
-static void vhost_user_blk_event(void *opaque, QEMUChrEvent event)
+static void vhost_user_blk_event(void *opaque, QEMUChrEvent event,
+ bool realized)
 {
 DeviceState *dev = opaque;
 VirtIODevice *vdev = VIRTIO_DEVICE(dev);
@@ -406,7 +418,7 @@ static void vhost_user_blk_event(void *opaque, QEMUChrEvent 
event)
  * TODO: maybe it is a good idea to make the same fix
  * for other vhost-user devices.
  */
-if (runstate_is_running()) {
+if (realized) {
 AioContext *ctx = qemu_get_current_aio_context();
 
 qemu_chr_fe_set_handlers(>chardev, NULL, NULL, NULL, NULL,
@@ -473,8 +485,9 @@ static void vhost_user_blk_device_realize(DeviceState *dev, 
Error **errp)
 s->vhost_vqs = g_new0(struct vhost_virtqueue, s->num_queues);
 s->connected = false;
 
-qemu_chr_fe_set_handlers(>chardev,  NULL, NULL, vhost_user_blk_event,
- NULL, (void *)dev, NULL, true);
+qemu_chr_fe_set_handlers(>chardev,  NULL, NULL,
+ vhost_user_blk_event_realize, NULL, (void *)dev,
+ NULL, true);
 
 reconnect:
 if (qemu_chr_fe_wait_connected(>chardev, ) < 0) {
@@ -494,6 +507,10 @@ reconnect:
 goto reconnect;
 }
 
+/* we're fully initialized, now we can operate, so change the handler */
+qemu_chr_fe_set_handlers(>chardev,  NULL, NULL,
+ vhost_user_blk_event_oper, NULL, (void *)dev,
+ NULL, true);
 return;
 
 virtio_err:
-- 
2.25.1




[PATCH v3 2/3] vhost-user-blk: perform immediate cleanup if disconnect on initialization

2021-03-25 Thread Denis Plotnikov
Commit 4bcad76f4c39 ("vhost-user-blk: delay vhost_user_blk_disconnect")
introduced postponing vhost_dev cleanup aiming to eliminate qemu aborts
because of connection problems with vhost-blk daemon.

However, it introdues a new problem. Now, any communication errors
during execution of vhost_dev_init() called by vhost_user_blk_device_realize()
lead to qemu abort on assert in vhost_dev_get_config().

This happens because vhost_user_blk_disconnect() is postponed but
it should have dropped s->connected flag by the time
vhost_user_blk_device_realize() performs a new connection opening.
On the connection opening, vhost_dev initialization in
vhost_user_blk_connect() relies on s->connection flag and
if it's not dropped, it skips vhost_dev initialization and returns
with success. Then, vhost_user_blk_device_realize()'s execution flow
goes to vhost_dev_get_config() where it's aborted on the assert.

To fix the problem this patch adds immediate cleanup on device
initialization(in vhost_user_blk_device_realize()) using different
event handlers for initialization and operation introduced in the
previous patch.
On initialization (in vhost_user_blk_device_realize()) we fully
control the initialization process. At that point, nobody can use the
device since it isn't initialized and we don't need to postpone any
cleanups, so we can do cleaup right away when there is a communication
problem with the vhost-blk daemon.
On operation we leave it as is, since the disconnect may happen when
the device is in use, so the device users may want to use vhost_dev's data
to do rollback before vhost_dev is re-initialized (e.g. in vhost_dev_set_log()).

Signed-off-by: Denis Plotnikov 
Reviewed-by: Raphael Norwitz 
---
 hw/block/vhost-user-blk.c | 48 +++
 1 file changed, 24 insertions(+), 24 deletions(-)

diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c
index 1af95ec6aae7..4e215f71f152 100644
--- a/hw/block/vhost-user-blk.c
+++ b/hw/block/vhost-user-blk.c
@@ -402,38 +402,38 @@ static void vhost_user_blk_event(void *opaque, 
QEMUChrEvent event,
 break;
 case CHR_EVENT_CLOSED:
 /*
- * A close event may happen during a read/write, but vhost
- * code assumes the vhost_dev remains setup, so delay the
- * stop & clear. There are two possible paths to hit this
- * disconnect event:
- * 1. When VM is in the RUN_STATE_PRELAUNCH state. The
- * vhost_user_blk_device_realize() is a caller.
- * 2. In tha main loop phase after VM start.
- *
- * For p2 the disconnect event will be delayed. We can't
- * do the same for p1, because we are not running the loop
- * at this moment. So just skip this step and perform
- * disconnect in the caller function.
- *
- * TODO: maybe it is a good idea to make the same fix
- * for other vhost-user devices.
+ * Closing the connection should happen differently on device
+ * initialization and operation stages.
+ * On initalization, we want to re-start vhost_dev initialization
+ * from the very beginning right away when the connection is closed,
+ * so we clean up vhost_dev on each connection closing.
+ * On operation, we want to postpone vhost_dev cleanup to let the
+ * other code perform its own cleanup sequence using vhost_dev data
+ * (e.g. vhost_dev_set_log).
  */
 if (realized) {
+/*
+ * A close event may happen during a read/write, but vhost
+ * code assumes the vhost_dev remains setup, so delay the
+ * stop & clear.
+ */
 AioContext *ctx = qemu_get_current_aio_context();
 
 qemu_chr_fe_set_handlers(>chardev, NULL, NULL, NULL, NULL,
 NULL, NULL, false);
 aio_bh_schedule_oneshot(ctx, vhost_user_blk_chr_closed_bh, opaque);
-}
 
-/*
- * Move vhost device to the stopped state. The vhost-user device
- * will be clean up and disconnected in BH. This can be useful in
- * the vhost migration code. If disconnect was caught there is an
- * option for the general vhost code to get the dev state without
- * knowing its type (in this case vhost-user).
- */
-s->dev.started = false;
+/*
+ * Move vhost device to the stopped state. The vhost-user device
+ * will be clean up and disconnected in BH. This can be useful in
+ * the vhost migration code. If disconnect was caught there is an
+ * option for the general vhost code to get the dev state without
+ * knowing its type (in this case vhost-user).
+ */
+s->dev.started = false;
+} else {
+vhost_user_blk_disconnect(dev);
+}
 break;
 case CHR_EVENT_BREAK:
 case CHR_EVENT_MUX_IN:
-- 
2.25.1




[PATCH v2 1/2] vhost-user-blk: use different event handlers on initialization

2021-03-24 Thread Denis Plotnikov
It is useful to use different connect/disconnect event handlers
on device initialization and operation as seen from the further
commit fixing a bug on device initialization.

The patch refactor the code to make use of them: we don't rely any
more on the VM state for choosing how to cleanup the device, instead
we explicitly use the proper event handler dependping on whether
the device has been initialized.

Signed-off-by: Denis Plotnikov 
---
 hw/block/vhost-user-blk.c | 31 ---
 1 file changed, 24 insertions(+), 7 deletions(-)

diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c
index b870a50e6b20..1af95ec6aae7 100644
--- a/hw/block/vhost-user-blk.c
+++ b/hw/block/vhost-user-blk.c
@@ -362,7 +362,18 @@ static void vhost_user_blk_disconnect(DeviceState *dev)
 vhost_dev_cleanup(>dev);
 }
 
-static void vhost_user_blk_event(void *opaque, QEMUChrEvent event);
+static void vhost_user_blk_event(void *opaque, QEMUChrEvent event,
+ bool realized);
+
+static void vhost_user_blk_event_realize(void *opaque, QEMUChrEvent event)
+{
+vhost_user_blk_event(opaque, event, false);
+}
+
+static void vhost_user_blk_event_oper(void *opaque, QEMUChrEvent event)
+{
+vhost_user_blk_event(opaque, event, true);
+}
 
 static void vhost_user_blk_chr_closed_bh(void *opaque)
 {
@@ -371,11 +382,12 @@ static void vhost_user_blk_chr_closed_bh(void *opaque)
 VHostUserBlk *s = VHOST_USER_BLK(vdev);
 
 vhost_user_blk_disconnect(dev);
-qemu_chr_fe_set_handlers(>chardev, NULL, NULL, vhost_user_blk_event,
-NULL, opaque, NULL, true);
+qemu_chr_fe_set_handlers(>chardev, NULL, NULL,
+vhost_user_blk_event_oper, NULL, opaque, NULL, true);
 }
 
-static void vhost_user_blk_event(void *opaque, QEMUChrEvent event)
+static void vhost_user_blk_event(void *opaque, QEMUChrEvent event,
+ bool realized)
 {
 DeviceState *dev = opaque;
 VirtIODevice *vdev = VIRTIO_DEVICE(dev);
@@ -406,7 +418,7 @@ static void vhost_user_blk_event(void *opaque, QEMUChrEvent 
event)
  * TODO: maybe it is a good idea to make the same fix
  * for other vhost-user devices.
  */
-if (runstate_is_running()) {
+if (realized) {
 AioContext *ctx = qemu_get_current_aio_context();
 
 qemu_chr_fe_set_handlers(>chardev, NULL, NULL, NULL, NULL,
@@ -473,8 +485,9 @@ static void vhost_user_blk_device_realize(DeviceState *dev, 
Error **errp)
 s->vhost_vqs = g_new0(struct vhost_virtqueue, s->num_queues);
 s->connected = false;
 
-qemu_chr_fe_set_handlers(>chardev,  NULL, NULL, vhost_user_blk_event,
- NULL, (void *)dev, NULL, true);
+qemu_chr_fe_set_handlers(>chardev,  NULL, NULL,
+ vhost_user_blk_event_realize, NULL, (void *)dev,
+ NULL, true);
 
 reconnect:
 if (qemu_chr_fe_wait_connected(>chardev, ) < 0) {
@@ -494,6 +507,10 @@ reconnect:
 goto reconnect;
 }
 
+/* we're fully initialized, now we can operate, so change the handler */
+qemu_chr_fe_set_handlers(>chardev,  NULL, NULL,
+ vhost_user_blk_event_oper, NULL, (void *)dev,
+ NULL, true);
 return;
 
 virtio_err:
-- 
2.25.1




[PATCH v2 2/2] vhost-user-blk: perform immediate cleanup if disconnect on initialization

2021-03-24 Thread Denis Plotnikov
Commit 4bcad76f4c39 ("vhost-user-blk: delay vhost_user_blk_disconnect")
introduced postponing vhost_dev cleanup aiming to eliminate qemu aborts
because of connection problems with vhost-blk daemon.

However, it introdues a new problem. Now, any communication errors
during execution of vhost_dev_init() called by vhost_user_blk_device_realize()
lead to qemu abort on assert in vhost_dev_get_config().

This happens because vhost_user_blk_disconnect() is postponed but
it should have dropped s->connected flag by the time
vhost_user_blk_device_realize() performs a new connection opening.
On the connection opening, vhost_dev initialization in
vhost_user_blk_connect() relies on s->connection flag and
if it's not dropped, it skips vhost_dev initialization and returns
with success. Then, vhost_user_blk_device_realize()'s execution flow
goes to vhost_dev_get_config() where it's aborted on the assert.

The connection/disconnection processing should happen
differently on initialization and operation of vhost-user-blk.
On initialization (in vhost_user_blk_device_realize()) we fully
control the initialization process. At that point, nobody can use the
device since it isn't initialized and we don't need to postpone any
cleanups, so we can do cleanup right away when there is communication
problems with the vhost-blk daemon.
On operation the disconnect may happen when the device is in use, so
the device users may want to use vhost_dev's data to do rollback before
vhost_dev is re-initialized (e.g. in vhost_dev_set_log()), so we
postpone the cleanup.

The patch splits those two cases, and performs the cleanup immediately on
initialization, and postpones cleanup when the device is initialized and
in use.

Signed-off-by: Denis Plotnikov 
---
 hw/block/vhost-user-blk.c | 48 +++
 1 file changed, 24 insertions(+), 24 deletions(-)

diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c
index 1af95ec6aae7..4e215f71f152 100644
--- a/hw/block/vhost-user-blk.c
+++ b/hw/block/vhost-user-blk.c
@@ -402,38 +402,38 @@ static void vhost_user_blk_event(void *opaque, 
QEMUChrEvent event,
 break;
 case CHR_EVENT_CLOSED:
 /*
- * A close event may happen during a read/write, but vhost
- * code assumes the vhost_dev remains setup, so delay the
- * stop & clear. There are two possible paths to hit this
- * disconnect event:
- * 1. When VM is in the RUN_STATE_PRELAUNCH state. The
- * vhost_user_blk_device_realize() is a caller.
- * 2. In tha main loop phase after VM start.
- *
- * For p2 the disconnect event will be delayed. We can't
- * do the same for p1, because we are not running the loop
- * at this moment. So just skip this step and perform
- * disconnect in the caller function.
- *
- * TODO: maybe it is a good idea to make the same fix
- * for other vhost-user devices.
+ * Closing the connection should happen differently on device
+ * initialization and operation stages.
+ * On initalization, we want to re-start vhost_dev initialization
+ * from the very beginning right away when the connection is closed,
+ * so we clean up vhost_dev on each connection closing.
+ * On operation, we want to postpone vhost_dev cleanup to let the
+ * other code perform its own cleanup sequence using vhost_dev data
+ * (e.g. vhost_dev_set_log).
  */
 if (realized) {
+/*
+ * A close event may happen during a read/write, but vhost
+ * code assumes the vhost_dev remains setup, so delay the
+ * stop & clear.
+ */
 AioContext *ctx = qemu_get_current_aio_context();
 
 qemu_chr_fe_set_handlers(>chardev, NULL, NULL, NULL, NULL,
 NULL, NULL, false);
 aio_bh_schedule_oneshot(ctx, vhost_user_blk_chr_closed_bh, opaque);
-}
 
-/*
- * Move vhost device to the stopped state. The vhost-user device
- * will be clean up and disconnected in BH. This can be useful in
- * the vhost migration code. If disconnect was caught there is an
- * option for the general vhost code to get the dev state without
- * knowing its type (in this case vhost-user).
- */
-s->dev.started = false;
+/*
+ * Move vhost device to the stopped state. The vhost-user device
+ * will be clean up and disconnected in BH. This can be useful in
+ * the vhost migration code. If disconnect was caught there is an
+ * option for the general vhost code to get the dev state without
+ * knowing its type (in this case vhost-user).
+ */
+s->dev.started = false;
+} else {
+vhost_user_blk_disconnect(dev);
+}
 break;
   

[PATCH v2 0/2] vhost-user-blk: fix bug on device disconnection during initialization

2021-03-24 Thread Denis Plotnikov
v2:
  * split the initial patch into two (Raphael)
  * rename init to realized (Raphael)
  * remove unrelated comment (Raphael)

When the vhost-user-blk device lose the connection to the daemon during
the initialization phase it kills qemu because of the assert in the code.
The series fixes the bug.

0001 is preparation for the fix
0002 fixes the bug, patch description has the full motivation for the series

Denis Plotnikov (2):
  vhost-user-blk: use different event handlers on initialization
  vhost-user-blk: perform immediate cleanup if disconnect on
initialization

 hw/block/vhost-user-blk.c | 79 ---
 1 file changed, 48 insertions(+), 31 deletions(-)

-- 
2.25.1




Re: [PATCH v1] vhost-user-blk: use different event handlers on init and operation

2021-03-22 Thread Denis Plotnikov

ping!


On 11.03.2021 11:10, Denis Plotnikov wrote:

Commit a1a20d06b73e "vhost-user-blk: delay vhost_user_blk_disconnect"
introduced postponing vhost_dev cleanup aiming to eliminate qemu aborts
because of connection problems with vhost-blk daemon.

However, it introdues a new problem. Now, any communication errors
during execution of vhost_dev_init() called by vhost_user_blk_device_realize()
lead to qemu abort on assert in vhost_dev_get_config().

This happens because vhost_user_blk_disconnect() is postponed but
it should have dropped s->connected flag by the time
vhost_user_blk_device_realize() performs a new connection opening.
On the connection opening, vhost_dev initialization in
vhost_user_blk_connect() relies on s->connection flag and
if it's not dropped, it skips vhost_dev initialization and returns
with success. Then, vhost_user_blk_device_realize()'s execution flow
goes to vhost_dev_get_config() where it's aborted on the assert.

It seems connection/disconnection processing should happen
differently on initialization and operation of vhost-user-blk.
On initialization (in vhost_user_blk_device_realize()) we fully
control the initialization process. At that point, nobody can use the
device since it isn't initialized and we don't need to postpone any
cleanups, so we can do cleanup right away when there is communication
problems with the vhost-blk daemon.
On operation the disconnect may happen when the device is in use, so
the device users may want to use vhost_dev's data to do rollback before
vhost_dev is re-initialized (e.g. in vhost_dev_set_log()), so we
postpone the cleanup.

The patch splits those two cases, and performs the cleanup immediately on
initialization, and postpones cleanup when the device is initialized and
in use.

Signed-off-by: Denis Plotnikov 
---
  hw/block/vhost-user-blk.c | 88 ---
  1 file changed, 54 insertions(+), 34 deletions(-)

diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c
index b870a50e6b20..84940122b8ca 100644
--- a/hw/block/vhost-user-blk.c
+++ b/hw/block/vhost-user-blk.c
@@ -362,7 +362,17 @@ static void vhost_user_blk_disconnect(DeviceState *dev)
  vhost_dev_cleanup(>dev);
  }
  
-static void vhost_user_blk_event(void *opaque, QEMUChrEvent event);

+static void vhost_user_blk_event(void *opaque, QEMUChrEvent event, bool init);
+
+static void vhost_user_blk_event_init(void *opaque, QEMUChrEvent event)
+{
+vhost_user_blk_event(opaque, event, true);
+}
+
+static void vhost_user_blk_event_oper(void *opaque, QEMUChrEvent event)
+{
+vhost_user_blk_event(opaque, event, false);
+}
  
  static void vhost_user_blk_chr_closed_bh(void *opaque)

  {
@@ -371,11 +381,11 @@ static void vhost_user_blk_chr_closed_bh(void *opaque)
  VHostUserBlk *s = VHOST_USER_BLK(vdev);
  
  vhost_user_blk_disconnect(dev);

-qemu_chr_fe_set_handlers(>chardev, NULL, NULL, vhost_user_blk_event,
-NULL, opaque, NULL, true);
+qemu_chr_fe_set_handlers(>chardev, NULL, NULL,
+vhost_user_blk_event_oper, NULL, opaque, NULL, true);
  }
  
-static void vhost_user_blk_event(void *opaque, QEMUChrEvent event)

+static void vhost_user_blk_event(void *opaque, QEMUChrEvent event, bool init)
  {
  DeviceState *dev = opaque;
  VirtIODevice *vdev = VIRTIO_DEVICE(dev);
@@ -390,38 +400,42 @@ static void vhost_user_blk_event(void *opaque, 
QEMUChrEvent event)
  break;
  case CHR_EVENT_CLOSED:
  /*
- * A close event may happen during a read/write, but vhost
- * code assumes the vhost_dev remains setup, so delay the
- * stop & clear. There are two possible paths to hit this
- * disconnect event:
- * 1. When VM is in the RUN_STATE_PRELAUNCH state. The
- * vhost_user_blk_device_realize() is a caller.
- * 2. In tha main loop phase after VM start.
- *
- * For p2 the disconnect event will be delayed. We can't
- * do the same for p1, because we are not running the loop
- * at this moment. So just skip this step and perform
- * disconnect in the caller function.
- *
- * TODO: maybe it is a good idea to make the same fix
- * for other vhost-user devices.
+ * Closing the connection should happen differently on device
+ * initialization and operation stages.
+ * On initalization, we want to re-start vhost_dev initialization
+ * from the very beginning right away when the connection is closed,
+ * so we clean up vhost_dev on each connection closing.
+ * On operation, we want to postpone vhost_dev cleanup to let the
+ * other code perform its own cleanup sequence using vhost_dev data
+ * (e.g. vhost_dev_set_log).
   */
-if (runstate_is_running()) {
+if (init) {
+vhost_user_blk_disconnect(dev);
+} else {
+/*
+ * A close event m

[PATCH v1] configure: add option to implicitly enable/disable libgio

2021-03-12 Thread Denis Plotnikov
Now, compilation of util/dbus is implicit and depends
on libgio presence on the building host.
The patch adds options to manage libgio dependencies explicitly.

Signed-off-by: Denis Plotnikov 
---
 configure | 60 ---
 1 file changed, 39 insertions(+), 21 deletions(-)

diff --git a/configure b/configure
index 34fccaa2bae6..23eed988be81 100755
--- a/configure
+++ b/configure
@@ -465,6 +465,7 @@ fuse_lseek="auto"
 multiprocess="auto"
 
 malloc_trim="auto"
+gio="$default_feature"
 
 # parse CC options second
 for opt do
@@ -1560,6 +1561,10 @@ for opt do
   ;;
   --disable-multiprocess) multiprocess="disabled"
   ;;
+  --enable-gio) gio=yes
+  ;;
+  --disable-gio) gio=no
+  ;;
   *)
   echo "ERROR: unknown option $opt"
   echo "Try '$0 --help' for more information"
@@ -1913,6 +1918,7 @@ disabled with --disable-FEATURE, default is enabled if 
available
   fuseFUSE block device export
   fuse-lseek  SEEK_HOLE/SEEK_DATA support for FUSE exports
   multiprocessOut of process device emulation support
+  gio libgio support
 
 NOTE: The object files are built at the place where configure is launched
 EOF
@@ -3319,17 +3325,19 @@ if test "$static" = yes && test "$mingw32" = yes; then
 glib_cflags="-DGLIB_STATIC_COMPILATION $glib_cflags"
 fi
 
-if $pkg_config --atleast-version=$glib_req_ver gio-2.0; then
-gio_cflags=$($pkg_config --cflags gio-2.0)
-gio_libs=$($pkg_config --libs gio-2.0)
-gdbus_codegen=$($pkg_config --variable=gdbus_codegen gio-2.0)
-if [ ! -x "$gdbus_codegen" ]; then
-gdbus_codegen=
-fi
-# Check that the libraries actually work -- Ubuntu 18.04 ships
-# with pkg-config --static --libs data for gio-2.0 that is missing
-# -lblkid and will give a link error.
-cat > $TMPC < $TMPC <
 int main(void)
 {
@@ -3337,18 +3345,28 @@ int main(void)
 return 0;
 }
 EOF
-if compile_prog "$gio_cflags" "$gio_libs" ; then
-gio=yes
-else
-gio=no
+if compile_prog "$gio_cflags" "$gio_libs" ; then
+pass=yes
+else
+pass=no
+fi
+
+if test "$pass" = "yes" &&
+$pkg_config --atleast-version=$glib_req_ver gio-unix-2.0; then
+gio_cflags="$gio_cflags $($pkg_config --cflags gio-unix-2.0)"
+gio_libs="$gio_libs $($pkg_config --libs gio-unix-2.0)"
+fi
 fi
-else
-gio=no
-fi
 
-if $pkg_config --atleast-version=$glib_req_ver gio-unix-2.0; then
-gio_cflags="$gio_cflags $($pkg_config --cflags gio-unix-2.0)"
-gio_libs="$gio_libs $($pkg_config --libs gio-unix-2.0)"
+if test "$pass" = "no"; then
+if test "$gio" = "yes"; then
+feature_not_found "gio" "Install libgio >= 2.0"
+else
+gio=no
+fi
+else
+gio=yes
+fi
 fi
 
 # Sanity check that the current size_t matches the
-- 
2.25.1




[PATCH v1] softmmu/vl: make default prealloc-threads work w/o -mem-prealloc

2021-03-11 Thread Denis Plotnikov
Preallocation in memory backends can be specified with either global
QEMU option "-mem-prealloc", or with per-backend property
"prealloc=true".  In the latter case, however, the default for the
number of preallocation threads is not set to the number of vcpus, but
remains at 1 instead.

Fix it by setting the "prealloc-threads" sugar property of
"memory-backend" to the number of vcpus unconditionally.

Fixes: ffac16fab3 ("hostmem: introduce "prealloc-threads" property")

Signed-off-by: Denis Plotnikov 
---
 softmmu/vl.c | 17 ++---
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/softmmu/vl.c b/softmmu/vl.c
index ff488ea3e7db..e392e226a2d3 100644
--- a/softmmu/vl.c
+++ b/softmmu/vl.c
@@ -2300,14 +2300,17 @@ static void qemu_validate_options(void)
 
 static void qemu_process_sugar_options(void)
 {
-if (mem_prealloc) {
-char *val;
+char *val;
 
-val = g_strdup_printf("%d",
- (uint32_t) 
qemu_opt_get_number(qemu_find_opts_singleton("smp-opts"), "cpus", 1));
-object_register_sugar_prop("memory-backend", "prealloc-threads", val,
-   false);
-g_free(val);
+val = g_strdup_printf("%d",
+  (uint32_t) qemu_opt_get_number(
+ qemu_find_opts_singleton("smp-opts"), "cpus", 1));
+
+object_register_sugar_prop("memory-backend", "prealloc-threads", val,
+false);
+g_free(val);
+
+if (mem_prealloc) {
 object_register_sugar_prop("memory-backend", "prealloc", "on", false);
 }
 
-- 
2.25.1




[PATCH v1] vhost-user-blk: use different event handlers on init and operation

2021-03-11 Thread Denis Plotnikov
Commit a1a20d06b73e "vhost-user-blk: delay vhost_user_blk_disconnect"
introduced postponing vhost_dev cleanup aiming to eliminate qemu aborts
because of connection problems with vhost-blk daemon.

However, it introdues a new problem. Now, any communication errors
during execution of vhost_dev_init() called by vhost_user_blk_device_realize()
lead to qemu abort on assert in vhost_dev_get_config().

This happens because vhost_user_blk_disconnect() is postponed but
it should have dropped s->connected flag by the time
vhost_user_blk_device_realize() performs a new connection opening.
On the connection opening, vhost_dev initialization in
vhost_user_blk_connect() relies on s->connection flag and
if it's not dropped, it skips vhost_dev initialization and returns
with success. Then, vhost_user_blk_device_realize()'s execution flow
goes to vhost_dev_get_config() where it's aborted on the assert.

It seems connection/disconnection processing should happen
differently on initialization and operation of vhost-user-blk.
On initialization (in vhost_user_blk_device_realize()) we fully
control the initialization process. At that point, nobody can use the
device since it isn't initialized and we don't need to postpone any
cleanups, so we can do cleanup right away when there is communication
problems with the vhost-blk daemon.
On operation the disconnect may happen when the device is in use, so
the device users may want to use vhost_dev's data to do rollback before
vhost_dev is re-initialized (e.g. in vhost_dev_set_log()), so we
postpone the cleanup.

The patch splits those two cases, and performs the cleanup immediately on
initialization, and postpones cleanup when the device is initialized and
in use.

Signed-off-by: Denis Plotnikov 
---
 hw/block/vhost-user-blk.c | 88 ---
 1 file changed, 54 insertions(+), 34 deletions(-)

diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c
index b870a50e6b20..84940122b8ca 100644
--- a/hw/block/vhost-user-blk.c
+++ b/hw/block/vhost-user-blk.c
@@ -362,7 +362,17 @@ static void vhost_user_blk_disconnect(DeviceState *dev)
 vhost_dev_cleanup(>dev);
 }
 
-static void vhost_user_blk_event(void *opaque, QEMUChrEvent event);
+static void vhost_user_blk_event(void *opaque, QEMUChrEvent event, bool init);
+
+static void vhost_user_blk_event_init(void *opaque, QEMUChrEvent event)
+{
+vhost_user_blk_event(opaque, event, true);
+}
+
+static void vhost_user_blk_event_oper(void *opaque, QEMUChrEvent event)
+{
+vhost_user_blk_event(opaque, event, false);
+}
 
 static void vhost_user_blk_chr_closed_bh(void *opaque)
 {
@@ -371,11 +381,11 @@ static void vhost_user_blk_chr_closed_bh(void *opaque)
 VHostUserBlk *s = VHOST_USER_BLK(vdev);
 
 vhost_user_blk_disconnect(dev);
-qemu_chr_fe_set_handlers(>chardev, NULL, NULL, vhost_user_blk_event,
-NULL, opaque, NULL, true);
+qemu_chr_fe_set_handlers(>chardev, NULL, NULL,
+vhost_user_blk_event_oper, NULL, opaque, NULL, true);
 }
 
-static void vhost_user_blk_event(void *opaque, QEMUChrEvent event)
+static void vhost_user_blk_event(void *opaque, QEMUChrEvent event, bool init)
 {
 DeviceState *dev = opaque;
 VirtIODevice *vdev = VIRTIO_DEVICE(dev);
@@ -390,38 +400,42 @@ static void vhost_user_blk_event(void *opaque, 
QEMUChrEvent event)
 break;
 case CHR_EVENT_CLOSED:
 /*
- * A close event may happen during a read/write, but vhost
- * code assumes the vhost_dev remains setup, so delay the
- * stop & clear. There are two possible paths to hit this
- * disconnect event:
- * 1. When VM is in the RUN_STATE_PRELAUNCH state. The
- * vhost_user_blk_device_realize() is a caller.
- * 2. In tha main loop phase after VM start.
- *
- * For p2 the disconnect event will be delayed. We can't
- * do the same for p1, because we are not running the loop
- * at this moment. So just skip this step and perform
- * disconnect in the caller function.
- *
- * TODO: maybe it is a good idea to make the same fix
- * for other vhost-user devices.
+ * Closing the connection should happen differently on device
+ * initialization and operation stages.
+ * On initalization, we want to re-start vhost_dev initialization
+ * from the very beginning right away when the connection is closed,
+ * so we clean up vhost_dev on each connection closing.
+ * On operation, we want to postpone vhost_dev cleanup to let the
+ * other code perform its own cleanup sequence using vhost_dev data
+ * (e.g. vhost_dev_set_log).
  */
-if (runstate_is_running()) {
+if (init) {
+vhost_user_blk_disconnect(dev);
+} else {
+/*
+ * A close event may happen during a read/write, but vhost
+ * code assumes the vhos

Re: [PATCH v0 3/4] migration: add background snapshot

2020-07-29 Thread Denis Plotnikov




On 29.07.2020 16:27, Dr. David Alan Gilbert wrote:

...

   /**
* ram_find_and_save_block: finds a dirty page and sends it to f
*
@@ -1782,6 +2274,7 @@ static int ram_find_and_save_block(RAMState *rs, bool 
last_stage)
   pss.block = rs->last_seen_block;
   pss.page = rs->last_page;
   pss.complete_round = false;
+pss.page_copy = NULL;
   if (!pss.block) {
   pss.block = QLIST_FIRST_RCU(_list.blocks);
@@ -1794,11 +2287,30 @@ static int ram_find_and_save_block(RAMState *rs, bool 
last_stage)
   if (!found) {
   /* priority queue empty, so just search for something dirty */
   found = find_dirty_block(rs, , );
+
+if (found && migrate_background_snapshot()) {
+/*
+ * make a copy of the page and
+ * pass it to the page search status
+ */
+int ret;
+ret = ram_copy_page(pss.block, pss.page, _copy);

I'm a bit confused about why we hit this; the way I'd thought about your
code was we turn on the write faulting, do one big save and then fixup
the faults as the save is happening (doing the copies) as the writes
hit; so when does this case hit?

To make it more clear, let me draw the whole picture:

When we do background snapshot, the vm is paused untill all vmstate EXCEPT
ram is saved.
RAM isn't written at all. That vmstate part is saved in the temporary
buffer.

Then all the RAM is marked as read-only and the vm is un-paused. Note that
at this moment all vm's vCPUs are
running and can touch any part of memory.
After that, the migration thread starts writing the ram content. Once a
memory chunk is written, the write protection is removed for that chunk.
If a vCPU wants to write to a memory page which is write protected (hasn't
been written yet), this write is intercepted, the memory page is copied
and queued for writing, the memory page write access is restored. The
intention behind of that, is to allow vCPU to work with a memory page as
soon as possible.

So I think I'm confusing this description with the code I'm seeing
above.  The code above, being in ram_find_and_save_block makes me think
it's calling ram_copy_page for every page at the point just before it
writes it - I'm not seeing how that corresponds to what you're saying
about it being queued when the CPU tries to write it.


You are right. The code should be different there.
It seems that I confused myself as well by sending a wrong version of 
the patch set.
I think this series should be re-send so my description above 
corresponds to the code.


Thanks,

Denis



Once all the RAM has been written, the rest of the vmstate is written from
the buffer. This needs to be so because some of the emulated devices, saved
in that
buffered vmstate part, expects the RAM content to be available first on its
loading.

Right, same type of problem as postcopy.

Dave


I hope this description will make things more clear.
If not, please let me know, so I could add more details.

Denis


--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK


--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK






Re: [PATCH v0 3/4] migration: add background snapshot

2020-07-29 Thread Denis Plotnikov




On 24.07.2020 03:08, Peter Xu wrote:

On Wed, Jul 22, 2020 at 11:11:32AM +0300, Denis Plotnikov wrote:

+/**
+ * ram_copy_page: make a page copy
+ *
+ * Used in the background snapshot to make a copy of a memeory page.
+ * Ensures that the memeory page is copied only once.
+ * When a page copy is done, restores read/write access to the memory
+ * page.
+ * If a page is being copied by another thread, wait until the copying
+ * ends and exit.
+ *
+ * Returns:
+ *   -1 - on error
+ *0 - the page wasn't copied by the function call
+ *1 - the page has been copied
+ *
+ * @block: RAM block to use
+ * @page_nr:   the page number to copy
+ * @page_copy: the pointer to return a page copy
+ *
+ */
+int ram_copy_page(RAMBlock *block, unsigned long page_nr,
+  void **page_copy)
+{
+void *host_page;
+int res = 0;
+
+atomic_inc(_state->page_copier_cnt);
+
+if (test_and_set_bit_atomic(page_nr, block->touched_map)) {
+while (!test_bit_atomic(page_nr, block->copied_map)) {
+/* the page is being copied -- wait for the end of the copying */
+qemu_event_wait(_state->page_copying_done);
+}
+goto out;
+}
+
+*page_copy = ram_page_buffer_get();
+if (!*page_copy) {
+res = -1;
+goto out;
+}
+
+host_page = block->host + (page_nr << TARGET_PAGE_BITS);
+memcpy(*page_copy, host_page, TARGET_PAGE_SIZE);
+
+if (ram_set_rw(host_page, TARGET_PAGE_SIZE)) {
+ram_page_buffer_free(*page_copy);
+*page_copy = NULL;
+res = -1;
+goto out;
+}
+
+set_bit_atomic(page_nr, block->copied_map);
+qemu_event_set(_state->page_copying_done);
+qemu_event_reset(_state->page_copying_done);
+
+res = 1;
+out:
+atomic_dec(_state->page_copier_cnt);
+return res;
+}

Is ram_set_rw() be called on the page only if a page fault triggered?
Shouldn't we also do that even in the background thread when we proactively
copying the pages?

Yes, we should. It's a bug.


Besides current solution, do you think we can make it simpler by only deliver
the fault request to the background thread?  We can let the background thread
to do all the rests and IIUC we can drop all the complicated sync bitmaps and
so on by doing so.  The process can look like:

   - background thread runs the general precopy migration, and,

 - it only does the ram_bulk_stage, which is the first loop, because for
   snapshot no reason to send a page twice..

 - After copy one page, do ram_set_rw() always, so accessing of this page
   will never trigger write-protect page fault again,

 - take requests from the unqueue_page() just like what's done in this
   series, but instead of copying the page, the page request should look
   exactly like the postcopy one.  We don't need copy_page because the page
   won't be changed before we unprotect it, so it shiould be safe.  These
   pages will still be with high priority because when queued it means vcpu
   writed to this protected page and fault in userfaultfd.  We need to
   migrate these pages first to unblock them.

   - the fault handler thread only needs to do:

 - when get a uffd-wp message, translate into a postcopy-like request
   (calculate ramblock and offset), then queue it.  That's all.

I believe we can avoid the copy_page parameter that was passed around, and we
can also save the two extra bitmaps and the complicated synchronizations.

Do you think this would work?


Yes, it would. This scheme is much simpler. I like it, in general.

I use such a complicated approach to reduce all possible vCPU delays:
if the storage where the snapshot is being saved is quite slow, it could 
lead

to vCPU freezing until the page is fully written to the storage.
So, with the current approach, if not take into account a number of page 
copies limitation,

the worst case is all VM's ram is copied and then written to the storage.
Other words, the current scheme provides minimal vCPU delays and thus 
minimal VM cpu

performance slowdown with the cost of host's memory consumption.
The new scheme is simple, doesn't consume extra host memory but can 
freeze vCPUs for

longer time r because:
* usually memory page coping is faster then memory page writing to a 
storage (think of HDD disk)

* writing page to a disk depends on disk performance and current disk load

So it seems that we have two different strategies:
1. lower CPU delays
2. lower memory usage

To be honest I would start from the yours scheme as it much simler and 
the other if needed in the future.


What do you think?

Denis


Besides, have we disabled dirty tracking of memslots?  IIUC that's not needed
for background snapshot too, so neither do we need dirty tracking nor do we
need to sync the dirty bitmap from outside us (kvm/vhost/...).






Re: [PATCH v0 3/4] migration: add background snapshot

2020-07-29 Thread Denis Plotnikov




On 24.07.2020 01:15, Peter Xu wrote:

On Wed, Jul 22, 2020 at 11:11:32AM +0300, Denis Plotnikov wrote:

+static void *background_snapshot_thread(void *opaque)
+{
+MigrationState *m = opaque;
+QIOChannelBuffer *bioc;
+QEMUFile *fb;
+int res = 0;
+
+rcu_register_thread();
+
+qemu_file_set_rate_limit(m->to_dst_file, INT64_MAX);
+
+qemu_mutex_lock_iothread();
+vm_stop(RUN_STATE_PAUSED);
+
+qemu_savevm_state_header(m->to_dst_file);
+qemu_mutex_unlock_iothread();
+qemu_savevm_state_setup(m->to_dst_file);

Is it intended to skip bql for the setup phase?  IIUC the main thread could
start the vm before we take the lock again below if we released it...


Good point!



+qemu_mutex_lock_iothread();
+
+migrate_set_state(>state, MIGRATION_STATUS_SETUP,
+  MIGRATION_STATUS_ACTIVE);
+
+/*
+ * We want to save the vm state for the moment when the snapshot saving was
+ * called but also we want to write RAM content with vm running. The RAM
+ * content should appear first in the vmstate.
+ * So, we first, save non-ram part of the vmstate to the temporary, buffer,
+ * then write ram part of the vmstate to the migration stream with vCPUs
+ * running and, finally, write the non-ram part of the vmstate from the
+ * buffer to the migration stream.
+ */
+bioc = qio_channel_buffer_new(4096);
+qio_channel_set_name(QIO_CHANNEL(bioc), "vmstate-buffer");
+fb = qemu_fopen_channel_output(QIO_CHANNEL(bioc));
+object_unref(OBJECT(bioc));
+
+if (ram_write_tracking_start()) {
+goto failed_resume;
+}
+
+if (global_state_store()) {
+goto failed_resume;
+}

Is this needed?  We should be always in stopped state here, right?

Yes, seems it isn't needed



+
+cpu_synchronize_all_states();
+
+if (qemu_savevm_state_complete_precopy_non_iterable(fb, false, false)) {
+goto failed_resume;
+}
+
+vm_start();
+qemu_mutex_unlock_iothread();
+
+while (!res) {
+res = qemu_savevm_state_iterate(m->to_dst_file, false);
+
+if (res < 0 || qemu_file_get_error(m->to_dst_file)) {
+goto failed;
+}
+}
+
+/*
+ * By this moment we have RAM content saved into the migration stream.
+ * The next step is to flush the non-ram content (vm devices state)
+ * right after the ram content. The device state was stored in
+ * the temporary buffer prior to the ram saving.
+ */
+qemu_put_buffer(m->to_dst_file, bioc->data, bioc->usage);
+qemu_fflush(m->to_dst_file);
+
+if (qemu_file_get_error(m->to_dst_file)) {
+goto failed;
+}
+
+migrate_set_state(>state, MIGRATION_STATUS_ACTIVE,
+ MIGRATION_STATUS_COMPLETED);
+goto exit;
+
+failed_resume:
+vm_start();
+qemu_mutex_unlock_iothread();
+failed:
+migrate_set_state(>state, MIGRATION_STATUS_ACTIVE,
+  MIGRATION_STATUS_FAILED);
+exit:
+ram_write_tracking_stop();
+qemu_fclose(fb);
+qemu_mutex_lock_iothread();
+qemu_savevm_state_cleanup();
+qemu_mutex_unlock_iothread();
+rcu_unregister_thread();
+return NULL;
+}
+
  void migrate_fd_connect(MigrationState *s, Error *error_in)
  {
  Error *local_err = NULL;
@@ -3599,8 +3694,14 @@ void migrate_fd_connect(MigrationState *s, Error 
*error_in)
  migrate_fd_cleanup(s);
  return;
  }
-qemu_thread_create(>thread, "live_migration", migration_thread, s,
-   QEMU_THREAD_JOINABLE);
+if (migrate_background_snapshot()) {
+qemu_thread_create(>thread, "bg_snapshot",

Maybe the name "live_snapshot" suites more (since the other one is
"live_migration")?


looks like it, another good name is async_snapshot and all the related 
function and properties should be rename accordingly



+   background_snapshot_thread, s,
+   QEMU_THREAD_JOINABLE);
+} else {
+qemu_thread_create(>thread, "live_migration", migration_thread, s,
+   QEMU_THREAD_JOINABLE);
+}
  s->migration_thread_running = true;
  }
  

[...]


@@ -1151,9 +1188,11 @@ static int save_normal_page(RAMState *rs, RAMBlock 
*block, ram_addr_t offset,
  ram_counters.transferred += save_page_header(rs, rs->f, block,
   offset | RAM_SAVE_FLAG_PAGE);
  if (async) {
-qemu_put_buffer_async(rs->f, buf, TARGET_PAGE_SIZE,
-  migrate_release_ram() &
-  migration_in_postcopy());
+bool may_free = migrate_background_snapshot() ||
+(migrate_release_ram() &&
+ migration_in_postcopy());

Does background snapshot need to free the memory?  /me c

Re: [PATCH v0 3/4] migration: add background snapshot

2020-07-28 Thread Denis Plotnikov




On 27.07.2020 19:48, Dr. David Alan Gilbert wrote:

* Denis Plotnikov (dplotni...@virtuozzo.com) wrote:

...

+static void page_fault_thread_stop(void)
+{
+if (page_fault_fd) {
+close(page_fault_fd);
+page_fault_fd = 0;
+}

I think you need to do that after you've done the quit and join,
otherwise the fault thread might still be reading this.


Seems to be so



+if (thread_quit_fd) {
+uint64_t val = 1;
+int ret;
+
+ret = write(thread_quit_fd, , sizeof(val));
+assert(ret == sizeof(val));
+
+qemu_thread_join(_fault_thread);
+close(thread_quit_fd);
+thread_quit_fd = 0;
+}
+}

...

  /**
   * ram_find_and_save_block: finds a dirty page and sends it to f
   *
@@ -1782,6 +2274,7 @@ static int ram_find_and_save_block(RAMState *rs, bool 
last_stage)
  pss.block = rs->last_seen_block;
  pss.page = rs->last_page;
  pss.complete_round = false;
+pss.page_copy = NULL;
  
  if (!pss.block) {

  pss.block = QLIST_FIRST_RCU(_list.blocks);
@@ -1794,11 +2287,30 @@ static int ram_find_and_save_block(RAMState *rs, bool 
last_stage)
  if (!found) {
  /* priority queue empty, so just search for something dirty */
  found = find_dirty_block(rs, , );
+
+if (found && migrate_background_snapshot()) {
+/*
+ * make a copy of the page and
+ * pass it to the page search status
+ */
+int ret;
+ret = ram_copy_page(pss.block, pss.page, _copy);

I'm a bit confused about why we hit this; the way I'd thought about your
code was we turn on the write faulting, do one big save and then fixup
the faults as the save is happening (doing the copies) as the writes
hit; so when does this case hit?


To make it more clear, let me draw the whole picture:

When we do background snapshot, the vm is paused untill all vmstate 
EXCEPT ram is saved.
RAM isn't written at all. That vmstate part is saved in the temporary 
buffer.


Then all the RAM is marked as read-only and the vm is un-paused. Note 
that at this moment all vm's vCPUs are

running and can touch any part of memory.
After that, the migration thread starts writing the ram content. Once a 
memory chunk is written, the write protection is removed for that chunk.
If a vCPU wants to write to a memory page which is write protected 
(hasn't been written yet), this write is intercepted, the memory page is 
copied
and queued for writing, the memory page write access is restored. The 
intention behind of that, is to allow vCPU to work with a memory page as 
soon as possible.


Once all the RAM has been written, the rest of the vmstate is written 
from the buffer. This needs to be so because some of the emulated 
devices, saved in that
buffered vmstate part, expects the RAM content to be available first on 
its loading.


I hope this description will make things more clear.
If not, please let me know, so I could add more details.

Denis


--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK






Re: [PATCH v0 0/4] background snapshot

2020-07-24 Thread Denis Plotnikov




On 23.07.2020 20:39, Peter Xu wrote:

On Thu, Jul 23, 2020 at 11:03:55AM +0300, Denis Plotnikov wrote:


On 22.07.2020 19:30, Peter Xu wrote:

On Wed, Jul 22, 2020 at 06:47:44PM +0300, Denis Plotnikov wrote:

On 22.07.2020 18:42, Denis Plotnikov wrote:

On 22.07.2020 17:50, Peter Xu wrote:

Hi, Denis,

Hi, Peter

...

How to use:
1. enable background snapshot capability
      virsh qemu-monitor-command vm --hmp migrate_set_capability
background-snapshot on

2. stop the vm
      virsh qemu-monitor-command vm --hmp stop

3. Start the external migration to a file
      virsh qemu-monitor-command cent78-bs --hmp migrate exec:'cat

./vm_state'

4. Wait for the migration finish and check that the migration
has completed state.

Thanks for continued working on this project! I have two high level
questions
before dig into the patches.

Firstly, is step 2 required?  Can we use a single QMP command to
take snapshots
(which can still be a "migrate" command)?

With this series it is required, but steps 2 and 3 should be merged into
a single one.

I'm not sure whether you're talking about the disk snapshot operations, anyway
yeah it'll be definitely good if we merge them into one in the next version.

After thinking for a while, I remembered why I split these two steps.
The vm snapshot consists of two parts: disk(s) snapshot(s) and vmstate.
With migrate command we save the vmstate only. So, the steps to save
the whole vm snapshot is the following:

2. stop the vm
     virsh qemu-monitor-command vm --hmp stop

2.1. Make a disk snapshot, something like
 virsh qemu-monitor-command vm --hmp snapshot_blkdev drive-scsi0-0-0-0 
./new_data
3. Start the external migration to a file
     virsh qemu-monitor-command vm --hmp migrate exec:'cat ./vm_state'

In this example, vm snapshot consists of two files: vm_state and the disk file. 
new_data will contain all new disk data written since [2.1.] executing.

But that's slightly different to the current interface of savevm and loadvm
which only requires a snapshot name, am I right?


Yes

Now we need both a snapshot
name (of the vmstate) and the name of the new snapshot image.


Yes


I'm not familiar with qemu image snapshots... my understanding is that current
snapshot (save_snapshot) used internal image snapshots, while in this proposal
you want the live snapshot to use extrenal snapshots.
Correct, I want to add ability to make a external live snapshot. (live = 
asyn ram writing)

   Is there any criteria on
making this decision/change?
Internal snapshot is supported by qcow2 and sheepdog (I never heard of 
someone using the later).
Because of qcow2 internal snapshot design, it's quite complex to 
implement "background" snapshot there.
More details here: 
https://www.mail-archive.com/qemu-devel@nongnu.org/msg705116.html
So, I decided to start with external snapshot to implement and approve 
the memory access intercepting part firstly.
Once it's done for external snapshot we can start to approach the 
internal snapshots.


Thanks,
Denis



Re: [PATCH v0 0/4] background snapshot

2020-07-23 Thread Denis Plotnikov




On 22.07.2020 19:30, Peter Xu wrote:

On Wed, Jul 22, 2020 at 06:47:44PM +0300, Denis Plotnikov wrote:


On 22.07.2020 18:42, Denis Plotnikov wrote:


On 22.07.2020 17:50, Peter Xu wrote:

Hi, Denis,

Hi, Peter

...

How to use:
1. enable background snapshot capability
     virsh qemu-monitor-command vm --hmp migrate_set_capability
background-snapshot on

2. stop the vm
     virsh qemu-monitor-command vm --hmp stop

3. Start the external migration to a file
     virsh qemu-monitor-command cent78-bs --hmp migrate exec:'cat

./vm_state'

4. Wait for the migration finish and check that the migration
has completed state.

Thanks for continued working on this project! I have two high level
questions
before dig into the patches.

Firstly, is step 2 required?  Can we use a single QMP command to
take snapshots
(which can still be a "migrate" command)?

With this series it is required, but steps 2 and 3 should be merged into
a single one.

I'm not sure whether you're talking about the disk snapshot operations, anyway
yeah it'll be definitely good if we merge them into one in the next version.


After thinking for a while, I remembered why I split these two steps.
The vm snapshot consists of two parts: disk(s) snapshot(s) and vmstate.
With migrate command we save the vmstate only. So, the steps to save
the whole vm snapshot is the following:

2. stop the vm
    virsh qemu-monitor-command vm --hmp stop

2.1. Make a disk snapshot, something like
virsh qemu-monitor-command vm --hmp snapshot_blkdev drive-scsi0-0-0-0 
./new_data
   
3. Start the external migration to a file

    virsh qemu-monitor-command vm --hmp migrate exec:'cat ./vm_state'

In this example, vm snapshot consists of two files: vm_state and the disk file. 
new_data will contain all new disk data written since [2.1.] executing.




Meanwhile, we might also want to check around the type of backend
RAM.  E.g.,
shmem and hugetlbfs are still not supported for uffd-wp (which I'm still
working on).  I didn't check explicitly whether we'll simply fail
the migration
for those cases since the uffd ioctls will fail for those kinds of
RAM.  It
would be okay if we handle all the ioctl failures gracefully,

The ioctl's result is processed but the patch has a flaw - it ignores
the result of ioctl check. Need to fix it.

It happens here:

+int ram_write_tracking_start(void)
+{
+if (page_fault_thread_start()) {
+return -1;
+}
+
+ram_block_list_create();
+ram_block_list_set_readonly(); << this returns -1 in case of failure but I 
just ignore it here
+
+return 0;
+}


or it would be
even better if we directly fail when we want to enable live snapshot
capability
for a guest who contains other types of ram besides private
anonymous memories.

I agree, but to know whether shmem or hugetlbfs are supported by the
current kernel we should
execute the ioctl for all memory regions on the capability enabling.

Yes, that seems to be a better solution, so we don't care about the type of ram
backend anymore but check directly with the uffd ioctls.  With these checks,
it'll be even fine to ignore the above retcode, or just assert, because we've
already checked that before that point.

Thanks,






Re: [PATCH v0 0/4] background snapshot

2020-07-22 Thread Denis Plotnikov




On 22.07.2020 18:42, Denis Plotnikov wrote:



On 22.07.2020 17:50, Peter Xu wrote:

Hi, Denis,

Hi, Peter

...

How to use:
1. enable background snapshot capability
    virsh qemu-monitor-command vm --hmp migrate_set_capability 
background-snapshot on


2. stop the vm
    virsh qemu-monitor-command vm --hmp stop

3. Start the external migration to a file
    virsh qemu-monitor-command cent78-bs --hmp migrate exec:'cat > 
./vm_state'


4. Wait for the migration finish and check that the migration has 
completed state.
Thanks for continued working on this project! I have two high level 
questions

before dig into the patches.

Firstly, is step 2 required?  Can we use a single QMP command to take 
snapshots

(which can still be a "migrate" command)?


With this series it is required, but steps 2 and 3 should be merged 
into a single one.


Meanwhile, we might also want to check around the type of backend 
RAM.  E.g.,

shmem and hugetlbfs are still not supported for uffd-wp (which I'm still
working on).  I didn't check explicitly whether we'll simply fail the 
migration
for those cases since the uffd ioctls will fail for those kinds of 
RAM.  It

would be okay if we handle all the ioctl failures gracefully,


The ioctl's result is processed but the patch has a flaw - it ignores 
the result of ioctl check. Need to fix it.


It happens here:

+int ram_write_tracking_start(void)
+{
+if (page_fault_thread_start()) {
+return -1;
+}
+
+ram_block_list_create();
+ram_block_list_set_readonly(); << this returns -1 in case of failure but I 
just ignore it here
+
+return 0;
+}


or it would be
even better if we directly fail when we want to enable live snapshot 
capability
for a guest who contains other types of ram besides private anonymous 
memories.


I agree, but to know whether shmem or hugetlbfs are supported by the 
current kernel we should

execute the ioctl for all memory regions on the capability enabling.

Thanks,
Denis





Re: [PATCH v0 0/4] background snapshot

2020-07-22 Thread Denis Plotnikov




On 22.07.2020 17:50, Peter Xu wrote:

Hi, Denis,

Hi, Peter

...

How to use:
1. enable background snapshot capability
virsh qemu-monitor-command vm --hmp migrate_set_capability 
background-snapshot on

2. stop the vm
virsh qemu-monitor-command vm --hmp stop

3. Start the external migration to a file
virsh qemu-monitor-command cent78-bs --hmp migrate exec:'cat > ./vm_state'

4. Wait for the migration finish and check that the migration has completed 
state.

Thanks for continued working on this project! I have two high level questions
before dig into the patches.

Firstly, is step 2 required?  Can we use a single QMP command to take snapshots
(which can still be a "migrate" command)?


With this series it is required, but steps 2 and 3 should be merged into 
a single one.


Meanwhile, we might also want to check around the type of backend RAM.  E.g.,
shmem and hugetlbfs are still not supported for uffd-wp (which I'm still
working on).  I didn't check explicitly whether we'll simply fail the migration
for those cases since the uffd ioctls will fail for those kinds of RAM.  It
would be okay if we handle all the ioctl failures gracefully,


The ioctl's result is processed but the patch has a flaw - it ignores 
the result of ioctl check. Need to fix it.

or it would be
even better if we directly fail when we want to enable live snapshot capability
for a guest who contains other types of ram besides private anonymous memories.


I agree, but to know whether shmem or hugetlbfs are supported by the 
current kernel we should

execute the ioctl for all memory regions on the capability enabling.

Thanks,
Denis



[PATCH v0 3/4] migration: add background snapshot

2020-07-22 Thread Denis Plotnikov
By the moment, making a vm snapshot may cause a significant vm downtime,
depending on the vm RAM size and the performance of disk storing
the vm snapshot. This happens because the VM has to be paused until all
vmstate including RAM is written.

To reduce the downtime, the background snapshot capability is used.
With the capability enabled, the vm is paused for a small amount of time while
the smallest vmstate part (all, except RAM) is writen. RAM, the biggest part
of vmstate, is written with running VM.

Signed-off-by: Denis Plotnikov 
---
 include/exec/ramblock.h |   8 +
 include/exec/ramlist.h  |   2 +
 migration/ram.h |  19 +-
 migration/savevm.h  |   3 +
 migration/migration.c   | 107 +++-
 migration/ram.c | 578 ++--
 migration/savevm.c  |   1 -
 7 files changed, 698 insertions(+), 20 deletions(-)

diff --git a/include/exec/ramblock.h b/include/exec/ramblock.h
index 07d50864d8..421e128ef6 100644
--- a/include/exec/ramblock.h
+++ b/include/exec/ramblock.h
@@ -59,6 +59,14 @@ struct RAMBlock {
  */
 unsigned long *clear_bmap;
 uint8_t clear_bmap_shift;
+
+/* The following 3 elements are for background snapshot */
+/* List of blocks used for background snapshot */
+QLIST_ENTRY(RAMBlock) bgs_next;
+/* Pages currently being copied */
+unsigned long *touched_map;
+/* Pages has been copied already */
+unsigned long *copied_map;
 };
 #endif
 #endif
diff --git a/include/exec/ramlist.h b/include/exec/ramlist.h
index bc4faa1b00..74e2a1162c 100644
--- a/include/exec/ramlist.h
+++ b/include/exec/ramlist.h
@@ -44,6 +44,8 @@ typedef struct {
 unsigned long *blocks[];
 } DirtyMemoryBlocks;
 
+typedef QLIST_HEAD(, RAMBlock) RamBlockList;
+
 typedef struct RAMList {
 QemuMutex mutex;
 RAMBlock *mru_block;
diff --git a/migration/ram.h b/migration/ram.h
index 2eeaacfa13..769b8087ae 100644
--- a/migration/ram.h
+++ b/migration/ram.h
@@ -42,7 +42,8 @@ uint64_t ram_bytes_remaining(void);
 uint64_t ram_bytes_total(void);
 
 uint64_t ram_pagesize_summary(void);
-int ram_save_queue_pages(const char *rbname, ram_addr_t start, ram_addr_t len);
+int ram_save_queue_pages(RAMBlock *block, const char *rbname,
+ ram_addr_t start, ram_addr_t len, void *page_copy);
 void acct_update_position(QEMUFile *f, size_t size, bool zero);
 void ram_debug_dump_bitmap(unsigned long *todump, bool expected,
unsigned long pages);
@@ -69,4 +70,20 @@ void colo_flush_ram_cache(void);
 void colo_release_ram_cache(void);
 void colo_incoming_start_dirty_log(void);
 
+/* for background snapshot */
+void ram_block_list_create(void);
+void ram_block_list_destroy(void);
+RAMBlock *ram_bgs_block_find(uint64_t address, ram_addr_t *page_offset);
+
+void *ram_page_buffer_get(void);
+int ram_page_buffer_free(void *buffer);
+
+int ram_block_list_set_readonly(void);
+int ram_block_list_set_writable(void);
+
+int ram_copy_page(RAMBlock *block, unsigned long page_nr, void **page_copy);
+int ram_process_page_fault(uint64_t address);
+
+int ram_write_tracking_start(void);
+void ram_write_tracking_stop(void);
 #endif
diff --git a/migration/savevm.h b/migration/savevm.h
index ba64a7e271..4f4edffa85 100644
--- a/migration/savevm.h
+++ b/migration/savevm.h
@@ -64,5 +64,8 @@ int qemu_loadvm_state(QEMUFile *f);
 void qemu_loadvm_state_cleanup(void);
 int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis);
 int qemu_load_device_state(QEMUFile *f);
+int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
+bool in_postcopy,
+bool inactivate_disks);
 
 #endif
diff --git a/migration/migration.c b/migration/migration.c
index 2ec0451abe..dc56e4974f 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -55,6 +55,7 @@
 #include "net/announce.h"
 #include "qemu/queue.h"
 #include "multifd.h"
+#include "sysemu/cpus.h"
 
 #define MAX_THROTTLE  (32 << 20)  /* Migration transfer speed throttling */
 
@@ -2473,7 +2474,7 @@ static void migrate_handle_rp_req_pages(MigrationState 
*ms, const char* rbname,
 return;
 }
 
-if (ram_save_queue_pages(rbname, start, len)) {
+if (ram_save_queue_pages(NULL, rbname, start, len, NULL)) {
 mark_source_rp_bad(ms);
 }
 }
@@ -3536,6 +3537,100 @@ static void *migration_thread(void *opaque)
 return NULL;
 }
 
+static void *background_snapshot_thread(void *opaque)
+{
+MigrationState *m = opaque;
+QIOChannelBuffer *bioc;
+QEMUFile *fb;
+int res = 0;
+
+rcu_register_thread();
+
+qemu_file_set_rate_limit(m->to_dst_file, INT64_MAX);
+
+qemu_mutex_lock_iothread();
+vm_stop(RUN_STATE_PAUSED);
+
+qemu_savevm_state_header(m->to_dst_file);
+qemu_mutex_unlock_iothread();
+qemu_savevm_state_setup(m->to_dst_file);
+qemu

[PATCH v0 2/4] migration: add background snapshot capability

2020-07-22 Thread Denis Plotnikov
The capability is used for background snapshot enabling.
The background snapshot logic is going to be added in the following
patch.

Signed-off-by: Denis Plotnikov 
---
 qapi/migration.json   |  7 ++-
 migration/migration.h |  1 +
 migration/migration.c | 35 +++
 3 files changed, 42 insertions(+), 1 deletion(-)

diff --git a/qapi/migration.json b/qapi/migration.json
index d5000558c6..46681a5c3c 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -424,6 +424,11 @@
 # @validate-uuid: Send the UUID of the source to allow the destination
 # to ensure it is the same. (since 4.2)
 #
+# @background-snapshot: If enabled, the migration stream will be a snapshot
+#   of the VM exactly at the point when the migration
+#   procedure starts. The VM RAM is saved with running VM.
+#   (since 5.2)
+#
 # Since: 1.2
 ##
 { 'enum': 'MigrationCapability',
@@ -431,7 +436,7 @@
'compress', 'events', 'postcopy-ram', 'x-colo', 'release-ram',
'block', 'return-path', 'pause-before-switchover', 'multifd',
'dirty-bitmaps', 'postcopy-blocktime', 'late-block-activate',
-   'x-ignore-shared', 'validate-uuid' ] }
+   'x-ignore-shared', 'validate-uuid', 'background-snapshot' ] }
 
 ##
 # @MigrationCapabilityStatus:
diff --git a/migration/migration.h b/migration/migration.h
index f617960522..63f2fde9a3 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -322,6 +322,7 @@ int migrate_compress_wait_thread(void);
 int migrate_decompress_threads(void);
 bool migrate_use_events(void);
 bool migrate_postcopy_blocktime(void);
+bool migrate_background_snapshot(void);
 
 /* Sending on the return path - generic and then for each message type */
 void migrate_send_rp_shut(MigrationIncomingState *mis,
diff --git a/migration/migration.c b/migration/migration.c
index 2ed9923227..2ec0451abe 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1086,6 +1086,32 @@ static bool migrate_caps_check(bool *cap_list,
 error_setg(errp, "Postcopy is not compatible with ignore-shared");
 return false;
 }
+
+if (cap_list[MIGRATION_CAPABILITY_BACKGROUND_SNAPSHOT]) {
+error_setg(errp, "Postcopy is not compatible "
+"with background snapshot");
+return false;
+}
+}
+
+if (cap_list[MIGRATION_CAPABILITY_BACKGROUND_SNAPSHOT]) {
+if (cap_list[MIGRATION_CAPABILITY_RELEASE_RAM]) {
+error_setg(errp, "Background snapshot is not compatible "
+"with release ram capability");
+return false;
+}
+
+if (cap_list[MIGRATION_CAPABILITY_COMPRESS]) {
+error_setg(errp, "Background snapshot is not "
+"currently compatible with compression");
+return false;
+}
+
+if (cap_list[MIGRATION_CAPABILITY_XBZRLE]) {
+error_setg(errp, "Background snapshot is not "
+"currently compatible with XBZLRE");
+return false;
+}
 }
 
 return true;
@@ -2390,6 +2416,15 @@ bool migrate_use_block_incremental(void)
 return s->parameters.block_incremental;
 }
 
+bool migrate_background_snapshot(void)
+{
+MigrationState *s;
+
+s = migrate_get_current();
+
+return s->enabled_capabilities[MIGRATION_CAPABILITY_BACKGROUND_SNAPSHOT];
+}
+
 /* migration thread support */
 /*
  * Something bad happened to the RP stream, mark an error
-- 
2.17.0




[PATCH v0 1/4] bitops: add some atomic versions of bitmap operations

2020-07-22 Thread Denis Plotnikov
1. test bit
2. test and set bit

Signed-off-by: Denis Plotnikov 
Reviewed-by: Peter Xu 
---
 include/qemu/bitops.h | 25 +
 1 file changed, 25 insertions(+)

diff --git a/include/qemu/bitops.h b/include/qemu/bitops.h
index f55ce8b320..63218afa5a 100644
--- a/include/qemu/bitops.h
+++ b/include/qemu/bitops.h
@@ -95,6 +95,21 @@ static inline int test_and_set_bit(long nr, unsigned long 
*addr)
 return (old & mask) != 0;
 }
 
+/**
+ * test_and_set_bit_atomic - Set a bit atomically and return its old value
+ * @nr: Bit to set
+ * @addr: Address to count from
+ */
+static inline int test_and_set_bit_atomic(long nr, unsigned long *addr)
+{
+unsigned long mask = BIT_MASK(nr);
+unsigned long *p = addr + BIT_WORD(nr);
+unsigned long old;
+
+old = atomic_fetch_or(p, mask);
+return (old & mask) != 0;
+}
+
 /**
  * test_and_clear_bit - Clear a bit and return its old value
  * @nr: Bit to clear
@@ -135,6 +150,16 @@ static inline int test_bit(long nr, const unsigned long 
*addr)
 return 1UL & (addr[BIT_WORD(nr)] >> (nr & (BITS_PER_LONG-1)));
 }
 
+/**
+ * test_bit_atomic - Determine whether a bit is set atomicallly
+ * @nr: bit number to test
+ * @addr: Address to start counting from
+ */
+static inline int test_bit_atomic(long nr, const unsigned long *addr)
+{
+long valid_nr = nr & (BITS_PER_LONG - 1);
+return 1UL & (atomic_read([BIT_WORD(nr)]) >> valid_nr);
+}
 /**
  * find_last_bit - find the last set bit in a memory region
  * @addr: The address to start the search at
-- 
2.17.0




[PATCH v0 4/4] background snapshot: add trace events for page fault processing

2020-07-22 Thread Denis Plotnikov
Signed-off-by: Denis Plotnikov 
---
 migration/ram.c| 4 
 migration/trace-events | 2 ++
 2 files changed, 6 insertions(+)

diff --git a/migration/ram.c b/migration/ram.c
index f187b5b494..29712a11c2 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -2172,12 +2172,16 @@ again:
 break;
 }
 
+trace_page_fault_processing_start(msg.arg.pagefault.address);
+
 if (ram_process_page_fault(msg.arg.pagefault.address) < 0) {
 error_report("page fault: error on write protected page "
  "processing [0x%llx]",
  msg.arg.pagefault.address);
 break;
 }
+
+trace_page_fault_processing_finish(msg.arg.pagefault.address);
 }
 
 return NULL;
diff --git a/migration/trace-events b/migration/trace-events
index 4ab0a503d2..f46b3b9a72 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -128,6 +128,8 @@ save_xbzrle_page_skipping(void) ""
 save_xbzrle_page_overflow(void) ""
 ram_save_iterate_big_wait(uint64_t milliconds, int iterations) "big wait: %" 
PRIu64 " milliseconds, %d iterations"
 ram_load_complete(int ret, uint64_t seq_iter) "exit_code %d seq iteration %" 
PRIu64
+page_fault_processing_start(unsigned long address) "HVA: 0x%lx"
+page_fault_processing_finish(unsigned long address) "HVA: 0x%lx"
 
 # migration.c
 await_return_path_close_on_source_close(void) ""
-- 
2.17.0




[PATCH v0 0/4] background snapshot

2020-07-22 Thread Denis Plotnikov
Currently where is no way to make a vm snapshot without pausing a vm
for the whole time until the snapshot is done. So, the problem is
the vm downtime on snapshoting. The downtime value depends on the vmstate
size, the major part of which is RAM and the disk performance which is
used for the snapshot saving.

The series propose a way to reduce the vm snapshot downtime. This is done
by saving RAM, the major part of vmstate, in the background when the vm
is running.

The background snapshot uses linux UFFD write-protected mode for memory
page access intercepting. UFFD write-protected mode was added to the linux v5.7.
If UFFD write-protected mode isn't available the background snapshot rejects to
run.

How to use:
1. enable background snapshot capability
   virsh qemu-monitor-command vm --hmp migrate_set_capability 
background-snapshot on

2. stop the vm
   virsh qemu-monitor-command vm --hmp stop

3. Start the external migration to a file
   virsh qemu-monitor-command cent78-bs --hmp migrate exec:'cat > ./vm_state'

4. Wait for the migration finish and check that the migration has completed 
state.

Denis Plotnikov (4):
  bitops: add some atomic versions of bitmap operations
  migration: add background snapshot capability
  migration: add background snapshot
  background snapshot: add trace events for page fault processing

 qapi/migration.json |   7 +-
 include/exec/ramblock.h |   8 +
 include/exec/ramlist.h  |   2 +
 include/qemu/bitops.h   |  25 ++
 migration/migration.h   |   1 +
 migration/ram.h |  19 +-
 migration/savevm.h  |   3 +
 migration/migration.c   | 142 +-
 migration/ram.c | 582 ++--
 migration/savevm.c  |   1 -
 migration/trace-events  |   2 +
 11 files changed, 771 insertions(+), 21 deletions(-)

-- 
2.17.0




Re: [RFC PATCH 0/3] block: Synchronous bdrv_*() from coroutine in different AioContext

2020-05-20 Thread Denis Plotnikov




I'm not quite sure I understand the point.
Let's see all the picture of async snapshot: our goal is to minimize a VM
downtime during the snapshot.
When we do async snapshot we save vmstate except RAM when a VM is stopped
using the current L1 table (further initial L1 table). Then, we want the VM
start running
and write RAM content. At this time all RAM is write-protected.
We unprotect each RAM page once it has been written.

Oh, I see, you're basically doing something like postcopy migration. I
was assuming it was more like regular live migration, except that you
would overwrite updated RAM blocks instead of appending them.

I can see your requirement then.


All those RAM pages should go to the snapshot using the initial L1 table.
Since the VM is running, it may want to write new disk blocks,
so we need to use a NEW L1 table to provide this ability. (Am I correct so
far?)
Thus, if I understand correctly, we need to use two L1 tables: the initial
one to store RAM pages
to the vmstate and the new one to allow disk writings.

May be I can't see a better way to achieve that. Please, correct me if I'm
wrong.

I guess I could imagine a different, though probably not better way: We
could internally have a separate low-level operation that moves the VM
state from the active layer to an already existing disk snapshot. Then
you would snapshot the disk and start writing the VM to the active
layer, and when the VM state write has completed you move it into the
snapshot.

The other options is doing what you suggested. There is nothing in the
qcow2 on-disk format that would prevent this, but we would have to
extend the qcow2 driver to allow I/O to inactive L1 tables. This sounds
like a non-trivial amount of code changes, though it would potentially
enable more use cases we never implemented ((read-only) access to
internal snapshots as block nodes, so you could e.g. use block jobs to
export a snapshot).

Kevin


Ok, thanks for validating the possibilities and more ideas of 
implementation.
I think I should start from trying to post my background snapshot 
version storing the vmstate to an external file

because write-protected-userfaultfd is now available on linux.
And If it's accepted I'll try to come up with an internal version for 
qcow2 (It seems this is the only format supporting this).


Denis



Re: [RFC PATCH 0/3] block: Synchronous bdrv_*() from coroutine in different AioContext

2020-05-19 Thread Denis Plotnikov




On 19.05.2020 17:18, Kevin Wolf wrote:

Am 19.05.2020 um 15:54 hat Denis Plotnikov geschrieben:


On 19.05.2020 15:32, Vladimir Sementsov-Ogievskiy wrote:

14.05.2020 17:26, Kevin Wolf wrote:

Am 14.05.2020 um 15:21 hat Thomas Lamprecht geschrieben:

On 5/12/20 4:43 PM, Kevin Wolf wrote:

Stefan (Reiter), after looking a bit closer at this, I think
there is no
bug in QEMU, but the bug is in your coroutine code that calls block
layer functions without moving into the right AioContext first. I've
written this series anyway as it potentially makes the life of callers
easier and would probably make your buggy code correct.
However, it doesn't feel right to commit something like
patch 2 without
having a user for it. Is there a reason why you can't upstream your
async snapshot code?

I mean I understand what you mean, but it would make the
interface IMO so
much easier to use, if one wants to explicit schedule it
beforehand they
can still do. But that would open the way for two styles doing
things, not
sure if this would seen as bad. The assert about from patch 3/3
would be
already really helping a lot, though.

I think patches 1 and 3 are good to be committed either way if people
think they are useful. They make sense without the async snapshot code.

My concern with the interface in patch 2 is both that it could give
people a false sense of security and that it would be tempting to write
inefficient code.

Usually, you won't have just a single call into the block layer for a
given block node, but you'll perform multiple operations. Switching to
the target context once rather than switching back and forth in every
operation is obviously more efficient.

But chances are that even if one of these function is bdrv_flush(),
which now works correctly from a different thread, you might need
another function that doesn't implement the same magic. So you always
need to be aware which functions support cross-context calls which
ones don't.

I feel we'd see a few bugs related to this.


Regarding upstreaming, there was some historical attempt to upstream it
from Dietmar, but in the time frame of ~ 8 to 10 years ago or so.
I'm not quite sure why it didn't went through then, I see if I can get
some time searching the mailing list archive.

We'd be naturally open and glad to upstream it, what it effectively
allow us to do is to not block the VM to much during snapshoting it
live.

Yes, there is no doubt that this is useful functionality. There has been
talk about this every now and then, but I don't think we ever got to a
point where it actually could be implemented.

Vladimir, I seem to remember you (or someone else from your team?) were
interested in async snapshots as well a while ago?

Den is working on this (add him to CC)

Yes, I was working on that.

What I've done can be found here:
https://github.com/denis-plotnikov/qemu/commits/bgs_uffd

The idea was to save a snapshot (state+ram) asynchronously to a separate
(raw) file using the existing infrastructure.
The goal of that was to reduce the VM downtime on snapshot.

We decided to postpone this work until "userfaultfd write protected mode"
wasn't in the linux mainstream.
Now, userfaultfd-wp is merged to linux. So we have plans to continue this
work.

According to the saving the "internal" snapshot to qcow2 I still have a
question. May be this is the right place and time to ask.

If I remember correctly, in qcow2 the snapshot is stored at the end of
the address space of the current block-placement-table.

Yes. Basically the way snapshots with VM state work is that you write
the VM state to some offset after the end of the virtual disk, when the
VM state is completely written you snapshot the current state (including
both content of the virtual disk and VM state) and finally discard the
VM state again in the active L1 table.


We switch to the new block-placement-table after the snapshot storing
is complete. In case of sync snapshot, we should switch the table
before the snapshot is written, another words, we should be able to
preallocate the the space for the snapshot and keep a link to the
space until snapshot writing is completed.

I don't see a fundamental difference between sync and async in this
respect. Why can't you write the VM state to the current L1 table first
like we usually do?


I'm not quite sure I understand the point.
Let's see all the picture of async snapshot: our goal is to minimize a 
VM downtime during the snapshot.

When we do async snapshot we save vmstate except RAM when a VM is stopped
using the current L1 table (further initial L1 table). Then, we want the 
VM start running

and write RAM content. At this time all RAM is write-protected.
We unprotect each RAM page once it has been written.
All those RAM pages should go to the snapshot using the initial L1 table.
Since the VM is running, it may want to write new disk blocks,
so we need to use a NEW L1 table to provide this ability. (Am I correct 
so far?)
Thus, if I 

Re: [RFC PATCH 0/3] block: Synchronous bdrv_*() from coroutine in different AioContext

2020-05-19 Thread Denis Plotnikov




On 19.05.2020 15:32, Vladimir Sementsov-Ogievskiy wrote:

14.05.2020 17:26, Kevin Wolf wrote:

Am 14.05.2020 um 15:21 hat Thomas Lamprecht geschrieben:

On 5/12/20 4:43 PM, Kevin Wolf wrote:
Stefan (Reiter), after looking a bit closer at this, I think there 
is no

bug in QEMU, but the bug is in your coroutine code that calls block
layer functions without moving into the right AioContext first. I've
written this series anyway as it potentially makes the life of callers
easier and would probably make your buggy code correct.


However, it doesn't feel right to commit something like patch 2 
without

having a user for it. Is there a reason why you can't upstream your
async snapshot code?


I mean I understand what you mean, but it would make the interface 
IMO so
much easier to use, if one wants to explicit schedule it beforehand 
they
can still do. But that would open the way for two styles doing 
things, not
sure if this would seen as bad. The assert about from patch 3/3 
would be

already really helping a lot, though.


I think patches 1 and 3 are good to be committed either way if people
think they are useful. They make sense without the async snapshot code.

My concern with the interface in patch 2 is both that it could give
people a false sense of security and that it would be tempting to write
inefficient code.

Usually, you won't have just a single call into the block layer for a
given block node, but you'll perform multiple operations. Switching to
the target context once rather than switching back and forth in every
operation is obviously more efficient.

But chances are that even if one of these function is bdrv_flush(),
which now works correctly from a different thread, you might need
another function that doesn't implement the same magic. So you always
need to be aware which functions support cross-context calls which
ones don't.

I feel we'd see a few bugs related to this.


Regarding upstreaming, there was some historical attempt to upstream it
from Dietmar, but in the time frame of ~ 8 to 10 years ago or so.
I'm not quite sure why it didn't went through then, I see if I can get
some time searching the mailing list archive.

We'd be naturally open and glad to upstream it, what it effectively
allow us to do is to not block the VM to much during snapshoting it
live.


Yes, there is no doubt that this is useful functionality. There has been
talk about this every now and then, but I don't think we ever got to a
point where it actually could be implemented.

Vladimir, I seem to remember you (or someone else from your team?) were
interested in async snapshots as well a while ago?


Den is working on this (add him to CC)

Yes, I was working on that.

What I've done can be found here: 
https://github.com/denis-plotnikov/qemu/commits/bgs_uffd


The idea was to save a snapshot (state+ram) asynchronously to a separate 
(raw) file using the existing infrastructure.

The goal of that was to reduce the VM downtime on snapshot.

We decided to postpone this work until "userfaultfd write protected 
mode" wasn't in the linux mainstream.
Now, userfaultfd-wp is merged to linux. So we have plans to continue 
this work.


According to the saving the "internal" snapshot to qcow2 I still have a 
question. May be this is the right place and time to ask.


If I remember correctly, in qcow2 the snapshot is stored at the end of 
the address space of the current block-placement-table.
We switch to the new block-placement-table after the snapshot storing is 
complete. In case of sync snapshot, we should switch the
table before the snapshot is written, another words, we should be able 
to preallocate the the space for the snapshot and keep a link

to the space until snapshot writing is completed.
The question is whether it could be done without qcow2 modification and 
if not, could you please give some ideas of how to implement that?


Denis




I pushed a tree[0] with mostly just that specific code squashed 
together (hope

I did not break anything), most of the actual code is in commit [1].
It'd be cleaned up a bit and checked for coding style issues, but 
works good

here.

Anyway, thanks for your help and pointers!

[0]: https://github.com/ThomasLamprecht/qemu/tree/savevm-async
[1]: 
https://github.com/ThomasLamprecht/qemu/commit/ffb9531f370ef0073e4b6f6021f4c47ccd702121


It doesn't even look that bad in terms of patch size. I had imagined it
a bit larger.

But it seems this is not really just an async 'savevm' (which would save
the VM state in a qcow2 file), but you store the state in a separate
raw file. What is the difference between this and regular migration into
a file?

I remember people talking about how snapshotting can store things in a
way that a normal migration stream can't do, like overwriting outdated
RAM state instead of just appending the new state, but you don't seem to
implement something like this.

Kevin









[PATCH v24 3/4] qcow2: add zstd cluster compression

2020-05-07 Thread Denis Plotnikov
zstd significantly reduces cluster compression time.
It provides better compression performance maintaining
the same level of the compression ratio in comparison with
zlib, which, at the moment, is the only compression
method available.

The performance test results:
Test compresses and decompresses qemu qcow2 image with just
installed rhel-7.6 guest.
Image cluster size: 64K. Image on disk size: 2.2G

The test was conducted with brd disk to reduce the influence
of disk subsystem to the test results.
The results is given in seconds.

compress cmd:
  time ./qemu-img convert -O qcow2 -c -o compression_type=[zlib|zstd]
  src.img [zlib|zstd]_compressed.img
decompress cmd
  time ./qemu-img convert -O qcow2
  [zlib|zstd]_compressed.img uncompressed.img

   compression   decompression
 zlib   zstd   zlib zstd

real 65.5   16.3 (-75 %)1.9  1.6 (-16 %)
user 65.0   15.85.3  2.5
sys   3.30.22.0  2.0

Both ZLIB and ZSTD gave the same compression ratio: 1.57
compressed image size in both cases: 1.4G

Signed-off-by: Denis Plotnikov 
QAPI part:
Acked-by: Markus Armbruster 
---
 docs/interop/qcow2.txt |   1 +
 configure  |   2 +-
 qapi/block-core.json   |   3 +-
 block/qcow2-threads.c  | 169 +
 block/qcow2.c  |   7 ++
 5 files changed, 180 insertions(+), 2 deletions(-)

diff --git a/docs/interop/qcow2.txt b/docs/interop/qcow2.txt
index 298a031310..cb723463f2 100644
--- a/docs/interop/qcow2.txt
+++ b/docs/interop/qcow2.txt
@@ -212,6 +212,7 @@ version 2.
 
 Available compression type values:
 0: zlib <https://www.zlib.net/>
+1: zstd <http://github.com/facebook/zstd>
 
 
 === Header padding ===
diff --git a/configure b/configure
index 23b5e93752..4e3a1690ea 100755
--- a/configure
+++ b/configure
@@ -1861,7 +1861,7 @@ disabled with --disable-FEATURE, default is enabled if 
available:
   lzfse   support of lzfse compression library
   (for reading lzfse-compressed dmg images)
   zstdsupport for zstd compression library
-  (for migration compression)
+  (for migration compression and qcow2 cluster compression)
   seccomp seccomp support
   coroutine-pool  coroutine freelist (better performance)
   glusterfs   GlusterFS backend
diff --git a/qapi/block-core.json b/qapi/block-core.json
index 1522e2983f..6fbacddab2 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -4293,11 +4293,12 @@
 # Compression type used in qcow2 image file
 #
 # @zlib: zlib compression, see <http://zlib.net/>
+# @zstd: zstd compression, see <http://github.com/facebook/zstd>
 #
 # Since: 5.1
 ##
 { 'enum': 'Qcow2CompressionType',
-  'data': [ 'zlib' ] }
+  'data': [ 'zlib', { 'name': 'zstd', 'if': 'defined(CONFIG_ZSTD)' } ] }
 
 ##
 # @BlockdevCreateOptionsQcow2:
diff --git a/block/qcow2-threads.c b/block/qcow2-threads.c
index 7dbaf53489..1914baf456 100644
--- a/block/qcow2-threads.c
+++ b/block/qcow2-threads.c
@@ -28,6 +28,11 @@
 #define ZLIB_CONST
 #include 
 
+#ifdef CONFIG_ZSTD
+#include 
+#include 
+#endif
+
 #include "qcow2.h"
 #include "block/thread-pool.h"
 #include "crypto.h"
@@ -166,6 +171,160 @@ static ssize_t qcow2_zlib_decompress(void *dest, size_t 
dest_size,
 return ret;
 }
 
+#ifdef CONFIG_ZSTD
+
+/*
+ * qcow2_zstd_compress()
+ *
+ * Compress @src_size bytes of data using zstd compression method
+ *
+ * @dest - destination buffer, @dest_size bytes
+ * @src - source buffer, @src_size bytes
+ *
+ * Returns: compressed size on success
+ *  -ENOMEM destination buffer is not enough to store compressed data
+ *  -EIOon any other error
+ */
+static ssize_t qcow2_zstd_compress(void *dest, size_t dest_size,
+   const void *src, size_t src_size)
+{
+ssize_t ret;
+size_t zstd_ret;
+ZSTD_outBuffer output = {
+.dst = dest,
+.size = dest_size,
+.pos = 0
+};
+ZSTD_inBuffer input = {
+.src = src,
+.size = src_size,
+.pos = 0
+};
+ZSTD_CCtx *cctx = ZSTD_createCCtx();
+
+if (!cctx) {
+return -EIO;
+}
+/*
+ * Use the zstd streamed interface for symmetry with decompression,
+ * where streaming is essential since we don't record the exact
+ * compressed size.
+ *
+ * ZSTD_compressStream2() tries to compress everything it could
+ * with a single call. Although, ZSTD docs says that:
+ * "You must continue calling ZSTD_compressStream2() with ZSTD_e_end
+ * until it returns 0, at which point you are free to start a new frame",
+ * in out tests we saw the only case when it returned with

[PATCH v24 1/4] qcow2: introduce compression type feature

2020-05-07 Thread Denis Plotnikov
The patch adds some preparation parts for incompatible compression type
feature to qcow2 allowing the use different compression methods for
image clusters (de)compressing.

It is implied that the compression type is set on the image creation and
can be changed only later by image conversion, thus compression type
defines the only compression algorithm used for the image, and thus,
for all image clusters.

The goal of the feature is to add support of other compression methods
to qcow2. For example, ZSTD which is more effective on compression than ZLIB.

The default compression is ZLIB. Images created with ZLIB compression type
are backward compatible with older qemu versions.

Adding of the compression type breaks a number of tests because now the
compression type is reported on image creation and there are some changes
in the qcow2 header in size and offsets.

The tests are fixed in the following ways:
* filter out compression_type for many tests
* fix header size, feature table size and backing file offset
  affected tests: 031, 036, 061, 080
  header_size +=8: 1 byte compression type
   7 bytes padding
  feature_table += 48: incompatible feature compression type
  backing_file_offset += 56 (8 + 48 -> header_change + feature_table_change)
* add "compression type" for test output matching when it isn't filtered
  affected tests: 049, 060, 061, 065, 082, 085, 144, 182, 185, 198, 206,
  242, 255, 274, 280

Signed-off-by: Denis Plotnikov 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Eric Blake 
Reviewed-by: Max Reitz 
QAPI part:
Acked-by: Markus Armbruster 
---
 qapi/block-core.json |  22 +-
 block/qcow2.h|  20 +-
 include/block/block_int.h|   1 +
 block/qcow2.c| 113 +++
 tests/qemu-iotests/031.out   |  14 ++--
 tests/qemu-iotests/036.out   |   4 +-
 tests/qemu-iotests/049.out   | 102 ++--
 tests/qemu-iotests/060.out   |   1 +
 tests/qemu-iotests/061.out   |  34 ++
 tests/qemu-iotests/065   |  28 +---
 tests/qemu-iotests/080   |   2 +-
 tests/qemu-iotests/082.out   |  48 +++--
 tests/qemu-iotests/085.out   |  38 +--
 tests/qemu-iotests/144.out   |   4 +-
 tests/qemu-iotests/182.out   |   2 +-
 tests/qemu-iotests/185.out   |   8 +--
 tests/qemu-iotests/198.out   |   2 +
 tests/qemu-iotests/206.out   |   5 ++
 tests/qemu-iotests/242.out   |   5 ++
 tests/qemu-iotests/255.out   |   8 +--
 tests/qemu-iotests/274.out   |  49 +++---
 tests/qemu-iotests/280.out   |   2 +-
 tests/qemu-iotests/common.filter |   3 +-
 23 files changed, 365 insertions(+), 150 deletions(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index 943df1926a..1522e2983f 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -78,6 +78,8 @@
 #
 # @bitmaps: A list of qcow2 bitmap details (since 4.0)
 #
+# @compression-type: the image cluster compression method (since 5.1)
+#
 # Since: 1.7
 ##
 { 'struct': 'ImageInfoSpecificQCow2',
@@ -89,7 +91,8 @@
   '*corrupt': 'bool',
   'refcount-bits': 'int',
   '*encrypt': 'ImageInfoSpecificQCow2Encryption',
-  '*bitmaps': ['Qcow2BitmapInfo']
+  '*bitmaps': ['Qcow2BitmapInfo'],
+  'compression-type': 'Qcow2CompressionType'
   } }
 
 ##
@@ -4284,6 +4287,18 @@
   'data': [ 'v2', 'v3' ] }
 
 
+##
+# @Qcow2CompressionType:
+#
+# Compression type used in qcow2 image file
+#
+# @zlib: zlib compression, see <http://zlib.net/>
+#
+# Since: 5.1
+##
+{ 'enum': 'Qcow2CompressionType',
+  'data': [ 'zlib' ] }
+
 ##
 # @BlockdevCreateOptionsQcow2:
 #
@@ -4307,6 +4322,8 @@
 # allowed values: off, falloc, full, metadata)
 # @lazy-refcounts: True if refcounts may be updated lazily (default: off)
 # @refcount-bits: Width of reference counts in bits (default: 16)
+# @compression-type: The image cluster compression method
+#(default: zlib, since 5.1)
 #
 # Since: 2.12
 ##
@@ -4322,7 +4339,8 @@
 '*cluster-size':'size',
 '*preallocation':   'PreallocMode',
 '*lazy-refcounts':  'bool',
-'*refcount-bits':   'int' } }
+'*refcount-bits':   'int',
+'*compression-type':'Qcow2CompressionType' } }
 
 ##
 # @BlockdevCreateOptionsQed:
diff --git a/block/qcow2.h b/block/qcow2.h
index f4de0a27d5..6a8b82e6cc 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -146,8 +146,16 @@ typedef struct QCowHeader {
 
 uint32_t refcount_order;
 uint32_t header_length;
+
+/* Additional fields */
+uint8_t compression_type;
+
+/* header must be a multiple of 8 */
+uint8_t padding[7];
 } QEMU_PACKED QCowHeader;
 
+QEMU_BUILD_BUG_ON(!QEMU_IS_ALIGNED(sizeof(QCowHeader), 8));
+
 typedef struct QEMU_PACKED QCowSnapshotHeader {
 /* heade

[PATCH v24 0/4] implement zstd cluster compression method

2020-05-07 Thread Denis Plotnikov
v24:
   01: add "compression_type" to the tests output [Max]
   hopefully, I've found them all.

v23:
   Undecided: whether to add zstd(zlib) compression
  details to the qcow2 spec
   03: tighten assertion on zstd decompression [Eric]
   04: use _rm_test_img appropriately [Max]

v22:
   03: remove assignemnt in if condition

v21:
   03:
   * remove the loop on compression [Max]
   * use designated initializers [Max]
   04:
   * don't erase user's options [Max]
   * use _rm_test_img [Max]
   * add unsupported qcow2 options [Max]

v20:
   04: fix a number of flaws [Vladimir]
   * don't use $RAND_FILE passing to qemu-io,
 so check $TEST_DIR is redundant
   * re-arrage $RAND_FILE writing
   * fix a typo

v19:
   04: fix a number of flaws [Eric]
   * remove rudundant test case descriptions
   * fix stdout redirect
   * don't use (())
   * use peek_file_be instead of od
   * check $TEST_DIR for spaces and other before using
   * use $RAND_FILE safer

v18:
   * 04: add quotes to all file name variables [Vladimir] 
   * 04: add Vladimir's comment according to "qemu-io write -s"
 option issue.

v17:
   * 03: remove incorrect comment in zstd decompress [Vladimir]
   * 03: remove "paraniod" and rewrite the comment on decompress [Vladimir]
   * 03: fix dead assignment [Vladimir]
   * 04: add and remove quotes [Vladimir]
   * 04: replace long offset form with the short one [Vladimir]

v16:
   * 03: ssize_t for ret, size_t for zstd_ret [Vladimir]
   * 04: small fixes according to the comments [Vladimir] 

v15:
   * 01: aiming qemu 5.1 [Eric]
   * 03: change zstd_res definition place [Vladimir]
   * 04: add two new test cases [Eric]
 1. test adjacent cluster compression with zstd
 2. test incompressible cluster processing
   * 03, 04: many rewording and gramma fixing [Eric]

v14:
   * fix bug on compression - looping until compress == 0 [Me]
   * apply reworked Vladimir's suggestions:
  1. not mixing ssize_t with size_t
  2. safe check for ENOMEM in compression part - avoid overflow
  3. tolerate sanity check allow zstd to make progress only
 on one of the buffers
v13:
   * 03: add progress sanity check to decompression loop [Vladimir]
 03: add successful decompression check [Me]

v12:
   * 03: again, rework compression and decompression loops
 to make them more correct [Vladimir]
 03: move assert in compression to more appropriate place
 [Vladimir]
v11:
   * 03: the loops don't need "do{}while" form anymore and
 the they were buggy (missed "do" in the beginning)
 replace them with usual "while(){}" loops [Vladimir]
v10:
   * 03: fix zstd (de)compressed loops for multi-frame
 cases [Vladimir]
v9:
   * 01: fix error checking and reporting in qcow2_amend compression type part 
[Vladimir]
   * 03: replace asserts with -EIO in qcow2_zstd_decompression [Vladimir, 
Alberto]
   * 03: reword/amend/add comments, fix typos [Vladimir]

v8:
   * 03: switch zstd API from simple to stream [Eric]
 No need to state a special cluster layout for zstd
 compressed clusters.
v7:
   * use qapi_enum_parse instead of the open-coding [Eric]
   * fix wording, typos and spelling [Eric]

v6:
   * "block/qcow2-threads: fix qcow2_decompress" is removed from the series
  since it has been accepted by Max already
   * add compile time checking for Qcow2Header to be a multiple of 8 [Max, 
Alberto]
   * report error on qcow2 amending when the compression type is actually 
chnged [Max]
   * remove the extra space and the extra new line [Max]
   * re-arrange acks and signed-off-s [Vladimir]

v5:
   * replace -ENOTSUP with abort in qcow2_co_decompress [Vladimir]
   * set cluster size for all test cases in the beginning of the 287 test

v4:
   * the series is rebased on top of 01 "block/qcow2-threads: fix 
qcow2_decompress"
   * 01 is just a no-change resend to avoid extra dependencies. Still, it may 
be merged in separate

v3:
   * remove redundant max compression type value check [Vladimir, Eric]
 (the switch below checks everything)
   * prevent compression type changing on "qemu-img amend" [Vladimir]
   * remove zstd config setting, since it has been added already by
 "migration" patches [Vladimir]
   * change the compression type error message [Vladimir] 
   * fix alignment and 80-chars exceeding [Vladimir]

v2:
   * rework compression type setting [Vladimir]
   * squash iotest changes to the compression type introduction patch 
[Vladimir, Eric]
   * fix zstd availability checking in zstd iotest [Vladimir]
   * remove unnecessry casting [Eric]
   * remove rudundant checks [Eric]
   * fix compressed cluster layout in qcow2 spec [Vladimir]
   * fix wording [Eric, Vladimir]
   * fix compression type filtering in iotests [Eric]

v1:
   the initial s

[PATCH v24 2/4] qcow2: rework the cluster compression routine

2020-05-07 Thread Denis Plotnikov
The patch enables processing the image compression type defined
for the image and chooses an appropriate method for image clusters
(de)compression.

Signed-off-by: Denis Plotnikov 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Alberto Garcia 
Reviewed-by: Max Reitz 
---
 block/qcow2-threads.c | 71 ---
 1 file changed, 60 insertions(+), 11 deletions(-)

diff --git a/block/qcow2-threads.c b/block/qcow2-threads.c
index a68126f291..7dbaf53489 100644
--- a/block/qcow2-threads.c
+++ b/block/qcow2-threads.c
@@ -74,7 +74,9 @@ typedef struct Qcow2CompressData {
 } Qcow2CompressData;
 
 /*
- * qcow2_compress()
+ * qcow2_zlib_compress()
+ *
+ * Compress @src_size bytes of data using zlib compression method
  *
  * @dest - destination buffer, @dest_size bytes
  * @src - source buffer, @src_size bytes
@@ -83,8 +85,8 @@ typedef struct Qcow2CompressData {
  *  -ENOMEM destination buffer is not enough to store compressed data
  *  -EIOon any other error
  */
-static ssize_t qcow2_compress(void *dest, size_t dest_size,
-  const void *src, size_t src_size)
+static ssize_t qcow2_zlib_compress(void *dest, size_t dest_size,
+   const void *src, size_t src_size)
 {
 ssize_t ret;
 z_stream strm;
@@ -119,10 +121,10 @@ static ssize_t qcow2_compress(void *dest, size_t 
dest_size,
 }
 
 /*
- * qcow2_decompress()
+ * qcow2_zlib_decompress()
  *
  * Decompress some data (not more than @src_size bytes) to produce exactly
- * @dest_size bytes.
+ * @dest_size bytes using zlib compression method
  *
  * @dest - destination buffer, @dest_size bytes
  * @src - source buffer, @src_size bytes
@@ -130,8 +132,8 @@ static ssize_t qcow2_compress(void *dest, size_t dest_size,
  * Returns: 0 on success
  *  -EIO on fail
  */
-static ssize_t qcow2_decompress(void *dest, size_t dest_size,
-const void *src, size_t src_size)
+static ssize_t qcow2_zlib_decompress(void *dest, size_t dest_size,
+ const void *src, size_t src_size)
 {
 int ret;
 z_stream strm;
@@ -191,20 +193,67 @@ qcow2_co_do_compress(BlockDriverState *bs, void *dest, 
size_t dest_size,
 return arg.ret;
 }
 
+/*
+ * qcow2_co_compress()
+ *
+ * Compress @src_size bytes of data using the compression
+ * method defined by the image compression type
+ *
+ * @dest - destination buffer, @dest_size bytes
+ * @src - source buffer, @src_size bytes
+ *
+ * Returns: compressed size on success
+ *  a negative error code on failure
+ */
 ssize_t coroutine_fn
 qcow2_co_compress(BlockDriverState *bs, void *dest, size_t dest_size,
   const void *src, size_t src_size)
 {
-return qcow2_co_do_compress(bs, dest, dest_size, src, src_size,
-qcow2_compress);
+BDRVQcow2State *s = bs->opaque;
+Qcow2CompressFunc fn;
+
+switch (s->compression_type) {
+case QCOW2_COMPRESSION_TYPE_ZLIB:
+fn = qcow2_zlib_compress;
+break;
+
+default:
+abort();
+}
+
+return qcow2_co_do_compress(bs, dest, dest_size, src, src_size, fn);
 }
 
+/*
+ * qcow2_co_decompress()
+ *
+ * Decompress some data (not more than @src_size bytes) to produce exactly
+ * @dest_size bytes using the compression method defined by the image
+ * compression type
+ *
+ * @dest - destination buffer, @dest_size bytes
+ * @src - source buffer, @src_size bytes
+ *
+ * Returns: 0 on success
+ *  a negative error code on failure
+ */
 ssize_t coroutine_fn
 qcow2_co_decompress(BlockDriverState *bs, void *dest, size_t dest_size,
 const void *src, size_t src_size)
 {
-return qcow2_co_do_compress(bs, dest, dest_size, src, src_size,
-qcow2_decompress);
+BDRVQcow2State *s = bs->opaque;
+Qcow2CompressFunc fn;
+
+switch (s->compression_type) {
+case QCOW2_COMPRESSION_TYPE_ZLIB:
+fn = qcow2_zlib_decompress;
+break;
+
+default:
+abort();
+}
+
+return qcow2_co_do_compress(bs, dest, dest_size, src, src_size, fn);
 }
 
 
-- 
2.17.0




[PATCH v24 4/4] iotests: 287: add qcow2 compression type test

2020-05-07 Thread Denis Plotnikov
The test checks fulfilling qcow2 requirements for the compression
type feature and zstd compression type operability.

Signed-off-by: Denis Plotnikov 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Tested-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Eric Blake 
---
 tests/qemu-iotests/287 | 152 +
 tests/qemu-iotests/287.out |  67 
 tests/qemu-iotests/group   |   1 +
 3 files changed, 220 insertions(+)
 create mode 100755 tests/qemu-iotests/287
 create mode 100644 tests/qemu-iotests/287.out

diff --git a/tests/qemu-iotests/287 b/tests/qemu-iotests/287
new file mode 100755
index 00..f98a4cadc1
--- /dev/null
+++ b/tests/qemu-iotests/287
@@ -0,0 +1,152 @@
+#!/usr/bin/env bash
+#
+# Test case for an image using zstd compression
+#
+# Copyright (c) 2020 Virtuozzo International GmbH
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+#
+
+# creator
+owner=dplotni...@virtuozzo.com
+
+seq="$(basename $0)"
+echo "QA output created by $seq"
+
+status=1   # failure is the default!
+
+# standard environment
+. ./common.rc
+. ./common.filter
+
+# This tests qocw2-specific low-level functionality
+_supported_fmt qcow2
+_supported_proto file
+_supported_os Linux
+_unsupported_imgopts 'compat=0.10' data_file
+
+COMPR_IMG="$TEST_IMG.compressed"
+RAND_FILE="$TEST_DIR/rand_data"
+
+_cleanup()
+{
+_cleanup_test_img
+_rm_test_img "$COMPR_IMG"
+rm -f "$RAND_FILE"
+}
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+# for all the cases
+CLUSTER_SIZE=65536
+
+# Check if we can run this test.
+if IMGOPTS='compression_type=zstd' _make_test_img 64M |
+grep "Invalid parameter 'zstd'"; then
+_notrun "ZSTD is disabled"
+fi
+
+echo
+echo "=== Testing compression type incompatible bit setting for zlib ==="
+echo
+_make_test_img -o compression_type=zlib 64M
+$PYTHON qcow2.py "$TEST_IMG" dump-header | grep incompatible_features
+
+echo
+echo "=== Testing compression type incompatible bit setting for zstd ==="
+echo
+_make_test_img -o compression_type=zstd 64M
+$PYTHON qcow2.py "$TEST_IMG" dump-header | grep incompatible_features
+
+echo
+echo "=== Testing zlib with incompatible bit set ==="
+echo
+_make_test_img -o compression_type=zlib 64M
+$PYTHON qcow2.py "$TEST_IMG" set-feature-bit incompatible 3
+# to make sure the bit was actually set
+$PYTHON qcow2.py "$TEST_IMG" dump-header | grep incompatible_features
+
+if $QEMU_IMG info "$TEST_IMG" >/dev/null 2>&1 ; then
+echo "Error: The image opened successfully. The image must not be opened."
+fi
+
+echo
+echo "=== Testing zstd with incompatible bit unset ==="
+echo
+_make_test_img -o compression_type=zstd 64M
+$PYTHON qcow2.py "$TEST_IMG" set-header incompatible_features 0
+# to make sure the bit was actually unset
+$PYTHON qcow2.py "$TEST_IMG" dump-header | grep incompatible_features
+
+if $QEMU_IMG info "$TEST_IMG" >/dev/null 2>&1 ; then
+echo "Error: The image opened successfully. The image must not be opened."
+fi
+
+echo
+echo "=== Testing compression type values ==="
+echo
+# zlib=0
+_make_test_img -o compression_type=zlib 64M
+peek_file_be "$TEST_IMG" 104 1
+echo
+
+# zstd=1
+_make_test_img -o compression_type=zstd 64M
+peek_file_be "$TEST_IMG" 104 1
+echo
+
+echo
+echo "=== Testing simple reading and writing with zstd ==="
+echo
+_make_test_img -o compression_type=zstd 64M
+$QEMU_IO -c "write -c -P 0xAC 64K 64K " "$TEST_IMG" | _filter_qemu_io
+$QEMU_IO -c "read -P 0xAC 64K 64K " "$TEST_IMG" | _filter_qemu_io
+# read on the cluster boundaries
+$QEMU_IO -c "read -v 131070 8 " "$TEST_IMG" | _filter_qemu_io
+$QEMU_IO -c "read -v 65534 8" "$TEST_IMG" | _filter_qemu_io
+
+echo
+echo "=== Testing adjacent clusters reading and writing with zstd ==="
+echo
+_make_test_img -o compression_type=zstd 64M
+$QEMU_IO -c "write -c -P 0xAB 0 64K " "$TEST_IMG" | _filter_qemu_io
+$QEMU_IO -c "write -c -P 0xAC 64K 64K " "$TEST_IMG" | _filter_qemu_io
+$QEMU_IO -c "write -c -P 0xAD 128K 64K " "$T

Re: [PATCH v23 0/4] implement zstd cluster compression method

2020-05-06 Thread Denis Plotnikov




On 05.05.2020 15:03, Max Reitz wrote:

On 05.05.20 12:26, Max Reitz wrote:

On 30.04.20 12:19, Denis Plotnikov wrote:

v23:
Undecided: whether to add zstd(zlib) compression
   details to the qcow2 spec
03: tighten assertion on zstd decompression [Eric]
04: use _rm_test_img appropriately [Max]

Thanks, applied to my block branch:

I’m afraid I have to unqueue this series again, because it makes many
iotests fail due to an additional “compression_type=zlib” output when
images are created, an additional “compression type” line in
qemu-img info output where format-specific information is not
suppressed, and an additional line in qemu-img create -f qcow2 -o help.

Max



Hmm, this is strange. I made some modifications for the tests
in 0001 of the series (qcow2: introduce compression type feature).

Among the other test related changes, the patch contains the hunk:

+++ b/tests/qemu-iotests/common.filter
@@ -152,7 +152,8 @@ _filter_img_create()
 -e "s# refcount_bits=[0-9]\\+##g" \
 -e "s# key-secret=[a-zA-Z0-9]\\+##g" \
 -e "s# iter-time=[0-9]\\+##g" \
-    -e "s# force_size=\\(on\\|off\\)##g"
+    -e "s# force_size=\\(on\\|off\\)##g" \
+    -e "s# compression_type=[a-zA-Z0-9]\\+##g"
 }

which has to filter out "compression_type" on image creation.

But you say that you can see the "compression_type" on the image creation.
May be the patch wasn't fully applied? Or the test related modification 
were omitted?


I've just re-based the series on top of:

681b07f4e76dbb700c16918d (vanilla/master, mainstream)
Merge: a2261b2754 714eb0dbc5
Author: Peter Maydell 
Date:   Tue May 5 15:47:44 2020 +0100

and run the tests with 'make check-block'

and got the following:

Not run: 071 099 184 220 259 267
Some cases not run in: 030 040 041
Passed all 113 iotests

May be I do something wrong?

Denis



Re: [PATCH v22 3/4] qcow2: add zstd cluster compression

2020-04-30 Thread Denis Plotnikov




On 30.04.2020 14:47, Max Reitz wrote:

On 30.04.20 11:48, Denis Plotnikov wrote:


On 30.04.2020 11:26, Max Reitz wrote:

On 29.04.20 15:02, Vladimir Sementsov-Ogievskiy wrote:

29.04.2020 15:17, Max Reitz wrote:

On 29.04.20 12:37, Vladimir Sementsov-Ogievskiy wrote:

29.04.2020 13:24, Max Reitz wrote:

On 28.04.20 22:00, Denis Plotnikov wrote:

zstd significantly reduces cluster compression time.
It provides better compression performance maintaining
the same level of the compression ratio in comparison with
zlib, which, at the moment, is the only compression
method available.

The performance test results:
Test compresses and decompresses qemu qcow2 image with just
installed rhel-7.6 guest.
Image cluster size: 64K. Image on disk size: 2.2G

The test was conducted with brd disk to reduce the influence
of disk subsystem to the test results.
The results is given in seconds.

compress cmd:
  time ./qemu-img convert -O qcow2 -c -o
compression_type=[zlib|zstd]
  src.img [zlib|zstd]_compressed.img
decompress cmd
  time ./qemu-img convert -O qcow2
  [zlib|zstd]_compressed.img uncompressed.img

   compression   decompression
     zlib   zstd   zlib zstd

real 65.5   16.3 (-75 %)    1.9  1.6 (-16 %)
user 65.0   15.8    5.3  2.5
sys   3.3    0.2    2.0  2.0

Both ZLIB and ZSTD gave the same compression ratio: 1.57
compressed image size in both cases: 1.4G

Signed-off-by: Denis Plotnikov 
QAPI part:
Acked-by: Markus Armbruster 
---
     docs/interop/qcow2.txt |   1 +
     configure  |   2 +-
     qapi/block-core.json   |   3 +-
     block/qcow2-threads.c  | 169
+
     block/qcow2.c  |   7 ++
     slirp  |   2 +-
     6 files changed, 181 insertions(+), 3 deletions(-)

[...]


diff --git a/block/qcow2-threads.c b/block/qcow2-threads.c
index 7dbaf53489..a0b12e1b15 100644
--- a/block/qcow2-threads.c
+++ b/block/qcow2-threads.c

[...]


+static ssize_t qcow2_zstd_decompress(void *dest, size_t dest_size,
+ const void *src, size_t
src_size)
+{

[...]


+    /*
+ * The compressed stream from the input buffer may consist of
more
+ * than one zstd frame.

Can it?

If not, we must require it in the specification.

Actually, now that you mention it, it would make sense anyway to add
some note to the specification on what exactly compressed with zstd
means.


Hmm. If at some point
we'll want multi-threaded compression of one big (2M) cluster.. Could
this be implemented with zstd lib, if multiple frames are allowed,
will
allowing multiple frames help? I don't know actually, but I think
better
not to forbid it. On the other hand, I don't see any benefit in large
compressed clusters. At least, in our scenarios (for compressed
backups)
we use 64k compressed clusters, for good granularity of incremental
backups (when for running vm we use 1M clusters).

Is it really that important?  Naïvely, it sounds rather complicated to
introduce multithreading into block drivers.

It is already here: compression and encryption already multithreaded.
But of course, one cluster is handled in one thread.

Ah, good.  I forgot.


(Also, as for compression, it can only be used in backup scenarios
anyway, where you write many clusters at once.  So parallelism on the
cluster level should sufficient to get high usage, and it would benefit
all compression types and cluster sizes.)


Yes it works in this way already :)

Well, OK then.


So, we don't know do we want one frame restriction or not. Do you have a
preference?

*shrug*

Seems like it would be preferential to allow multiple frames still.  A
note in the spec would be nice (i.e., streaming format, multiple frames
per cluster possible).

We don't mention anything about zlib compressing details in the spec.

Yep.  Which is bad enough.


If we mention anything about ZSTD compressing details we'll have to do
it for
zlib as well.

I personally don’t like “If you can’t make it perfect, you shouldn’t do
it at all”.  Mentioning it for zstd would still be an improvement.


Good approach. I like it. But I'm trying to be cautious.


Also, I believe that “zlib compression” is pretty much unambiguous,
considering all the code does is call deflate()/inflate().

I’m not saying we need to amend the spec in this series, but I don’t see
a good argument against doing so at some point.


So, I think we have two possibilities for the spec:
1. mention for both
2. don't mention at all

I think the 2nd is better. It gives more degree of freedom for the
future improvements.

No, it absolutely doesn’t.  There is a de-facto format, namely what qemu
accepts.  Changing that would be an incompatible change.  Just because
we don’t write what’s allowed into the spec doesn’t change

[PATCH v23 1/4] qcow2: introduce compression type feature

2020-04-30 Thread Denis Plotnikov
The patch adds some preparation parts for incompatible compression type
feature to qcow2 allowing the use different compression methods for
image clusters (de)compressing.

It is implied that the compression type is set on the image creation and
can be changed only later by image conversion, thus compression type
defines the only compression algorithm used for the image, and thus,
for all image clusters.

The goal of the feature is to add support of other compression methods
to qcow2. For example, ZSTD which is more effective on compression than ZLIB.

The default compression is ZLIB. Images created with ZLIB compression type
are backward compatible with older qemu versions.

Adding of the compression type breaks a number of tests because now the
compression type is reported on image creation and there are some changes
in the qcow2 header in size and offsets.

The tests are fixed in the following ways:
* filter out compression_type for many tests
* fix header size, feature table size and backing file offset
  affected tests: 031, 036, 061, 080
  header_size +=8: 1 byte compression type
   7 bytes padding
  feature_table += 48: incompatible feature compression type
  backing_file_offset += 56 (8 + 48 -> header_change + feature_table_change)
* add "compression type" for test output matching when it isn't filtered
  affected tests: 049, 060, 061, 065, 144, 182, 242, 255

Signed-off-by: Denis Plotnikov 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Eric Blake 
Reviewed-by: Max Reitz 
QAPI part:
Acked-by: Markus Armbruster 
---
 qapi/block-core.json |  22 +-
 block/qcow2.h|  20 +-
 include/block/block_int.h|   1 +
 block/qcow2.c| 113 +++
 tests/qemu-iotests/031.out   |  14 ++--
 tests/qemu-iotests/036.out   |   4 +-
 tests/qemu-iotests/049.out   | 102 ++--
 tests/qemu-iotests/060.out   |   1 +
 tests/qemu-iotests/061.out   |  34 ++
 tests/qemu-iotests/065   |  28 +---
 tests/qemu-iotests/080   |   2 +-
 tests/qemu-iotests/144.out   |   4 +-
 tests/qemu-iotests/182.out   |   2 +-
 tests/qemu-iotests/242.out   |   5 ++
 tests/qemu-iotests/255.out   |   8 +--
 tests/qemu-iotests/common.filter |   3 +-
 16 files changed, 267 insertions(+), 96 deletions(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index 943df1926a..1522e2983f 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -78,6 +78,8 @@
 #
 # @bitmaps: A list of qcow2 bitmap details (since 4.0)
 #
+# @compression-type: the image cluster compression method (since 5.1)
+#
 # Since: 1.7
 ##
 { 'struct': 'ImageInfoSpecificQCow2',
@@ -89,7 +91,8 @@
   '*corrupt': 'bool',
   'refcount-bits': 'int',
   '*encrypt': 'ImageInfoSpecificQCow2Encryption',
-  '*bitmaps': ['Qcow2BitmapInfo']
+  '*bitmaps': ['Qcow2BitmapInfo'],
+  'compression-type': 'Qcow2CompressionType'
   } }
 
 ##
@@ -4284,6 +4287,18 @@
   'data': [ 'v2', 'v3' ] }
 
 
+##
+# @Qcow2CompressionType:
+#
+# Compression type used in qcow2 image file
+#
+# @zlib: zlib compression, see <http://zlib.net/>
+#
+# Since: 5.1
+##
+{ 'enum': 'Qcow2CompressionType',
+  'data': [ 'zlib' ] }
+
 ##
 # @BlockdevCreateOptionsQcow2:
 #
@@ -4307,6 +4322,8 @@
 # allowed values: off, falloc, full, metadata)
 # @lazy-refcounts: True if refcounts may be updated lazily (default: off)
 # @refcount-bits: Width of reference counts in bits (default: 16)
+# @compression-type: The image cluster compression method
+#(default: zlib, since 5.1)
 #
 # Since: 2.12
 ##
@@ -4322,7 +4339,8 @@
 '*cluster-size':'size',
 '*preallocation':   'PreallocMode',
 '*lazy-refcounts':  'bool',
-'*refcount-bits':   'int' } }
+'*refcount-bits':   'int',
+'*compression-type':'Qcow2CompressionType' } }
 
 ##
 # @BlockdevCreateOptionsQed:
diff --git a/block/qcow2.h b/block/qcow2.h
index f4de0a27d5..6a8b82e6cc 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -146,8 +146,16 @@ typedef struct QCowHeader {
 
 uint32_t refcount_order;
 uint32_t header_length;
+
+/* Additional fields */
+uint8_t compression_type;
+
+/* header must be a multiple of 8 */
+uint8_t padding[7];
 } QEMU_PACKED QCowHeader;
 
+QEMU_BUILD_BUG_ON(!QEMU_IS_ALIGNED(sizeof(QCowHeader), 8));
+
 typedef struct QEMU_PACKED QCowSnapshotHeader {
 /* header is 8 byte aligned */
 uint64_t l1_table_offset;
@@ -216,13 +224,16 @@ enum {
 QCOW2_INCOMPAT_DIRTY_BITNR  = 0,
 QCOW2_INCOMPAT_CORRUPT_BITNR= 1,
 QCOW2_INCOMPAT_DATA_FILE_BITNR  = 2,
+QCOW2_INCOMPAT_COMPRESSION_BITNR = 3,
 QCOW2_INCOMPAT_DIRTY= 1 << QCOW2_INCOMPAT_DIRTY_BITNR,
 QCOW2_INCOMPAT_CORRUPT  = 1 <&l

[PATCH v23 2/4] qcow2: rework the cluster compression routine

2020-04-30 Thread Denis Plotnikov
The patch enables processing the image compression type defined
for the image and chooses an appropriate method for image clusters
(de)compression.

Signed-off-by: Denis Plotnikov 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Alberto Garcia 
Reviewed-by: Max Reitz 
---
 block/qcow2-threads.c | 71 ---
 1 file changed, 60 insertions(+), 11 deletions(-)

diff --git a/block/qcow2-threads.c b/block/qcow2-threads.c
index a68126f291..7dbaf53489 100644
--- a/block/qcow2-threads.c
+++ b/block/qcow2-threads.c
@@ -74,7 +74,9 @@ typedef struct Qcow2CompressData {
 } Qcow2CompressData;
 
 /*
- * qcow2_compress()
+ * qcow2_zlib_compress()
+ *
+ * Compress @src_size bytes of data using zlib compression method
  *
  * @dest - destination buffer, @dest_size bytes
  * @src - source buffer, @src_size bytes
@@ -83,8 +85,8 @@ typedef struct Qcow2CompressData {
  *  -ENOMEM destination buffer is not enough to store compressed data
  *  -EIOon any other error
  */
-static ssize_t qcow2_compress(void *dest, size_t dest_size,
-  const void *src, size_t src_size)
+static ssize_t qcow2_zlib_compress(void *dest, size_t dest_size,
+   const void *src, size_t src_size)
 {
 ssize_t ret;
 z_stream strm;
@@ -119,10 +121,10 @@ static ssize_t qcow2_compress(void *dest, size_t 
dest_size,
 }
 
 /*
- * qcow2_decompress()
+ * qcow2_zlib_decompress()
  *
  * Decompress some data (not more than @src_size bytes) to produce exactly
- * @dest_size bytes.
+ * @dest_size bytes using zlib compression method
  *
  * @dest - destination buffer, @dest_size bytes
  * @src - source buffer, @src_size bytes
@@ -130,8 +132,8 @@ static ssize_t qcow2_compress(void *dest, size_t dest_size,
  * Returns: 0 on success
  *  -EIO on fail
  */
-static ssize_t qcow2_decompress(void *dest, size_t dest_size,
-const void *src, size_t src_size)
+static ssize_t qcow2_zlib_decompress(void *dest, size_t dest_size,
+ const void *src, size_t src_size)
 {
 int ret;
 z_stream strm;
@@ -191,20 +193,67 @@ qcow2_co_do_compress(BlockDriverState *bs, void *dest, 
size_t dest_size,
 return arg.ret;
 }
 
+/*
+ * qcow2_co_compress()
+ *
+ * Compress @src_size bytes of data using the compression
+ * method defined by the image compression type
+ *
+ * @dest - destination buffer, @dest_size bytes
+ * @src - source buffer, @src_size bytes
+ *
+ * Returns: compressed size on success
+ *  a negative error code on failure
+ */
 ssize_t coroutine_fn
 qcow2_co_compress(BlockDriverState *bs, void *dest, size_t dest_size,
   const void *src, size_t src_size)
 {
-return qcow2_co_do_compress(bs, dest, dest_size, src, src_size,
-qcow2_compress);
+BDRVQcow2State *s = bs->opaque;
+Qcow2CompressFunc fn;
+
+switch (s->compression_type) {
+case QCOW2_COMPRESSION_TYPE_ZLIB:
+fn = qcow2_zlib_compress;
+break;
+
+default:
+abort();
+}
+
+return qcow2_co_do_compress(bs, dest, dest_size, src, src_size, fn);
 }
 
+/*
+ * qcow2_co_decompress()
+ *
+ * Decompress some data (not more than @src_size bytes) to produce exactly
+ * @dest_size bytes using the compression method defined by the image
+ * compression type
+ *
+ * @dest - destination buffer, @dest_size bytes
+ * @src - source buffer, @src_size bytes
+ *
+ * Returns: 0 on success
+ *  a negative error code on failure
+ */
 ssize_t coroutine_fn
 qcow2_co_decompress(BlockDriverState *bs, void *dest, size_t dest_size,
 const void *src, size_t src_size)
 {
-return qcow2_co_do_compress(bs, dest, dest_size, src, src_size,
-qcow2_decompress);
+BDRVQcow2State *s = bs->opaque;
+Qcow2CompressFunc fn;
+
+switch (s->compression_type) {
+case QCOW2_COMPRESSION_TYPE_ZLIB:
+fn = qcow2_zlib_decompress;
+break;
+
+default:
+abort();
+}
+
+return qcow2_co_do_compress(bs, dest, dest_size, src, src_size, fn);
 }
 
 
-- 
2.17.0




[PATCH v23 0/4] implement zstd cluster compression method

2020-04-30 Thread Denis Plotnikov
v23:
   Undecided: whether to add zstd(zlib) compression
  details to the qcow2 spec
   03: tighten assertion on zstd decompression [Eric]
   04: use _rm_test_img appropriately [Max]

v22:
   03: remove assignemnt in if condition

v21:
   03:
   * remove the loop on compression [Max]
   * use designated initializers [Max]
   04:
   * don't erase user's options [Max]
   * use _rm_test_img [Max]
   * add unsupported qcow2 options [Max]

v20:
   04: fix a number of flaws [Vladimir]
   * don't use $RAND_FILE passing to qemu-io,
 so check $TEST_DIR is redundant
   * re-arrage $RAND_FILE writing
   * fix a typo

v19:
   04: fix a number of flaws [Eric]
   * remove rudundant test case descriptions
   * fix stdout redirect
   * don't use (())
   * use peek_file_be instead of od
   * check $TEST_DIR for spaces and other before using
   * use $RAND_FILE safer

v18:
   * 04: add quotes to all file name variables [Vladimir] 
   * 04: add Vladimir's comment according to "qemu-io write -s"
 option issue.

v17:
   * 03: remove incorrect comment in zstd decompress [Vladimir]
   * 03: remove "paraniod" and rewrite the comment on decompress [Vladimir]
   * 03: fix dead assignment [Vladimir]
   * 04: add and remove quotes [Vladimir]
   * 04: replace long offset form with the short one [Vladimir]

v16:
   * 03: ssize_t for ret, size_t for zstd_ret [Vladimir]
   * 04: small fixes according to the comments [Vladimir] 

v15:
   * 01: aiming qemu 5.1 [Eric]
   * 03: change zstd_res definition place [Vladimir]
   * 04: add two new test cases [Eric]
 1. test adjacent cluster compression with zstd
 2. test incompressible cluster processing
   * 03, 04: many rewording and gramma fixing [Eric]

v14:
   * fix bug on compression - looping until compress == 0 [Me]
   * apply reworked Vladimir's suggestions:
  1. not mixing ssize_t with size_t
  2. safe check for ENOMEM in compression part - avoid overflow
  3. tolerate sanity check allow zstd to make progress only
 on one of the buffers
v13:
   * 03: add progress sanity check to decompression loop [Vladimir]
 03: add successful decompression check [Me]

v12:
   * 03: again, rework compression and decompression loops
 to make them more correct [Vladimir]
 03: move assert in compression to more appropriate place
 [Vladimir]
v11:
   * 03: the loops don't need "do{}while" form anymore and
 the they were buggy (missed "do" in the beginning)
 replace them with usual "while(){}" loops [Vladimir]
v10:
   * 03: fix zstd (de)compressed loops for multi-frame
 cases [Vladimir]
v9:
   * 01: fix error checking and reporting in qcow2_amend compression type part 
[Vladimir]
   * 03: replace asserts with -EIO in qcow2_zstd_decompression [Vladimir, 
Alberto]
   * 03: reword/amend/add comments, fix typos [Vladimir]

v8:
   * 03: switch zstd API from simple to stream [Eric]
 No need to state a special cluster layout for zstd
 compressed clusters.
v7:
   * use qapi_enum_parse instead of the open-coding [Eric]
   * fix wording, typos and spelling [Eric]

v6:
   * "block/qcow2-threads: fix qcow2_decompress" is removed from the series
  since it has been accepted by Max already
   * add compile time checking for Qcow2Header to be a multiple of 8 [Max, 
Alberto]
   * report error on qcow2 amending when the compression type is actually 
chnged [Max]
   * remove the extra space and the extra new line [Max]
   * re-arrange acks and signed-off-s [Vladimir]

v5:
   * replace -ENOTSUP with abort in qcow2_co_decompress [Vladimir]
   * set cluster size for all test cases in the beginning of the 287 test

v4:
   * the series is rebased on top of 01 "block/qcow2-threads: fix 
qcow2_decompress"
   * 01 is just a no-change resend to avoid extra dependencies. Still, it may 
be merged in separate

v3:
   * remove redundant max compression type value check [Vladimir, Eric]
 (the switch below checks everything)
   * prevent compression type changing on "qemu-img amend" [Vladimir]
   * remove zstd config setting, since it has been added already by
 "migration" patches [Vladimir]
   * change the compression type error message [Vladimir] 
   * fix alignment and 80-chars exceeding [Vladimir]

v2:
   * rework compression type setting [Vladimir]
   * squash iotest changes to the compression type introduction patch 
[Vladimir, Eric]
   * fix zstd availability checking in zstd iotest [Vladimir]
   * remove unnecessry casting [Eric]
   * remove rudundant checks [Eric]
   * fix compressed cluster layout in qcow2 spec [Vladimir]
   * fix wording [Eric, Vladimir]
   * fix compression type filtering in iotests [Eric]

v1:
   the initial series

Denis Plotnikov (4):
  qcow2: introduce compression type feature
  qcow2: rework the cluster compression rou

[PATCH v23 3/4] qcow2: add zstd cluster compression

2020-04-30 Thread Denis Plotnikov
zstd significantly reduces cluster compression time.
It provides better compression performance maintaining
the same level of the compression ratio in comparison with
zlib, which, at the moment, is the only compression
method available.

The performance test results:
Test compresses and decompresses qemu qcow2 image with just
installed rhel-7.6 guest.
Image cluster size: 64K. Image on disk size: 2.2G

The test was conducted with brd disk to reduce the influence
of disk subsystem to the test results.
The results is given in seconds.

compress cmd:
  time ./qemu-img convert -O qcow2 -c -o compression_type=[zlib|zstd]
  src.img [zlib|zstd]_compressed.img
decompress cmd
  time ./qemu-img convert -O qcow2
  [zlib|zstd]_compressed.img uncompressed.img

   compression   decompression
 zlib   zstd   zlib zstd

real 65.5   16.3 (-75 %)1.9  1.6 (-16 %)
user 65.0   15.85.3  2.5
sys   3.30.22.0  2.0

Both ZLIB and ZSTD gave the same compression ratio: 1.57
compressed image size in both cases: 1.4G

Signed-off-by: Denis Plotnikov 
QAPI part:
Acked-by: Markus Armbruster 
---
 docs/interop/qcow2.txt |   1 +
 configure  |   2 +-
 qapi/block-core.json   |   3 +-
 block/qcow2-threads.c  | 169 +
 block/qcow2.c  |   7 ++
 5 files changed, 180 insertions(+), 2 deletions(-)

diff --git a/docs/interop/qcow2.txt b/docs/interop/qcow2.txt
index 640e0eca40..18a77f737e 100644
--- a/docs/interop/qcow2.txt
+++ b/docs/interop/qcow2.txt
@@ -209,6 +209,7 @@ version 2.
 
 Available compression type values:
 0: zlib <https://www.zlib.net/>
+1: zstd <http://github.com/facebook/zstd>
 
 
 === Header padding ===
diff --git a/configure b/configure
index 23b5e93752..4e3a1690ea 100755
--- a/configure
+++ b/configure
@@ -1861,7 +1861,7 @@ disabled with --disable-FEATURE, default is enabled if 
available:
   lzfse   support of lzfse compression library
   (for reading lzfse-compressed dmg images)
   zstdsupport for zstd compression library
-  (for migration compression)
+  (for migration compression and qcow2 cluster compression)
   seccomp seccomp support
   coroutine-pool  coroutine freelist (better performance)
   glusterfs   GlusterFS backend
diff --git a/qapi/block-core.json b/qapi/block-core.json
index 1522e2983f..6fbacddab2 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -4293,11 +4293,12 @@
 # Compression type used in qcow2 image file
 #
 # @zlib: zlib compression, see <http://zlib.net/>
+# @zstd: zstd compression, see <http://github.com/facebook/zstd>
 #
 # Since: 5.1
 ##
 { 'enum': 'Qcow2CompressionType',
-  'data': [ 'zlib' ] }
+  'data': [ 'zlib', { 'name': 'zstd', 'if': 'defined(CONFIG_ZSTD)' } ] }
 
 ##
 # @BlockdevCreateOptionsQcow2:
diff --git a/block/qcow2-threads.c b/block/qcow2-threads.c
index 7dbaf53489..1914baf456 100644
--- a/block/qcow2-threads.c
+++ b/block/qcow2-threads.c
@@ -28,6 +28,11 @@
 #define ZLIB_CONST
 #include 
 
+#ifdef CONFIG_ZSTD
+#include 
+#include 
+#endif
+
 #include "qcow2.h"
 #include "block/thread-pool.h"
 #include "crypto.h"
@@ -166,6 +171,160 @@ static ssize_t qcow2_zlib_decompress(void *dest, size_t 
dest_size,
 return ret;
 }
 
+#ifdef CONFIG_ZSTD
+
+/*
+ * qcow2_zstd_compress()
+ *
+ * Compress @src_size bytes of data using zstd compression method
+ *
+ * @dest - destination buffer, @dest_size bytes
+ * @src - source buffer, @src_size bytes
+ *
+ * Returns: compressed size on success
+ *  -ENOMEM destination buffer is not enough to store compressed data
+ *  -EIOon any other error
+ */
+static ssize_t qcow2_zstd_compress(void *dest, size_t dest_size,
+   const void *src, size_t src_size)
+{
+ssize_t ret;
+size_t zstd_ret;
+ZSTD_outBuffer output = {
+.dst = dest,
+.size = dest_size,
+.pos = 0
+};
+ZSTD_inBuffer input = {
+.src = src,
+.size = src_size,
+.pos = 0
+};
+ZSTD_CCtx *cctx = ZSTD_createCCtx();
+
+if (!cctx) {
+return -EIO;
+}
+/*
+ * Use the zstd streamed interface for symmetry with decompression,
+ * where streaming is essential since we don't record the exact
+ * compressed size.
+ *
+ * ZSTD_compressStream2() tries to compress everything it could
+ * with a single call. Although, ZSTD docs says that:
+ * "You must continue calling ZSTD_compressStream2() with ZSTD_e_end
+ * until it returns 0, at which point you are free to start a new frame",
+ * in out tests we saw the only case when it returned with

[PATCH v23 4/4] iotests: 287: add qcow2 compression type test

2020-04-30 Thread Denis Plotnikov
The test checks fulfilling qcow2 requirements for the compression
type feature and zstd compression type operability.

Signed-off-by: Denis Plotnikov 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Tested-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Eric Blake 
---
 tests/qemu-iotests/287 | 152 +
 tests/qemu-iotests/287.out |  67 
 tests/qemu-iotests/group   |   1 +
 3 files changed, 220 insertions(+)
 create mode 100755 tests/qemu-iotests/287
 create mode 100644 tests/qemu-iotests/287.out

diff --git a/tests/qemu-iotests/287 b/tests/qemu-iotests/287
new file mode 100755
index 00..f98a4cadc1
--- /dev/null
+++ b/tests/qemu-iotests/287
@@ -0,0 +1,152 @@
+#!/usr/bin/env bash
+#
+# Test case for an image using zstd compression
+#
+# Copyright (c) 2020 Virtuozzo International GmbH
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+#
+
+# creator
+owner=dplotni...@virtuozzo.com
+
+seq="$(basename $0)"
+echo "QA output created by $seq"
+
+status=1   # failure is the default!
+
+# standard environment
+. ./common.rc
+. ./common.filter
+
+# This tests qocw2-specific low-level functionality
+_supported_fmt qcow2
+_supported_proto file
+_supported_os Linux
+_unsupported_imgopts 'compat=0.10' data_file
+
+COMPR_IMG="$TEST_IMG.compressed"
+RAND_FILE="$TEST_DIR/rand_data"
+
+_cleanup()
+{
+_cleanup_test_img
+_rm_test_img "$COMPR_IMG"
+rm -f "$RAND_FILE"
+}
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+# for all the cases
+CLUSTER_SIZE=65536
+
+# Check if we can run this test.
+if IMGOPTS='compression_type=zstd' _make_test_img 64M |
+grep "Invalid parameter 'zstd'"; then
+_notrun "ZSTD is disabled"
+fi
+
+echo
+echo "=== Testing compression type incompatible bit setting for zlib ==="
+echo
+_make_test_img -o compression_type=zlib 64M
+$PYTHON qcow2.py "$TEST_IMG" dump-header | grep incompatible_features
+
+echo
+echo "=== Testing compression type incompatible bit setting for zstd ==="
+echo
+_make_test_img -o compression_type=zstd 64M
+$PYTHON qcow2.py "$TEST_IMG" dump-header | grep incompatible_features
+
+echo
+echo "=== Testing zlib with incompatible bit set ==="
+echo
+_make_test_img -o compression_type=zlib 64M
+$PYTHON qcow2.py "$TEST_IMG" set-feature-bit incompatible 3
+# to make sure the bit was actually set
+$PYTHON qcow2.py "$TEST_IMG" dump-header | grep incompatible_features
+
+if $QEMU_IMG info "$TEST_IMG" >/dev/null 2>&1 ; then
+echo "Error: The image opened successfully. The image must not be opened."
+fi
+
+echo
+echo "=== Testing zstd with incompatible bit unset ==="
+echo
+_make_test_img -o compression_type=zstd 64M
+$PYTHON qcow2.py "$TEST_IMG" set-header incompatible_features 0
+# to make sure the bit was actually unset
+$PYTHON qcow2.py "$TEST_IMG" dump-header | grep incompatible_features
+
+if $QEMU_IMG info "$TEST_IMG" >/dev/null 2>&1 ; then
+echo "Error: The image opened successfully. The image must not be opened."
+fi
+
+echo
+echo "=== Testing compression type values ==="
+echo
+# zlib=0
+_make_test_img -o compression_type=zlib 64M
+peek_file_be "$TEST_IMG" 104 1
+echo
+
+# zstd=1
+_make_test_img -o compression_type=zstd 64M
+peek_file_be "$TEST_IMG" 104 1
+echo
+
+echo
+echo "=== Testing simple reading and writing with zstd ==="
+echo
+_make_test_img -o compression_type=zstd 64M
+$QEMU_IO -c "write -c -P 0xAC 64K 64K " "$TEST_IMG" | _filter_qemu_io
+$QEMU_IO -c "read -P 0xAC 64K 64K " "$TEST_IMG" | _filter_qemu_io
+# read on the cluster boundaries
+$QEMU_IO -c "read -v 131070 8 " "$TEST_IMG" | _filter_qemu_io
+$QEMU_IO -c "read -v 65534 8" "$TEST_IMG" | _filter_qemu_io
+
+echo
+echo "=== Testing adjacent clusters reading and writing with zstd ==="
+echo
+_make_test_img -o compression_type=zstd 64M
+$QEMU_IO -c "write -c -P 0xAB 0 64K " "$TEST_IMG" | _filter_qemu_io
+$QEMU_IO -c "write -c -P 0xAC 64K 64K " "$TEST_IMG" | _filter_qemu_io
+$QEMU_IO -c "write -c -P 0xAD 128K 64K " "$T

Re: [PATCH v22 3/4] qcow2: add zstd cluster compression

2020-04-30 Thread Denis Plotnikov




On 30.04.2020 11:26, Max Reitz wrote:

On 29.04.20 15:02, Vladimir Sementsov-Ogievskiy wrote:

29.04.2020 15:17, Max Reitz wrote:

On 29.04.20 12:37, Vladimir Sementsov-Ogievskiy wrote:

29.04.2020 13:24, Max Reitz wrote:

On 28.04.20 22:00, Denis Plotnikov wrote:

zstd significantly reduces cluster compression time.
It provides better compression performance maintaining
the same level of the compression ratio in comparison with
zlib, which, at the moment, is the only compression
method available.

The performance test results:
Test compresses and decompresses qemu qcow2 image with just
installed rhel-7.6 guest.
Image cluster size: 64K. Image on disk size: 2.2G

The test was conducted with brd disk to reduce the influence
of disk subsystem to the test results.
The results is given in seconds.

compress cmd:
     time ./qemu-img convert -O qcow2 -c -o
compression_type=[zlib|zstd]
     src.img [zlib|zstd]_compressed.img
decompress cmd
     time ./qemu-img convert -O qcow2
     [zlib|zstd]_compressed.img uncompressed.img

  compression   decompression
    zlib   zstd   zlib zstd

real 65.5   16.3 (-75 %)    1.9  1.6 (-16 %)
user 65.0   15.8    5.3  2.5
sys   3.3    0.2    2.0  2.0

Both ZLIB and ZSTD gave the same compression ratio: 1.57
compressed image size in both cases: 1.4G

Signed-off-by: Denis Plotnikov 
QAPI part:
Acked-by: Markus Armbruster 
---
    docs/interop/qcow2.txt |   1 +
    configure  |   2 +-
    qapi/block-core.json   |   3 +-
    block/qcow2-threads.c  | 169
+
    block/qcow2.c  |   7 ++
    slirp  |   2 +-
    6 files changed, 181 insertions(+), 3 deletions(-)

[...]


diff --git a/block/qcow2-threads.c b/block/qcow2-threads.c
index 7dbaf53489..a0b12e1b15 100644
--- a/block/qcow2-threads.c
+++ b/block/qcow2-threads.c

[...]


+static ssize_t qcow2_zstd_decompress(void *dest, size_t dest_size,
+ const void *src, size_t
src_size)
+{

[...]


+    /*
+ * The compressed stream from the input buffer may consist of
more
+ * than one zstd frame.

Can it?

If not, we must require it in the specification.

Actually, now that you mention it, it would make sense anyway to add
some note to the specification on what exactly compressed with zstd
means.


Hmm. If at some point
we'll want multi-threaded compression of one big (2M) cluster.. Could
this be implemented with zstd lib, if multiple frames are allowed, will
allowing multiple frames help? I don't know actually, but I think better
not to forbid it. On the other hand, I don't see any benefit in large
compressed clusters. At least, in our scenarios (for compressed backups)
we use 64k compressed clusters, for good granularity of incremental
backups (when for running vm we use 1M clusters).

Is it really that important?  Naïvely, it sounds rather complicated to
introduce multithreading into block drivers.

It is already here: compression and encryption already multithreaded.
But of course, one cluster is handled in one thread.

Ah, good.  I forgot.


(Also, as for compression, it can only be used in backup scenarios
anyway, where you write many clusters at once.  So parallelism on the
cluster level should sufficient to get high usage, and it would benefit
all compression types and cluster sizes.)


Yes it works in this way already :)

Well, OK then.


So, we don't know do we want one frame restriction or not. Do you have a
preference?

*shrug*

Seems like it would be preferential to allow multiple frames still.  A
note in the spec would be nice (i.e., streaming format, multiple frames
per cluster possible).


We don't mention anything about zlib compressing details in the spec.

If we mention anything about ZSTD compressing details we'll have to do 
it for

zlib as well. So, I think we have two possibilities for the spec:
1. mention for both
2. don't mention at all

I think the 2nd is better. It gives more degree of freedom for the 
future improvements.

and we've already covered both cases (one frame, may frames) in the code.
I'm note sure I'm right. Any other opinions?

Denis


Max






Re: [PATCH v22 4/4] iotests: 287: add qcow2 compression type test

2020-04-29 Thread Denis Plotnikov




On 29.04.2020 13:26, Max Reitz wrote:

On 28.04.20 22:00, Denis Plotnikov wrote:

The test checks fulfilling qcow2 requirements for the compression
type feature and zstd compression type operability.

Signed-off-by: Denis Plotnikov 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Tested-by: Vladimir Sementsov-Ogievskiy 
---
  tests/qemu-iotests/287 | 152 +
  tests/qemu-iotests/287.out |  67 
  tests/qemu-iotests/group   |   1 +
  3 files changed, 220 insertions(+)
  create mode 100755 tests/qemu-iotests/287
  create mode 100644 tests/qemu-iotests/287.out

diff --git a/tests/qemu-iotests/287 b/tests/qemu-iotests/287
new file mode 100755
index 00..21fe1f19f5
--- /dev/null
+++ b/tests/qemu-iotests/287
@@ -0,0 +1,152 @@
+#!/usr/bin/env bash
+#
+# Test case for an image using zstd compression
+#
+# Copyright (c) 2020 Virtuozzo International GmbH
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+#
+
+# creator
+owner=dplotni...@virtuozzo.com
+
+seq="$(basename $0)"
+echo "QA output created by $seq"
+
+status=1   # failure is the default!
+
+# standard environment
+. ./common.rc
+. ./common.filter
+
+# This tests qocw2-specific low-level functionality
+_supported_fmt qcow2
+_supported_proto file
+_supported_os Linux
+_unsupported_imgopts 'compat=0.10' data_file
+
+COMPR_IMG="$TEST_IMG.compressed"
+RAND_FILE="$TEST_DIR/rand_data"
+
+_cleanup()
+{
+   _rm_test_img

_rm_test_img needs an argument (it basically replaces “rm”).  What I
thus meant was to keep the _cleanup_test_img here (that was completely
correct), but...


+   rm -f "$COMPR_IMG"

...to use “_rm_test_img "$COMPR_IMG"” here instead of rm.

Max

ok, I got it.

Denis



Re: [PATCH v22 3/4] qcow2: add zstd cluster compression

2020-04-29 Thread Denis Plotnikov




On 29.04.2020 13:24, Max Reitz wrote:

On 28.04.20 22:00, Denis Plotnikov wrote:

zstd significantly reduces cluster compression time.
It provides better compression performance maintaining
the same level of the compression ratio in comparison with
zlib, which, at the moment, is the only compression
method available.

The performance test results:
Test compresses and decompresses qemu qcow2 image with just
installed rhel-7.6 guest.
Image cluster size: 64K. Image on disk size: 2.2G

The test was conducted with brd disk to reduce the influence
of disk subsystem to the test results.
The results is given in seconds.

compress cmd:
   time ./qemu-img convert -O qcow2 -c -o compression_type=[zlib|zstd]
   src.img [zlib|zstd]_compressed.img
decompress cmd
   time ./qemu-img convert -O qcow2
   [zlib|zstd]_compressed.img uncompressed.img

compression   decompression
  zlib   zstd   zlib zstd

real 65.5   16.3 (-75 %)1.9  1.6 (-16 %)
user 65.0   15.85.3  2.5
sys   3.30.22.0  2.0

Both ZLIB and ZSTD gave the same compression ratio: 1.57
compressed image size in both cases: 1.4G

Signed-off-by: Denis Plotnikov 
QAPI part:
Acked-by: Markus Armbruster 
---
  docs/interop/qcow2.txt |   1 +
  configure  |   2 +-
  qapi/block-core.json   |   3 +-
  block/qcow2-threads.c  | 169 +
  block/qcow2.c  |   7 ++
  slirp  |   2 +-
  6 files changed, 181 insertions(+), 3 deletions(-)

[...]


diff --git a/block/qcow2-threads.c b/block/qcow2-threads.c
index 7dbaf53489..a0b12e1b15 100644
--- a/block/qcow2-threads.c
+++ b/block/qcow2-threads.c

[...]


+static ssize_t qcow2_zstd_decompress(void *dest, size_t dest_size,
+ const void *src, size_t src_size)
+{

[...]


+/*
+ * The compressed stream from the input buffer may consist of more
+ * than one zstd frame.

Can it?


Potentially, it can, if another implemention of qcow2 saves a couple of 
frames for some reason.


Denis


Max






Re: [RFC patch v1 2/3] qemu-file: add buffered mode

2020-04-28 Thread Denis Plotnikov




On 28.04.2020 20:54, Dr. David Alan Gilbert wrote:

* Denis Plotnikov (dplotni...@virtuozzo.com) wrote:


On 27.04.2020 15:14, Dr. David Alan Gilbert wrote:

* Denis Plotnikov (dplotni...@virtuozzo.com) wrote:

The patch adds ability to qemu-file to write the data
asynchronously to improve the performance on writing.
Before, only synchronous writing was supported.

Enabling of the asyncronous mode is managed by new
"enabled_buffered" callback.

It's a bit invasive isn't it - changes a lot of functions in a lot of
places!

If you mean changing the qemu-file code - yes, it is.

Yeh that's what I worry about; qemu-file is pretty complex as it is.
Especially when it then passes it to the channel code etc


If you mean changing the qemu-file usage in the code - no.
The only place to change is the snapshot code when the buffered mode is
enabled with a callback.
The change is in 03 patch of the series.

That's fine - that's easy.


The multifd code separated the control headers from the data on separate
fd's - but that doesn't help your case.

yes, that doesn't help

Is there any chance you could do this by using the existing 'save_page'
hook (that RDMA uses).

I don't think so. My goal is to improve writing performance of
the internal snapshot to qcow2 image. The snapshot is saved in qcow2 as
continuous stream placed in the end of address space.
To achieve the best writing speed I need a size and base-aligned buffer
containing the vm state (with ram) which looks like that (related to ram):

... | ram page header | ram page | ram page header | ram page | ... and so
on

to store the buffer in qcow2 with a single operation.

'save_page' would allow me not to store 'ram page' in the qemu-file internal
structures,
and write my own ram page storing logic. I think that wouldn't help me a lot
because:
1. I need a page with the ram page header
2. I want to reduce the number of io operations
3. I want to save other parts of vm state as fast as possible

May be I can't see the better way of using 'save page' callback.
Could you suggest anything?

I guess it depends if we care about keeping the format of the snapshot
the same here;  if we were open to changing it, then we could use
the save_page hook to delay the writes, so we'd have a pile of headers
followed by a pile of pages.


I think we have to care about keeping the format. Because many users 
already have internal snapshots
saved in the qcow2 images, if we change the format we can't load 
snapshots from those images
as well as make snapshots non-readable for older qemu-s or we need to 
support two versions of format

which I think is too complicated.




Denis

In the cover letter you mention direct qemu_fflush calls - have we got a
few too many in some palces that you think we can clean out?

I'm not sure that some of them are excessive. To the best of my knowlege,
qemu-file is used for the source-destination communication on migration
and removing some qemu_fflush-es may break communication logic.

I can't see any obvious places where it's called during the ram
migration; can you try and give me a hint to where you're seeing it ?


I think those qemu_fflush-es aren't in the ram migration rather than in 
other vm state parts.
Although, those parts are quite small in comparison to ram, I saw quite 
a lot of qemu_fflush-es while debugging.
Still, we could benefit saving them with fewer number of io operation if 
we going to use buffered mode.


Denis




Snapshot is just a special case (if not the only) when we know that we can
do buffered (cached)
writings. Do you know any other cases when the buffered (cached) mode could
be useful?

The RDMA code does it because it's really not good at small transfers,
but maybe generally it would be a good idea to do larger writes if
possible - something that multifd manages.

Dave


Dave


Signed-off-by: Denis Plotnikov 
---
   include/qemu/typedefs.h |   1 +
   migration/qemu-file.c   | 351 
+---
   migration/qemu-file.h   |   9 ++
   3 files changed, 339 insertions(+), 22 deletions(-)

diff --git a/include/qemu/typedefs.h b/include/qemu/typedefs.h
index 88dce54..9b388c8 100644
--- a/include/qemu/typedefs.h
+++ b/include/qemu/typedefs.h
@@ -98,6 +98,7 @@ typedef struct QEMUBH QEMUBH;
   typedef struct QemuConsole QemuConsole;
   typedef struct QEMUFile QEMUFile;
   typedef struct QEMUFileBuffer QEMUFileBuffer;
+typedef struct QEMUFileAioTask QEMUFileAioTask;
   typedef struct QemuLockable QemuLockable;
   typedef struct QemuMutex QemuMutex;
   typedef struct QemuOpt QemuOpt;
diff --git a/migration/qemu-file.c b/migration/qemu-file.c
index 285c6ef..f42f949 100644
--- a/migration/qemu-file.c
+++ b/migration/qemu-file.c
@@ -29,19 +29,25 @@
   #include "qemu-file.h"
   #include "trace.h"
   #include "qapi/error.h"
+#include "block/aio_task.h"
-#define IO_BUF_SIZE 32768
+#define IO_BUF_SIZE (1024 * 1024)
   #define MAX_IOV_SI

[PATCH v22 4/4] iotests: 287: add qcow2 compression type test

2020-04-28 Thread Denis Plotnikov
The test checks fulfilling qcow2 requirements for the compression
type feature and zstd compression type operability.

Signed-off-by: Denis Plotnikov 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Tested-by: Vladimir Sementsov-Ogievskiy 
---
 tests/qemu-iotests/287 | 152 +
 tests/qemu-iotests/287.out |  67 
 tests/qemu-iotests/group   |   1 +
 3 files changed, 220 insertions(+)
 create mode 100755 tests/qemu-iotests/287
 create mode 100644 tests/qemu-iotests/287.out

diff --git a/tests/qemu-iotests/287 b/tests/qemu-iotests/287
new file mode 100755
index 00..21fe1f19f5
--- /dev/null
+++ b/tests/qemu-iotests/287
@@ -0,0 +1,152 @@
+#!/usr/bin/env bash
+#
+# Test case for an image using zstd compression
+#
+# Copyright (c) 2020 Virtuozzo International GmbH
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+#
+
+# creator
+owner=dplotni...@virtuozzo.com
+
+seq="$(basename $0)"
+echo "QA output created by $seq"
+
+status=1   # failure is the default!
+
+# standard environment
+. ./common.rc
+. ./common.filter
+
+# This tests qocw2-specific low-level functionality
+_supported_fmt qcow2
+_supported_proto file
+_supported_os Linux
+_unsupported_imgopts 'compat=0.10' data_file
+
+COMPR_IMG="$TEST_IMG.compressed"
+RAND_FILE="$TEST_DIR/rand_data"
+
+_cleanup()
+{
+   _rm_test_img
+   rm -f "$COMPR_IMG"
+   rm -f "$RAND_FILE"
+}
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+# for all the cases
+CLUSTER_SIZE=65536
+
+# Check if we can run this test.
+if IMGOPTS='compression_type=zstd' _make_test_img 64M |
+grep "Invalid parameter 'zstd'"; then
+_notrun "ZSTD is disabled"
+fi
+
+echo
+echo "=== Testing compression type incompatible bit setting for zlib ==="
+echo
+_make_test_img -o compression_type=zlib 64M
+$PYTHON qcow2.py "$TEST_IMG" dump-header | grep incompatible_features
+
+echo
+echo "=== Testing compression type incompatible bit setting for zstd ==="
+echo
+_make_test_img -o compression_type=zstd 64M
+$PYTHON qcow2.py "$TEST_IMG" dump-header | grep incompatible_features
+
+echo
+echo "=== Testing zlib with incompatible bit set ==="
+echo
+_make_test_img -o compression_type=zlib 64M
+$PYTHON qcow2.py "$TEST_IMG" set-feature-bit incompatible 3
+# to make sure the bit was actually set
+$PYTHON qcow2.py "$TEST_IMG" dump-header | grep incompatible_features
+
+if $QEMU_IMG info "$TEST_IMG" >/dev/null 2>&1 ; then
+echo "Error: The image opened successfully. The image must not be opened."
+fi
+
+echo
+echo "=== Testing zstd with incompatible bit unset ==="
+echo
+_make_test_img -o compression_type=zstd 64M
+$PYTHON qcow2.py "$TEST_IMG" set-header incompatible_features 0
+# to make sure the bit was actually unset
+$PYTHON qcow2.py "$TEST_IMG" dump-header | grep incompatible_features
+
+if $QEMU_IMG info "$TEST_IMG" >/dev/null 2>&1 ; then
+echo "Error: The image opened successfully. The image must not be opened."
+fi
+
+echo
+echo "=== Testing compression type values ==="
+echo
+# zlib=0
+_make_test_img -o compression_type=zlib 64M
+peek_file_be "$TEST_IMG" 104 1
+echo
+
+# zstd=1
+_make_test_img -o compression_type=zstd 64M
+peek_file_be "$TEST_IMG" 104 1
+echo
+
+echo
+echo "=== Testing simple reading and writing with zstd ==="
+echo
+_make_test_img -o compression_type=zstd 64M
+$QEMU_IO -c "write -c -P 0xAC 64K 64K " "$TEST_IMG" | _filter_qemu_io
+$QEMU_IO -c "read -P 0xAC 64K 64K " "$TEST_IMG" | _filter_qemu_io
+# read on the cluster boundaries
+$QEMU_IO -c "read -v 131070 8 " "$TEST_IMG" | _filter_qemu_io
+$QEMU_IO -c "read -v 65534 8" "$TEST_IMG" | _filter_qemu_io
+
+echo
+echo "=== Testing adjacent clusters reading and writing with zstd ==="
+echo
+_make_test_img -o compression_type=zstd 64M
+$QEMU_IO -c "write -c -P 0xAB 0 64K " "$TEST_IMG" | _filter_qemu_io
+$QEMU_IO -c "write -c -P 0xAC 64K 64K " "$TEST_IMG" | _filter_qemu_io
+$QEMU_IO -c "write -c -P 0xAD 128K 64K " "$TEST_IMG" | _filte

[PATCH v22 0/4] implement zstd cluster compression method

2020-04-28 Thread Denis Plotnikov
v22:
   03: remove assignemnt in if condition

v21:
   03:
   * remove the loop on compression [Max]
   * use designated initializers [Max]
   04:
   * don't erase user's options [Max]
   * use _rm_test_img [Max]
   * add unsupported qcow2 options [Max]

v20:
   04: fix a number of flaws [Vladimir]
   * don't use $RAND_FILE passing to qemu-io,
 so check $TEST_DIR is redundant
   * re-arrage $RAND_FILE writing
   * fix a typo

v19:
   04: fix a number of flaws [Eric]
   * remove rudundant test case descriptions
   * fix stdout redirect
   * don't use (())
   * use peek_file_be instead of od
   * check $TEST_DIR for spaces and other before using
   * use $RAND_FILE safer

v18:
   * 04: add quotes to all file name variables [Vladimir] 
   * 04: add Vladimir's comment according to "qemu-io write -s"
 option issue.

v17:
   * 03: remove incorrect comment in zstd decompress [Vladimir]
   * 03: remove "paraniod" and rewrite the comment on decompress [Vladimir]
   * 03: fix dead assignment [Vladimir]
   * 04: add and remove quotes [Vladimir]
   * 04: replace long offset form with the short one [Vladimir]

v16:
   * 03: ssize_t for ret, size_t for zstd_ret [Vladimir]
   * 04: small fixes according to the comments [Vladimir] 

v15:
   * 01: aiming qemu 5.1 [Eric]
   * 03: change zstd_res definition place [Vladimir]
   * 04: add two new test cases [Eric]
 1. test adjacent cluster compression with zstd
 2. test incompressible cluster processing
   * 03, 04: many rewording and gramma fixing [Eric]

v14:
   * fix bug on compression - looping until compress == 0 [Me]
   * apply reworked Vladimir's suggestions:
  1. not mixing ssize_t with size_t
  2. safe check for ENOMEM in compression part - avoid overflow
  3. tolerate sanity check allow zstd to make progress only
 on one of the buffers
v13:
   * 03: add progress sanity check to decompression loop [Vladimir]
 03: add successful decompression check [Me]

v12:
   * 03: again, rework compression and decompression loops
 to make them more correct [Vladimir]
 03: move assert in compression to more appropriate place
 [Vladimir]
v11:
   * 03: the loops don't need "do{}while" form anymore and
 the they were buggy (missed "do" in the beginning)
 replace them with usual "while(){}" loops [Vladimir]
v10:
   * 03: fix zstd (de)compressed loops for multi-frame
 cases [Vladimir]
v9:
   * 01: fix error checking and reporting in qcow2_amend compression type part 
[Vladimir]
   * 03: replace asserts with -EIO in qcow2_zstd_decompression [Vladimir, 
Alberto]
   * 03: reword/amend/add comments, fix typos [Vladimir]

v8:
   * 03: switch zstd API from simple to stream [Eric]
 No need to state a special cluster layout for zstd
 compressed clusters.
v7:
   * use qapi_enum_parse instead of the open-coding [Eric]
   * fix wording, typos and spelling [Eric]

v6:
   * "block/qcow2-threads: fix qcow2_decompress" is removed from the series
  since it has been accepted by Max already
   * add compile time checking for Qcow2Header to be a multiple of 8 [Max, 
Alberto]
   * report error on qcow2 amending when the compression type is actually 
chnged [Max]
   * remove the extra space and the extra new line [Max]
   * re-arrange acks and signed-off-s [Vladimir]

v5:
   * replace -ENOTSUP with abort in qcow2_co_decompress [Vladimir]
   * set cluster size for all test cases in the beginning of the 287 test

v4:
   * the series is rebased on top of 01 "block/qcow2-threads: fix 
qcow2_decompress"
   * 01 is just a no-change resend to avoid extra dependencies. Still, it may 
be merged in separate

v3:
   * remove redundant max compression type value check [Vladimir, Eric]
 (the switch below checks everything)
   * prevent compression type changing on "qemu-img amend" [Vladimir]
   * remove zstd config setting, since it has been added already by
 "migration" patches [Vladimir]
   * change the compression type error message [Vladimir] 
   * fix alignment and 80-chars exceeding [Vladimir]

v2:
   * rework compression type setting [Vladimir]
   * squash iotest changes to the compression type introduction patch 
[Vladimir, Eric]
   * fix zstd availability checking in zstd iotest [Vladimir]
   * remove unnecessry casting [Eric]
   * remove rudundant checks [Eric]
   * fix compressed cluster layout in qcow2 spec [Vladimir]
   * fix wording [Eric, Vladimir]
   * fix compression type filtering in iotests [Eric]

v1:
   the initial series

Denis Plotnikov (4):
  qcow2: introduce compression type feature
  qcow2: rework the cluster compression routine
  qcow2: add zstd cluster compression
  iotests: 287: add qcow2 compression type test

 docs/interop/qcow2.txt   |   1 +
 configure|   2 +-
 qapi/block-

[PATCH v22 1/4] qcow2: introduce compression type feature

2020-04-28 Thread Denis Plotnikov
The patch adds some preparation parts for incompatible compression type
feature to qcow2 allowing the use different compression methods for
image clusters (de)compressing.

It is implied that the compression type is set on the image creation and
can be changed only later by image conversion, thus compression type
defines the only compression algorithm used for the image, and thus,
for all image clusters.

The goal of the feature is to add support of other compression methods
to qcow2. For example, ZSTD which is more effective on compression than ZLIB.

The default compression is ZLIB. Images created with ZLIB compression type
are backward compatible with older qemu versions.

Adding of the compression type breaks a number of tests because now the
compression type is reported on image creation and there are some changes
in the qcow2 header in size and offsets.

The tests are fixed in the following ways:
* filter out compression_type for many tests
* fix header size, feature table size and backing file offset
  affected tests: 031, 036, 061, 080
  header_size +=8: 1 byte compression type
   7 bytes padding
  feature_table += 48: incompatible feature compression type
  backing_file_offset += 56 (8 + 48 -> header_change + feature_table_change)
* add "compression type" for test output matching when it isn't filtered
  affected tests: 049, 060, 061, 065, 144, 182, 242, 255

Signed-off-by: Denis Plotnikov 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Eric Blake 
QAPI part:
Acked-by: Markus Armbruster 
---
 qapi/block-core.json |  22 +-
 block/qcow2.h|  20 +-
 include/block/block_int.h|   1 +
 block/qcow2.c| 113 +++
 tests/qemu-iotests/031.out   |  14 ++--
 tests/qemu-iotests/036.out   |   4 +-
 tests/qemu-iotests/049.out   | 102 ++--
 tests/qemu-iotests/060.out   |   1 +
 tests/qemu-iotests/061.out   |  34 ++
 tests/qemu-iotests/065   |  28 +---
 tests/qemu-iotests/080   |   2 +-
 tests/qemu-iotests/144.out   |   4 +-
 tests/qemu-iotests/182.out   |   2 +-
 tests/qemu-iotests/242.out   |   5 ++
 tests/qemu-iotests/255.out   |   8 +--
 tests/qemu-iotests/common.filter |   3 +-
 16 files changed, 267 insertions(+), 96 deletions(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index 943df1926a..1522e2983f 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -78,6 +78,8 @@
 #
 # @bitmaps: A list of qcow2 bitmap details (since 4.0)
 #
+# @compression-type: the image cluster compression method (since 5.1)
+#
 # Since: 1.7
 ##
 { 'struct': 'ImageInfoSpecificQCow2',
@@ -89,7 +91,8 @@
   '*corrupt': 'bool',
   'refcount-bits': 'int',
   '*encrypt': 'ImageInfoSpecificQCow2Encryption',
-  '*bitmaps': ['Qcow2BitmapInfo']
+  '*bitmaps': ['Qcow2BitmapInfo'],
+  'compression-type': 'Qcow2CompressionType'
   } }
 
 ##
@@ -4284,6 +4287,18 @@
   'data': [ 'v2', 'v3' ] }
 
 
+##
+# @Qcow2CompressionType:
+#
+# Compression type used in qcow2 image file
+#
+# @zlib: zlib compression, see <http://zlib.net/>
+#
+# Since: 5.1
+##
+{ 'enum': 'Qcow2CompressionType',
+  'data': [ 'zlib' ] }
+
 ##
 # @BlockdevCreateOptionsQcow2:
 #
@@ -4307,6 +4322,8 @@
 # allowed values: off, falloc, full, metadata)
 # @lazy-refcounts: True if refcounts may be updated lazily (default: off)
 # @refcount-bits: Width of reference counts in bits (default: 16)
+# @compression-type: The image cluster compression method
+#(default: zlib, since 5.1)
 #
 # Since: 2.12
 ##
@@ -4322,7 +4339,8 @@
 '*cluster-size':'size',
 '*preallocation':   'PreallocMode',
 '*lazy-refcounts':  'bool',
-'*refcount-bits':   'int' } }
+'*refcount-bits':   'int',
+'*compression-type':'Qcow2CompressionType' } }
 
 ##
 # @BlockdevCreateOptionsQed:
diff --git a/block/qcow2.h b/block/qcow2.h
index f4de0a27d5..6a8b82e6cc 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -146,8 +146,16 @@ typedef struct QCowHeader {
 
 uint32_t refcount_order;
 uint32_t header_length;
+
+/* Additional fields */
+uint8_t compression_type;
+
+/* header must be a multiple of 8 */
+uint8_t padding[7];
 } QEMU_PACKED QCowHeader;
 
+QEMU_BUILD_BUG_ON(!QEMU_IS_ALIGNED(sizeof(QCowHeader), 8));
+
 typedef struct QEMU_PACKED QCowSnapshotHeader {
 /* header is 8 byte aligned */
 uint64_t l1_table_offset;
@@ -216,13 +224,16 @@ enum {
 QCOW2_INCOMPAT_DIRTY_BITNR  = 0,
 QCOW2_INCOMPAT_CORRUPT_BITNR= 1,
 QCOW2_INCOMPAT_DATA_FILE_BITNR  = 2,
+QCOW2_INCOMPAT_COMPRESSION_BITNR = 3,
 QCOW2_INCOMPAT_DIRTY= 1 << QCOW2_INCOMPAT_DIRTY_BITNR,
 QCOW2_INCOMPAT_CORRUPT  = 1 << QCOW2_INCOMPAT_CORRUPT_BITNR,
 QCOW

[PATCH v22 2/4] qcow2: rework the cluster compression routine

2020-04-28 Thread Denis Plotnikov
The patch enables processing the image compression type defined
for the image and chooses an appropriate method for image clusters
(de)compression.

Signed-off-by: Denis Plotnikov 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Alberto Garcia 
---
 block/qcow2-threads.c | 71 ---
 1 file changed, 60 insertions(+), 11 deletions(-)

diff --git a/block/qcow2-threads.c b/block/qcow2-threads.c
index a68126f291..7dbaf53489 100644
--- a/block/qcow2-threads.c
+++ b/block/qcow2-threads.c
@@ -74,7 +74,9 @@ typedef struct Qcow2CompressData {
 } Qcow2CompressData;
 
 /*
- * qcow2_compress()
+ * qcow2_zlib_compress()
+ *
+ * Compress @src_size bytes of data using zlib compression method
  *
  * @dest - destination buffer, @dest_size bytes
  * @src - source buffer, @src_size bytes
@@ -83,8 +85,8 @@ typedef struct Qcow2CompressData {
  *  -ENOMEM destination buffer is not enough to store compressed data
  *  -EIOon any other error
  */
-static ssize_t qcow2_compress(void *dest, size_t dest_size,
-  const void *src, size_t src_size)
+static ssize_t qcow2_zlib_compress(void *dest, size_t dest_size,
+   const void *src, size_t src_size)
 {
 ssize_t ret;
 z_stream strm;
@@ -119,10 +121,10 @@ static ssize_t qcow2_compress(void *dest, size_t 
dest_size,
 }
 
 /*
- * qcow2_decompress()
+ * qcow2_zlib_decompress()
  *
  * Decompress some data (not more than @src_size bytes) to produce exactly
- * @dest_size bytes.
+ * @dest_size bytes using zlib compression method
  *
  * @dest - destination buffer, @dest_size bytes
  * @src - source buffer, @src_size bytes
@@ -130,8 +132,8 @@ static ssize_t qcow2_compress(void *dest, size_t dest_size,
  * Returns: 0 on success
  *  -EIO on fail
  */
-static ssize_t qcow2_decompress(void *dest, size_t dest_size,
-const void *src, size_t src_size)
+static ssize_t qcow2_zlib_decompress(void *dest, size_t dest_size,
+ const void *src, size_t src_size)
 {
 int ret;
 z_stream strm;
@@ -191,20 +193,67 @@ qcow2_co_do_compress(BlockDriverState *bs, void *dest, 
size_t dest_size,
 return arg.ret;
 }
 
+/*
+ * qcow2_co_compress()
+ *
+ * Compress @src_size bytes of data using the compression
+ * method defined by the image compression type
+ *
+ * @dest - destination buffer, @dest_size bytes
+ * @src - source buffer, @src_size bytes
+ *
+ * Returns: compressed size on success
+ *  a negative error code on failure
+ */
 ssize_t coroutine_fn
 qcow2_co_compress(BlockDriverState *bs, void *dest, size_t dest_size,
   const void *src, size_t src_size)
 {
-return qcow2_co_do_compress(bs, dest, dest_size, src, src_size,
-qcow2_compress);
+BDRVQcow2State *s = bs->opaque;
+Qcow2CompressFunc fn;
+
+switch (s->compression_type) {
+case QCOW2_COMPRESSION_TYPE_ZLIB:
+fn = qcow2_zlib_compress;
+break;
+
+default:
+abort();
+}
+
+return qcow2_co_do_compress(bs, dest, dest_size, src, src_size, fn);
 }
 
+/*
+ * qcow2_co_decompress()
+ *
+ * Decompress some data (not more than @src_size bytes) to produce exactly
+ * @dest_size bytes using the compression method defined by the image
+ * compression type
+ *
+ * @dest - destination buffer, @dest_size bytes
+ * @src - source buffer, @src_size bytes
+ *
+ * Returns: 0 on success
+ *  a negative error code on failure
+ */
 ssize_t coroutine_fn
 qcow2_co_decompress(BlockDriverState *bs, void *dest, size_t dest_size,
 const void *src, size_t src_size)
 {
-return qcow2_co_do_compress(bs, dest, dest_size, src, src_size,
-qcow2_decompress);
+BDRVQcow2State *s = bs->opaque;
+Qcow2CompressFunc fn;
+
+switch (s->compression_type) {
+case QCOW2_COMPRESSION_TYPE_ZLIB:
+fn = qcow2_zlib_decompress;
+break;
+
+default:
+abort();
+}
+
+return qcow2_co_do_compress(bs, dest, dest_size, src, src_size, fn);
 }
 
 
-- 
2.17.0




[PATCH v22 3/4] qcow2: add zstd cluster compression

2020-04-28 Thread Denis Plotnikov
zstd significantly reduces cluster compression time.
It provides better compression performance maintaining
the same level of the compression ratio in comparison with
zlib, which, at the moment, is the only compression
method available.

The performance test results:
Test compresses and decompresses qemu qcow2 image with just
installed rhel-7.6 guest.
Image cluster size: 64K. Image on disk size: 2.2G

The test was conducted with brd disk to reduce the influence
of disk subsystem to the test results.
The results is given in seconds.

compress cmd:
  time ./qemu-img convert -O qcow2 -c -o compression_type=[zlib|zstd]
  src.img [zlib|zstd]_compressed.img
decompress cmd
  time ./qemu-img convert -O qcow2
  [zlib|zstd]_compressed.img uncompressed.img

   compression   decompression
 zlib   zstd   zlib zstd

real 65.5   16.3 (-75 %)1.9  1.6 (-16 %)
user 65.0   15.85.3  2.5
sys   3.30.22.0  2.0

Both ZLIB and ZSTD gave the same compression ratio: 1.57
compressed image size in both cases: 1.4G

Signed-off-by: Denis Plotnikov 
QAPI part:
Acked-by: Markus Armbruster 
---
 docs/interop/qcow2.txt |   1 +
 configure  |   2 +-
 qapi/block-core.json   |   3 +-
 block/qcow2-threads.c  | 169 +
 block/qcow2.c  |   7 ++
 slirp  |   2 +-
 6 files changed, 181 insertions(+), 3 deletions(-)

diff --git a/docs/interop/qcow2.txt b/docs/interop/qcow2.txt
index 640e0eca40..18a77f737e 100644
--- a/docs/interop/qcow2.txt
+++ b/docs/interop/qcow2.txt
@@ -209,6 +209,7 @@ version 2.
 
 Available compression type values:
 0: zlib <https://www.zlib.net/>
+1: zstd <http://github.com/facebook/zstd>
 
 
 === Header padding ===
diff --git a/configure b/configure
index 23b5e93752..4e3a1690ea 100755
--- a/configure
+++ b/configure
@@ -1861,7 +1861,7 @@ disabled with --disable-FEATURE, default is enabled if 
available:
   lzfse   support of lzfse compression library
   (for reading lzfse-compressed dmg images)
   zstdsupport for zstd compression library
-  (for migration compression)
+  (for migration compression and qcow2 cluster compression)
   seccomp seccomp support
   coroutine-pool  coroutine freelist (better performance)
   glusterfs   GlusterFS backend
diff --git a/qapi/block-core.json b/qapi/block-core.json
index 1522e2983f..6fbacddab2 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -4293,11 +4293,12 @@
 # Compression type used in qcow2 image file
 #
 # @zlib: zlib compression, see <http://zlib.net/>
+# @zstd: zstd compression, see <http://github.com/facebook/zstd>
 #
 # Since: 5.1
 ##
 { 'enum': 'Qcow2CompressionType',
-  'data': [ 'zlib' ] }
+  'data': [ 'zlib', { 'name': 'zstd', 'if': 'defined(CONFIG_ZSTD)' } ] }
 
 ##
 # @BlockdevCreateOptionsQcow2:
diff --git a/block/qcow2-threads.c b/block/qcow2-threads.c
index 7dbaf53489..a0b12e1b15 100644
--- a/block/qcow2-threads.c
+++ b/block/qcow2-threads.c
@@ -28,6 +28,11 @@
 #define ZLIB_CONST
 #include 
 
+#ifdef CONFIG_ZSTD
+#include 
+#include 
+#endif
+
 #include "qcow2.h"
 #include "block/thread-pool.h"
 #include "crypto.h"
@@ -166,6 +171,160 @@ static ssize_t qcow2_zlib_decompress(void *dest, size_t 
dest_size,
 return ret;
 }
 
+#ifdef CONFIG_ZSTD
+
+/*
+ * qcow2_zstd_compress()
+ *
+ * Compress @src_size bytes of data using zstd compression method
+ *
+ * @dest - destination buffer, @dest_size bytes
+ * @src - source buffer, @src_size bytes
+ *
+ * Returns: compressed size on success
+ *  -ENOMEM destination buffer is not enough to store compressed data
+ *  -EIOon any other error
+ */
+static ssize_t qcow2_zstd_compress(void *dest, size_t dest_size,
+   const void *src, size_t src_size)
+{
+ssize_t ret;
+size_t zstd_ret;
+ZSTD_outBuffer output = {
+.dst = dest,
+.size = dest_size,
+.pos = 0
+};
+ZSTD_inBuffer input = {
+.src = src,
+.size = src_size,
+.pos = 0
+};
+ZSTD_CCtx *cctx = ZSTD_createCCtx();
+
+if (!cctx) {
+return -EIO;
+}
+/*
+ * Use the zstd streamed interface for symmetry with decompression,
+ * where streaming is essential since we don't record the exact
+ * compressed size.
+ *
+ * ZSTD_compressStream2() tries to compress everything it could
+ * with a single call. Although, ZSTD docs says that:
+ * "You must continue calling ZSTD_compressStream2() with ZSTD_e_end
+ * until it returns 0, at which point you are free to start a new frame",
+ * in out tests we saw 

Re: [PATCH v20 4/4] iotests: 287: add qcow2 compression type test

2020-04-28 Thread Denis Plotnikov




On 28.04.2020 16:01, Eric Blake wrote:

On 4/28/20 7:55 AM, Max Reitz wrote:


+# This tests qocw2-specific low-level functionality
+_supported_fmt qcow2
+_supported_proto file
+_supported_os Linux

This test doesn’t work with compat=0.10 (because we can’t store a
non-default compression type there) or data_file (because those don’t
support compression), so those options should be marked as 
unsupported.


(It does seem to work with any refcount_bits, though.)


Could I ask how to achieve that?
I can't find any _supported_* related.



It’s _unsupported_imgopts.


Test 036 is an example of this.

Max, Eric

Thanks!

Denis








  1   2   3   4   5   6   >