Re: [PATCH v2] virtio: add VIRTQUEUE_ERROR QAPI event
Reviewed-by: Denis Plotnikov On 9/12/23 20:57, Vladimir Sementsov-Ogievskiy wrote: For now we only log the vhost device error, when virtqueue is actually stopped. Let's add a QAPI event, which makes possible: - collect statistics of such errors - make immediate actions: take core dumps or do some other debugging - inform the user through a management API or UI, so that (s)he can react somehow, e.g. reset the device driver in the guest or even build up some automation to do so Note that basically every inconsistency discovered during virtqueue processing results in a silent virtqueue stop. The guest then just sees the requests getting stuck somewhere in the device for no visible reason. This event provides a means to inform the management layer of this situation in a timely fashion. The event could be reused for some other virtqueue problems (not only for vhost devices) in future. For this it gets a generic name and structure. We keep original VHOST_OPS_DEBUG(), to keep original debug output as is here, it's not the only call to VHOST_OPS_DEBUG in the file. Signed-off-by: Vladimir Sementsov-Ogievskiy --- v2: - improve commit message (just stole wording by Roman, hope he don't mind:) - add event throttling hw/virtio/vhost.c | 12 +--- monitor/monitor.c | 10 ++ qapi/qdev.json| 25 + 3 files changed, 44 insertions(+), 3 deletions(-) diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c index e2f6ffb446..162899feee 100644 --- a/hw/virtio/vhost.c +++ b/hw/virtio/vhost.c @@ -15,6 +15,7 @@ #include "qemu/osdep.h" #include "qapi/error.h" +#include "qapi/qapi-events-qdev.h" #include "hw/virtio/vhost.h" #include "qemu/atomic.h" #include "qemu/range.h" @@ -1332,11 +1333,16 @@ static void vhost_virtqueue_error_notifier(EventNotifier *n) struct vhost_virtqueue *vq = container_of(n, struct vhost_virtqueue, error_notifier); struct vhost_dev *dev = vq->dev; -int index = vq - dev->vqs; if (event_notifier_test_and_clear(n) && dev->vdev) { -VHOST_OPS_DEBUG(-EINVAL, "vhost vring error in virtqueue %d", -dev->vq_index + index); +int ind = vq - dev->vqs + dev->vq_index; +DeviceState *ds = >vdev->parent_obj; + +VHOST_OPS_DEBUG(-EINVAL, "vhost vring error in virtqueue %d", ind); +qapi_event_send_virtqueue_error(ds->id, ds->canonical_path, ind, +VIRTQUEUE_ERROR_VHOST_VRING_ERR, +"vhost reported failure through vring " +"error fd"); } } diff --git a/monitor/monitor.c b/monitor/monitor.c index 941f87815a..cb1ee31156 100644 --- a/monitor/monitor.c +++ b/monitor/monitor.c @@ -313,6 +313,7 @@ static MonitorQAPIEventConf monitor_qapi_event_conf[QAPI_EVENT__MAX] = { [QAPI_EVENT_BALLOON_CHANGE]= { 1000 * SCALE_MS }, [QAPI_EVENT_QUORUM_REPORT_BAD] = { 1000 * SCALE_MS }, [QAPI_EVENT_QUORUM_FAILURE]= { 1000 * SCALE_MS }, +[QAPI_EVENT_VIRTQUEUE_ERROR] = { 1000 * SCALE_MS }, [QAPI_EVENT_VSERPORT_CHANGE] = { 1000 * SCALE_MS }, [QAPI_EVENT_MEMORY_DEVICE_SIZE_CHANGE] = { 1000 * SCALE_MS }, }; @@ -497,6 +498,10 @@ static unsigned int qapi_event_throttle_hash(const void *key) hash += g_str_hash(qdict_get_str(evstate->data, "qom-path")); } +if (evstate->event == QAPI_EVENT_VIRTQUEUE_ERROR) { +hash += g_str_hash(qdict_get_str(evstate->data, "device")); +} + return hash; } @@ -524,6 +529,11 @@ static gboolean qapi_event_throttle_equal(const void *a, const void *b) qdict_get_str(evb->data, "qom-path")); } +if (eva->event == QAPI_EVENT_VIRTQUEUE_ERROR) { +return !strcmp(qdict_get_str(eva->data, "device"), + qdict_get_str(evb->data, "device")); +} + return TRUE; } diff --git a/qapi/qdev.json b/qapi/qdev.json index 6bc5a733b8..199e21cae7 100644 --- a/qapi/qdev.json +++ b/qapi/qdev.json @@ -161,3 +161,28 @@ ## { 'event': 'DEVICE_UNPLUG_GUEST_ERROR', 'data': { '*device': 'str', 'path': 'str' } } + +## +# @VirtqueueError: +# +# Since: 8.2 +## +{ 'enum': 'VirtqueueError', + 'data': [ 'vhost-vring-err' ] } + +## +# @VIRTQUEUE_ERROR: +# +# Emitted when a device virtqueue fails in runtime. +# +# @device: the device's ID if it has one +# @path: the device's QOM path +# @virtqueue: virtqueue index +# @error: error identifier +# @description: human readable description +# +# Since: 8.2 +## +{ 'event': 'VIRTQUEUE_ERROR', + 'data': { '*device': 'str', 'path': 'str', 'virtqueue': 'int', +'error': 'VirtqueueError', 'description': 'str'} }
[PING][PATCH v5] qapi/qmp: Add timestamps to qmp command responses
Hi all! It seems that this series has come through a number of reviews and got some "reviewed-by". Is there any flaws to fix preventing to merge this series? Thanks, Denis On 26.04.2023 17:08, Denis Plotnikov wrote: Add "start" & "end" time values to QMP command responses. These time values are added to let the qemu management layer get the exact command execution time without any other time variance which might be brought by other parts of management layer or qemu internals. This helps to look for problems poactively from the management layer side. The management layer would be able to detect problem cases by calculating QMP command execution time: 1. execution_time_from_mgmt_perspective - execution_time_of_qmp_command > some_threshold This detects problems with management layer or internal qemu QMP command dispatching 2. current_qmp_command_execution_time > avg_qmp_command_execution_time This detects that a certain QMP command starts to execute longer than usual In both these cases more thorough investigation of the root cases should be done by using some qemu tracepoints depending on particular QMP command under investigation or by other means. The timestamps help to avoid excessive log output when qemu tracepoints are used to address similar cases. Example of result: ./qemu/scripts/qmp/qmp-shell /tmp/qmp.socket (QEMU) query-status {"end": {"seconds": 1650367305, "microseconds": 831032}, "start": {"seconds": 1650367305, "microseconds": 831012}, "return": {"status": "running", "singlestep": false, "running": true}} The response of the QMP command contains the start & end time of the QMP command processing. Also, "start" & "end" timestaps are added to qemu guest agent responses as qemu-ga shares the same code for request dispatching. Suggested-by: Andrey Ryabinin Signed-off-by: Denis Plotnikov Reviewed-by: Daniel P. Berrangé --- v4->v5: - use json-number instead of json-value for time values [Vladimir] - use a new util function for timestamp printing [Vladimir] v3->v4: - rewrite commit message [Markus] - use new fileds description in doc [Markus] - change type to int64_t [Markus] - simplify tests [Markus] v2->v3: - fix typo "timestaps -> timestamps" [Marc-André] v1->v2: - rephrase doc descriptions [Daniel] - add tests for qmp timestamps to qmp test and qga test [Daniel] - adjust asserts in test-qmp-cmds according to the new number of returning keys v0->v1: - remove interface to control "start" and "end" time values: return timestamps unconditionally - add description to qmp specification - leave the same timestamp format in "seconds", "microseconds" to be consistent with events timestamp - fix patch description --- docs/interop/qmp-spec.txt | 28 ++-- include/qapi/util.h| 2 ++ qapi/qapi-util.c | 11 +++ qapi/qmp-dispatch.c| 11 +++ qapi/qmp-event.c | 6 +- tests/qtest/qmp-test.c | 32 tests/unit/test-qga.c | 29 + tests/unit/test-qmp-cmds.c | 4 ++-- 8 files changed, 114 insertions(+), 9 deletions(-) diff --git a/docs/interop/qmp-spec.txt b/docs/interop/qmp-spec.txt index b0e8351d5b261..ed204b53373e5 100644 --- a/docs/interop/qmp-spec.txt +++ b/docs/interop/qmp-spec.txt @@ -158,7 +158,9 @@ responses that have an unknown "id" field. The format of a success response is: -{ "return": json-value, "id": json-value } +{ "return": json-value, "id": json-value, + "start": {"seconds": json-number, "microseconds": json-number}, + "end": {"seconds": json-number, "microseconds": json-number} } Where, @@ -169,13 +171,25 @@ The format of a success response is: command does not return data - The "id" member contains the transaction identification associated with the command execution if issued by the Client +- The "start" member contains the exact time of when the server + started executing the command. This excludes any time the + command request spent queued, after reading it off the wire. + It is a json-object with the number of seconds and microseconds + since the Unix epoch +- The "end" member contains the exact time of when the server + finished executing the command. This excludes any time the + command response spent queued, waiting to be sent on the wire. + It is a json-object with the number of seconds and microseconds + since the Unix epoch 2.4.2 error --- The format of an error response is:
[PATCH v5] qapi/qmp: Add timestamps to qmp command responses
Add "start" & "end" time values to QMP command responses. These time values are added to let the qemu management layer get the exact command execution time without any other time variance which might be brought by other parts of management layer or qemu internals. This helps to look for problems poactively from the management layer side. The management layer would be able to detect problem cases by calculating QMP command execution time: 1. execution_time_from_mgmt_perspective - execution_time_of_qmp_command > some_threshold This detects problems with management layer or internal qemu QMP command dispatching 2. current_qmp_command_execution_time > avg_qmp_command_execution_time This detects that a certain QMP command starts to execute longer than usual In both these cases more thorough investigation of the root cases should be done by using some qemu tracepoints depending on particular QMP command under investigation or by other means. The timestamps help to avoid excessive log output when qemu tracepoints are used to address similar cases. Example of result: ./qemu/scripts/qmp/qmp-shell /tmp/qmp.socket (QEMU) query-status {"end": {"seconds": 1650367305, "microseconds": 831032}, "start": {"seconds": 1650367305, "microseconds": 831012}, "return": {"status": "running", "singlestep": false, "running": true}} The response of the QMP command contains the start & end time of the QMP command processing. Also, "start" & "end" timestaps are added to qemu guest agent responses as qemu-ga shares the same code for request dispatching. Suggested-by: Andrey Ryabinin Signed-off-by: Denis Plotnikov Reviewed-by: Daniel P. Berrangé --- v4->v5: - use json-number instead of json-value for time values [Vladimir] - use a new util function for timestamp printing [Vladimir] v3->v4: - rewrite commit message [Markus] - use new fileds description in doc [Markus] - change type to int64_t [Markus] - simplify tests [Markus] v2->v3: - fix typo "timestaps -> timestamps" [Marc-André] v1->v2: - rephrase doc descriptions [Daniel] - add tests for qmp timestamps to qmp test and qga test [Daniel] - adjust asserts in test-qmp-cmds according to the new number of returning keys v0->v1: - remove interface to control "start" and "end" time values: return timestamps unconditionally - add description to qmp specification - leave the same timestamp format in "seconds", "microseconds" to be consistent with events timestamp - fix patch description --- docs/interop/qmp-spec.txt | 28 ++-- include/qapi/util.h| 2 ++ qapi/qapi-util.c | 11 +++ qapi/qmp-dispatch.c| 11 +++ qapi/qmp-event.c | 6 +- tests/qtest/qmp-test.c | 32 tests/unit/test-qga.c | 29 + tests/unit/test-qmp-cmds.c | 4 ++-- 8 files changed, 114 insertions(+), 9 deletions(-) diff --git a/docs/interop/qmp-spec.txt b/docs/interop/qmp-spec.txt index b0e8351d5b261..ed204b53373e5 100644 --- a/docs/interop/qmp-spec.txt +++ b/docs/interop/qmp-spec.txt @@ -158,7 +158,9 @@ responses that have an unknown "id" field. The format of a success response is: -{ "return": json-value, "id": json-value } +{ "return": json-value, "id": json-value, + "start": {"seconds": json-number, "microseconds": json-number}, + "end": {"seconds": json-number, "microseconds": json-number} } Where, @@ -169,13 +171,25 @@ The format of a success response is: command does not return data - The "id" member contains the transaction identification associated with the command execution if issued by the Client +- The "start" member contains the exact time of when the server + started executing the command. This excludes any time the + command request spent queued, after reading it off the wire. + It is a json-object with the number of seconds and microseconds + since the Unix epoch +- The "end" member contains the exact time of when the server + finished executing the command. This excludes any time the + command response spent queued, waiting to be sent on the wire. + It is a json-object with the number of seconds and microseconds + since the Unix epoch 2.4.2 error --- The format of an error response is: -{ "error": { "class": json-string, "desc": json-string }, "id": json-value } +{ "error": { "class": json-string, "desc": json-string }, "id": json-value + "start": {"seconds": json-number, &q
[PING] [PATCH v4] qapi/qmp: Add timestamps to qmp command responses
On 10.01.2023 13:32, Denis Plotnikov wrote: [ping] On 01.11.2022 18:37, Denis Plotnikov wrote: Add "start" & "end" time values to QMP command responses. These time values are added to let the qemu management layer get the exact command execution time without any other time variance which might be brought by other parts of management layer or qemu internals. This helps to look for problems poactively from the management layer side. The management layer would be able to detect problem cases by calculating QMP command execution time: 1. execution_time_from_mgmt_perspective - execution_time_of_qmp_command > some_threshold This detects problems with management layer or internal qemu QMP command dispatching 2. current_qmp_command_execution_time > avg_qmp_command_execution_time This detects that a certain QMP command starts to execute longer than usual In both these cases more thorough investigation of the root cases should be done by using some qemu tracepoints depending on particular QMP command under investigation or by other means. The timestamps help to avoid excessive log output when qemu tracepoints are used to address similar cases. Example of result: ./qemu/scripts/qmp/qmp-shell /tmp/qmp.socket (QEMU) query-status {"end": {"seconds": 1650367305, "microseconds": 831032}, "start": {"seconds": 1650367305, "microseconds": 831012}, "return": {"status": "running", "singlestep": false, "running": true}} The response of the QMP command contains the start & end time of the QMP command processing. Also, "start" & "end" timestaps are added to qemu guest agent responses as qemu-ga shares the same code for request dispatching. Suggested-by: Andrey Ryabinin Signed-off-by: Denis Plotnikov Reviewed-by: Daniel P. Berrangé --- v3->v4 - rewrite commit message [Markus] - use new fileds description in doc [Markus] - change type to int64_t [Markus] - simplify tests [Markus] v2->v3: - fix typo "timestaps -> timestamps" [Marc-André] v1->v2: - rephrase doc descriptions [Daniel] - add tests for qmp timestamps to qmp test and qga test [Daniel] - adjust asserts in test-qmp-cmds according to the new number of returning keys v0->v1: - remove interface to control "start" and "end" time values: return timestamps unconditionally - add description to qmp specification - leave the same timestamp format in "seconds", "microseconds" to be consistent with events timestamp - fix patch description docs/interop/qmp-spec.txt | 28 ++-- qapi/qmp-dispatch.c| 18 ++ tests/qtest/qmp-test.c | 32 tests/unit/test-qga.c | 29 + tests/unit/test-qmp-cmds.c | 4 ++-- 5 files changed, 107 insertions(+), 4 deletions(-) diff --git a/docs/interop/qmp-spec.txt b/docs/interop/qmp-spec.txt index b0e8351d5b261..0dd8e716c02f0 100644 --- a/docs/interop/qmp-spec.txt +++ b/docs/interop/qmp-spec.txt @@ -158,7 +158,9 @@ responses that have an unknown "id" field. The format of a success response is: -{ "return": json-value, "id": json-value } +{ "return": json-value, "id": json-value, + "start": {"seconds": json-value, "microseconds": json-value}, + "end": {"seconds": json-value, "microseconds": json-value} } Where, @@ -169,13 +171,25 @@ The format of a success response is: command does not return data - The "id" member contains the transaction identification associated with the command execution if issued by the Client +- The "start" member contains the exact time of when the server + started executing the command. This excludes any time the + command request spent queued, after reading it off the wire. + It is a json-object with the number of seconds and microseconds + since the Unix epoch +- The "end" member contains the exact time of when the server + finished executing the command. This excludes any time the + command response spent queued, waiting to be sent on the wire. + It is a json-object with the number of seconds and microseconds + since the Unix epoch 2.4.2 error --- The format of an error response is: -{ "error": { "class": json-string, "desc": json-string }, "id": json-value } +{ "error": { "class": json-string, "desc": json-string }, "id": json-value + "start": {"seconds": json-value, "microseconds": json-value}, + "end": {"seconds": json-value, "microseconds&quo
Re: [PATCH v4] qapi/qmp: Add timestamps to qmp command responses
[ping] On 01.11.2022 18:37, Denis Plotnikov wrote: Add "start" & "end" time values to QMP command responses. These time values are added to let the qemu management layer get the exact command execution time without any other time variance which might be brought by other parts of management layer or qemu internals. This helps to look for problems poactively from the management layer side. The management layer would be able to detect problem cases by calculating QMP command execution time: 1. execution_time_from_mgmt_perspective - execution_time_of_qmp_command > some_threshold This detects problems with management layer or internal qemu QMP command dispatching 2. current_qmp_command_execution_time > avg_qmp_command_execution_time This detects that a certain QMP command starts to execute longer than usual In both these cases more thorough investigation of the root cases should be done by using some qemu tracepoints depending on particular QMP command under investigation or by other means. The timestamps help to avoid excessive log output when qemu tracepoints are used to address similar cases. Example of result: ./qemu/scripts/qmp/qmp-shell /tmp/qmp.socket (QEMU) query-status {"end": {"seconds": 1650367305, "microseconds": 831032}, "start": {"seconds": 1650367305, "microseconds": 831012}, "return": {"status": "running", "singlestep": false, "running": true}} The response of the QMP command contains the start & end time of the QMP command processing. Also, "start" & "end" timestaps are added to qemu guest agent responses as qemu-ga shares the same code for request dispatching. Suggested-by: Andrey Ryabinin Signed-off-by: Denis Plotnikov Reviewed-by: Daniel P. Berrangé --- v3->v4 - rewrite commit message [Markus] - use new fileds description in doc [Markus] - change type to int64_t [Markus] - simplify tests [Markus] v2->v3: - fix typo "timestaps -> timestamps" [Marc-André] v1->v2: - rephrase doc descriptions [Daniel] - add tests for qmp timestamps to qmp test and qga test [Daniel] - adjust asserts in test-qmp-cmds according to the new number of returning keys v0->v1: - remove interface to control "start" and "end" time values: return timestamps unconditionally - add description to qmp specification - leave the same timestamp format in "seconds", "microseconds" to be consistent with events timestamp - fix patch description docs/interop/qmp-spec.txt | 28 ++-- qapi/qmp-dispatch.c| 18 ++ tests/qtest/qmp-test.c | 32 tests/unit/test-qga.c | 29 + tests/unit/test-qmp-cmds.c | 4 ++-- 5 files changed, 107 insertions(+), 4 deletions(-) diff --git a/docs/interop/qmp-spec.txt b/docs/interop/qmp-spec.txt index b0e8351d5b261..0dd8e716c02f0 100644 --- a/docs/interop/qmp-spec.txt +++ b/docs/interop/qmp-spec.txt @@ -158,7 +158,9 @@ responses that have an unknown "id" field. The format of a success response is: -{ "return": json-value, "id": json-value } +{ "return": json-value, "id": json-value, + "start": {"seconds": json-value, "microseconds": json-value}, + "end": {"seconds": json-value, "microseconds": json-value} } Where, @@ -169,13 +171,25 @@ The format of a success response is: command does not return data - The "id" member contains the transaction identification associated with the command execution if issued by the Client +- The "start" member contains the exact time of when the server + started executing the command. This excludes any time the + command request spent queued, after reading it off the wire. + It is a json-object with the number of seconds and microseconds + since the Unix epoch +- The "end" member contains the exact time of when the server + finished executing the command. This excludes any time the + command response spent queued, waiting to be sent on the wire. + It is a json-object with the number of seconds and microseconds + since the Unix epoch 2.4.2 error --- The format of an error response is: -{ "error": { "class": json-string, "desc": json-string }, "id": json-value } +{ "error": { "class": json-string, "desc": json-string }, "id": json-value + "start": {"seconds": json-value, "microseconds": json-value}, + "end": {"seconds": json-value, "microseconds": json-value} } Where, @@ -184,6
[PATCH v4] qapi/qmp: Add timestamps to qmp command responses
Add "start" & "end" time values to QMP command responses. These time values are added to let the qemu management layer get the exact command execution time without any other time variance which might be brought by other parts of management layer or qemu internals. This helps to look for problems poactively from the management layer side. The management layer would be able to detect problem cases by calculating QMP command execution time: 1. execution_time_from_mgmt_perspective - execution_time_of_qmp_command > some_threshold This detects problems with management layer or internal qemu QMP command dispatching 2. current_qmp_command_execution_time > avg_qmp_command_execution_time This detects that a certain QMP command starts to execute longer than usual In both these cases more thorough investigation of the root cases should be done by using some qemu tracepoints depending on particular QMP command under investigation or by other means. The timestamps help to avoid excessive log output when qemu tracepoints are used to address similar cases. Example of result: ./qemu/scripts/qmp/qmp-shell /tmp/qmp.socket (QEMU) query-status {"end": {"seconds": 1650367305, "microseconds": 831032}, "start": {"seconds": 1650367305, "microseconds": 831012}, "return": {"status": "running", "singlestep": false, "running": true}} The response of the QMP command contains the start & end time of the QMP command processing. Also, "start" & "end" timestaps are added to qemu guest agent responses as qemu-ga shares the same code for request dispatching. Suggested-by: Andrey Ryabinin Signed-off-by: Denis Plotnikov Reviewed-by: Daniel P. Berrangé --- v3->v4 - rewrite commit message [Markus] - use new fileds description in doc [Markus] - change type to int64_t [Markus] - simplify tests [Markus] v2->v3: - fix typo "timestaps -> timestamps" [Marc-André] v1->v2: - rephrase doc descriptions [Daniel] - add tests for qmp timestamps to qmp test and qga test [Daniel] - adjust asserts in test-qmp-cmds according to the new number of returning keys v0->v1: - remove interface to control "start" and "end" time values: return timestamps unconditionally - add description to qmp specification - leave the same timestamp format in "seconds", "microseconds" to be consistent with events timestamp - fix patch description docs/interop/qmp-spec.txt | 28 ++-- qapi/qmp-dispatch.c| 18 ++ tests/qtest/qmp-test.c | 32 tests/unit/test-qga.c | 29 + tests/unit/test-qmp-cmds.c | 4 ++-- 5 files changed, 107 insertions(+), 4 deletions(-) diff --git a/docs/interop/qmp-spec.txt b/docs/interop/qmp-spec.txt index b0e8351d5b261..0dd8e716c02f0 100644 --- a/docs/interop/qmp-spec.txt +++ b/docs/interop/qmp-spec.txt @@ -158,7 +158,9 @@ responses that have an unknown "id" field. The format of a success response is: -{ "return": json-value, "id": json-value } +{ "return": json-value, "id": json-value, + "start": {"seconds": json-value, "microseconds": json-value}, + "end": {"seconds": json-value, "microseconds": json-value} } Where, @@ -169,13 +171,25 @@ The format of a success response is: command does not return data - The "id" member contains the transaction identification associated with the command execution if issued by the Client +- The "start" member contains the exact time of when the server + started executing the command. This excludes any time the + command request spent queued, after reading it off the wire. + It is a json-object with the number of seconds and microseconds + since the Unix epoch +- The "end" member contains the exact time of when the server + finished executing the command. This excludes any time the + command response spent queued, waiting to be sent on the wire. + It is a json-object with the number of seconds and microseconds + since the Unix epoch 2.4.2 error --- The format of an error response is: -{ "error": { "class": json-string, "desc": json-string }, "id": json-value } +{ "error": { "class": json-string, "desc": json-string }, "id": json-value + "start": {"seconds": json-value, "microseconds": json-value}, + "end": {"seconds": json-value, "microseconds": json-value} } Where, @@ -184,6 +198,16 @@ The format of an error response is: not attempt to parse this message. - The "
Re: [PATCH v3] qapi/qmp: Add timestamps to qmp command responses
On 14.10.2022 16:19, Daniel P. Berrangé wrote: On Fri, Oct 14, 2022 at 02:57:06PM +0200, Markus Armbruster wrote: Daniel P. Berrangé writes: On Fri, Oct 14, 2022 at 11:31:13AM +0200, Markus Armbruster wrote: Daniel P. Berrangé writes: On Thu, Oct 13, 2022 at 05:00:26PM +0200, Markus Armbruster wrote: Denis Plotnikov writes: Add "start" & "end" time values to qmp command responses. Please spell it QMP. More of the same below. These time values are added to let the qemu management layer get the exact command execution time without any other time variance which might be brought by other parts of management layer or qemu internals. This is particulary useful for the management layer logging for later problems resolving. I'm still having difficulties seeing the value add over existing tracepoints and logging. Can you tell me about a problem you cracked (or could have cracked) with the help of this? Consider your QMP client is logging all commands and replies in its own logfile (libvirt can do this). Having this start/end timestamps included means the QMP client log is self contained. A QMP client can include client-side timestamps in its log. What value is being added by server-side timestamps? According to the commit message, it's for getting "the exact command execution time without any other time variance which might be brought by other parts of management layer or qemu internals." Why is that useful? In particular, why is excluding network and QEMU queueing delays (inbound and outbound) useful? Lets, say some commands normally runs in ~100ms, but occasionally runs in 2secs, and you want to understand why. A first step is understanding whether a given command itself is slow at executing, or whether its execution has merely been delayed because some other aspect of QEMU has delayed its execution. If the server timestamps show it was very fast, then that indicates delayed processing. Thus instead of debugging the slow command, I can think about what scenarios would be responsible for the delay. Perhaps a previous QMP command was very slow, or maybe there is simply a large volume of QMP commands backlogged, or some part of QEMU got blocked. Another case would be a command that is normally fast, and sometimes is slower, but still relatively fast. The network and queueing side might be a significant enough proportion of the total time to obscure the slowdown. If you can eliminate the non-execution time, you can see the performance trends over time to spot the subtle slowdowns and detect abnormal behaviour before it becomes too terrible. This is troubleshooting. Asking for better troubleshooting tools is fair. However, the proposed timestamps provide much more limited insight than existing tracepoints. For instance, enabling tracepoints are absolutely great and let you get a hell of alot more information, *provided* you are in a position to actually use tracepoints. This is, unfortunately, frequently not the case when supporting real world production deployments. Exactly!!! Thanks for the pointing out! Bug reports from customers typically include little more than a log file they got from the mgmt client at time the problem happened. The problem experianced may no longer exist, so asking them to run a tracepoint script is not possible. They may also be reluctant to actually run tracepoint scripts on a production system, or simply lack the ability todo so at all, due to constraints of the deployment environment. Logs from libvirt are something that are collected by default for many mgmt apps, or can be turned on by the user with minimal risk of disruption. Overall, there's a compelling desire to be proactive in collecting information ahead of time, that might be useful in diagnosing future bug reports. This is the main reason. When you encounter a problem one of the first questions is "Was there something similar in the past. Another question is how often does it happen. With the timestamps these questions answering becomes easier. Another thing is that with the qmp command timestamps you can build a monitoring system which will report about the cases when execution_time_from_mgmt_perspective - excution_time_qmp_command > some_threshold which in turn proactively tell you about the potential problems. And then you'll start using the qmp tracepoints (and other means) to figure out the real reason of the execution time variance. Thanks, Denis So it isn't an 'either / or' decision of QMP reply logs vs use of tracepoints, both are beneficial, with their own pros/cons. With regards, Daniel
Re: [PATCH v3] qapi/qmp: Add timestamps to qmp command responses
On 13.10.2022 18:00, Markus Armbruster wrote: Denis Plotnikov writes: Add "start" & "end" time values to qmp command responses. Please spell it QMP. More of the same below. ok Can you tell me about a problem you cracked (or could have cracked) with the help of this? We have a management layer which interacts with qemu via qmp. When it issues a qmp command we measure execution time which takes to perform a certain qmp command. Some of that commands seems to execute longer that expected. In that case there is a question what part of command execution takes the majority of time. Is it the flaw in the management layer or in qemu qmp command scheduling or the qmp command execution itself? The timestaps being added help to exclude the qmp command execution time from the question. Also timestamps helps to get know the exact time when the command is started and ended and put that information to a system logs properly according to timestamps. "return": {"status": "running", "singlestep": false, "running": true}} The responce of the qmp command contains the start & end time of response ok the qmp command processing. Suggested-by: Andrey Ryabinin Signed-off-by: Denis Plotnikov Reviewed-by: Daniel P. Berrangé Please spell out that this affects both QMP and qemu-ga. ok command does not return data - The "id" member contains the transaction identification associated with the command execution if issued by the Client +- The "start" member contains the exact time of when the server + started executing the command. This excludes any time the + command request spent queued, after reading it off the wire. + It is a fixed json-object with time in seconds and microseconds + relative to the Unix Epoch (1 Jan 1970) What's a "fixed json-object"? Hmm, I guess you're copying from the description of event member "timestamp". That's right Let's go with "a json-object with the number of seconds and microseconds since the Unix epoch" everywhere. ok Make this int64_t, because that's what g_get_real_time() returns. Same for add_timestamps() parameters. ok, will fix the type everywhere +qobject_unref(resp); I'd be tempted to fold this into existing tests. Do you want me to put timestamp checking to an existing testcase? Thanks, Denis + qtest_quit(qts); } diff --git a/tests/unit/test-qga.c b/tests/unit/test-qga.c index b4e0a145737d1..18ec9bac3650e 100644 --- a/tests/unit/test-qga.c +++ b/tests/unit/test-qga.c @@ -217,6 +217,36 @@ static void test_qga_ping(gconstpointer fix) qmp_assert_no_error(ret); } +static void test_qga_timestamps(gconstpointer fix) +{ +QDict *start, *end; +uint64_t start_s, start_us, end_s, end_us, start_ts, end_ts; +const TestFixture *fixture = fix; +g_autoptr(QDict) ret = NULL; + +ret = qmp_fd(fixture->fd, "{'execute': 'guest-ping'}"); +g_assert_nonnull(ret); +qmp_assert_no_error(ret); + +start = qdict_get_qdict(ret, "start"); +g_assert(start); +end = qdict_get_qdict(ret, "end"); +g_assert(end); + +start_s = qdict_get_try_int(start, "seconds", 0); +g_assert(start_s); +start_us = qdict_get_try_int(start, "microseconds", 0); + +end_s = qdict_get_try_int(end, "seconds", 0); +g_assert(end_s); +end_us = qdict_get_try_int(end, "microseconds", 0); + +start_ts = (start_s * G_USEC_PER_SEC) + start_us; +end_ts = (end_s * G_USEC_PER_SEC) + end_us; + +g_assert(end_ts > start_ts); +} + static void test_qga_id(gconstpointer fix) { const TestFixture *fixture = fix; @@ -948,6 +978,7 @@ int main(int argc, char **argv) g_test_add_data_func("/qga/sync-delimited", , test_qga_sync_delimited); g_test_add_data_func("/qga/sync", , test_qga_sync); g_test_add_data_func("/qga/ping", , test_qga_ping); +g_test_add_data_func("/qga/timestamps", , test_qga_timestamps); g_test_add_data_func("/qga/info", , test_qga_info); g_test_add_data_func("/qga/network-get-interfaces", , test_qga_network_get_interfaces); diff --git a/tests/unit/test-qmp-cmds.c b/tests/unit/test-qmp-cmds.c index 6085c099950b5..54d63bb8e346f 100644 --- a/tests/unit/test-qmp-cmds.c +++ b/tests/unit/test-qmp-cmds.c @@ -154,7 +154,7 @@ static QObject *do_qmp_dispatch(bool allow_oob, const char *template, ...) g_assert(resp); ret = qdict_get(resp, "return"); g_assert(ret); -g_assert(qdict_size(resp) == 1); +g_assert(qdict_size(resp) == 3); qobject_ref(ret); qobject_unref(resp); @@ -181,7 +181,7 @@ static void do_qmp_dispatch_error(bool allow_oob, ErrorClass cls, ==, QapiErrorClass_str(cls)); g_assert(qdict_get_try_str(error, "desc")); g_assert(qdict_size(error) == 2); -g_assert(qdict_size(resp) == 1); +g_assert(qdict_size(resp) == 3); qobject_unref(resp); qobject_unref(req);
[PATCH v3] qapi/qmp: Add timestamps to qmp command responses
Add "start" & "end" time values to qmp command responses. These time values are added to let the qemu management layer get the exact command execution time without any other time variance which might be brought by other parts of management layer or qemu internals. This is particulary useful for the management layer logging for later problems resolving. Example of result: ./qemu/scripts/qmp/qmp-shell /tmp/qmp.socket (QEMU) query-status {"end": {"seconds": 1650367305, "microseconds": 831032}, "start": {"seconds": 1650367305, "microseconds": 831012}, "return": {"status": "running", "singlestep": false, "running": true}} The responce of the qmp command contains the start & end time of the qmp command processing. Suggested-by: Andrey Ryabinin Signed-off-by: Denis Plotnikov Reviewed-by: Daniel P. Berrangé --- v0->v1: - remove interface to control "start" and "end" time values: return timestamps unconditionally - add description to qmp specification - leave the same timestamp format in "seconds", "microseconds" to be consistent with events timestamp - fix patch description v1->v2: - rephrase doc descriptions [Daniel] - add tests for qmp timestamps to qmp test and qga test [Daniel] - adjust asserts in test-qmp-cmds according to the new number of returning keys v2->v3: - fix typo "timestaps -> timestamps" [Marc-André] docs/interop/qmp-spec.txt | 28 ++-- qapi/qmp-dispatch.c| 18 ++ tests/qtest/qmp-test.c | 34 ++ tests/unit/test-qga.c | 31 +++ tests/unit/test-qmp-cmds.c | 4 ++-- 5 files changed, 111 insertions(+), 4 deletions(-) diff --git a/docs/interop/qmp-spec.txt b/docs/interop/qmp-spec.txt index b0e8351d5b261..2e0b7de0c4dc7 100644 --- a/docs/interop/qmp-spec.txt +++ b/docs/interop/qmp-spec.txt @@ -158,7 +158,9 @@ responses that have an unknown "id" field. The format of a success response is: -{ "return": json-value, "id": json-value } +{ "return": json-value, "id": json-value, + "start": {"seconds": json-value, "microseconds": json-value}, + "end": {"seconds": json-value, "microseconds": json-value} } Where, @@ -169,13 +171,25 @@ The format of a success response is: command does not return data - The "id" member contains the transaction identification associated with the command execution if issued by the Client +- The "start" member contains the exact time of when the server + started executing the command. This excludes any time the + command request spent queued, after reading it off the wire. + It is a fixed json-object with time in seconds and microseconds + relative to the Unix Epoch (1 Jan 1970) +- The "end" member contains the exact time of when the server + finished executing the command. This excludes any time the + command response spent queued, waiting to be sent on the wire. + It is a fixed json-object with time in seconds and microseconds + relative to the Unix Epoch (1 Jan 1970) 2.4.2 error --- The format of an error response is: -{ "error": { "class": json-string, "desc": json-string }, "id": json-value } +{ "error": { "class": json-string, "desc": json-string }, "id": json-value + "start": {"seconds": json-value, "microseconds": json-value}, + "end": {"seconds": json-value, "microseconds": json-value} } Where, @@ -184,6 +198,16 @@ The format of an error response is: not attempt to parse this message. - The "id" member contains the transaction identification associated with the command execution if issued by the Client +- The "start" member contains the exact time of when the server + started executing the command. This excludes any time the + command request spent queued, after reading it off the wire. + It is a fixed json-object with time in seconds and microseconds + relative to the Unix Epoch (1 Jan 1970) +- The "end" member contains the exact time of when the server + finished executing the command. This excludes any time the + command response spent queued, waiting to be sent on the wire. + It is a fixed json-object with time in seconds and microseconds + relative to the Unix Epoch (1 Jan 1970) NOTE: Some errors can occur before the Server is able to read the "id" member, in these cases the "id" member will not be part of the error response, even diff --git a/qapi/qmp-dispatch.c b/qapi/qmp-dispatch.c index 0990873ec8ec1..fce8
[PATCH v2] qapi/qmp: Add timestamps to qmp command responses
Add "start" & "end" time values to qmp command responses. These time values are added to let the qemu management layer get the exact command execution time without any other time variance which might be brought by other parts of management layer or qemu internals. This is particulary useful for the management layer logging for later problems resolving. Example of result: ./qemu/scripts/qmp/qmp-shell /tmp/qmp.socket (QEMU) query-status {"end": {"seconds": 1650367305, "microseconds": 831032}, "start": {"seconds": 1650367305, "microseconds": 831012}, "return": {"status": "running", "singlestep": false, "running": true}} The responce of the qmp command contains the start & end time of the qmp command processing. Suggested-by: Andrey Ryabinin Signed-off-by: Denis Plotnikov --- v0->v1: - remove interface to control "start" and "end" time values: return timestamps unconditionally - add description to qmp specification - leave the same timestamp format in "seconds", "microseconds" to be consistent with events timestamp - fix patch description v1->v2: - rephrase doc descriptions [Daniel] - add tests for qmp timestamps to qmp test and qga test [Daniel] - adjust asserts in test-qmp-cmds according to the new number of returning keys docs/interop/qmp-spec.txt | 28 ++-- qapi/qmp-dispatch.c| 18 ++ tests/qtest/qmp-test.c | 34 ++ tests/unit/test-qga.c | 31 +++ tests/unit/test-qmp-cmds.c | 4 ++-- 5 files changed, 111 insertions(+), 4 deletions(-) diff --git a/docs/interop/qmp-spec.txt b/docs/interop/qmp-spec.txt index b0e8351d5b261..2e0b7de0c4dc7 100644 --- a/docs/interop/qmp-spec.txt +++ b/docs/interop/qmp-spec.txt @@ -158,7 +158,9 @@ responses that have an unknown "id" field. The format of a success response is: -{ "return": json-value, "id": json-value } +{ "return": json-value, "id": json-value, + "start": {"seconds": json-value, "microseconds": json-value}, + "end": {"seconds": json-value, "microseconds": json-value} } Where, @@ -169,13 +171,25 @@ The format of a success response is: command does not return data - The "id" member contains the transaction identification associated with the command execution if issued by the Client +- The "start" member contains the exact time of when the server + started executing the command. This excludes any time the + command request spent queued, after reading it off the wire. + It is a fixed json-object with time in seconds and microseconds + relative to the Unix Epoch (1 Jan 1970) +- The "end" member contains the exact time of when the server + finished executing the command. This excludes any time the + command response spent queued, waiting to be sent on the wire. + It is a fixed json-object with time in seconds and microseconds + relative to the Unix Epoch (1 Jan 1970) 2.4.2 error --- The format of an error response is: -{ "error": { "class": json-string, "desc": json-string }, "id": json-value } +{ "error": { "class": json-string, "desc": json-string }, "id": json-value + "start": {"seconds": json-value, "microseconds": json-value}, + "end": {"seconds": json-value, "microseconds": json-value} } Where, @@ -184,6 +198,16 @@ The format of an error response is: not attempt to parse this message. - The "id" member contains the transaction identification associated with the command execution if issued by the Client +- The "start" member contains the exact time of when the server + started executing the command. This excludes any time the + command request spent queued, after reading it off the wire. + It is a fixed json-object with time in seconds and microseconds + relative to the Unix Epoch (1 Jan 1970) +- The "end" member contains the exact time of when the server + finished executing the command. This excludes any time the + command response spent queued, waiting to be sent on the wire. + It is a fixed json-object with time in seconds and microseconds + relative to the Unix Epoch (1 Jan 1970) NOTE: Some errors can occur before the Server is able to read the "id" member, in these cases the "id" member will not be part of the error response, even diff --git a/qapi/qmp-dispatch.c b/qapi/qmp-dispatch.c index 0990873ec8ec1..fce87416f2128 100644 --- a/qapi/qmp-dispatch.c +++ b/qapi/qmp-dispatch.c @@ -130,6 +130,22 @@ static void do_qmp_disp
[PATCH v1] qapi/qmp: Add timestamps to qmp command responses
Add "start" & "end" time values to qmp command responses. These time values are added to let the qemu management layer get the exact command execution time without any other time variance which might be brought by other parts of management layer or qemu internals. This is particulary useful for the management layer logging for later problems resolving. Example of result: ./qemu/scripts/qmp/qmp-shell /tmp/qmp.socket (QEMU) query-status {"end": {"seconds": 1650367305, "microseconds": 831032}, "start": {"seconds": 1650367305, "microseconds": 831012}, "return": {"status": "running", "singlestep": false, "running": true}} The responce of the qmp command contains the start & end time of the qmp command processing. Suggested-by: Andrey Ryabinin Signed-off-by: Denis Plotnikov --- v0->v1: - remove interface to control "start" and "end" time values: return timestamps unconditionally - add description to qmp specification - leave the same timestamp format in "seconds", "microseconds" to be consistent with events timestamp - fix patch description docs/interop/qmp-spec.txt | 20 ++-- qapi/qmp-dispatch.c | 18 ++ 2 files changed, 36 insertions(+), 2 deletions(-) diff --git a/docs/interop/qmp-spec.txt b/docs/interop/qmp-spec.txt index b0e8351d5b261..d1cca8bc447ce 100644 --- a/docs/interop/qmp-spec.txt +++ b/docs/interop/qmp-spec.txt @@ -158,7 +158,9 @@ responses that have an unknown "id" field. The format of a success response is: -{ "return": json-value, "id": json-value } +{ "return": json-value, "id": json-value, + "start": {"seconds": json-value, "microseconds": json-value}, + "end": {"seconds": json-value, "microseconds": json-value} } Where, @@ -169,13 +171,21 @@ The format of a success response is: command does not return data - The "id" member contains the transaction identification associated with the command execution if issued by the Client +- The "start" member contains the exact time of when the command has been + stated to be processed. It is a fixed json-object with time in + seconds and microseconds relative to the Unix Epoch (1 Jan 1970) +- The "end" member contains the exact time of when the command has been + finished to be processed. It is a fixed json-object with time in + seconds and microseconds relative to the Unix Epoch (1 Jan 1970) 2.4.2 error --- The format of an error response is: -{ "error": { "class": json-string, "desc": json-string }, "id": json-value } +{ "error": { "class": json-string, "desc": json-string }, "id": json-value + "start": {"seconds": json-value, "microseconds": json-value}, + "end": {"seconds": json-value, "microseconds": json-value} } Where, @@ -184,6 +194,12 @@ The format of an error response is: not attempt to parse this message. - The "id" member contains the transaction identification associated with the command execution if issued by the Client +- The "start" member contains the exact time of when the command has been + stated to be processed. It is a fixed json-object with time in + seconds and microseconds relative to the Unix Epoch (1 Jan 1970) +- The "end" member contains the exact time of when the command has been + finished to be processed. It is a fixed json-object with time in + seconds and microseconds relative to the Unix Epoch (1 Jan 1970) NOTE: Some errors can occur before the Server is able to read the "id" member, in these cases the "id" member will not be part of the error response, even diff --git a/qapi/qmp-dispatch.c b/qapi/qmp-dispatch.c index 0990873ec8ec1..fce87416f2128 100644 --- a/qapi/qmp-dispatch.c +++ b/qapi/qmp-dispatch.c @@ -130,6 +130,22 @@ static void do_qmp_dispatch_bh(void *opaque) aio_co_wake(data->co); } +static void add_timestamps(QDict *qdict, uint64_t start_ms, uint64_t end_ms) +{ +QDict *start_dict, *end_dict; + +start_dict = qdict_new(); +qdict_put_int(start_dict, "seconds", start_ms / G_USEC_PER_SEC); +qdict_put_int(start_dict, "microseconds", start_ms % G_USEC_PER_SEC); + +end_dict = qdict_new(); +qdict_put_int(end_dict, "seconds", end_ms / G_USEC_PER_SEC); +qdict_put_int(end_dict, "microseconds", end_ms % G_USEC_PER_SEC); + +qdict_put_obj(qdict, "start", QOBJECT(start_dict)); +qdict_put_obj(qdict, "end", QOBJECT(end_dict)); +} + /* * Runs outside of coroutine context for OOB commands, but in coroutine * context for everything else. @@ -146,6 +162,7 @@ QDict *qmp_dispatch(const QmpCommandList *cmds, QObject *request, QObject *id; QObject *ret = NULL; QDict *rsp = NULL; +uint64_t ts_start = g_get_real_time(); dict = qobject_to(QDict, request); if (!dict) { @@ -270,5 +287,6 @@ out: qdict_put_obj(rsp, "id", qobject_ref(id)); } +add_timestamps(rsp, ts_start, g_get_real_time()); return rsp; } -- 2.25.1
Re: [patch v0] qapi/qmp: Add timestamps to qmp command responses.
On 27.09.2022 09:04, Markus Armbruster wrote: Daniel P. Berrangé writes: On Mon, Sep 26, 2022 at 12:59:40PM +0300, Denis Plotnikov wrote: Add "start" & "end" timestamps to qmp command responses. It's disabled by default, but can be enabled with 'timestamp=on' monitor's parameter, e.g.: -chardev socket,id=mon1,path=/tmp/qmp.socket,server=on,wait=off -mon chardev=mon1,mode=control,timestamp=on I'm not convinced a cmdline flag is the right approach here. I think it ought be something defined by the QMP spec. The QMP spec is docs/interop/qmp-spec.txt. The feature needs to be defined there regardless of how we control it. ok, thanks for pointing out The "QMP" greeting should report "timestamp" capabilities. The 'qmp_capabilities' command can be used to turn on this capability for all commands henceforth. Yes, this is how optional QMP protocol features should be controlled. Bonus: control is per connection, not just globally. As an option extra, the 'execute' command could gain a parameter to allow this to be requested for only an individual command. Needs a use case. Alternatively we could say the overhead of adding the timestmaps is small enough that we just add this unconditionally for everything hence, with no opt-in/opt-out. Yes, because the extension is backwards compatible. May be it worth to send the timestamps always in the response if doesn't contradicts with anything and doesn't bring any unnecessary data overhead. From the other hand turning it on via qmp capabilities seems to be more flexible solution. Aside: qmp-spec.txt could be clearer on what that means. Example of result: ./qemu/scripts/qmp/qmp-shell /tmp/qmp.socket (QEMU) query-status {"end": {"seconds": 1650367305, "microseconds": 831032}, "start": {"seconds": 1650367305, "microseconds": 831012}, "return": {"status": "running", "singlestep": false, "running": true}} The responce of the qmp command contains the start & end time of the qmp command processing. Seconds and microseconds since when? The update to qmp-spec.txt should tell. Why split the time into seconds and microseconds? If you use microseconds since the Unix epoch (1970-01-01 UTC), 64 bit unsigned will result in a year 586524 problem: $ date --date "@`echo '2^64/100' | bc`" Wed Jan 19 09:01:49 CET 586524 Even a mere 53 bits will last until 2255. This is Just for convenience, may be it's too much and timestamp in msec if enough These times may be helpful for the management layer in understanding of the actual timeline of a qmp command processing. Can you explain the problem scenario in more detail. Yes, please, because: The mgmt app already knows when it send the QMP command and knows when it gets the QMP reply. This covers the time the QMP was queued before processing (might be large if QMP is blocked on another slow command) , the processing time, and the time any reply was queued before sending (ought to be small). So IIUC, the value these fields add is that they let the mgmt app extract only the command processing time, eliminating any variance do to queue before/after. So the scenario is the following: we need a means to understand from the management layer prospecitive of what is the timeline of the command execution. This is needed for a problem resolving if a qmp command executes for too long from the management layer point of view. Specifically, management layer sees the execution time as "management_layer_internal_routine_time" + "qemu_dispatching_time" + "qemu_qmp_command_execution_time". Suggested qmp command timestaps gives "qemu_command_execution_time". Management layer calculates "management_layer_internal_routine_time" internally. Using those two things we can calculate "qemu_dispatching_time" and decide where the potential delays comes from. This will gives us a direction of further problem investigation. Suggested-by: Andrey Ryabinin Signed-off-by: Denis Plotnikov
[patch v0] qapi/qmp: Add timestamps to qmp command responses.
Add "start" & "end" timestamps to qmp command responses. It's disabled by default, but can be enabled with 'timestamp=on' monitor's parameter, e.g.: -chardev socket,id=mon1,path=/tmp/qmp.socket,server=on,wait=off -mon chardev=mon1,mode=control,timestamp=on Example of result: ./qemu/scripts/qmp/qmp-shell /tmp/qmp.socket (QEMU) query-status {"end": {"seconds": 1650367305, "microseconds": 831032}, "start": {"seconds": 1650367305, "microseconds": 831012}, "return": {"status": "running", "singlestep": false, "running": true}} The responce of the qmp command contains the start & end time of the qmp command processing. These times may be helpful for the management layer in understanding of the actual timeline of a qmp command processing. Suggested-by: Andrey Ryabinin Signed-off-by: Denis Plotnikov --- include/monitor/monitor.h | 2 +- include/qapi/qmp/dispatch.h | 2 +- monitor/monitor-internal.h | 1 + monitor/monitor.c | 9 - monitor/qmp.c | 5 +++-- qapi/control.json | 3 +++ qapi/qmp-dispatch.c | 28 +++- qga/main.c | 2 +- stubs/monitor-core.c| 2 +- tests/unit/test-qmp-cmds.c | 6 +++--- 10 files changed, 49 insertions(+), 11 deletions(-) diff --git a/include/monitor/monitor.h b/include/monitor/monitor.h index a4b40e8391db4..2a18e9ee34bc2 100644 --- a/include/monitor/monitor.h +++ b/include/monitor/monitor.h @@ -19,7 +19,7 @@ bool monitor_cur_is_qmp(void); void monitor_init_globals(void); void monitor_init_globals_core(void); -void monitor_init_qmp(Chardev *chr, bool pretty, Error **errp); +void monitor_init_qmp(Chardev *chr, bool pretty, bool timestamp, Error **errp); void monitor_init_hmp(Chardev *chr, bool use_readline, Error **errp); int monitor_init(MonitorOptions *opts, bool allow_hmp, Error **errp); int monitor_init_opts(QemuOpts *opts, Error **errp); diff --git a/include/qapi/qmp/dispatch.h b/include/qapi/qmp/dispatch.h index 1e4240fd0dbc0..d07f5764271be 100644 --- a/include/qapi/qmp/dispatch.h +++ b/include/qapi/qmp/dispatch.h @@ -56,7 +56,7 @@ const char *qmp_command_name(const QmpCommand *cmd); bool qmp_has_success_response(const QmpCommand *cmd); QDict *qmp_error_response(Error *err); QDict *qmp_dispatch(const QmpCommandList *cmds, QObject *request, -bool allow_oob, Monitor *cur_mon); +bool allow_oob, bool timestamp, Monitor *cur_mon); bool qmp_is_oob(const QDict *dict); typedef void (*qmp_cmd_callback_fn)(const QmpCommand *cmd, void *opaque); diff --git a/monitor/monitor-internal.h b/monitor/monitor-internal.h index caa2e90ef22a4..69425a7bc8152 100644 --- a/monitor/monitor-internal.h +++ b/monitor/monitor-internal.h @@ -136,6 +136,7 @@ typedef struct { Monitor common; JSONMessageParser parser; bool pretty; +bool timestamp; /* * When a client connects, we're in capabilities negotiation mode. * @commands is _cap_negotiation_commands then. When command diff --git a/monitor/monitor.c b/monitor/monitor.c index 86949024f643a..85a0b6498dbc1 100644 --- a/monitor/monitor.c +++ b/monitor/monitor.c @@ -726,7 +726,7 @@ int monitor_init(MonitorOptions *opts, bool allow_hmp, Error **errp) switch (opts->mode) { case MONITOR_MODE_CONTROL: -monitor_init_qmp(chr, opts->pretty, _err); +monitor_init_qmp(chr, opts->pretty, opts->timestamp, _err); break; case MONITOR_MODE_READLINE: if (!allow_hmp) { @@ -737,6 +737,10 @@ int monitor_init(MonitorOptions *opts, bool allow_hmp, Error **errp) error_setg(errp, "'pretty' is not compatible with HMP monitors"); return -1; } +if (opts->timestamp) { +error_setg(errp, "'timestamp' is not compatible with HMP monitors"); +return -1; +} monitor_init_hmp(chr, true, _err); break; default: @@ -782,6 +786,9 @@ QemuOptsList qemu_mon_opts = { },{ .name = "pretty", .type = QEMU_OPT_BOOL, +},{ +.name = "timestamp", +.type = QEMU_OPT_BOOL, }, { /* end of list */ } }, diff --git a/monitor/qmp.c b/monitor/qmp.c index 092c527b6fc9c..fd487fee9f850 100644 --- a/monitor/qmp.c +++ b/monitor/qmp.c @@ -142,7 +142,7 @@ static void monitor_qmp_dispatch(MonitorQMP *mon, QObject *req) QDict *error; rsp = qmp_dispatch(mon->commands, req, qmp_oob_enabled(mon), - >common); + mon->timestamp, >common); if (mon->commands == _cap_negotiation_commands) { error = qdict_get_qdict(rsp, "error"); @@ -495,7 +495,7 @@ static void monitor_
[PING][Ping] [PATCH v1 0/2] vl: flush all task from rcu queue before exiting
ping ping On 19.11.2021 12:42, Denis Plotnikov wrote: Ping! On 15.11.2021 12:41, Denis Plotnikov wrote: v1 -> v0: * move monitor cleanup to the very end of qemu cleanup [Paolo] The goal is to notify management layer about device destruction on qemu shutdown. Without this series DEVICE_DELETED event may not be sent because of stuck tasks in the rcu thread. The rcu tasks may stuck on qemu shutdown because the rcu not always have enough time to run them. Denis Plotnikov (2): monitor: move monitor destruction to the very end of qemu cleanup vl: flush all task from rcu queue before exiting include/qemu/rcu.h | 1 + monitor/monitor.c | 6 ++ softmmu/runstate.c | 4 +++- util/rcu.c | 12 4 files changed, 22 insertions(+), 1 deletion(-)
[Ping] [PATCH v1 0/2] vl: flush all task from rcu queue before exiting
Ping! On 15.11.2021 12:41, Denis Plotnikov wrote: v1 -> v0: * move monitor cleanup to the very end of qemu cleanup [Paolo] The goal is to notify management layer about device destruction on qemu shutdown. Without this series DEVICE_DELETED event may not be sent because of stuck tasks in the rcu thread. The rcu tasks may stuck on qemu shutdown because the rcu not always have enough time to run them. Denis Plotnikov (2): monitor: move monitor destruction to the very end of qemu cleanup vl: flush all task from rcu queue before exiting include/qemu/rcu.h | 1 + monitor/monitor.c | 6 ++ softmmu/runstate.c | 4 +++- util/rcu.c | 12 4 files changed, 22 insertions(+), 1 deletion(-)
[PATCH v1 2/2] vl: flush all task from rcu queue before exiting
The device destruction may superimpose over qemu shutdown. In this case some management layer, requested a device unplug and waiting for DEVICE_DELETED event, may never get this event. This happens because device_finalize() may never be called on qemu shutdown for some devices using address_space_destroy(). The later is called from the rcu thread. On qemu shutdown, not all rcu callbacks may be called because the rcu thread may not have enough time to converge before qemu main thread exit. To resolve this issue this patch makes rcu thread to finish all its callbacks explicitly by calling a new rcu intreface function right before qemu main thread exit. Signed-off-by: Denis Plotnikov --- include/qemu/rcu.h | 1 + softmmu/runstate.c | 2 ++ util/rcu.c | 12 3 files changed, 15 insertions(+) diff --git a/include/qemu/rcu.h b/include/qemu/rcu.h index 515d327cf11c..f7fbdc3781e5 100644 --- a/include/qemu/rcu.h +++ b/include/qemu/rcu.h @@ -134,6 +134,7 @@ struct rcu_head { extern void call_rcu1(struct rcu_head *head, RCUCBFunc *func); extern void drain_call_rcu(void); +extern void flush_rcu(void); /* The operands of the minus operator must have the same type, * which must be the one that we specify in the cast. diff --git a/softmmu/runstate.c b/softmmu/runstate.c index 8d29dd2c00e2..3f833678f6eb 100644 --- a/softmmu/runstate.c +++ b/softmmu/runstate.c @@ -821,6 +821,8 @@ void qemu_cleanup(void) audio_cleanup(); qemu_chr_cleanup(); user_creatable_cleanup(); +/* finish all the tasks from rcu queue before exiting */ +flush_rcu(); monitor_cleanup(); /* TODO: unref root container, check all devices are ok */ } diff --git a/util/rcu.c b/util/rcu.c index 13ac0f75cb2a..f047f8ee8d16 100644 --- a/util/rcu.c +++ b/util/rcu.c @@ -348,6 +348,18 @@ void drain_call_rcu(void) } +/* + * This function drains rcu queue until there are no tasks to do left + * and aims to the cases when one needs to ensure that no work hang + * in rcu thread before proceeding, e.g. on qemu shutdown. + */ +void flush_rcu(void) +{ +while (qatomic_read(_call_count) > 0) { +drain_call_rcu(); +} +} + void rcu_register_thread(void) { assert(rcu_reader.ctr == 0); -- 2.25.1
[PATCH v1 1/2] monitor: move monitor destruction to the very end of qemu cleanup
This is needed to keep sending DEVICE_DELETED events on qemu cleanup. The event may happen in the rcu thread and we're going to flush the rcu queue explicitly before qemu exiting in the next patch. So move the monitor destruction to the very end of qemu cleanup to be able to send all the events. Signed-off-by: Denis Plotnikov --- monitor/monitor.c | 6 ++ softmmu/runstate.c | 2 +- 2 files changed, 7 insertions(+), 1 deletion(-) diff --git a/monitor/monitor.c b/monitor/monitor.c index 21c7a68758f5..b04ae4850db2 100644 --- a/monitor/monitor.c +++ b/monitor/monitor.c @@ -605,11 +605,17 @@ void monitor_data_init(Monitor *mon, bool is_qmp, bool skip_flush, mon->outbuf = g_string_new(NULL); mon->skip_flush = skip_flush; mon->use_io_thread = use_io_thread; +/* + * take an extra ref to prevent monitor's chardev + * from destroying in qemu_chr_cleanup() + */ +object_ref(OBJECT(mon->chr.chr)); } void monitor_data_destroy(Monitor *mon) { g_free(mon->mon_cpu_path); +object_unref(OBJECT(mon->chr.chr)); qemu_chr_fe_deinit(>chr, false); if (monitor_is_qmp(mon)) { monitor_data_destroy_qmp(container_of(mon, MonitorQMP, common)); diff --git a/softmmu/runstate.c b/softmmu/runstate.c index 10d9b7365aa7..8d29dd2c00e2 100644 --- a/softmmu/runstate.c +++ b/softmmu/runstate.c @@ -819,8 +819,8 @@ void qemu_cleanup(void) tpm_cleanup(); net_cleanup(); audio_cleanup(); -monitor_cleanup(); qemu_chr_cleanup(); user_creatable_cleanup(); +monitor_cleanup(); /* TODO: unref root container, check all devices are ok */ } -- 2.25.1
[PATCH v1 0/2] vl: flush all task from rcu queue before exiting
v1 -> v0: * move monitor cleanup to the very end of qemu cleanup [Paolo] The goal is to notify management layer about device destruction on qemu shutdown. Without this series DEVICE_DELETED event may not be sent because of stuck tasks in the rcu thread. The rcu tasks may stuck on qemu shutdown because the rcu not always have enough time to run them. Denis Plotnikov (2): monitor: move monitor destruction to the very end of qemu cleanup vl: flush all task from rcu queue before exiting include/qemu/rcu.h | 1 + monitor/monitor.c | 6 ++ softmmu/runstate.c | 4 +++- util/rcu.c | 12 4 files changed, 22 insertions(+), 1 deletion(-) -- 2.25.1
Re: [Ping][PATCH v0] vl: flush all task from rcu queue before exiting
On 09.11.2021 20:46, Paolo Bonzini wrote: On 11/9/21 08:23, Denis Plotnikov wrote: Ping ping! Looks good, but can you explain why it's okay to call it before qemu_chr_cleanup() and user_creatable_cleanup()? I think a better solution to the ordering problem would be: qemu_chr_cleanup(); user_creatable_cleanup(); flush_rcu(); monitor_cleanup(); I agree, this looks better with something like this: diff --git a/chardev/char-fe.c b/chardev/char-fe.c index 7789f7be9c..f0c3ea5447 100644 --- a/chardev/char-fe.c +++ b/chardev/char-fe.c @@ -195,6 +195,7 @@ bool qemu_chr_fe_init(CharBackend *b, int tag = 0; if (s) { + object_ref(OBJECT(s)); if (CHARDEV_IS_MUX(s)) { MuxChardev *d = MUX_CHARDEV(s); @@ -241,6 +242,7 @@ void qemu_chr_fe_deinit(CharBackend *b, bool del) } else { object_unref(obj); } + object_unref(obj); } b->chr = NULL; } to keep the chardev live between qemu_chr_cleanup() and monitor_cleanup(). but frankly speaking I don't understand why we have to do ref/unref in char-fe interface functions, instead of just ref/uref-ing monitor's char device directly like this: diff --git a/monitor/monitor.c b/monitor/monitor.c index 21c7a68758f5..3692a8e15268 100644 --- a/monitor/monitor.c +++ b/monitor/monitor.c @@ -611,6 +611,7 @@ void monitor_data_destroy(Monitor *mon) { g_free(mon->mon_cpu_path); qemu_chr_fe_deinit(>chr, false); + object_unref(OBJECT(>chr)); if (monitor_is_qmp(mon)) { monitor_data_destroy_qmp(container_of(mon, MonitorQMP, common)); } else { @@ -737,6 +738,7 @@ int monitor_init(MonitorOptions *opts, bool allow_hmp, Error **errp) error_propagate(errp, local_err); return -1; } + object_ref(OBJECT(chr)); return 0; } May be this shows the intentions better? Denis Paolo
[Ping][PATCH v0] vl: flush all task from rcu queue before exiting
Ping ping! On 02.11.2021 16:39, Denis Plotnikov wrote: The device destruction may superimpose over qemu shutdown. In this case some management layer, requested a device unplug and waiting for DEVICE_DELETED event, may never get this event. This happens because device_finalize() may never be called on qemu shutdown for some devices using address_space_destroy(). The later is called from the rcu thread. On qemu shutdown, not all rcu callbacks may be called because the rcu thread may not have enough time to converge before qemu main thread exit. To resolve this issue this patch makes rcu thread to finish all its callbacks explicitly by calling a new rcu intreface function right before qemu main thread exit. Signed-off-by: Denis Plotnikov --- include/qemu/rcu.h | 1 + softmmu/runstate.c | 3 +++ util/rcu.c | 12 3 files changed, 16 insertions(+) diff --git a/include/qemu/rcu.h b/include/qemu/rcu.h index 515d327cf11c..f7fbdc3781e5 100644 --- a/include/qemu/rcu.h +++ b/include/qemu/rcu.h @@ -134,6 +134,7 @@ struct rcu_head { extern void call_rcu1(struct rcu_head *head, RCUCBFunc *func); extern void drain_call_rcu(void); +extern void flush_rcu(void); /* The operands of the minus operator must have the same type, * which must be the one that we specify in the cast. diff --git a/softmmu/runstate.c b/softmmu/runstate.c index 10d9b7365aa7..28f319a97a2b 100644 --- a/softmmu/runstate.c +++ b/softmmu/runstate.c @@ -822,5 +822,8 @@ void qemu_cleanup(void) monitor_cleanup(); qemu_chr_cleanup(); user_creatable_cleanup(); + +/* finish all the tasks from rcu queue before exiting */ +flush_rcu(); /* TODO: unref root container, check all devices are ok */ } diff --git a/util/rcu.c b/util/rcu.c index 13ac0f75cb2a..f047f8ee8d16 100644 --- a/util/rcu.c +++ b/util/rcu.c @@ -348,6 +348,18 @@ void drain_call_rcu(void) } +/* + * This function drains rcu queue until there are no tasks to do left + * and aims to the cases when one needs to ensure that no work hang + * in rcu thread before proceeding, e.g. on qemu shutdown. + */ +void flush_rcu(void) +{ +while (qatomic_read(_call_count) > 0) { +drain_call_rcu(); +} +} + void rcu_register_thread(void) { assert(rcu_reader.ctr == 0);
Re: [PATCH v0] vl: flush all task from rcu queue before exiting
On 02.11.2021 16:39, Denis Plotnikov wrote: The device destruction may superimpose over qemu shutdown. In this case some management layer, requested a device unplug and waiting for DEVICE_DELETED event, may never get this event. This happens because device_finalize() may never be called on qemu shutdown for some devices using address_space_destroy(). The later is called from the rcu thread. On qemu shutdown, not all rcu callbacks may be called because the rcu thread may not have enough time to converge before qemu main thread exit. To resolve this issue this patch makes rcu thread to finish all its callbacks explicitly by calling a new rcu intreface function right before qemu main thread exit. Signed-off-by: Denis Plotnikov --- include/qemu/rcu.h | 1 + softmmu/runstate.c | 3 +++ util/rcu.c | 12 3 files changed, 16 insertions(+) diff --git a/include/qemu/rcu.h b/include/qemu/rcu.h index 515d327cf11c..f7fbdc3781e5 100644 --- a/include/qemu/rcu.h +++ b/include/qemu/rcu.h @@ -134,6 +134,7 @@ struct rcu_head { extern void call_rcu1(struct rcu_head *head, RCUCBFunc *func); extern void drain_call_rcu(void); +extern void flush_rcu(void); /* The operands of the minus operator must have the same type, * which must be the one that we specify in the cast. diff --git a/softmmu/runstate.c b/softmmu/runstate.c index 10d9b7365aa7..28f319a97a2b 100644 --- a/softmmu/runstate.c +++ b/softmmu/runstate.c @@ -822,5 +822,8 @@ void qemu_cleanup(void) actually, flush_rcu() should be here before monitor_cleanup to send DEVICE_DELETED monitor_cleanup(); qemu_chr_cleanup(); user_creatable_cleanup(); + +/* finish all the tasks from rcu queue before exiting */ +flush_rcu(); /* TODO: unref root container, check all devices are ok */ } diff --git a/util/rcu.c b/util/rcu.c index 13ac0f75cb2a..f047f8ee8d16 100644 --- a/util/rcu.c +++ b/util/rcu.c @@ -348,6 +348,18 @@ void drain_call_rcu(void) } +/* + * This function drains rcu queue until there are no tasks to do left + * and aims to the cases when one needs to ensure that no work hang + * in rcu thread before proceeding, e.g. on qemu shutdown. + */ +void flush_rcu(void) +{ +while (qatomic_read(_call_count) > 0) { +drain_call_rcu(); +} +} + void rcu_register_thread(void) { assert(rcu_reader.ctr == 0);
[PATCH v0] vl: flush all task from rcu queue before exiting
The device destruction may superimpose over qemu shutdown. In this case some management layer, requested a device unplug and waiting for DEVICE_DELETED event, may never get this event. This happens because device_finalize() may never be called on qemu shutdown for some devices using address_space_destroy(). The later is called from the rcu thread. On qemu shutdown, not all rcu callbacks may be called because the rcu thread may not have enough time to converge before qemu main thread exit. To resolve this issue this patch makes rcu thread to finish all its callbacks explicitly by calling a new rcu intreface function right before qemu main thread exit. Signed-off-by: Denis Plotnikov --- include/qemu/rcu.h | 1 + softmmu/runstate.c | 3 +++ util/rcu.c | 12 3 files changed, 16 insertions(+) diff --git a/include/qemu/rcu.h b/include/qemu/rcu.h index 515d327cf11c..f7fbdc3781e5 100644 --- a/include/qemu/rcu.h +++ b/include/qemu/rcu.h @@ -134,6 +134,7 @@ struct rcu_head { extern void call_rcu1(struct rcu_head *head, RCUCBFunc *func); extern void drain_call_rcu(void); +extern void flush_rcu(void); /* The operands of the minus operator must have the same type, * which must be the one that we specify in the cast. diff --git a/softmmu/runstate.c b/softmmu/runstate.c index 10d9b7365aa7..28f319a97a2b 100644 --- a/softmmu/runstate.c +++ b/softmmu/runstate.c @@ -822,5 +822,8 @@ void qemu_cleanup(void) monitor_cleanup(); qemu_chr_cleanup(); user_creatable_cleanup(); + +/* finish all the tasks from rcu queue before exiting */ +flush_rcu(); /* TODO: unref root container, check all devices are ok */ } diff --git a/util/rcu.c b/util/rcu.c index 13ac0f75cb2a..f047f8ee8d16 100644 --- a/util/rcu.c +++ b/util/rcu.c @@ -348,6 +348,18 @@ void drain_call_rcu(void) } +/* + * This function drains rcu queue until there are no tasks to do left + * and aims to the cases when one needs to ensure that no work hang + * in rcu thread before proceeding, e.g. on qemu shutdown. + */ +void flush_rcu(void) +{ +while (qatomic_read(_call_count) > 0) { +drain_call_rcu(); +} +} + void rcu_register_thread(void) { assert(rcu_reader.ctr == 0); -- 2.25.1
[PATCH v0 1/2] vhost-user-blk: add a new vhost-user-virtio-blk type
The main reason of adding a new type is to make cross-device live migration between "virtio-blk" and "vhost-user-blk" devices possible in both directions. It might be useful for the cases when a slow block layer should be replaced with a more performant one on running VM without stopping, i.e. with very low downtime comparable with the one on migration. It's possible to achive that for two reasons: 1.The VMStates of "virtio-blk" and "vhost-user-blk" are almost the same. They consist of the identical VMSTATE_VIRTIO_DEVICE and differs from each other in the values of migration service fields only. 2.The device driver used in the guest is the same: virtio-blk The new type uses virtio-blk VMState instead of vhost-user-blk specific VMstate, also it implements migration save/load callbacks to be compatible with migration stream produced by "virtio-blk" device. Adding the new vhost-user-blk type instead of modifying the existing one is convenent. It ease to differ the new virtio-blk-compatible vhost-user-blk device from the existing non-compatible one using qemu machinery without any other modifiactions. That gives all the variety of qemu device related constraints out of box. Signed-off-by: Denis Plotnikov --- hw/block/vhost-user-blk.c | 63 ++ include/hw/virtio/vhost-user-blk.h | 2 + 2 files changed, 65 insertions(+) diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c index ba13cb87e520..877fe54e891f 100644 --- a/hw/block/vhost-user-blk.c +++ b/hw/block/vhost-user-blk.c @@ -30,6 +30,7 @@ #include "hw/virtio/virtio-access.h" #include "sysemu/sysemu.h" #include "sysemu/runstate.h" +#include "migration/qemu-file-types.h" #define REALIZE_CONNECTION_RETRIES 3 @@ -612,9 +613,71 @@ static const TypeInfo vhost_user_blk_info = { .class_init = vhost_user_blk_class_init, }; +/* + * this is the same as vmstate_virtio_blk + * we use it to allow virtio-blk <-> vhost-user-virtio-blk migration + */ +static const VMStateDescription vmstate_vhost_user_virtio_blk = { +.name = "virtio-blk", +.minimum_version_id = 2, +.version_id = 2, +.fields = (VMStateField[]) { +VMSTATE_VIRTIO_DEVICE, +VMSTATE_END_OF_LIST() +}, +}; + +static void vhost_user_virtio_blk_save(VirtIODevice *vdev, QEMUFile *f) +{ +/* + * put a zero byte in the stream to be compatible with virtio-blk + */ +qemu_put_sbyte(f, 0); +} + +static int vhost_user_virtio_blk_load(VirtIODevice *vdev, QEMUFile *f, + int version_id) +{ +if (qemu_get_sbyte(f)) { +/* + * on virtio-blk -> vhost-user-virtio-blk migration we don't expect + * to get any infilght requests in the migration stream because + * we can't load them yet. + * TODO: consider putting those inflight requests to inflight region + */ +error_report("%s: can't load in-flight requests", + TYPE_VHOST_USER_VIRTIO_BLK); +return -EINVAL; +} + +return 0; +} + +static void vhost_user_virtio_blk_class_init(ObjectClass *klass, void *data) +{ +DeviceClass *dc = DEVICE_CLASS(klass); +VirtioDeviceClass *vdc = VIRTIO_DEVICE_CLASS(klass); + +/* override vmstate of vhost_user_blk */ +dc->vmsd = _vhost_user_virtio_blk; + +/* adding callbacks to be compatible with virtio-blk migration stream */ +vdc->save = vhost_user_virtio_blk_save; +vdc->load = vhost_user_virtio_blk_load; +} + +static const TypeInfo vhost_user_virtio_blk_info = { +.name = TYPE_VHOST_USER_VIRTIO_BLK, +.parent = TYPE_VHOST_USER_BLK, +.instance_size = sizeof(VHostUserBlk), +/* instance_init is the same as in parent type */ +.class_init = vhost_user_virtio_blk_class_init, +}; + static void virtio_register_types(void) { type_register_static(_user_blk_info); +type_register_static(_user_virtio_blk_info); } type_init(virtio_register_types) diff --git a/include/hw/virtio/vhost-user-blk.h b/include/hw/virtio/vhost-user-blk.h index 7c91f15040eb..d81f18d22596 100644 --- a/include/hw/virtio/vhost-user-blk.h +++ b/include/hw/virtio/vhost-user-blk.h @@ -23,6 +23,8 @@ #include "qom/object.h" #define TYPE_VHOST_USER_BLK "vhost-user-blk" +#define TYPE_VHOST_USER_VIRTIO_BLK "vhost-user-virtio-blk" + OBJECT_DECLARE_SIMPLE_TYPE(VHostUserBlk, VHOST_USER_BLK) #define VHOST_USER_BLK_AUTO_NUM_QUEUES UINT16_MAX -- 2.25.1
[PATCH v0 2/2] vhost-user-blk-pci: add new pci device type to support vhost-user-virtio-blk
To allow the recently added vhost-user-virtio-blk work via virtio-pci. This patch refactors the vhost-user-blk-pci object model to reuse the existing code. Signed-off-by: Denis Plotnikov --- hw/virtio/vhost-user-blk-pci.c | 43 +++--- 1 file changed, 40 insertions(+), 3 deletions(-) diff --git a/hw/virtio/vhost-user-blk-pci.c b/hw/virtio/vhost-user-blk-pci.c index 33b404d8a225..2f68296af22f 100644 --- a/hw/virtio/vhost-user-blk-pci.c +++ b/hw/virtio/vhost-user-blk-pci.c @@ -34,10 +34,18 @@ typedef struct VHostUserBlkPCI VHostUserBlkPCI; /* * vhost-user-blk-pci: This extends VirtioPCIProxy. */ +#define TYPE_VHOST_USER_BLK_PCI_ABSTRACT "vhost-user-blk-pci-abstract-base" +#define VHOST_USER_BLK_PCI_ABSTRACT(obj) \ +OBJECT_CHECK(VHostUserBlkPCI, (obj), TYPE_VHOST_USER_BLK_PCI_ABSTRACT) + #define TYPE_VHOST_USER_BLK_PCI "vhost-user-blk-pci-base" DECLARE_INSTANCE_CHECKER(VHostUserBlkPCI, VHOST_USER_BLK_PCI, TYPE_VHOST_USER_BLK_PCI) +#define TYPE_VHOST_USER_VIRTIO_BLK_PCI "vhost-user-virtio-blk-pci-base" +#define VHOST_USER_VIRTIO_BLK_PCI(obj) \ +OBJECT_CHECK(VHostUserBlkPCI, (obj), TYPE_VHOST_USER_VIRTIO_BLK_PCI) + struct VHostUserBlkPCI { VirtIOPCIProxy parent_obj; VHostUserBlk vdev; @@ -52,7 +60,7 @@ static Property vhost_user_blk_pci_properties[] = { static void vhost_user_blk_pci_realize(VirtIOPCIProxy *vpci_dev, Error **errp) { -VHostUserBlkPCI *dev = VHOST_USER_BLK_PCI(vpci_dev); +VHostUserBlkPCI *dev = VHOST_USER_BLK_PCI_ABSTRACT(vpci_dev); DeviceState *vdev = DEVICE(>vdev); if (dev->vdev.num_queues == VHOST_USER_BLK_AUTO_NUM_QUEUES) { @@ -66,7 +74,8 @@ static void vhost_user_blk_pci_realize(VirtIOPCIProxy *vpci_dev, Error **errp) qdev_realize(vdev, BUS(_dev->bus), errp); } -static void vhost_user_blk_pci_class_init(ObjectClass *klass, void *data) +static void vhost_user_blk_pci_abstract_class_init(ObjectClass *klass, + void *data) { DeviceClass *dc = DEVICE_CLASS(klass); VirtioPCIClass *k = VIRTIO_PCI_CLASS(klass); @@ -81,6 +90,12 @@ static void vhost_user_blk_pci_class_init(ObjectClass *klass, void *data) pcidev_k->class_id = PCI_CLASS_STORAGE_SCSI; } +static const VirtioPCIDeviceTypeInfo vhost_user_blk_pci_abstract_info = { +.base_name = TYPE_VHOST_USER_BLK_PCI_ABSTRACT, +.instance_size = sizeof(VHostUserBlkPCI), +.class_init = vhost_user_blk_pci_abstract_class_init, +}; + static void vhost_user_blk_pci_instance_init(Object *obj) { VHostUserBlkPCI *dev = VHOST_USER_BLK_PCI(obj); @@ -92,18 +107,40 @@ static void vhost_user_blk_pci_instance_init(Object *obj) } static const VirtioPCIDeviceTypeInfo vhost_user_blk_pci_info = { +.parent = TYPE_VHOST_USER_BLK_PCI_ABSTRACT, .base_name = TYPE_VHOST_USER_BLK_PCI, .generic_name= "vhost-user-blk-pci", .transitional_name = "vhost-user-blk-pci-transitional", .non_transitional_name = "vhost-user-blk-pci-non-transitional", .instance_size = sizeof(VHostUserBlkPCI), .instance_init = vhost_user_blk_pci_instance_init, -.class_init = vhost_user_blk_pci_class_init, +}; + +static void vhost_user_virtio_blk_pci_instance_init(Object *obj) +{ +VHostUserBlkPCI *dev = VHOST_USER_VIRTIO_BLK_PCI(obj); + +virtio_instance_init_common(obj, >vdev, sizeof(dev->vdev), +TYPE_VHOST_USER_VIRTIO_BLK); +object_property_add_alias(obj, "bootindex", OBJECT(>vdev), + "bootindex"); +} + +static const VirtioPCIDeviceTypeInfo vhost_user_virtio_blk_pci_info = { +.parent = TYPE_VHOST_USER_BLK_PCI_ABSTRACT, +.base_name = TYPE_VHOST_USER_VIRTIO_BLK_PCI, +.generic_name= "vhost-user-virtio-blk-pci", +.transitional_name = "vhost-user-virtio-blk-pci-transitional", +.non_transitional_name = "vhost-user-virtio-blk-pci-non-transitional", +.instance_size = sizeof(VHostUserBlkPCI), +.instance_init = vhost_user_virtio_blk_pci_instance_init, }; static void vhost_user_blk_pci_register(void) { +virtio_pci_types_register(_user_blk_pci_abstract_info); virtio_pci_types_register(_user_blk_pci_info); +virtio_pci_types_register(_user_virtio_blk_pci_info); } type_init(vhost_user_blk_pci_register) -- 2.25.1
[PATCH v0 0/2] virtio-blk and vhost-user-blk cross-device migration
It might be useful for the cases when a slow block layer should be replaced with a more performant one on running VM without stopping, i.e. with very low downtime comparable with the one on migration. It's possible to achive that for two reasons: 1.The VMStates of "virtio-blk" and "vhost-user-blk" are almost the same. They consist of the identical VMSTATE_VIRTIO_DEVICE and differs from each other in the values of migration service fields only. 2.The device driver used in the guest is the same: virtio-blk In the series cross-migration is achieved by adding a new type. The new type uses virtio-blk VMState instead of vhost-user-blk specific VMstate, also it implements migration save/load callbacks to be compatible with migration stream produced by "virtio-blk" device. Adding the new type instead of modifying the existing one is convenent. It ease to differ the new virtio-blk-compatible vhost-user-blk device from the existing non-compatible one using qemu machinery without any other modifiactions. That gives all the variety of qemu device related constraints out of box. 0001: adds new type "vhost-user-virtio-blk" 0002: add new type "vhost-user-virtio-blk-pci" Denis Plotnikov (2): vhost-user-blk: add a new vhost-user-virtio-blk type vhost-user-blk-pci: add new pci device type to support vhost-user-virtio-blk hw/block/vhost-user-blk.c | 63 ++ hw/virtio/vhost-user-blk-pci.c | 43 ++-- include/hw/virtio/vhost-user-blk.h | 2 + 3 files changed, 105 insertions(+), 3 deletions(-) -- 2.25.1
[PATCH v0] machine: remove non existent device tuning
Device "vhost-blk-device" doesn't exist in qemu yet. So, its compatibility tuning is meaningless. The line with the non existent device was introduced with "1bf8a989a566b virtio: make seg_max virtqueue size dependent" patch by mistake. The oriiginal patch was meant to set "seg-max-adjust" property for "vhost-scsi" device instead of "vhost-blk-device". So now, we have "seg-max-adjust" property enabled for all machine types using binaries with implemented "seg-max-adjust". Replacing "vhost-blk-device" with "vhost-scsi" instead of removing the line seems to be a bad idea. Now, because of the mistake, "seg-max" for "vhost-scsi" device is queue-size dependent even for machine types using "hw_compat_4_2" and older on new binaries. Thus, now we have two behaviors: * on old binaries (before the original patch) - seg_max is always 126 * on new binaries - seg_max is queue-size dependent Replacing the line will split the behavior of new binaries into two. This would make an investigation in case of related problems harder. To not make things worse, this patch just removes the line to keep behavior like it is now. Signed-off-by: Denis Plotnikov --- hw/core/machine.c | 1 - 1 file changed, 1 deletion(-) diff --git a/hw/core/machine.c b/hw/core/machine.c index 067f42b528fd..d4f70cc01af0 100644 --- a/hw/core/machine.c +++ b/hw/core/machine.c @@ -87,7 +87,6 @@ GlobalProperty hw_compat_4_2[] = { { "virtio-blk-device", "x-enable-wce-if-config-wce", "off" }, { "virtio-blk-device", "seg-max-adjust", "off"}, { "virtio-scsi-device", "seg_max_adjust", "off"}, -{ "vhost-blk-device", "seg_max_adjust", "off"}, { "usb-host", "suppress-remote-wake", "off" }, { "usb-redir", "suppress-remote-wake", "off" }, { "qxl", "revision", "4" }, -- 2.25.1
Re: [PATCH v1 2/4] virtio: increase virtuqueue size for virtio-scsi and virtio-blk
On 09.09.2021 11:28, Stefano Garzarella wrote: On Wed, Sep 08, 2021 at 06:20:49PM +0300, Denis Plotnikov wrote: On 08.09.2021 16:22, Stefano Garzarella wrote: Message bounced, I use new Denis's email address. On Wed, Sep 08, 2021 at 03:17:16PM +0200, Stefano Garzarella wrote: Hi Denis, I just found this discussion since we still have the following line in hw/core/machine.c: { "vhost-blk-device", "seg_max_adjust", "off"} IIUC it was a typo, and I think we should fix it since in the future we can have `vhost-blk-device`. So, I think we have 2 options: 1. remove that line since for now is useless 2. replace with "vhost-scsi" I'm not sure which is the best, what do you suggest? Thanks, Stefano Hi Stefano I prefer to just remove the line without replacing. This will keep things exactly like it is now. Replacing with "vhost-scsi" will affect seg_max (limit to 126) on newly created VMs with machine types using "hw_compat_4_2" and older. Now, because of the typo (error), their seg-max is queue-size dependent. I'm not sure, if replacing the line may cause any problems, for example on migration: source (queue-size 256, seg max 254) -> destination (queue-size 256, seg max 126). But it will definitely introduce two different behaviors for VMs with "hw_compat_4_2" and older. So, I'd choose the lesser of two evils and keep the things like it's now. Yep, make sense. It was the same concern I had. Do you want to send a patch to remove it with this explanation? Yes, sure. I'll do it today. Denis Thanks for the clarification, Stefano
Re: [PATCH v1 2/4] virtio: increase virtuqueue size for virtio-scsi and virtio-blk
On 08.09.2021 16:22, Stefano Garzarella wrote: Message bounced, I use new Denis's email address. On Wed, Sep 08, 2021 at 03:17:16PM +0200, Stefano Garzarella wrote: Hi Denis, I just found this discussion since we still have the following line in hw/core/machine.c: { "vhost-blk-device", "seg_max_adjust", "off"} IIUC it was a typo, and I think we should fix it since in the future we can have `vhost-blk-device`. So, I think we have 2 options: 1. remove that line since for now is useless 2. replace with "vhost-scsi" I'm not sure which is the best, what do you suggest? Thanks, Stefano Hi Stefano I prefer to just remove the line without replacing. This will keep things exactly like it is now. Replacing with "vhost-scsi" will affect seg_max (limit to 126) on newly created VMs with machine types using "hw_compat_4_2" and older. Now, because of the typo (error), their seg-max is queue-size dependent. I'm not sure, if replacing the line may cause any problems, for example on migration: source (queue-size 256, seg max 254) -> destination (queue-size 256, seg max 126). But it will definitely introduce two different behaviors for VMs with "hw_compat_4_2" and older. So, I'd choose the lesser of two evils and keep the things like it's now. Thanks! Denis On Fri, Feb 07, 2020 at 11:48:05AM +0300, Denis Plotnikov wrote: On 05.02.2020 14:19, Stefan Hajnoczi wrote: On Tue, Feb 04, 2020 at 12:59:04PM +0300, Denis Plotnikov wrote: On 30.01.2020 17:58, Stefan Hajnoczi wrote: On Wed, Jan 29, 2020 at 05:07:00PM +0300, Denis Plotnikov wrote: The goal is to reduce the amount of requests issued by a guest on 1M reads/writes. This rises the performance up to 4% on that kind of disk access pattern. The maximum chunk size to be used for the guest disk accessing is limited with seg_max parameter, which represents the max amount of pices in the scatter-geather list in one guest disk request. Since seg_max is virqueue_size dependent, increasing the virtqueue size increases seg_max, which, in turn, increases the maximum size of data to be read/write from guest disk. More details in the original problem statment: https://lists.gnu.org/archive/html/qemu-devel/2017-12/msg03721.html Suggested-by: Denis V. Lunev Signed-off-by: Denis Plotnikov --- hw/core/machine.c | 3 +++ include/hw/virtio/virtio.h | 2 +- 2 files changed, 4 insertions(+), 1 deletion(-) diff --git a/hw/core/machine.c b/hw/core/machine.c index 3e288bfceb..8bc401d8b7 100644 --- a/hw/core/machine.c +++ b/hw/core/machine.c @@ -28,6 +28,9 @@ #include "hw/mem/nvdimm.h" GlobalProperty hw_compat_4_2[] = { + { "virtio-blk-device", "queue-size", "128"}, + { "virtio-scsi-device", "virtqueue_size", "128"}, + { "vhost-blk-device", "virtqueue_size", "128"}, vhost-blk-device?! Who has this? It's not in qemu.git so please omit this line. ;-) So in this case the line: { "vhost-blk-device", "seg_max_adjust", "off"}, introduced by my patch: commit 1bf8a989a566b2ba41c197004ec2a02562a766a4 Author: Denis Plotnikov Date: Fri Dec 20 17:09:04 2019 +0300 virtio: make seg_max virtqueue size dependent is also wrong. It should be: { "vhost-scsi-device", "seg_max_adjust", "off"}, Am I right? It's just called "vhost-scsi": include/hw/virtio/vhost-scsi.h:#define TYPE_VHOST_SCSI "vhost-scsi" On the other hand, do you want to do this for the vhost-user-blk, vhost-user-scsi, and vhost-scsi devices that exist in qemu.git? Those devices would benefit from better performance too. After thinking about that for a while, I think we shouldn't extend queue sizes for vhost-user-blk, vhost-user-scsi and vhost-scsi. This is because increasing the queue sizes seems to be just useless for them: the all thing is about increasing the queue sizes for increasing seg_max (it limits the max block query size from the guest). For virtio-blk-device and virtio-scsi-device it makes sense, since they have seg-max-adjust property which, if true, sets seg_max to virtqueue_size-2. vhost-scsi also have this property but it seems the property just doesn't affect anything (remove it?). Also vhost-user-blk, vhost-user-scsi and vhost-scsi don't do any seg_max settings. If I understand correctly, their backends are ment to be responsible for doing that. So, what about changing the queue sizes just for virtio-blk-device and virtio-scsi-device? Denis It seems to be so. We also have the test checking those settings: tests/acceptance/virtio_seg_max_adjust.py For now it checks virtio-scsi-pci and virtio-blk-pci. I'm going to extend it for the virtqueue size checking. If I change vhost-user-blk, vhost-user-scsi and vhost-scsi it's worth to check those devices too. But I don't know how
[PING][PING] [PATCH v4] vhost: make SET_VRING_ADDR, SET_FEATURES send replies
On 12.08.2021 11:04, Denis Plotnikov wrote: On 09.08.2021 13:48, Denis Plotnikov wrote: On vhost-user-blk migration, qemu normally sends a number of commands to enable logging if VHOST_USER_PROTOCOL_F_LOG_SHMFD is negotiated. Qemu sends VHOST_USER_SET_FEATURES to enable buffers logging and VHOST_USER_SET_VRING_ADDR per each started ring to enable "used ring" data logging. The issue is that qemu doesn't wait for reply from the vhost daemon for these commands which may result in races between qemu expectation of logging starting and actual login starting in vhost daemon. The race can appear as follows: on migration setup, qemu enables dirty page logging by sending VHOST_USER_SET_FEATURES. The command doesn't arrive to a vhost-user-blk daemon immediately and the daemon needs some time to turn the logging on internally. If qemu doesn't wait for reply, after sending the command, qemu may start migrateing memory pages to a destination. At this time, the logging may not be actually turned on in the daemon but some guest pages, which the daemon is about to write to, may have already been transferred without logging to the destination. Since the logging wasn't turned on, those pages won't be transferred again as dirty. So we may end up with corrupted data on the destination. The same scenario is applicable for "used ring" data logging, which is turned on with VHOST_USER_SET_VRING_ADDR command. To resolve this issue, this patch makes qemu wait for the command result explicitly if VHOST_USER_PROTOCOL_F_REPLY_ACK is negotiated and logging enabled. Signed-off-by: Denis Plotnikov --- v3 -> v4: * join acked reply and get_features in enforce_reply [mst] * typos, rewording, cosmetic changes [mst] v2 -> v3: * send VHOST_USER_GET_FEATURES to flush out outstanding messages [mst] v1 -> v2: * send reply only when logging is enabled [mst] v0 -> v1: * send reply for SET_VRING_ADDR, SET_FEATURES only [mst] --- hw/virtio/vhost-user.c | 139 + 1 file changed, 98 insertions(+), 41 deletions(-) diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c index ee57abe04526..5bb9254acd21 100644 --- a/hw/virtio/vhost-user.c +++ b/hw/virtio/vhost-user.c @@ -1095,23 +1095,6 @@ static int vhost_user_set_mem_table(struct vhost_dev *dev, return 0; } -static int vhost_user_set_vring_addr(struct vhost_dev *dev, - struct vhost_vring_addr *addr) -{ - VhostUserMsg msg = { - .hdr.request = VHOST_USER_SET_VRING_ADDR, - .hdr.flags = VHOST_USER_VERSION, - .payload.addr = *addr, - .hdr.size = sizeof(msg.payload.addr), - }; - - if (vhost_user_write(dev, , NULL, 0) < 0) { - return -1; - } - - return 0; -} - static int vhost_user_set_vring_endian(struct vhost_dev *dev, struct vhost_vring_state *ring) { @@ -1288,72 +1271,146 @@ static int vhost_user_set_vring_call(struct vhost_dev *dev, return vhost_set_vring_file(dev, VHOST_USER_SET_VRING_CALL, file); } -static int vhost_user_set_u64(struct vhost_dev *dev, int request, uint64_t u64) + +static int vhost_user_get_u64(struct vhost_dev *dev, int request, uint64_t *u64) { VhostUserMsg msg = { .hdr.request = request, .hdr.flags = VHOST_USER_VERSION, - .payload.u64 = u64, - .hdr.size = sizeof(msg.payload.u64), }; + if (vhost_user_one_time_request(request) && dev->vq_index != 0) { + return 0; + } + if (vhost_user_write(dev, , NULL, 0) < 0) { return -1; } + if (vhost_user_read(dev, ) < 0) { + return -1; + } + + if (msg.hdr.request != request) { + error_report("Received unexpected msg type. Expected %d received %d", + request, msg.hdr.request); + return -1; + } + + if (msg.hdr.size != sizeof(msg.payload.u64)) { + error_report("Received bad msg size."); + return -1; + } + + *u64 = msg.payload.u64; + return 0; } -static int vhost_user_set_features(struct vhost_dev *dev, - uint64_t features) +static int vhost_user_get_features(struct vhost_dev *dev, uint64_t *features) { - return vhost_user_set_u64(dev, VHOST_USER_SET_FEATURES, features); + return vhost_user_get_u64(dev, VHOST_USER_GET_FEATURES, features); } -static int vhost_user_set_protocol_features(struct vhost_dev *dev, - uint64_t features) +static int enforce_reply(struct vhost_dev *dev, + const VhostUserMsg *msg) { - return vhost_user_set_u64(dev, VHOST_USER_SET_PROTOCOL_FEATURES, features); + uint64_t dummy; + + if (msg->hdr.flags & VHOST_USER_NEED_REPLY_MASK) { + return process_message_reply(dev, msg); + } + + /* + * We need to w
[PING] [PATCH v4] vhost: make SET_VRING_ADDR, SET_FEATURES send replies
On 09.08.2021 13:48, Denis Plotnikov wrote: On vhost-user-blk migration, qemu normally sends a number of commands to enable logging if VHOST_USER_PROTOCOL_F_LOG_SHMFD is negotiated. Qemu sends VHOST_USER_SET_FEATURES to enable buffers logging and VHOST_USER_SET_VRING_ADDR per each started ring to enable "used ring" data logging. The issue is that qemu doesn't wait for reply from the vhost daemon for these commands which may result in races between qemu expectation of logging starting and actual login starting in vhost daemon. The race can appear as follows: on migration setup, qemu enables dirty page logging by sending VHOST_USER_SET_FEATURES. The command doesn't arrive to a vhost-user-blk daemon immediately and the daemon needs some time to turn the logging on internally. If qemu doesn't wait for reply, after sending the command, qemu may start migrateing memory pages to a destination. At this time, the logging may not be actually turned on in the daemon but some guest pages, which the daemon is about to write to, may have already been transferred without logging to the destination. Since the logging wasn't turned on, those pages won't be transferred again as dirty. So we may end up with corrupted data on the destination. The same scenario is applicable for "used ring" data logging, which is turned on with VHOST_USER_SET_VRING_ADDR command. To resolve this issue, this patch makes qemu wait for the command result explicitly if VHOST_USER_PROTOCOL_F_REPLY_ACK is negotiated and logging enabled. Signed-off-by: Denis Plotnikov --- v3 -> v4: * join acked reply and get_features in enforce_reply [mst] * typos, rewording, cosmetic changes [mst] v2 -> v3: * send VHOST_USER_GET_FEATURES to flush out outstanding messages [mst] v1 -> v2: * send reply only when logging is enabled [mst] v0 -> v1: * send reply for SET_VRING_ADDR, SET_FEATURES only [mst] --- hw/virtio/vhost-user.c | 139 + 1 file changed, 98 insertions(+), 41 deletions(-) diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c index ee57abe04526..5bb9254acd21 100644 --- a/hw/virtio/vhost-user.c +++ b/hw/virtio/vhost-user.c @@ -1095,23 +1095,6 @@ static int vhost_user_set_mem_table(struct vhost_dev *dev, return 0; } -static int vhost_user_set_vring_addr(struct vhost_dev *dev, - struct vhost_vring_addr *addr) -{ -VhostUserMsg msg = { -.hdr.request = VHOST_USER_SET_VRING_ADDR, -.hdr.flags = VHOST_USER_VERSION, -.payload.addr = *addr, -.hdr.size = sizeof(msg.payload.addr), -}; - -if (vhost_user_write(dev, , NULL, 0) < 0) { -return -1; -} - -return 0; -} - static int vhost_user_set_vring_endian(struct vhost_dev *dev, struct vhost_vring_state *ring) { @@ -1288,72 +1271,146 @@ static int vhost_user_set_vring_call(struct vhost_dev *dev, return vhost_set_vring_file(dev, VHOST_USER_SET_VRING_CALL, file); } -static int vhost_user_set_u64(struct vhost_dev *dev, int request, uint64_t u64) + +static int vhost_user_get_u64(struct vhost_dev *dev, int request, uint64_t *u64) { VhostUserMsg msg = { .hdr.request = request, .hdr.flags = VHOST_USER_VERSION, -.payload.u64 = u64, -.hdr.size = sizeof(msg.payload.u64), }; +if (vhost_user_one_time_request(request) && dev->vq_index != 0) { +return 0; +} + if (vhost_user_write(dev, , NULL, 0) < 0) { return -1; } +if (vhost_user_read(dev, ) < 0) { +return -1; +} + +if (msg.hdr.request != request) { +error_report("Received unexpected msg type. Expected %d received %d", + request, msg.hdr.request); +return -1; +} + +if (msg.hdr.size != sizeof(msg.payload.u64)) { +error_report("Received bad msg size."); +return -1; +} + +*u64 = msg.payload.u64; + return 0; } -static int vhost_user_set_features(struct vhost_dev *dev, - uint64_t features) +static int vhost_user_get_features(struct vhost_dev *dev, uint64_t *features) { -return vhost_user_set_u64(dev, VHOST_USER_SET_FEATURES, features); +return vhost_user_get_u64(dev, VHOST_USER_GET_FEATURES, features); } -static int vhost_user_set_protocol_features(struct vhost_dev *dev, -uint64_t features) +static int enforce_reply(struct vhost_dev *dev, + const VhostUserMsg *msg) { -return vhost_user_set_u64(dev, VHOST_USER_SET_PROTOCOL_FEATURES, features); +uint64_t dummy; + +if (msg->hdr.flags & VHOST_USER_NEED_REPLY_MASK) { +return process_message_reply(dev, msg); +} + + /* +* We need to wait for a reply but the backend does not +* support
Re: [PATCH v3] vhost: make SET_VRING_ADDR, SET_FEATURES send replies
On 09.08.2021 12:34, Michael S. Tsirkin wrote: Looks good. Some cosmetics: all comments addressed in already sent v4 On Mon, Aug 09, 2021 at 12:03:30PM +0300, Denis Plotnikov wrote: On vhost-user-blk migration, qemu normally sends a number of commands to enable logging if VHOST_USER_PROTOCOL_F_LOG_SHMFD is negotiated. Qemu sends VHOST_USER_SET_FEATURES to enable buffers logging and VHOST_USER_SET_VRING_ADDR per each started ring to enable "used ring" data logging. The issue is that qemu doesn't wait for reply from the vhost daemon for these commands which may result in races between qemu expectation of logging starting and actual login starting in vhost daemon. The race can appear as follows: on migration setup, qemu enables dirty page logging by sending VHOST_USER_SET_FEATURES. The command doesn't arrive to a vhost-user-blk daemon immediately and the daemon needs some time to turn the logging on internally. If qemu doesn't wait for reply, after sending the command, qemu may start migrate memory pages to a destination. At this time, start migrating the logging may not be actually turned on in the daemon but some guest pages, which the daemon is about to write to, may have already been transferred without logging to the destination. Since the logging wasn't turned on, those pages won't be transferred again as dirty. So we may end up with corrupted data on the destination. The same scenario is applicable for "used ring" data logging, which is turned on with VHOST_USER_SET_VRING_ADDR command. To resolve this issue, this patch makes qemu wait for the commands result command result explicilty if VHOST_USER_PROTOCOL_F_REPLY_ACK is negotiated and logging enabled. typo Signed-off-by: Denis Plotnikov --- v2 -> v3: * send VHOST_USER_GET_FEATURES to flush out outstanding messages [mst] v1 -> v2: * send reply only when logging is enabled [mst] v0 -> v1: * send reply for SET_VRING_ADDR, SET_FEATURES only [mst] --- hw/virtio/vhost-user.c | 130 - 1 file changed, 89 insertions(+), 41 deletions(-) diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c index ee57abe04526..18f685df549f 100644 --- a/hw/virtio/vhost-user.c +++ b/hw/virtio/vhost-user.c @@ -1095,23 +1095,6 @@ static int vhost_user_set_mem_table(struct vhost_dev *dev, return 0; } -static int vhost_user_set_vring_addr(struct vhost_dev *dev, - struct vhost_vring_addr *addr) -{ -VhostUserMsg msg = { -.hdr.request = VHOST_USER_SET_VRING_ADDR, -.hdr.flags = VHOST_USER_VERSION, -.payload.addr = *addr, -.hdr.size = sizeof(msg.payload.addr), -}; - -if (vhost_user_write(dev, , NULL, 0) < 0) { -return -1; -} - -return 0; -} - static int vhost_user_set_vring_endian(struct vhost_dev *dev, struct vhost_vring_state *ring) { @@ -1288,72 +1271,137 @@ static int vhost_user_set_vring_call(struct vhost_dev *dev, return vhost_set_vring_file(dev, VHOST_USER_SET_VRING_CALL, file); } -static int vhost_user_set_u64(struct vhost_dev *dev, int request, uint64_t u64) + +static int vhost_user_get_u64(struct vhost_dev *dev, int request, uint64_t *u64) { VhostUserMsg msg = { .hdr.request = request, .hdr.flags = VHOST_USER_VERSION, -.payload.u64 = u64, -.hdr.size = sizeof(msg.payload.u64), }; +if (vhost_user_one_time_request(request) && dev->vq_index != 0) { +return 0; +} + if (vhost_user_write(dev, , NULL, 0) < 0) { return -1; } +if (vhost_user_read(dev, ) < 0) { +return -1; +} + +if (msg.hdr.request != request) { +error_report("Received unexpected msg type. Expected %d received %d", + request, msg.hdr.request); +return -1; +} + +if (msg.hdr.size != sizeof(msg.payload.u64)) { +error_report("Received bad msg size."); +return -1; +} + +*u64 = msg.payload.u64; + return 0; } -static int vhost_user_set_features(struct vhost_dev *dev, - uint64_t features) +static int vhost_user_get_features(struct vhost_dev *dev, uint64_t *features) { -return vhost_user_set_u64(dev, VHOST_USER_SET_FEATURES, features); +return vhost_user_get_u64(dev, VHOST_USER_GET_FEATURES, features); } -static int vhost_user_set_protocol_features(struct vhost_dev *dev, -uint64_t features) +static int enforce_reply(struct vhost_dev *dev) { -return vhost_user_set_u64(dev, VHOST_USER_SET_PROTOCOL_FEATURES, features); + /* +* we need a reply but can't get it from some command directly, +* so send the command which must send a reply to make sure +* the command we sent before is actually completed. better:
[PATCH v4] vhost: make SET_VRING_ADDR, SET_FEATURES send replies
On vhost-user-blk migration, qemu normally sends a number of commands to enable logging if VHOST_USER_PROTOCOL_F_LOG_SHMFD is negotiated. Qemu sends VHOST_USER_SET_FEATURES to enable buffers logging and VHOST_USER_SET_VRING_ADDR per each started ring to enable "used ring" data logging. The issue is that qemu doesn't wait for reply from the vhost daemon for these commands which may result in races between qemu expectation of logging starting and actual login starting in vhost daemon. The race can appear as follows: on migration setup, qemu enables dirty page logging by sending VHOST_USER_SET_FEATURES. The command doesn't arrive to a vhost-user-blk daemon immediately and the daemon needs some time to turn the logging on internally. If qemu doesn't wait for reply, after sending the command, qemu may start migrateing memory pages to a destination. At this time, the logging may not be actually turned on in the daemon but some guest pages, which the daemon is about to write to, may have already been transferred without logging to the destination. Since the logging wasn't turned on, those pages won't be transferred again as dirty. So we may end up with corrupted data on the destination. The same scenario is applicable for "used ring" data logging, which is turned on with VHOST_USER_SET_VRING_ADDR command. To resolve this issue, this patch makes qemu wait for the command result explicitly if VHOST_USER_PROTOCOL_F_REPLY_ACK is negotiated and logging enabled. Signed-off-by: Denis Plotnikov --- v3 -> v4: * join acked reply and get_features in enforce_reply [mst] * typos, rewording, cosmetic changes [mst] v2 -> v3: * send VHOST_USER_GET_FEATURES to flush out outstanding messages [mst] v1 -> v2: * send reply only when logging is enabled [mst] v0 -> v1: * send reply for SET_VRING_ADDR, SET_FEATURES only [mst] --- hw/virtio/vhost-user.c | 139 + 1 file changed, 98 insertions(+), 41 deletions(-) diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c index ee57abe04526..5bb9254acd21 100644 --- a/hw/virtio/vhost-user.c +++ b/hw/virtio/vhost-user.c @@ -1095,23 +1095,6 @@ static int vhost_user_set_mem_table(struct vhost_dev *dev, return 0; } -static int vhost_user_set_vring_addr(struct vhost_dev *dev, - struct vhost_vring_addr *addr) -{ -VhostUserMsg msg = { -.hdr.request = VHOST_USER_SET_VRING_ADDR, -.hdr.flags = VHOST_USER_VERSION, -.payload.addr = *addr, -.hdr.size = sizeof(msg.payload.addr), -}; - -if (vhost_user_write(dev, , NULL, 0) < 0) { -return -1; -} - -return 0; -} - static int vhost_user_set_vring_endian(struct vhost_dev *dev, struct vhost_vring_state *ring) { @@ -1288,72 +1271,146 @@ static int vhost_user_set_vring_call(struct vhost_dev *dev, return vhost_set_vring_file(dev, VHOST_USER_SET_VRING_CALL, file); } -static int vhost_user_set_u64(struct vhost_dev *dev, int request, uint64_t u64) + +static int vhost_user_get_u64(struct vhost_dev *dev, int request, uint64_t *u64) { VhostUserMsg msg = { .hdr.request = request, .hdr.flags = VHOST_USER_VERSION, -.payload.u64 = u64, -.hdr.size = sizeof(msg.payload.u64), }; +if (vhost_user_one_time_request(request) && dev->vq_index != 0) { +return 0; +} + if (vhost_user_write(dev, , NULL, 0) < 0) { return -1; } +if (vhost_user_read(dev, ) < 0) { +return -1; +} + +if (msg.hdr.request != request) { +error_report("Received unexpected msg type. Expected %d received %d", + request, msg.hdr.request); +return -1; +} + +if (msg.hdr.size != sizeof(msg.payload.u64)) { +error_report("Received bad msg size."); +return -1; +} + +*u64 = msg.payload.u64; + return 0; } -static int vhost_user_set_features(struct vhost_dev *dev, - uint64_t features) +static int vhost_user_get_features(struct vhost_dev *dev, uint64_t *features) { -return vhost_user_set_u64(dev, VHOST_USER_SET_FEATURES, features); +return vhost_user_get_u64(dev, VHOST_USER_GET_FEATURES, features); } -static int vhost_user_set_protocol_features(struct vhost_dev *dev, -uint64_t features) +static int enforce_reply(struct vhost_dev *dev, + const VhostUserMsg *msg) { -return vhost_user_set_u64(dev, VHOST_USER_SET_PROTOCOL_FEATURES, features); +uint64_t dummy; + +if (msg->hdr.flags & VHOST_USER_NEED_REPLY_MASK) { +return process_message_reply(dev, msg); +} + + /* +* We need to wait for a reply but the backend does not +* support replies for the command we just sent. +* Send VHOST_USER_GET_FEATURES which make
[PATCH v3] vhost: make SET_VRING_ADDR, SET_FEATURES send replies
On vhost-user-blk migration, qemu normally sends a number of commands to enable logging if VHOST_USER_PROTOCOL_F_LOG_SHMFD is negotiated. Qemu sends VHOST_USER_SET_FEATURES to enable buffers logging and VHOST_USER_SET_VRING_ADDR per each started ring to enable "used ring" data logging. The issue is that qemu doesn't wait for reply from the vhost daemon for these commands which may result in races between qemu expectation of logging starting and actual login starting in vhost daemon. The race can appear as follows: on migration setup, qemu enables dirty page logging by sending VHOST_USER_SET_FEATURES. The command doesn't arrive to a vhost-user-blk daemon immediately and the daemon needs some time to turn the logging on internally. If qemu doesn't wait for reply, after sending the command, qemu may start migrate memory pages to a destination. At this time, the logging may not be actually turned on in the daemon but some guest pages, which the daemon is about to write to, may have already been transferred without logging to the destination. Since the logging wasn't turned on, those pages won't be transferred again as dirty. So we may end up with corrupted data on the destination. The same scenario is applicable for "used ring" data logging, which is turned on with VHOST_USER_SET_VRING_ADDR command. To resolve this issue, this patch makes qemu wait for the commands result explicilty if VHOST_USER_PROTOCOL_F_REPLY_ACK is negotiated and logging enabled. Signed-off-by: Denis Plotnikov --- v2 -> v3: * send VHOST_USER_GET_FEATURES to flush out outstanding messages [mst] v1 -> v2: * send reply only when logging is enabled [mst] v0 -> v1: * send reply for SET_VRING_ADDR, SET_FEATURES only [mst] --- hw/virtio/vhost-user.c | 130 - 1 file changed, 89 insertions(+), 41 deletions(-) diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c index ee57abe04526..18f685df549f 100644 --- a/hw/virtio/vhost-user.c +++ b/hw/virtio/vhost-user.c @@ -1095,23 +1095,6 @@ static int vhost_user_set_mem_table(struct vhost_dev *dev, return 0; } -static int vhost_user_set_vring_addr(struct vhost_dev *dev, - struct vhost_vring_addr *addr) -{ -VhostUserMsg msg = { -.hdr.request = VHOST_USER_SET_VRING_ADDR, -.hdr.flags = VHOST_USER_VERSION, -.payload.addr = *addr, -.hdr.size = sizeof(msg.payload.addr), -}; - -if (vhost_user_write(dev, , NULL, 0) < 0) { -return -1; -} - -return 0; -} - static int vhost_user_set_vring_endian(struct vhost_dev *dev, struct vhost_vring_state *ring) { @@ -1288,72 +1271,137 @@ static int vhost_user_set_vring_call(struct vhost_dev *dev, return vhost_set_vring_file(dev, VHOST_USER_SET_VRING_CALL, file); } -static int vhost_user_set_u64(struct vhost_dev *dev, int request, uint64_t u64) + +static int vhost_user_get_u64(struct vhost_dev *dev, int request, uint64_t *u64) { VhostUserMsg msg = { .hdr.request = request, .hdr.flags = VHOST_USER_VERSION, -.payload.u64 = u64, -.hdr.size = sizeof(msg.payload.u64), }; +if (vhost_user_one_time_request(request) && dev->vq_index != 0) { +return 0; +} + if (vhost_user_write(dev, , NULL, 0) < 0) { return -1; } +if (vhost_user_read(dev, ) < 0) { +return -1; +} + +if (msg.hdr.request != request) { +error_report("Received unexpected msg type. Expected %d received %d", + request, msg.hdr.request); +return -1; +} + +if (msg.hdr.size != sizeof(msg.payload.u64)) { +error_report("Received bad msg size."); +return -1; +} + +*u64 = msg.payload.u64; + return 0; } -static int vhost_user_set_features(struct vhost_dev *dev, - uint64_t features) +static int vhost_user_get_features(struct vhost_dev *dev, uint64_t *features) { -return vhost_user_set_u64(dev, VHOST_USER_SET_FEATURES, features); +return vhost_user_get_u64(dev, VHOST_USER_GET_FEATURES, features); } -static int vhost_user_set_protocol_features(struct vhost_dev *dev, -uint64_t features) +static int enforce_reply(struct vhost_dev *dev) { -return vhost_user_set_u64(dev, VHOST_USER_SET_PROTOCOL_FEATURES, features); + /* +* we need a reply but can't get it from some command directly, +* so send the command which must send a reply to make sure +* the command we sent before is actually completed. +*/ +uint64_t dummy; +return vhost_user_get_features(dev, ); } -static int vhost_user_get_u64(struct vhost_dev *dev, int request, uint64_t *u64) +static int vhost_user_set_vring_addr(struct vhost_dev *dev, + struct vhost_vring_add
Re: [PATCH v2] vhost: make SET_VRING_ADDR, SET_FEATURES send replies
On 03.08.2021 18:05, Michael S. Tsirkin wrote: On Mon, Jul 19, 2021 at 05:21:38PM +0300, Denis Plotnikov wrote: On vhost-user-blk migration, qemu normally sends a number of commands to enable logging if VHOST_USER_PROTOCOL_F_LOG_SHMFD is negotiated. Qemu sends VHOST_USER_SET_FEATURES to enable buffers logging and VHOST_USER_SET_VRING_ADDR per each started ring to enable "used ring" data logging. The issue is that qemu doesn't wait for reply from the vhost daemon for these commands which may result in races between qemu expectation of logging starting and actual login starting in vhost daemon. The race can appear as follows: on migration setup, qemu enables dirty page logging by sending VHOST_USER_SET_FEATURES. The command doesn't arrive to a vhost-user-blk daemon immediately and the daemon needs some time to turn the logging on internally. If qemu doesn't wait for reply, after sending the command, qemu may start migrate memory pages to a destination. At this time, the logging may not be actually turned on in the daemon but some guest pages, which the daemon is about to write to, may have already been transferred without logging to the destination. Since the logging wasn't turned on, those pages won't be transferred again as dirty. So we may end up with corrupted data on the destination. The same scenario is applicable for "used ring" data logging, which is turned on with VHOST_USER_SET_VRING_ADDR command. To resolve this issue, this patch makes qemu wait for the commands result explicilty if VHOST_USER_PROTOCOL_F_REPLY_ACK is negotiated and logging is enabled. Signed-off-by: Denis Plotnikov --- v1 -> v2: * send reply only when logging is enabled [mst] v0 -> v1: * send reply for SET_VRING_ADDR, SET_FEATURES only [mst] hw/virtio/vhost-user.c | 37 ++--- 1 file changed, 34 insertions(+), 3 deletions(-) diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c index ee57abe04526..133588b3961e 100644 --- a/hw/virtio/vhost-user.c +++ b/hw/virtio/vhost-user.c @@ -1095,6 +1095,11 @@ static int vhost_user_set_mem_table(struct vhost_dev *dev, return 0; } +static bool log_enabled(uint64_t features) +{ +return !!(features & (0x1ULL << VHOST_F_LOG_ALL)); +} + static int vhost_user_set_vring_addr(struct vhost_dev *dev, struct vhost_vring_addr *addr) { @@ -1105,10 +1110,21 @@ static int vhost_user_set_vring_addr(struct vhost_dev *dev, .hdr.size = sizeof(msg.payload.addr), }; +bool reply_supported = virtio_has_feature(dev->protocol_features, + VHOST_USER_PROTOCOL_F_REPLY_ACK); + +if (reply_supported && log_enabled(msg.hdr.flags)) { +msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK; +} + if (vhost_user_write(dev, , NULL, 0) < 0) { return -1; } +if (msg.hdr.flags & VHOST_USER_NEED_REPLY_MASK) { +return process_message_reply(dev, ); +} + return 0; } OK this is good, but the problem is that we then still have a race if VHOST_USER_PROTOCOL_F_REPLY_ACK is not set. Bummer. Let's send VHOST_USER_GET_FEATURES in this case to flush out outstanding messages? Ok, I've already sent v3 with related changes. @@ -1288,7 +1304,8 @@ static int vhost_user_set_vring_call(struct vhost_dev *dev, return vhost_set_vring_file(dev, VHOST_USER_SET_VRING_CALL, file); } -static int vhost_user_set_u64(struct vhost_dev *dev, int request, uint64_t u64) +static int vhost_user_set_u64(struct vhost_dev *dev, int request, uint64_t u64, + bool need_reply) { VhostUserMsg msg = { .hdr.request = request, @@ -1297,23 +1314,37 @@ static int vhost_user_set_u64(struct vhost_dev *dev, int request, uint64_t u64) .hdr.size = sizeof(msg.payload.u64), }; +if (need_reply) { +bool reply_supported = virtio_has_feature(dev->protocol_features, + VHOST_USER_PROTOCOL_F_REPLY_ACK); +if (reply_supported) { +msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK; +} +} + if (vhost_user_write(dev, , NULL, 0) < 0) { return -1; } +if (msg.hdr.flags & VHOST_USER_NEED_REPLY_MASK) { +return process_message_reply(dev, ); +} + return 0; } static int vhost_user_set_features(struct vhost_dev *dev, uint64_t features) { -return vhost_user_set_u64(dev, VHOST_USER_SET_FEATURES, features); +return vhost_user_set_u64(dev, VHOST_USER_SET_FEATURES, features, + log_enabled(features)); } static int vhost_user_set_protocol_features(struct vhost_dev *dev, uint64_t features) { -return vhost_user_set_u64(dev, VHOST_USER_SET_
[PING][PING][PATCH v2] vhost: make SET_VRING_ADDR, SET_FEATURES send replies
On 23.07.2021 12:59, Denis Plotnikov wrote: ping! On 19.07.2021 17:21, Denis Plotnikov wrote: On vhost-user-blk migration, qemu normally sends a number of commands to enable logging if VHOST_USER_PROTOCOL_F_LOG_SHMFD is negotiated. Qemu sends VHOST_USER_SET_FEATURES to enable buffers logging and VHOST_USER_SET_VRING_ADDR per each started ring to enable "used ring" data logging. The issue is that qemu doesn't wait for reply from the vhost daemon for these commands which may result in races between qemu expectation of logging starting and actual login starting in vhost daemon. The race can appear as follows: on migration setup, qemu enables dirty page logging by sending VHOST_USER_SET_FEATURES. The command doesn't arrive to a vhost-user-blk daemon immediately and the daemon needs some time to turn the logging on internally. If qemu doesn't wait for reply, after sending the command, qemu may start migrate memory pages to a destination. At this time, the logging may not be actually turned on in the daemon but some guest pages, which the daemon is about to write to, may have already been transferred without logging to the destination. Since the logging wasn't turned on, those pages won't be transferred again as dirty. So we may end up with corrupted data on the destination. The same scenario is applicable for "used ring" data logging, which is turned on with VHOST_USER_SET_VRING_ADDR command. To resolve this issue, this patch makes qemu wait for the commands result explicilty if VHOST_USER_PROTOCOL_F_REPLY_ACK is negotiated and logging is enabled. Signed-off-by: Denis Plotnikov --- v1 -> v2: * send reply only when logging is enabled [mst] v0 -> v1: * send reply for SET_VRING_ADDR, SET_FEATURES only [mst] hw/virtio/vhost-user.c | 37 ++--- 1 file changed, 34 insertions(+), 3 deletions(-) diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c index ee57abe04526..133588b3961e 100644 --- a/hw/virtio/vhost-user.c +++ b/hw/virtio/vhost-user.c @@ -1095,6 +1095,11 @@ static int vhost_user_set_mem_table(struct vhost_dev *dev, return 0; } +static bool log_enabled(uint64_t features) +{ +return !!(features & (0x1ULL << VHOST_F_LOG_ALL)); +} + static int vhost_user_set_vring_addr(struct vhost_dev *dev, struct vhost_vring_addr *addr) { @@ -1105,10 +1110,21 @@ static int vhost_user_set_vring_addr(struct vhost_dev *dev, .hdr.size = sizeof(msg.payload.addr), }; +bool reply_supported = virtio_has_feature(dev->protocol_features, + VHOST_USER_PROTOCOL_F_REPLY_ACK); + +if (reply_supported && log_enabled(msg.hdr.flags)) { +msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK; +} + if (vhost_user_write(dev, , NULL, 0) < 0) { return -1; } +if (msg.hdr.flags & VHOST_USER_NEED_REPLY_MASK) { +return process_message_reply(dev, ); +} + return 0; } @@ -1288,7 +1304,8 @@ static int vhost_user_set_vring_call(struct vhost_dev *dev, return vhost_set_vring_file(dev, VHOST_USER_SET_VRING_CALL, file); } -static int vhost_user_set_u64(struct vhost_dev *dev, int request, uint64_t u64) +static int vhost_user_set_u64(struct vhost_dev *dev, int request, uint64_t u64, + bool need_reply) { VhostUserMsg msg = { .hdr.request = request, @@ -1297,23 +1314,37 @@ static int vhost_user_set_u64(struct vhost_dev *dev, int request, uint64_t u64) .hdr.size = sizeof(msg.payload.u64), }; +if (need_reply) { +bool reply_supported = virtio_has_feature(dev->protocol_features, + VHOST_USER_PROTOCOL_F_REPLY_ACK); +if (reply_supported) { +msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK; +} +} + if (vhost_user_write(dev, , NULL, 0) < 0) { return -1; } +if (msg.hdr.flags & VHOST_USER_NEED_REPLY_MASK) { +return process_message_reply(dev, ); +} + return 0; } static int vhost_user_set_features(struct vhost_dev *dev, uint64_t features) { -return vhost_user_set_u64(dev, VHOST_USER_SET_FEATURES, features); +return vhost_user_set_u64(dev, VHOST_USER_SET_FEATURES, features, + log_enabled(features)); } static int vhost_user_set_protocol_features(struct vhost_dev *dev, uint64_t features) { -return vhost_user_set_u64(dev, VHOST_USER_SET_PROTOCOL_FEATURES, features); +return vhost_user_set_u64(dev, VHOST_USER_SET_PROTOCOL_FEATURES, features, + false); } static int vhost_user_get_u64(struct vhost_dev *dev, int request, uint64_t *u64)
[PING][PATCH v2] vhost: make SET_VRING_ADDR, SET_FEATURES send replies
ping! On 19.07.2021 17:21, Denis Plotnikov wrote: On vhost-user-blk migration, qemu normally sends a number of commands to enable logging if VHOST_USER_PROTOCOL_F_LOG_SHMFD is negotiated. Qemu sends VHOST_USER_SET_FEATURES to enable buffers logging and VHOST_USER_SET_VRING_ADDR per each started ring to enable "used ring" data logging. The issue is that qemu doesn't wait for reply from the vhost daemon for these commands which may result in races between qemu expectation of logging starting and actual login starting in vhost daemon. The race can appear as follows: on migration setup, qemu enables dirty page logging by sending VHOST_USER_SET_FEATURES. The command doesn't arrive to a vhost-user-blk daemon immediately and the daemon needs some time to turn the logging on internally. If qemu doesn't wait for reply, after sending the command, qemu may start migrate memory pages to a destination. At this time, the logging may not be actually turned on in the daemon but some guest pages, which the daemon is about to write to, may have already been transferred without logging to the destination. Since the logging wasn't turned on, those pages won't be transferred again as dirty. So we may end up with corrupted data on the destination. The same scenario is applicable for "used ring" data logging, which is turned on with VHOST_USER_SET_VRING_ADDR command. To resolve this issue, this patch makes qemu wait for the commands result explicilty if VHOST_USER_PROTOCOL_F_REPLY_ACK is negotiated and logging is enabled. Signed-off-by: Denis Plotnikov --- v1 -> v2: * send reply only when logging is enabled [mst] v0 -> v1: * send reply for SET_VRING_ADDR, SET_FEATURES only [mst] hw/virtio/vhost-user.c | 37 ++--- 1 file changed, 34 insertions(+), 3 deletions(-) diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c index ee57abe04526..133588b3961e 100644 --- a/hw/virtio/vhost-user.c +++ b/hw/virtio/vhost-user.c @@ -1095,6 +1095,11 @@ static int vhost_user_set_mem_table(struct vhost_dev *dev, return 0; } +static bool log_enabled(uint64_t features) +{ +return !!(features & (0x1ULL << VHOST_F_LOG_ALL)); +} + static int vhost_user_set_vring_addr(struct vhost_dev *dev, struct vhost_vring_addr *addr) { @@ -1105,10 +1110,21 @@ static int vhost_user_set_vring_addr(struct vhost_dev *dev, .hdr.size = sizeof(msg.payload.addr), }; +bool reply_supported = virtio_has_feature(dev->protocol_features, + VHOST_USER_PROTOCOL_F_REPLY_ACK); + +if (reply_supported && log_enabled(msg.hdr.flags)) { +msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK; +} + if (vhost_user_write(dev, , NULL, 0) < 0) { return -1; } +if (msg.hdr.flags & VHOST_USER_NEED_REPLY_MASK) { +return process_message_reply(dev, ); +} + return 0; } @@ -1288,7 +1304,8 @@ static int vhost_user_set_vring_call(struct vhost_dev *dev, return vhost_set_vring_file(dev, VHOST_USER_SET_VRING_CALL, file); } -static int vhost_user_set_u64(struct vhost_dev *dev, int request, uint64_t u64) +static int vhost_user_set_u64(struct vhost_dev *dev, int request, uint64_t u64, + bool need_reply) { VhostUserMsg msg = { .hdr.request = request, @@ -1297,23 +1314,37 @@ static int vhost_user_set_u64(struct vhost_dev *dev, int request, uint64_t u64) .hdr.size = sizeof(msg.payload.u64), }; +if (need_reply) { +bool reply_supported = virtio_has_feature(dev->protocol_features, + VHOST_USER_PROTOCOL_F_REPLY_ACK); +if (reply_supported) { +msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK; +} +} + if (vhost_user_write(dev, , NULL, 0) < 0) { return -1; } +if (msg.hdr.flags & VHOST_USER_NEED_REPLY_MASK) { +return process_message_reply(dev, ); +} + return 0; } static int vhost_user_set_features(struct vhost_dev *dev, uint64_t features) { -return vhost_user_set_u64(dev, VHOST_USER_SET_FEATURES, features); +return vhost_user_set_u64(dev, VHOST_USER_SET_FEATURES, features, + log_enabled(features)); } static int vhost_user_set_protocol_features(struct vhost_dev *dev, uint64_t features) { -return vhost_user_set_u64(dev, VHOST_USER_SET_PROTOCOL_FEATURES, features); +return vhost_user_set_u64(dev, VHOST_USER_SET_PROTOCOL_FEATURES, features, + false); } static int vhost_user_get_u64(struct vhost_dev *dev, int request, uint64_t *u64)
[PATCH v2] vhost: make SET_VRING_ADDR, SET_FEATURES send replies
On vhost-user-blk migration, qemu normally sends a number of commands to enable logging if VHOST_USER_PROTOCOL_F_LOG_SHMFD is negotiated. Qemu sends VHOST_USER_SET_FEATURES to enable buffers logging and VHOST_USER_SET_VRING_ADDR per each started ring to enable "used ring" data logging. The issue is that qemu doesn't wait for reply from the vhost daemon for these commands which may result in races between qemu expectation of logging starting and actual login starting in vhost daemon. The race can appear as follows: on migration setup, qemu enables dirty page logging by sending VHOST_USER_SET_FEATURES. The command doesn't arrive to a vhost-user-blk daemon immediately and the daemon needs some time to turn the logging on internally. If qemu doesn't wait for reply, after sending the command, qemu may start migrate memory pages to a destination. At this time, the logging may not be actually turned on in the daemon but some guest pages, which the daemon is about to write to, may have already been transferred without logging to the destination. Since the logging wasn't turned on, those pages won't be transferred again as dirty. So we may end up with corrupted data on the destination. The same scenario is applicable for "used ring" data logging, which is turned on with VHOST_USER_SET_VRING_ADDR command. To resolve this issue, this patch makes qemu wait for the commands result explicilty if VHOST_USER_PROTOCOL_F_REPLY_ACK is negotiated and logging is enabled. Signed-off-by: Denis Plotnikov --- v1 -> v2: * send reply only when logging is enabled [mst] v0 -> v1: * send reply for SET_VRING_ADDR, SET_FEATURES only [mst] hw/virtio/vhost-user.c | 37 ++--- 1 file changed, 34 insertions(+), 3 deletions(-) diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c index ee57abe04526..133588b3961e 100644 --- a/hw/virtio/vhost-user.c +++ b/hw/virtio/vhost-user.c @@ -1095,6 +1095,11 @@ static int vhost_user_set_mem_table(struct vhost_dev *dev, return 0; } +static bool log_enabled(uint64_t features) +{ +return !!(features & (0x1ULL << VHOST_F_LOG_ALL)); +} + static int vhost_user_set_vring_addr(struct vhost_dev *dev, struct vhost_vring_addr *addr) { @@ -1105,10 +1110,21 @@ static int vhost_user_set_vring_addr(struct vhost_dev *dev, .hdr.size = sizeof(msg.payload.addr), }; +bool reply_supported = virtio_has_feature(dev->protocol_features, + VHOST_USER_PROTOCOL_F_REPLY_ACK); + +if (reply_supported && log_enabled(msg.hdr.flags)) { +msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK; +} + if (vhost_user_write(dev, , NULL, 0) < 0) { return -1; } +if (msg.hdr.flags & VHOST_USER_NEED_REPLY_MASK) { +return process_message_reply(dev, ); +} + return 0; } @@ -1288,7 +1304,8 @@ static int vhost_user_set_vring_call(struct vhost_dev *dev, return vhost_set_vring_file(dev, VHOST_USER_SET_VRING_CALL, file); } -static int vhost_user_set_u64(struct vhost_dev *dev, int request, uint64_t u64) +static int vhost_user_set_u64(struct vhost_dev *dev, int request, uint64_t u64, + bool need_reply) { VhostUserMsg msg = { .hdr.request = request, @@ -1297,23 +1314,37 @@ static int vhost_user_set_u64(struct vhost_dev *dev, int request, uint64_t u64) .hdr.size = sizeof(msg.payload.u64), }; +if (need_reply) { +bool reply_supported = virtio_has_feature(dev->protocol_features, + VHOST_USER_PROTOCOL_F_REPLY_ACK); +if (reply_supported) { +msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK; +} +} + if (vhost_user_write(dev, , NULL, 0) < 0) { return -1; } +if (msg.hdr.flags & VHOST_USER_NEED_REPLY_MASK) { +return process_message_reply(dev, ); +} + return 0; } static int vhost_user_set_features(struct vhost_dev *dev, uint64_t features) { -return vhost_user_set_u64(dev, VHOST_USER_SET_FEATURES, features); +return vhost_user_set_u64(dev, VHOST_USER_SET_FEATURES, features, + log_enabled(features)); } static int vhost_user_set_protocol_features(struct vhost_dev *dev, uint64_t features) { -return vhost_user_set_u64(dev, VHOST_USER_SET_PROTOCOL_FEATURES, features); +return vhost_user_set_u64(dev, VHOST_USER_SET_PROTOCOL_FEATURES, features, + false); } static int vhost_user_get_u64(struct vhost_dev *dev, int request, uint64_t *u64) -- 2.25.1
Re: [PATCH v1] vhost: make SET_VRING_ADDR, SET_FEATURES send replies
On 08.07.2021 16:02, Denis Plotnikov wrote: On 08.07.2021 15:04, Michael S. Tsirkin wrote: On Thu, Jul 08, 2021 at 11:28:40AM +0300, Denis Plotnikov wrote: On vhost-user-blk migration, qemu normally sends a number of commands to enable logging if VHOST_USER_PROTOCOL_F_LOG_SHMFD is negotiated. Qemu sends VHOST_USER_SET_FEATURES to enable buffers logging and VHOST_USER_SET_VRING_ADDR per each started ring to enable "used ring" data logging. The issue is that qemu doesn't wait for reply from the vhost daemon for these commands which may result in races between qemu expectation of logging starting and actual login starting in vhost daemon. The race can appear as follows: on migration setup, qemu enables dirty page logging by sending VHOST_USER_SET_FEATURES. The command doesn't arrive to a vhost-user-blk daemon immediately and the daemon needs some time to turn the logging on internally. If qemu doesn't wait for reply, after sending the command, qemu may start migrate memory pages to a destination. At this time, the logging may not be actually turned on in the daemon but some guest pages, which the daemon is about to write to, may have already been transferred without logging to the destination. Since the logging wasn't turned on, those pages won't be transferred again as dirty. So we may end up with corrupted data on the destination. The same scenario is applicable for "used ring" data logging, which is turned on with VHOST_USER_SET_VRING_ADDR command. To resolve this issue, this patch makes qemu wait for the commands result explicilty if VHOST_USER_PROTOCOL_F_REPLY_ACK is negotiated. Signed-off-by: Denis Plotnikov --- hw/virtio/vhost-user.c | 31 --- 1 file changed, 28 insertions(+), 3 deletions(-) diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c index ee57abe04526..15b5fac67cf3 100644 --- a/hw/virtio/vhost-user.c +++ b/hw/virtio/vhost-user.c @@ -1105,10 +1105,20 @@ static int vhost_user_set_vring_addr(struct vhost_dev *dev, .hdr.size = sizeof(msg.payload.addr), }; + bool reply_supported = virtio_has_feature(dev->protocol_features, + VHOST_USER_PROTOCOL_F_REPLY_ACK); + if (reply_supported) { + msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK; + } + if (vhost_user_write(dev, , NULL, 0) < 0) { return -1; } + if (reply_supported) { + return process_message_reply(dev, ); + } + return 0; } Same - can we limit this to when logging is being enabled? I think it's possible but do we really need some additional complexity? Do you bother about delays on device initialization? Would the reply for the command introduce significant device initialization time delay? In my understanding, this is done rarely on vhost-user device initialization. So, may be we can afford it to be a little bit longer? According to the migration case, in my understanding, major time the migration of vhost-user should be done with logging enabled. Otherwise it's hard to tell how to make sure that the memory migrates with consistent data. So here we shouldn't care too much about setup speed and should care more about data consistency. What do you think? Thanks! Denis please, let me know if my points above seem to be unreasonable and I'll send another version with reply sending only when logging is enabled. Thanks! Denis @@ -1288,7 +1298,8 @@ static int vhost_user_set_vring_call(struct vhost_dev *dev, return vhost_set_vring_file(dev, VHOST_USER_SET_VRING_CALL, file); } -static int vhost_user_set_u64(struct vhost_dev *dev, int request, uint64_t u64) +static int vhost_user_set_u64(struct vhost_dev *dev, int request, uint64_t u64, + bool need_reply) { VhostUserMsg msg = { .hdr.request = request, @@ -1297,23 +1308,37 @@ static int vhost_user_set_u64(struct vhost_dev *dev, int request, uint64_t u64) .hdr.size = sizeof(msg.payload.u64), }; + if (need_reply) { + bool reply_supported = virtio_has_feature(dev->protocol_features, + VHOST_USER_PROTOCOL_F_REPLY_ACK); + if (reply_supported) { + msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK; + } + } + if (vhost_user_write(dev, , NULL, 0) < 0) { return -1; } + if (msg.hdr.flags & VHOST_USER_NEED_REPLY_MASK) { + return process_message_reply(dev, ); + } + return 0; } static int vhost_user_set_features(struct vhost_dev *dev, uint64_t features) { - return vhost_user_set_u64(dev, VHOST_USER_SET_FEATURES, features); + return vhost_user_set_u64(dev, VHOST_USER_SET_FEATURES, features, + true); } Same here. In fact, static int vhost_user_set_protocol_features(struct vhost_dev *dev, uint64_t features)
Re: [PATCH v1] vhost: make SET_VRING_ADDR, SET_FEATURES send replies
On 08.07.2021 15:04, Michael S. Tsirkin wrote: On Thu, Jul 08, 2021 at 11:28:40AM +0300, Denis Plotnikov wrote: On vhost-user-blk migration, qemu normally sends a number of commands to enable logging if VHOST_USER_PROTOCOL_F_LOG_SHMFD is negotiated. Qemu sends VHOST_USER_SET_FEATURES to enable buffers logging and VHOST_USER_SET_VRING_ADDR per each started ring to enable "used ring" data logging. The issue is that qemu doesn't wait for reply from the vhost daemon for these commands which may result in races between qemu expectation of logging starting and actual login starting in vhost daemon. The race can appear as follows: on migration setup, qemu enables dirty page logging by sending VHOST_USER_SET_FEATURES. The command doesn't arrive to a vhost-user-blk daemon immediately and the daemon needs some time to turn the logging on internally. If qemu doesn't wait for reply, after sending the command, qemu may start migrate memory pages to a destination. At this time, the logging may not be actually turned on in the daemon but some guest pages, which the daemon is about to write to, may have already been transferred without logging to the destination. Since the logging wasn't turned on, those pages won't be transferred again as dirty. So we may end up with corrupted data on the destination. The same scenario is applicable for "used ring" data logging, which is turned on with VHOST_USER_SET_VRING_ADDR command. To resolve this issue, this patch makes qemu wait for the commands result explicilty if VHOST_USER_PROTOCOL_F_REPLY_ACK is negotiated. Signed-off-by: Denis Plotnikov --- hw/virtio/vhost-user.c | 31 --- 1 file changed, 28 insertions(+), 3 deletions(-) diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c index ee57abe04526..15b5fac67cf3 100644 --- a/hw/virtio/vhost-user.c +++ b/hw/virtio/vhost-user.c @@ -1105,10 +1105,20 @@ static int vhost_user_set_vring_addr(struct vhost_dev *dev, .hdr.size = sizeof(msg.payload.addr), }; +bool reply_supported = virtio_has_feature(dev->protocol_features, + VHOST_USER_PROTOCOL_F_REPLY_ACK); +if (reply_supported) { +msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK; +} + if (vhost_user_write(dev, , NULL, 0) < 0) { return -1; } +if (reply_supported) { +return process_message_reply(dev, ); +} + return 0; } Same - can we limit this to when logging is being enabled? I think it's possible but do we really need some additional complexity? Do you bother about delays on device initialization? Would the reply for the command introduce significant device initialization time delay? In my understanding, this is done rarely on vhost-user device initialization. So, may be we can afford it to be a little bit longer? According to the migration case, in my understanding, major time the migration of vhost-user should be done with logging enabled. Otherwise it's hard to tell how to make sure that the memory migrates with consistent data. So here we shouldn't care too much about setup speed and should care more about data consistency. What do you think? Thanks! Denis @@ -1288,7 +1298,8 @@ static int vhost_user_set_vring_call(struct vhost_dev *dev, return vhost_set_vring_file(dev, VHOST_USER_SET_VRING_CALL, file); } -static int vhost_user_set_u64(struct vhost_dev *dev, int request, uint64_t u64) +static int vhost_user_set_u64(struct vhost_dev *dev, int request, uint64_t u64, + bool need_reply) { VhostUserMsg msg = { .hdr.request = request, @@ -1297,23 +1308,37 @@ static int vhost_user_set_u64(struct vhost_dev *dev, int request, uint64_t u64) .hdr.size = sizeof(msg.payload.u64), }; +if (need_reply) { +bool reply_supported = virtio_has_feature(dev->protocol_features, + VHOST_USER_PROTOCOL_F_REPLY_ACK); +if (reply_supported) { +msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK; +} +} + if (vhost_user_write(dev, , NULL, 0) < 0) { return -1; } +if (msg.hdr.flags & VHOST_USER_NEED_REPLY_MASK) { +return process_message_reply(dev, ); +} + return 0; } static int vhost_user_set_features(struct vhost_dev *dev, uint64_t features) { -return vhost_user_set_u64(dev, VHOST_USER_SET_FEATURES, features); +return vhost_user_set_u64(dev, VHOST_USER_SET_FEATURES, features, + true); } Same here. In fact, static int vhost_user_set_protocol_features(struct vhost_dev *dev, uint64_t features) { -return vhost_user_set_u64(dev, VHOST_USER_SET_PROTOCOL_FEATURES, features); +return vhost_user_set_u64(dev, VHOST_
[PATCH v1] vhost: make SET_VRING_ADDR, SET_FEATURES send replies
On vhost-user-blk migration, qemu normally sends a number of commands to enable logging if VHOST_USER_PROTOCOL_F_LOG_SHMFD is negotiated. Qemu sends VHOST_USER_SET_FEATURES to enable buffers logging and VHOST_USER_SET_VRING_ADDR per each started ring to enable "used ring" data logging. The issue is that qemu doesn't wait for reply from the vhost daemon for these commands which may result in races between qemu expectation of logging starting and actual login starting in vhost daemon. The race can appear as follows: on migration setup, qemu enables dirty page logging by sending VHOST_USER_SET_FEATURES. The command doesn't arrive to a vhost-user-blk daemon immediately and the daemon needs some time to turn the logging on internally. If qemu doesn't wait for reply, after sending the command, qemu may start migrate memory pages to a destination. At this time, the logging may not be actually turned on in the daemon but some guest pages, which the daemon is about to write to, may have already been transferred without logging to the destination. Since the logging wasn't turned on, those pages won't be transferred again as dirty. So we may end up with corrupted data on the destination. The same scenario is applicable for "used ring" data logging, which is turned on with VHOST_USER_SET_VRING_ADDR command. To resolve this issue, this patch makes qemu wait for the commands result explicilty if VHOST_USER_PROTOCOL_F_REPLY_ACK is negotiated. Signed-off-by: Denis Plotnikov --- hw/virtio/vhost-user.c | 31 --- 1 file changed, 28 insertions(+), 3 deletions(-) diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c index ee57abe04526..15b5fac67cf3 100644 --- a/hw/virtio/vhost-user.c +++ b/hw/virtio/vhost-user.c @@ -1105,10 +1105,20 @@ static int vhost_user_set_vring_addr(struct vhost_dev *dev, .hdr.size = sizeof(msg.payload.addr), }; +bool reply_supported = virtio_has_feature(dev->protocol_features, + VHOST_USER_PROTOCOL_F_REPLY_ACK); +if (reply_supported) { +msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK; +} + if (vhost_user_write(dev, , NULL, 0) < 0) { return -1; } +if (reply_supported) { +return process_message_reply(dev, ); +} + return 0; } @@ -1288,7 +1298,8 @@ static int vhost_user_set_vring_call(struct vhost_dev *dev, return vhost_set_vring_file(dev, VHOST_USER_SET_VRING_CALL, file); } -static int vhost_user_set_u64(struct vhost_dev *dev, int request, uint64_t u64) +static int vhost_user_set_u64(struct vhost_dev *dev, int request, uint64_t u64, + bool need_reply) { VhostUserMsg msg = { .hdr.request = request, @@ -1297,23 +1308,37 @@ static int vhost_user_set_u64(struct vhost_dev *dev, int request, uint64_t u64) .hdr.size = sizeof(msg.payload.u64), }; +if (need_reply) { +bool reply_supported = virtio_has_feature(dev->protocol_features, + VHOST_USER_PROTOCOL_F_REPLY_ACK); +if (reply_supported) { +msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK; +} +} + if (vhost_user_write(dev, , NULL, 0) < 0) { return -1; } +if (msg.hdr.flags & VHOST_USER_NEED_REPLY_MASK) { +return process_message_reply(dev, ); +} + return 0; } static int vhost_user_set_features(struct vhost_dev *dev, uint64_t features) { -return vhost_user_set_u64(dev, VHOST_USER_SET_FEATURES, features); +return vhost_user_set_u64(dev, VHOST_USER_SET_FEATURES, features, + true); } static int vhost_user_set_protocol_features(struct vhost_dev *dev, uint64_t features) { -return vhost_user_set_u64(dev, VHOST_USER_SET_PROTOCOL_FEATURES, features); +return vhost_user_set_u64(dev, VHOST_USER_SET_PROTOCOL_FEATURES, features, + false); } static int vhost_user_get_u64(struct vhost_dev *dev, int request, uint64_t *u64) -- 2.25.1
Re: [PATCH v0] vhost: make SET_VRING_ADDR, SET_[PROTOCOL_]FEATEURES send replies
On 07.07.2021 21:44, Michael S. Tsirkin wrote: On Wed, Jul 07, 2021 at 05:58:50PM +0300, Denis Plotnikov wrote: On 07.07.2021 17:39, Michael S. Tsirkin wrote: On Wed, Jul 07, 2021 at 03:19:20PM +0300, Denis Plotnikov wrote: On 07.07.2021 13:10, Michael S. Tsirkin wrote: On Fri, Jun 25, 2021 at 11:52:10AM +0300, Denis Plotnikov wrote: On vhost-user-blk migration, qemu normally sends a number of commands to enable logging if VHOST_USER_PROTOCOL_F_LOG_SHMFD is negotiated. Qemu sends VHOST_USER_SET_FEATURES to enable buffers logging and VHOST_USER_SET_FEATURES per each started ring to enable "used ring" data logging. The issue is that qemu doesn't wait for reply from the vhost daemon for these commands which may result in races between qemu expectation of logging starting and actual login starting in vhost daemon. Could you be more explicit please? What kind of race have you observed? Getting a reply slows down the setup considerably and should not be done lightly. I'm talking about the vhost-user-blk case. On migration setup, we enable logging by sending VHOST_USER_SET_FEATURES. The command doesn't arrive to a vhost-user-blk daemon immediately and the daemon needs some time turn the logging on internally. If qemu doesn't wait for reply, after sending the command qemu may start migrate memory pages. At this time the logging may not be actually turned on in the daemon but some guest pages, which the daemon is about to write to, may be already transferred without logging to a destination. Since the logging wasn't turned on, those pages won't be transferred again as dirty. So we may end up with corrupted data on the destination. Have I managed to explain the case clearly? Thanks! Denis OK so this is just about enabling logging. It would be cleaner to defer migrating memory until response ... if that is too hard, at least document why we are doing this please. And, let's wait for an ack just in that case then - why not? And what about VHOST_USER_SET_PROTOCOL_FEATURES? The code uses the same path for both VHOST_USER_SET_PROTOCOL_FEATURES and VHOST_USER_SET_FEATURES via vhost_user_set_u64(). So, I decided to suggest adding reply to both of them, so both feature setting commands work similarly as it doesn't contradicts with vhost-user spec. I'm not sure that it worth doing that, so if you think it's not I'll just remove them. Denis I'm inclined to say let's not add to the latency of setting up the device unnecessarily. ok I'll remove reply for VHOST_USER_SET_FEATURES and amend the commit message in v2 Thanks! Denis Thanks! To resolve this issue, this patch makes qemu wait for the commands result explicilty if VHOST_USER_PROTOCOL_F_REPLY_ACK is negotiated. Also, this patch adds the reply waiting for VHOST_USER_SET_PROTOCOL_FEATURES command to make the features setting functions work similary. Signed-off-by: Denis Plotnikov --- hw/virtio/vhost-user.c | 20 1 file changed, 20 insertions(+) diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c index ee57abe04526..e47b82adab00 100644 --- a/hw/virtio/vhost-user.c +++ b/hw/virtio/vhost-user.c @@ -1105,10 +1105,20 @@ static int vhost_user_set_vring_addr(struct vhost_dev *dev, .hdr.size = sizeof(msg.payload.addr), }; +bool reply_supported = virtio_has_feature(dev->protocol_features, + VHOST_USER_PROTOCOL_F_REPLY_ACK); +if (reply_supported) { +msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK; +} + if (vhost_user_write(dev, , NULL, 0) < 0) { return -1; } +if (reply_supported) { +return process_message_reply(dev, ); +} + return 0; } @@ -1297,10 +1307,20 @@ static int vhost_user_set_u64(struct vhost_dev *dev, int request, uint64_t u64) .hdr.size = sizeof(msg.payload.u64), }; +bool reply_supported = virtio_has_feature(dev->protocol_features, + VHOST_USER_PROTOCOL_F_REPLY_ACK); +if (reply_supported) { +msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK; +} + if (vhost_user_write(dev, , NULL, 0) < 0) { return -1; } +if (reply_supported) { +return process_message_reply(dev, ); +} + return 0; } -- 2.25.1
Re: [PATCH v0] vhost: make SET_VRING_ADDR, SET_[PROTOCOL_]FEATEURES send replies
On 07.07.2021 17:39, Michael S. Tsirkin wrote: On Wed, Jul 07, 2021 at 03:19:20PM +0300, Denis Plotnikov wrote: On 07.07.2021 13:10, Michael S. Tsirkin wrote: On Fri, Jun 25, 2021 at 11:52:10AM +0300, Denis Plotnikov wrote: On vhost-user-blk migration, qemu normally sends a number of commands to enable logging if VHOST_USER_PROTOCOL_F_LOG_SHMFD is negotiated. Qemu sends VHOST_USER_SET_FEATURES to enable buffers logging and VHOST_USER_SET_FEATURES per each started ring to enable "used ring" data logging. The issue is that qemu doesn't wait for reply from the vhost daemon for these commands which may result in races between qemu expectation of logging starting and actual login starting in vhost daemon. Could you be more explicit please? What kind of race have you observed? Getting a reply slows down the setup considerably and should not be done lightly. I'm talking about the vhost-user-blk case. On migration setup, we enable logging by sending VHOST_USER_SET_FEATURES. The command doesn't arrive to a vhost-user-blk daemon immediately and the daemon needs some time turn the logging on internally. If qemu doesn't wait for reply, after sending the command qemu may start migrate memory pages. At this time the logging may not be actually turned on in the daemon but some guest pages, which the daemon is about to write to, may be already transferred without logging to a destination. Since the logging wasn't turned on, those pages won't be transferred again as dirty. So we may end up with corrupted data on the destination. Have I managed to explain the case clearly? Thanks! Denis OK so this is just about enabling logging. It would be cleaner to defer migrating memory until response ... if that is too hard, at least document why we are doing this please. And, let's wait for an ack just in that case then - why not? And what about VHOST_USER_SET_PROTOCOL_FEATURES? The code uses the same path for both VHOST_USER_SET_PROTOCOL_FEATURES and VHOST_USER_SET_FEATURES via vhost_user_set_u64(). So, I decided to suggest adding reply to both of them, so both feature setting commands work similarly as it doesn't contradicts with vhost-user spec. I'm not sure that it worth doing that, so if you think it's not I'll just remove them. Denis Thanks! To resolve this issue, this patch makes qemu wait for the commands result explicilty if VHOST_USER_PROTOCOL_F_REPLY_ACK is negotiated. Also, this patch adds the reply waiting for VHOST_USER_SET_PROTOCOL_FEATURES command to make the features setting functions work similary. Signed-off-by: Denis Plotnikov --- hw/virtio/vhost-user.c | 20 1 file changed, 20 insertions(+) diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c index ee57abe04526..e47b82adab00 100644 --- a/hw/virtio/vhost-user.c +++ b/hw/virtio/vhost-user.c @@ -1105,10 +1105,20 @@ static int vhost_user_set_vring_addr(struct vhost_dev *dev, .hdr.size = sizeof(msg.payload.addr), }; +bool reply_supported = virtio_has_feature(dev->protocol_features, + VHOST_USER_PROTOCOL_F_REPLY_ACK); +if (reply_supported) { +msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK; +} + if (vhost_user_write(dev, , NULL, 0) < 0) { return -1; } +if (reply_supported) { +return process_message_reply(dev, ); +} + return 0; } @@ -1297,10 +1307,20 @@ static int vhost_user_set_u64(struct vhost_dev *dev, int request, uint64_t u64) .hdr.size = sizeof(msg.payload.u64), }; +bool reply_supported = virtio_has_feature(dev->protocol_features, + VHOST_USER_PROTOCOL_F_REPLY_ACK); +if (reply_supported) { +msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK; +} + if (vhost_user_write(dev, , NULL, 0) < 0) { return -1; } +if (reply_supported) { +return process_message_reply(dev, ); +} + return 0; } -- 2.25.1
Re: [PATCH v0] vhost: make SET_VRING_ADDR, SET_[PROTOCOL_]FEATEURES send replies
On 07.07.2021 13:10, Michael S. Tsirkin wrote: On Fri, Jun 25, 2021 at 11:52:10AM +0300, Denis Plotnikov wrote: On vhost-user-blk migration, qemu normally sends a number of commands to enable logging if VHOST_USER_PROTOCOL_F_LOG_SHMFD is negotiated. Qemu sends VHOST_USER_SET_FEATURES to enable buffers logging and VHOST_USER_SET_FEATURES per each started ring to enable "used ring" data logging. The issue is that qemu doesn't wait for reply from the vhost daemon for these commands which may result in races between qemu expectation of logging starting and actual login starting in vhost daemon. Could you be more explicit please? What kind of race have you observed? Getting a reply slows down the setup considerably and should not be done lightly. I'm talking about the vhost-user-blk case. On migration setup, we enable logging by sending VHOST_USER_SET_FEATURES. The command doesn't arrive to a vhost-user-blk daemon immediately and the daemon needs some time turn the logging on internally. If qemu doesn't wait for reply, after sending the command qemu may start migrate memory pages. At this time the logging may not be actually turned on in the daemon but some guest pages, which the daemon is about to write to, may be already transferred without logging to a destination. Since the logging wasn't turned on, those pages won't be transferred again as dirty. So we may end up with corrupted data on the destination. Have I managed to explain the case clearly? Thanks! Denis Thanks! To resolve this issue, this patch makes qemu wait for the commands result explicilty if VHOST_USER_PROTOCOL_F_REPLY_ACK is negotiated. Also, this patch adds the reply waiting for VHOST_USER_SET_PROTOCOL_FEATURES command to make the features setting functions work similary. Signed-off-by: Denis Plotnikov --- hw/virtio/vhost-user.c | 20 1 file changed, 20 insertions(+) diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c index ee57abe04526..e47b82adab00 100644 --- a/hw/virtio/vhost-user.c +++ b/hw/virtio/vhost-user.c @@ -1105,10 +1105,20 @@ static int vhost_user_set_vring_addr(struct vhost_dev *dev, .hdr.size = sizeof(msg.payload.addr), }; +bool reply_supported = virtio_has_feature(dev->protocol_features, + VHOST_USER_PROTOCOL_F_REPLY_ACK); +if (reply_supported) { +msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK; +} + if (vhost_user_write(dev, , NULL, 0) < 0) { return -1; } +if (reply_supported) { +return process_message_reply(dev, ); +} + return 0; } @@ -1297,10 +1307,20 @@ static int vhost_user_set_u64(struct vhost_dev *dev, int request, uint64_t u64) .hdr.size = sizeof(msg.payload.u64), }; +bool reply_supported = virtio_has_feature(dev->protocol_features, + VHOST_USER_PROTOCOL_F_REPLY_ACK); +if (reply_supported) { +msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK; +} + if (vhost_user_write(dev, , NULL, 0) < 0) { return -1; } +if (reply_supported) { +return process_message_reply(dev, ); +} + return 0; } -- 2.25.1
[PING] [PATCH v0] vhost: make SET_VRING_ADDR, SET_[PROTOCOL_]FEATEURES send replies
On 02.07.2021 12:41, Denis Plotnikov wrote: ping ping! On 25.06.2021 11:52, Denis Plotnikov wrote: On vhost-user-blk migration, qemu normally sends a number of commands to enable logging if VHOST_USER_PROTOCOL_F_LOG_SHMFD is negotiated. Qemu sends VHOST_USER_SET_FEATURES to enable buffers logging and VHOST_USER_SET_FEATURES per each started ring to enable "used ring" data logging. The issue is that qemu doesn't wait for reply from the vhost daemon for these commands which may result in races between qemu expectation of logging starting and actual login starting in vhost daemon. To resolve this issue, this patch makes qemu wait for the commands result explicilty if VHOST_USER_PROTOCOL_F_REPLY_ACK is negotiated. Also, this patch adds the reply waiting for VHOST_USER_SET_PROTOCOL_FEATURES command to make the features setting functions work similary. Signed-off-by: Denis Plotnikov --- hw/virtio/vhost-user.c | 20 1 file changed, 20 insertions(+) diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c index ee57abe04526..e47b82adab00 100644 --- a/hw/virtio/vhost-user.c +++ b/hw/virtio/vhost-user.c @@ -1105,10 +1105,20 @@ static int vhost_user_set_vring_addr(struct vhost_dev *dev, .hdr.size = sizeof(msg.payload.addr), }; +bool reply_supported = virtio_has_feature(dev->protocol_features, + VHOST_USER_PROTOCOL_F_REPLY_ACK); +if (reply_supported) { +msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK; +} + if (vhost_user_write(dev, , NULL, 0) < 0) { return -1; } +if (reply_supported) { +return process_message_reply(dev, ); +} + return 0; } @@ -1297,10 +1307,20 @@ static int vhost_user_set_u64(struct vhost_dev *dev, int request, uint64_t u64) .hdr.size = sizeof(msg.payload.u64), }; +bool reply_supported = virtio_has_feature(dev->protocol_features, + VHOST_USER_PROTOCOL_F_REPLY_ACK); +if (reply_supported) { +msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK; +} + if (vhost_user_write(dev, , NULL, 0) < 0) { return -1; } +if (reply_supported) { +return process_message_reply(dev, ); +} + return 0; }
Re: [PATCH v0] vhost: make SET_VRING_ADDR, SET_[PROTOCOL_]FEATEURES send replies
ping ping! On 25.06.2021 11:52, Denis Plotnikov wrote: On vhost-user-blk migration, qemu normally sends a number of commands to enable logging if VHOST_USER_PROTOCOL_F_LOG_SHMFD is negotiated. Qemu sends VHOST_USER_SET_FEATURES to enable buffers logging and VHOST_USER_SET_FEATURES per each started ring to enable "used ring" data logging. The issue is that qemu doesn't wait for reply from the vhost daemon for these commands which may result in races between qemu expectation of logging starting and actual login starting in vhost daemon. To resolve this issue, this patch makes qemu wait for the commands result explicilty if VHOST_USER_PROTOCOL_F_REPLY_ACK is negotiated. Also, this patch adds the reply waiting for VHOST_USER_SET_PROTOCOL_FEATURES command to make the features setting functions work similary. Signed-off-by: Denis Plotnikov --- hw/virtio/vhost-user.c | 20 1 file changed, 20 insertions(+) diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c index ee57abe04526..e47b82adab00 100644 --- a/hw/virtio/vhost-user.c +++ b/hw/virtio/vhost-user.c @@ -1105,10 +1105,20 @@ static int vhost_user_set_vring_addr(struct vhost_dev *dev, .hdr.size = sizeof(msg.payload.addr), }; +bool reply_supported = virtio_has_feature(dev->protocol_features, + VHOST_USER_PROTOCOL_F_REPLY_ACK); +if (reply_supported) { +msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK; +} + if (vhost_user_write(dev, , NULL, 0) < 0) { return -1; } +if (reply_supported) { +return process_message_reply(dev, ); +} + return 0; } @@ -1297,10 +1307,20 @@ static int vhost_user_set_u64(struct vhost_dev *dev, int request, uint64_t u64) .hdr.size = sizeof(msg.payload.u64), }; +bool reply_supported = virtio_has_feature(dev->protocol_features, + VHOST_USER_PROTOCOL_F_REPLY_ACK); +if (reply_supported) { +msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK; +} + if (vhost_user_write(dev, , NULL, 0) < 0) { return -1; } +if (reply_supported) { +return process_message_reply(dev, ); +} + return 0; }
[PATCH v0] vhost: make SET_VRING_ADDR, SET_[PROTOCOL_]FEATEURES send replies
On vhost-user-blk migration, qemu normally sends a number of commands to enable logging if VHOST_USER_PROTOCOL_F_LOG_SHMFD is negotiated. Qemu sends VHOST_USER_SET_FEATURES to enable buffers logging and VHOST_USER_SET_FEATURES per each started ring to enable "used ring" data logging. The issue is that qemu doesn't wait for reply from the vhost daemon for these commands which may result in races between qemu expectation of logging starting and actual login starting in vhost daemon. To resolve this issue, this patch makes qemu wait for the commands result explicilty if VHOST_USER_PROTOCOL_F_REPLY_ACK is negotiated. Also, this patch adds the reply waiting for VHOST_USER_SET_PROTOCOL_FEATURES command to make the features setting functions work similary. Signed-off-by: Denis Plotnikov --- hw/virtio/vhost-user.c | 20 1 file changed, 20 insertions(+) diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c index ee57abe04526..e47b82adab00 100644 --- a/hw/virtio/vhost-user.c +++ b/hw/virtio/vhost-user.c @@ -1105,10 +1105,20 @@ static int vhost_user_set_vring_addr(struct vhost_dev *dev, .hdr.size = sizeof(msg.payload.addr), }; +bool reply_supported = virtio_has_feature(dev->protocol_features, + VHOST_USER_PROTOCOL_F_REPLY_ACK); +if (reply_supported) { +msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK; +} + if (vhost_user_write(dev, , NULL, 0) < 0) { return -1; } +if (reply_supported) { +return process_message_reply(dev, ); +} + return 0; } @@ -1297,10 +1307,20 @@ static int vhost_user_set_u64(struct vhost_dev *dev, int request, uint64_t u64) .hdr.size = sizeof(msg.payload.u64), }; +bool reply_supported = virtio_has_feature(dev->protocol_features, + VHOST_USER_PROTOCOL_F_REPLY_ACK); +if (reply_supported) { +msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK; +} + if (vhost_user_write(dev, , NULL, 0) < 0) { return -1; } +if (reply_supported) { +return process_message_reply(dev, ); +} + return 0; } -- 2.25.1
Re: [PATCH 1/5] vhost-user-blk: Don't reconnect during initialisation
Reviewed-by: Denis Plotnikov On 22.04.2021 20:02, Kevin Wolf wrote: This is a partial revert of commits 77542d43149 and bc79c87bcde. Usually, an error during initialisation means that the configuration was wrong. Reconnecting won't make the error go away, but just turn the error condition into an endless loop. Avoid this and return errors again. Additionally, calling vhost_user_blk_disconnect() from the chardev event handler could result in use-after-free because none of the initialisation code expects that the device could just go away in the middle. So removing the call fixes crashes in several places. For example, using a num-queues setting that is incompatible with the backend would result in a crash like this (dereferencing dev->opaque, which is already NULL): #0 0x55d0a4bd in vhost_user_read_cb (source=0x568f4690, condition=(G_IO_IN | G_IO_HUP), opaque=0x7fffcbf0) at ../hw/virtio/vhost-user.c:313 #1 0x55d950d3 in qio_channel_fd_source_dispatch (source=0x57c3f750, callback=0x55d0a478 , user_data=0x7fffcbf0) at ../io/channel-watch.c:84 #2 0x77b32a9f in g_main_context_dispatch () at /lib64/libglib-2.0.so.0 #3 0x77b84a98 in g_main_context_iterate.constprop () at /lib64/libglib-2.0.so.0 #4 0x77b32163 in g_main_loop_run () at /lib64/libglib-2.0.so.0 #5 0x55d0a724 in vhost_user_read (dev=0x57bc62f8, msg=0x7fffcc50) at ../hw/virtio/vhost-user.c:402 #6 0x55d0ee6b in vhost_user_get_config (dev=0x57bc62f8, config=0x57bc62ac "", config_len=60) at ../hw/virtio/vhost-user.c:2133 #7 0x55d56d46 in vhost_dev_get_config (hdev=0x57bc62f8, config=0x57bc62ac "", config_len=60) at ../hw/virtio/vhost.c:1566 #8 0x55cdd150 in vhost_user_blk_device_realize (dev=0x57bc60b0, errp=0x7fffcf90) at ../hw/block/vhost-user-blk.c:510 #9 0x55d08f6d in virtio_device_realize (dev=0x57bc60b0, errp=0x7fffcff0) at ../hw/virtio/virtio.c:3660 Signed-off-by: Kevin Wolf --- hw/block/vhost-user-blk.c | 54 ++- 1 file changed, 13 insertions(+), 41 deletions(-) diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c index f5e9682703..e824b0a759 100644 --- a/hw/block/vhost-user-blk.c +++ b/hw/block/vhost-user-blk.c @@ -50,6 +50,8 @@ static const int user_feature_bits[] = { VHOST_INVALID_FEATURE_BIT }; +static void vhost_user_blk_event(void *opaque, QEMUChrEvent event); + static void vhost_user_blk_update_config(VirtIODevice *vdev, uint8_t *config) { VHostUserBlk *s = VHOST_USER_BLK(vdev); @@ -362,19 +364,6 @@ static void vhost_user_blk_disconnect(DeviceState *dev) vhost_dev_cleanup(>dev); } -static void vhost_user_blk_event(void *opaque, QEMUChrEvent event, - bool realized); - -static void vhost_user_blk_event_realize(void *opaque, QEMUChrEvent event) -{ -vhost_user_blk_event(opaque, event, false); -} - -static void vhost_user_blk_event_oper(void *opaque, QEMUChrEvent event) -{ -vhost_user_blk_event(opaque, event, true); -} - static void vhost_user_blk_chr_closed_bh(void *opaque) { DeviceState *dev = opaque; @@ -382,12 +371,11 @@ static void vhost_user_blk_chr_closed_bh(void *opaque) VHostUserBlk *s = VHOST_USER_BLK(vdev); vhost_user_blk_disconnect(dev); -qemu_chr_fe_set_handlers(>chardev, NULL, NULL, -vhost_user_blk_event_oper, NULL, opaque, NULL, true); +qemu_chr_fe_set_handlers(>chardev, NULL, NULL, vhost_user_blk_event, + NULL, opaque, NULL, true); } -static void vhost_user_blk_event(void *opaque, QEMUChrEvent event, - bool realized) +static void vhost_user_blk_event(void *opaque, QEMUChrEvent event) { DeviceState *dev = opaque; VirtIODevice *vdev = VIRTIO_DEVICE(dev); @@ -401,17 +389,7 @@ static void vhost_user_blk_event(void *opaque, QEMUChrEvent event, } break; case CHR_EVENT_CLOSED: -/* - * Closing the connection should happen differently on device - * initialization and operation stages. - * On initalization, we want to re-start vhost_dev initialization - * from the very beginning right away when the connection is closed, - * so we clean up vhost_dev on each connection closing. - * On operation, we want to postpone vhost_dev cleanup to let the - * other code perform its own cleanup sequence using vhost_dev data - * (e.g. vhost_dev_set_log). - */ -if (realized && !runstate_check(RUN_STATE_SHUTDOWN)) { +if (!runstate_check(RUN_STATE_SHUTDOWN)) { /* * A close event may happen during a read/write, but vhost * code assumes the vhost_dev remains setup, so delay the @@ -431,8 +409,6 @@ static void vhost_u
Re: [PATCH v3 2/3] vhost-user-blk: perform immediate cleanup if disconnect on initialization
On 21.04.2021 22:59, Michael S. Tsirkin wrote: On Wed, Apr 21, 2021 at 07:13:24PM +0300, Denis Plotnikov wrote: On 21.04.2021 18:24, Kevin Wolf wrote: Am 25.03.2021 um 16:12 hat Denis Plotnikov geschrieben: Commit 4bcad76f4c39 ("vhost-user-blk: delay vhost_user_blk_disconnect") introduced postponing vhost_dev cleanup aiming to eliminate qemu aborts because of connection problems with vhost-blk daemon. However, it introdues a new problem. Now, any communication errors during execution of vhost_dev_init() called by vhost_user_blk_device_realize() lead to qemu abort on assert in vhost_dev_get_config(). This happens because vhost_user_blk_disconnect() is postponed but it should have dropped s->connected flag by the time vhost_user_blk_device_realize() performs a new connection opening. On the connection opening, vhost_dev initialization in vhost_user_blk_connect() relies on s->connection flag and if it's not dropped, it skips vhost_dev initialization and returns with success. Then, vhost_user_blk_device_realize()'s execution flow goes to vhost_dev_get_config() where it's aborted on the assert. To fix the problem this patch adds immediate cleanup on device initialization(in vhost_user_blk_device_realize()) using different event handlers for initialization and operation introduced in the previous patch. On initialization (in vhost_user_blk_device_realize()) we fully control the initialization process. At that point, nobody can use the device since it isn't initialized and we don't need to postpone any cleanups, so we can do cleaup right away when there is a communication problem with the vhost-blk daemon. On operation we leave it as is, since the disconnect may happen when the device is in use, so the device users may want to use vhost_dev's data to do rollback before vhost_dev is re-initialized (e.g. in vhost_dev_set_log()). Signed-off-by: Denis Plotnikov Reviewed-by: Raphael Norwitz I think there is something wrong with this patch. I'm debugging an error case, specifically num-queues being larger in QEMU that in the vhost-user-blk export. Before this patch, it has just an unfriendly error message: qemu-system-x86_64: -device vhost-user-blk-pci,chardev=vhost1,id=blk1,iommu_platform=off,disable-legacy=on,num-queues=4: Unexpected end-of-file before all data were read qemu-system-x86_64: -device vhost-user-blk-pci,chardev=vhost1,id=blk1,iommu_platform=off,disable-legacy=on,num-queues=4: Failed to read msg header. Read 0 instead of 12. Original request 24. qemu-system-x86_64: -device vhost-user-blk-pci,chardev=vhost1,id=blk1,iommu_platform=off,disable-legacy=on,num-queues=4: vhost-user-blk: get block config failed qemu-system-x86_64: Failed to set msg fds. qemu-system-x86_64: vhost VQ 0 ring restore failed: -1: Resource temporarily unavailable (11) After the patch, it crashes: #0 0x55d0a4bd in vhost_user_read_cb (source=0x568f4690, condition=(G_IO_IN | G_IO_HUP), opaque=0x7fffcbf0) at ../hw/virtio/vhost-user.c:313 #1 0x55d950d3 in qio_channel_fd_source_dispatch (source=0x57c3f750, callback=0x55d0a478 , user_data=0x7fffcbf0) at ../io/channel-watch.c:84 #2 0x77b32a9f in g_main_context_dispatch () at /lib64/libglib-2.0.so.0 #3 0x77b84a98 in g_main_context_iterate.constprop () at /lib64/libglib-2.0.so.0 #4 0x77b32163 in g_main_loop_run () at /lib64/libglib-2.0.so.0 #5 0x55d0a724 in vhost_user_read (dev=0x57bc62f8, msg=0x7fffcc50) at ../hw/virtio/vhost-user.c:402 #6 0x55d0ee6b in vhost_user_get_config (dev=0x57bc62f8, config=0x57bc62ac "", config_len=60) at ../hw/virtio/vhost-user.c:2133 #7 0x55d56d46 in vhost_dev_get_config (hdev=0x57bc62f8, config=0x57bc62ac "", config_len=60) at ../hw/virtio/vhost.c:1566 #8 0x55cdd150 in vhost_user_blk_device_realize (dev=0x57bc60b0, errp=0x7fffcf90) at ../hw/block/vhost-user-blk.c:510 #9 0x55d08f6d in virtio_device_realize (dev=0x57bc60b0, errp=0x7fffcff0) at ../hw/virtio/virtio.c:3660 The problem is that vhost_user_read_cb() still accesses dev->opaque even though the device has been cleaned up meanwhile when the connection was closed (the vhost_user_blk_disconnect() added by this patch), so it's NULL now. This problem was actually mentioned in the comment that is removed by this patch. I tried to fix this by making vhost_user_read() cope with the fact that the device might have been cleaned up meanwhile, but then I'm running into the next set of problems. The first is that retrying is pointless, the error condition is in the configuration, it will never change. The other is that after many repetitions of the same error message, I got a crash where the device is cleaned up a second time in vhost_dev_init() and the virtqueues are already NULL. So it seems to me that erroring out during the initialisation phase makes a lot more sense than
Re: [PATCH v3 2/3] vhost-user-blk: perform immediate cleanup if disconnect on initialization
On 21.04.2021 18:24, Kevin Wolf wrote: Am 25.03.2021 um 16:12 hat Denis Plotnikov geschrieben: Commit 4bcad76f4c39 ("vhost-user-blk: delay vhost_user_blk_disconnect") introduced postponing vhost_dev cleanup aiming to eliminate qemu aborts because of connection problems with vhost-blk daemon. However, it introdues a new problem. Now, any communication errors during execution of vhost_dev_init() called by vhost_user_blk_device_realize() lead to qemu abort on assert in vhost_dev_get_config(). This happens because vhost_user_blk_disconnect() is postponed but it should have dropped s->connected flag by the time vhost_user_blk_device_realize() performs a new connection opening. On the connection opening, vhost_dev initialization in vhost_user_blk_connect() relies on s->connection flag and if it's not dropped, it skips vhost_dev initialization and returns with success. Then, vhost_user_blk_device_realize()'s execution flow goes to vhost_dev_get_config() where it's aborted on the assert. To fix the problem this patch adds immediate cleanup on device initialization(in vhost_user_blk_device_realize()) using different event handlers for initialization and operation introduced in the previous patch. On initialization (in vhost_user_blk_device_realize()) we fully control the initialization process. At that point, nobody can use the device since it isn't initialized and we don't need to postpone any cleanups, so we can do cleaup right away when there is a communication problem with the vhost-blk daemon. On operation we leave it as is, since the disconnect may happen when the device is in use, so the device users may want to use vhost_dev's data to do rollback before vhost_dev is re-initialized (e.g. in vhost_dev_set_log()). Signed-off-by: Denis Plotnikov Reviewed-by: Raphael Norwitz I think there is something wrong with this patch. I'm debugging an error case, specifically num-queues being larger in QEMU that in the vhost-user-blk export. Before this patch, it has just an unfriendly error message: qemu-system-x86_64: -device vhost-user-blk-pci,chardev=vhost1,id=blk1,iommu_platform=off,disable-legacy=on,num-queues=4: Unexpected end-of-file before all data were read qemu-system-x86_64: -device vhost-user-blk-pci,chardev=vhost1,id=blk1,iommu_platform=off,disable-legacy=on,num-queues=4: Failed to read msg header. Read 0 instead of 12. Original request 24. qemu-system-x86_64: -device vhost-user-blk-pci,chardev=vhost1,id=blk1,iommu_platform=off,disable-legacy=on,num-queues=4: vhost-user-blk: get block config failed qemu-system-x86_64: Failed to set msg fds. qemu-system-x86_64: vhost VQ 0 ring restore failed: -1: Resource temporarily unavailable (11) After the patch, it crashes: #0 0x55d0a4bd in vhost_user_read_cb (source=0x568f4690, condition=(G_IO_IN | G_IO_HUP), opaque=0x7fffcbf0) at ../hw/virtio/vhost-user.c:313 #1 0x55d950d3 in qio_channel_fd_source_dispatch (source=0x57c3f750, callback=0x55d0a478 , user_data=0x7fffcbf0) at ../io/channel-watch.c:84 #2 0x77b32a9f in g_main_context_dispatch () at /lib64/libglib-2.0.so.0 #3 0x77b84a98 in g_main_context_iterate.constprop () at /lib64/libglib-2.0.so.0 #4 0x77b32163 in g_main_loop_run () at /lib64/libglib-2.0.so.0 #5 0x55d0a724 in vhost_user_read (dev=0x57bc62f8, msg=0x7fffcc50) at ../hw/virtio/vhost-user.c:402 #6 0x55d0ee6b in vhost_user_get_config (dev=0x57bc62f8, config=0x57bc62ac "", config_len=60) at ../hw/virtio/vhost-user.c:2133 #7 0x55d56d46 in vhost_dev_get_config (hdev=0x57bc62f8, config=0x57bc62ac "", config_len=60) at ../hw/virtio/vhost.c:1566 #8 0x55cdd150 in vhost_user_blk_device_realize (dev=0x57bc60b0, errp=0x7fffcf90) at ../hw/block/vhost-user-blk.c:510 #9 0x55d08f6d in virtio_device_realize (dev=0x57bc60b0, errp=0x7fffcff0) at ../hw/virtio/virtio.c:3660 The problem is that vhost_user_read_cb() still accesses dev->opaque even though the device has been cleaned up meanwhile when the connection was closed (the vhost_user_blk_disconnect() added by this patch), so it's NULL now. This problem was actually mentioned in the comment that is removed by this patch. I tried to fix this by making vhost_user_read() cope with the fact that the device might have been cleaned up meanwhile, but then I'm running into the next set of problems. The first is that retrying is pointless, the error condition is in the configuration, it will never change. The other is that after many repetitions of the same error message, I got a crash where the device is cleaned up a second time in vhost_dev_init() and the virtqueues are already NULL. So it seems to me that erroring out during the initialisation phase makes a lot more sense than retrying. Kevin But without the patch there will be another problem which the patch actually addresses. It seems to me
[BUG FIX][PATCH v3 0/3] vhost-user-blk: fix bug on device disconnection during initialization
This is a series fixing a bug in host-user-blk. Is there any chance for it to be considered for the next rc? Thanks! Denis On 29.03.2021 16:44, Denis Plotnikov wrote: ping! On 25.03.2021 18:12, Denis Plotnikov wrote: v3: * 0003: a new patch added fixing the problem on vm shutdown I stumbled on this bug after v2 sending. * 0001: gramma fixing (Raphael) * 0002: commit message fixing (Raphael) v2: * split the initial patch into two (Raphael) * rename init to realized (Raphael) * remove unrelated comment (Raphael) When the vhost-user-blk device lose the connection to the daemon during the initialization phase it kills qemu because of the assert in the code. The series fixes the bug. 0001 is preparation for the fix 0002 fixes the bug, patch description has the full motivation for the series 0003 (added in v3) fix bug on vm shutdown Denis Plotnikov (3): vhost-user-blk: use different event handlers on initialization vhost-user-blk: perform immediate cleanup if disconnect on initialization vhost-user-blk: add immediate cleanup on shutdown hw/block/vhost-user-blk.c | 79 --- 1 file changed, 48 insertions(+), 31 deletions(-)
Re: [PATCH v3 0/3] vhost-user-blk: fix bug on device disconnection during initialization
ping! On 25.03.2021 18:12, Denis Plotnikov wrote: v3: * 0003: a new patch added fixing the problem on vm shutdown I stumbled on this bug after v2 sending. * 0001: gramma fixing (Raphael) * 0002: commit message fixing (Raphael) v2: * split the initial patch into two (Raphael) * rename init to realized (Raphael) * remove unrelated comment (Raphael) When the vhost-user-blk device lose the connection to the daemon during the initialization phase it kills qemu because of the assert in the code. The series fixes the bug. 0001 is preparation for the fix 0002 fixes the bug, patch description has the full motivation for the series 0003 (added in v3) fix bug on vm shutdown Denis Plotnikov (3): vhost-user-blk: use different event handlers on initialization vhost-user-blk: perform immediate cleanup if disconnect on initialization vhost-user-blk: add immediate cleanup on shutdown hw/block/vhost-user-blk.c | 79 --- 1 file changed, 48 insertions(+), 31 deletions(-)
[PATCH v3 3/3] vhost-user-blk: add immediate cleanup on shutdown
Qemu crashes on shutdown if the chardev used by vhost-user-blk has been finalized before the vhost-user-blk. This happens with char-socket chardev operating in the listening mode (server). The char-socket chardev emits "close" event at the end of finalizing when its internal data is destroyed. This calls vhost-user-blk event handler which in turn tries to manipulate with destroyed chardev by setting an empty event handler for vhost-user-blk cleanup postponing. This patch separates the shutdown case from the cleanup postponing removing the need to set an event handler. Signed-off-by: Denis Plotnikov --- hw/block/vhost-user-blk.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c index 4e215f71f152..0b5b9d44cdb0 100644 --- a/hw/block/vhost-user-blk.c +++ b/hw/block/vhost-user-blk.c @@ -411,7 +411,7 @@ static void vhost_user_blk_event(void *opaque, QEMUChrEvent event, * other code perform its own cleanup sequence using vhost_dev data * (e.g. vhost_dev_set_log). */ -if (realized) { +if (realized && !runstate_check(RUN_STATE_SHUTDOWN)) { /* * A close event may happen during a read/write, but vhost * code assumes the vhost_dev remains setup, so delay the -- 2.25.1
[PATCH v3 0/3] vhost-user-blk: fix bug on device disconnection during initialization
v3: * 0003: a new patch added fixing the problem on vm shutdown I stumbled on this bug after v2 sending. * 0001: gramma fixing (Raphael) * 0002: commit message fixing (Raphael) v2: * split the initial patch into two (Raphael) * rename init to realized (Raphael) * remove unrelated comment (Raphael) When the vhost-user-blk device lose the connection to the daemon during the initialization phase it kills qemu because of the assert in the code. The series fixes the bug. 0001 is preparation for the fix 0002 fixes the bug, patch description has the full motivation for the series 0003 (added in v3) fix bug on vm shutdown Denis Plotnikov (3): vhost-user-blk: use different event handlers on initialization vhost-user-blk: perform immediate cleanup if disconnect on initialization vhost-user-blk: add immediate cleanup on shutdown hw/block/vhost-user-blk.c | 79 --- 1 file changed, 48 insertions(+), 31 deletions(-) -- 2.25.1
[PATCH v3 1/3] vhost-user-blk: use different event handlers on initialization
It is useful to use different connect/disconnect event handlers on device initialization and operation as seen from the further commit fixing a bug on device initialization. This patch refactors the code to make use of them: we don't rely any more on the VM state for choosing how to cleanup the device, instead we explicitly use the proper event handler depending on whether the device has been initialized. Signed-off-by: Denis Plotnikov Reviewed-by: Raphael Norwitz --- hw/block/vhost-user-blk.c | 31 --- 1 file changed, 24 insertions(+), 7 deletions(-) diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c index b870a50e6b20..1af95ec6aae7 100644 --- a/hw/block/vhost-user-blk.c +++ b/hw/block/vhost-user-blk.c @@ -362,7 +362,18 @@ static void vhost_user_blk_disconnect(DeviceState *dev) vhost_dev_cleanup(>dev); } -static void vhost_user_blk_event(void *opaque, QEMUChrEvent event); +static void vhost_user_blk_event(void *opaque, QEMUChrEvent event, + bool realized); + +static void vhost_user_blk_event_realize(void *opaque, QEMUChrEvent event) +{ +vhost_user_blk_event(opaque, event, false); +} + +static void vhost_user_blk_event_oper(void *opaque, QEMUChrEvent event) +{ +vhost_user_blk_event(opaque, event, true); +} static void vhost_user_blk_chr_closed_bh(void *opaque) { @@ -371,11 +382,12 @@ static void vhost_user_blk_chr_closed_bh(void *opaque) VHostUserBlk *s = VHOST_USER_BLK(vdev); vhost_user_blk_disconnect(dev); -qemu_chr_fe_set_handlers(>chardev, NULL, NULL, vhost_user_blk_event, -NULL, opaque, NULL, true); +qemu_chr_fe_set_handlers(>chardev, NULL, NULL, +vhost_user_blk_event_oper, NULL, opaque, NULL, true); } -static void vhost_user_blk_event(void *opaque, QEMUChrEvent event) +static void vhost_user_blk_event(void *opaque, QEMUChrEvent event, + bool realized) { DeviceState *dev = opaque; VirtIODevice *vdev = VIRTIO_DEVICE(dev); @@ -406,7 +418,7 @@ static void vhost_user_blk_event(void *opaque, QEMUChrEvent event) * TODO: maybe it is a good idea to make the same fix * for other vhost-user devices. */ -if (runstate_is_running()) { +if (realized) { AioContext *ctx = qemu_get_current_aio_context(); qemu_chr_fe_set_handlers(>chardev, NULL, NULL, NULL, NULL, @@ -473,8 +485,9 @@ static void vhost_user_blk_device_realize(DeviceState *dev, Error **errp) s->vhost_vqs = g_new0(struct vhost_virtqueue, s->num_queues); s->connected = false; -qemu_chr_fe_set_handlers(>chardev, NULL, NULL, vhost_user_blk_event, - NULL, (void *)dev, NULL, true); +qemu_chr_fe_set_handlers(>chardev, NULL, NULL, + vhost_user_blk_event_realize, NULL, (void *)dev, + NULL, true); reconnect: if (qemu_chr_fe_wait_connected(>chardev, ) < 0) { @@ -494,6 +507,10 @@ reconnect: goto reconnect; } +/* we're fully initialized, now we can operate, so change the handler */ +qemu_chr_fe_set_handlers(>chardev, NULL, NULL, + vhost_user_blk_event_oper, NULL, (void *)dev, + NULL, true); return; virtio_err: -- 2.25.1
[PATCH v3 2/3] vhost-user-blk: perform immediate cleanup if disconnect on initialization
Commit 4bcad76f4c39 ("vhost-user-blk: delay vhost_user_blk_disconnect") introduced postponing vhost_dev cleanup aiming to eliminate qemu aborts because of connection problems with vhost-blk daemon. However, it introdues a new problem. Now, any communication errors during execution of vhost_dev_init() called by vhost_user_blk_device_realize() lead to qemu abort on assert in vhost_dev_get_config(). This happens because vhost_user_blk_disconnect() is postponed but it should have dropped s->connected flag by the time vhost_user_blk_device_realize() performs a new connection opening. On the connection opening, vhost_dev initialization in vhost_user_blk_connect() relies on s->connection flag and if it's not dropped, it skips vhost_dev initialization and returns with success. Then, vhost_user_blk_device_realize()'s execution flow goes to vhost_dev_get_config() where it's aborted on the assert. To fix the problem this patch adds immediate cleanup on device initialization(in vhost_user_blk_device_realize()) using different event handlers for initialization and operation introduced in the previous patch. On initialization (in vhost_user_blk_device_realize()) we fully control the initialization process. At that point, nobody can use the device since it isn't initialized and we don't need to postpone any cleanups, so we can do cleaup right away when there is a communication problem with the vhost-blk daemon. On operation we leave it as is, since the disconnect may happen when the device is in use, so the device users may want to use vhost_dev's data to do rollback before vhost_dev is re-initialized (e.g. in vhost_dev_set_log()). Signed-off-by: Denis Plotnikov Reviewed-by: Raphael Norwitz --- hw/block/vhost-user-blk.c | 48 +++ 1 file changed, 24 insertions(+), 24 deletions(-) diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c index 1af95ec6aae7..4e215f71f152 100644 --- a/hw/block/vhost-user-blk.c +++ b/hw/block/vhost-user-blk.c @@ -402,38 +402,38 @@ static void vhost_user_blk_event(void *opaque, QEMUChrEvent event, break; case CHR_EVENT_CLOSED: /* - * A close event may happen during a read/write, but vhost - * code assumes the vhost_dev remains setup, so delay the - * stop & clear. There are two possible paths to hit this - * disconnect event: - * 1. When VM is in the RUN_STATE_PRELAUNCH state. The - * vhost_user_blk_device_realize() is a caller. - * 2. In tha main loop phase after VM start. - * - * For p2 the disconnect event will be delayed. We can't - * do the same for p1, because we are not running the loop - * at this moment. So just skip this step and perform - * disconnect in the caller function. - * - * TODO: maybe it is a good idea to make the same fix - * for other vhost-user devices. + * Closing the connection should happen differently on device + * initialization and operation stages. + * On initalization, we want to re-start vhost_dev initialization + * from the very beginning right away when the connection is closed, + * so we clean up vhost_dev on each connection closing. + * On operation, we want to postpone vhost_dev cleanup to let the + * other code perform its own cleanup sequence using vhost_dev data + * (e.g. vhost_dev_set_log). */ if (realized) { +/* + * A close event may happen during a read/write, but vhost + * code assumes the vhost_dev remains setup, so delay the + * stop & clear. + */ AioContext *ctx = qemu_get_current_aio_context(); qemu_chr_fe_set_handlers(>chardev, NULL, NULL, NULL, NULL, NULL, NULL, false); aio_bh_schedule_oneshot(ctx, vhost_user_blk_chr_closed_bh, opaque); -} -/* - * Move vhost device to the stopped state. The vhost-user device - * will be clean up and disconnected in BH. This can be useful in - * the vhost migration code. If disconnect was caught there is an - * option for the general vhost code to get the dev state without - * knowing its type (in this case vhost-user). - */ -s->dev.started = false; +/* + * Move vhost device to the stopped state. The vhost-user device + * will be clean up and disconnected in BH. This can be useful in + * the vhost migration code. If disconnect was caught there is an + * option for the general vhost code to get the dev state without + * knowing its type (in this case vhost-user). + */ +s->dev.started = false; +} else { +vhost_user_blk_disconnect(dev); +} break; case CHR_EVENT_BREAK: case CHR_EVENT_MUX_IN: -- 2.25.1
[PATCH v2 1/2] vhost-user-blk: use different event handlers on initialization
It is useful to use different connect/disconnect event handlers on device initialization and operation as seen from the further commit fixing a bug on device initialization. The patch refactor the code to make use of them: we don't rely any more on the VM state for choosing how to cleanup the device, instead we explicitly use the proper event handler dependping on whether the device has been initialized. Signed-off-by: Denis Plotnikov --- hw/block/vhost-user-blk.c | 31 --- 1 file changed, 24 insertions(+), 7 deletions(-) diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c index b870a50e6b20..1af95ec6aae7 100644 --- a/hw/block/vhost-user-blk.c +++ b/hw/block/vhost-user-blk.c @@ -362,7 +362,18 @@ static void vhost_user_blk_disconnect(DeviceState *dev) vhost_dev_cleanup(>dev); } -static void vhost_user_blk_event(void *opaque, QEMUChrEvent event); +static void vhost_user_blk_event(void *opaque, QEMUChrEvent event, + bool realized); + +static void vhost_user_blk_event_realize(void *opaque, QEMUChrEvent event) +{ +vhost_user_blk_event(opaque, event, false); +} + +static void vhost_user_blk_event_oper(void *opaque, QEMUChrEvent event) +{ +vhost_user_blk_event(opaque, event, true); +} static void vhost_user_blk_chr_closed_bh(void *opaque) { @@ -371,11 +382,12 @@ static void vhost_user_blk_chr_closed_bh(void *opaque) VHostUserBlk *s = VHOST_USER_BLK(vdev); vhost_user_blk_disconnect(dev); -qemu_chr_fe_set_handlers(>chardev, NULL, NULL, vhost_user_blk_event, -NULL, opaque, NULL, true); +qemu_chr_fe_set_handlers(>chardev, NULL, NULL, +vhost_user_blk_event_oper, NULL, opaque, NULL, true); } -static void vhost_user_blk_event(void *opaque, QEMUChrEvent event) +static void vhost_user_blk_event(void *opaque, QEMUChrEvent event, + bool realized) { DeviceState *dev = opaque; VirtIODevice *vdev = VIRTIO_DEVICE(dev); @@ -406,7 +418,7 @@ static void vhost_user_blk_event(void *opaque, QEMUChrEvent event) * TODO: maybe it is a good idea to make the same fix * for other vhost-user devices. */ -if (runstate_is_running()) { +if (realized) { AioContext *ctx = qemu_get_current_aio_context(); qemu_chr_fe_set_handlers(>chardev, NULL, NULL, NULL, NULL, @@ -473,8 +485,9 @@ static void vhost_user_blk_device_realize(DeviceState *dev, Error **errp) s->vhost_vqs = g_new0(struct vhost_virtqueue, s->num_queues); s->connected = false; -qemu_chr_fe_set_handlers(>chardev, NULL, NULL, vhost_user_blk_event, - NULL, (void *)dev, NULL, true); +qemu_chr_fe_set_handlers(>chardev, NULL, NULL, + vhost_user_blk_event_realize, NULL, (void *)dev, + NULL, true); reconnect: if (qemu_chr_fe_wait_connected(>chardev, ) < 0) { @@ -494,6 +507,10 @@ reconnect: goto reconnect; } +/* we're fully initialized, now we can operate, so change the handler */ +qemu_chr_fe_set_handlers(>chardev, NULL, NULL, + vhost_user_blk_event_oper, NULL, (void *)dev, + NULL, true); return; virtio_err: -- 2.25.1
[PATCH v2 2/2] vhost-user-blk: perform immediate cleanup if disconnect on initialization
Commit 4bcad76f4c39 ("vhost-user-blk: delay vhost_user_blk_disconnect") introduced postponing vhost_dev cleanup aiming to eliminate qemu aborts because of connection problems with vhost-blk daemon. However, it introdues a new problem. Now, any communication errors during execution of vhost_dev_init() called by vhost_user_blk_device_realize() lead to qemu abort on assert in vhost_dev_get_config(). This happens because vhost_user_blk_disconnect() is postponed but it should have dropped s->connected flag by the time vhost_user_blk_device_realize() performs a new connection opening. On the connection opening, vhost_dev initialization in vhost_user_blk_connect() relies on s->connection flag and if it's not dropped, it skips vhost_dev initialization and returns with success. Then, vhost_user_blk_device_realize()'s execution flow goes to vhost_dev_get_config() where it's aborted on the assert. The connection/disconnection processing should happen differently on initialization and operation of vhost-user-blk. On initialization (in vhost_user_blk_device_realize()) we fully control the initialization process. At that point, nobody can use the device since it isn't initialized and we don't need to postpone any cleanups, so we can do cleanup right away when there is communication problems with the vhost-blk daemon. On operation the disconnect may happen when the device is in use, so the device users may want to use vhost_dev's data to do rollback before vhost_dev is re-initialized (e.g. in vhost_dev_set_log()), so we postpone the cleanup. The patch splits those two cases, and performs the cleanup immediately on initialization, and postpones cleanup when the device is initialized and in use. Signed-off-by: Denis Plotnikov --- hw/block/vhost-user-blk.c | 48 +++ 1 file changed, 24 insertions(+), 24 deletions(-) diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c index 1af95ec6aae7..4e215f71f152 100644 --- a/hw/block/vhost-user-blk.c +++ b/hw/block/vhost-user-blk.c @@ -402,38 +402,38 @@ static void vhost_user_blk_event(void *opaque, QEMUChrEvent event, break; case CHR_EVENT_CLOSED: /* - * A close event may happen during a read/write, but vhost - * code assumes the vhost_dev remains setup, so delay the - * stop & clear. There are two possible paths to hit this - * disconnect event: - * 1. When VM is in the RUN_STATE_PRELAUNCH state. The - * vhost_user_blk_device_realize() is a caller. - * 2. In tha main loop phase after VM start. - * - * For p2 the disconnect event will be delayed. We can't - * do the same for p1, because we are not running the loop - * at this moment. So just skip this step and perform - * disconnect in the caller function. - * - * TODO: maybe it is a good idea to make the same fix - * for other vhost-user devices. + * Closing the connection should happen differently on device + * initialization and operation stages. + * On initalization, we want to re-start vhost_dev initialization + * from the very beginning right away when the connection is closed, + * so we clean up vhost_dev on each connection closing. + * On operation, we want to postpone vhost_dev cleanup to let the + * other code perform its own cleanup sequence using vhost_dev data + * (e.g. vhost_dev_set_log). */ if (realized) { +/* + * A close event may happen during a read/write, but vhost + * code assumes the vhost_dev remains setup, so delay the + * stop & clear. + */ AioContext *ctx = qemu_get_current_aio_context(); qemu_chr_fe_set_handlers(>chardev, NULL, NULL, NULL, NULL, NULL, NULL, false); aio_bh_schedule_oneshot(ctx, vhost_user_blk_chr_closed_bh, opaque); -} -/* - * Move vhost device to the stopped state. The vhost-user device - * will be clean up and disconnected in BH. This can be useful in - * the vhost migration code. If disconnect was caught there is an - * option for the general vhost code to get the dev state without - * knowing its type (in this case vhost-user). - */ -s->dev.started = false; +/* + * Move vhost device to the stopped state. The vhost-user device + * will be clean up and disconnected in BH. This can be useful in + * the vhost migration code. If disconnect was caught there is an + * option for the general vhost code to get the dev state without + * knowing its type (in this case vhost-user). + */ +s->dev.started = false; +} else { +vhost_user_blk_disconnect(dev); +} break;
[PATCH v2 0/2] vhost-user-blk: fix bug on device disconnection during initialization
v2: * split the initial patch into two (Raphael) * rename init to realized (Raphael) * remove unrelated comment (Raphael) When the vhost-user-blk device lose the connection to the daemon during the initialization phase it kills qemu because of the assert in the code. The series fixes the bug. 0001 is preparation for the fix 0002 fixes the bug, patch description has the full motivation for the series Denis Plotnikov (2): vhost-user-blk: use different event handlers on initialization vhost-user-blk: perform immediate cleanup if disconnect on initialization hw/block/vhost-user-blk.c | 79 --- 1 file changed, 48 insertions(+), 31 deletions(-) -- 2.25.1
Re: [PATCH v1] vhost-user-blk: use different event handlers on init and operation
ping! On 11.03.2021 11:10, Denis Plotnikov wrote: Commit a1a20d06b73e "vhost-user-blk: delay vhost_user_blk_disconnect" introduced postponing vhost_dev cleanup aiming to eliminate qemu aborts because of connection problems with vhost-blk daemon. However, it introdues a new problem. Now, any communication errors during execution of vhost_dev_init() called by vhost_user_blk_device_realize() lead to qemu abort on assert in vhost_dev_get_config(). This happens because vhost_user_blk_disconnect() is postponed but it should have dropped s->connected flag by the time vhost_user_blk_device_realize() performs a new connection opening. On the connection opening, vhost_dev initialization in vhost_user_blk_connect() relies on s->connection flag and if it's not dropped, it skips vhost_dev initialization and returns with success. Then, vhost_user_blk_device_realize()'s execution flow goes to vhost_dev_get_config() where it's aborted on the assert. It seems connection/disconnection processing should happen differently on initialization and operation of vhost-user-blk. On initialization (in vhost_user_blk_device_realize()) we fully control the initialization process. At that point, nobody can use the device since it isn't initialized and we don't need to postpone any cleanups, so we can do cleanup right away when there is communication problems with the vhost-blk daemon. On operation the disconnect may happen when the device is in use, so the device users may want to use vhost_dev's data to do rollback before vhost_dev is re-initialized (e.g. in vhost_dev_set_log()), so we postpone the cleanup. The patch splits those two cases, and performs the cleanup immediately on initialization, and postpones cleanup when the device is initialized and in use. Signed-off-by: Denis Plotnikov --- hw/block/vhost-user-blk.c | 88 --- 1 file changed, 54 insertions(+), 34 deletions(-) diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c index b870a50e6b20..84940122b8ca 100644 --- a/hw/block/vhost-user-blk.c +++ b/hw/block/vhost-user-blk.c @@ -362,7 +362,17 @@ static void vhost_user_blk_disconnect(DeviceState *dev) vhost_dev_cleanup(>dev); } -static void vhost_user_blk_event(void *opaque, QEMUChrEvent event); +static void vhost_user_blk_event(void *opaque, QEMUChrEvent event, bool init); + +static void vhost_user_blk_event_init(void *opaque, QEMUChrEvent event) +{ +vhost_user_blk_event(opaque, event, true); +} + +static void vhost_user_blk_event_oper(void *opaque, QEMUChrEvent event) +{ +vhost_user_blk_event(opaque, event, false); +} static void vhost_user_blk_chr_closed_bh(void *opaque) { @@ -371,11 +381,11 @@ static void vhost_user_blk_chr_closed_bh(void *opaque) VHostUserBlk *s = VHOST_USER_BLK(vdev); vhost_user_blk_disconnect(dev); -qemu_chr_fe_set_handlers(>chardev, NULL, NULL, vhost_user_blk_event, -NULL, opaque, NULL, true); +qemu_chr_fe_set_handlers(>chardev, NULL, NULL, +vhost_user_blk_event_oper, NULL, opaque, NULL, true); } -static void vhost_user_blk_event(void *opaque, QEMUChrEvent event) +static void vhost_user_blk_event(void *opaque, QEMUChrEvent event, bool init) { DeviceState *dev = opaque; VirtIODevice *vdev = VIRTIO_DEVICE(dev); @@ -390,38 +400,42 @@ static void vhost_user_blk_event(void *opaque, QEMUChrEvent event) break; case CHR_EVENT_CLOSED: /* - * A close event may happen during a read/write, but vhost - * code assumes the vhost_dev remains setup, so delay the - * stop & clear. There are two possible paths to hit this - * disconnect event: - * 1. When VM is in the RUN_STATE_PRELAUNCH state. The - * vhost_user_blk_device_realize() is a caller. - * 2. In tha main loop phase after VM start. - * - * For p2 the disconnect event will be delayed. We can't - * do the same for p1, because we are not running the loop - * at this moment. So just skip this step and perform - * disconnect in the caller function. - * - * TODO: maybe it is a good idea to make the same fix - * for other vhost-user devices. + * Closing the connection should happen differently on device + * initialization and operation stages. + * On initalization, we want to re-start vhost_dev initialization + * from the very beginning right away when the connection is closed, + * so we clean up vhost_dev on each connection closing. + * On operation, we want to postpone vhost_dev cleanup to let the + * other code perform its own cleanup sequence using vhost_dev data + * (e.g. vhost_dev_set_log). */ -if (runstate_is_running()) { +if (init) { +vhost_user_blk_disconnect(dev); +} else { +/* + * A close event m
[PATCH v1] configure: add option to implicitly enable/disable libgio
Now, compilation of util/dbus is implicit and depends on libgio presence on the building host. The patch adds options to manage libgio dependencies explicitly. Signed-off-by: Denis Plotnikov --- configure | 60 --- 1 file changed, 39 insertions(+), 21 deletions(-) diff --git a/configure b/configure index 34fccaa2bae6..23eed988be81 100755 --- a/configure +++ b/configure @@ -465,6 +465,7 @@ fuse_lseek="auto" multiprocess="auto" malloc_trim="auto" +gio="$default_feature" # parse CC options second for opt do @@ -1560,6 +1561,10 @@ for opt do ;; --disable-multiprocess) multiprocess="disabled" ;; + --enable-gio) gio=yes + ;; + --disable-gio) gio=no + ;; *) echo "ERROR: unknown option $opt" echo "Try '$0 --help' for more information" @@ -1913,6 +1918,7 @@ disabled with --disable-FEATURE, default is enabled if available fuseFUSE block device export fuse-lseek SEEK_HOLE/SEEK_DATA support for FUSE exports multiprocessOut of process device emulation support + gio libgio support NOTE: The object files are built at the place where configure is launched EOF @@ -3319,17 +3325,19 @@ if test "$static" = yes && test "$mingw32" = yes; then glib_cflags="-DGLIB_STATIC_COMPILATION $glib_cflags" fi -if $pkg_config --atleast-version=$glib_req_ver gio-2.0; then -gio_cflags=$($pkg_config --cflags gio-2.0) -gio_libs=$($pkg_config --libs gio-2.0) -gdbus_codegen=$($pkg_config --variable=gdbus_codegen gio-2.0) -if [ ! -x "$gdbus_codegen" ]; then -gdbus_codegen= -fi -# Check that the libraries actually work -- Ubuntu 18.04 ships -# with pkg-config --static --libs data for gio-2.0 that is missing -# -lblkid and will give a link error. -cat > $TMPC < $TMPC < int main(void) { @@ -3337,18 +3345,28 @@ int main(void) return 0; } EOF -if compile_prog "$gio_cflags" "$gio_libs" ; then -gio=yes -else -gio=no +if compile_prog "$gio_cflags" "$gio_libs" ; then +pass=yes +else +pass=no +fi + +if test "$pass" = "yes" && +$pkg_config --atleast-version=$glib_req_ver gio-unix-2.0; then +gio_cflags="$gio_cflags $($pkg_config --cflags gio-unix-2.0)" +gio_libs="$gio_libs $($pkg_config --libs gio-unix-2.0)" +fi fi -else -gio=no -fi -if $pkg_config --atleast-version=$glib_req_ver gio-unix-2.0; then -gio_cflags="$gio_cflags $($pkg_config --cflags gio-unix-2.0)" -gio_libs="$gio_libs $($pkg_config --libs gio-unix-2.0)" +if test "$pass" = "no"; then +if test "$gio" = "yes"; then +feature_not_found "gio" "Install libgio >= 2.0" +else +gio=no +fi +else +gio=yes +fi fi # Sanity check that the current size_t matches the -- 2.25.1
[PATCH v1] softmmu/vl: make default prealloc-threads work w/o -mem-prealloc
Preallocation in memory backends can be specified with either global QEMU option "-mem-prealloc", or with per-backend property "prealloc=true". In the latter case, however, the default for the number of preallocation threads is not set to the number of vcpus, but remains at 1 instead. Fix it by setting the "prealloc-threads" sugar property of "memory-backend" to the number of vcpus unconditionally. Fixes: ffac16fab3 ("hostmem: introduce "prealloc-threads" property") Signed-off-by: Denis Plotnikov --- softmmu/vl.c | 17 ++--- 1 file changed, 10 insertions(+), 7 deletions(-) diff --git a/softmmu/vl.c b/softmmu/vl.c index ff488ea3e7db..e392e226a2d3 100644 --- a/softmmu/vl.c +++ b/softmmu/vl.c @@ -2300,14 +2300,17 @@ static void qemu_validate_options(void) static void qemu_process_sugar_options(void) { -if (mem_prealloc) { -char *val; +char *val; -val = g_strdup_printf("%d", - (uint32_t) qemu_opt_get_number(qemu_find_opts_singleton("smp-opts"), "cpus", 1)); -object_register_sugar_prop("memory-backend", "prealloc-threads", val, - false); -g_free(val); +val = g_strdup_printf("%d", + (uint32_t) qemu_opt_get_number( + qemu_find_opts_singleton("smp-opts"), "cpus", 1)); + +object_register_sugar_prop("memory-backend", "prealloc-threads", val, +false); +g_free(val); + +if (mem_prealloc) { object_register_sugar_prop("memory-backend", "prealloc", "on", false); } -- 2.25.1
[PATCH v1] vhost-user-blk: use different event handlers on init and operation
Commit a1a20d06b73e "vhost-user-blk: delay vhost_user_blk_disconnect" introduced postponing vhost_dev cleanup aiming to eliminate qemu aborts because of connection problems with vhost-blk daemon. However, it introdues a new problem. Now, any communication errors during execution of vhost_dev_init() called by vhost_user_blk_device_realize() lead to qemu abort on assert in vhost_dev_get_config(). This happens because vhost_user_blk_disconnect() is postponed but it should have dropped s->connected flag by the time vhost_user_blk_device_realize() performs a new connection opening. On the connection opening, vhost_dev initialization in vhost_user_blk_connect() relies on s->connection flag and if it's not dropped, it skips vhost_dev initialization and returns with success. Then, vhost_user_blk_device_realize()'s execution flow goes to vhost_dev_get_config() where it's aborted on the assert. It seems connection/disconnection processing should happen differently on initialization and operation of vhost-user-blk. On initialization (in vhost_user_blk_device_realize()) we fully control the initialization process. At that point, nobody can use the device since it isn't initialized and we don't need to postpone any cleanups, so we can do cleanup right away when there is communication problems with the vhost-blk daemon. On operation the disconnect may happen when the device is in use, so the device users may want to use vhost_dev's data to do rollback before vhost_dev is re-initialized (e.g. in vhost_dev_set_log()), so we postpone the cleanup. The patch splits those two cases, and performs the cleanup immediately on initialization, and postpones cleanup when the device is initialized and in use. Signed-off-by: Denis Plotnikov --- hw/block/vhost-user-blk.c | 88 --- 1 file changed, 54 insertions(+), 34 deletions(-) diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c index b870a50e6b20..84940122b8ca 100644 --- a/hw/block/vhost-user-blk.c +++ b/hw/block/vhost-user-blk.c @@ -362,7 +362,17 @@ static void vhost_user_blk_disconnect(DeviceState *dev) vhost_dev_cleanup(>dev); } -static void vhost_user_blk_event(void *opaque, QEMUChrEvent event); +static void vhost_user_blk_event(void *opaque, QEMUChrEvent event, bool init); + +static void vhost_user_blk_event_init(void *opaque, QEMUChrEvent event) +{ +vhost_user_blk_event(opaque, event, true); +} + +static void vhost_user_blk_event_oper(void *opaque, QEMUChrEvent event) +{ +vhost_user_blk_event(opaque, event, false); +} static void vhost_user_blk_chr_closed_bh(void *opaque) { @@ -371,11 +381,11 @@ static void vhost_user_blk_chr_closed_bh(void *opaque) VHostUserBlk *s = VHOST_USER_BLK(vdev); vhost_user_blk_disconnect(dev); -qemu_chr_fe_set_handlers(>chardev, NULL, NULL, vhost_user_blk_event, -NULL, opaque, NULL, true); +qemu_chr_fe_set_handlers(>chardev, NULL, NULL, +vhost_user_blk_event_oper, NULL, opaque, NULL, true); } -static void vhost_user_blk_event(void *opaque, QEMUChrEvent event) +static void vhost_user_blk_event(void *opaque, QEMUChrEvent event, bool init) { DeviceState *dev = opaque; VirtIODevice *vdev = VIRTIO_DEVICE(dev); @@ -390,38 +400,42 @@ static void vhost_user_blk_event(void *opaque, QEMUChrEvent event) break; case CHR_EVENT_CLOSED: /* - * A close event may happen during a read/write, but vhost - * code assumes the vhost_dev remains setup, so delay the - * stop & clear. There are two possible paths to hit this - * disconnect event: - * 1. When VM is in the RUN_STATE_PRELAUNCH state. The - * vhost_user_blk_device_realize() is a caller. - * 2. In tha main loop phase after VM start. - * - * For p2 the disconnect event will be delayed. We can't - * do the same for p1, because we are not running the loop - * at this moment. So just skip this step and perform - * disconnect in the caller function. - * - * TODO: maybe it is a good idea to make the same fix - * for other vhost-user devices. + * Closing the connection should happen differently on device + * initialization and operation stages. + * On initalization, we want to re-start vhost_dev initialization + * from the very beginning right away when the connection is closed, + * so we clean up vhost_dev on each connection closing. + * On operation, we want to postpone vhost_dev cleanup to let the + * other code perform its own cleanup sequence using vhost_dev data + * (e.g. vhost_dev_set_log). */ -if (runstate_is_running()) { +if (init) { +vhost_user_blk_disconnect(dev); +} else { +/* + * A close event may happen during a read/write, but vhost + * code assumes the vhos
Re: [PATCH v0 3/4] migration: add background snapshot
On 29.07.2020 16:27, Dr. David Alan Gilbert wrote: ... /** * ram_find_and_save_block: finds a dirty page and sends it to f * @@ -1782,6 +2274,7 @@ static int ram_find_and_save_block(RAMState *rs, bool last_stage) pss.block = rs->last_seen_block; pss.page = rs->last_page; pss.complete_round = false; +pss.page_copy = NULL; if (!pss.block) { pss.block = QLIST_FIRST_RCU(_list.blocks); @@ -1794,11 +2287,30 @@ static int ram_find_and_save_block(RAMState *rs, bool last_stage) if (!found) { /* priority queue empty, so just search for something dirty */ found = find_dirty_block(rs, , ); + +if (found && migrate_background_snapshot()) { +/* + * make a copy of the page and + * pass it to the page search status + */ +int ret; +ret = ram_copy_page(pss.block, pss.page, _copy); I'm a bit confused about why we hit this; the way I'd thought about your code was we turn on the write faulting, do one big save and then fixup the faults as the save is happening (doing the copies) as the writes hit; so when does this case hit? To make it more clear, let me draw the whole picture: When we do background snapshot, the vm is paused untill all vmstate EXCEPT ram is saved. RAM isn't written at all. That vmstate part is saved in the temporary buffer. Then all the RAM is marked as read-only and the vm is un-paused. Note that at this moment all vm's vCPUs are running and can touch any part of memory. After that, the migration thread starts writing the ram content. Once a memory chunk is written, the write protection is removed for that chunk. If a vCPU wants to write to a memory page which is write protected (hasn't been written yet), this write is intercepted, the memory page is copied and queued for writing, the memory page write access is restored. The intention behind of that, is to allow vCPU to work with a memory page as soon as possible. So I think I'm confusing this description with the code I'm seeing above. The code above, being in ram_find_and_save_block makes me think it's calling ram_copy_page for every page at the point just before it writes it - I'm not seeing how that corresponds to what you're saying about it being queued when the CPU tries to write it. You are right. The code should be different there. It seems that I confused myself as well by sending a wrong version of the patch set. I think this series should be re-send so my description above corresponds to the code. Thanks, Denis Once all the RAM has been written, the rest of the vmstate is written from the buffer. This needs to be so because some of the emulated devices, saved in that buffered vmstate part, expects the RAM content to be available first on its loading. Right, same type of problem as postcopy. Dave I hope this description will make things more clear. If not, please let me know, so I could add more details. Denis -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
Re: [PATCH v0 3/4] migration: add background snapshot
On 24.07.2020 03:08, Peter Xu wrote: On Wed, Jul 22, 2020 at 11:11:32AM +0300, Denis Plotnikov wrote: +/** + * ram_copy_page: make a page copy + * + * Used in the background snapshot to make a copy of a memeory page. + * Ensures that the memeory page is copied only once. + * When a page copy is done, restores read/write access to the memory + * page. + * If a page is being copied by another thread, wait until the copying + * ends and exit. + * + * Returns: + * -1 - on error + *0 - the page wasn't copied by the function call + *1 - the page has been copied + * + * @block: RAM block to use + * @page_nr: the page number to copy + * @page_copy: the pointer to return a page copy + * + */ +int ram_copy_page(RAMBlock *block, unsigned long page_nr, + void **page_copy) +{ +void *host_page; +int res = 0; + +atomic_inc(_state->page_copier_cnt); + +if (test_and_set_bit_atomic(page_nr, block->touched_map)) { +while (!test_bit_atomic(page_nr, block->copied_map)) { +/* the page is being copied -- wait for the end of the copying */ +qemu_event_wait(_state->page_copying_done); +} +goto out; +} + +*page_copy = ram_page_buffer_get(); +if (!*page_copy) { +res = -1; +goto out; +} + +host_page = block->host + (page_nr << TARGET_PAGE_BITS); +memcpy(*page_copy, host_page, TARGET_PAGE_SIZE); + +if (ram_set_rw(host_page, TARGET_PAGE_SIZE)) { +ram_page_buffer_free(*page_copy); +*page_copy = NULL; +res = -1; +goto out; +} + +set_bit_atomic(page_nr, block->copied_map); +qemu_event_set(_state->page_copying_done); +qemu_event_reset(_state->page_copying_done); + +res = 1; +out: +atomic_dec(_state->page_copier_cnt); +return res; +} Is ram_set_rw() be called on the page only if a page fault triggered? Shouldn't we also do that even in the background thread when we proactively copying the pages? Yes, we should. It's a bug. Besides current solution, do you think we can make it simpler by only deliver the fault request to the background thread? We can let the background thread to do all the rests and IIUC we can drop all the complicated sync bitmaps and so on by doing so. The process can look like: - background thread runs the general precopy migration, and, - it only does the ram_bulk_stage, which is the first loop, because for snapshot no reason to send a page twice.. - After copy one page, do ram_set_rw() always, so accessing of this page will never trigger write-protect page fault again, - take requests from the unqueue_page() just like what's done in this series, but instead of copying the page, the page request should look exactly like the postcopy one. We don't need copy_page because the page won't be changed before we unprotect it, so it shiould be safe. These pages will still be with high priority because when queued it means vcpu writed to this protected page and fault in userfaultfd. We need to migrate these pages first to unblock them. - the fault handler thread only needs to do: - when get a uffd-wp message, translate into a postcopy-like request (calculate ramblock and offset), then queue it. That's all. I believe we can avoid the copy_page parameter that was passed around, and we can also save the two extra bitmaps and the complicated synchronizations. Do you think this would work? Yes, it would. This scheme is much simpler. I like it, in general. I use such a complicated approach to reduce all possible vCPU delays: if the storage where the snapshot is being saved is quite slow, it could lead to vCPU freezing until the page is fully written to the storage. So, with the current approach, if not take into account a number of page copies limitation, the worst case is all VM's ram is copied and then written to the storage. Other words, the current scheme provides minimal vCPU delays and thus minimal VM cpu performance slowdown with the cost of host's memory consumption. The new scheme is simple, doesn't consume extra host memory but can freeze vCPUs for longer time r because: * usually memory page coping is faster then memory page writing to a storage (think of HDD disk) * writing page to a disk depends on disk performance and current disk load So it seems that we have two different strategies: 1. lower CPU delays 2. lower memory usage To be honest I would start from the yours scheme as it much simler and the other if needed in the future. What do you think? Denis Besides, have we disabled dirty tracking of memslots? IIUC that's not needed for background snapshot too, so neither do we need dirty tracking nor do we need to sync the dirty bitmap from outside us (kvm/vhost/...).
Re: [PATCH v0 3/4] migration: add background snapshot
On 24.07.2020 01:15, Peter Xu wrote: On Wed, Jul 22, 2020 at 11:11:32AM +0300, Denis Plotnikov wrote: +static void *background_snapshot_thread(void *opaque) +{ +MigrationState *m = opaque; +QIOChannelBuffer *bioc; +QEMUFile *fb; +int res = 0; + +rcu_register_thread(); + +qemu_file_set_rate_limit(m->to_dst_file, INT64_MAX); + +qemu_mutex_lock_iothread(); +vm_stop(RUN_STATE_PAUSED); + +qemu_savevm_state_header(m->to_dst_file); +qemu_mutex_unlock_iothread(); +qemu_savevm_state_setup(m->to_dst_file); Is it intended to skip bql for the setup phase? IIUC the main thread could start the vm before we take the lock again below if we released it... Good point! +qemu_mutex_lock_iothread(); + +migrate_set_state(>state, MIGRATION_STATUS_SETUP, + MIGRATION_STATUS_ACTIVE); + +/* + * We want to save the vm state for the moment when the snapshot saving was + * called but also we want to write RAM content with vm running. The RAM + * content should appear first in the vmstate. + * So, we first, save non-ram part of the vmstate to the temporary, buffer, + * then write ram part of the vmstate to the migration stream with vCPUs + * running and, finally, write the non-ram part of the vmstate from the + * buffer to the migration stream. + */ +bioc = qio_channel_buffer_new(4096); +qio_channel_set_name(QIO_CHANNEL(bioc), "vmstate-buffer"); +fb = qemu_fopen_channel_output(QIO_CHANNEL(bioc)); +object_unref(OBJECT(bioc)); + +if (ram_write_tracking_start()) { +goto failed_resume; +} + +if (global_state_store()) { +goto failed_resume; +} Is this needed? We should be always in stopped state here, right? Yes, seems it isn't needed + +cpu_synchronize_all_states(); + +if (qemu_savevm_state_complete_precopy_non_iterable(fb, false, false)) { +goto failed_resume; +} + +vm_start(); +qemu_mutex_unlock_iothread(); + +while (!res) { +res = qemu_savevm_state_iterate(m->to_dst_file, false); + +if (res < 0 || qemu_file_get_error(m->to_dst_file)) { +goto failed; +} +} + +/* + * By this moment we have RAM content saved into the migration stream. + * The next step is to flush the non-ram content (vm devices state) + * right after the ram content. The device state was stored in + * the temporary buffer prior to the ram saving. + */ +qemu_put_buffer(m->to_dst_file, bioc->data, bioc->usage); +qemu_fflush(m->to_dst_file); + +if (qemu_file_get_error(m->to_dst_file)) { +goto failed; +} + +migrate_set_state(>state, MIGRATION_STATUS_ACTIVE, + MIGRATION_STATUS_COMPLETED); +goto exit; + +failed_resume: +vm_start(); +qemu_mutex_unlock_iothread(); +failed: +migrate_set_state(>state, MIGRATION_STATUS_ACTIVE, + MIGRATION_STATUS_FAILED); +exit: +ram_write_tracking_stop(); +qemu_fclose(fb); +qemu_mutex_lock_iothread(); +qemu_savevm_state_cleanup(); +qemu_mutex_unlock_iothread(); +rcu_unregister_thread(); +return NULL; +} + void migrate_fd_connect(MigrationState *s, Error *error_in) { Error *local_err = NULL; @@ -3599,8 +3694,14 @@ void migrate_fd_connect(MigrationState *s, Error *error_in) migrate_fd_cleanup(s); return; } -qemu_thread_create(>thread, "live_migration", migration_thread, s, - QEMU_THREAD_JOINABLE); +if (migrate_background_snapshot()) { +qemu_thread_create(>thread, "bg_snapshot", Maybe the name "live_snapshot" suites more (since the other one is "live_migration")? looks like it, another good name is async_snapshot and all the related function and properties should be rename accordingly + background_snapshot_thread, s, + QEMU_THREAD_JOINABLE); +} else { +qemu_thread_create(>thread, "live_migration", migration_thread, s, + QEMU_THREAD_JOINABLE); +} s->migration_thread_running = true; } [...] @@ -1151,9 +1188,11 @@ static int save_normal_page(RAMState *rs, RAMBlock *block, ram_addr_t offset, ram_counters.transferred += save_page_header(rs, rs->f, block, offset | RAM_SAVE_FLAG_PAGE); if (async) { -qemu_put_buffer_async(rs->f, buf, TARGET_PAGE_SIZE, - migrate_release_ram() & - migration_in_postcopy()); +bool may_free = migrate_background_snapshot() || +(migrate_release_ram() && + migration_in_postcopy()); Does background snapshot need to free the memory? /me c
Re: [PATCH v0 3/4] migration: add background snapshot
On 27.07.2020 19:48, Dr. David Alan Gilbert wrote: * Denis Plotnikov (dplotni...@virtuozzo.com) wrote: ... +static void page_fault_thread_stop(void) +{ +if (page_fault_fd) { +close(page_fault_fd); +page_fault_fd = 0; +} I think you need to do that after you've done the quit and join, otherwise the fault thread might still be reading this. Seems to be so +if (thread_quit_fd) { +uint64_t val = 1; +int ret; + +ret = write(thread_quit_fd, , sizeof(val)); +assert(ret == sizeof(val)); + +qemu_thread_join(_fault_thread); +close(thread_quit_fd); +thread_quit_fd = 0; +} +} ... /** * ram_find_and_save_block: finds a dirty page and sends it to f * @@ -1782,6 +2274,7 @@ static int ram_find_and_save_block(RAMState *rs, bool last_stage) pss.block = rs->last_seen_block; pss.page = rs->last_page; pss.complete_round = false; +pss.page_copy = NULL; if (!pss.block) { pss.block = QLIST_FIRST_RCU(_list.blocks); @@ -1794,11 +2287,30 @@ static int ram_find_and_save_block(RAMState *rs, bool last_stage) if (!found) { /* priority queue empty, so just search for something dirty */ found = find_dirty_block(rs, , ); + +if (found && migrate_background_snapshot()) { +/* + * make a copy of the page and + * pass it to the page search status + */ +int ret; +ret = ram_copy_page(pss.block, pss.page, _copy); I'm a bit confused about why we hit this; the way I'd thought about your code was we turn on the write faulting, do one big save and then fixup the faults as the save is happening (doing the copies) as the writes hit; so when does this case hit? To make it more clear, let me draw the whole picture: When we do background snapshot, the vm is paused untill all vmstate EXCEPT ram is saved. RAM isn't written at all. That vmstate part is saved in the temporary buffer. Then all the RAM is marked as read-only and the vm is un-paused. Note that at this moment all vm's vCPUs are running and can touch any part of memory. After that, the migration thread starts writing the ram content. Once a memory chunk is written, the write protection is removed for that chunk. If a vCPU wants to write to a memory page which is write protected (hasn't been written yet), this write is intercepted, the memory page is copied and queued for writing, the memory page write access is restored. The intention behind of that, is to allow vCPU to work with a memory page as soon as possible. Once all the RAM has been written, the rest of the vmstate is written from the buffer. This needs to be so because some of the emulated devices, saved in that buffered vmstate part, expects the RAM content to be available first on its loading. I hope this description will make things more clear. If not, please let me know, so I could add more details. Denis -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
Re: [PATCH v0 0/4] background snapshot
On 23.07.2020 20:39, Peter Xu wrote: On Thu, Jul 23, 2020 at 11:03:55AM +0300, Denis Plotnikov wrote: On 22.07.2020 19:30, Peter Xu wrote: On Wed, Jul 22, 2020 at 06:47:44PM +0300, Denis Plotnikov wrote: On 22.07.2020 18:42, Denis Plotnikov wrote: On 22.07.2020 17:50, Peter Xu wrote: Hi, Denis, Hi, Peter ... How to use: 1. enable background snapshot capability virsh qemu-monitor-command vm --hmp migrate_set_capability background-snapshot on 2. stop the vm virsh qemu-monitor-command vm --hmp stop 3. Start the external migration to a file virsh qemu-monitor-command cent78-bs --hmp migrate exec:'cat ./vm_state' 4. Wait for the migration finish and check that the migration has completed state. Thanks for continued working on this project! I have two high level questions before dig into the patches. Firstly, is step 2 required? Can we use a single QMP command to take snapshots (which can still be a "migrate" command)? With this series it is required, but steps 2 and 3 should be merged into a single one. I'm not sure whether you're talking about the disk snapshot operations, anyway yeah it'll be definitely good if we merge them into one in the next version. After thinking for a while, I remembered why I split these two steps. The vm snapshot consists of two parts: disk(s) snapshot(s) and vmstate. With migrate command we save the vmstate only. So, the steps to save the whole vm snapshot is the following: 2. stop the vm virsh qemu-monitor-command vm --hmp stop 2.1. Make a disk snapshot, something like virsh qemu-monitor-command vm --hmp snapshot_blkdev drive-scsi0-0-0-0 ./new_data 3. Start the external migration to a file virsh qemu-monitor-command vm --hmp migrate exec:'cat ./vm_state' In this example, vm snapshot consists of two files: vm_state and the disk file. new_data will contain all new disk data written since [2.1.] executing. But that's slightly different to the current interface of savevm and loadvm which only requires a snapshot name, am I right? Yes Now we need both a snapshot name (of the vmstate) and the name of the new snapshot image. Yes I'm not familiar with qemu image snapshots... my understanding is that current snapshot (save_snapshot) used internal image snapshots, while in this proposal you want the live snapshot to use extrenal snapshots. Correct, I want to add ability to make a external live snapshot. (live = asyn ram writing) Is there any criteria on making this decision/change? Internal snapshot is supported by qcow2 and sheepdog (I never heard of someone using the later). Because of qcow2 internal snapshot design, it's quite complex to implement "background" snapshot there. More details here: https://www.mail-archive.com/qemu-devel@nongnu.org/msg705116.html So, I decided to start with external snapshot to implement and approve the memory access intercepting part firstly. Once it's done for external snapshot we can start to approach the internal snapshots. Thanks, Denis
Re: [PATCH v0 0/4] background snapshot
On 22.07.2020 19:30, Peter Xu wrote: On Wed, Jul 22, 2020 at 06:47:44PM +0300, Denis Plotnikov wrote: On 22.07.2020 18:42, Denis Plotnikov wrote: On 22.07.2020 17:50, Peter Xu wrote: Hi, Denis, Hi, Peter ... How to use: 1. enable background snapshot capability virsh qemu-monitor-command vm --hmp migrate_set_capability background-snapshot on 2. stop the vm virsh qemu-monitor-command vm --hmp stop 3. Start the external migration to a file virsh qemu-monitor-command cent78-bs --hmp migrate exec:'cat ./vm_state' 4. Wait for the migration finish and check that the migration has completed state. Thanks for continued working on this project! I have two high level questions before dig into the patches. Firstly, is step 2 required? Can we use a single QMP command to take snapshots (which can still be a "migrate" command)? With this series it is required, but steps 2 and 3 should be merged into a single one. I'm not sure whether you're talking about the disk snapshot operations, anyway yeah it'll be definitely good if we merge them into one in the next version. After thinking for a while, I remembered why I split these two steps. The vm snapshot consists of two parts: disk(s) snapshot(s) and vmstate. With migrate command we save the vmstate only. So, the steps to save the whole vm snapshot is the following: 2. stop the vm virsh qemu-monitor-command vm --hmp stop 2.1. Make a disk snapshot, something like virsh qemu-monitor-command vm --hmp snapshot_blkdev drive-scsi0-0-0-0 ./new_data 3. Start the external migration to a file virsh qemu-monitor-command vm --hmp migrate exec:'cat ./vm_state' In this example, vm snapshot consists of two files: vm_state and the disk file. new_data will contain all new disk data written since [2.1.] executing. Meanwhile, we might also want to check around the type of backend RAM. E.g., shmem and hugetlbfs are still not supported for uffd-wp (which I'm still working on). I didn't check explicitly whether we'll simply fail the migration for those cases since the uffd ioctls will fail for those kinds of RAM. It would be okay if we handle all the ioctl failures gracefully, The ioctl's result is processed but the patch has a flaw - it ignores the result of ioctl check. Need to fix it. It happens here: +int ram_write_tracking_start(void) +{ +if (page_fault_thread_start()) { +return -1; +} + +ram_block_list_create(); +ram_block_list_set_readonly(); << this returns -1 in case of failure but I just ignore it here + +return 0; +} or it would be even better if we directly fail when we want to enable live snapshot capability for a guest who contains other types of ram besides private anonymous memories. I agree, but to know whether shmem or hugetlbfs are supported by the current kernel we should execute the ioctl for all memory regions on the capability enabling. Yes, that seems to be a better solution, so we don't care about the type of ram backend anymore but check directly with the uffd ioctls. With these checks, it'll be even fine to ignore the above retcode, or just assert, because we've already checked that before that point. Thanks,
Re: [PATCH v0 0/4] background snapshot
On 22.07.2020 18:42, Denis Plotnikov wrote: On 22.07.2020 17:50, Peter Xu wrote: Hi, Denis, Hi, Peter ... How to use: 1. enable background snapshot capability virsh qemu-monitor-command vm --hmp migrate_set_capability background-snapshot on 2. stop the vm virsh qemu-monitor-command vm --hmp stop 3. Start the external migration to a file virsh qemu-monitor-command cent78-bs --hmp migrate exec:'cat > ./vm_state' 4. Wait for the migration finish and check that the migration has completed state. Thanks for continued working on this project! I have two high level questions before dig into the patches. Firstly, is step 2 required? Can we use a single QMP command to take snapshots (which can still be a "migrate" command)? With this series it is required, but steps 2 and 3 should be merged into a single one. Meanwhile, we might also want to check around the type of backend RAM. E.g., shmem and hugetlbfs are still not supported for uffd-wp (which I'm still working on). I didn't check explicitly whether we'll simply fail the migration for those cases since the uffd ioctls will fail for those kinds of RAM. It would be okay if we handle all the ioctl failures gracefully, The ioctl's result is processed but the patch has a flaw - it ignores the result of ioctl check. Need to fix it. It happens here: +int ram_write_tracking_start(void) +{ +if (page_fault_thread_start()) { +return -1; +} + +ram_block_list_create(); +ram_block_list_set_readonly(); << this returns -1 in case of failure but I just ignore it here + +return 0; +} or it would be even better if we directly fail when we want to enable live snapshot capability for a guest who contains other types of ram besides private anonymous memories. I agree, but to know whether shmem or hugetlbfs are supported by the current kernel we should execute the ioctl for all memory regions on the capability enabling. Thanks, Denis
Re: [PATCH v0 0/4] background snapshot
On 22.07.2020 17:50, Peter Xu wrote: Hi, Denis, Hi, Peter ... How to use: 1. enable background snapshot capability virsh qemu-monitor-command vm --hmp migrate_set_capability background-snapshot on 2. stop the vm virsh qemu-monitor-command vm --hmp stop 3. Start the external migration to a file virsh qemu-monitor-command cent78-bs --hmp migrate exec:'cat > ./vm_state' 4. Wait for the migration finish and check that the migration has completed state. Thanks for continued working on this project! I have two high level questions before dig into the patches. Firstly, is step 2 required? Can we use a single QMP command to take snapshots (which can still be a "migrate" command)? With this series it is required, but steps 2 and 3 should be merged into a single one. Meanwhile, we might also want to check around the type of backend RAM. E.g., shmem and hugetlbfs are still not supported for uffd-wp (which I'm still working on). I didn't check explicitly whether we'll simply fail the migration for those cases since the uffd ioctls will fail for those kinds of RAM. It would be okay if we handle all the ioctl failures gracefully, The ioctl's result is processed but the patch has a flaw - it ignores the result of ioctl check. Need to fix it. or it would be even better if we directly fail when we want to enable live snapshot capability for a guest who contains other types of ram besides private anonymous memories. I agree, but to know whether shmem or hugetlbfs are supported by the current kernel we should execute the ioctl for all memory regions on the capability enabling. Thanks, Denis
[PATCH v0 3/4] migration: add background snapshot
By the moment, making a vm snapshot may cause a significant vm downtime, depending on the vm RAM size and the performance of disk storing the vm snapshot. This happens because the VM has to be paused until all vmstate including RAM is written. To reduce the downtime, the background snapshot capability is used. With the capability enabled, the vm is paused for a small amount of time while the smallest vmstate part (all, except RAM) is writen. RAM, the biggest part of vmstate, is written with running VM. Signed-off-by: Denis Plotnikov --- include/exec/ramblock.h | 8 + include/exec/ramlist.h | 2 + migration/ram.h | 19 +- migration/savevm.h | 3 + migration/migration.c | 107 +++- migration/ram.c | 578 ++-- migration/savevm.c | 1 - 7 files changed, 698 insertions(+), 20 deletions(-) diff --git a/include/exec/ramblock.h b/include/exec/ramblock.h index 07d50864d8..421e128ef6 100644 --- a/include/exec/ramblock.h +++ b/include/exec/ramblock.h @@ -59,6 +59,14 @@ struct RAMBlock { */ unsigned long *clear_bmap; uint8_t clear_bmap_shift; + +/* The following 3 elements are for background snapshot */ +/* List of blocks used for background snapshot */ +QLIST_ENTRY(RAMBlock) bgs_next; +/* Pages currently being copied */ +unsigned long *touched_map; +/* Pages has been copied already */ +unsigned long *copied_map; }; #endif #endif diff --git a/include/exec/ramlist.h b/include/exec/ramlist.h index bc4faa1b00..74e2a1162c 100644 --- a/include/exec/ramlist.h +++ b/include/exec/ramlist.h @@ -44,6 +44,8 @@ typedef struct { unsigned long *blocks[]; } DirtyMemoryBlocks; +typedef QLIST_HEAD(, RAMBlock) RamBlockList; + typedef struct RAMList { QemuMutex mutex; RAMBlock *mru_block; diff --git a/migration/ram.h b/migration/ram.h index 2eeaacfa13..769b8087ae 100644 --- a/migration/ram.h +++ b/migration/ram.h @@ -42,7 +42,8 @@ uint64_t ram_bytes_remaining(void); uint64_t ram_bytes_total(void); uint64_t ram_pagesize_summary(void); -int ram_save_queue_pages(const char *rbname, ram_addr_t start, ram_addr_t len); +int ram_save_queue_pages(RAMBlock *block, const char *rbname, + ram_addr_t start, ram_addr_t len, void *page_copy); void acct_update_position(QEMUFile *f, size_t size, bool zero); void ram_debug_dump_bitmap(unsigned long *todump, bool expected, unsigned long pages); @@ -69,4 +70,20 @@ void colo_flush_ram_cache(void); void colo_release_ram_cache(void); void colo_incoming_start_dirty_log(void); +/* for background snapshot */ +void ram_block_list_create(void); +void ram_block_list_destroy(void); +RAMBlock *ram_bgs_block_find(uint64_t address, ram_addr_t *page_offset); + +void *ram_page_buffer_get(void); +int ram_page_buffer_free(void *buffer); + +int ram_block_list_set_readonly(void); +int ram_block_list_set_writable(void); + +int ram_copy_page(RAMBlock *block, unsigned long page_nr, void **page_copy); +int ram_process_page_fault(uint64_t address); + +int ram_write_tracking_start(void); +void ram_write_tracking_stop(void); #endif diff --git a/migration/savevm.h b/migration/savevm.h index ba64a7e271..4f4edffa85 100644 --- a/migration/savevm.h +++ b/migration/savevm.h @@ -64,5 +64,8 @@ int qemu_loadvm_state(QEMUFile *f); void qemu_loadvm_state_cleanup(void); int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis); int qemu_load_device_state(QEMUFile *f); +int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f, +bool in_postcopy, +bool inactivate_disks); #endif diff --git a/migration/migration.c b/migration/migration.c index 2ec0451abe..dc56e4974f 100644 --- a/migration/migration.c +++ b/migration/migration.c @@ -55,6 +55,7 @@ #include "net/announce.h" #include "qemu/queue.h" #include "multifd.h" +#include "sysemu/cpus.h" #define MAX_THROTTLE (32 << 20) /* Migration transfer speed throttling */ @@ -2473,7 +2474,7 @@ static void migrate_handle_rp_req_pages(MigrationState *ms, const char* rbname, return; } -if (ram_save_queue_pages(rbname, start, len)) { +if (ram_save_queue_pages(NULL, rbname, start, len, NULL)) { mark_source_rp_bad(ms); } } @@ -3536,6 +3537,100 @@ static void *migration_thread(void *opaque) return NULL; } +static void *background_snapshot_thread(void *opaque) +{ +MigrationState *m = opaque; +QIOChannelBuffer *bioc; +QEMUFile *fb; +int res = 0; + +rcu_register_thread(); + +qemu_file_set_rate_limit(m->to_dst_file, INT64_MAX); + +qemu_mutex_lock_iothread(); +vm_stop(RUN_STATE_PAUSED); + +qemu_savevm_state_header(m->to_dst_file); +qemu_mutex_unlock_iothread(); +qemu_savevm_state_setup(m->to_dst_file); +qemu
[PATCH v0 2/4] migration: add background snapshot capability
The capability is used for background snapshot enabling. The background snapshot logic is going to be added in the following patch. Signed-off-by: Denis Plotnikov --- qapi/migration.json | 7 ++- migration/migration.h | 1 + migration/migration.c | 35 +++ 3 files changed, 42 insertions(+), 1 deletion(-) diff --git a/qapi/migration.json b/qapi/migration.json index d5000558c6..46681a5c3c 100644 --- a/qapi/migration.json +++ b/qapi/migration.json @@ -424,6 +424,11 @@ # @validate-uuid: Send the UUID of the source to allow the destination # to ensure it is the same. (since 4.2) # +# @background-snapshot: If enabled, the migration stream will be a snapshot +# of the VM exactly at the point when the migration +# procedure starts. The VM RAM is saved with running VM. +# (since 5.2) +# # Since: 1.2 ## { 'enum': 'MigrationCapability', @@ -431,7 +436,7 @@ 'compress', 'events', 'postcopy-ram', 'x-colo', 'release-ram', 'block', 'return-path', 'pause-before-switchover', 'multifd', 'dirty-bitmaps', 'postcopy-blocktime', 'late-block-activate', - 'x-ignore-shared', 'validate-uuid' ] } + 'x-ignore-shared', 'validate-uuid', 'background-snapshot' ] } ## # @MigrationCapabilityStatus: diff --git a/migration/migration.h b/migration/migration.h index f617960522..63f2fde9a3 100644 --- a/migration/migration.h +++ b/migration/migration.h @@ -322,6 +322,7 @@ int migrate_compress_wait_thread(void); int migrate_decompress_threads(void); bool migrate_use_events(void); bool migrate_postcopy_blocktime(void); +bool migrate_background_snapshot(void); /* Sending on the return path - generic and then for each message type */ void migrate_send_rp_shut(MigrationIncomingState *mis, diff --git a/migration/migration.c b/migration/migration.c index 2ed9923227..2ec0451abe 100644 --- a/migration/migration.c +++ b/migration/migration.c @@ -1086,6 +1086,32 @@ static bool migrate_caps_check(bool *cap_list, error_setg(errp, "Postcopy is not compatible with ignore-shared"); return false; } + +if (cap_list[MIGRATION_CAPABILITY_BACKGROUND_SNAPSHOT]) { +error_setg(errp, "Postcopy is not compatible " +"with background snapshot"); +return false; +} +} + +if (cap_list[MIGRATION_CAPABILITY_BACKGROUND_SNAPSHOT]) { +if (cap_list[MIGRATION_CAPABILITY_RELEASE_RAM]) { +error_setg(errp, "Background snapshot is not compatible " +"with release ram capability"); +return false; +} + +if (cap_list[MIGRATION_CAPABILITY_COMPRESS]) { +error_setg(errp, "Background snapshot is not " +"currently compatible with compression"); +return false; +} + +if (cap_list[MIGRATION_CAPABILITY_XBZRLE]) { +error_setg(errp, "Background snapshot is not " +"currently compatible with XBZLRE"); +return false; +} } return true; @@ -2390,6 +2416,15 @@ bool migrate_use_block_incremental(void) return s->parameters.block_incremental; } +bool migrate_background_snapshot(void) +{ +MigrationState *s; + +s = migrate_get_current(); + +return s->enabled_capabilities[MIGRATION_CAPABILITY_BACKGROUND_SNAPSHOT]; +} + /* migration thread support */ /* * Something bad happened to the RP stream, mark an error -- 2.17.0
[PATCH v0 1/4] bitops: add some atomic versions of bitmap operations
1. test bit 2. test and set bit Signed-off-by: Denis Plotnikov Reviewed-by: Peter Xu --- include/qemu/bitops.h | 25 + 1 file changed, 25 insertions(+) diff --git a/include/qemu/bitops.h b/include/qemu/bitops.h index f55ce8b320..63218afa5a 100644 --- a/include/qemu/bitops.h +++ b/include/qemu/bitops.h @@ -95,6 +95,21 @@ static inline int test_and_set_bit(long nr, unsigned long *addr) return (old & mask) != 0; } +/** + * test_and_set_bit_atomic - Set a bit atomically and return its old value + * @nr: Bit to set + * @addr: Address to count from + */ +static inline int test_and_set_bit_atomic(long nr, unsigned long *addr) +{ +unsigned long mask = BIT_MASK(nr); +unsigned long *p = addr + BIT_WORD(nr); +unsigned long old; + +old = atomic_fetch_or(p, mask); +return (old & mask) != 0; +} + /** * test_and_clear_bit - Clear a bit and return its old value * @nr: Bit to clear @@ -135,6 +150,16 @@ static inline int test_bit(long nr, const unsigned long *addr) return 1UL & (addr[BIT_WORD(nr)] >> (nr & (BITS_PER_LONG-1))); } +/** + * test_bit_atomic - Determine whether a bit is set atomicallly + * @nr: bit number to test + * @addr: Address to start counting from + */ +static inline int test_bit_atomic(long nr, const unsigned long *addr) +{ +long valid_nr = nr & (BITS_PER_LONG - 1); +return 1UL & (atomic_read([BIT_WORD(nr)]) >> valid_nr); +} /** * find_last_bit - find the last set bit in a memory region * @addr: The address to start the search at -- 2.17.0
[PATCH v0 4/4] background snapshot: add trace events for page fault processing
Signed-off-by: Denis Plotnikov --- migration/ram.c| 4 migration/trace-events | 2 ++ 2 files changed, 6 insertions(+) diff --git a/migration/ram.c b/migration/ram.c index f187b5b494..29712a11c2 100644 --- a/migration/ram.c +++ b/migration/ram.c @@ -2172,12 +2172,16 @@ again: break; } +trace_page_fault_processing_start(msg.arg.pagefault.address); + if (ram_process_page_fault(msg.arg.pagefault.address) < 0) { error_report("page fault: error on write protected page " "processing [0x%llx]", msg.arg.pagefault.address); break; } + +trace_page_fault_processing_finish(msg.arg.pagefault.address); } return NULL; diff --git a/migration/trace-events b/migration/trace-events index 4ab0a503d2..f46b3b9a72 100644 --- a/migration/trace-events +++ b/migration/trace-events @@ -128,6 +128,8 @@ save_xbzrle_page_skipping(void) "" save_xbzrle_page_overflow(void) "" ram_save_iterate_big_wait(uint64_t milliconds, int iterations) "big wait: %" PRIu64 " milliseconds, %d iterations" ram_load_complete(int ret, uint64_t seq_iter) "exit_code %d seq iteration %" PRIu64 +page_fault_processing_start(unsigned long address) "HVA: 0x%lx" +page_fault_processing_finish(unsigned long address) "HVA: 0x%lx" # migration.c await_return_path_close_on_source_close(void) "" -- 2.17.0
[PATCH v0 0/4] background snapshot
Currently where is no way to make a vm snapshot without pausing a vm for the whole time until the snapshot is done. So, the problem is the vm downtime on snapshoting. The downtime value depends on the vmstate size, the major part of which is RAM and the disk performance which is used for the snapshot saving. The series propose a way to reduce the vm snapshot downtime. This is done by saving RAM, the major part of vmstate, in the background when the vm is running. The background snapshot uses linux UFFD write-protected mode for memory page access intercepting. UFFD write-protected mode was added to the linux v5.7. If UFFD write-protected mode isn't available the background snapshot rejects to run. How to use: 1. enable background snapshot capability virsh qemu-monitor-command vm --hmp migrate_set_capability background-snapshot on 2. stop the vm virsh qemu-monitor-command vm --hmp stop 3. Start the external migration to a file virsh qemu-monitor-command cent78-bs --hmp migrate exec:'cat > ./vm_state' 4. Wait for the migration finish and check that the migration has completed state. Denis Plotnikov (4): bitops: add some atomic versions of bitmap operations migration: add background snapshot capability migration: add background snapshot background snapshot: add trace events for page fault processing qapi/migration.json | 7 +- include/exec/ramblock.h | 8 + include/exec/ramlist.h | 2 + include/qemu/bitops.h | 25 ++ migration/migration.h | 1 + migration/ram.h | 19 +- migration/savevm.h | 3 + migration/migration.c | 142 +- migration/ram.c | 582 ++-- migration/savevm.c | 1 - migration/trace-events | 2 + 11 files changed, 771 insertions(+), 21 deletions(-) -- 2.17.0
Re: [RFC PATCH 0/3] block: Synchronous bdrv_*() from coroutine in different AioContext
I'm not quite sure I understand the point. Let's see all the picture of async snapshot: our goal is to minimize a VM downtime during the snapshot. When we do async snapshot we save vmstate except RAM when a VM is stopped using the current L1 table (further initial L1 table). Then, we want the VM start running and write RAM content. At this time all RAM is write-protected. We unprotect each RAM page once it has been written. Oh, I see, you're basically doing something like postcopy migration. I was assuming it was more like regular live migration, except that you would overwrite updated RAM blocks instead of appending them. I can see your requirement then. All those RAM pages should go to the snapshot using the initial L1 table. Since the VM is running, it may want to write new disk blocks, so we need to use a NEW L1 table to provide this ability. (Am I correct so far?) Thus, if I understand correctly, we need to use two L1 tables: the initial one to store RAM pages to the vmstate and the new one to allow disk writings. May be I can't see a better way to achieve that. Please, correct me if I'm wrong. I guess I could imagine a different, though probably not better way: We could internally have a separate low-level operation that moves the VM state from the active layer to an already existing disk snapshot. Then you would snapshot the disk and start writing the VM to the active layer, and when the VM state write has completed you move it into the snapshot. The other options is doing what you suggested. There is nothing in the qcow2 on-disk format that would prevent this, but we would have to extend the qcow2 driver to allow I/O to inactive L1 tables. This sounds like a non-trivial amount of code changes, though it would potentially enable more use cases we never implemented ((read-only) access to internal snapshots as block nodes, so you could e.g. use block jobs to export a snapshot). Kevin Ok, thanks for validating the possibilities and more ideas of implementation. I think I should start from trying to post my background snapshot version storing the vmstate to an external file because write-protected-userfaultfd is now available on linux. And If it's accepted I'll try to come up with an internal version for qcow2 (It seems this is the only format supporting this). Denis
Re: [RFC PATCH 0/3] block: Synchronous bdrv_*() from coroutine in different AioContext
On 19.05.2020 17:18, Kevin Wolf wrote: Am 19.05.2020 um 15:54 hat Denis Plotnikov geschrieben: On 19.05.2020 15:32, Vladimir Sementsov-Ogievskiy wrote: 14.05.2020 17:26, Kevin Wolf wrote: Am 14.05.2020 um 15:21 hat Thomas Lamprecht geschrieben: On 5/12/20 4:43 PM, Kevin Wolf wrote: Stefan (Reiter), after looking a bit closer at this, I think there is no bug in QEMU, but the bug is in your coroutine code that calls block layer functions without moving into the right AioContext first. I've written this series anyway as it potentially makes the life of callers easier and would probably make your buggy code correct. However, it doesn't feel right to commit something like patch 2 without having a user for it. Is there a reason why you can't upstream your async snapshot code? I mean I understand what you mean, but it would make the interface IMO so much easier to use, if one wants to explicit schedule it beforehand they can still do. But that would open the way for two styles doing things, not sure if this would seen as bad. The assert about from patch 3/3 would be already really helping a lot, though. I think patches 1 and 3 are good to be committed either way if people think they are useful. They make sense without the async snapshot code. My concern with the interface in patch 2 is both that it could give people a false sense of security and that it would be tempting to write inefficient code. Usually, you won't have just a single call into the block layer for a given block node, but you'll perform multiple operations. Switching to the target context once rather than switching back and forth in every operation is obviously more efficient. But chances are that even if one of these function is bdrv_flush(), which now works correctly from a different thread, you might need another function that doesn't implement the same magic. So you always need to be aware which functions support cross-context calls which ones don't. I feel we'd see a few bugs related to this. Regarding upstreaming, there was some historical attempt to upstream it from Dietmar, but in the time frame of ~ 8 to 10 years ago or so. I'm not quite sure why it didn't went through then, I see if I can get some time searching the mailing list archive. We'd be naturally open and glad to upstream it, what it effectively allow us to do is to not block the VM to much during snapshoting it live. Yes, there is no doubt that this is useful functionality. There has been talk about this every now and then, but I don't think we ever got to a point where it actually could be implemented. Vladimir, I seem to remember you (or someone else from your team?) were interested in async snapshots as well a while ago? Den is working on this (add him to CC) Yes, I was working on that. What I've done can be found here: https://github.com/denis-plotnikov/qemu/commits/bgs_uffd The idea was to save a snapshot (state+ram) asynchronously to a separate (raw) file using the existing infrastructure. The goal of that was to reduce the VM downtime on snapshot. We decided to postpone this work until "userfaultfd write protected mode" wasn't in the linux mainstream. Now, userfaultfd-wp is merged to linux. So we have plans to continue this work. According to the saving the "internal" snapshot to qcow2 I still have a question. May be this is the right place and time to ask. If I remember correctly, in qcow2 the snapshot is stored at the end of the address space of the current block-placement-table. Yes. Basically the way snapshots with VM state work is that you write the VM state to some offset after the end of the virtual disk, when the VM state is completely written you snapshot the current state (including both content of the virtual disk and VM state) and finally discard the VM state again in the active L1 table. We switch to the new block-placement-table after the snapshot storing is complete. In case of sync snapshot, we should switch the table before the snapshot is written, another words, we should be able to preallocate the the space for the snapshot and keep a link to the space until snapshot writing is completed. I don't see a fundamental difference between sync and async in this respect. Why can't you write the VM state to the current L1 table first like we usually do? I'm not quite sure I understand the point. Let's see all the picture of async snapshot: our goal is to minimize a VM downtime during the snapshot. When we do async snapshot we save vmstate except RAM when a VM is stopped using the current L1 table (further initial L1 table). Then, we want the VM start running and write RAM content. At this time all RAM is write-protected. We unprotect each RAM page once it has been written. All those RAM pages should go to the snapshot using the initial L1 table. Since the VM is running, it may want to write new disk blocks, so we need to use a NEW L1 table to provide this ability. (Am I correct so far?) Thus, if I
Re: [RFC PATCH 0/3] block: Synchronous bdrv_*() from coroutine in different AioContext
On 19.05.2020 15:32, Vladimir Sementsov-Ogievskiy wrote: 14.05.2020 17:26, Kevin Wolf wrote: Am 14.05.2020 um 15:21 hat Thomas Lamprecht geschrieben: On 5/12/20 4:43 PM, Kevin Wolf wrote: Stefan (Reiter), after looking a bit closer at this, I think there is no bug in QEMU, but the bug is in your coroutine code that calls block layer functions without moving into the right AioContext first. I've written this series anyway as it potentially makes the life of callers easier and would probably make your buggy code correct. However, it doesn't feel right to commit something like patch 2 without having a user for it. Is there a reason why you can't upstream your async snapshot code? I mean I understand what you mean, but it would make the interface IMO so much easier to use, if one wants to explicit schedule it beforehand they can still do. But that would open the way for two styles doing things, not sure if this would seen as bad. The assert about from patch 3/3 would be already really helping a lot, though. I think patches 1 and 3 are good to be committed either way if people think they are useful. They make sense without the async snapshot code. My concern with the interface in patch 2 is both that it could give people a false sense of security and that it would be tempting to write inefficient code. Usually, you won't have just a single call into the block layer for a given block node, but you'll perform multiple operations. Switching to the target context once rather than switching back and forth in every operation is obviously more efficient. But chances are that even if one of these function is bdrv_flush(), which now works correctly from a different thread, you might need another function that doesn't implement the same magic. So you always need to be aware which functions support cross-context calls which ones don't. I feel we'd see a few bugs related to this. Regarding upstreaming, there was some historical attempt to upstream it from Dietmar, but in the time frame of ~ 8 to 10 years ago or so. I'm not quite sure why it didn't went through then, I see if I can get some time searching the mailing list archive. We'd be naturally open and glad to upstream it, what it effectively allow us to do is to not block the VM to much during snapshoting it live. Yes, there is no doubt that this is useful functionality. There has been talk about this every now and then, but I don't think we ever got to a point where it actually could be implemented. Vladimir, I seem to remember you (or someone else from your team?) were interested in async snapshots as well a while ago? Den is working on this (add him to CC) Yes, I was working on that. What I've done can be found here: https://github.com/denis-plotnikov/qemu/commits/bgs_uffd The idea was to save a snapshot (state+ram) asynchronously to a separate (raw) file using the existing infrastructure. The goal of that was to reduce the VM downtime on snapshot. We decided to postpone this work until "userfaultfd write protected mode" wasn't in the linux mainstream. Now, userfaultfd-wp is merged to linux. So we have plans to continue this work. According to the saving the "internal" snapshot to qcow2 I still have a question. May be this is the right place and time to ask. If I remember correctly, in qcow2 the snapshot is stored at the end of the address space of the current block-placement-table. We switch to the new block-placement-table after the snapshot storing is complete. In case of sync snapshot, we should switch the table before the snapshot is written, another words, we should be able to preallocate the the space for the snapshot and keep a link to the space until snapshot writing is completed. The question is whether it could be done without qcow2 modification and if not, could you please give some ideas of how to implement that? Denis I pushed a tree[0] with mostly just that specific code squashed together (hope I did not break anything), most of the actual code is in commit [1]. It'd be cleaned up a bit and checked for coding style issues, but works good here. Anyway, thanks for your help and pointers! [0]: https://github.com/ThomasLamprecht/qemu/tree/savevm-async [1]: https://github.com/ThomasLamprecht/qemu/commit/ffb9531f370ef0073e4b6f6021f4c47ccd702121 It doesn't even look that bad in terms of patch size. I had imagined it a bit larger. But it seems this is not really just an async 'savevm' (which would save the VM state in a qcow2 file), but you store the state in a separate raw file. What is the difference between this and regular migration into a file? I remember people talking about how snapshotting can store things in a way that a normal migration stream can't do, like overwriting outdated RAM state instead of just appending the new state, but you don't seem to implement something like this. Kevin
[PATCH v24 3/4] qcow2: add zstd cluster compression
zstd significantly reduces cluster compression time. It provides better compression performance maintaining the same level of the compression ratio in comparison with zlib, which, at the moment, is the only compression method available. The performance test results: Test compresses and decompresses qemu qcow2 image with just installed rhel-7.6 guest. Image cluster size: 64K. Image on disk size: 2.2G The test was conducted with brd disk to reduce the influence of disk subsystem to the test results. The results is given in seconds. compress cmd: time ./qemu-img convert -O qcow2 -c -o compression_type=[zlib|zstd] src.img [zlib|zstd]_compressed.img decompress cmd time ./qemu-img convert -O qcow2 [zlib|zstd]_compressed.img uncompressed.img compression decompression zlib zstd zlib zstd real 65.5 16.3 (-75 %)1.9 1.6 (-16 %) user 65.0 15.85.3 2.5 sys 3.30.22.0 2.0 Both ZLIB and ZSTD gave the same compression ratio: 1.57 compressed image size in both cases: 1.4G Signed-off-by: Denis Plotnikov QAPI part: Acked-by: Markus Armbruster --- docs/interop/qcow2.txt | 1 + configure | 2 +- qapi/block-core.json | 3 +- block/qcow2-threads.c | 169 + block/qcow2.c | 7 ++ 5 files changed, 180 insertions(+), 2 deletions(-) diff --git a/docs/interop/qcow2.txt b/docs/interop/qcow2.txt index 298a031310..cb723463f2 100644 --- a/docs/interop/qcow2.txt +++ b/docs/interop/qcow2.txt @@ -212,6 +212,7 @@ version 2. Available compression type values: 0: zlib <https://www.zlib.net/> +1: zstd <http://github.com/facebook/zstd> === Header padding === diff --git a/configure b/configure index 23b5e93752..4e3a1690ea 100755 --- a/configure +++ b/configure @@ -1861,7 +1861,7 @@ disabled with --disable-FEATURE, default is enabled if available: lzfse support of lzfse compression library (for reading lzfse-compressed dmg images) zstdsupport for zstd compression library - (for migration compression) + (for migration compression and qcow2 cluster compression) seccomp seccomp support coroutine-pool coroutine freelist (better performance) glusterfs GlusterFS backend diff --git a/qapi/block-core.json b/qapi/block-core.json index 1522e2983f..6fbacddab2 100644 --- a/qapi/block-core.json +++ b/qapi/block-core.json @@ -4293,11 +4293,12 @@ # Compression type used in qcow2 image file # # @zlib: zlib compression, see <http://zlib.net/> +# @zstd: zstd compression, see <http://github.com/facebook/zstd> # # Since: 5.1 ## { 'enum': 'Qcow2CompressionType', - 'data': [ 'zlib' ] } + 'data': [ 'zlib', { 'name': 'zstd', 'if': 'defined(CONFIG_ZSTD)' } ] } ## # @BlockdevCreateOptionsQcow2: diff --git a/block/qcow2-threads.c b/block/qcow2-threads.c index 7dbaf53489..1914baf456 100644 --- a/block/qcow2-threads.c +++ b/block/qcow2-threads.c @@ -28,6 +28,11 @@ #define ZLIB_CONST #include +#ifdef CONFIG_ZSTD +#include +#include +#endif + #include "qcow2.h" #include "block/thread-pool.h" #include "crypto.h" @@ -166,6 +171,160 @@ static ssize_t qcow2_zlib_decompress(void *dest, size_t dest_size, return ret; } +#ifdef CONFIG_ZSTD + +/* + * qcow2_zstd_compress() + * + * Compress @src_size bytes of data using zstd compression method + * + * @dest - destination buffer, @dest_size bytes + * @src - source buffer, @src_size bytes + * + * Returns: compressed size on success + * -ENOMEM destination buffer is not enough to store compressed data + * -EIOon any other error + */ +static ssize_t qcow2_zstd_compress(void *dest, size_t dest_size, + const void *src, size_t src_size) +{ +ssize_t ret; +size_t zstd_ret; +ZSTD_outBuffer output = { +.dst = dest, +.size = dest_size, +.pos = 0 +}; +ZSTD_inBuffer input = { +.src = src, +.size = src_size, +.pos = 0 +}; +ZSTD_CCtx *cctx = ZSTD_createCCtx(); + +if (!cctx) { +return -EIO; +} +/* + * Use the zstd streamed interface for symmetry with decompression, + * where streaming is essential since we don't record the exact + * compressed size. + * + * ZSTD_compressStream2() tries to compress everything it could + * with a single call. Although, ZSTD docs says that: + * "You must continue calling ZSTD_compressStream2() with ZSTD_e_end + * until it returns 0, at which point you are free to start a new frame", + * in out tests we saw the only case when it returned with
[PATCH v24 1/4] qcow2: introduce compression type feature
The patch adds some preparation parts for incompatible compression type feature to qcow2 allowing the use different compression methods for image clusters (de)compressing. It is implied that the compression type is set on the image creation and can be changed only later by image conversion, thus compression type defines the only compression algorithm used for the image, and thus, for all image clusters. The goal of the feature is to add support of other compression methods to qcow2. For example, ZSTD which is more effective on compression than ZLIB. The default compression is ZLIB. Images created with ZLIB compression type are backward compatible with older qemu versions. Adding of the compression type breaks a number of tests because now the compression type is reported on image creation and there are some changes in the qcow2 header in size and offsets. The tests are fixed in the following ways: * filter out compression_type for many tests * fix header size, feature table size and backing file offset affected tests: 031, 036, 061, 080 header_size +=8: 1 byte compression type 7 bytes padding feature_table += 48: incompatible feature compression type backing_file_offset += 56 (8 + 48 -> header_change + feature_table_change) * add "compression type" for test output matching when it isn't filtered affected tests: 049, 060, 061, 065, 082, 085, 144, 182, 185, 198, 206, 242, 255, 274, 280 Signed-off-by: Denis Plotnikov Reviewed-by: Vladimir Sementsov-Ogievskiy Reviewed-by: Eric Blake Reviewed-by: Max Reitz QAPI part: Acked-by: Markus Armbruster --- qapi/block-core.json | 22 +- block/qcow2.h| 20 +- include/block/block_int.h| 1 + block/qcow2.c| 113 +++ tests/qemu-iotests/031.out | 14 ++-- tests/qemu-iotests/036.out | 4 +- tests/qemu-iotests/049.out | 102 ++-- tests/qemu-iotests/060.out | 1 + tests/qemu-iotests/061.out | 34 ++ tests/qemu-iotests/065 | 28 +--- tests/qemu-iotests/080 | 2 +- tests/qemu-iotests/082.out | 48 +++-- tests/qemu-iotests/085.out | 38 +-- tests/qemu-iotests/144.out | 4 +- tests/qemu-iotests/182.out | 2 +- tests/qemu-iotests/185.out | 8 +-- tests/qemu-iotests/198.out | 2 + tests/qemu-iotests/206.out | 5 ++ tests/qemu-iotests/242.out | 5 ++ tests/qemu-iotests/255.out | 8 +-- tests/qemu-iotests/274.out | 49 +++--- tests/qemu-iotests/280.out | 2 +- tests/qemu-iotests/common.filter | 3 +- 23 files changed, 365 insertions(+), 150 deletions(-) diff --git a/qapi/block-core.json b/qapi/block-core.json index 943df1926a..1522e2983f 100644 --- a/qapi/block-core.json +++ b/qapi/block-core.json @@ -78,6 +78,8 @@ # # @bitmaps: A list of qcow2 bitmap details (since 4.0) # +# @compression-type: the image cluster compression method (since 5.1) +# # Since: 1.7 ## { 'struct': 'ImageInfoSpecificQCow2', @@ -89,7 +91,8 @@ '*corrupt': 'bool', 'refcount-bits': 'int', '*encrypt': 'ImageInfoSpecificQCow2Encryption', - '*bitmaps': ['Qcow2BitmapInfo'] + '*bitmaps': ['Qcow2BitmapInfo'], + 'compression-type': 'Qcow2CompressionType' } } ## @@ -4284,6 +4287,18 @@ 'data': [ 'v2', 'v3' ] } +## +# @Qcow2CompressionType: +# +# Compression type used in qcow2 image file +# +# @zlib: zlib compression, see <http://zlib.net/> +# +# Since: 5.1 +## +{ 'enum': 'Qcow2CompressionType', + 'data': [ 'zlib' ] } + ## # @BlockdevCreateOptionsQcow2: # @@ -4307,6 +4322,8 @@ # allowed values: off, falloc, full, metadata) # @lazy-refcounts: True if refcounts may be updated lazily (default: off) # @refcount-bits: Width of reference counts in bits (default: 16) +# @compression-type: The image cluster compression method +#(default: zlib, since 5.1) # # Since: 2.12 ## @@ -4322,7 +4339,8 @@ '*cluster-size':'size', '*preallocation': 'PreallocMode', '*lazy-refcounts': 'bool', -'*refcount-bits': 'int' } } +'*refcount-bits': 'int', +'*compression-type':'Qcow2CompressionType' } } ## # @BlockdevCreateOptionsQed: diff --git a/block/qcow2.h b/block/qcow2.h index f4de0a27d5..6a8b82e6cc 100644 --- a/block/qcow2.h +++ b/block/qcow2.h @@ -146,8 +146,16 @@ typedef struct QCowHeader { uint32_t refcount_order; uint32_t header_length; + +/* Additional fields */ +uint8_t compression_type; + +/* header must be a multiple of 8 */ +uint8_t padding[7]; } QEMU_PACKED QCowHeader; +QEMU_BUILD_BUG_ON(!QEMU_IS_ALIGNED(sizeof(QCowHeader), 8)); + typedef struct QEMU_PACKED QCowSnapshotHeader { /* heade
[PATCH v24 0/4] implement zstd cluster compression method
v24: 01: add "compression_type" to the tests output [Max] hopefully, I've found them all. v23: Undecided: whether to add zstd(zlib) compression details to the qcow2 spec 03: tighten assertion on zstd decompression [Eric] 04: use _rm_test_img appropriately [Max] v22: 03: remove assignemnt in if condition v21: 03: * remove the loop on compression [Max] * use designated initializers [Max] 04: * don't erase user's options [Max] * use _rm_test_img [Max] * add unsupported qcow2 options [Max] v20: 04: fix a number of flaws [Vladimir] * don't use $RAND_FILE passing to qemu-io, so check $TEST_DIR is redundant * re-arrage $RAND_FILE writing * fix a typo v19: 04: fix a number of flaws [Eric] * remove rudundant test case descriptions * fix stdout redirect * don't use (()) * use peek_file_be instead of od * check $TEST_DIR for spaces and other before using * use $RAND_FILE safer v18: * 04: add quotes to all file name variables [Vladimir] * 04: add Vladimir's comment according to "qemu-io write -s" option issue. v17: * 03: remove incorrect comment in zstd decompress [Vladimir] * 03: remove "paraniod" and rewrite the comment on decompress [Vladimir] * 03: fix dead assignment [Vladimir] * 04: add and remove quotes [Vladimir] * 04: replace long offset form with the short one [Vladimir] v16: * 03: ssize_t for ret, size_t for zstd_ret [Vladimir] * 04: small fixes according to the comments [Vladimir] v15: * 01: aiming qemu 5.1 [Eric] * 03: change zstd_res definition place [Vladimir] * 04: add two new test cases [Eric] 1. test adjacent cluster compression with zstd 2. test incompressible cluster processing * 03, 04: many rewording and gramma fixing [Eric] v14: * fix bug on compression - looping until compress == 0 [Me] * apply reworked Vladimir's suggestions: 1. not mixing ssize_t with size_t 2. safe check for ENOMEM in compression part - avoid overflow 3. tolerate sanity check allow zstd to make progress only on one of the buffers v13: * 03: add progress sanity check to decompression loop [Vladimir] 03: add successful decompression check [Me] v12: * 03: again, rework compression and decompression loops to make them more correct [Vladimir] 03: move assert in compression to more appropriate place [Vladimir] v11: * 03: the loops don't need "do{}while" form anymore and the they were buggy (missed "do" in the beginning) replace them with usual "while(){}" loops [Vladimir] v10: * 03: fix zstd (de)compressed loops for multi-frame cases [Vladimir] v9: * 01: fix error checking and reporting in qcow2_amend compression type part [Vladimir] * 03: replace asserts with -EIO in qcow2_zstd_decompression [Vladimir, Alberto] * 03: reword/amend/add comments, fix typos [Vladimir] v8: * 03: switch zstd API from simple to stream [Eric] No need to state a special cluster layout for zstd compressed clusters. v7: * use qapi_enum_parse instead of the open-coding [Eric] * fix wording, typos and spelling [Eric] v6: * "block/qcow2-threads: fix qcow2_decompress" is removed from the series since it has been accepted by Max already * add compile time checking for Qcow2Header to be a multiple of 8 [Max, Alberto] * report error on qcow2 amending when the compression type is actually chnged [Max] * remove the extra space and the extra new line [Max] * re-arrange acks and signed-off-s [Vladimir] v5: * replace -ENOTSUP with abort in qcow2_co_decompress [Vladimir] * set cluster size for all test cases in the beginning of the 287 test v4: * the series is rebased on top of 01 "block/qcow2-threads: fix qcow2_decompress" * 01 is just a no-change resend to avoid extra dependencies. Still, it may be merged in separate v3: * remove redundant max compression type value check [Vladimir, Eric] (the switch below checks everything) * prevent compression type changing on "qemu-img amend" [Vladimir] * remove zstd config setting, since it has been added already by "migration" patches [Vladimir] * change the compression type error message [Vladimir] * fix alignment and 80-chars exceeding [Vladimir] v2: * rework compression type setting [Vladimir] * squash iotest changes to the compression type introduction patch [Vladimir, Eric] * fix zstd availability checking in zstd iotest [Vladimir] * remove unnecessry casting [Eric] * remove rudundant checks [Eric] * fix compressed cluster layout in qcow2 spec [Vladimir] * fix wording [Eric, Vladimir] * fix compression type filtering in iotests [Eric] v1: the initial s
[PATCH v24 2/4] qcow2: rework the cluster compression routine
The patch enables processing the image compression type defined for the image and chooses an appropriate method for image clusters (de)compression. Signed-off-by: Denis Plotnikov Reviewed-by: Vladimir Sementsov-Ogievskiy Reviewed-by: Alberto Garcia Reviewed-by: Max Reitz --- block/qcow2-threads.c | 71 --- 1 file changed, 60 insertions(+), 11 deletions(-) diff --git a/block/qcow2-threads.c b/block/qcow2-threads.c index a68126f291..7dbaf53489 100644 --- a/block/qcow2-threads.c +++ b/block/qcow2-threads.c @@ -74,7 +74,9 @@ typedef struct Qcow2CompressData { } Qcow2CompressData; /* - * qcow2_compress() + * qcow2_zlib_compress() + * + * Compress @src_size bytes of data using zlib compression method * * @dest - destination buffer, @dest_size bytes * @src - source buffer, @src_size bytes @@ -83,8 +85,8 @@ typedef struct Qcow2CompressData { * -ENOMEM destination buffer is not enough to store compressed data * -EIOon any other error */ -static ssize_t qcow2_compress(void *dest, size_t dest_size, - const void *src, size_t src_size) +static ssize_t qcow2_zlib_compress(void *dest, size_t dest_size, + const void *src, size_t src_size) { ssize_t ret; z_stream strm; @@ -119,10 +121,10 @@ static ssize_t qcow2_compress(void *dest, size_t dest_size, } /* - * qcow2_decompress() + * qcow2_zlib_decompress() * * Decompress some data (not more than @src_size bytes) to produce exactly - * @dest_size bytes. + * @dest_size bytes using zlib compression method * * @dest - destination buffer, @dest_size bytes * @src - source buffer, @src_size bytes @@ -130,8 +132,8 @@ static ssize_t qcow2_compress(void *dest, size_t dest_size, * Returns: 0 on success * -EIO on fail */ -static ssize_t qcow2_decompress(void *dest, size_t dest_size, -const void *src, size_t src_size) +static ssize_t qcow2_zlib_decompress(void *dest, size_t dest_size, + const void *src, size_t src_size) { int ret; z_stream strm; @@ -191,20 +193,67 @@ qcow2_co_do_compress(BlockDriverState *bs, void *dest, size_t dest_size, return arg.ret; } +/* + * qcow2_co_compress() + * + * Compress @src_size bytes of data using the compression + * method defined by the image compression type + * + * @dest - destination buffer, @dest_size bytes + * @src - source buffer, @src_size bytes + * + * Returns: compressed size on success + * a negative error code on failure + */ ssize_t coroutine_fn qcow2_co_compress(BlockDriverState *bs, void *dest, size_t dest_size, const void *src, size_t src_size) { -return qcow2_co_do_compress(bs, dest, dest_size, src, src_size, -qcow2_compress); +BDRVQcow2State *s = bs->opaque; +Qcow2CompressFunc fn; + +switch (s->compression_type) { +case QCOW2_COMPRESSION_TYPE_ZLIB: +fn = qcow2_zlib_compress; +break; + +default: +abort(); +} + +return qcow2_co_do_compress(bs, dest, dest_size, src, src_size, fn); } +/* + * qcow2_co_decompress() + * + * Decompress some data (not more than @src_size bytes) to produce exactly + * @dest_size bytes using the compression method defined by the image + * compression type + * + * @dest - destination buffer, @dest_size bytes + * @src - source buffer, @src_size bytes + * + * Returns: 0 on success + * a negative error code on failure + */ ssize_t coroutine_fn qcow2_co_decompress(BlockDriverState *bs, void *dest, size_t dest_size, const void *src, size_t src_size) { -return qcow2_co_do_compress(bs, dest, dest_size, src, src_size, -qcow2_decompress); +BDRVQcow2State *s = bs->opaque; +Qcow2CompressFunc fn; + +switch (s->compression_type) { +case QCOW2_COMPRESSION_TYPE_ZLIB: +fn = qcow2_zlib_decompress; +break; + +default: +abort(); +} + +return qcow2_co_do_compress(bs, dest, dest_size, src, src_size, fn); } -- 2.17.0
[PATCH v24 4/4] iotests: 287: add qcow2 compression type test
The test checks fulfilling qcow2 requirements for the compression type feature and zstd compression type operability. Signed-off-by: Denis Plotnikov Reviewed-by: Vladimir Sementsov-Ogievskiy Tested-by: Vladimir Sementsov-Ogievskiy Reviewed-by: Eric Blake --- tests/qemu-iotests/287 | 152 + tests/qemu-iotests/287.out | 67 tests/qemu-iotests/group | 1 + 3 files changed, 220 insertions(+) create mode 100755 tests/qemu-iotests/287 create mode 100644 tests/qemu-iotests/287.out diff --git a/tests/qemu-iotests/287 b/tests/qemu-iotests/287 new file mode 100755 index 00..f98a4cadc1 --- /dev/null +++ b/tests/qemu-iotests/287 @@ -0,0 +1,152 @@ +#!/usr/bin/env bash +# +# Test case for an image using zstd compression +# +# Copyright (c) 2020 Virtuozzo International GmbH +# +# This program is free software; you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation; either version 2 of the License, or +# (at your option) any later version. +# +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program. If not, see <http://www.gnu.org/licenses/>. +# + +# creator +owner=dplotni...@virtuozzo.com + +seq="$(basename $0)" +echo "QA output created by $seq" + +status=1 # failure is the default! + +# standard environment +. ./common.rc +. ./common.filter + +# This tests qocw2-specific low-level functionality +_supported_fmt qcow2 +_supported_proto file +_supported_os Linux +_unsupported_imgopts 'compat=0.10' data_file + +COMPR_IMG="$TEST_IMG.compressed" +RAND_FILE="$TEST_DIR/rand_data" + +_cleanup() +{ +_cleanup_test_img +_rm_test_img "$COMPR_IMG" +rm -f "$RAND_FILE" +} +trap "_cleanup; exit \$status" 0 1 2 3 15 + +# for all the cases +CLUSTER_SIZE=65536 + +# Check if we can run this test. +if IMGOPTS='compression_type=zstd' _make_test_img 64M | +grep "Invalid parameter 'zstd'"; then +_notrun "ZSTD is disabled" +fi + +echo +echo "=== Testing compression type incompatible bit setting for zlib ===" +echo +_make_test_img -o compression_type=zlib 64M +$PYTHON qcow2.py "$TEST_IMG" dump-header | grep incompatible_features + +echo +echo "=== Testing compression type incompatible bit setting for zstd ===" +echo +_make_test_img -o compression_type=zstd 64M +$PYTHON qcow2.py "$TEST_IMG" dump-header | grep incompatible_features + +echo +echo "=== Testing zlib with incompatible bit set ===" +echo +_make_test_img -o compression_type=zlib 64M +$PYTHON qcow2.py "$TEST_IMG" set-feature-bit incompatible 3 +# to make sure the bit was actually set +$PYTHON qcow2.py "$TEST_IMG" dump-header | grep incompatible_features + +if $QEMU_IMG info "$TEST_IMG" >/dev/null 2>&1 ; then +echo "Error: The image opened successfully. The image must not be opened." +fi + +echo +echo "=== Testing zstd with incompatible bit unset ===" +echo +_make_test_img -o compression_type=zstd 64M +$PYTHON qcow2.py "$TEST_IMG" set-header incompatible_features 0 +# to make sure the bit was actually unset +$PYTHON qcow2.py "$TEST_IMG" dump-header | grep incompatible_features + +if $QEMU_IMG info "$TEST_IMG" >/dev/null 2>&1 ; then +echo "Error: The image opened successfully. The image must not be opened." +fi + +echo +echo "=== Testing compression type values ===" +echo +# zlib=0 +_make_test_img -o compression_type=zlib 64M +peek_file_be "$TEST_IMG" 104 1 +echo + +# zstd=1 +_make_test_img -o compression_type=zstd 64M +peek_file_be "$TEST_IMG" 104 1 +echo + +echo +echo "=== Testing simple reading and writing with zstd ===" +echo +_make_test_img -o compression_type=zstd 64M +$QEMU_IO -c "write -c -P 0xAC 64K 64K " "$TEST_IMG" | _filter_qemu_io +$QEMU_IO -c "read -P 0xAC 64K 64K " "$TEST_IMG" | _filter_qemu_io +# read on the cluster boundaries +$QEMU_IO -c "read -v 131070 8 " "$TEST_IMG" | _filter_qemu_io +$QEMU_IO -c "read -v 65534 8" "$TEST_IMG" | _filter_qemu_io + +echo +echo "=== Testing adjacent clusters reading and writing with zstd ===" +echo +_make_test_img -o compression_type=zstd 64M +$QEMU_IO -c "write -c -P 0xAB 0 64K " "$TEST_IMG" | _filter_qemu_io +$QEMU_IO -c "write -c -P 0xAC 64K 64K " "$TEST_IMG" | _filter_qemu_io +$QEMU_IO -c "write -c -P 0xAD 128K 64K " "$T
Re: [PATCH v23 0/4] implement zstd cluster compression method
On 05.05.2020 15:03, Max Reitz wrote: On 05.05.20 12:26, Max Reitz wrote: On 30.04.20 12:19, Denis Plotnikov wrote: v23: Undecided: whether to add zstd(zlib) compression details to the qcow2 spec 03: tighten assertion on zstd decompression [Eric] 04: use _rm_test_img appropriately [Max] Thanks, applied to my block branch: I’m afraid I have to unqueue this series again, because it makes many iotests fail due to an additional “compression_type=zlib” output when images are created, an additional “compression type” line in qemu-img info output where format-specific information is not suppressed, and an additional line in qemu-img create -f qcow2 -o help. Max Hmm, this is strange. I made some modifications for the tests in 0001 of the series (qcow2: introduce compression type feature). Among the other test related changes, the patch contains the hunk: +++ b/tests/qemu-iotests/common.filter @@ -152,7 +152,8 @@ _filter_img_create() -e "s# refcount_bits=[0-9]\\+##g" \ -e "s# key-secret=[a-zA-Z0-9]\\+##g" \ -e "s# iter-time=[0-9]\\+##g" \ - -e "s# force_size=\\(on\\|off\\)##g" + -e "s# force_size=\\(on\\|off\\)##g" \ + -e "s# compression_type=[a-zA-Z0-9]\\+##g" } which has to filter out "compression_type" on image creation. But you say that you can see the "compression_type" on the image creation. May be the patch wasn't fully applied? Or the test related modification were omitted? I've just re-based the series on top of: 681b07f4e76dbb700c16918d (vanilla/master, mainstream) Merge: a2261b2754 714eb0dbc5 Author: Peter Maydell Date: Tue May 5 15:47:44 2020 +0100 and run the tests with 'make check-block' and got the following: Not run: 071 099 184 220 259 267 Some cases not run in: 030 040 041 Passed all 113 iotests May be I do something wrong? Denis
Re: [PATCH v22 3/4] qcow2: add zstd cluster compression
On 30.04.2020 14:47, Max Reitz wrote: On 30.04.20 11:48, Denis Plotnikov wrote: On 30.04.2020 11:26, Max Reitz wrote: On 29.04.20 15:02, Vladimir Sementsov-Ogievskiy wrote: 29.04.2020 15:17, Max Reitz wrote: On 29.04.20 12:37, Vladimir Sementsov-Ogievskiy wrote: 29.04.2020 13:24, Max Reitz wrote: On 28.04.20 22:00, Denis Plotnikov wrote: zstd significantly reduces cluster compression time. It provides better compression performance maintaining the same level of the compression ratio in comparison with zlib, which, at the moment, is the only compression method available. The performance test results: Test compresses and decompresses qemu qcow2 image with just installed rhel-7.6 guest. Image cluster size: 64K. Image on disk size: 2.2G The test was conducted with brd disk to reduce the influence of disk subsystem to the test results. The results is given in seconds. compress cmd: time ./qemu-img convert -O qcow2 -c -o compression_type=[zlib|zstd] src.img [zlib|zstd]_compressed.img decompress cmd time ./qemu-img convert -O qcow2 [zlib|zstd]_compressed.img uncompressed.img compression decompression zlib zstd zlib zstd real 65.5 16.3 (-75 %) 1.9 1.6 (-16 %) user 65.0 15.8 5.3 2.5 sys 3.3 0.2 2.0 2.0 Both ZLIB and ZSTD gave the same compression ratio: 1.57 compressed image size in both cases: 1.4G Signed-off-by: Denis Plotnikov QAPI part: Acked-by: Markus Armbruster --- docs/interop/qcow2.txt | 1 + configure | 2 +- qapi/block-core.json | 3 +- block/qcow2-threads.c | 169 + block/qcow2.c | 7 ++ slirp | 2 +- 6 files changed, 181 insertions(+), 3 deletions(-) [...] diff --git a/block/qcow2-threads.c b/block/qcow2-threads.c index 7dbaf53489..a0b12e1b15 100644 --- a/block/qcow2-threads.c +++ b/block/qcow2-threads.c [...] +static ssize_t qcow2_zstd_decompress(void *dest, size_t dest_size, + const void *src, size_t src_size) +{ [...] + /* + * The compressed stream from the input buffer may consist of more + * than one zstd frame. Can it? If not, we must require it in the specification. Actually, now that you mention it, it would make sense anyway to add some note to the specification on what exactly compressed with zstd means. Hmm. If at some point we'll want multi-threaded compression of one big (2M) cluster.. Could this be implemented with zstd lib, if multiple frames are allowed, will allowing multiple frames help? I don't know actually, but I think better not to forbid it. On the other hand, I don't see any benefit in large compressed clusters. At least, in our scenarios (for compressed backups) we use 64k compressed clusters, for good granularity of incremental backups (when for running vm we use 1M clusters). Is it really that important? Naïvely, it sounds rather complicated to introduce multithreading into block drivers. It is already here: compression and encryption already multithreaded. But of course, one cluster is handled in one thread. Ah, good. I forgot. (Also, as for compression, it can only be used in backup scenarios anyway, where you write many clusters at once. So parallelism on the cluster level should sufficient to get high usage, and it would benefit all compression types and cluster sizes.) Yes it works in this way already :) Well, OK then. So, we don't know do we want one frame restriction or not. Do you have a preference? *shrug* Seems like it would be preferential to allow multiple frames still. A note in the spec would be nice (i.e., streaming format, multiple frames per cluster possible). We don't mention anything about zlib compressing details in the spec. Yep. Which is bad enough. If we mention anything about ZSTD compressing details we'll have to do it for zlib as well. I personally don’t like “If you can’t make it perfect, you shouldn’t do it at all”. Mentioning it for zstd would still be an improvement. Good approach. I like it. But I'm trying to be cautious. Also, I believe that “zlib compression” is pretty much unambiguous, considering all the code does is call deflate()/inflate(). I’m not saying we need to amend the spec in this series, but I don’t see a good argument against doing so at some point. So, I think we have two possibilities for the spec: 1. mention for both 2. don't mention at all I think the 2nd is better. It gives more degree of freedom for the future improvements. No, it absolutely doesn’t. There is a de-facto format, namely what qemu accepts. Changing that would be an incompatible change. Just because we don’t write what’s allowed into the spec doesn’t change
[PATCH v23 1/4] qcow2: introduce compression type feature
The patch adds some preparation parts for incompatible compression type feature to qcow2 allowing the use different compression methods for image clusters (de)compressing. It is implied that the compression type is set on the image creation and can be changed only later by image conversion, thus compression type defines the only compression algorithm used for the image, and thus, for all image clusters. The goal of the feature is to add support of other compression methods to qcow2. For example, ZSTD which is more effective on compression than ZLIB. The default compression is ZLIB. Images created with ZLIB compression type are backward compatible with older qemu versions. Adding of the compression type breaks a number of tests because now the compression type is reported on image creation and there are some changes in the qcow2 header in size and offsets. The tests are fixed in the following ways: * filter out compression_type for many tests * fix header size, feature table size and backing file offset affected tests: 031, 036, 061, 080 header_size +=8: 1 byte compression type 7 bytes padding feature_table += 48: incompatible feature compression type backing_file_offset += 56 (8 + 48 -> header_change + feature_table_change) * add "compression type" for test output matching when it isn't filtered affected tests: 049, 060, 061, 065, 144, 182, 242, 255 Signed-off-by: Denis Plotnikov Reviewed-by: Vladimir Sementsov-Ogievskiy Reviewed-by: Eric Blake Reviewed-by: Max Reitz QAPI part: Acked-by: Markus Armbruster --- qapi/block-core.json | 22 +- block/qcow2.h| 20 +- include/block/block_int.h| 1 + block/qcow2.c| 113 +++ tests/qemu-iotests/031.out | 14 ++-- tests/qemu-iotests/036.out | 4 +- tests/qemu-iotests/049.out | 102 ++-- tests/qemu-iotests/060.out | 1 + tests/qemu-iotests/061.out | 34 ++ tests/qemu-iotests/065 | 28 +--- tests/qemu-iotests/080 | 2 +- tests/qemu-iotests/144.out | 4 +- tests/qemu-iotests/182.out | 2 +- tests/qemu-iotests/242.out | 5 ++ tests/qemu-iotests/255.out | 8 +-- tests/qemu-iotests/common.filter | 3 +- 16 files changed, 267 insertions(+), 96 deletions(-) diff --git a/qapi/block-core.json b/qapi/block-core.json index 943df1926a..1522e2983f 100644 --- a/qapi/block-core.json +++ b/qapi/block-core.json @@ -78,6 +78,8 @@ # # @bitmaps: A list of qcow2 bitmap details (since 4.0) # +# @compression-type: the image cluster compression method (since 5.1) +# # Since: 1.7 ## { 'struct': 'ImageInfoSpecificQCow2', @@ -89,7 +91,8 @@ '*corrupt': 'bool', 'refcount-bits': 'int', '*encrypt': 'ImageInfoSpecificQCow2Encryption', - '*bitmaps': ['Qcow2BitmapInfo'] + '*bitmaps': ['Qcow2BitmapInfo'], + 'compression-type': 'Qcow2CompressionType' } } ## @@ -4284,6 +4287,18 @@ 'data': [ 'v2', 'v3' ] } +## +# @Qcow2CompressionType: +# +# Compression type used in qcow2 image file +# +# @zlib: zlib compression, see <http://zlib.net/> +# +# Since: 5.1 +## +{ 'enum': 'Qcow2CompressionType', + 'data': [ 'zlib' ] } + ## # @BlockdevCreateOptionsQcow2: # @@ -4307,6 +4322,8 @@ # allowed values: off, falloc, full, metadata) # @lazy-refcounts: True if refcounts may be updated lazily (default: off) # @refcount-bits: Width of reference counts in bits (default: 16) +# @compression-type: The image cluster compression method +#(default: zlib, since 5.1) # # Since: 2.12 ## @@ -4322,7 +4339,8 @@ '*cluster-size':'size', '*preallocation': 'PreallocMode', '*lazy-refcounts': 'bool', -'*refcount-bits': 'int' } } +'*refcount-bits': 'int', +'*compression-type':'Qcow2CompressionType' } } ## # @BlockdevCreateOptionsQed: diff --git a/block/qcow2.h b/block/qcow2.h index f4de0a27d5..6a8b82e6cc 100644 --- a/block/qcow2.h +++ b/block/qcow2.h @@ -146,8 +146,16 @@ typedef struct QCowHeader { uint32_t refcount_order; uint32_t header_length; + +/* Additional fields */ +uint8_t compression_type; + +/* header must be a multiple of 8 */ +uint8_t padding[7]; } QEMU_PACKED QCowHeader; +QEMU_BUILD_BUG_ON(!QEMU_IS_ALIGNED(sizeof(QCowHeader), 8)); + typedef struct QEMU_PACKED QCowSnapshotHeader { /* header is 8 byte aligned */ uint64_t l1_table_offset; @@ -216,13 +224,16 @@ enum { QCOW2_INCOMPAT_DIRTY_BITNR = 0, QCOW2_INCOMPAT_CORRUPT_BITNR= 1, QCOW2_INCOMPAT_DATA_FILE_BITNR = 2, +QCOW2_INCOMPAT_COMPRESSION_BITNR = 3, QCOW2_INCOMPAT_DIRTY= 1 << QCOW2_INCOMPAT_DIRTY_BITNR, QCOW2_INCOMPAT_CORRUPT = 1 <&l
[PATCH v23 2/4] qcow2: rework the cluster compression routine
The patch enables processing the image compression type defined for the image and chooses an appropriate method for image clusters (de)compression. Signed-off-by: Denis Plotnikov Reviewed-by: Vladimir Sementsov-Ogievskiy Reviewed-by: Alberto Garcia Reviewed-by: Max Reitz --- block/qcow2-threads.c | 71 --- 1 file changed, 60 insertions(+), 11 deletions(-) diff --git a/block/qcow2-threads.c b/block/qcow2-threads.c index a68126f291..7dbaf53489 100644 --- a/block/qcow2-threads.c +++ b/block/qcow2-threads.c @@ -74,7 +74,9 @@ typedef struct Qcow2CompressData { } Qcow2CompressData; /* - * qcow2_compress() + * qcow2_zlib_compress() + * + * Compress @src_size bytes of data using zlib compression method * * @dest - destination buffer, @dest_size bytes * @src - source buffer, @src_size bytes @@ -83,8 +85,8 @@ typedef struct Qcow2CompressData { * -ENOMEM destination buffer is not enough to store compressed data * -EIOon any other error */ -static ssize_t qcow2_compress(void *dest, size_t dest_size, - const void *src, size_t src_size) +static ssize_t qcow2_zlib_compress(void *dest, size_t dest_size, + const void *src, size_t src_size) { ssize_t ret; z_stream strm; @@ -119,10 +121,10 @@ static ssize_t qcow2_compress(void *dest, size_t dest_size, } /* - * qcow2_decompress() + * qcow2_zlib_decompress() * * Decompress some data (not more than @src_size bytes) to produce exactly - * @dest_size bytes. + * @dest_size bytes using zlib compression method * * @dest - destination buffer, @dest_size bytes * @src - source buffer, @src_size bytes @@ -130,8 +132,8 @@ static ssize_t qcow2_compress(void *dest, size_t dest_size, * Returns: 0 on success * -EIO on fail */ -static ssize_t qcow2_decompress(void *dest, size_t dest_size, -const void *src, size_t src_size) +static ssize_t qcow2_zlib_decompress(void *dest, size_t dest_size, + const void *src, size_t src_size) { int ret; z_stream strm; @@ -191,20 +193,67 @@ qcow2_co_do_compress(BlockDriverState *bs, void *dest, size_t dest_size, return arg.ret; } +/* + * qcow2_co_compress() + * + * Compress @src_size bytes of data using the compression + * method defined by the image compression type + * + * @dest - destination buffer, @dest_size bytes + * @src - source buffer, @src_size bytes + * + * Returns: compressed size on success + * a negative error code on failure + */ ssize_t coroutine_fn qcow2_co_compress(BlockDriverState *bs, void *dest, size_t dest_size, const void *src, size_t src_size) { -return qcow2_co_do_compress(bs, dest, dest_size, src, src_size, -qcow2_compress); +BDRVQcow2State *s = bs->opaque; +Qcow2CompressFunc fn; + +switch (s->compression_type) { +case QCOW2_COMPRESSION_TYPE_ZLIB: +fn = qcow2_zlib_compress; +break; + +default: +abort(); +} + +return qcow2_co_do_compress(bs, dest, dest_size, src, src_size, fn); } +/* + * qcow2_co_decompress() + * + * Decompress some data (not more than @src_size bytes) to produce exactly + * @dest_size bytes using the compression method defined by the image + * compression type + * + * @dest - destination buffer, @dest_size bytes + * @src - source buffer, @src_size bytes + * + * Returns: 0 on success + * a negative error code on failure + */ ssize_t coroutine_fn qcow2_co_decompress(BlockDriverState *bs, void *dest, size_t dest_size, const void *src, size_t src_size) { -return qcow2_co_do_compress(bs, dest, dest_size, src, src_size, -qcow2_decompress); +BDRVQcow2State *s = bs->opaque; +Qcow2CompressFunc fn; + +switch (s->compression_type) { +case QCOW2_COMPRESSION_TYPE_ZLIB: +fn = qcow2_zlib_decompress; +break; + +default: +abort(); +} + +return qcow2_co_do_compress(bs, dest, dest_size, src, src_size, fn); } -- 2.17.0
[PATCH v23 0/4] implement zstd cluster compression method
v23: Undecided: whether to add zstd(zlib) compression details to the qcow2 spec 03: tighten assertion on zstd decompression [Eric] 04: use _rm_test_img appropriately [Max] v22: 03: remove assignemnt in if condition v21: 03: * remove the loop on compression [Max] * use designated initializers [Max] 04: * don't erase user's options [Max] * use _rm_test_img [Max] * add unsupported qcow2 options [Max] v20: 04: fix a number of flaws [Vladimir] * don't use $RAND_FILE passing to qemu-io, so check $TEST_DIR is redundant * re-arrage $RAND_FILE writing * fix a typo v19: 04: fix a number of flaws [Eric] * remove rudundant test case descriptions * fix stdout redirect * don't use (()) * use peek_file_be instead of od * check $TEST_DIR for spaces and other before using * use $RAND_FILE safer v18: * 04: add quotes to all file name variables [Vladimir] * 04: add Vladimir's comment according to "qemu-io write -s" option issue. v17: * 03: remove incorrect comment in zstd decompress [Vladimir] * 03: remove "paraniod" and rewrite the comment on decompress [Vladimir] * 03: fix dead assignment [Vladimir] * 04: add and remove quotes [Vladimir] * 04: replace long offset form with the short one [Vladimir] v16: * 03: ssize_t for ret, size_t for zstd_ret [Vladimir] * 04: small fixes according to the comments [Vladimir] v15: * 01: aiming qemu 5.1 [Eric] * 03: change zstd_res definition place [Vladimir] * 04: add two new test cases [Eric] 1. test adjacent cluster compression with zstd 2. test incompressible cluster processing * 03, 04: many rewording and gramma fixing [Eric] v14: * fix bug on compression - looping until compress == 0 [Me] * apply reworked Vladimir's suggestions: 1. not mixing ssize_t with size_t 2. safe check for ENOMEM in compression part - avoid overflow 3. tolerate sanity check allow zstd to make progress only on one of the buffers v13: * 03: add progress sanity check to decompression loop [Vladimir] 03: add successful decompression check [Me] v12: * 03: again, rework compression and decompression loops to make them more correct [Vladimir] 03: move assert in compression to more appropriate place [Vladimir] v11: * 03: the loops don't need "do{}while" form anymore and the they were buggy (missed "do" in the beginning) replace them with usual "while(){}" loops [Vladimir] v10: * 03: fix zstd (de)compressed loops for multi-frame cases [Vladimir] v9: * 01: fix error checking and reporting in qcow2_amend compression type part [Vladimir] * 03: replace asserts with -EIO in qcow2_zstd_decompression [Vladimir, Alberto] * 03: reword/amend/add comments, fix typos [Vladimir] v8: * 03: switch zstd API from simple to stream [Eric] No need to state a special cluster layout for zstd compressed clusters. v7: * use qapi_enum_parse instead of the open-coding [Eric] * fix wording, typos and spelling [Eric] v6: * "block/qcow2-threads: fix qcow2_decompress" is removed from the series since it has been accepted by Max already * add compile time checking for Qcow2Header to be a multiple of 8 [Max, Alberto] * report error on qcow2 amending when the compression type is actually chnged [Max] * remove the extra space and the extra new line [Max] * re-arrange acks and signed-off-s [Vladimir] v5: * replace -ENOTSUP with abort in qcow2_co_decompress [Vladimir] * set cluster size for all test cases in the beginning of the 287 test v4: * the series is rebased on top of 01 "block/qcow2-threads: fix qcow2_decompress" * 01 is just a no-change resend to avoid extra dependencies. Still, it may be merged in separate v3: * remove redundant max compression type value check [Vladimir, Eric] (the switch below checks everything) * prevent compression type changing on "qemu-img amend" [Vladimir] * remove zstd config setting, since it has been added already by "migration" patches [Vladimir] * change the compression type error message [Vladimir] * fix alignment and 80-chars exceeding [Vladimir] v2: * rework compression type setting [Vladimir] * squash iotest changes to the compression type introduction patch [Vladimir, Eric] * fix zstd availability checking in zstd iotest [Vladimir] * remove unnecessry casting [Eric] * remove rudundant checks [Eric] * fix compressed cluster layout in qcow2 spec [Vladimir] * fix wording [Eric, Vladimir] * fix compression type filtering in iotests [Eric] v1: the initial series Denis Plotnikov (4): qcow2: introduce compression type feature qcow2: rework the cluster compression rou
[PATCH v23 3/4] qcow2: add zstd cluster compression
zstd significantly reduces cluster compression time. It provides better compression performance maintaining the same level of the compression ratio in comparison with zlib, which, at the moment, is the only compression method available. The performance test results: Test compresses and decompresses qemu qcow2 image with just installed rhel-7.6 guest. Image cluster size: 64K. Image on disk size: 2.2G The test was conducted with brd disk to reduce the influence of disk subsystem to the test results. The results is given in seconds. compress cmd: time ./qemu-img convert -O qcow2 -c -o compression_type=[zlib|zstd] src.img [zlib|zstd]_compressed.img decompress cmd time ./qemu-img convert -O qcow2 [zlib|zstd]_compressed.img uncompressed.img compression decompression zlib zstd zlib zstd real 65.5 16.3 (-75 %)1.9 1.6 (-16 %) user 65.0 15.85.3 2.5 sys 3.30.22.0 2.0 Both ZLIB and ZSTD gave the same compression ratio: 1.57 compressed image size in both cases: 1.4G Signed-off-by: Denis Plotnikov QAPI part: Acked-by: Markus Armbruster --- docs/interop/qcow2.txt | 1 + configure | 2 +- qapi/block-core.json | 3 +- block/qcow2-threads.c | 169 + block/qcow2.c | 7 ++ 5 files changed, 180 insertions(+), 2 deletions(-) diff --git a/docs/interop/qcow2.txt b/docs/interop/qcow2.txt index 640e0eca40..18a77f737e 100644 --- a/docs/interop/qcow2.txt +++ b/docs/interop/qcow2.txt @@ -209,6 +209,7 @@ version 2. Available compression type values: 0: zlib <https://www.zlib.net/> +1: zstd <http://github.com/facebook/zstd> === Header padding === diff --git a/configure b/configure index 23b5e93752..4e3a1690ea 100755 --- a/configure +++ b/configure @@ -1861,7 +1861,7 @@ disabled with --disable-FEATURE, default is enabled if available: lzfse support of lzfse compression library (for reading lzfse-compressed dmg images) zstdsupport for zstd compression library - (for migration compression) + (for migration compression and qcow2 cluster compression) seccomp seccomp support coroutine-pool coroutine freelist (better performance) glusterfs GlusterFS backend diff --git a/qapi/block-core.json b/qapi/block-core.json index 1522e2983f..6fbacddab2 100644 --- a/qapi/block-core.json +++ b/qapi/block-core.json @@ -4293,11 +4293,12 @@ # Compression type used in qcow2 image file # # @zlib: zlib compression, see <http://zlib.net/> +# @zstd: zstd compression, see <http://github.com/facebook/zstd> # # Since: 5.1 ## { 'enum': 'Qcow2CompressionType', - 'data': [ 'zlib' ] } + 'data': [ 'zlib', { 'name': 'zstd', 'if': 'defined(CONFIG_ZSTD)' } ] } ## # @BlockdevCreateOptionsQcow2: diff --git a/block/qcow2-threads.c b/block/qcow2-threads.c index 7dbaf53489..1914baf456 100644 --- a/block/qcow2-threads.c +++ b/block/qcow2-threads.c @@ -28,6 +28,11 @@ #define ZLIB_CONST #include +#ifdef CONFIG_ZSTD +#include +#include +#endif + #include "qcow2.h" #include "block/thread-pool.h" #include "crypto.h" @@ -166,6 +171,160 @@ static ssize_t qcow2_zlib_decompress(void *dest, size_t dest_size, return ret; } +#ifdef CONFIG_ZSTD + +/* + * qcow2_zstd_compress() + * + * Compress @src_size bytes of data using zstd compression method + * + * @dest - destination buffer, @dest_size bytes + * @src - source buffer, @src_size bytes + * + * Returns: compressed size on success + * -ENOMEM destination buffer is not enough to store compressed data + * -EIOon any other error + */ +static ssize_t qcow2_zstd_compress(void *dest, size_t dest_size, + const void *src, size_t src_size) +{ +ssize_t ret; +size_t zstd_ret; +ZSTD_outBuffer output = { +.dst = dest, +.size = dest_size, +.pos = 0 +}; +ZSTD_inBuffer input = { +.src = src, +.size = src_size, +.pos = 0 +}; +ZSTD_CCtx *cctx = ZSTD_createCCtx(); + +if (!cctx) { +return -EIO; +} +/* + * Use the zstd streamed interface for symmetry with decompression, + * where streaming is essential since we don't record the exact + * compressed size. + * + * ZSTD_compressStream2() tries to compress everything it could + * with a single call. Although, ZSTD docs says that: + * "You must continue calling ZSTD_compressStream2() with ZSTD_e_end + * until it returns 0, at which point you are free to start a new frame", + * in out tests we saw the only case when it returned with
[PATCH v23 4/4] iotests: 287: add qcow2 compression type test
The test checks fulfilling qcow2 requirements for the compression type feature and zstd compression type operability. Signed-off-by: Denis Plotnikov Reviewed-by: Vladimir Sementsov-Ogievskiy Tested-by: Vladimir Sementsov-Ogievskiy Reviewed-by: Eric Blake --- tests/qemu-iotests/287 | 152 + tests/qemu-iotests/287.out | 67 tests/qemu-iotests/group | 1 + 3 files changed, 220 insertions(+) create mode 100755 tests/qemu-iotests/287 create mode 100644 tests/qemu-iotests/287.out diff --git a/tests/qemu-iotests/287 b/tests/qemu-iotests/287 new file mode 100755 index 00..f98a4cadc1 --- /dev/null +++ b/tests/qemu-iotests/287 @@ -0,0 +1,152 @@ +#!/usr/bin/env bash +# +# Test case for an image using zstd compression +# +# Copyright (c) 2020 Virtuozzo International GmbH +# +# This program is free software; you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation; either version 2 of the License, or +# (at your option) any later version. +# +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program. If not, see <http://www.gnu.org/licenses/>. +# + +# creator +owner=dplotni...@virtuozzo.com + +seq="$(basename $0)" +echo "QA output created by $seq" + +status=1 # failure is the default! + +# standard environment +. ./common.rc +. ./common.filter + +# This tests qocw2-specific low-level functionality +_supported_fmt qcow2 +_supported_proto file +_supported_os Linux +_unsupported_imgopts 'compat=0.10' data_file + +COMPR_IMG="$TEST_IMG.compressed" +RAND_FILE="$TEST_DIR/rand_data" + +_cleanup() +{ +_cleanup_test_img +_rm_test_img "$COMPR_IMG" +rm -f "$RAND_FILE" +} +trap "_cleanup; exit \$status" 0 1 2 3 15 + +# for all the cases +CLUSTER_SIZE=65536 + +# Check if we can run this test. +if IMGOPTS='compression_type=zstd' _make_test_img 64M | +grep "Invalid parameter 'zstd'"; then +_notrun "ZSTD is disabled" +fi + +echo +echo "=== Testing compression type incompatible bit setting for zlib ===" +echo +_make_test_img -o compression_type=zlib 64M +$PYTHON qcow2.py "$TEST_IMG" dump-header | grep incompatible_features + +echo +echo "=== Testing compression type incompatible bit setting for zstd ===" +echo +_make_test_img -o compression_type=zstd 64M +$PYTHON qcow2.py "$TEST_IMG" dump-header | grep incompatible_features + +echo +echo "=== Testing zlib with incompatible bit set ===" +echo +_make_test_img -o compression_type=zlib 64M +$PYTHON qcow2.py "$TEST_IMG" set-feature-bit incompatible 3 +# to make sure the bit was actually set +$PYTHON qcow2.py "$TEST_IMG" dump-header | grep incompatible_features + +if $QEMU_IMG info "$TEST_IMG" >/dev/null 2>&1 ; then +echo "Error: The image opened successfully. The image must not be opened." +fi + +echo +echo "=== Testing zstd with incompatible bit unset ===" +echo +_make_test_img -o compression_type=zstd 64M +$PYTHON qcow2.py "$TEST_IMG" set-header incompatible_features 0 +# to make sure the bit was actually unset +$PYTHON qcow2.py "$TEST_IMG" dump-header | grep incompatible_features + +if $QEMU_IMG info "$TEST_IMG" >/dev/null 2>&1 ; then +echo "Error: The image opened successfully. The image must not be opened." +fi + +echo +echo "=== Testing compression type values ===" +echo +# zlib=0 +_make_test_img -o compression_type=zlib 64M +peek_file_be "$TEST_IMG" 104 1 +echo + +# zstd=1 +_make_test_img -o compression_type=zstd 64M +peek_file_be "$TEST_IMG" 104 1 +echo + +echo +echo "=== Testing simple reading and writing with zstd ===" +echo +_make_test_img -o compression_type=zstd 64M +$QEMU_IO -c "write -c -P 0xAC 64K 64K " "$TEST_IMG" | _filter_qemu_io +$QEMU_IO -c "read -P 0xAC 64K 64K " "$TEST_IMG" | _filter_qemu_io +# read on the cluster boundaries +$QEMU_IO -c "read -v 131070 8 " "$TEST_IMG" | _filter_qemu_io +$QEMU_IO -c "read -v 65534 8" "$TEST_IMG" | _filter_qemu_io + +echo +echo "=== Testing adjacent clusters reading and writing with zstd ===" +echo +_make_test_img -o compression_type=zstd 64M +$QEMU_IO -c "write -c -P 0xAB 0 64K " "$TEST_IMG" | _filter_qemu_io +$QEMU_IO -c "write -c -P 0xAC 64K 64K " "$TEST_IMG" | _filter_qemu_io +$QEMU_IO -c "write -c -P 0xAD 128K 64K " "$T
Re: [PATCH v22 3/4] qcow2: add zstd cluster compression
On 30.04.2020 11:26, Max Reitz wrote: On 29.04.20 15:02, Vladimir Sementsov-Ogievskiy wrote: 29.04.2020 15:17, Max Reitz wrote: On 29.04.20 12:37, Vladimir Sementsov-Ogievskiy wrote: 29.04.2020 13:24, Max Reitz wrote: On 28.04.20 22:00, Denis Plotnikov wrote: zstd significantly reduces cluster compression time. It provides better compression performance maintaining the same level of the compression ratio in comparison with zlib, which, at the moment, is the only compression method available. The performance test results: Test compresses and decompresses qemu qcow2 image with just installed rhel-7.6 guest. Image cluster size: 64K. Image on disk size: 2.2G The test was conducted with brd disk to reduce the influence of disk subsystem to the test results. The results is given in seconds. compress cmd: time ./qemu-img convert -O qcow2 -c -o compression_type=[zlib|zstd] src.img [zlib|zstd]_compressed.img decompress cmd time ./qemu-img convert -O qcow2 [zlib|zstd]_compressed.img uncompressed.img compression decompression zlib zstd zlib zstd real 65.5 16.3 (-75 %) 1.9 1.6 (-16 %) user 65.0 15.8 5.3 2.5 sys 3.3 0.2 2.0 2.0 Both ZLIB and ZSTD gave the same compression ratio: 1.57 compressed image size in both cases: 1.4G Signed-off-by: Denis Plotnikov QAPI part: Acked-by: Markus Armbruster --- docs/interop/qcow2.txt | 1 + configure | 2 +- qapi/block-core.json | 3 +- block/qcow2-threads.c | 169 + block/qcow2.c | 7 ++ slirp | 2 +- 6 files changed, 181 insertions(+), 3 deletions(-) [...] diff --git a/block/qcow2-threads.c b/block/qcow2-threads.c index 7dbaf53489..a0b12e1b15 100644 --- a/block/qcow2-threads.c +++ b/block/qcow2-threads.c [...] +static ssize_t qcow2_zstd_decompress(void *dest, size_t dest_size, + const void *src, size_t src_size) +{ [...] + /* + * The compressed stream from the input buffer may consist of more + * than one zstd frame. Can it? If not, we must require it in the specification. Actually, now that you mention it, it would make sense anyway to add some note to the specification on what exactly compressed with zstd means. Hmm. If at some point we'll want multi-threaded compression of one big (2M) cluster.. Could this be implemented with zstd lib, if multiple frames are allowed, will allowing multiple frames help? I don't know actually, but I think better not to forbid it. On the other hand, I don't see any benefit in large compressed clusters. At least, in our scenarios (for compressed backups) we use 64k compressed clusters, for good granularity of incremental backups (when for running vm we use 1M clusters). Is it really that important? Naïvely, it sounds rather complicated to introduce multithreading into block drivers. It is already here: compression and encryption already multithreaded. But of course, one cluster is handled in one thread. Ah, good. I forgot. (Also, as for compression, it can only be used in backup scenarios anyway, where you write many clusters at once. So parallelism on the cluster level should sufficient to get high usage, and it would benefit all compression types and cluster sizes.) Yes it works in this way already :) Well, OK then. So, we don't know do we want one frame restriction or not. Do you have a preference? *shrug* Seems like it would be preferential to allow multiple frames still. A note in the spec would be nice (i.e., streaming format, multiple frames per cluster possible). We don't mention anything about zlib compressing details in the spec. If we mention anything about ZSTD compressing details we'll have to do it for zlib as well. So, I think we have two possibilities for the spec: 1. mention for both 2. don't mention at all I think the 2nd is better. It gives more degree of freedom for the future improvements. and we've already covered both cases (one frame, may frames) in the code. I'm note sure I'm right. Any other opinions? Denis Max
Re: [PATCH v22 4/4] iotests: 287: add qcow2 compression type test
On 29.04.2020 13:26, Max Reitz wrote: On 28.04.20 22:00, Denis Plotnikov wrote: The test checks fulfilling qcow2 requirements for the compression type feature and zstd compression type operability. Signed-off-by: Denis Plotnikov Reviewed-by: Vladimir Sementsov-Ogievskiy Tested-by: Vladimir Sementsov-Ogievskiy --- tests/qemu-iotests/287 | 152 + tests/qemu-iotests/287.out | 67 tests/qemu-iotests/group | 1 + 3 files changed, 220 insertions(+) create mode 100755 tests/qemu-iotests/287 create mode 100644 tests/qemu-iotests/287.out diff --git a/tests/qemu-iotests/287 b/tests/qemu-iotests/287 new file mode 100755 index 00..21fe1f19f5 --- /dev/null +++ b/tests/qemu-iotests/287 @@ -0,0 +1,152 @@ +#!/usr/bin/env bash +# +# Test case for an image using zstd compression +# +# Copyright (c) 2020 Virtuozzo International GmbH +# +# This program is free software; you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation; either version 2 of the License, or +# (at your option) any later version. +# +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program. If not, see <http://www.gnu.org/licenses/>. +# + +# creator +owner=dplotni...@virtuozzo.com + +seq="$(basename $0)" +echo "QA output created by $seq" + +status=1 # failure is the default! + +# standard environment +. ./common.rc +. ./common.filter + +# This tests qocw2-specific low-level functionality +_supported_fmt qcow2 +_supported_proto file +_supported_os Linux +_unsupported_imgopts 'compat=0.10' data_file + +COMPR_IMG="$TEST_IMG.compressed" +RAND_FILE="$TEST_DIR/rand_data" + +_cleanup() +{ + _rm_test_img _rm_test_img needs an argument (it basically replaces “rm”). What I thus meant was to keep the _cleanup_test_img here (that was completely correct), but... + rm -f "$COMPR_IMG" ...to use “_rm_test_img "$COMPR_IMG"” here instead of rm. Max ok, I got it. Denis
Re: [PATCH v22 3/4] qcow2: add zstd cluster compression
On 29.04.2020 13:24, Max Reitz wrote: On 28.04.20 22:00, Denis Plotnikov wrote: zstd significantly reduces cluster compression time. It provides better compression performance maintaining the same level of the compression ratio in comparison with zlib, which, at the moment, is the only compression method available. The performance test results: Test compresses and decompresses qemu qcow2 image with just installed rhel-7.6 guest. Image cluster size: 64K. Image on disk size: 2.2G The test was conducted with brd disk to reduce the influence of disk subsystem to the test results. The results is given in seconds. compress cmd: time ./qemu-img convert -O qcow2 -c -o compression_type=[zlib|zstd] src.img [zlib|zstd]_compressed.img decompress cmd time ./qemu-img convert -O qcow2 [zlib|zstd]_compressed.img uncompressed.img compression decompression zlib zstd zlib zstd real 65.5 16.3 (-75 %)1.9 1.6 (-16 %) user 65.0 15.85.3 2.5 sys 3.30.22.0 2.0 Both ZLIB and ZSTD gave the same compression ratio: 1.57 compressed image size in both cases: 1.4G Signed-off-by: Denis Plotnikov QAPI part: Acked-by: Markus Armbruster --- docs/interop/qcow2.txt | 1 + configure | 2 +- qapi/block-core.json | 3 +- block/qcow2-threads.c | 169 + block/qcow2.c | 7 ++ slirp | 2 +- 6 files changed, 181 insertions(+), 3 deletions(-) [...] diff --git a/block/qcow2-threads.c b/block/qcow2-threads.c index 7dbaf53489..a0b12e1b15 100644 --- a/block/qcow2-threads.c +++ b/block/qcow2-threads.c [...] +static ssize_t qcow2_zstd_decompress(void *dest, size_t dest_size, + const void *src, size_t src_size) +{ [...] +/* + * The compressed stream from the input buffer may consist of more + * than one zstd frame. Can it? Potentially, it can, if another implemention of qcow2 saves a couple of frames for some reason. Denis Max
Re: [RFC patch v1 2/3] qemu-file: add buffered mode
On 28.04.2020 20:54, Dr. David Alan Gilbert wrote: * Denis Plotnikov (dplotni...@virtuozzo.com) wrote: On 27.04.2020 15:14, Dr. David Alan Gilbert wrote: * Denis Plotnikov (dplotni...@virtuozzo.com) wrote: The patch adds ability to qemu-file to write the data asynchronously to improve the performance on writing. Before, only synchronous writing was supported. Enabling of the asyncronous mode is managed by new "enabled_buffered" callback. It's a bit invasive isn't it - changes a lot of functions in a lot of places! If you mean changing the qemu-file code - yes, it is. Yeh that's what I worry about; qemu-file is pretty complex as it is. Especially when it then passes it to the channel code etc If you mean changing the qemu-file usage in the code - no. The only place to change is the snapshot code when the buffered mode is enabled with a callback. The change is in 03 patch of the series. That's fine - that's easy. The multifd code separated the control headers from the data on separate fd's - but that doesn't help your case. yes, that doesn't help Is there any chance you could do this by using the existing 'save_page' hook (that RDMA uses). I don't think so. My goal is to improve writing performance of the internal snapshot to qcow2 image. The snapshot is saved in qcow2 as continuous stream placed in the end of address space. To achieve the best writing speed I need a size and base-aligned buffer containing the vm state (with ram) which looks like that (related to ram): ... | ram page header | ram page | ram page header | ram page | ... and so on to store the buffer in qcow2 with a single operation. 'save_page' would allow me not to store 'ram page' in the qemu-file internal structures, and write my own ram page storing logic. I think that wouldn't help me a lot because: 1. I need a page with the ram page header 2. I want to reduce the number of io operations 3. I want to save other parts of vm state as fast as possible May be I can't see the better way of using 'save page' callback. Could you suggest anything? I guess it depends if we care about keeping the format of the snapshot the same here; if we were open to changing it, then we could use the save_page hook to delay the writes, so we'd have a pile of headers followed by a pile of pages. I think we have to care about keeping the format. Because many users already have internal snapshots saved in the qcow2 images, if we change the format we can't load snapshots from those images as well as make snapshots non-readable for older qemu-s or we need to support two versions of format which I think is too complicated. Denis In the cover letter you mention direct qemu_fflush calls - have we got a few too many in some palces that you think we can clean out? I'm not sure that some of them are excessive. To the best of my knowlege, qemu-file is used for the source-destination communication on migration and removing some qemu_fflush-es may break communication logic. I can't see any obvious places where it's called during the ram migration; can you try and give me a hint to where you're seeing it ? I think those qemu_fflush-es aren't in the ram migration rather than in other vm state parts. Although, those parts are quite small in comparison to ram, I saw quite a lot of qemu_fflush-es while debugging. Still, we could benefit saving them with fewer number of io operation if we going to use buffered mode. Denis Snapshot is just a special case (if not the only) when we know that we can do buffered (cached) writings. Do you know any other cases when the buffered (cached) mode could be useful? The RDMA code does it because it's really not good at small transfers, but maybe generally it would be a good idea to do larger writes if possible - something that multifd manages. Dave Dave Signed-off-by: Denis Plotnikov --- include/qemu/typedefs.h | 1 + migration/qemu-file.c | 351 +--- migration/qemu-file.h | 9 ++ 3 files changed, 339 insertions(+), 22 deletions(-) diff --git a/include/qemu/typedefs.h b/include/qemu/typedefs.h index 88dce54..9b388c8 100644 --- a/include/qemu/typedefs.h +++ b/include/qemu/typedefs.h @@ -98,6 +98,7 @@ typedef struct QEMUBH QEMUBH; typedef struct QemuConsole QemuConsole; typedef struct QEMUFile QEMUFile; typedef struct QEMUFileBuffer QEMUFileBuffer; +typedef struct QEMUFileAioTask QEMUFileAioTask; typedef struct QemuLockable QemuLockable; typedef struct QemuMutex QemuMutex; typedef struct QemuOpt QemuOpt; diff --git a/migration/qemu-file.c b/migration/qemu-file.c index 285c6ef..f42f949 100644 --- a/migration/qemu-file.c +++ b/migration/qemu-file.c @@ -29,19 +29,25 @@ #include "qemu-file.h" #include "trace.h" #include "qapi/error.h" +#include "block/aio_task.h" -#define IO_BUF_SIZE 32768 +#define IO_BUF_SIZE (1024 * 1024) #define MAX_IOV_SI
[PATCH v22 4/4] iotests: 287: add qcow2 compression type test
The test checks fulfilling qcow2 requirements for the compression type feature and zstd compression type operability. Signed-off-by: Denis Plotnikov Reviewed-by: Vladimir Sementsov-Ogievskiy Tested-by: Vladimir Sementsov-Ogievskiy --- tests/qemu-iotests/287 | 152 + tests/qemu-iotests/287.out | 67 tests/qemu-iotests/group | 1 + 3 files changed, 220 insertions(+) create mode 100755 tests/qemu-iotests/287 create mode 100644 tests/qemu-iotests/287.out diff --git a/tests/qemu-iotests/287 b/tests/qemu-iotests/287 new file mode 100755 index 00..21fe1f19f5 --- /dev/null +++ b/tests/qemu-iotests/287 @@ -0,0 +1,152 @@ +#!/usr/bin/env bash +# +# Test case for an image using zstd compression +# +# Copyright (c) 2020 Virtuozzo International GmbH +# +# This program is free software; you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation; either version 2 of the License, or +# (at your option) any later version. +# +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program. If not, see <http://www.gnu.org/licenses/>. +# + +# creator +owner=dplotni...@virtuozzo.com + +seq="$(basename $0)" +echo "QA output created by $seq" + +status=1 # failure is the default! + +# standard environment +. ./common.rc +. ./common.filter + +# This tests qocw2-specific low-level functionality +_supported_fmt qcow2 +_supported_proto file +_supported_os Linux +_unsupported_imgopts 'compat=0.10' data_file + +COMPR_IMG="$TEST_IMG.compressed" +RAND_FILE="$TEST_DIR/rand_data" + +_cleanup() +{ + _rm_test_img + rm -f "$COMPR_IMG" + rm -f "$RAND_FILE" +} +trap "_cleanup; exit \$status" 0 1 2 3 15 + +# for all the cases +CLUSTER_SIZE=65536 + +# Check if we can run this test. +if IMGOPTS='compression_type=zstd' _make_test_img 64M | +grep "Invalid parameter 'zstd'"; then +_notrun "ZSTD is disabled" +fi + +echo +echo "=== Testing compression type incompatible bit setting for zlib ===" +echo +_make_test_img -o compression_type=zlib 64M +$PYTHON qcow2.py "$TEST_IMG" dump-header | grep incompatible_features + +echo +echo "=== Testing compression type incompatible bit setting for zstd ===" +echo +_make_test_img -o compression_type=zstd 64M +$PYTHON qcow2.py "$TEST_IMG" dump-header | grep incompatible_features + +echo +echo "=== Testing zlib with incompatible bit set ===" +echo +_make_test_img -o compression_type=zlib 64M +$PYTHON qcow2.py "$TEST_IMG" set-feature-bit incompatible 3 +# to make sure the bit was actually set +$PYTHON qcow2.py "$TEST_IMG" dump-header | grep incompatible_features + +if $QEMU_IMG info "$TEST_IMG" >/dev/null 2>&1 ; then +echo "Error: The image opened successfully. The image must not be opened." +fi + +echo +echo "=== Testing zstd with incompatible bit unset ===" +echo +_make_test_img -o compression_type=zstd 64M +$PYTHON qcow2.py "$TEST_IMG" set-header incompatible_features 0 +# to make sure the bit was actually unset +$PYTHON qcow2.py "$TEST_IMG" dump-header | grep incompatible_features + +if $QEMU_IMG info "$TEST_IMG" >/dev/null 2>&1 ; then +echo "Error: The image opened successfully. The image must not be opened." +fi + +echo +echo "=== Testing compression type values ===" +echo +# zlib=0 +_make_test_img -o compression_type=zlib 64M +peek_file_be "$TEST_IMG" 104 1 +echo + +# zstd=1 +_make_test_img -o compression_type=zstd 64M +peek_file_be "$TEST_IMG" 104 1 +echo + +echo +echo "=== Testing simple reading and writing with zstd ===" +echo +_make_test_img -o compression_type=zstd 64M +$QEMU_IO -c "write -c -P 0xAC 64K 64K " "$TEST_IMG" | _filter_qemu_io +$QEMU_IO -c "read -P 0xAC 64K 64K " "$TEST_IMG" | _filter_qemu_io +# read on the cluster boundaries +$QEMU_IO -c "read -v 131070 8 " "$TEST_IMG" | _filter_qemu_io +$QEMU_IO -c "read -v 65534 8" "$TEST_IMG" | _filter_qemu_io + +echo +echo "=== Testing adjacent clusters reading and writing with zstd ===" +echo +_make_test_img -o compression_type=zstd 64M +$QEMU_IO -c "write -c -P 0xAB 0 64K " "$TEST_IMG" | _filter_qemu_io +$QEMU_IO -c "write -c -P 0xAC 64K 64K " "$TEST_IMG" | _filter_qemu_io +$QEMU_IO -c "write -c -P 0xAD 128K 64K " "$TEST_IMG" | _filte
[PATCH v22 0/4] implement zstd cluster compression method
v22: 03: remove assignemnt in if condition v21: 03: * remove the loop on compression [Max] * use designated initializers [Max] 04: * don't erase user's options [Max] * use _rm_test_img [Max] * add unsupported qcow2 options [Max] v20: 04: fix a number of flaws [Vladimir] * don't use $RAND_FILE passing to qemu-io, so check $TEST_DIR is redundant * re-arrage $RAND_FILE writing * fix a typo v19: 04: fix a number of flaws [Eric] * remove rudundant test case descriptions * fix stdout redirect * don't use (()) * use peek_file_be instead of od * check $TEST_DIR for spaces and other before using * use $RAND_FILE safer v18: * 04: add quotes to all file name variables [Vladimir] * 04: add Vladimir's comment according to "qemu-io write -s" option issue. v17: * 03: remove incorrect comment in zstd decompress [Vladimir] * 03: remove "paraniod" and rewrite the comment on decompress [Vladimir] * 03: fix dead assignment [Vladimir] * 04: add and remove quotes [Vladimir] * 04: replace long offset form with the short one [Vladimir] v16: * 03: ssize_t for ret, size_t for zstd_ret [Vladimir] * 04: small fixes according to the comments [Vladimir] v15: * 01: aiming qemu 5.1 [Eric] * 03: change zstd_res definition place [Vladimir] * 04: add two new test cases [Eric] 1. test adjacent cluster compression with zstd 2. test incompressible cluster processing * 03, 04: many rewording and gramma fixing [Eric] v14: * fix bug on compression - looping until compress == 0 [Me] * apply reworked Vladimir's suggestions: 1. not mixing ssize_t with size_t 2. safe check for ENOMEM in compression part - avoid overflow 3. tolerate sanity check allow zstd to make progress only on one of the buffers v13: * 03: add progress sanity check to decompression loop [Vladimir] 03: add successful decompression check [Me] v12: * 03: again, rework compression and decompression loops to make them more correct [Vladimir] 03: move assert in compression to more appropriate place [Vladimir] v11: * 03: the loops don't need "do{}while" form anymore and the they were buggy (missed "do" in the beginning) replace them with usual "while(){}" loops [Vladimir] v10: * 03: fix zstd (de)compressed loops for multi-frame cases [Vladimir] v9: * 01: fix error checking and reporting in qcow2_amend compression type part [Vladimir] * 03: replace asserts with -EIO in qcow2_zstd_decompression [Vladimir, Alberto] * 03: reword/amend/add comments, fix typos [Vladimir] v8: * 03: switch zstd API from simple to stream [Eric] No need to state a special cluster layout for zstd compressed clusters. v7: * use qapi_enum_parse instead of the open-coding [Eric] * fix wording, typos and spelling [Eric] v6: * "block/qcow2-threads: fix qcow2_decompress" is removed from the series since it has been accepted by Max already * add compile time checking for Qcow2Header to be a multiple of 8 [Max, Alberto] * report error on qcow2 amending when the compression type is actually chnged [Max] * remove the extra space and the extra new line [Max] * re-arrange acks and signed-off-s [Vladimir] v5: * replace -ENOTSUP with abort in qcow2_co_decompress [Vladimir] * set cluster size for all test cases in the beginning of the 287 test v4: * the series is rebased on top of 01 "block/qcow2-threads: fix qcow2_decompress" * 01 is just a no-change resend to avoid extra dependencies. Still, it may be merged in separate v3: * remove redundant max compression type value check [Vladimir, Eric] (the switch below checks everything) * prevent compression type changing on "qemu-img amend" [Vladimir] * remove zstd config setting, since it has been added already by "migration" patches [Vladimir] * change the compression type error message [Vladimir] * fix alignment and 80-chars exceeding [Vladimir] v2: * rework compression type setting [Vladimir] * squash iotest changes to the compression type introduction patch [Vladimir, Eric] * fix zstd availability checking in zstd iotest [Vladimir] * remove unnecessry casting [Eric] * remove rudundant checks [Eric] * fix compressed cluster layout in qcow2 spec [Vladimir] * fix wording [Eric, Vladimir] * fix compression type filtering in iotests [Eric] v1: the initial series Denis Plotnikov (4): qcow2: introduce compression type feature qcow2: rework the cluster compression routine qcow2: add zstd cluster compression iotests: 287: add qcow2 compression type test docs/interop/qcow2.txt | 1 + configure| 2 +- qapi/block-
[PATCH v22 1/4] qcow2: introduce compression type feature
The patch adds some preparation parts for incompatible compression type feature to qcow2 allowing the use different compression methods for image clusters (de)compressing. It is implied that the compression type is set on the image creation and can be changed only later by image conversion, thus compression type defines the only compression algorithm used for the image, and thus, for all image clusters. The goal of the feature is to add support of other compression methods to qcow2. For example, ZSTD which is more effective on compression than ZLIB. The default compression is ZLIB. Images created with ZLIB compression type are backward compatible with older qemu versions. Adding of the compression type breaks a number of tests because now the compression type is reported on image creation and there are some changes in the qcow2 header in size and offsets. The tests are fixed in the following ways: * filter out compression_type for many tests * fix header size, feature table size and backing file offset affected tests: 031, 036, 061, 080 header_size +=8: 1 byte compression type 7 bytes padding feature_table += 48: incompatible feature compression type backing_file_offset += 56 (8 + 48 -> header_change + feature_table_change) * add "compression type" for test output matching when it isn't filtered affected tests: 049, 060, 061, 065, 144, 182, 242, 255 Signed-off-by: Denis Plotnikov Reviewed-by: Vladimir Sementsov-Ogievskiy Reviewed-by: Eric Blake QAPI part: Acked-by: Markus Armbruster --- qapi/block-core.json | 22 +- block/qcow2.h| 20 +- include/block/block_int.h| 1 + block/qcow2.c| 113 +++ tests/qemu-iotests/031.out | 14 ++-- tests/qemu-iotests/036.out | 4 +- tests/qemu-iotests/049.out | 102 ++-- tests/qemu-iotests/060.out | 1 + tests/qemu-iotests/061.out | 34 ++ tests/qemu-iotests/065 | 28 +--- tests/qemu-iotests/080 | 2 +- tests/qemu-iotests/144.out | 4 +- tests/qemu-iotests/182.out | 2 +- tests/qemu-iotests/242.out | 5 ++ tests/qemu-iotests/255.out | 8 +-- tests/qemu-iotests/common.filter | 3 +- 16 files changed, 267 insertions(+), 96 deletions(-) diff --git a/qapi/block-core.json b/qapi/block-core.json index 943df1926a..1522e2983f 100644 --- a/qapi/block-core.json +++ b/qapi/block-core.json @@ -78,6 +78,8 @@ # # @bitmaps: A list of qcow2 bitmap details (since 4.0) # +# @compression-type: the image cluster compression method (since 5.1) +# # Since: 1.7 ## { 'struct': 'ImageInfoSpecificQCow2', @@ -89,7 +91,8 @@ '*corrupt': 'bool', 'refcount-bits': 'int', '*encrypt': 'ImageInfoSpecificQCow2Encryption', - '*bitmaps': ['Qcow2BitmapInfo'] + '*bitmaps': ['Qcow2BitmapInfo'], + 'compression-type': 'Qcow2CompressionType' } } ## @@ -4284,6 +4287,18 @@ 'data': [ 'v2', 'v3' ] } +## +# @Qcow2CompressionType: +# +# Compression type used in qcow2 image file +# +# @zlib: zlib compression, see <http://zlib.net/> +# +# Since: 5.1 +## +{ 'enum': 'Qcow2CompressionType', + 'data': [ 'zlib' ] } + ## # @BlockdevCreateOptionsQcow2: # @@ -4307,6 +4322,8 @@ # allowed values: off, falloc, full, metadata) # @lazy-refcounts: True if refcounts may be updated lazily (default: off) # @refcount-bits: Width of reference counts in bits (default: 16) +# @compression-type: The image cluster compression method +#(default: zlib, since 5.1) # # Since: 2.12 ## @@ -4322,7 +4339,8 @@ '*cluster-size':'size', '*preallocation': 'PreallocMode', '*lazy-refcounts': 'bool', -'*refcount-bits': 'int' } } +'*refcount-bits': 'int', +'*compression-type':'Qcow2CompressionType' } } ## # @BlockdevCreateOptionsQed: diff --git a/block/qcow2.h b/block/qcow2.h index f4de0a27d5..6a8b82e6cc 100644 --- a/block/qcow2.h +++ b/block/qcow2.h @@ -146,8 +146,16 @@ typedef struct QCowHeader { uint32_t refcount_order; uint32_t header_length; + +/* Additional fields */ +uint8_t compression_type; + +/* header must be a multiple of 8 */ +uint8_t padding[7]; } QEMU_PACKED QCowHeader; +QEMU_BUILD_BUG_ON(!QEMU_IS_ALIGNED(sizeof(QCowHeader), 8)); + typedef struct QEMU_PACKED QCowSnapshotHeader { /* header is 8 byte aligned */ uint64_t l1_table_offset; @@ -216,13 +224,16 @@ enum { QCOW2_INCOMPAT_DIRTY_BITNR = 0, QCOW2_INCOMPAT_CORRUPT_BITNR= 1, QCOW2_INCOMPAT_DATA_FILE_BITNR = 2, +QCOW2_INCOMPAT_COMPRESSION_BITNR = 3, QCOW2_INCOMPAT_DIRTY= 1 << QCOW2_INCOMPAT_DIRTY_BITNR, QCOW2_INCOMPAT_CORRUPT = 1 << QCOW2_INCOMPAT_CORRUPT_BITNR, QCOW
[PATCH v22 2/4] qcow2: rework the cluster compression routine
The patch enables processing the image compression type defined for the image and chooses an appropriate method for image clusters (de)compression. Signed-off-by: Denis Plotnikov Reviewed-by: Vladimir Sementsov-Ogievskiy Reviewed-by: Alberto Garcia --- block/qcow2-threads.c | 71 --- 1 file changed, 60 insertions(+), 11 deletions(-) diff --git a/block/qcow2-threads.c b/block/qcow2-threads.c index a68126f291..7dbaf53489 100644 --- a/block/qcow2-threads.c +++ b/block/qcow2-threads.c @@ -74,7 +74,9 @@ typedef struct Qcow2CompressData { } Qcow2CompressData; /* - * qcow2_compress() + * qcow2_zlib_compress() + * + * Compress @src_size bytes of data using zlib compression method * * @dest - destination buffer, @dest_size bytes * @src - source buffer, @src_size bytes @@ -83,8 +85,8 @@ typedef struct Qcow2CompressData { * -ENOMEM destination buffer is not enough to store compressed data * -EIOon any other error */ -static ssize_t qcow2_compress(void *dest, size_t dest_size, - const void *src, size_t src_size) +static ssize_t qcow2_zlib_compress(void *dest, size_t dest_size, + const void *src, size_t src_size) { ssize_t ret; z_stream strm; @@ -119,10 +121,10 @@ static ssize_t qcow2_compress(void *dest, size_t dest_size, } /* - * qcow2_decompress() + * qcow2_zlib_decompress() * * Decompress some data (not more than @src_size bytes) to produce exactly - * @dest_size bytes. + * @dest_size bytes using zlib compression method * * @dest - destination buffer, @dest_size bytes * @src - source buffer, @src_size bytes @@ -130,8 +132,8 @@ static ssize_t qcow2_compress(void *dest, size_t dest_size, * Returns: 0 on success * -EIO on fail */ -static ssize_t qcow2_decompress(void *dest, size_t dest_size, -const void *src, size_t src_size) +static ssize_t qcow2_zlib_decompress(void *dest, size_t dest_size, + const void *src, size_t src_size) { int ret; z_stream strm; @@ -191,20 +193,67 @@ qcow2_co_do_compress(BlockDriverState *bs, void *dest, size_t dest_size, return arg.ret; } +/* + * qcow2_co_compress() + * + * Compress @src_size bytes of data using the compression + * method defined by the image compression type + * + * @dest - destination buffer, @dest_size bytes + * @src - source buffer, @src_size bytes + * + * Returns: compressed size on success + * a negative error code on failure + */ ssize_t coroutine_fn qcow2_co_compress(BlockDriverState *bs, void *dest, size_t dest_size, const void *src, size_t src_size) { -return qcow2_co_do_compress(bs, dest, dest_size, src, src_size, -qcow2_compress); +BDRVQcow2State *s = bs->opaque; +Qcow2CompressFunc fn; + +switch (s->compression_type) { +case QCOW2_COMPRESSION_TYPE_ZLIB: +fn = qcow2_zlib_compress; +break; + +default: +abort(); +} + +return qcow2_co_do_compress(bs, dest, dest_size, src, src_size, fn); } +/* + * qcow2_co_decompress() + * + * Decompress some data (not more than @src_size bytes) to produce exactly + * @dest_size bytes using the compression method defined by the image + * compression type + * + * @dest - destination buffer, @dest_size bytes + * @src - source buffer, @src_size bytes + * + * Returns: 0 on success + * a negative error code on failure + */ ssize_t coroutine_fn qcow2_co_decompress(BlockDriverState *bs, void *dest, size_t dest_size, const void *src, size_t src_size) { -return qcow2_co_do_compress(bs, dest, dest_size, src, src_size, -qcow2_decompress); +BDRVQcow2State *s = bs->opaque; +Qcow2CompressFunc fn; + +switch (s->compression_type) { +case QCOW2_COMPRESSION_TYPE_ZLIB: +fn = qcow2_zlib_decompress; +break; + +default: +abort(); +} + +return qcow2_co_do_compress(bs, dest, dest_size, src, src_size, fn); } -- 2.17.0
[PATCH v22 3/4] qcow2: add zstd cluster compression
zstd significantly reduces cluster compression time. It provides better compression performance maintaining the same level of the compression ratio in comparison with zlib, which, at the moment, is the only compression method available. The performance test results: Test compresses and decompresses qemu qcow2 image with just installed rhel-7.6 guest. Image cluster size: 64K. Image on disk size: 2.2G The test was conducted with brd disk to reduce the influence of disk subsystem to the test results. The results is given in seconds. compress cmd: time ./qemu-img convert -O qcow2 -c -o compression_type=[zlib|zstd] src.img [zlib|zstd]_compressed.img decompress cmd time ./qemu-img convert -O qcow2 [zlib|zstd]_compressed.img uncompressed.img compression decompression zlib zstd zlib zstd real 65.5 16.3 (-75 %)1.9 1.6 (-16 %) user 65.0 15.85.3 2.5 sys 3.30.22.0 2.0 Both ZLIB and ZSTD gave the same compression ratio: 1.57 compressed image size in both cases: 1.4G Signed-off-by: Denis Plotnikov QAPI part: Acked-by: Markus Armbruster --- docs/interop/qcow2.txt | 1 + configure | 2 +- qapi/block-core.json | 3 +- block/qcow2-threads.c | 169 + block/qcow2.c | 7 ++ slirp | 2 +- 6 files changed, 181 insertions(+), 3 deletions(-) diff --git a/docs/interop/qcow2.txt b/docs/interop/qcow2.txt index 640e0eca40..18a77f737e 100644 --- a/docs/interop/qcow2.txt +++ b/docs/interop/qcow2.txt @@ -209,6 +209,7 @@ version 2. Available compression type values: 0: zlib <https://www.zlib.net/> +1: zstd <http://github.com/facebook/zstd> === Header padding === diff --git a/configure b/configure index 23b5e93752..4e3a1690ea 100755 --- a/configure +++ b/configure @@ -1861,7 +1861,7 @@ disabled with --disable-FEATURE, default is enabled if available: lzfse support of lzfse compression library (for reading lzfse-compressed dmg images) zstdsupport for zstd compression library - (for migration compression) + (for migration compression and qcow2 cluster compression) seccomp seccomp support coroutine-pool coroutine freelist (better performance) glusterfs GlusterFS backend diff --git a/qapi/block-core.json b/qapi/block-core.json index 1522e2983f..6fbacddab2 100644 --- a/qapi/block-core.json +++ b/qapi/block-core.json @@ -4293,11 +4293,12 @@ # Compression type used in qcow2 image file # # @zlib: zlib compression, see <http://zlib.net/> +# @zstd: zstd compression, see <http://github.com/facebook/zstd> # # Since: 5.1 ## { 'enum': 'Qcow2CompressionType', - 'data': [ 'zlib' ] } + 'data': [ 'zlib', { 'name': 'zstd', 'if': 'defined(CONFIG_ZSTD)' } ] } ## # @BlockdevCreateOptionsQcow2: diff --git a/block/qcow2-threads.c b/block/qcow2-threads.c index 7dbaf53489..a0b12e1b15 100644 --- a/block/qcow2-threads.c +++ b/block/qcow2-threads.c @@ -28,6 +28,11 @@ #define ZLIB_CONST #include +#ifdef CONFIG_ZSTD +#include +#include +#endif + #include "qcow2.h" #include "block/thread-pool.h" #include "crypto.h" @@ -166,6 +171,160 @@ static ssize_t qcow2_zlib_decompress(void *dest, size_t dest_size, return ret; } +#ifdef CONFIG_ZSTD + +/* + * qcow2_zstd_compress() + * + * Compress @src_size bytes of data using zstd compression method + * + * @dest - destination buffer, @dest_size bytes + * @src - source buffer, @src_size bytes + * + * Returns: compressed size on success + * -ENOMEM destination buffer is not enough to store compressed data + * -EIOon any other error + */ +static ssize_t qcow2_zstd_compress(void *dest, size_t dest_size, + const void *src, size_t src_size) +{ +ssize_t ret; +size_t zstd_ret; +ZSTD_outBuffer output = { +.dst = dest, +.size = dest_size, +.pos = 0 +}; +ZSTD_inBuffer input = { +.src = src, +.size = src_size, +.pos = 0 +}; +ZSTD_CCtx *cctx = ZSTD_createCCtx(); + +if (!cctx) { +return -EIO; +} +/* + * Use the zstd streamed interface for symmetry with decompression, + * where streaming is essential since we don't record the exact + * compressed size. + * + * ZSTD_compressStream2() tries to compress everything it could + * with a single call. Although, ZSTD docs says that: + * "You must continue calling ZSTD_compressStream2() with ZSTD_e_end + * until it returns 0, at which point you are free to start a new frame", + * in out tests we saw
Re: [PATCH v20 4/4] iotests: 287: add qcow2 compression type test
On 28.04.2020 16:01, Eric Blake wrote: On 4/28/20 7:55 AM, Max Reitz wrote: +# This tests qocw2-specific low-level functionality +_supported_fmt qcow2 +_supported_proto file +_supported_os Linux This test doesn’t work with compat=0.10 (because we can’t store a non-default compression type there) or data_file (because those don’t support compression), so those options should be marked as unsupported. (It does seem to work with any refcount_bits, though.) Could I ask how to achieve that? I can't find any _supported_* related. It’s _unsupported_imgopts. Test 036 is an example of this. Max, Eric Thanks! Denis