Re: [Qemu-block] [Qemu-devel] [PATCH v2 0/2] block latency histogram

2018-03-08 Thread Emilio G. Cota
On Thu, Mar 08, 2018 at 22:07:35 +0300, Vladimir Sementsov-Ogievskiy wrote:
> 08.03.2018 21:56, Emilio G. Cota wrote:
> >  * Binning happens only at print time, so that we retain the flexibility to
> >  * choose the binning. This might not be ideal for workloads that do not 
> > care
> >  * much about precision and insert many samples all with different x values;
> >  * in that case, pre-binning (e.g. entering both 0.115 and 0.097 as 0.1)
> >  * should be considered.
(snip)
> In this case, I'll have to do same bin search (and store same interval
> settings) as I already do, on my part, to calculate a parameter for qdist
> interface. And I'll have store almost all same data on my part. So, it
> doesn't really help. And I need nothing of qdist benefits: I don't need (and
> don't want) dynamic allocation of bins on adding an element or any type of
> visualization.

I see. You require a couple of features that qdist doesn't yet support:

- Arbitrarily-sized, pre-defined bins.
- Support for querying the data programmatically instead of just
  printing it out.

We could circumvent the first missing feature with pre-binning,
but in that case we'd do a bsearch twice as you point out (BTW
your concern about memory allocation wouldn't apply though).

The second missing feature should be easy to add to qdist.

That said, given that you want this in for 2.12, I'd go with your
approach for now. In the future we should look into supporting
your use case in qdist, since it is likely that there will be
more users with a similar need.

Thanks,

Emilio



Re: [Qemu-block] [Qemu-devel] [PATCH v2 1/2] block/accounting: introduce latency histogram

2018-03-08 Thread Eric Blake

On 03/08/2018 12:58 PM, Vladimir Sementsov-Ogievskiy wrote:
Hm. these numbers are actually boundary points of histogram 
intervals, not intervals itself. And, wiki says "The bins are usually 
specified as consecutive, non-overlapping intervals of a variable.", 
so, intervals are bins.


So, what about:

1. interval_boundaries
2. bin_boundaries
3. boundaries

(and same names with s/_/-/ for qapi)


Any of those works for me; 1 is probably most precise, while 3 is the 
shortest to type.








Also, now I doubt, is it a good idea to share same bin boundaries for 
each io type.




so, for qmp command, what about:

boundaries - optional, default boundaries for all io operations
boundaries-read - boundaries for read
boundaries-write - boundaries for write
...

so, call without any boundaries: drop _all_ histograms
call with only specific boundaries-*: set or reset _only_ corresponding 
specific histograms

call with only boundaries parameter: set or reset _all_ histograms
call with boundaries parameter and some of (or all, but it is not 
useful) specific boundaries-*: set or reset _all_ histograms, some to 
default  and some to specific.


Seems reasonable, and if that makes the command more useful for your 
timing, I'm fine with it (your argument that different timing bins for 
read and write also makes sense, as there are some inherent differences 
in the amount of work done in the two directions).


--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Re: [Qemu-block] [Qemu-devel] [PATCH v2 0/2] block latency histogram

2018-03-08 Thread Vladimir Sementsov-Ogievskiy

08.03.2018 21:56, Emilio G. Cota wrote:

On Thu, Mar 08, 2018 at 14:42:29 +0300, Vladimir Sementsov-Ogievskiy wrote:

Hi Emilio!

Looked through qdist, if I understand correctly, it saves each added (with
different value) element. It is not effective for disk io timing - we'll
have too much elements. In my approach, histogram don't grow, it initially
have several ranges and counts hits to each range.

I thought about this use case, i.e. having a gazillion elements.
You should just do some pre-binning before inserting samples
into qdist, as pointed out in this comment in qdist.h:

/*
  * Samples with the same 'x value' end up in the same qdist_entry,
  * e.g. inc(0.1) and inc(0.1) end up as {x=0.1, count=2}.
  *
  * Binning happens only at print time, so that we retain the flexibility to
  * choose the binning. This might not be ideal for workloads that do not care
  * much about precision and insert many samples all with different x values;
  * in that case, pre-binning (e.g. entering both 0.115 and 0.097 as 0.1)
  * should be considered.
  */
struct qdist_entry {
 double x;
 unsigned long count;
};

Let me know if you need help with that.

Thanks,

Emilio


In this case, I'll have to do same bin search (and store same interval 
settings) as I already do, on my part, to calculate a parameter for 
qdist interface. And I'll have store almost all same data on my part. 
So, it doesn't really help. And I need nothing of qdist benefits: I 
don't need (and don't want) dynamic allocation of bins on adding an 
element or any type of visualization.


--
Best regards,
Vladimir




Re: [Qemu-block] [Qemu-devel] [PATCH v2 1/2] block/accounting: introduce latency histogram

2018-03-08 Thread Vladimir Sementsov-Ogievskiy

08.03.2018 21:21, Vladimir Sementsov-Ogievskiy wrote:

08.03.2018 21:14, Vladimir Sementsov-Ogievskiy wrote:

08.03.2018 20:31, Eric Blake wrote:

On 03/06/2018 09:32 AM, Stefan Hajnoczi wrote:
On Wed, Feb 07, 2018 at 03:50:36PM +0300, Vladimir 
Sementsov-Ogievskiy wrote:

Introduce latency histogram statics for block devices.
For each accounted operation type latency region [0, +inf) is
divided into subregions by several points. Then, calculate
hits for each subregion.

Signed-off-by: Vladimir Sementsov-Ogievskiy 


---



According to Wikipedia and Mathworld, "intervals" and "bins" are
commonly used terms:
https://en.wikipedia.org/wiki/Histogram
http://mathworld.wolfram.com/Histogram.html

I suggest:

   typedef struct {
   /* The following histogram is represented like this:
    *
    * 5|   *
    * 4|   *
    * 3| * *
    * 2| * *    *
    * 1| *    *    *    *
    *  +--
    *  10   50   100
    *
    * BlockLatencyHistogram histogram = {
    * .nbins = 4,
    * .intervals = {10, 50, 100},
    * .bins = {3, 1, 5, 2},
    * };


The name 'intervals' is still slightly ambiguous: does it hold the 
boundary point (0-10 for 10 slots, 10-50 for 40 slots, 50-100, for 
50 slots, then 100-INF) or is it the interval size of each slot 
(first bin is 10 slots for 0-10, next bin is 50 slots wide so 10-60, 
next bin is 100 slots wide so 60-160, everything else is 160-INF).  
But the ascii-art diagram plus the text is sufficient to resolve the 
intent if you keep that name (I don't have a suggestion for a better 
name).




Hm. these numbers are actually boundary points of histogram 
intervals, not intervals itself. And, wiki says "The bins are usually 
specified as consecutive, non-overlapping intervals of a variable.", 
so, intervals are bins.


So, what about:

1. interval_boundaries
2. bin_boundaries
3. boundaries

(and same names with s/_/-/ for qapi)





Also, now I doubt, is it a good idea to share same bin boundaries for 
each io type.




so, for qmp command, what about:

boundaries - optional, default boundaries for all io operations
boundaries-read - boundaries for read
boundaries-write - boundaries for write
...

so, call without any boundaries: drop _all_ histograms
call with only specific boundaries-*: set or reset _only_ corresponding 
specific histograms

call with only boundaries parameter: set or reset _all_ histograms
call with boundaries parameter and some of (or all, but it is not 
useful) specific boundaries-*: set or reset _all_ histograms, some to 
default  and some to specific.


--
Best regards,
Vladimir




Re: [Qemu-block] [Qemu-devel] [PATCH v2 0/2] block latency histogram

2018-03-08 Thread Emilio G. Cota
On Thu, Mar 08, 2018 at 14:42:29 +0300, Vladimir Sementsov-Ogievskiy wrote:
> Hi Emilio!
> 
> Looked through qdist, if I understand correctly, it saves each added (with
> different value) element. It is not effective for disk io timing - we'll
> have too much elements. In my approach, histogram don't grow, it initially
> have several ranges and counts hits to each range.

I thought about this use case, i.e. having a gazillion elements.
You should just do some pre-binning before inserting samples
into qdist, as pointed out in this comment in qdist.h:

/*
 * Samples with the same 'x value' end up in the same qdist_entry,
 * e.g. inc(0.1) and inc(0.1) end up as {x=0.1, count=2}.
 *
 * Binning happens only at print time, so that we retain the flexibility to
 * choose the binning. This might not be ideal for workloads that do not care
 * much about precision and insert many samples all with different x values;
 * in that case, pre-binning (e.g. entering both 0.115 and 0.097 as 0.1)
 * should be considered.
 */
struct qdist_entry {
double x;
unsigned long count;
};

Let me know if you need help with that.

Thanks,

Emilio



[Qemu-block] [PATCH 1/5] nbd/server: move nbd_co_send_structured_error up

2018-03-08 Thread Vladimir Sementsov-Ogievskiy
To be reused in nbd_co_send_sparse_read() in the following patch.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 nbd/server.c | 48 
 1 file changed, 24 insertions(+), 24 deletions(-)

diff --git a/nbd/server.c b/nbd/server.c
index 4990a5826e..c406b0656d 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -1341,6 +1341,30 @@ static int coroutine_fn 
nbd_co_send_structured_read(NBDClient *client,
 return nbd_co_send_iov(client, iov, 2, errp);
 }
 
+static int coroutine_fn nbd_co_send_structured_error(NBDClient *client,
+ uint64_t handle,
+ uint32_t error,
+ const char *msg,
+ Error **errp)
+{
+NBDStructuredError chunk;
+int nbd_err = system_errno_to_nbd_errno(error);
+struct iovec iov[] = {
+{.iov_base = , .iov_len = sizeof(chunk)},
+{.iov_base = (char *)msg, .iov_len = msg ? strlen(msg) : 0},
+};
+
+assert(nbd_err);
+trace_nbd_co_send_structured_error(handle, nbd_err,
+   nbd_err_lookup(nbd_err), msg ? msg : 
"");
+set_be_chunk(, NBD_REPLY_FLAG_DONE, NBD_REPLY_TYPE_ERROR, handle,
+ sizeof(chunk) - sizeof(chunk.h) + iov[1].iov_len);
+stl_be_p(, nbd_err);
+stw_be_p(_length, iov[1].iov_len);
+
+return nbd_co_send_iov(client, iov, 1 + !!iov[1].iov_len, errp);
+}
+
 static int coroutine_fn nbd_co_send_sparse_read(NBDClient *client,
 uint64_t handle,
 uint64_t offset,
@@ -1400,30 +1424,6 @@ static int coroutine_fn 
nbd_co_send_sparse_read(NBDClient *client,
 return ret;
 }
 
-static int coroutine_fn nbd_co_send_structured_error(NBDClient *client,
- uint64_t handle,
- uint32_t error,
- const char *msg,
- Error **errp)
-{
-NBDStructuredError chunk;
-int nbd_err = system_errno_to_nbd_errno(error);
-struct iovec iov[] = {
-{.iov_base = , .iov_len = sizeof(chunk)},
-{.iov_base = (char *)msg, .iov_len = msg ? strlen(msg) : 0},
-};
-
-assert(nbd_err);
-trace_nbd_co_send_structured_error(handle, nbd_err,
-   nbd_err_lookup(nbd_err), msg ? msg : 
"");
-set_be_chunk(, NBD_REPLY_FLAG_DONE, NBD_REPLY_TYPE_ERROR, handle,
- sizeof(chunk) - sizeof(chunk.h) + iov[1].iov_len);
-stl_be_p(, nbd_err);
-stw_be_p(_length, iov[1].iov_len);
-
-return nbd_co_send_iov(client, iov, 1 + !!iov[1].iov_len, errp);
-}
-
 /* nbd_co_receive_request
  * Collect a client request. Return 0 if request looks valid, -EIO to drop
  * connection right away, and any other negative value to report an error to
-- 
2.11.1




[Qemu-block] [PATCH 3/5] nbd/server: fix: check client->closing before reply sending

2018-03-08 Thread Vladimir Sementsov-Ogievskiy
Signed-off-by: Vladimir Sementsov-Ogievskiy 
---

It's like an RFC. I'm not sure, but this place looks like a bug. Shouldn't
we chack client-closing even before nbd_client_receive_next_request() call?

 nbd/server.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/nbd/server.c b/nbd/server.c
index e0de431e10..97b45a21fa 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -1547,10 +1547,6 @@ static coroutine_fn void nbd_trip(void *opaque)
 goto disconnect;
 }
 
-if (ret < 0) {
-goto reply;
-}
-
 if (client->closing) {
 /*
  * The client may be closed when we are blocked in
@@ -1559,6 +1555,10 @@ static coroutine_fn void nbd_trip(void *opaque)
 goto done;
 }
 
+if (ret < 0) {
+goto reply;
+}
+
 switch (request.type) {
 case NBD_CMD_READ:
 /* XXX: NBD Protocol only documents use of FUA with WRITE */
-- 
2.11.1




[Qemu-block] [PATCH 0/5] nbd server fixing and refactoring before BLOCK_STATUS

2018-03-08 Thread Vladimir Sementsov-Ogievskiy
01 and 02 are splitted and updated "[PATCH] nbd/server: fix space read",
others are new.

Vladimir Sementsov-Ogievskiy (5):
  nbd/server: move nbd_co_send_structured_error up
  nbd/server: fix sparse read
  nbd/server: fix: check client->closing before reply sending
  nbd/server: refactor nbd_trip: cmd_read and generic reply
  nbd/server: refactor nbd_trip: split out nbd_handle_request

 nbd/server.c | 302 ---
 1 file changed, 165 insertions(+), 137 deletions(-)

-- 
2.11.1




[Qemu-block] [PATCH 4/5] nbd/server: refactor nbd_trip: cmd_read and generic reply

2018-03-08 Thread Vladimir Sementsov-Ogievskiy
nbd_trip has difficult logic of sending reply: it tries to use one
code path for all replies. It is ok for simple replies, but is not
comfortable for structured replies. Also, two types of error (and
corresponding message in local_err) - fatal (leading to disconnect)
and not-fatal (just to be sent to the client) are difficult to follow.

To make things a bit clearer, the following is done:
 - split CMD_READ logic to separate function. It is the most difficult
   command for now, and it is definitely cramped inside nbd_trip. Also,
   it is difficult to follow CMD_READ logic, shared between
   "case NBD_CMD_READ" and "if"s under "reply:" label.
 - create separate helper function nbd_send_generic_reply() and use it
   both in new nbd_do_cmd_read and for other command in nbd_trip instead
   of common code-path under "reply:" label in nbd_trip. The helper
   supports error message, so logic with local_err in nbd_trip is
   simplified.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 nbd/server.c | 164 ---
 1 file changed, 88 insertions(+), 76 deletions(-)

diff --git a/nbd/server.c b/nbd/server.c
index 97b45a21fa..a2f5f73d52 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -1520,6 +1520,70 @@ static int nbd_co_receive_request(NBDRequestData *req, 
NBDRequest *request,
 return 0;
 }
 
+/* Send simple reply without a payload or structured error
+ * @error_msg is ignored if @ret >= 0 */
+static coroutine_fn int nbd_send_generic_reply(NBDClient *client,
+   uint64_t handle,
+   int ret,
+   const char *error_msg,
+   Error **errp)
+{
+if (client->structured_reply && ret < 0) {
+return nbd_co_send_structured_error(client, handle, -ret, error_msg,
+errp);
+} else {
+return nbd_co_send_simple_reply(client, handle, ret < 0 ? -ret : 0,
+NULL, 0, errp);
+}
+}
+
+/* Handle NBD_CMD_READ request.
+ * Return -errno if sending fails. Other errors are reported directly to the
+ * client as an error reply. */
+static coroutine_fn int nbd_do_cmd_read(NBDClient *client, NBDRequest *request,
+uint8_t *data, Error **errp)
+{
+int ret;
+NBDExport *exp = client->exp;
+
+assert(request->type == NBD_CMD_READ);
+
+/* XXX: NBD Protocol only documents use of FUA with WRITE */
+if (request->flags & NBD_CMD_FLAG_FUA) {
+ret = blk_co_flush(exp->blk);
+if (ret < 0) {
+return nbd_send_generic_reply(client, request->handle, ret,
+  "flush failed", errp);
+}
+}
+
+if (client->structured_reply && !(request->flags & NBD_CMD_FLAG_DF) &&
+request->len) {
+return nbd_co_send_sparse_read(client, request->handle, request->from,
+   data, request->len, errp);
+}
+
+ret = blk_pread(exp->blk, request->from + exp->dev_offset, data,
+request->len);
+if (ret < 0) {
+return nbd_send_generic_reply(client, request->handle, ret,
+  "reading from file failed", errp);
+}
+
+if (client->structured_reply) {
+if (request->len) {
+return nbd_co_send_structured_read(client, request->handle,
+   request->from, data,
+   request->len, true, errp);
+} else {
+return nbd_co_send_structured_done(client, request->handle, errp);
+}
+} else {
+return nbd_co_send_simple_reply(client, request->handle, 0,
+data, request->len, errp);
+}
+}
+
 /* Owns a reference to the NBDClient passed as opaque.  */
 static coroutine_fn void nbd_trip(void *opaque)
 {
@@ -1529,7 +1593,6 @@ static coroutine_fn void nbd_trip(void *opaque)
 NBDRequest request = { 0 };/* GCC thinks it can be used uninitialized 
*/
 int ret;
 int flags;
-int reply_data_len = 0;
 Error *local_err = NULL;
 char *msg = NULL;
 
@@ -1556,39 +1619,21 @@ static coroutine_fn void nbd_trip(void *opaque)
 }
 
 if (ret < 0) {
-goto reply;
+/* It's not a -EIO, so, accordingly to nbd_co_receive_request()
+ * semantics, we should return the error to the client. */
+Error *export_err = local_err;
+
+local_err = NULL;
+ret = nbd_send_generic_reply(client, request.handle, -EINVAL,
+ error_get_pretty(export_err), _err);
+error_free(export_err);
+
+goto replied;
 }
 
 switch (request.type) {
 case NBD_CMD_READ:
-/* XXX: NBD Protocol only documents use 

[Qemu-block] [PATCH 5/5] nbd/server: refactor nbd_trip: split out nbd_handle_request

2018-03-08 Thread Vladimir Sementsov-Ogievskiy
Split out request handling logic.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 nbd/server.c | 129 +++
 1 file changed, 67 insertions(+), 62 deletions(-)

diff --git a/nbd/server.c b/nbd/server.c
index a2f5f73d52..7b236f404e 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -1584,17 +1584,79 @@ static coroutine_fn int nbd_do_cmd_read(NBDClient 
*client, NBDRequest *request,
 }
 }
 
+/* Handle NBD request.
+ * Return -errno if sending fails. Other errors are reported directly to the
+ * client as an error reply. */
+static coroutine_fn int nbd_handle_request(NBDClient *client,
+   NBDRequest *request,
+   uint8_t *data, Error **errp)
+{
+int ret;
+int flags;
+NBDExport *exp = client->exp;
+char *msg;
+
+switch (request->type) {
+case NBD_CMD_READ:
+return nbd_do_cmd_read(client, request, data, errp);
+
+case NBD_CMD_WRITE:
+flags = 0;
+if (request->flags & NBD_CMD_FLAG_FUA) {
+flags |= BDRV_REQ_FUA;
+}
+ret = blk_pwrite(exp->blk, request->from + exp->dev_offset,
+ data, request->len, flags);
+
+return nbd_send_generic_reply(client, request->handle, ret,
+  "writing to file failed", errp);
+case NBD_CMD_WRITE_ZEROES:
+flags = 0;
+if (request->flags & NBD_CMD_FLAG_FUA) {
+flags |= BDRV_REQ_FUA;
+}
+if (!(request->flags & NBD_CMD_FLAG_NO_HOLE)) {
+flags |= BDRV_REQ_MAY_UNMAP;
+}
+ret = blk_pwrite_zeroes(exp->blk, request->from + exp->dev_offset,
+request->len, flags);
+
+return nbd_send_generic_reply(client, request->handle, ret,
+  "writing to file failed", errp);
+case NBD_CMD_DISC:
+/* unreachable, thanks to special case in nbd_co_receive_request() */
+abort();
+
+case NBD_CMD_FLUSH:
+ret = blk_co_flush(exp->blk);
+
+return nbd_send_generic_reply(client, request->handle, ret,
+  "flush failed", errp);
+case NBD_CMD_TRIM:
+ret = blk_co_pdiscard(exp->blk, request->from + exp->dev_offset,
+  request->len);
+
+return nbd_send_generic_reply(client, request->handle, ret,
+  "discard failed", errp);
+default:
+msg = g_strdup_printf("invalid request type (%" PRIu32 ") received",
+  request->type);
+ret = nbd_send_generic_reply(client, request->handle, -EINVAL, msg,
+ errp);
+g_free(msg);
+
+return ret;
+}
+}
+
 /* Owns a reference to the NBDClient passed as opaque.  */
 static coroutine_fn void nbd_trip(void *opaque)
 {
 NBDClient *client = opaque;
-NBDExport *exp = client->exp;
 NBDRequestData *req;
 NBDRequest request = { 0 };/* GCC thinks it can be used uninitialized 
*/
 int ret;
-int flags;
 Error *local_err = NULL;
-char *msg = NULL;
 
 trace_nbd_trip();
 if (client->closing) {
@@ -1627,66 +1689,9 @@ static coroutine_fn void nbd_trip(void *opaque)
 ret = nbd_send_generic_reply(client, request.handle, -EINVAL,
  error_get_pretty(export_err), _err);
 error_free(export_err);
-
-goto replied;
-}
-
-switch (request.type) {
-case NBD_CMD_READ:
-ret = nbd_do_cmd_read(client, , req->data, _err);
-
-break;
-case NBD_CMD_WRITE:
-flags = 0;
-if (request.flags & NBD_CMD_FLAG_FUA) {
-flags |= BDRV_REQ_FUA;
-}
-ret = blk_pwrite(exp->blk, request.from + exp->dev_offset,
- req->data, request.len, flags);
-ret = nbd_send_generic_reply(client, request.handle, ret,
- "writing to file failed", _err);
-
-break;
-case NBD_CMD_WRITE_ZEROES:
-flags = 0;
-if (request.flags & NBD_CMD_FLAG_FUA) {
-flags |= BDRV_REQ_FUA;
-}
-if (!(request.flags & NBD_CMD_FLAG_NO_HOLE)) {
-flags |= BDRV_REQ_MAY_UNMAP;
-}
-ret = blk_pwrite_zeroes(exp->blk, request.from + exp->dev_offset,
-request.len, flags);
-ret = nbd_send_generic_reply(client, request.handle, ret,
- "writing to file failed", _err);
-
-break;
-case NBD_CMD_DISC:
-/* unreachable, thanks to special case in nbd_co_receive_request() */
-abort();
-
-case NBD_CMD_FLUSH:
-ret = blk_co_flush(exp->blk);
-ret = nbd_send_generic_reply(client, request.handle, ret,
- "flush failed", _err);
-
- 

Re: [Qemu-block] [Qemu-devel] [PATCH v2 1/2] block/accounting: introduce latency histogram

2018-03-08 Thread Vladimir Sementsov-Ogievskiy

08.03.2018 21:14, Vladimir Sementsov-Ogievskiy wrote:

08.03.2018 20:31, Eric Blake wrote:

On 03/06/2018 09:32 AM, Stefan Hajnoczi wrote:
On Wed, Feb 07, 2018 at 03:50:36PM +0300, Vladimir 
Sementsov-Ogievskiy wrote:

Introduce latency histogram statics for block devices.
For each accounted operation type latency region [0, +inf) is
divided into subregions by several points. Then, calculate
hits for each subregion.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---



According to Wikipedia and Mathworld, "intervals" and "bins" are
commonly used terms:
https://en.wikipedia.org/wiki/Histogram
http://mathworld.wolfram.com/Histogram.html

I suggest:

   typedef struct {
   /* The following histogram is represented like this:
    *
    * 5|   *
    * 4|   *
    * 3| * *
    * 2| * *    *
    * 1| *    *    *    *
    *  +--
    *  10   50   100
    *
    * BlockLatencyHistogram histogram = {
    * .nbins = 4,
    * .intervals = {10, 50, 100},
    * .bins = {3, 1, 5, 2},
    * };


The name 'intervals' is still slightly ambiguous: does it hold the 
boundary point (0-10 for 10 slots, 10-50 for 40 slots, 50-100, for 50 
slots, then 100-INF) or is it the interval size of each slot (first 
bin is 10 slots for 0-10, next bin is 50 slots wide so 10-60, next 
bin is 100 slots wide so 60-160, everything else is 160-INF).  But 
the ascii-art diagram plus the text is sufficient to resolve the 
intent if you keep that name (I don't have a suggestion for a better 
name).




Hm. these numbers are actually boundary points of histogram intervals, 
not intervals itself. And, wiki says "The bins are usually specified 
as consecutive, non-overlapping intervals of a variable.", so, 
intervals are bins.


So, what about:

1. interval_boundaries
2. bin_boundaries
3. boundaries

(and same names with s/_/-/ for qapi)





Also, now I doubt, is it a good idea to share same bin boundaries for 
each io type.


--
Best regards,
Vladimir




Re: [Qemu-block] [Qemu-devel] [PATCH v2 1/2] block/accounting: introduce latency histogram

2018-03-08 Thread Vladimir Sementsov-Ogievskiy

08.03.2018 20:31, Eric Blake wrote:

On 03/06/2018 09:32 AM, Stefan Hajnoczi wrote:
On Wed, Feb 07, 2018 at 03:50:36PM +0300, Vladimir 
Sementsov-Ogievskiy wrote:

Introduce latency histogram statics for block devices.
For each accounted operation type latency region [0, +inf) is
divided into subregions by several points. Then, calculate
hits for each subregion.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---



According to Wikipedia and Mathworld, "intervals" and "bins" are
commonly used terms:
https://en.wikipedia.org/wiki/Histogram
http://mathworld.wolfram.com/Histogram.html

I suggest:

   typedef struct {
   /* The following histogram is represented like this:
    *
    * 5|   *
    * 4|   *
    * 3| * *
    * 2| * *    *
    * 1| *    *    *    *
    *  +--
    *  10   50   100
    *
    * BlockLatencyHistogram histogram = {
    * .nbins = 4,
    * .intervals = {10, 50, 100},
    * .bins = {3, 1, 5, 2},
    * };


The name 'intervals' is still slightly ambiguous: does it hold the 
boundary point (0-10 for 10 slots, 10-50 for 40 slots, 50-100, for 50 
slots, then 100-INF) or is it the interval size of each slot (first 
bin is 10 slots for 0-10, next bin is 50 slots wide so 10-60, next bin 
is 100 slots wide so 60-160, everything else is 160-INF).  But the 
ascii-art diagram plus the text is sufficient to resolve the intent if 
you keep that name (I don't have a suggestion for a better name).




Hm. these numbers are actually boundary points of histogram intervals, 
not intervals itself. And, wiki says "The bins are usually specified as 
consecutive, non-overlapping intervals of a variable.", so, intervals 
are bins.


So, what about:

1. interval_boundaries
2. bin_boundaries
3. boundaries

(and same names with s/_/-/ for qapi)


--
Best regards,
Vladimir




Re: [Qemu-block] [PATCH v3 0/4] vl: introduce vm_shutdown()

2018-03-08 Thread Stefan Hajnoczi
On Wed, Mar 07, 2018 at 02:42:01PM +, Stefan Hajnoczi wrote:
> v3:
>  * Rebase on qemu.git/master after AIO_WAIT_WHILE() was merged [Fam]
> v2:
>  * Tackle the .ioeventfd_stop() vs vq handler race by removing the ioeventfd
>from a BH in the IOThread [Fam]
> 
> There are several race conditions in virtio-blk/virtio-scsi dataplane code.
> This patch series addresses them, see the commit description for details on 
> the
> individual cases.
> 
> Stefan Hajnoczi (4):
>   block: add aio_wait_bh_oneshot()
>   virtio-blk: fix race between .ioeventfd_stop() and vq handler
>   virtio-scsi: fix race between .ioeventfd_stop() and vq handler
>   vl: introduce vm_shutdown()
> 
>  include/block/aio-wait.h| 13 +
>  include/sysemu/iothread.h   |  1 -
>  include/sysemu/sysemu.h |  1 +
>  cpus.c  | 16 +---
>  hw/block/dataplane/virtio-blk.c | 24 +---
>  hw/scsi/virtio-scsi-dataplane.c |  9 +
>  iothread.c  | 31 ---
>  util/aio-wait.c | 31 +++
>  vl.c| 13 +++--
>  9 files changed, 83 insertions(+), 56 deletions(-)
> 
> -- 
> 2.14.3
> 
> 

Thanks, applied to my block tree:
https://github.com/stefanha/qemu/commits/block

Stefan


signature.asc
Description: PGP signature


Re: [Qemu-block] [Qemu-devel] [PATCH 0/2] block: fix nbd-server-stop crash after blockdev-snapshot-sync

2018-03-08 Thread Stefan Hajnoczi
On Wed, Mar 07, 2018 at 05:27:45PM -0600, Eric Blake wrote:
> On 03/06/2018 02:48 PM, Stefan Hajnoczi wrote:
> > The blockdev-snapshot-sync command uses bdrv_append() to update all parents 
> > to
> > point at the external snapshot node.  This breaks BlockBackend's
> > blk_add/remove_aio_context_notifier(), which doesn't expect a BDS change.
> > 
> > Patch 1 fixes this by tracking AioContext notifiers in BlockBackend.
> > 
> > See the test case in Patch 2 for a reproducer.
> > 
> > Stefan Hajnoczi (2):
> >block: let blk_add/remove_aio_context_notifier() tolerate BDS changes
> >iotests: add 208 nbd-server + blockdev-snapshot-sync test case
> > 
> >   block/block-backend.c  | 63 
> > ++
> >   block/trace-events |  2 ++
> >   tests/qemu-iotests/208 | 55 
> >   tests/qemu-iotests/208.out |  9 +++
> >   tests/qemu-iotests/group   |  1 +
> >   5 files changed, 130 insertions(+)
> >   create mode 100755 tests/qemu-iotests/208
> >   create mode 100644 tests/qemu-iotests/208.out
> 
> Whose tree should this series go through?  MAINTAINERS didn't flag it as
> directly touching any files that normally affect my NBD queue, but given
> that the iotest that reproduces the problem uses NBD, I'm fine if you want
> it to go through me.

Good question.  Max and Kevin maintain block/block-backend.c so one of
them should be happy with this series before it gets merged.

When a patch affects multiple trees, the last sub-maintainer to review
it can do the merge.

So if they have already posted their R-b when you are finished, then
feel free to merge it!  And vice versa.

Stefan


signature.asc
Description: PGP signature


Re: [Qemu-block] [Qemu-devel] [PATCH v2 1/2] block/accounting: introduce latency histogram

2018-03-08 Thread Eric Blake

On 03/06/2018 09:32 AM, Stefan Hajnoczi wrote:

On Wed, Feb 07, 2018 at 03:50:36PM +0300, Vladimir Sementsov-Ogievskiy wrote:

Introduce latency histogram statics for block devices.
For each accounted operation type latency region [0, +inf) is
divided into subregions by several points. Then, calculate
hits for each subregion.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---



According to Wikipedia and Mathworld, "intervals" and "bins" are
commonly used terms:
https://en.wikipedia.org/wiki/Histogram
http://mathworld.wolfram.com/Histogram.html

I suggest:

   typedef struct {
   /* The following histogram is represented like this:
*
* 5|   *
* 4|   *
* 3| * *
* 2| * **
* 1| ****
*  +--
*  10   50   100
*
* BlockLatencyHistogram histogram = {
* .nbins = 4,
* .intervals = {10, 50, 100},
* .bins = {3, 1, 5, 2},
* };


The name 'intervals' is still slightly ambiguous: does it hold the 
boundary point (0-10 for 10 slots, 10-50 for 40 slots, 50-100, for 50 
slots, then 100-INF) or is it the interval size of each slot (first bin 
is 10 slots for 0-10, next bin is 50 slots wide so 10-60, next bin is 
100 slots wide so 60-160, everything else is 160-INF).  But the 
ascii-art diagram plus the text is sufficient to resolve the intent if 
you keep that name (I don't have a suggestion for a better name).


--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Re: [Qemu-block] [Qemu-devel] [PATCH v2 0/2] block latency histogram

2018-03-08 Thread Eric Blake

On 03/06/2018 10:00 AM, Stefan Hajnoczi wrote:

On Wed, Feb 07, 2018 at 03:50:35PM +0300, Vladimir Sementsov-Ogievskiy wrote:

v2:

01: add block_latency_histogram_clear()
02: fix spelling (sorry =()
 some rewordings
 remove histogram if latency parameter unspecified

Vladimir Sementsov-Ogievskiy (2):
   block/accounting: introduce latency histogram
   qapi: add block latency histogram interface

  qapi/block-core.json   | 73 +-
  include/block/accounting.h |  9 +
  block/accounting.c | 97 ++
  block/qapi.c   | 31 +++
  blockdev.c | 19 +
  5 files changed, 228 insertions(+), 1 deletion(-)


The feature looks good.  I posted documentation and code readability
suggestions.



I also think it makes sense; if a v3 hits the list in time, I will 
probably include it in my last QAPI pull before softfreeze.


--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Re: [Qemu-block] Limiting coroutine stack usage

2018-03-08 Thread Stefan Hajnoczi
On Wed, Mar 07, 2018 at 09:36:38PM +0100, Peter Lieven wrote:
> Am 06.03.2018 um 12:51 schrieb Stefan Hajnoczi:
> > On Tue, Feb 20, 2018 at 06:04:02PM +0100, Peter Lieven wrote:
> >> I remember we discussed a long time ago to limit the stack usage of all 
> >> functions that are executed in a coroutine
> >> context to a very low value to be able to safely limit the coroutine stack 
> >> size as well.
> >>
> >> I checked through all functions in block/, migration/ and nbd/ and there 
> >> are only very few larger or unbound stack
> >> allocations that can easily be fixed.
> >>
> >> Now my question: Is there an easy way to add a cflag like 
> >> -Wstack-usage=2048 to all objects in a given directory only?
> >> I tried to add a llimit to the whole project, but fixing this will be a 
> >> larger task.
> > 2KB is fine for QEMU code but actual coroutine stack sizes will have to
> > be at least 8KB, I guess, in order for third-party libraries to work
> > (e.g. curl, rbd).  PATH_MAX is 4KB on Linux.
> >
> > Nested event loops in QEMU code can also result in deep call stacks.
> > This happens when aio_poll() invokes an fd handler or BH that also
> > invokes aio_poll().
> 
> The plan was to limit the stack usage only as a compiler option. I would 
> leave the coroutine stack size at 1MB
> for now until we have a way to identify the worst case usage.

I'm not sure we'll be able to confidently set a small stack size, but a
compile-time check doesn't hurt.

Maybe someday we'll switch to stackless coroutines.  Clang now has
coroutines support and maybe gcc will get it too (for C++20).  If the
compiler transforms the code then coroutine state can be kept in a
smaller data area that is not a call stack.  A single per-thread call
stack is used to run any coroutine and only during coroutine execution.

Stefan


signature.asc
Description: PGP signature


Re: [Qemu-block] [PATCH v3 0/4] vl: introduce vm_shutdown()

2018-03-08 Thread Paolo Bonzini
On 08/03/2018 17:01, Michael S. Tsirkin wrote:
> On Wed, Mar 07, 2018 at 02:42:01PM +, Stefan Hajnoczi wrote:
>> v3:
>>  * Rebase on qemu.git/master after AIO_WAIT_WHILE() was merged [Fam]
>> v2:
>>  * Tackle the .ioeventfd_stop() vs vq handler race by removing the ioeventfd
>>from a BH in the IOThread [Fam]
> 
> Acked-by: Michael S. Tsirkin 
> 
> who is merging this?

Probably Stefan himself?  Just in case, for 4/4:

Acked-by: Paolo Bonzini 

>> There are several race conditions in virtio-blk/virtio-scsi dataplane code.
>> This patch series addresses them, see the commit description for details on 
>> the
>> individual cases.
>>
>> Stefan Hajnoczi (4):
>>   block: add aio_wait_bh_oneshot()
>>   virtio-blk: fix race between .ioeventfd_stop() and vq handler
>>   virtio-scsi: fix race between .ioeventfd_stop() and vq handler
>>   vl: introduce vm_shutdown()
>>
>>  include/block/aio-wait.h| 13 +
>>  include/sysemu/iothread.h   |  1 -
>>  include/sysemu/sysemu.h |  1 +
>>  cpus.c  | 16 +---
>>  hw/block/dataplane/virtio-blk.c | 24 +---
>>  hw/scsi/virtio-scsi-dataplane.c |  9 +
>>  iothread.c  | 31 ---
>>  util/aio-wait.c | 31 +++
>>  vl.c| 13 +++--
>>  9 files changed, 83 insertions(+), 56 deletions(-)
>>
>> -- 
>> 2.14.3




Re: [Qemu-block] [PATCH] nbd/server: Honor FUA request on NBD_CMD_TRIM

2018-03-08 Thread Eric Blake

On 03/08/2018 09:22 AM, Paolo Bonzini wrote:

TRIM requests should not need FUA since they're just advisory.


Still, while you argue that TRIM is advisory (which I agree), if it does
nothing, then you've (implicitly) honored FUA (that transaction didn't
affect persistent storage, so you didn't have to wait any longer for
anything to land); but if it DOES change the disk contents, then waiting
for that change to land IS worth supporting, hence why the NBD protocol
requires the FUA flag to be honored on trim.


But if power fails, after restart you cannot see the difference between
a TRIM command that chose to did nothing, and one that chose to change
the disk contents but failed to persist the changes.  This is why I
thought there is no need for FUA in my opinion.

I suppose in principle you could detect the change by reading the
TRIMmed sectors and writing to another disk.  So TRIM would have to be a
Schroedinger command that is persistent once you read the sectors, and
that makes little sense.  The problem is, SCSI doesn't have a FUA flag
either...


The documentation of NBD_CMD_TRIM says that in general you must not 
expect reliable read results from the area you trimmed (since the 
command is advisory, you don't know if you would read the old data 
unchanged, all zeroes, or even random unrelated data).  But if you know 
that a particular server treats TRIM as mandatory rather than advisory, 
and also guarantees a reads-as-zero after a successful TRIM, then for 
that particular server, the FUA flag on TRIM makes sense.  The 
documentation for NBD_CMD_BLOCK_STATUS also points out that block status 
may, but not must, be altered by NBD_CMD_TRIM, which might be another 
way to observer how much of a TRIM request was advisory.


At any rate, your argument makes sense that because bdrv_pdiscard() is 
advisory, we can't tell whether it made a difference, and therefore 
waiting for it to make a difference isn't worthwhile, and therefore 
plumbing BDRV_REQ_FUA through the block layer for bdrv_pdiscard() is 
pointless.  At this point, I will just go ahead and add the flush for 
qemu as NBD server if it ever sees NBD_CMD_TRIM + FUA (which is unlikely 
to happen in practice, as most clients are smart enough to realize that 
TRIM is advisory and reading after TRIM is unreliable anyways, so 
waiting for the TRIM to land is pointless); and qemu as a client will 
probably never send NBD_CMD_TRIM + FUA.


--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Re: [Qemu-block] [PATCH] virtio-blk: dataplane: Don't batch notifications if EVENT_IDX is present

2018-03-08 Thread Stefan Hajnoczi
On Wed, Mar 07, 2018 at 12:44:59PM +0100, Sergio Lopez wrote:
> Commit 5b2ffbe4d99843fd8305c573a100047a8c962327 ("virtio-blk: dataplane:
> notify guest as a batch") deferred guest notification to a BH in order
> batch notifications, with purpose of avoiding flooding the guest with
> interruptions.
> 
> This optimization came with a cost. The average latency perceived in the
> guest is increased by a few microseconds, but also when multiple IO
> operations finish at the same time, the guest won't be notified until
> all completions from each operation has been run. On the contrary,
> virtio-scsi issues the notification at the end of each completion.
> 
> On the other hand, nowadays we have the EVENT_IDX feature that allows a
> better coordination between QEMU and the Guest OS to avoid sending
> unnecessary interruptions.
> 
> With this change, virtio-blk/dataplane only batches notifications if the
> EVENT_IDX feature is not present.
> 
> Some numbers obtained with fio (ioengine=sync, iodepth=1, direct=1):
>  - Test specs:
>* fio-3.4 (ioengine=sync, iodepth=1, direct=1)
>* qemu master
>* virtio-blk with a dedicated iothread (default poll-max-ns)
>* backend: null_blk nr_devices=1 irqmode=2 completion_nsec=28
>* 8 vCPUs pinned to isolated physical cores
>* Emulator and iothread also pinned to separate isolated cores
>* variance between runs < 1%
> 
>  - Not patched
>* numjobs=1:  lat_avg=327.32  irqs=29998
>* numjobs=4:  lat_avg=337.89  irqs=29073
>* numjobs=8:  lat_avg=342.98  irqs=28643
> 
>  - Patched:
>* numjobs=1:  lat_avg=323.92  irqs=30262
>* numjobs=4:  lat_avg=332.65  irqs=29520
>* numjobs=8:  lat_avg=335.54  irqs=29323
> 
> Signed-off-by: Sergio Lopez 
> ---
>  hw/block/dataplane/virtio-blk.c | 15 +--
>  1 file changed, 13 insertions(+), 2 deletions(-)

Thanks, applied to my block tree!

I merged this quickly because I don't want to forget this patch for the
upcoming QEMU 2.12 release.  If anyone has questions or wants to
discuss, please go ahead and I can hold back the patch.

https://github.com/stefanha/qemu/commits/block

Stefan


signature.asc
Description: PGP signature


Re: [Qemu-block] [PATCH v3 0/4] vl: introduce vm_shutdown()

2018-03-08 Thread Michael S. Tsirkin
On Wed, Mar 07, 2018 at 02:42:01PM +, Stefan Hajnoczi wrote:
> v3:
>  * Rebase on qemu.git/master after AIO_WAIT_WHILE() was merged [Fam]
> v2:
>  * Tackle the .ioeventfd_stop() vs vq handler race by removing the ioeventfd
>from a BH in the IOThread [Fam]

Acked-by: Michael S. Tsirkin 

who is merging this?

> There are several race conditions in virtio-blk/virtio-scsi dataplane code.
> This patch series addresses them, see the commit description for details on 
> the
> individual cases.
> 
> Stefan Hajnoczi (4):
>   block: add aio_wait_bh_oneshot()
>   virtio-blk: fix race between .ioeventfd_stop() and vq handler
>   virtio-scsi: fix race between .ioeventfd_stop() and vq handler
>   vl: introduce vm_shutdown()
> 
>  include/block/aio-wait.h| 13 +
>  include/sysemu/iothread.h   |  1 -
>  include/sysemu/sysemu.h |  1 +
>  cpus.c  | 16 +---
>  hw/block/dataplane/virtio-blk.c | 24 +---
>  hw/scsi/virtio-scsi-dataplane.c |  9 +
>  iothread.c  | 31 ---
>  util/aio-wait.c | 31 +++
>  vl.c| 13 +++--
>  9 files changed, 83 insertions(+), 56 deletions(-)
> 
> -- 
> 2.14.3



Re: [Qemu-block] [PATCH] nbd/server: Honor FUA request on NBD_CMD_TRIM

2018-03-08 Thread Paolo Bonzini
On 08/03/2018 15:45, Eric Blake wrote:
> On 03/08/2018 12:50 AM, Paolo Bonzini wrote:
>>> The NBD spec states that since trim requests can affect disk contents,
>>> then they should allow for FUA semantics just like writes for ensuring
>>> the disk has settled before returning.  As bdrv_[co_]pdiscard() does
>>> not (yet?) support a flags argument, we can't pass FUA down the block
>>> layer stack, and must therefore emulate it with a flush at the NBD
>>> layer.
>>
>> TRIM requests should not need FUA since they're just advisory.
> 
> Still, while you argue that TRIM is advisory (which I agree), if it does
> nothing, then you've (implicitly) honored FUA (that transaction didn't
> affect persistent storage, so you didn't have to wait any longer for
> anything to land); but if it DOES change the disk contents, then waiting
> for that change to land IS worth supporting, hence why the NBD protocol
> requires the FUA flag to be honored on trim.

But if power fails, after restart you cannot see the difference between
a TRIM command that chose to did nothing, and one that chose to change
the disk contents but failed to persist the changes.  This is why I
thought there is no need for FUA in my opinion.

I suppose in principle you could detect the change by reading the
TRIMmed sectors and writing to another disk.  So TRIM would have to be a
Schroedinger command that is persistent once you read the sectors, and
that makes little sense.  The problem is, SCSI doesn't have a FUA flag
either...

Paolo




Re: [Qemu-block] [PATCH] nbd/server: fix space read

2018-03-08 Thread Vladimir Sementsov-Ogievskiy

08.03.2018 18:17, Vladimir Sementsov-Ogievskiy wrote:

08.03.2018 14:50, Vladimir Sementsov-Ogievskiy wrote:

05.03.2018 22:47, Eric Blake wrote:

On 03/05/2018 12:04 PM, Vladimir Sementsov-Ogievskiy wrote:

In case of io error in nbd_co_send_sparse_read we should not
"goto reply:", as it is fatal error and common behavior is
disconnect in this case. We should not try to send client an
error reply, representing channel-io error on previous try to
send a reply.


Good catch.



Fix this by handle block-status error in nbd_co_send_sparse_read,
so nbd_co_send_sparse_read fails only on io error. Then just skip
common "reply:" code path in nbd_trip.

Note: nbd_co_send_structured_error is moved without changes to be
called from nbd_co_send_sparse_read.


Might be easier to read as two patches, one for the code motion, the 
other for using the new code.  But I'm not going to insist on a split.




Signed-off-by: Vladimir Sementsov-Ogievskiy 
---

PS: this all shows that nbd_trip is tooo complex (and yes, I remember,
that its final semantics was made by myself :(.. It should be
refactored into several smaller parts. Do you have any ideas?

The complexity here is that we should handle channel errors (fatal) 
and
export errors(non fatal) in different ways, and both of error types 
may

have errp, which should be handled differently too..

May be, the best way is to make separate functions for each command,
avoiding code duplication by using helper-functions instead of 
common code

in nbd_trip.


Yes, splitting nbd_trip into smaller helper functions may be 
worthwhile.




  nbd/server.c | 64 
+++-

  1 file changed, 37 insertions(+), 27 deletions(-)

diff --git a/nbd/server.c b/nbd/server.c
index 4990a5826e..ea6b9467e4 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -21,6 +21,7 @@
  #include "qapi/error.h"
  #include "trace.h"
  #include "nbd-internal.h"
+#include "qemu/error-report.h"


I'm a bit worried about this one.  The server has previously not 
needed this, so it may be the wrong thing to pull it in now.




+
  static int coroutine_fn nbd_co_send_sparse_read(NBDClient *client,
  uint64_t handle,
  uint64_t offset,


It doesn't help that we are lacking comments on the contract of this 
function.


@@ -1361,8 +1386,15 @@ static int coroutine_fn 
nbd_co_send_sparse_read(NBDClient *client,

  bool final;
    if (status < 0) {
-    error_setg_errno(errp, -status, "unable to check for 
holes");

-    return status;
+    char *msg = g_strdup_printf("unable to check for 
holes: %s",

+ strerror(-status));


Indentation looks off.


+
+    error_report("%s", msg);


Do we really need to report this on the server's stderr, or is 
sending the reply to the client good enough?



I'm just trying to save current behavior in nbd_trip... Why not? Is 
it fair that client knows about some strange io error on server-side, 
but server logs lacks this information?


May be, it worth switching it from error_report to trace, but it 
should be a separate patch, changing all places around.


stop, it is already traced in nbd_co_send_simple_reply and 
nbd_co_send_structured_error. So, we can safely drop it.







and it needs "qemu/error-report.h".. nbd_trip uses error_report_err 
instead. error_report_err calls error_report anyway.





+
+    ret = nbd_co_send_structured_error(client, handle, 
-status, msg,

+   errp);
+    g_free(msg);
+    return ret;


So if we have an early return here, then pre-patch, we were 
unconditionally setting errp and returning -1 (even if the client is 
still alive); post-patch errp is only set if we failed to do I/O 
with the client.  That change is right, but harder to see without 
comments giving a function contract.



@@ -1567,7 +1575,7 @@ static coroutine_fn void nbd_trip(void *opaque)
    request.from, 
req->data, request.len,

    _err);
  if (ret < 0) {
-    goto reply;
+    goto replied;
  }
  goto done;
  }
@@ -1664,6 +1672,8 @@ reply:
 req->data, reply_data_len, 
_err);

  }
  g_free(msg);
+
+replied:


This part makes sense.


  if (ret < 0) {
  error_prepend(_err, "Failed to send reply: ");
  goto disconnect;












--
Best regards,
Vladimir




Re: [Qemu-block] [PATCH] nbd/server: fix space read

2018-03-08 Thread Vladimir Sementsov-Ogievskiy

08.03.2018 14:50, Vladimir Sementsov-Ogievskiy wrote:

05.03.2018 22:47, Eric Blake wrote:

On 03/05/2018 12:04 PM, Vladimir Sementsov-Ogievskiy wrote:

In case of io error in nbd_co_send_sparse_read we should not
"goto reply:", as it is fatal error and common behavior is
disconnect in this case. We should not try to send client an
error reply, representing channel-io error on previous try to
send a reply.


Good catch.



Fix this by handle block-status error in nbd_co_send_sparse_read,
so nbd_co_send_sparse_read fails only on io error. Then just skip
common "reply:" code path in nbd_trip.

Note: nbd_co_send_structured_error is moved without changes to be
called from nbd_co_send_sparse_read.


Might be easier to read as two patches, one for the code motion, the 
other for using the new code.  But I'm not going to insist on a split.




Signed-off-by: Vladimir Sementsov-Ogievskiy 
---

PS: this all shows that nbd_trip is tooo complex (and yes, I remember,
that its final semantics was made by myself :(.. It should be
refactored into several smaller parts. Do you have any ideas?

The complexity here is that we should handle channel errors (fatal) and
export errors(non fatal) in different ways, and both of error types may
have errp, which should be handled differently too..

May be, the best way is to make separate functions for each command,
avoiding code duplication by using helper-functions instead of 
common code

in nbd_trip.


Yes, splitting nbd_trip into smaller helper functions may be worthwhile.



  nbd/server.c | 64 
+++-

  1 file changed, 37 insertions(+), 27 deletions(-)

diff --git a/nbd/server.c b/nbd/server.c
index 4990a5826e..ea6b9467e4 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -21,6 +21,7 @@
  #include "qapi/error.h"
  #include "trace.h"
  #include "nbd-internal.h"
+#include "qemu/error-report.h"


I'm a bit worried about this one.  The server has previously not 
needed this, so it may be the wrong thing to pull it in now.




+
  static int coroutine_fn nbd_co_send_sparse_read(NBDClient *client,
  uint64_t handle,
  uint64_t offset,


It doesn't help that we are lacking comments on the contract of this 
function.


@@ -1361,8 +1386,15 @@ static int coroutine_fn 
nbd_co_send_sparse_read(NBDClient *client,

  bool final;
    if (status < 0) {
-    error_setg_errno(errp, -status, "unable to check for 
holes");

-    return status;
+    char *msg = g_strdup_printf("unable to check for holes: 
%s",

+ strerror(-status));


Indentation looks off.


+
+    error_report("%s", msg);


Do we really need to report this on the server's stderr, or is 
sending the reply to the client good enough?



I'm just trying to save current behavior in nbd_trip... Why not? Is it 
fair that client knows about some strange io error on server-side, but 
server logs lacks this information?


May be, it worth switching it from error_report to trace, but it should 
be a separate patch, changing all places around.




and it needs "qemu/error-report.h".. nbd_trip uses error_report_err 
instead. error_report_err calls error_report anyway.





+
+    ret = nbd_co_send_structured_error(client, handle, 
-status, msg,

+   errp);
+    g_free(msg);
+    return ret;


So if we have an early return here, then pre-patch, we were 
unconditionally setting errp and returning -1 (even if the client is 
still alive); post-patch errp is only set if we failed to do I/O with 
the client.  That change is right, but harder to see without comments 
giving a function contract.



@@ -1567,7 +1575,7 @@ static coroutine_fn void nbd_trip(void *opaque)
    request.from, req->data, 
request.len,

    _err);
  if (ret < 0) {
-    goto reply;
+    goto replied;
  }
  goto done;
  }
@@ -1664,6 +1672,8 @@ reply:
 req->data, reply_data_len, 
_err);

  }
  g_free(msg);
+
+replied:


This part makes sense.


  if (ret < 0) {
  error_prepend(_err, "Failed to send reply: ");
  goto disconnect;









--
Best regards,
Vladimir




Re: [Qemu-block] [PATCH] nbd/server: Honor FUA request on NBD_CMD_TRIM

2018-03-08 Thread Eric Blake

On 03/08/2018 12:50 AM, Paolo Bonzini wrote:



The NBD spec states that since trim requests can affect disk contents,
then they should allow for FUA semantics just like writes for ensuring
the disk has settled before returning.  As bdrv_[co_]pdiscard() does
not (yet?) support a flags argument, we can't pass FUA down the block
layer stack, and must therefore emulate it with a flush at the NBD
layer.


TRIM requests should not need FUA since they're just advisory.  On
the other hand, WRITE ZEROES requests need to support FUA.


It looks like qemu's NBD implementation properly supports FUA on WRITE 
ZEROES requests.  The block layer in bdrv_co_do_pwrite_zeroes checks for 
FUA support for both an actual zero request to the driver, as well as 
any fallback write requests, and does a followup flush if any driver 
request did not support BDRV_REQ_FUA.  Even when BDRV_REQ_MAY_UNMAP is 
passed through to the driver and allows the zeroes to be written by a 
trim, then either that trim must also honor FUA semantics 
(block/nbd{,-client}.c does this), or the driver can't advertise FUA 
support on zeroes (block/iscsi.c does this) and the block layer fallback 
kicks in.  So I think we're fine on that front.


Still, while you argue that TRIM is advisory (which I agree), if it does 
nothing, then you've (implicitly) honored FUA (that transaction didn't 
affect persistent storage, so you didn't have to wait any longer for 
anything to land); but if it DOES change the disk contents, then waiting 
for that change to land IS worth supporting, hence why the NBD protocol 
requires the FUA flag to be honored on trim.


--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Re: [Qemu-block] [PATCH 4/5] migration/block: limit the number of parallel I/O requests

2018-03-08 Thread Peter Lieven

Am 08.03.2018 um 13:50 schrieb Juan Quintela:

Peter Lieven  wrote:

the current implementation submits up to 512 I/O requests in parallel
which is much to high especially for a background task.
This patch adds a maximum limit of 16 I/O requests that can
be submitted in parallel to avoid monopolizing the I/O device.

Signed-off-by: Peter Lieven 

This code is already a mess (TM).  It is difficult to understand what we
are doing, so not easy to see if your changes are better/worse than what
we have.

I am not sure that your solution help to improve things here.  Let's see
what I understand.

We have three fields (without a single comment):

- submitted: this is the number of blocks that we have asked the block
  device to read asynchronously to main memory, and that
  haven't yet read.  I.e. "blocks_read_pending" would be a
  better name?

- read_done: this is the number of blocks that we have finished read
  asynchronously from this block device.  When we finish, we
  do a submitted -- and a read_done++.  blocks_read_finished
  name?


Correct. The names should be adjusted.



- transferred: This is the number of blocks that we have transferred
since the beginning of migration.  At this point, we do a
read_done-- and a transferred++

Note also that we do malloc()/free() for each block


Yes, but that is a different story.



So, now that we have defined what our fields mean, we need to know what
is our test.  block_save_pending():

 get_remaining_dirty() + (submitted + read_done) * BLOCK_SIZE

Looks good.

But let's us see what test we do in block_save_iterate() (Yes, I have
been very liberal with reformatting and removal of struct names):

(submitted + read_done) * BLOCK_SIZE <  qemu_file_get_rate_limit(f) &&
(submitted + read_done) < MAX_INFLIGHT_IO

The idea of that test is to make sure that we _don't send_ through the
QEMUFile more than qemu_file_get_rate_limit(f).  But there are several
things here:
- we have already issued a flush_blks() call before we enter the loop
   And it is inside possibility that we have already sent too much data at
   this point, but we enter the while loop anyways.

   Notice that flush_blks() does the right thing and test in each loop if
   qemu_file_rate_limit() has been reached and stops sending more data if
   it returns true;

- At this point, we *could* have sent all that can be sent for this
   round, but we enter the loop anyways.  And we test two things:
  - that we haven't read from block devices more than
qemu_file_get_rate_limit() bytes (notice that this is the maximum
that we could put through the migration channel, not really
related  with what we read from block devices).

  - that we haven't read in this round more than MAX_INFLIGHT_IO
blocks.  That is 512 blocks, at BLOCK_SIZE bytes is 512MB. Why?
Who knows.


The idea behind this was only to limit the maximum memory that is allocated.
That was not meant as a rate limit for the storage backend.



- Now we exit the while loop, and we have pending blocks to send, the
 minimum between:
- qemu_file_get_rate_limit()/BLOCK_SIZE, and
- MAX_INFLIGHT_IO

But not all of them are ready to send, only "read_done" blocks are
ready, the "submitted" ones are still waiting for read completion.


Thats what I tried to address in patch 5.



And we call back flush_blks() that will try to send all the "read_done"
blocks through the migration channel until we hit
qemu_file_rate_limit().

So, looking at the problem from far away:

- how many read requests are we have to have in flight at any moment, is
   that 16 from this series the right number? Notice that each request
   are 1MB, so this is 16MB (I have no clue what is the right value).


I choosed that value because it helped to improved the stalls in the
guest that I have been seeing. Stefan also said that 16 would be a good
value for a background task. 512 is definetly too much.



- how many blocks should we get on each round.  Notice that the reason
   for the 1st patch on this series is because the block layer is not
   sending enough blocks to prevent ram migration to start.  If there are
   enough dirty memory sent from the block layer, we shouldn't ever enter
   the ram stage.
   Notice how we register them in vl.c:

 blk_mig_init();
 ram_mig_init();


The idea of the 1st patch was to skip ram migration until we have completed
the first round of block migration (aka bulk phase) as this will take a long 
time.
After that we are only doing incremental updates. You are right this might still
be too early, but it to start after the bulk phase was an easy choice without
introducing another heuristic.



So, after so long mail, do I have some suggestion?

- should we make the MAX_PARALLEL_IO autotunig?  i.e. if we are not
   being able to get 

Re: [Qemu-block] [PULL 0/6] Qio next patches

2018-03-08 Thread Peter Maydell
On 7 March 2018 at 11:25, Daniel P. Berrangé  wrote:
> The following changes since commit f2bb2d14c2958f3f5aef456bd2cdb1ff99f4a562:
>
>   Merge remote-tracking branch 'remotes/stefanha/tags/block-pull-request' 
> into staging (2018-03-05 16:41:20 +)
>
> are available in the Git repository at:
>
>   https://github.com/berrange/qemu tags/qio-next-pull-request
>
> for you to fetch changes up to 1939ccdaa61ce6a1f57d83277b3d41d3a9ad3c58:
>
>   qio: non-default context for TLS handshake (2018-03-06 10:19:07 +)
>
> 
>
> 
>
> Peter Xu (6):
>   qio: rename qio_task_thread_result
>   qio: introduce qio_channel_add_watch_{full|source}
>   qio: store gsources for net listeners
>   qio: non-default context for threaded qtask
>   qio: non-default context for async conn
>   qio: non-default context for TLS handshake
>

Applied, thanks.

-- PMM



Re: [Qemu-block] [PATCH 4/5] migration/block: limit the number of parallel I/O requests

2018-03-08 Thread Juan Quintela
Peter Lieven  wrote:
> the current implementation submits up to 512 I/O requests in parallel
> which is much to high especially for a background task.
> This patch adds a maximum limit of 16 I/O requests that can
> be submitted in parallel to avoid monopolizing the I/O device.
>
> Signed-off-by: Peter Lieven 

This code is already a mess (TM).  It is difficult to understand what we
are doing, so not easy to see if your changes are better/worse than what
we have.

I am not sure that your solution help to improve things here.  Let's see
what I understand.

We have three fields (without a single comment):

- submitted: this is the number of blocks that we have asked the block
 device to read asynchronously to main memory, and that
 haven't yet read.  I.e. "blocks_read_pending" would be a
 better name?

- read_done: this is the number of blocks that we have finished read
 asynchronously from this block device.  When we finish, we
 do a submitted -- and a read_done++.  blocks_read_finished
 name?

- transferred: This is the number of blocks that we have transferred
   since the beginning of migration.  At this point, we do a
   read_done-- and a transferred++

Note also that we do malloc()/free() for each block

So, now that we have defined what our fields mean, we need to know what
is our test.  block_save_pending():

get_remaining_dirty() + (submitted + read_done) * BLOCK_SIZE

Looks good.

But let's us see what test we do in block_save_iterate() (Yes, I have
been very liberal with reformatting and removal of struct names):

(submitted + read_done) * BLOCK_SIZE <  qemu_file_get_rate_limit(f) &&
(submitted + read_done) < MAX_INFLIGHT_IO

The idea of that test is to make sure that we _don't send_ through the
QEMUFile more than qemu_file_get_rate_limit(f).  But there are several
things here:
- we have already issued a flush_blks() call before we enter the loop
  And it is inside possibility that we have already sent too much data at
  this point, but we enter the while loop anyways.

  Notice that flush_blks() does the right thing and test in each loop if
  qemu_file_rate_limit() has been reached and stops sending more data if
  it returns true;

- At this point, we *could* have sent all that can be sent for this
  round, but we enter the loop anyways.  And we test two things:
 - that we haven't read from block devices more than
   qemu_file_get_rate_limit() bytes (notice that this is the maximum
   that we could put through the migration channel, not really
   related  with what we read from block devices).

 - that we haven't read in this round more than MAX_INFLIGHT_IO
   blocks.  That is 512 blocks, at BLOCK_SIZE bytes is 512MB. Why?
   Who knows.

- Now we exit the while loop, and we have pending blocks to send, the
minimum between:
   - qemu_file_get_rate_limit()/BLOCK_SIZE, and
   - MAX_INFLIGHT_IO

But not all of them are ready to send, only "read_done" blocks are
ready, the "submitted" ones are still waiting for read completion.

And we call back flush_blks() that will try to send all the "read_done"
blocks through the migration channel until we hit
qemu_file_rate_limit().

So, looking at the problem from far away:

- how many read requests are we have to have in flight at any moment, is
  that 16 from this series the right number? Notice that each request
  are 1MB, so this is 16MB (I have no clue what is the right value).

- how many blocks should we get on each round.  Notice that the reason
  for the 1st patch on this series is because the block layer is not
  sending enough blocks to prevent ram migration to start.  If there are
  enough dirty memory sent from the block layer, we shouldn't ever enter
  the ram stage.
  Notice how we register them in vl.c:

blk_mig_init();
ram_mig_init();

So, after so long mail, do I have some suggestion?

- should we make the MAX_PARALLEL_IO autotunig?  i.e. if we are not
  being able to get qemu_file_get_rate_limit()/BLOCK_SIZE read_blocks by
  iteration, we should increase MAX_PARALLEL_IO limit?

- should we "take" into account how many blocks have transferred the 1st
  call to flush_blks() and only wait for "read_blocks" until
  flush_blks() instead of for the whole set?

Later, Juan.



Re: [Qemu-block] [PATCH] nbd/server: fix space read

2018-03-08 Thread Vladimir Sementsov-Ogievskiy

05.03.2018 22:47, Eric Blake wrote:

On 03/05/2018 12:04 PM, Vladimir Sementsov-Ogievskiy wrote:

In case of io error in nbd_co_send_sparse_read we should not
"goto reply:", as it is fatal error and common behavior is
disconnect in this case. We should not try to send client an
error reply, representing channel-io error on previous try to
send a reply.


Good catch.



Fix this by handle block-status error in nbd_co_send_sparse_read,
so nbd_co_send_sparse_read fails only on io error. Then just skip
common "reply:" code path in nbd_trip.

Note: nbd_co_send_structured_error is moved without changes to be
called from nbd_co_send_sparse_read.


Might be easier to read as two patches, one for the code motion, the 
other for using the new code.  But I'm not going to insist on a split.




Signed-off-by: Vladimir Sementsov-Ogievskiy 
---

PS: this all shows that nbd_trip is tooo complex (and yes, I remember,
that its final semantics was made by myself :(.. It should be
refactored into several smaller parts. Do you have any ideas?

The complexity here is that we should handle channel errors (fatal) and
export errors(non fatal) in different ways, and both of error types may
have errp, which should be handled differently too..

May be, the best way is to make separate functions for each command,
avoiding code duplication by using helper-functions instead of common 
code

in nbd_trip.


Yes, splitting nbd_trip into smaller helper functions may be worthwhile.



  nbd/server.c | 64 
+++-

  1 file changed, 37 insertions(+), 27 deletions(-)

diff --git a/nbd/server.c b/nbd/server.c
index 4990a5826e..ea6b9467e4 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -21,6 +21,7 @@
  #include "qapi/error.h"
  #include "trace.h"
  #include "nbd-internal.h"
+#include "qemu/error-report.h"


I'm a bit worried about this one.  The server has previously not 
needed this, so it may be the wrong thing to pull it in now.




+
  static int coroutine_fn nbd_co_send_sparse_read(NBDClient *client,
  uint64_t handle,
  uint64_t offset,


It doesn't help that we are lacking comments on the contract of this 
function.


@@ -1361,8 +1386,15 @@ static int coroutine_fn 
nbd_co_send_sparse_read(NBDClient *client,

  bool final;
    if (status < 0) {
-    error_setg_errno(errp, -status, "unable to check for 
holes");

-    return status;
+    char *msg = g_strdup_printf("unable to check for holes: 
%s",

+ strerror(-status));


Indentation looks off.


+
+    error_report("%s", msg);


Do we really need to report this on the server's stderr, or is sending 
the reply to the client good enough?



I'm just trying to save current behavior in nbd_trip... Why not? Is it 
fair that client knows about some strange io error on server-side, but 
server logs lacks this information?


and it needs "qemu/error-report.h".. nbd_trip uses error_report_err 
instead. error_report_err calls error_report anyway.





+
+    ret = nbd_co_send_structured_error(client, handle, 
-status, msg,

+   errp);
+    g_free(msg);
+    return ret;


So if we have an early return here, then pre-patch, we were 
unconditionally setting errp and returning -1 (even if the client is 
still alive); post-patch errp is only set if we failed to do I/O with 
the client.  That change is right, but harder to see without comments 
giving a function contract.



@@ -1567,7 +1575,7 @@ static coroutine_fn void nbd_trip(void *opaque)
    request.from, req->data, 
request.len,

    _err);
  if (ret < 0) {
-    goto reply;
+    goto replied;
  }
  goto done;
  }
@@ -1664,6 +1672,8 @@ reply:
 req->data, reply_data_len, 
_err);

  }
  g_free(msg);
+
+replied:


This part makes sense.


  if (ret < 0) {
  error_prepend(_err, "Failed to send reply: ");
  goto disconnect;






--
Best regards,
Vladimir




Re: [Qemu-block] [Qemu-devel] [PATCH v2 0/2] block latency histogram

2018-03-08 Thread Vladimir Sementsov-Ogievskiy

Hi Emilio!

Looked through qdist, if I understand correctly, it saves each added 
(with different value) element. It is not effective for disk io timing - 
we'll have too much elements. In my approach, histogram don't grow, it 
initially have several ranges and counts hits to each range.



06.03.2018 20:49, Emilio G. Cota wrote:

On Tue, Mar 06, 2018 at 16:00:17 +, Stefan Hajnoczi wrote:

On Wed, Feb 07, 2018 at 03:50:35PM +0300, Vladimir Sementsov-Ogievskiy wrote:

Vladimir Sementsov-Ogievskiy (2):
   block/accounting: introduce latency histogram
   qapi: add block latency histogram interface

(snip)

  5 files changed, 228 insertions(+), 1 deletion(-)

The feature looks good.  I posted documentation and code readability
suggestions.

Hi Vladimir,

Did you consider using qdist? For reference, see commit bf3afd5f419
and/or grep 'qdist'.

Thanks,

Emilio



--
Best regards,
Vladimir




Re: [Qemu-block] [PATCH 2/5] migration/block: reset dirty bitmap before read in bulk phase

2018-03-08 Thread Juan Quintela
Peter Lieven  wrote:
> Reset the dirty bitmap before reading to make sure we don't miss
> any new data.
>
> Cc: qemu-sta...@nongnu.org
> Signed-off-by: Peter Lieven 

Reviewed-by: Juan Quintela 



Re: [Qemu-block] [PATCH 1/5] migration: do not transfer ram during bulk storage migration

2018-03-08 Thread Juan Quintela
Peter Lieven  wrote:
> this patch makes the bulk phase of a block migration to take
> place before we start transferring ram. As the bulk block migration
> can take a long time its pointless to transfer ram during that phase.
>
> Signed-off-by: Peter Lieven 
> Reviewed-by: Stefan Hajnoczi 

Reviewed-by: Juan Quintela 



Re: [Qemu-block] [Qemu-devel] [PATCH v4 00/37] x-blockdev-create for protocols and qcow2

2018-03-08 Thread Kevin Wolf
Am 08.03.2018 um 11:21 hat Daniel P. Berrangé geschrieben:
> On Wed, Mar 07, 2018 at 07:59:09PM +0100, Kevin Wolf wrote:
> > This series implements a minimal QMP command that allows to create an
> > image file on the protocol level or an image format on a given block
> > node.
> > 
> > Eventually, the interface is going to change to some kind of an async
> > command (possibly a (non-)block job), but that will require more work on
> > the job infrastructure first, so let's first QAPIfy image creation in
> > the block drivers. In this series, I'm going for a synchronous command
> > that is prefixed with x- for now.
> > 
> > This series converts qcow2 and all protocol drivers that allow an actual
> > image creation. This means that drivers which only check if the already
> > existing storage is good enough are not converted (e.g. host_device,
> > iscsi). The old behaviour was useful because 'qemu-img create' wants to
> > create both protocol and format layer, but with the separation in QMP,
> > you can just leave out the protocol layer creation when the device
> > already exists.
> 
> Are you going to convert the other format drivers in later versions of
> this series, or leave it upto individual maintaniers to convert the
> rest ? (personally thinking of the luks driver of course)

In a follow-up series, actually, or perhaps one series per format. This
one is already much longer than I hoped it would become. The longer a
series is, the harder it becomes to get it fully reviewed, address
reviews all over the place and somehow deal with merge conflict. Had I
known how many changes some protocols drivers need, I would probably
have split it into two.

luks happens to be the one format driver that I did first (because it's
alphabetically the first one that supports image creation), so if you
want to look at the patches, they are contained here:

http://repo.or.cz/qemu/kevin.git/shortlog/refs/heads/blockdev-create

The branch probably doesn't build at the moment, but if you cherry-pick
the luks patches, that should work. I'm going to send them once this
series with the basic support is at least in my block branch (maybe
already later today?)

Kevin



[Qemu-block] [PATCH 0/5] block migration fixes

2018-03-08 Thread Peter Lieven
Peter Lieven (5):
  migration: do not transfer ram during bulk storage migration
  migration/block: reset dirty bitmap before read in bulk phase
  migration/block: rename MAX_INFLIGHT_IO to MAX_IO_BUFFERS
  migration/block: limit the number of parallel I/O requests
  migration/block: compare only read blocks against the rate limiter

 migration/block.c | 17 -
 migration/ram.c   |  8 
 2 files changed, 16 insertions(+), 9 deletions(-)

-- 
2.7.4





[Qemu-block] [PATCH 1/5] migration: do not transfer ram during bulk storage migration

2018-03-08 Thread Peter Lieven
this patch makes the bulk phase of a block migration to take
place before we start transferring ram. As the bulk block migration
can take a long time its pointless to transfer ram during that phase.

Signed-off-by: Peter Lieven 
Reviewed-by: Stefan Hajnoczi 
---
 migration/ram.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/migration/ram.c b/migration/ram.c
index 5e33e5c..42468ee 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -2258,6 +2258,13 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
 int64_t t0;
 int done = 0;
 
+if (blk_mig_bulk_active()) {
+/* Avoid transferring ram during bulk phase of block migration as
+ * the bulk phase will usually take a long time and transferring
+ * ram updates during that time is pointless. */
+goto out;
+}
+
 rcu_read_lock();
 if (ram_list.version != rs->last_version) {
 ram_state_reset(rs);
@@ -2304,6 +2311,7 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
  */
 ram_control_after_iterate(f, RAM_CONTROL_ROUND);
 
+out:
 qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
 ram_counters.transferred += 8;
 
-- 
2.7.4





[Qemu-block] [PATCH 4/5] migration/block: limit the number of parallel I/O requests

2018-03-08 Thread Peter Lieven
the current implementation submits up to 512 I/O requests in parallel
which is much to high especially for a background task.
This patch adds a maximum limit of 16 I/O requests that can
be submitted in parallel to avoid monopolizing the I/O device.

Signed-off-by: Peter Lieven 
---
 migration/block.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/migration/block.c b/migration/block.c
index 41b95d1..ce939e2 100644
--- a/migration/block.c
+++ b/migration/block.c
@@ -37,6 +37,7 @@
 #define MAX_IS_ALLOCATED_SEARCH (65536 * BDRV_SECTOR_SIZE)
 
 #define MAX_IO_BUFFERS 512
+#define MAX_PARALLEL_IO 16
 
 //#define DEBUG_BLK_MIGRATION
 
@@ -775,6 +776,7 @@ static int block_save_iterate(QEMUFile *f, void *opaque)
 while ((block_mig_state.submitted +
 block_mig_state.read_done) * BLOCK_SIZE <
qemu_file_get_rate_limit(f) &&
+   block_mig_state.submitted < MAX_PARALLEL_IO &&
(block_mig_state.submitted + block_mig_state.read_done) <
MAX_IO_BUFFERS) {
 blk_mig_unlock();
-- 
2.7.4





[Qemu-block] [PATCH 2/5] migration/block: reset dirty bitmap before read in bulk phase

2018-03-08 Thread Peter Lieven
Reset the dirty bitmap before reading to make sure we don't miss
any new data.

Cc: qemu-sta...@nongnu.org
Signed-off-by: Peter Lieven 
---
 migration/block.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/migration/block.c b/migration/block.c
index 1f03946..87bb35c 100644
--- a/migration/block.c
+++ b/migration/block.c
@@ -331,11 +331,10 @@ static int mig_save_device_bulk(QEMUFile *f, 
BlkMigDevState *bmds)
  */
 qemu_mutex_lock_iothread();
 aio_context_acquire(blk_get_aio_context(bmds->blk));
-blk->aiocb = blk_aio_preadv(bb, cur_sector * BDRV_SECTOR_SIZE, >qiov,
-0, blk_mig_read_cb, blk);
-
 bdrv_reset_dirty_bitmap(bmds->dirty_bitmap, cur_sector * BDRV_SECTOR_SIZE,
 nr_sectors * BDRV_SECTOR_SIZE);
+blk->aiocb = blk_aio_preadv(bb, cur_sector * BDRV_SECTOR_SIZE, >qiov,
+0, blk_mig_read_cb, blk);
 aio_context_release(blk_get_aio_context(bmds->blk));
 qemu_mutex_unlock_iothread();
 
-- 
2.7.4





[Qemu-block] [PATCH 5/5] migration/block: compare only read blocks against the rate limiter

2018-03-08 Thread Peter Lieven
only read_done blocks are in the queued to be flushed to the migration
stream. submitted blocks are still in flight.

Signed-off-by: Peter Lieven 
---
 migration/block.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/migration/block.c b/migration/block.c
index ce939e2..4e950c2 100644
--- a/migration/block.c
+++ b/migration/block.c
@@ -773,8 +773,7 @@ static int block_save_iterate(QEMUFile *f, void *opaque)
 
 /* control the rate of transfer */
 blk_mig_lock();
-while ((block_mig_state.submitted +
-block_mig_state.read_done) * BLOCK_SIZE <
+while (block_mig_state.read_done * BLOCK_SIZE <
qemu_file_get_rate_limit(f) &&
block_mig_state.submitted < MAX_PARALLEL_IO &&
(block_mig_state.submitted + block_mig_state.read_done) <
-- 
2.7.4





Re: [Qemu-block] block migration and dirty bitmap reset

2018-03-08 Thread Peter Lieven

Am 08.03.2018 um 10:01 schrieb Fam Zheng:

On Thu, Mar 8, 2018 at 4:57 PM, Peter Lieven  wrote:



Am 08.03.2018 um 02:28 schrieb Fam Zheng :


On Wed, 03/07 09:06, Peter Lieven wrote:
Hi,

while looking at the code I wonder if the blk_aio_preadv and the 
bdrv_reset_dirty_bitmap order must
be swapped in mig_save_device_bulk:

qemu_mutex_lock_iothread();
aio_context_acquire(blk_get_aio_context(bmds->blk));
blk->aiocb = blk_aio_preadv(bb, cur_sector * BDRV_SECTOR_SIZE, >qiov,
0, blk_mig_read_cb, blk);

bdrv_reset_dirty_bitmap(bmds->dirty_bitmap, cur_sector * BDRV_SECTOR_SIZE,
nr_sectors * BDRV_SECTOR_SIZE);
aio_context_release(blk_get_aio_context(bmds->blk));
qemu_mutex_unlock_iothread();

In mig_save_device_dirty we first reset the dirty bitmap and read then which 
shoulds like
a better idea.

Yes, that sounds right to me.

Fam

You mean the order should be swapped in mig_save_device_bulk as well?

Yeah, resetting the dirty bitmap before reading makes sure we don't
miss any new data.


I wonder if this can really happen during blk_aio_preadv?
Anyway better safe than sorry. It at least will not hurt to swap the order.

Peter




Re: [Qemu-block] [Qemu-devel] [PATCH v4 00/37] x-blockdev-create for protocols and qcow2

2018-03-08 Thread Daniel P . Berrangé
On Wed, Mar 07, 2018 at 07:59:09PM +0100, Kevin Wolf wrote:
> This series implements a minimal QMP command that allows to create an
> image file on the protocol level or an image format on a given block
> node.
> 
> Eventually, the interface is going to change to some kind of an async
> command (possibly a (non-)block job), but that will require more work on
> the job infrastructure first, so let's first QAPIfy image creation in
> the block drivers. In this series, I'm going for a synchronous command
> that is prefixed with x- for now.
> 
> This series converts qcow2 and all protocol drivers that allow an actual
> image creation. This means that drivers which only check if the already
> existing storage is good enough are not converted (e.g. host_device,
> iscsi). The old behaviour was useful because 'qemu-img create' wants to
> create both protocol and format layer, but with the separation in QMP,
> you can just leave out the protocol layer creation when the device
> already exists.

Are you going to convert the other format drivers in later versions of
this series, or leave it upto individual maintaniers to convert the
rest ? (personally thinking of the luks driver of course)


Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|



Re: [Qemu-block] block migration and dirty bitmap reset

2018-03-08 Thread Fam Zheng
On Thu, Mar 8, 2018 at 4:57 PM, Peter Lieven  wrote:
>
>
>> Am 08.03.2018 um 02:28 schrieb Fam Zheng :
>>
>>> On Wed, 03/07 09:06, Peter Lieven wrote:
>>> Hi,
>>>
>>> while looking at the code I wonder if the blk_aio_preadv and the 
>>> bdrv_reset_dirty_bitmap order must
>>> be swapped in mig_save_device_bulk:
>>>
>>>qemu_mutex_lock_iothread();
>>>aio_context_acquire(blk_get_aio_context(bmds->blk));
>>>blk->aiocb = blk_aio_preadv(bb, cur_sector * BDRV_SECTOR_SIZE, 
>>> >qiov,
>>>0, blk_mig_read_cb, blk);
>>>
>>>bdrv_reset_dirty_bitmap(bmds->dirty_bitmap, cur_sector * 
>>> BDRV_SECTOR_SIZE,
>>>nr_sectors * BDRV_SECTOR_SIZE);
>>>aio_context_release(blk_get_aio_context(bmds->blk));
>>>qemu_mutex_unlock_iothread();
>>>
>>> In mig_save_device_dirty we first reset the dirty bitmap and read then 
>>> which shoulds like
>>> a better idea.
>>
>> Yes, that sounds right to me.
>>
>> Fam
>
> You mean the order should be swapped in mig_save_device_bulk as well?

Yeah, resetting the dirty bitmap before reading makes sure we don't
miss any new data.

Fam

>
> Peter
>



Re: [Qemu-block] block migration and dirty bitmap reset

2018-03-08 Thread Peter Lieven


> Am 08.03.2018 um 02:28 schrieb Fam Zheng :
> 
>> On Wed, 03/07 09:06, Peter Lieven wrote:
>> Hi,
>> 
>> while looking at the code I wonder if the blk_aio_preadv and the 
>> bdrv_reset_dirty_bitmap order must
>> be swapped in mig_save_device_bulk:
>> 
>>qemu_mutex_lock_iothread();
>>aio_context_acquire(blk_get_aio_context(bmds->blk));
>>blk->aiocb = blk_aio_preadv(bb, cur_sector * BDRV_SECTOR_SIZE, >qiov,
>>0, blk_mig_read_cb, blk);
>> 
>>bdrv_reset_dirty_bitmap(bmds->dirty_bitmap, cur_sector * BDRV_SECTOR_SIZE,
>>nr_sectors * BDRV_SECTOR_SIZE);
>>aio_context_release(blk_get_aio_context(bmds->blk));
>>qemu_mutex_unlock_iothread();
>> 
>> In mig_save_device_dirty we first reset the dirty bitmap and read then which 
>> shoulds like
>> a better idea.
> 
> Yes, that sounds right to me.
> 
> Fam

You mean the order should be swapped in mig_save_device_bulk as well?

Peter